double or nothing multiplicative incentive mechanisms for crowdsourcing nihar shah university of california berkeley nihar dengyong zhou microsoft research abstract crowdsourcing has gained immense popularity in machine learning applications for obtaining large amounts of labeled data crowdsourcing is cheap and fast but suffers from the problem of data to address this fundamental challenge in crowdsourcing we propose simple payment mechanism to incentivize workers to answer only the questions that they are sure of and skip the rest we show that surprisingly under mild and natural requirement this mechanism is the one and only payment mechanism possible we also show that among all possible mechanisms that may or may not satisfy our mechanism makes the smallest possible payment to spammers interestingly this unique mechanism takes multiplicative form the simplicity of the mechanism is an added benefit in preliminary experiments involving over several hundred workers we observe significant reduction in the error rates under our unique mechanism for the same or lower monetary expenditure introduction complex machine learning tools such as deep learning are gaining increasing popularity and are being applied to wide variety of problems these tools however require large amounts of labeled data these large labeling tasks are being performed by coordinating crowds of workers through the internet this is known as crowdsourcing crowdsourcing as means of collecting labeled training data has now become indispensable to the engineering of intelligent systems most workers in crowdsourcing are not experts as consequence labels obtained from crowdsourcing typically have significant amount of error recent efforts have focused on developing statistical techniques to the noisy labels in order to improve its quality however when the inputs to these algorithms are erroneous it is difficult to guarantee that the processed labels will be reliable enough for subsequent use by machine learning or other applications in order to avoid garbage in garbage out we take complementary approach to this problem cleaning the data at the time of collection we consider crowdsourcing settings where the workers are paid for their services such as in the popular crowdsourcing platforms of amazon mechanical turk and others these commercial platforms have gained substantial popularity due to their support for diverse range of tasks for machine learning labeling varying from image annotation and text recognition to speech captioning and machine translation we consider problems that are objective in nature that is have definite answer figure depicts an example of such question where the worker is shown set of images and for each image the worker is required to identify if the image depicts the golden gate bridge is this the golden gate bridge is this the golden gate bridge yes yes no no not sure figure different interfaces in crowdsourcing setup the conventional interface and with an option to skip our approach builds on the simple insight that in typical crowdsourcing setups workers are simply paid in proportion to the amount of tasks they complete as result workers attempt to answer questions that they are not sure of thereby increasing the error rate of the labels for the questions that worker is not sure of her answers could be very unreliable to ensure acquisition of only labels we wish to encourage the worker to skip the questions about which she is unsure for instance by providing an explicit not sure option for every question see figure our goal is to develop payment mechanisms to encourage the worker to select this option when she is unsure we will term any payment mechanism that incentivizes the worker to do so as incentive compatible in addition to incentive compatibility preventing spammers is another desirable requirement from incentive mechanisms in crowdsourcing spammers are workers who answer randomly without regard to the question being asked in the hope of earning some free money and are known to exist in large numbers on crowdsourcing platforms it is thus of interest to deter spammers by paying them as low as possible an intuitive objective to this end is to ensure zero expenditure on spammers who answer randomly in this paper however we impose strictly and significantly weaker condition and then show that there is one and only one mechanism that can satisfy this weak condition our requirement referred to as the axiom says that if all the questions attempted by the worker are answered incorrectly then the payment must be zero we propose payment mechanism for the aforementioned setting incentive compatibility plus and show that surprisingly this is the only possible mechanism we also show that additionally our mechanism makes the smallest possible payment to spammers among all possible incentive compatible mechanisms that may or may not satisfy the axiom our payment mechanism takes multiplicative form the evaluation of the worker response to each question is certain score and the final payment is product of these scores this mechanism has additional appealing features in that it is simple to compute and is also simple to explain to the workers our mechanism is applicable to any type of objective questions including multiple choice annotation questions transcription tasks etc in order to test whether our mechanism is practical and to assess the quality of the final labels obtained we conducted experiments on the amazon mechanical turk crowdsourcing platform in our preliminary experiments that involved over several hundred workers we found that the quality of data improved by under our unique mechanism with the total monetary expenditure being the same or lower as compared to the conventional baseline problem setting in the crowdsourcing setting that we consider one or more workers perform task where task consists of multiple questions the questions are objective by which we mean each question has precisely one correct answer examples of objective questions include classification questions such as figure questions on transcribing text from audio or images etc for any possible answer to any question we define the worker confidence about an answer as the probability according to her belief of this answer being correct in other words one can assume that the worker has in her mind probability distribution over all possible answers to question and the confidence for an answer is the probability of that answer being correct as shorthand we also define the confidence about question as the confidence for the answer that the worker is most confident about for that question we assume that the worker confidences for different questions are independent our goal is that for every question the worker should be incentivized to skip if the confidence is below certain threshold otherwise select the answer that she thinks is most confident about more formally let be predefined value the goal is to design payment mechanisms that incentivize the worker to skip the questions for which her confidence is lower than and attempt those for which her confidence is higher than moreover for the questions that she attempts to answer she must be incentivized to select the answer that she believes is most likely to be correct the threshold may be chosen based on various factors of the problem at hand for example on the downstream machine learning algorithms using the crowdsourced data or the knowledge of the statistics of worker abilities etc in this paper we assume that the threshold is given to us let denote the total number of questions in the task among these we assume the existence of some gold standard questions that is set of questions whose answers are known to the requester let denote the number of gold standard questions the gold standard questions are assumed to be distributed uniformly at random in the pool of questions of course the worker does not know which of the questions form the gold standard the payment to worker for task is computed after receiving her responses to all the questions in the task the payment is based on the worker performance on the gold standard questions since the payment is based on known answers the payments to different workers do not depend on each other thereby allowing us to consider the presence of only one worker without any loss in generality we will employ the following standard notation for any positive integer the set is denoted by the indicator function is denoted by if is true and otherwise the notation denotes the set of all real numbers let xg denote the evaluations of the answers that the worker gives to the gold standard questions here denotes that the worker skipped the question denotes that the worker attempted to answer the question and that answer was incorrect and denotes that the worker attempted to answer the question and that answer was correct let denote the payment function namely function that determines the payment to the worker based on these evaluations xg note that the crowdsourcing platforms of today mandate the payments to be we will let denote the budget the maximum amount that can be paid to any individual worker for this task max xg xg the amount is thus the amount of compensation paid to perfect agent for her work we will assume this budget condition of throughout the rest of the paper we assume that the worker attempts to maximize her overall expected payment in what follows the expression the worker expected payment will refer to the expected payment from the worker point of view and the expectation will be taken with respect to the worker confidences about her answers and the uniformly random choice of the gold standard questions among the questions in the task for any question let yi if the worker attempts question and set yi otherwise further for every question such that yi let pi be the confidence of the worker for the answer she has selected for question and for every question such that yi let pi be any arbitrary value let then from the worker perspective the expected payment for the selected answers and is yjg pji pji jg in the expression above the outermost summation corresponds to the expectation with respect to the randomness arising from the unknown choice of the gold standard questions the inner summation corresponds to the expectation with respect to the worker beliefs about the correctness of her responses in the event that the confidence about question is exactly equal to the worker may be equally incentivized to answer or skip we will call any payment function as an mechanism if the expected payment of the worker under this payment function is strictly maximized when the worker responds in the manner main results mechanism and guarantees in this section we present the main results of the paper namely the design of mechanisms with practically useful properties to this end we impose the following natural requirement on the payment function that is motivated by the practical considerations of budget constraints and discouraging spammers and miscreants we term this requirement as the axiom axiom axiom if all the answers attempted by the worker in the gold standard are wrong then the payment is zero more formally for every set of evaluations xg that satisfy pg pg xi xi we require the payment to satisfy xg observe that is an extremely mild requirement in fact it is significantly weaker than imposing zero payment on workers who answer randomly for instance if the questions are of format then randomly choosing among the two options for each question would result in of the answers being correct in expectation while the axiom is applicable only when none of them turns out to be correct proposed multiplicative mechanism we now present our proposed payment mechanism in algorithm algorithm multiplicative mechanism inputs threshold budget evaluations xg of the worker answers to the gold standard questions pg pg let xi and xi the payment is xg µt the proposed mechanism has multiplicative form each answer in the gold standard is given score based on whether it was correct score incorrect score or skipped score and the final payment is simply product of these scores scaled by the mechanism is easy to describe to workers for instance if and cents then the description reads the reward starts at cents for every correct answer in the gold standard questions the reward will double however if any of these questions are answered incorrectly then the reward will become zero so please use the not sure option observe how this payment rule is similar to the popular double or nothing paradigm the algorithm makes zero payment if one or more attempted answers in the gold standard are wrong note that this property is significantly stronger than the property of which we originally required where we wanted zero payment only when all attempted answers were wrong surprisingly as we prove shortly algorithm is the only mechanism that satisfies the following theorem shows that the proposed payment mechanism indeed incentivizes worker to skip the questions for which her confidence is below while answering those for which her confidence is greater than in the latter case the worker is incentivized to select the answer which she thinks is most likely to be correct theorem the payment mechanism of algorithm is and satisfies the condition such payment function that is based on gold standard questions is also called strictly proper scoring rule the proof of theorem is presented in appendix it is easy to see that the mechanism satisfies the proof of incentive compatibility is also not hard we consider any arbitrary worker with arbitrary belief distributions and compute the expected payment for that worker for the case when her choices in the task follow the requirements we then show that any other choice leads to strictly smaller expected payment while we started out with very weak condition of of making zero payment when all attempted answers are wrong the mechanism proposed in algorithm is significantly more strict and makes zero payment when any of the attempted answers is wrong natural question that arises is can we design an alternative mechanism satisfying incentive compatibility and that operates somewhere in between uniqueness of the mechanism in the previous section we showed that our proposed multiplicative mechanism is incentive compatible and satisfies the intuitive requirement of it turns out perhaps surprisingly that this mechanism is unique in this respect theorem the payment mechanism of algorithm is the only mechanism that satisfies the condition theorem gives strong result despite imposing very weak requirements to see this recall our earlier discussion on deterring spammers that is incurring low expenditure on workers who answer randomly for instance when the task comprises questions one may wish to design mechanisms which make zero payment when the responses to or more of the questions in the gold standard are incorrect the axiom is much weaker requirement and the only mechanism that can satisfy this requirement is the mechanism of algorithm the proof of theorem is available in appendix the proof relies on the following key lemma that establishes condition that any mechanism must necessarily satisfy the lemma applies to any mechanism and not just to those satisfying lemma any payment mechanism must satisfy for every and every yi yg yi yg yi yg yi yg the proof of this lemma is provided in appendix given this lemma the proof of theorem is then completed via an induction on the number of skipped questions optimality against spamming behavior as discussed earlier crowdsouring tasks especially those with multiple choice questions often encounter spammers who answer randomly without heed to the question being asked for instance under setup spammer will choose one of the two options uniformly at random for every question highly desirable objective in crowdsourcing settings is to deter spammers to this end one may wish to impose condition of zero payment when the responses to or more of the attempted questions in the gold standard are incorrect second desirable metric could be to minimize the expenditure on worker who simply skips all questions while the aforementioned requirements were deterministic functions of the worker responses one may alternatively wish to impose requirements that depend on the distribution of the worker answering process for instance third desirable feature would be to minimize the expected payment to worker who answers all questions uniformly at random we now show that interestingly our unique multiplicative payment mechanism simultaneously satisfies all these requirements the result is stated assuming multiplechoice setup but extends trivially to settings theorem distributional consider any value among all incentivecompatible mechanisms that may or may not satisfy algorithm strictly minimizes the expenditure on worker who skips some of the questions in the the gold standard and chooses answers to the remaining questions uniformly at random theorem deterministic consider any value among all mechanisms that may or may not satisfy algorithm strictly minimizes the expenditure on worker who gives incorrect answers to fraction or more of the questions attempted in the gold standard the proof of theorem is presented in appendix we see from this result that the multiplicative payment mechanism of algorithm thus possesses very useful properties geared to deter spammers while ensuring that good worker will be paid high enough amount to illustrate this point let us compare the mechanism of algorithm with the popular additive class of payment mechanisms example consider the popular class of additive mechanisms where the payments to worker are added across the gold standard questions this additive payment mechanism offers reward of µt for every correct answer in the gold standard for every question skipped and for every incorrect answer importantly the final payment to the worker is the sum of the rewards across the gold standard questions one can verify that this additive mechanism is incentive compatible one can also see that that as guaranteed by our theory this additive payment mechanism does not satisfy the axiom suppose each question involves choosing from two options let us compute the expenditure that these two mechanisms make under spamming behavior of choosing the answer randomly to each question given the likelihood of each question being correct on can compute that the additive mechanism makes payment of in expectation on the other hand our mechanism pays an expected amount of only the payment to spammers thus reduces exponentially with the number of gold standard questions under our mechanism whereas it does not reduce at all in the additive mechanism now consider different means of exploiting the mechanism where the worker simply skips all questions to this end observe that if worker skips all the questions then the additive payment mechanism will incur an expenditure of µt on the other hand the proposed payment mechanism of algorithm pays an exponentially smaller amount of µt recall that simulations and experiments in this section we present synthetic simulations and experiments to evaluate the effects of our setting and our mechanism on the final label quality synthetic simulations we employ synthetic simulations to understand the effects of various kinds of labeling errors in crowdsourcing we consider questions in this set of simulations whenever worker answers question her confidence for the correct answer is drawn from distribution independent of all else we investigate the effects of the following five choices of the distribution the uniform distribution on the support triangular distribution with lower upper and mode of beta distribution with parameter values and the distribution that is uniform on the discrete set truncated gaussian distribution truncation of to the interval when worker has confidence drawn from the distribution and attempts the question the probability of making an error equals we compare the setting where workers attempt every question with the setting where workers skip questions for which their confidence is below certain threshold in this set of simulations we set in either setting we aggregate the labels obtained from the workers for each question via majority vote on the two classes ties are broken by choosing one of the two options uniformly at random figure error under different interfaces for synthetic simulations of five distributions of the workers error probabilities figure depicts the results from these simulations each bar represents the fraction of questions that are labeled incorrectly and is an average across trials the standard error of the mean is too small to be visible we see that the setting consistently outperforms the conventional setting and the gains obtained are moderate to high depending on the underlying distribution of the workers errors in particular the gains are quite striking under the model this result is not surprising since the mechanism ideally screens the spammers out and leaves only the hammers who answer perfectly experiments on amazon mechanical turk we conducted preliminary experiments on the amazon mechanical turk commercial crowdsourcing platform to evaluate our proposed scheme in scenarios the complete data including the interface presented to the workers in each of the tasks the results obtained from the workers and the ground truth solutions are available on the website of the first author goal before delving into details we first note certain caveats relating to such study of mechanism design on crowdsourcing platforms when worker encounters mechanism for only small amount of time handful of tasks in typical research experiments and for small amount of money at most few dollars in typical crowdsourcing tasks we can not expect the worker to completely understand the mechanism and act precisely as required for instance we wouldn expect our experimental results to change significantly even upon moderate modifications in the promised amounts and furthermore we do expect the outcomes to be noisy incentive compatibility kicks in when the worker encounters mechanism across longer term for example when proposed mechanism is adopted as standard for platform or when higher amounts are involved this is when we would expect workers or others bloggers or researchers to design strategies that can game the mechanism the theoretical guarantee of incentive compatibility or strict properness then prevents such gaming in the long run we thus regard these experiments as preliminary our intentions towards this experimental exercise were to evaluate the potential of our algorithms to work in practice and to investigate the effect of the proposed algorithms on the net error in the collected labelled data experimental setup we conducted the five following experiments tasks on amazon mechanical turk identifying the golden gate bridge from pictures identifying the breeds of dogs from pictures identifying heads of countries identifying continents to which flags belong and identifying the textures in displayed images each of these tasks comprised to figure error under different interfaces and mechanisms for five experiments conducted on mechanical turk ple choice for each experiment we compared baseline setting figure with an additive payment mechanism that pays fixed amount per correct answer and ii our setting figure with the multiplicative mechanism of algorithm for each experiment and for each of the two settings we had workers independently perform the task upon completion of the tasks on amazon mechanical turk we aggregated the data in the following manner for each mechanism in each experiment we subsampled and workers and took majority vote of their responses we averaged the accuracy across all questions and across iterations of this procedure results figure reports the error in the aggregate data in the five experiments we see that in most cases our setting results in higher quality data and in many of the instances the reduction is or higher all in all in the experiments we observed substantial reduction in the amount of error in the labelled data while expending the same or lower amounts and receiving no negative comments from the workers these observations suggest that our proposed setting coupled with our multiplicative payment mechanisms have potential to work in practice the underlying fundamental theory ensures that the system can not be gamed in the long run discussion and conclusions in an extended version of this paper we generalize the setting considered here to one where we also elicit the workers confidence about their answers moreover in companion paper we construct mechanisms to elicit the support of worker beliefs our mechanism offers some additional benefits the pattern of skips of the workers provide reasonable estimate of the difficulty of each question in practice the questions that are estimated to be more difficult may now be delegated to an expert or to additional workers secondly the theoretical guarantees of our mechanism may allow for better of the data incorporating the confidence information and improving the overall accuracy developing statistical aggregation algorithms or augmenting existing ones for this purpose is useful direction of research thirdly the simplicity of our mechanisms may facilitate an easier adoption among the workers in conclusion given the uniqueness and optimality in theory simplicity and good performance observed in practice we envisage our multiplicative payment mechanisms to be of interest to practitioners as well as researchers who employ crowdsourcing see the extended version of this paper for additional experiments involving responses such as text transcription references john bohannon social science for pennies science cbw andrew carlson justin betteridge richard wang estevam hruschka jr and tom mitchell coupled learning for information extraction in acm wsdm pages jia deng wei dong richard socher li kai li and li imagenet hierarchical image database in ieee conference on computer vision and pattern recognition pages double or nothing http last accessed july tilmann gneiting and adrian raftery strictly proper scoring rules prediction and estimation journal of the american statistical association geoffrey hinton li deng dong yu george dahl mohamed navdeep jaitly andrew senior vincent vanhoucke patrick nguyen tara sainath et al deep neural networks for acoustic modeling in speech recognition the shared views of four research groups ieee signal processing magazine panagiotis ipeirotis foster provost victor sheng and jing wang repeated labeling using multiple noisy labelers data mining and knowledge discovery srikanth jagabathula lakshminarayanan subramanian and ashwin venkataraman worker filtering in crowdsourcing in advances in neural information processing systems pages gabriella kazai jaap kamps marijn koolen and natasa crowdsourcing for book search evaluation impact of hit design on comparative system ranking in acm sigir pages david karger sewoong oh and devavrat shah iterative learning for reliable crowdsourcing systems in advances in neural information processing systems pages qiang liu jian peng and alexander ihler variational inference for crowdsourcing in nips pages vikas raykar shipeng yu linda zhao gerardo hermosillo valadez charles florin luca bogoni and linda moy learning from crowds the journal of machine learning research nihar shah and dengyong zhou double or nothing multiplicative incentive mechanisms for crowdsourcing nihar shah dengyong zhou and yuval peres approval voting and incentives in crowdsourcing in international conference on machine learning icml jeroen vuurens arjen de vries and carsten eickhoff how much spam can you take an analysis of crowdsourcing results to increase accuracy in acm sigir workshop on crowdsourcing for information retrieval pages paul wais shivaram lingamneni duncan cook jason fennell benjamin goldenberg daniel lubarov david marin and hari simons towards building highquality workforce with mechanical turk nips workshop on computational social science and the wisdom of crowds dengyong zhou qiang liu john platt christopher meek and nihar shah regularized minimax conditional entropy for crowdsourcing arxiv preprint 
learning with symmetric label noise the importance of being unhinged brendan van aditya krishna the australian national university robert national ict australia abstract convex potential minimisation is the de facto approach to binary classification however long and servedio proved that under symmetric label noise sln minimisation of any convex potential over linear function class can result in classification performance equivalent to random guessing this ostensibly shows that convex losses are not in this paper we propose convex loss and prove that it is the loss avoids the long and servedio result by virtue of being negatively unbounded the loss is modification of the hinge loss where one does not clamp at zero hence we call it the unhinged loss we show that the optimal unhinged solution is equivalent to that of strongly regularised svm and is the limiting solution for any convex potential this implies that strong regularisation makes most standard learners experiments confirm the unhinged loss is borne out in practice so with apologies to wilde while the truth is rarely pure it can be simple learning with symmetric label noise binary classification is the canonical supervised learning problem given an instance space and samples from some distribution over the goal is to learn scorer with low misclassification error on future samples drawn from our interest is in the more realistic scenario where the learner observes samples from some corruption of where labels have some constant probability of being flipped and the goal is still to perform well with respect to this problem is known as learning from symmetric label noise sln learning angluin and laird long and servedio showed that there exist linearly separable where when the learner observes some corruption with symmetric label noise of any nonzero rate minimisation of any convex potential over linear function class results in classification performance on that is equivalent to random guessing ostensibly this establishes that convex losses are not and motivates the use of losses stempfel and ralaivola et ding and vishwanathan denchev et manwani and sastry in this paper we propose convex loss and prove that it is the loss avoids the result of long and servedio by virtue of being negatively unbounded the loss is modification of the hinge loss where one does not clamp at zero thus we call it the unhinged loss this loss has several appealing properties such as being the unique convex loss satisfying notion of strong proposition being proposition consistent when minimised on proposition and having an simple optimal solution that is the difference of two kernel means equation finally we show that this optimal solution is equivalent to that of strongly regularised svm proposition and any convex potential proposition implying that strong regularisation endows most standard learners with the classifier resulting from minimising the unhinged loss is not new devroye et chapter and smola section and cristianini section however establishing this classifier strong uniqueness thereof and its equivalence to highly regularised svm solution to our knowledge is novel background and problem setup fix an instance space we denote by distribution over with random variables any may be expressed via the and base rate or via the marginal and function we interchangeably write as dp or dm classifiers scorers and risks scorer is any function loss is any function we use to refer to and the risk is defined as given distribution the of scorer is defined as ld so that ld for set is the set of for all scorers in function class is any rx given some the set of restricted scorers for loss are those scorers in that minimise the sd argmin ld the set of unrestricted scorers is sd sd for rx the restricted of scorer is its excess risk over that of any restricted scorer regretd ld inf binary classification is concerned with the loss jyv jv loss is if all its scorers are also optimal for loss sd sd convex potential is any loss yv where is convex differentiable with and long and servedio definition all convex potentials are bartlett et theorem learning with symmetric label noise sln learning the problem of learning with symmetric label noise sln learning is the following angluin and laird kearns blum and mitchell natarajan et for some notional clean distribution which we would like to observe we instead observe samples from some corrupted distribution sln for some the distribution sln is such that the marginal distribution of instances is unchanged but each label is independently flipped with probability the goal is to learn scorer from these corrupted samples such that ld is small for any quantity in we denote its corrupted counterparts in sln with bar for the corrupted marginal distribution and for the corrupted function additionally when is clear from context we will occasionally refer to sln by it is easy to check that the corrupted marginal distribution and natarajan et lemma formalisation we consider learners for loss and function class with learning being the search for some that minimises the informally is robust to symmetric label noise slnrobust if minimising over gives the same classifier on both the clean distribution which the learner would like to observe and sln for any which the learner actually observes we now formalise this notion and review what is known about learners learners formal definition for some fixed instance space let denote the set of distributions on given notional clean distribution nsln returns the set of possible corrupted versions of the learner may observe where labels are flipped with unknown probability nsln sln equipped with this we define our notion of definition we say that learner is if nsln ld ld that is requires that for any level of label noise in the observed distribution the classification performance wrt of the learner is the same as if the learner directly observes unfortunately widely adopted class of learners is not as we will now see convex potentials with linear function classes are not fix rd and consider learners with convex potential and function class of linear scorers flin hw xi rd this captures the linear svm and logistic regression which are widely studied in theory and applied in practice disappointingly these learners are not long and servedio theorem give an example where when learning under symmetric label noise for any convex potential the corrupted minimiser over flin has classification performance equivalent to random guessing on this implies that flin is not as per definition proposition long and servedio theorem let rd for any pick any convex potential then flin is not the fallout what learners are in light of proposition there are two ways to proceed in order to obtain learners either we change the class of losses or we change the function class the first approach has been pursued in large body of work that embraces losses stempfel and ralaivola et ding and vishwanathan denchev et manwani and sastry while such losses avoid the conditions of proposition this does not automatically imply that they are when used with flin in appendix we present evidence that some of these losses are in fact not when used with flin the second approach is to consider suitably rich that contains the scorer for by employing universal kernel with this choice one can still use convex potential loss and in fact owing to equation any loss proposition pick any then rx is both approaches have drawbacks the first approach has computational penalty as it requires optimising loss the second approach has statistical penalty as estimation rates with rich will require larger sample size thus it appears that involves tradeoff however there is variant of the first option pick loss that is convex but not convex potential such loss would afford the computational and statistical advantages of minimising convex risks with linear scorers manwani and sastry demonstrated that square loss yv is one such loss we will show that there is simpler loss that is convex and but is not in the class of convex potentials by virtue of being negatively unbounded to derive this loss we first robustness via procedure even if we were content with difference of between the clean and corrupted minimisers performance long and servedio theorem implies that in the worst case loss perspective on we now to reason about optimal scorers on the same distribution but with two different losses this will help characterise set of strongly losses reformulating via losses given any natarajan et al lemma showed how to associate with loss counterpart such that ld the loss is defined as follows definition loss given any loss and the loss is since depends on the unknown parameter it is not directly usable to design an learner nonetheless it is useful theoretical device since by construction for any sd sd sd this means that sufficient condition for to be is for sd ghosh et al theorem proved sufficient condition on such that this holds namely interestingly equation is necessary for stronger notion of robustness which we now explore characterising stronger notion of as the first step towards stronger notion of robustness we rewrite with slight abuse of notation ld where is distribution over labels and scores standard requires that label noise does not change the minimisers that if is such that for all the same relation holds with in place of strong strengthens this notion by requiring that label noise does not affect the ordering of all pairs of joint distributions over labels and scores this of course trivially implies as with the definition of given distribution over labels and scores let be the corresponding distribution where labels are flipped with probability strong can then be made precise as follows definition strong call loss strongly if for every we now strong using notion of order equivalence of loss pairs which simply requires that two losses order all distributions over labels and scores identically order equivalent if definition order equivalent loss pairs call pair of losses clearly order equivalence of implies sd sd which in turn implies it is thus not surprising that we can relate order equivalence to strong of proposition loss is strongly iff for every are order equivalent this connection now lets us exploit classical result in decision theory about order equivalent losses being affine transformations of each other combined with the definition of this lets us conclude that the sufficient condition of equation is also necessary for strong of proposition loss is strongly if and only if it satisfies equation we now return to our original goal which was to find convex that is for flin and ideally more general function classes the above suggests that to do so it is reasonable to consider those losses that satisfy equation unfortunately it is evident that if is convex and bounded below by zero then it can not possibly be admissible in this sense but we now show that removing the boundedness restriction allows for the existence of convex admissible loss the unhinged loss convex strongly loss consider the following simple but convex loss unh unh and compared to the hinge loss the loss does not clamp at zero it does not have hinge thus peculiarly it is negatively unbounded an issue we discuss in thus we call this the unhinged the loss has number of attractive properties the most immediate being is its the unhinged loss is strongly unh unh since unh is strongly and thus that proposition implies that unh is for any further the following uniqueness property is not hard to show proposition pick any convex loss then that is up to scaling and translation unh is the only convex loss that is strongly returning to the case of linear scorers the above implies that unh flin is this does not contradict proposition since unh is not convex potential as it is negatively unbounded intuitively this property allows the loss to offset the penalty incurred by instances that are misclassified with high margin by awarding gain for instances that correctly classified with high margin the unhinged loss is classification calibrated is by itself insufficient for learner to be useful for example loss that is uniformly zero is strongly but is useless as it is not fortunately the unhinged loss is as we now establish for technical reasons see we operate with fb the set of scorers with range bounded by proposition fix unh for any dm fb sign thus for every the restricted scorer over fb has the same sign as the classifier for loss in the limiting case where rx the optimal scorer is attainable if we operate over the extended reals so that unh is enforcing boundedness of the loss while the of unh is encouraging proposition implies that its unrestricted is thus the regret of every scorer is identically which hampers analysis of consistency in orthodox decision theory analogous theoretical issues arise when attempting to establish basic theorems with unbounded losses ferguson pg we can this issue by restricting attention to bounded scorers so that unh is effectively bounded by proposition this does not affect the of the loss in the context of linear scorers boundedness of scorers can be achieved by regularisation instead of ing with flin one can instead use flin hw xi where so that flin for observe that as unh is for any unh flin is for any as we shall see in working with flin also lets us establish of the hinge loss when is large unhinged loss minimisation on corrupted distribution is consistent using bounded scorers makes it possible to establish surrogate regret bound for the unhinged loss this shows classification consistency of unhinged loss minimisation on the corrupted distribution this loss has been considered in sriperumbudur et al reid and williamson in the context of maximum mean discrepancy see the appendix the analysis of its is to our knowledge novel proposition fix unh then for any and scorer fb fb regretd regret fb regret standard rates of convergence via generalisation bounds are also trivial to derive see the appendix learning with the unhinged loss and kernels we now show that the optimal solution for the unhinged loss when employing regularisation and kernelised scorers has simple form this sheds further light on and regularisation the centroid classifier optimises the unhinged loss consider minimising the unhinged risk over the class of kernelised scorers fh hw ih for some where is feature mapping into reproducing kernel hilbert space with kernel equivalently given we want wunh argmin hw hw wih the optimality condition implies that wunh which is the kernel mean map of smola et and thus the optimal unhinged scorer is from equation the unhinged solution is equivalent to nearest centroid classifier manning et pg tibshirani et and cristianini section equation gives simple way to understand the of unh fh as the optimal scorers on the clean and corrupted distributions only differ by scaling see the appendix interestingly servedio theorem established that nearest centroid classifier which they termed average is robust to general class of label noise but required the assumption that is uniform over the unit sphere our result establishes that sln robustness of the classifier holds without any assumptions on in fact ghosh et al theorem lets one quantify the unhinged loss performance under more general noise model see the appendix for discussion practical considerations we note several points relating to practical usage of the unhinged loss with kernelised scorers first is not required to select since changing only changes the magnitude of scores not their sign thus for the purposes of classification one can simply use second we can easily extend the scorers to use bias regularised with strength λb tuning λb is equivalent to computing as per equation and tuning threshold on holdout set third when rd for small we can store wunh explicitly and use this to make predictions for high or infinite dimensional we can either make predictions directly via equation or use random fourier features rahimi and recht to approximately embed into some dimensional rd and then store wunh as usual the latter requires kernel we now show that under some assumptions wunh coincides with the solution of two established methods the appendix discusses some further relationships to the maximum mean discrepancy given training sample dn we can use plugin estimates as appropriate equivalence to highly regularised svm and other convex potentials there is an interesting equivalence between the unhinged solution and that of highly regularised svm this has been noted in hastie et al section which showed how svms approach nearest centroid classifier which is of course the optimal unhinged solution proposition pick any and with for any let whinge argmin max hw ih hw wih be the svm solution then if whinge wunh since unh fh is it follows that for hinge max hinge fh is similarly provided is sufficiently large that is strong regularisation and bounded feature map endows the hinge loss with proposition can be generalised to show that wunh is the limiting solution of any twice differentiable convex potential this shows that strong regularisation endows most learners with intuitively with strong regularisation one only considers the behaviour of loss near zero since convex potential has it will behave similarly to its linear approximation around zero viz the unhinged loss proposition pick any bounded feature mapping and twice differentiable convex potential with bounded let wφ be the minimiser of the regularised risk then wunh lim equivalence to fisher linear discriminant with whitened data for binary classification on dm the fisher linear discriminant fld finds weight vector proportional to the minimiser of square loss sq yv bishop section wsq xxt λi wsq is only changed by scaling by equation and the fact that the corrupted marginal factor under label noise this provides an alternate proof of the fact that sq flin is manwani and sastry theorem clearly the unhinged loss solution wunh is equivalent to the fld and square loss solution wsq when the input data is whitened xx with with universal kernel both the unhinged and square loss asymptotically recover the optimal classifier but the unhinged loss does not require matrix inversion with misspecified one can not in general argue for the superiority of the unhinged loss over square loss or as there is no universally good surrogate to the loss reid and williamson appendix the appendix illustrate examples where both losses may underperform of unhinged loss empirical illustration we now illustrate that the unhinged loss is empirically manifest we reiterate that with high regularisation the unhinged solution is equivalent to an svm and in the limit any loss solution thus we do not aim to assert that the unhinged loss is better than other losses but rather to demonstrate that its is not purely theoretical we first show that the unhinged risk minimiser performs well on the example of long and servedio henceforth figure shows the distribution where with marginal distribution and all three instances are deterministically positive we pick the unhinged minimiser perfectly classifies all three points regardless of the level of label noise figure the hinge minimiser is perfect when there is no noise but with even small amount of noise achieves error rate long and servedio section show that regularisation does not endow square loss escapes the result of long and servedio since it is not monotone decreasing unhinged hinge noise hinge noise hinge unhinged table mean and standard deviation of the error over trials on grayed cells denote the best performer at that noise rate figure dataset we next consider empirical risk minimisers from random training sample we construct training set of instances injected with varying levels of label noise and evaluate classification performance on test set of instances we compare the hinge for ding and vishwanathan and unhinged minimisers using linear scorer without bias term and regularisation strength from table even at label noise the unhinged classifier is able to find perfect solution by contrast both other losses suffer at even moderate noise rates we next report results on some uci datasets where we additionally tune threshold so as to ensure the best training set accuracy table summarises results on sample of four datasets the appendix contains results with more datasets performance metrics and losses even at noise close to the unhinged loss is often able to learn classifier with some discriminative power hinge unhinged hinge unhinged iris housing hinge unhinged hinge unhinged splice table mean and standard deviation of the error over trials on uci datasets conclusion and future work we proposed convex loss proved that is robust to symmetric label noise showed it is the unique loss that satisfies notion of strong established that it is optimised by the nearest centroid classifier and showed that most convex potentials such as the svm are also when highly regularised so with apologies to wilde while the truth is rarely pure it can be simple acknowledgments nicta is funded by the australian government through the department of communications and the australian research council through the ict centre of excellence program the authors thank cheng soon ong for valuable comments on draft of this paper references dana angluin and philip laird learning from noisy examples machine learning peter bartlett michael jordan and jon mcauliffe convexity classification and risk bounds journal of the american statistical association christopher bishop pattern recognition and machine learning new york avrim blum and tom mitchell combining labeled and unlabeled data with in conference on computational learning theory colt pages vasil denchev nan ding hartmut neven and vishwanathan robust classification with adiabatic quantum optimization in international conference on machine learning icml pages luc devroye and lugosi probabilistic theory of pattern recognition springer nan ding and vishwanathan regression in advances in neural information processing systems nips pages curran associates thomas ferguson mathematical statistics decision theoretic approach academic press aritra ghosh naresh manwani and sastry making risk minimization tolerant to label noise neurocomputing trevor hastie saharon rosset robert tibshirani and ji zhu the entire regularization path for the support vector machine journal of machine learning research december issn michael kearns efficient learning from statistical queries journal of the acm november philip long and rocco servedio random classification noise defeats all convex potential boosters machine learning issn christopher manning prabhakar raghavan and hinrich introduction to information retrieval cambridge university press new york ny usa isbn naresh manwani and sastry noise tolerance under risk minimization ieee transactions on cybernetics june hamed vijay mahadevan and nuno vasconcelos on the design of robust classifiers for computer vision in ieee conference on computer vision and pattern recognition cvpr nagarajan natarajan inderjit dhillon pradeep ravikumar and ambuj tewari learning with noisy labels in advances in neural information processing systems nips pages ali rahimi and benjamin recht random features for kernel machines in advances in neural information processing systems nips pages mark reid and robert williamson composite binary losses journal of machine learning research december mark reid and robert williamson information divergence and risk for binary experiments journal of machine learning research mar bernhard and alexander smola learning with kernels volume mit press rocco servedio on pac learning using winnow perceptron and algorithm in conference on computational learning theory colt john and nello cristianini kernel methods for pattern analysis cambridge uni press alex smola arthur gretton le song and bernhard hilbert space embedding for distributions in algorithmic learning theory alt bharath sriperumbudur kenji fukumizu arthur gretton gert lanckriet and bernhard kernel choice and classifiability for rkhs embeddings of probability distributions in advances in neural information processing systems nips guillaume stempfel and liva ralaivola learning svms from sloppily labeled data in artificial neural networks icann volume pages springer berlin heidelberg robert tibshirani trevor hastie balasubramanian narasimhan and gilbert chu diagnosis of multiple cancer types by shrunken centroids of gene expression proceedings of the national academy of sciences oscar wilde the importance of being earnest 
algorithmic stability and uniform generalization ibrahim alabdulmohsin king abdullah university of science and technology thuwal saudi arabia abstract one of the central questions in statistical learning theory is to determine the conditions under which agents can learn from experience this includes the necessary and sufficient conditions for generalization from given finite training set to new observations in this paper we prove that algorithmic stability in the inference process is equivalent to uniform generalization across all parametric loss functions we provide various interpretations of this result for instance relationship is proved between stability and data processing which reveals that algorithmic stability can be improved by the inferred hypothesis or by augmenting training examples with artificial noise prior to learning in addition we establish relationship between algorithmic stability and the size of the observation space which provides formal justification for dimensionality reduction methods finally we connect algorithmic stability to the size of the hypothesis space which recovers the classical pac result that the size complexity of the hypothesis space should be controlled in order to improve algorithmic stability and improve generalization introduction one fundamental goal of any learning algorithm is to strike right balance between underfitting and overfitting in mathematical terms this is often translated into two separate objectives first we would like the learning algorithm to produce hypothesis that is reasonably consistent with the empirical evidence to have small empirical risk second we would like to guarantee that the empirical risk training error is valid estimate of the true unknown risk test error the former condition protects against underfitting while the latter condition protects against overfitting the rationale behind these two objectives can be understood if we define the generalization risk rgen by the absolute difference between the empirical and true risks rgen remp rtrue then it is elementary to observe that the true risk rtrue is bounded from above by the sum remp rgen hence by minimizing both the empirical risk underfitting and the generalization risk overfitting one obtains an inference procedure whose true risk is minimal minimizing the empirical risk alone can be carried out using the empirical risk minimization erm procedure or some approximations to it however the generalization risk is often impossible to deal with directly instead it is common practice to bound it analyticaly so that we can establish conditions under which it is guaranteed to be small by establishing conditions for generalization one hopes to design better learning algorithms that both perform well empirically and generalize well to novel observations in the future prominent example of such an approach is the support vector machines svm algorithm for binary classification however bounding the generalization risk is quite intricate because it can be approached from various angles in fact several methods have been proposed in the past to prove generalization bounds including uniform convergence algorithmic stability rademacher and gaussian complexities generic chaining bounds the framework and analysis concentration of measure inequalities form the building blocks of these rich theories the proliferation of generalization bounds can be understood if we look into the general setting of learning introduced by vapnik in this setting we have an observation space and hypothesis space learning algorithm henceforth denoted uses finite set of observations to infer hypothesis in the general setting the inference process is influenced by three key factors the nature of the observation space the nature of the hypothesis space and the details of the learning algorithm by imposing constraints on any of these three components one may be able to derive new generalization bounds for example the vc theory derives generalization bounds by assuming constraints on while stability bounds are derived by assuming constraints on given that different generalization bounds can be established by imposing constraints on any of or it is intriguing to ask if there exists single view for generalization that ties all of these different components together in this paper we answer this question in the affirmative by establishing that algorithmic stability alone is equivalent to uniform generalization informally speaking an inference process is said to generalize uniformly if the generalization risk vanishes uniformly across all bounded parametric loss functions at the limit of large training sets more precise definition will be presented in the sequel we will show why constraints that are imposed on either or to improve uniform generalization can be interpreted as methods of improving the stability of the learning algorithm this is similar in spirit to result by kearns and ron who showed that having finite vc dimension in the hypothesis space implies certain notion of algorithmic stability in the inference process our statement however is more general as it applies to all learning algorithms that fall under vapnik general setting of learning well beyond uniform convergence the rest of the paper is as follows first we review the current literature on algorithmic stability generalization and learnability then we introduce key definitions that will be repeatedly used throughout the paper next we prove the central theorem which reveals that algorithmic stability is equivalent to uniform generalization and provide various interpretations of this result afterward related work perhaps the two most fundamental concepts in statistical learning theory are those of learnability and generalization the two concepts are distinct from each other as will be discussed in more details next whereas learnability is concerned with measuring the excess risk within hypothesis space generalization is concerned with estimating the true risk in order to define learnability and generalization suppose we have an observation space probability distribution of observations and bounded stochastic loss function where is an inferred hypothesis note that is implicitly function of parameterized by as well we define the true risk of hypothesis by the risk functional rtrue then learning algorithm is called consistent if the true risk of its inferred hypothesis converges to the optimal true risk within the hypothesis space at the limit of large training sets problem is called learnable if it admits consistent learning algorithm it has been known that learnability for supervised classification and regression problems is equivalent to uniform convergence however et al recently showed that uniform convergence is not necessary in vapnik general setting of learning and proposed algorithmic stability as an alternative key condition for learnability unlike learnability the question of generalization is concerned primarily with how representative the empirical risk remp is to the true risk rtrue to elaborate suppose we have finite training set sm zi which comprises of observations zi we define the empirical risk of hypothesis with respect to sm by remp sm zi zi we also let rtrue be the true risk as defined in eq then learning algorithm is said to generalize if the empirical risk of its inferred hypothesis converges to its true risk as similar to learnability uniform convergence is by definition sufficient for generalization but it is not necessary because the learning algorithm can always restrict its search space to smaller subset of artificially so to speak by contrast it is not known whether algorithmic stability is necessary for generalization it has been shown that various notions of algorithmic stability can be defined that are sufficient for generalization however it is not known whether an appropriate notion of algorithmic stability can be defined that is both necessary and sufficient for generalization in vapnik general setting of learning in this paper we answer this question by showing that stability in the inference process is not only sufficient for generalization but it is in fact equivalent to uniform generalization which is notion of generalization that is stronger than the one traditionally considered in the literature preliminaries to simplify the discussion we will always assume that all sets are countable including the observation space and the hypothesis space this is similar to the assumptions used in some previous works such as however the main results which are presented in section can be readily generalized in addition we assume that all learning algorithms are invariant to permutations of the training set hence the order of training examples is irrelevant moreover if is random variable drawn from the alphabet and is function of we write to mean often we will simply write ex to mean if the distribution of is clear from the context if takes its values from finite set uniformly at random we write to denote this distribution of if is boolean random variable then if and only if is true otherwise in general random variables are denoted with capital letters instances of random variables are denoted with small letters and alphabets are denoted with calligraphic typeface also given two probability mass functions and defined on the same alphabet we will write hp qi to denote the overlapping coefficient intersection between and that is hp qi min note that hp qi where is the total variation distance last we will write nk φk to denote the binomial distribution in this paper we consider the general setting of learning introduced by vapnik to reiterate we have an observation space and hypothesis space our learning algorithm receives set of observations sm zi generated from fixed unknown distribution and picks hypothesis with probability pl formally is stochastic map in this paper we allow the hypothesis to be any summary statistic of the training set it can be measure of central tendency as in unsupervised learning or it can be mapping from an input space to an output space as in supervised learning in fact we even allow to be subset of the training set itself in formal terms is stochastic map between the two random variables and sm where the exact interpretation of those random variables is irrelevant in any learning task we assume bounded loss function is used to measure the quality of the inferred hypothesis on the observation most importantly we assume that is parametric definition parametric loss functions loss function is called parametric if it is independent of the training set sm given the inferred hypothesis that is parametric loss function satisfies the markov chain sm for any fixed hypothesis we define its true risk rtrue by eq and define its empirical risk on training set sm denoted remp sm by eq we also define the true and empirical risks of the learning algorithm by the expected risk of its inferred hypothesis esm eh rtrue esm rtrue esm eh remp sm esm remp sm to simplify notation we will write and instead of and we will consider the following definition of generalization definition generalization learning algorithm with parametric loss function generalizes if for any distribution on we have where and are given in eq and eq respectively in other words learning algorithm generalizes according to definition if its empirical performance training loss becomes an unbiased estimator to the true risk as next we define uniform generalization definition uniform generalization learning algorithm generalizes uniformly if for any there exists such that for all distributions on all parametric loss functions and all sample sizes we have uniform generalization is stronger than the original notion of generalization in definition in particular if learning algorithm generalizes uniformly then it generalizes according to definition as well the converse however is not true even though uniform generalization appears to be quite strong condition at first sight key contribution of this paper is to show that it is not strong condition because it is equivalent to simple condition namely algorithmic stability main results before we prove that algorithmic stability is equivalent to uniform generalization we introduce probabilistic notion of mutual stability between two random variables in order to abstract away any labeling information the random variables might possess the observation space may or may not be metric space we define stability by the impact of observations on probability distributions definition mutual stability let and be two random variables then the mutual stability between and is defined by hp ex hp ey hp if we recall that hp qi is the overlapping coefficient between the two probability distributions and we see that given by definition is indeed probabilistic measure of mutual stability it measures how stable the distribution of is before and after observing an instance of and vice versa small value of means that the probability distribution of or is heavily perturbed by single observation of the other random variable perfect mutual stability is achieved when the two random variables are independent of each other with this probabilistic notion of mutual stability in mind we define the stability of learning algorithm by the mutual stability between its inferred hypothesis and random training example be learning algorithm that receives definition algorithmic stability let finite set of training examples sm zi drawn from fixed distribution let pl be the hypothesis inferred by and let ztrn sm be single random training example we define the stability of by inf ztrn where the infimum is taken over all possible distributions of observations learning algorithm is called algorithmically stable if note that the above definition of algorithmic stability is rather weak it only requires that the contribution of any single training example on the overall inference process to be more and more negligible as the sample size increases in addition it is even if the learning algorithm is deterministic because the hypothesis if it is deterministic function of an entire training set of observations remains stochastic function of any individual observation we illustrate this concept with the following example example suppose that observations zi are bernoulli trials with zi and that the hypothesis produced by is the empirical average zi because ztrn and ztrn it can be shown using stirling approximation that the algorithmic stability of this learning algorithm is asymptotically given by which is achieved when more general statement will be proved later in section next we show that the notion of algorithmic stability in definition is equivalent to the notion of uniform generalization in definition before we do that we first state the following lemma lemma data processing inequality let and be three random variables that satisfy the markov chain then proof the proof consists of two steps first we note that because the markov chain implies that we have by direct substitution into definition second similar to the can not inequality in information theory it can be shown that for any random variables and this is proved using some algebraic minimum of the sums is always larger than the thep the fact that sum of minimums min min αi βi combining both results yields which is the desired result now we are ready to state the main result of this paper theorem for any learning algorithm algorithmic stability as given in inition is both necessary and sufficient for uniform generalization see definition in addition ztrn where rtrue and remp are the true and empirical risks of the learning algorithm defined in eq and respectively proof here is an outline of the proof first because parametric loss function is itself random variable that satisfies the markov chain sm it is not independent of ztrn sm hence the empirical risk is given by el eztrn ztrn by contrast the true risk is given by el eztrn ztrn the difference is el eztrn ztrn eztrn ztrn to sandwich the side between an upper and lower bound we note that if and are two distributions defined on the same alphabet and is bounded loss function then where is the total variation distance the proof to this result can be immediately deduced by considering the two regions and separately this is then used to deduce the inequalities ztrn ztrn where the second inequality follows by the data processing inequality in lemma whereas the last inequality follows by definition of algorithmic stability see definition this proves that if is algorithmically stable as then converges to zero uniformly across all parametric loss functions therefore algorithmic stability is sufficient for uniform generalization the converse is proved by showing that for any bounded there exists parametric loss and distribution pδ such that therefore algorithmic stability is also necessary for uniform generalization interpreting algorithmic stability and uniform generalization in this section we provide several interpretations of algorithmic stability and uniform generalization in addition we show how theorem recovers some classical results in learning theory algorithmic stability and data processing the relationship between algorithmic stability and data processing is presented in lemma given the random variables and and the markov chain we always have this presents us with qualitative insights into the design of machine learning algorithms first suppose we have two different hypotheses and we will say that contains less informative than if the markov chain sm holds for example if observations zi are bernoulli trials then can be the empirical average as given in example while can be the label that occurs most often in the training set because the hypothesis contains strictly less information about the original training set than formally we have sm in this case enjoys better uniform generalization bound than because of intuitively we know that such result should hold because is less tied to the original training set than this brings us to the following remark detailed proofs are available in the supplementary file remark we can improve the uniform generalization bound or equivalently algorithmic stability of learning algorithm by its inferred hypothesis in manner that is conditionally independent of the original training set given example hypotheses is common technique used in machine learning this includes sparsifying the coefficient vector rd in linear methods where wj is set to zero if it has small absolute magnitude it also includes methods that have been proposed to reduce the number of support vectors in svm by exploiting linear dependence by the data processing inequality such methods improve algorithmic stability and uniform generalization needless to mention better generalization does not immediately translate into smaller true risk this is because the empirical risk itself may increase when the inferred hypothesis is independently of the original training set second if the markov chain holds we also obtain by applying the data processing inequality to the reverse markov chain as result we can improve algorithmic stability by contaminating training examples with artificial noise prior to learning this is because if is perturbed version of training set sm then sm implies that ztrn when ztrn sm and are random training examples drawn uniformly at random from each training set respectively this brings us to the following remark remark we can improve the algorithmic stability of learning algorithm by introducing artificial noise to training examples and applying the learning algorithm on the perturbed training set example corrupting training examples with artificial noise such as the recent dropout method are popular techniques in neural networks to improve generalization by the data processing inequality such methods indeed improve algorithmic stability and uniform generalization algorithmic stability and the size of the observation space next we look into how the size of the observation space influences algorithmic stability first we start with the following definition definition lazy learning learning algorithm is called lazy if its hypothesis is mapped with the training set sm the mapping sm is injective lazy learner is called lazy if its hypothesis is equivalent to the original training set in its information content hence no learning actually takes place one example is learning when sm despite their simple nature lazy learners are useful in practice they are useful theoretical tools as well in particular because of the equivalence sm and the data processing inequality the algorithmic stability of lazy learner provides lower bound to the stability of any possible learning algorithm therefore we can relate algorithmic stability uniform generalization to the size of the observation space by quantifying the algorithmic stability of lazy learners because the size of is usually infinite however we introduce the following definition of effective set size definition in countable space endowed with probability mass function the effective size of is defined by ess at one extreme if is uniform over finite alphabet then ess at the other extreme if is kronecker delta distribution then ess as proved next this notion of effective set size determines the rate of convergence of an empirical probability mass function to its true distribution when the distance is measured in the total variation sense as result it allows us to relate algorithmic stability to property of the observation space theorem let be countable space endowed with probability mass function let sm be set of samples zi define psm to be the empirical probability mass function induced by drawing samples uniformly at random from sm then esm psm ess where ess is the effective size of see inition in addition for any learning algorithm we have ztrn ess where the bound is achieved by lazy learners see definition special case of theorem was proved by de moivre in the who showed that the pempirical mean of bernoulli trials with probability of success converges to the true mean at rate of πm proof here is an outline of the proof first we know that sm where is the multinomial coefficient using the relation the multinomial series and de moivre formula for the mean deviation of the binomial random variable it can be shown with some algebraic manipulations that esm psm pk pk pk using stirling approximation to the factorial we obtain the simple asymptotic expression pk ess esm psm πm which is tight due to the tightness of the stirling approximation the rest of the theorem follows from the markov chain sm sm the data processing inequality and definition corollary given the conditions of theorem if is in addition finite then for any learning algorithm we have proof because in finite observation space the maximum effective set size see definition is which is attained at the uniform distribution intuitively speaking theorem and its corollary state that in order to guarantee good uniform generalization for all possible learning algorithms the number of observations must be sufficiently large to cover the entire effective size of the observation space needless to mention this is difficult to achieve in practice so the algorithmic stability of machine learning algorithms must be controlled in order to guarantee good generalization from few empirical observations similarly the uniform generalization bound can be improved by reducing the effective size of the observation space such as by using dimensionality reduction methods algorithmic stability and the complexity of the hypothesis space finally we look into the hypothesis space and how it influences algorithmic stability first we look into the role of the size of the hypothesis space this is formalized in the following theorem theorem denote by the hypothesis inferred by learning algorithm then the following bound on algorithmic stability always holds log where is the shannon entropy measured in nats using natural logarithms proof the proof is if we let be the mutual information between the and and let sm zm be random choice of training set we have hx sm sm sm zi because conditioning reduces entropy for any and we have sm zi zi ztrn ztrn therefore ztrn sm on average this is believed to be the first appearance of the law in statistical inference in the literature because the effective set size of the bernoulli distribution according to definition is given by theorem agrees with in fact generalizes de moivre result next we use pinsker inequality which states that for any probability distributions and where is total variation distance and is the divergence measured in nats using natural logarithms if we recall that sm sm sm while mutual information is sm sm sm we deduce from pinsker inequality and eq ztrn ztrn ztrn ztrn sm log in the last line we used the fact that for any random variables and theorem the classical pac result on the finite hypothesis space in terms of algorithmic stability learning algorithm will enjoy high stability if the size of the hypothesis space is small in terms of uniform generalization it states that the generalizationp risk of learning algorithm is bounded from above uniformly across all parametric loss functions by log where is the shannon entropy of next we relate algorithmic stability to the vc dimension despite the fact that the vc dimension is defined on functions whereas algorithmic stability is functional of probability distributions there exists connection between the two concepts to show this we first introduce notion of an induced concept class that exists for any learning algorithm definition the concept class induced by learning algorithm is defined to be the set of total boolean functions ztrn ztrn for all intuitively every hypothesis induces total partition on the observation space given by the boolean function in definition that is splits into two disjoint sets the set of values in that are posteriori less likely to have been present in the training set than before given that the inferred hypothesis is and the set of all other values the complexity richness of the induced concept class is related to algorithmic stability via the vc dimension theorem let be learning algorithm with an induced concept class let dv be the vc dimension of then the following bound holds if dv dv log in particular is algorithmically stable if its induced concept class has finite vc dimension proof the is bounded from below by proof relies on the fact that algorithmic stability supp esm ch ch where ch ztrn ztrn the final bound follows by applying uniform convergence results conclusions in this paper we showed that probabilistic notion of algorithmic stability was equivalent to uniform generalization in informal terms learning algorithm is called algorithmically stable if the impact of single training example on the probability distribution of the final hypothesis always vanishes at the limit of large training sets in other words the inference process never depends heavily on any single training example if algorithmic stability holds then the learning algorithm generalizes well regardless of the choice of the parametric loss function we also provided several interpretations of this result for instance the relationship between algorithmic stability and data processing reveals that algorithmic stability can be improved by either the inferred hypothesis or by augmenting training examples with artificial noise prior to learning in addition we established relationship between algorithmic stability and the effective size of the observation space which provided formal justification for dimensionality reduction methods finally we connected algorithmic stability to the complexity richness of the hypothesis space which the classical pac result that the complexity of the hypothesis space should be controlled in order to improve stability and hence improve generalization references vapnik an overview of statistical learning theory neural networks ieee transactions on vol september cortes and vapnik networks machine learning vol pp blumer ehrenfeucht haussler and warmuth learnability and the vapnikchervonenkis dimension journal of the acm jacm vol no pp talagrand majorizing measures the generic chaining the annals of probability vol no pp mcallester stochastic model selection machine learning vol pp bousquet and elisseeff stability and generalization the journal of machine learning research jmlr vol pp bartlett and mendelson rademacher and gaussian complexities risk bounds and structural results the journal of machine learning research jmlr vol pp audibert and bousquet combining and generic chaining bounds the journal of machine learning research jmlr vol pp xu and mannor robustness and generalization machine learning vol no pp elisseeff pontil et error and stability of learning algorithms with applications series on learning theory and practice science series sub series iii computer and systems sciences kutin and niyogi algorithmic stability and generalization error in proceedings of the eighteenth conference on uncertainty in artificial intelligence uai poggio rifkin mukherjee and niyogi general conditions for predictivity in learning theory nature vol pp kearns and ron algorithmic stability and bounds for crossvalidation neural computation vol no pp shamir srebro and sridharan learnability stability and uniform convergence the journal of machine learning research jmlr vol pp devroye and lugosi probabilistic theory of pattern recognition springer vapnik and chapelle bounds on error expectation for support vector machines neural computation vol no pp robbins remark on stirling formula american mathematical monthly pp cover and thomas elements of information theory wiley sons downs gates and masters exact simplification of support vector solutions jmlr vol pp wager wang and liang dropout training as adaptive regularization in nips pp stigler the history of statistics the measurement of uncertainty before harvard university press diaconis and zabell closed form summation for classical distributions variations on theme of de moivre statlstlcal science vol no pp and understanding machine learning from theory to algorithms cambridge university press 
adaptive sequential inference for dirichlet process mixture models theodoros tsiligkaridis keith forsythe massachusetts institute of technology lincoln laboratory lexington ma usa ttsili forsythe abstract we develop sequential inference procedure for dirichlet process mixtures of gaussians for online clustering and parameter estimation when the number of clusters are unknown we present an easily computable closed form parametric expression for the conditional likelihood in which hyperparameters are recursively updated as function of the streaming data assuming conjugate priors motivated by asymptotics we propose novel adaptive design for the dirichlet process concentration parameter and show that the number of classes grow at most at logarithmic rate we further prove that in the limit the conditional likelihood and data predictive distribution become asymptotically gaussian we demonstrate through experiments on synthetic and real data sets that our approach is superior to other online methods introduction dirichlet process mixture models dpmm have been widely used for clustering data neal rasmussen traditional finite mixture models often suffer from overfitting or underfitting of data due to possible mismatch between the model complexity and amount of data thus model selection or model averaging is required to find the correct number of clusters or the model with the appropriate complexity this requires significant computation for data sets or large samples bayesian nonparametric modeling are alternative approaches to parametric modeling an example being dpmm which can automatically infer the number of clusters from the data via bayesian inference techniques the use of markov chain monte carlo mcmc methods for dirichlet process mixtures has made inference tractable neal however these methods can exhibit slow convergence and their convergence can be tough to detect alternatives include variational methods blei jordan which are deterministic algorithms that convert inference to optimization these approaches can take significant computational effort even for moderate sized data sets for data sets and applications with streaming data there is need for inference algorithms that are much faster and do not require multiple passes through the data in this work we focus on lowcomplexity algorithms that adapt to each sample as they arrive making them highly scalable an online algorithm for learning dpmm based on sequential variational approximation sva was proposed in lin and the authors in wang dunson recently proposed sequential maximum map estimator for the class labels given streaming data the algorithm is called sequential updating and greedy search sugs and each iteration is composed of greedy selection step and posterior update step the choice of concentration parameter is critical for dpmm as it controls the number of clusters antoniak while most fast dpmm algorithms use fixed fearnhead daume kurihara et al imposing prior distribution on and sampling from it provides more flexibility but this approach still heavily relies on experimentation and prior knowledge thus many fast inference methods for dirichlet process mixture models have been proposed that can adapt to the data including the works escobar west where learning of is incorporated in the gibbs sampling analysis blei jordan where gamma prior is used in conjugate manner directly in the variational inference algorithm wang dunson also account for model uncertainty on the concentration parameter in bayesian manner directly in the sequential inference procedure this approach can be computationally expensive as discretization of the domain of is needed and its stability highly depends on the initial distribution on and on the range of values of to the best of our knowledge we are the first to analytically study the evolution and stability of the adapted sequence of in the online learning setting in this paper we propose an adaptive approach for adapting motivated by largesample asymptotics and call the resulting algorithm asugs adaptive sugs while the basic idea behind asugs is directly related to the greedy approach of sugs the main contribution is novel stable method for choosing the concentration parameter adaptively as new data arrive which greatly improves the clustering performance we derive an upper bound on the number of classes logarithmic in the number of samples and further prove that the sequence of concentration parameters that results from this adaptive design is almost bounded we finally prove that the conditional likelihood which is the primary tool used for online clustering is asymptotically gaussian in the limit implying that the clustering part of asugs asymptotically behaves as gaussian classifier experiments show that our method outperforms other methods for online learning of dpmm the paper is organized as follows in section we review the sequential inference framework for dpmm that we will build upon introduce notation and propose our adaptive modification in section the probabilistic data model is given and sequential inference steps are shown section contains the growth rate analysis of the number of classes and the concentration parameters and section contains the gaussian approximation to the conditional likelihood experimental results are shown in section and we conclude in section sequential inference framework for dpmm here we review the sugs framework of wang dunson for online clustering here the nonparametric nature of the dirichlet process manifests itself as modeling mixture models with countably infinite components let the observations be given by yi rd and γi to denote the class label of the ith observation latent variable we define the available information at time as yi and the online sequential updating and greedy search sugs algorithm is summarized next for completeness set and calculate for choose best class label for yi γi arg γi update the posterior distribution yi θγi using yi γi θγi where θh are the parameters of class yi is the observation density conditioned on class and is the number of classes created at time the algorithm sequentially allocates observations yi to classes based on maximizing the conditional posterior probability to calculate the posterior probability γi define the variables def def li yi yi πi γi from bayes rule γi li yi πi for here is considered fixed at this iteration and is not updated in fully bayesian manner according to the dirichlet process prediction the predictive probability of assigning observation yi to class is πi algorithm adaptive sequential updating and greedy search asugs input streaming data yi rate parameter set and calculate for do update concentration parameter yi πi choose best label for yi γi qh yi update posterior distribution end for θγi yi θγi where γl counts the number of observations labeled as class at time and is the concentration parameter adaptation of concentration parameter it is well known that the concentration parameter has strong influence on the growth of the number of classes antoniak our experiments show that in this sequential framework the choice of is even more critical choosing fixed as in the online sva algorithm of lin requires which is computationally prohibitive for data sets furthermore in the streaming data setting where no estimate on the data complexity exists it is impractical to perform although the parameter is handled from fully bayesian treatment in wang dunson grid of possible values can take say αl along with the prior distribution over them needs to be chosen in advance storage and updating of matrix of size and further marginalization is needed to compute γi at each iteration thus we propose an alternative method for choosing that works well in practice is simple to compute and has theoretical guarantees the idea is to start with prior distribution on that favors small and shape it into posterior distribution using the data define pi as the posterior distribution formed at time which will be used in asugs at time let denote the prior for an exponential distribution the dependence on and is trivial only at this first step then by bayes rule pi yi γi πi γi where πi γi is given in once this update is made after the selection of γi the to be used in the next selection step is the mean of the distribution pi αi as will be shown in section the distribution pi can be approximated by gamma distribution with shape parameter ki and rate parameter log under this approximation we have αi ki only requiring storage and update of one scalar parameter ki at each iteration the asugs algorithm is summarized in algorithm the selection step may be implemented by sampling the probability mass function qh the posterior update step can be efficiently performed by updating the hyperparameters as function of the streaming data for the case of conjugate distributions section derives these updates for the case of multivariate gaussian observations and conjugate priors for the parameters sequential inference under unknown mean unknown covariance we consider the general case of an unknown mean and covariance for each class the probabilistic model for the parameters of each class is given as yi co where denotes the multivariate normal distribution with mean and precision matrix and is the wishart distribution with degrees of freedom and scale matrix the follow joint distribution the model leads parameters rd to expressions for li yi due to conjugacy tzikas et al to calculate the class posteriors the conditional likelihoods of yi given assignment to class and the previous class assignments need to be calculated first the conditional likelihood of yi given assignment to class and the history is given by li yi yi θh dθh due to the conjugacy of the distributions the posterior θh always has the form θh µh ch th th vh where µh ch δh vh are hyperparameters that can be recursively computed as new samples come in the form of this recursive computation of the hyperparameters is derived in appendix for ease of interpretation and numerical stability we define σh vh vh as the σh inverse of the mean of the wishart distribution the matrix has the natural interpretation as the covariance matrix of class at iteration once the γi th component is chosen the parameter updates for the γi th class become γi cγi cγi cγi µγ γi cγi γi γi δγ δγ cγi yi cγi µγ yi γi if the starting matrix σh is positive definite then all the matrices σh will remain positive definite let us return to the calculation of the conditional likelihood by iterated integration it follows that rh ρd δh det σh li yi rh σh yi µh yi µh def where ρd def and rh ch detailed mathematical derivation of this conditional likelihood is included in appendix we remark that for the new class li has the form with the initial choice of hyperparameters growth rate analysis of number of classes stability in this section we derive model for the posterior distribution pn using approximations which will allow us to derive growth rates on the number of classes and the sequence of concentration parameters showing that the number of classes grows as kn for arbitarily small under certain mild conditions the probability density of the parameter is updated at the jth step in the following fashion innovation class chosen pj otherwise where only the factors in the update are shown the factors are absorbed by the normalization to probability density choosing the innovation class pushes mass toward infinity while choosing any other class pushes mass toward zero thus there is possibility that the innovation probability grows in undesired manner we assess the growth of the number of def innovations rn kn under simple assumptions on some likelihood functions that appear naturally in the asugs algorithm assuming that the initial distribution of is the distribution used at step is proportional to αrn αj we make use of the limiting relation theorem the following asymptotic behavior holds log log proof see appendix using theorem model for pn is αrn suitably normalized recognizing this as the gamma distribution with shape parameter rn and rate parameter log its rn mean is given by αn we use the mean in this form to choose class membership in alg this asymptotic approximation leads to very simple scalar update of the concentration parameter there is no need for discretization for tracking the evolution of continuous probability distributions on in our experiments this approximation is very accurate recall that the innovation class is labeled at the nth step the modeled updates randomly select previous class or innovation new class by sampling from the probability distrip bution qk γn note that mn where mn represents the number of members in class at time we assume the data follows the gaussian mixture distribution def pt πh σh where πh are the prior probabilities and µh σh are the parameters of the gaussian clusters define the probability density function which plays the role of the predictive distribution def ln so that the probabilities of choosing previous class or an innovation using equ are proporp tional to ln yn yn and ln yn tively if denotes the innovation probability at step then we have yn ln yn for some positive proportionality factor define the likelihood ratio lr at the beginning of stage as def ln ln conceptually the mixture represents modeled distribution fitting the currently observed data if all modes of the data have been observed it is reasonable to expect that is good model for future observations the lr ln yn is not large when the future observations are by in fact we expect pt as as discussed in section ln yn ln yn lemma the following bound holds min yn proof the result follows directly from after simple calculation the innovation random variable rn is described by the random process associated with the probabilities of transition τn rn τn rn def here ln is independent of and only depends on the initial choice of hyperparameters as discussed in sec the expectation of rn is majorized by the expectation of similar random process based on the def transition probability σn min instead of τn as appendix shows where the random sequence an is given by log the latter can be described as modification of polya urn process with selection probability σn the asymptotic behavior of rn and related variables is described in the following theorem theorem let τn be sequence of random variables τn satisfying τn for where an log and where the nonnegative random variables rn evolve according to assume the following for ln yn pt where is the divergence between distributions and then as rn op αn op logζ proof see appendix theorem bounds the growth rate of the mean of the number of class innovations and the concentration parameter αn in terms of the sample size and parameter the bounded lr and bounded kl divergence conditions of thm manifest themselves in the rate exponents of the experiments section shows that both of the conditions of thm hold for all iterations for some in fact assuming the correct clustering the mixture distribution converges to the true mixture distribution pt implying that the number of class innovations grows at most as and the sequence of concentration parameters is where can be arbitrarily small asymptotic normality of conditional likelihood in this section we derive an asymptotic expression for the conditional likelihood in order to gain insight into the of the algorithm we let πh denote the true prior probability of class using the bounds of the gamma function ρd in theorem from batir it follows that under normal convergence conditions of the algorithm with the pruning and merging steps included all classes will be correctly identified and populated with approximately πh observations at time thus the conditional class prior for each class converges to πh as in virtue of πi πh according rh ch op log to we expect as since πh also we expect πh as according to also from before ρd δh δh πh the parameter updates imply µh µh and σh σh as this follows from the strong law of large numbers as the updates are recursive implementations of the sample mean and sample covariance matrix thus the approximation to the conditional likelihood becomes πh lim li yi det σh yi σh yi det σh where we used uc ec the conditional likelihood corresponds to the multivariate gaussian distribution with mean µh and covariance matrix σh similar asymptotic normality result was recently obtained in tsiligkaridis forsythe for gaussian observations with von mises prior the asymptotics πh µh µh σh σh ln σh as imply that the mixture distribution in converges to the true gaussian mixture distribution pt of thus for any small we expect pt for all validating the assumption of theorem experiments we apply the asugs learning algorithm to synthetic example and to real data set to verify the stability and accuracy of our method the experiments show the value of adaptation of the dirichlet concentration parameter for online clustering and parameter estimation since it is possible that multiple clusters are similar and classes might be created due to outliers or due to the particular ordering of the streaming data sequence we add the pruning and merging step in the asugs algorithm as done in lin we compare asugs and with sugs sva and proposed in lin since it was shown in lin that sva and outperform the methods that perform iterative updates over the entire data set including collapsed gibbs sampling mcmc with and variational inference synthetic data set we consider learning the parameters of gaussian mixture each with equal variance of the training set was made up of iid samples and the test set was made up of iid samples the clustering results are shown in fig showing that the approaches are more stable than algorithms performs best and identifies the correct number of clusters and their parameters fig shows the data on the test set averaged over monte carlo trials the mean and variance of the number of classes at each iteration the approaches achieve higher than approaches asymptotically fig provides some numerical verification for the assumptions of theorem as expected the predictive likelihood converges to the true mixture distribution pt and the likelihood ratio li yi is bounded after enough samples are processed asugs mean number of classes avg joint asugs sugs sva iteration iteration variance of number of classes sva iteration figure clustering performance of sva asugs and on synthetic data set identifies the clusters correctly joint on synthetic data mean and variance of number of classes as function of iteration the likelihood values were evaluated on set of samples achieves the highest and has the lowest asymptotic variance on the number of classes real data set we applied the online nonparametric bayesian methods for clustering image data we used the mnist data set which consists of training samples and test samples each sample pt kl li yi sample sample figure likelihood ratio li yi yi left and between and true mixture distribution pt right for synthetic example see is image of handwritten digit total of dimensions and we perform pca preprocessing to reduce dimensionality to dimensions as in kurihara et al we use only random subset consisting of random samples for training this training set contains data from all digits with an approximately uniform proportion fig shows the predictive over the test set and the mean images for clusters obtained using asugspm and respectively we note that achieves higher values and finds all digits correctly using only clusters while finds some digits using clusters predictive iteration figure predictive on test set mean images for clusters found using and on mnist data set discussion although both sva and asugs methods have similar computational complexity and use decisions and information obtained from processing previous samples in order to decide on class innovations the mechanics of these methods are quite different asugs uses an adaptive motivated by asymptotic theory while sva uses fixed furthermore sva updates the parameters of all the components at each iteration in weighted fashion while asugs only updates the parameters of the cluster thus minimizing leakage to unrelated components the parameter of asugs does not affect performance as much as the threshold parameter of sva does which often leads to instability requiring lots of pruning and merging steps and increasing latency this is critical for large data sets or streaming applications because would be required to set appropriately we observe higher and better numerical stability for methods in comparison to sva the mathematical formulation of asugs allows for theoretical guarantees theorem and asymptotically normal predictive distribution conclusion we developed fast online clustering and parameter estimation algorithm for dirichlet process mixtures of gaussians capable of learning in single data pass motivated by asymptotics we proposed novel adaptive design for the concentration parameter and showed it leads to logarithmic growth rates on the number of classes through experiments on synthetic and real data sets we show our method achieves better performance and is as fast as other online learning dpmm methods references antoniak mixtures of dirichlet processes with applications to bayesian nonparametric problems the annals of statistics batir inequalities for the gamma function archiv der mathematik blei and jordan variational inference for dirichlet process mixtures bayesian analysis daume fast search for dirichlet process mixture models in conference on artificial intelligence and statistics escobar and west bayesian density estimation and inference using mixtures journal of the american statistical association june fearnhead particle filters for mixture models with an uknown number of components statistics and computing kurihara welling and vlassis accelerated variational dirichlet mixture models in advances in neural information processing systems nips lin dahua online learning of nonparametric mixture models via sequential variational approximation in burges bottou welling ghahramani and weinberger eds advances in neural information processing systems pp curran associates neal bayesian mixture modeling in proceedings of the workshop on maximum entropy and bayesian methods of statistical analysis volume pp neal markov chain sampling methods for dirichlet process mixture models journal of computational and graphical statistics june rasmussen the infinite gaussian mixture model in advances in neural information processing systems pp mit press tsiligkaridis and forsythe sequential bayesian inference framework for blind frequency offset estimation in proceedings of ieee international workshop on machine learning for signal processing boston ma september tzikas likas and galatsanos the variational approximation for bayesian inference ieee signal processing magazine pp november wang and dunson fast bayesian inference in dirichlet process mixture models journal of computational and graphical statistics 
adaptive langevin thermostat for bayesian sampling xiaocheng university of edinburgh zhanxing university of edinburgh benedict leimkuhler university of edinburgh amos storkey university of edinburgh abstract monte carlo sampling for bayesian posterior inference is common approach used in machine learning the markov chain monte carlo procedures that are used are often analogues of associated stochastic differential equations sdes these sdes are guaranteed to leave invariant the required posterior distribution an area of current research addresses the computational benefits of stochastic gradient methods in this setting existing techniques rely on estimating the variance or covariance of the subsampling error and typically assume constant variance in this article we propose adaptive langevin thermostat that can effectively dissipate noise while maintaining desired target distribution the proposed method achieves substantial speedup over popular alternative schemes for machine learning applications introduction in machine learning applications direct sampling with the entire dataset is computationally infeasible for instance standard markov chain monte carlo mcmc methods as well as typical hybrid monte carlo hmc methods require the calculation of the acceptance probability and the creation of informed proposals based on the whole dataset in order to improve the computational efficiency number of stochastic gradient methods have been proposed in the setting of bayesian sampling based on random and much smaller subsets to approximate the likelihood of the whole dataset thus substantially reducing the computational cost in practice welling and teh proposed the stochastic gradient langevin dynamics sgld combining the ideas of stochastic optimization and traditional brownian dynamics with sequence of stepsizes decreasing to zero fixed stepsize is often adopted in practice which is the choice in this article as in vollmer et al where modified sgld msgld was also introduced that was designed to reduce the sampling bias sgld generates samples from first order brownian dynamics and thus with fixed timestep one can show that it is unable to dissipate excess noise in gradient approximations while maintaining the desired invariant distribution stochastic gradient hamiltonian monte carlo sghmc method was proposed by chen et al which relies on second order langevin dynamics and incorporates diffusion matrix that is intended to effectively offset the stochastic perturbation of the gradient however it is difficult to accommodate the additional diffusion term in practice the first and second authors contributed equally and the listed author order was decided by lot moreover as pointed out in poor estimation of it may have significant adverse influence on the sampling of the target distribution for example the effective system temperature may be altered the thermostat idea which is widely used in molecular dynamics was recently adopted in the stochastic gradient thermostat sgnht by ding et al in order to adjust the kinetic energy during simulation in such way that the canonical ensemble is preserved so that prescribed constant temperature distribution is maintained in fact the sgnht method is essentially equivalent to the adaptive langevin thermostat proposed earlier by jones and leimkuhler in the molecular dynamics setting see for discussions despite the substantial interest generated by these methods the mathematical foundation for stochastic gradient methods has been incomplete the underlying dynamics of the sgnht method was taken up by leimkuhler and shang together with the design of discretization schemes with high effective order of accuracy sgnht methods are designed based on the assumption of constant noise variance in this article we propose adaptive langevin ccadl thermostat that can handle noise improving both robustness and reliability in practice and which can effectively speed up the convergence to the desired invariant distribution in machine learning applications the rest of the article is organized as follows in section we describe the setting of bayesian sampling with noisy gradients and briefly review existing techniques section considers the construction of the novel ccadl method that can effectively dissipate noise while maintaining the correct distribution various numerical experiments are performed in section to verify the usefulness of ccadl in wide range of machine learning applications finally we summarize our findings in section bayesian sampling with noisy gradients in the typical setting of bayesian sampling one is interested in drawing states from posterior distribution defined as where rnd is the parameter vector of interest denotes the entire dataset and and are the likelihood and prior distributions respectively we introduce potential energy function by defining exp where is positive parameter and can be interpreted as being proportional to the reciprocal temperature in an associated physical system kb kb is the boltzmann constant and is the temperature in practice is often set to be unity for notational simplicity taking the logarithm of yields log assuming the data are independent and identically distributed the logarithm of the likelihood can be calculated as log log xi where is the size of the entire dataset however as already mentioned it is computationally infeasible to deal with the entire dataset at each timestep as would typically be required in mcmc and hmc methods instead in order to improve the efficiency random and much smaller subset is preferred in stochastic gradient methods in which the likelihood of the dataset for given parameters is approximated by nx log log xri where xri represents random subset of thus the noisy potential energy can be written as nx log xri where the negative gradient of the potential is referred to as the noisy force our goal is to correctly sample the gibbs distribution exp as in the gradient noise is assumed to be gaussian with mean zero and unknown variance in which case one may rewrite the noisy force as where typically is diagonal matrix represents the covariance matrix of the noise and is vector of standard normal random variables note that here is actually equivalent to in typical setting of numerical integration stepsize has hς and therefore assuming constant covariance matrix where is the identity matrix the sgnht method by ding et al has the following underlying dynamics written as standard stochastic differential equation sde system dθ pdt dp dwa dξ kb dt where colloquially dw and dwa represent vectors of independent wiener increments and are often informally denoted by dti the coefficient represents the strength of artificial noise added into the system to improve ergodicity and which can be termed the effective friction is positive parameter and proportional to the variance of the noise the auxiliary variable is governed by device via negative feedback mechanism when the instantaneous temperature average kinetic energy per degree of freedom calculated as kb pt is below the target temperature the dynamical friction would decrease allowing an increase of temperature while would increase when the temperature is above the target is coupling parameter which is referred to as the thermal mass in the molecular dynamics setting proposition see jones and leimkuhler the sgnht method preserves the modified gibbs stationary distribution exp exp where is the normalizing constant pt is the hamiltonian and proposition tells us that the sgnht method can adaptively dissipate excess noise pumped into the system while maintaining the correct distribution the variance of the gradient noise does not need to be known priori as long as is constant the auxiliary variable will be able to automatically find its mean value on the fly however with covariance matrix the sgnht method would not produce the required target distribution ding et al claimed that it is reasonable to assume the covariance matrix is constant when the size of the dataset is large in which case the variance of the posterior of is small the magnitude of the posterior variance does not actually relate to the constancy of the however in general is not constant simply assuming the of the can have significant impact on the performance of the method most notably the stability measured by the largest usable stepsize therefore it is essential to have an approach that can handle noise in the following section we propose thermostat that can effectively dissipate noise while maintaining the target stationary distribution adaptive langevin thermostat as mentioned in the previous section the sgnht method can only dissipate noise with constant covariance matrix when the covariance matrix becomes in general covariance matrix does not imply the required thermal equilibrium the system can not be expected to converge to the desired invariant distribution typically resulting in poor estimation of functions of parameters of interest in fact in that case it is not clear whether or not there exists an invariant distribution at all in order to construct system that preserves the canonical distribution we suggest adding suitable damping viscous term to effectively dissipate the gradient noise to this end we propose the following adaptive langevin ccadl thermostat dθ pdt dp hς βς dwa dξ pt kb dt proposition the ccadl thermostat preserves the modified gibbs stationary distribution exp exp proof the equation corresponding to is ρt pρ pρ pt kb just insert into the operator to see that it vanishes the incorporation of the covariance matrix in is intended to offset the covariance matrix coming from the gradient approximation however in practice one does not know priori thus instead one must estimate during the simulation task which will be addressed in section this procedure is related to the method used in the sghmc method proposed by chen et al which uses dynamics of the following form dθ pdt dp hς dwa it can be shown that the sghmc method preserves the gibbs canonical distribution ρβ exp although both ccadl and sghmc preserve their respective invariant distributions let us note several advantages of the former over the latter in practice ccadl and sghmc both require estimation of the covariance matrix during simulation which can be costly in high dimension in numerical experiments we have found that simply using the diagonal of the covariance matrix at significantly reduced computational cost works quite well in ccadl by contrast it is difficult to find suitable value of the parameter in sghmc since one has to make sure the matrix is positive one may attempt to use large value of the effective friction small stepsize however friction would essentially reduce sghmc to sgld which is not desirable as pointed out in while extremely small stepsize would significantly impact the computational efficiency ii estimation of the covariance matrix unavoidably introduces additional noise in both ccadl and sghmc nonetheless ccadl can still effectively control the system temperature maintaining the correct distribution of the momenta due to the use of the stabilizing control while in sghmc poor estimation of the covariance matrix may lead to significant deviations of the system temperature as well as the distribution of the momenta resulting in poor sampling of the parameters of interest covariance estimation of noisy gradients under the assumption that the noise of the stochastic gradient follows normal distribution we apply similar method to that of to estimate the covariance matrix associated with the noisy gradient if we let log and assume that the size of subset is large enough for the central limit theorem to hold we have xri ex it where it cov is the covariance of the gradient at given the noisy stochastic pn gradient based on the current subset xri log and the clean algorithm adaptive langevin ccadl thermostat input κt initialize and for do estimate using eq pt ξt ptt pt end for full gradient and thus pn xi log we have ex ex it it assuming does not change dramatically over time we use the moving average update to estimate it where κt and xri xri is the empirical covariance of the gradient represents the mean gradient of the log likelihood computed from subset as proved in this estimator has convergence order of as already mentioned estimating the full covariance matrix is computationally infeasible in high dimension however we have found that employing diagonal approximation of the covariance matrix estimating the variance only along each dimension of the noisy gradient works quite well in practice as demonstrated in section the procedure of the ccadl method is summarized in algorithm where we simply used and nd in order to be consistent with the original implementation of sgnht note that this is simple first order in terms of the stepsize algorithm recent article has introduced higher order of accuracy schemes which can improve accuracy but our interest here is in the direct comparison of the underlying machinery of sghmc sgnht and ccadl so we avoid further modifications and enhancements related to timestepping at this stage in the following section we compare the newly established ccadl method with sghmc and sgnht on various machine learning tasks to demonstrate the benefits of ccadl in bayesian sampling with noisy gradient numerical experiments bayesian inference for gaussian distribution we first compare the performance of the newly established ccadl method with sghmc and sgnht for simple task using synthetic data bayesian inference of both the mean and variance of normal distribution we apply the same experimental setting as in we generated samples from standard normal distribution we used the likelihood function of xi and assigned distribution as their prior distribution gam then the corresponding posterior distribution is another normalgamma distribution κn gam βn with xi κn αn βn µn pn where xi random subset of size was selected at each timestep to approximate the full gradient resulting in the following stochastic gradients γn µγ ri xr it can be seen that the variance of the stochastic gradient noise is no longer constant and actually depends on the size of the subset and the values of and in each iteration this directly violates the constant noise variance assumption of sgnht while ccadl adjusts to the varying noise variance the marginal distributions of and obtained from various methods with different combinations of and were compared and plotted in figure with table consisting of the corresponding root mean square error rmse of the distribution and autocorrelation time from samples in most of the cases both sgnht and ccadl easily outperform the sghmc method possibly due to the presence of the device with sghmc only showing superiority with small value of and large value of neither of which is desirable in practice as discussed in section between sgnht and the newly proposed ccadl method the latter achieves better performance in each of the cases investigated highlighting the importance of the covariance control with parameterdependent noise true sghmc sgnht ccadl true sghmc sgnht ccadl true sghmc sgnht ccadl density true sghmc sgnht ccadl density density true sghmc sgnht ccadl density true sghmc sgnht ccadl true sghmc sgnht ccadl density density density true sghmc sgnht ccadl density figure comparisons of marginal distribution density of top row and bottom row with various values of and indicated in each column the peak region is highlighted in the inset table comparisons of rmse autocorrelation time of of various methods for bayesian inference of the mean and variance of gaussian distribution methods sghmc sgnht ccadl bayesian logistic regression we then consider bayesian logistic regression model trained on the benchmark mnist dataset for binary classification of digits and using training data points with test set of size random of the original features used we used the likeliqn hood function of xi yi xi and the prior distribution of exp subset of size was used at each timestep since the dimensionality of this problem is not that high full covariance estimation was used for ccadl we investigate in figure top row the convergence speed of each method through measuring test log likelihood using the posterior mean against the number of passes over the entire dataset ccadl displays significant improvements over sghmc and sgnht with different values of and ccadl converges much faster than the other two which also indicates its faster mixing speed and shorter period ccadl shows robustness in different values of the effective friction with sghmc and sgnht relying on relative large value of especially for the sghmc method which is intended to dominate the gradient noise to compare the sample quality obtained from each method figure bottom row plots the twodimensional marginal posterior distribution in randomly selected dimensions of and based on samples from each method after the period we start to collect samples when the test log likelihood stabilizes the true reference distribution was obtained by sufficiently long run of standard hmc we implemented runs of standard hmc and found there was no variation between these runs which guarantees its qualification as the true reference distribution again ccadl shows much better performance than sghmc and sgnht note that the contour of sghmc does not even fit in the region of the plot and in fact it shows significant deviation even in the estimation of the mean sghmc sghmc sgnht sgnht ccadl ccadl number of passes true hmc sghmc sgnht ccadl number of passes true hmc sghmc sgnht ccadl sghmc sghmc sgnht sgnht ccadl ccadl true hmc sghmc sgnht ccadl number of passes sghmc sghmc sgnht sgnht ccadl ccadl test log likelihood test log likelihood test log likelihood figure comparisons of bayesian logistic regression of various methods on the mnist dataset of digits and with various values of and top row test log likelihood using the posterior mean against the number of passes over the entire dataset bottom row marginal posterior distribution in randomly selected dimensions and with fixed based on samples from each method after the period we start to collect samples when the test log likelihood stabilizes magenta circle is the true reference posterior mean obtained from standard hmc and crosses represent the sample means computed from various methods ellipses represent contours covering probability mass note that the contour of sghmc is well beyond the scale of the plot especially in the large stepsize regime in which case we do not include it here discriminative restricted boltzmann machine drbm drbm is classifier and the gradient of its discriminative objective can be explicitly computed due to the limited space we refer the readers to for more details we trained drbm on different datasets from the dataset collection including letter and sensit vehicle acoustic the detailed information of these datasets are presented in table we selected the number of hidden units using to achieve their best results since the dimension of parameters nd is relatively high we used only diagonal covariance matrix estimation for ccadl to significantly reduce the computational cost estimating the variance only along each dimension the size of the subset was chosen as to obtain reasonable variance estimation for each dataset we chose the first of the total number of passes over the entire dataset as the period and collected the remaining samples for prediction table datasets used in drbm with corresponding parameter configurations datasets letter acoustic set classes features hidden units total number of parameters nd the error rates computed by various methods on the test set using the posterior mean against the number of passes over the entire dataset were plotted in figure it can be observed that sghmc and sgnht only work well with large value of the effective friction which corresponds to strong random walk effect and thus slows down the convergence on the contrary ccadl works http reliably much better than the other two in wide range of and more importantly in the large stepsize regime which speeds up the convergence rate in relation to the computational work performed it can be easily seen that the performance of sghmc heavily relies on using small value of and large value of which significantly limits its usefulness in practice number of passes test error sghmc sghmc sgnht sgnht ccadl ccadl number of passes number of passes acoustic number of passes sghmc sghmc sgnht sgnht ccadl ccadl number of passes number of passes acoustic number of passes letter sghmc sghmc sgnht sgnht ccadl ccadl letter sghmc sghmc sgnht sgnht ccadl ccadl test error sghmc sghmc sgnht sgnht ccadl ccadl letter number of passes sghmc sghmc sgnht sgnht ccadl ccadl test error test error test error sghmc sghmc sgnht sgnht ccadl ccadl test error test error test error sghmc sghmc sgnht sgnht ccadl ccadl test error sghmc sghmc sgnht sgnht ccadl ccadl number of passes acoustic figure comparisons of drbm on datasets top row letter middle row and acoustic bottom row with various values of and indicated test error rates of various methods using the posterior mean against the number of passes over the entire dataset conclusions and future work in this article we have proposed novel ccadl formulation that can effectively dissipate noise while maintaining desired invariant distribution ccadl combines ideas of sghmc and sgnht from the literature but achieves significant improvements over each of these methods in practice the additional error introduced by covariance estimation is expected to be small in relative sense substantially smaller than the error arising from the noisy gradient our findings have been verified in machine learning applications in particular we have consistently observed that sghmc relies on small stepsize and large friction which significantly reduces its usefulness in practice as discussed the techniques presented in this article could be of use in more general settings of bayesian sampling and optimization which we leave for future work naive nonsymmetric splitting method has been applied for ccadl for fair comparison in this article however we point out that optimal design of splitting methods in ergodic sde systems has been explored recently in the mathematics community moreover it has been shown in that certain type of symmetric splitting method for the method with clean full gradient inherits the superconvergence property fourth order convergence to the invariant distribution for configurational quantities recently demonstrated in the setting of langevin dynamics we leave further exploration of this direction in the context of noisy gradients for future work references abdulle vilmart and zygalakis long time accuracy of splitting methods for langevin dynamics siam journal on numerical analysis ahn korattikara and welling bayesian posterior sampling via stochastic gradient fisher scoring in proceedings of the international conference on machine learning pages brooks gelman jones and meng handbook of markov chain monte carlo crc press chen fox and guestrin stochastic gradient hamiltonian monte carlo in proceedings of the international conference on machine learning pages ding fang babbush chen skeel and neven bayesian sampling using stochastic gradient thermostats in advances in neural information processing systems pages duane kennedy pendleton and roweth hybrid monte carlo physics letters frenkel and smit understanding molecular simulation from algorithms to applications second edition academic press hoover computational statistical mechanics studies in modern thermodynamics elsevier science horowitz generalized guided monte carlo algorithm physics letters jones and leimkuhler adaptive stochastic methods for sampling driven molecular systems the journal of chemical physics larochelle and bengio classification using discriminative restricted boltzmann machines in proceedings of the international conference on machine learning pages leimkuhler and matthews rational construction of stochastic numerical methods for molecular sampling applied mathematics research express leimkuhler and matthews molecular dynamics with deterministic and stochastic numerical methods springer leimkuhler matthews and stoltz the computation of averages from equilibrium and nonequilibrium langevin molecular dynamics ima journal of numerical analysis leimkuhler and shang adaptive thermostats for noisy gradient systems siam journal on scientific computing metropolis rosenbluth rosenbluth teller and teller equation of state calculations by fast computing machines the journal of chemical physics unified formulation of the constant temperature molecular dynamics methods the journal of chemical physics robbins and monro stochastic approximation method annals of mathematical statistics robert and casella monte carlo statistical methods second edition springer vollmer zygalakis and teh asymptotic properties of stochastic gradient langevin dynamics arxiv preprint welling and teh bayesian learning via stochastic gradient langevin dynamics in proceedings of the international conference on machine learning pages 
robust portfolio optimization fang han department of biostatistics johns hopkins university baltimore md fhan huitong qiu department of biostatistics johns hopkins university baltimore md han liu department of operations research and financial engineering princeton university princeton nj hanliu brian caffo department of biostatistics johns hopkins university baltimore md bcaffo abstract we propose robust portfolio optimization approach based on quantile statistics the proposed method is robust to extreme events in asset returns and accommodates large portfolios under limited historical data specifically we show that the risk of the estimated portfolio converges to the oracle optimal risk with parametric rate under weakly dependent asset returns the theory does not rely on higher order moment assumptions thus allowing for asset returns moreover the rate of convergence quantifies that the size of the portfolio under management is allowed to scale exponentially with the sample size of the historical data the empirical effectiveness of the proposed method is demonstrated under both synthetic and real stock data our work extends existing ones by achieving robustness in high dimensions and by allowing serial dependence introduction markowitz analysis sets the basis for modern portfolio optimization theory however the analysis has been criticized for being sensitive to estimation errors in the mean and covariance matrix of the asset returns compared to the covariance matrix the mean of the asset returns is more influential and harder to estimate therefore many studies focus on the global minimum variance gmv formulation which only involves estimating the covariance matrix of the asset returns estimating the covariance matrix of asset returns is challenging due to the high dimensionality and of asset return data specifically the number of assets under management is usually much larger than the sample size of exploitable historical data on the other hand extreme events are typical in financial asset prices leading to asset returns to overcome the curse of dimensionality structured covariance matrix estimators are proposed for asset return data considered estimators based on factor models with observable factors studied covariance matrix estimators based on latent factor models proposed to shrink the sample covariance matrix towards highly structured covariance matrices including the identity matrix order autoregressive covariance matrices and covariance matrix estimators these estimators are commonly based on the sample covariance matrix sub gaussian tail assumptions are required to guarantee consistency for data robust estimators of covariance matrices are desired classic robust covariance matrix estimators include minimum volume ellipsoid mve and minimum ance determinant mcd estimators and estimators based on data outlyingness and depth these estimators are specifically designed for data with very low dimensions and large sample sizes for generalizing the robust estimators to high dimensions proposed the orthogonalized ogk estimator which extends estimator by the eigenvalues studied shrinkage estimators based on tyler however although ogk is computationally tractable in high dimensions consistency is only guaranteed under fixed dimension the shrunken tylor involves iteratively inverting large matrices moreover its consistency is only guaranteed when the dimension is in the same order as the sample size the aforementioned robust estimators are analyzed under independent data points their performance under time series data is questionable in this paper we build on scatter estimator and propose robust portfolio optimization approach our contributions are in three aspects first we show that the proposed method accommodates high dimensional data by allowing the dimension to scale exponentially with sample size secondly we verify that consistency of the proposed method is achieved without any tail conditions thus allowing for asset return data thirdly we consider weakly dependent time series and demonstrate how the degree of dependence affects the consistency of the proposed method background in this section we introduce the notation system and provide review on the constrained portfolio optimization that will be exploited in this paper notation let vd be real vector and mjk be matrix with mjk as the entry for we define the vector norm of as pd kvkq and the vector norm of as let the matrix qp max norm of be kmkmax maxjk and the frobenius norm be kmkf jk mjk let xd and yd be two random vectors we write if and are identically distributed we use to denote vectors with at every entry constrained gmv formulation under the gmv formulation found that imposing constraint improves portfolio efficiency relaxed the constraint by constraint and showed that portfolio efficiency can be further improved let rd be random vector of asset returns portfolio is characterized by vector of investment allocations wd among the assets the constrained gmv portfolio optimization can be formulated as min wt σw here is the covariance matrix of is the budget constraint and is the grossexposure constraint is called the gross exposure constant which controls the percentage of long and short positions allowed in the portfolio the optimization problem can be converted into quadratic programming problem and solved by standard software method in this section we introduce the portfolio optimization approach let be random variable with distribution function and zt be sequence of observations from for constant we define the of and zt to be inf where min scatter matrix is defined to be any matrix proportional to the covariance matrix by constant here are the order statistics of zt we say is unique if there is unique if there exists unique exists unique such that we say zt such that following the estimator qn we define the population and sample scales to be and bq zt zt here ze is an independent copy of based on and we can further define robust scatter matrices for asset returns in detail let xd rd be random vector representing the returns of assets and xt be sequence of observations from where xt xtd we define the population and sample scatter matrices qne to be bq bq rq rq jk and rjk are given by where the entries of rq and bq xtj rjj xj jj rq jk tj tk tj tk jk is since bq can be computed using log time the computational complexity of log since in practice can be computed almost as efficiently as the sample covariance matrix which has complexity let wd be the vector of investment allocations among the assets for matrix we define risk function rd by wt mw when has covariance matrix var wt is the variance of the portfolio return wt and is employed as the objected function in the gmv formulation however estimating is difficult due to the heavy tails of asset returns in this paper we adopt rq as robust alternative to the risk metric and consider the following oracle portfolio optimization problem wopt argmin rq here is the constraint introduced in section in practice rq is onto the cone unknown and has to be estimated for convexity of the risk function we project of positive definite matrices argminr max sλ mt λmin id λmax id the optimization here λmin and λmax set the lower and upper bounds for the eigenvalues of problem can be solved by projection and contraction algorithm we summarize the we formulate the empirical robust portfolio algorithm in the supplementary material using optimization by opt argmin remark the robust portfolio optimization approach involves three parameters λmin λmax and empirically setting λmin and λmax proves to work well is typically provided by investors for controlling the percentages of short positions when choice is desired we refer to for approach remark the rationale behind the positive definite projection lies in two aspects first in order that the portfolio optimization is convex and well conditioned positive definite matrix with lower bounded eigenvalues is needed this is guaranteed by setting λmin secondly the projection is more robust compared to the ogk estimate ogk induces positive definiteness by the eigenvalues using the variances of the principal components robustness is lost when the data possibly containing outliers are projected onto the principal directions for estimating the principal components remark we adopt the quantile in the definitions of and bq to achieve breakdown point however we note that our methodology and theory carries through if is replaced by any absolute constant theoretical properties in this section we provide theoretical analysis of the proposed portfolio optimization approach for opt based on an estimate of rq the next lemma shows that the error an optimized portfolio opt rq and wopt rq is essentially related to the estimation error in between the risks opt lemma let be the solution to min for an arbitrary matrix then we have opt rq wopt rq kr rq kmax opt where is the solution to the oracle portfolio optimization problem and is the grossexposure constant opt rq which relates to the rate of convergence next we derive the rate of convergence for kmax to this end we first introduce dependence condition on the asset return series in kr xt and definition let xt be stationary process denote by xt the generated by xt and xt respectively the coefficient is defined by sup the process xt is if and only if condition xt rd is stationary process such that for any xtj xtj xtk and xtj xtk are processes satisfying for any and some constant the parameter determines the rate of decay in and characterizes the degree of dependence in xt next we introduce an identifiability condition on the distribution function of the asset returns ed be an independent copy of for any condition let ej ej ek and let and be the distribution functions of xj xk we assume there exist constants and such that inf dy for any condition guarantees the identifiability of the quantiles and is standard in the literature on quantile statistics based on conditions and we can present the rates of convergence and for theorem let xt be an absolutely continuous stationary process satisfying conditions and suppose log as then for any and large enough with probability no smaller than we have rq kmax rt kr here the rate of convergence rt is defined by log log rt max log log io where max xj qσ xj xk xj xk and moreover if sλ for sλ defined in we further have rq kmax kr the implications of theorem are as follows when the parameters and σmax do not scale with the rate of convergence reduces to op log thus the number of assets under management is allowed to scale exponentially with sample size compared to similar rates of convergence obtained for estimators we do not require any moment or tail conditions thus accommodating asset return data the effect of serial dependence on the rate of convergence is characterized by ically as approaches increases towards infinity inflating rt is allowed to scale with such that log the rate of convergence rt is inversely related to the lower bound on the marginal density functions around the quantiles this is because when is small the distribution functions are flat around the quantiles making the population quantiles harder to estimate opt rq combining lemma and theorem we obtain the rate of convergence for theorem let xt be an absolutely continuous stationary process satisfying conditions and suppose that log as and rq sλ then for any and large enough we have opt rq wopt rq rt where rt is defined in and is the constant theorem shows that the risk of the estimated portfolio converges to the oracle optimal risk with parametric rate rt the number of assets is allowed to scale exponentially with sample size moreover the rate of convergence does not rely on any tail conditions on the distribution of the asset returns for the rest of this section we build the connection between the proposed robust portfolio optimization and its counterpart specifically we show that they are consistent under the elliptical model definition random vector rd follows an elliptical distribution with location rd and scatter if and only if there exist nonnegative random variable matrix with rank random vector rr independent from and uniformly distributed on the sphere such that ξau here aa has rank we denote ecd is called the generating variate commonly used elliptical distributions include gaussian distribution and elliptical distributions have been widely used for modeling financial return data since they naturally capture many stylized properties including heavy tails and tail dependence the next theorem relates rq and rq to their counterparts and under the elliptical model theorem let xd ecd be an absolutely continuous elliptical ed be an independent copy of then we have random vector and rq mq for some constant only depending on the distribution of moreover if eξ we have rq cq and rq cq where cov is the covariance matrix of and is constant given by ej ej ek cq var xj var xj xk ej ek var xj xk here the last two inequalities hold when var xj xk and var xj xk by theorem under the elliptical model minimizing the robust risk metric rq is equivalent with minimizing the standard risk metric thus the robust portfolio optimization is equivalent to its counterpart in the population level plugging into leads to the following theorem theorem let xt be an absolutely continuous stationary process satisfying conditions and suppose that ecd follows an elliptical distribution with covariance matrix and log as then we have opt wopt rt where is the constant cq is defined in and rt is defined in opt obtained from the robust portfolio thus under the elliptical model the optimal portfolio optimization also leads to parametric rate of convergence for the standard risk experiments in this section we investigate the empirical performance of the proposed portfolio optimization approach in section we demonstrate the robustness of the proposed approach using synthetic data in section we simulate portfolio management using the standard poor stock index data the proposed portfolio optimization approach qne is compared with three competitors these competitors are constructed by replacing the covariance matrix in by commonly used matrix estimators ogk the orthogonalized estimator constructs pilot scatter matrix estimate using robust of scale then the eigenvalues using the variances of the principal components factor the principal factor estimator iteratively solves for the specific variances and the factor loadings shrink the shrinkage estimator shrinkages the sample covariance matrix towards onefactor covariance estimator synthetic data following we construct the covariance matrix of the asset returns using model xj εj where xj is the return of the stock bjk is the loadings of the stock on factor fk and εj is the idiosyncratic noise independent of the three factors under this model the covariance matrix of the stock returns is given by bσf bt diag where bjk is matrix consisting of the factor loadings σf is the covariance matrix of the three factors and is the variance of the noise εi we adopt the covariance in in our simulations following we generate the factor loadings from trivariate normal distribution nd µb σb where the mean µb and covariance σb are specified in table after the factor loadings are generated they are fixed as parameters throughout the simulations the covariance matrix σf of the three factors is also given in table the standard deviations σd of the idiosyncratic noises are generated independently from truncated gamma distribution with shape and scale restricting the support to again these standard deviations are fixed as parameters once they are generated according to these parameters are obtained by fitting the model using daily return data of industry portfolios from may to the covariance matrix is fixed throughout the simulations since we are only interested in risk optimization we set the mean of the asset returns to be the dimension of the stocks under consideration is fixed at given the covariance matrix we generate the asset return data from the following three distributions multivariate gaussian distribution nd table parameters for generating the covariance matrix in equation parameters for factor loadings risk constant factor shrink constant gaussian qne ogk factor shrink matching rate qne ogk elliptical matching rate factor shrink constant multivariate gaussian qne ogk constant factor shrink oracle qne ogk risk factor shrink oracle qne ogk factor shrink oracle qne ogk risk σf matching rate parameters for factor returns σb µb constant multivariate constant elliptical figure portfolio risks selected number of stocks and matching rates to the oracle optimal portfolios multivariate distribution with degree of freedom and covariance matrix elliptical distribution with generating variate log and covariance matrix under each distribution we generate asset return series of half year we estimate the matrices using qne and the three competitors and plug them into to optimize the portfolio allocations we also solve with the true covariance matrix to obtain the oracle optimal portfolios as benchmarks we range the constraint from to the results are based on simulations and the matching rates between the optimized portfolios figure shows the portfolio risks and the oracle optimal here the matching rate is defined as follows for two portfolios and let and be the corresponding sets of selected assets the assets for which the weights ws are the matching rate between and is defined as where denotes the cardinality of set we note two observations from figure the four estimators leads to comparable portfolio risks under the gaussian model however under distributions and qne achieves lower portfolio risk ii the matching rates of qne are stable across the three models and are higher than the competing methods under distributions and thus we conclude that qne is robust to heavy tails in both risk minimization and asset selection real data in this section we simulate portfolio management using the stocks we collect adjusted daily closing for stocks that stayed in the index from january due to the regularization in the constraint the solution is generally sparse the adjusted closing prices accounts for all corporate actions including stock splits dividends and rights offerings table annualized sharpe ratios returns and risks under competing approaches using index data sharpe ratio qne ogk factor shrink return in risk in to december using the closing prices we obtain daily returns as the daily growth rates of the prices we manage portfolio consisting of the stocks from january to december on days we optimize the portfolio allocations using the past months stock return data sample points we hold the portfolio for one day and evaluate the portfolio return on day in this way we obtain portfolio returns we repeat the process for each of the four methods under comparison and range the constant from to since the true covariance matrix of the stock returns is unknown we adopt the sharpe ratio for evaluating the performances of the portfolios table summarizes the annualized sharpe ratios mean returns and empirical risks standard deviations of the portfolio returns we observe that qne achieves the largest sharpe ratios under all values of the constant indicating the lowest risks under the same returns or equivalently the highest returns under the same risk discussion in this paper we propose robust portfolio optimization framework building on scatter matrix we obtain rates of convergence for the scatter matrix estimators and the risk of the estimated portfolio the relations of the proposed framework with its counterpart are well understood the main contribution of the robust portfolio optimization approach lies in its robustness to heavy tails in high dimensions heavy tails present unique challenges in high dimensions compared to low dimensions for example asymptotic theory of guarantees consistency in the rate op even for data if statistical error diminishes rapidly with increasing however when statistical error may scale rapidly with dimension thus stringent tail conditions such as subgaussian conditions are required to guarantee consistency for estimators in high dimensions in this paper based on quantile statistics we achieve consistency for portfolio risk without assuming any tail conditions while allowing to scale nearly exponentially with another contribution of his work lies in the theoretical analysis of how serial dependence may affect consistency of the estimation we measure the degree of serial dependence using the coefficient we show that the effect of the serial dependence the rate of convergence is summarized by the parameter which characterizes the size of we drop the data after to avoid the financial crisis when the stock prices are likely to violate the stationary assumption imposes upper bound on the percentage of short positions in practice the percentage of short positions is usually strictly controlled to be much lower references harry markowitz portfolio selection the journal of finance michael best and robert grauer on the sensitivity of portfolios to changes in asset means some analytical and computational results review of financial studies vijay kumar chopra and william ziemba the effect of errors in means variances and covariances on optimal portfolio choice the journal of portfolio management robert merton on estimating the expected return on the market an exploratory investigation journal of financial economics jarl kallberg and william ziemba in portfolio selection problems in risk and capital pages springer jianqing fan yingying fan and jinchi lv high dimensional covariance matrix estimation using factor model journal of econometrics james stock and mark watson forecasting using principal components from large number of predictors journal of the american statistical association jushan bai kunpeng li et al statistical analysis of factor models of high dimension the annals of statistics jianqing fan yuan liao and martina mincheva large covariance estimation by thresholding principal orthogonal complements journal of the royal statistical society series statistical methodology olivier ledoit and michael wolf improved estimation of the covariance matrix of stock returns with an application to portfolio selection journal of empirical finance olivier ledoit and michael wolf estimator for covariance matrices journal of multivariate analysis olivier ledoit and michael wolf honey shrunk the sample covariance matrix the journal of portfolio management peter huber robust statistics wiley ricardo maronna and ruben zamar robust estimates of location and dispersion for highdimensional datasets technometrics ramanathan gnanadesikan and john kettenring robust estimates residuals and outlier detection with multiresponse data biometrics yilun chen ami wiesel and alfred hero robust shrinkage estimation of covariance matrices ieee transactions on signal processing romain couillet and matthew mckay large dimensional analysis and optimization of robust shrinkage covariance matrix estimators journal of multivariate analysis ravi jagannathan and ma risk reduction in large portfolios why imposing the wrong constraints helps the journal of finance jianqing fan jingjin zhang and ke yu vast portfolio selection with constraints journal of the american statistical association peter rousseeuw and christophe croux alternatives to the median absolute deviation journal of the american statistical association xu and shao solving the matrix nearness problem in the maximum norm by applying projection and contraction method advances in operations research alexandre belloni and victor chernozhukov quantile regression in sparse models the annals of statistics lan wang yichao wu and runze li quantile regression for analyzing heterogeneity in dimension journal of the american statistical association peter bickel and elizaveta levina covariance regularization by thresholding the annals of statistics tony cai zhang and harrison zhou optimal rates of convergence for covariance matrix estimation the annals of statistics fang samuel kotz and kai wang ng symmetric multivariate and related distributions chapman and hall harry joe multivariate models and dependence concepts chapman and hall rafael schmidt tail dependence for elliptically contoured distributions mathematical methods of operations research svetlozar todorov rachev handbook of heavy tailed distributions in finance elsevier svetlozar rachev christian menn and frank fabozzi and skewed asset return distributions implications for risk management portfolio selection and option pricing wiley kevin dowd measuring market risk wiley torben gustav andersen handbook of financial time series springer jushan bai and shuzhong shi estimating high dimensional covariance matrices and its applications annals of economics and finance sara van de geer and sa van de geer empirical processes in cambridge university press cambridge alastair hall generalized method of moments oxford university press oxford peter and sara van de geer statistics for data methods theory and applications springer 
logarithmic time online multiclass prediction anna choromanska courant institute of mathematical sciences new york ny usa achoroma john langford microsoft research new york ny usa jcl abstract we study the problem of multiclass classification with an extremely large number of classes with the goal of obtaining train and test time complexity logarithmic in the number of classes we develop tree construction approaches for constructing logarithmic depth trees on the theoretical front we formulate new objective function which is optimized at each node of the tree and creates dynamic partitions of the data which are both pure in terms of class labels and balanced we demonstrate that under favorable conditions we can construct logarithmic depth trees that have leaves with low label entropy however the objective function at the nodes is challenging to optimize computationally we address the empirical problem with new online decision tree construction procedure experiments demonstrate that this online algorithm quickly achieves improvement in test error compared to more common logarithmic training time approaches which makes it plausible method in computationally constrained applications introduction the central problem of this paper is computational complexity in setting where the number of classes for multiclass prediction is very large such problems occur in natural language which translation is best search what result is best and detection who is that tasks almost all machine learning algorithms with the exception of decision trees have running times for multiclass classification which are with canonical example being classifiers in this setting the most efficient possible accurate approach is given by information theory in essence any multiclass classification algorithm must uniquely specify the bits of all labels that it predicts correctly on consequently kraft inequality equation implies that the expected computational complexity of predicting correctly is per example where is the shannon entropy of the label for the worst case distribution on classes this implies log computation is required hence our goal is achieving log computational time per for both training and testing while effectively using online learning algorithms to minimize passes over the data the goal of logarithmic in complexity naturally motivates approaches that construct logarithmic depth hierarchy over the labels with one label per leaf while this hierarchy is sometimes available through prior knowledge in many scenarios it needs to be learned as well this naturally leads to partition problem which arises at each node in the hierarchy the partition problem is finding classifier which divides examples into two subsets with purer set of labels than the original set definitions of purity vary but canonical examples are the number of labels remaining in each subset or softer notions such as the average shannon entropy of the class labels despite resulting in classifier this problem is fundamentally different from standard binary classification to see this note that replacing with is very bad for binary classification but has no impact on the quality of the partition problem is fundamentally throughout the paper by logarithmic time we mean logarithmic time per example the problem bears parallels to clustering in this regard for symmetric classes since the average of and is poor partition the function places all points on the same side the choice of partition matters in problem dependent ways for example consider examples on line with label at position and threshold classifiers in this case trying to partition class labels from class label results in poor performance accuracy the partition problem is typically solved for decision tree learning via an approach amongst small set of possible classifiers see in the multiclass setting it is desirable to achieve substantial error reduction for each node in the tree which motivates using richer set of classifiers in the nodes to minimize the number of nodes and thereby decrease the computational complexity the main theoretical contribution of this work is to establish boosting algorithm for learning trees with nodes and log depth thereby addressing the goal of logarithmic time train and test complexity our main theoretical result presented in section generalizes binary theorem to multiclass boosting as in all boosting results performance is critically dependent on the quality of the weak learner supporting intuition that we need sufficiently rich partitioners at nodes the approach uses new objective for decision tree learning which we optimize at each node of the tree the objective and its theoretical properties are presented in section complete system with multiple partitions lomtree vs could be constructed top down as the oaa ing theorem or bottom up as filter tree lomtree bottom up partition process appears ble with representational constraints as shown in section in the supplementary material so we focus on tree creation whenever there are representational constraints on partitions such as linear classifiers finding strong partition function requires an efficient search over this set of classifiers ficient searches over large function classes are routinely performed via gradient descent niques for supervised learning so they seem number of classes like natural candidate in existing literature figure comparison of examples for doing this exist when the problem all oaa and the logarithmic online is indeed binary or when there is prespeciclass tree lomtree with fied hierarchy over the labels and we just need strained to use the same training time as the to find partitioners aligned with that hierarchy lomtree by dataset truncation and lomtree neither of these cases have multistrained to use the same representation ple labels and want to dynamically create the ity as as the number of class choice of partition rather than assuming that labels grows the problem becomes harder and the one was handed to us does there exist purity criterion amenable to gradient descent aplomtree becomes more dominant proach the precise objective studied in theory fails this test due to its discrete nature and even natural approximations are challenging to tractably optimize under computational constraints as result we use the theoretical objective as motivation and construct new logarithmic online multiclass tree lomtree algorithm for empirical evaluation creating tree in an online fashion creates new class of problems what if some node is initially created but eventually proves useless because no examples go to it at best this results in wasteful solution while in practice it starves other parts of the tree which need representational complexity to deal with this we design an efficient process for recycling orphan nodes into locations where they are needed and prove that the number of times node is recycled is at most logarithmic in the number of examples the algorithm is described in section and analyzed in section and is it effective given the inherent of the partition problem this is unavoidably an empirical question which we answer on range of datasets varying from to classes in section we find that under constrained training times this approach is quite effective compared to all baselines while dominating other log train time approaches what new to the best of our knowledge the splitting criterion the boosting statement the lomtree algorithm the swapping guarantee and the experimental results are all new here prior work only few authors address logarithmic time training the filter tree addresses consistent and robust multiclass classification showing that it is possible in the statistical limit the filter tree does not address the partition problem as we do here which as shown in our experimental section is often helpful the partition finding problem is addressed in the conditional probability tree but that paper addresses conditional probability estimation conditional probability estimation can be converted into multiclass prediction but doing so is not logarithmic time operation quite few authors have addressed logarithmic testing time while allowing training time to be or worse while these approaches are intractable on our larger scale problems we describe them here for context the partition problem can be addressed by recursively applying spectral clustering on confusion graph other clustering approaches include empirically this approach has been found to sometimes lead to badly imbalanced splits in the context of ranking another approach uses hierarchical clustering to recover the label sets for given partition the more recent work on the multiclass classification problem addresses it via sparse output coding by tuning multiclass categorization into decoding problem the authors decouple the learning processes of coding matrix and bit predictors and use probabilistic decoding to decode the optimal class label the authors however specify class similarity which is to compute see section in and hence this approach is in different complexity class than ours this is also born out experimentally the variant of the popular error correcting output code scheme for solving prediction problems with large output spaces under the assumption of output sparsity was also considered in their approach in general requires running time to decode since in essence the fit of each label to the predictions must be checked and there are labels another approach proposes iterative algorithms for and prediction with relatively large number of examples and data dimensions and the work of focusing in particular on the multiclass classification both approaches however have training time decision trees are naturally structured to allow logarithmic time prediction traditional decision trees often have difficulties with large number of classes because their splitting criteria are not to the large class setting however newer approaches have addressed this effectively at significant scales in the context of multilabel classification multilabel learning with missing labels is also addressed in more specifically the first work performs brute force optimization of multilabel variant of the gini index defined over the set of positive labels in the node and assumes label independence during random forest construction their method makes fast predictions however has high training costs the second work optimizes rank sensitive loss function discounted cumulative gain additionally problem with hierarchical classification is that the performance significantly deteriorates lower in the hierarchy which some authors solve by biasing the training distribution to reduce error propagation while simultaneously combining and approaches during training the reduction approach we use for optimizing partitions implicitly optimizes differential objective approach to this has been tried previously on other objectives yielding good results in different context framework and theoretical analysis in this section we describe the essential elements of the approach and outline the theoretical properties of the resulting framework we begin with ideas setting we employ hierarchical approach for learning multiclass decision tree structure training this structure in fashion we assume that we receive examples rd with labels we also assume access to hypothesis class where each is binary classifier the overall objective is to learn tree of depth log where each node in the tree consists of classifier from the classifiers are trained in such way that hn hn denotes the classifier in node of the means that the example is sent to the right subtree of node while hn sends to the left subtree when we reach leaf we predict according to the label with the highest frequency amongst the examples reaching that leaf further in the paper we skip index whenever it is clear from the context that we consider fixed tree node in the interest of computational complexity we want to encourage the number of examples going to the left and right to be fairly balanced for good statistical accuracy we want to send examples of class almost exclusively to either the left or the right subtree thereby refining the purity of the class distributions at subsequent levels in the tree the purity of tree node is therefore measure of whether the examples of each class reaching the node are then mostly sent to its one child node pure split or otherwise to both children impure split the formal definitions of balancedness and purity are introduced in section an objective expressing both and resulting theoretical properties are illustrated in the following sections key consideration in picking this objective is that we want to effectively optimize it over hypotheses while streaming over examples in an online this seems unsuitable with some of the more standard decision tree objectives such as shannon or gini entropy which leads us to design new objective at the same time we show in section that under suitable assumptions optimizing the objective also leads to effective reduction of the average shannon entropy over the entire tree an objective and analysis of resulting partitions we now define criterion to measure the quality of hypothesis in creating partitions at fixed node in the tree let denotes the proportion of label amongst the examples reaching this node let and denote the fraction of examples reaching for which marginally and conditional on class respectively then we define the we aim to maximize the objective to obtain high quality partitions intuitively the objective encourages the fraction of examples going to the right from class to be substantially different from the background fraction for each class as concrete simple scenario if for some hypothesis then the objective prefers to be as close to or as possible for each class leading to pure partitions we now make these intuitions more formal definition purity the hypothesis induces pure split if min where and is called the purity factor in particular partition is called maximally pure if meaning that each class is sent exclusively to the left or the right we now define similar definition for the balancedness of split definition balancedness the hypothesis induces balanced split if where and is called the balancing factor partition is called maximally balanced if meaning that an equal number of examples are sent to the left and right children of the partition the balancing factor and the purity factor are related as shown in lemma the proofs of lemma and the following lemma lemma are deferred to the supplementary material lemma for any hypothesis and any distribution over examples the purity factor and the balancing factor satisfy min partition is called maximally pure and balanced if it satisfies both and we see that for hypothesis inducing maximally pure and balanced partition as captured in the next lemma of course we do not expect to have hypotheses producing maximally pure and balanced splits in practice lemma for any hypothesis the objective satisfies furthermore if induces maximally pure and balanced partition then we want an objective to achieve its optimum for simultaneously pure and balanced split the standard criteria such as shannon or gini entropy as well as the criterion we will propose posed in equation satisfy this requirement for the criteria see for our criterion see lemma our algorithm could also be implemented as batch or streaming where in case of the latter one can for example make one pass through the data per every tree level however for massive datasets making multiple passes through the data is computationally costly further justifying the need for an online approach the proposed objective function exhibits some similarities with the carnap measure used in probability and inductive logic quality of the entire tree the above section helps us understand the quality of an individual split produced by effectively maximizing we next reason about the quality of the entire tree as we add more and more nodes we measure the quality of trees using the average entropy over all the leaves in the tree and track the decrease of this entropy as function of the number of nodes our analysis extends the theoretical analysis in originally developed to show the boosting properties of the decision trees for binary classification problems to the multiclass classification setting given tree we consider the entropy function gt as the measure of the quality of tree gt wl ln where are the probabilities that randomly chosen data point drawn from where is fixed target distribution over has label given that reaches node denotes the set of all tree leaves denotes the number of internal tree nodes and wl is the weight of leaf defined as the probability randomly chosen drawn from reaches leaf note that wl we next state the main theoretical result of this paper it is captured in theorem we adopt the weak learning framework the weak hypothesis assumption captured in definition posits that each node of the tree has hypothesis in its hypothesis class which guarantees simultaneously weak purity and weak balancedness of the split on any distribution over under this assumption one can use the new decision tree approach to drive the error below any threshold definition weak hypothesis assumption let denote any node of the tree and let hm and pm hm furthermore let be such that for all min we say that the weak hypothesis assumption is satisfied when for any distribution over at each node of the tree there exists hypothesis hm such that pk hm theorem under the weak hypothesis assumption for any to obtain gt it suffices to make ln splits we defer the proof of theorem to the supplementary material and provide its sketch now the analysis studies tree construction algorithm where we recursively find the leaf node with the highest weight and choose to split it into two children let be the heaviest leaf at time consider splitting it to two children the contribution of node to the tree entropy changes after it splits this change entropy reduction corresponds to gap in the jensen inequality applied to the concave function and thus can further be we use the fact that shannon entropy is strongly concave with respect to see example in the obtained turns out to depend proportionally on hn this implies that the larger the objective hn is at time the larger the entropy reduction ends up being which further reinforces intuitions to maximize in general it might not be possible to find any hypothesis with large enough objective hn to guarantee sufficient progress at this point so we appeal to weak learning assumption this assumption can be used to further the entropy reduction and prove theorem the lomtree algorithm the objective function of section has another convenient form which yields simple online algorithm for tree construction and training note that equation can be written details are shown in section in the supplementary material as ex maximizing this objective is discrete optimization problem that can be relaxed as follows ex where ex is the expected score of class we next explain our empirical approach for maximizing the relaxed objective the empirical estimates of the expectations can be easily stored and updated online in every tree node the decision whether to send an example reaching node to its left or right child node is based on the sign of the difference between the two expectations ex and ex where is label of the data point when ex ex the data point is sent to the left else it is sent to the right this procedure is conveniently demonstrated on toy example in section in the supplement during training the algorithm assigns unique label to each node of the tree which is currently leaf this is the label with the highest frequency amongst the examples reaching that leaf while algorithm lomtree algorithm online tree training input regression algorithm max number of tree nodes swap resistance rs subroutine setnode mv mv sum of the scores for class lv lv number of points of class reaching nv nv number of points of class which are used to train regressor in ev ev expected score for class ev expected total score cv the size of the smallest in the subtree with root subroutine updatec while and cparent cv parent cv min cleft cright subroutine swap find leaf for which cs cr grandpa if spa left sgpa left sgpa ssib else right sgpa ssib updatec ssib setnode left setnode spa right spa create root setnode for each example do set do if lj mj lj nj ej lj if is leaf if lj has at least entries if or cj maxi lj rs if setnode left setnode right else swap cleft cright cleft updatec left if is not leaf if ej ej else train hj with example pk mj nj mj hj ej mj ej nj set to the child of corresponding to hj else cj break testing test example is pushed down the tree along the path from the root to the leaf where in each node of the path its regressor directs the example either to the left or right child node the test example is then labeled with the label assigned to the leaf that this example descended to the training algorithm is detailed in algorithm where each tree node contains classifier we use linear classifiers hj is the regressor stored in node and hj is the value of the prediction of hj on example the stopping criterion for expanding the tree is when the number of nodes reaches threshold swapping consider scenario where the current training example descends to leaf the leaf can split create two children if the examples that reached it in the past were coming from at least two different the smallest leaf is the one with the smallest total number of data points reaching it in the past parent left and right denote resp the parent and the left and right child of node grandpa and sibling denote respectively the grandparent of node and the sibling of node the node which has the same parent as in the implementation both sums are stored as variables thus updating ev takes computations we also refer to this prediction value as the score in this section sgpa ssib spa spa sgpa ssib figure illustration of the swapping procedure left before the swap right after the swap classes however if the number of nodes of the tree reaches threshold no more nodes can be expanded and thus can not create children since the tree construction is done online some nodes created at early stages of training may end up useless because no examples reach them later on this prevents potentially useful splits such as at leaf this problem can be solved by recycling orphan nodes subroutine swap in algorithm the general idea behind node recycling is to allow nodes to split if certain condition is met in particular node splits if the following holds cj max lj rs cr where denotes the root of the entire tree cj is the size of the smallest leaf in the subtree with root where the smallest leaf is the one with the smallest total number of data points reaching it in the past lj is vector of integers where the ith element is the count of the number of data points with label reaching leaf in the past and finally rs is swap resistance the subtraction of lj in equation ensures that pure node will not be recycled if the condition in inequality is satisfied the swap of the nodes is performed where an orphan leaf which was reached by the smallest number of examples in the past and its parent spa are detached from the tree and become children of node whereas the old sibling ssib of an orphan node becomes direct child of the old grandparent sgpa the swapping procedure is shown in figure the condition captured in the inequality allows us to prove that the number of times any given node is recycled is by the logarithm of the number of examples whenever the swap resistance is or more lemma lemma let the swap resistance rs be greater or equal to then for all sequences of examples the number of times algorithm recycles any given node is by the logarithm with base of the sequence length experiments we address several hypotheses experimentally the lomtree algorithm achieves true logarithmic time computation in practice the lomtree algorithm is competitive with or better than all other logarithmic time algorithms for multiclass classification the lomtree algorithm has statistical performance close to more common approaches to address these hypotheses we contable dataset sizes ducted experiments on variety of isolet sector aloi imnet odp benchmark multiclass datasets isosize let sector aloi imagenet im features net and the details of the examples datasets are provided in table the datasets were divided into training classes and testing furthermore of the training dataset was used as validation set the baselines we compared lomtree with are balanced random tree of logarithmic depth rtree and the filter tree where computationally feasible we also compared with classifier oaa as representative approach all methods were implemented in the vowpal wabbit learning system and have similar levels of optimization the regressors in the tree nodes for lomtree rtree and filter tree as well as the oaa regressors were trained by online gradient descent for which we explored step sizes chosen from the set we used compressed the details of the source of each dataset are provided in the supplementary material linear regressors for each method we investigated training with up to passes through the data and we selected the best setting of the parameters step size and number of passes as the one minimizing the validation error additionally for the lomtree we investigated different settings of the stopping criterion for the tree expansion and swap resistance rs in table and we report respectively train time and test time the best performer is indicated in bold training time and later reported test error is not provided for oaa on imagenet and odp due to are petabyte scale table training time on selected problems table test time on all problems isolet sector aloi isolet sector aloi imnet odp lomtree lomtree oaa oaa ms time ratio the first hypothesis is consistent with the experimental results lomtree significantly outperforms oaa due to building only logarithmic depth trees the improvement in the training time increases with the number of classes in the classification problem for instance on aloi training with lomtree is times faster than with oaa the same can be said about the test time where the test time for aloi imagenet and odp are respectively and times faster than oaa the significant advantage of lomtree over oaa is also captured in figure next in table the best logarithmic time perlomtree vs former is indicated in bold we report test error of logarithmic time algorithms we also show the binomial symmetrical confidence intervals for our results clearly the ond hypothesis is also consistent with the experimental results since the rtree imposes random label partition the resulting error it tains is generally worse than the error obtained by the competitor methods including lomtree which learns the label partitioning directly from the data at the same time lomtree beats ter tree on every dataset though for imagenet number of classes figure logarithm of the ratio of and odp both have high level of noise the advantage of lomtree is not as significant test times of oaa and lomtree on all problems table test error and confidence interval on all problems lomtree rtree filter tree oaa isolet sector aloi imnet na odp na the third hypothesis is weakly consistent with the empirical results the time advantage of lomtree comes with some loss of statistical accuracy with respect to oaa where oaa is tractable we conclude that lomtree significantly closes the gap between other logarithmic time methods and oaa making it plausible approach in computationally constrained applications conclusion the lomtree algorithm reduces the multiclass problem to set of binary problems organized in tree structure where the partition in every tree node is done by optimizing new partition criterion online the criterion guarantees pure and balanced splits leading to logarithmic training and testing time for the tree classifier we provide theoretical justification for our approach via boosting statement and empirically evaluate it on multiple multiclass datasets empirically we find that this is the best available logarithmic time approach for multiclass classification problems note however that the mechanics of testing datastes are much easier one can simply test with effectively untrained parameters on few examples to measure the test speed thus the test time for oaa on imagenet and odp is provided also to the best of our knowledge there exist no results of the oaa performance on these datasets published in the literature acknowledgments we would like to thank alekh agarwal dean foster robert schapire and matus telgarsky for valuable discussions references rifkin and klautau in defense of classification mach learn cover and thomas elements of information theory john wiley sons breiman friedman olshen and stone classification and regression trees crc press llc boca raton florida kearns and mansour on the boosting ability of decision tree learning algorithms journal of computer and systems sciences also in stoc beygelzimer langford and ravikumar tournaments in alt beygelzimer langford lifshits sorkin and strehl conditional probability tree estimation analysis and algorithms in uai bishop pattern recognition and machine learning springer bengio weston and grangier label embedding trees for large tasks in nips madzarov gjorgjevikj and chorbev svm classifier utilizing binary decision tree informatica deng satheesh berg and fast and balanced efficient label tree learning for large scale object recognition in nips weston makadia and yee label partitioning for sublinear ranking in icml zhao and xing sparse output coding for visual recognition in cvpr hsu kakade langford and zhang prediction via compressed sensing in nips agarwal kakade karampatziakis song and valiant least squares revisited scalable approaches for prediction in icml beijbom saberian kriegman and vasconcelos loss functions for costsensitive multiclass boosting in icml agarwal gupta prabhu and varma learning with millions of labels recommending advertiser bid phrases for web pages in www prabhu and varma fastxml fast accurate and stable for extreme learning in acm sigkdd yu jain kar and dhillon learning with missing labels in icml liu yang wan zeng chen and ma support vector machines classification with very taxonomy in sigkdd explorations bennett and nguyen refined experts improving classification in large taxonomies in sigir montillo tu shotton winn iglesias metaxas and criminisi entanglement and differentiable information gain maximization decision forests for computer vision and medical image analysis tentori crupi bonini and osherson comparison of confirmation measures cognition carnap logical foundations of probability ed chicago university of chicago press par pp online learning and online convex optimization found trends mach langford li and strehl http nesterov introductory lectures on convex optimization basic course applied optimization kluwer academic deng dong socher li li and imagenet hierarchical image database in cvpr 
planar ultrametrics for image segmentation charless fowlkes department of computer science university of california irvine fowlkes julian yarkony experian data lab san diego ca abstract we study the problem of hierarchical clustering on planar graphs we formulate this in terms of finding the closest ultrametric to specified set of distances and solve it using an lp relaxation that leverages minimum cost perfect matching as subroutine to efficiently explore the space of planar partitions we apply our algorithm to the problem of hierarchical image segmentation introduction we formulate hierarchical image segmentation from the perspective of estimating an ultrametric distance over the set of image pixels that agrees closely with an input set of noisy pairwise distances an ultrametric space replaces the usual triangle inequality with the ultrametric inequality max which captures the transitive property of clustering if and are in the same cluster and and are in the same cluster then and must also be in the same cluster thresholding an ultrametric immediately yields partition into sets whose diameter is less than the given threshold varying this distance threshold naturally produces hierarchical clustering in which clusters at high thresholds are composed of clusters at lower thresholds inspired by the approach of our method represents an ultrametric explicitly as hierarchical collection of segmentations determining the appropriate segmentation at single distance threshold is equivalent to finding multicut in graph with both positive and negative edge weights finding an ultrametric imposes the additional constraint that these multicuts are hierarchically consistent across different thresholds we focus on the case where the input distances are specified by planar graph this arises naturally in the domain of image segmentation where elements are pixels or superpixels and distances are defined between neighbors and allows us to exploit fast combinatorial algorithms for partitioning planar graphs that yield tighter lp relaxations than the local polytope relaxation often used in graphical inference the paper is organized as follows we first introduce the closest ultrametric problem and the relation between multicuts and ultrametrics we then describe an lp relaxation that uses delayed column generation approach and exploits planarity to efficiently find cuts via the classic reduction to perfect matching we apply our algorithm to the task of natural image segmentation and demonstrate that our algorithm converges rapidly and produces optimal or solutions in practice closest ultrametric and multicuts let be weighted graph with edge weights indexed by edges our goal is to find an ultrametric distance over vertices of the graph that is close to in the sense that the distortion kθ is minimized we begin by reformulating this closest ultrametric problem in terms of finding set of nested multicuts in family of weighted graphs we specify partitioning or multicut of the vertices of the graph into components using binary vector where indicates that the edge is cut and that the vertices and associated with the edge are in separate components of the partition we use mcut to denote the set of binary indicator vectors that represent valid multicuts of the graph for notational simplicity in the remainder of the paper we frequently omit the dependence on which is given as fixed input necessary and sufficient condition for an indicator vector to define valid multicut in is that for every cycle of edges if one edge on the cycle is cut then at least one other edge in the cycle must also be cut let denote the set of all cycles in where each cycle is set of edges and is the set of edges in cycle excluding edge we can express mcut in terms of these cycle inequalities as mcut hierarchical clustering of graph can be described by nested collection of multicuts we denote the space of valid hierarchical partitions with layers by which we represent by set of vectors in which any cut edge remains cut at all finer layers of the hierarchy mcut given valid hierarchical clustering an ultrametric can be specified over the vertices of the graph by choosing sequence of real values that indicate distance threshold associated with each level of the hierarchical clustering the ultrametric distance specified by the pair assigns distance to each pair of vertices based on the coarsest level of the clustering at which they remain in separate clusters for pairs corresponding to an edge in the graph we can write this explicitly in terms of the multicut indicator vectors as de max pairs that do not correspond to an and we assume by convention that edge in the original graph can still be assigned unique distance based on the coarsest level at which they lie in different connected components of the cut specified by to compute the quality of an ultrametric with respect to an input set of edge weights we measure the squared difference between the edge weights and the ultrametric distance kθ to write this compactly in terms of multicut pm indicator vectors we construct set of weights for each edge and layer denoted θel so that θel kθe these weights are given explicitly by the telescoping series kθe we use θel kθe kθe to denote the vector containing θel for all for fixed number of levels and fixed set of thresholds the problem of finding the closest ultrametric can then be written as an integer linear program ilp over the edge cut indicators xx min min kθe θe kθe kθe kθe min kθe min θel min θl this optimization corresponds to solving collection of multicut problems where the multicuts are constrained to be hierarchically consistent linear combination of cut vectors hierarchical cuts figure any partitioning can be represented as linear superposition of cuts where each cut isolates connected component of the partition and is assigned weight by introducing an auxiliary slack variables we are able to represent larger set of valid indicator vectors using fewer columns of by introducing additional slack variables at each layer of the hierarchical segmentation we can efficiently represent many hierarchical segmentations here that are consistent from layer to layer while using only small number of cut indicators as columns of computing multicuts also known as correlation clustering is np hard even in the case of planar graphs direct approach to finding an approximate solution to eq is to relax the integrality constraints on and instead optimize over the whole polytope defined by the set of cycle inequalities we use ωl to denote the corresponding relaxation of while the resulting polytope is not the convex hull of mcut the integral vertices do correspond exactly to the set of valid multicuts in practice we found that applying straightforward approach that successively adds violated cycle inequalities to this relaxation of eq requires far too many constraints and is too slow to be useful instead we develop column generation approach tailored for planar graphs that allows for efficient and accurate approximate inference the cut cone and planar multicuts consider partition of planar graph into two disjoint sets of nodes we denote the space of indicator vectors corresponding to such cuts by cut cut may yield more than two connected components but it can not produce every possible multicut it can not split triangle of three nodes into three separate components let be an indicator matrix where each column specifies valid cut with zek if and only if edge is cut in twoway cut the indicator vector of any multicut in planar graph can be generated by suitable linear combination of of cuts columns of that isolate the individual components from the rest of the graph where the weight of each such cut is let be vector specifying positive weighted combination of cuts the set zγ is the conic hull of cut or cut cone since any multicut can be expressed as superposition of cuts the cut cone is identical to the conic hull of mcut this equivalence suggests an lp relaxation of the multicut given by min zγ zγ where the vector specifies the edge weights for the case of planar graphs any solution to this lp relaxation satisfies the cycle inequalities see supplement and expanded multicut objective since the matrix contains an exponential number of cuts eq is still intractable instead we consider an approximation using constraint set which is subset of columns of in previous work we showed that since the optimal multicut may no longer lie in the span of the reduced cut matrix it is useful to allow some values of exceed see figure for an example we introduce slack vector that tracks the presence of any overcut edges and prevents them from contributing to the objective when the corresponding edge weight is negative let min θe denote the component of θe the expanded objective is given by min for any edge such that θe any decrease in the objective from overcutting by an amount βe is exactly compensated for in the objective by the term βe when contains all cuts then eq and eq are equivalent further if is the minimizer of eq when only contains subset of columns then the edge indicator vector given by min still satisfies the cycle inequalities see supplement for details expanded lp for finding the closest ultrametric to develop an lp relaxation of the closest ultrametric problem we replace the multicut problem at each layer with the expanded multicut objective described by eq we let and denote the collection of weights and slacks for the levels of the hierarchy and let max θel and min θel denote the positive and negative components of θl to enforce hierarchical consistency between layers we would like to add the constraint that zγ zγ however this constraint is too rigid when does not include all possible cuts it is thus computationally useful to introduce an additional slack vector associated with each level and edge which we denote as the introduction of αel allows for cuts represented by zγ to violate the hierarchical constraint we modify the objective so that violations to the original hierarchy constraint are paid for in proportion to the introduction of allows us to find valid ultrametrics while using smaller number of columns of to be used than would otherwise be required illustrated in figure we call this relaxed closest ultrametric problem including the slack variable the expanded closest ultrametric objective written as min θl zγ αl zγ zγ αl zγ where by convention we define αl and we have dropped the constant term from eq given solution we can recover relaxed solution to the closest ultrametric problem eq over ωl by setting xel min zγ in the supplement we demonstrate that for any that obeys the constraints in eq this thresholding operation yields solution that lies in ωl and achieves the same or lower objective value the dual objective we optimize the dual of the objective in eq using an efficient column generation approach based on perfect matching we introduce two sets of lagrange multipliers and λl corresponding to the between and within layer constraints respectively for algorithm dual closest ultrametric via cutting planes residual while residual do solve eq given residual for do arg θl λl residual residual θl λl isocuts end for end while notational convenience let the dual objective can then be written as max the dual lp can be interpreted as finding small modification of the original edge weights θl so that every possible cut of each resulting graph at level has weight observe that the introduction of the two slack terms and in the primal problem eq results in bounds on the lagrange multipliers and in the dual problem in eq in practice these dual constraints turn out to be essential for efficient optimization and constitute the core contribution of this paper solving the dual via cutting planes the chief complexity of the dual lp is contained in the constraints including which encodes of an exponential number of cuts of the graph represented by the columns of to circumvent the difficulty of explicitly enumerating the columns of we employ cutting plane method that efficiently searches for additional violated constraints columns of which are then successively added let denote the current working set of columns our dual optimization algorithm iterates over the following three steps solve the dual lp with find the most violated constraint of the form θl λl for layer append column to the matrix for each such cut found we terminate when no violated constraints exist or computational budget has been exceeded finding violated constraints identifying columns to add to is carried out for each layer separately finding the most violated constraint of the full problem corresponds to computing the cut of graph with edge weights θl λl if this cut has weight then all the constraints are satisfied otherwise we add the corresponding cut indicator vector as an additional column of to generate new constraint for layer based on the current lagrange multipliers we solve arg min θel λle ωel ze and subsequently add the new constraints from all layers to our lp unlike the multicut problem finding cut in planar graph can be solved exactly by reduction to perfect matching this is classic result that provides an exact solution for the ground state of lattice ising model without ferromagnetic field in log time ub lb bound counts time sec objective ratio ucm um figure the average convergence of the upper blue and red as function of running time values plotted are the gap between the bound and the best computed at termination for given problem instance this relative gap is averaged over problem instances which have not yet converged at given time point we indicate the percentage of problem instances that have yet to terminate using black bars marking percent histogram of the ratio of closest ultrametric objective values for our algorithm um and the baseline clustering produced by ucm all ratios were less than showing that in no instances did um produce worse solution than ucm computing lower bound at given iteration prior to adding newly generated set of constraints we can compute the total residual constraint violation over all layers of hierarchy by θl λl in the supplement we demonstrate that the value of the dual objective plus is on the relaxed closest ultrametric problem in eq thus as the costs of the matchings approach zero from below the objective of the reduced problem over approaches an accurate on optimization over expanding generated cut constraints when given cut produces more than two connected components we found it useful to add constraint corresponding to each component following the approach of let the number of connected components of be denoted for each of the components then we add one column to corresponding to the cut that isolates that connected component from the rest this allows more flexibility in representing the final optimum multicut as superpositions of these components in addition we also found it useful in practice to maintain separate set of constraints for each layer maintaining independent constraints can result in smaller overall lp speeding convergence of we found that adding an explicit penalty term to the objective that encourages small values of speeds up convergence dramatically with no loss in solution quality in our experiments this penalty is scaled by parameter which is chosen to be extremely small in magnitude relative to the values of so that it only has an influence when no other forces are acting on given term in primal decoding algorithm gives summary of the dual solver which produces as well as set of cuts described by the constraint matrices the subroutine isocuts computes the set of cuts that isolate each connected component of to generate hierarchical clustering we solve the primal eq using the reduced set in order to recover fractional solution xel min we use an lp solver ibm cplex which provides this primal solution for free when solving the dual in alg we round the fractional primal solution to discrete hierarchical clustering by thresholding xel we then repair uncut any cut edges that lie inside connected component in our implementation we test few discrete thresholds and take that threshold that yields with the lowest cost after each pass through the loop of alg we compute these and retain the optimum solution observed thus far precision maximum ucm um recall um ucm time sec figure boundary detection performance of our closest ultrametric algorithm um and the baseline ultrametric contour maps algorithm with ucm and without length weighting on bsds black circles indicate thresholds used in the closest um optimization anytime performance on the bsds benchmark as function of um ucm with and without length weighting achieve maximum of and respectively experiments we applied our algorithm to segmenting images from the berkeley segmentation data set bsds we use superpixels generated by performing an oriented watershed transform on the output of the global probability of boundary gpb edge detector and construct planar graph whose vertices are superpixels with edges connecting neighbors in the image plane whose base distance is derived from gp let gp be be the local estimate of boundary contrast given by averaging the gp classifier output over the boundary between pair of neighboring superpixels we truncate extreme values to enforce gp be that gp be with and set θe log be log the additive offset assures that θe in our experiments we use fixed set of eleven distance threshold levels δl chosen to uniformly span the useful range of threshold values finally we weighted edges proportionally to the length of the corresponding boundary in the image we performed dual cutting plane iterations until convergence or seconds had passed lowerbounds for the bsds segmentations were on the order of or we terminate when the total residual is greater than all codes were written in matlab using the blossom implementation of perfect matching and the ibm ilog cplex lp solver with default options baseline we compare our results with the hierarchical clusterings produced by the ultrametric contour map ucm ucm performs agglomerative clustering of superpixels and assigns the averaged gp value as the distance between each pair of merged regions while ucm was not explicitly designed to find the closest ultrametric it provides strong baseline for hierarchical clustering to compute the closest ultrametric corresponding to the ucm clustering result we solve the minimization in eq while restricting each multicut to be the partition at some level of the ucm hierarchy convergence and timing figure shows the average behavior of convergence as function of runtime we found the given by the cost of the decoded integer solution and the lowerbound estimated by the dual lp are very close the integrality gap is typically within of the and never more than convergence of the dual is achieved quite rapidly most instances require less than iterations to converge with roughly linear growth in the size of the lp at each iteration as cutting planes are added in fig we display histogram computed over test image problem instances of the cost of ucm solutions relative to those produced by closest ultrametric um estimated by our method ratio of less than indicates that our approach generated solution with lower distortion ultrametric in no problem instance did ucm outperform our um algorithm um mc um mc figure the proposed closest ultrametric um enforces consistency across levels while performing independent clustering mc at each threshold does not guarantee hierarchical segmentation first image columns and in the second image hierarchical segmentation um better preserves semantic parts of the two birds while correctly merging the background regions segmentation quality figure shows the segmentation benchmark accuracy of our closest ultrametric algorithm denoted um along with the baseline ultrametric contour maps algorithm ucm with and without length weighting in terms of segmentation accuracy um performs nearly identically to the state of the art ucm algorithm with some small gains in the regime it is worth noting that the bsds benchmark does not provide strong penalties for small leaks between two segments when the total number of boundary pixels involved is small our algorithm may find strong application in domains where the local boundary signal is noisier biological imaging or when is more heavily penalized while our approach is slower than agglomerative clustering it is not necessary to wait for convergence in order to produce high quality results we found that while the upper and lower bounds decrease as function of time the clustering performance as measured by is often nearly optimal after only ten seconds and remains stable figure shows plot of the achieved by um as function of time importance of enforcing hierarchical constraints although independently finding multicuts at different thresholds often produces hierarchical clusterings this is by no means guaranteed we ran algorithm while setting ωel allowing each layer to be solved independently fig shows examples where hierarchical constraints between layers improves segmentation quality relative to independent clustering at each threshold conclusion we have introduced new method for approximating the closest ultrametric on planar graphs that is applicable to hierarchical image segmentation our contribution is dual cutting plane approach that exploits the introduction of novel slack terms that allow for representing much larger space of solutions with relatively few cutting planes this yields an efficient algorithm that provides rigorous bounds on the quality the resulting solution we empirically observe that our algorithm rapidly produces compelling image segmentations along with and that are nearly tight on the benchmark bsds test data set acknowledgements jy acknowledges the support of experian cf acknowledges support of nsf grants and references nir ailon and moses charikar fitting tree metrics hierarchical clustering and phylogeny in foundations of computer science pages bjoern andres joerg kappes thorsten beier ullrich kothe and fred hamprecht probabilistic image segmentation with closedness constraints in proc of iccv pages bjoern andres thorben kroger kevin briggman winfried denk natalya korogod graham knott ullrich kothe and fred hamprecht globally optimal segmentation for connectomics in proc of eccv bjoern andres julian yarkony manjunath stephen kirchhoff engin turetken charless fowlkes and hanspeter pfister segmenting planar superpixel adjacency graphs nonplanar superpixel affinity graphs in proc of emmcvpr pablo arbelaez michael maire charless fowlkes and jitendra malik contour detection and hierarchical image segmentation ieee trans pattern anal mach may yoram bachrach pushmeet kohli vladimir kolmogorov and morteza zadimoghaddam optimal coalition structure generation in cooperative graph games in proc of aaai shai bagon and meirav galun large scale correlation clustering in corr barahona on the computational complexity of ising spin glass models journal of physics mathematical nuclear and general april barahona on cuts and matchings in planar graphs mathematical programming november barahona and mahjoub on the cut polytope mathematical programming september thorsten beier thorben kroeger jorg kappes ullrich kothe and fred hamprecht cut glue and cut fast approximate solver for multicut partitioning in computer vision and pattern recognition cvpr ieee conference on pages michel deza and monique laurent geometry of cuts and metrics volume springer science business media michael fisher on the dimer solution of planar ising models journal of mathematical physics sungwoong kim sebastian nowozin pushmeet kohli and chang dong yoo correlation clustering for image segmentation in advances in neural information processing pages vladimir kolmogorov blossom new implementation of minimum cost perfect matching algorithm mathematical programming computation david martin charless fowlkes doron tal and jitendra malik database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics in proc of iccv pages david martin charless fowlkes and jitendra malik learning to detect natural image boundaries using local brightness color and texture cues ieee trans pattern anal mach may julian yarkony analyzing planarcc nips workshop julian yarkony thorsten beier pierre baldi and fred hamprecht parallel multicut segmentation via dual decomposition in new frontiers in mining complex patterns julian yarkony alexander ihler and charless fowlkes fast planar correlation clustering for image segmentation in proc of eccv chong zhang julian yarkony and fred hamprecht cell detection and segmentation using correlation clustering in miccai volume pages 
expressing an image stream with sequence of natural sentences cesc chunseong park gunhee kim seoul national university seoul korea gunhee https abstract we propose an approach for retrieving sequence of natural sentences for an image stream since general users often take series of pictures on their special moments it would better take into consideration of the whole image stream to produce natural language descriptions while almost all previous studies have dealt with the relation between single image and single natural sentence our work extends both input and output dimension to sequence of images and sequence of sentences to this end we design multimodal architecture called coherence recurrent convolutional network crcn which consists of convolutional neural networks bidirectional recurrent neural networks and an local coherence model our approach directly learns from vast resource of blog posts as parallel training data we demonstrate that our approach outperforms other candidate methods using both quantitative measures bleu and recall and user studies via amazon mechanical turk introduction recently there has been hike of interest in automatically generating natural language descriptions for images in the research of computer vision natural language processing and machine learning while most of existing work aims at discovering the relation between single image and single natural sentence we extend both input and output dimension to sequence of images and sequence of sentences which may be an obvious next step toward joint understanding of the visual content of images and language descriptions albeit in current literature our problem setup is motivated by that general users often take series of pictures on their memorable moments for example many people who visit new york city nyc would capture their experiences with large image streams and thus it would better take the whole photo stream into consideration for the translation to natural language description figure an intuition of our problem statement with new york city example we aim at expressing an image stream with sequence of natural sentences we leverage natural blog posts to learn the relation between image streams and sentence sequences we propose coherence recurrent convolutional networks crcn that integrate convolutional networks bidirectional recurrent networks and the coherence model illustrates an intuition of our problem statement with an example of visiting nyc our objective is given photo stream to automatically produce sequence of natural language sentences that best describe the essence of the input image set we propose novel multimodal architecture named coherence recurrent convolutional networks crcn that integrate convolutional neural networks for image description bidirectional recurrent neural networks for the language model and the local coherence model for smooth flow of multiple sentences since our problem deals with learning the semantic relations between long streams of images and text it is more challenging to obtain appropriate parallel corpus than previous research of single sentence generation our idea to this issue is to directly leverage online natural blog posts as parallel training data because usually blog consists of sequence of informative text and multiple representative images that are carefully selected by authors in way of storytelling see an example in we evaluate our approach with the blog datasets of the nyc and disneyland consisting of more than blog posts with associated images although we focus on the tourism topics in our experiments our approach is completely unsupervised and thus applicable to any domain that has large set of blog posts with images we demonstrate the superior performance of our approach by comparing with other alternatives including we evaluate with quantitative measures bleu and recall and user studies via amazon mechanical turk amt related work due to recent surge of volume of literature on this subject of generating natural language descriptions for image data here we discuss representative selection of ideas that are closely related to our work one of the most popular approaches is to pose the text generation as retrieval problem that learns ranking and embedding in which the caption of test image is transferred from the sentences of its most similar training images our approach partly involves the text retrieval because we search for candidate sentences for each image of query sequence from training database however we then create final paragraph by considering both compatibilities between individual images and text and the coherence that captures text relatedness at the level of transitions there have been also works our key novelty is that we explicitly include the coherence model unlike videos consecutive images in the streams may show sharp changes of visual content which cause the abrupt discontinuity between consecutive sentences thus the coherence model is more demanded to make output passages fluent many recent works have exploited multimodal networks that combine deep convolutional neural networks cnn and recurrent neural network rnn notable architectures in this category integrate the cnn with bidirectional rnns recurrent convolutional nets longshort term memory nets deep boltzmann machines rnn and other variants of multimodal rnns although our method partly take advantage of such recent progress of multimodal neural networks our major novelty is that we integrate it with the coherence model as unified architecture to retrieve fluent sequential multiple sentences in the following we compare more previous work that bears particular resemblance to ours among multimodal neural network models the recurrent convolutional net is related to our objective because their framework explicitly models the relations between sequential inputs and outputs however the model is applied to video description task of creating sentence for given short video clip and does not address the generation of multiple sequential sentences hence unlike ours there is no mechanism for the coherence between sentences the work of addresses the retrieval of image sequences for query paragraph which is the opposite direction of our problem they propose latent structural svm framework to learn the semantic relevance relations from text to image sequences however their model is specialized only for the image sequence retrieval and thus not applicable to the natural sentence generation contributions we highlight main contributions of this paper as follows to the best of our knowledge this work is the first to address the problem of expressing image streams with sentence sequences we extend both input and output to more elaborate forms with respect to whole body of existing methods image streams instead of individual images and sentence sequences instead of individual sentences we develop multimodal architecture of coherence recurrent convolutional networks crcn which integrates convolutional networks for image representation recurrent networks for sentence modeling and the local coherence model for fluent transitions of sentences we evaluate our method with large datasets of unstructured blog posts consisting of blog posts with associated images with both quantitative evaluation and user studies we show that our approach is more successful than other alternatives in verbalizing an image stream parallel dataset from blog posts we discuss how to transform blog posts to training set of parallel data streams each of which is sequence of pairs in tn the training set size is denoted by shows the summary of steps for blog posts blog we assume that blog authors augment their text with multiple images in semantically meaningful manner in order to decompose each blog into sequence of images and associated text we first perform text segmentation and then text summarization the purpose of text segmentation is to divide the input blog text into set of text segments each of which is associated with single image thus the number of segments is identical to the number of images in the blog the objective of text summarization is to reduce each text segment into single key sentence as result of these two processes we can transform each blog into form of in tn text segmentation we first divide the blog passage into text blocks according to paragraphs we apply standard paragraph tokenizer of nltk that uses regular expressions to detect paragraph divisions we then use the heuristics based on the block distances proposed in simply we assign each text block to the image that has the minimum index distance where each text block and image is counted as single index distance in the blog text summarization we summarize each text segment into single key sentence we apply the latent semantic analysis lsa summarization method which uses the singular value decomposition to obtain the concept dimension of sentences and then recursively finds the most representative sentences that maximize the similarity for each topic in text segment data augmentation the data augmentation is technique for convolutional neural networks to improve image classification accuracies its basic idea is to artificially increase the number of training examples by applying transformations horizontal reflection or adding noise to training images we empirically observe that this idea leads better performance in our problem as well for each sequence in tn we augment each sentence tn with multiple sentences for training that is when we perform the text summarization we select highest ranked summary sentences among which the one becomes the summary sentence for the associated image and all the ones are used for training in our model with slight abuse of notation we let tnl to denote both the single summary sentence and augmented sentences we choose after thorough empirical tests text description once we represent each text segment with sentences we extract the paragraph vector to represent the content of text the paragraph vector is based unsupervised algorithm that learns feature representation from pieces of passage we learn dense vector representation separately from the two classes of the blog dataset using the gensim code we use pn to denote the paragraph vector representation for text tn we then extract parsed tree for each tn to identify coreferent entities and grammatical roles of the words we use the stanford core nlp library the parse trees are used for the local coherence model which will be discussed in section our architecture many existing sentence generation models combine words or phrases from training data to generate sentence for novel image our approach is one level higher we use sentences from training database to author sequence of sentences for novel image stream although our model can be easily extended to use words or phrases as basic building blocks such granularity makes sequences too long to train the language model which may cause several difficulties for learning the rnn models for example the vanishing gradient effect is hardship to backpropagate an error signal through temporal interval therefore we design our approach that retrieves individual candidate sentences for each query image from training database and crafts best sentence sequence considering both the fitness of individual pairs and coherence between consecutive sentences figure illustration of steps of blog posts and the proposed crcn architecture illustrates the structure of our crcn it consists of three main components which are convolutional neural networks cnn for image representation bidirectional recurrent neural networks brnn for sentence sequence modeling and the local coherence model for smooth flow of multiple sentences each data stream is sequence denoted by in tn we use to denote position of in sequence we define the cnn and brnn model for each position separately and the coherence model for whole data stream for the cnn component our choice is the vggnet that represents images as vectors we discuss the details of our brnn and coherence model in section and section respectively and finally present how to combine the output of the three components to create single compatibility score in section the brnn model the role of brnn model is to represent content flow of text sequences in our problem the brnn is more suitable than the normal rnn because the brnn can simultaneously model forward and backward streams which allow us to consider both previous and next sentences for each sentence to make the content of whole sequence interact with one another as shown in our brnn has five layers input layer layer output layer and relu activation layer which are finally merged with that of the coherence model into two fully connected layers note that each text is represented by paragraph vector pt as discussed in section the exact form of our brnn is as follows see together for better understanding xft wif pt bfi hft xft wf xbt wib pt bbi bf hbt xbt wb bb ot wo hft hbt bo the brnn takes sequence of text vectors pt as input we then compute xft and xbt which are the activations of input units to forward and backward units unlike other brnn models we separate the input activation into forward and backward ones with different sets of parameters wif and wib which empirically leads better performance we set the activation function to the rectified linear unit relu max then we create two independent forward and backward hidden units denoted by hft and hbt the final activation of the brnn ot can be regarded as description for the content of the sentence at location which also implicitly encodes the flow of the sentence and its surrounding context in the sequence the parameter sets to learn include weights wif wib wf wb wo and biases bfi bbi bf bb bo the local coherence model the brnn model can capture the flow of text content but it lacks learning the coherence of passage that reflects distributional syntactic and referential information between discourse entities thus we explicitly include local coherence model based on the work of which focuses on resolving the patterns of local transitions of discourse entities coreferent noun phrases in the whole text as shown in we first extract parse trees for every summarized text denoted by zt and then concatenate all sequenced parse trees into one large one from which we make an entity grid for the whole sequence the entity grid is table where each row corresponds to discourse entity and each column represents sentence grammatical role are expressed by three categories and one for absent not referenced in the sentence subjects objects other than subject or object and absent after making the entity grid we enumerate the transitions of the grammatical roles of entities in the whole text we set the history parameter to three which means we can obtain transition descriptions or oox by computing the ratio of the occurrence frequency of each transition we finally create representation that captures the coherence of sequence finally we make this descriptor to vector by and forward it to relu layer as done for the brnn output combination of cnn rnn and coherence model after the relu activation layers of the rnn and the coherence model their output ot and goes through two fully connected fc layers whose role is to decide proper combination of the brnn language factors and the coherence factors we drop the bias terms for the layers and the dimensions of variables are wf wf ot st and wf wf we use the shared parameters for and so that the output mixes well the interaction between the content flows and coherency in our tests joint learning outperforms learning the two terms with separate parameters note that the multiplication wf wf of the last two fc layers does not reduce to single linear mapping thanks to dropout we assign and dropout rates to the two layers empirically it improves generalization performance much over single fc layer with dropout training the crcn to train our crcn model we first define the compatibility score between an image stream and paragraph sequence while our score function is inspired by karpathy et al there are two major differences first the score function of deals between sentence fragments and image fragments and thus the algorithm considers all combinations between them to find out the best matching on the other hand we define the score by an ordered and paired compatibility between sentence sequence and an image sequence second we also add the term that measures the relevance relation of coherency between an image sequence and text sequence finally the score skl for sentence sequence and an image stream is defined by skl skt vtl vtl where vtl denotes the cnn feature vector for image of stream we then define the cost function to train our crcn model as follows xhx max skl skk max slk skk where skk denotes the score between training pair of corresponding image and sentence sequence the objective based on the structured loss encourages aligned sequence pairs to have higher score by margin than misaligned pairs for each positive training example we randomly sample ne examples from the training set since each contrastive example has random length and is sampled from the dataset of wide range of content it is extremely unlikely that the negative examples have the same length and the same content order of sentences with positive examples optimization we use the backpropagation through time bptt algorithm to train our model we apply the stochastic gradient descent sgd with of data streams among many sgd techniques we select rmsprop optimizer which leads the best performance in our experiments we initialize the weights of our crcn model using the method of he et al which is robust in deep rectified models we observe that it is better than simple gaussian random initialization although our model is not extremely deep we use dropout regularization in all layers except the brnn with dropout for the last fc layer and for the other remaining layers retrieval of sentence sequences at test time the objective is to retrieve best sentence sequence for given query image stream iqn first we select images for each query image from training database using the on the cnn vggnet features in our experiments is successful we then generate set of sentence sequence candidates by concatenating the sentences associated with the images at each location finally we use our learned crcn model to compute the compatibility score between the query image stream and each sequence candidate according to which we rank the candidates however one major difficulty of this scenario is that there are exponentially many candidates to resolve this issue we use an approximate strategy we recursively halve the problem into subproblems until the size of the subproblem is manageable for example if we halve the search candidate length times then the search space of each subproblem becomes using the beam search idea we first find the best sequence candidates in the subproblem of the lowest level and recursively increase the candidate lengths while the maximum candidate size is limited to we set though it is an approximate search our experiments assure that it achieves almost optimal solutions with plausible combinatorial search mainly because the local fluency and coherence is undoubtedly necessary for the global one that is in order for whole sentence sequence to be fluent and coherent its any subparts must be as well experiments we compare the performance of our approach with other candidate methods via quantitative measures and user studies using amazon mechanical turk amt please refer to the supplementary material for more results and the details of implementation and experimental setting experimental setting dataset we collect blog datasets of the two topics nyc and disneyland we reuse the blog data of disneyland from the dataset of and newly collect the data of nyc using the same crawling method with in which we first crawl blog posts and their associated pictures from two popular blog publishing sites blogspot and wordpress by changing query terms from google search then we manually select the travelogue posts that describe stories and events with multiple images finally the dataset includes unique blog posts and images for nyc and blog posts and images for disneyland task for quantitative evaluation we randomly split our dataset into as training set as validation and the others as test set for each test post we use the image sequence as query iq and the sequence of summarized sentences as groundtruth tg each algorithm retrieves the best sequences from training database for query image sequence and ideally the retrieved sequences match well with tg since the training and test data are disjoint each algorithm can only retrieve similar but not identical sentences at best for quantitative measures we exploit two types of metrics of language similarity bleu cider and meteor scores and retrieval accuracies recall and median rank which are popularly used in text generation literature the recall is the recall rate of groundtruth retrieval given top candidates and the median rank indicates the median ranking value of the first retrieved groundtruth better performance is indicated by higher bleu cider meteor scores and lower median rank values baselines since the sentence sequence generation from image streams has not been addressed yet in previous research we instead extend several models that have publicly available codes as baselines including the multimodal models by kiros et al and recurrent convolutional models by karpathy et al and vinyals et al for we use the three variants introduced in the paper which are the standard model lbl and two extensions lbl and factored lbl we use the neuraltalk package authored by karpathy et al for the baseline of denoted by and denoted by as the simplest baseline we also compare with the global matching glomatch in for all the baselines we create final sentence sequences by concatenating the sentences generated for each image in the query stream lbl glomatch rcn crcn lbl glomatch rcn crcn language metrics retrieval metrics cider meteor medrank new york city disneyland table evaluation of sentence generation for the two datasets new york city and disneyland with language similarity metrics bleu and retrieval metrics median rank better performance is indicated by higher bleu cider meteor scores and lower median rank values we also compare between different variants of our method to validate the contributions of key components of our method we test the search without the rnn part as the simplest variant for each image in test query we find its most similar training images and simply concatenate their associated sentences the second variant is the method denoted by rcn that excludes the coherence model from our approach our complete method is denoted by crcn and this comparison quantifies the improvement by the coherence model to be fair we use the same vggnet feature for all the algorithms quantitative results table shows the quantitative results of experiments using both language and retrieval metrics our approach crcn and rcn outperform with large margins other baselines which generate passages without consideration of transitions unlike ours the shows the best performance among the three models of albeit with small margin partly because they share the same word dictionary in training among models the significantly outperforms the because the lstm units help learn models from irregular and lengthy data of natural blogs more robustly we also observe that crcn outperforms and rcn especially with the retrieval metrics it shows that the integration of two key components the brnn and the coherence model indeed contributes the performance improvement the crcn is only slightly better than the rcn in language metrics but significantly better in retrieval metrics it means that rcn is fine with retrieving fairly good solutions but not good at ranking the only correct solution high compared to crcn the small margins in language metrics are also attributed by their inherent limitation for example the bleu focuses on counting the matches of words and thus is not good at comparing between sentences even worse between paragraphs for fully evaluating their fluency and coherency illustrates several examples of sentence sequence retrieval in each set we show query image stream and text results created by our method and baselines except we show parts of sequences because they are rather long for illustration these qualitative examples demonstrate that our approach is more successful to verbalize image sequences that include variety of content user studies via amazon mechanical turk we perform user studies using amt to observe general users preferences between text sequences by different algorithms since our evaluation involves multiple images and long passages of text we design our amt task to be sufficiently simple for general turkers with no background knowledge figure examples of sentence sequence retrieval for nyc top and disneyland bottom in each set we present part of query image stream and its corresponding text output by our method and baseline baselines glomatch rcn rcn nyc disneyland table the results of amt pairwise preference tests we present the percentages of responses that turkers vote for our crcn over baselines the length of query streams is except the last column which has we first randomly sample test streams from the two datasets we first set the maximum number of images per query to if query is longer than that we uniformly sample it to in an amt test we show query image stream iq and pair of passages generated by our method crcn and one baseline in random order we ask turkers to choose more agreed text sequence with iq we design the test as pairwise comparison instead of question to make answering and analysis easier the questions look very similar to the examples of we obtain answers from three different turkers for each query we compare with four baselines we choose among the three variants of and among methods we also select glomatch and rcn as the variants of our method table shows the results of amt tests which validate that amt annotators prefer our results to those of baselines the glomatch is the worst because it uses too weak image representation gist and tiny images the differences between crcn and rcn column of table are not as significant as previous quantitative measures mainly because our query image stream is sampled to relatively short the coherence becomes more critical as the passage is longer to justify this argument we run another set of amt tests in which we use images per query as shown in the last column of table the performance margins between crcn and rcn become larger as the lengths of query image streams increase this result assures that as passages are longer the coherence becomes more important and thus crcn output is more preferred by turkers conclusion we proposed an approach for retrieving sentence sequences for an image stream we developed coherence recurrent convolutional network crcn which consists of convolutional networks bidirectional recurrent networks and local coherence model with quantitative evaluation and users studies using amt on large collections of blog posts we demonstrated that our crcn approach outperformed other candidate methods acknowledgements this research is partially supported by hancom and basic science research program through national research foundation of korea references barzilay and lapata modeling local coherence an approach in acl bird loper and klein natural language processing with python reilly media chen and zitnick mind eye recurrent visual representation for image caption generation in cvpr choi and moore latent semantic analysis for text segmentation in emnlp donahue hendricks guadarrama rohrbach venugopalan saenko and darrell recurrent convolutional networks for visual recognition and description in cvpr gong wang hodosh hockenmaier and lazebnik improving embeddings using large weakly annotated photo collections in eccv he zhang ren and sun delving deep into rectifiers surpassing performance on imagenet classification in arxiv hodosh young and hockenmaier framing image description as ranking task data models and evaluation metrics jair karpathy and deep alignments for generating image descriptions in cvpr kim moon and sigal joint photo stream and blog post summarization and exploration in cvpr kim moon and sigal ranking and retrieval of image sequences from multiple paragraph queries in cvpr kiros salakhutdinov and zemel multimodal neural language models in icml krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips kulkarni premraj dhar li choi berg and berg baby talk understanding and generating image descriptions in cvpr kuznetsova ordonez berg and choi treetalk composition and compression of trees for image descriptions in tacl lavie meteor an automatic metric for mt evaluation with improved correlation with human judgments in acl le and mikolov distributed representations of sentences and documents in icml manning surdeanu bauer finkel bethard and mcclosky the stanford corenlp natural language processing toolkit in acl mao xu yang wang huang and yuille deep captioning with multimodal recurrent neural networks in iclr mikolov statistical language models based on neural networks in ph thesis brno university of technology ordonez kulkarni and berg describing images using million captioned photographs in nips papineni roukos ward and zhu bleu method for automatic evaluation of machine translation in acl rohrbach qiu titov thater pinkal and schiele translating video content to natural language descriptions in iccv schuster and paliwal bidirectional recurrent neural networks in ieee tsp simonyan and zisserman very deep convolutional networks for image recognition in iclr socher karpathy le manning and ng grounded compositional semantics for finding and describing images with sentences in tacl srivastava and salakhutdinov multimodal learning with deep boltzmann machines in nips tieleman and hinton lecture rmsprop in coursera vedantam zitnick and parikh cider image description evaluation in vinyals toshev bengio and erhan show and tell neural image caption generator in cvpr werbos generalization of backpropagation with application to recurrent gas market model neural networks xu xiong chen and corso jointly modeling deep video and compositional text to bridge vision and language in unified framework in aaai 
parallel correlation clustering on big graphs xinghao dimitris samet benjamin kannan and michael amplab eecs at uc berkeley statistics at uc berkeley abstract given similarity graph between items correlation clustering cc groups similar items together and dissimilar ones apart one of the most popular cc algorithms is kwikcluster an algorithm that serially clusters neighborhoods of vertices and obtains ratio unfortunately in practice kwikcluster requires large number of clustering rounds potential bottleneck for large graphs we present and clusterwild two algorithms for parallel correlation clustering that run in polylogarithmic number of rounds and provably achieve nearly linear speedups uses concurrency control to enforce serializability of parallel clustering process and guarantees ratio clusterwild is coordination free algorithm that abandons consistency for the benefit of better scaling this leads to provably small loss in the approximation ratio we demonstrate experimentally that both algorithms outperform the state of the art both in terms of clustering accuracy and running time we show that our algorithms can cluster graphs in under seconds on cores while achieving speedup introduction clustering items according to some notion of similarity is major primitive in machine learning correlation clustering serves as basic means to achieve this goal given similarity measure between items the goal is to group similar items together and dissimilar items apart in contrast to other clustering approaches the number of clusters is not determined priori and good solutions aim to balance the tension between grouping all items together versus isolating them the simplest cc variant can be described on complete signed graph our input is graph on vertices with weights on edges between similar items and edges between dissimilar ones our goal is to generate partition of vertices into disjoint sets that minimizes the number of disagreeing edges this equals the number of edges cut by the clusters plus the number of edges inside the clusters this metric is commonly called the number of disagreements in figure we give toy example of cc instance cluster cluster cost edges inside clusters edges across clusters figure in the above graph solid edges denote similarity and dashed dissimilarity the number of disagreeing edges in the above clustering clustering is we color the bad edges with red entity deduplication is the archetypal motivating example for correlation clustering with applications in chat disentanglement resolution and spam detection the input is set of entities say results of keyword search and pairwise classifier that with some between entities two results of keyword search might refer to the same item but might look different if they come from different sources by building similarity graph between entities and then applying cc the hope is to cluster duplicate entities in the same group in the context of keyword search this implies more meaningful and compact list of results cc has been further applied to finding communities in signed networks classifying missing edges in opinion or trust networks gene clustering and consensus clustering kwikcluster is the simplest cc algorithm that achieves provable ratio and works in the following way pick vertex at random cluster center create cluster for and its positive neighborhood vertices connected to with positive edges peel these vertices and their associated edges from the graph and repeat until all vertices are clustered beyond its theoretical guarantees experimentally kwikcluster performs well when combined with local heuristics kwikcluster seems like an inherently sequential algorithm and in most cases of interest it requires many peeling rounds this happens because small number of vertices are clustered per round this can be bottleneck for large graphs recently there have been efforts to develop scalable variants of kwikcluster in distributed peeling algorithm was presented in the context of mapreduce using an elegant analysis the authors establish in polylogarithmic number of rounds the algorithm employs simple step that rejects vertices that are executed in parallel but are conflicting however we see in our experiments this seemingly minor coordination step hinders in parallel core setting in sketch of distributed algorithm was presented this algorithm achieves the same approximation as kwikcluster in logarithmic number of rounds in expectation however it performs significant redundant work per iteration in its effort to detect in parallel which vertices should become cluster centers our contributions we present and clusterwild two parallel cc algorithms with provable performance guarantees that in practice outperform the state of the art both in terms of running time and clustering accuracy is parallel version of kwikcluster that uses concurrency control to establish ratio clusterwild is simple to implement algorithm that abandons consistency for the benefit of better scaling while having provably small loss in the approximation ratio achieves approximation ratio in number of rounds by enforcing consistency between concurrently running peeling threads consistency is enforced using concurrency control notion extensively studied for databases transactions that was recently used to parallelize inherently sequential machine learning algorithms clusterwild is parallel cc algorithm that waives consistency in favor of speed the cost we pay is an arbitrarily small loss in clusterwild accuracy we show that clusterwild achieves opt approximation in number of rounds with provable nearly linear speedups our main theoretical innovation for clusterwild is analyzing the algorithm as serial variant of kwikcluster that runs on noisy graph in our experimental evaluation we demonstrate that both algorithms gracefully scale up to graphs with billions of edges in these large graphs our algorithms output valid clustering in less than seconds on threads up to an order of magnitude faster than kwikcluster we observe how not unexpectedly clusterwild is faster than and quite surprisingly abandoning coordination in this parallel setting only amounts to of relative loss in the clustering accuracy furthermore we compare against state of the art parallel cc algorithms showing that we consistently outperform these algorithms in terms of both running time and clustering accuracy notation denotes graph with vertices and edges is complete and only has edges we denote by dv the positive degree of vertex the number of vertices connected to with positive edges denotes the positive maximum degree of and denotes the positive neighborhood of moreover let cv two vertices are termed as friends if and vice versa we denote by permutation of two parallel algorithms for correlation clustering the formal definition of correlation clustering is given below correlation clustering given graph on vertices partition the vertices into an arbitrary number of disjoint subsets ck such that the sum of negative edges within the subsets plus the sum of positive edges across the subsets is minimized opt min min ci ci ci ci ci cj where and are the sets of positive and negative edges in kwikcluster is remarkably simple algorithm that approximately solves the above combinatorial problem and operates as follows random vertex is picked cluster cv is created with and its positive neighborhood then the vertices in cv are peeled from the graph and this process is repeated until all vertices are clustered kwikcluster can be equivalently executed as noted by if we substitute the random choice of vertex per peeling round with random order preassigned to vertices see alg that is select random permutation on vertices then peel the vertex indexed by and its friends remove from the vertices in cv and repeat this process having an order among vertices makes the discussion of parallel algorithms more convenient parallel cc using concurency control algorithm kwikcluster with suppose we now wish to run parallel version of kwikcluster say on two threads one thread random permutation of picks vertex indexed by and the other while do thread picks indexed by concurrently select the vertex indexed by cv can both vertices be cluster centers they can remove clustered vertices from and iff they are not friends in if and are end while nected with positive edge then the vertex with the smallest order wins this is our concurency rule no now assume that and are not friends in and both and become cluster centers moreover assume that and have common unclustered friend say should be clustered with or we need to follow what would happen with kwikcluster in alg will go with the vertex that has the smallest permutation number in this case this is concurency rule no following the above simple rules we develop our serializable parallel cc algorithm since constructs the same clusters as kwikcluster for given ordering it inherits its approximation the above idea of identifying the cluster centers in rounds was first used in to obtain parallel algorithm for maximal independent set mis shown as alg starts by assigning random permutation to the vertices it then samples an active set of unclustered vertices this sample is taken from the prefix of after sampling each of the threads picks vertex with the smallest order in then checks if that vertex can become cluster center we first enforce concurrency rule no adjacent vertices can not be cluster centers at the same time enforces it by making each thread check the friends of the vertex say that is picked from thread will check in attemptcluster whether its vertex has any preceding friends that are cluster centers if there are none it will go ahead and label as cluster center and proceed with creating cluster if preceding friend of is cluster center then is labeled as not being cluster center if preceding friend of call it has not yet received label is currently being processed and is not yet labeled as cluster center or not then the thread processing will wait on to receive label the major technical detail is in showing that this wait time is bounded we show that no more than log threads can be in conflict at the same time using new subgraph sampling lemma since is serializable it has to respect concurrency rule no if vertex is adjacency to two cluster centers then it gets assigned to the one with smaller permutation order this is accomplished in createcluster after processing all vertices in all threads are synchronized in bulk the clustered vertices are removed new active set is sampled and the same process is repeated until everything has been clustered in the following section we present the theoretical guarantees for algorithm clusterwild createcluster clusterid for do clusterid min clusterid end for input clusterid clusterid random permutation of while do attemptcluster maximum vertex degree in if clusterid and iscenter then the first vertices in createcluster while do in parallel end if first element in iscenter if then concurrency control for do check friends in order of attemptcluster if then if they precede you wait else if clusterwild then coordination free wait until clusterid till clustered createcluster if iscenter then end if return friend is center so you can be end while end if remove clustered vertices from and end if end while end for output clusterid clusterid return no earlier friends are centers so you are clusterwild correlation clustering clusterwild speeds up computation by ignoring the first concurrency rule it uniformly samples unclustered vertices and builds clusters around all of them without respecting the rule that cluster centers can not be friends in in clusterwild threads bypass the attemptcluster routine this eliminates the waiting part of clusterwild samples set of vertices from the prefix of each thread picks the first ordered vertex remaining in and using that vertex as cluster center it creates cluster around it it peels away the clustered vertices and repeats the same process on the next remaining vertex in at the end of processing all vertices in all threads are synchronized in bulk the clustered vertices are removed new active set is sampled and the parallel clustering is repeated careful analysis along the lines of shows that the number of rounds bulk synchronization steps is only quite unsurprisingly clusterwild is faster than interestingly abandoning consistency does not incur much loss in the approximation ratio we show how the error introduced in the accuracy of the solution can be bounded we characterize this error theoretically and show that in practice it only translates to only relative loss in the objective the main intuition of why clusterwild does not introduce too much error is that the chance of two randomly selected vertices being friends is small hence the concurrency rules are infrequently broken theoretical guarantees in this section we bound the number of rounds required for each algorithms and establish the theoretical speedup one can obtain with parallel threads we proceed to present our approximation guarantees we would like to remind the reader in relevant consider graphs that are complete signed and unweighted the omitted proofs can be found in the appendix number of rounds and running time our analysis follows those of and the main idea is to track how fast the maximum degree decreases in the remaining graph at the end of each round lemma and clusterwild terminate after log log rounds we now analyze the running time of both algorithms under simplified bsp model the main idea is that the the running time of each super step round is determined by the straggling thread the one that gets assigned the most amount of work plus the time needed for synchronization at the end of each round assumption we assume that threads operate asynchronously within round and synchronize at the end of round memory cell can be concurrently by multiple threads the time spent per round of the algorithm is proportional to the time of the slowest thread the cost of thread synchronization at the end of each batch takes time where is the number of threads the total computation cost is proportional to the sum of the time spent for all rounds plus the time spent during the bulk synchronization step under this simplified model we show that both algorithms obtain nearly linear speedup with clusterwild being faster than precisely due to lack of coordination our main tool for analyzing is recent result from theorem which guarantees that if one samples an subset of vertices in graph the sampled subgraph has connected component of size at most log combining the above in the appendix we show the following result theorem the running time of on cores is upper bounded by log log log as long as the number of cores is smaller than mini nii where nii is the size of the batch in the round of each algorithm the running time of clusterwild on cores is upper bounded by log log approximation ratio we now proceed with establishing the approximation ratios of and clusterwild is serializable it is straightforward that obtains precisely the same approximation ratio as kwikcluster one has to simply show that for any permutation kwikcluster and will output the same clustering this is indeed true as the two simple concurrency rules mentioned in the previous section are sufficient for to be equivalent to kwikcluster theorem achieves approximation ratio in expectation clusterwild as serial procedure on noisy graph analyzing clusterwild is bit more involved our guarantees are based on the fact that clusterwild can be treated as if one was running peeling algorithm on noisy graph since adjacent active vertices can still become cluster centers in clusterwild one can view the edges between them as deleted by somewhat unconventional adversary we analyze this new noisy graph and establish our theoretical result theorem clusterwild achieves approximation in expectation we provide sketch of the proof and delegate the details to the appendix since clusterwild ignores the edges among active vertices we treat these edges as deleted in our main result we quantify the loss of clustering accuracy that is caused by ignoring these edges before we proceed we define bad triangles combinatorial structure that is used to measure the clustering quality of peeling algorithm definition bad triangle in is set of three vertices such that two pairs are joined with positive edge and one pair is joined with negative edge let tb denote the set of bad triangles in to quantify the cost of clusterwild we make the below observation lemma the cost of any greedy algorithm that picks vertex irrespective of the sampling order creates cv peels it away and repeats is equal to the number of bad triangles adjacent to each cluster center lemma let denote the random graph induced by deleting all edges between active vertices per round for given run of clusterwild and let denote the number of additional bad triangles thatp has compared to then the expected cost of clusterwild can be upper bounded as where pt is the event that triangle with end points is bad and at least one of its end points becomes active while is still part of the original unclustered graph proof we begin by bounding the second term by considering the number of new bad triangles created at each round ai using the result that clusterwild terminates after at most log log rounds we get we are left to bound pt to do that we use the following lemma pt lemma if pt satisfies then pt op proof let be one of the many sets thatpattribute in thep cost of an optimal possibly of edges pt pt pt algorithm then opt now as with we will simply have to boundsthe expectation of the bad triangles adjacent to an edge let su be the union of the sets of nodes of the bad triangles that contain both vertices and observe that if some becomes active before and then cost of the cost of the bad triangle is incurred on the other hand if either or or both are selected as pivots in some round then cu can be as high as at most equal to all bad triangles containing the edge let auv or are activated before any other vertices in su then cu cu au au cu ac au where the last inequality is obtained by union bound over and we now bound the following probability observe that hence we need to upper bound the probability per round that no positive neighbors in become activated is upper bounded by hence the probability can be upper bounded as we know that and also then hence we have cu exp the overall expectation is then bounded by opt ln log our approximation ratio for clusterwild opt which establishes bsp algorithms as proxy for asynchronous algorithms we would like to note that the analysis under the bsp model can be useful proxy for the performance of completely asynchronous variants of our algorithms specifically see alg where we remove the synchronization barriers the only difference between the asynchronous execution in alg compared to alg is the complete lack of bulk synchronization at the end of the processing of each active set although the analysis of the bsp variants of the algorithms is tractable unfortunately analyzing precisely the speedup of the asynchronous and the approximation guarantees for the asynchronous clusterwild is challenging however in our experimental section we test the completely asynchronous algorithms against the bsp algorithms of the previous section and observe that they perform quite similarly both in terms of accuracy of clustering and running times we skip the constants to simplify the presentation however they are all smaller than related work correlation clustering was formally introduced by bansal et al in the general case minimizing disagreements is and hard to approximate within an arbitrarily small constant there are two variations of the problem cc on complete graphs where all edges are present and all weights are and ii cc on general graphs with arbitrary edge weights both problems are hard however the general graph setup seems fundamentally harder the best known approximation ratio for the latter is log and reduction to the minimum multicut problem indicates that any improvement to that requires fundamental breakthroughs in theoretical algorithms algorithm clusterwild asynchronous execution input clusterid clusterid random permutation of while do first element in if then concurrency control attemptcluster else if clusterwild then coordination free createcluster end if remove clustered vertices from and end while output clusterid clusterid in the case of complete unweighted graphs long series of results establishes approximation via rounded linear program lp recent result establishes approximation using an elegant rounding to the same lp relaxation by avoiding the expensive lp and by just using the rounding procedure of as basis for greedy algorithm yields kwikcluster approximation for cc on complete unweighted graphs variations of the cost metric for cc change the algorithmic landscape maximizing agreements the dual measure of disagreements or maximizing the difference between the number of agreements and disagreements come with different hardness and approximation results there are also several variants chromatic cc overlapping cc small number of clusters cc with added constraints that are suitable for some biology applications the way finds the cluster centers can be seen as variation of the mis algorithm of the main difference is that in our case we passively detect the mis by locking on memory variables and by waiting on preceding ordered threads this means that vertex only pushes its cluster id and status cluster to its friends versus pulling or asking for its friends cluster status this saves substantial amount of computational effort experiments our parallel algorithms were all in defer full discussion of the implementation details to appendix we ran all our experiments on amazon vcpus memory instances using threads the real graphs listed in table were each graph vertices edges description dblp network link graph of english part of wikipedia crawl of the domain crawl of the domain crawl by webbase crawler table graphs used in the evaluation of our parallel algorithms tested with different random orderings we measured the runtimes speedups ratio of runtime on thread to runtime on threads and objective values obtained by our parallel algorithms for comparison we also implemented the algorithm presented in which we denote as cdk for values of were used for bsp clusterwild bsp and cdk in the interest of space we present only representative plots of our results full results are given in our appendix code available at https cdk was only tested on the smaller graphs of and because cdk was prohibitively slow often orders of magnitude slower than clusterwild and even kwikcluster mean runtime mean runtime ms mean runtime ms mean runtime serial as bsp cw as cw bsp mean speedup serial as bsp cw as cw bsp ideal as bsp cw as cw bsp speedup number of threads mean runtimes of blocked vertices number of rounds mean number of synchronization rounds for bsp algorithms number of threads number of threads mean speedup webbase objective value relative to serial bsp min bsp mean bsp max bsp min bsp mean bsp max of blocked vertices bsp bsp bsp bsp cdk bsp cdk number of threads mean runtimes mean number of rounds algo obj value serial obj value percent of blocked vertices for bsp run with cw bsp mean cw bsp median cw as mean cw as median cdk mean cdk median number of threads median objective values cw bsp and cdk run with figure in the above figures cw is short for clusterwild bsp is short for the variants of the parallel algorithms and as is short for the asynchronous variants runtimes speedups and clusterwild are initially slower than serial due to the overheads required for atomic operations in the parallel setting however all our parallel algorithms outperform kwikcluster with threads as more threads are added the asychronous variants become faster than their bsp counterparts as there are no synchronization barrriers the difference between bsp and asychronous variants is greater for smaller clusterwild is also always faster than since there are no coordination overheads the asynchronous algorithms are able to achieve speedup of on threads the bsp algorithms have poorer speedup ratio but nevertheless achieve speedup with synchronization rounds the main overhead of the bsp algorithms lies in the need for synchronization rounds as increases the amount of synchronization decreases and with our algorithms have less than synchronization rounds which is small considering the size of the graphs and our multicore setting blocked vertices additionally incurs an overhead in the number of vertices that are blocked waiting for earlier vertices to complete we note that this overhead is extremely small in on all graphs less than of vertices are blocked on the larger and sparser graphs this drops to less than in of vertices objective value by design the algorithms also return the same output and thus objective value as kwikcluster we find that clusterwild bsp is at most worse than serial across all graphs and values of the behavior of asynchronous clusterwild worsens as threads are added reaching worse than serial for one of the graphs finally on the smaller graphs we were able to test cdk on cdk returns worse median objective value than both clusterwild variants conclusions and future directions in this paper we have presented two parallel algorithms for correlation clustering with nearly linear speedups and provable approximation ratios overall the two approaches support each is relatively fast relative to clusterwild we may prefer for its guarantees of accuracy and when clusterwild is accurate relative to we may prefer clusterwild for its speed in the future we intend to implement our algorithms in the distributed environment where synchronization and communication often account for the highest cost both and clusterwild are for the distributed environment since they have polylogarithmic number of rounds references ahmed elmagarmid panagiotis ipeirotis and vassilios verykios duplicate record detection survey knowledge and data engineering ieee transactions on arvind arasu christopher and dan suciu deduplication with constraints using dedupalog in data engineering icde ieee international conference on pages ieee micha elsner and warren schudy bounding and comparing methods for correlation clustering beyond ilp in proceedings of the workshop on integer linear programming for natural langauge processing pages association for computational linguistics bilal hussain oktie hassanzadeh fei chiang hyun chul lee and miller an evaluation of clustering algorithms in duplicate detection technical report francesco bonchi david and edo liberty correlation clustering from theory to practice in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm flavio chierichetti nilesh dalvi and ravi kumar correlation clustering in mapreduce in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm bo yang william cheung and jiming liu community mining from signed social networks knowledge and data engineering ieee transactions on gentile vitale zappella et al correlation clustering approach to link classification in signed networks in annual conference on learning theory pages microtome amir ron shamir and zohar yakhini clustering gene expression patterns journal of computational biology nir ailon moses charikar and alantha newman aggregating inconsistent information ranking and clustering journal of the acm jacm xinghao pan joseph gonzalez stefanie jegelka tamara broderick and michael jordan optimistic concurrency control for distributed unsupervised learning in advances in neural information processing systems pages guy blelloch jeremy fineman and julian shun greedy sequential maximal independent set and matching are parallel on average in proceedings of the annual acm symposium on parallelism in algorithms and architectures pages acm michael krivelevich the phase transition in site percolation on graphs arxiv preprint nikhil bansal avrim blum and shuchi chawla correlation clustering in ieee annual symposium on foundations of computer science pages ieee computer society moses charikar venkatesan guruswami and anthony wirth clustering with qualitative information in foundations of computer science proceedings annual ieee symposium on pages ieee erik demaine dotan emanuel amos fiat and nicole immorlica correlation clustering in general weighted graphs theoretical computer science shuchi chawla konstantin makarychev tselil schramm and grigory yaroslavtsev near optimal lp rounding algorithm for correlation clustering on complete and complete graphs in proceedings of the annual acm on symposium on theory of computing stoc pages chaitanya swamy correlation clustering maximizing agreements via semidefinite programming in proceedings of the fifteenth annual symposium on discrete algorithms pages society for industrial and applied mathematics ioannis giotis and venkatesan guruswami correlation clustering with fixed number of clusters in proceedings of the seventeenth annual symposium on discrete algorithm pages acm moses charikar and anthony wirth maximizing quadratic programs extending grothendieck inequality in foundations of computer science proceedings annual ieee symposium on pages ieee noga alon konstantin makarychev yury makarychev and assaf naor quadratic forms on graphs inventiones mathematicae francesco bonchi aristides gionis francesco gullo and antti ukkonen chromatic correlation clustering in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm francesco bonchi aristides gionis and antti ukkonen overlapping correlation clustering in data mining icdm ieee international conference on pages ieee gregory puleo and olgica milenkovic correlation clustering with constrained cluster sizes and extended weights bounds arxiv preprint boldi and vigna the webgraph framework compression techniques in www boldi rosa santini and vigna layered label propagation multiresolution ordering for compressing social networks in www acm press boldi codenotti santini and vigna ubicrawler scalable fully distributed web crawler software practice experience 
faster towards object detection with region proposal networks shaoqing kaiming he ross girshick jian sun microsoft research kahe rbg jiansun abstract object detection networks depend on region proposal algorithms to hypothesize object locations advances like sppnet and fast have reduced the running time of these detection networks exposing region proposal computation as bottleneck in this work we introduce region proposal network rpn that shares convolutional features with the detection network thus enabling nearly region proposals an rpn is network that simultaneously predicts object bounds and objectness scores at each position rpns are trained to generate highquality region proposals which are used by fast for detection with simple alternating optimization rpn and fast can be trained to share convolutional features for the very deep model our detection system has frame rate of including all steps on gpu while achieving object detection accuracy on pascal voc map and map using proposals per image code is available at https introduction recent advances in object detection are driven by the success of region proposal methods and convolutional neural networks although cnns were computationally expensive as originally developed in their cost has been drastically reduced thanks to sharing convolutions across proposals the latest incarnation fast achieves near rates using very deep networks when ignoring the time spent on region proposals now proposals are the computational bottleneck in detection systems region proposal methods typically rely on inexpensive features and economical inference schemes selective search ss one of the most popular methods greedily merges superpixels based on engineered features yet when compared to efficient detection networks selective search is an order of magnitude slower at per image in cpu implementation edgeboxes currently provides the best tradeoff between proposal quality and speed at per image nevertheless the region proposal step still consumes as much running time as the detection network one may note that fast cnns take advantage of gpus while the region proposal methods used in research are implemented on the cpu making such runtime comparisons inequitable an obvious way to accelerate proposal computation is to it for the gpu this may be an effective engineering solution but ignores the detection network and therefore misses important opportunities for sharing computation in this paper we show that an algorithmic proposals with deep to an elegant and effective solution where proposal computation is nearly given the shaoqing ren is with the university of science and technology of china this work was done when he was an intern at microsoft research tection network computation to this end we introduce novel region proposal networks rpns that share convolutional layers with object detection networks by sharing convolutions at the marginal cost for computing proposals is small per image our observation is that the convolutional conv feature maps used by detectors like fast can also be used for generating region proposals on top of these conv features we construct rpns by adding two additional conv layers one that encodes each conv map position into short feature vector and second that at each conv map position outputs an objectness score and regressed bounds for region proposals relative to various scales and aspect ratios at that location is typical value our rpns are thus kind of network fcn and they can be trained specifically for the task for generating detection proposals to unify rpns with fast object detection networks we propose simple training scheme that alternates between for the region proposal task and then for object detection while keeping the proposals fixed this scheme converges quickly and produces unified network with conv features that are shared between both tasks we evaluate our method on the pascal voc detection benchmarks where rpns with fast produce detection accuracy better than the strong baseline of selective search with fast meanwhile our method waives nearly all computational burdens of ss at effective running time for proposals is just milliseconds using the expensive very deep models of our detection method still has frame rate of including all steps on gpu and thus is practical object detection system in terms of both speed and accuracy map on pascal voc and map on code is available at https related work several recent papers have proposed ways of using deep networks for locating or classagnostic bounding boxes in the overfeat method fc layer is trained to predict the box coordinates for the localization task that assumes single object the fc layer is then turned into conv layer for detecting multiple objects the multibox methods generate region proposals from network whose last fc layer simultaneously predicts multiple boxes which are used for object detection their proposal network is applied on single image or multiple large image crops we discuss overfeat and multibox in more depth later in context with our method shared computation of convolutions has been attracting increasing attention for efficient yet accurate visual recognition the overfeat paper computes conv features from an image pyramid for classification localization and detection pooling spp on shared conv feature maps is proposed for efficient object detection and semantic segmentation fast enables detector training on shared conv features and shows compelling accuracy and speed region proposal networks region proposal network rpn takes an image of any size as input and outputs set of rectangular object proposals each with an objectness we model this process with fullyconvolutional network which we describe in this section because our ultimate goal is to share computation with fast object detection network we assume that both nets share common set of conv layers in our experiments we investigate the zeiler and fergus model zf which has shareable conv layers and the simonyan and zisserman model vgg which has shareable conv layers to generate region proposals we slide small network over the conv feature map output by the last shared conv layer this network is fully connected to an spatial window of the input conv region is generic term and in this paper we only consider rectangular regions as is common for many methods objectness measures membership to set of object classes background scores coordinates cls layer person anchor boxes dog horse reg layer car cat dog person intermediate layer bus person boat person person person person sliding window conv feature map figure left region proposal network rpn right example detections using rpn proposals on pascal voc test our method detects objects in wide range of scales and aspect ratios feature map each sliding window is mapped to vector for zf and for vgg this vector is fed into two sibling layer reg and layer cls we use in this paper noting that the effective receptive field on the input image is large and pixels for zf and vgg respectively this mininetwork is illustrated at single position in fig left note that because the operates in fashion the layers are shared across all spatial locations this architecture is naturally implemented with an conv layer followed by two sibling conv layers for reg and cls respectively relus are applied to the output of the conv layer anchors at each location we simultaneously predict region proposals so the reg layer has outputs encoding the coordinates of boxes the cls layer outputs scores that estimate probability of object for each the proposals are parameterized relative to reference boxes called anchors each anchor is centered at the sliding window in question and is associated with scale and aspect ratio we use scales and aspect ratios yielding anchors at each sliding position for conv feature map of size typically there are hk anchors in total an important property of our approach is that it is translation invariant both in terms of the anchors and the functions that compute proposals relative to the anchors as comparison the multibox method uses to generate anchors which are not translation invariant if one translates an object in an image the proposal should translate and the same function should be able to predict the proposal in either location moreover because the multibox anchors are not translation invariant it requires output layer whereas our method requires output layer our proposal layers have an order of magnitude fewer parameters million for multibox using googlenet million for rpn using and thus have less risk of overfitting on small datasets like pascal voc loss function for learning region proposals for training rpns we assign binary class label of being an object or not to each anchor we assign positive label to two kinds of anchors the with the highest iou overlap with box or ii an anchor that has an iou overlap higher than with any box note that single box may assign positive labels to multiple anchors we assign negative label to anchor if its iou ratio is lower than for all boxes anchors that are neither positive nor negative do not contribute to the training objective with these definitions we minimize an objective function following the loss in fast rcnn our loss function for an image is defined as pi ti lcls pi lreg ti ncls nreg for simplicity we implement the cls layer as softmax layer alternatively one may use logistic regression to produce scores here is the index of an anchor in and pi is the predicted probability of anchor being an object the label is if the anchor is positive and is if the anchor is negative ti is vector representing the parameterized coordinates of the predicted bounding box and is that of the box associated with positive anchor the classification loss lcls is log loss over two classes object not object for the regression loss we use lreg ti ti where is the robust loss function smooth defined in the term lreg means the regression loss is activated only for positive anchors and is disabled otherwise the outputs of the cls and reg layers consist of pi and ti respectively the two terms are normalized with ncls and nreg and balancing weight for regression we adopt the parameterizations of the coordinates following tx xa ty ya tw log th log tx xa ya log log where and denote the two coordinates of the box center width and height variables xa and are for the predicted box anchor box and box respectively likewise for this can be thought of as regression from an anchor box to nearby box nevertheless our method achieves regression by different manner from previous methods in regression is performed on features pooled from arbitrarily sized regions and the regression weights are shared by all region sizes in our formulation the features used for regression are of the same spatial size on the feature maps to account for varying sizes set of regressors are learned each regressor is responsible for one scale and one aspect ratio and the regressors do not share weights as such it is still possible to predict boxes of various sizes even though the features are of fixed optimization the rpn which is naturally implemented as network can be trained by and stochastic gradient descent sgd we follow the imagecentric sampling strategy from to train this network each arises from single image that contains many positive and negative anchors it is possible to optimize for the loss functions of all anchors but this will bias towards negative samples as they are dominate instead we randomly sample anchors in an image to compute the loss function of where the sampled positive and negative anchors have ratio of up to if there are fewer than positive samples in an image we pad the with negative ones we randomly initialize all new layers by drawing weights from gaussian distribution with standard deviation all other layers the shared conv layers are initialized by pretraining model for imagenet classification as is standard practice we tune all layers of the zf net and and up for the vgg net to conserve memory we use learning rate of for and for the next on the pascal dataset we also use momentum of and weight decay of our implementation uses caffe sharing convolutional features for region proposal and object detection thus far we have described how to train network for region proposal generation without considering the object detection cnn that will utilize these proposals for the detection network we adopt fast and now describe an algorithm that learns conv layers that are shared between the rpn and fast both rpn and fast trained independently will modify their conv layers in different ways we therefore need to develop technique that allows for sharing conv layers between the two networks rather than learning two separate networks note that this is not as easy as simply defining single network that includes both rpn and fast and then optimizing it jointly with backpropagation the reason is that fast training depends on fixed object proposals and it is in our early implementation as also in the released code was set as and the cls term in eqn was normalized by the size ncls and the reg term was normalized by the number of anchor locations nreg both cls and reg terms are roughly equally weighted in this way https not clear priori if learning fast while simultaneously changing the proposal mechanism will converge while this joint optimizing is an interesting question for future work we develop pragmatic training algorithm to learn shared features via alternating optimization in the first step we train the rpn as described above this network is initialized with an model and for the region proposal task in the second step we train separate detection network by fast using the proposals generated by the rpn this detection network is also initialized by the model at this point the two networks do not share conv layers in the third step we use the detector network to initialize rpn training but we fix the shared conv layers and only the layers unique to rpn now the two networks share conv layers finally keeping the shared conv layers fixed we the fc layers of the fast as such both networks share the same conv layers and form unified network implementation details we train and test both region proposal and object detection networks on images we the images such that their shorter side is pixels feature extraction may improve accuracy but does not exhibit good we also note that for zf and vgg nets the total stride on the last conv layer is pixels on the image and thus is pixels on typical pascal image even such large stride provides good results though accuracy may be further improved with smaller stride for anchors we use scales with box areas of and pixels and aspect ratios of and we note that our algorithm allows the use of anchor boxes that are larger than the underlying receptive field when predicting large proposals such predictions are not one may still roughly infer the extent of an object if only the middle of the object is visible with this design our solution does not need features or sliding windows to predict large regions saving considerable running time fig right shows the capability of our method for wide range of scales and aspect ratios the table below shows the learned average proposal size for each anchor using the zf net numbers for anchor proposal the anchor boxes that cross image boundaries need to be handled with care during training we ignore all anchors so they do not contribute to the loss for typical image there will be roughly anchors in total with the anchors ignored there are about anchors per image for training if the outliers are not ignored in training they introduce large difficult to correct error terms in the objective and training does not converge during testing however we still apply the rpn to the entire image this may generate proposal boxes which we clip to the image boundary some rpn proposals highly overlap with each other to reduce redundancy we adopt nonmaximum suppression nms on the proposal regions based on their cls scores we fix the iou threshold for nms at which leaves us about proposal regions per image as we will show nms does not harm the ultimate detection accuracy but substantially reduces the number of proposals after nms we use the ranked proposal regions for detection in the following we train fast using rpn proposals but evaluate different numbers of proposals at experiments we comprehensively evaluate our method on the pascal voc detection benchmark this dataset consists of about trainval images and test images over object categories we also provide results in the pascal voc benchmark for few models for the imagenet network we use the fast version of zf net that has conv layers and fc layers and the public that has conv layers and fc layers we primarily evaluate detection mean average precision map because this is the actual metric for object detection rather than focusing on object proposal proxy metrics table top shows fast results when trained and tested using various region proposal methods these results use the zf net for selective search ss we generate about ss table detection results on pascal voc test set trained on voc trainval the detectors are fast with zf but using various proposal methods for training and testing region proposals method boxes ss eb shared region proposals method proposals map ss eb shared unshared no nms no cls no cls no cls no reg no reg ablation experiments follow below unshared ss ss ss ss ss ss ss ss ss ss proposals by the fast mode for edgeboxes eb we generate the proposals by the default eb setting tuned for iou ss has an map of and eb has an map of rpn with fast achieves competitive results with an map of while using up to using rpn yields much faster detection system than using either ss or eb because of shared conv computations the fewer proposals also reduce the fc cost next we consider several ablations of rpn and then show that proposal quality improves when using the very deep network ablation experiments to investigate the behavior of rpns as proposal method we conducted several ablation studies first we show the effect of sharing conv layers between the rpn and fast detection network to do this we stop after the second step in the training process using separate networks reduces the result slightly to unshared table we observe that this is because in the third step when the features are used to the rpn the proposal quality is improved next we disentangle the rpn influence on training the fast detection network for this purpose we train fast model by using the ss proposals and zf net we fix this detector and evaluate the detection map by changing the proposal regions used at in these ablation experiments the rpn does not share features with the detector replacing ss with rpn proposals at leads to an map of the loss in map is because of the inconsistency between the proposals this result serves as the baseline for the following comparisons somewhat surprisingly the rpn still leads to competitive result when using the topranked proposals at indicating that the rpn proposals are accurate on the other extreme using the rpn proposals without nms has comparable map suggesting nms does not harm the detection map and may reduce false alarms next we separately investigate the roles of rpn cls and reg outputs by turning off either of them at when the cls layer is removed at thus no is used we randomly sample proposals from the unscored regions the map is nearly unchanged with but degrades considerably to when this shows that the cls scores account for the accuracy of the highest ranked proposals on the other hand when the reg layer is removed at so the proposals become anchor boxes the map drops to this suggests that the proposals are mainly due to regressed positions the anchor boxes alone are not sufficient for accurate detection for rpn the number of proposals is the maximum number for an image rpn may produce fewer proposals after nms and thus the average number of proposals is smaller table detection results on pascal voc test set the detector is fast and training data voc trainval union set of voc trainval and voc trainval for rpn the proposals for fast are this was reported in using the repository provided by this paper this number is higher in six runs method proposals ss ss unshared shared shared data map time ms table detection results on pascal voc test set the detector is fast and training data voc trainval union set of voc and voc trainval for rpn the proposals for fast are http http method proposals data map ss ss table timing ms on gpu except ss proposal is evaluated in cpu includes nms pooling fc and softmax see our released code for the profiling of running time model system conv proposal total rate vgg vgg zf ss fast rpn fast rpn fast fps fps fps we also evaluate the effects of more powerful networks on the proposal quality of rpn alone we use to train the rpn and still use the above detector of the map improves from using to using this is promising result because it suggests that the proposal quality of is better than that of because proposals of are competitive with ss both are when consistently used for training and testing we may expect to be better than ss the following experiments justify this hypothesis detection accuracy and running time of table shows the results of for both proposal and detection using the fast result is for unshared features slightly higher than the ss baseline as shown above this is because the proposals generated by are more accurate than ss unlike ss that is the rpn is actively trained and benefits from better networks for the variant the result is than the strong ss baseline yet with nearly proposals we further train the rpn and detection network on the union set of pascal voc trainval and trainval following the map is on the pascal voc test set table our method has an map of trained on the union set of voc and voc trainval following in table we summarize the running time of the entire object detection system ss takes seconds depending on content on average and fast with takes on ss proposals or if using svd on fc layers our system with takes in total for both proposal and detection with the conv features shared the rpn alone only takes computing the additional layers our computation is also low thanks to fewer proposals our system has of fps with the zf net analysis of next we compute the recall of proposals at different iou ratios with boxes it is noteworthy that the metric is just loosely related to the ultimate detection accuracy it is more appropriate to use this metric to diagnose the proposal method than to evaluate it zğđăůů figure recall iou overlap ratio on the pascal voc test set table detection proposal detection detection results are on the pascal voc test set using the zf model and fast rpn uses unshared features regions rpn zf unshared dense scales asp ratios dense scales asp ratios detector map fast zf scale fast zf scale fast zf scales in fig we show the results of using and proposals we compare with ss and eb and the proposals are the ranked ones based on the confidence generated by these methods the plots show that the rpn method behaves gracefully when the number of proposals drops from to this explains why the rpn has good ultimate detection map when using as few as proposals as we analyzed before this property is mainly attributed to the cls term of the rpn the recall of ss and eb drops more quickly than rpn when the proposals are fewer detection proposal detection the overfeat paper proposes detection method that uses regressors and classifiers on sliding windows over conv feature maps overfeat is detection pipeline and ours is cascade consisting of proposals and detections in overfeat the features come from sliding window of one aspect ratio over scale pyramid these features are used to simultaneously determine the location and category of objects in rpn the features are from square sliding windows and predict proposals relative to anchors with different scales and aspect ratios though both methods use sliding windows the region proposal task is only the first stage of rpn fast detector attends to the proposals to refine them in the second stage of our cascade the features are adaptively pooled from proposal boxes that more faithfully cover the features of the regions we believe these features lead to more accurate detections to compare the and systems we emulate the overfeat system and thus also circumvent other differences of implementation details by fast in this system the proposals are dense sliding windows of scales and aspect ratios fast is trained to predict scores and regress box locations from these sliding windows because the overfeat system uses an image pyramid we also evaluate using conv features extracted from scales we use those scales as in table compares the system and two variants of the system using the zf model the system has an map of this is lower than the system by this experiment justifies the effectiveness of cascaded region proposals and object detection similar observations are reported in where replacing ss region proposals with sliding windows leads to degradation in both papers we also note that the system is slower as it has considerably more proposals to process conclusion we have presented region proposal networks rpns for efficient and accurate region proposal generation by sharing convolutional features with the detection network the region proposal step is nearly our method enables unified object detection system to run at fps the learned rpn also improves region proposal quality and thus the overall object detection accuracy references chavali agrawal mahendru and batra evaluation protocol is gameable arxiv dai he and sun convolutional feature masking for joint object and stuff segmentation in cvpr erhan szegedy toshev and anguelov scalable object detection using deep neural networks in cvpr everingham van gool williams winn and zisserman the pascal visual object classes challenge results girshick fast girshick donahue darrell and malik rich feature hierarchies for accurate object detection and semantic segmentation in cvpr he zhang ren and sun spatial pyramid pooling in deep convolutional networks for visual recognition in eccv hosang benenson and schiele what makes for effective detection proposals hosang benenson and schiele how good are detection proposals really in bmvc jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips lecun boser denker henderson howard hubbard and jackel backpropagation applied to handwritten zip code recognition neural computation lenc and vedaldi minus long shelhamer and darrell fully convolutional networks for semantic segmentation in cvpr nair and hinton rectified linear units improve restricted boltzmann machines in icml ren he girshick zhang and sun object detection networks on convolutional feature maps russakovsky deng su krause satheesh ma huang karpathy khosla bernstein berg and imagenet large scale visual recognition challenge sermanet eigen zhang mathieu fergus and lecun overfeat integrated recognition localization and detection using convolutional networks in iclr simonyan and zisserman very deep convolutional networks for image recognition in iclr szegedy reed erhan and anguelov scalable object detection szegedy toshev and erhan deep neural networks for object detection in nips uijlings van de sande gevers and smeulders selective search for object recognition ijcv zeiler and fergus visualizing and understanding convolutional neural networks in eccv zitnick and edge boxes locating object proposals from edges in eccv 
local embeddings ke jun alexandros viper group computer vision and multimedia laboratory university of geneva and expedia switzerland and business informatics department university of applied sciences western switzerland abstract is profound concept in physics this concept was shown to be useful for dimensionality reduction we present basic definitions with interesting we give theoretical propositions to show that is more powerful representation than euclidean space we apply this concept to manifold learning for preserving local information empirical results on nonmetric datasets show that more information can be preserved in introduction as simple and intuitive representation the euclidean space has been widely used in various learning tasks in dimensionality reduction given points in or their pairwise similarities are usually represented as corresponding set of points in the representation power of is limited some of its limitations are listed next the maximum number of points which can share common nearest neighbor is limited for for while such centralized structures do exist in real data can at most embed points with uniform similarities it is hard to model relationships with less variance even if is large enough as metric space must satisfy the triangle inequality and therefore must admit transitive similarities meaning that neighbor neighbor should also be nearby such relationships can be violated on real data social networks the gram matrix of real vectors must be positive therefore can not faithfully represent the negative of input similarities which was discovered to be meaningful to tackle the above limitations of euclidean embeddings method is to impose statistical mixture model each embedding point is random point on several candidate locations some mixture weights these candidate locations can be in the same this allows an embedding point to jump across long distance through statistical or they can be in independent resulting in different views of the input data another approach beyond euclidean embeddings is to change the embedding destination to curved space md this md can be riemannian manifold with positive definite metric or equivalently curved surface embedded in euclidean space to learn such an embedding requires expression of the distance measure this md can also be with an indefinite metric this representation under the names space minkowski space or more conveniently was shown to be powerful representation for datasets in these works an embedding is obtained through spectral decomposition of matrix which is computed based on some input data on the other hand manifold learning methods are capable of learning kernel gram matrix that encapsulates useful information into narrow band of its corresponding author usually local neighborhood information is more strongly preserved as compared to information so that the input information is unfolded in manner to achieve the desired compactness the present work advocates the representation section introduces the basic concepts section gives several simple propositions that describe the representation power of as novel contributions section applies the representation to manifold learning section shows that using the same number of parameters more information can be preserved by such embeddings as compared to euclidean embeddings this leads to new data visualization techniques section concludes and discusses possible extensions the fundamental measurements in geometry are established by the concept of metric intuitively it is or inner product the metric of euclidean space is everywhere identity the inner product between any two vectors and is id where id is the identity matrix ds dt is ds dt real vector space where ds dt and the metric is id this metric is not trivial it is with background in physics point in ds dt is called an event denoted by ds ds ds the first ds dimensions are where the measurements are exactly the same as in euclidean space the last dt dimensions are which cause in accordance to the metric in eq ds dt ds dx in analogy to using inner products to define distances the following definition gives dissimilarity measure between two events in ds dt definition the interval or shortly interval between any two events and is dx ds the interval can be positive zero or negative with respect to reference point ds dt the set is called light cone figure shows light cone in within the light cone negative interval occurs outside the light cone the following help to establish the concept of ds dt can accommodate an arbitrarily large number of events sharing common nearest neighbor in let and put evenly on the circle at time then is the unique nearest neighbor of ds dt can represent uniform similarities between an arbitrarily large number of points in the similarities within ai ai are uniform in ds dt the triangle inequality is not necessarily satisfied in let then the trick is that as absolute time value increases its intervals with all events at time are shrinking correspondingly similarity measures in ds dt can be the fact that is similar to and independently does not necessarily mean that and are similar neighborhood of is where this hyperboloid has infinite volume no matter how small is comparatively neighborhood in is much narrower with an exponentially shrinking volume as its radius decreases time time spa space space lightcone figure compass in the colored lines show equalinterval contours with respect to the origin all possible embeddings in resp are mapped to of as shown by the red resp blue line dimensionality reduction projects the input onto these by minimizing the kl divergence the representation capability of this section formally discusses some basic properties of ds dt in relation to dimensionality reduction we first build tool to shift between two different representations of an embedding matrix of yi yj and matrix of hyi yj from straightforward derivations we have lemma pn cn cii cij cji and kn kij kij kji are two families of real symmetric matrices dim cn dim kn linear mapping from cn to kn and its inverse are given by in eet in eet diag et ediag where and diag means the diagonal entries of as column vector cn and kn are the sets of interval matrices and matrices respectively in particular kn means gram matrix and the corresponding means square distance matrix the double centering mapping is widely used to generate gram matrix from dissimilarity matrix proposition cn events in ds dt ds dt and their intervals are prank proof cn has the vl vl λl where rank and vl are orthonormal for each rank gives the coordinates in one dimension which is if or if remark ds dt ds dt can represent any interval matrix cn or equivalently any kn comparatively can only represent kn distance matrix in is invariant to rotations in other words the direction information of point cloud is completely discarded in ds dt some direction information is kept to distinguish between and dimensions as shown in fig one can tell the direction in by moving point along the curve and measuring its interval the origin local embedding techniques often use similarity measures in statistical simplex pij pij pij this has one less dimension than cn and kn so that dim mapping from kn cn to is given by pij cij where is strictly monotonically decreasing function so that large probability mass is assigned to pair of events with small interval proposition trivially extends to proposition events in ds dt ds dt and their similarities are remark ds dt ds dt can represent any symmetric positive similarities typically in the in eq we have exp cn of any given is the curve eet in cij ln ij where eet in means uniform on the the corresponding curve in increment entries of by eq ee where because kn is in ee shares with common eigenvector with zero eigenvalue and the rest eigenrank values are all there exist orthonormal vectors vl and real numbers λl rank vl vl and in eet vl vl therefore rank vl vl δvl vl depending on can be negative definite positive definite or somewhere in between this is summarized in the following theorem theorem if exp in eq the in kn of is continuous curve and the number of positive eigenvalues of increases monotonically with with enough dimensions any can be perfectly represented in or timeonly or ds dt there is no particular reason to favor model because the objective of dimensionality reduction is to get compact model with small number of dimensions regardless of whether they are or formally knds dt rank ds rank dt is subset of kn in the domain kn dimensionality reduction based on the input finds some ds dt knds dt which is close to the curve in the probability domain the image of knds dt under some mapping kn is knds dt as shown in fig dimensionality reduction finds some dt knds dt so that dt is the closest point to some information theoretic measure the proximity of to dt its proximity to knds dt measures the quality of the model ds dt as the embedding target space when the model scale or the number of dimensions is given we will investigate the latter approach which depends on the choice of ds dt the mapping and some proximity measure on we will show that with the same number of dimensions ds dt the region knds dt with dimensions is naturally close to certain input local embeddings we project given similarity matrix to some knds dt or equivalently to set of events yi ds dt so that hyi yj as in eq and the similarities among these events resemble as discussed in section mapping kn helps transfer knds dt into of so that the projection can be done inside this mapping expressed in the event coordinates is given by exp kyit yjt pij kyis yjs where ds ds ds and denotes the for any pair of events yi and yj pij increases when their space coordinates move close when their time coordinates move away this agrees with the basic intuitions of for dimensions the heat kernel is used to make pij sensitive to time variations this helps to suppress events with large absolute time values which make the embedding less interpretable for dimensions the kernel as suggested by is used so that there could be more volume to accommodate the often input data based on our experience this hybrid parametrization of pij can better model real data as compared to alternative parametrizations similar to sne and an optimal embedding can be obtained by minimizing the kl divergence from the input to the output given by ij kl ij ln pij according to some straightforward derivations its gradients are ij pij yit yjt ij pij yis yjs kyi yj where ij ji and pij pji as an intuitive interpretation of gradient descent process eqs and we have that if pij ij yi and yj are put too far from each other then yis and yjs are attracting and yit and yjt are repelling so that their interval becomes shorter if pij ij then yi and yj are repelling in space and attracting in time during gradient descent yis are updated by the scheme as used in where each scalar parameter has its own adaptive learning rate initialized to yit are updated based on one global adaptive learning rate initialized to the learning of time should be more cautious because pij is more sensitive to time variations by eq therefore the ratio should be very small empirical results aiming at potential applications in data visualization and social network analysis we compare sne and the method proposed in section denoted as snest they are based on the same optimizer but correspond to different of as presented by the curves in fig given different embeddings of the same dataset using the same number of dimensions we perform model selection based on the kl divergence as explained in the end of section we generated toy dataset school representing school with two classes each class has students standing evenly on circle where each student is communicating with his her nearest neighbours and one teacher who is communicating with all the students in the same class and the teacher in the other class the input is distributed evenly on the pairs who are socially connected contains matrix from nips to after discarding the authors who have only one nips paper we get authors who papers the matrix is where caij denotes the number of papers thatpauthor with author the input similarity is computed so that ij caij caij caij where the number of papers is normalized by each author total number of papers is built in the same way using only the first volumes grqc is an arxiv graph with nodes and edges after removing one isolated node matrix gives the numbers of papers between any two authors who submitted to the general relativity and quantum cosmology category from january to april the input similarity satisfies ij caij caij caij is the semantic similarities among english words in each wsij is an asymmetric similaritypfrom word to word the input is normalized into probability vector so that ij wsij wsij wsji wsji is built in the same way using subset of words table shows the kl divergence in eq in most cases snest for fixed number of free parameters has the lowest kl on grqc and the embedding by snest in is even better than sne and in meaning that the embedding by snest is both compact and faithful this is in contrast to the mixture approach for visualization which multiplies the number of parameters to get faithful representation fixing the free parameters to two dimensions in has the best overall performance and snest in is worse we also discovered that using dimensions usually performs better than alternative choices such as which are not shown due to space limitation timelike dimension allows adaptation to data the investigated similarities however are table kl divergence of different embeddings after repeated runs on different configurations for each embedding the minimal kl that we have achieved within epochs is shown the bold numbers show the winners among sne and snest using the same number of parameters sne sne sne snest snest snest school grqc exp kyit ytj kyis ysj kyit ytj time teachers kyis ysj figure the embedding of school by snest in the black resp colored dots denote the students resp teachers the paper coordinates resp color mean the space resp time exp ky coordinates the links mean social connections the contour of in eq as function of kyis yjs and kyit yjt the unit of the displayed levels is mainly in the sense that random pair of people or words are more likely to be dissimilar rather than similar according to our experience on such datasets good performance is often achieved with mainly dimensions mixed with small number of or as suggested by table to interpret the embeddings fig presents the embedding of school in where the space and time are represented by paper coordinates and three colors levels respectively each class is embedded as circle the center of each class the teacher is lifted to different time so as to be near to all students in the same class one teacher being blue while the other being red creates between the teachers because their large time difference makes them nearby in figures and show the embeddings of and in similar to the sne visualizations it is easy to find close authors or words embedded nearby the learned however is not equivalent to the visual proximity because of the time dimension how much does the visual proximity reflect the underlying from the histogram of the time coordinates we see that the time values are in the narrow range while the range of the space coordinates is at least times larger figure shows the similarity function on the of eq over an interesting range of kyis yjs and kyit yjt in this range large similarity values are very sensitive to space variations and their red level curves are almost vertical meaning that the similarity information is largely carried by space coordinates therefore the visualization of neighborhoods is relatively accurate visually nearby points are indeed similar proximity in neighborhood is informative regarding on the other hand small similarity values are less sensitive to space variations and their blue level curves span large distance in space meaning that the visual distance between dissimilar points is less informative regarding for das atiya frasconi sminchisescu mozer grimes bengio li kim yu achan gupta ballard gerstner rosenfeld touretzky black raocohn caruana mitchell thrun courville gordon wainwright bradley tesauro lafferty sahanipillow willsky tresp montague welling dayan smyth malik hofmann kulis rahimi grauman pouget roth blei teh hinton simoncelli baldi mjolsness riesenhuber winther lee saad ghahramani moody zemel griffiths bartlett seung opper leen barber marchand jordan scholkopf amari kakade wang muller atkeson bishop tishbysaul bengioseeger weston scott doya williamsfrey crammer ratsch zador jaakkolasinger ruppin waibel yuille bach poggio maass garrigues horn weinshalldarrell schraudolphchapelle williamson hochreiter vapnik deweese lewicki koller fukumizuplatt warmuth freeman bialek lee xinggrettonsmolasimard stevens weiss gray herbrich ng blair pearlmutter viola mohri denker mel moore kearns koch bower schuurmans hastie movellan singh lecun lee liu wangjin zhang sejnowski lippmann barto defreitas murray sutton hasler attias goldstein minch roweis nowlan rasmussentenenbaum buhmann sollich cristianini graepel morgan beck johnson rumelhart cauwenberghs meir kawato cottrell smith zhang sun obermayer histogram of time coordinates giles cowan figure an embedding of in major authors with at least nips papers or with time value in the range are shown by their names other authors are shown by small dots the paper coordinates are in dimensions the positions of the displayed names are adjusted up to tiny radius to avoid text overlap the color of each name represents the time dimension the font size is proportional to the absolute time value example visual distance of with time difference of has roughly the same similarity as visual distance of with no time difference this is matter of embedding dissimilar samples far or very far and does not affect much the visual perception which naturally requires less accuracy on such samples however perception errors could still occur in these plots although they are increasingly unlikely as the observation radius turns small in viewing such visualizations one must count in the time represented by the colors and font sizes and remember that point with large absolute time value should be weighted higher in similarity judgment consider the learning of yi by eq if the input ij is larger than what can be faithfully modeled in model then will push to different time therefore the absolute value of time is significance measurement by fig the connection hubs and points with remote connections are more likely to be at different time emphasizing the embedding points with large absolute time values helps the user to focus on important points one can easily identify authors and popular words in figs and this type of information is not discovered by traditional embeddings conclusions and discussions we advocate the use of representation for data while previous works on such embeddings compute an indefinite kernel by simple transformations of the input data we learn indefinite kernel by manifold learning trying to better preserve the her spider shutter outdoors squid gloves clam parrot pantyhose curtains turtle pronoun cowgirl lobster crane shark feather sinker fish boat bird frontier boot dolphin crew lake space gully marine telescope planets nest birdsducks river drift submarine eagle penguin fly cue waterfall butterfly harbor tropicalsunset rocks waterdisaster thirsty roach bassflute parade board sunshine tube piano marines terminal accident hiker fleet radiator music general flap pour trailer keyssticker battery bus wanderchase singer speaker dunk blow aura plug console rankcommander fuse train lightning poem dancer balloon maroon runner valve ozone truck steambreath haul clear paddy express breezeway hoop soapcleaner fumes unload saloon handle veer hiking fog bumps tar spit boilfever cupboard running counter menthol fireman cleaning cylinder marble slurp dragon main signal spontaneous cool mild oil cornerpassage vodkalemonade scum dust emergency campingglide levelelectrician wide crooked orient explorer dryer draft shake heat fireplace beyond covereddescent pile smear reaction measure clumsy stairs attic fuzz travel toothpastepale barrel thirstsnap motion swing direction miner lift drill cube tourist rain swampspraydrain slug damp lizard shadow drunk hole cold bowl blender hockey grind snow buffalo stick milksheep teapot silverware kitchenorange juice molasses cattle fry eggoatmeal turkey intake lunch hostess chinese hot dogs coleslaw italian mixed topping crush lemon scar communist make up red relief disgust squeeze body mole frail dangerous twelve foul drugsdoctor man maiden baby essence histogram of time coordinates impatience superman spoil ghost puberty party hormones ecstasy buzz inventor egypt grace prime cross prove persuade lover intimate integrity repentance nephew royal monarch oath favor engage groom please giving rare common some court nothing bet extra reduce stolen add void quality addition get use borrow till account charity forbid possession keep beg compulsion ritual threshold norm uneven distinct crime chance for capture replace forbidden admit limit dareclaims lack associate parent independent responsibility empire anarchy change another majority restriction difference maximum rebel legal fakeinterestsalesman faithfulaffair sister bond righteousness family valentinestranger jewish sunday greek opponent lovers nun not moral honor traitor disbelieve respect understanding revival taboo adore fair perfect favorite endless decision often everyday abundance congress proof accumulate innocence outstanding decency disown parents blame mischief helpful fuss worry annoying idol junior superstition identity memory disperse fill handicap vote order ten percent personality tend worthless regret math spy alonetogether person criticism repeat award factory leader card possible unsure depletion management briefcase authority helper master support policeman union culture important division ego feeling godangel damn proverb stubborn condemn dismay frustrate sneaky popular treat champion diligence suggest figure negotiation nonsense kids delight staff affect dobe professional reason effort skill capacity secretary excel profession violation challenge potential defeat specific crisis resistance consequence accomplished history evaluate project renounce creativity think confuse shyexperience obnoxious entertain sensitive temper code scream awarehiddenstimulus confusion theory opinion serious attract scientific compound einstein spade hardy cheerleader wasp nerve outrageous growth spell message context deliver graduation learn defense wrap foreign elaborate since who label writer gone coursemeaning therapy daring contemporary traumaelders outline plan impression horoscope absence originate scribble clamp warn symbol issue thesis highlight attention opening sneak protect telephone presentation busy dominate fray recent date instance television lead elimination event ad tale free stall give up going computer discover boxer attend escape imitate performance cast salute era brake retreat delay direct return series screen engineer process protection discipline panic medicine boy cutedecorateviolent elegant eyeball uniform nerves intensity freedom evict scenery missile tired extreme made grow neutral small tone wake plain stiff europe microscope comfortpeaceful hypnotize grave breast health loft plush fit perish birth smell mafia drug musty gun components rape stroke bacteria handkerchief mouth sense restore japan annihilate downtown long exercise reclinerpoise limp split hobby fold style swabs suspense caption range football cover muscle tip fort spike clothes blood pinch zit lick headache tart ax haircut lean cheek painter keeper place apartment wrestling emerald sewlace deteriorate wasted saber pink beard costume mint garlic provisionintestineliver pants swoon brown neck clay sliver rack indianblock hotel base building bat dice hit bound stripe pat flower hair shot squash vaseline biscuit wolf back imageblockadearts chart pound chuck waste elf apple pizza natural pastrystrawberryflowers dill parsley plastic rat stray tail pigfield cracker fattening pie goods tasty predator leo bag chisel shed hanghand target animalcoat iron ankle landscapehedge sausage protein ribs crunch brittle rabbit southern potato wood stain elephant monkey treedeer bear chickencorn hay hunger cooked supermarket move diameter saucer mixture road door beer reckless car robe bite lava explode aardvark goo anisette cart motorcycle controls bicycle helicopter rubber crater week afternoon nomad ram sitting rider coast coral swimmer worm sealoverflow grasshopper downstairs stable boots horse environmentunicorn ahoy trade money quarter due cents receipt rate excise profit value sales shop inflation economic handbag extravagant precious broke blackmail luxurystingy bum ghetto welfare welcome meet figure an embedding of in only subset is shown for clear visualization the position of each word represents its space coordinates up to tiny adjustments to avoid overlap the color of each word shows its time value the font size represents the absolute time value bours we discovered that using the same number of dimensions certain input information is better preserved in than euclidean space we built visualizer of data which automatically discovers important points to enhance the proposed visualization an interactive interface can allow the user select one reference point and show the true similarity values by aligning other points so that the visual distances correspond to the similarities proper constraints or regularization could be proposed so that the time values are discrete or sparse and the resulting embedding can be more easily interpreted the proposed learning is on knds kn or corresponding of another interesting of kn could be ttt which extends the cone to any matrix in kn with compact negative it is possible to construct of kn so that the embedder can learn whether dimension is or as another axis of future investigation given the large family of manifold learners there can be many ways to project the input information onto these the proposed method snest is based on the kl divergence in some immediate extensions can be based on other dissimilarity measures in kn or this could also be useful for faithful representations of graph datasets with indefinite weights acknowledgments this work has been supported be the department of computer science university of geneva in collaboration with swiss national science foundation project maaya grant number references zeger and gersho how many points in euclidean space can have common nearest neighbor in international symposium on information theory page van der maaten and hinton visualizing similarities in multiple maps machine learning laub and feature discovery in pairwise data jmlr jul hinton and roweis stochastic neighbor embedding in nips pages mit press cook sutskever mnih and hinton visualizing similarity data with mixture of maps in aistats pages jost riemannian geometry and geometric analysis universitext springer edition wilson hancock pekalska and duin spherical embeddings for dissimilarities in cvpr pages lunga and ersoy spherical stochastic neighbor embedding of hyperspectral data geoscience and remote sensing ieee transactions on neill geometry with applications to relativity number in series pure and applied mathematics academic press goldfarb unified approach to pattern recognition pattern recognition pekalska and duin the dissimilarity representation for pattern recognition foundations and applications world scientific laub macke and wichmann inducing metric violations in human similarity judgements in nips pages mit press van der maaten and hinton visualizing data using jmlr nov lawrence spectral dimensionality reduction via maximum entropy in aistats jmlr cp pages weinberger sha and saul learning kernel matrix for nonlinear dimensionality reduction in icml pages leskovec kleinberg and faloutsos graph evolution densification and shrinking diameters acm transactions on knowledge discovery from data nelson mcevoy and schreiber the university of south florida word association rhyme and word fragment norms http freeassociation 
convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements john lafferty university of chicago lafferty qinqing zheng university of chicago qinqing abstract we propose simple scalable and fast gradient descent algorithm to optimize nonconvex objective for the rank minimization problem and closely related family of semidefinite programs with log random measurements of positive semidefinite matrix of rank and condition number our method is guaranteed to converge linearly to the global optimum introduction semidefinite programming has become key optimization tool in many areas of applied mathematics signal processing and machine learning sdps often arise naturally from the problem structure or are derived as surrogate optimizations that are relaxations of difficult combinatorial problems in spite of the importance of sdps in efficient algorithms with polynomial runtime is widely recognized that current optimization algorithms based on interior point methods can handle only relatively small problems thus considerable gap exists between the theory and applicability of sdp formulations scalable algorithms for semidefinite programming and closely related families of nonconvex programs more generally are greatly needed parallel development is the surprising effectiveness of simple classical procedures such as gradient descent for large scale problems as explored in the recent machine learning literature in many areas of machine learning and signal processing such as classification deep learning and phase retrieval gradient descent methods in particular first order stochastic optimization have led to remarkably efficient algorithms that can attack very large scale problems in this paper we build on this work to develop algorithms for solving the rank minimization problem under random measurements and closely related family of semidefinite programs our algorithms are efficient and scalable and we prove that they attain linear convergence to the global optimum under natural assumptions the affine rank minimization problem is to find matrix of minimum rank satisfying constraints where rm is an affine transformation the underdetermined case where np is of particular interest and can be formulated as the optimization min rank subject to this problem is direct generalization of compressed sensing and subsumes many machine learning problems such as image compression low rank matrix completion and metric embedding while the problem is natural and has many applications the optimization is nonconvex and challenging to solve without conditions on the transformation or the minimum rank solution it is generally np hard existing methods such as nuclear norm relaxation singular value projection svp and alternating least squares altminsense assume that certain restricted isometry property rip holds for in the random measurement setting this essentially means that at least log measurements are available where rank in this work we assume that is positive semidefinite and ii rm is defined as tr ai where each ai is random symmetric matrix from the gaussian orthogonal ensemble goe with ai jj and ai jk for our goal is thus to solve the optimization min rank subject to tr ai bi in addition to the wide applicability of affine rank minimization the problem is also closely connected to class of semidefinite programs in section we show that the minimizer of particular class of sdp can be obtained by linear transformation of thus efficient algorithms for problem can be applied in this setting as well noting that solution to can be decomposed as where our approach is based on minimizing the squared residual zz tr ai bi while this is nonconvex function we take motivation from recent work for phase retrieval by et al and develop gradient descent algorithm for optimizing using carefully constructed initialization and step size our main contributions concerning this algorithm are as follows we prove that with log constraints our gradient descent scheme can exactly recover with high probability empirical experiments show that this bound may potentially be improved to rn log we show that our method converges linearly and has lower computational cost compared with previous methods we carry out detailed comparison of rank minimization algorithms and demonstrate that when the measurement matrices ai are sparse our gradient method significantly outperforms alternative approaches in section we briefly review related work in section we discuss the gradient scheme in detail our main analytical results are presented in section with detailed proofs contained in the supplementary material our experimental results are presented in section and we conclude with brief discussion of future work in section semidefinite programming and rank minimization before reviewing related work and presenting our algorithm we pause to explain the connection between semidefinite programming and rank minimization this connection enables our scalable gradient descent algorithm to be applied and analyzed for certain classes of sdps consider standard form semidefinite program min tr subject to ei bi tr em sn if is positive definite then we can write ll where where is invertible it follows that the minimum of problem is the same as min tr subject to tr ai bi ei in particular minimizers of are obtained from minimizers of where ai via the transformation since is positive semidefinite tr is equal to hence problem is the nuclear norm relaxation of problem next we characterize the specific cases where so that the sdp and rank minimization solutions coincide the following result is from recht et al theorem let rm be linear map for every integer with define the isometry constant to be the smallest value δk such that δk kxkf ka δk kxkf holds for any matrix of rank at most suppose that there exists rank matrix such that if then is the only matrix of rank at most satisfying furthermore if then can be attained by minimizing over the affine subset in other words since if holds for the transformation and one finds matrix of rank satisfying the affine constraint then must be positive semidefinite hence one can ignore the semidefinite constraint when solving the rank minimization the resulting problem then can be exactly solved by nuclear norm relaxation since the minimum rank solution is positive semidefinite it then coincides with the solution of the sdp which is constrained nuclear norm optimization the observation that one can ignore the semidefinite constraint justifies our experimental comparison with methods such as nuclear norm relaxation svp and altminsense described in the following section related work burer and monteiro proposed general approach for solving semidefinite programs using factored nonconvex optimization giving mostly experimental support for the convergence of the algorithms the first nontrivial guarantee for solving affine rank minimization problem is given by recht et al based on replacing the rank function by the convex surrogate nuclear norm as already mentioned in the previous section while this is convex problem solving it in practice is nontrivial and variety of methods have been developed for efficient nuclear norm minimization the most popular algorithms are proximal methods that perform singular value thresholding at every iteration while effective for small problem instances the computational expense of the svd prevents the method from being useful for large scale problems recently jain et al proposed projected gradient descent algorithm svp singular value projection that solves min ka bk subject to rank where is the vector norm and is the input rank in the th iteration svp updates as the best rank approximation to the gradient update µa which is constructed from the svd if rank then svp can recover under similar rip condition as the nuclear norm heuristic and enjoys linear numerical rate of convergence yet svp suffers from the expensive svd for large problem instances subsequent work of jain et al proposes an alternating least squares algorithm altminsense that avoids the svd altminsense factorizes into two factors such that and minimizes the squared residual by updating and alternately each update is least squares problem the authors show that the iterates obtained by altminsense converge to linearly under rip condition however the least squares problems are often it is difficult to observe altminsense converging to in practice as described above considerable progress has been made on algorithms for rank minimization and certain semidefinite programming problems yet truly efficient scalable and provably convergent algorithms have not yet been obtained in the specific setting that is positive semidefinite our algorithm exploits this structure to achieve these goals we note that recent and independent work of tu et al proposes hybrid algorithm called procrustes flow pf which uses few iterations of svp as initialization and then applies gradient descent gradient descent algorithm for rank minimization our method is described in algorithm it is parallel to the wirtinger flow wf algorithm for phase retrieval to recover complex vector cn given the squared magnitudes of its linear measurements bi where am cn et al propose method to minimize the sum of squared residuals fwf bi the authors establish the convergence of wf to the global sufficient measurements the iterates of wf converge linearly to up to global phase with high probability if and the ai are the function fwf can be expressed as fwf ai ai ai which is special case of where ai ai and each of and are rank one see figure for an illustration figure shows the convergence rate of our method our methods and results are thus generalizations of wirtinger flow for phase retrieval before turning to the presentation of our technical results in the following section we present some intuition and remarks about how and why this algorithm works for simplicity let us assume that the rank is specified correctly initialization is of course crucial in nonconvex optimization as many local minima may be present to obtain sufficiently accurate initialization we use spectral method similar to those used in the starting point is the observation that linear combination of the constraint values and matrices yields an unbiased estimate of the solution pm lemma let bi ai then where the expectation is with respect to the randomness in the measurement matrices ai based on this fact let σu be the eigenvalue decomposition of where and diag σr such that σr are the nonzero eigenvalues of of associated with let clearly zs kzs is the top sth eigenvector eigenvalue kzs therefore we initialize according to vs where vs λs is the top sth eigenpair of for sufficiently large it is reasonable to expect that is close to this is confirmed by concentration of measure arguments certain key properties of will be seen to yield linear rate of convergence in the analysis of convex functions nesterov shows that for unconstrained optimization the gradient descent scheme with sufficiently small step size will converge linearly to the optimum if the objective function is strongly convex and has lipschitz continuous gradient however these two properties are global and do not hold for our objective function nevertheless we expect that similar conditions hold for the local area near if so then if we start close enough to we can achieve the global optimum in our subsequent analysis we establish the convergence of algorithm with constant step size of the form kz kf where is small constant since kz kf is unknown we replace it by convergence analysis in this section we present our main result analyzing the gradient descent algorithm and give sketch of the proof to begin note that the symmetric decomposition of is not unique since dist kz kf iteration figure an instance of where is and the underlying truth is both and are minimizers linear convergence of the gradient scheme for and the distance metric is given in definition algorithm gradient descent for rank minimization input ai bi initialization set vr λr to the top eigenpairs of where vs repeat tr ai bi ai pm bi ai pr until convergence zkzk output for any orthonormal matrix thus the solution set is for some with ze kx for any ze we define the distance to the optimal solution in terms of note that kzk this set definition define the distance between and as min kz min ze our main result for exact recovery is stated below assuming that the rank is correctly specified since the true rank is typically unknown in practice one can start from very low rank and gradually increase it theorem let the condition number denote the ratio of the largest to the smallest nonzero eigenvalues of there exists universal constant such that if log with high probability the initialization satisfies σr moreover there exists universal constant such that when using constant step size kz kf with and initial value obeying the kth step of algorithm satisfies κn σr with high probability we now outline the proof giving full details in the supplementary material the proof has four main steps the first step is to give regularity condition under which the algorithm converges linearly if we start close enough to this provides local regularity property that is similar to the nesterov criteria that the objective function is strongly convex and has lipschitz continuous gradient denote the matrix closest to in the solution set definition let arg min we say that satisfies the regularity condition rc if there exist constants such that for any satisfying we have zi σr kf kz kf using this regularity condition we show that the iterative step of the algorithm moves closer to the optimum if the current iterate is sufficiently close theorem consider the update if satisfies rc kz kf and min then ακr in the next step of the proof we condition on two events that will be shown to hold with high probability using concentration results let denote small value to be specified later for any such that kuk ai ai rδ for any for all zs zk zs zk here the expectations are with respect to the random measurement matrices under these assumptions we can show that the objective satisfies the regularity condition with high probability theorem suppose that and hold if σr then satisfies the regularity condition rc σr with probability at least where are universal constants next we show that under good initialization can be found theorem suppose that holds let vs λs be the top eigenpairs of bi ai such that let zr where zs vs if then finally we show that conditioning on and is valid since these events have high probability as long as is sufficiently large theorem if the number of samples log then for any min satisfying kuk ai ai rδ holds with probability at least where and are universal constants theorem for any rn if log then for any min for all zs zk zs zk with probability at least note that since we need min σr we have and the number of ments required by our algorithm scales as log while only log samples are required by the regularity condition we conjecture this bound could be further improved to be rn log this is supported by the experimental results presented below recently tu et al establish tighter bound overall specifically when only one svp step is used in preprocessing the initialization of pf is also the spectral decomposition of the authors show that measurements are sufficient for to satisfy σr with high probability and demonstrate an rn sample complexity for the regularity condition experiments in this section we report the results of experiments on synthetic datasets we compare our gradient descent algorithm with nuclear norm relaxation svp and altminsense for which we drop the positive semidefiniteness constraint as justified by the observation in section we use admm for the nuclear norm minimization based on the algorithm for the mixture approach in tomioka et al see appendix for simplicity we assume that altminsense svp and the gradient scheme know the true rank krylov subspace techniques such as the lanczos method could be used compute the partial eigendecomposition we use the randomized algorithm of halko et al to compute the low rank svd all methods are implemented in matlab and the experiments were run on macbook pro with intel core processor and gb memory computational complexity it is instructive to compare the cost of the different approaches see table suppose that the density fraction of nonzero entries of each ai is for altminsense the cost of solving the least squares problem is rρ the other three methods have cost to compute the affine transformation for the nuclear norm approach the cost is from the svd and the cost is due to the update of the dual variables the gradient scheme requires operations to compute and to multiply by matrix to obtain the gradient svp needs operations to compute the top singular vectors however in practice this partial svd is more expensive than the cost required for the matrix multiplies in the gradient scheme method complexity nuclear norm minimization via admm gradient descent svp altminsense rρ table computational complexities of different methods clearly altminsense is the least efficient for the other approaches in the dense case large the affine transformation dominates the computation our method removes the overhead caused by the svd in the sparse case small the other parts dominate and our method enjoys low cost runtime comparison we conduct experiments for both dense and sparse measurement matrices altminsense is indeed slow so we do not include it here in the first scenario we randomly generate matrix xx where we also generate matrices am from the goe and then take kf kf for we report the relative error measured in the frobenius norm defined as kx the nuclear norm approach we set the regularization parameter to we test three values for the penalty parameter and select as it leads to the fastest convergence similarly for svp we evaluate the three values for the step size and select as the largest for which svp converges for our approach we test the three values for and select in the same way probability of successful recovery kf kx kx kf kf kx kx kf nuclear norm svp gradient descent time seconds time seconds gradient svp nuclear gradient svp nuclear gradient svp nuclear gradient svp nuclear figure runtime comparison where is and ai are dense runtime comparison where is and ai are sparse sample complexity comparison in the second scenario we use more general and practical setting we randomly generate matrix as before we generate sparse ai whose entries are bernoulli with probability ai jk with probability where we use for all the methods we use the same strategies as before to select parameters for the nuclear norm approach we try three values and select for svp we test the three values for the step size and select for the gradient algorithm we check the three values for and choose the results are shown in figures and in the dense case our method is faster than the nuclear norm approach and slightly outperforms svp in the sparse case it is significantly faster than the other approaches sample complexity we also evaluate the number of measurements required by each method to exactly recover which we refer to as the sample complexity we randomly generate the true matrix and compute the solutions of each method given measurements where the ai are randomly drawn from the goe solution with relative error below is considered to be successful we run trials and compute the empirical probability of successful recovery we consider cases where or and is of rank one or two the results are shown in figure for svp and our approach the phase transitions happen around when is and when is this scaling is close to the number of degrees of freedom in each case this confirms that the sample complexity scales linearly with the rank the phase transition for the nuclear norm approach occurs later the results suggest that the sample complexity of our method should also scale as rn log as for svp and the nuclear norm approach conclusion we connect special case of affine rank minimization to class of semidefinite programs with random constraints building on recently proposed algorithm for phase retrieval we develop gradient descent procedure for rank minimization and establish convergence to the optimal solution with log measurements we conjecture that rn log measurements are sufficient for the method to converge and that the conditions on the sampling matrices ai can be significantly weakened more broadly the technique used in this the semidefinite matrix variable recasting the convex optimization as nonconvex optimization and applying firstorder proposed by burer and monteiro may be effective for much wider class of sdps and deserves further study acknowledgements research supported in part by nsf grant and onr grant references arash amini and martin wainwright analysis of semidefinite relaxations for sparse principal components the annals of statistics francis bach adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression the journal of machine learning research francis bach and eric moulines analysis of stochastic approximation algorithms for machine learning in advances in neural information processing systems nips samuel burer and renato dc monteiro nonlinear programming algorithm for solving semidefinite programs via factorization mathematical programming cai emmanuel and zuowei shen singular value thresholding algorithm for matrix completion siam journal on optimization emmanuel xiaodong li and mahdi soltanolkotabi phase retrieval via wirtinger flow theory and algorithms arxiv preprint aspremont el ghaoui jordan and lanckriet direct formulation for sparse pca using semidefinite programming in thrun saul and schoelkopf eds advances in neural information processing systems nips michel goemans and david williamson improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming journal of the acm november issn nathan halko martinsson and joel tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam review matt hoffman david blei chong wang and john paisley stochastic variational inference the journal of machine learning research prateek jain raghu meka and inderjit dhillon guaranteed rank minimization via singular value projection in advances in neural information processing systems pages prateek jain praneeth netrapalli and sujay sanghavi matrix completion using alternating minimization in proceedings of the annual acm symposium on theory of computing pages acm beatrice laurent and pascal massart adaptive estimation of quadratic functional by model selection annals of statistics pages michel ledoux and brian rider small deviations for beta ensembles electron no issn doi url http raghu meka prateek jain constantine caramanis and inderjit dhillon rank minimization via online learning in proceedings of the international conference on machine learning pages acm yurii nesterov introductory lectures on convex optimization volume springer science business media praneeth netrapalli prateek jain and sujay sanghavi phase retrieval using alternating minimization in advances in neural information processing systems pages benjamin recht maryam fazel and pablo parrilo guaranteed solutions of linear matrix equations via nuclear norm minimization siam review ryota tomioka kohei hayashi and hisashi kashima estimation of tensors via convex optimization arxiv preprint joel tropp an introduction to matrix concentration inequalities arxiv preprint stephen tu ross boczar mahdi soltanolkotabi and benjamin recht solutions of linear matrix equations via procrustes flow arxiv preprint 
smooth interactive submodular set cover yisong yue california institute of technology yyue bryan he stanford university bryanhe abstract interactive submodular set cover is an interactive variant of submodular set cover over hypothesis class of submodular functions where the goal is to satisfy all sufficiently plausible submodular functions to target threshold using as few actions as possible it models settings where there is uncertainty regarding which submodular function to optimize in this paper we propose new extension which we call smooth interactive submodular set cover that allows the target threshold to vary depending on the plausibility of each hypothesis we present the first algorithm for this more general setting with theoretical guarantees on optimality we further show how to extend our approach to deal with realvalued functions which yields new theoretical results for submodular set cover for both the interactive and settings introduction in interactive submodular set cover issc the goal is to interactively satisfy all plausible submodular functions in as few actions as possible issc is framework that generalizes both submodular set cover by virtue of being interactive as well as some instances of active learning by virtue of many active learning criteria being submodular key characteristic of issc is the priori uncertainty regarding the correct submodular function to optimize for example in personalized recommender systems the system does not know the user preferences priori but can learn them interactively via user feedback thus any algorithm must choose actions in order to disambiguate between competing hypotheses as well as optimize for the most plausible ones this issue is also known as the tradeoff in this paper we propose the smooth interactive submodular set cover problem which addresses two important limitations of previous work the first limitation is that conventional issc only allows for single threshold to satisfy and this all or nothing nature can be inflexible for settings where the covering goal should vary smoothly based on plausibility in smooth issc one can smoothly vary the target threshold of the candidate submodular functions according to their plausibility in other words the less plausible hypothesis is the less we emphasize maximizing its associated utility function we present simple greedy algorithm for smooth issc with provable guarantees on optimality we also show that our smooth issc framework and algorithm fully generalize previous instances of and algorithms for issc by reducing back to just one threshold one consequence of smooth issc is the need to optimize for functions which leads to the second limitation of previous work many natural classes of submodular functions are realvalued cf however submodular set cover both interactive and has only been rigorously studied for integral or rational functions with fixed denominator which highlights significant gap between theory and practice we propose relaxed version of smooth issc using an approximation tolerance such that one needs only to satisfy the set cover criterion to within we extend our greedy algorithm to provably optimize for submodular functions within this tolerance to the best of our knowledge this yields the first theoretically rigorous algorithm for submodular set cover both interactive and problem smooth interactive submodular set cover given hypothesis class does not necessarily contain query set and response set with known for modular query cost function defined over monotone submodular objective functions fh for monotone submodular distance functions gh for with gh gh for any if threshold function mapping distance to required objective function value protocol for ask question and receive response goal using minimal cost terminate when fh gh for all where and background submodular set cover in the basic submodular set cover problem we are given an action set and monotone submodular set function that maps subsets to scalar values set function is monotone and submodular if and only if and respectively where denotes set addition in other words monotonicity implies that adding set always yields gain and submodularity implies that adding to smaller set results in larger gain than adding to larger set we also assume that each is associated with modular or additive cost given pa target threshold the goal is to select set that satisfies with minimal cost this problem is nphard but for simple greedy forward selection can provably achieve cost of at most ln op and is typically very effective in practice one motivating application is content recommendation where are items to recommend captures the utility of and is the satisfaction goal monotonicity of captures the property that total utility never decreases as one recommends more items and submodularity captures the the diminishing returns property when recommending redundant items interactive submodular set cover in the basic interactive setting the decision maker must optimize over hypothesis class of submodular functions fh the setting is interactive whereby the decision maker chooses an action or query and the environment provides response each query is now function mapping hypotheses to responses and the environment provides responses according to an unknown true hypothesis this process iterates until where denotes the set of observed pairs the goal is to satisfy with minimal cost for example when recommending movies to new user with unknown interests cf can be set of user types or movie genres action drama horror then would contain individual movies that can be recommended and would be yes or no response or an integer rating representing how interested the user modeled as is in given movie the interactive setting is both learning and covering problem as opposed to just covering problem the decision maker must balance between disambiguating between hypotheses in identifying which is the true and satisfying the covering goal this issue is also known as the tradeoff noisy issc extends basic issc by no longer assuming the true is in and uses distance function gh and tolerance such that the goal is to satisfy fh for all sufficiently plausible where plausibility is defined as gh problem statement we now present the smooth interactive submodular set cover problem which generalizes basic and noisy issc described in section like basic issc each hypothesis is associated with utility function fh that maps sets of pairs to fh fh gh fh fh gh gh gh figure examples of multiple thresholds approximate multiple thresholds continuous convex threshold and an approximate continuous convex threshold for the approximate setting we essentially allow for satisfying any threshold function that resides in the yellow region scalars like noisy issc the hypothesis class does not necessarily contain the true the agnostic setting each is associated with distance or disagreement function gh which maps sets of pairs to disagreement score the larger gh is the more disagrees with we further require that fh and gh problem describes the general problem setting let denote the set of all possible pairs given by the goal is to construct set with minimal cost such that for every we have fh gh where maps disagreement values to desired utilities in general is function since the goal is to optimize more the most plausible hypotheses in we describe two versions of below version step function multiple thresholds the first version uses decreasing step function see figure given pair of sequences αn and κn the threshold function is αnκ where nκ min κn and αn κn the goal in problem is equivalently and satisfy fh αn whenever gh κn this version is strict generalization of noisy issc which uses only single and version convex threshold curve the second version uses convex that decreases continuously as gh increases see figure and is not strict generalization of noisy issc approximate thresholds finally we also consider relaxed version of smooth issc whereby we only require that the objectives fh be satisfied to within some tolerance more formally we say that we approximately solve problem with tolerance if its goal is redefined as using minimal cost guarantee fh gh for all see figure for the approximate versions of the multiple tresholds and convex versions respectively issc has only been rigorously studied when the utility functions are fh are with fixed denominator we show in section how to efficiently solve the approximate version of smooth issc when fh are which also yields new approach for approximately solving the classical submodular set cover problem with objective functions algorithm main results key question in the study of interactive optimization is how to balance the tradeoff on the one hand one should exploit current knowledge to efficiently satisfy the plausible submodular functions however hypotheses that seem plausible might actually not be due to imperfections in the algorithm knowledge one should thus explore by playing actions that disambiguate the plausibility of competing hypotheses our setting is further complicated due to also solving combinatorial optimization problem submodular set cover which is in general intractable approach outline we present general greedy algorithm described in algorithm below for solving smooth issc with provably cost algorithm requires as input submodular algorithm worst case greedy algorithm for smooth interactive submodular set cover input input input input while do play observe end while variable fh gh df dg αi κi αn submodular termination threshold for query or action set response set definition set of hypotheses set of actions or queries set of responses monotone submodular utility function monotone submodular distance function monotone submodular function unifying fh gh and the thresholds maximum value held by denominator for fh when rational denominator for gh when rational continuous convex threshold thresholds for is largest thresholds for is smallest number of thresholds approximation tolerance for the case surrogate utility function for the approximate version surrogate thresholds for the approximate version figure summary of notation used the top portion is used in all settings the middle portion is used for the multiple thresholds setting the bottom portion is used for functions that quantifies the and the specific instantiation of depends on which version of smooth issc is being solved algorithm greedily optimizes for the worst case outcome at each iteration line until termination condition has been met line the construction of is essentially reduction of smooth issc to simpler submodular set cover problem and generalizes the reduction approach in in particular we first lift the analysis of to deal with multiple thresholds section we then show how to deal with approximate thresholds in the setting section which finally allows us to address the continuous threshold setting section our cost guarantees are stated relative to the general cover cost gcc which lower bounds the optimal cost as stated in definition and lemma below via this reduction we can show that our approach achieves cost bounded by ln gcc ln op for clarity of exposition all proofs are deferred to the supplementary material definition general cover cost gcc define oracles rq to be functions mapping questions to responses and is the set of pairs given by for the set of questions define the general cover cost as gcc max min lemma lemma from if there is question asking strategy for satisfying with worst case cost then gcc thus gcc op multiple thresholds version we begin with the multiple thresholds version in this section we assume that each fh and gh are with fixed denominators df and dg we first define doubly when each fh gh are then df dg respectively figure fh gh max gh fh figure this figure shows the relationship between the terms defined in definition for between terms the if nmax either fhidefined or in ghdefinition generates nmaxtradeoff αnbetween κn then fhi this generates the tradeoff either of theeither two thresholds forg or max max for all the either this creates the thresholds requirement that the thresholds must be satisfied between satisfying of the two then hi nmax max for for all this creates the requirement that all of the hypotheses must be this enforces that all at least one of the thresholds or must be satisfied satisfied then this enforces that all hypotheses must be satisfied using and we define the general forms ofutility and and used in sections and truncated version of each hypothesis submodular function each of these sections will apply this definition to different choices of fh gh and to solve their problem in this definition is constant to make fh αvariants αj max min fh αn αj αj to be is the contribution to the maximum value from fh and and is the contribution to the maximum from gh and gh κn κvalue max min gh κn κj κj general and other definition words at αj and from above at αn it is assumed that αj is truncated from below is offset by so that gh is constructed αj analogously jh nn and which can be instantiated to using and we can the general address different of smooth versions issc definition form and max fh αn gh κn αn definition multiple thresholds to solve the multiple thresholds version of the problem fh κj and are used without modification the constants are set as the following cg df dg df dg the coefficient converts each to be cf is the contribution to and and to from gh maximizing and κn the most plausible fh and exthis definition trades off between is theofcontribution from fh ploration distinguishing between version more andof less plausible fh by each to reach its definition multiple thresholds issc given allowing αn and κn we inmaximum value either by having fh reach or having gh reach in other words each of the stantiate thresholds and definition via maxbeinsatisfied with either sufficiently large utility fh or sufficiently large distance gh figure shows the logical relationships between these components df dn cf df cg κn we prove in appendix that is monotone submodular and that finding such that is equivalent to solving problem for definition we require that and thresholds satisfy for to beexploitation submodular maximizing the plausible fh and exploration in definition trades off by disambiguating in fhh allowing each to reach its maximum by either fh reachconditionplausibility the sequence is ing or reaching in other words each can be satisfied with either sufficiently large theorem let fh and gh be monotone submodular and with fixed denominator utility large distance figure shows the relationships between components fhdor and dg respectively then if condition logical holds then applying algorithm using and from solves the multiple thresholds version of problem with cost at most max definition we in appendix an that is and that finding an such that ln gcc is equivalent to solving problem for to be submodular we also require condition which is essentially discrete analogue to the condition that continuous should be convex is condition the sequence ακnn theorem given condition algorithm definition solves the multiple thresholds qn version of problem using cost at most ln dg κn gcc if each gh is integral and κn then the bound simplifies to ln gcc we present an alternative formulation in appendix that has better bounds when dg is large but is less flexible and can not be easily extended to the and convex threshold curve settings approximate thresholds for functions solving even submodular set cover is extremely challenging when the utility functions fh are for example appendix describes setting where the greedy algorithm performs arbitrarily poorly we now extend the results from section to fh and αn rather than trying to solve the problem exactly we instead solve relaxed or approximate version which will be useful for the convex threshold curve setting let denote approximation tolerance for fh denote rounding up to the nearest multiple of and denote rounding down to the nearest multiple of we define surrogate problem definition approximate thresholds for functions define the following approximations to fh and αn fh αn κj αn dg κj dg instantiate and in definition using above gh κn and dg cf cg dg κn we prove in appendix that definition is an instance of smooth issc problem and that solving definition will approximately solve the original smooth issc problem theorem given condition algorithm using definition will approximately solve the multiple thresholds version of problem with tolerance using cost at most qn ln dg κn gcc we show in appendix how to apply this result to approximately solve the basic submodular set cover problem with objectives note that if is selected as the smallest distinct difference between values in fh then the approximation will be exact convex threshold curve version we now address the setting where the threshold curve is continuous and convex we again solve the approximate version since the threshold curve is necessarily let be the tolerance for let be defined so that dg is the maximal value of gh we convert the continuous version to multiple threshold version with thresholds that is within an of the former as shown below definition equivalent multiple thresholds for continuous convex curve instantiate and in definition using gh without modification and sequence of thresholds fh αn dg κj κn dg with constants set as cf cg dg κn dg note that the are not too expensive to compute we prove in appendix that satisfying this set of thresholds is equivalent to satisfying the original curve within note also that definition uses the same form as definition to handle the approximation of functions theorem applying algorithm using definition approximately solves the convex threshn old version of problem with tolerance using cost at most ln dg gcc note that if is sufficiently large then could in principle be smaller which can lead to less conservative approximations there may also be more precise approximations by reducing to other formulations for the setting appendix simulation experiments comparison of methods to solve multiple thresholds we compared our multiple threshold method against multiple baselines see appendix for more details in range of simulation settings see appendix figure shows the results we see that our approach is consistently amongst the best performing methods the primary competitor is the circuit of constraints approach from see appendix for comparison of the theoretical guarantees we also note that all approaches dramatically outperform their guarantees cost for setting cost for setting cost for setting multiple threshold def alternative def circuit def forward sec backward sec cost cost cost percentile percentile percentile figure comparison against baselines in three simulation settings validating approximation tolerances we also validated the efficacy of our approximate thresholds relaxation see appendix for more details of the setup figure shows the results we see that the actual deviation from the original smooth issc problem is much smaller than the specified which suggests that our guarantees are rather conservative for instance at the algorithm is allowed to terminate immediately we also see that the cost to completion steadily decreases as increases which agrees with our theoretical results cost vs deviation cost deviation vs figure comparing cost and deviation from the exact function for varying summary of results discussion figure summarizes the size of or for functions for the various settings recall that our cost guarantees take the form ln op when fh are then we instead solve the smooth issc problem approximately with cost guarantee ln op our results are well developed for many different versions of the utility functions fh but are less flexible for the distance functions gh for example even for gh scales as dg which is not desirable the restriction of gh to be rational or integral leads to relatively straightforward reduction of the continuous convex version of to multiple thresholds version in fact our formulation can be extended to deal with gh and κn in the multiple thresholds version however the resulting is no longer guaranteed to be submodular it is possible that different assumption than the one imposed in condition is required to prove more general results rational rational real rational multiple thresholds qn df dg κi qn dg κi convex threshold curve df dg dg figure summarizing when fh are we show instead our analysis appears to be overly conservative for many settings for instance all the approaches we evaluated empirically achieved much better performance than their guarantees it would be interesting to identify ways to constrain the problem and develop tighter theoretical guarantees other related work submodular optimization is an important problem that arises across many settings including sensor placements summarization inferring latent influence networks diversified recommender systems and multiple solution prediction however the majority of previous work has focused on offline submodular optimization whereby the submodular function to be optimized is fixed priori does not vary depending on feedback there are two typical ways that submodular optimization problem can be made interactive the first is in online submodular optimization where an unknown submodular function must be reoptimized repeatedly over many sessions in an online or fashion in this setting feedback is typically provided only at the conclusion of session and so adapting from feedback is performed between sessions in other words each session consists of submodular optimization problem and the technical challenge stems from the fact that the submodular function is unknown priori and must be learned from feedback provided post optimization in each session this setting is often referred to as interactive optimization the other way to make submodular optimization interactive which we consider in this paper is to make feedback available immediately after each action taken in this way one can simultaneously learn about and optimize for the unknown submodular function within single optimization session this setting is often referred to as interactive optimization one can also consider settings that allow for both and interactive optimization perhaps the most application of interactive submodular optimization is active learning where the goal is to quickly reduce the hypothesis class to some target residual uncertainty for planning or decision making many instances of noisy and approximate active learning can be formulated as an interactive submodular set cover problem related setting is adaptive submodularity which is probabilistic setting that essentially requires that the conditional expectation over the hypothesis set of submodular functions is itself submodular function in contrast we require that the hypothesis class be pointwise submodular each hypothesis corresponds to different submodular utility function although neither adaptive submodularity nor pointwise submodularity is strict generalization of the other cf in practice it can often be easier to model application settings using pointwise submodularity the flipped problem is to maximize utility with bounded budget which is commonly known as the budgeted submodular maximization problem interactive budgeted maximization has been analyzed rigorously for adaptive submodular problems but it remains challenge to develop provably interactive algorithms for pointwise submodular utility functions conclusions we introduced smooth interactive submodular set cover smoothed generalization of previous issc frameworks smooth issc allows for the target threshold to vary based on the plausibility of the hypothesis smooth issc also introduces an approximate threshold solution concept that can be applied to functions which also applies to basic submodular set cover with objectives we developed the first provably algorithm for this setting references dhruv batra payman yadollahpour abner and gregory shakhnarovich diverse solutions in markov random fields in european conference on computer vision eccv yuxin chen and andreas krause batch mode active learning and adaptive submodular optimization in international conference on machine learning icml debadeepta dey tommy liu martial hebert and andrew bagnell contextual sequence prediction via submodular function optimization in robotics science and systems conference rss khalid and carlos guestrin beyond keyword search discovering relevant scientific literature in acm conference on knowledge discovery and data mining kdd khalid gaurav veda dafna shahaf and carlos guestrin turning down the noise in the blogosphere in acm conference on knowledge discovery and data mining kdd victor gabillon branislav kveton zheng wen brian eriksson and muthukrishnan adaptive submodular maximization in bandit setting in neural information processing systems nips daniel golovin and andreas krause adaptive submodularity new approach to active learning and stochastic optimization in conference on learning theory colt manuel gomez rodriguez jure leskovec and andreas krause inferring networks of diffusion and influence in acm conference on knowledge discovery and data mining kdd andrew guillory active learning and submodular functions phd thesis university of washington andrew guillory and jeff bilmes interactive submodular set cover in international conference on machine learning icml andrew guillory and jeff bilmes simultaneous learning and covering with adversarial noise in international conference on machine learning icml steve hanneke the complexity of interactive machine learning master thesis carnegie mellon university shervin javdani yuxin chen amin karbasi andreas krause andrew bagnell and siddhartha srinivasa near optimal bayesian active learning for decision making in conference on artificial intelligence and statistics aistats shervin javdani matthew klingensmith andrew bagnell nancy pollard and siddhartha srinivasa efficient touch based localization through submodularity in ieee international conference on robotics and automation icra andreas krause ajit singh and carlos guestrin sensor placements in gaussian processes in international conference on machine learning icml jure leskovec andreas krause carlos guestrin christos faloutsos jeanne vanbriesen and natalie glance outbreak detection in networks in acm conference on knowledge discovery and data mining kdd hui lin and jeff bilmes learning mixtures of submodular shells with application to document summarization in conference on uncertainty in artificial intelligence uai george nemhauser laurence wolsey and marshall fisher an analysis of approximations for maximizing submodular set functions mathematical programming adarsh prasad stefanie jegelka and dhruv batra submodular meets structured finding diverse subsets in structured item sets in neural information processing systems nips filip radlinski robert kleinberg and thorsten joachims learning diverse rankings with bandits in international conference on machine learning icml karthik raman pannaga shivaswamy and thorsten joachims online learning to diversify from implicit feedback in acm conference on knowledge discovery and data mining kdd stephane ross jiaji zhou yisong yue debadeepta dey and andrew bagnell learning policies for contextual submodular prediction in international conference on machine learning icml sebastian tschiatschek rishabh iyer haochen wei and jeff bilmes learning mixtures of submodular functions for image collection summarization in neural information processing systems nips laurence wolsey an analysis of the greedy algorithm for the submodular set covering problem combinatorica yisong yue and carlos guestrin linear submodular bandits and their application to diversified retrieval in neural information processing systems nips yisong yue and thorsten joachims predicting diverse subsets using structural svms in international conference on machine learning icml 
galileo perceiving physical object properties by integrating physics engine with deep learning jiajun eecs mit jiajunwu joseph lim eecs mit lim ilker bcs mit the rockefeller university ilkery william freeman eecs mit billf joshua tenenbaum bcs mit jbt abstract humans demonstrate remarkable abilities to predict physical events in dynamic scenes and to infer the physical properties of objects from static images we propose generative model for solving these problems of physical scene understanding from videos and images at the core of our generative model is physics engine operating on an representation of physical properties including mass position shape and friction we can infer these latent properties using relatively brief runs of mcmc which drive simulations in the physics engine to fit key features of visual observations we further explore directly mapping visual inputs to physical properties inverting part of the generative process using deep learning we name our model galileo and evaluate it on video dataset with simple yet physically rich scenarios results show that galileo is able to infer the physical properties of objects and predict the outcome of variety of physical events with an accuracy comparable to human subjects our study points towards an account of human vision with generative physical knowledge at its core and various recognition models as helpers leading to efficient inference introduction our visual system is designed to perceive physical world that is full of dynamic content consider yourself watching rube goldberg machine unfold as the kinetic energy moves through the machine you may see objects sliding down ramps colliding with each other rolling entering other objects falling many kinds of physical interactions between objects of different masses materials and other physical properties how does our visual system recover so much content from the dynamic physical world what is the role of experience in interpreting novel dynamical scene recent behavioral and computational studies of human physical scene understanding push forward an account that people judgments are best explained as probabilistic simulations of realistic but mental physics engine specifically these studies suggest that the brain carries detailed but noisy knowledge of the physical attributes of objects and the laws of physical interactions between objects newtonian mechanics to understand physical scene and more crucially to predict the future dynamical evolution of scene the brain relies on simulations from this mental physics engine even though the probabilistic simulation account is very appealing there are missing practical and conceptual leaps first as practical matter the probabilistic simulation approach is shown to work only with synthetically generated stimuli either in worlds or in worlds but each indicates equal contribution the authors are listed in the alphabetical order object is constrained to be block and the joint inference of the mass and friction coefficient is not handled second as conceptual matter previous research rarely clarifies how mental physics engine could take advantage of previous experience of the agent it is the case that humans have life long experience with dynamical scenes and fuller account of human physical scene understanding should address it here we build on the idea that humans utilize realistic physics engine as part of generative model to interpret physical scenes we name our model galileo the first component of our generative model is the physical object representations where each object is rigid body and represented not only by its geometric shape or volume and its position in space but also by its mass and its friction all of these object attributes are treated as latent variables in the model and are approximated or estimated on the basis of the visual input the second part is realistic physics engine in this paper specifically the bullet physics engine the physics engine takes scene setup as input specification of each of the physical objects in the scene which constitutes hypothesis in our generative model and physically simulates it forward in time generating simulated velocity profiles and positions for each object the third part of galileo is the likelihood function we evaluate the observed videos with respect to the model hypotheses using the velocity vectors of objects in the scene we use standard tracking algorithm to map the videos to the velocity space now given video as observation to the model physical scene understanding in the model corresponds to inverting the generative model by probabilistic inference to recover the underlying physical object properties in the scene here we build video dataset to evaluate our model and humans on data which contains videos of different objects with range of materials and masses over simple yet physically rich scenario an object sliding down an inclined surface and potentially collide with another object on the ground note that in the fields of computer vision and robotics there have been studies on predicting physical interactions or inferring properties of objects for various purposes including reasoning and tracking however none of them focused on learning physical properties directly and nor they have incorporated physics engine with representation learning based on the estimates we derived from visual input with physics engine natural extension is to generate or synthesize training data for any automatic learning systems by bootstrapping from the videos already collected and labeling them with estimates of galileo this is learning algorithm for inferring generic physical properties and relates to the phases in helmholtz machines and to the cognitive development of infants extensive studies suggest that infants either are born with or can learn quickly physical knowledge about objects when they are very young even before they acquire more advanced knowledge like semantic categories of objects young babies are sensitive to physics of objects mainly from the motion of foreground objects from background in other words they learn by watching videos of moving objects but later in life and clearly in adulthood we can perceive physical attributes in just static scenes without any motion here building upon the idea of helmholtz machiness our approach suggests one potential computational path to the development of the ability to perceive physical content in static scenes following the recent work we train recognition model sleep cycle that is in the form of deep convolutional network where the training data is generated in manner by the generative model itself wake cycle videos observed by our model and the resulting physical inferences our work makes three contributions first we propose galileo novel model for estimating physical properties of objects from visual inputs by incorporating the feedback of physics engine in the loop we demonstrate that it achieves encouraging performance on video dataset second we train deep learning based recognition model that leads to efficient inference in the generative model and enables the generative model to predict future dynamical evolution of static scenes how would that scene unfold in time third we test our model and compare it to humans on variety of physical judgment tasks our results indicate that humans are quite successful in these tasks and our model closely matches humans in performance but also consistently makes similar errors as humans do providing further evidence in favor of the probabilistic simulation account of human physical scene understanding ra na na nb ga nb ib ia na physical object nb ra gb ga ga gb gb mass friction coefficient shape position offset draw two physical objects physics engine simulated velocities likelihood function observed velocities tracking algorithm figure snapshots of the dataset overview of the model our model formalizes hypothesis space of physical object representations where each object is defined by its mass friction coefficient shape and positional offset an origin to model videos we draw exactly two objects from that hypothesis space into the physics engine the simulations from the physics engine are compared to observations in the velocity space much nicer space than pixels scenario we seek to learn physical properties of objects by observing videos among many scenarios we consider an introductory setup an object is put on an inclined surface it may either slide down or keep static due to gravity and friction and may hit another object if it slides down this seemingly simple scenario is physically highly involved the observed outcome of these scenario are physical values which help to describe the scenario such as the velocity and moving distance of objects causally underlying these observations are the latent physical properties of objects such as the material density mass and friction coefficient as shown in section our galileo model intends to model the causal generative relationship between these observed and unobserved variables we collect video dataset of around objects sliding down ramp possibly hitting another object figure provides some exemplar videos in the dataset the results of collisions including whether it will happen or not are determined by multiple factors such as material density and friction coefficient size and shape volume and slope of surface gravity videos in our dataset vary in all these parameters specifically there are different materials cardboard dough foam hollow rubber hollow wood metal coin metal pole plastic block plastic doll plastic ring plastic toy porcelain rubber wooden block and wooden pole for each material there are to objects of different sizes and shapes the angle between the inclined surface and the ground is either or when an object slides down it may hit either cardboard box or piece of foam or neither galileo physical object model the gist of our model can be summarized as probabilistically inverting physics engine in order to recover unobserved physical properties of objects we collectively refer to the unobserved latent variables of an object as its physical representation for each object ti consists of its mass mi friction coefficient ki shape vi and position offset pi an origin in space we place uniform priors over the mass and the friction coefficient for each object mi uniform and ki uniform respectively for shape vi we have four variables shape type ti and the scaling factors for three dimensions xi yi zi we simplify the possible shape space in our model by constraining each shape type ti to be one of the three with equal probability box cylinder and torus note that applying scaling differently on each dimension to these three basic shapes results in large space of the scaling factors are chosen to be uniform over the range of values to capture the extent of different shapes in the dataset remember that our scenario consists of an object on the ramp and another on the ground the position offset pi for each object is uniform over the set this indicates that for the object on the ramp its position can be perturbed along the ramp in at most units upwards or downwards from its starting position which is units upwards on the ramp from the ground the next component of our generative model is realistic physics engine that we denote as specifically we use the bullet physics engine following the earlier related work the physics engine takes specification of each of the physical objects in the scene within the basic ramp setting as input and simulates it forward in time generating simulated velocity vectors for each object in the scene and respectively among other physical properties such as position rendered image of each simulation step etc in light of initial qualitative analysis we use velocity vectors as our feature representation in evaluating the hypothesis generated by the model against data we employ standard tracking algorithm klt point tracker to lift the visual observations to the velocity space that is for each video we first run the tracking algorithm and we obtain velocities by simply using the center locations of each of the tracked moving objects between frames this gives us the velocity vectors for the object on the ramp and the object on the ground and respectively note that we could replace the klt tracker with tracking algorithms for more complicated scenarios given pair of observed velocity vectors and the recovery of the physical object representations and for the two objects via simulation can be formalized as where we define the likelihood function as vo where vo is the concatenated vector of and vs is the concatenated vector of the dimensionality of vo and vs are kept the same for video by adjusting the number of simulation steps we use to obtain vo according to the length of the video but from video to video the length of these vectors may vary in all of our simulations we fix to which is the only free parameter in our model experiments show that the value of does not change our results significantly tracking as recognition the posterior distribution in equation is intractable in order to alleviate the burden of posterior inference we use the output of our recognition model to predict and fix some of the latent variables in the model specifically we determine the vi or ti xi yi zi using the output of the tracking algorithm and fix these variables without further sampling them furthermore we fix values of pi also on the basis of the output of the tracking algorithm inference once we initialize and fix the latent variables using the tracking algorithm as our recognition model we then perform metropolis hasting updates on the remaining four latent variables and at each mcmc sweep we propose new value for one of these random variables where the proposal distribution is uniform in order to help with mixing we also use broader proposal distribution uniform at every mcmc sweeps for shape type box xi yi and zi could all be different values for shape type torus we constrained the scaling factors such that xi zi and for shape type cylinder we constrained the scaling factors such that zi dough cardboard pole figure simulation results each row represents one video in the data the first frame of the video the last frame of the video the first frame of the simulated scene generated by bullet the last frame of the simulated scene the estimated object with larger mass the estimated object with larger friction coefficient simulations for each video as mentioned earlier we use the tracking algorithm to initialize and fix the shapes of the objects and and the position offsets and we also obtain the velocity vector for each object using the tracking algorithm we determine the length of the physics engine simulation by the length of the observed video that is the simulation runs until it outputs velocity vector for each object that is as long as the input velocity vector from the tracking algorithm as mentioned earlier we collect videos uniformly distributed across different object categories we perform mcmc simulations for single video each of which was mcmc sweeps long we report the results with the highest score across the chains the map estimate in figure we illustrate the results for three individual videos every two frame of the top row shows the first and the last frame of video and the bottom row images show the corresponding frames from our model simulations with the map estimate we quantify different aspects of our model in the following behavioral experiments where we compare our model against human subjects judgments furthermore we use the inferences made by our model here on the videos to train recognition model to arrive at physical object perception in static scenes with the model importantly note that our model can generalize across broad range of tasks beyond the ramp scenario for example once we infer the coefficient friction of an object we can make prediction on whether it will slide down ramp with different slope by doing simulation we test some of the generalizations in section bootstrapping as efficient perception in static scenes based on the estimates we derived from the visual input with physics engine we bootstrap from the videos already collected by labeling them with estimates of galileo this is learning algorithm for inferring generic physical properties as discussed in section this formulation is also related to the phases in helmholtz machines and to the cognitive development of infants here we focus on two physical properties mass and friction coefficient to do this we first estimate these physical properties using the method described in earlier sections then we train lenet widely used deep neural network for datasets using image patches cropped from videos based on the output of the tracker as data and estimated physical properties as labels the trained model can then be used to predict these physical properties of objects based on purely visual cues even though they might have never appeared in the training set we also measure masses of all objects in the dataset which makes it possible for us to quantitatively evaluate the predictions of the deep network we choose one object per material as our test cases initialization with recognition model random initialization mse corr oracle galileo uniform log likelihood mass methods figure mean squared errors of oracle estimation our estimation and uniform estimations of mass on scale and the correlations between estimations and ground truths number of mcmc sweeps figure the traces of several chains with and without lenet based initializations use all data of those objects as test data and the others as training data we compare our model with baseline which always outputs uniform estimate calculated by averaging the masses of all objects in the test data and with an oracle algorithm which is lenet trained using the same training data but has access to the ground truth masses of training objects as labels apparently the performance of the oracle model can be viewed as an upper bound of our galileo system table compares the performance of galileo the oracle algorithm and the baseline we can observe that galileo is much better than baseline although there is still some space for improvement because we trained lenet using static images to predict physical object properties such as friction and mass ratios we can use it to recognize those attributes in quick pass at the very first frame of the video to the extent that the trained lenet is accurate if we initialize the mcmc chains with these predictions we expect to see an overall boost in our traces we test by running several chains with and without initializations results can be seen in figure despite the fact that lenet is not achieving perfect performance by itself we indeed get boost in speed and quality in the inference experiments in this section we conduct experiments from multiple perspectives to evaluate our model specifically we use the model to predict how far objects will move after the collision whether the object will remain stable in different scene and which of the two objects is heavier based on observations of collisions for every experiment we also conduct behavioral experiments on amazon mechanical turk so that we may compare the performance of human and machine on these tasks outcome prediction in the outcome prediction experiment our goal is to measure and compare how well human and machines can predict the moving distance of an object if only part of the video can be observed specifically for behavioral experiments on amazon mechanical turk we first provide users four full videos of objects made of certain material which contain complete collisions in this way users may infer the physical properties associated with that material in their mind we select different object but made of the same material show users video of the object but only to the moment of collision we finally ask users to label where they believe the target object either cardboard or foam will be after the collision how far the target will move we tested users per case given partial video for galileo to generate predicted destinations we first run it to fit the part of the video to derive our estimate of its friction coefficient we then estimate its density by averaging human galileo uniform error in pixels ea le oo de po bl oo de po rc la to oc in ll do as tic pl tic as bl oc pl pl as tic po le et al al et llo oo co ug do ho ca rd oa rd figure mean errors in numbers of pixels of human predictions galileo outputs and uniform estimate calculated by averaging ground truth ending points over all test cases as the error patterns are similar for both target objects foam and cardboard the errors here are averaged across target objects for each material figure heat maps of user predictions galileo outputs orange crosses and ground truths white crosses the density values we derived from other objects with that material by observing collisions that they are involved we further estimate the density mass and friction coefficient of the target object by averaging our estimates from other collisions we now have all required information for the model to predict the ending point of the target after the collision note that the information available to galileo is exactly the same as that available to humans we compare three kinds of predictions human feedback galileo output and as baseline uniform estimate calculated by averaging ground truth ending points over all test cases figure shows the euclidean distance in pixels between each of them and the ground truth we can see that human predictions are much better than the uniform estimate but still far from perfect galileo performs similar to human in the average on this task figure shows for some test cases heat maps of user predictions galileo outputs orange crosses and ground truths white crosses the error correlation between human and pom is the correlation analysis for the uniform model is not useful because the correlation is constant independent of the uniform prediction value mass prediction the second experiment is to predict which of two objects is heavier after observing video of collision of them for this task we also randomly choose objects we test each of them on users for galileo we can directly obtain its guess based on the estimates of the masses of the objects figure demonstrates that human and our model achieve about the same accuracy on this task we also calculate correlations between different outputs here for correlation analysis we use the ratio of the masses of the two objects estimated by galileo as its predictor human responses are aggregated for each trial to get the proportion of people making each decision as the relation is highly nonlinear we calculate spearman coefficients from table we notice that human responses machine outputs and ground truths are all positively correlated human galileo mass human vs galileo human vs truth galileo vs truth spearman coeff will it move human vs galileo human vs truth galileo vs truth mass will it move figure average accuracy of human predictions and galileo outputs on the tasks of mass prediction and will it move prediction error bars indicate standard deviations of human accuracies pearson coeff table correlations between pairs of outputs in the mass prediction experiment in spearman coefficient and in the will it move prediction experiment in pearson coefficient will it move prediction our third experiment is to predict whether certain object will move in different scene after observing one of its collisions on amazon mechanical turk we show users video containing collision of two objects in this video the angle between the inclined surface and the ground is degrees we then show users the first frame of video of the same object and ask them to predict whether the object will slide down the surface in this case we randomly choose objects for the experiment and divide them into lists of objects per user and get each of the item tested on users overall for galileo it is straightforward to predict the stability of an object in the case using estimates from the video interestingly both humans and the model are at chance on this task figure and their responses are reasonably correlated table again here we aggregate human responses for each trial to get the proportion of people making each decision moreover both subjects and the model show bias towards saying it will future controlled experimentation and simulations will investigate what underlies this correspondence conclusion this paper accomplishes three goals first it shows that generative vision system with physical object representations and realistic physics engine at its core can efficiently deal with data when proper recognition models and feature spaces are used second it shows that humans intuitions about physical outcomes are often accurate and our model largely captures these intuitions but crucially humans and the model make similar errors lastly the experience of the model that is the inferences it makes on the basis of dynamical visual scenes can be used to train deep learning model which leads to more efficient inference and to the ability to see physical properties in the static images our study points towards an account of human vision with generative physical knowledge at its core and various recognition models as helpers to induce efficient inference acknowledgements this work was supported by nsf robust intelligence reconstructive recognition and the center for brains minds and machines funded by nsf stc award references baillargeon infants physical world current directions in psychological science peter battaglia jessica hamrick and joshua tenenbaum simulation as an engine of physical scene understanding pnas susan carey the origin of concepts oxford university press erwin coumans bullet physics engine open source software http org peter dayan geoffrey hinton radford neal and richard zemel the helmholtz machine neural computation zhaoyin jia andy gallagher ashutosh saxena and tsuhan chen reasoning from blocks to stability ieee tpami yann lecun bottou yoshua bengio and patrick haffner learning applied to document recognition proceedings of the ieee adam sanborn vikash mansinghka and thomas griffiths reconciling intuitive physics and newtonian mechanics for colliding objects psychological review john schulman alex lee jonathan ho and pieter abbeel tracking deformable objects with point clouds in robotics and automation icra ieee international conference on pages ieee carlo tomasi and takeo kanade detection and tracking of point features international journal of computer vision tomer ullman andreas noah goodman and josh tenenbaum learning physics from dynamical scenes in cogsci ilker yildirim tejas kulkarni winrich freiwald and joshua tenenbaum efficient in vision computational framework behavioral tests and modeling neuronal representations in annual conference of the cognitive science society bo zheng yibiao zhao joey yu katsushi ikeuchi and zhu detecting potential falling objects by inferring human action and natural disturbance in icra 
the of auctions jamie computer and information science university of pennsylvania philadelphia pa jamiemor tim roughgarden stanford university palo alto ca tim abstract this paper develops general approach rooted in statistical learning theory to learning an approximately auction from data we introduce auctions to interpolate between simple auctions such as welfare maximization with reserve prices and optimal auctions thereby balancing the competing demands of expressivity and simplicity we prove that such auctions have small representation error in the sense that for every product distribution over bidders valuations there exists auction with small and expected revenue close to optimal we show that the set of auctions has modest pseudodimension for polynomial and therefore leads to small learning error one consequence of our results is that in arbitrary settings one can learn mechanism with expected revenue arbitrarily close to optimal from polynomial number of samples introduction in the traditional economic approach to identifying auction one first posits prior distribution over all unknown information and then solves for the auction that maximizes expected revenue with respect to this distribution the first obstacle to making this approach operational is the difficulty of formulating an appropriate prior the second obstacle is that even if an appropriate prior distribution is available the corresponding optimal auction can be far too complex and unintuitive for practical use this motivates the goal of identifying auctions that are simple and yet in terms of expected revenue in this paper we apply tools from learning theory to address both of these challenges in our model we assume that bidders valuations willingness to pay are drawn from an unknown distribution learning algorithm is given samples from for example these could represent the outcomes of comparable transactions that were observed in the past the learning algorithm suggests an auction to use for future bidders and its performance is measured by comparing the expected revenue of its output auction to that earned by the optimal auction for the distribution the possible outputs of the learning algorithm correspond to some set of auctions we view as design parameter that can be selected by seller along with the learning algorithm central goal of this work is to identify classes that balance representation error the amount of revenue sacrificed by restricting to auctions in with learning error the generalization error incurred by learning over from samples that is we seek set that is rich enough to contain an auction that closely approximates an optimal auction whatever might be yet simple enough that the best auction in can be learned from small amount of data learning theory offers tools both for rigorously defining the simplicity of set of auctions through complexity measures such as the part of this work done while visiting stanford university partially supported by simons award for graduate students in theoretical computer science as well as nsf grant and for quantifying the amount of data necessary to identify the approximately best auction from our goal of learning auction also requires understanding the representation error of different classes this task is and we develop the necessary arguments in this paper our contributions the primary contributions of this paper are the following first we show that concepts from statistical learning theory can be directly applied to reason about learning from data an approximately auction precisely for set of auctions and an arbitrary unknown distribution over valuations in dc log samples from are enough to learn up to factor the best auction in where dc denotes the of the set defined in section second we introduce the class of auctions to interpolate smoothly between simple auctions such as welfare maximization subject to individualized reserve prices when and the complex auctions that can arise as optimal auctions as third we prove that in quite general auction settings with bidders the of the set of auctions is nt log nt fourth we quantify the number of levels required for the set of auctions to have low representation error with respect to the optimal auctions that arise from arbitrary product distributions for example for auctions and several generalizations thereof if then for every product distribution there exists auction with expected revenue at least times that of the optimal auction for in the above sense the in auctions is tunable sweet spot allowing designer to balance the competing demands of expressivity to achieve and simplicity to achieve learnability for example given fixed amount of past data our results indicate how much auction complexity in the form of the number of levels one can employ without risking overfitting the auction to the data alternatively given target approximation factor our results give sufficient conditions on and consequently on the number of samples needed to achieve this approximation factor the resulting sample complexity upper bound has polynomial dependence on and the number of bidders known results imply that any method of learning auction from samples must have sample complexity with polynomial dependence on all three of these parameters even for auctions related work the present work shares much of its spirit and goals with balcan et al who proposed applying statistical learning theory to the design of auctions the difference between the two works is that our work assumes bidders valuations are drawn from an unknown distribution while balcan et al study the more demanding setting since no auction can achieve revenue balcan et al define their revenue benchmark with respect to set of auctions on each input as the maximum revenue obtained by any auction of on the idea of learning from samples enters the work of balcan et al through the internal randomness of their partitioning of bidders rather than through an exogenous distribution over inputs as in this work both our work and theirs requires polynomial dependence on ours in terms of necessary number of samples and theirs in terms of necessary number of bidders as well as measure of the complexity of the class in our case the and in theirs an analagous measure the primary improvement of our work over of the results in balcan et al is that our results apply for single matroid feasibility and arbitrary singleparameter settings see section for definitions while their results apply only to settings of unlimited we also view as feature the fact that our sample complexity upper bounds can be deduced directly from results in learning theory we can focus instead on the and work of bounding the and representation error of auction classes elkind also considers similar model to ours but only for the special case of auctions while her proposed auction format is similar to ours our results cover the far more general see balcan et al for an extension to the case of large finite supply case of arbitrary settings and and support distributions our sample complexity bounds are also better even in the case of auction linear rather than quadratic dependence on the number of bidders on the other hand the learning algorithm in for singleitem auctions is computationally efficient while ours is not cole and roughgarden study auctions with bidders with valuations drawn from independent not necessarily identical regular distributions see section and prove upper and lower bounds polynomial in and on the sample complexity of learning auction while the formalism in their work is inspired by learning theory no formal connections are offered in particular both their upper and lower bounds were proved from scratch our positive results include auctions as very special case and for bounded or mhr valuations our sample complexity upper bounds are much better than those in cole and roughgarden huang et al consider learning the optimal price from samples when there is single buyer and single seller this problem was also studied implicitly in our general positive results obviously cover the and mhr settings in though the specialized analysis in yields better indeed almost optimal sample complexity bounds as function of medina and mohri show how to use combination of the and rademacher complexity to measure the sample complexity of selecting single reserve price for the vcg mechanism to optimize revenue in our notation this corresponds to analyzing single set of auctions vcg with reserve medina and mohri do not address the expressivity simplicity that is central to this paper dughmi et al also study the sample complexity of learning good auctions but their main results are negative exponential sample complexity for the difficult scenario of settings all settings in this paper are our work on auctions also contributes to the literature on simple approximately revenuemaximizing auctions here one takes the perspective of seller who knows the valuation distribution but is bound by simplicity constraint on the auction deployed thereby ruling out the optimal auction our results that bound the representation error of auctions theorems and can be interpreted as principled way to trade off the simplicity of an auction with its approximation guarantee while previous work in this literature generally left the term simple safely undefined this paper effectively proposes the of an auction class as rigorous and quantifiable simplicity measure preliminaries this section reviews useful terminology and notation standard in bayesian auction design and learning theory bayesian auction design we consider settings with bidders this means that each bidder has single unknown parameter its valuation or willingness to pay for every bidder has value for losing setting is specified by collection of subsets of each such subset represent collection of bidders that can simultaneously for example in setting with copies of an item where no bidder wants more than one copy would be all subsets of of cardinality at most generalization of this case studied in the supplementary materials section is matroid settings these satisfy whenever and and ii for two sets there is always an augmenting element such that the supplementary materials section also consider arbitrary settings where the only assumption is that to ease comprehension we often illustrate our main ideas using auctions where is the singletons and the empty set we assume bidders valuations are drawn from the continuous joint cumulative distribution except in the extension in section we assume that the support of is limited to as in most of optimal auction theory we usually assume that is product distribution with fn and each vi fi drawn independently but not identically the virtual value of bidder is denoted by vi vi fif vi distribution satisfies the rate mhr condition if fi vi fi vi is nondecreasing intuitively if its tails are no heavier than those of an exponential distribution in fundamental paper proved that when every virtual valuation function is nondecreasing the regular case the auction that maximizes expected revenue for bayesian bidders chooses winners in way which maximizes the sum of the virtual values of the winners this auction is known as myerson auction which we refer to as the result can be extended to the general case by replacing the virtual valuation functions by ironed virtual valuation the details are but technical see myerson and hartline for details sample complexity vc dimension and the this section reviews several definitions from learning theory suppose there is some domain and let be some unknown target function let be an unknown distribution over we wish to understand how many labeled samples are necessary and sufficient to be able to output which agrees with almost everywhere with respect to the sample complexity of learning depends fundamentally on the complexity of the set of binary functions from which we are choosing we define the relevant complexity measure next let be set of samples from the set is said to be shattered by if for every subset there is some ct such that ct if and ct if that is ranging over all induces all possible projections onto the vc dimension of denoted vc is the size of the largest set that can be shattered by let errs denote the empirical error of on and let err denote the true expected error of with respect to key result from learning theory is for every distribution sample of size vc ln is sufficient to guarantee that errs err err for every with probability in this case the error on the sample is close to the true error simultaneously for every hypothesis in in particular choosing the hypothesis with the minimum sample error minimizes the true error up to we say is learnable with sample complexity if given sample of size with probability for all err thus any class is learnable with vc ln samples conversely for every learning algorithm that uses fewer than vc samples there exists distribution and constant such that with probability at least outputs hypothesis with err err for some that is the true error of the output hypothesis is more than larger the best hypothesis in the class to learn functions we need generalization of vc dimension which concerns binary functions the does exactly formally let be realvalued function over and the class we are learning over let be sample drawn from labeled according to both the empirical and true error of hypothesis are defined as before though can now take on values in rather than in let rm be set of targets for we say rm witnesses the shattering of by if for each there exists some ct such that ft xi ri for all xi and ct xi ri for all xi if there exists some witnessing the shattering of we say is shatterable by the of denoted dc is the size of the largest set which is shatterable by the sample complexity upper bounds of this paper are derived from the following theorem which states that the sample complexity of learning over class of functions is governed by the class theorem suppose is class of functions with range in and dc for every the complexity of learning with respect to is dc ln ln moreover the guarantee in theorem is realized by the learning algorithm that simply outputs the function with the smallest empirical error on the sample the dimension is weaker condition that is also sufficient for sample complexity bounds all of our arguments give the same upper bounds on the and the dimension of various auction classes so we present the stronger statements applying to auction classes for the remainder of this paper we consider classes of truthful auctions when we discuss some auction we treat as the function that maps truthful bid tuples to the revenue achieved on them by the auction then rather than minimizing error we aim to maximize revenue in our setting the guarantee of theorem directly implies that with probability at least over the samples the output of the empirical revenue maximization learning algorithm which returns the auction with the highest average revenue on the samples chooses an auction with expected revenue over the true underlying distribution that is within an additive of the maximum possible auctions to illustrate out ideas we first focus on auctions the results of this section are generalized significantly in the supplementary see sections and section defines the class of auctions gives an example and interprets the auctions as approximations to virtual welfare maximizers section proves that the of the set of such auctions is nt log nt which by theorem implies upper bound section proves that taking yields low representation error auctions the case we now introduce auctions or ct for short intuitively one can think of each bidder as facing one of possible prices the price they face depends upon the values of the other bidders consider for each bidder numbers we refer to these numbers as thresholds this set of tn numbers defines auction with the following allocation rule consider valuation tuple for each bidder let ti vi denote the index of the largest threshold that lower bounds vi or if vi we call ti vi the level of bidder sort the bidders from highest level to lowest level and within level use fixed lexicographical ordering to pick the award the item to the first bidder in this sorted order unless ti for every bidder in which case there is no sale the payment rule is the unique one that renders truthful bidding dominant strategy and charges to losing bidders the winning bidder pays the lowest bid at which she would continue to win it is important for us to understand this payment rule in detail there are three interesting cases suppose bidder is the winner in the first case is the only bidder who might be allocated the item other bidders have level in which case her bid must be at least her lowest threshold in the second case there are multiple bidders at her level so she must bid high enough to be at her level and since ties are broken lexicographically this is her threshold to win in the final case she need not compete at her level she can choose to either pay one level above her competition in which case her position in the ordering does not matter or she can bid at the same level as her competitors in which case she only wins if she dominates all of those bidders at the level according to formally the payment of the winner if any is as follows let denote the highest level such that there at least two bidders at or above level and be the set of bidders other than whose level is at least monop if then pi she is the only potential winner but must have level mult if ti vi then pi she needs to be at level to win an auction is truthful if truthful bidding is dominant strategy for every bidder that is for every bidder and all possible bids by the other bidders maximizes its expected utility value minus price paid by bidding its true value in the settings that we study the expected revenue of the optimal auction measured at equilibrium with respect to the prior distribution is no larger than that of the optimal truthful auction when the valuation distributions are regular this can be done by value or randomly when it is done by value this equates to generalization of vcg with nonanonymous reserves and is ic and has identical representation error as this analysis when bidders are regular unique if ti vi if for all she pays pi otherwise she pays pi she either needs to be at level in which case her position in does not matter or at level in which case she would need to be the highest according to we now describle particular auction and demonstrate each case of the payment rule example consider the following auction for bidders let and for example if bidder bids less than she is at level bid in puts her at level bid in at level bid in at level and bid of at least at level let monop if va vb vc then are at level to which the item is never allocated so wins and pays the minimum she needs to bid to be at level mult if va vb vc then and are both at level and so will win and pays the minimum she needs to bid to be at level unique if va vb vc then is at level and and are at level since and need only pay enough to be at level if on the other hand va vb and vc has level at least while have level but needs to pay since remark connection to virtual valuation functions auctions are naturally interpreted as discrete approximations to virtual welfare maximizers and our representation error bound in theorem makes this precise each level corresponds to constraint of the form if any bidder has level at least do not sell to any bidder with level less than we can interpret the with fixed ranging over bidders as the bidder values that map to some common virtual value for example auctions treat all values below the single threshold as having negative virtual value and above the threshold uses values as proxies for virtual values auctions use the second threshold to the refine virtual value estimates and so on with this interpretation it is intuitively clear that as it is possible to estimate bidders virtual valuation functions and thus approximate myerson optimal auction to arbitrary accuracy the of auctions this section shows that the of the class of auctions with bidders is nt log nt combining this with theorem immediately yields sample complexity bounds parameterized by for learning the best such auction from samples theorem for fixed order the set of auctions has nt log nt proof recall from section that we need to upper bound the size of every set that is shatterable using auctions fix set of samples vm of size and potential witness rm each auction induces binary labeling of the samples vj of whether revenue on vj is at least rj or strictly less than rj the set is shattered with witness if and only if the number of distinct labelings of given by any auction is we the number of distinct labelings of given by auctions for some fixed potential witness counting the labelings in two stages note that involves nm numbers one value vij for each bidder for each sample and auction involves nt numbers thresholds for each bidder call two auctions with thresholds and ˆi equivalent if the relative order of the agrees with that of the ˆi in that both induce the same permutation of merging the sorted list of the vij with the sorted list of the yields the same partition of the vij as does merging it with the sorted list of the ˆi note that this is an equivalence relation if two auctions are equivalent every comparison between valuation and threshold or two valuations is resolved identically by those auctions using the defining properties of equivalence crude upper bound on the number of equivalence classes is nm nt nt nm nt nt we now the number of distinct labelings of that can be generated by auctions in single equivalence class first as all comparisons between two numbers valuations or thresholds are resolved identically for all auctions in each bidder in each sample vj of is assigned the same level across auctions in and the winner if any in each sample vj is constant across all of by the same reasoning the identity of the parameter that gives the winner payment some is uniquely determined by pairwise comparisons recall section and hence is common across all auctions in the payments however can vary across auctions in the equivalence class for bidder and level let si be the subset of samples in which bidder wins and pays the revenue obtained by each auction in on sample of si is simply and independent of all other parameters of the auction thus ranging over all auctions in generates at most distinct binary labelings of si the possible subsets of si for which an auction meets the corresponding target rj form nested collection summarizing within the equivalence class of auctions varying parameter generates at most different labelings of the samples si and has no effect on the other samples since the subsets si are disjoint varying all of the ranging over generates at most ty mnt distinct labelings of combining and the class of all auctions produces at most nm nt distinct labelings of since shattering requires distinct labelings we conclude that nm nt implying nt log nt as claimed the representation error of auctions in this section we show that for every bounded product distribution there exists auction with expected revenue close to that of the optimal auction when bidders are independent and bounded the analsysis rounds an optimal auction to auction without losing much expected revenue this is done using thresholds to approximate each bidder virtual value the lowest threshold at the bidder monopoly reserve price the next thresholds at the values at which bidder virtual value surpasses multiples of and the remaining thresholds at those values where bidder virtual value reaches powers of theorem formalizes this intuition theorem suppose is distribution over if ct contains auction with expected revenue at least times the optimal expected revenue theorem follows immediately from the following lemma with general result for later use we prove this more lemma consider bidders with valuations in and with maxi vi then ct contains auction with expected revenue at least times that of an optimal auction for proof consider fixed bidder we define thresholds for bucketing by her virtual value and prove that the auction using these thresholds for each bidder closely approximates the expected revenue of the optimal auction let be parameter defined later set bidder monopoly for for let let consider fixed valuation profile let denote the winner according to and the winner according to the optimal auction if there is no winner we interpret and as recall that always awards the item to bidder with the highest positive virtual value or no one if no such bidders exist the definition of the thresholds immediately implies the following only allocates to ironed bidders if there is no tie that is there is unique bidder at the highest level then when there is tie at level the virtual value of the winner of is close to that of if then if these facts imply that ev rev ev ev vi ev rev are equal the first and final equality follow from and allocations depending on ironed virtual values not on the values themselves thus the ironed virtual values are equal in expectation to the unironed virtual values and thus the revenue of the mechanisms see chapter for discussion as maxi vi it must be that rev posted price of will achieve this revenue combining this with and setting implies ev rev ev rev combining theorems and yields the following corollary corollary let be product distribution with all bidders valuations in that and nt log nt log log then with probability at least the empirical revenue maximizer of ct on set of samples from has expected revenue at least times that of the optimal auction open questions there are some significant opportunities for research first there is much to do on the design of computationally efficient in addition to algorithms for learning nearoptimal auction the present work focuses on sample complexity and our learning algorithms are generally not computationally the general research agenda here is to identify auction classes for various settings such that has low representation error has small there is algorithm to find an approximately auction from on given set of there are also interesting open questions on the statistical side notably for problems while the negative result in rules out universally good upper bound on the sample complexity of learning mechanism in settings we suspect that positive results are possible for several interesting special cases recall from section that denotes the virtual valuation function of bidder from here on we always mean the ironed version of virtual values it is convenient to assume that these functions are strictly increasing not just nondecreasing this can be enforced at the cost of losing an arbitrarily small amount of revenue there is clear parallel with computational learning theory while the foundations of classification vc dimension etc have been long understood this research area strives to understand which concept classes are learnable in polynomial time the and performance bounds implied by analysis as in theorem hold with such an approximation algorithm with the algorithm approximation factor carrying through to the learning algorithm guarantee see also references martin anthony and peter bartlett neural network learning theoretical foundations cambridge university press ny ny usa moshe babaioff nicole immorlica brendan lucier and matthew weinberg simple and approximately optimal mechanism for an additive buyer sigecom january balcan avrim blum and yishay mansour single price mechanisms for revenue maximization in unlimited supply combinatorial auctions technical report carnegie mellon university balcan avrim blum jason hartline and yishay mansour reducing mechanism design to algorithm design via machine learning jour of comp and system sciences yang cai and constantinos daskalakis theorems for optimal multidimensional pricing in foundations of computer science focs ieee annual symposium on pages palm springs ca oct ieee shuchi chawla jason hartline and robert kleinberg algorithmic pricing via virtual valuations in proceedings of the acm conf on electronic commerce pages ny ny usa acm shuchi chawla jason hartline david malec and balasubramanian sivan mechanism design and sequential posted pricing in proceedings of the acm symposium on theory of computing pages ny ny usa acm richard cole and tim roughgarden the sample complexity of revenue maximization in proceedings of the annual acm symposium on theory of computing pages ny ny usa siam nikhil devanur jason hartline anna karlin and thach nguyen mechanism design in internet and network economics pages springer singapore peerapong dhangwatnotai tim roughgarden and qiqi yan revenue maximization with single sample in proceedings of the acm conf on electronic commerce pages ny ny usa acm shaddin dughmi li han and noam nisan sampling and representation complexity of revenue maximization in web and internet economics volume of lecture notes in computer science pages springer intl publishing beijing china edith elkind designing and learning optimal finite support auctions in proceedings of the eighteenth annual symposium on discrete algorithms pages siam jason hartline mechanism design and approximation jason hartline chicago illinois jason hartline and tim roughgarden simple versus optimal mechanisms in acm conf on electronic commerce stanford ca acm zhiyi huang yishay mansour and tim roughgarden making the most of your samples url http michael kearns and umesh vazirani an introduction to computational learning theory mit press cambridge ma andres munoz medina and mehryar mohri learning theory and algorithms for revenue optimization in second price auctions with reserve in proceedings of the intl conf on machine learning pages roger myerson optimal auction design mathematics of operations research david pollard convergence of stochastic processes david pollard new haven connecticut roughgarden and schrijvers ironing in the dark submitted tim roughgarden inbal and qiqi yan mechanisms in proceedings of the acm conf on electronic commerce pages ny ny usa acm leslie valiant theory of the learnable communications of the acm vladimir vapnik and ya chervonenkis on the uniform convergence of relative frequencies of events to their probabilities theory of probability its applications andrew yao an bidder reduction for auctions and its applications in proceedings of the annual symposium on discrete algorithms pages san diego ca acm 
unlocking neural population using hierarchical dynamics model mijung gergo jakob gatsby computational neuroscience unit university college london research center caesar an associate of the max planck society bonn max planck institute for biological cybernetics bernstein center for computational neuroscience mijung gbohner abstract neural population activity often exhibits rich variability this variability can arise from stochasticity neural dynamics on short as well as from modulations of neural firing properties on long often referred to as neural to better understand the nature of in neural circuits and their impact on cortical information processing we introduce hierarchical dynamics model that is able to capture both slow modulations in firing rates as well as neural population dynamics we derive bayesian laplace propagation algorithm for joint inference of parameters and population states on neural population recordings from primary visual cortex we demonstrate that our model provides better account of the structure of neural firing than stationary dynamics models introduction neural spiking activity recorded from populations of cortical neurons can exhibit substantial variability in response to repeated presentations of sensory stimulus this variability is thought to arise both from dynamics generated endogenously within the circuit as well as from variations in internal and behavioural states an understanding of how the interplay between sensory inputs and endogenous dynamics shapes neural activity patterns is essential for our understanding of how information is processed by neuronal populations multiple statistical and mechanistic models for characterising neuronal population dynamics have been developed in addition to these dynamics which take place on fast milliseconds up to few seconds there are also processes modulating neural firing activity which take place on much slower timescales seconds to hours slow drifts in rates across an experiment can be caused by fluctuations in arousal anaesthesia level or other physiological properties of the experimental preparation furthermore processes such as learning and plasticity can lead to slow changes in neural firing properties the statistical structure of these slow fluctuations has been modelled using models and related techniques recent experimental findings have shown that slow multiplicative fluctuations in neural excitability are dominant source of neural covariability in extracellular recordings from cortical circuits to accurately capture the the structure of neural dynamics and to disentangle the contributions of slow and fast modulatory processes to neural variability and it is therefore important to develop models that can capture neural dynamics both on fast within experimental trials and slow across trials few such models exist czanner et al presented statistical model of firing in which dynamics are modelled by generalised linear coupling from the recent spiking history of each neuron onto its instantaneous firing rate and acrosstrial dynamics were modelled by defining random walk model over parameters more recently mangion et al presented latent linear dynamical system model with poisson observations plds with latent space and used heuristic filtering approach for tracking parameters again based on model rabinowitz et al presented technique for identifying slow modulatory inputs from the recordings of single neurons using gaussian process model and an efficient inference technique using evidence optimisation here we present hierarchical model that consists of latent dynamical system with poisson observations plds to model neural population dynamics combined with gaussian process gp to model modulations in firing rates or across experimental trials the use of an exponential nonlinearity implies that latent modulations have multiplicative effect on neural firing rates compared to previous models using random walks over parameters using gp is more flexible and powerful way of modelling the statistical structure of and makes it possible to use that model the variability and smoothness of across time in this paper we focus on concrete variant of this general model we introduce new set of variables which control neural firing rate on each trial to capture in firing rates we derive bayesian laplace propagation method for inferring the posterior distributions over the latent variables and the parameters from population recordings of spiking activity our approach generalises the latent states in to models with states as well as to bayesian treatment of based on gaussian process priors the paper is organised as follows in sec we introduce our framework for constructing neural population models as well as the concrete model we will use for analyses in sec we derive the bayesian laplace propagation algorithm in sec we show applications to simulated data and neural population recordings from visual cortex hierarchical models of neural population dynamics we start by introducing hierarchical model for capturing short population dynamics as well as long in firing rates although we use the term to mean that the system is best described by parameters that change over time which is how the term is often used in the context of neural data analysis we note that the distribution over parameters can be described by stochastic process which might be strictly stationary in the statistical modelling framework we assume that the neural population activity of neurons yt rp depends on latent state xt rk and modulatory factor rk which is different for each trial the latent state models of spiking activity and the modulatory factor models slowly varying mean firing rates across experimental trials we model neural spiking activity as conditionally poisson given the latent state xt and modulator with log firing rate which is linear in parameters and latent factors yt poiss yt exp xt where the loading matrix specifies how each neuron is related to the latent state and the modulator rp is an offset term that controls the mean firing rate of each cell and poiss yt means that the ith entry of yt is drawn independently from poisson distribution with mean wi the ith entry of because of the use of an exponential nonlinearity latent factors have multiplicative effect on neural firing rates as has been observed experimentally following we assume that the latent dynamics evolve according to autoregressive process with gaussian innovations xt xt but here we allow for sensory stimuli or experimental covariates ut rd to influence the latent states linearly the dynamics matrix determines the state evolution models the dependence of latent states on external inputs and is the covariance of the innovation noise we set to be the identity matrix ik as in and we assume ik stochastic process is stationary if its joint distribution over any two and only depends on the elapsed time figure schematic of hierarchical nonstationary poisson observation latent dynamical system for capturing nonstationarity in mean firing rates the parameter slowly varies across trials and leads to fluctuations in mean firing rates recording recording the parameters in this model are we refer to this general model as nonstationary plds different variants of can be constructed by placing priors on individual parameters which allow them to vary across trials in which case they would then depend on the trial index or by omitting different components of the for the modulator we assume that it varies across trials according to gp with mean mh and modified squared exponential kernel gp mh where the th block of size is given by exp ik here we assume the independent on the diagonal to be constant and small as in when the modulator vanishes which corresponds to the conventional plds model with fixed parameters when the mean firing rates vary across trials and the parameter determines the timescale in units of trials of these fluctuations we impose ridge priors on the model parameters see appendix for details so that the total set of hyperparameters of the model is mh where is the set of ridge parameters bayesian laplace propagation our goal is to infer parameters and latent variables in the model the exact posterior distribution is analytically intractable due to the use of poisson likelihood and we therefore assume the joint posterior over the latent variables and parameters to be factorising qθ qx this factorisation simplifies computing the integrals involved in calculating bound on the marginal likelihood of the observations log log dθ dθ log similar to variational bayesian expectation maximization vbem algorithm our inference procedure consists of the following three steps we compute the approximate posterior over latent variables qx by integrating out the parameters qx exp dθqθ log which is performed by message passing relying on the dependency in latent states then we compute the approximate posterior over parameters qθ by integrating out the latent variables qθ exp qx log and we update the hyperparameters by computing the gradients of the bound on the eq after integrating out both latent variables and parameters we iterate the three steps until convergence unfortunately the integrals in both eq and eq are not analytically tractable even with the gaus sian distributions for qx and qθ for tractability and fast computation of messages in second variant of the model in which the dynamics matrix determining the correlations in the population varies across trials is described in the appendix the algorithm for eq we utilise the laplace propagation or laplace expectation propagation which makes gaussian approximation to each message based on laplace approximation then propagates the messages forward and backward while laplace propagation in the prior work is commonly coupled with point estimates of parameters we consider the posterior distribution over parameters for this reason we refer to our inference method as bayesian laplace propagation the use of approximate message passing in the laplace propagation implies that there is no longer guarantee that the lower bound will increase monotonically in each iteration which is the main difference between our method and the vbem algorithm we therefore monitored the convergence of our algorithm by computing ahead prediction scores the algorithm proceeds by iterating the following three steps approximating the posterior over latent states using the dependency in latent states we derive sequential algorithm to obtain qx generalising the approach of to latent states since this step decouples across trials it is easy to parallelize and we omit the for clarity we note that computation of the approximate posterior in this step is not more expensive than bayesian inference of the latent state in fixed parameter plds the forward message xt at time is given by xt exp hlog xt yt iqθ assuming that the forward message at time denoted by is gaussian the poisson likelihood term will render the forward message at time but we will approximate xt as gaussian using the first and second derivatives of the side of eq with respect to xt similarly the backward message at time is given by dxt xt exp hlog xt yt iqθ which we also approximate to gaussian for tractability in computing backward messages using the messages we compute the posterior marginal distribution over latent variables see appendix we need to compute the between neighbouring latent variables to obtain the sufficient statistics of latent variables which we will need for updating the posterior over parameters the pairwise marginals of latent variables are given by xt exp hlog iqθ xt which we approximate as joint gaussian distribution by using the derivatives of eq and extracting the term from the joint covariance matrix approximating the posterior over parameters after inferring the posterior over latent states we update the posterior distribution over the parameters the posterior over parameters factorizes as qθ qa qc where used the vectorized notations vec and vec we set to the maximum likelihood estimates for simplicity in inference the computational cost of this algorithm is dominated by the cost of calculating the posterior distribution over which involves manipulation of gaussian while this was still tractable without further approximations for the sizes used in our analyses below hundreds of trials variety of approximate methods for exist which could be used to improve efficiency of this computation in particular we will typically be dealing with systems in which which means that the is smooth and could be approximated using representations estimating hyperparameters finally after obtaining the the approximate posterior we update the hyperparameters of the prior by maximizing the lower bound with respect to the hyperparameters the variational lower bound simplifies to see in for details note that the usage of gaussian approximate posteriors ensures that this step is analogous to hyper parameter updating in fully gaussian lds log group group population activity trial neurons log mean firing rate true plds neurons trials log mean firing rate trial trials true plds condi cov group population activity true plds trial neurons neurons group covariance estimation total cov trial figure illustration of in firing rates simulated data spike rates of neurons are influenced by two slowly varying firing rate modulators the log mean firing rates of the two groups of neurons are red group and blue group across trials raster plots show the extreme cases trials and the traces show the posterior mean of estimated by light blue for light red for independent pldss fit plds to each trial data individually dark gray and plds light gray total and conditional on each trial covariance of recovered neural responses from each model averaged across all neuron pairs and then normalised for visualisation the covariances recovered by our model red well match the true ones black while those by independent pldss gray and single plds light gray do not where is constant here the kl divergence between the prior and posterior over parameters denoted by µφ σφ and respectively is given by kl log tr σφ µφ σφ µφ where the prior mean and covariance depend on the hyperparameters we update the hyperparameters by taking the derivative of kl each hyper parameter for the prior mean the first derivative expression provides update for time scale of fluctuations in firing rates and variance of fluctuations their derivative expressions do not provide closed form update in which case we compute the kl divergence on the grid defined in each hyperparameter space and choose the value that minimises kl predictive distributions for test data in our model different trials are no longer considered to be independent so we can predict parameters for trials using the gp model on and our approximations we have gaussian predictive distributions on for test data given training data mh µh mh where is the prior covariance matrix on and is on and is their prior crosscovariance as introduced in of and the negative hessian hh is defined as log yt hh in the applications to simulated and neurophysiological data described in the following we used this approach to predict the properties of neural dynamics on trials applications simulated data we first illustrate the performance of on simulated population recording from neurons consisting of trials of length time steps each we used latent state and assumed that the population consisted of two homogeneous subpopulations of size each with one modulatory input controlling rate fluctuations in each group see fig in addition we assumed that for half of each trial there was stimulus drifting grating represented by vector which consisted of the sine and cosine mean firing rate hz cell cell most neurons cell cell cell cell cell cell cell cell trial most stationary neurons data plds trial rmse most neurons most stationary neurons all neurons plds figure firing rates in population of neurons mean firing rates of neurons black trace across trials left the most neurons right the most stationary neurons the fitted solid line and the predicted circles mean firing rates are also shown for in red and plds in gray left the rmse in predicting single neuron firing rates across most neurons for varying latent dimensionalities where achieves significantly lower rmse middle rmse for the most stationary neurons where there is no difference between two methods apart from an outlier at right rmse for the all neurons of the phase of the stimulus frequency hz as well as an additional binary term which indicated whether the stimulus was active we fit to the data and found that it successfully captures the in log mean firing rates defined by as shown in fig and recovers the total and trialconditioned covariances the mean of the covariances of for comparison we also fit separate pldss to the data from each trial as well as single plds to the entire data the naive approach of fitting an individual plds to each trial can in principle follow the modulation however as each model is only fit to one trial the are very noisy since they are not sufficiently constrained by the data from each trial we note that single plds with fixed parameters as is conventionally used in neural data analysis is able to track the modulations in firing rates in the posterior mean however single plds would not be able to extrapolate firing rates for unseen trials as we will demonstrate in our analyses on neural data below in addition it will also fail to separate slow and fast modulations into different parameters by comparing the total covariance of the data averaged across neuron pairs to the covariance calculated by estimating the covariance on each trial individually and averaging covariances one can calculate how much of the can be explained by fluctuations in firing rates see in this simulation shown in fig which illustrates an extreme case dominated by strong effects the conditional covariance is much smaller than the full covariance neurophysiological data how big are in neural population recordings and can our model successfully capture them to address these questions we analyzed population recording from anaesthetized macaque primary visual cortex consisting of neurons stimulated by sine grating stimuli the details of data collection are described in but our also included units not used in the original study we binned the spikes recorded during trials of length stimulus was on for of the same orientation using bins resulting in trials of length bins analogously to the simulated dataset above we parameterised the stimulus as vector of the sine and cosine with the same temporal frequency of the drifting grating as well as an indicator that specifies whether there is stimulus or not we used cross validation to evaluate performance of the model repeatedly divided the data into test data trials and training data the remaining trials we fit the model on each training set and using the estimated parameters from the training data we made predictions on the modulator on test data by using the mean of the predictive distribution over we note that in contrast to conventional applications of which assume trials our model here also takes into correlations in firing rates across therefore we had to keep the in order to compute predictive distributions for test data using formulas in eq using these parameters we drew samples for spikes for the entire trials to compute the mean firing rates of each neuron at each trial for comparison we also fit single plds to the data as this model does not allow for modulations of firing rates we simply kept the parameters estimated from the training data for visualisation of results we quantified the of each neuron by first smoothing its firing rate across trials using kernel of size trials calculating the variance of the smoothed firing rate estimate and displaying firing rates for the most neurons in the population fig left as well as most stationary neurons fig right importantly the were also correctly interpolated for held out trials circles in fig to evaluate whether the additional parameters in result in superior model compared to conventional plds we tested the model with different latent dimensionalities ranging from to and compared each model against fixed plds of matched dimensionality fig we estimated predicted firing rates on held out trials by sampling replicate trials from the predictive distribution for both models and compared the median across samples of the mean firing rates of each neuron to those of the data the shown rmse values are the errors of predicted firing rate in hz per neuron per held out trial population mean across all neurons and trials is hz we found that outperformed plds provided that we had sufficiently many latent states at least for large latent dimensionalities performance degraded again which could be consequence of overfitting furthermore we show that for neurons there is large gain in predictive power fig left whereas for stationary neurons plds and have similar prediction accuracy fig middle the rmse on firing rates for all neurons fig right suggests that our model correctly identified the fluctuation in firing rates we also wanted to gain insights into the temporal scale of the underlying we first looked at the recovered of the latent modulators and found them to be highly preserved across multiple training folds and importantly across different values of the latent dimensionalities consistently peaked near trials fig we made sure that the peak near trials is not merely consequence of parameter parameters were initialised by fitting gaussian process with exponentiated quadratic kernel to each neuron mean firing rate over trials individually then taking the mean over neurons as the initial global for our kernel the initial values were differing slightly between training sets similarly we checked that the parameters of the final model after iterations of bayesian laplace propagation were indeed superior to the initial values by monitoring the prediction error on trials furthermore due to introducing smooth change with the correct time scale in the latent space the posterior mean of across trials shown in fig we find that recovers more of the covariance of neurons compared to the fixed plds model fig discussion are ubiquitous in neural data slow modulations in firing properties can result from diverse processes such as plasticity and learning fluctuations in arousal cortical reorganisation after injury as well as development and aging in addition in neural data can also be consequence of experimental artifacts and can be caused by fluctuations in anaesthesia level estimates normalized mean autocovariance estimated modulators data plds count trials trial index time lag ms figure firing rates in population of neurons continued histogram of across different latent dimensionalities and training sets mean at is indicated by the vertical red line estimated modulator the posterior mean of the modulator with an estimated length scale of approximately trials is smoothly varying across trials comparison of normalized mean across neurons stability of the physiological preparation or electrode drift whatever the origins of are it is important to have statistical models which can identify them and disentangle their effects from correlations and dynamics on faster we here presented hierarchical model for neural population dynamics in the presence of nonstationarity specifically we concentrated on variant of this model which focuses on nonstationarity in firing rates recent experimental studies have shown that slow fluctuations in neural excitability which have multiplicative effect on neural firing rates are dominant source of noise correlations in anaesthetized visual cortex because of the exponential spiking nonlinearity employed in our model the latent additive fluctuations in the also have multiplicative effect on firing rates applied to of neurophysiological recordings we demonstrated that this modelling approach can successfully capture in neurophysiological recordings from primary visual cortex in our model both neural dynamics and latent modulators are mediated by the same subspace parameterised by we note however that this assumption does not imply that neurons with strong correlations will also have strong correlations as different dimensions of this subspace as long as it is chosen big enough could be occupied by short and long term correlations respectively in our applications to neural data we found that the latent state had to be at least for the model to outperform stationary dynamics model and it might be the case that at least three dimensions are necessary to capture both fast and slow correlations it is an open question of how correlations on fast and slow timescales are related and the techniques presented have the potential to be of use for mapping out their relationships there are limitations to the current study we did not address the question of how to select amongst multiple different models which could be used to model neural for given dataset we did not present numerical techniques for how to scale up the current algorithm for larger trial numbers using approximations to the covariance matrix or large neural populations and we did not address the question of how to overcome the slow convergence properties of gp kernel parameter estimation while laplace propagation is flexible it is an approximate inference technique and the quality of its approximations might vary for different models of tasks we believe that extending our method to address these questions provides an exciting direction for future research and will result in powerful set of statistical methods for investigating how neural systems operate in the presence of acknowledgments we thank alexander ecker and the lab of andreas tolias for sharing their data with us see http and for allowing us to use it in this publication as well as maneesh sahani and alexander ecker for valuable comments this work was funded by the gatsby charitable foundation mp and gb and the german federal ministry of education and research mp and jhm through bmbf bernstein center code available at http references renart and machens variability in neural activity and behavior curr opin neurobiol destexhe intracellular and computational evidence for dominant role of internal network activity in cortical computations curr opin neurobiol maimon modulation of visual physiology by behavioral state in monkeys mice and flies curr opin neurobiol harris and thiele cortical state and attention nat rev neurosci ecker et al state dependence of noise correlations in macaque primary visual cortex neuron ralf haefner pietro berkes and fiser perceptual as probabilistic inference by neural sampling arxiv preprint alexander ecker george denfield matthias bethge and andreas tolias on the structure of population activity under fluctuations in attentional state biorxiv page smith and brown estimating model from point process observations neural comput eden frank barbieri solo and brown dynamic analysis of neural encoding by point process adaptive filtering neural comput yu afshar santhanam ryu shenoy and sahani extracting dynamical structure embedded in neural activity in nips pages mit press cambridge ma kulkarni and paninski models for multiple neural data network truccolo hochberg and donoghue collective dynamics in human and monkey sensorimotor cortex predicting single neuron spikes nat neurosci macke buesing cunningham yu shenoy and sahani empirical models of spiking in neural populations in nips pages van vreeswijk and sompolinsky chaos in neuronal networks with balanced excitatory and inhibitory activity science tomko and crapper neuronal variability responses to identical visual stimuli brain res brody correlations without synchrony neural comput goris movshon and simoncelli partitioning neuronal variability nat neurosci gilbert and li adult visual cortical plasticity neuron brown nguyen frank wilson and solo an analysis of neural receptive field plasticity by point process adaptive filtering proc natl acad sci frank et al contrasting patterns of receptive field plasticity in the hippocampus and the entorhinal cortex an adaptive filtering approach neurosci lesica and stanley improved tracking of encoding properties of visual neurons by extended recursive ieee trans neural syst rehabil eng ventura cai and kass variability and its effect on dependency between two neurons hass and durstewitz method for of spike train data with application to the pearson neurophysiol et al cortical state determines global variability and correlations in visual cortex neurosci gabriela uri sylvia marianna wendy and emery analysis of and neural spiking dynamics journal of neurophysiology mangion et al online variational inference for models with observations neural comput neil rabinowitz robbe lt goris johannes and eero simoncelli model of sensory neural responses in the presence of unknown modulatory inputs arxiv preprint rasmussen and williams gaussian processes for machine learning mit press cambridge ma usa beal variational algorithms for approximate bayesian inference phd thesis gatsby unit university college london yu et al factor analysis for analysis of neural population activity smola vishwanathan and eskin laplace propagation in sebastian thrun lawrence saul and bernhard editors nips pages mit press ypma and heskes novel approximations for inference in nonlinear dynamical systems using expectation propagation shenoy yu and sahani expectation propagation for inference in dynamical models with poisson observations in proc ieee nonlinear statistical signal processing workshop murray and adams slice sampling covariance hyperparameters of latent gaussian models in nips pages 
bayesian manifold learning the locally linear latent variable model mijung park wittawat jitkrittum ahmad lars maneesh sahani gatsby computational neuroscience unit university college london mijung wittawat atqamar lbuesing maneesh abstract we introduce the locally linear latent variable model probabilistic model for manifold discovery that describes joint distribution over observations their manifold coordinates and locally linear maps conditioned on set of neighbourhood relationships the model allows straightforward variational optimisation of the posterior distribution on coordinates and locally linear maps from the latent space to the observation space given the data thus the encapsulates the preserving intuitions that underlie methods such as locally linear embedding lle its probabilistic semantics make it easy to evaluate the quality of hypothesised neighbourhood relationships select the intrinsic dimensionality of the manifold construct extensions and to combine the manifold model with additional probabilistic models that capture the structure of coordinates within the manifold introduction many datasets comprise points derived from smooth manifold embedded within the space of measurements and possibly corrupted by noise for instance biological or medical imaging data might reflect the interplay of small number of latent processes that all affect measurements linear multivariate analyses such as principal component analysis pca or multidimensional scaling mds have long been used to estimate such underlying processes but can not always reveal structure when the mapping is or equivalently the manifold is curved thus there has been substantial recent interest in algorithms to identify manifolds in data many heuristic methods for manifold discovery are based on the idea of preserving the geometric properties of local neighbourhoods within the data while embedding unfolding or otherwise transforming the data to occupy fewer dimensions thus algorithms such as embedding lle and laplacian eigenmap attempt to preserve local linear relationships or to minimise the distortion of local derivatives others like isometric feature mapping isomap or maximum variance unfolding mvu preserve local distances estimating global manifold properties by continuation across neighbourhoods before embedding to lower dimensions by classical methods such as pca or mds while generally hewing to this same intuitive path the range of available algorithms has grown very substantially in recent years current affiliation thread genius current affiliation google deepmind however these approaches do not define distributions over the data or over the manifold properties thus they provide no measures of uncertainty on manifold structure or on the locations of the embedded points they can not be combined with structured probabilistic model within the manifold to define full likelihood relative to the observations and they provide only heuristic methods to evaluate the manifold dimensionality as others have pointed out they also make it difficult to extend the manifold definition to points in principled way an established alternative is to construct an explicit probabilistic model of the functional relationship between manifold coordinates and each measured dimension of the data assuming that the functions instantiate draws from priors the original gaussian process latent variable model required optimisation of the coordinates and thus still did not provide uncertainties on these locations or allow evaluation of the likelihood of model over them however recent extension exploits an auxiliary variable approach to optimise more general variational bound thus retaining approximate probabilistic semantics within the latent space the stochastic process model for the mapping functions also makes it straightforward to estimate the function at previously unobserved points thus generalising with ease however the gives up on the intuitive preservation of local neighbourhood properties that underpin the methods reviewed above instead the expected smoothness or other structure of the manifold must be defined by the gaussian process covariance function chosen priori here we introduce new probabilistic model over observations embedded locations and mappings between high and linear maps within each neighbourhood such that each group of variables is gaussian distributed given the other two this locally linear latent variable model thus respects the same intuitions as the common manifold discovery algorithms while still defining probabilistic model indeed variational inference in this model follows more directly and with fewer separate bounding operations than the sparse approach used with the thus uncertainty in the coordinates and in the manifold shape defined by the local maps is captured naturally lower bound on the marginal likelihood of the model makes it possible to select between different latent dimensionalities and perhaps most crucially between different definitions of neighbourhood thus addressing an important unsolved issue with neighbourhooddefined algorithms unlike existing probabilistic frameworks with locally linear models such as mixtures of factor analysers mfa and local tangent space analysis ltsa methods does not require an additional step to obtain the globally consistent alignment of local this paper is organised as follows in section we introduce our generative model for which we derive the variational inference method in section we briefly describe extension for and mathematically describe the dissimilarity between and gplvm at the end of section in section we demonstrate the approach on several real world problems notation in the following diagonal matrix with entries taken from the vector is written diag the vector of ones is and the identity matrix is in the euclidean norm of vector is kvk the frobenius norm of matrix is kmkf the kronecker delta is denoted by δij if and otherwise the kronecker product of matrices and is for random vector we denote the normalisation constant in its probability density function by zw the expectation of random vector with respect to density is hwiq the model suppose we have data points yn rdy and graph on nodes with edge set eg yi and yj are neighbours we assume that there is latent representation of the data with coordinates xn rdx dx dy it will be helpful to concatenate the vectors to form yn and xn this is also true of one previous method which finds model parameters and global coordinates by variational methods similar to our own space yj tyimy yi ci space tximx xi xj figure locally linear mapping ci for ith data point transforms the tangent space txi mx at xi in the lowdimensional space to the tangent space tyi my at the corresponding data point yi in the space neighbouring data point is denoted by yj and the corresponding latent variable by xj our key assumption is that the mapping between data and coordinates is locally linear fig the tangent spaces are approximated by yj yi and xj xi the pairwise differences between the ith point and neighbouring points the matrix ci rdy at the ith point linearly maps those tangent spaces as yj yi ci xj xi under this assumption we aim to find the distribution over the linear maps cn rdy and the latent variables that best describe the data likelihood given the graph zz log log dx dc the joint distribution can be written in terms of priors on and the likelihood of as in the following we highlight the essential components the locally linear latent variable model detailed derivations are given in the appendix adjacency matrix and laplacian matrix the edge set of for data points specifies symmetric adjacency matrix we write ηij for the jth element of which is if yj and yi are neighbours and if not including on the diagonal the graph laplacian matrix is then diag prior on we assume that the latent variables are with bounded expected scale and that latent variables corresponding to neighbouring points are close in euclidean distance formally the log prior on the coordinates is then log xn αkxi ηij kxi xj log zx where the parameter controls the expected scale this prior can be written as multivariate normal distribution on the concatenated where idx αindx prior on we assume that the linear maps corresponding to neighbouring points are similar in terms of frobenius norm thus favouring smooth manifold of low curvature this gives xx log cn ci ηij kci cj log zc tr log zc where idx the second line corresponds to the matrix normal density giving mn idy as the prior on in our implementation we fix to small since the magnitude of the product ci xi xj is determined by optimising the hyperparameter above sets the scale of the average linear map ensuring the prior precision matrix is invertible figure graphical representation of generative process in lllvm given dataset we construct neighbourhood graph the distribution over the latent variable is controlled by the graph as well as the parameter the distribution over the linear map is also governed by the graph the latent variable and the linear map together determine the data likelihood likelihood under the assumption we penalise the approximation error of eq which yields the log likelihood log ηij ci ci log zy yi where yj yi and xj xi thus is drawn from multivariate normal distribution given by µy σy µy σy and en rndy with yp idy ei ηji cj ci for computational simplicity we assume γidy the graphical representation of the generative process underlying the is given in fig variational inference our goal is to infer the latent variables as well as the parameters in we infer them by maximising the lower bound of the marginal likelihood of the observations zz log log dxdc following the common treatment for computational tractability we assume the posterior over factorises as we maximise the lower bound and by the variational expectation maximization algorithm which consists of the variational expectation step for computing by exp exp log dc log dx then the maximization step for estimating by arg maxθ step computing from eq requires rewriting the likelihood in eq as quadratic function in exp ax where the normaliser has all the terms that do not depend on from eq let ndx where the jth the matrix is given by σy ae aij pn pn dx dx block is aij ae ae and each jth dy dx block of ae rndy is given by ae cj ci δij ck ci the ηik ndx vector is defined with the component dx vectors bn pn as given by bi ηij cj yi yj ci yj yi the likelihood combined with the prior on gives us the gaussian posterior over solving eq σx where haiq µx σx hbiq the term centers the data and ensures the distribution can be normalised it applies in subspace orthogonal to that modelled by and and so its value does not affect the resulting manifold model datapoints posterior mean of average lwbs true post mean of figure simulated example data points drawn from swiss roll true latent points in used for generating the data posterior mean of and posterior mean of after em iterations given which was chosen by maximising the lower bound across different average lower bounds as function of each point is an average across random seeds similarly computing from eq requires rewriting the likelihood in eq as quadratic function in exp tr γc where the normaliser has all the terms that do not depend on from eq and the matrix qn rndx where the jth subvector of the ith column is qi dx ηij xi xj δij we define hn rdy ik pkn whose ith block is hi ηij yj yi xj xi the likelihood combined with the prior on gives us the gaussian posterior over solving eq mn µc σc where and µc hhiq hγiq the expected values of and are given in the appendix step we set the parameters by maximising which is split into two terms based on dependence on each parameter expected for updating by arg maxv eq log and negative kl divergence between the prior and the posterior on for updating by arg maxα eq log log the update rules for each hyperparameter are given in the appendix the full em starts with an initial value of in the given compute as in eq likewise given compute as in eq the parameters are updated in the by maximising eq the two steps are repeated until the variational lower bound in eq saturates to give sense of how the algorithm works we visualise fitting results for simulated example in fig using the graph constructed from observations given different we run our em algorithm the posterior means of and given the optimal chosen by the maximum lower bound resemble the true manifolds in and spaces respectively extension in the model one can formulate computationally efficient extension technique as follows given data points denoted by yn the variational em algorithm derived in the previous section converts into the posterior now given new data point one can first find the neighbourhood of without changing the current neighbourhood graph then it is possible to compute the distributions over the corresponding locally linear map and latent variable via simply performing the given freezing all other quantities the same as an implementation is available from http samples in representation posterior mean of in space without shortcut with shortcut lb lb figure resolving problems using variational lower bound visualization of samples drawn from swiss roll in space points red and blue are close to each other dotted grey in visualization of the samples on the latent manifold the distance between points and is seen to be large posterior mean of shortcircuiting the and the data points in the graph construction lllvm achieves higher lower bound when the shortcut is absent the red and blue parts are mixed in the resulting estimate in space right when there is shortcut the lower bound is obtained after em iterations comparison to closely related probabilistic dimensionality reduction algorithm to is defines the mapping from the latent space to data space using gaussian processes the likelihood of the observations ydy yk is the vector formed by the kth element of all high dimensional vectors given latent variables qdy xdx is defined by yk knn in where the jth of the covariancei matrix is of the exponentiated quadratic form xi xj element pdx σf exp αq xi xj with parameters αq in once we integrate out from eq we also obtain the gaussian likelihood given dc exp ll in contrast to the precision matrix ll depends on the graph laplacian matrix through and therefore in the graph structure directly determines the functional form of the conditional precision experiments mitigating the problem like other methods is sensitive to misspecified neighbourhoods the prior likelihood and posterior all depend on the assumed graph unlike other methods lllvm provides natural way to evaluate possible using the variational lower bound of eq fig shows samples drawn from swiss roll in space fig two points labelled and happen to fall close to each other in but are actually far apart on the latent surface fig graph might link these distorting the recovered coordinates however evaluating the model without this edge the correct graph yields higher variational bound fig although it is prohibitive to evaluate every possible graph in this way the availability of principled criterion to test specific hypotheses is of obvious value in the following we demonstrate on two real datasets handwritten digits and climate data modelling usps handwritten digits as first example we test our method on subset of samples each of the digits from the usps digit dataset where each digit is of size dy we follow and represent the latent variables in variational lower bound posterior mean of digit digit digit digit digit true estimate query query query query query em iterations classification error lle isomap lllvm isomap gplvm lle figure usps handwritten digit dataset described in section mean in solid and variance standard deviation shading of the variational lower bound across different random starts of em algorithm with different the highest lower bound is achieved when the posterior mean of in each digit is colour coded on the right side are reconstructions of for randomly chosen query points using neighbouring and posterior means of we can recover successfully see text fitting results by using the same data isomap and lle using the extracted features in we evaluated classifier for digit identity with the same data divided into training and test sets the classification error is shown in features yield the comparably low error with and isomap fig shows variational lower bounds for different values of using different em initialisations the posterior mean of obtained from using the best is illustrated in fig fig also shows reconstructions of one example of each digit using its coordinates as well as the posterior mean coordinates tangent spaces and actual images yi of its closest neighbours the reconstruction is basedhon the assumed structure pk of the generative model eq that is yi similar process could be used to reconstruct digits at locations finally we quantify the relevance of the recovered subspace by computing the error incurred using simple classifier to report digit identity using the features obtained by and various competing methods fig classification with coordinates performs similarly to and isomap and outperforms lle mapping climate data in this experiment we attempted to recover geographical relationships between weather stations from recorded monthly precipitation patterns data were obtained by averaging annual precipitation records from at weather stations scattered across the us see fig thus the data set comprised vectors the goal of the experiment is to recover the topology of the weather stations as given by their latitude and the dataset is made available by the national climatic data center at http we use version monthly data latitude longitude weather stations isomap lle ltsa figure climate modelling problem as described in section each example corresponding to weather station is vector of monthly precipitation measurements using only the measurements the projection obtained from the proposed recovers the topological arrangement of the stations to large degree tude using only these climatic measurements as before we compare the projected points obtained by with several widely used dimensionality reduction techniques for the methods ltsa isomap and lle we used with euclidean distance to construct the neighbourhood graph the results are presented in fig identified more arrangement for the weather stations than the other algorithms the fully probabilistic nature of and gplvm allowed these algorithms to handle the noise present in the measurements in principled way this contrasts with isomap which can be topologically unstable vulnerable to shortcircuit errors if the neighbourhood is too large perhaps coincidentally also seems to respect local geography more fully in places than does conclusion we have demonstrated new probabilistic approach to manifold discovery that embodies the central notion that local geometries are mapped linearly between manifold coordinates and observations the approach offers natural variational algorithm for learning quantifies local uncertainty in the manifold and permits evaluation of hypothetical neighbourhood relationships in the present study we have described the model conditioned on neighbourhood graph in principle it is also possible to extend so as to construct distance matrix as in by maximising the data likelihood we leave this as direction for future work acknowledgments the authors were funded by the gatsby charitable foundation references roweis and saul nonlinear dimensionality reduction by locally linear embedding science belkin and niyogi laplacian eigenmaps and spectral techniques for embedding and clustering in nips pages tenenbaum silva and langford global geometric framework for nonlinear dimensionality reduction science van der maaten postma and van den herik dimensionality reduction comparative review http cayton algorithms for manifold learning univ of california at san diego tech rep pages http platt fastmap metricmap and landmark mds are all algorithms in proceedings of international workshop on artificial intelligence and statistics pages lawrence gaussian process latent variable models for visualisation of high dimensional data in nips pages titsias and lawrence bayesian gaussian process latent variable model in aistats pages roweis saul and hinton global coordination of local linear models in nips pages brand charting manifold in nips pages zhan and yin robust local tangent space alignment in nips pages verbeek learning nonlinear image manifolds by global alignment of local linear models ieee transactions on pattern analysis and machine intelligence bishop pattern recognition and machine learning springer new york beal variational algorithms for approximate bayesian inference phd thesis gatsby unit university college london menne williams and vose the historical climatology network monthly temperature data version bulletin of the american meteorological society july mukund balasubramanian and eric schwartz the isomap algorithm and topological stability science january lawrence spectral dimensionality reduction via maximum entropy in aistats pages 
color constancy by learning to predict chromaticity from luminance ayan chakrabarti toyota technological institute at chicago kenwood chicago il ayanc abstract color constancy is the recovery of true surface color from observed color and requires estimating the chromaticity of scene illumination to correct for the bias it induces in this paper we show that the color statistics of natural any spatial or semantic by themselves be powerful cue for color constancy specifically we describe an illuminant estimation method that is built around classifier for identifying the true chromaticity of pixel given its luminance absolute brightness across color channels during inference each pixel observed color restricts its true chromaticity to those values that can be explained by one of candidate set of illuminants and applying the classifier over these values yields distribution over the corresponding illuminants global estimate for the scene illuminant is computed through simple aggregation of these distributions across all pixels we begin by simply defining the classifier by computing empirical histograms over discretized chromaticity and luminance values from training set of natural images these histograms reflect preference for hues corresponding to smooth reflectance functions and for achromatic colors in brighter pixels despite its simplicity the resulting estimation algorithm outperforms current color constancy methods next we propose method to learn the classifier using stochastic gradient descent we set likelihoods to minimize errors in the final scene illuminant estimates on training set this leads to further improvements in accuracy most significantly in the tail of the error distribution introduction the spectral distribution of light reflected off surface is function of an intrinsic material property of the also of the spectral distribution of the light illuminating the surface consequently the observed color of the same surface under different illuminants in different images will be different to be able to reliably use color computationally for identifying materials and objects researchers are interested in deriving an encoding of color from an observed image that is invariant to changing illumination this task is known as color constancy and requires resolving the ambiguity between illuminant and surface colors in an observed image since both of these quantities are unknown much of color constancy research is focused on identifying models and statistical properties of natural scenes that are informative for color constancy while pschophysical experiments have demonstrated that the human visual system is remarkably successful at achieving color constancy it remains challenging task computationally early color constancy algorithms were based on relatively simple models for pixel colors for example the gray world method simply assumed that the average true intensities of different color channels across all pixels in an image would be equal while the retinex method assumed that the true color of the brightest pixels in an image is white most modern color constancy methods however are based on more complex reasoning with image features many methods use models for image derivatives instead of individual pixels others are based on recognizing and matching image segments to those in training set to recover true color recent method proposes the use of convolutional neural network cnn to regress from image patches to illuminant color there are also many color constancy algorithms that combine illuminant estimates from number of simpler unitary algorithms sometimes using image features to give higher weight to the outputs of some subset of methods in this paper we demonstrate that by appropriately modeling and reasoning with the statistics of individual pixel colors one can computationally recover illuminant color with high accuracy we consider individual pixels in isolation where the color constancy task reduces to discriminating between the possible choices of true color for the pixel that are feasible given the observed color and candidate set of illuminants central to our method is function that gives us the relative likelihoods of these true colors and therefore distribution over the corresponding candidate illuminants our global estimate for the scene illuminant is then computed by simply aggregating these distributions across all pixels in the image we formulate the likelihood function as one that measures the conditional likelihood of true pixel chromaticity given observed luminance in part to be agnostic to the scalar color channelindependent ambiguity in observed color intensities moreover rather than committing to parametric form we quantize the space of possible chromaticity and luminance values and define the function over this discrete domain we begin by setting the conditional likelihoods purely empirically based simply on the histograms of true color values over all pixels in all images across training set even with this purely empirical approach our estimation algorithm yields estimates with higher accuracy than current methods then we investigate learning the perpixel belief function by optimizing an objective based on the accuracy of the final global illuminant estimate we carry out this optimization using stochastic gradient descent and using approach similar to dropout to improve generalization beyond the training set this further improves estimation accuracy without adding to the computational cost of inference preliminaries assuming lambertian reflection the spectral distribution of light reflected by material is product of the distribution of the incident light and the material reflectance function the color intensity vector recorded by sensor at each pixel is then given by dλ where is the reflectance at is the spectral distribution of the incident illumination is shading factor and denotes the spectral sensitivities of the color sensors color constancy is typically framed as the task of computing from the corresponding color intensities that would have been observed under some canonical illuminant ℓref typically chosen to be ℓref we will refer to as the true color at since involves projection of the full incident light spectrum on to the three filters it is not generally possible to recover from even with knowledge of the illuminant however commonly adopted approximation shown to be reasonable under certain assumptions is to relate the true and observed colors and by simple adaptation where refers to the hadamard product and depends on the illuminant for ℓref with some abuse of terminology we will refer to as the illuminant in the remainder of the paper moreover we will focus on the case in this paper and assume in an image our goal during inference will be to estimate this global illuminant from the observed image the true color image can then simply be recovered as where denotes the inverse of note that color constancy algorithms seek to resolve the ambiguity between and in only up to scalar factor this is because scalar ambiguities show up in between and ℓref due to light attenuation between and due to the shading factor and in the observed image itself due to varying exposure settings therefore the performance metric typically used is the angular error vectors and mt between the true and estimated illuminant database for training and evaluation we use the database of natural indoor and outdoor images captured under various illuminants by gehler et al we use the version from shi and funt that contains linear images without gamma correction generated from the raw camera data the database contains images captured with two different cameras images with canon and with canon each image contains color checker chart placed in the image with its position manually labeled the colors of the gray squares in the chart are taken to be the value of the true illuminant for each image which can then be used to correct the image to get true colors at each pixel of course only up to scale the chart is masked out during evaluation we use over this dataset in our experiments each fold contains images from both cameras corresponding to one of partitions of each camera image set ordered by file of capture estimates for images in each fold are based on training only with data from the remaining folds we report results with and these correspond to average training set sizes of and images respectively color constancy with chromaticity statistics color vector can be characterized in terms of its luminance or absolute brightness across color channels and its chromaticity which is measure of the relative ratios between intensities in different channels while there are different ways of encoding chromaticity we will do so in terms of the unit vector in the direction of note that since intensities can not be negative is restricted to lie on the eighth of the unit sphere remember from sec that our goal is to resolve the ambiguity between the true colors and the illuminant only up to scale in other words we need only estimate the illuminant chromaticity and true chromaticities from the observed image which we can relate from as kx key property of natural illuminant chromaticities is that they are known to take fairly restricted set of values close to locus predicted by planck radiation law to be able to exploit this we denote as the set of possible values for illuminant chromaticity and construct it from training set specifically we the chromaticity vectors of the illuminants in the training set and let be the set of unique chromaticity values additionally we define prior bi log ni over this candidate set based on the number ni of training illuminants that were quantized to given the observed color at single pixel the ambiguity in across the illuminant set translates to corresponding ambiguity in the true chromaticity over the set figure illustrates this ambiguity for few different observed colors we note that while there is significant angular deviation within the set of possible true chromaticity values for any observed color values in each set lie close to one dimensional locus in chromaticity space this suggests that the illuminants in our training set are indeed good fit to planck the goal of our work is to investigate the extent to which we can resolve the above ambiguity in true chromaticity on basis without having to reason about the pixel spatial neighborhood or semantic context our approach is based on computing likelihood distribution over the possible values of given the observed luminance kv but as mentioned in sec there is considerable ambiguity in the scale of observed color intensities we address this partially by applying simple global normalization to the observed luminance to define quantization is over uniformly sized bins in see supplementary material for details in fact the chromaticities appear to lie on two curves that are slightly separated from each other this separation is likely due to differences in the sensor responses of the two cameras in the dataset ambiguity with observed color legend green re re lu ree ee gr set of possible true chromaticities for speciﬁc observed color blue red blue true chromaticity from empirical statistics from learning figure color constancy with distributions of natural scenes ambiguity in true chromaticity given observed color each set of points corresponds to the possible true chromaticity values location in see legend consistent with the pixel observed chromaticity color of the points and different candidate illuminants distributions over different values for true chromaticity of pixel conditioned on its observed luminance computed as empirical histograms over the training set values are normalized by the median luminance value over all pixels corresponding distributions learned with training to maximize accuracy of overall illuminant estimation kv kv this very roughly compensates for variations across images due to exposure settings illuminant brightness etc however note that since the normalization is global it does not compensate for variations due to shading the central component of our inference method is function that encodes the belief that pixel with normalized observed luminance has true chromaticity this function is defined over discrete domain by quantizing both chromaticity and luminance values we clip luminance values to four four times the median luminance of the image and quantize them into twenty equal sized bins and for chromaticity we use much finger quantization with bins in see supplementary material for details in this section we adopt purely empirical approach and define as log where is the number of pixels across all pixels in set of images in training set that have true chromaticity and observed luminance we visualize these empirical versions of for subset of the luminance quantization levels in fig we find that in general desaturated chromaticities with similar intensity values in all color channels are most common this is consistent with findings of statistical analysis of natural spectra which shows the dc component flat across wavelength to be the one with most variance we also note that the concentration of the likelihood mass in these chromaticities increasing for higher values of luminance this phenomenon is also predicted by traditional intuitions in color science materials are brightest when they reflect most of the incident light which typically occurs when they have flat reflectance function with all values of close to one indeed this is what forms the basis of the retinex method amongst saturated colors we find that hues which combine green with either red or blue occur more frequently than primary colors with pure green and combinations of red and blue being the least common this is consistent with findings that reflectance functions are usually smooth pca on pixel spectra in revealed basis both saturated green and combinations would require the reflectance to have either sharp peak or crest respectively in the middle of the visible spectrum we now describe method that exploits the belief function for illuminant estimation given the observed color at pixel we can obtain distribution over the set of possible true chromaticity values which can also be interpreted as distribution over the corresponding illuminants we then simply aggregate these distributions across all pixels in the image and define the global probability of being the scene illuminant as pi exp li exp where li βbi is the total number of pixels in the image and and are scalar parameters the final illuminant chromaticity estimate is then computed as pi arg min cos arg max km km pi mi note that also incorporates the prior bi over illuminants we set the parameters and using grid search to values that minimize mean illuminant estimation error over the training set the primary computational cost of inference is in computing the values of li we values of using over the discrete domain of quantized chromaticity values for and the candidate illuminant set for therefore computing each li essentially only requires the addition of numbers from table we need to do this for all illuminants where summations for different illuminants can be carried out in parallel our implementation takes roughly seconds for image on modern intel cpu with cores and is available at http this empirical version of our approach bears some similarity to the bayesian method of that is based on priors for illuminants and for the likelihood of different true reflectance values being present in scene however the key difference is our modeling of true chromaticity conditioned on luminance that explicitly makes estimation agnostic to the absolute scale of intensity values we also reason with all pixels rather than the set of unique colors in the image experimental results table compares the performance of illuminant estimation with our method see rows labeled empirical to the current using different quantiles of angular error across the database results for other methods are from the survey by li et al see the supplementary material for comparisons to some other recent methods we show results with both and we find that our errors with threefold have lower mean median and values than those of the best performing method from which combines illuminant estimates from twelve different unitary method many of which are also listed in table using regression the improvement in error is larger with respect to the other combination methods as well as those based the statistics of image derivatives moreover since our method has more parameters than most previous algorithms has entries it is likely table quantiles of angular error for different methods on the database method mean median bayesian gamut mapping deriv gamut mapping gray world gray edge scene geom comb comb comb neural comb elm comb proposed empirical trained empirical trained to benefit from more training data we find this to indeed be the case and observe considerable decrease in error quantiles when we switch to figure shows estimation results with our method for few sample images for each image we show the input image indicating the ground truth color chart being masked out and the output image with colors corrected by the global illuminant estimate to visualize the quality of contributions from individual pixels we also show map of angular errors for illuminant estimates from individual pixels these estimates are based on values of li computed by restricting the summation in to individual pixels we find that even these estimates are fairly accurate for lot of pixels even when it true color is saturated see cart in first row also to evaluate the weight of these distributions to the global li we show map of their variance on basis as expected from fig we note higher variances in relatively brighter pixels the image in the last row represents one of the poorest estimates across the entire dataset higher than note that much of the image is in shadow and contain only few distinct and likely atypical materials learning while the empirical approach in the previous section would be optimal if pixel chromaticities in typical image were infact that is clearly not the case therefore in this section we propose an alternate approach method to setting the beliefs in that optimizes for the accuracy of the final global illuminant estimate however unlike previous color constancy methods that explicitly model statistical between example by modeling spatial derivatives or learning functions on histograms retain the overall parametric form by which we compute the illuminant in therefore even though itself is learned through knowledge of of chromaticities in natural images estimation of the illuminant during inference is still achieved through simple aggregation of distributions specifically we set the entries of to minimize cost function over set of training images ct pti estimate error belief variance empirical global estimate error error empirical ground truth error error empirical ground truth error ground truth error figure estimation results on sample images along with output images corrected with the global illuminant estimate from our methods we also visualize illuminant information extracted at local level we show map of the angular error of illuminant estimates computed with li based on distributions from only single pixel we also show map of the variance var li of these beliefs to gauge the weight of their contributions to the global illuminant estimate where is the true illuminant chromaticity of the tth training image and pti is computed from the observed colors using we augment the training data available to us by each image with different illuminants from the training set we use the original image set and six copies for training and use seventh copy for validation we use stochastic gradient descent to minimize we initialize to empirical values as described in the previous section for convenience we multiply the empirical values by and then set for computing li and then consider individual images from the training set at each iteration we make multiple passes through the training set and at each iteration we randomly the pixels from each training image specifically we only retain of the total pixels in the image by randomly patches at time this approach which can be interpreted as being similar to dropout prevents and improves generalization derivatives of the cost function with respect to the current values of beliefs are given by pti we use momentum to update the values of at each iteration based on these derivative as where is the previous update value is the learning rate and is the momentum factor in our experiments we set run stochastic gradient descent for epochs with and another epochs with we retain the values of from each epoch and our final output is the version that yields the lowest mean illuminant estimation error on the validation set where we show the belief values learned in this manner in fig notice that although they retain the overall biases towards desaturated colors and combined and hues they are less smooth than their empirical counterparts in fig many instances there are sharp changes in the values for small changes in chromaticity while harder to interpret we hypothesize that these variations result from shifting beliefs of specific pairs to their neighbors when they correspond to incorrect choices within the ambiguous set of specific observed colors experimental results we also report errors when using these trained versions of the belief function in table and find that they lead to an appreciable reduction in error in comparison to their empirical counterparts indeed the errors with training using crossvalidation begin to approach those of the empirical version with which has access to much more training data also note that the most significant improvements for both and are in outlier performance in the and error values color constancy methods perform worst on images that are dominated by small number of materials with ambiguous chromaticity and our results indicate that training increases the reliability of our estimation method in these cases we also include results for the case for the example images in figure for all three images there is an improvement in the global estimation error more interestingly we see that the error and variance maps now have more variation since now reacts more sharply to slight chromaticity changes from pixel to pixel moreover we see that larger fraction of pixels generate fairly accurate estimates by themselves blue shirt in row there is also higher disparity in belief variance including within regions that visually look homogeneous in the input indicating that the global estimate is now more heavily influenced by smaller fraction of pixels conclusion and future work in this paper we introduced new color constancy method that is based on conditional likelihood function for the true chromaticity of pixel given its luminance we proposed two approaches to learning this function the first was based purely on empirical pixel statistics while the second was based on maximizing accuracy of the final illuminant estimate both versions were found to outperform color constancy methods including those that employed more complex features and semantic reasoning while we assumed single global illuminant in this paper the underlying reasoning can likely be extended to the case especially since as we saw in fig our method was often able to extract reasonable illuminant estimates from individual pixels another useful direction for future research is to investigate the benefits of using likelihood functions that are conditioned on using an intrinsic image decomposition of normalized luminance this would factor out the scalar ambiguity caused by shading which could lead to more informative distributions acknowledgments we thank the authors of for providing estimation results of other methods for comparison the author was supported by gift from adobe references brainard and radonjic color constancy in the new visual neurosciences buchsbaum spatial processor model for object colour perception franklin inst land the retinex theory of color vision scientific american gijsenij gevers and van de weijer generalized gamut mapping using image derivative structures for color constancy ijcv van de weijer gevers and gijsenij color constancy ieee trans image chakrabarti hirakawa and zickler color constancy with statistics pami joze and drew color constancy and multiple illumination pami li xiong and xu supervised combination strategy for illumination chromaticity estimation acm trans appl lu gijsenij gevers nedovic and xu color constancy using scene geometry in iccv bianco gasparini and schettini framework for illuminant chromaticity estimation electron bianco ciocca cusano and schettini automatic color constancy algorithm selection and combination pattern recognition srivastava hinton krizhevsky sutskever and salakhutdinov dropout simple way to prevent neural networks from overfitting jmlr chong gortler and zickler the von kries hypothesis and basis for color constancy in proc iccv gehler rother blake minka and sharp bayesian color constancy revisited in cvpr shi and funt version of the gehler color constancy dataset of images accessed from http judd macadam wyszecki budde condit henderson and simonds spectral distribution of typical daylight as function of correlated color temperature josa chakrabarti and zickler statistics of hyperspectral images in proc cvpr li xiong hu and funt evaluating combinational illumination estimation methods on images ieee trans imag data at http bianco cusano and schettini color constancy using cnns forsyth novel algorithm for color constancy ijcv xiong and funt imag sci estimating illumination chromaticity via support vector regression gijsenij gevers and van de weijer computational color constancy survey and experiments ieee trans image data at http 
fast and accurate inference of models lucas maystre epfl matthias grossglauser epfl abstract we show that the ml estimate of models derived from luce choice axiom the model can be expressed as the stationary distribution of markov chain this conveys insight into several recently proposed spectral inference algorithms we take advantage of this perspective and formulate new spectral algorithm that is significantly more accurate than previous ones for the model with simple adaptation this algorithm can be used iteratively producing sequence of estimates that converges to the ml estimate the ml version runs faster than competing approaches on benchmark of five datasets our algorithms are easy to implement making them relevant for practitioners at large introduction aggregating pairwise comparisons and partial rankings are important problems with applications in econometrics psychometrics sports ranking and multiclass classification one possible approach to tackle these problems is to postulate statistical model of discrete choice in this spirit luce stated the choice axiom in foundational work published over fifty years ago denote the probability of choosing item when faced with alternatives in the set given two items and and any two sets of alternatives and containing and the axiom posits that in other words the odds of choosing item over item are independent of the rest of the alternatives this simple assumption directly leads to unique parametric choice model known as the terry model in the case of pairwise comparisons and the model in the generalized case of rankings in this paper we highlight connection between the ml estimate under these models and the stationary distribution of markov chain parametrized by the observed choices markov chains were already used in recent work to aggregate pairwise comparisons and rankings these approaches reduce the problem to that of finding stationary distribution by formalizing the link between the likelihood of observations under the choice model and certain markov chain we unify these algorithms and explicate them from an ml inference perspective we will also take detour and use this link in the reverse direction to give an alternative proof to recent result on the error rate of the ml estimate by using spectral analysis techniques beyond this we make two contributions to statistical inference for this model first we develop simple consistent and computationally efficient spectral algorithm that is applicable to wide range of models derived from the choice axiom the exact formulation of the markov chain used in the algorithm is distinct from related work and achieves significantly better statistical efficiency at no additional computational cost second we observe that with small adjustment the algorithm can be used iteratively and it then converges to the ml estimate an evaluation on five datasets reveals that it runs consistently faster than competing approaches and has much more predictable performance that does not depend on the structure of the data the key step finding stationary distribution can be offloaded to commonly available primitives which makes our algorithms scale well our algorithms are intuitively pleasing simple to understand and implement and they outperform the state of the art hence we believe that they will be highly useful to practitioners the rest of the paper is organized as follows we begin by introducing some notations and presenting few useful facts about the choice model and about markov chains by necessity our exposition is succinct and the reader is encouraged to consult luce and levin et al for more thorough exposition in section we discuss related work in section we present our algorithms and in section we evaluate them on synthetic and data we conclude in section discrete choice model denote by the number of items luce choice axiom implies that each itempi can be parametrized by positive strength πi such that πi πj for any containing the strengths πi are defined up to multiplicative factor for identifiability we let πi an alternative parametrization of the model is given by θi log πi in which case the model is sometimes referred to as conditional logit markov chain theory we represent finite stationary markov chain by directed graph where is the set of states and is the set of transitions with positive rate if is strongly connected the markov chain is said to be ergodic and admits unique stationary distribution the global balance equations relate the transition rates λij to the stationary distribution as follows πi λij πj λji the stationary distribution is therefore invariant to changes in the time scale to rescaling of the transition rates in the supplementary file we briefly discuss how to find given λij related work spectral methods applied to ranking and scoring items from noisy choices have history to the best of our knowledge saaty is the first to suggest using the leading eigenvector of matrix of inconsistent pairwise judgments to score alternatives two decades later page et al developed pagerank an algorithm that ranks web pages according to the stationary distribution of random walk on the hyperlink graph in the same vein dwork et al proposed several variants of markov chains for aggregating heterogeneous rankings the idea is to construct random walk that is biased towards items and use the ranking induced by the stationary distribution more recently negahban et al presented rank centrality an algorithm for aggregating pairwise comparisons close in spirit to that of when the data is generated under the model this algorithm asymptotically recovers model parameters with only log pairwise comparisons for the more general case of rankings under the model azari soufiani et al propose to break rankings into pairwise comparisons and to apply an algorithm similar to rank centrality they show that the resulting estimator is statistically consistent interestingly many of these spectral algorithms can be related to the method of moments broadly applicable alternative to estimation the history of algorithms for inference under luce model goes back even further in the special case of pairwise comparisons the same iterative algorithm was independently discovered by zermelo ford and dykstra much later this algorithm was explained by hunter as an instance of mm algorithm and extended to the more general choice model today hunter mm algorithm is the de facto standard for ml inference in luce model as the likelihood can be written as concave function optimization procedures such as the method can also be used although they have been been reported to be slower and less practical recently kumar et al looked at the problem of finding the transition matrix of markov chain given its stationary distribution the problem of inferring luce model parameters from data can be reformulated in their framework and the ml estimate is the solution to the inversion of the stationary distribution their work stands out as the first to link ml inference to markov chains albeit very differently from the way presented in our paper beyond algorithms properties of the estimator in this model were studied extensively hajek et al consider the model for rankings they give an upper bound to the estimation error and show that the ml estimator is in summary they show that only log samples are enough to drive the error down to zero as increases rajkumar and agarwal consider the model for pairwise comparisons they show that the ml estimator is able to recover the correct ranking even when the data is generated as per another model thurstone as long as condition is satisfied we also mention that as an alternative to likelihood maximization bayesian inference has also been proposed caron and doucet present gibbs sampler and guiver and snelson propose an approximate inference algorithm based on expectation propagation in this work we provide unifying perspective on recent advances in spectral algorithms from estimation perspective it turns out that this perspective enables us to make contributions on both sides on the one hand we develop an improved and more general spectral ranking algorithm and on the other hand we propose faster procedure for ml inference by using this algorithm iteratively algorithms we begin by expressing the ml estimate under the choice model as the stationary distribution of markov chain we then take advantage of this formulation to propose novel algorithms for model inference although our derivation is made in the general choice model we will also discuss implications for the special cases of pairwise data in section and ranking data in section suppose that we collect independent observations in the multiset each observation consists of choice among set of alternatives we say that wins over and denote by whenever and we define the directed comparison graph as gd with and if wins at least once over in in order to ensure that the ml estimate is we make the standard assumption that gd is strongly connected in practice if this assumption does not hold we can consider each strongly connected component separately ml estimate as stationary distribution for simplicity we denote the model parameter associated with item by the of parameters given observations is log log πj for each item we define two sets of indices let wi and li be the indices of the observations where item wins over and loses against the alternatives respectively the is not concave in it can be made strictly concave using simple reparametrization but we briefly show in the supplementary material that it admits unique stationary point at the ml estimate the optimality condition log implies log in order to go from to we multiply by and rearrange the terms to simplify the notation let us further introduce the function πi which takes observations and an instance of model parameters and returns real number let di wi lj the set of observations where wins over then can be rewritten as dj di algorithm luce spectral ranking require observations for do for do λji λji end for end for stat dist of markov chain return algorithm iterative luce spectral ranking require observations repeat for do for dop λji λji πt end for end for stat dist of markov chain until convergence this formulation conveys new viewpoint on the ml estimate it is easy to recognize the global balance equations of markov chain on states representing the items with transition rates λji di and stationary distribution these transition rates have an interesting interpretation di is the count of how many times wins over weighted by the strength of the alternatives at this point it is useful to observe that for any parameters di if and otherwise combined with the assumption that gd is strongly connected it follows that any parametrizes the transition rates of an ergodic homogeneous markov chain the ergodicity of the inhomogeneous markov chain where the transition rates are constantly updated to reflect the current distribution over states is shown by the following theorem theorem the markov chain with inhomogeneous transition rates λji di converges to the estimate for any initial distribution in the open probability simplex proof sketch by is the unique invariant distribution of the markov chain in the supplementary file we look at an equivalent uniformized chain using the contraction mapping principle one can show that this chain converges to the invariant distribution approximate and exact ml inference we approximate the markov chain described in by considering priori that all alternatives have equal strength that is we set the transition rates λji di by fixing to for the contribution of winning over to the rate of transition λji is in other words for each observation the winning item is rewarded by fixed amount of incoming rate that is evenly split across the alternatives the chunk allocated to itself is discarded we interpret the stationary distribution as an estimate of model parameters algorithm summarizes this procedure called luce spectral ranking lsr if we consider growing number of observations lsr converges to the true model parameters even in the restrictive case where the sets of alternatives are fixed theorem let be collection of sets of alternatives such that for any partition of into two sets and let be the number of choices observed over alternatives then as proof sketch the condition on ensures that asymptotically gd is strongly connected let be shorthand for we can show that if items and are compared in at least one set of alternatives the ratio of transition rates satisfies λij it follows that in the limit of the stationary distribution is rigorous proof is given in the supplementary file starting from the lsr estimate we can iteratively refine the transition rates of the markov chain and obtain sequence of estimates by the only fixed point of this iteration is the ml estimate we call this procedure and describe it in algorithm lsr or one iteration of entails filling matrix of weighted pairwise counts and finding stationary distribution let and let be the running time of finding the stationary distribution then lsr has running time as comparison one iteration of this is equivalent to stating that the hypergraph is connected the mm algorithm is finding the stationary distribution can be implemented in different ways for example in sparse regime where the stationary distribution can be found with the power method in few sparse matrix multiplications in the supplementary file we give more details about possible implementations in practice whether or turns out to be dominant in the running time is not foregone conclusion aggregating pairwise comparisons special case of luce choice model occurs when all sets of alternatives contain exactly two items when the data consists of pairwise comparisons this model was proposed by zermelo and later by bradley and terry as the stationary distribution is invariant to changes in the we can rescale the transition rates and set λji when using lsr on pairwise data let be the set containing the pairs of items that have been compared at least once in the case where each pair has been compared exactly times lsr is strictly equivalent to formulation of rank centrality in fact our derivation justifies rank centrality as an approximate ml inference algorithm for the model furthermore we provide principled extension of rank centrality to the case where the number of comparisons observed is unbalanced rank centrality considers transition rates proportional to the ratio of wins whereas justifies making transition rates proportional to the count of wins negahban et al also provide an upper bound on the error rate of rank centrality which essentially shows that it is because the two estimators are equivalent in the setting of balanced pairwise comparisons the bound also applies to lsr more interestingly the expression of the ml estimate as stationary distribution enables us to reuse the same analytical techniques to bound the error of the ml estimate in the supplementary file we therefore provide an alternative proof of the recent result of hajek et al on the of the ml estimate aggregating partial rankings another case of interest is when observations do not consist of only single choice but of ranking over the alternatives we now suppose observations consisting of rankings for conciseness we suppose that is the same for all observations let one such observation be where is the item with rank luce and later plackett independently proposed model of rankings where πσ pk πσ in this model ranking can be interpreted as sequence of independent choices choose the first item then choose the second among the remaining alternatives etc with this point of view in mind lsr and can easily accommodate data consisting of rankings by decomposing the observations into choices azari soufiani et al provide class of consistent estimators for the model using the idea of breaking rankings into pairwise comparisons although they explain their algorithms from perspective it is straightforward to reinterpret their estimators as stationary distributions of particular markov chains in fact for their algorithm is identical to lsr when however breaking ranking into pairwise comparisons implicitly makes the incorrect assumption that these comparisons are statistically independent the markov chain that lsr builds breaks rankings into pairwise rate contributions but weights the contributions differently depending on the rank of the winning item in section we show that this weighting turns out to be crucial our approach yields significant improvement in statistical efficiency yet keeps the same attractive computational cost and ease of use applicability to other models several other variants and extensions of luce choice model have been proposed for example rao and kupper extend the model to the case where comparison between two items can result in tie in the supplementary file we show that the ml estimate in the model can also be formulated as stationary distribution and we provide corresponding adaptations of lsr and we believe that our algorithms can be generalized to further models that are based on the choice axiom however this axiom is key and other choice models such as thurstone do not admit the interpretation we derive here experimental evaluation in this section we compare lsr and to other inference algorithms in terms of statistical efficiency and empirical performance in order to understand the efficiency of the estimators we generate synthetic data from known ground truth then we look at five datasets and investigate the practical performance of the algorithms in terms of accuracy running time and convergence rate error metric as the probability of winning over depends on the ratio of strengths πi the strengths are typically logarithmically spaced in order to evaluate the accuracy of an estimate to ground truth parameters we therefore use log transformation reminiscent of the theoretic formulation of the choice model define log πi with chosen such that θi we will consider the error rmse erms kθ statistical efficiency to assess the statistical efficiency of lsr and other algorithms we follow the experimental procedure of hajek et al we consider items and draw uniformly at random in we generate full rankings over the items from model parametrized with eθi for given we break down each of the full rankings as follows first we partition the items into subsets of size uniformly at random then we store the rankings induced by the full ranking on each of those subsets as result we obtain statistically independent partial rankings for given estimator this data produces an estimate for which we record the error to we consider four estimators the first two lsr and ml work on the ranking data directly remaining two follow azari soufiani et al who suggest breaking down rankings into pairwise comparisons these comparisons are then used by lsr resulting in azari soufiani et estimator and by an ml estimator in short the four estimators vary according to whether they use rankings or derived comparisons and whether the model is fitted using an approximate spectral algorithm or using exact maximum likelihood figure plots erms for increasing sizes of partial rankings as well as lower bound to the error of any estimator for the model see hajek et al for details we observe that breaking the rankings into pairwise comparisons estimators incurs significant efficiency loss over using the rankings directly lsr and ml we conclude that by correctly weighting pairwise rates in the markov chain lsr distinctly outperforms the approach as increases we also observe that the ml estimate is always more efficient spectral estimators such as lsr provide quick asymptotically consistent estimate of parameters but this observation justifies calling them approximate inference algorithms empirical performance we investigate the performance of various inference algorithms on five datasets the nascar and sushi datasets contain multiway partial rankings the youtube gifgif and chess contain pairwise comparisons among those the chess dataset is particular in that it features of ties in this case we use the extension of the model proposed by rao and kupper we preprocess each dataset by discarding items that are not part of the largest strongly connected component in the comparison graph the number of items the number of rankings as well as the size of partial rankings for each dataset are given in table additional details on the experimental setup are given in the supplementary material we first compare the estimates produced by three approximate ml inference algorithms lsr and rank centrality rc note that rc applies only to pairwise comparisons and that lsr is the only see https http and https rmse lower bound ml lsr figure statistical efficiency of different estimators for increasing sizes of partial rankings as grows breaking rankings into pairwise comparisons becomes increasingly inefficient lsr remains efficient at no additional computational cost algorithm able to infer the parameters in the model also note that in the case of pairwise comparisons and lsr are strictly equivalent in table we report the deviation to the ml estimate and the running time of the algorithm table performance of approximate ml inference algorithms lsr dataset nascar sushi youtube gifgif chess rc erms erms erms the smallest value of erms is highlighted in bold for each dataset we observe that in the case of multiway partial rankings lsr is almost four times more accurate than on the datasets considered in the case of pairwise comparisons rc is slightly worse than lsr and because the number of comparisons per pair is not homogeneous see section the running time of the three algorithms is comparable next we turn our attention to ml inference and consider three iterative algorithms mm and for we use an solver each algorithm is initialized with and convergence is declared when erms in table we report the number of iterations needed to reach convergence as well as the total running time of the algorithm table performance of iterative ml inference algorithms dataset mm newton γd nascar sushi youtube gifgif chess the smallest total running time is highlighted in bold for each dataset we observe that newtonraphson does not always converge despite the being strictly on the nascar dataset this has also been noted by hunter computing the newton step appears to be severely for many datasets we believe that it can be addressed by careful choice tently outperforms mm and in running time even if the average running time per iteration is in general larger than that of mm it needs considerably fewer iterations for the youtube dataset yields an increase in speed of over times the slow convergence of algorithms is known yet the scale of the issue and its apparent unpredictability is surprising in hunter mm algorithm updating given πi involves only parameters of items to which has been compared therefore we speculate that the convergence rate of mm is dependent on the expansion properties of the comparison graph gd as an illustration we consider the sushi dataset to quantify the expansion properties we look at the spectral gap γd of simple random walk on gd intuitively the larger the spectral gap is the better the expansion properties are the original comparison graph is almost complete and γd by breaking each ranking into independent pairwise comparisons we effectively sparsify the comparison graph as result the spectral gap decreases to in figure we show the convergence rate of mm and for the original and modified datasets we observe that both algorithms display linear convergence however the rate at which mm converges appears to be sensitive to the structure of the comparison graph in contrast is robust to changes in the structure the spectral gap of each dataset is listed in table rmse mm mm iteration figure convergence rate of and mm on the sushi dataset when partial rankings are broken down into independent comparisons the comparison graph becomes sparser is robust to this change whereas the convergence rate of mm significantly decreases conclusion in this paper we develop perspective on the estimate of luce choice model this perspective explains and unifies several recent spectral algorithms from an ml inference point of view we present our own spectral algorithm that works on wider range of data and show that the resulting estimate significantly outperforms previous approaches in terms of accuracy we also show that this simple algorithm with straighforward adaptation can produce sequence of estimates that converge to the ml estimate on datasets our ml algorithm is always faster than the state of the art at times by up to two orders of magnitude beyond statistical and computational performance we believe that key strength of our algorithms is that they are simple to implement as an example our implementation of lsr fits in ten lines of python code the most complex stationary be readily offloaded to commonly available and highly optimized primitives as such we believe that our work is very useful for practitioners acknowledgments we thank holly stratis ioannidis ksenia konyushkova and brunella spinelli for careful proofreading and comments on the text of starting point step size or by monitoring the numerical stability however these modifications are and impose an additional burden on the practitioner references mcfadden conditional logit analysis of qualitative choice behavior in zarembka editor frontiers in econometrics pages academic press thurstone the method of paired comparisons for social values journal of abnormal and social psychology bradley and terry rank analysis of incomplete block designs the method of paired comparisons biometrika plackett the analysis of permutations journal of the royal statistical society series applied statistics elo the rating of chess players past present arco hastie and tibshirani classification by pairwise coupling the annals of statistics luce individual choice behavior theoretical analysis wiley dwork kumar naor and sivakumar rank aggregation methods for the web in proceedings of the international conference on world wide web www hong kong china negahban oh and shah iterative ranking from comparisons in advances in neural information processing systems nips lake tahoe ca azari soufiani chen parkes and xia generalized for rank aggregation in advances in neural information processing systems nips lake tahoe ca hajek oh and xu inference from partial rankings in advances in neural information processing systems nips montreal qc canada levin peres and wilmer markov chains and mixing times american mathematical society saaty the analytic hierarchy process planning priority setting resource allocation page brin motwani and winograd the pagerank citation ranking bringing order to the web technical report stanford university zermelo die berechnung der als ein maximumproblem der wahrscheinlichkeitsrechnung mathematische zeitschrift ford jr solution of ranking problem from binary comparisons the american mathematical monthly dykstra rank analysis of incomplete block designs method of paired comparisons employing unequal repetitions on pairs biometrics hunter mm algorithms for generalized models the annals of statistics kumar tomkins vassilvitskii and vee inverting in proceedings of the international conference on web search and data mining wsdm pages rajkumar and agarwal statistical convergence perspective of algorithms for rank aggregation from pairwise data in proceedings of the international conference on machine learning icml beijing china caron and doucet efficient bayesian inference for generalized models journal of computational and graphical statistics guiver and snelson bayesian inference for ranking models in proceedings of the international conference on machine learning icml montreal canada rao and kupper ties in experiments generalization of the model journal of the american statistical association kamishima and akaho efficient clustering for orders in mining complex data pages springer 
probabilistic line searches for stochastic optimization maren mahsereci and philipp hennig max planck institute for intelligent systems spemannstraße germany abstract in deterministic optimization line searches are standard tool ensuring stability and efficiency where only stochastic gradients are available no direct equivalent has so far been formulated because uncertain gradients do not allow for strict sequence of decisions collapsing the search space we construct probabilistic line search by combining the structure of existing deterministic methods with notions from bayesian optimization our method retains gaussian process surrogate of the univariate optimization objective and uses probabilistic belief over the wolfe conditions to monitor the descent the algorithm has very low computational cost and no parameters experiments show that it effectively removes the need to define learning rate for stochastic gradient descent introduction stochastic gradient descent sgd is currently the standard in machine learning for the optimization of highly multivariate functions if their gradient is corrupted by noise this includes the online or batch training of neural networks logistic regression and variational models in all these cases noisy gradients arise because an exchangeable of the optimization parameters rd across large dataset di is evaluated only on subset dj di dj if the indices are draws from by the central limit theorem the error is unbiased and approximately normal distributed despite its popularity and its low cost per step sgd has deficiencies that can make it inefficient or at least tedious to use in practice two main issues are that first the gradient itself even without noise is not the optimal search direction and second sgd requires step size learning rate that has drastic effect on the algorithm efficiency is often difficult to choose well and virtually never optimal for each individual descent step the former issue adapting the search direction has been addressed by many authors see for an overview existing approaches range from lightweight diagonal preconditioning approaches like adagrad and stochastic to empirical estimates for the natural gradient or the newton direction to algorithms and more elaborate estimates of the newton direction most of these algorithms also include an auxiliary adaptive effect on the learning rate and schaul et al recently provided an estimation method to explicitly adapt the learning rate from one gradient descent step to another none of these algorithms change the size of the current descent step accumulating statistics across steps in this fashion requires some conservatism if the step size is initially too large or grows too fast sgd can become unstable and explode because individual steps are not checked for robustness at the time they are taken function value distance in line search direction figure sketch the task of classic line search is to tune the step taken by optimization algorithm along univariate search direction the search starts at the endpoint of the previous line search at sequence of exponentially growing extrapolation steps finds point of positive gradient at it is followed by interpolation steps until an acceptable point is found points of insufficient decrease above the line tf gray area are excluded by the armijo condition while points of steep gradient orange areas are excluded by the curvature condition weak wolfe conditions in solid orange strong extension in lighter tone point is the first to fulfil both conditions and is thus accepted the principally same problem exists in deterministic optimization problems there providing stability is one of several tasks of the line search subroutine it is standard constituent of algorithms like the classic nonlinear conjugate gradient and bfgs methods in the case line searches are considered solved problem but the methods used in deterministic optimization are not stable to noise they are easily fooled by even small disturbances either becoming overly conservative or failing altogether the reason for this brittleness is that existing line searches take sequence of hard decisions to shrink or shift the search space this yields efficiency but breaks hard in the presence of noise section constructs probabilistic line search for noisy objectives stabilizing optimization methods like the works cited above as line searches only change the length not the direction of step they could be used in combination with the algorithms adapting sgd direction cited above the algorithm presented below is thus complement not competitor to these methods connections deterministic line searches there is host of existing line search variants in essence though these methods explore univariate domain to the right of starting point until an acceptable point is reached figure more precisely consider the problem of minimizing rd with access to rd rd at iteration some outer loop chooses at location xi search direction si rd by the bfgs rule or simply si xi for gradient descent it will not be assumed that si has unit norm the line search operates along the univariate domain xi tsi for along this direction it collects scalar function values and projected gradients that will be denoted and most line searches involve an initial extrapolation phase to find point tr with tr this is followed by search in tr by interval nesting or by interpolation of the collected function and gradient values with cubic the wolfe conditions for termination as the line search is only an auxiliary step within larger iteration it need not find an exact root of it suffices to find point sufficiently close to minimum the wolfe conditions are widely accepted formalization of this notion they consider acceptable if it fulfills tf and using two constants chosen by the designer of the line search not the user is the armijo or sufficient decrease condition it encodes that acceptable functions values should lie below linear extrapolation line of slope is the curvature condition demanding in these algorithms another task of the line search is to guarantee certain properties of surrounding estimation rule in bfgs it ensures positive definiteness of the estimate this aspect will not feature here this is the strategy in by rasmussen which provided model for our implementation at the time of writing it can be found at http àá pwolfe pb pa weak strong distance in line search direction figure sketch of probabilistic line search as in fig the algorithm performs extrapolation and interpolation but receives unreliable noisy function and gradient values these are used to construct gp posterior top solid posterior mean thin lines at standard deviations local pdf marginal as shading three dashed sample paths this implies bivariate gaussian belief over the validity of the weak wolfe conditions middle three plots pa is the marginal for pb for their correlation points are considered acceptable if their joint probability pwolfe bottom is above threshold gray an approximation to the strong wolfe conditions is shown dashed decrease in slope the choice accepts any value below while rejects all points for convex functions for the curvature condition only accepts points with while accepts any point of greater slope than and are known as the weak form of the wolfe conditions the strong form replaces with this guards against accepting points of low function value but large positive gradient figure shows conceptual sketch illustrating the typical process of line search and the weak and strong wolfe conditions the exposition in will initially focus on the weak conditions which can be precisely modeled probabilistically section then adds an approximate treatment of the strong form bayesian optimization recently blossoming approach to global optimization revolves around modeling the objective with probability measure usually gaussian process gp searching for extrema evaluation points are then chosen by utility functional our line search borrows the idea of gaussian process surrogate and popular utility expected improvement bayesian optimization methods are often computationally expensive thus for task like line search but since line searches are governors more than information extractors the kind of expected of bayesian optimizer is not needed the following sections develop lightweight algorithm which adds only minor computational overhead to stochastic optimization probabilistic line search we now consider minimizing from eq that is the algorithm can access only noisy function values and gradients yt at location with gaussian likelihood σf yt yt the gaussian form is supported by the central limit argument at eq see regarding estimation of the variances our algorithm has three main ingredients robust yet lightweight gaussian process surrogate on facilitating analytic optimization simple bayesian optimization objective for exploration and probabilistic formulation of the wolfe conditions as termination criterion lightweight gaussian process surrogate we model information about the objective in probability measure there are two requirements on such measure first it must be robust to irregularity of the objective and second it must allow analytic computation of discrete candidate points for evaluation because line search should not call yet another optimization subroutine itself both requirements are fulfilled by wiener process gaussian process prior gp with covariance function here and denote shift by constant this ensures this kernel is positive the precise value is irrelevant as the algorithm only considers positive values of our implementation uses see regarding the scale with the likelihood of eq this prior gives rise to gp posterior whose mean function is cubic we note in passing that regression on and from observations of pairs yt can be formulated as filter and thus performed in time however since line search typically collects data points generic gp inference using gram matrix has virtually the same low cost because gaussian measures are closed under linear maps eq implies wiener process linear spline model on gp with using the indicator function if else thus min given set of evaluations vectors with elements ti yti with independent likelihood the posterior is gp with posterior mean and covariance and as follows ktt tt ktt ktt tt tt tt the posterior marginal variance will be denoted by to see that is indeed piecewise cubic cubic spline we note that it has at most three because this piecewise cubic form of is crucial for our purposes having collected values of and respectively all local minima of can be found analytically in time in single sweep through the cells ti here denotes the start location where are inherited from the preceding line search for typical line searches in each cell is cubic polynomial with at most one minimum in the cell found by trivial quadratic computation from the three scalars ti ti ti this is in contrast to other gp regression example the one arising from gaussian give more involved posterior means whose local minima can be found only approximately another advantage of the cubic spline interpolant is that it does not assume the existence of higher derivatives in contrast to the gaussian kernel for example and thus reacts robustly to irregularities in the objective in our algorithm after each evaluation of yn yn we use this property to compute short list of candidates for the next evaluation consisting of the local minimizers of and one additional extrapolation node at tmax where tmax is the currently largest evaluated and is an extrapolation step size starting at and doubled after each extrapolation step choosing among candidates the previous section described the construction of discrete candidate points for the next evaluation to decide at which of the candidate points to actually call and we make use of popular utility from bayesian optimization expected improvement is the expected amount eq can be generalized to the natural spline removing the need for the constant however this notion is in the case of single observation which is crucial for the line search there is no probabilistic belief over and higher paths of the wiener process are almost surely almost everywhere but is always member of the reproducing kernel hilbert space induced by thus piecewise cubic σf σf σf σf σf σf σf σf σf σf pwolfe constraining extrapolation interpolation immediate accept high noise interpolation figure curated snapshots of line searches from mnist experiment showing variability of the objective shape and the decision process top row gp posterior and evaluations bottom row approximate pwolfe over strong wolfe conditions accepted point marked red under the gp surrogate by which the function might be smaller than current best value we set ti where ti are observed locations uei ep ft min erf exp the next evaluation point is chosen as the candidate maximizing this utility multiplied by the probability for the wolfe conditions to be fulfilled which is derived in the following section probabilistic wolfe conditions for termination the key observation for probabilistic extension of and is that they are positivity constraints on two variables at bt that are both linear projections of the jointly gaussian variables and at bt the gp of eq on thus implies at each value of bivariate gaussian distribution aa mt at ct ctab at bt bt mbt ctba ctbb with mat and ctaa ctbb ctab ctba and mbt the quadrant probability pwolfe at bt for the wolfe conditions to hold is an integral over bivariate normal probability ρt wolfe da db pt ma mb ρt taa bb ct with correlation coefficient ρt ctab ctaa ctbb it can be computed efficiently using readily available on laptop one evaluation of pwolfe cost about microseconds each line search requires such calls the line search computes this probability for all evaluation nodes after each evaluation if any of the nodes fulfills the wolfe conditions with pwolfe cw greater than some threshold cw it is accepted and returned if several nodes simultaneously fulfill this requirement the of the lowest is returned section below motivates fixing cw http approximation for strong conditions as noted in section deterministic optimizers tend to use the strong wolfe conditions which use and precise extension of these conditions to the probabilistic setting is numerically taxing because the distribution over is requiring customized computations however straightforward variation to captures the spirit of the strong wolfe conditions that large positive derivatives should not be accepted assuming that the search direction is descent direction the strong second wolfe condition can be written exactly as bt the value is bounded to confidence by hence an approximation to the strong wolfe conditions can be reached by replacing the infinite upper integration limit on in eq with mbt ctbb the effect of this adaptation which adds no overhead to the computation is shown in figure as dashed line eliminating as inner loop the line search should not require any tuning by the user the preceding section introduced six undefined parameters cw σf σf we will now show that cw can be fixed by hard design decisions can be eliminated by standardizing the optimization objective within the line search and the noise levels can be estimated at runtime with low overhead for batch objectives of the form in eq the result is algorithm that effectively removes the one most problematic parameter from learning rate design parameters cw our algorithm inherits the wolfe thresholds and from its deterministic ancestors we set and this is standard setting that yields lenient line search one that accepts most descent points the rationale is that the stochastic aspect of sgd is not always problematic but can also be helpful through kind of annealing effect the acceptance threshold cw is new design parameter arising only in the probabilistic setting we fix it to cw to motivate this value first note that in the limit all values cw are equivalent because pwolfe then switches discretely between and upon observation of the function computation left out for space assuming only two evaluations at and and the same fixed noise level on and which then cancels out shows that function values barely fulfilling the conditions can have pwolfe while function values at for with unlucky evaluations both function and gradient values one from true value can achieve pwolfe the choice cw balances the two competing desiderata for precision and recall empirically fig we rarely observed values of pwolfe close to this threshold even at high evaluation noise function evaluation typically either clearly rules out the wolfe conditions or lifts pwolfe well above the threshold scale the parameter of eq simply scales the prior variance it can be eliminated by scaling the optimization objective we set and scale yi yi within the code of the line search this gives and and typically ensures the objective ranges in the single digits across where most line searches take place the division by causes disturbance but this does not seem to have notable empirical effect noise scales σf σf the likelihood requires standard deviations for the noise on both function values σf and gradients σf one could attempt to learn these across several line searches however in exchangeable models as captured by eq the variance of the loss and its gradient can be estimated directly within the batch at low computational approach already advocated by schaul et al we collect the empirical statistics yj and yj where denotes the square and estimate at the beginning of line search from xk xk xk and si xk this amounts to the cautious assumption that noise on the gradient is independent we finally scale the two empirical estimates as described in σf σf and ditto for σf the overhead of this estimation is small if the computation of yj itself is more expensive than the summation over in the neural network examples of with their comparably simple the additional steps added only cost overhead to the evaluation of the loss of course this approach requires batch size for batches running averaging could be used instead batches are not necessarily good choice in our experiments for example vanilla sgd with batch size converged faster in time than sgd estimating noise separately for each input dimension captures the often inhomogeneous structure among gradient elements and its effect on the noise along the projected direction for example in deep models gradient noise is typically higher on weights between the input and first hidden layer hence line searches along the corresponding directions are noisier than those along directions affecting weights propagating step sizes between line searches as will be demonstrated in the line search can find good step sizes even if the length of the direction si which is proportional to the learning rate in sgd is since such scale issues typically persist over time it would be wasteful to have the algorithm good scale in each line search instead we propagate step lengths from one iteration of the search to another we set the initial search direction to with some initial learning rate then after each line search ending at xi si the next search direction is set to xi thus the next line search starts its extrapolation at times the step size of its predecessor remark on convergence of sgd with line searches we note in passing that it is straightforward to ensure that sgd instances using the line search inherit the convergence guarantees of sgd putting even an extremely loose bound on the step sizes taken by the line search such that and ensures the line sgd converges in probability experiments our experiments were performed on the problems of training neural net with logistic nonlinearity on the mnist and in both cases the network had hidden units giving optimization problems with and parameters respectively while this may be by contemporary standards it exhibits the stereotypical challenges of stochastic optimization for machine learning since the line search deals with only univariate subproblems the extrinsic dimensionality of the optimization task is not particularly relevant for an empirical evaluation leaving aside the cost of the function evaluations themselves computation cost associated with the line search is independent of the extrinsic dimensionality the central nuisance of sgd is having to choose the learning rate and potentially also schedule for its decrease theoretically decaying learning rate is necessary to guarantee convergence of sgd but empirically keeping the rate constant or only decaying it cautiously often work better fig in practical setting user would perform exploratory experiments say for steps to determine good learning rate and decay schedule then run longer experiment in the best found setting in our networks constant learning rates of and for mnist and respectively achieved the lowest test error after the first steps of sgd we then trained networks with vanilla sgd with and without using the schedule and sgd using the probabilistic line search with ranging across five orders of magnitude on batches of size fig top shows test errors after epochs as function of the initial learning rate error bars based on random across the broad range of values the line search quickly identified good step sizes stabilized the training and progressed efficiently reaching test errors similar http and http like other authors we only used the batch of mnist neural net neural net sgd fixed sgd decaying line search test error intial learning rate intial learning rate test error epoch epoch figure top row test error after epochs as function of initial learning rate note logarithmic ordinate for mnist bottom row test error as function of training epoch same color and symbol scheme as in top row no matter the initial learning rate the line sgd perform close to the in practice unknown optimal sgd instance effectively removing the need for exploratory experiments and tuning all plots show means and over repetitions to those reported in the literature for tuned versions of this kind of architecture on these datasets while in both datasets the best sgd instance without just barely outperformed the line searches the optimal value was not the one that performed best after steps so this kind of exploratory experiment which comes with its own cost of human designer time would have led to worse performance than simply starting single instance of sgd with the linesearch and letting the algorithm do the rest average time overhead excluding for the objective was about per line search this is independent of the problem dimensionality and expected to drop significantly with optimized code analysing one of the mnist instances more closely we found that the average length of line search was function evaluations of line searches terminated after the first evaluation this suggests good scale adaptation and thus efficient search note that an optimally tuned algorithm would always lead to accepts the supplements provide additional plots of raw objective values chosen encountered gradient norms and gradient noises during the optimization as well as error plots for each of the two datasets respectively these provide richer picture of the control performed by the line search in particular they show that the line search chooses step sizes that follow nontrivial dynamic over time this is in line with the empirical truism that sgd requires tuning of the step size during its progress nuisance taken care of by the line search using this structured information for more elaborate analytical purposes in particular for convergence estimation is an enticing prospect but beyond the scope of this paper conclusion the line search paradigm widely accepted in deterministic optimization can be extended to noisy settings our design combines existing principles from the case with ideas from bayesian optimization adapted for efficiency we arrived at lightweight algorithm that exposes no parameters to the user our method is complementary to and can in principle be combined with virtually all existing methods for stochastic optimization that adapt step direction of fixed length empirical evaluations suggest the line search effectively frees users from worries about the choice of learning rate any reasonable initial choice will be quickly adapted and lead to close to optimal performance our matlab implementation will be made available at time of publication of this article references robbins and monro stochastic approximation method the annals of mathematical statistics zhang solving large scale linear prediction problems using stochastic gradient descent algorithms in international conference on machine learning icml bottou machine learning with stochastic gradient descent in proceedings of the int conf on computational statistic compstat pages springer hoffman blei wang and paisley stochastic variational inference journal of machine learning research hensman rattray and lawrence fast variational inference in the conjugate exponential family in advances in neural information processing systems nips pages broderick boyd wibisono wilson and jordan streaming variational bayes in advances in neural information processing systems nips pages george and powell adaptive stepsizes for recursive estimation with applications in approximate dynamic programming machine learning duchi hazan and singer adaptive subgradient methods for online learning and stochastic optimization journal of machine learning research schraudolph local gain adaptation in stochastic gradient descent in ninth international conference on artificial neural networks icann volume pages amari park and fukumizu adaptive method of realizing natural gradient learning for multilayer perceptrons neural computation roux and fitzgibbon fast natural newton method in international conference on machine learning icml pages rajesh chong blei and xing an adaptive learning rate for stochastic variational inference in international conference on machine learning icml pages hennig fast probabilistic optimization from noisy gradients in international conference on machine learning icml schaul zhang and lecun no more pesky learning rates in international conference on machine learning pages fletcher and reeves function minimization by conjugate gradients the computer journal broyden new minimization algorithm notices of the ams fletcher new approach to variable metric algorithms the computer journal goldfarb family of variable metric updates derived by variational means math shanno conditioning of methods for function minimization math nocedal and wright numerical optimization springer verlag wolfe convergence conditions for ascent methods siam review pages armijo minimization of functions having lipschitz continuous first partial derivatives pacific journal of mathematics jones schonlau and welch efficient global optimization of expensive functions journal of global optimization rasmussen and williams gaussian processes for machine learning mit wahba spline models for observational data number in regional conferences series in applied mathematics siam bayesian filtering and smoothing cambridge university press papoulis probability random variables and stochastic processes new york ed edition adler the geometry of random fields wiley drezner and wesolowsky on the computation of the bivariate normal integral journal of statistical computation and simulation 
inferring algorithmic patterns with recurrent nets tomas mikolov facebook ai research broadway new york usa tmikolov armand joulin facebook ai research broadway new york usa ajoulin abstract despite the recent achievements in machine learning we are still very far from achieving real artificial intelligence in this paper we discuss the limitations of standard deep learning approaches and show that some of these limitations can be overcome by learning how to grow the complexity of model in structured way specifically we study the simplest sequence prediction problems that are beyond the scope of what is learnable with standard recurrent networks algorithmically generated sequences which can only be learned by models which have the capacity to count and to memorize sequences we show that some basic algorithms can be learned from sequential data using recurrent network associated with trainable memory introduction machine learning aims to find regularities in data to perform various tasks historically there have been two major sources of breakthroughs scaling up the existing approaches to larger datasets and development of novel approaches in the recent years lot of progress has been made in scaling up learning algorithms by either using alternative hardware such as gpus or by taking advantage of large clusters while improving computational efficiency of the existing methods is important to deploy the models in real world applications it is crucial for the research community to continue exploring novel approaches able to tackle new problems recently deep neural networks have become very successful at various tasks leading to shift in the computer vision and speech recognition communities this breakthrough is commonly attributed to two aspects of deep networks their similarity to the hierarchical recurrent structure of the neocortex and the theoretical justification that certain patterns are more efficiently represented by functions employing multiple instead of single one this paper investigates which patterns are difficult to represent and learn with the current state of the art methods this would hopefully give us hints about how to design new approaches which will advance machine learning research further in the past this approach has lead to crucial breakthrough results the xor problem is an example of trivial classification problem that can not be solved using linear classifiers but can be solved with one this popularized the use of hidden layers and kernels methods another example is the parity problem described by papert and minsky it demonstrates that while single hidden layer is sufficient to represent any function it is not guaranteed to represent it efficiently and in some cases can even require exponentially many more parameters and thus also training data than what is sufficient for deeper model this lead to use of architectures that have several layers of currently known as deep learning models following this line of work we study basic patterns which are difficult to represent and learn for standard deep models in particular we study learning regularities in sequences of symbols sequence generator an bn an bn cn an bn cn dn an nxn example aabbaaabbbabaaaaabbbbb aaabbbcccabcaaaaabbbbbccccc aabbccddaaabbbcccdddabcd aabbbbaaabbbbbbabb aabcccaaabbcccccabcc table examples generated from the algorithms studied in this paper in bold the characters which can be predicted deterministically during training we do not have access to this information and at test time we evaluate only on deterministically predictable characters erated by simple algorithms interestingly we find that these regularities are difficult to learn even for some advanced deep learning methods such as recurrent networks we attempt to increase the learning capabilities of recurrent nets by allowing them to learn how to control an infinite structured memory we explore two basic topologies of the structured memory pushdown stack and list our structured memory is defined by constraining part of the recurrent matrix in recurrent net we use multiplicative gating mechanisms as learnable controllers over the memory and show that this allows our network to operate as if it was performing simple read and write operations such as push or pop for stack among recent work with similar motivation we are aware of the neural turing machine and memory networks however our work can be considered more as follow up of the research done in the early nineties when similar types of memory augmented neural networks were studied algorithmic patterns we focus on sequences generated by simple short algorithms the goal is to learn regularities in these sequences by building predictive models we are mostly interested in discrete patterns related to those that occur in the real world such as various forms of long term memory more precisely we suppose that during training we have only access to stream of data which is obtained by concatenating sequences generated by given algorithm we do not have access to the boundary of any sequence nor to sequences which are not generated by the algorithm we denote the regularities in these sequences of symbols as algorithmic patterns in this paper we focus on algorithmic patterns which involve some form of counting and memorization examples of these patterns are presented in table for simplicity we mostly focus on the unary and binary numeral systems to represent patterns this allows us to focus on designing model which can learn these algorithms when the input is given in its simplest form some algorithm can be given as context free grammars however we are interested in the more general case of sequential patterns that have short description length in some general computational system of particular interest are patterns relevant to develop better language understanding finally this study is limited to patterns whose symbols can be predicted in single computational step leaving out algorithms such as sorting or dynamic programming related work some of the algorithmic patterns we study in this paper are closely related to context free and context sensitive grammars which were widely studied in the past some works used recurrent networks with hardwired symbolic structures these networks are continuous implementation of symbolic systems and can deal with recursive patterns in computational linguistics while theses approaches are interesting to understand the link between symbolic and systems such as neural networks they are often hand designed for each specific grammar wiles and elman show that simple recurrent networks are able to learn sequences of the form an bn and generalize on limited range of while this is promising result their model does not truly learn how to count but instead relies mostly on memorization of the patterns seen in the training data rodriguez et al further studied the behavior of this network designs hardwired second order recurrent network to tackle similar sequences christiansen and chater extended these results to grammars with larger vocabularies this work shows that this type of architectures can learn complex internal representation of the symbols but it can not generalize to longer sequences generated by the same algorithm beside using simple recurrent networks other structures have been used to deal with recursive patterns such as pushdown dynamical automata or sequenctial cascaded networks hochreiter and schmidhuber introduced the long short term memory network lstm architecture while this model was orginally developed to address the vanishing and exploding gradient problems lstm is also able to learn simple and grammars this is possible because its hidden units can choose through multiplicative gating mechanism to be either linear or the linear units allow the network to potentially count one can easily add and subtract constants and store finite amount of information for long period of time these mechanisms are also used in the gated recurrent unit network in our work we investigate the use of similar mechanism in context where the memory is unbounded and structured as opposed to previous work we do not need to erase our memory to store new unit more recently graves et al have extended lstm with an attention mechansim to build model which roughly resembles turing machine with limited tape their memory controller works with fixed size memory and it is not clear if its complexity is necessary for the the simple problems they study finally many works have also used external memory modules with recurrent network such as stacks zheng et al use discrete external stack which may be hard to learn on long sequences das et al learn continuous stack which has some similarities with ours the mechnisms used in their work is quite different from ours their memory cells are associated with weights to allow continuous representation of the stack in order to train it with continuous optimization scheme on the other hand our solution is closer to standard rnn with special connectivities which simulate stack with unbounded capacity we tackle problems which are closely related to the ones addressed in these works and try to go further by exploring more challenging problems such as binary addition model simple recurrent network we consider sequential data that comes in the form of discrete tokens such as characters or words the goal is to design model able to predict the next symbol in stream of data our approach is based on standard model called recurrent neural network rnn and popularized by elman rnn consists of an input layer hidden layer with recurrent connection and an output layer the recurrent connection allows the propagation of information through sequence of tokens rnn takes as input the encoding xt of the current token and predicts the probability yt of next symbol there is hidden layer with units which stores additional information about the previous tokens seen in the sequence more precisely at each time the state of the hidden layer ht is updated based on its previous state and the encoding xt of the current token according to the following equation ht xt where exp is the sigmoid activation function applied coordinate wise is the token embedding matrix and is the matrix of recurrent weights given the state of these hidden units the network then outputs the probability vector yt of the next token according to the following equation yt ht where is the softmax function and is the output matrix where is the number of different tokens this architecture is able to learn relatively complex patterns similar in nature to the ones captured by while this has made the rnns interesting for language modeling they may not have the capacity to learn how algorithmic patterns are generated in the next section we show how to add an external memory to rnns which has the theoretical capability to learn simple algorithmic patterns figure neural network extended with stack and controlling mechanism that learns what action among push pop and to perform the same model extended with list with actions insert left right and pushdown network in this section we describe simple structured memory inspired by pushdown automaton an automaton which employs stack we train our network to learn how to operate this memory with standard optimization tools stack is type of persistent memory which can be only accessed through its topmost element three basic operations can be performed with stack pop removes the top element push adds new element on top of the stack and does nothing for simplicity we first consider simplified version where the model can only choose between push or pop at each time step we suppose that this decision is made by variable at which depends on the state of the hidden variable ht at aht where is matrix is the size of the hidden layer and is softmax function we denote by at push the probability of the push action and by at pop the probability of the pop action we suppose that the stack is stored at time in vector st of size note that could be increased on demand and does not have to be fixed which allows the capacity of the model to grow the top element is stored at position with value st st at push dht at pop where is matrix if at pop is equal to the top element is replaced by the value below all values are moved by one position up in the stack structure if at push is equal to we move all values down in the stack and add value on top of the stack similarly for an element stored at depth in the stack we have the following update rule st at push at pop we use the stack to carry information to the hidden layer at the next time step when the stack is empty st is set to the hidden layer ht is now updated as ht xt where is recurrent matrix and are the element of the stack at time in our experiments we set to we call this model stack rnn and show it in figure without the recurrent matrix for clarity stack with adding the action allows the stack to keep the same value on top by minor change of the stack update rule eq is replaced by st at push dht at pop at extension to multiple stacks using single stack has serious limitations especially considering that at each time step only one action can be performed we increase capacity of the model by using multiple stacks in parallel the stacks can interact through the hidden layer allowing them to process more challenging patterns method rnn lstm list rnn stack rnn stack rnn rounding an bn an bn cn an bn cn dn an an bm table comparison with rnn and lstm on sequences generated by counting algorithms the sequences seen during training are such that and and we test on sequences up to we report the percent of for which the model was able to correctly predict the sequences performance above means it is able to generalize to never seen sequence lengths lists while in this paper we mostly focus on an infinite memory based on stacks it is straightforward to extend the model to another forms of infinite memory for example the doublylinked list list is one dimensional memory where each node is connected to its left and right neighbors there is head associated with the list the head can move between nearby nodes and insert new node at its current position more precisely we consider three different actions insert which inserts an element at the current position of the head left which moves the head to the left and right which moves it to the right given list and fixed head position head the updates are at right at left at insert dht if head lt at right at left at insert if head at right at left at insert if head note that we can add operation as well we call this model list rnn and show it in figure without the recurrent matrix for clarity optimization the models presented above are continuous and can thus be trained with stochastic gradient descent sgd method and through time as patterns becomes more complex more complex memory controller must be learned in practice we observe that these more complex controller are harder to learn with sgd using several random restarts seems to solve the problem in our case we have also explored other type of search based procedures as discussed in the supplementary material rounding continuous operators on stacks introduce small imprecisions leading to numerical issues on very long sequences while simply discretizing the controllers partially solves this problem we design more robust rounding procedure tailored to our model we slowly makes the controllers converge to discrete values by multiply their weights by constant which slowly goes to infinity we finetune the weights of our network as this multiplicative variable increase leading to smoother rounding of our network finally we remove unused stacks by exploring models which use only subset of the stacks while would be exponential in the number of stacks we can do it efficiently by building tree of removable stacks and exploring it with deep first search experiments and results first we consider various sequences generated by simple algorithms where the goal is to learn their generation rule we hope to understand the scope of algorithmic patterns each model can capture we also evaluate the models on standard language modeling dataset penn treebank implementation details stack and list rnns are trained with sgd and backpropagation through time with steps hard clipping of to prevent gradient explosions and an initial learning rate of the learning rate is divided by each time the entropy on the validation set is not decreasing the depth defined in eq is set to the free parameters are the number of hidden units stacks and the use of the baselines are rnns with and units and lstms with and layers with and units the of the baselines are selected on the validation sets learning simple algorithmic patterns given an algorithm with short description length we generate sequences and concatenate them into longer sequences this is an unsupervised task since the boundaries of each generated sequences current next prediction proba next action pop pop push pop push push push push push push push push push push push push pop push pop push pop push pop push pop push pop push pop pop pop pop pop pop pop pop pop pop top top table example of the stack rnn with hidden units and stacks on sequence an with means that the stack is empty the depth is set to for clarity we see that the first stack pushes an element every time it sees and pop when it sees the second stack pushes when it sees when it sees it pushes if the first stack is not empty and pop otherwise this shows how the two stacks interact to correctly predict the deterministic part of the sequence shown in bold memorization binary addition figure comparison of rnn lstm list rnn and stack rnn on memorization and the performance of stack rnn on binary addition the accuracy is in the proportion of correctly predicted sequences generated with given we use hidden units and stacks are not known we study patterns related to counting and memorization as shown in table to evaluate if model has the capacity to understand the generation rule used to produce the sequences it is tested on sequences it has not seen during training our experimental setting is the following the training and validation set are composed of sequences generated with up to while the test set is composed of sequences generated with up to during training we incrementally increase the parameter every few epochs until it reaches some at test time we measure the performance by counting the number of correctly predicted sequences sequence is considered as correctly predicted if we correctly predict its deterministic part shown in bold in table on these toy examples the recurrent matrix defined in eq is set to to isolate the mechanisms that stack and list can capture counting results on patterns generated by counting algorithms are shown in table we report the percentage of sequence lengths for which method is able to correctly predict sequences of that length list rnn and stack rnn have hidden units and either lists or stacks for these tasks the operation is not used table shows that rnns are unable to generalize to longer sequences and they only correctly predict sequences seen during training lstm is able to generalize to longer sequences which shows that it is able to count since the hidden units in an lstm can be linear with finer search the lstm should be able to achieve on all of these tasks despite the absence of linear units these models are also able to generalize for an bm rounding is required to obtain the best performance table show an example of actions done by stack rnn with two stacks on sequence of the form an for clarity we show sequence generated with equal to and we use discretization stack rnn pushes an element on both stacks when it sees the first stack pops elements when the input is and the second stack starts popping only when the first one is empty note that the second stack pushes special value to keep track of the sequence length memorization figure shows results on memorization for dictionary with two elements stack rnn has units and stacks and list rnn has lists we use random restarts and we repeat this process multiple times stack rnn and list rnn are able to learn memorization while rnn and lstm do not seem to generalize in practice list rnn is more unstable than stack rnn and overfits on the training set more frequently this unstability may be explained by the higher number of actions the controler can choose from versus for this reason we focus on stack rnn in the rest of the experiments figure an example of learned stack rnn that performs binary addition the last column is our interpretation of the functionality learned by the different stacks the color code is green means push red means pop and grey means actions equivalent to we show the current discretized value on the top of the each stack at each given time the sequence is read from left to right one character at time in bold is the part of the sequence which has to be predicted note that the result is written in reverse binary addition given sequence representing binary addition the goal is to predict the result where represents the end of the sequence as opposed to the previous tasks this task is supervised the location of the deterministic tokens is provided the result of the addition is asked in the reverse order in the previous example as previously we train on short sequences and test on longer ones the length of the two input numbers is chosen such that the sum of their lengths is equal to less than during training and up to at test time their most significant digit is always set to stack rnn has hidden units with stacks the right panel of figure shows the results averaged over multiple runs with random restarts while stack rnns are generalizing to longer numbers it overfits for some runs on the validation set leading to larger error bar than in the previous experiments figure shows an example of model which generalizes to long sequences of binary addition this example illustrates the moderately complex behavior that the stack rnn learns to solve this task the first stack keeps track of where we are in the sequence either reading the first number reading the second number or writing the result stack keeps in memory the first number interestingly the first number is first captured by the stacks and and then copied to stack the second number is stored on stack while its length is captured on stack by pushing one and then set of zeros when producing the result the values stored on these three stacks are popped finally stack takes care of the carry it switches between two states or which explicitly say if there is carry over or not while this use of stacks is not optimal in the sense of minimal description length it is able to generalize to sequences never seen before language modeling model validation perplexity test perplexity ngram ngram cache rnn lstm srcn stack rnn table comparison of rnn lstm srcn and stack rnn on penn treebank corpus we use the recurrent matrix in stack rnn as well as hidden units and stacks we compare stack rnn with rnn lstm and srcn on the standard language modeling dataset penn treebank corpus srcn is standard rnn with additional linear units which capture long term dependencies similar to bag of words the models have only one hidden layer with hidden units table shows that stack rnn performs better than rnn with comparable number of parameters but not as well as lstm and srcn empirically we observe that stack rnn learns to store exponentially decaying bag of words similar in nature to the memory of srcn discussion and future work continuous versus discrete model and search certain simple algorithmic patterns can be efficiently learned using continuous optimization approach stochastic gradient descent applied to continuous model representation in our case rnn note that stack rnn works better than prior work based on rnn from the nineties it seems also simpler than many other approaches designed for these tasks however it is not clear if continuous representation is completely appropriate for learning algorithmic patterns it may be more natural to attempt to solve these problems with discrete model this motivates us to try to combine continuous and discrete optimization it is possible that the future of learning of algorithmic patterns will involve such combination of discrete and continuous optimization memory while in theory using multiple stacks for representing memory is as powerful as turing complete computational system intricate interactions between stacks need to be learned to capture more complex algorithmic patterns stack rnn also requires the input and output sequences to be in the right format memorization is in reversed order it would be interesting to consider in the future other forms of memory which may be more flexible as well as additional mechanisms which allow to perform multiple steps with the memory such as loop or random access finally complex algorithmic patterns can be more easily learned by composing simpler algorithms designing model which possesses mechanism to compose algorithms automatically and training it on incrementally harder tasks is very important research direction conclusion we have shown that certain difficult pattern recognition problems can be solved by augmenting recurrent network with structured growing potentially unlimited memory we studied very simple memory structures such as stack and list but the same approach can be used to learn how to operate more complex ones for example tape while currently the topology of the long term memory is fixed we think that it should be learned from the data as well acknowledgment we would like to thank arthur szlam keith adams jason weston yann lecun and the rest of the facebook ai research team for their useful comments references bengio and lecun scaling learning algorithms towards ai kernel machines bishop pattern recognition and machine learning springer new york and wiles and dynamics in recurrent neural networks connection science the code is available at https bottou machine learning with stochastic gradient descent in compstat springer breiman random forests machine learning bridle probabilistic interpretation of feedforward classification network outputs with relationships to statistical pattern recognition in neurocomputing pages springer christiansen and chater toward connectionist model of recursion in human linguistic performance cognitive science chung gulcehre cho and bengio gated feedback recurrent neural networks arxiv ciresan meier masci gambardella and schmidhuber neural networks for visual object classification arxiv preprint crocker mechanisms for sentence processing university of edinburgh dahl yu deng and acero deep neural networks for speech recognition audio speech and language processing das giles and sun learning grammars capabilities and limitations of recurrent neural network with an external stack memory in accss das giles and sun using prior knowledge in nnpda to learn languages nips elman finding structure in time cognitive science fanty parsing in connectionist networks parallel natural language processing gers and schmidhuber lstm recurrent networks learn simple and languages transactions on neural networks graves wayne and danihelka neural turing machines arxiv preprint recurrent network that performs prediction task in accss hochreiter and schmidhuber long memory neural computation holldobler kalinke and lehmann designing counter another case study of dynamics and activation landscapes in recurrent networks in advances in artificial intelligence krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips lecun bottou bengio and haffner learning applied to document recognition mikolov statistical language models based on neural networks phd thesis brno university of technology mikolov joulin chopra mathieu and ranzato learning longer memory in recurrent neural networks arxiv preprint minsky and papert perceptrons mit press mozer and das connectionist symbol manipulator that discovers the structure of languages nips pollack the induction of dynamical recognizers machine learning recht re wright and niu hogwild approach to parallelizing stochastic gradient descent in nips rodriguez wiles and elman recurrent neural network that learns to count connection science rumelhart hinton and williams learning internal representations by error propagation technical report dtic document tabor fractal encoding of grammars in connectionist networks expert systems werbos generalization of backpropagation with application to recurrent gas market model neural networks weston chopra and bordes memory networks in iclr wiles and elman learning to count without counter case study of dynamics and activation landscapes in recurrent networks in accss williams and zipser learning algorithms for recurrent networks and their computational complexity theory architectures and applications pages zaremba and sutskever learning to execute arxiv preprint zeng goodman and smyth discrete recurrent neural networks for grammatical inference transactions on neural networks 
where are they looking aditya carl vondrick massachusetts institute of technology antonio torralba recasens khosla vondrick torralba indicates equal contribution abstract humans have the remarkable ability to follow the gaze of other people to identify what they are looking at following eye gaze or is an important ability that allows us to understand what other people are thinking the actions they are performing and even predict what they might do next despite the importance of this topic this problem has only been studied in limited scenarios within the computer vision community in this paper we propose deep neural networkbased approach for and new benchmark dataset gazefollow for thorough evaluation given an image and the location of head our approach follows the gaze of the person and identifies the object being looked at our deep network is able to discover how to extract head pose and gaze orientation and to select objects in the scene that are in the predicted line of sight and likely to be looked at such as televisions balls and food the quantitative evaluation shows that our approach produces reliable results even when viewing only the back of the head while our method outperforms several baseline approaches we are still far from reaching human performance on this task overall we believe that gazefollowing is challenging and important problem that deserves more attention from the community introduction you step out of your house and notice group of people looking up you look up and realize they are looking at an aeroplane in the sky despite the object being far away humans have the remarkable ability to precisely follow the gaze direction of another person task commonly referred to as gazefollowing see for review such an ability is key element to understanding what people are doing in scene and their intentions similarly it is crucial for computer vision system to have this ability to better understand and interpret people for instance person might be holding book but looking at the television or group of people might be looking at the same object which can indicate that they are collaborating at some task or they might be looking at different places which can indicate that they are not familiar with each other or that they are performing unrelated tasks figure we present model that learns to predict where people in images are looking we also introduce gazefollow new annotated dataset for see figure has applications in robotics and human interaction interfaces where it is important to understand the object of interest of person can also be used to predict what person will do next as people tend to attend to objects they are planning to interact with even before they start an action despite the importance of this topic only few works in computer vision have explored gazefollowing previous work on addresses the problem by limiting the scope people looking at each other only by restricting the situations scenes with multiple people only or synthetic scenarios or by using complex inputs multiple images or data only tackles the unrestricted scenario but relies on face detectors therefore can not handle situations such as people looking away from the camera and is not evaluated on task our goal is to perform in natural settings without making restrictive assumptions and when only single view is available we want to address the general problem to be able to handle situations in which several people are looking at each other and one or more people are interacting with one or more objects in this paper we formulate the problem of as given single picture containing one or more people the task is to the predict the location that each person in scene is looking at to address this problem we introduce deep architecture that learns to combine information about the head orientation and head location with the scene content in order to follow the gaze of person inside the picture the input to our model is picture and the location of the person for who we want to follow the gaze and the output is distribution over possible locations that the selected person might be looking at this output distribution can be seen as saliency map from the point of view of the person inside the picture to train and evaluate our model we also introduce gazefollow benchmark dataset for our model code and dataset are available for download at http related work saliency although strongly related there are number of important distinctions between and saliency models of attention in traditional models of visual attention the goal is to predict the eye fixations of an observer looking at picture while in gazefollowing the goal is to estimate what is being looked at by person inside picture most saliency models focus on predicting fixations while an observer is an image see for review however in the people in the picture are generally engaged in task or navigating an environment and therefore are not and might fixate on objects even when they are not the most salient model for has to be able to follow the line of sight and then select among all possible elements that cross the line of sight which objects are likely to be the center of attention both tasks and saliency modeling are related in several interesting ways for instance showed that of people inside picture can influence the fixations of an observer looking at the picture as the object being fixated by the people inside the picture will attract the attention of the observer of the picture related work gaze the work on in computer vision is very limited gazefollowing is used in to improve models of saliency prediction however they only estimate the gaze direction without identifying the object being attended further their reliance on face detector prevents them from being able to estimate gaze for people looking away from the camera another way of approaching is using wearable to precisely measure the gaze of several people in scene for instance used an eye tracker to predict the next object the user will interact with and to improve action recognition in egocentric vision in they propose detecting people looking at each other in movie in order to better identify interactions between people as in this work only relies on the direction of gaze without estimating the object being attended and therefore can not address the general problem of in which person is interacting with an object in they perform in scenes with multiple observers in an image by finding the regions in which multiple lines of sight intersect their method needs multiple people in the scene each with an egocentric camera used to get head location as the model only uses head orientation information and does not incorporate knowledge about the content of the scene in the authors propose system to infer the region attracting the attention of group of people social saliency prediction as in their method takes as input set of pictures taken from the viewpoint of each of the people present in the image and it does not perform our method only uses single view of the scene to infer gaze head location density fixation loc density norm fixation loc density avg gaze direction direction color code example test images and annotations test set statistics figure gazefollow dataset we introduce new dataset for in natural images on the left we show several example annotations and images in the graphs on the right we summarize few statistics about test partition of the dataset the top three heat maps show the probability density for the location of the head the fixation location and the fixation location normalized with respect to the head position the bottom shows the average gaze direction for various head positions gazefollow dataset in order to both train and evaluate models we built gazefollow dataset annotated with the location of where people in images are looking we used several major datasets that contain people as source of images images from sun images from ms coco images from actions images from pascal images from the imagenet detection challenge and images from the places dataset this concatenation results in challenging and large image collection of people performing diverse activities in many everyday scenarios since the source datasets do not have gaze we annotated it using amazon mechanical turk amt workers used our online tool to mark the center of person eyes and where the worker believed the person was looking workers could indicate if the person was looking outside the image or if the person head was not visible to control quality we included images with known and we used these to detect and discard poor annotations finally we obtained people in images with gaze locations inside the image we use about people of our dataset for testing and the rest for training we ensured that every person in an image is part of the same split and to avoid bias we picked images for testing such that the fixation locations were uniformly distributed across the image further to evaluate human consistency on we collected gaze annotations per person for the test set we show some example annotations and statistics of the dataset in we designed our dataset to capture various fixation scenarios for example some images contain several people with joint attention while others contain people looking at each other the number of people in the image can vary ranging from single person to crowd of people moreover we observed that while some people have consistent fixation locations others have bimodal or largely inconsistent distributions suggesting that solutions to the problem could be multimodal learning to follow gaze at high level our model is inspired by how humans tend to follow gaze when people infer where another person is looking they often first look at the person head and eyes to estimate their field of view and subsequently reason about salient objects in their perspective to predict where they are looking in this section we present model that emulates this approach full image xi product saliency pathway shifted grids conv conv conv conv conv conv fc saliency map xi head xh gaze pathway fc conv conv conv conv conv fc fc head location xp fc fc fc fc gaze mask xh xp fc gaze prediction figure network architecture we show the architecture of our deep network for our network has two main components the saliency pathway top to estimate saliency and the gaze pathway bottom to estimate gaze direction see section for details gaze and saliency pathways suppose we have an image xi and person for whom we want to predict gaze we parameterize this person with quantized spatial location of the person head xp and cropped image of their head xh given we seek to predict the spatial location of the person fixation encouraged by progress in deep learning we also use deep networks to predict person fixation keeping the motivation from section in mind we design our network to have two separate pathways for gaze and saliency the gaze pathway only has access to the closeup image of the person head and their location and produces spatial map xh xp of size the saliency pathway sees the full image but not the person location and produces another spatial map xi of the same size we then combine the pathways with an product xh xp xi where represents the product is fully connected layer that uses the multiplied pathways to predict where the person is looking since the two network pathways only receive subset of the inputs they can not themselves solve the full problem during training and instead are forced to solve subproblems our intention is that since the gaze pathway only has access to the person head xh and location xp we expect it will learn to predict the direction of gaze likewise since the saliency pathway does not know which person to follow we hope it learns to find objects that are salient independent of the person viewpoint the product allows these two pathways to interact in way that is similar to how humans approach this task in order for location in the product to be activated both the gaze and saliency pathways must have large activations saliency map to form the saliency pathway we use convolutional network on the full image to produce hidden representation of size since shows that objects tend to emerge in these deep representations we can create saliency map by learning the importance of these objects to do this we add convolutional layer that convolves the hidden representation with filter which produces the saliency map here the sign and magnitude of can be interpreted as weights indicating an object importance for saliency gaze mask in the gaze pathway we use convolutional network on the head image we concatenate its output with the head position and use several fully connected layers and final sigmoid to predict the gaze mask pathway visualization fig shows examples of the gaze masks and saliency maps learned by our network fig also compares the saliency maps of our network with the saliency computed using state of the art saliency model note that our model learns notion of saliency that is relevant for the task and places emphasis on certain objects that people tend to look at balls and televisions in the third example the red light coming from the computer mouse is salient in the judd et al model but that object is not relevant in task as the computer monitor is more likely to be the target of attention of the person inside the picture gaze mask saliency input image saliency gaze saliency input image saliency gaze saliency input image saliency gaze saliency figure pathway visualization the gaze mask output by our network for various head poses each triplet of images show from left to right the input image its saliency estimated using and the saliency estimated using our network these examples clearly illustrate the differences between saliency and saliency multimodal predictions although humans can often follow gaze reliably predicting gaze is sometimes ambiguous if there are several salient objects in the image or the eye pose can not be accurately perceived then humans may disagree when predicting gaze we can observe this for several examples in fig consequently we want to design our model to support multimodal predictions we could formulate our problem as regression task regress the cartesian coordinates of fixations but then our predictions would be unimodal instead we can formulate our problem as classification task which naturally supports multimodal outputs because each category has confidence value to do this we quantize the fixation location into grid then the job of the network is to classify the inputs into one of classes the model output rn is the confidence that the person is fixating in each grid cell shifted grids for classification we must choose the number of grid cells if we pick small our predictions will suffer from poor precision if we pick large there will be more precision but the learning problem becomes harder because standard classification losses do not gradually penalize spatial categories misclassification that is off by just one cell should be penalized less than errors multiple cells away to alleviate this we propose the use of shifted grids as illustrated in fig where the network solves several overlapping classification problems the network predicts locations in multiple grids where each grid is shifted such that cells in one grid overlap with cells in other grids we then average the shifted outputs to produce the final prediction training we train our network using backpropagation we use softmax loss for each shifted grid and average their losses since we only supervise the network with gaze fixations we do not enforce that the gaze and saliency pathways solve their respective subproblems rather we expect that the proposed network structure encourages these roles to emerge automatically which they do as shown in fig implementation details we implemented the network using caffe the convolutional layers in both the gaze and saliency pathways follow the architecture of the first five layers of the alexnet architecture in our experiments we initialize these convolutional layers of the saliency pathway with the and those of the gaze pathway with the last convolutional layer of the saliency pathway has convolution kernel the remaining fully connected layers in the gaze pathway are of sizes and respectively the saliency map and gaze mask are in size and we use shifted grids of size each for learning we augment our training data with flips and random crops with the fixation locations adjusted accordingly figure qualitative results we show several examples of successes and failures of our model the red lines indicate ground truth gaze and the yellow our predicted gaze min min model auc dist dist ang auc dist dist ang model our no image grid no position grid no head judd no eltwise fixed bias grid center grid random loss one human our full main evaluation model diagnostics table evaluation we evaluate our model against baselines and analyze how it performances with some components disabled auc refers to the area under the roc curve higher is better dist refers to the distance to the average of ground truth fixation while min dist refers to the distance to the nearest ground truth fixation lower is better ang is the angular error of predicted gaze in degrees lower is better see section for details experiments setup we evaluate the ability of our model to predict where people in images are looking we use the disjoint train and test sets from gazefollow as described in section to train and evaluate our model the test set was randomly sampled such that the fixation location was approximately uniform and ignored people who were looking outside the picture or at the camera similar to pascal voc action recognition where person bounding boxes are available both during training and testing we assume that we are given the head location at both train and test time this allows us to focus our attention on the primary task of in section we show that our method performs well even when using simple head detector our primary evaluation metric compares the ground truth against the distribution predicted by our model we use the area under curve auc criteria from where the predicted heatmap is used as confidences to produce an roc curve the auc is the area under this roc curve if our model behaves perfectly the auc will be while chance performance is distance we evaluate the euclidean distance between our prediction and the average of ground truth annotations we assume each image is of size when computing the distance additionally as the ground truth may be multimodal we also report the minimum distance between our note that as mentioned in section we obtain annotations per person in the test set diction and all ground truth annotations angular error using the ground truth eye position from the annotation we compute the gaze vectors for the average ground truth fixations and our prediction and report the angular difference between them we compare our approach against several baselines ranging from simple center fixed bias to more complex svm saliency as described below center the prediction is always the center of the image fixed bias the prediction is given by the average of fixations from the training set for heads in similar locations as the test image svm we generate features by concatenating the quantized eye position with of the for both the full image and the head image we train svm on these features to predict gaze using similar classification grid setup as our model we evaluate this approach for both single grid and shifted grids freeviewing saliency we use saliency model as predictor of gaze although saliency models ignore head orientation and location they may still identify important objects in the image results we compare our model against baselines in our method archives an auc of and mean euclidean error of outperforming all baselines significantly in all the evaluation metrics the svm model using shifted grids shows the best baseline performance surpassing the one grid baseline by reasonable margin this verifies the effectiveness of the shifted grids approach proposed in this work shows some example outputs of our method these qualitative results show that our method is able to distinguish people in the image by using the gaze pathway to model person point of view as it produces different outputs for different people in the same image furthermore it is also able to find salient objects in images such as balls or food however the method still has certain limitations the lack of understanding generates some wrong predictions as illustrated by the image in the row of fig where one of the predictions is in different plane of depth to obtain an approximate upper bound on prediction performance we evaluate human performance on this task since we annotated our test set times we can quantify how well one annotation predicts the mean of the remaining annotations single human is able to achieve an auc of and mean euclidean error of while our approach outperforms all baselines it is still far from reaching human performance we hope that the availability of gazefollow will motivate further research in this direction allowing machines to reach human level performance analysis ablation study in tbl we report the performance after removing different components of our model one at time to better understand their significance in general all three of inputs image position and head contribute to the performance of our model interestingly the model with only the head and its position achieves comparable angular error to our full method suggesting that the gaze pathway is largely responsible for estimating the gaze direction further we show the results of our model with single output grids and removing shifted grids hurts performance significantly as shifted grids have spatially graded loss function which is important for learning internal representation in fig we visualize the various stages of our network we show the output of each of the pathways as well as the element wise product for example in the second row we have two different girls writing on the blackboard the gaze mask effectively creates heat map of the field of view for the girl in the right while the saliency map identifies the salient spots in the image the multiplication of the saliency map and gaze mask removes the responses of the girl on the left and attenuates the saliency of the right girl head finally our shifted grids approach accurately predicts where the girl is looking further we apply the technique from to visualize the top activations for different units in the fifth convolutional layer of the saliency pathway we use filter weights from the sixth convolutional layer to rank their contribution to the saliency map fig shows four units with positive left and negative right contributions to the saliency map interestingly learns positive weights for salient objects such as switched on tv monitors and balls and negative weights for objects input image gaze mask saliency product output prediction figure visualization of internal representations we visualize the output of different components of our model the green circle indicates the person whose gaze we are trying to predict the red show the ground truth gaze and the yellow line is our predicted gaze positive weight units negative weight units ball surface screen horizon head lights pizza floor figure visualization of saliency units we visualize several units in our saliency pathway by finding images with high scoring activations similar to we sort the units by the weights of the sixth convolutional layer see section for more details positive weights tend to correspond to salient everyday objects while negative weights tend to correspond to background objects automatic head detection to evaluate the impact of imperfect head locations on our system we built simple head detector and input its detections into our model for detections surpassing the intersection over union threshold of our model achieved an auc of as compared to an auc of when using head locations this demonstrates that our model is robust to inaccurate head detections and can easily be made conclusion accurate achieving performance will be an important tool to enable systems that can interpret human behavior and social situations in this paper we have introduced model that learns to do using gazefollow dataset of human annotated gaze our model automatically learns to extract the line of sight from heads without using any supervision on head pose and to detect salient objects that people are likely to interact with without requiring annotations during training we hope that our model and dataset will serve as important resources to facilitate further research in this direction acknowledgements we thank andrew owens for helpful discussions funding for this research was partially supported by the obra social la caixa fellowship for studies to ar and google phd fellowship to cv references borji parks and itti complementary effects of gaze direction and early saliency in guiding fixations during free viewing journal of vision borji sihite and itti salient object detection benchmark in eccv emery the eyes have it the neuroethology function and evolution of social gaze neuroscience biobehavioral reviews everingham van gool williams winn and zisserman the pascal visual object classes voc challenge ijcv fathi hodgins and rehg social interactions perspective in cvpr fathi li and rehg learning to recognize daily actions using gaze in eccv hoffman grimes shon and rao probabilistic model of gaze imitation and shared attention neural networks itti and koch computational modelling of visual attention nature reviews neuroscience jasso triesch and using eye direction cues for gaze developmental model in icdl jia caffe an open source convolutional architecture for fast feature embedding http judd ehinger durand and torralba learning to predict where humans look in cvpr krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips lin maire belongie hays perona ramanan and zitnick microsoft coco common objects in context in eccv zisserman eichner and ferrari detecting people looking at each other in videos ijcv park jain and sheikh predicting primary gaze behavior using social saliency fields in iccv parks borji and itti augmented saliency model using automatic head pose detection and learned gaze following in natural scenes vision research russakovsky deng su krause satheesh ma huang karpathy khosla bernstein et al imagenet large scale visual recognition challenge ijcv soo park and shi social saliency prediction in cvpr xiao hays ehinger oliva and torralba sun database scene recognition from abbey to zoo in cvpr yao jiang khosla lin guibas and human action recognition by learning bases of action attributes and parts in iccv zhou khosla lapedriza oliva and torralba object detectors emerge in deep scene cnns in iclr zhou lapedriza xiao torralba and oliva learning deep features for scene recognition using places database in nips zhu and ramanan face detection pose estimation and landmark localization in the wild in cvpr 
the pareto regret frontier for bandits tor lattimore department of computing science university of alberta canada abstract given bandit problem it may be desirable to achieve regret for some special actions show that the price for such unbalanced regret guarantees is rather high specifically if an algorithm enjoys regret of with respect to some action then there must exist another action for which the regret is at least where is the horizon and the number of actions also give upper bounds in both the stochastic and adversarial settings showing that this result can not be improved for the stochastic case the pareto regret frontier is characterised exactly up to constant factors introduction the bandit is the simplest class of problems that exhibit the dilemma in each time step the learner chooses one of actions and receives noisy reward signal for the chosen action learner performance is measured in terms of the regret which is the expected difference between the rewards it actually received and those it would have received in expectation by choosing the optimal action prior work on the regret criterion for bandits has treated all actions uniformly and has aimed for bounds on the regret that do not depend on which action turned out to be optimal take different approach and ask what can be achieved if some actions are given special treatment focussing on bounds ask whether or not it is possible to achieve improved regret for some actions and what is the cost in terms of the regret for the remaining actions such results may be useful in variety of cases for example company that is exploring some new strategies might expect an especially small regret if its existing strategy turns out to be nearly optimal this problem has previously been considered in the experts setting where the learner is allowed to observe the reward for all actions in every round not only for the action actually chosen the earliest work seems to be by hutter and poland where it that the learner can assign prior weight to each action and pays regret of log ρi for expert where ρi is the prior belief in expert and is the horizon the uniform regret is obtained by choosing ρi which leads to the log bound achieved by the exponential weighting algorithm the consequence of this is that an algorithm can enjoy constant regret with respect to single action while suffering minimally on the remainder the problem was studied in more detail by koolen where remarkably the author was able to exactly describe the pareto regret frontier when other related work also in the experts setting is where the objective is to obtain an improved regret against mixture of available et kapralov and panigrahy in similar vain sani et al showed that algorithms for prediction with expert advice can be combined with minimal cost to obtain the best of both worlds in the bandit setting am only aware of the work by liu and li who study the effect of the prior on the regret of thompson sampling in special case in contrast the lower bound given here applies to all algorithms in relatively standard setting the main contribution of this work is characterisation of the pareto regret frontier the set of achievable regret bounds for stochastic bandits let µi be the unknown mean of the ith arm and assume that supi µi µj in each time step the learner chooses an action it and receives reward git µi ηt where ηt is the noise term that assume to be sampled independently from distribution that may depend on it this model subsumes both gaussian and bernoulli or bounded rewards let be bandit strategy which is function from histories of observations to an action it then the expected pseudo regret with respect to the ith arm is rµ nµi µit where the expectation is taken with respect to the randomness in the noise and the actions of the policy throughout this work will be fixed so is omitted from the notation the expected with respect to arm is riπ sup rµ this means that rπ rk is vector of pseudo regrets with respect to each of the arms let rk be set defined by bi min for all bj the boundary of is denoted by δb the following theorem shows that δb describes the pareto regret frontier up to constant factors theorem there exist universal constants and such that lower bound for ηt and all strategies we have rπ upper bound for all there exists strategy such that riπ bi for all observe that the lower bound relies on the assumption that the noise term be gaussian while the upper bound holds for subgaussian noise the lower bound may be generalised to other noise models such as bernoulli but does not hold for all subgaussian noise models for example it does not hold if there is no noise ηt almost surely the lower bound also applies to the adversarial framework where the rewards may be chosen arbitrarily although was not able to derive matching upper bound in this case simple modification of the algorithm bubeck and leads to an algorithm with nk nk log for all and rk where the regret is the adversarial version of the expected regret details are in the supplementary material the new results seem elegant but disappointing in the experts setting we have seen that the learner can distribute prior amongst the actions and obtain bound on the regret depending in natural way on the prior weight of the optimal action in contrast in the bandit setting the learner pays an enormously higher price to obtain small regret with respect to even single arm in fact the learner must essentially choose single arm to favour after which the regret for the remaining arms has very limited flexibility unlike in the experts setting if even single arm enjoys constant regret then the regret with respect to all other arms is necessarily linear preliminaries use the same notation as bubeck and define ti to be the number of times action has been chosen after time step and to be the empirical estimate of µi from the first times action was sampled this means that ti is the empirical estimate of µi at the start of the tth round use the convention that since the noise model is we have µi exp this result is presumably well known but proof is included in the supplementary material for convenience the optimal arm is arg maxi µi with ties broken in some arbitrary way the optimal reward is maxi µi the gap between the mean rewards of the jth arm and the optimal arm is µj and µi µj the vector of regrets is rπ rk and has been defined already in eq write rπ rk if riπ bi for all for vector rπ and we have rπ riπ understanding the frontier before proving the main theorem briefly describe the features of the regret frontier first notice that if bi for all then xp bi bj thus as expected this particular is witnessed up to constant factors by moss audibert and bubeck and lattimore but not ucb auer et which suffers riucb nk log of course the uniform choice of is not the only option suppose the first arm is special so should be chosen especially small assume without loss of generality that bk then by the main theorem we have bk therefore bk this also proves the claim in the abstract since it implies that bk if is fixed then choosing bk does not lie on the frontier because log bk pk however if log then choosing bk does lie on the frontier and is factor of log away from the lower bound given in eq therefore up the log factor points on the regret frontier are characterised entirely by permutation determining the order of regrets and the smallest regret perhaps the most natural choice of assuming again that bk is np bk for for this leads to is at most log worse than that obtained by moss and while being factor of better for select few and assumptions the assumption that is used to avoid annoying boundary problems caused by the fact that time is discrete this means that if is extremely large then even single sample from this arm can big regret bound this assumption is already quite common for example regret of kn clearly does not hold if the gaps are permitted to be unbounded unfortunately there is no perfect resolution to this annoyance most elegant would be to allow time to be continuous with actions taken up to stopping times otherwise you have to deal with the problem with special cases or make assumptions as have done here lower bounds theorem assume ηt is sampled from standard gaussian let be an arbitrary strategy then rπ proof assume without loss of generality that mini riπ if this is not the case then simply the actions if then the result is trivial from now on assume let and define crkπ εk min define vectors µk rk by µk εk if if otherwise therefore the optimal action for the bandit with means µk is let rkπ and and assume then crkπ eπµk tk rkπ rµπk εk eπµk tj εk eπµk tk where follows since is the regret with respect to arm since the gap between the means of the kth arm and any other arm is atp least εk note that this is also true for since mink εk follows from the fact that ti and from the definition of εk therefore eπµk tk rkπ therefore for with we have eπµk tk tk nεk tk nεk tk nεk tk where follows from standard entropy inequalities and similar argument as used by auer et al details in supplementary material since and tk and by eq therefore tk which implies that εk tk εk rkπ therefore for all we have riπ rkπ rkπ therefore rkπ rk rkπ which implies that rπ as required upper bounds now show that the lower bound derived in the previous section is tight up to constant factors the algorithm is generalisation moss audibert and bubeck with two modifications first the width of the confidence bounds are biased in way and second the upper confidence bounds are shifted the new algorithm is functionally identical to moss in the special case that bi is uniform define max log input and bk ni for all for do it arg max ti end for ni ti ti ni algorithm unbalanced moss theorem let then the strategy given in algorithm satisfies rπ corollary for all the following hold rµ rµ mini the second part of the corollary is useful when is large but there exists an arm for which and bi are both small the proof of theorem requires few lemmas the first is somewhat standard concentration inequality that follows from combination of the peeling argument and doob maximal inequality the proof may be found in the supplementary material lemma let zi max µi then zi for all in the analysis of traditional bandit algorithms the gap measures how quickly the algorithm can detect the difference between arms and by pdesign however algorithm is negatively biasing its estimate of the empirical mean of arm by this has the effect of shifting the gaps which ji and define to be denote by ji µi µj lemma define stopping time τji by τji min µj ji then tj τji if zi proof let be the first time step such that tj τji then nj ji tj µj tj tj ji ji µj ji µi ni ti ti ti which implies that arm will not be chosen at time step and so also not for any subsequent time steps by the same argument and induction therefore tj τji nj ji ji then eτji lemma if productlog ji ji proof let be defined by productlog nj ji ji nj therefore eτji τji µi ji ji ji exp µi nj productlog ji where the last inequality follows since proof of theoremp let ni and then for we have ji and ji letting we have rµ tj tj ji τji max zi productlog nj ji zi zi nj zi zi ji on the other hand where follows by using lemma to bound tj τji when zi the total number of pulls for arms for which zi is at most follows by bounding τji in expectation using lemma follows from basic calculus and because for we have all that remains is to bound the expectation zi zi zi zi dz ni ni where have used lemma and simple identities putting it together we obtain rµ nj where applied the assumption and so nj bi the above proof may be simplified in the special case that is uniform where we recover the minimax regret of moss but with perhaps simpler proof than was given originally by audibert and bubeck on logarithmic regret in recent technical report demonstrated empirically that moss suffers problemdependent regret in terms of the minimum gap lattimore specifically it can happen that moss rµ log where mini on the other hand the asymptotic regret can be significantly smaller specifically ucb by auer et al satisfies ucb rµ log which for unequal gaps can be much smaller than eq and is asymptotically lai and robbins the problem is that moss explores only enough to obtain minimax regret but sometimes obtains minimax regret even when more conservative algorithm would do better it is worth remarking that this effect is harder to observe than one might think the example given in the afforementioned technical report is carefully tuned to exploit this failing but still requires and before significant problems arise in all other experiments moss was performing admirably in comparison to ucb problems can be avoided by modifying ucb rather than moss the cost is factor of log the algorithm is similar to algorithm but chooses the action that maximises the following index log log it arg max ti ni where is fixed arbitrary constant theorem if is the strategy of unbalanced ucb with ni and then the regret of the unbalanced ucb satisfies regret rµ log regret let log then log rµ log the proof is deferred to the supplementary material the indicator function in the problemdependent bound vanishes for sufficiently large provided log which is equivalent to log thus for reasonable choices of bk the algorithm is going to enjoy the same asymptotic performance as ucb theorem may be proven for any algorithm for which it can be shown that eti log which includes for example et and thompson sampling see analysis by agrawal and goyal and original paper by thompson but not lattimore or moss audibert and bubeck experimental results compare moss and unbalanced moss in two simple simulated examples both with horizon each data point is an empirical average of samples so error bars are too small to see is available in the supplementary material the first experiment has arms and and plotted the results for for varying as predicted the new algorithm performs significantly better than moss for positive and significantly worse otherwise fig the second experiment has arms this time and bk with results are shown for µk for and again the results agree with the theory the unbalanced algorithm is superior to moss for and inferior otherwise fig moss moss regret regret figure figure sadly the experiments serve only to highlight the plight of the biased learner which suffers significantly worse results than its unbaised counterpart for most actions discussion have shown that the cost of favouritism for bandit algorithms is rather serious if an algorithm exhibits small regret for specific action then the regret of the remaining actions is necessarily significantly larger than the uniform bound of kn this unfortunate result is in stark contrast to the experts setting for which there exist algorithms that suffer constant regret with respect to single expert at almost no cost for the remainder surprisingly the best achievable bounds are determined up to permutation almost entirely by the value of the smallest regret there are some interesting open questions most notably in the adversarial setting am not sure if the upper or lower bound is tight or neither it would also be nice to know if the constant factors can be determined exactly asymptotically but so far this has not been done even in the uniform case for the stochastic setting it is natural to ask if the algorithm can also be modified intuitively one would expect this to be possible but it would require the very long proof acknowledgements am indebted to the very careful reviewers who made many suggestions for improving this paper thank you references shipra agrawal and navin goyal further optimal regret bounds for thompson sampling in proceedings of international conference on artificial intelligence and statistics aistats shipra agrawal and navin goyal analysis of thompson sampling for the bandit problem in proceedings of conference on learning theory colt audibert and bubeck minimax policies for adversarial and stochastic bandits in colt pages peter auer nicolo yoav freund and robert schapire gambling in rigged casino the adversarial bandit problem in foundations of computer science annual symposium on pages ieee peter auer and paul fischer analysis of the multiarmed bandit problem machine learning bubeck and regret analysis of stochastic and nonstochastic multiarmed bandit problems foundations and trends in machine learning now publishers incorporated isbn olivier garivier maillard munos and gilles stoltz upper confidence bounds for optimal sequential allocation the annals of statistics nicolo prediction learning and games cambridge university press eyal michael kearns yishay mansour and jennifer wortman regret to the best regret to the average machine learning marcus hutter and jan poland adaptive online prediction by following the perturbed leader the journal of machine learning research michael kapralov and rina panigrahy prediction strategies without loss in advances in neural information processing systems pages wouter koolen the pareto regret frontier in advances in neural information processing systems pages tze leung lai and herbert robbins asymptotically efficient adaptive allocation rules advances in applied mathematics tor lattimore optimally confident ucb improved regret for bandits technical report url http liu and lihong li on the prior sensitivity of thompson sampling arxiv preprint amir sani gergely neu and alessandro lazaric exploiting easy data in online optimization in advances in neural information processing systems pages william thompson on the likelihood that one unknown probability exceeds another in view of the evidence of two samples biometrika 
on the limitation of spectral methods from the gaussian hidden clique problem to perturbations of gaussian tensors andrea montanari department of electrical engineering and department of statistics stanford university montanari daniel reichman department of cognitive and brain sciences university of california berkeley ca ofer zeitouni faculty of mathematics weizmann institute rehovot israel and courant institute new york university abstract we consider the following detection problem given realization of symmetric matrix of dimension distinguish between the hypothesis that all upper triangular variables are gaussians variables with mean and variance and the hypothesis that there is planted principal submatrix of dimension for which all upper triangular variables are gaussians with mean and variance whereas all other upper triangular elements of not in are gaussians variables with mean and variance we refer to this as the gaussian hidden clique problem when it is possible to solve this detection problem with probability on by computing the spectrum of and considering the largest eigenvalue of we prove that when no algorithm that examines only the eigenvalues of can detect the existence of hidden gaussian clique with error probability vanishing as the result above is an immediate consequence of more general result on perturbations of gaussian tensors in this context we establish lower bound on the critical ratio below which signal can not be detected introduction consider the following problem one is given symmetric matrix of dimension such that the entries xi are mutually independent random variables given realization of one would like to distinguish between the hypothesis that all random variables xi have the same distribution to the hypothesis where there is set with so that all random variables in the submatrix xu xs have distribution that is different from the distribution of all other elements in which are still distributed as we refer to xu as the hidden submatrix the same problem was recently studied in and for the asymmetric case where no symmetry assumption is imposed on the independent entries of in detection problems with similar flavor such as the hidden clique problem have been studied over the years in several fields including computer science physics and statistics we refer to section for further discussion of the related literature an intriguing outcome of these works is that while the two hypothesis are statistically distinguishable as soon as log for sufficiently large constant practical algorithms require significantly larger in this paper we study the class of spectral or eigenvaluebased tests detecting the hidden submatrix our proof technique naturally allow to consider two further generalizations of this problem that are of independent interests we briefly summarize our results below the gaussian hidden clique problem this is special case of the above hypothesis testing setting whereby and entries on the diagonal are defined slightly differently in order to simplify calculations here and below denote the gaussian distribution of mean and variance equivalently let be random matrix from the gaussian orthogonal ensemble goe zij independently for and zii then under hypothesis we have being the indicator vector of and under hypothesis the factor in the normalization is for technical convenience the gaussian hidden clique problem can be thought of as the following clustering problem there are elements and the entry measures the similarity between elements and the hidden submatrix corresponds to cluster of similar elements and our goal is to determine given the matrix whether there is large cluster of similar elements or alternatively whether all similarities are essentially random gaussian noise our focus in this work is on the following restricted hypothesis testing question let λn be the ordered eigenvalues of is there test that depends only on λn and that distinguishes from reliably with error probability converging to as notice that the eigenvalues distribution does not depend on as long as this is independent from the noise we can therefore think of as fixed for this question historically the first polynomial time algorithm for detecting planted clique of size in random graph relied on spectral methods see section for more details this is one reason for our interest in spectral tests for the gaussian hidden clique problem if then implies that simple test checking whether for some is reliable for the gaussian hidden clique problem we prove that this result is tight in the sense that no spectral test is reliable for matrices in gaussian noise our proof technique builds on simple observation since the noise is invariant under orthogonal the above question is equivalent to the following testing problem for and rn uniformly random unit vector test versus βvvt the correspondence between the two problems yields again this problem and closely related asymmetric version has been studied in the literature and it follows from that reliable test exists for we provide simple proof based on the second moment method that no test is reliable for tensors in gaussian noise it turns that the same proof applies to an even more general problem detecting signal in noisy tensor we carry out our analysis in this more general setting for two reasons first we think that this clarifies the what aspects of the model are important for our proof technique to apply second the problem estimating tensors from noisy data has attracted significant interest recently within the machine learning community nk more precisely we consider noisy tensor of the form where is gaussian noise and is random unit vector we consider the problem of testing this hypothesis against we establish threshold such that no test can be reliable for in particular two differences are worth remarking for with respect to the more familiar matrix case first we do not expect the second moment bound to be tight reliable test to exist for all on the other hand we can show that it is tight up to by this we mean that for any orthogonal matrix independent of rzrt is distributed as universal and independent constant second below the problem is more difficult than the matrix version below not only no reliable test exists but asymptotically any test behaves asymptotically as random guessing for more details on our results regarding noisy tensors see theorem main result for spectral detection let be goe matrix as defined in the previous section equivalently if is an asymmetric matrix with entries gi gt for deterministic sequence of vectors kv we consider the two hypotheses βvvt example is provided by the gaussian hidden clique problem in which case and for some set observe that the distribution of eigenvalues of under either alternative is invariant to the choice of the vector or subset as long as the norm of is kept fixed therefore any successful algorithm that examines only the eigenvalues will distinguish between and but not give any information on the vector or subset in the case of we let respectively denote the distribution of the eigenvalues of under respectively or spectral statistical test for distinguishing between and or simply spectral test is measurable map tn λn to formulate precisely what we mean by the word distinguish we introduce the following notion definition for each let be two probability measures on the same measure space ωn fn we say that the sequence is contiguous with respect to if for any sequence of events an fn lim an lim an note that contiguity is not in general symmetric relation in the context of the spectral statistical tests described above the sequences an in definition with pn and qn can be put in correspondence with spectral statistical tests tn by taking an λn tn λn we will thus say that is spectrally contiguous with respect to if qn is contiguous with respect to pn our main result on the gaussian hidden clique problem is the following theorem for any sequence satisfying lim the hypotheses are spectrally contiguous with respect to contiguity and integrability contiguity is related to notion of uniform absolute continuity of measures recall that probability measure on measure space is absolutely continuous with respect to another probability measure if for every measurable set implies that in which case there exists function dµ dν the derivative of with respect to so that dν for every measurable set we then have the following known useful fact lemma within the setting of definition assume that is absolutely continuous with respect its derivative to and denote by λn if lim then is contiguous with respect to if then ktv where ktv denotes the total variation distance ktv sup method and structure of the paper consider problem we use the fact that the law of the eigenvalues under both and are invariant under conjugations by orthogonal matrix once we conjugate matrices sampled under the hypothesis by an independent orthogonal matrix sampled according to the haar distribution we get matrix distributed as βvvt where is uniform on the sphere and is goe matrix with entries of variance letting denote the law of βuut and denote the law of we show that is contiguous with respect to which implies that the law of eigenvalues is contiguous with respect to to show the contiguity we consider more general setup of independent interest of gaussian tensors of order and in that setup show that the derivative λn is uniformly square integrable under an application of lemma then quickly yields theorem the structure of the paper is as follows in the next section we define formally the detection problem for symmetric tensor of order we show the existence of threshold under which detection is not possible theorem and show how theorem follows from this section is devoted to the proof of theorem and concludes with some additional remarks and consequences of theorem finally section is devoted to description of the relation between the gaussian hidden clique problem and hidden clique problem in computer science and related literature symmetric tensor model and reduction exploiting rotational invariance we will reduce the spectral detection problem to detection problem involving standard detection problem between random matrices since the latter generalizes to tensor setup we first introduce general gaussian hypothesis testing for which is of independent interest we then explain how the spectral detection problem reduces to the special case of preliminaries and notation we use boldface for vectors and boldface for matrices and tensors pn the ordinary scalar product and norm over vectors are denoted by hu vi ui vi and kvkp we write for the unit sphere in dimensions rn nk given real order tensor we let ik ik denote its coordinates the nk outer product of two tensors is and for rn we define nk as as the outer power of we define the inner product of two tensors hx yi ik ik ik we define the frobenius euclidean norm of tensor by kxkf norm by hx xi and its operator kxkop max hx uk kui it is easy to check that this is indeed norm for the special case it reduces to the ordinary matrix operator norm equivalently to the largest singular value of for permutation sk we will denote by xπ the tensor with permuted indices ik xπ ik we call the tensor symmetric if for any permutation sk xπ it is proved that for symmetric tensors we have the equivalent representation kxkop max we define with the usual conventions of arithmetic operations the symmetric tensor model and main result nk we denote by tensor with independent and identically distributed entries ik note that this tensor is not symmetric nk we define the symmetric standard normal noise tensor by note that the subset of entries with unequal indices form an collection ik ik nk with this normalization we have for any symmetric tensor eha zi exp we will also use the fact that is invariant in distribution under conjugation by orthogonal transformations that is that for any orthogonal matrix ik has the same distribution as jk jk given parameter we consider the following model for random symmetric tensor with standard normal tensor and uniformly distributed over the unit sphere in the case this is the standard deformation of goe matrix we let pβ pβ denote the law of under model theorem for let log inf assume then for any we have lim pβ tv further for and pβ is contiguous with respect to few remarks are in order following theorem first it is not difficult to derive the asymptotic log ok for large second for we get using log that recall that for and it is known that the largest eigenvalue of converges almost surely to as consequence pβ ktv for all the second moment bound is tight for it follows by the triangle inequality that kxkop kzkop and further lim kzkop µk almost surely as for some bounded µk it follows that pβ ktv for all hence the second moment bound is off by factor for large log ok and hence the factor is indeed bounded in behavior below the threshold let us stress an important qualitative difference between and for for the two models are indistinguishable and any test is essentially as good as random guessing formally for any measurable function rn we have lim pβ for our result implies that for pβ ktv is bounded away from on the other hand it is easy to see that it is bounded away from as well lim inf pβ ktv lim sup pβ ktv indeed consider for instance the statistics tr under while under pβ hence lim inf pβ ktv kn ktv rx here is the gaussian distribution function the same phenomenon for rectangular matrices is discussed in detail in reduction of spectral detection to the symmetric tensor model recall that in the setup of theorem is the law of the eigenvalues of under and is the law of the eigenvalues of under then is invariant by conjugation of orthogonal matrices therefore the detection problem is not changed if we replace by rxrt rzrt where is an orthogonal matrix sampled according to the haar measure direct calculation yields βvvt is goe matrix with offwhere is uniform on the dimensional sphere and are independent of one another diagonal entries of variance furthermore and note that with we can relate the detection let be the law of problem of to the detection problem of as follows lemma if is contiguous with respect to then is spectrally contiguous with respect to we have ktv ktv in view of lemma theorem is an immediate consequence of theorem proof of theorem the proof uses the following large deviations lemma which follows for instance from proposition lemma let uniformly random vector on the unit sphere and let hv be its first coordinate then for any interval with lim log hv max log proof of theorem we denote by the derivative of pβ with respect to by definition it is easy to derive the following formula nβ nβ exp hx µn dv where µn is the uniform measure on squaring and using we get nβ exp hx µn µn nβ µn µn exp nβ ik µn µn exp nβ hv ik µn dv exp where in the first step we used and in the last step we used rotational invariance let fβ be defined by fβ qk log using lemma and varadhan lemma for any nβ hv ik hv µn dv exp max fβ exp it follows from the definition of that fβ for any hence nβ exp hv ik µn dv for some and all large enough next notice that under µn hv where and is with degrees of freedom independent of then letting zn with degrees of freedom nβ exp zn nβ exp enβ exp enβ ec dx enβ where now for any we can and will choose small enough so that both enβ exponentially fast by tail bounds on random variables and if the argument of the exponent in the integral in the right hand side of is bounded above by which is possible since the argument vanishes at hence for any and all large enough we have ec dx for some now for the integrand in is dominated by to therefore since and converges pointwise as lim for the argument is independent of and can be integrated immediately yielding after taking the limit lim sup indeed the above calculation implies that the limit exists and is given by the side the proof is completed by invoking lemma related work in the classical planted clique problem the computational problem is to find the planted clique of cardinality in polynomial time where we assume the location of the planted clique is hidden and is not part of the input there are several algorithms that recover the planted clique in polynomial time when where is constant independent of despite significant effort no polynomial time algorithm for this problem is known when in the decision version of the planted clique problem one seeks an efficient algorithm that distinguishes between random graph distributed as or random graph containing planted clique of size log for the natural threshold for the problem is the size of the largest clique in random sample of which is asymptotic to log no polynomial time algorithm is known for this decision problem if as another example consider the following setting introduced by see also one is given realization of gaussian vector xn with entries the goal is to distinguish between the following two hypotheses under the first hypothesis all entries in are standard normals under the second hypothesis one is given family of subsets sm such that for every sk and there exists an such that for any si xα is gaussian random variable with mean and unit variance whereas for every si xα is standard normal the second hypothesis does not specify the index only its existence the main question is how large must be such that one can reliably distinguish between these two hypotheses in are vertices in certain undirected graphs and the family is set of paths in these graphs the gaussian hidden clique problem is related to various applications in statistics and computational biology that detection is statistically possible when log was established in in terms of polynomial time detection show that detection is possible when for the symmetric cases as noted no polynomial time algorithm is known for the gaussian hidden clique problem when it was hypothesized that the gaussian hidden clique problem should be difficult when the closest results to ours are the ones of in the language of the present paper these authors consider rectangular matrix of the form whereby has entries zij is deterministic of unit norm and has entries which are independent of they consider the problem of testing this distribution against setting it is proved in that the distribution of the singular values of under the null and the alternative are mutually contiguous if and not mutually contiguous if while derive some more refined results their proofs rely on advanced tools from random matrix theory while our proof is simpler and generalizable to other settings tensors references broutin devroye lugosi on combinatorial testing problems annals of statistics alon krivelevich and sudakov finding large hidden clique in random graph random structures and algorithms anderson guionnet and zeitouni an introduction to random matrices cambridge university press helgason and zeitouni searching for trail of evidence in maze annals of statistics auffinger ben arous and cerny random matrices and complexity of spin glasses communications on pure and applied mathematics balakrishnan kolar rinaldo singh and wasserman statistical and computational tradeoffs in biclustering nips workshop on computational in statistical learning bhamidi dey and nobel energy landscape for large average submatrix detection problems in gaussian random matrices deshpande and montanari finding hidden cliques of size in nearly linear time foundations of computational mathematics dembo and zeitouni matrix optimization under random external fields feige and krauthgamer finding and certifying large hidden clique in graph random struct algorithms and the largest eigenvalue of rank one deformation of large wigner matrices comm math phys and the eigenvalues of random symmetric matrices combinatorica guionnet and maida fourier view on and related asymptotics of spherical integrals journal of functional analysis grimmett and mcdiarmid on colouring random graphs math proc cambridge philos soc hsu kakade and zhang spectral algorithm for learning hidden markov models journal of computer and system sciences jerrum large cliques elude the metropolis process random struct algorithms knowles and yin the isotropic semicircle law and deformation of wigner matrices communications on pure and applied mathematics kolar balakrishnan rinaldo and singh minimax localization of structural information in large noisy matrices neural information processing systems nips talagrand free energy of the spherical mean field model probability theory and related fields ma and wu computational barriers in minimax submatrix detection montanari and richard statistical model for tensor pca neural information processing systems nips onatski moreira hallin et al asymptotic power of sphericity tests for data the annals of statistics waterhouse the estimate for symmetric multilinear forms linear algebra and its applications 
measuring sample quality with stein method jackson gorham department of statistics stanford university lester mackey department of statistics stanford university abstract to improve the efficiency of monte carlo estimation practitioners are turning to biased markov chain monte carlo procedures that trade off asymptotic exactness for computational speed the reasoning is sound reduction in variance due to more rapid sampling can outweigh the bias introduced however the inexactness creates new challenges for sampler and parameter selection since standard measures of sample quality like effective sample size do not account for asymptotic bias to address these challenges we introduce new computable quality measure based on stein method that quantifies the maximum discrepancy between sample and target expectations over large class of test functions we use our tool to compare exact biased and deterministic sample sequences and illustrate applications to hyperparameter selection convergence rate assessment and quantifying tradeoffs in posterior inference introduction when faced with complex target distribution one often turns to rmarkov chain monte carlo mcmc to approximate intractable expectations ep dx with asymppn totically exact sample estimates eq xi xi these complex targets commonly arise as posterior distributions in bayesian inference and as candidate distributions in maximum likelihood estimation in recent years researchers have introduced asymptotic bias into mcmc procedures to trade off asymptotic correctness for improved sampling speed the rationale is that more rapid sampling can reduce the variance of monte carlo estimate and hence outweigh the bias introduced however the added flexibility introduces new challenges for sampler and parameter selection since standard sample quality measures like effective sample size asymptotic variance trace and mean plots and pooled and variance diagnostics presume eventual convergence to the target and hence do not account for asymptotic bias to address this shortcoming we develop new measure of sample quality suitable for comparing asymptotically exact asymptotically biased and even deterministic sample sequences the quality measure is based on stein method and is attainable by solving linear program after outlining our design criteria in section we relate the convergence of the quality measure to that of standard probability metrics in section develop streamlined implementation based on geometric spanners in section and illustrate applications to hyperparameter selection convergence rate assessment and the quantification of tradeoffs in posterior inference in section we discuss related work in section and defer all proofs to the appendix notation we denote the and norms on rd by and respectively we will often refer to generic norm on rd with associated dual norms kwk hw vi for vectors rd km km vk for matrices and kt kt for tensors we denote the standard basis vector by ej the partial derivative xk by rk and the gradient of any rd function by rg with components rg jk rk gj quality measures for samples consider target distribution with open convex support rd and continuously differentiable density we assume that is known up to its normalizing constant and that exact integration under is intractable for most functions of interest we will approximate expectations under with the aid of weighted sample collection of distinct sample points xn with weights xi encoded in probability mass function thepprobability mass function induces discrete distrin bution and an approximation eq xi xi for any target expectation ep we make no assumption about the provenance of the sample points they may arise as random draws from markov chain or even be deterministically selected our goal is to compare the fidelity of different samples approximating common target distribution that is we seek to quantify the discrepancy between eq and ep in manner that detects when sequence of samples is converging to the target ii detects when sequence of samples is not converging to the target and iii is computationally feasible we begin by considering the maximum deviation between sample and target expectations over class of test functions dh sup ep when the class of test functions is sufficiently large the convergence of dh qm to zero implies that the sequence of sample measures qm converges weakly to in this case the expression is termed an integral probability metric ipm by varying the class of test functions we can recover many probability metrics as ipms including the total variation distance generated by and the wasserstein distance also known as the or earth mover distance generated by kx yk the primary impediment to adopting an ipm as sample quality measure is that exact computation is typically infeasible when generic integration under is intractable however we could skirt this intractability by focusing on classes of test functions with known expectation under for example if we consider only test functions for which ep then the ipm value dh is the solution of an optimization problem depending on alone this at high level is our strategy but many questions remain how do we select the class of test functions how do we know that the resulting ipm will track convergence and of sample sequence desiderata and ii how do we solve the resulting optimization problem in practice desideratum iii to address the first two of these questions we draw upon tools from charles stein method of characterizing distributional convergence we return to the third question in section stein method stein method for characterizing convergence in distribution classically proceeds in three steps identify operator acting on set of rd functions of for which ep for all together and define the stein discrepancy sup sup ep dt an quality measure with no explicit integration under lower bound the stein discrepancy by familiar ipm dh this step can be performed once in advance for large classes of target distributions and ensures that for any sequence of probability measures µm µm converges to zero only if dh µm does desideratum ii upper bound the stein discrepancy by any means necessary to demonstrate convergence to zero under suitable conditions desideratum in our case the universal bound established in section will suffice scalar functions are more common in stein method but we will find rd more convenient while stein method is typically employed as an analytical tool we view the stein discrepancy as promising candidate for practical sample quality measure indeed in section we will adopt an optimization perspective and develop efficient procedures to compute the stein discrepancy for any sample measure and appropriate choices of and first we assess the convergence properties of an equivalent stein discrepancy in the subsections to follow identifying stein operator the generator method of barbour provides convenient and general means of constructing operators which produce functions under let zt represent markov process with unique stationary distribution then the infinitesimal generator of zt defined by au lim zt for rd satisfies ep au under mild conditions on and hence candidate operator can be constructed from any infinitesimal generator for example the overdamped langevin diffusion defined by the stochastic differential equation dzt log zt dt dwt for wt wiener process gives rise to the generator hru log hr ru after substituting for ru we obtain the associated stein ap tp hg log hr the stein operator tp is particularly to our setting as it depends on only through the derivative of its log density and hence is computable even when the normalizing constant of is not if we let denote the boundary of an empty set when rd and represent the outward unit normal vector to the boundary at then we may define the classical stein set rg krg sup max kg krg and kx yk hg with defined of sufficiently smooth functions satisfying boundary condition the following proposition consequence of integration by parts shows that is suitable domain for tp proposition if ep kr log then ep tp for all together tp and form the classical stein discrepancy tp our chief object of study lower bounding the classical stein discrepancy in the univariate setting it is known for wide variety of targets that the classical stein discrepancy µm tp converges to zero only if the wasserstein distance µm does in the multivariate setting analogous statements are available for multivariate gaussian targets but few other target distributions have been analyzed to extend the reach of the multivariate literature we show in theorem that the classical stein discrepancy also determines wasserstein convergence for large class of strongly densities including the bayesian logistic regression posterior under gaussian priors theorem stein discrepancy lower bound for strongly densities if rd and log is strongly concave with third and fourth derivatives bounded and continuous then for any probability measures µm µm tp only if µm we emphasize that the sufficient conditions in theorem are certainly not necessary for lower bounding the classical stein discrepancy we hope that the theorem and its proof will provide template for lower bounding tp for other large classes of multivariate target distributions the operator tp has also found fruitful application in the design of monte carlo control variates upper bounding the classical stein discrepancy we next establish sufficient conditions for the convergence of the classical stein discrepancy to zero proposition stein discrepancy upper bound if and with log integrable tp kx zk kr log log log kx zk kr log log kr log kx zk one implication of proposition is that qm tp converges to zero whenever xm qm converges in to and log xm converges in mean to log extension to stein sets the analyses and algorithms in this paper readily accommodate stein sets of the form kg krg krg rg and max kx yk hg with defined for constants known as stein factors in the literature we will exploit this additional flexibility in section to establish tight relations between the stein discrepancy and wasserstein distance for target distributions for general use however we advocate the classical stein set and graph stein sets to be introduced in the sequel indeed any stein discrepancy is equivalent to the classical stein discrepancy in strong sense proposition equivalence of stein discrepancies for any min tp tp max tp computing stein discrepancies in this section we introduce an efficiently computable stein discrepancy with convergence properties equivalent to those of the classical discrepancy we restrict attention to the unconstrained domain rd in sections and present extensions for constrained domains in section graph stein discrepancies evaluating stein discrepancy tp for fixed pair reduces to solving an optimization program over functions for example the classical stein discrepancy is the optimum pn tp sup xi hg xi log xi hr xi kg krg krg rg kx yk note that the objective associated with any stein discrepancy tp is linear in and since is discrete only depends on and rg through their values at each of the sample points xi the primary difficulty in solving the classical stein program stems from the infinitude of constraints imposed by the classical stein set one way to avoid this difficulty is to impose the classical smoothness constraints at only finite collection of points to this end for each finite graph with vertices and edges we define the graph stein set rd max kg krg and max kg krg rg kg rg kg rg kx yk kx yk kx yk kx yk the family of functions which satisfy the classical constraints and certain implied taylor compatibility constraints at pairs of points in remarkably if the graph consists of edges between all distinct sample points xi then the associated complete graph stein discrepancy tp is equivalent to the classical stein discrepancy in the following strong sense proposition equivalence of classical and complete graph stein discrepancies if rd and supp with xi xl supp xi xl then tp tp tp where is constant independent of depending only on the dimension and norm proposition follows from the extension theorem for smooth functions and implies that the complete graph stein discrepancy inherits all of the desirable convergence properties of the classical discrepancy however the complete graph also introduces order constraints rendering computation infeasible for large samples to achieve the same form of equivalence while enforcing only constraints we will make use of sparse geometric spanner subgraphs geometric spanners for given dilation factor is graph with weight kx yk on each edge and path between each pair with total weight no larger than the next proposition shows that spanner stein discrepancies enjoy the same convergence properties as the complete graph stein discrepancy proposition equivalence of spanner and complete graph stein discrepancies if rd gt supp is and supp xi xl supp xi xl then tp tp gt tp moreover for any norm with edges can be computed in log expected time for constant depending only on and as result we will adopt stein discrepancy tp as our standard quality measure decoupled linear programs the final unspecified component of our stein discrepancy is the choice of norm we recommend the norm as the resulting optimization problem decouples into independent linear programs lps that can be solved in parallel more precisely tp equals pd sup vi ji rj log vi jji and vi vl ei el ji jl ei vi vl max kvjii vljlk kv kv vl ji we have arbitrarily numbered the elements vi of the vertex set so that value gj vi and jki represents the gradient value rk gj vi jl el vi kvi vl ji vl represents the function constrained domains small modification to the unconstrained formulation extends our tractable stein discrepancy computation to any domain defined by coordinate boundary constraints that is to with for all specifically for each dimension we augment the coordinate linear program of with the boundary compatibility constraints max jibj ji vjji bij for each bj and ij these additional constraints ensure that our candidate function and gradient values can be extended to smooth function satisfying the boundary conditions hg on proposition in the appendix shows that the spanner stein discrepancy so computed is strongly equivalent to the classical stein discrepancy on algorithm summarizes the complete solution for computing our recommended spanner stein discrepancy in the multivariate setting notably the spanner step is unnecessary in the univariate setting as the complete graph stein discrepancy tp can be computed directly by sorting the sample and boundary points and only enforcing constraints between consecutive points in this ordering thus the complete graph stein discrepancy is our recommended quality measure when and recipe for its computation is given in algorithm algorithm multivariate spanner stein discrepancy input coordinate bounds with for all compute sparse of supp for to do in parallel rj solve coordinate linear program with graph and boundary constraints pd return rj algorithm univariate complete graph stein discrepancy input bounds with ort xn return sup dx log and max experiments we now turn to an empirical evaluation of our proposed quality measures we compute all spanners using the efficient greedy spanner implementation of bouts et al and solve all optimization programs using julia for mathematical programming with the default gurobi solver all reported timings are obtained using single core of an intel xeon cpu simple example we begin with simple example to illuminate few properties of the stein diagnostic for the target we generate sequence of sample points from the target and second sequence from scaled student distribution with matching variance and degrees of freedom the left panel of figure shows that the complete graph stein discrepancy applied to the first gaussian sample points decays to zero at an rate while the discrepancy applied to the scaled student sample remains bounded away from zero the middle panel displays optimal stein functions recovered by the stein program for different sample sizes each yields test function tp featured in the right panel that best discriminates the sample from the target notably the student test functions exhibit relatively large magnitude values in the tails of the support comparing discrepancies we show in theorem in the appendix that when the classical stein discrepancy is the optimum of convex quadratically constrained quadratic program with linear objective variables and constraints this offers the opportunity to directly compare the behavior of the graph and classical stein discrepancies we will also compare to the wasserstein distance number of sample points sample gaussian scaled student tp stein discrepancy scaled student gaussian scaled student gaussian figure left complete graph stein discrepancy for target middle right optimal stein functions and discriminating test functions tp recovered by the stein program seed seed gaussian discrepancy classical stein wasserstein uniform discrepancy value seed complete graph stein number of sample points figure comparison of discrepancy measures for sample sequences drawn from their targets which is computable for simple univariate target distributions and provably lower bounds the stein discrepancies with for unif and for for and unif targets and several random number generator seeds we generate sequence of sample points from the target distribution and plot the nonuniform classical and complete graph stein discrepancies and the wasserstein distance as functions of the first sample points in figure two apparent trends are that the graph stein discrepancy very closely approximates the classical and that both stein discrepancies track the fluctuations in wasserstein distance even when magnitude separation exists in the unif case the wasserstein distance in fact equals the classical stein discrepancy because tp is lipschitz function selecting sampler hyperparameters stochastic gradient langevin dynamics sgld with constant step size is biased mcmc procedure designed for scalable inference it approximates the overdamped langevin diffusion but because no mh correction is used the stationary distribution of sgld deviates increasingly from its target as grows if is too small however sgld explores the sample space too slowly hence an appropriate choice of is critical for accurate posterior inference to illustrate the value of the stein diagnostic for this task we adopt the bimodal gaussian mixture model gmm posterior of as our target for range of step sizes we use sgld with minibatch size to draw independent sequences of length and we select the value of with the highest median quality either the maximum effective sample size ess standard diagnostic based on autocorrelation or the minimum spanner stein discrepancy across these sequences the average discrepancy computation consumes for spanner construction and per coordinate linear program as seen in figure ess which does not detect distributional bias selects the largest step size presented to it while the stein discrepancy prefers an intermediate value the rightmost plot of figure shows that representative sgld sample of size using the selected by ess is greatly overdispersed the leftmost is greatly underdispersed due to slow mixing the middle sample with selected by the stein diagnostic most closely resembles the true posterior quantifying the approximate random walk mh arwmh sampler is second biased mcmc procedure designed for scalable posterior inference its tolerance parameter controls the number of datapoint likelihood evaluations used to approximate the standard mh correction step qualitatively larger implies fewer likelihood computations more rapid sampling and more rapid reduction of variance smaller yields closer approximation to the mh correction and less bias in the sampler stationary distribution we will use the stein discrepancy to explicitly quantify this we analyze dataset of prostate cancer patients with six binary predictors and binary outcome indicating whether cancer has spread to surrounding lymph nodes our target is the bayesian logistic regression posterior under prior on the parameters we run rwmh and arwmh and batch size for likelihood evaluations discard the points from the first evaluations and thin the remaining points to sequences of length the discrepancy computation time for points averages for the spanner and for coordinate lp figure displays the spanner stein discrepancy applied to the first points in each sequence as function of the likelihood evaluation count we see that the approximate sample is of higher stein quality for smaller computational budgets but is eventually overtaken by the asymptotically exact sequence diagnostic ess step size step size step size diagnostic spanner stein step size step size selection criteria spanner stein discrepancy stein discrepancy minimized at second moment error mean error normalized prob error sgld sample points with equidensity contours of overlaid figure ess maximized at discrepancy log median diagnostic hyperparameter number of likelihood evaluations figure curves for bayesian logistic regression with approximate rwmh to corroborate our result we use langevin chain of length as surrogate for the target and compute several error measures for each sample normalized probability max error maxl hx wl hz wl mean error maxjj zjj and second moment max error max for and wl the datapoint covariate zj zk vector the measures also found in figure accord with the stein discrepancy quantification assessing convergence rates the stein discrepancy can also be used to assess the quality of deterministic sample sequences in figure in the appendix for unif we plot the complete graph stein discrepancies of the first points of an unif sample deterministic sobol sequence and deterministic kernel herding sequence defined by the norm khkh dx we use the median value over sequences in the case and estimate the convergence rate for each sampler using the slope of the best least squares affine fit to each plot the discrepancy computation time averages for points and theprecovered rates of and for the and sobol sequences accord with expected and bounds from the literature as witnessed also inp other metrics the herding rate of outpaces its best known bound of dh qn suggesting an opportunity for sharper analysis discussion of related work we have developed quality measure suitable for comparing biased exact and deterministic sample sequences by exploiting an infinite class of known target functionals the diagnostics of also account for asymptotic bias but lose discriminating power by considering only finite collection of functionals for example for target the score statistic of can not distinguish two samples with equal first and second moments maximum mean discrepancy mmd on characteristic hilbert space takes full distributional bias into account but is only viable when the expected kernel evaluations are easily computed under the target one can approximate mmd but this requires access to separate trustworthy sample from the target acknowledgments the authors thank madeleine udell andreas eberle and jessica hwang for their pointers and feedback and quirijn bouts kevin buchin and francis bach for sharing their code and counsel references brooks gelman jones and meng handbook of markov chain monte carlo crc press geyer markov chain monte carlo maximum likelihood computer science and statistics proc symp interface pages welling and teh bayesian learning via stochastic gradient langevin dynamics in icml ahn korattikara and welling bayesian posterior sampling via stochastic gradient fisher scoring in proceedings of international conference on machine learning icml korattikara chen and welling austerity in mcmc land cutting the budget in proceedings of international conference on machine learning icml integral probability metrics and their generating classes of functions advances in applied probability pp stein bound for the error in the normal approximation to the distribution of sum of dependent random variables in proceedings of the sixth berkeley symposium on mathematical statistics and probability volume probability theory pages berkeley ca university of california press barbour stein method and poisson process convergence appl special vol celebration of applied probability oates girolami and chopin control functionals for monte carlo integration october hy chen goldstein and shao normal approximation by stein method springer science business media chatterjee and shao nonnormal approximation by stein method of exchangeable pairs with application to the model annals of applied probability reinert and multivariate normal approximation with stein method of exchangeable pairs under general linearity condition annals of probability chatterjee and meckes multivariate normal approximation using exchangeable pairs alea meckes on stein method for multivariate normal approximation in high dimensional probability the luminy volume pages institute of mathematical statistics glaeser de quelques tayloriennes analyse erratum insert to no shvartsman the whitney extension problem and lipschitz selections of mappings in jetspaces transactions of the american mathematical society chew there is planar graph almost as good as the complete graph in proc of annual symposium on computational geometry new york ny acm peleg and graph spanners journal of graph theory and mendel fast construction of nets in metrics and their applications siam journal on computing bouts ten brink and buchin framework for computing the greedy spanner in proc of annual symposium on computational geometry pages new york ny acm lubin and dunning computing in operations research using julia informs journal on computing gurobi optimization gurobi optimizer reference manual url http vallender calculation of the wasserstein distance between probability distributions on the line theory of probability its applications stein method of exchangeable pairs for the beta distribution and generalizations canty and ripley boot bootstrap functions package version roberts and tweedie exponential convergence of langevin distributions and their discrete approximations bernoulli pages sobol on the distribution of points in cube and the approximate evaluation of integrals ussr computational mathematics and mathematical physics chen welling and smola from kernel herding in uai del barrio and central limit theorems for the wasserstein distance between the empirical and the true distributions ann wang and sloan low discrepancy sequences in high dimensions how well are their projections distributed comput appl march issn bach and obozinski on the equivalence between herding and conditional gradient algorithms in proceedings of international conference on machine learning icml zellner and min gibbs sampler convergence criteria jasa fan brooks and gelman output assessment for monte carlo simulations via the score statistic journal of computational and graphical statistics gretton borgwardt rasch and smola kernel method for the in advances in neural information processing systems pages 
bidirectional recurrent convolutional networks for yan wei liang center for research on intelligent perception and computing national laboratory of pattern recognition center for excellence in brain science and intelligence technology institute of automation chinese academy of sciences yhuang wangwei wangliang abstract super resolving video is usually handled by either sr or sr deals with each video frame independently and ignores intrinsic temporal dependency of video frames which actually plays very important role in video sr generally extracts motion information optical flow to model the temporal dependency which often shows high computational cost considering that recurrent neural networks rnns can model contextual information of temporal sequences well we propose bidirectional recurrent convolutional network for efficient sr different from vanilla rnns the recurrent full connections are replaced with convolutional connections and conditional convolutional connections from previous input layers to the current hidden layer are added for enhancing dependency modelling with the powerful temporal dependency modelling our model can super resolve videos with complex motions and achieve performance due to the cheap convolution operations our model has low computational complexity and runs orders of magnitude faster than other methods introduction since large numbers of displays have sprung up generating videos from previous contents namely video sr is under great demand recently various methods have been proposed to handle this problem which can be classified into two categories sr super resolves each of the video frames independently and sr models and exploits temporal dependency among video frames which is usually considered as an essential component of video sr existing sr methods generally model the temporal dependency by extracting subpixel motions of video frames estimating optical flow based on sparse prior integration or variation regularity but such accurate motion estimation can only be effective for video sequences which contain small motions in addition the high computational cost of these methods limits the applications several solutions have been explored to overcome these issues by avoiding the explicit motion estimation unfortunately they still have to perform implicit motion estimation to reduce temporal aliasing and achieve resolution enhancement when large motions are encountered given the fact that recurrent neural networks rnns can well model contextual information for video sequence we propose bidirectional recurrent convolutional network brcn to efficiently learn the temporal dependency for sr the proposed network exploits three convolutions feedforward convolution models visual spatial dependency between lowresolution frame and its result recurrent convolution connects the hidden layers of successive frames to learn temporal dependency different from the full recurrent connection in vanilla rnns it is convolutional connection here conditional convolution connects input layers at the previous timestep to the current hidden layer to further enhance dependency modelling to simultaneously consider the temporal dependency from both previous and future frames we exploit forward recurrent network and backward recurrent network respectively and then combine them together for the final prediction we apply the proposed model to super resolve videos with complex motions the experimental results demonstrate that the model can achieve performance as well as orders of magnitude faster speed than other sr methods our main contributions can be summarized as follows we propose bidirectional recurrent convolutional network for sr where the temporal dependency can be efficiently modelled by bidirectional recurrent and conditional convolutions it is an framework which does not need we achieve better performance and faster speed than existing multiframe sr methods related work we will review the related work from the following prospectives irani and peleg propose the primary work for this problem followed by freeman et al studying this problem in way to alleviate high computational complexity bevilacqua et al and chang et al introduce manifold learning techniques which can reduce the required number of image patch exemplars for further acceleration timofte et al propose the anchored neighborhood regression method yang et al and zeyde et al exploit compressive sensing to encode image patches with compact dictionary and obtain sparse representations dong et al learn convolutional neural network for sr which achieves the current result in this work we focus on sr by modelling temporal dependency in video sequences baker and kanade extract optical flow to model the temporal dependency in video sequences for video then various improvements around this work are explored to better handle visual motions however these methods suffer from the high computational cost due to the motion estimation to deal with this problem protter et al and takeda et al avoid motion estimation by employing nonlocal mean and steering kernel regression in this work we propose bidirectional recurrent and conditional convolutions as an alternative to model temporal dependency and achieve faster speed bidirectional recurrent convolutional network formulation given noisy and blurry video our goal is to obtain and version in this paper we propose bidirectional recurrent convolutional network brcn to map the frames to ones as shown in figure the proposed network contains forward recurrent convolutional and backward recurrent convolutional to model the temporal dependency from both previous and future frames note that similar bidirectional scheme has been proposed previously in the two of brcn are denoted by two black blocks with dash borders respectively in each there are four layers including the input layer the first hidden layer the second hidden layer and the output layer which are connected by three convolutional operations feedforward convolution the convolutions denoted by black lines learn visual spatial dependency between frame and its result similar configurations have also been explored previously in backward input layer frame 𝑿𝒊 first hidden layer second hidden layer second hidden layer first hidden layer output layer frame input layer frame 𝑿𝒊 forward feedforward convolution recurrent convolution conditional convolution figure the proposed bidirectional recurrent convolutional network brcn recurrent convolution the convolutions denoted by blue lines aim to model temporal dependency across video frames by connecting adjacent hidden layers of successive frames where the current hidden layer is conditioned on the hidden layer at the previous timestep we use the recurrent convolution in both forward and backward subnetworks such bidirectional recurrent scheme can make full use of the forward and backward temporal dynamics conditional convolution the convolutions denoted by red lines connect input layer at the previous timestep to the current hidden layer and use previous inputs to provide longterm contextual information they enhance dependency modelling with this kind of conditional connection we denote the frame sets of as xi and infer the other three layers as follows first hidden layer when inferring the first hidden layer xi or xi at the ith timestep in the forward or backward three inputs are considered the current input layer xi connected by feedforward convolution the hidden layer or at the or timestep connected by recurrent convolution and the input layer or at the or timestep connected by conditional convolution xi xi xi wvb xi where or wvb and or represent the filters of feedforward and conditional convolutions in the forward or backward respectively both of them have the size of where is the number of input channels is the filter size and is the number of filters or represents the filters of recurrent convolutions their filter size is set to to avoid border effects or represents biases the activation function is the rectified linear unit relu note that in equation the filter responses of recurrent and note that we upscale each frame in the sequence to the desired size with bicubic interpolation in advance xi vector hi xi trbm xi brcn figure comparison between trbm and the proposed brcn conditional convolutions can be regarded as dynamic changing biases which focus on modelling the temporal changes across frames while the filter responses of feedforward convolution focus on learning visual content second hidden layer this phase projects the obtained feature maps xi or xi from to dimensions which aims to capture the nonlinear structure in sequence data in addition to mapping by feedforward convolution we also consider two mappings using recurrent and conditional convolutions respectively the projected feature maps in the second hidden layer xi or xi in the forward or backward can be obtained as follows xi xi xi wvb xi where or wvb and or represent the filters of feedforward and conditional convolutions respectively both of which have the size of or represents the filters of recurrent convolution whose size is note that the inference of the two hidden layers can be regarded as representation learning phase where we could stack more hidden layers to increase the representability of our network to better capture the complex data structure output layer in this phase we combine the projected feature maps in both forward and backward to jointly predict the desired frame xi xi wvb xi where or wvb and or represent the filters of feedforward and conditional convolutions respectively their sizes are both we do not use any recurrent convolution for output layer connection with temporal restricted boltzmann machine in this section we discuss the connection between the proposed brcn and temporal restricted boltzmann machine trbm which is widely used model in sequence modelling as shown in figure trbm and brcn contain similar recurrent connections blue lines between hidden layers and conditional connections red lines between input layer and hidden layer they share the common flexibility to model and propagate temporal dependency along the time however trbm is generative model while brcn is discriminative model and trbm contains an additional connection green line between input layers for sample generation in fact brcn can be regarded as deterministic bidirectional and implementation of trbm specifically when inferring the hidden layer in brcn as illustrated in figure feedforward and conditional convolutions extract overlapped patches from the input each of which is fully connected to vector in the feature maps xi for recurrent convolutions since each filter size is and all the filters contain weights vector in xi is fully connected to the corresponding vector in at the previous time step therefore the patch connections of brcn are actually those of discriminative trbm in other words by setting the filter sizes of feedforward and conditional convolutions as the size of the whole frame brcn is equivalent to trbm compared with trbm brcn has the following advantages for handling the task of video superresolution brcn restricts the receptive field of original full connection to patch rather than the whole frame which can capture the temporal change of visual details brcn replaces all the full connections with convolutional ones which largely reduces the computational cost brcn is more flexible to handle videos of different sizes once it is trained on video dataset similar to trbm the proposed model can be generalized to other sequence modelling applications video motion modelling network learning through combining equations and we can obtain the desired prediction from the video where denotes the network parameters network learning proceeds by minimizing the mean square error mse between the predicted video and the groundtruth ko yk via stochastic gradient descent actually stochastic gradient descent is enough to achieve satisfying results although we could exploit other optimization algorithms with more computational cost during optimization all the filter weights of recurrent and conditional convolutions are initialized by randomly sampling from gaussian distribution with mean and standard deviation whereas the filter weights of feedforward convolution are on static images note that the pretraining step only aims to speed up training by providing better parameter initialization due to the limited size of training set this step can be avoided by alternatively using larger scale dataset we experimentally find that using smaller learning rate for the weights in the output layer is crucial to obtain good performance experimental results to verify the effectiveness we apply the proposed model to the task of video sr and present both quantitative and qualitative results as follows datasets and implementation details we use yuv format video as our training set which have been widely used in many video sr methods to enlarge the training set model training is performed in volumebased way cropping multiple overlapped volumes from training videos and then regarding each volume as training sample during cropping each volume has spatial size of and temporal step of the spatial and temporal strides are and respectively as result we can generate roughly volumes from the original dataset we test our model on variety of challenging videos including dancing flag fan treadmill and turbine which contain complex motions with severe motion blur and aliasing note that we do not have to extract volumes during testing since the convolutional operation can scale to videos of any spatial size and temporal step we generate the testing dataset with the following steps using gaussian filter with standard deviation to smooth each original frame and downsampling the frame by factor of with bicubic http here we focus on the factor of which is usually considered as the most difficult case in table the results of psnr db and running time sec on the testing video sequences video dancing flag fan treadmill turbine average video dancing flag fan treadmill turbine average bicubic psnr time sc psnr time psnr time psnr time anr psnr time psnr time psnr time psnr time enhancer psnr time brcn psnr time table the results of psnr db by variants of brcn on the testing video sequences feedforward convolution recurrent convolution conditional convolution bidirectional scheme video dancing flag fan treadmill turbine average brcn brcn brcn brcn brcn some important parameters of our network are illustrated as follows and note that varying the number and size of filters does not have significant impact on the performance because some filters with certain sizes are already in regime where they can almost reconstruct the videos quantitative and qualitative comparison we compare our brcn with two sr methods including and commercial software namely enhancer and seven sr methods including bicubic sc ksvd anr and the results of all the methods are compared in table where evaluation measures include both peak ratio psnr and running time time specifically compared with the sr methods anr and our method can surpass them by db which is mainly attributed to the beneficial mechanism of temporal dependency modelling brcn also performs much better than the two representative sr methods and enhancer by db and db respectively in fact most existing methods tend to fail catastrophically when dealing with very complex motions because it is difficult for them to estimate the motions with pinpoint accuracy for the proposed brcn we also investigate the impact of model architecture on the performance we take simplified network containing only feedforward convolution as benchmark and then study its several variants by successively adding other operations including bidirectional scheme recurrent and conditional convolutions the results by all the variants of brcn are shown in table where the elements in the brace represent the included operations as we can see due to the similar to we only deal with luminance channel in the ycrcb color space note that our model can be generalized to handle all the three channels by setting here we simply upscale the other two channels with bicubic method for well illustration original bicubic anr brcn figure closeup comparison among original frames and super resolved results by bicubic anr and brcn respectively efit of learning temporal dependency exploiting either recurrent convolution or conditional convolution can greatly improve the performance when combining these two convolutions together they obtain much better results the performance can still be further promoted when adding the bidirectional scheme which results from the fact that each video frame is related to not only its previous frame but also the future one in addition to the quantitative evaluation we also present some qualitative results in terms of singleframe in figure and in figure please enlarge and view these figures on the screen for better comparison from these figures we can observe that our method is able to recover more image details than others under various motion conditions running time we present the comparison of running time in both table and figure where all the methods are implemented on the brcn same machine intel cpu ghz and gb memory the publicly available codes of compared methods are anr in matlab while and ours are in python from the table and figure we can see that our brcn takes sec per frame on average which is orders of magnitude faster than the fast sr method sc it should be noted that the speed gap is not caused by the different implementations as stat sr method sr method ed in the computational bottleneck for existing sr methods is the accurate motion estimation figure running time psnr for all the methods while our model explores an alternative based on efficient convolutions which has lower computational complexity note that the speed of our method is worse than the fastest sr method anr it is likely that our method involves the additional phase of temporal dependency modelling but we achieve better performance db original bicubic anr brcn figure comparison among original frames and frames from the top row to the bottom of the dancing video and super resolved results by bicubic anr and brcn respectively filter visualization figure visualization of learned filters by the proposed brcn we visualize the learned filters of feedforward and conditional convolutions in figure the filters of and exhibit some patterns which can be viewed as edge detectors the filters of and show some patterns which indicate that the predicted highresolution frame is obtained by averaging over the feature maps in the second hidden layer this averaging operation is also in consistent with the corresponding reconstruction phase in sr methods but the difference is that our filters are automatically learned rather than when comparing the learned filters between feedforward and conditional convolutions we can also observe that the patterns in the filters of feedforward convolution are much more regular and clear conclusion and future work in this paper we have proposed the bidirectional recurrent convolutional network brcn for multiframe our main contribution is the novel use of bidirectional scheme recurrent and conditional convolutions for temporal dependency modelling we have applied our model to super resolve videos containing complex motions and achieved better performance and faster speed in the future we will perform comparisons with other sr methods acknowledgments this work is jointly supported by national natural science foundation of china and national basic research program of china references video enhancer http version baker and kanade optical flow technical report cmu bascle blake and zisserman motion deblurring and from an image sequence european conference on computer vision pages bevilacqua roumy guillemot and morel superresolution based on nonnegative neighbor embedding british machine vision conference chang yeung and xiong through neighbor embedding ieee conference on computer vision and pattern recognition page dong loy he and tang learning deep convolutional network for image superresolution european conference on computer vision pages eigen krishnan and fergus restoring an image taken through window covered with dirt or rain ieee international conference on computer vision pages freeman pasztor and carmichael learning vision international journal of computer vision pages glasner bagon and irani from single image ieee international conference on computer vision pages irani and peleg improving resolution by image registration cvgip graphical models and image processing pages jain and seung natural image denoising with convolutional networks advances in neural information processing systems pages jia wang and tang image transformation based on learning dictionaries across image spaces ieee transactions on pattern analysis and machine intelligence pages liu and sun on bayesian adaptive video super resolution ieee transactions on pattern analysis and machine intelligence pages mitzel pock schoenemann and cremers video super resolution using duality based optical flow pattern recognition pages nair and hinton rectified linear units improve restricted boltzmann machines international conference on machine learning pages protter elad takeda and milanfar generalizing the to reconstruction ieee transactions on image processing pages schultz and stevenson extraction of frames from video sequences ieee transactions on image processing pages schusterand and paliwal bidirectional recurrent neural networks ieee transactions on signal processing pages shahar faktor and irani from single video ieee conference on computer vision and pattern recognition pages sutskever and hinton learning multilevel distributed representations for sequences in international conference on artificial intelligence and statistics pages takeda milanfar protter and elad without explicit subpixel motion estimation ieee transactions on image processing pages taylor hinton and roweis modeling human motion using binary latent variables advances in neural information processing systems pages timofte de and gool anchored neighborhood regression for fast superresolution ieee international conference on computer vision pages xu ren liu and jia deep convolutional neural network for image deconvolution in advances in neural information processing systems pages yang wright huang and ma image via sparse representation ieee transactions on image processing pages zeyde elad and protte on single image using curves and surfaces pages 
bounding errors of simon barthelmé cnrs guillaume dehaene university of geneva abstract expectation propagation is very popular algorithm for variational inference but comes with few theoretical guarantees in this article we prove that the approximation errors made by ep can be bounded our bounds have an asymptotic interpretation in the number of datapoints which allows us to study ep convergence with respect to the true posterior in particular we show that ep converges at rate of for the mean up to an order of magnitude faster than the traditional gaussian approximation at the mode we also give similar asymptotic expansions for moments of order to as well as excess cost defined as the additional kl cost incurred by using ep rather than the ideal gaussian approximation all these expansions highlight the superior convergence properties of ep our approach for deriving those results is likely applicable to many similar approximate inference methods in addition we introduce bounds on the moments of distributions that may be of independent interest introduction expectation propagation ep is an efficient approximate inference algorithm that is known to give good approximations to the point of being almost exact in certain applications it is surprising that while the method is empirically very successful there are few theoretical guarantees on its behavior indeed most work on ep has focused on efficiently implementing the method in various settings theoretical work on ep mostly represents new justifications of the method which while they offer intuitive insight do not give mathematical proofs that the method behaves as expected one recent breakthrough is due to dehaene and barthelmé who prove that in the large datalimit the ep iteration behaves like newton search and its approximation is asymptotically exact however it remains unclear how good we can expect the approximation to be when we have only finite data in this article we offer characterization of the quality of the ep approximation in terms of the distance between the true and approximate mean and variance when approximating probability distribution that is for some reason close to being gaussian natural approximation to use is the gaussian with mean equal to the mode or argmax of and with variance the inverse at the mode we call it the canonical gaussian approximation cga and its use is usually justified by appealing to the mises theorem which shows that in the limit of large amount of independent observations posterior distributions tend towards their cga this powerful justification and the ease with which the cga is computed finding the mode can be done using newton methods makes it good reference point for any method like ep which aims to offer better gaussian approximation at higher computational cost in section we introduce the cga and the ep approximation in section we give our theoretical results bounding the quality of ep approximations background in this section we present the cga and give short introduction to the ep algorithm descriptions of ep can be found in minka seeger bishop raymond et al the canonical gaussian approximation what we call here the cga is perhaps the most common approximate inference method in the machine learning cookbook it is often called the laplace approximation but this is misnomer the laplace approximation refers to approximating the integral from the integral of the cga the reason the cga is so often used is its compelling simplicity given target distribution exp we find the mode and compute the second derivatives of at argminφ to form gaussian approximation the cga is effectively just taylor expansion and its use is justified by the mises theorem which essentially saysqthat the cga becomes exact in the asymptotic limit roughly if pn where yn represent independent datapoints then pn in total variation cga vs gaussian ep gaussian ep as its name indicates provides an alternative way of computing gaussian approximation to target distribution there is broad overlap between the problems where ep can be applied and the problems where the cga can be used with ep coming at higher cost our contribution is to show formally that the higher computational cost for ep may well be worth bearing as ep approximations can outperform cgas by an order of magnitude to be specific we focus on the moment estimates mean and covariance computed by ep and cga and derive bounds on their distance to the true mean and variance of the target distribution our bounds have an asymptotic interpretation and under interpretation we show for example that the mean returned by ep is within an order of of the true mean where is the datapoints for the cga which uses the mode as an estimate of the mean we exhibit upper bound and we compute the error term responsible for this behavior this enables us to show that in the situations in which this error is indeed ep is better than the cga the ep algorithm we consider the task of approximating probability distribution over which we call the target distribution can be but for simplicity we focus on the case one important hypothesis that makes ep feasible is that factorizes into simple factor terms fi ep proposes to approximate each fi usually referred to as sites by gaussian function qi referred to as the it is convenient to use the parametrization of gaussians in terms of natural parameters qi βi exp ri βi which makes some of the further computations easier to understand note that ep could also be used with other exponential approximating families these gaussian approximations are computed iteratively starting from current approximation qit βit we select site for update with index we then compute the cavity distribution qjt this is very easy in natural parameters exp rjt βjt hti fi compute the hybrid distribution and its mean and variance compute the gaussian which minimizes the divergence to the hybrid ie the gaussian with same mean and variance hti argmin kl hti finally update the approximation of fi hti where the division is simply computed as subtraction between natural parameters we iterate these operations until fixed point is reached at which point we return gaussian approximation of qi the in this work we will characterize the quality of an ep approximation of we define this to be any fixed point of the iteration presented in section which could all be returned by the algorithm it is known that ep will have at least one but it is unknown under which conditions the is unique we conjecture that when all sites are one of our hypotheses to control the behavior of ep it is in fact unique but we can offer proof yet if isn logconcave it is straightforward to construct examples in which ep has multiple these open questions won matter for our result because we will show that all of ep should there be more than one produce good approximation of fixed points of ep have very interesting characterization if we note the at given the corresponding hybrid distributions and the global approximation of then the mean and variance of all the hybrids and is the as we will show in section this leads to very tight bound on the possible positions of these notation we will use repeatedly the following notation fi is the target distribution we want to approximate the sites fi are each approximated by gaussian qi yielding an approximation to qi the hybrids hi interpolate between and by replacing one site approximation qi with the true site fi our results make heavy use of the of the sites and the target distribution we note φi log fi and φp log φi we will introduce in section hypotheses on these functions parameter βm controls their minimum curvature and parameters kd control the maximum dth derivative we will always consider of ep where the mean and variance under all hybrids and is identical we will note these common values µep and vep we will also refer to the third and fourth centered moment of the hybrids denoted by and to the fourth moment of which is simply we will show how all these moments are related to the true moments of the target distribution which we will note for the mean and variance and for the third and fourth moment we also investigate the quality of the cga and φp where is the the mode of for approximations the expected values of all sufficient statistics of the exponential family are equal results in this section we will give tight bounds on the quality of the ep approximation ie of of the ep iteration our results lean on the properties of distributions in section we introduce new bounds on the moments of distributions the bounds show that those distributions are in certain sense close to being gaussian we then apply these results to study fixed points of ep where they enable us to compute bounds on the distance between the mean and variance of the true distribution and of the approximation given by ep which we do in section our bounds require us to assume that all sites fi are βm with slowlychanging that is if we note φi log fi φi βm φi kd the target distribution then inherits those properties from the sites noting φp log φi then φp is nβm and its higher derivatives are bounded φp nβm nkd natural concern here is whether or not our conditions on the sites are of practical interest indeed likelihoods are rare we picked these strong regularity conditions because they make the proofs relatively tractable although still technical and long the proof technique carries over to more complicated but more realistic cases one such interesting generalization consists of the case in which and all hybrids at the are with slowly changing with possibly differing constants in such case while the math becomes more unwieldy similar bounds as ours can be found greatly extending the scope of our results the results we present here should thus be understood as stepping stone and not as the final word on the quality of the ep approximation we have focused on providing rigorous but extensible proof distributions are strongly constrained distributions have many interesting properties they are of course unimodal and the family is closed under both marginalization and multiplication for our purposes however the most important property is result due to brascamp and lieb which bounds their even moments we give here an extension in the case of distributions with slowly changing as quantified by eq our results show that these are close to being gaussian the inequality states that if lc exp is βm ie βm then centered even moments of lc are bounded by the corresponding moments of gaussian with variance βm if we note these moments and µlc elc the mean of lc elc µlc βm where is the double factorial the product of all odd terms from to etc this result can be understood as stating that distribution must have small variance but doesn generally need to be close to gaussian with our hypothesis of slowly changing we were able to improve on this result our improved results include bound on odd moments as well as first order expansions of even moments eqs our extension to the inequality is as follows if is slowly changing in the sense that some of its higher derivatives are bounded as per eq then we can give bound on µlc showing that µlc is close to the mode of lc see eqs to and showing that lc is mostly symmetric µlc βm and we can compute the first order expansions of and and bound the errors in terms of βm and the µlc µlc with eq and we see that βm βm βm and µlc and in that µlc sense that lc is close to the gaussian with mean µlc and µlc these expansions could be extended to further orders and similar formulas can be found for the other moments of lc for example any odd moments can be bounded by ck βm with ck some constant and any even moment can be found to have expansion µlc the proof as well as more detailed results can be found in the supplement note how our result relates to the mises theorem which says that in the limit of large amount of observations posterior tends towards its cga if we consider the posterior obtained from likelihood functions that are all and slowly changing our results show the slightly different result that the moments of that posterior are close to those of gaussian with mean µlc instead of lc and µlc instead of lc this point is critical while the cga still ends up capturing the limit behavior of as µlc in the largedata limit see eq below an approximation that would return the gaussian approximation at µlc would be better this is essentially what ep does and this is how it improves on the cga computing bounds on ep approximations in this section we consider given ep qk βi and the corresponding approximation of ri βi we will show that the expected value and variance of resp µep and vep are close to the true mean and variance of resp and and also investigate the quality of the cga φp under our assumptions on the sites eq and we are able to derive bounds on the quality of the ep approximation the proof is quite involved and long and we will only present it in the supplement in the main text we give partial version we detail the first step of the demonstration which consists of computing rough bound on the distance between the true mean the ep approximation µep and the mode and give an outline of the rest of the proof let show that µep and are all close to one another we start from eq applied to φp which tells us that φp must thus be close to indeed φp φp φp φp φp nβm combining eq and we finally have let now show that µep is also close to we proceed similarly starting from eq but applied to all hybrids hi φi µep µep which is not really equivalent to eq yet recall that has mean µep we thus have βµep which gives µep µep if we sum all terms in eq the µep and thus cancel leaving us with φp µep which is equivalent to eq but for µep instead of this shows that µep is like close to at this point we can show that since they are both close to eq and µep which constitutes the first step of our computation of bounds on the quality of ep after computing this the next step is evaluating the quality of the approximation of the variance via computing vep for ep and φp for the cga from eq in both cases we find vep φp since is of order because of eq upper bound on variance this is decent approximation the relative error is of order we can find similarly that both ep and cga do good job of finding good approximation of the fourth moment of for ep this means that the fourth moment of each hybrid and of are close match φp in contrast the third moment of the hybrids doesn match at all the third moment of but their sum does finally we come back to the approximation of by µep these obey two very similar relationships φp φp µep µep vep since vep slight rephrasing of eq we finally have µep we summarize the results in the following theorem theorem characterizing of ep under the assumptions given by eq and sites with slowly changing log we can bound the quality of the ep approximation and the cga µep φp βm ep we give the full expression for the bounds and in the supplement note that the order of magnitude of the bound on is the best possible because it is attained for certain distributions for example consider gamma distribution with natural parameters nα nβ whose mean by its mode is approximated at order nβ more generally from eq we can compute the first order of the error φp φp which is the term causing the order error whenever this term is significant it is thus safe to conclude that ep improves on the cga also note that since is of order the relative error for the approximation is of order for both methods despite having convergence rate of the same order the ep approximation is demonstrably better than the cga as we show next let us first see why the approximation for is only of order for both methods the following relationship holds φp in this relationship φp is an order term while the rest are order if we now compare this to the cga approximation of we find that it fails at multiple levels first it completely ignores the two order terms and then because it takes the value of φp at which is at distance of from it adds another order error term since φp the cga is thus adding quite bit of error even if each component is of order meanwhile vep obeys relationship similar to eq mi vep φp µep φi µep ep vep we can see where the ep approximation produces errors the φp term is well approximated since µep we have φp φp µep the term involving is also well approximated and we can see that the only term that fails is the term the order error is thus entirely coming from this term which shows that ep performance suffers more from the skewness of the target distribution than from its kurtosis finally note that with our result we can get some intuitions about the quality of the ep approximation using other metrics for example if the most interesting metric is the kl divergence kl the excess kl divergence from using the ep approximation instead of the true minimizer qkl which has the same mean and variance as is given by qkl µep log log vep µep log vep vep µep vep vep which we recognize as kl qkl similar formula gives the excess kl divergence from using the cga instead of qkl for both methods the variance term is of order though it should be smaller for ep but the mean term is of order for ep while it is of order for the cga once again ep is found to be the better approximation finally note that our bounds are quite pessimistic the true value might be much better fit than we have predicted here first cause is the bounding of the derivatives of log eqs while those bounds are correct they might prove to be very pessimistic for example if the contributions from the sites to the cancel each other out much lower bound than nkd might apply similarly there might be another lower bound on the curvature much higher than nβm another cause is the bounding of the variance from the curvature while applying requires the distribution to have high everywhere distribution with close to the mode and in the tails still has very low variance in such case the bound is very pessimistic in order to improve on our bounds we will thus need to use tighter bounds on the of the hybrids and of the target distribution but we will also need an extension of the result that can deal with those cases where distribution is strongly around its mode but in the tails the is much lower conclusion ep has been used for now quite some time without any theoretical concrete guarantees on its performance in this work we provide explicit performance bounds and show that ep is superior to the cga in the sense of giving provably better approximations of the mean and variance there are now theoretical arguments for substituting ep to the cga in number of practical problems where the gain in precision is worth the increased computational cost this work tackled the first steps in proving that ep offers an appropriate approximation continuing in its tracks will most likely lead to more general and less pessimistic bounds but it remains an open question how to quantify the quality of the approximation using other distance measures for example it would be highly useful for machine learning if one could show bounds on prediction error when using ep we believe that our approach should extend to more general performance measures and plan to investigate this further in the future references thomas minka expectation propagation for approximate bayesian inference in uai proceedings of the conference in uncertainty in artificial intelligence pages san francisco ca usa morgan kaufmann publishers isbn url http malte kuss and carl rasmussen assessing approximate inference for binary gaussian process classification mach learn december issn url http hannes nickisch and carl rasmussen approximations for binary gaussian process classification journal of machine learning research october url http guillaume dehaene and simon barthelmé expectation propagation in the limit technical report march url http minka divergence measures and message passing technical report url http seeger expectation propagation for exponential families technical report url http christopher bishop pattern recognition and machine learning information science and statistics springer ed corr printing edition october isbn url http jack raymond andre manoel and manfred opper expectation propagation september url http anirban dasgupta asymptotic theory of statistics and probability springer texts in statistics springer edition march isbn url http adrien saumard and jon wellner and strong review statist doi url http herm brascamp and elliott lieb best constants in young inequality its converse and its generalization to more than three functions advances in mathematics may issn doi url http 
fast universal algorithm to learn parametric nonlinear embeddings miguel eecs university of california merced max vladymyrov uc merced and yahoo labs http maxv abstract nonlinear embedding algorithms such as stochastic neighbor embedding do dimensionality reduction by optimizing an objective function involving similarities between pairs of input patterns the result is projection of each input pattern common way to define an mapping is to optimize the objective directly over parametric mapping of the inputs such as neural net this can be done using the chain rule and nonlinear optimizer but is very slow because the objective involves quadratic number of terms each dependent on the entire mapping parameters using the method of auxiliary coordinates we derive training algorithm that works by alternating steps that train an auxiliary embedding with steps that train the mapping this has two advantages the algorithm is universal in that specific learning algorithm for any choice of embedding and mapping can be constructed by simply reusing existing algorithms for the embedding and for the mapping user can then try possible mappings and embeddings with less effort the algorithm is fast and it can reuse methods developed for nonlinear embeddings yielding iterations introduction given dataset yn of points in rd nonlinear embedding algorithms seek to find projections xn with by optimizing an objective function constructed using an matrix of similarities wnm between pairs of input patterns yn ym for example the elastic embedding ee optimizes pn wnm kxn xm exp kxn xm here the first term encourages projecting similar patterns near each other while the second term repels all pairs of projections other algorithms of this type are stochastic neighbor embedding sne neighbor retrieval visualizer nerv or the sammon mapping as well as spectral methods such as metric multidimensional scaling and laplacian eigenmaps though our focus is on nonlinear objectives nonlinear embeddings can produce visualizations of data that display structure such as manifolds or clustering and have been used for exploratory purposes and other applications in machine learning and beyond optimizing nonlinear embeddings is difficult for three reasons there are many parameters the objective is very nonconvex so gradient descent and other methods require many iterations and it involves terms so evaluating the gradient is very slow major progress in these problems has been achieved in recent years for the second problem the spectral direction is constructed by bending the gradient using the curvature of the quadratic part of the objective for ee this is the graph laplacian of this significantly reduces the number of iterations while evaluating the direction itself is about as costly as evaluating the gradient for the third problem methods such as tree methods and fast multipole methods approximate the gradient in log and for small dimensions respectively and have allowed to scale up embeddings to millions of patterns another issue that arises with nonlinear embeddings is that they do not define an mapping rd rl that can be used to project patterns not in the training set there are two basic approaches to define an mapping for given embedding the first one is variational argument originally put forward for laplacian eigenmaps and also applied to the elastic embedding the idea is to optimize the embedding objective for dataset consisting of the training points and one test point but keeping the training projections fixed essentially this constructs nonparametric mapping implicitly defined by the training points and its projections without introducing any assumptions the mapping comes out in closed form for laplacian eigenmaps estimator but not in general for ee in which case it needs numerical optimization in either case evaluating the mapping for test point is which is slow and does not scale for spectral methods one can also use the formula but it does not apply to nonlinear embeddings and is still at test time the second approach is to use mapping belonging to parametric family of mappings linear or neural net which is fast at test time directly fitting to is inelegant since is unrelated to the embedding and may not work well if the mapping can not model well the data if is linear better way is to involve in the learning from the beginning by replacing xn with yn in the embedding objective function and optimizing it over the parameters of for example for the elastic embedding of this means pn pn wnm kf yn ym exp kf yn ym this will give better results because the only embeddings that are allowed are those that are realizable by mapping in the family considered hence the optimal will exactly match the embedding which is still trying to optimize the objective this provides an intermediate solution between the nonparametric mapping described above which is slow at test time and the direct fit of parametric mapping to the embedding which is suboptimal we will focus on this approach which we call parametric embedding pe following previous work long history of pes exists using unsupervised or supervised embedding objectives and using linear or nonlinear mappings neural nets each of these papers develops specialized algorithm to learn the particular pe they define embedding objective and mapping family besides pes have also been used as regularization terms in semisupervised classification regression or deep learning our focus in this paper is on optimizing an unsupervised parametric embedding defined by given embedding objective such as ee or and given family for the mapping such as linear or neural net the straightforward approach used in all papers cited above is to derive training algorithm by applying the chain rule to compute gradients over the parameters of and feeding them to nonlinear optimizer usually gradient descent or conjugate gradients this has three problems first new gradient and optimization algorithm must be developed and coded for each choice of and for user who wants to try different choices on given dataset this is very the power of nonlinear embeddings and unsupervised methods in general is precisely as exploratory techniques to understand the structure in data so user needs to be able to try multiple techniques ideally the user should simply be able to plug different mappings into any embedding objective with minimal development work second computing the gradient involves terms each depending on the entire mapping parameters which is very slow third both and must be differentiable for the chain rule to apply here we propose new approach to optimizing parametric embeddings based on the recently introduced method of auxiliary coordinates mac that partially alleviates these problems the idea is to solve an equivalent constrained problem by introducing new variables the auxiliary coordinates alternating optimization over the coordinates and the mapping parameters results in step that trains an auxiliary embedding with regularization term and step that trains the mapping by solving regression both of which can be solved by existing algorithms section introduces important concepts and describes the based optimization of parametric embeddings section applies mac to parametric embeddings and section shows with different combinations of embeddings and mappings that the resulting algorithm is very easy to construct including use of methods and is faster than the based optimization embedding space free embedding direct fit pe figure left illustration of the feasible set for grayed areas of embeddings that can be produced by the mapping family this corresponds to the feasible set of the equality constraints in the problem parametric embedding is feasible embedding with locally minimal value of free embedding is minimizer of and is usually not feasible direct fit to the free embedding is feasible but usually not optimal right panels embeddings of objects from the dataset using linear mapping free embedding its direct fit and the parametric embedding pe optimized with mac free embeddings parametric embeddings and gradients consider given nonlinear embedding objective function that takes an argument and maps it to real value is constructed for dataset according to particular embedding model we will use as running example the equations for the elastic embedding which are simpler than for most other embeddings we call free embedding the result of optimizing local optimizer of parametric embedding pe objective function for using family of mappings rd rl for example linear mappings is defined as where yn as in eq for ee note that to simplify the notation we do not write explicitly the parameters of thus specific pe can be defined by any combination of embedding objective function ee sne and parametric mapping family linear neural net the result of optimizing local optimizer of is mapping which we can apply to any input rd not necessarily from among the training patterns finally we call direct fit the mapping resulting from fitting to by regression to map the input patterns to free embedding we have the following results theorem let be global minimizer of then proof theorem perfect direct fit let if and is global minimizer of then is global minimizer of proof let with then theorem means that if the direct fit of to has zero error then it is the solution of the parametric embedding and there is no need to optimize theorem means that pe can not do better than free this is obvious in that pe is not free but constrained to use only embeddings that can be produced by mapping in as illustrated in fig pe will typically worsen the free embedding more powerful mapping families such as neural nets will distort the embedding less than more restricted families such as linear mappings in this sense the free embedding can be seen as using as mapping family table with parameters it represents the most flexible mapping since every projection xn is free parameter but it can only be applied to patterns in the training set we will assume that the direct fit has positive error the direct fit is not perfect so that optimizing is necessary computationally the complexity of the gradient of appears to be where is the number of parameters in because involves terms each dependent on all the parameters of for linear this would cost ld however if manually simplified and coded the gradient can actually be computed in for example for the elastic embedding with linear mapping ay where is of the gradient of eq is pn ay ay λexp kay ay nm by continuity argument theorem carries over to the case where and are local minimizers of and respectively however theorem would apply only locally that is holds locally but there may be mappings with associated with another lower local minimizer of however the same intuition remains we can not expect pe to improve over good free embedding and this can be computed in dl if we precompute ay and take common factors of the summation over xn and xm an automatic differentiation package may or may not be able to realize these savings in general the obvious way to optimize is to compute the gradient wrt the parameters of by applying the chain rule since is function of and this is function of the parameters of assuming and are differentiable while perfectly doable in theory in practice this has several problems deriving debugging and coding the gradient of for nonlinear is cumbersome one could use automatic differentiation but current packages can result in inefficient gradients in time and memory and are not in widespread use in machine learning also combining autodiff with methods seems difficult because the latter require spatial data structures that are effective for points in low dimension no more than as far as we know and depend on the actual point values the pe gradient may not benefit from algorithms developed for embeddings for example the spectral direction method relies on special properties of the free embedding hessian which do not apply to the pe hessian given the gradient one then has to choose and possibly adapt suitable nonlinear optimization method and set its parameters line search parameters etc so that convergence is assured and the resulting algorithm is efficient simple choices such as gradient descent or conjugate gradients are usually not efficient and developing good algorithm is research problem in itself as evidenced by the many papers that study specific combinations of embedding objective and parametric mapping even having done all this the resulting algorithm will still be very slow because of the complexity of computing the gradient it may be possible to approximate the gradient using methods but again this would involve significant development effort as noted earlier the chain rule only applies if both and are differentiable finally all of the above needs to be redone if we change the mapping from neural net to rbf network or the embedding from ee to we now show how these problems can be addressed by using different approach to the optimization optimizing parametric embedding using auxiliary coordinates the pe objective function can be seen as nested function where we first apply and then recently proposed strategy the method of auxiliary coordinates mac can be used to derive optimization algorithms for such nested systems we write the nested problem min as the following equivalent constrained optimization problem min zn yn where we have introduced an auxiliary coordinate zn for each input pattern and corresponding equality constraint zn can be seen as the output of the projection for xn the optimization is now on an augmented space with extra parameters and the feasible set of the equality constraints is shown in fig we solve the constrained problem using method it is also possible to use the augmented lagrangian method by optimizing the following unconstrained problem and driving pn min pq kzn yn kz under mild assumptions the minima trace continuous path that converges to local optimum of and hence of finally we optimize pq using alternating optimization over the coordinates and the mapping this results in two steps pn over given kzn yn this is standard regression for dataset using and can be solved using existing code for many families of mappings for example for linear mapping ay we solve linear system efficiently done by caching in the first iteration and doing matrix multiplication in subsequent iterations for deep net we can use stochastic gradient descent with pretraining possibly on gpu for regression tree or forest we can use any algorithm etc also note that if we want to have regularization term in the pe objective for weight decay or for model complexity that term will appear in the step but not in the step hence the training and regularization of the mapping is confined to the step given the inputs and current outputs the mapping communicates with the embedding objective precisely through these coordinates over given minz kz this is regularized embedding since is the original embedding objective function and kz is quadratic regularization term on with weight which encourages to be close to given embedding we can reuse existing code to learn the embedding with simple modifications for example the gradient has an added term the spectral direction now uses curvature matrix the embedding communicates with the mapping through the outputs which are constant within the step which gradually force the embedding to agree with the output of member of the family of mappings hence the intricacies of nonlinear optimization line search method parameters etc remain confined within the regression for and within the embedding for separately from each other designing an optimization algorithm for an arbitrary combination of embedding and mapping is simply achieved by alternately calling existing algorithms for the embedding and for the mapping although we have introduced large number of new parameters to optimize over the auxiliary coordinates the cost of mac iteration is actually the same asymptotically as the cost of computing the pe gradient where is the number of parameters in in the step the objective function has terms but each term depends only on projections zn and zm parameters hence it costs in the step the objective function has terms each depending on the entire mapping parameters hence it costs another advantage of mac is that because it does not use gradients it is even possible to use something like regression tree for which is not differentiable and so the pe objective function is not differentiable either in mac we can use an algorithm to train regression trees within the step using as data reducing the constraint error kz and the pe objective final advantage is that we can benefit from recent work done on using methods to reduce the complexity of computing the embedding gradient exactly to log using treebased methods such as the algorithm or even using fast multipole methods at small approximation error we can reuse such code as is without any extra work to approximate the gradient of and then add to it the exact gradient of the regularization term kz which is already linear in hence each mac iteration and steps runs in linear time on the sample size and is thus scalable to larger datasets the problem of optimizing parametric embeddings is closely related to that of learning binary hashing for fast information retrieval using loss functions the only difference is that in binary hashing the mapping an hash function maps vector rd to an binary vector the mac framework can also be applied and the resulting algorithm alternates an step that fits classifier for each bit of the hash function and step that optimizes regularized binary embedding using combinatorial optimization schedule of initial and the path to minimizer the mac algorithm for parametric embeddings introduces no new optimization parameters except for the penalty parameter the convergence theory of methods and mac tells us that convergence to local optimum is guaranteed if each iteration achieves sufficient decrease always possible by running enough steps and if the latter condition ensures the equality constraints are eventually satisfied mathematically the minima of pq as function of trace continuous path in the space that ends at local minimum of the constrained problem and thus of the parametric embedding objective function hence our algorithm belongs to the family of methods such as quadratic penalty augmented lagrangian homotopy and methods widely regarded as effective with nonconvex problems in practice one follows that path loosely doing fast inexact steps on and for the current value of and then increasing how fast to increase does depend on the particular problem typically one multiplies times factor of around increasing very slowly will follow the path more closely but the runtime will increase since does not appear in the step increasing is best done within step we run several iterations over increase run several iterations over and then do an step the starting point of the path is here the step simply optimizes and hence gives us free embedding we just train an elastic embedding model on the dataset the step then fits to and hence gives us the direct fit which generally will have positive error kz otherwise we stop with an optimal pe thus the beginning of the path is the direct fit to the free embedding as increases we follow the path and as converges to minimizer of the pe and converges to hence the lifetime of the mac algorithm over the time starts with free embedding and direct fit which disagree with each other and progressively reduces the error in the fit by increasing the error in the embedding until and agree at an optimal pe although it is possible to initialize in different way random and start with large value of we find this converges to worse local optima than starting from free embedding with small good local optima for the free embedding itself can be found by homotopy methods as well experiments our experiments confirm that mac finds optima as good as those of the conventional optimization based on gradients but that it is faster particularly if using methods we demonstrate this with different embedding objectives the elastic embedding and and mappings linear and neural net we report on representative subset of experiments illustrative example the simple example of fig shows the different embedding types described in the paper we use the dataset containing rotation sequences of physical objects every degrees each grayscale image of pixels total points in dimensions thus each object traces closed loop in pixel space we produce embeddings of objects using the elastic embedding ee the free embedding results from optimizing the ee objective function without any limitations on the projections it gives the best visualization of the data but no mapping we now seek linear mapping the direct fit fits linear mapping to map the images to their projections from the free embedding the resulting predictions give quite distorted representation of the data because linear mapping can not realize the free embedding with low error the parametric embedding pe finds the linear mapping that optimizes which for ee is eq to optimize the pe we used mac which was faster than gradient descent and conjugate gradients the resulting pe represents the data worse than the free embedding since the pe is constrained to produce embeddings that are realizable by linear mapping but better than the direct fit because the pe can search for embeddings that while being realizable by linear mapping produce lower value of the ee objective function the details of the optimization are as follows we preprocess the data using pca projecting to dimensions otherwise learning mapping would be trivial since there are more degrees of freedom than there are points the free embedding was optimized using the spectral direction until consecutive iterates differed by relative error less than we increased from to with step of values and did iterations for each value the step uses the spectral direction stopping when the relative error is less than cost of the iterations fig left shows as function of the number of data points using swissroll dataset the time needed to compute the gradient of the pe objective red curve and the gradient of the mac and steps black and magenta respectively as well as their sum in blue we use and sigmoidal neural net with an architecture we approximate the gradient in log using the method the plot shows the asymptotically complexity to be quadratic for the pe gradient but linear for the step and log for the step the pe gradient runs out of memory for large quality of the local optima for the same swissroll dataset fig right shows as function of the number of data points the final value of the pe objective function achieved by the cg optimization and by mac both using the same initialization there is practically no difference between both optimization algorithms we sometimes do find they converge to different local optima as in some of our other experiments different embedding objectives and mapping families the goal of this experiment is to show that we can easily derive convergent efficient algorithm for various combinations of embeddings and mappings we consider as embedding objective functions and ee and as mappings neural net and linear mapping we apply each combination to learn parametric embedding for the mnist dataset containing images of handwritten digit images training pe objective function runtime seconds pe mac step step pe mac figure runtime per iteration and final pe objective for swissroll dataset using as mapping sigmoidal neural net with an architecture for for pe we give the runtime needed to compute the gradient of the pe objective using cg with gradients for mac we give the runtime needed to compute the steps separately and together the gradient of the step is approximated with an method errorbars over randomly generated swissrolls neural net embedding mac pe minibatch pe batch runtime seconds ee linear embedding mac pe runtime seconds figure mnist dataset top with neural net bottom ee with linear mapping left initial free embedding we show sample of points to avoid clutter middle final parametric embedding right learning curves for mac and optimization each marker indicates one iteration for mac the solid markers indicate iterations where increased nonlinear free embedding on dataset of this size was very slow until the recent introduction of methods for ee and other methods we are the first to use methods for pes thanks to the decoupling between mapping and embedding introduced by mac for each combination we derive the mac algorithm by reusing code available online for the ee and free embeddings we use the spectral direction for the methods to approximate the embedding objective function gradient we use the fast multipole method for ee and the method for and for training deep net we use unsupervised pretraining and backpropagation fig left shows the free embedding of mnist obtained with and ee after iterations of the spectral direction to compute the gaussian affinities between pairs of points we used entropic affinities with perplexity neighbors the optimization details are as follows for the neural net we replicated the setup of this uses neural net with an architecture initialized with pretraining as scribed in and for the pe optimization we used the code from because of memory limitations actually solved an approximate version of the pe objective function where rather than using all pairwise point interactions only bn interactions are used corresponding to using minibatches of points therefore the solution obtained is not minimizer of the pe objective as can be seen from the higher objective value in fig bottom however we did also solve the exact objective by using one minibatch containing the entire dataset each minibatch was trained with cg iterations and total of epochs for mac we used optimizing until the objective function decrease before the step and after the step was less than relative error of the rest of the optimization details concern the embedding and neural net and are based on existing code the initialization for is the free embedding the step like the free embedding uses the spectral direction with fixed step size using iterations of linear conjugate gradients to solve the linear system and using initialized from the the previous iteration direction the gradient of the free embedding is approximated in log using the method with accuracy altogether one iteration took around seconds we exit the step when the relative error between consecutive embeddings is less than for the step we used stochastic gradient descent with minibatches of points step size and momentum rate and trained for epochs for the first values of and for epochs for the rest for the linear mapping ay we implemented our own pe optimizer with gradient descent and backtracking line search for iterations in mac we used values spaced logarithmically from to optimizing at each value until the objective function decrease was less than relative error of both the step and the free embedding use the spectral direction with fixed step size we stop optimizing them when the relative error between consecutive embeddings is less than the gradient is approximated using fast multipole methods with accuracy the number of terms in the truncated series in the step the linear system to find was solved using iterations of linear conjugate gradients with warm start fig shows the final parametric embeddings for mac top and linear ee bottom and the learning curves pe error over iterations mac is considerably faster than the optimization in all cases for the mac is almost faster than using minibatch the approximate pe objective and faster than the exact batch mode this is partly thanks to the use of methods in the step the runtimes were excluding the time taken by pretraining mac pe minibatch pe batch free embedding without using methods mac is faster than pe batch and comparable to pe minibatch for the linear ee the runtimes were mac pe direct fit the embedding preserves the overall structure of the free embedding but both embeddings do differ for example the free embedding creates small clumps of points and the neural net being continuous mapping tends to smooth them out the linear ee embedding distorts the free ee embedding considerably more than if using neural net this is because linear mapping has much harder time at approximating the complex mapping from the data into that the free embedding implicitly demands conclusion in our view the main advantage of using the method of auxiliary coordinates mac to learn parametric embeddings is that it simplifies the algorithm development one only needs to plug in existing code for the embedding with minor modifications and the mapping this is particularly useful to benefit from complex highly optimized code for specific problems such as the methods we used here or perhaps gpu implementations of deep nets and other machine learning models in many applications the efficiency in programming an easy robust solution is more valuable than the speed of the machine but in addition we find that the mac algorithm can be quite faster than the based optimization of the parametric embedding acknowledgments work funded by nsf award we thank weiran wang for help with training the deep net in the mnist experiment references barnes and hut hierarchical log algorithm nature belkin and niyogi laplacian eigenmaps for dimensionality reduction and data representation neural computation bengio delalleau le roux paiement vincent and ouimet learning eigenfunctions links spectral embedding and kernel pca neural computation bromley bentz bottou guyon lecun moore and shah signature verification using siamese time delay neural network int pattern recognition and artificial intelligence the elastic embedding algorithm for dimensionality reduction icml and lu the laplacian eigenmaps latent variable model aistats and wang distributed optimization of deeply nested systems and wang distributed optimization of deeply nested systems aistats globerson and roweis metric learning by collapsing classes nips goldberger roweis hinton and salakhutdinov neighbourhood components analysis nips greengard and rokhlin fast algorithm for particle simulations comp griewank and walther evaluating derivatives principles and techniques of algorithmic differentiation siam second edition hadsell chopra and lecun dimensionality reduction by learning an invariant mapping cvpr he and niyogi locality preserving projections nips hinton and roweis stochastic neighbor embedding nips lowe and tipping neural networks and topographic mappings for exploratory data analysis neural computing applications mao and jain artificial neural networks for feature extraction and multivariate data projection ieee trans neural networks min yuan van der maaten bonner and zhang deep supervised embedding icml nocedal and wright numerical optimization springer second edition peltonen and kaski discriminative components of data ieee trans neural networks raziperchikolaei and learning hashing with loss functions using auxiliary coordinates salakhutdinov and hinton learning nonlinear embedding by preserving class neighbourhood structure aistats sammon nonlinear mapping for data structure analysis ieee trans computers teh and roweis automatic alignment of local representations nips van der maaten learning parametric embedding by preserving local structure aistats van der maaten int conf learning representations iclr van der maaten and hinton visualizing data using jmlr venna peltonen nybo aidos and kaski information retrieval perspective to nonlinear dimensionality reduction for data visualization jmlr vladymyrov and strategies for fast learning of nonlinear embeddings icml vladymyrov and entropic affinities properties and efficient numerical computation icml vladymyrov and training of nonlinear embeddings aistats webb multidimensional scaling by iterative majorization using radial basis functions pattern recognition weston ratle and collobert deep learning via embedding icml yang peltonen and kaski scalable optimization for neighbor embedding for visualization icml 
texture synthesis using convolutional neural networks leon gatys centre for integrative neuroscience university of germany bernstein center for computational neuroscience germany graduate school of neural information processing university of germany alexander ecker centre for integrative neuroscience university of germany bernstein center for computational neuroscience germany max planck institute for biological cybernetics germany baylor college of medicine houston tx usa matthias bethge centre for integrative neuroscience university of germany bernstein center for computational neuroscience germany max planck institute for biological cybernetics germany abstract here we introduce new model of natural textures based on the feature spaces of convolutional neural networks optimised for object recognition samples from the model are of high perceptual quality demonstrating the generative power of neural networks trained in purely discriminative fashion within the model textures are represented by the correlations between feature maps in several layers of the network we show that across layers the texture representations increasingly capture the statistical properties of natural images while making object information more and more explicit the model provides new tool to generate stimuli for neuroscience and might offer insights into the deep representations learned by convolutional neural networks introduction the goal of visual texture synthesis is to infer generating process from an example texture which then allows to produce arbitrarily many new samples of that texture the evaluation criterion for the quality of the synthesised texture is usually human inspection and textures are successfully synthesised if human observer can not tell the original texture from synthesised one in general there are two main approaches to find texture generating process the first approach is to generate new texture by resampling either pixels or whole patches of the original texture these resampling techniques and their numerous extensions and improvements see for review are capable of producing high quality natural textures very efficiently however they do not define an actual model for natural textures but rather give mechanistic procedure for how one can randomise source texture without changing its perceptual properties in contrast the second approach to texture synthesis is to explicitly define parametric texture model the model usually consists of set of statistical measurements that are taken over the feature maps gradient descent input figure synthesis method texture analysis left the original texture is passed through the cnn and the gram matrices gl on the feature responses of number of layers are computed texture is passed through the cnn and loss function el is synthesis right white noise image computed on every layer included in the texture model the total loss function is weighted sum of the contributions el from each layer using gradient descent on the total loss with respect to the pixel values new image is found that produces the same gram matrices as the original texture spatial extent of the image in the model texture is uniquely defined by the outcome of those measurements and every image that produces the same outcome should be perceived as the same texture therefore new samples of texture can be generated by finding an image that produces the same measurement outcomes as the original texture conceptually this idea was first proposed by julesz who conjectured that visual texture can be uniquely described by the joint histograms of its pixels later on texture models were inspired by the linear response properties of the mammalian early visual system which resemble those of oriented gabor filters these texture models are based on statistical measurements taken on the filter responses rather than directly on the image pixels so far the best parametric model for texture synthesis is probably that proposed by portilla and simoncelli which is based on set of carefully handcrafted summary statistics computed on the responses of linear filter bank called steerable pyramid however although their model shows very good performance in synthesising wide range of textures it still fails to capture the full scope of natural textures in this work we propose new parametric texture model to tackle this problem fig instead of describing textures on the basis of model for the early visual system we use convolutional neural network functional model for the entire ventral stream as the foundation for our texture model we combine the conceptual framework of spatial summary statistics on feature responses with the powerful feature space of convolutional neural network that has been trained on object recognition in that way we obtain texture model that is parameterised by spatially invariant representations built on the hierarchical processing architecture of the convolutional neural network convolutional neural network we use the network convolutional neural network trained on object recognition that was introduced and extensively described previously here we give only brief summary of its architecture we used the feature space provided by the convolutional and pooling layers of the network we did not use any of the fully connected layers the network architecture is based on two fundamental computations linearly rectified convolution with filters of size where is the number of input feature maps stride and padding of the convolution is equal to one such that the output feature map has the same spatial dimensions as the input feature maps maximum pooling in regions which the feature maps by factor of two these two computations are applied in an alternating manner see fig number of convolutional layers is followed by layer after each of the first three pooling layers the number of feature maps is doubled together with the spatial this transformation results in reduction of the total number of feature responses by factor of two fig provides schematic overview over the network architecture and the number of feature maps in each layer since we use only the convolutional layers the input images can be arbitrarily large the first convolutional layer has the same size as the image and for the following layers the ratio between the feature map sizes remains fixed generally each layer in the network defines filter bank whose complexity increases with the position of the layer in the network the trained convolutional network is publicly available and its usability for new applications is supported by the for texture generation we found that replacing the maxpooling operation by average pooling improved the gradient flow and one obtains slightly cleaner results which is why the images shown below were generated with average pooling finally for practical reasons we rescaled the weights in the network such that the mean activation of each filter over images and positions is equal to one such can always be done without changing the output of neural network if the in the network are rectifying linear texture model the texture model we describe in the following is much in the spirit of that proposed by portilla and simoncelli to generate texture from given source image we first extract features of different sizes homogeneously from this image next we compute spatial summary statistic on the feature responses to obtain stationary description of the source image fig finally we find new image with the same stationary description by performing gradient descent on random image that has been initialised with white noise fig the main difference to portilla and simoncelli work is that instead of using linear filter bank and set of carefully chosen summary statistics we use the feature space provided by highperforming deep neural network and only one spatial summary statistic the correlations between feature responses in each layer of the network to characterise given vectorised texture in our model we first pass through the convolutional neural network and compute the activations for each layer in the network since each layer in the network can be understood as filter bank its activations in response to an image form set of filtered images feature maps layer with nl distinct filters has nl feature maps each of size ml when vectorised these feature maps can be stored in matrix rnl where fjk is the activation of the th filter at position in layer textures are per definition stationary so texture model needs to be agnostic to spatial information summary statistic that discards the spatial information in the feature maps is given by the correlations between the responses of source code to generate textures with cnns as well as the rescaled network can be found at http different features these feature correlations are up to constant of proportionality given by the gram matrix gl rnl where glij is the inner product between feature map and in layer glij fik fjk set of gram matrices from some layers in the network in response to given texture provides stationary description of the texture which fully specifies texture in our model fig texture generation to generate new texture on the basis of given image we use gradient descent from white noise image to find another image that matches the representation of the original image this optimisation is done by minimising the distance between the entries of the gram matrix of the original image and the gram matrix of the image being generated fig be the original image and the image that is generated and gl and their respective let and representations in layer eq the contribution of layer to the total loss is then el glij ml and the total loss is wl el where wl are weighting factors of the contribution of each layer to the total loss the derivative of el with respect to the activations in layer can be computed analytically if nl ml ji if with respect to the pixels can be readily the gradients of el and thus the gradient of computed using standard error the gradient can be used as input for some numerical optimisation strategy in our work we use which seemed reasonable choice for the optimisation problem at hand the entire procedure relies mainly on the standard pass that is used to train the convolutional network therefore in spite of the large complexity of the model texture generation can be done in reasonable time using gpus and toolboxes for training deep neural networks results we show textures generated by our model from four different source images fig each row of images was generated using an increasing number of layers in the texture model to constrain the gradient descent the labels in the figure indicate the layer included in other words for the loss terms above certain layer we set the weights wl while for the loss terms below and including that layer we set wl for example the images in the first row were generated only from the texture representation of the first layer of the vgg network the images in the second row where generated by jointly matching the texture representations on top of layer and in this way we obtain textures that show what structure of natural textures are captured by certain computational processing stages of the texture model the first three columns show images generated from natural textures we find that constraining all layers up to layer generates complex natural textures that are almost indistinguishable from the original texture fig fifth row in contrast when constraining only the feature correlations on the lowest layer the textures contain little structure and are not far from spectrally matched noise original portilla simoncelli figure generated stimuli each row corresponds to different processing stage in the network when only constraining the texture representation on the lowest layer the synthesised textures have little structure first row with increasing number of layers on which we match the texture representation we find that we generate images with increasing degree of naturalness rows labels on the left indicate the layer included the source textures in the first three columns were previously used by portilla and simoncelli for better comparison we also show their results last row the last column shows textures generated from image to give better intuition about how the texture model represents image information parameters parameters parameters parameters original figure number of parameters in the texture model we explore several ways to reduce the number of parameters in the texture model see main text and compare the results textures generated from the different layers of the caffe reference network the textures are of lesser quality than those generated with the vgg network textures generated with the vgg architecture but random weights texture synthesis fails in this case indicating that learned filters are crucial for texture generation fig first row we can interpolate between these two extremes by using only the constraints from all layers up to some intermediate layer we find that the statistical structure of natural images is matched on an increasing scale as the number of layers we use for texture generation increases we did not include any layers above layer since this did not improve the quality of the synthesised textures for comparability we used source textures that were previously used by portilla and simoncelli and also show the results of their texture model fig last row to give better intuition for how the texture synthesis works we also show textures generated from image taken from the imagenet validation set fig last column our algorithm produces texturised version of the image that preserves local spatial information but discards the global spatial arrangement of the image the size of the regions in which spatial information is preserved increases with the number of layers used for texture generation this property can be explained by the increasing receptive field sizes of the units over the layers of the deep convolutional neural network when using summary statistics from all layers of the convolutional neural network the number of parameters of the model is very large for each layer with nl feature maps we match nl nl parameters so if we use all layers up to and including our model has parameters fig fourth column however we find that this texture model is heavily overparameterised in fact when using only one layer on each scale in the network curious finding is that the yellow box which indicates the source of the original texture is also placed towards the bottom left corner in the textures generated by our model as our texture model does not store any spatial information about the feature responses the only possible explanation for such behaviour is that some features in the network explicitly encode the information at the image boundaries this is exactly what we find when inspecting feature maps in the vgg network some feature maps at least from layer onwards only show high activations along their edges this might originate from the that is used for the convolutions in the vgg network and it could be interesting to investigate the effect of such padding on learning and object recognition performance classification performance gram gram vgg vgg decoding layer figure performance of linear classifier on top of the texture representations in different layers in classifying objects from the imagenet dataset information is made increasingly explicit along the hierarchy of our texture model and the model contains parameters while hardly loosing any quality fig third column we can further reduce the number of parameters by doing pca of the feature vector in the different layers of the network and then constructing the gram matrix only for the first principal components by using the first principal components for layers and we can further reduce the model to parameters fig second column interestingly constraining only the feature map averages in layers and parameters already produces interesting textures fig first column these ad hoc methods for parameter reduction show that the texture representation can be compressed greatly with little effect on the perceptual quality of the synthesised textures finding minimal set of parameters that reproduces the quality of the full model is an interesting topic of ongoing research and beyond the scope of the present paper larger number of natural textures synthesised with the parameter model can be found in the supplementary material as well as on our there one can also observe some failures of the model in case of very regular structures brick walls in general we find that the very deep architecture of the vgg network with small convolutional filters seems to be particularly well suited for texture generation purposes when performing the same experiment with the caffe reference network which is very similar to the alexnet the quality of the generated textures decreases in two ways first the statistical structure of the source texture is not fully matched even when using all constraints fig second we observe an artifactual grid that overlays the generated textures fig we believe that the artifactual grid originates from the larger receptive field sizes and strides in the caffe reference network while the results from the caffe reference network show that the architecture of the network is important the learned feature spaces are equally crucial for texture generation when synthesising texture with network with the vgg architecture but random weights texture generation fails fig underscoring the importance of using trained network to understand our texture features better in the context of the original object recognition task of the network we evaluated how well object identity can be linearly decoded from the texture features in different layers of the network for each layer we computed the representation of each image in the imagenet training set and trained linear classifier to predict object identity as we were not interested in optimising prediction performance we did not use any data augmentation and trained and tested only on the centre crop of the images we computed the accuracy of these linear classifiers on the imagenet validation set and compared them to the performance of the original network also evaluated on the centre crops of the validation images the analysis suggests that our texture representation continuously disentangles object identity information fig object identity can be decoded increasingly well over the layers in fact linear decoding from the final pooling layer performs almost as well as the original network suggesting that our texture representation preserves almost all information at first sight this might appear surprising since the texture representation does not necessarily preserve the global structure of objects in images fig last column however we believe that this tency is in fact to be expected and might provide an insight into how cnns encode object identity the convolutional representations in the network are and the network task object recognition is agnostic to spatial information thus we expect that object information can be read out independently from the spatial information in the feature maps we show that this is indeed the case linear classifier on the gram matrix of layer comes close to the performance of the full network top accuracy fig discussion we introduced new parametric texture model based on convolutional neural network our texture model exceeds previous work as the quality of the textures synthesised using our model shows substantial improvement compared to the current state of the art in parametric texture synthesis fig fourth row compared to last row while our model is capable of producing natural textures of comparable quality to texture synthesis methods our synthesis procedure is computationally more expensive nevertheless both in industry and academia there is currently much effort taken in order to make the evaluation of deep neural networks more efficient since our texture synthesis procedure builds exactly on the same operations any progress made in the general field of deep convolutional networks is likely to be transferable to our texture synthesis method thus we expect considerable improvements in the practical applicability of our texture model in the near future by computing the gram matrices on feature maps our texture model transforms the representations from the convolutional neural network into stationary feature space this general strategy has recently been employed to improve performance in object recognition and detection or texture recognition and segmentation in particular cimpoi et al report impressive performance in material recognition and scene segmentation by using stationary representation built on the highest convolutional layer of readily trained neural networks in agreement with our results they show that performance in natural texture recognition continuously improves when using higher convolutional layers as the input to their representation as our main aim is to synthesise textures we have not evaluated the gram matrix representation on texture recognition benchmarks but would expect that it also provides good feature space for those tasks in recent years texture models inspired by biological vision have provided fruitful new analysis tool for studying visual perception in particular the parametric texture model proposed by portilla and simoncelli has sparked great number of studies in neuroscience and psychophysics our texture model is based on deep convolutional neural networks that are the first artificial systems that rival biology in terms of difficult perceptual inference tasks such as object recognition at the same time their hierarchical architecture and basic computational properties admit fundamental similarity to real neural systems together with the increasing amount of evidence for the similarity of the representations in convolutional networks and those in the ventral visual pathway these properties make them compelling candidate models for studying visual information processing in the brain in fact it was recently suggested that textures generated from the representations of convolutional networks may therefore prove useful as stimuli in perceptual or physiological investigations we feel that our texture model is the first step in that direction and envision it to provide an exciting new tool in the study of visual information processing in biological systems acknowledgments this work was funded by the german national academic foundation the bernstein center for computational neuroscience fkz and the german excellency initiative through the centre for integrative neuroscience references balas nakano and rosenholtz representation in peripheral vision explains visual crowding journal of vision cadieu hong yamins pinto ardila solomon majaj and dicarlo deep neural networks rival the representation of primate it cortex for core visual object recognition plos comput biol december cimpoi maji and vedaldi deep convolutional filter banks for texture recognition and segmentation cs november arxiv denton zaremba bruna lecun and fergus exploiting linear structure within convolutional networks for efficient evaluation in nips efros and leung texture synthesis by sampling in computer vision the proceedings of the seventh ieee international conference on volume pages ieee efros and freeman image quilting for texture synthesis and transfer in proceedings of the annual conference on computer graphics and interactive techniques pages acm freeman and simoncelli metamers of the ventral stream nature neuroscience september freeman ziemba heeger simoncelli and movshon functional and perceptual signature of the second visual area in primates nature neuroscience july he zhang ren and sun spatial pyramid pooling in deep convolutional networks for visual recognition arxiv preprint heeger and bergen texture in proceedings of the annual conference on computer graphics and interactive techniques siggraph pages new york ny usa acm jaderberg vedaldi and zisserman speeding up convolutional neural networks with low rank expansions in bmvc jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding in proceedings of the acm international conference on multimedia pages acm julesz visual pattern discrimination ire transactions on information theory february and kriegeskorte deep supervised but not unsupervised models may explain it cortical representation plos comput biol november krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in advances in neural information processing systems pages kwatra essa turk and bobick graphcut textures image and video synthesis using graph cuts in acm transactions on graphics tog volume pages acm lebedev ganin rakhuba oseledets and lempitsky convolutional neural networks using arxiv preprint lecun bottou orr and efficient backprop in neural networks tricks of the trade pages springer movshon and simoncelli representation of naturalistic image structure in the primate visual cortex cold spring harbor symposia on quantitative biology cognition okazawa tajima and komatsu image statistics underlying natural texture selectivity of neurons in macaque pnas january portilla and simoncelli parametric texture model based on joint statistics of complex wavelet coefficients international journal of computer vision october rosenholtz huang raj balas and ilie summary statistic representation in peripheral vision explains visual search journal of vision russakovsky deng su krause satheesh ma huang karpathy khosla bernstein berg and imagenet large scale visual recognition challenge cs september arxiv simoncelli and freeman the steerable pyramid flexible architecture for derivative computation in image processing international conference on volume pages ieee computer society simonyan and zisserman very deep convolutional networks for image recognition cs september arxiv szegedy liu jia sermanet reed anguelov erhan vanhoucke and rabinovich going deeper with convolutions cs september arxiv wei lefebvre kwatra and turk state of the art in texture synthesis in eurographics state of the art report pages eurographics association wei and levoy fast texture synthesis using vector quantization in proceedings of the annual conference on computer graphics and interactive techniques pages acm publishing yamins hong cadieu solomon seibert and dicarlo performanceoptimized hierarchical models predict neural responses in higher visual cortex pnas page may zhu byrd lu and nocedal algorithm fortran subroutines for optimization acm transactions on mathematical software toms 
extending gossip algorithms to distributed estimation of igor colin joseph salmon ltci cnrs paristech paris france bellet magnet team inria lille nord europe villeneuve ascq france abstract efficient and robust algorithms for decentralized estimation in networks are essential to many distributed systems whereas distributed estimation of sample mean statistics has been the subject of good deal of attention computation of statistics relying on more expensive averaging over pairs of observations is less investigated area yet such data functionals are essential to describe global properties of statistical population with important examples including area under the curve empirical variance gini mean difference and point scatter this paper proposes new synchronous and asynchronous randomized gossip algorithms which simultaneously propagate data across the network and maintain local estimates of the of interest we establish convergence rate bounds of and log for the synchronous and asynchronous cases respectively where is the number of iterations with explicit data and network dependent terms beyond favorable comparisons in terms of rate analysis numerical experiments provide empirical evidence the proposed algorithms surpasses the previously introduced approach introduction decentralized computation and estimation have many applications in sensor and networks as well as for extracting knowledge from massive information graphs such as interlinked web documents and social media algorithms running on such networks must often operate under tight constraints the nodes forming the network can not rely on centralized entity for communication and synchronization without being aware of the global network topology have limited resources computational power memory energy gossip algorithms where each node exchanges information with at most one of its neighbors at time have emerged as simple yet powerful technique for distributed computation in such settings given data observation on each node gossip algorithms can be used to compute averages or sums of functions of the data that are separable across observations see for example and references therein unfortunately these algorithms can not be used to efficiently compute quantities that take the form of an average over pairs of observations also known as among classical used in machine learning and data mining one can mention among others the sample variance the area under the curve auc of classifier on distributed data the gini mean difference the kendall tau rank correlation coefficient the point scatter and several statistical hypothesis test statistics such as wilcoxon in this paper we propose randomized synchronous and asynchronous gossip algorithms to efficiently compute in which each node maintains local estimate of the quantity of interest throughout the execution of the algorithm our methods rely on two types of iterative information exchange in the network propagation of local observations across the network and averaging of cal estimates we show that the local estimates generated by our approach converge in expectation to the value of the at rates of and log for the synchronous and asynchronous versions respectively where is the number of iterations these convergence bounds feature datadependent terms that reflect the hardness of the estimation problem and terms related to the spectral gap of the network graph showing that our algorithms are faster on wellconnected networks the proofs rely on an original reformulation of the problem using phantom nodes on additional nodes that account for data propagation in the network our results largely improve upon those presented in in particular we achieve faster convergence together with lower memory and communication costs experiments conducted on auc and point scatter estimation using real data confirm the superiority of our approach the rest of this paper is organized as follows section introduces the problem of interest as well as relevant notation section provides brief review of the related work in gossip algorithms we then describe our approach along with the convergence analysis in section both in the synchronous and asynchronous settings section presents our numerical results background definitions and notations for any integer we denote by the set and by the cardinality of any finite set we represent network of size as an undirected graph where is the set of vertices and the set of edges we denote by the adjacency matrix related to the graph that is for all ij if and only if for any node we denote its degree by di we denote by the graph laplacian of defined by where diag dn is the matrix of degrees graph is said to be connected if for all there exists path connecting and it is bipartite if there exist such that and matrix is nonnegative resp positive if and only if for all ij resp ij we write resp when this holds the transpose of is denoted by matrix is stochastic if and only if and where rn the matrix is if and only if and are stochastic we denote by in the identity matrix in en the standard basis in rn the indicator function of an event and the usual norm problem statement let be an input space and xn sample of points in that space we assume rd for some throughout the paper but our results straightforwardly extend to the more general setting we denote as xn the design matrix let be measurable function symmetric in its two arguments and with we consider the problem of estimating the following quantity known as degree two xi xj in this paper we illustrate the interest of on two applications among many others the first one is the point scatter which measures the clustering quality of partition of as the average distance between points in each cell it is of the form with hp kx we also study the auc measure for given sample xn on the auc measure of linear classifier is given by xi xj auc we point out that the usual definition of differs slightly from by factor of algorithm synchronous gossip algorithm for computing require each node holds observation xk each node initializes its auxiliary observation yk xk and its estimate zk for do for do set zp zp xp yp end for draw uniformly at random from set zi zj zi zj swap auxiliary observations of nodes and yi yj end for this score is the probability for classifier to rank positive observation higher than negative one we focus here on the decentralized setting where the data sample is partitioned across set of nodes in network for simplicity we assume and each node only has access to single data observation xi we are interested in estimating efficiently using gossip algorithm related work gossip algorithms have been extensively studied in the context of decentralized averaging in networks where the goal is to compute the average of real numbers xi one of the earliest work on this canonical problem is due to but more efficient algorithms have recently been proposed see for instance of particular interest to us is the work of which introduces randomized gossip algorithm for computing the empirical mean in context where nodes wake up asynchronously and simply average their local estimate with that of randomly chosen neighbor the communication probabilities are given by stochastic matrix where pij is the probability that node selects neighbor at given iteration as long as the network graph is connected and the local estimates converge to at rate where the constant can be tied to the spectral gap of the network graph showing faster convergence for such algorithms pncan be extended to compute other functions such as maxima and minima or sums of the form xi for some function as done for instance in some work has also gone into developing faster gossip algorithms for poorly connected networks assuming that nodes know their partial geographic location for detailed account of the literature on gossip algorithms we refer the reader to however existing gossip algorithms can not be used to efficiently compute as it depends on pairs of observations to the best of our knowledge this problem has only been investigated in their algorithm coined achieves convergence rate but has several drawbacks first each node must store two auxiliary observations and two pairs of nodes must exchange an observation at each iteration for problems large this leads to significant memory and communication load second the algorithm is not asynchronous as every node must update its estimate at each iteration consequently nodes must have access to global clock which is often unrealistic in practice in the next section we introduce new synchronous and asynchronous algorithms with faster convergence as well as smaller memory and communication cost per iteration gosta algorithms in this section we introduce gossip algorithms for computing our approach is based on the pn pn observation that hi with hi xi xj and we write hn the goal is thus similar to the usual distributed averaging problem with the our results generalize to the case where each node holds subset of the observations see section for the sake of completeness we provide an analysis of this algorithm in the supplementary material original graph new graph figure comparison of original network and phantom network key difference that each local value hi is itself an average depending on the entire data sample consequently our algorithms will combine two steps at each iteration data propagation step to allow each node to estimate hi and an averaging step to ensure convergence to the desired value we first present the algorithm and its analysis for the simpler synchronous setting in section before introducing an asynchronous version section synchronous setting in the synchronous setting we assume that the nodes have access to global clock so that they can all update their estimate at each time instance we stress that the nodes need not to be aware of the global network topology as they will only interact with their direct neighbors in the graph let us denote by zk the local estimate of by node at iteration in order to propagate data across the network each node maintains an auxiliary observation yk initialized to xk our algorithm coined gosta goes as follows at each iteration each node updates its local estimate by taking the running average of zk and xk yk then an edge of the network is drawn uniformly at random and the corresponding pair of nodes average their local estimates and swap their auxiliary observations the observations are thus each performing random walk albeit coupled on the network graph the full procedure is described in algorithm in order to prove the convergence of algorithm we consider an equivalent reformulation of the problem which allows us to model the data propagation and the averaging steps separately specifically for each we define phantom gk vk ek of the original network with vk vik and ek vik vjk we then create new graph where each node is connected to its counterpart vkk vk vk ek vkk the construction of is illustrated in figure in this new graph the nodes from the original network will hold the estimates zn as described above the role of each gk is to simulate the data propagation in the original graph for vik initially holds the value xk xi at each iteration we draw random edge of and nodes vik and vjk swap their value for all to update its estimate each node will use the current value at vkk we can now represent the system state at iteration by vector the first coefficients are associated with nodes in and correspond to the estimate vector zn the last coefficients are associated with nodes in vk and represent the data propagation in the network their initial value is set to en so that for any node vl initially stores the value xk xl remark the phantom network is of size but we stress the fact that it is used solely as tool for the convergence analysis algorithm operates on the original graph the transition matrix of this system accounts for three events the averaging step the action of on itself the data propagation the action of gk on itself for all and the estimate update the action of gk on node for all at given step we are interested in characterizing the transition matrix such that for the sake of clarity we write as an upper matrix with and rn the bottom left part is necessarily because does not influence any gk the upper left block corresponds to the averaging step therefore for any we have in ei ej ei ej where for any wα is defined by wα in ei ej ei ej in furthermore and are defined as follows and en where is block diagonal matrix corresponding to the observations being propagated and represents the estimate update for each node note that in where is the kronecker product we can now describe the expected state evolution at iteration one has using recursion we can write pt bc therefore in order to prove the convergence of algorithm one needs to show that pt bc we state this precisely in the next theorem theorem let be connected and graph with nodes design matrix and the sequence of estimates generated by algorithm for all we have xi xj lim zk moreover for any ct ct where and is the second largest eigenvalue of proof see supplementary material theorem shows that the local estimates generated by algorithm converge to at rate furthermore the constants reveal the rate dependency on the particular problem instance indeed the two norm terms are and quantify the difficulty of the estimation problem itself through dispersion measure in contrast is term since where is the second smallest eigenvalue of the graph laplacian see lemma in the supplementary material the value is also known as the spectral gap of and graphs with larger spectral gap typically have better connectivity this will be illustrated in section algorithm an asynchronous gossip algorithm for computing require each node holds observation xk and pk each node initializes yk xk zk and mk for do draw uniformly at random from set mi mi and mj mj set zi zj zi zj set zi zi xi yi set zj pj zj pj xj yj swap auxiliary observations of nodes and yi yj end for comparison to to estimate does not use averaging instead each node requires two auxiliary observations yk and yk which are both initialized to xk at each iteration each node updates its local estimate by taking the running average of zk and yk yk then two random edges are selected the nodes connected by the first resp second edge swap their first resp second auxiliary observations precise statement of the algorithm is provided in the supplementary material has several drawbacks compared to gosta it requires initiating communication between two pairs of nodes at each iteration and the amount of communication and memory required is higher especially when data is furthermore applying our convergence analysis to we obtain the following refined where and is the second largest eigenvalue of the advantage of propagating two observations in is seen in the term however the absence of averaging leads to an overall factor intuitively this is because nodes do not benefit from each other estimates in practice and are close to for reasonablysized networks for instance for the complete graph so the square term does not provide much gain and the factor dominates in we thus expect to converge slower than gosta which is confirmed by the numerical results presented in section asynchronous setting in practical settings nodes may not have access to global clock to synchronize the updates in this section we remove the global clock assumption and propose fully asynchronous algorithm where each node has local clock ticking at rate poisson process yet local clocks are so one can use an equivalent model with global clock ticking at rate poisson process and random edge draw at each iteration as in synchronous setting one may refer to for more details on clock modeling however at given iteration the estimate update step now only involves the selected pair of nodes therefore the nodes need to maintain an estimate of the current iteration number to ensure convergence to an unbiased estimate of hence for all let pk denote the probability of node being picked at any iteration with our assumption that nodes activate with uniform distribution over pk moreover the number of times node has been selected at given iteration follows binomial distribution with parameters and pk let us define mk such that mk and for mk if is picked at iteration mk mk otherwise for any and any one has mk pk therefore given that every node knows its degree and the total number of edges in the network the iteration estimates are unbiased we can now give an asynchronous version of gosta as stated in algorithm to show that local estimates converge to we use similar model as in the synchronous setting the time dependency of the transition matrix is more complex so is the upper bound the proof can be found in the supplementary material dataset wine quality complete graph graph table value of for each network theorem let be connected and non bipartite graph with nodes design matrix and the sequence of estimates generated by algorithm for all we have lim zk xi xj moreover there exists constant such that for any log khk proof see supplementary material remark our methods can be extended to the situation where nodes contain multiple observations when drawn node will pick random auxiliary observation to swap similar convergence results are achieved by splitting each node into set of nodes each containing only one observation and new edges weighted judiciously experiments in this section we present two applications on real datasets the decentralized estimation of the area under the roc curve auc and of the point scatter we compare the performance of our algorithms to that of see supplementary material for additional comparisons to some baseline methods we perform our simulations on the three types of network described below corresponding values of are shown in table complete graph this is the case where all nodes are connected to each other it is the ideal situation in our framework since any pair of nodes can communicate directly for complete graph of size see or for details grid here nodes are located on grid and each node is connected to its four neighbors on the grid this network offers regular graph with isotropic communication but its diameter is quite high especially in comparison to usual networks this random network generation technique is introduced in and allows us to create networks with various communication properties it relies on two parameters the average degree of the network and rewiring probability in expectation the higher the rewiring probability the better the connectivity of the network here we use and to achieve connectivity compromise between the complete graph and the grid auc measure we first focus on the auc measure of linear classifier as defined in we use the binary classification dataset which contains points in we set to the difference between the class means for each generated network we perform runs of algorithm and the top row of figure shows the evolution over time of the average relative error and the associated standard deviation across nodes for both algorithms on each type of network on average outperforms on every network the variance of the estimates across nodes is also lower due to the averaging step interestingly the performance gap between the two algorithms is greatly increasing early on presumably because the exponential term in the convergence bound of is significant in the first steps point scatter we then turn to the point scatter defined in we use the wine quality dataset which contains points in dimensions with total of we focus on the partition associated to class centroids and run the aforementioned this dataset is available at http this dataset is available at https figure evolution of the average relative error solid line and its standard deviation filled area with the number of iterations for red and algorithm blue on the dataset top row and the wine quality dataset bottom row error reaching time average relative error figure panel shows the average number of iterations needed to reach an relative error below for several network sizes panel compares the relative error solid line and its standard deviation filled area of synchronous blue and asynchronous red versions of gosta methods times the results are shown in the bottom row of figure as in the case of auc achieves better perfomance on all types of networks both in terms of average error and variance in figure we show the average time needed to reach relative error on complete graph ranging from to as predicted by our analysis the performance gap widens in favor of gosta as the size of the graph increases finally we compare the performance of and algorithm in figure despite the slightly worse theoretical convergence rate for both algorithms have comparable performance in practice conclusion we have introduced new synchronous and asynchronous randomized gossip algorithms to compute statistics that depend on pairs of observations we have proved the convergence rate in both settings and numerical experiments confirm the practical interest of the proposed algorithms in future work we plan to investigate whether adaptive communication schemes such as those of can be used to our algorithms our contribution could also be used as building block for decentralized optimization of extending for instance the approaches of acknowledgements this work was supported by the chair machine learning for big data of paristech and was conducted when bellet was affiliated with paristech references modern graph theory volume springer stephen boyd arpita ghosh balaji prabhakar and devavrat shah randomized gossip algorithms ieee transactions on information theory fan chung spectral graph theory volume american mathematical society on and clustering performance in advances in neural information processing systems pages alexandros dimakis soummya kar moura michael rabbat and anna scaglione gossip algorithms for distributed signal processing proceedings of the ieee alexandros dimakis anand sarwate and martin wainwright geographic gossip efficient averaging for sensor networks ieee transactions on signal processing john duchi alekh agarwal and martin wainwright dual averaging for distributed optimization convergence analysis and network scaling ieee transactions on automatic control james hanley and barbara mcneil the meaning and use of the area under receiver operating characteristic roc curve radiology richard karp christian schindelhauer scott shenker and berthold vocking randomized rumor spreading in symposium on foundations of computer science pages ieee david kempe alin dobra and johannes gehrke computation of aggregate information in symposium on foundations of computer science pages ieee wojtek kowalczyk and nikos vlassis newscast em in advances in neural information processing systems pages alan lee theory and practice marcel dekker new york wenjun li huaiyu dai and yanbing zhang fast distributed consensus in wireless networks ieee transactions on information theory henry mann and donald whitney on test of whether one of two random variables is stochastically larger than the other annals of mathematical statistics damon and devavrat shah fast distributed algorithms for computing separable functions ieee transactions on information theory angelia nedic and asuman ozdaglar distributed subgradient methods for optimization ieee transactions on automatic control kristiaan pelckmans and johan suykens gossip algorithms for computing in ifac workshop on estimation and control of networked systems pages devavrat shah gossip algorithms foundations and trends in networking john tsitsiklis problems in decentralized decision making and computation phd thesis massachusetts institute of technology duncan watts and steven strogatz collective dynamics of networks nature 
streaming distributed variational inference for bayesian nonparametrics trevor julian john fisher jonathan lids csail mit tdjc jstraub csail fisher csail jhow abstract this paper presents methodology for creating streaming distributed inference algorithms for bayesian nonparametric bnp models in the proposed framework processing nodes receive sequence of data minibatches compute variational posterior for each and make asynchronous streaming updates to central model in contrast to previous algorithms the proposed framework is truly streaming distributed asynchronous and the key challenge in developing the framework arising from the fact that bnp models do not impose an inherent ordering on their components is finding the correspondence between minibatch and central bnp posterior components before performing each update to address this the paper develops combinatorial optimization problem over component correspondences and provides an efficient solution technique the paper concludes with an application of the methodology to the dp mixture model with experimental results demonstrating its practical scalability and performance introduction bayesian nonparametric bnp stochastic processes are streaming priors their unique feature is that they specify in probabilistic sense that the complexity of latent model should grow as the amount of observed data increases this property captures common sense in many data analysis problems for example one would expect to encounter far more topics in document corpus after reading documents than after reading and becomes crucial in settings with unbounded persistent streams of data while their fixed parametric cousins can be used to infer model complexity for datasets with known magnitude priori such priors are silent with respect to notions of model complexity growth in streaming data settings bayesian nonparametrics are also naturally suited to parallelization of data processing due to the exchangeability and thus conditional independence they often exhibit via de finetti theorem for example labels from the chinese restaurant process are rendered by conditioning on the underlying dirichlet process dp random measure and feature assignments from the indian buffet process are rendered by conditioning on the underlying beta process bp random measure given these properties one might expect there to be wealth of inference algorithms for bnps that address the challenges associated with parallelization and streaming however previous work has only addressed these two settings in concert for parametric models and only recently has each been addressed individually for bnps in the streaming setting and developed streaming inference for dp mixture models using sequential variational approximation stochastic variational inference and related methods are often considered streaming algorithms but their performance depends on the choice of learning rate and on the dataset having known fixed size priori outside of variational approaches which are the focus of the present paper there exist exact parallelized mcmc methods for bnps the tradeoff in using such methods is that they provide samples from the posterior rather than the distribution itself and results regarding assessing minibatch posterior retrieve original central posterior central node intermediate posterior node central node node central node node central node node retrieve data yt data stream retrieve the yt data stream yt perform component id data stream data stream perform inference update the model figure the four main steps of the algorithm that is run asynchronously on each processing node convergence remain limited sequential particle filters for inference have also been developed but these suffer issues with particle degeneracy and exponential forgetting the main challenge posed by the streaming distributed setting for bnps is the combinatorial problem of component identification most bnp models contain some notion of countably infinite set of latent components clusters in dp mixture model and do not impose an inherent ordering on the components thus in order to combine information about the components from multiple processors the correspondence between components must first found brute force search is tractable even for moderately sized models there are possible correspondences for two sets of components of sizes and furthermore there does not yet exist method to evaluate the quality of component correspondence for bnp models this issue has been studied before in the mcmc literature where it is known as the label switching problem but past solution techniques are generally and restricted to use on very simple mixture models this paper presents methodology for creating streaming distributed inference algorithms for bayesian nonparametric models in the proposed framework shown for single node in figure processing nodes receive sequence of data minibatches compute variational posterior for each and make asynchronous streaming updates to central model using mapping obtained from component identification optimization the key contributions of this work are as follows first we develop minibatch posterior decomposition that motivates streaming distributed framework suitable for bayesian nonparametrics then we derive the component identification optimization problem by maximizing the probability of component matching we show that the bnp prior regularizes model complexity in the optimization an interesting side effect of this is that regardless of whether the minibatch variational inference scheme is truncated the proposed algorithm is finally we provide an efficiently computable regularization bound for the dirichlet process prior based on jensen the paper concludes with applications of the methodology to the dp mixture model with experimental results demonstrating the scalability and performance of the method in practice streaming distributed bayesian nonparametric inference the proposed framework motivated by posterior decomposition that will be discussed in section involves collection of processing nodes with asynchronous access to central variational posterior approximation shown for single node in figure data is provided to each processing node as sequence of minibatches when processing node receives minibatch of data it obtains the central posterior figure and using it as prior computes minibatch variational posterior approximation figure when minibatch inference is complete the node then performs component identification between the minibatch posterior and the current central posterior accounting for possible modifications made by other processing nodes figure finally it merges the minibatch posterior into the central variational posterior figure in the following sections we use the dp mixture as guiding example for the technical development of the inference framework however it is emphasized that the material in this paper generalizes to many other bnp models such as the hierarchical dp hdp topic model bp latent feature model and py mixture see the supplement for further details regularization bounds for other popular bnp priors may be found in the supplement posterior decomposition consider dp mixture model with cluster parameters assignments and observed data for each asynchronous update made by each processing node the dataset is split into three subsets yo yi ym for analysis when the processing node receives minibatch of data ym it queries the central processing node for the original posterior zo which will be used as the prior for minibatch inference once inference is complete it again queries the central processing node for the intermediate posterior zo zi yi which accounts for asynchronous updates from other processing nodes since minibatch inference began each subset yr has nr observations yrj and each variable zrj assigns yrj to cluster parameter θzrj given the independence of and in the prior and the conditional independence of the data given the latent parameters bayes rule yields the following decomposition of the posterior of and given updated central posterior original posterior minibatch posterior intermediate posterior zi zm zo zm zo yo zi zo yo zi zm this decomposition suggests simple streaming distributed asynchronous update rule for processing node first obtain the current central posterior density zo and using it as prior compute the minibatch posterior zm zo ym and then update the central posterior density by using with the current central posterior density zi zo yo however there are two issues preventing the direct application of the decomposition rule unknown component correspondence since it is generally intractable to find the minibatch posteriors zm zo ym exactly approximate methods are required further as requires the multiplication of densities methods are difficult to use suggesting variational approach typical variational techniques introduce an artificial ordering of the parameters in the posterior thereby breaking symmetry that is crucial to combining posteriors correctly using density multiplication the use of with variational approximations thus requires first solving component identification problem unknown model size while previous posterior merging procedures required matching between the components of the minibatch posterior and central posterior bayesian nonparametric posteriors break this assumption indeed the datasets yo yi and ym from the same nonparametric mixture model can be generated by the same disjoint or an overlapping set of cluster parameters in other words the global number of unique posterior components can not be determined until the component identification problem is solved and the minibatch posterior is merged variational component identification suppose we have the following exponential family prior and approximate variational posterior densities in the minibatch decomposition θk θk θk zo qo zo ζo zo ko θk eηok θk ηok zm zo yo qm zm zo ζm zm ζo zo θk eηmk θk ηmk zi zo yo qi zi zo ζi zi ζo zo ki θk eηik θk ηik where ζr are products of categorical distributions for the cluster labels zr and the goal is to use the posterior decomposition to find the updated posterior approximation θk eηk θk ηk as mentioned in the previous section the artificial ordering of components causes the application of with variational approximations to fail as disparate components from the approximate posteriors may be merged erroneously this is demonstrated in figure which shows results from synthetic experiment described in section ignoring component identification as the number of parallel threads increases more matching mistakes are made leading to decreasing model quality to address this first note that there is no issue with the first ko components of qm and qi these can be merged directly since they each correspond to the ko components of qo thus the component identification problem reduces to finding the correspondence between the last km km ko components of the minibatch posterior and the last ki ki ko components of the intermediate posterior for notational simplicity and without loss of generality fix the component ordering of the intermediate posterior qi and define km ki km to be the mapping from minibatch posterior component to updated central posterior component where the fact that the first ko components have no ordering ambiguity can be expressed as ko note that the maximum number of components after merging is ki km since each of the last km components in the minibatch posterior may correspond to new components in the intermediate posterior after substituting the three variational approximations into the goal of the component identification optimization is to find the mapping that yields the largest updated posterior normalizing constant matches components with similar densities xz zi zm argmax qo zo qm zm zo qi zi zo qm zm ζm zm θσ eηmk θσ ηmk ko zmj pζ zmj zm is the distribution such that pζm taking the where ζm logarithm of the objective and exploiting the decoupling allows the separation of the objective into sum of two terms one expressing the quality of the matching between components the integral over and one that regularizes the final model size the sum over while the first term is available in closed form the second is in general not therefore using the concavity of the logarithm function jensen inequality yields lower bound that can be used in place of the intractable original objective resulting in the final component identification optimization ki argmax eσζ log zi zm zo ko more detailed derivation of the optimization may be found in the supplement eσζ denotes expecσ tation under the distribution ζo zo ζi zi ζm zm and km ηrk kr kr km where km denotes the range of the mapping the definitions in ensure that the prior is used whenever posterior does not contain particular component the intuition for the optimization is that it combines finding component correspondences with high similarity via the function with regularization on the final updated posterior model size despite its motivation from the dirichlet process mixture the component identification optimization is not specific to this model indeed the derivation did not rely on any properties specific to the dirichlet process mixture the optimization applies to any bayesian nonparametric model with set of components and set of combinatorial indicators for example the optimization applies to the hierarchical dirichlet process topic model with topic word distributions and topic correspondences and to the beta process latent feature model with features and this is equivalent to the regularization ζo zo ζi zi ζm zm zi zm zo binary assignment vectors the form of the objective in the component identification optimization reflects this generality in order to apply the proposed streaming distributed method to particular model one simply needs variational inference algorithm that computes posteriors of the form and way to compute or bound the expectation in the objective of updating the central posterior to update the central posterior the node first locks it and solves for via locking prevents other nodes from solving or modifying the central posterior but does not prevent other nodes from reading the central posterior obtaining minibatches or performing inference the synthetic experiment in section shows that this does not incur significant time penalty in practice then the processing node transmits and its minibatch variational posterior to the central processing node where the product decomposition is used to find the updated central variational posterior in with parameters max ki max ζi zi ζo zo ζm zm ηk km finally the node unlocks the central posterior and the next processing node to receive new minibatch will use the above and ηk from the central node as their ko ζo zo and ηok application to the dirichlet process mixture model the expectation in the objective of is typically intractable to compute in therefore suitable lower bound may be used in its place this section presents such bound for the dirichlet process and discusses the application of the proposed inference framework to the dirichlet process mixture model using the developed bound crucially the lower bound decomposes such that the optimization becomes bipartite matching problem such problems are solvable in polynomial time by the hungarian algorithm leading to tractable component identification step in the proposed streaming distributed framework regularization lower bound for the dirichlet process with concentration parameter zi zm zo is the exchangeable partition probability function eppf zi zm zo nk where nk is the amount of data assigned to cluster and is the set of labels of nonempty clusters given that the variational distribution ζr zr is product of independent categorqnr qkr zrj ical distributions ζr zr jensen inequality may be used to bound the πrjk regularization in below see the supplement for further details by ki eσζ log zi zm zo log log max where is constant with respect to the component mapping and pn pn log πrjk kr kr pn pn km km log πmjσ note that the bound allows incremental updates after finding the optimal mapping the central update can be augmented by updating the values of sk and tk on the central node to sk tk truth lower bound truth lower bound regularization regularization increasing increasing certain number of clusters clustering uncertainty uncertain figure the dirichlet process regularization and lower bound with fully uncertain labelling and varying number of clusters and the number of clusters fixed with varying labelling uncertainty as with ηk and from after performing the regularization statistics update processing node that receives new minibatch will use the above sk and tk as their sok and tok respectively figure demonstrates the behavior of the lower bound in synthetic experiment with datapoints for various dp concentration parameter values the true regularization log eζ was computed by sample approximation with samples in figure the number of this figure demonstrates clusters was varied with symmetric categorical label weights set to two important phenomena first the bound increases as in other words it gives preference to fewer larger clusters which is the typical bnp rich get richer property second the behavior of the bound as depends on the concentration parameter as increases more clusters are preferred in figure the number of clusters was fixed to and the categorical weights were sampled from symmetric dirichlet distribution with parameter this figure demonstrates that the bound does not degrade significantly with high labelling uncertainty and is nearly exact for low labelling uncertainty overall figure demonstrates that the proposed lower bound exhibits similar behaviors to the true regularization supporting its use in the optimization solving the component identification optimization given that both the regularization and component matching score in the objective decompose the objective can be rewritten using matrix of matching as sum of terms for each ki km ki ki scores and selector variables ki ki setting xkj indicates that component in the minibatch posterior is matched to component in the intermediate posterior providing score rkj defined using and as rkj log max the optimization can be rewritten in terms of and as argmax tr xt xt xkk ko ki ki the first two constraints express the property of the constraint xkk ko fixes the upper ko block of to due to the fact that the first ko components are matched directly and the blocks to denoting to be the lower right km km blocks of the remaining optimization problem is linear assignment problem on with cost matrix which can be solved using the hungarian note that if km ko or ki ko this implies that no matching problem needs to be solved the first ko components of the minibatch posterior are matched directly and the last km are set as new components in practical implementation of the framework new clusters are typically discovered at diminishing rate as more data are observed so the number of matching problems that are solved likewise tapers off the final optimal component mapping is found by finding the nonzero elements of argmax kj km for the experiments in this work we used the implementation at cpu time batch sva svi movb sc without component id with component id clusters matchings true clusters clusters matchings true clusters count count test traces time threads threads count cpu time test log likelihood test log likelihood test log likelihood cpu time test log likelihood test log likelihood cpu time merge time microseconds cpu time for component id minibatches merged id counts threads final id counts figure synthetic results over trials computation time and test log likelihood for with varying numbers of parallel threads with component identification disabled and enabled test log likelihood traces for threads and the comparison algorithms histogram of computation time in microseconds to solve the component identification optimization number of clusters and number of component identification problems solved as function of the number of minibatch updates threads final number of clusters and matchings solved with varying numbers of parallel threads experiments in this section the proposed inference framework is evaluated on the dp gaussian mixture with niw prior we compare the streaming distributed procedure coupled with standard variational inference to five inference algorithms memoized online variational inference movb stochastic online variational inference svi with learning rate sequential variational approximation sva with cluster creation old and threshold subcluster splits mcmc sc and batch variational inference batch priors were set by hand and all methods were initialized randomly methods that use multiple passes through the data movb svi were allowed to do so movb was allowed to make moves while had fixed truncations all experiments were performed on computer with cpu cores and of ram synthetic this dataset consisted of vectors generated from gaussian mixture model with clusters and niw prior with and the algorithms were given the true niw prior dp concentration and minibatches of size minibatch inference was truncated to components and all other algorithms were truncated to components figure shows the results from the experiment over trials which illustrate number of important properties of first and foremost ignoring the component identification problem leads to decreasing model quality with increasing number of parallel threads since more matching mistakes are made figure second if component identification is properly accounted for using the proposed optimization increasing the number of parallel threads reduces execution time but does not affect the final model quality figure third with threads converges to the same final test log likelihood as the comparison algorithms in significantly reduced time figure fourth each component identification optimization typically takes seconds and thus matching accounts for less than millisecond of total computation and does not affect the overall computation time significantly figure fifth the majority of the component matching problems are solved within the first minibatch updates out of total of afterwards the true clusters have all been discovered and the processing nodes contribute to those clusters rather than creating new ones as per the discussion at the end of section figure finally increased parallelization can be advantageous in discovering the correct number of clusters with only one thread mistakes made early on are built upon and persist whereas with more threads there are more component identification problems solved and thus more chances to discover the correct clusters figure count cluster airplane trajectory clusters algorithm svi sva movb sc batch airplane cluster weights mnist clusters numerical results on airplane mnist and sun airplane mnist sun time testll time testll time testll figure instances and counts for trajectory clusters generated by instances for clusters discovered by on mnist numerical results airplane trajectories this dataset consisted of automatic dependent surveillance broadcast messages collected from planes across the united states during the period to the messages were connected based on plane call sign and time stamp and erroneous trajectories were filtered based on reasonable bounds yielding trajectories with held out for testing the points in each trajectory were fit via linear regression and the parameter vectors were clustered data was split into minibatches of size and used parallel threads mnist digits this dataset consisted of images of digits with held out for testing the images were reduced to dimensions with pca prior to clustering data was split into minibatches of size and used parallel threads sun images this dataset consisted of images from scene categories with held out for testing the images were reduced to dimensions with pca prior to clustering data was split into minibatches of size and used parallel threads figure shows the results from the experiments on the three real datasets from qualitative standpoint discovers sensible clusters in the data as demonstrated in figures however an important quantitative result is highlighted by table the larger dataset is the more the benefits of parallelism provided by become apparent consistently provides model quality that is competitive with the other algorithms but requires orders of magnitude less computation time corroborating similar findings on the synthetic dataset conclusions this paper presented streaming distributed asynchronous inference algorithm for bayesian nonparametric models with focus on the combinatorial problem of matching minibatch posterior components to central posterior components during asynchronous updates the main contributions are component identification optimization based on minibatch posterior decomposition tractable bound on the objective for the dirichlet process mixture and experiments demonstrating the performance of the methodology on datasets while the present work focused on the dp mixture as guiding example it is not limited to this model exploring the application of the proposed methodology to other bnp models is potential area for future research acknowledgments this work was supported by the office of naval research under onr muri grant references agostino nobile bayesian analysis of finite mixture distributions phd thesis carnegie mellon university jeffrey miller and matthew harrison simple example of dirichlet process mixture inconsistency for the number of components in advances in neural information processing systems yee whye teh dirichlet processes in encyclopedia of machine learning springer new york thomas griffiths and zoubin ghahramani infinite latent feature models and the indian buffet process in advances in neural information processing systems tamara broderick nicholas boyd andre wibisono ashia wilson and michael jordan streaming variational bayes in advances in neural information procesing systems trevor campbell and jonathan how approximate decentralized bayesian inference in proceedings of the conference on uncertainty in artificial intelligence dahua lin online learning of nonparametric mixture models via sequential variational approximation in advances in neural information processing systems xiaole zhang david nott christopher yau and ajay jasra sequential algorithm for fast fitting of dirichlet process mixture models journal of computational and graphical statistics matt hoffman david blei chong wang and john paisley stochastic variational inference journal of machine learning research chong wang john paisley and david blei online variational inference for the hierarchical dirichlet process in proceedings of the international conference on artificial intelligence and statistics michael bryant and erik sudderth truly nonparametric online variational inference for hierarchical dirichlet processes in advances in neural information proecssing systems chong wang and david blei stochastic variational inference for bayesian nonparametric models in advances in neural information processing systems michael hughes and erik sudderth memoized online variational inference for dirichlet process mixture models in advances in neural information processing systems jason chang and john fisher iii parallel sampling of dp mixture models using splits in advances in neural information procesing systems willie neiswanger chong wang and eric xing asymptotically exact embarassingly parallel mcmc in proceedings of the conference on uncertainty in artificial intelligence carlos carvalho hedibert lopes nicholas polson and matt taddy particle learning for general mixtures bayesian analysis matthew stephens dealing with label switching in mixture models journal of the royal statistical society series ajay jasra chris holmes and david stephens markov chain monte carlo methods and the label switching problem in bayesian mixture modeling statistical science yee whye teh michael jordan matthew beal and david blei hierarchical dirichlet processes journal of the american statistical association finale and zoubin ghahramani accelerated sampling for the indian buffet process in proceedings of the international conference on machine learning avinava dubey sinead williamson and eric xing parallel markov chain monte carlo for mixture models in proceedings of the conference on uncertainty in artificial intelligence jack edmonds and richard karp theoretical improvements in algorithmic efficiency for network flow problems journal of the association for computing machinery jim pitman exchangeable and partially exchangeable random partitions probability theory and related fields david blei and michael jordan variational inference for dirichlet process mixtures bayesian analysis yann lecun corinna cortes and christopher burges mnist database of handwritten digits online jianxiong xiao james hays krista ehinger aude oliva and antonio torralba sun image database online 
learning visual biases from human imagination carl vondrick hamed aude oliva antonio torralba massachusetts institute of technology of maryland baltimore county vondrick oliva torralba hpirsiav abstract although the human visual system can recognize many concepts under challenging conditions it still has some biases in this paper we investigate whether we can extract these biases and transfer them into machine recognition system we introduce novel method that inspired by tools in human psychophysics estimates the biases that the human visual system might use for recognition but in computer vision feature spaces our experiments are surprising and suggest that classifiers from the human visual system can be transferred into machine with some success since these classifiers seem to capture favorable biases in the human visual system we further present an svm formulation that constrains the orientation of the svm hyperplane to agree with the bias from human visual system our results suggest that transferring this human bias into machines may help object recognition systems generalize across datasets and perform better when very little training data is available introduction computer vision researchers often go through great lengths to remove dataset biases from their models however not all biases are adversarial even natural recognition systems such as the human visual system have biases some of the most well known human biases for example are the canonical perspective prefer to see objects from certain perspective and gestalt laws of grouping tendency to see objects in collections of parts we hypothesize that biases in the human visual system can be beneficial for visual understanding since recognition is an underconstrained problem the biases that the human visual system developed may provide useful priors for perception in this paper we develop novel method to learn some biases from the human visual system and incorporate them into computer vision systems we focus our approach on learning the biases that people may have for the appearance of objects to illustrate our method consider what may seem like an odd experiment suppose we sample white noise from standard normal distribution and treat it as point in visual feature space cnn or hog what is the chance that this sample corresponds to visual features of car image visualizes some samples and as expected we see noise but let us not stop there we next generate one hundred fifty thousand points from the same distribution and ask workers on amazon mechanical turk to classify visualizations of each sample as car or not visualizes the average of visual features that workers believed were cars although our dataset consists of only white noise car emerges sampling noise may seem unusual to computer vision researchers but similar procedure named classification images has gained popularity in human psychophysics for estimating an approximate template the human visual system internally uses for recognition in the procedure an observer looks at an image perturbed with random noise and indicates whether they perceive target category after large number of trials psychophysics researchers can apply basic statistics to extract an approximation of the internal template the observer used for recognition since the white noise cnn features human visual system template for car figure although all image patches on the left are just noise when we show thousands of them to online workers and ask them to find ones that look like cars car emerges in the average shown on the right this method is based on well known tools in human psychophysics that estimates the biases that the human visual system uses for recognition we explore how to transfer these biases into machine procedure is done with noise the estimated template reveals some of the cues that the human visual system used for discrimination we propose to extend classification images to estimate biases from the human visual system however our approach makes two modifications firstly we estimate the template in computer vision feature spaces which allows us to incorporate these biases into learning algorithms in computer vision systems to do this we take advantage of algorithms that invert visual features back to images by estimating these biases in feature space we can learn biases for how humans may correspond features such as shapes and colors with objects to our knowledge we are the first to estimate classification images in vision feature spaces secondly we want our template to be biased by the human visual system and not our choice of dataset unlike classification images we do not perturb real images instead our approach only uses visualizations of feature space noise to estimate the templates we capitalize on the ability of people to discern visual objects from random noise in systematic manner related work mental images our methods build upon work to extract mental images from user head for both general objects faces and scenes however our work differs because we estimate mental images in computer vision feature spaces which allows us to integrate the mental images into machine recognition system visual biases our paper studies biases in the human visual system similar to but we wish to transfer these biases into computer recognition system we extend ideas to use computer vision to analyze these biases our work is also closely related to dataset biases which motivates us to try to transfer favorable biases into recognition systems the idea to transfer biases from the human mind into object recognition is inspired by many recent works that puts human in the computer vision loop trains recognition systems with active learning and studies crowdsourcing the primary difference of these approaches and our work is rather than using crowds as workforce we want to extract biases from the worker visual systems feature visualization our work explores novel application of feature visualizations rather than using feature visualizations to diagnose computer vision systems we use them to inspect and learn biases in the human visual system transfer learning we also build upon methods in transfer learning to incorporate priors into learning algorithms common transfer learning method for svms is to change the regularization term to where is the prior however this imposes prior on both the norm and orientation of in our case since the visual bias does not provide an additional prior on the norm we present svm formulation that constrains only the orientation of to be close to rgb hog figure we visualize white noise in rgb and feature spaces to visualize white noise features we use feature inversion algorithms white noise in feature space has correlations in image space that white noise in rgb does not we capitalize on this structure to estimate visual biases in feature space without using real images cnn our approach extends sign constraints on svms but instead enforces orientation constraints our method enforces hard orientation constraint which builds on soft orientation constraints classification images review the procedure classification images is popular method in human psychophysics that attempts to estimate the internal template that the human visual system might use for recognition of category we review classification images in this section as it is the inspiration for our method the goal is to approximate the template rd that human observer uses to discriminate between two classes and male female faces or chair not chair suppose we have intensity images rd and rd if we sample white noise id and ask an observer to indicate the class label for most of the time the observer will answer with the correct class label however there is chance that might manipulate to cause the observer to mistakenly label as class the insight into classification images is that if we perform large number of trials then we can estimate decision function that discriminates between and but makes the same mistakes as the observer since makes the same errors it provides an estimate of the template that the observer internally used to discriminate from by analyzing this model we can then gain insight into how visual system might recognize different categories since psychophysics researchers are interested in models that are interpretable classification images are often linear approximations of the form the template rd can be estimated in many ways but the most common is sum of the stimulus images µaa µba µab µbb where µxy is the average image where the true class is and the observer predicted class the template is fairly intuitive it will have large positive value on locations that the observer used to predict and large negative value for locations correlated with predicting although classification images is simple this procedure has led to insights in human perception for example used classification images to study face processing strategies in the human visual system for complete analysis of classification images we refer readers to review articles estimating human biases in feature spaces standard classification images is performed with perturbing real images with white noise however this approach may negatively bias the template by the choice of dataset instead we are interested in estimating templates that capture biases in the human visual system and not datasets we propose to estimate these templates by only sampling white noise with no real images unfortunately sampling just white noise in rgb is extremely unlikely to result in natural image see to overcome this we can estimate the templates in feature spaces used in computer vision feature spaces encode higher abstractions of images such as gradients shapes or colors while sampling white noise in feature space may still not lay on the manifold of natural images it is more likely to capture statistics relevant for recognition since humans can not directly interpret abstract feature spaces we can use feature inversion algorithms to visualize them using these ideas we first sample noise from gaussian distribution id we then invert the noise feature back to an image where is the cn car television person bottle fire hydrant figure we visualize some biases estimated from trials by mechanical turk workers feature inverse by instructing people to indicate whether visualization of noise is target category or not we can build linear template rd that approximates people internal templates µa µb where µa rd is the average in feature space of white noise that workers incorrectly believe is the target object and similarly µb rd is the average of noise that workers believe is noise is special case of the original classification images where the background class is white noise and the positive class is empty instead we rely on humans to hallucinate objects in noise to form µa since we build these biases with only white gaussian noise and no real images our approach may be robust to many issues in dataset bias instead templates from our method can inherit the biases for the appearances of objects present in the human visual system which we suspect provides advantageous signals about the visual world in order to estimate from noise we need to perform many trials which we can conduct effectively on amazon mechanical turk we sampled points from standard normal multivariate distribution and inverted each sample with the feature inversion algorithm from hoggles we then instructed workers to indicate whether they see the target category or not in the visualization since we found that the interpretation of noise visualizations depends on the scale we show the worker three different scales we paid workers to label images and workers often collectively solved the entire batch in few hours in order to assure quality we occasionally gave workers an easy example to which we knew the answer and only retained work from workers who performed well above chance we only used the easy examples to qualify workers and discarded them when computing the final template visualizing biases although subjects are classifying identity covariance white gaussian noise with no real images objects can emerge after many trials to show this we performed experiments with both hog and the last convolutional layer of convolutional neural network cnn trained on imagenet for several common object categories we visualize some of the templates from our method in although the templates are blurred they seem to show significant detail about the object for example in the car template we can clearly see object in the center sitting on top of dark road and lighter sky the television template resembles rectangular structure and the fire hydrant templates reveals red hydrant with two arms on the side the templates seem to contain the canonical perspective of objects but also extends them with color and shape biases in these visualizations we have assumed that all workers on mechanical turk share the same appearance bias of objects however this assumption is not necessarily true to examine this we instructed workers on mechanical turk to find sport balls in cnn noise and clustered workers by their geographic location shows the templates for both india and the united states even india united states figure we grouped users by their geographic location us or india and instructed each group to classify cnn noise as sports ball or not which allows us to see how biases can vary by culture indians seem to imagine red ball which is the standard color for cricket ball and the predominant sport in india americans seem to imagine brown or orange ball which could be an american football or basketball both popular sports in the though both sets of workers were labeling noise from the same distribution indian workers seemed to imagine red balls while american workers tended to imagine balls remarkably the most popular sport in india is cricket which is played with red ball and popular sports in the united states are american football and basketball which are played with balls we conjecture that americans and indians may have different mental images of sports balls in their head and the color is influenced by popular sports in their country this effect is likely attributed to phenomena in social psychology where human perception can be influenced by culture since environment plays role in the development of the human vision system people from different cultures likely develop slightly different images inside their head leveraging humans biases for recognition if the biases we learn are beneficial for recognition then we would expect them to perform above chance at recognizing objects in real images to evaluate this we use the visual biases directly as classifier for object recognition we quantify their performance on object classification in realworld images using the pascal voc dataset evaluating against the validation set since pascal voc does not have fire hydrant category we downloaded images from flickr with fire hydrants and added them to the validation set we report performance as the average precision on curve the results in suggest that biases from the human visual system do capture some signals useful for classifying objects in real images although the classifiers are estimated using only white noise in most cases the templates are significantly outperforming chance suggesting that biases from the human visual system may be beneficial computationally our results suggest that shape is an important bias to discriminate objects in cnn feature space notice how the top classifications in tend to share the same rough shape by category for example the classifier for person finds people that are upright and the television classifier fires on rectangular shapes the confusions are quantified bottles are often confused as people and cars are confused as buses moreover some templates appear to rely on color as well suggests that the classifier for correctly favors red objects which is evidenced by it frequently firing on people wearing red clothes the bottle classifier seems to be incorrectly biased towards blue objects which contributes to its poor performance hog cnn chance ap car person bottle tv hog cnn chance person car bottle tv firehydrant figure we show the average precision ap for object classification on pascal voc using templates estimated with noise even though the template is created without dataset it performs significantly above chance car figure we show some of the top classifications from the human biases estimated with cnn features note that real data is not used in building these models person bottle fire hydrant television car tvmonitor car person car bus train boat tvmonitor sofa motorbike bottle predicted category predicted category aeroplane predicted category firehydrant tvmonitor bus person aeroplane train boat chair firehydrant dog cat motorbike chair car diningtable sofa bird horse diningtable tvmonitor probability of retrieval probability of retrieval probability of retrieval figure we plot the class confusions for some human biases on top classifications with cnn features we show only the top classes for visualization notice that many of the confusions may be sensible the classifier for car tends to retrieve vehicles and the fire hydrant classifier commonly mistakes people and bottles while the motivation of this experiment has been to study whether human biases are favorable for recognition our approach has some applications although templates estimated from white noise will likely never be substitute for massive labeled datasets our approach can be helpful for recognizing objects when no training data is available rather our approach enables us to build classifiers for categories that person has only imagined and never seen in our experiments we evaluated on common categories to make evaluation simpler but in principle our approach can work for rare categories as well we also wish to note that the cnn features used here are trained to classify images on imagenet lsvrc and hence had access to data however we showed competitive results for hog as well which is feature as well as results for category that the cnn network did not see during training fire hydrants learning with human biases our experiments to visualize the templates and use them as object recognition systems suggest that visual biases from the human visual system provide some signals that are useful for discriminating objects in real world images in this section we investigate how to incorporate these signals into learning algorithms when there is some training data available we present an svm that constrains the separating hyperplane to have an orientation similar to the human bias we estimated svm with orientation constraints let xi rm be training point and yi be its label for standard svm seeks separating hyperplane rm with bias that maximizes the margin between positive and negative examples we wish to add the constraint that the svm hyperplane must be at most degrees away from the bias template min ξi yi wt xi ξi ξi wt wt where ξi are the slack variables is the regularization hyperfigure parameter and is the orientation prior such that bounds the maximum angle that the is allowed to deviate from note that we have assumed without loss of generality that shows visualization of this orientation constraint the feasible space for the solution is the grayed hypercone the svm solution is not allowed to deviate from the prior classifier by more than degrees optimization efficiently by writing the objective as conic program we rewrite wt as and introduce an auxiliary variable such that wt wθ substituting these constraints into and replacing the svm regularization term with leads to the conic program ξi min yi wt xi ξi wt ξi wt since at the minimum wt is equivalent to but in standard conic program form as conic programs are convex by construction we can then optimize it efficiently using solvers which we use mosek note that removing makes it equivalent to the standard svm specifies the angle of the cone in our experiments we found to be reasonable while this angle is not very restrictive in low dimensions it becomes much more restrictive as the number of dimensions increases experiments we previously used the bias template as classifier for recognizing objects when there is no training data available however in some cases there may be few real examples available for learning we can incorporate the bias template into learning using an svm with orientation constraints using the same evaluation procedure as the previous section we compare three approaches single svm trained with only few positives and the entire negative set the same svm with orientation priors for cos on the human bias and the human bias alone we then follow the same experimental setup as before we show full results for the svm with orientation priors in in general biases from the human visual system can assist the svm when the amount of positive training data is only few examples in these low data regimes acquiring classifiers from the human visual system can improve performance with margin sometimes ap furthermore standard computer vision datasets often suffer from dataset biases that harm cross dataset generalization performance since the template we estimate is biased by the human visual system and not datasets there is no dataset we believe our approach may help cross dataset generalization we trained an svm classifier with cnn features to recognize cars on caltech but we tested it on object classification with pascal voc suggest that by constraining the svm to be close to the human bias for car we are able to improve the generalization performance of our classifiers sometimes over ap we then tried the reverse experiment in we trained on pascal voc but tested on caltech while pascal voc provides much better sample of the visual world the orientation priors still help generalization performance when there is little training data available these results suggest that incorporating the biases from the human visual system may help alleviate some dataset bias issues in computer vision positives category chance human car person bottle tv positive positives svm svm figure we show ap for the svm with orientation priors for object classification on pascal voc for varying amount of positive data with cnn features all results are means over random subsamples of the training sets refers to svm with the human bias as an orientation prior car classification cnn train on pascal test on caltech car classification cnn train on caltech test on pascal ap ap only svm only svm train on caltech test on pascal train on pascal test on caltech figure since bias from humans is estimated with only noise it tends to be biased towards the human visual system instead of datasets we train an svm to classify cars on caltech that is constrained towards the bias template and evaluate it on pascal voc for every training set size constraining the svm to the human bias with is able to improve generalization performance we train constrained svm on pascal voc and test on caltech for low data regimes the human bias may help boost performance conclusion since the human visual system is one of the best recognition systems we hypothesize that its biases may be useful for visual understanding in this paper we presented novel method to estimate some biases that people have for the appearance of objects by estimating these biases in computer vision feature spaces we can transfer these templates into machine and leverage them computationally our experiments suggest biases from the human visual system may provide useful signals for computer vision systems especially when little if any training data is available acknowledgements we thank aditya khosla for important discussions and andrew owens and zoya bylinskii for helpful comments funding for this research was partially supported by google phd fellowship to cv and google research award and onr muri to at references the mosek optimization software http ahumada perceptual classification images from vernier acuity masked by noise aytar and zisserman tabula rasa model transfer for object category detection in iccv beard and ahumada technique to extract relevant image features for visual tasks in spie blais jack scheepers fiset and caldara culture shapes how we look at faces plos one branson wah schroff babenko welinder perona and belongie visual recognition with humans in the loop chua boland and nisbett cultural variation in eye movements during scene perception proceedings of the national academy of sciences of the united states of america dalal and triggs histograms of oriented gradients for human detection in cvpr deng dong socher li li and imagenet hierarchical image database in cvpr eckstein and ahumada classification images tool to analyze visual strategies journal of vision ellis source book of gestalt psychology psychology press epshteyn and dejong rotational prior knowledge for svms in ecml everingham van gool williams winn and zisserman the pascal visual object classes challenge ijcv fergus and perona learning of object categories pami ferecatu and geman statistical framework for image category search from mental picture pami gosselin and schyns superstitious perceptions reveal properties of internal representations psychological science greene botros beck and visual noise from natural scene statistics reveals human scene category representations arxiv jr and lovell stimulus features in signal detection the journal of the acoustical society of america krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips kulis saenko and darrell what you saw is not what you get domain adaptation using asymmetric kernel transforms in cvpr pages li concise formulas for the area and volume of hyperspherical cap asian journal of mathematics and statistics mahendran and vedaldi understanding deep image representations by inverting them cvpr mangini and biederman making the ineffable explicit estimating the information employed for face classifications cognitive science mezuman and weiss learning about canonical views from internet image collections in nips murray classification images review journal of vision palmer rosch and chase canonical perspective and the perception of objects attention and performance ix parikh and zitnick of machines in nips wcsswc ponce berg everingham forsyth hebert lazebnik marszalek schmid russell torralba et al dataset issues in object recognition in toward object recognition salakhutdinov torralba and tenenbaum learning to share visual appearance for multiclass object detection in cvpr sekuler gaspar gold and bennett inversion leads to quantitative not qualitative changes in face processing current biology sorokin and forsyth utility data annotation with amazon mechanical turk in cvpr workshops torralba and efros unbiased look at dataset bias in cvpr vijayanarasimhan and grauman live active learning training object detectors with crawled data and crowds in cvpr von ahn liu and blum peekaboom game for locating objects in images in sigchi human factors vondrick khosla malisiewicz and torralba hoggles visualizing object detection features iccv weinzaepfel and reconstructing an image from its local descriptors in cvpr yang yan and hauptmann adapting svm classifiers to data with shifted distributions in icdm workshops 
smooth and strong map inference with linear convergence ofer meshi tti chicago mehrdad mahdavi tti chicago alexander schwing university of toronto abstract maximum map inference is an important task for many applications although the standard formulation gives rise to hard combinatorial optimization problem several effective approximations have been proposed and studied in recent years we focus on linear programming lp relaxations which have achieved performance in many applications however optimization of the resulting program is in general challenging due to and complex constraints therefore in this work we study the benefits of augmenting the objective function of the relaxation with strong convexity specifically we introduce strong convexity by adding quadratic term to the lp relaxation objective we provide theoretical guarantees for the resulting programs bounding the difference between their optimal value and the original optimum further we propose suitable optimization algorithms and analyze their convergence introduction probabilistic graphical models are an elegant framework for reasoning about multiple variables with structured dependencies they have been applied in variety of domains including computer vision natural language processing computational biology and many more throughout finding the maximum map configuration the most probable assignment is one of the central tasks for these models unfortunately in general the map inference problem is despite this theoretical barrier in recent years it has been shown that approximate inference methods based on linear programming lp relaxations often provide high quality map solutions in practice although tractable in principle lp relaxations pose real computational challenge in particular for many applications standard lp solvers perform poorly due to the large number of variables and constraints therefore significant research effort has been put into designing efficient solvers that exploit the special structure of the map inference problem some of the proposed algorithms optimize the primal lp directly however this is hard due to complex coupling constraints between the variables therefore most of the specialized map solvers optimize the dual function which is often easier since it preserves the structure of the underlying model and facilitates elegant algorithms nevertheless the resulting optimization problem is still challenging since the dual function is piecewise linear and therefore in fact it was recently shown that lp relaxations for map inference are not easier than general lps this result implies that there exists an inherent between the approximation error accuracy of the relaxation and its optimization error efficiency in this paper we propose new ways to explore this specifically we study the benefits of adding strong convexity in the form of quadratic term to the map lp relaxation objective we show that adding strong convexity to the primal lp results in new smooth dual objective which serves as an alternative to this smooth objective can be computed efficiently and optimized via methods including accelerated gradient on the other hand introducing strong convexity in the dual leads to new primal formulation in which the coupling constraints are enforced softly through penalty term in the objective this allows us to derive an efficient conditional gradient algorithm also known as the fw algorithm we can then add strong convexity to both primal and dual to obtain smooth and strongly convex objective for which various algorithms enjoy linear convergence rate we provide theoretical guarantees for the new objective functions analyze the convergence rate of the proposed algorithms and compare them to existing approaches all of our algorithms are guaranteed to globally converge to the optimal value of the modified objective function finally we show empirically that our methods are competitive with other algorithms for map lp relaxation related work several authors proposed efficient approximations for map inference based on lp relaxations kumar et al show that lp relaxation dominates other convex relaxations for map inference due to the complex constraints only few of the existing algorithms optimize the primal lp directly ravikumar et al present proximal point method that requires iterative projections onto the constraints in the inner loop inexactness of these iterative projections complicates the convergence analysis of this scheme in section we show that adding quadratic term to the dual problem corresponds to much easier primal in which agreement constraints are enforced softly through penalty term that accounts for constraint violation this enables us to derive simpler algorithm based on conditional gradient for the primal relaxed program recently belanger et al used different penalty term for constraint violation and showed that it corresponds to on dual variables in contrast our penalty terms are smooth which leads to different objective function and faster convergence guarantees most of the popular algorithms for map lp relaxations focus on the dual program and optimize it in various ways the subgradient algorithm can be applied to the objective however its convergence rate is rather slow both in theory and in practice in particular the algorithm requires iterations to obtain an solution to the dual problem algorithms based on coordinate minimization can also be applied and often converge fast but they might get stuck in suboptimal fixed points due to the of the objective to overcome this limitation it has been proposed to smooth the dual objective using function coordinate minimization methods are then guaranteed to converge to the optimum of the smoothed objective meshi et al have shown that the convergence rate of such algorithms is where is the smoothing parameter accelerated gradient algorithms have also been successfully applied to the smooth dual obtaining improved convergence rate of which can be used to obtain rate the original objective in section we propose an alternative smoothing technique based on adding quadratic term to the primal objective we then show how algorithms can be applied efficiently to optimize the new objective function other globally convergent methods that have been proposed include augmented lagrangian bundle methods and steepest descent approach however the convergence rate of these methods in the context of map inference has not been analyzed yet making them hard to compare to other algorithms problem formulation in this section we formalize map inference in graphical models consider set of discrete variables xn and denote by xi particular assignment to variable xi we refer to subsets of these variables by also known as regions and the total number of regions is referred to as each subset is associated with local score function or factor xr the map problem is to find an assignment which maximizes global score function that decomposes over the factors max xr the above combinatorial optimization problem is hard in general and tractable only in several special cases most notably for graphs or pairwise score functions efficient dynamic programming algorithms can be applied here we do not make such simplifying assumptions and instead focus on approximate inference in particular we are interested in imations based on the lp relaxation taking the following form xx max µr xr xr where ml xr pxr µr xr xp µp xp µr xr xr where represents containment relationship between the regions and the dual program of the above lp is formulated as minimizing the of factors min max xr max xr pr xr rc xc xr xr this is piecewise linear function in the dual variables hence it is convex but not strongly and two commonly used optimization schemes for this objective are subgradient descent and block coordinate minimization while the convergence rate of the former can be upper bounded by the latter is due to the of the objective function to remedy this shortcoming it has been proposed to smooth the objective by replacing the local maximization with the resulting unconstrainted program is xr min log exp xr this dual form corresponds to adding local entropy terms to the primal given in eq obtaining xx max µr xr xr µr xr where µr xr µr xr log µr xr denotes the entropy the following guarantee holds for the smooth optimal value log vr where is the optimal value of the dual program given in eq and vr denotes the number of variables in region the dual given in eq is smooth function with lipschitz constant vr see in this case coordinate minimization algorithms are globally convergent to the smooth optimum and their convergence rate can be bounded by algorithms can also be applied to the smooth dual and have similar convergence rate this can be improved using nesterov acceleration scheme to obtain an rate the gradient of eq takes the simple form xr pr xr br xr bp xp where br xr exp xp introducing strong convexity in this section we study the effect of adding strong convexity to the objective function specifically we add the euclidean norm of the variables to either the dual section or primal section function we study the properties of the objectives and propose appropriate optimization schemes strong convexity in the dual as mentioned above the dual given in eq is piecewise linear function hence not smooth introducing strong convexity to control the convergence rate is an alternative to smoothing we propose to introduce strong convexity by simply adding the norm of the variables to the dual algorithm for primal initialize µr xr xr for all xr while not converged do pick at random let sr xr xr for all xr sr µr let ks and clip to ka update µr end while rc µr program given in eq min the corresponding primal objective is then see appendix max µp xp µr xr xp where preserves only the constraints in ml and for convenience simplex we define aµ xr importantly this primal program is similar xp to the original primal given in eq but the marginalization constraints in ml are enforced softly via penalty term in the objective interestingly the primal in eq is somewhat similar to the objective function obtained by the steepest descent approach proposed by schwing et al despite being motivated from different perspectives similar to schwing et al our algorithm below is also based on conditional gradient however ours is algorithm whereas theirs employs procedure we obtain the following guarantee for the optimum of the strongly convex dual see appendix where is chosen such that it can be shown that where maxr wr and wr is the number of configurations of region see appendix notice that this bound is worse than the bound stated in eq due to the dependence on the magnitude of the parameters and the number of configurations wr optimization it is easy to modify the subgradient algorithm to optimize the strongly convex dual given in eq it only requires adding the term to the subgradient since the objective is and strongly convex we obtain convergence rate of we note that coordinate descent algorithms for the dual objective are still since the program is still instead we propose to optimize the primal given in eq via conditional gradient algorithm specifically in algorithm we implement the algorithm proposed by et al in algorithm we denote pr we define as pr xr µr xr and arc µr xr µr xr xp µp xp in appendix we show that the convergence rate of algorithm is similar to subgradient in the dual however algorithm has several advantages over subgradient first the requires no tuning since the optimal step is computed analytically second it is easy to monitor the of the current solution by keeping track of the duality gap sr µr which provides sound stopping notice that the basic operation for the update is maximization over the maxxr xr which is similar to subgradient computation this operation is sometimes cheaper than coordinate minimization which requires computing similar rate guarantees can be derived for the duality gap marginals see we also point out that similar to et al it is possible to execute algorithm in terms of dual variables without storing primal variables µr xr for large parent regions see appendix for details as we demonstrate in section this can be important when using global factors we note that algorithm can be used with minor modifications in the inner loop of an augmented lagrangian algorithm but we show later that this procedure is not necessary to obtain good results for some applications finally meshi et al show how to use the objective in eq to obtain an efficient training algorithm for learning the score functions from data strong convexity in the primal we next consider appending the primal given in eq with similar norm obtaining max it turns out that the corresponding dual function takes the form see appendix min max kuk min thus the dual objective involves scaling the factor reparameterization by and then projecting the resulting vector onto the probability simplex we denote the result of this projection by ur or just when clear from context the norm in eq has the same role as the entropy terms in eq and serves to smooth the dual function this is consequence of the well known duality between strong convexity and smoothness in particular the dual stated in eq is smooth with lipschitz constant to calculate the objective value we need to compute the projection ur onto the simplex for all factors this can be done by sorting the elements of the scaled reparameterization and then shifting all elements by the same value such that all positive elements sum to the negative elements are then set to see for details intuitively we can think of ur as which does not place weight on the maximum element but instead spreads the weight among the top scoring elements if their score is close enough to the maximum the effect is similar to the case where br can also be as soft see eq on the other hand unlike br our ur will most likely be sparse since only few elements tend to have scores close to the maximum and hence value in ur another interesting property of the dual in eq is invariance to shifting which is also the case for the dual provided in eq and the dual given in eq specifically shifting all elements of pr by the same value does not change the objective value since the projection onto the simplex is we next bound the difference between the smooth optimum and the original one the bound follows easily from the bounded norm of µr in the probability simplex or equivalently we actually use the equivalent form on the right in order to get an upper bound rather than lower from strong duality we immediately get similar guarantee for the dual optimum notice that this bound is better than the corresponding bound stated in eq since it does not depend on the scope size of regions vr in our experiments we show the shifted objective value fw ml max gradient accelerated cd primal max ml exp xr dual min gradient log accelerated log primal aµ log dual min xr sdca log gradient accelerated cd primal max ml aµ section subgradient primal max proximal projections dual min gradient log accelerated log primal µr section subgradient cd primal max dual min xr aµ µr section max xr section convex dual min dual min table summary of objective functions algorithms and rates existing approaches are shaded table summary of objective functions algorithms and rates row and column headers pertain to the dual objective previously known approaches are shaded optimization to solve the dual program given in eq we can use algorithms the gradient takes the form optimization to solve the dual program given in eq can use algorithms the gradient takes the form pr xr ur xr up xp xp which only requires computing the ur asuin the notice that this form projection up xp objective function pr xr is very similar to the gradient eq with projections taking the role of beliefs the gradient descent algorithm applies the updates iteratively the convergence rate of this scheme our smooth dual is which isrsimilar rate notice as in thethat this form which onlyfor requires computing the projection as in to thethe objective case nesterov accelerated gradient method better rate is very similar to the gradient eq achieves with projections taking the see role of beliefs the gradient descent applies theefficient updates the rate of unfortunately it is algorithm not clear how to derive coordinate minimization updates for theconvergence dual in iteratively eq sincefor the our projection ur dual depends on the dual variables in this scheme smooth is which is similar to the manner rate as in the accelerated method achieves better rate see finally wecase point nesterov out that the program in is very similar to the one solved in the loop of proximal point methods therefore our algorithm can be used with minor unfortunately it is not clear how to derive efficient coordinate minimization updates for the dual in modifications as subroutine within such proximal algorithms requires mapping the final dual eq since the projection depends on the dual variables in manner solution to feasible primal solutionr see finally we point out that the program in eq is very similar to the one solved in the inner and strong loop smooth of proximal point methods therefore our algorithm can be used with in ordermodifications to obtain smooth strongly convex objective function we can add an requires regularizer to minor as and subroutine within such proximal algorithms mapping the final the program in eq similarly for the dual in eq gradientdualsmooth solution to given feasible primal solution possible see based algorithms have linear convergence rate in this case equivalently we can add an term to the primal in eq although conditional gradient is not guaranteed to converge linearly in case smooth and strong this stochastic coordinate ascent sdca does enjoy linear convergence and can even be in order to obtain smooth and on strongly convexand objective canthis addrequires an term to the accelerated to gain better dependence the smoothing convexityfunction parameterswe smooth program giventointhe eq similarly for the in eq only minor modifications algorithms discussedpossible above which are highlighteddual in appendix gradientto conclude this section summarize all objective in table we can add an based algorithms havewelinear convergence ratefunctions in this and casealgorithms equivalently term to the primal in eq although conditional gradient is not guaranteed to converge linearly in case stochastic coordinate ascent sdca does enjoy linear convergence and can even be experiments accelerated to gain better dependence on the smoothing and convexity parameters this requires we now proceed to evaluate the proposed methods on real and synthetic data and compare them to only minor modifications to the algorithms presented above which are highlighted in appendix existing approaches we begin with synthetic model adapted from kolmogorov to conclude this section we summarize allcoordinate objective descent functions and algorithms table this example was designed to show that algorithms might getinstuck in suboptimal points due to we compare the following map inference algorithms experiments coordinate descent cd subgradient descent smooth cd for gradient gd and accelerated gd agd with either or smoothing section we nowdescent proceed to evaluate the proposed methods on real and synthetic data and compare them to existing approaches we begin with synthetic model adapted from kolmogorov this example was designed to that coordinate descent algorithms might get stuck in suboptimal points due to we compare the following map inference algorithms coordinate descent cd subgradient descent smooth cd for gradient descent gd and accelerated gd agd with either or smoothing section our algorithm fw and the linear convergence variants section in fig cd subgradient cd soft cd soft cd soft gd soft agd soft gd objective agd agd agd iterations fw fw fw agd sdca opt figure comparison of various inference algorithms on synthetic model the objective value as function of the iterations is plotted the optimal value is shown in thin dashed dark line we notice that cd light blue dashed is indeed stuck at the initial point second we observe that the subgradient algorithm yellow is extremely slow to converge third we see that smooth cd algorithms green converge nicely to the smooth optimum algorithms for the same smooth objective purple also converge to the same optimum while agd is much faster than gd we can also see that algorithms for the objective red preform slightly better than their counterparts in particular they have faster convergence and tighter objective for the same value of the smoothing parameter as our theoretical analysis suggests for example compare the convergence of agd soft and agd both with for the optimal value compare cd soft and agd both with fourth we note that the fw algorithm blue requires smaller values of the parameter in order to achieve high accuracy as our bound in eq predicts we point out that the dependence on the smoothing or strong convexity parameter is roughly linear which is also aligned with our convergence bounds finally we see that for this model the smooth and strongly convex algorithms gray perform similar or even slightly worse than either the or counterparts in our experiments we compare the number of iterations rather than runtime of the algorithms since the computational cost per iteration is roughly the same for all algorithms includes pass over all factors and the actual runtime greatly depends on the implementation for example gradient computation for smoothing requires sorting factors rather than just maximizing over their values incurring cost of wr log wr per factor instead of just wr for gradient however one can use partitioning around pivot value instead of sorting yielding wr cost in expectation and caching the pivot can also the runtime considerably moreover logarithm and exponent operations needed by the gradient are much slower than the basic operations used for computing the smooth gradient as another example we point out that agd algorithms can be further improved by searching for the effective lipschitz constant rather than using the conservative bound see for more details in order to abstract away these details we compare the iteration cost of the vanilla versions of all algorithms we next conduct experiments on real data from protein prediction problem from yanover et al this problem can be cast as map inference in model with unary and pairwise factors fig left shows the convergence of various map algorithms for one of the proteins similar behavior was observed for the other instances the behavior is similar to the synthetic example above except for the much better performance of coordinate descent in particular we see that coordinate minimization algorithms perform very well in this setting better than gradientbased and the fw algorithms this finding is consistent with previous work only closer look fig left bottom reveals that smoothing actually helps to obtain slightly better solution here in particular the cd with and agd with as well as the primal sdca and dual agd algorithms for the smooth and strongly convex objective are able to recover the optimal solution within the allowed number of iterations the fw algorithm also finds solution finally we apply our approach to an image segmentation problem with global cardinality factor specifically we use the weizmann horse dataset for segmentation all images are resized to pixels and we use images to learn the parameters of the model and the other images to test inference our model consists of unary and pairwise factors along with single global cardinality factor that serves to encourage segmentations where the number of foreground pixels is not too far from the trainset mean specifically we use the pcardinality factor from li and zemel defined as max where xi here is reference cardinality computed from the training set and is tolerance parameter set to iterations objective objective subgradient mplp fw objective objective cd subgradient cd soft cd soft agd agd iterations iterations agd soft agd soft fw agd sdca figure left comparison of map inference algorithms on protein prediction problem in the upper figure the solid lines show the optimized objective for each algorithm and the dashed lines show the score of the best decoded solution obtained via simple rounding the bottom figure shows the value of the decoded solution in more detail right comparison of map inference algorithms on an image segmentation problem again solid lines show the value of the optimized objective while dashed lines show the score of the best decoded solution so far iterations first we notice that not all of the algorithms are efficient in this setting in particular algorithms that optimize the smooth dual either or smoothing need to enumerate factor configurations in order to compute updates which is prohibitive for the global cardinality factor we therefore take the subgradient and coordinate descent mplp as baselines and compare their performance to that of our fw algorithm with we use the variant that does not store primal variables for the global factor appendix we point out that mplp requires calculating for factors rather than simple maximization for subgradient and fw in the case of cardinality factors this can be done at similar cost using dynamic programming however there are other types of factors where computation might be more expensive than max in fig right we show typical run for single image where we limit the number of iterations to we observe that subgradient descent is again very slow to converge and coordinate descent is also rather slow here in fact it is not even guaranteed to reach the optimum in contrast our fw algorithm converges orders of magnitude faster and finds high quality solution for runtime comparison see appendix over the entire test instances we found that fw gets the highest score solution for images while mplp finds the best solution in only images and subgradient never wins to explain this success recall that our algorithm enforces the agreement constraints between factor marginals only softly it makes sense that in this setting it is not crucial to reach full agreement between the cardinality factor and the other factors in order to obtain good solution conclusion in this paper we studied the benefits of strong convexity for map inference we introduced simple term to make either the dual or primal lp relaxations strongly convex we analyzed the resulting objective functions and provided theoretical guarantees for their optimal values we then proposed several optimization algorithms and derived upper bounds on their convergence rates using the same machinery we obtained smooth and strongly convex objective functions for which our algorithms retained linear convergence guarantees our approach offers new ways to the approximation error of the relaxation and the optimization error indeed we showed empirically that our methods significantly outperform strong baselines on problems involving cardinality potentials to extend our work we aim at natural language processing applications since they share characteristics similar to the investigated image segmentation task finally we were unable to derive coordinate minimization updates for our dual in eq we hope to find alternative smoothing techniques which facilitate even more efficient updates references belanger passos riedel and mccallum message passing for soft constraint dual decomposition in uai borenstein sharon and ullman combining and segmentation in cvpr duchi singer and chandra efficient projections onto the for learning in high dimensions in icml pages frank and wolfe an algorithm for quadratic programming volume pages garber and hazan linearly convergent conditional gradient algorithm with applications to online and stochastic optimization arxiv preprint globerson and jaakkola fixing convergent message passing algorithms for map in nips mit press hazan and shashua belief propagation for approximate inference ieee transactions on information theory johnson convex relaxation methods for graphical models lagrangian and maximum entropy approaches phd thesis eecs mit kappes savchynskyy and bundle approach to efficient by lagrangian relaxation in cvpr kolmogorov convergent message passing for energy minimization ieee transactions on pattern analysis and machine intelligence komodakis paragios and tziritas mrf energy minimization and beyond via dual decomposition ieee pami kumar kolmogorov and an analysis of convex relaxations for map estimation of discrete mrfs jmlr jaggi schmidt and pletscher optimization for structural svms in icml pages li and zemel high order regularization for learning of structured output problems in icml pages martins figueiredo aguiar smith and xing an augmented lagrangian approach to constrained map inference in icml pages meshi and globerson an alternating direction method for dual map lp relaxation in ecml meshi jaakkola and globerson convergence rate analysis of map coordinate minimization algorithms in nips pages meshi srebro and hazan efficient training of structured svms via soft constraints in aistats nemirovski and yudin problem complexity and method efficiency in optimization wiley nesterov introductory lectures on convex optimization basic course volume kluwer academic publishers nesterov smooth minimization of functions math prusa and werner universality of the local marginal polytope in cvpr pages ieee ravikumar agarwal and wainwright for linear programs proximal methods and rounding schemes jmlr savchynskyy schmidt kappes and schnorr study of nesterov scheme for lagrangian decomposition and map labeling cvpr schwing hazan pollefeys and urtasun globally convergent dual map lp relaxation solvers using margins in proc nips schwing hazan pollefeys and urtasun globally convergent parallel map lp relaxation solver using the algorithm in proc icml and zhang accelerated proximal stochastic dual coordinate ascent for regularized loss minimization in icml sontag globerson and jaakkola introduction to dual decomposition for inference in optimization for machine learning pages mit press tarlow givoni and zemel efficient message passing with high order potentials in aistats volume pages jmlr cp wainwright jaakkola and willsky map estimation via agreement on trees and linear programming ieee transactions on information theory werner linear programming approach to problem review ieee transactions on pattern analysis and machine intelligence werner revisiting the linear programming relaxation approach to gibbs energy minimization and weighted constraint satisfaction ieee pami yanover meltzer and weiss linear programming relaxations and belief propagation an empirical study journal of machine learning research 
copeland dueling bandits masrour zoghi informatics institute university of amsterdam netherlands zohar karnin yahoo labs new york ny zkarnin shimon whiteson department of computer science university of oxford uk maarten de rijke informatics institute university of amsterdam derijke abstract version of the dueling bandit problem is addressed in which condorcet winner may not exist two algorithms are proposed that instead seek to minimize regret with respect to the copeland winner which unlike the condorcet winner is guaranteed to exist the first copeland confidence bound ccb is designed for small numbers of arms while the second scalable copeland bandits scb works better for problems we provide theoretical results bounding the regret accumulated by ccb and scb both substantially improving existing results such existing results either offer bounds of the form log but require restrictive assumptions or offer bounds of the form log without requiring such assumptions our results offer the best of both worlds log bounds without restrictive assumptions introduction the dueling bandit problem arises naturally in domains where feedback is more reliable when given as pairwise preference when it is provided by human and specifying feedback instead would be arbitrary or inefficient examples include ranker evaluation in information retrieval ad placement and recommender systems as with other preference learning problems feedback consists of pairwise preference between selected pair of arms instead of scalar reward for single selected arm as in the bandit problem most existing algorithms for the dueling bandit problem require the existence of condorcet winner which is an arm that beats every other arm with probability greater than if such algorithms are applied when no condorcet winner exists no decision may be reached even after many comparisons this is key weakness limiting their practical applicability for example in industrial ranker evaluation when many rankers must be compared each comparison corresponds to costly live experiment and thus the potential for failure if no condorcet winner exists is unacceptable this risk is not merely theoretical on the contrary recent experiments on dueling bandit problems based on information retrieval datasets show that dueling bandit problems without condorcet winners arise regularly in practice figure in addition we show in appendix in the supplementary material that there are realistic situations in ranker evaluation in information retrieval in which the probability that the condorcet assumption holds decreases rapidly as the number of arms grows since the dueling bandit methods mentioned above do not provide regret bounds in the absence of condorcet winner applying them remains risky in practice indeed we demonstrate empirically the danger of applying such algorithms to dueling bandit problems that do not have condorcet winner cf appendix in the supplementary material the of the condorcet winner has been investigated extensively in social choice theory where numerous definitions have been proposed without clear contender for the most suitable resolution in the dueling bandit context few methods have been proposed to address this issue savage pbr and rankel which use some of the notions proposed by social choice theorists such as the copeland score or the borda score to measure the quality of each arm hence determining what constitutes the best arm or more generally the arms in this paper we focus on finding copeland winners which are arms that beat the greatest number of other arms because it is natural conceptually simple extension of the condorcet winner unfortunately the methods mentioned above come with bounds of the form log in this paper we propose two new dueling bandit algorithms for the copeland setting with significantly improved bounds the first algorithm called copeland confidence bound ccb is inspired by the recently proposed relative upper confidence bound method but modified and extended to address the unique challenges that arise when no condorcet winner exists we prove anytime and expected regret bounds for ccb of the form log furthermore the denominator of this result has much better dependence on the gaps arising from the dueling bandit problem than most existing results cf sections and for the details however remaining weakness of ccb is the additive term in its regret bounds in applications with large this term can dominate for any experiment of reasonable duration for example at bing experiments are run concurrently on any given day in which case the duration of the experiment needs to be longer than the age of the universe in nanoseconds before log becomes significant in comparison to our second algorithm called scalable copeland bandits scb addresses this weakness by eliminating the term achieving an expected regret bound of the form log log the price of scb tighter regret bounds is that when two suboptimal arms are close to evenly matched it may waste comparisons trying to determine which one wins in expectation by contrast ccb can identify that this determination is unnecessary yielding better performance unless there are very many arms ccb and scb are thus complementary algorithms for finding copeland winners our main contributions are as follows we propose two algorithms that address the dueling bandit problem in the absence of condorcet winner one designed for problems with small numbers of arms and the other scaling well with the number of arms we provide regret bounds that bridge the gap between two groups of results those of the form log that make the condorcet assumption and those of the form log that do not make the condorcet assumption our bounds are similar to those of the former but are as broadly applicable as the latter furthermore the result for ccb has substantially better dependence on the gaps than the second group of results we include an empirical evaluation of ccb and scb using problem arising from information retrieval ir the experimental results mirror the theoretical ones problem setting let the dueling bandit problem is modification of the bandit problem the latter considers arms ak and at each an arm ai can be pulled generating reward drawn from an unknown stationary distribution with expected value µi the dueling bandit problem is variation in which instead of pulling single arm we choose pair ai aj and receive one of them as the better choice with the probability of ai being picked equal to an unknown constant pij and that of aj being picked equal to pji pij problem instance is fully specified by preference matrix pij whose ij entry is equal to pij most previous work assumes the existence of condorcet winner an arm which without loss of generality we label such that for all in such work regret is defined relative to the condorcet winner however condorcet winners do not always exist in this paper we consider formulation of the problem that does not assume the existence of condorcet winner instead we consider the copeland dueling bandit problem which defines regret with respect to copeland winner which is an arm with maximal copeland score the copeland score of ai denoted cpld ai is the number of arms aj for which pij the normalized copeland score denoted cpld ai is simply cpld without loss of generality we assume that ac are the copeland winners where is the number of copeland winners we define regret as follows definition the regret incurred by comparing ai and aj is cpld ai cpld aj remark since our results see establish bounds on the number of queries to winners they can also be applied to other notions of regret related work numerous methods have been proposed for the dueling bandit problem including interleaved filter beat the mean relative confidence sampling relative upper confidence bound rucb doubler and multisbm and mergerucb all of which require the existence of condorcet winner and often come with bounds of the form log however as observed in and appendix problems do not always have condorcet winners there is another group of algorithms that do not assume the existence of condorcet winner but have bounds of the form log in the copeland setting sensitivity analysis of variables for generic exploration savage racing pbr and rank elicitation rankel all three of these algorithms are designed to solve more general or more difficult problems and they solve the copeland dueling bandit problem as special case this work bridges the gap between these two groups by providing algorithms that are as broadly applicable as the second group but have regret bounds comparable to those of the first group furthermore in the case of the results for ccb rather than depending on the smallest gap between arms ai and aj min as in the case of many results in the copeland our regret bounds depend on larger quantity that results in substantially lower cf in addition to the above bounds have been proven for other notions of winners including borda random walk and very recently von neumann the dichotomy discussed also persists in the case of these results which either rely on restrictive assumptions to obtain linear dependence on or are more broadly applicable at the expense of quadratic dependence on natural question for future work is whether the improvements achieved in this paper in the case of the copeland winner can be obtained in the case of these other notions as well we refer the interested reader to appendix for numerical comparison of these notions of winners in practice more generally there is proliferation of notions of winners that the field of social choice theory has put forth and even though each definition has its merits it is difficult to argue for any single definition to be superior to all others related setting is that of partial monitoring games while dueling bandit problem can be modeled as partial monitoring problem doing so yields weaker results in the authors present bounds from which regret bound of the form log can be deduced for the dueling bandit problem whereas our work achieves linear dependence in method we now present two algorithms that find copeland winners copeland confidence bound ccb ccb see algorithm is based on the principle of optimism followed by pessimism it maintains optimistic and pessimistic estimates of the preference matrix matrices and line it uses to choose an optimistic copeland winner ac lines and an arm that has some chance of being copeland winner then it uses to choose an opponent ad line an arm deemed likely to discredit the hypothesis that ac is indeed copeland winner more precisely an optimistic estimate of the copeland score of each arm ai is calculated using line and ac is selected from the set of top scorers with preference given to those in shortlist bt line these are arms that have roughly speaking been optimistic winners throughout history to maintain bt as soon as ccb discovers that the optimistic copeland score of an arm is lower than the pessimistic copeland score of another arm it purges the former from bt line the mechanism for choosing the opponent ad is as follows the matrices and define confidence interval around pij for each and in relation to ac there are three types of arms arms aj the confidence region of pcj is strictly above arms aj the confidence region of pcj is strictly below and arms aj the confidence region of pcj contains note that an arm of type or at time may become an arm of type at time even without queries to the corresponding pair as the size of the confidence intervals increases as time goes on cf equation in and theorem algorithm copeland confidence bound input copeland dueling bandit problem and an exploration parameter wij array of wins wij is the number of times ai beat aj ak potential best arms for each potential to beat ai lc estimated max losses of copeland winner for do ln ln uij and with uii lii ij cpld ai uik and cpld ai lik ct ai cpld ai maxj cpld aj set bt bt and bti bti and update as follows reset disproven hypotheses if for any and aj bti we have lij reset bt lc and btk for all set them to their original values as in lines above remove winners for each ai bt if cpld ai cpld aj holds for any set bt bt ai and if lc then set bti ak however if bt reset bt lc and bt for all add copeland winners for any ai ct with cpld ai cpld ai set bt bt ai bt and lc cpld ai for each if we have lc set btj and if lc randomly choose lc elements of btj and remove the rest with probability sample uniformly from the set aj bti and lij uij if it is and skip to line if bt ct then with probability set ct bt ct sample ac from ct uniformly at random with probability choose the set to be either bti or ak and then set arg max ljc ujc if there is tie is not allowed to be equal to compare arms ac and ad and increment wcd or wdc depending on which arm wins end for ccb always chooses ad from arms of type because comparing ac and type arm is most informative about the copeland score of ac among arms of type ccb favors those that have confidently beaten arm ac in the past line arms that in some round were of type such arms are maintained in shortlist of formidable opponents bti that are likely to confirm that ai is not copeland winner these arms are favored when selecting ad lines and the sets bti are what speed up the elimination of winners enabling regret bounds that scale asymptotically with rather than specifically for winner ai the set bti will eventually contain lc strong opponents for ai line where lc is the number of losses of each copeland winner since lc is typically small cf appendix asymptotically this leads to bound of only log on the number of when ai is chosen as an optimistic copeland winner instead of bound of log which more naive algorithm would produce scalable copeland bandits scb scb is designed to handle dueling bandit problems with large numbers of arms it is based on an algorithm described in algorithm designed for pac setting it finds an winner with probability although we are primarily interested in the case with algorithm relies on reduction to bandit problem where we have direct access algorithm approximate copeland bandit solver input copeland dueling bandit problem with preference matrix pij failure probability and approximation parameter also define define random variable reward for as the following procedure pick uniformly random from query the pair ai aj sufficiently many times in order to determine at least whether pij return if pij and otherwise invoke algorithm where in each of its calls to reward the feedback is determined by the above stochastic process return the same output returned by algorithm to noisy version of the copeland score the process of estimating the score of arm ai consists of comparing ai to random arm aj until it becomes clear which arm beats the other the sample complexity bound which yields the regret bound is achieved by combining bound for bandits and bound on the number of arms that can have high copeland score algorithm calls bandit algorithm as subroutine to this end we use the algorithm slight modification of algorithm in it implements an elimination tournament with confidence regions based on the between probability distributions the interested reader can find the in algorithm contained in appendix combining this with the squaring trick modification of the doubling trick that reduces the number of partitions from log to log log the scb algorithm described in algorithm repeatedly calls algorithm but if an increasing threshold is reached if it terminates early then the identified arm is played against itself until the threshold is reached algorithm scalable copeland bandits input copeland dueling bandit problem with preference matrix pij for all do set and run algorithm with failure probability log in order to find an exact copeland winner if it requires more than queries let be the number of queries used by invoking algorithm and let ai be the arm produced by it query the pair ai ai times end for theoretical results in this section we present regret bounds for both ccb and scb assuming that the number of copeland winners and the number of losses of each copeland winner are ccb regret bound takes the form log while scb is of the form log log note that these bounds are not directly comparable when there are relatively few arms ccb is expected to perform better by contrast when there are many arms scb is expected to be superior appendix in the supplementary material provides empirical evidence to support these expectations throughout this section we impose the following condition on the preference matrix there are no ties for all pairs ai aj with we have pij this assumption is not very restrictive in practice for example in the ranker evaluation setting from information retrieval each arm corresponds to ranker complex and highly engineered system so it is unlikely that two rankers are indistinguishable furthermore some of the results we present in this section actually hold under even weaker assumptions however for the sake of clarity we defer discussion of these nuanced differences to appendix in the supplementary material copeland confidence bound ccb in this section we provide rough outline of our argument for the bound on the regret accumulated by algorithm for more detailed argument the interested reader is referred to appendix consider copeland bandit problem with arms ak and preference matrix pij such that arms ac are the copeland winners with being the number of copeland winners moreover we define lc to be the number of arms to which copeland winner loses in expectation using this notation our expected regret bound for ccb takes the form ln here is notion of gap defined in appendix which is an improvement upon the smallest gap between any pair of arms this result is proven in two steps first we bound the number of comparisons involving noncopeland winners yielding result of the form ln second theorem closes the gap see appendix in the supplementary material for experimental evidence that this is the case in practice between this bound and the one in by showing that beyond certain time horizon ccb selects winning arms as the optimistic copeland winner very infrequently theorem given copeland bandit problem satisfying assumption and any and there exist constants and such that with probability the regret accumulated by ccb is bounded by the following lc ln ln using the high probability regret bound given in theorem we can deduce the expected regret result claimed in for as corollary by integrating over the interval scalable copeland bandits we now turn to our regret result for scb which lowers the dependence in the additive constant of ccb regret result to log we begin by defining the relevant quantities definition given copeland bandit problem and an arm ai we define the following recall that cpld ai cpld ai is called the normalized copeland score ai is an if cpld ai cpld max cpld cpld ai and hi with maxi hi max ij cpld we now state our main scalability result theorem given copeland problem satisfying assumption the expected regret of scb pk hi cpld ai algorithm is bounded by log which in turn can be bounded by log lc where lc and min are as in definition min recall that scb is based on algorithm an algorithm that identifies copeland winner with high probability as result theorem is an immediate corollary of lemma obtained by using the well known squaring trick as mentioned in section the squaring trick is minor variation on the doubling trick that reduces the number of partitions from log to log log lemma is result for finding an copeland winner see definition note that for the regret setting we are only interested in the special case with the problem of identifying the best arm lemma with probability cpld ai hi algorithm finds an copeland winner by time log log min lc log in particular when there is condorcet winner cpld lc or more generally cpld lc an exact solution is found with probability at least by using an expected number of queries of at most lc log log in the remainder of this section we sketch the main ideas underlying the proof of lemma detailed in appendix in the supplementary material we first treat the simpler deterministic setting in which single query suffices to determine which of pair of arms beats the other while solution can easily be obtained using many queries we aim for one with query complexity linear in the main ingredients of the proof are as follows cpld ai is the mean of bernoulli random variable defined as such sample uniformly at random an index from the set and return if ai beats aj and otherwise applying based algorithm algorithm to the bandit arising from the above observation we obtain bound by dividing the arms into two groups those with copeland scores close to that of the copeland winners and the rest for the former we use the result from lemma to bound the number of such arms for the latter the resulting regret is dealt with using lemma which exploits the possible distribution of copeland scores the exact expression requires replacing log with log cumulative regret mslr informational cm with rankers rucb rankel pbr scb savage ccb time figure regret results for copeland dueling bandit problem arising from ranker evaluation let us state the two key lemmas here lemma let ak be the set of arms for which cpld ai that is arms that are beaten by at most arms then proof consider fully connected directed graph whose node set is and the arc ai aj is in the graph if arm ai beats arm aj by the definition of cpld the of any node is upper bounded by therefore the total number of arcs in the graph is at most now the full connectivity of the graph implies that the total number of arcs in the graph is exactly thus and the claim follows lemma the sum ai cpld is in log proof follows from lemma via careful partitioning of arms details are in appendix given the structure of algorithm the stochastic case is similar to the deterministic case for the following reason while the latter requires single comparison between arms ai and aj to determine which arm beats the other in the stochastic case we need roughly log log ij ij between the two arms to correctly answer the same question with probability at least comparisons experiments to evaluate our methods ccb and scb we apply them to copeland dueling bandit problem arising from ranker evaluation in the field of information retrieval ir we follow the experimental approach in and use preference matrix to simulate comparisons between each pair of arms ai aj by drawing samples from bernoulli random variables with mean pij we compare our proposed algorithms against the state of the art dueling bandit algorithms rucb copeland savage pbr and rankel we include rucb in order to verify our claim that dueling bandit algorithms that assume the existence of condorcet winner have linear regret if applied to copeland dueling bandit problem without condorcet winner more specifically we consider dueling bandit problem obtained from comparing five rankers none of whom beat the other four there is no condorcet winner due to lack of space the details of the experimental setup have been included in appendix figure shows the regret accumulated by ccb scb the copeland variants of savage pbr rankel and rucb on this problem the horizontal time axis uses log scale while the vertical axis which measures cumulative regret uses linear scale ccb outperforms all other algorithms in this experiment note that three of the baseline algorithms under consideration here savage pbr and rankel require the horizon of the experiment as an input either directly or through failure probability sample code and the preference matrices used in the experiments can be found at http which we set to with being the horizon in order to obtain regret algorithm as prescribed in therefore we ran independent experiments with varying horizons and recorded the accumulated regret the markers on the curves corresponding to these algorithms represent these numbers consequently the regret curves are not monotonically increasing for instance savage cumulative regret at time is lower than at time because the runs that produced the former number were not continuations of those that resulted in the latter but rather completely independent furthermore rucb cumulative regret grows linearly which is why the plot does not contain the entire curve appendix contains further experimental results including those of our scalability experiment conclusion in many applications that involve learning from human behavior feedback is more reliable when provided in the form of pairwise preferences in the dueling bandit problem the goal is to use such pairwise feedback to find the most desirable choice from set of options most existing work in this area assumes the existence of condorcet winner an arm that beats all other arms with probability greater than even though these results have the advantage that the bounds they provide scale linearly in the number of arms their main drawback is that in practice the condorcet assumption is too restrictive by contrast other results that do not impose the condorcet assumption achieve bounds that scale quadratically in the number of arms in this paper we set out to solve natural generalization of the problem where instead of assuming the existence of condorcet winner we seek to find copeland winner which is guaranteed to exist we proposed two algorithms to address this problem one for small numbers of arms called ccb and more scalable one called scb that works better for problems with large numbers of arms we provided theoretical results bounding the regret accumulated by each algorithm these results improve substantially over existing results in the literature by filling the gap that exists in the current results namely the discrepancy between results that make the condorcet assumption and are of the form log and the more general results that are of the form log moreover we have included in the supplementary material empirical results on both dueling bandit problem arising from application domain and synthetic problem used to test the scalability of scb the results of these experiments show that ccb beats all existing copeland dueling bandit algorithms while scb outperforms ccb on the problem one open question raised by our work is how to devise an algorithm that has the benefits of both ccb and scb the scalability of the latter together with the former better dependence on the gaps at this point it is not clear to us how this could be achieved another interesting direction for future work is an extension of both ccb and scb to problems with continuous set of arms given the prevalence of cyclical preference relationships in practice we hypothesize that the nonexistence of condorcet winner is an even greater issue when dealing with an infinite number of arms given that both our algorithms utilize confidence bounds to make their choices we anticipate that algorithms like those proposed in can be combined with our ideas to produce solution to the copeland bandit problem that does not rely on the convexity assumptions made by algorithms such as the one proposed in finally it is also interesting to expand our results to handle scores other than the copeland score such as an variant of the copeland score as in or completely different notions of winners such as the borda random walk or von neumann winners see acknowledgments we would like to thank nir ailon and ulle endriss for helpful discussions this research was supported by amsterdam data science the dutch national program commit elsevier the european community seventh framework programme under grant agreement nr the esf research network program elias the royal dutch academy of sciences knaw under the elite network shifts project the microsoft research program the netherlands escience center under project number the netherlands institute for sound and vision the netherlands organisation for scientific research nwo under project nrs the yahoo faculty research and engagement program and yandex all content represents the opinion of the authors which is not necessarily shared or endorsed by their respective employers sponsors references yue broder kleinberg and joachims the dueling bandits problem journal of computer and system sciences joachims optimizing search engines using clickthrough data in kdd yue and joachims beat the mean bandit in icml hofmann whiteson and de rijke balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval information retrieval and editors preference learning schuth sietsma whiteson lefortier and de rijke multileaved comparisons for fast online evaluation in cikm li kim and zitouni toward predicting the outcome of an experiment for search relevance in wsdm zoghi whiteson de rijke and munos relative confidence sampling for efficient ranker evaluation in wsdm schulze new monotonic reversal symmetric and condorcetconsistent election method social choice and welfare urvoy clerot and naamane generic exploration and voting bandits in icml weng cheng and selection based on adaptive sampling of noisy preferences in icml and pac rank elicitation through adaptive sampling of stochastic pairwise preferences in aaai zoghi whiteson munos and de rijke relative upper confidence bound for the dueling bandits problem in icml kohavi deng frasca walker xu and pohlmann online controlled experiments at large scale in kdd thompson on the likelihood that one unknown probability exceeds another in view of the evidence of two samples biometrika pages ailon karnin and joachims reducing dueling bandits to cardinal bandits in icml zoghi whiteson and de rijke mergerucb method for online ranker evaluation in wsdm negahban oh and shah iterative ranking from comparisons in nips hofmann schapire slivkins and zoghi contextual dueling bandits in colt piccolboni and schindelhauer discrete prediction games with arbitrary feedback and loss in colt zolghadr and an adaptive algorithm for finite stochastic partial monitoring in icml garivier maillard munos stoltz et al upper confidence bounds for optimal sequential allocation the annals of statistics manning raghavan and introduction to information retrieval cambridge university press kleinberg slivkins and upfa bandits in metric space in stoc bubeck munos stoltz and szepesvari bandits jmlr srinivas krause kakade and seeger gaussian process optimization in the bandit setting no regret and experimental design in icml munos optimistic optimization of deterministic function without the knowledge of its smoothness in nips bull convergence rates of efficient global optimization algorithms jmlr de freitas smola and zoghi exponential regret bounds for gaussian process bandits with deterministic observations in icml valko carpentier and munos stochastic simultaneous optimistic optimization in icml yue and joachims interactively optimizing information retrieval systems as dueling bandits problem in icml altman and tennenholtz axiomatic foundations for ranking systems jair 
optimal ridge detection using coverage risk christopher genovese department of statistics carnegie mellon university genovese chen department of statistics carnegie mellon university yenchic larry wasserman department of statistics carnegie mellon university larry shirley ho department of physics carnegie mellon university shirleyh abstract we introduce the concept of coverage risk as an error measure for density ridge estimation the coverage risk generalizes the mean integrated square error to set estimation we propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk we study the rate of convergence for coverage risk and prove consistency of the risk estimators we apply our method to three simulated datasets and to cosmology data in all the examples the proposed method successfully recover the underlying density structure introduction density ridges are structures that characterize high density regions density ridges have been applied to computer vision remote sensing biomedical imaging and cosmology density ridges are similar to the principal curves figure provides an example for applying density ridges to learn the structure of our universe to detect density ridges from data proposed the subspace constrained mean shift scms algorithm scms is modification of usual mean shift algorithm to adapt to the local geometry unlike mean shift that pushes every mesh point to local mode scms moves the meshes along projected gradient until arriving at nearby ridges essentially the scms algorithm detects the ridges of the kernel density estimator kde therefore the scms algorithm requires preselected parameter which acts as the role of smoothing bandwidth in the kernel density estimator despite the wide application of the scms algorithm the choice of remains an unsolved problem similar to the density estimation problem poor choice of results in or undersmoothing for the density ridges see the second row of figure in this paper we introduce the concept of coverage risk which is generalization of the mean integrated expected error from function estimation we then show that one can consistently estimate the coverage risk by using data splitting or the smoothed bootstrap this leads us to selection rule for choosing the parameter for the scms algorithm we apply the proposed method to several famous datasets including the spiral dataset the three spirals dataset and the nips dataset in all simulations our selection rule allows the scms algorithm to detect the underlying structure of the data figure the cosmic web this is slice of the observed universe from the sloan digital sky survey we apply the density ridge method to detect filaments the top row is one example for the detected filaments the bottom row shows the effect of smoothing optimal smoothing under optimal smoothing we detect an intricate filament network if we or the dataset we can not find the structure density ridges density ridges are defined as follows assume xn are independently and identically distributed from smooth probability density function with compact support the density ridges are defined as where vd with vj being the eigenvector associated with the ordered eigenvalue λj λd for hessian matrix that is is the collection of points whose projected gradient it can be shown that under appropriate conditions is collection of smooth curves manifolds in rd the scms algorithm is estimate for by using bn vbn vbn pn pn are the associated quantities defined is the kde and vbn and where pbn by pbn hence one can clearly see that the parameter in the scms algorithm plays the same role of smoothing bandwidth for the kde coverage risk before we introduce the coverage risk we first define some geometric concepts let be the dimensional hausdorff measure namely is the length of set and is the area of let be the projection distance from point to set we define ur and urbn as bn random variables uniformly distributed over the true density ridges and the ridge estimator respectively assuming and rn are given we define the following two random variables bn wn ur fn rn bn are sets wn is the distance from randomly note that ur urbn are random variables while bn and fn is the distance from random point on bn to selected point on to the estimator let haus inf be the hausdorff distance between and where the following lemma gives some useful properties about wn fn and fn are bounded by haus cn namely lemma both random variables wn and bn wn haus fn haus bn fn are the cumulative distribution function cdf for wn and bn bn bn fn bn wn bn is the ratio of being covered by padding the regions around bn at distance thus wn this lemma follows trivially by definition so that we omit its proof lemma links the random fn to the hausdorff distance and the coverage for and bn thus we call them variables wn and bn as coverage random variables now we define the and coverage risk for estimating by fn wn bn note that is and is the expected square projected distance between and that the expectation in applies to both rn and ur one can view as generalized mean integrated square errors mise for sets nice property of and is that they are not sensitive to outliers of in the sense that small perturbation of will not change the risk too much on the contrary the hausdorff distance is very sensitive to outliers selection for tuning parameters based on risk minimization in this section we will show how to choose by minimizing an estimate of the risk we propose two risk estimators the first estimator is based on the smoothed bootstrap we the we estimate the risk by sample from the kde pbn and recompute the estimator wn wn xn wn wn xn risk risk where and rn rn rn the second approach is to use data splitting we randomly split the data into and assuming is even and we compute the estimated manifolds by using and then we compute half of the data which we denote as xn xn and risk where having estimated the risk we select by argmin risk where is an upper bound by the normal reference rule which is known to oversmooth so that we only select below this rule moreover one can choose by minimizing risk as well in they consider selecting the smoothing bandwidth for local principal curves by this criterion is different from ours the counts data points the is monotonic increasing function and they propose to select the bandwidth such that the derivative is highest our coverage risk yields simple curve and one can easily pick the optimal bandwidth by minimizing the estimated risk manifold comparison by coverage the concepts of coverage in previous section can be generalized to investigate the difference between two manifolds let and be an and an manifolds and are not necessarily the same we define the coverage random variables then by lemma the cdf for and contains information about how and are different from each other is the coverage on by padding regions with distance around we call the plots of the cdf of and coverage diagrams since they are linked to the coverage over and the coverage diagram allows us to study how two manifolds are different from each other when the coverage diagram can be used as similarity measure for two manifolds when the coverage diagram serves as measure for quality of representing high dimensional objects by low dimensional ones nice property for coverage diagram is that we can approximate the cdf for and by mesh of points or points uniformly distributed over and in figure we consider helix dataset whose support has dimension and we compare two curves spiral curve green and straight line orange to represent the helix dataset as can be seen from the coverage diagram right panel the green curve has better coverage at each distance compared to the orange curve so that the spiral curve provides better representation for the helix dataset in addition to the coverage diagram we can also use the following and losses as summary for the difference the expectation is take over and and both and here are fixed the risks in are the expected losses cn cn coverage figure the helix dataset the original support for the helix dataset black dots are regions we can use green spiral curves to represent the regions note that we also provide bad representation using straight line orange the coverage plot reveals the quality for representation left the original data dashed line is coverage from data points black dots over curves in the left panel and solid line is coverage from curves on data points right the coverage plot for the spiral curve green versus straight line orange theoretical analysis in this section we analyze the asymptotic behavior for the coverage risk and prove the consistency for estimating the coverage risk by the proposed method in particular we derive the asymptotic properties for the density ridges we only focus on risk since by jensen inequality the risk can be bounded by the risk before we state our assumption we first define the orientation of density ridges recall that the density ridge is collection of one dimensional curves thus for each point we can associate unit vector that represent the orientation of at the explicit formula for can be found in lemma of assumptions there exist δr such that for all δr kkp kmax where kp kmax is the element wise norm to the third derivative and for each the kernel function is three times bounded differetiable and is symmetric and dx dx for all the kernel function and its partial derivative satisfies condition in specifically let rd we require that satisfies supn for some positive number where denotes the number of the metric space and is the envelope function of and the supreme is taken over the whole rd ther and are usually called the vc characteristics of the norm kf supp dp assumption appears in and is very mild the first two inequality in are just the bound on eigenvalues the last inequality requires the density around ridges to be smooth the latter part of requires the direction of ridges to be similar to the gradient direction assumption is the common condition for kernel density estimator see and assumption is to regularize the classes of kernel functions that is widely assumed any bounded kernel function with compact support satisfies this condition both and hold for the gaussian kernel under the above condition we derive the rate of convergence for the risk theorem let be the coverage risk for estimating the density ridges and level sets assume and and is at least four times bounded differentiable then as log and nh σr br that depends only on the density and the kernel function for some br and σr the rate in theorem shows decomposition the first term involving is the bias term while the latter term is the variance part thanks to the jensen inequality the rate of convergence for risk is the square root of the rate theorem note that we require the smoothing log parameter to decay slowly to by nh this constraint comes from the uniform bound for estimating third derivatives for we need this constraint since we need the smoothness for estimated ridges to converge to the smoothness for the true ridges similar result for density level set appears in by lemma we can upper bound the risk by expected square of the hausdorff distance which gives the rate bn log the rate under hausdorff distance for density ridges can be found in and the rate for density ridges appears in the rate induced by theorem agrees with the bound from the hausdorff distance and has slightly better rate for variance without factor this phenomena is similar to the mise and error for nonparametric estimation for functions the mise converges slightly faster by factor than square to the error now we prove the consistency of the risk estimators in particular we prove the consistency for the smoothed bootstrap the case of data splitting can be proved in the similar way theorem let be the coverage risk for estimating the density ridges and level sets let be the corresponding risk estimator by the smoothed bootstrap assume and and risk log is at least four times bounded differentiable then as and nh risk theorem proves the consistency for risk estimation using the smoothed bootstrap this also leads to the consistency for data splitting estimated coverage risk smoothing parameter estimated coverage risk smoothing parameter estimated coverage risk smoothing parameter figure three different simulation datasets top row the spiral dataset middle row the three spirals dataset bottom row nips character dataset for each row the leftmost panel shows the estimated coverage risk using data splitting the red straight line indicates the bandwidth selected by least square cross validation which is either undersmooth or oversmooth then the rest three panels are the result using different smoothing parameters from left to right we show the result for optimal smoothing using the coverage risk and note that the second minimum in the coverage risk at the three spirals dataset middle row corresponds to phase transition when the estimator becomes big circle this is also locally stable structure applications simulation data we now apply the data splitting technique to choose the smoothing bandwidth for density ridge estimation note that we use data splitting over smooth bootstrap since in practice data splitting works better the density ridge estimation can be done by the subspace constrain mean shift algorithm we consider three famous datasets the spiral dataset the three spirals dataset and nips dataset figure shows the result for the three simulation datasets the top row is the spiral dataset the middle row is the three spirals dataset the bottom row is the nips character dataset for each row from left to right the first panel is the estimated risk by using data splitting note that there is no practical difference between and risk the second to fourth panels are optimal smoothing and note that we also remove the ridges whose density is below maxx pbn since they behave like random noise as can be seen easily the optimal bandwidth allows the density ridges to capture the underlying structures in every dataset on the contrary the and the does not capture the structure and have higher risk estimated coverage risk smoothing parameter figure another slice for the cosmic web data from the sloan digital sky survey the leftmost panel shows the estimated coverage risk right panel for estimating density ridges under different smoothing parameters we estimated the coverage risk by using data splitting for the rest panels from left to right we display the case for optimal smoothing and as can be seen easily the optimal smoothing method allows the scms algorithm to detect the intricate cosmic network structure cosmic web now we apply our technique to the sloan digital sky survey huge dataset that contains millions of galaxies in our data each point is an observed galaxy with three features the redshift which is the distance from the galaxy to earth ra the right ascension which is the longitude of the universe dec the declination which is the latitude of the universe these three features ra dec uniquely determine the location of given galaxy to demonstrate the effectiveness of our method we select slice of our universe at redshift with ra dec since the redshift difference is very tiny we ignore the redshift value of the galaxies within this region and treat them as data points thus we only use ra and then we apply the scms algorithm version of with data splitting method introduced in section to select the smoothing parameter the result is given in figure the left panel provides the estimated coverage risk at different smoothing bandwidth the rest panels give the result for second panel optimal smoothing third panel and right most panel in the third panel of figure we see that the scms algorithm detects the filament structure in the data discussion in this paper we propose method using coverage risk generalization of mean integrated square error to select the smoothing parameter for the density ridge estimation problem we show that the coverage risk can be estimated using data splitting or smoothed bootstrap and we derive the statistical consistency for risk estimators both simulation and real data analysis show that the proposed bandwidth selector works very well in practice the concept of coverage risk is not limited to density ridges instead it can be easily generalized to other manifold learning technique thus we can use data splitting to estimate the risk and use the risk estimator to select the tuning parameters this is related to the stability selection which allows us to select tuning parameters even in an unsupervised learning settings references bas ghadarghadar and erdogmus automated extraction of blood vessel networks from microscopy image stacks via principal curve tracing in biomedical imaging from nano to macro ieee international symposium on pages ieee bas erdogmus draft and lichtman local tracing of curvilinear structures in volumetric color images application to the brainbow analysis journal of visual communication and image representation cadre kernel estimation of density level sets journal of multivariate analysis chen genovese tibshirani and wasserman nonparametric modal regression arxiv preprint chen genovese and wasserman generalized mode and ridge estimation arxiv june chen genovese and wasserman asymptotic theory for density ridges arxiv preprint chen ho freeman genovese and wasserman cosmic web reconstruction through density ridges method and algorithm arxiv preprint cheng mean shift mode seeking and clustering pattern analysis and machine intelligence ieee transactions on cuevas and estimation of general level sets aust eberly ridges in image and data analysis springer einbeck bandwidth selection for based unsupervised learning techniques unified approach via journal of pattern recognition einmahl and mason uniform in bandwidth consistency for function estimators the annals of statistics evans and gariepy measure theory and fine properties of functions volume crc press fukunaga and hostetler the estimation of the gradient of density function with applications in pattern recognition information theory ieee transactions on genovese verdinelli and wasserman nonparametric ridge estimation the annals of statistics gine and guillou rates of strong uniform consistency for multivariate kernel density estimators in annales de institut henri poincare probability and statistics hastie principal curves and surfaces technical report dtic document hastie and stuetzle principal curves journal of the american statistical association jones marron and sheather brief survey of bandwidth selection for density estimation journal of the american statistical association mason polonik et al asymptotic normality of level set estimates the annals of applied probability miao wang shi and wu method for accurate road centerline extraction from classified image ozertem and erdogmus locally defined principal curves and surfaces journal of machine learning research rinaldo and wasserman generalized density clustering the annals of statistics scott multivariate density estimation theory practice and visualization volume john wiley sons silverman and young the bootstrap to smooth or not to smooth biometrika silverman density estimation for statistics and data analysis chapman and hall tibshirani principal curves revisited statistics and computing wasserman all of nonparametric statistics new york 
multiclass svm maksim matthias and bernt max planck institute for informatics saarbrücken germany saarland university saarbrücken germany abstract class ambiguity is typical in image classification problems with large number of classes when classes are difficult to discriminate it makes sense to allow guesses and evaluate classifiers based on the error instead of the standard loss we propose multiclass svm as direct method to optimize for performance our generalization of the multiclass svm is based on tight convex upper bound of the error we propose fast optimization scheme based on an efficient projection onto the simplex which is of its own interest experiments on five datasets show consistent improvements in accuracy compared to various baselines introduction as the number of classes increases two important issues emerge class overlap and multilabel nature of examples this phenomenon asks for adjustments of both the evaluation metrics as well as the loss functions employed when predictor is allowed guesses and is not penalized for mistakes such an evaluation measure is known as error we argue that this is an important metric that will figure images from sun illustrating evitably receive more attention in the future as class ambiguity top left to right park river pond bottom park campus picnic area the illustration in figure indicates how obvious is it that each row of figure shows examples of different classes can we imagine human to predict correctly on the first attempt does it even make sense to penalize learning system for such mistakes while the problem of class ambiguity is apparent in computer vision similar problems arise in other domains when the number of classes becomes large we propose multiclass svm as generalization of the multiclass svm it is based on tight convex upper bound of the loss which we call hinge loss while it turns out to be similar to version of the ranking based loss proposed by we show that the hinge loss is lower bound on their version and is thus tighter bound on the loss we propose an efficient implementation based on stochastic dual coordinate ascent sdca key ingredient in the optimization is the biased projection onto the simplex this projection turns out to be tricky generalization of the continuous quadratic knapsack problem respectively the projection onto the standard simplex the proposed algorithm for solving it has complexity log for rm our implementation of the multiclass svm scales to large datasets like places with about million examples and classes finally extensive experiments on several challenging computer vision problems show that multiclass svm consistently improves in error over the multiclass svm equivalent to our multiclass svm svm and other methods based on different ranking losses loss in multiclass classification in multiclass classification one is given set xi yi of training examples xi along with the corresponding labels yi let rd be the feature space and the set of labels the task is to learn set of linear predictors wy rd such that the risk of the classifier arg hwy xi is minimized for given loss function which is usually chosen to be convex upper bound of the loss the generalization to nonlinear predictors using kernels is discussed below the classification problem becomes extremely challenging in the presence of large number of ambiguous classes it is natural in that case to extend the evaluation protocol to allow guesses which leads to the popular error and accuracy performance measures formally we consider ranking of labels induced by the prediction scores hwy xi let the bracket denote permutation of labels such that is the index of the largest score the loss errk is defined as errk xi hwy xi where xi hwm xi and if is true and otherwise note that the standard loss is recovered when and errk is always for therefore we are interested in the regime multiclass support vector machine in this section we review the multiclass svm of crammer and singer which will be extended to the multiclass svm in the following we mainly follow the notation of given training pair xi yi the multiclass svm loss on example xi is defined as max hwy xi hwyi xi since our optimization scheme is based on fenchel duality we also require convex conjugate of the primal loss function let where is the all ones vector and ej is the standard basis vector in rm let rm be defined componentwise as aj hwj xi hwyi xi and let rm xi xi proposition pair for the multiclass svm loss is hc bi if max otherwise note that thresholding with in is actually redundant as yi and is only given to enhance similarity to the version defined later support vector machine the main motivation for the loss is to relax the penalty for making an error in the predictions looking at in direct extension to the setting would be function ψk max which incurs loss iff since the ground truth score yi we conclude that ψk xi xi hwyi xi which directly corresponds to the loss errk with margin note that the function ψk ignores the values of the first scores which could be quite large if there are highly similar classes that would be fine in this model as long as the correct prediction is within the first guesses however the function ψk is unfortunately nonconvex since the function fk returning the largest coordinate is nonconvex for therefore finding globally optimal solution is computationally intractable instead we propose the following convex upper bound on ψk which we call the hinge loss φk max where the sum of the largest components is known to be convex we have that ψk φk for any and rm moreover φk unless all largest scores are the same this extra slack can be used to increase the margin between the current and the remaining least similar classes which should then lead to an improvement in the metric simplex and convex conjugate of the hinge loss in this section we derive the conjugate of the proposed loss we begin with well known result that is used later in the proof all proofs can be found in the supplement let max pk pm lemma lemma mint kt hj we also define set which arises naturally as the effective of the conjugate of by analogy we call it the simplex as for it reduces to the standard simplex with the inequality constraint let definition the simplex is convex polytope defined as xi xi xi where is the bound on the sum xi we let the crucial difference to the standard simplex is the upper bound on xi which limits their maximal contribution to the total sum xi see figure for an illustration figure simplex the first technical contribution of this work is as follows for unlike the standard proposition pair for the hinge loss simplex it has vertices is given as follows hc bi if φk max φk otherwise moreover φk max ha λi therefore we see that the proposed formulation naturally extends the multiclass svm of crammer and singer which is recovered when we have also obtained an interesting extension or rather contraction since of the standard simplex relation of the hinge loss to ranking based losses usunier et al have recently formulated very general family of convex losses for ranking and multiclass classification in their framework the hinge loss on example xi can be written as lβ βy max convex function has an effective domain dom where βm is sequence of numbers which act as weights for the ordered losses the relation to the hinge loss becomes apparent if we choose βj in that case we obtain another version of the hinge loss if and otherwise max it is straightforward to check that ψk φk the bound φk holds with equality if or otherwise there is gap and our loss is strictly better upper bound on the actual loss we perform extensive evaluation and comparison of both versions of the hinge loss in while employed larank and optimized an approximation of lβ we show in the supplement how the loss function can be optimized exactly and efficiently within the proxsdca framework multiclass to binary reduction it is also possible to compare directly to ranking based methods that solve binary problem using the following reduction we employ it in our experiments to evaluate the ranking based methods svmperf and toppush the trick is to augment the training set by embedding each xi rd into rmd using feature map φy for each the mapping φy places xi at the position in rmd and puts zeros everywhere else the example φyi xi is labeled and all φy xi for yi are labeled therefore we have new training set with mn examples and md dimensional sparse features moreover hw φy xi hwy xi which establishes the relation to the original multiclass problem another approach to general performance measures is given in it turns out that using the above reduction one can show that under certain constraints on the classifier the recall is equivalent to the error convex upper bound on recall is then optimized in via structured svm as their convex upper bound on the recall is not decomposable in an instance based loss it is not directly comparable to our loss while being theoretically very elegant the approach of does not scale to very large datasets optimization framework we begin with general multiclass classification problem where for notational convenience we keep the loss function unspecified the multiclass svm or the multiclass svm are obtained by plugging in the corresponding loss function from fenchel duality for multiclass classification problems let be the matrix of training examples xi rd let be the matrix of primal variables obtained by stacking the vectors wy rd and the matrix of dual variables before we prove our main result of this section theorem we first impose technical constraint on loss function to be compatible with the choice of the ground truth coordinate the hinge loss from section satisfies this requirement as we show in proposition we also prove an auxiliary lemma which is then used in theorem definition convex function is if for any rm with yj we have that sup hy xi xj this constraint is needed to prove equality in the following lemma lemma let be let hj and let hj then yj ej if yi otherwise we can now use lemma to compute convex conjugates of the loss functions theorem let φi be yi for each let be regularization parameter and let be the gram matrix the primal and fenchel dual objective functions are given as φi xi hwyi xi tr ai ayi eyi tr aka if ai otherwise moreover we have that xa and xi aki where ki is the column of finally we show that theorem applies to the loss functions that we consider proposition the hinge loss function from section is yi we have repeated the derivation from section in as there is typo in the optimization problem leading to the conclusionp that ayi must be at the optimum lemma fixes this by making the requirement ayi aj explicit note that this modification is already mentioned in their for optimization of multiclass svm via as an optimization scheme we employ the algorithm multiclass svm proximal stochastic dual coordinate ascent input training data xi yi parameters framework of loss regularization stopping cond and zhang which has strong convergence output guarantees and is easy to adapt to our initialize lem in particular we iteratively update batch repeat ai of dual variables corresponding to randomly permute training data the training pair xi yi so as to maximize the for to do dual objective from theorem we also si xi prediction scores ai cache previous values aold maintain the primal variables xa and ai update kxi yi si ai stop when the relative duality gap is below see for details this procedure is summarized in algorithm old xi ai ai let us make few comments on the advantages update of the proposed method first apart from the end for update step which we discuss below all main until relative duality gap is below operations can be computed using blas library which makes the overall implementation efficient second the update step in line is optimal in the sense that it yields maximal dual objective increase jointly over variables this is opposed to sgd updates with step sizes as well as to maximal but scalar updates in other sdca variants finally we have stopping criterion as we can compute the duality gap see discussion in the latter is especially attractive if there is time budget for learning the algorithm can also be easily kernelized since xi aki cf theorem dual variables update for the proposed hinge loss from section optimization of the dual objective over ai rm given other variables fixed is an instance of regularized biased projection problem onto let be obtained by removing the coordinate from vector the simplex λn proposition the following two problems are equivalent with ai and ayi xi max ai min kb xk xi λn ai where hxi xi qyi xi hxi xi ai and we discuss in the following section how to project onto the set λn efficiently efficient projection onto the simplex one of our main technical results is an algorithm for efficiently computing projections onto respectively the biased projection introduced in proposition the optimization problem in proposition reduces to the euclidean projection onto for and for it biases the solution to be orthogonal to let us highlight that is substantially different from the standard simplex and none of the existing methods can be used as we discuss below continuous quadratic knapsack problem finding the euclidean projection onto the simplex is an instance of the general optimization problem minx ka hb xi xi known as the continuous quadratic knapsack problem cqkp for example to project onto the simplex we set and this is well examined problem and several highly efficient algorithms are available see the surveys the first main difference to our set is the upper bound on the xi all existing algorithms expect that is fixed which allows them to consider decompositions minxi ai xi xi which can be solved in in our case the upper bound xi introduces coupling across all variables which makes the existing algorithms not applicable second main difference is the bias term xi added to the objective the additional difficulty introduced by this term is relatively minor thus we solve the problem for general including for the euclidean projection onto even though we need only in proposition the only case when our problem reduces to cqkp is when the constraint xi is satisfied with equality in that case we can let and use any algorithm for the knapsack problem we choose since it is easy to implement does not require sorting and scales linearly in practice the bias in the projection problem reduces to constant in this case and has therefore no effect projection onto the cone when the constraint xi is not satisfied with equality at the optimum it has essentially no influence on the projection problem and can be removed in that case we are left with the problem of the biased projection onto the cone which we address with the following lemma lemma let rd be the solution to the following optimization problem min ka xk xi xi and let xi if and then if and then is the index of the largest component in otherwise pk for where the following system of linear equations holds ai ai ρk ai ρk ai ρk together with the feasibility constraints on ρuk max ai min ai max ai min ai and we have min max we now show how to check if the biased projection is for the standard simplex where the cone is the positive orthant the projection is when all ai it is slightly more involved for pk lemma the biased projection onto the cone is zero if sufficient condition if this is also necessary projection lemmas and suggest simple algorithm for the biased projection onto the topk cone first we check if the projection is constant cases and in lemma in case we compute and check if it is compatible with the corresponding sets in the general case we suggest simple exhaustive search strategy we sort and loop over the feasible partitions until we find solution to that satisfies since we know that and we can limit the search to iterations in the worst case where each iteration requires constant number of operations for the biased projection we leave as the fallback case as lemma gives only sufficient condition this yields runtime complexity of log kd which is comparable to simplex projection algorithms based on sorting projection onto the simplex as we argued in the biased projection onto the simplex becomes either the knapsack problem or the biased projection onto the cone depending on the constraint xi at the optimum the following lemma provides way to check which of the two cases apply lemma let rd be the solution to the following optimization problem min ka xk xi xi xi xi let be the optimal thresholds such that min max and let be defined as in lemma then it must hold that kp ρr where ai projection we can now use lemma to compute the biased projection onto as follows first we check the special cases of zero and constant projections as we did before if that fails we proceed with the knapsack problem since it is faster to solve having the thresholds and the partitioning into the sets we compute the value of as given in lemma if we are done otherwise we know that xi and go directly to the general case in lemma experimental results we have two main goals in the experiments first we show that the biased projection onto the simplex is scalable and comparable to an efficient algorithm for the simplex projection see the supplement second we show that the multiclass svm using both versions of the hinge loss and denoted svmα and svmβ respectively leads to improvements in accuracy consistently over all datasets and choices of in particular we note improvements compared to the multiclass svm of crammer and singer which corresponds to svmα svmβ we release our implementation of the projection procedures and both sdca solvers as with matlab interface image classification experiments we evaluate our method on five image classification datasets of different scale and complexity caltech silhouettes mit indoor sun places and imagenet for caltech and for the others the results on the two large scale datasets are in the supplement we in the range to extending it when the optimal value is at the boundary we use liblinear for svmova svmperf with the corresponding loss function for recall and the code provided by for toppush when ranking method like recall and toppush does not scale to particular dataset using the reduction of the multiclass to binary problem discussed in we use the version of the corresponding method we implemented denoted based on the from table in on caltech we use features provided by for the other datasets we extract cnn features of cnn layer after relu for the scene recognition datasets we use the places cnn and for ilsvrc we use the caffe reference model https caltech silhouettes mit indoor method method svm toppush recall recall recall svmα svmα svmα svmβ svmβ svmβ ova method method method blh dge ras sp zlx kl jvj gwg sun splits accuracy method xhe spm lsh gwg zlx kl ova svm toppushova recall recall recall svmα svmα svmα svmβ svmβ svmβ table accuracy top section state of the art middle section baseline methods bottom section svms svmα with the loss svmβ with the loss experimental results are given in table first we note that our method is scalable to large datasets with millions of training examples such as places and ilsvrc results in the supplement second we observe that optimizing the hinge loss both versions yields consistently better performance this might come at the cost of decreased accuracy on mit indoor but interestingly may also result in noticeable increase in the accuracy on larger datasets like caltech silhouettes and sun this resonates with our argumentation that optimizing for is often more appropriate for datasets with large number of classes overall we get systematic increase in accuracy over all datasets that we examined for example we get the following improvements in accuracy with our svmα compared to svmα on caltech on mit indoor and on sun conclusion we demonstrated scalability and effectiveness of the proposed multiclass svm on five image recognition datasets leading to consistent improvements in performance in the future one could study if the hinge loss can be generalized to the family of ranking losses similar to the loss this could lead to tighter convex upper bounds on the corresponding discrete losses references bordes bottou gallinari and weston solving multiclass support vector machines with larank in icml pages bousquet and bottou the tradeoffs of large scale learning in nips pages boyd and vandenberghe convex optimization cambridge university press bu liu han and wu superpixel segmentation based structural scene recognition in mm pages acm crammer and singer on the algorithmic implementation of multiclass vector machines the journal of machine learning research doersch gupta and efros visual element discovery as discriminative mode seeking in nips pages fan chang hsieh wang and lin liblinear library for large linear classification journal of machine learning research gong wang guo and lazebnik orderless pooling of deep convolutional activation features in eccv gupta bengio and weston training highly multiclass classifiers jmlr jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding arxiv preprint joachims support vector method for multivariate performance measures in icml pages juneja vedaldi jawahar and zisserman blocks that shout distinctive parts for scene classification in cvpr kiwiel variable fixing algorithms for the continuous quadratic knapsack problem journal of optimization theory and applications koskela and laaksonen convolutional network features for scene recognition in proceedings of the acm international conference on multimedia pages acm lapin schiele and hein scalable multitask representation learning for scene classification in cvpr li jin and zhou top rank optimization in linear time in nips pages ogryczak and tamir minimizing the sum of the largest functions in linear time information processing letters patriksson survey on the continuous nonlinear resource allocation problem european journal of operational research patriksson and strömberg algorithms for the continuous nonlinear resource allocation problem new implementations and numerical studies european journal of operational research quattoni and torralba recognizing indoor scenes in cvpr razavian azizpour sullivan and carlsson cnn features an astounding baseline for recognition arxiv preprint russakovsky deng su krause satheesh ma huang karpathy khosla bernstein berg and imagenet large scale visual recognition challenge sánchez perronnin mensink and verbeek image classification with the fisher vector theory and practice ijcv pages and zhang accelerated proximal stochastic dual coordinate ascent for regularized loss minimization mathematical programming pages sun and ponce learning discriminative part detectors for image classification and cosegmentation in iccv pages swersky frey tarlow zemel and adams probabilistic models for classification and ranking in nips pages usunier buffoni and gallinari ranking with ordered weighted pairwise classification in icml pages weston bengio and usunier wsabie scaling up to large vocabulary image annotation ijcai pages xiao hays ehinger oliva and torralba sun database scene recognition from abbey to zoo in cvpr zhou lapedriza xiao torralba and oliva learning deep features for scene recognition using places database in nips 
policy evaluation using the scott niekum university of texas at austin philip thomas university of massachusetts amherst carnegie mellon university georgios theocharous adobe research george konidaris duke university abstract we propose the as an alternative to the currently used by the td family of algorithms the benefit of the is that it accounts for the correlation of different length returns because it is difficult to compute exactly we suggest one way of approximating the we provide empirical studies that suggest that it is superior to the and for variety of problems introduction most reinforcement learning rl algorithms learn value function that estimates the expected return obtained by following given policy from given state efficient algorithms for estimating the value function have therefore been primary focus of rl research the most widely used family of rl algorithms the td family forms an estimate of return called the that blends but biased temporal difference return estimates with but unbiased monte carlo return estimates using parameter while several different algorithms exist within the td original algorithm formulations and methods for adapting among formulation has remained unchanged since its introduction in recently konidaris et al proposed the as an alternative to the which uses more accurate model of how the variance of return increases with its length however both the and fail to account for the correlation of returns of different lengths instead treating them as statistically independent we propose the which uses statistical techniques to directly account for the correlation of returns of different lengths however unlike the and the is not simple to compute and often can only be approximated we propose method for approximating the and show that it outperforms the and on range of evaluation problems complex backups estimates of return lie at the heart of based rl algorithms an estimate of the value function estimates return from each state and the learning process aims to reduce the error between estimated and observed returns for brevity we suppress the dependencies of and on and write and temporal difference td algorithms use an estimate of the return obtained by taking single transition in the markov decision process mdp and then estimating the remaining return using the estimate of the value function rstd rt is the return estimate from state st rt is the reward for going from st to via action where rstd at and is discount parameter monte carlo algorithms for episodic tasks do not use intermediate estimates but instead use the full return rsmc for an episode transitions in length after time we assume that is finite these two types of return estimates can be considered instances of the more general notion of an return rst for here transitions are observed from the mdp and the remaining portion of return is estimated using the estimate of the value function since is state that occurs after the end of an episode we assume that always complex return is weighted average of the step returns rs where are weights and will be used to specify the weighting schemes of different approaches the question that this paper proposes an answer to is what weighting scheme will produce the best estimates of the true expected return the rsλt is the weighting scheme that is used by the entire family of td algorithms it uses parameter that determines how the weight given to return decreases as the length of the return increases if wλ wλ if which which has low variance but high bias when rsλt rsmc when rsλt rstd has high variance but is unbiased intermediate values of blend the but estimates from short returns with the but estimates from the longer returns the success of the is largely due to its using linear function approximation has time complexity linear in the number of features however this efficiency comes at cost the is not founded on principled statistical konidaris et al remedied this recently by showing that the is the maximum likelihood estimator of st given three assumptions specifically rsλt arg pr rst rst rst st if assumption independence rst rst are independent random variables assumption unbiased normal estimators rst is normally distributed with mean rst st for all assumption geometric variance var rst although this result provides theoretical foundation for the it is based on three typically false assumptions the returns are highly correlated only the monte carlo return is unbiased and the variance of the returns from each state do not usually increase geometrically this suggests three areas where the might be could be modified to better account for the correlation of returns the bias of the different returns and the true form of var rst the uses an approximate formula for the variance of an return in place of assumption this allows the to better account for how the variance of returns increases with their to be clear there is wealth of theoretical and empirical analyses of algorithms that use the until recently there was not derivation of the as the estimator of st that optimizes some objective maximizes log likelihood or minimizes expected squared error length while simultaneously removing the need for the parameter the is given by the weighting scheme pn wγ pl the we propose new complex return the that improves upon the and returns by account ing for the correlations of the returns to emphasize this problem notice that rst and rst will be almost identical perfectly correlated for many mdps particularly when is small this means that assumption is particularly egregious and suggests that new complex return might improve upon the and by properly accounting for the correlation of returns we formulate the problem of how best to combine different length returns to estimate the true expected return as linear regression problem this reformulation allows us to leverage the wellunderstood properties of linear regression algorithms consider regression problem with points xi yi where the value of yi depends on the value of xi the goal is to predict yi given xi we set xi and yi rst we can then construct the design matrix vector in this case and the response vector rst rst rst we seek regression coefficient such that this will be our estimate of the true expected return generalized least squares gls is method for selecting when the yi are not necessarily independent and may have different variances specifically if we use linear model with possibly correlated noise to model the data xβ where is unknown is random vector and var then the gls estimator is the best linear unbiased estimator blue for linear unbiased estimator with the lowest possible variance in our setting the assumptions about the true model that produced the data become that rst rst rst st st st where the returns are all unbiased estimates of the true expected return and var since in our case var cov rst st rst st cov rst rst where var denotes the element of var in the ith row and jth column so using only assumption gls solved for gives us the complex return rst rst rst pl st which can be written in the form of with weights pl wω where is an matrix with cov rst rst notice that the is generalization of the and returns the can be obtained by reintroducing the false assumption that the returns are independent and that their variance grows geometrically by making diagonal matrix with ωn similarly the can be pn obtained by making diagonal matrix with ωn notice that rsωt is blue of st if assumption holds since assumption does not hold the is not an unbiased estimator of still we expect it to outperform the and because it accounts for the correlation of returns and they do not however in some cases it may perform worse because it is still based on the false assumption that all of the returns are unbiased estimators of st furthermore given assumption there may be biased estimators of st that have lower expected mean squared error than blue which must be unbiased approximating the in practice the covariance matrix is unknown and must be approximated from data this approach known as feasible generalized least squares fgls can perform worse than ordinary least squares given insufficient data to accurately estimate we must therefore accurately approximate from small amounts of data to study the accuracy of covariance matrix estimates we estimated using large number of trajectories for four different domains gridworld variant of the canonical mountain car domain digital marketing problem and continuous control problem all of which are described in more detail in subsequent experiments the covariance matrix estimates are depicted in figures and we do not specify rows and columns in the figures because all covariance matrices and estimates thereof are symmetric because they were computed from very large number of trajectories we will treat them as ground truth we must estimate the when only few trajectories are available figures and show direct empirical estimates of the covariance matrices using only few trajectories these empirical approximations are poor due to the very limited amount of data except for the digital marketing domain where few trajectories means the solid black entries in figures and show the weights wω on different length returns when using different estimates of the noise in the direct empirical estimate of the covariance matrix using only few trajectories leads to poor estimates of the return weights when approximating from small number of trajectories we must be careful to avoid this overfitting of the available data one way to do this is to assume compact parametric model for below we describe parametric model of that has only four parameters regardless of which determines the size of we use this parametric model in our experiments as proof of we show that the using even this simple estimate of can produce improved results over the other existing complex returns we do not claim that this scheme for estimating is particularly principled or noteworthy estimating entries of notice in figures and that for cov rsi rsj cov rsi rsi var rsi this structure would mean that we can fill in given its diagonal values leaving only parameters we now explain why this relationship is reasonable in general and not just an artifact of our domains we can write each entry in as recurrence relation cov rs rs rs rs rs rs cov rs when the term is the temporal difference error steps in the future the proposed assumption that cov rsi rsj var rsi is equivalent to assuming that the covariance of this temporal difference error and the return is negligible cov rst the approximate independence of these two terms is reasonable in general due to the markov property which ensures that at least the conditional covariance cov rst is zero because this relationship is not exact the entries tend to grow as they get farther from the diagonal however especially when some trajectories are padded with absorbing states this relationship is quite accurate when since the temporal difference errors at the absorbing state are all zero and cov rst this results in significant difference between cov rst rst empirical from empirical from approximate from approximate from million trajectories trajectories million trajectories trajectories mean squared error weight variance return length empirical approx empirical approx app return length empirical approx empirical return type approx approximate and approximate and empirical weights mean squared error from five trajeccal diagonals of for each return tories figure gridworld results empirical from empirical from approximate from approximate from million trajectories trajectories million trajectories trajectories approx empirical approx app emp return type return length empirical app approx wis approx empirical return length empirical is wis mean squared error weight variance approximate and approximate and empirical weights mean squared error from two trajeccal diagonals of for each return tories figure mountain car results empirical from empirical from approximate from approximate from million trajectories trajectories million trajectories trajectories mean squared error weight app emp approx empirical app emp return length empirical approx approx empirical return length empirical is wis variance return type approx approximate and approximate and empirical weights mean squared error from tracal diagonals of for each return jectories figure digital marketing results empirical from empirical from approximate from approximate from trajectories trajectories trajectories trajectories approx empirical approx app emp return length empirical approx approx empirical is return length empirical wis app mean squared error weight variance return type approximate and approximate and empirical weights mean squared error from trajeccal diagonals of for each return tories figure functional electrical stimulation results and cov rst rst rather than try to model this drop which can influence the weights significantly we reintroduce the assumption that the monte carlo return is independent of the other returns making the elements of the last row and column zero estimating diagonal entries of the remaining question is how best to approximate the diagonal of from very small number of trajectories consider the solid and dotted black curves in figures and which depict the diagonals of when estimated from either large number or small number of trajectories when using only few trajectories the diagonal includes fluctuations that can have significant impacts on the resulting weights however when using many trajectories which we treat as giving ground truth the diagonal tends to be relatively smooth and monotonically increasing until it plateaus ignoring the final entry this suggests using smooth parametric form to approximate the diagonal which we do as follows let vi denote the sample variance of rst for let be the largest sample variance vi we parameterize the diagonal using four parameters and vl if if vl vl otherwise sets the initial variance and vl is the variance of the monte carlo return the parameter enforces ceiling on the variance of the return and captures the growth rate of the variance much like we select the and that minimize the mean squared error between and vi and set and vl directly from the this reduces the problem of estimating an matrix to estimating four numbers from return data consider figures and which depict as computed from many trajectories the differences between these estimates and the ground truth show that this parameterization is not perfect as we can not represent the true exactly however the estimate is reasonable and the resulting weights solid red are visually similar to the ground truth weights solid black in figures and we can now get accurate estimates of from very few trajectories figures and show when computed from only few trajectories note their similarity to when using large number of trajectories and that the resulting weights unfilled red in figures and are similar to the those obtained using many more trajectories the filled red bars pseudocode for approximating the is provided in algorithm unlike the which can be computed from single trajectory the requires set of trajectories in order to estimate the pseudocode assumes that every trajectory is of length which can be achieved by padding shorter trajectories with absorbing states we include the constraints that and algorithm computing the require trajectories beginning at and of length compute rs for and for each trajectory compute the sample variances vi var rs for set vi search for the and that minimize the mean squared error between vi and vl for fill the diagonal of the matrix with vl using the optimized and fill all of the other entries with where if or and then set instead compute the weights for the returns according to compute the for each trajectory according to experiments approximations of the could in principle replace the in the whole family of td algorithms however using the for td raises several interesting questions that are beyond the scope of this initial work is there way to estimate the since different is needed for every state how can the be used with function approximation where most states will never be revisited we therefore focus on the specific problem of policy the performance of policy using trajectories generated by possibly different policy this problem is of interest for applications that require the evaluation of proposed policy using historical data due to space constraints we relegate the details of our experiments to the appendix in the supplemental documents however the results of the experiments are and show the mean squared error mse of value estimates when using various notice that for all domains using the the emp and app labels results in lower mse than the and the with any setting of conclusions recent work has begun to explore the statistical basis of complex estimates of return and how we might reformulate them to be more statistically efficient we have proposed return estimator that improves upon the and by accounting for the covariance of return estimates our results show that understanding and exploiting the fact that in control in standard supervised samples are typically neither independent nor identically distributed can substantially improve data efficiency in an algorithm of significant practical importance many largely positive theoretical properties of the and td have been discovered over the past few decades this line of research into other complex returns is still in its infancy and so there are many open questions for example can the be improved upon by removing assumption or by keeping assumption but using biased estimator not blue is there method for approximating the that allows for value function approximation with the same time complexity as td or which better leverages our knowledge that the environment is markovian would td using the be convergent in the same settings as td while we hope to answer these questions in future work it is also our hope that this work will inspire other researchers to revisit the problem of constructing statistically principled complex return to compute the mse we used large number of monte carlo rollouts to estimate the true value of each policy references sutton learning to predict by the methods of temporal differences machine learning bradtke and barto linear algorithms for temporal difference learning machine learning march downey and sanner temporal difference bayesian model averaging bayesian perspective on adapting lambda in proceedings of the international conference on machine learning pages konidaris niekum and thomas tdγ complex backups in temporal difference learning in advances in neural information processing systems pages sutton and barto reinforcement learning an introduction mit press cambridge ma kariya and kurata generalized least squares wiley precup sutton and singh eligibility traces for policy evaluation in proceedings of the international conference on machine learning pages mahmood hasselt and sutton weighted importance sampling for learning with linear function approximation in advances in neural information processing systems tetreault and litman comparing the utility of state features in spoken dialogue using reinforcement learning in proceedings of the human language american association for computational linguistics konidaris osentoski and thomas value function approximation in reinforcement learning using the fourier basis in proceedings of the conference on artificial intelligence pages theocharous and hallak lifetime value marketing using reinforcement learning in the multidisciplinary conference on reinforcement learning and decision making thomas theocharous and ghavamzadeh high confidence evaluation in proceedings of the conference on artificial intelligence blana kirsch and chadwick combined feedforward and feedback control of redundant nonlinear dynamic musculoskeletal system medical and biological engineering and computing thomas branicky van den bogert and jagodnik application of the architecture to functional electrical stimulation control of human arm in proceedings of the twentyfirst innovative applications of artificial intelligence pages pilarski dawson degris fahimi carey and sutton online human training of myoelectric prosthesis controller via reinforcement learning in proceedings of the ieee international conference on rehabilitation robotics pages jagodnik and van den bogert proportional derivative fes controller for planar arm movement in annual conference international fes society philadelphia pa hansen and ostermeier completely derandomized in evolution strategies evolutionary computation 
orthogonal nmf through subspace exploration megasthenis asteris the university of texas at austin megas dimitris papailiopoulos university of california berkeley dimitrisp alexandros dimakis the university of texas at austin dimakis abstract orthogonal nonnegative matrix factorization onmf aims to approximate nonnegative matrix as the product of two nonnegative factors one of which has orthonormal columns it yields potentially useful data representations as superposition of disjoint parts while it has been shown to work well for clustering tasks where traditional methods underperform existing algorithms rely mostly on heuristics which despite their good empirical performance lack provable performance guarantees we present new onmf algorithm with provable approximation guarantees for any constant dimension we obtain an additive eptas without any assumptions on the input our algorithm relies on novel approximation to the related nonnegative principal component analysis nnpca problem given an arbitrary data matrix nnpca seeks nonnegative components that jointly capture most of the variance our nnpca algorithm is of independent interest and generalizes previous work that could only obtain guarantees for single component we evaluate our algorithms on several real and synthetic datasets and show that their performance matches or outperforms the state of the art introduction orthogonal nmf the success of nonnegative matrix factorization nmf in range of disciplines spanning data mining chemometrics signal processing and more has driven an extensive practical and theoretical study its power lies in its potential to generate meaningful decompositions of data into combinations of few nonnegative parts orthogonal nmf onmf is variant of nmf with an additional orthogonality constraint given real nonnegative matrix and target dimension typically much smaller than and we seek to approximate by the product of an nonnegative matrix with orthogonal orthonormal columns and an nonnegative matrix in the form of an optimization onmf min km since is nonnegative its columns are orthogonal if and only if they have disjoint supports in turn each row of is approximated by scaled version of single transposed column of despite the admittedly limited representational power compared to nmf onmf yields sparser partbased representations that are potentially easier to interpret while it naturally lends itself to certain applications in clustering setting for example serves as cluster membership matrix and the columns of correspond to cluster centroids empirical evidence shows that onmf performs remarkably well in certain clustering tasks such as document classification in the analysis of textual data where is words by documents matrix the orthogonal columns of can be interpreted as topics defined by disjoint subsets of words in the case of an image dataset with each column of corresponding to an image evaluated on multiple pixels each of the orthogonal base vectors highlights disjoint segment of the image area nonnegative pca for any given factor with orthonormal columns the second onmf factor is readily determined this follows from the fact that is by assumption nonnegative based on the above it can be shown that the onmf problem is equivalent to max nnpca where wk ik for arbitrary not necessarily matrices the maximization coincides with the nonnegative principal component analysis nnpca problem similarly to vanilla pca nnpca seeks orthogonal components that jointly capture most of the variance of the centered data in the nonzero entries of the extracted components however must be positive which renders the problem even in the case of single component our contributions we present novel algorithm for nnpca our algorithm approximates the solution to for any real input matrix and is accompanied with global approximation guarantees using the above as building block we develop an algorithm to approximately solve the onmf problem on any nonnegative matrix our algorithm outputs solution that strictly satisfies both the nonnegativity and the orthogonality constraints our main results are as follows theorem nnpca for any matrix desired number of components and accuracy parameter our nnpca algorithm computes wk such that where is the th singular value of in time tsvd here tsvd denotes the time required to compute approximation of the input using the truncated singular value decomposition svd our nnpca algorithm operates on the matrix the parameter controls natural higher values of lead to tighter guarantees but impact the running time of our algorithm finally note that despite the exponential dependence in and the complexity scales polynomially in the ambient dimension of the input if the input matrix is nonnegative as in any instance of the onmf problem we can compute an approximate orthogonal nonnegative factorization in two steps first obtain an orthogonal factor by approximately solving the nnpca problem on and subsequently set theorem onmf for any nonnegative matrix target dimension and desired accuracy our onmf algorithm computes an onmf pair such that in time tsvd kǫ km for any constant dimension theorem implies an additive eptas for the relative onmf approximation error this is to the best our knowledge the first general onmf approximation guarantee since we impose no assumptions on beyond nonnegativity we evaluate our nnpca and onmf algorithms on synthetic and real datasets as we discuss in section for several cases we show improvements compared to the previous state of the art related work onmf as variant of nmf first appeared implicitly in the formulation in was introduced in several algorithms in subsequent line of work approximately solve variants of that optimization problem most rely on modifying approaches for nmf to accommodate the orthogonality constraint either exploiting the additional structural properties in the objective introducing penalization term or updating the current estimate in suitable directions they typically reduce to multiplicative update rule which attains orthogonality only in limit sense in the authors suggest two alternative approaches an em algorithm motivated by connections to spherical and an augmented lagrangian formulation that explicitly enforces orthogonality but only achieves nonnegativity in the limit despite their good performance in practice existing methods only guarantee local convergence nmf onmf significant body of work has focused on separable nmf variant of nmf partially related to onmf nmf seeks to compose into the product of two nonnegative matrices and where contains figure onmf and separable nmf upon aptation of the identity matrix intuitively the propriate permutation of the rows of in the geometric picture of nmf should be quite first case each row of is approximated by different from that of onmf in the former the single row of while in the second by rows of are the extreme rays of convex cone nonnegative combination of all rows of enclosing all rows of while in the latter they should be scattered in the interior of that cone so that each row of has one representative in small angular distance algebraically onmf factors approximately satisfy the structural requirement of nmf but the converse is not true nmf solution is not valid onmf solution fig in the nnpca front nonnegativity as constraint on pca first appeared in which proposed scheme on penalized version of to compute set of nonnegative components in the authors developed framework stemming from em on generative model of pca to compute nonnegative and optionally sparse component in the authors proposed an algorithm based on sampling points from subspace of the data covariance and projecting them on the nonnegative orthant and focus on the problem multiple components can be computed sequentially employing heuristic deflation step our main theoretical result is generalization of the analysis of for multiple components finally note that despite the connection between the two problems existing algorithms for onmf are not suitable for nnpca as they only operate on nonnegative matrices algorithms and guarantees overview we first develop an algorithm to approximately solve the nnpca problem on any arbitrary not necessarily matrix the core idea is to solve the nnpca problem not directly on but approximation instead our main technical contribution is procedure that approximates the solution to the constrained maximization on matrix within multiplicative factor arbitrarily close to in time exponential in but polynomial in the dimensions of the input our low rank nnpca algorithm relies on generating large number of candidate solutions one of which provably achieves objective value close to optimal the nonnegative components wk returned by our low rank nnpca algorithm on the sketch are used as surrogate for the desired components of the original input intuitively the performance of the extracted nonnegative components depends on how well is approximated by the low rank sketch higher rank approximation leads to better results however the complexity of our low rank solver depends exponentially in the rank of its input natural arises between the quality of the extracted components and the running time of our nnpca algorithm using our nnpca algorithm as building block we propose novel algorithm for the onmf problem in an onmf instance we are given an nonnegative matrix and target dimension and seek to approximate with product of two nonnegative matrices where additionally has orthonormal columns computing such factorization is equivalent to solving the nnpca problem on the nonnegative matrix see appendix for formal argument once nonnegative orthogonal factor is obtained the second onmf factor is readily determined minimizes the frobenius approximation error in for given under an appropriate configuration of the accuracy parameters for any nonnegative input and constant target dimension our algorithm yields an additive eptas for the relative approximation error without any additional assumptions on the input data main results low rank nnpca we develop an algorithm to approximately solve the nnpca problem on an real matrix arg max km wk algorithm lowranknnpca input real matrix output wk see lemma candidate solutions svd trunc svd do for each uσc localoptw alg end for arg km the procedure which lies in the core of our subsequent developments is encoded in alg we describe it in detail in section the key observation is that irrespectively of the dimensions of the input the maximization in can be reduced to unknowns the algorithm generates large number of of points the tion of tuples is denoted by the kth cartesian power of an of the unit sphere using these points we effectively sample the of the input each tuple yields feasible solution wk through computationally efficient subroutine alg the best among those candidate solutions is provably close to the optimal with respect to the objective in the approximation guarantees are formally established in the following lemma lemma for any real matrix with rank desired number of components and accuracy parameter algorithm outputs wk such that km km where is the optimal solution defined in in time tsvd proof see appendix nonnegative pca given an arbitrary real matrix we can generate sketch and solve the low rank nnpca problem on using algorithm the output wk of the low rank problem can be used as surrogate for the desired components of the original input for simplicity here we consider the case where is the approximation of obtained by the truncated svd intuitively the performance of the extracted components on the original data matrix will depend on how well the latter is approximated by and in turn by the spectral decay of the input data for example if exhibits sharp spectral decay which is frequently the case in real data moderate value of suffices to obtain good approximation this leads to our first main theorem which formally establishes the guarantees of our nnpca algorithm theorem for any real matrix let be its best approximation algorithm with input and parameters and outputs wk such that where arg in time tsvd proof the proof follows from lemma it is formally provided in appendix theorem establishes between the computational complexity of the proposed nnpca approach and the tightness of the approximation guarantees higher values of imply smaller km and in turn tighter bound assuming that the singular values of decay but have an exponential impact on the running time despite the exponential dependence on and our approach is polynomial in the dimensions of the input dominated by the truncated svd in practice algorithm can be terminated early returning the best computed result at the time of termination sacrificing the theoretical approximation guarantees in section we empirically evaluate our algorithm on real datasets and demonstrate that even for small values of our nnpca algorithms significantly outperforms existing approaches orthogonal nmf the nnpca algorithm straightforwardly yields an algorithm for the onmf problem in an onmf instance the input matrix is by assumption nonnegative given any orthogonal nonnegative factor the optimal choice for the second factor is hence it suffices to determine which can be obtained by solving the nnpca problem on the proposed onmf algorithm is outlined in alg given nonnegative matrix we first obtain approximation via the truncated svd where is an accuracy parameter using alg on we compute an orthogonal nonnegative factor wk that approximately maximizes within desired accuracy the second onmf factor is readily determined as described earlier algorithm onmfs input real svd lowranknnpca alg output the accuracy parameter once again controls between the quality of the onmf factors and the complexity of the algorithm we note however that for any target dimension and desired accuracy parameter setting suffices to achieve an additive error on the relative approximation error of the onmf problem more formally theorem for any real nonnegative matrix target dimension and desired accuracy algorithm with parameter outputs an onmf pair such that km in time tsvd kǫ proof see appendix theorem implies an additive for the relative approximation error in the onmf problem for any constant target dimension algorithm runs in time polynomial in the dimensions of the input finally note that it did not require any assumption on beyond nonnegativity the low rank nnpca algorithm in this section we alg which plays central role in our developments as it is the key piece of our nnpca and in turn our onmf algorithm alg approximately solves the nnpca problem on matrix it operates by producing large but tractable number of candidate solutions wk and returns the one that maximizes the objective value in in the sequel we provide brief description of the ideas behind the algorithm we are interested in approximately solving the low rank nnpca problem let uσv denote the truncated svd of for any km kσu kσu wj max wj uσcj cj where denotes the sphere let denote the variable formed by stacking the vectors cj the key observation is that for given we can efficiently compute wk that maximizes the side of the procedure for that task is outlined in alg hence the nnpca problem is reduced to determining the optimal value of the variable but first let us we provide brief description of alg additive eptas efficient polynomial time approximation scheme refers to an algorithm that can approximate the solution of an optimization problem within an arbitrarily small additive error and has complexity that scales polynomially in the input size but possibly exponentially in eptas is more efficient than ptas because it enforces polynomial dependency on for any running time where is polynomial for example running time of is considered ptas but not eptas for fixed matrix algorithm computes arg max aj algorithm localoptw input real matrix arg pk output wj aj cw for each do diag ij for do arg maxj if then end if end for for do wj ij end for cw end for arg pk wj aj where uσc the challenge is to determine if an oracle the support of the optimal solution revealed the optimal supports ij of its columns then the exact value of the nonzero entries would be determined by the inequality and the contribution of the jth summand in would be equal to due to the nonnegativity constrains in wk the optimal support ij of the jth column must contain indices corresponding to only nonnegative or nonpositive entries of aj but not combination of both algorithm considers all possible sign combinations for the support sets implicitly by solving on all matrices diag hence we may assume without loss of generality that all support sets correspond to nonnegative entries of moreover if index is assigned to ij then the contribution of the entire ith row of to the objective is equal to based on the above algorithm constructs the collection of the support sets by assigning index to ij if and only if aij is nonnegative and the largest among the entries of the ith row of the algorithm runs in and guarantees that the output is the optimal solution to more formal analysis of the alg is provided in section thus far we have seen that any given value of can be associated with feasible solution wk via the maximization and alg if we could efficiently consider all possible values in the continuous domain of we would be able to recover the pair that maximizes and in turn the optimal solution of however that is not possible instead we consider fine discretization of the domain of and settle for an approximate solution in particular let nǫ denote finite of the sphere for any point in the net contains point within distance from the former see appendix for the construction of such net further let nǫ denote the kth cartesian power of the previous net the latter is collection of matrices alg operates on this collection for each it identifies candidate solution wk via the maximization using algorithm by the properties of the it can be shown that at least one of the computed candidate solutions must attain an objective value close to the optimal of the guarantees of alg are formally established in lemma detailed analysis of the algorithm is provided in the corresponding proof in appendix this completes the description of our algorithmic developments experimental evaluation nnpca we compare our nnpca algorithm against three existing approaches nspca em and nnspan on real datasets nspca computes multiple nonnegative but not necessarily orthogonal components parameter penalizes the overlap among their supports we set high penalty to promote orthogonality em and nnspan compute only single nonnegative component multiple components are computed consecutively interleaving an appropriate deflation step to ensure orthogonality the deflation step effectively zeroes out the variables used in previously extracted components finally note that both the em and nspca algorithms are randomly initialized all depicted values are the best results over multiple random restarts for our algorithm we use sketch of rank of the centered input data further we apply an early termination criterion execution is terminated if no improvement is observed in number of consecutive iterations samples this can only hurt the performance of our algorithm when used as subroutine in alg alg can be simplified into an procedure lines nnspca em nnspan onmfs cumulative expl variance onmfs nnspan em nnspca cumulative expl variance target components components figure cumul variance captured by nonnegative components cbcl dataset in fig we set and plot the cumul variance versus the number of components em and nnspan extract components greedily first components achieve high value but subsequent ones contribute less to the objective our algorithm jointly optimizes the components achieving improvement over the second best method fig depicts the cumul variance for various values of we note the percentage improvement of our algorithm over the second best method cbcl dataset the cbcl dataset contains pixel gray scale face images it has been used in the evaluation of all three methods we extract orthogonal nonnegative components using all methods and compare the total explained variance the objective in we note that input data has been centered and it is hence not nonnegative fig depicts the cumulative explained variance versus the number of components for em and nnspan extract components greedily with deflation step the first component achieves high value but subsequent ones contribute less to the total variance on the contrary our algorithm jointly optimizes the components achieving an approximately increase in the total variance compared to the second best method we repeat the experiment for fig depicts the total variance captured by each method for each value of our algorithm significantly outperforms the existing approaches additional datasets we solve the nnpca problem on various datasets obtained from we arbitrarily set the target number of components to and configure our algorithm to use sketch of the input table lists the total variance captured by the extracted components for each method our algorithm consistently outperforms the other approaches onmf we compare our algorithm with several onmf algorithms the algorithm of for iterations and ii the more recent iii algorithms of for iterations we also compare to clustering methods namely vanilla and spherical since such algorithms also yield an approximate onmf mzn om ev rcence rain eukemia feat ix ow es pec ow kos nspca em nnspan onmfs table total variance captured by nonnegative components on various datasets for each dataset we list variables and the variance captured by each method higher values are better our algorithm labeled onmfs operates on sketch in all cases and consistently achieves the best results we note the percentage improvement over the second best method synthetic data we generate synthetic dataset as follows we select five base vectors cj randomly and independently from the unit hypercube in dimensions then we generate data points xi ai cj ni for some where ai ni and is parameter controlling the noise variance any negative entries of xi are set to zero km wh sp onmfs we vary in for each value we compute an approximate onmf on randomly generated datasets and measure the relative frobenius approximation error for the methods that involved random initialization we run averaging iterations per montecarlo trial our algorithm is configured to operate on sketch figure depicts the relative error achieved by each method averaged over the random trials versus the noise variance our algorithm labeled onmfs achieves competitive or higher accuracy for most values in the range of noise power figure relative frob approximation error on synthetic data data points samples are generated by randomly scaling and adding noise to one of five base points that have been randomly selected from the unit hypercube in dimensions we run onmf methods with target dimension our algorithm is labeled as onmfs real datasets we apply the onmf algorithms on various nonnegative datasets obtained from we arbitrarily set the target number of components to table lists the relative frobenius approximation error achieved by each algorithm we note that on the text datasets bag of words we run the algorithms on the uncentered matrix our algorithm performs competitively compared to other methods conclusions we presented novel algorithm for approximately solving the onmf problem on nonnegative matrix our algorithm relied on new method for solving the nnpca problem the latter jointly optimizes multiple orthogonal nonnegative components and provably achieves an objective value close to optimal our onmf algorithm is the first one to be equipped with theoretical approximation guarantees for constant target dimension it yields an additive eptas for the relative approximation error empirical evaluation on synthetic and real datasets demonstrates that our algorithms outperform or match existing approaches in both problems acknowledgments dp is generously supported by nsf awards and and muri afosr grant this research has been supported by nsf grants ccf and aro yip mzn om ev rcence rain feat ix ems rain ow kos ow nron ow nips ow imes onmfs table onmf approximation error on nonnegative datasets for each dataset we list the size variables and the relative frobenius approximation error achieved by each method lower values are better we arbitrarily set the target dimension dashes denote an invalid for our method we note in parentheses the approximation rank used references daniel lee and sebastian seung algorithms for matrix factorization in advances in neural information processing systems pages gershon buchsbaum and orin bloch color categories revealed by matrix factorization of munsell color spectra vision research farial shahnaz michael berry paul pauca and robert plemmons document clustering using nonnegative matrix factorization information processing management lin projected gradient methods for nonnegative matrix factorization neural computation andrzej cichocki rafal zdunek anh huy phan and amari nonnegative matrix and tensor factorizations applications to exploratory data analysis and blind source separation john wiley sons victor bittorf benjamin recht christopher re and joel tropp factoring nonnegative matrices with linear programs advances in neural information processing systems nicolas gillis and stephen vavasis fast and robust recursive algorithms for separable nonnegative matrix factorization arxiv preprint huang nd sidiropoulos and swamiy nmf revisited new uniqueness results and algorithms in acoustics speech and signal processing icassp ieee international conference on pages ieee chris ding tao li wei peng and haesun park orthogonal nonnegative matrix for clustering in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm tao li and chris ding the relationships among various nonnegative matrix factorization methods for clustering in data mining icdm sixth international conference on pages ieee filippo pompili nicolas gillis absil and glineur two algorithms for orthogonal nonnegative matrix factorization with application to clustering arxiv preprint seungjin choi algorithms for orthogonal nonnegative matrix factorization in neural networks ijcnn ieee world congress on computational intelligence ieee international joint conference on pages ieee zhirong yang and erkki oja linear and nonlinear projective nonnegative matrix factorization neural networks ieee transactions on da kuang haesun park and chris hq ding symmetric nonnegative matrix factorization for graph clustering in sdm volume pages siam zhirong yang tele hao onur dikmen xi chen and erkki oja clustering by nonnegative matrix factorization using graph random walk in advances in neural information processing systems pages ron zass and amnon shashua nonnegative sparse pca in advances in neural information processing systems pages cambridge ma mit press megasthenis asteris dimitris papailiopoulos and alexandros dimakis nonnegative sparse pca with provable guarantees in proceedings of the international conference on machine learning pages zhijian yuan and erkki oja projective nonnegative matrix factorization for image compression and feature extraction in image analysis pages springer hualiang li adal wei wang darren emge and andrzej cichocki matrix factorization with orthogonality constraints and its application to raman spectroscopy the journal of vlsi signal processing systems for signal image and video technology bin cao dou shen sun xuanhui wang qiang yang and zheng chen detect and track latent factors with online nonnegative matrix factorization in ijcai volume pages xin li william kw cheung jiming liu and zhili wu novel orthogonal belief compression for pomdps in proceedings of the international conference on machine learning pages acm gang chen fei wang and changshui zhang collaborative filtering using orthogonal nonnegative matrix information processing management na gillis and vavasis fast and robust recursive algorithms for separable nonnegative matrix factorization ieee transactions on pattern analysis and machine intelligence sanjeev arora rong ge yoni halpern david mimno ankur moitra david sontag yichen wu and michael zhu practical algorithm for topic modeling with provable guarantees arxiv preprint sanjeev arora rong ge ravindran kannan and ankur moitra computing nonnegative matrix in proceedings of the symposium on theory of computing pages abhishek kumar vikas sindhwani and prabhanjan kambadur fast conical hull algorithms for matrix factorization in proceedings of the international conference on machine learning pages christian sigg and joachim buhmann for sparse and pca in proceedings of the international conference on machine learning icml pages new york ny usa acm liming cai michael fellows david juedes and frances rosamond on efficient approximation schemes for problems on planar structures journal of computer and system sciences marco cesati and luca trevisan on the efficiency of polynomial time approximation schemes information processing letters sung learning and example selection for object and pattern recognition phd thesis phd thesis mit artificial intelligence laboratory and center for biological and computational learning cambridge ma lichman uci machine learning repository filippo pompili nicolas gillis absil and glineur an orthogonal nonnegative matrix factorization algorithm with application to clustering in esann bache and lichman uci machine learning repository 
stochastic online greedy learning with feedbacks tian lin tsinghua university beijing china jian li tsinghua university beijing china lapordge wei chen microsoft research beijing china weic abstract the greedy algorithm is extensively studied in the field of combinatorial optimization for decades in this paper we address the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time we first propose the greedy regret and greedy regret as learning metrics comparing with the performance of offline greedy algorithm we then propose two online greedy learning algorithms with feedbacks which use bandit and pure exploration bandit policies at each level of greedy learning one for each of the regret metrics respectively both algorithms achieve log regret bound being the time horizon for general class of combinatorial structures and reward functions that allow greedy solutions we further show that the bound is tight in and other problem instance parameters introduction the greedy algorithm is simple and and can be applied to solve wide range of complex optimization problems either with exact solutions minimum spanning tree or approximate solutions maximum coverage or influence maximization moreover for many practical problems the greedy algorithm often serves as the first heuristic of choice and performs well in practice even when it does not provide theoretical guarantee the classical greedy algorithm assumes that certain reward function is given and it constructs the solution iteratively in each phase it searches for local optimal element to maximize the marginal gain of reward and add it to the solution we refer to this case as the offline greedy algorithm with given reward function and the corresponding problem the offline problems the process of the greedy algorithm naturally forms decision sequence to illustrate the decision flow in finding the solution which is named as the greedy sequence we characterize the decision class as an accessible set system general combinatorial structure encompassing many interesting problems in many real applications however the reward function is stochastic and is not known in advance and the reward is only instantiated based on the unknown distribution after the greedy sequence is selected for example in the influence maximization problem social influence are propagated in social network from the selected seed nodes following stochastic model with unknown parameters and one wants to find the optimal seed set of size that generates the largest influence spread which is the expected number of nodes influenced in cascade in this case the reward of seed selection is only instantiated after the seed selection and is only one of the random outcomes therefore when the stochastic reward function is unknown we aim at maximizing the expected reward overtime while gradually learning the key parameters of the expected reward functions this falls in the domain of online learning and we refer the online algorithm as the strategy of the player who makes sequential decisions interacts with the environment obtains feedbacks and accumulates her reward for online greedy algorithms in particular at each time step the player selects and plays candidate decision sequence while the environment instantiates the reward function and then the player collects the values of instantiated function at every phase of the decision sequence as the feedbacks thus the name of feedbacks and takes the value of the final phase as the reward cumulated in this step the typical objective for an online algorithm is to make sequential decisions against the optimal solution in the offline problem where the reward function is known priori for online greedy algorithms instead we compare it with the solution of the offline greedy algorithm and minimize their gap of the cumulative reward over time termed as the greedy regret furthermore in some problems such as influence maximization the reward function is estimated with error even for the offline problem and thus the greedily selected element at each phase may contain some error we call such greedy sequence as greedy sequence to accommodate these cases we also define the metric of greedy regret which compares the online solution against the minimum offline solution from all greedy sequences in this paper we propose two online greedy algorithms targeted at two regret metrics respectively the first algorithm uses the stochastic bandit mab in particular the ucb policy as the building block to minimize the greedy regret we apply the ucb policy to every phase by associating the confidence bound to each arm and then choose the arm having the highest upper confidence bound greedily in the process of decision for the second scenario where we allow tolerating for each phase we propose algorithm to minimize the greedy regret for every phase in the greedy process applies the lucb policy which depends on the upper and lower confidence bound to eliminate arms it first explores each arm until the lower bound of one arm is higher than the upper bound of any other arm within an then the stage of current phase is switched to exploit that best arm and continues to the next phase both and achieve the problemdependent log bound in terms of the respective regret metrics where the coefficients in front of depends on direct elements along the greedy sequence its decision frontier corresponding to the instance of learning problem the two algorithms have complementary advantages when we really target at greedy regret setting to for has slightly better regret guarantee and does not need an artificial switch between exploration and exploitation when we are satisfied with greedy regret works but can not be adapted for this case and may suffer larger regret we also show problem instance in this paper where the upper bound is tight to the lower bound in and other problem parameters we further show our algorithms can be easily extended to the knapsack problem and applied to the stochastic online maximization for consistent functions and submodular functions in the supplementary material to summarize our contributions include the following to the best of our knowledge we are the first to propose the framework using the greedy regret and greedy regret to characterize the online performance of the stochastic greedy algorithm for different scenarios and it works for wide class of accessible set systems and general reward functions we propose algorithms ogucb and that achieve the log regret bound and we also show that the upper bound matches with the lower bound up to constant factor due to the space constraint the analysis of algorithms applications and empirical evaluation of the lower bound are moved to the supplementary material related work the bandit mab problem for both stochastic and adversarial settings has been widely studied for decades most work focus on minimizing the cumulative regret over time or identifying the optimal solution in terms of pure exploration bandits among those work there is one line of research that generalizes mab to combinatorial learning problems our paper belongs to this line considering stochastic learning with feedbacks while we focus on the greedy algorithm the structure and its performance measure which have not been addressed the classical greedy algorithms in the offline setting are studied in many applications and there is line of work focusing on characterizing the greedy structure for solutions we adopt their characterizations of accessible set systems to the online setting of the greedy learning there is also branch of work using the greedy algorithm to solve online learning problem while they require the knowledge of the exact form of reward function restricting to special functions such as linear and submodular rewards our work does not assume the exact form and it covers much larger class of combinatorial structures and reward functions preliminaries online combinatorial learning problem can be formulated as repeated game between the environment and the player under stochastic bandit framework let en be finite ground set of size and be collection of subsets of we consider the accessible set system satisfying the following two axioms if and then there exists some in we define any set as feasible set if for any its accessible set is defined as we say feasible set is maximal if define the largest length of any feasible set as and the largest width of any feasible set as we say that such an accessible set system is the decision class of the player in the class of combinatorial learning problems the size of is usually very large exponential in and beginning with an empty set the accessible set system ensures that any feasible set can be acquired by adding elements one by one in some order cf lemma in the supplementary material for more details which naturally forms the decision process of the player for convenience we say the player can choose decision sequence defined as an ordered feasible sets sk satisfying that sk and for any si si where si besides define decision sequence as maximal if and only if sk is maximal let be an arbitrary set the environment draws samples from as at each time by following predetermined but unknown distribution consider reward function that is bounded and it is in the first parameter while the exact form of function is agnostic to the player we use shorthand ft ωt to denote the reward for any given at time and denote the expected reward as where the expectation eωt is taken from the randomness of the environment at time for ease of presentation we assume that the reward function for any time is normalized with arbitrary alignment as follows ft for any constant for any ft ft therefore reward function is implicitly bounded within we extend the concept of arms in mab and introduce notation to define an arm representing the selected element based on the prefix where is feasible set and and define as the arm space then we can define the marginal reward for function ft as ft ft ft and the expected marginal reward for as notice that the use of arms characterizes the marginal reward and also indicates that it is related to the player previous decision the offline problem and the offline greedy algorithm in the offline problem we assume that is provided as value oracle therefore the objective is to find the optimal solution arg which only depends on the player decision when the optimal solution is computationally hard to obtain usually we are interested in finding feasible set such that αf where then is called an αapproximation solution that is typical case where the greedy algorithm comes into play the offline greedy algorithm is local search algorithm that refines the solution phase by phase it goes as follows let for each phase find arg gk and let gk the above process ends when is maximal we define the maximal decision sequence gmg mg is its length found by the offline greedy as the greedy sequence for simplicity we assume that it is unique therefore the optimal solution is maximal decision sequence one important feature is that the greedy algorithm uses polynomial number of calls poly to the offline oracle even though the size of or may be exponentially large in some cases such as the offline influence maximization problem the value of can only be accessed with some error or estimated approximately sometimes even though can be computed exactly we may only need an approximate maximizer in each greedy phase in favor of computational efficiency efficient submodular maximization to capture such scenarios we say maximal decision sequence is an greedy sequence if the greedy decision can tolerate error every phase for each and sk sk notice that there could be many greedy sequences and we denote qmq mq is its length as the one with the minimum reward that is qmq is minimized over all greedy sequences the online problem in the online case in constrast is not provided the player can only access one of functions generated by the environment for each time step during repeated game for each time the game proceeds in the following three steps the environment draws sample ωt from its predetermined distribution without revealing it the player may based on her previous knowledge select decision sequence smt which reflects the process of her decision phase by phase then the player plays and gains reward ft smt while observes intermediate feedbacks ft ft ft smt to update her knowledge we refer such feedbacks as feedbacks in the decision order for any time denote sm and smt the player is to make sequential decisions and the classical objective is to minimize the cumulative gap of rewards against the optimal solution or the approximation solution for example when the optimal solution arg can be solved in the offline problem we minimize the expected pt cumulative regret ft over the time horizon where the expectation is taken from the randomness of the environment and the possible random algorithm of the player in this paper we are interested in online algorithms that are comparable to the solution of the offline greedy algorithm namely the greedy sequence gmg thus the objective is to minimize the greedy regret defined as rg gmg ft given we define the greedy regret as rq qmq ft where qmq is the minimum greedy sequence we remark that if the offline greedy algorithm provides an solution with then the greedy regret or greedy regret also provides regret which is the regret comparing to the fraction of the optimal solution as defined in in the rest of the paper our goal is to design the player policy that is comparable to the offline pt greedy in other words rg gmg ft thus to achieve sublinear greedy regret rg is our main focus the online greedy and algorithm in this section we propose our online greedy og algorithm with the ucb policy to minimize the greedy regret defined in for any arm playing at each time yields the marginal reward as random variable xt ft in which the random event ωt is and we denote as its true mean algorithm og require maxoracle for do true repeat sk hk maxoracle online greedy procedure find the current maximal sk until sk until maximal sequence is found play sequence sk observe ft ft sk and gain ft sk for all do update according to signals from maxoracle if are all true then update si and si according to subroutine ucb to implement maxoracle ln setup confidence radius radt for each if is not initialized then return true else break ties arbitrarily to initialize arms apply ucb rule arg radt and return true let be the empirical mean for the marginal reward of and be the counter of the plays more specifically denote and nt for particular and at the beginning of the time step and they are evaluated as follows fi ii nt ii ii where ii indicates whether is updated at time in particular assume that our algorithm is so that each and is by default until is played the online greedy algorithm og proposed in algorithm serves as allowing different implementations of subroutine maxoracle for every time og calls maxoracle line to be specified later to find the local maximal phase by phase until the decision sequence is made then it plays sequence observes feedbacks and gains the reward line meanwhile og collects the boolean signals hk from maxoracle during the greedy process line and update estimators and according to those signals line on the other hand maxoracle takes accessible arms estimators and counted time and returns an arm from and signal hk true false to instruct og whether to update estimators for the following phase the classical ucb can be used to implement maxoracle which is described in subroutine we term our algorithm og in which maxoracle is implemented by subroutine ucb as algorithm few remarks are in order first algorithm chooses an arm with the highest upper confidence bound for each phase second the signal hk is always true meaning that always update empirical means of arms along the decision sequence third because we use and the memory is allocated only when it is needed regret bound of for any feasible set define the greedy element for as arg and we use for convenience denote is maximal as the collection of all maximal feasible sets in we use the following gaps to measure the performance of the algorithm definition gaps the gap between the maximal greedy feasible set gmg and any is defined as gmg if it is positive and otherwise we define the maximum gap as gmg which is the worst penalty for any maximal feasible set for any arms we define the unit gap of the gap for one phase as gs for any arms we define the gap irreversible once selected as max gmg min where for two feasible sets and means that is prefix of in some decision sequence that is there exists decision sequence sk such that sk and for some sj thus means the largest gap we may have after we have fixed our prefix selection to be and is upper bounded by definition decision frontier for any decision sequence sk define decision sk frontier as the arms need to be explored in the decision sk sequence and similarly theorem greedy regret bound for any time algorithm algorithm with subroutine can achieve the greedy regret ln where is the greedy decision sequence when the above theorem immediately recovers the regret bound of the classical ucb mw log with the greedy regret is bounded by where is the minimum unit gap and the memory cost is at most proportional to the regret for special class of linear bandits simple extension where we treat arms and as the same can make essentially the same as omm in while the regret is log and the memory cost is cf appendix of the supplementary material relaxing the greedy sequence with tolerance in this section we propose an online algorithm called which learns an greedy sequence with the goal of minimizing the greedy regret in we learn sequences by policy which utilizes results from pac learning with fixed confidence setting in section we implement maxoracle via the lucb policy and derive its exploration time we then assume the knowledge of time horizon in section and analyze the greedy regret and in section we show that the assumption of knowing can be further removed og with policy given and failure probability we use subroutine to implement the subroutine maxoracle in algorithm og we call the resulting algorithm specifically subroutine is adapted from in and specialized to explore the element in the support of set width and oracle arg max in assume that exploit is for each greedy phase the algorithm first explores each arm in in the exploration stage during which the return flag the second return field is always false when the optimal one is found initialize exploit with iˆt it sticks to exploit in the exploitation stage for the subsequent time steps and return flag for this phase becomes true the main algorithm og then uses these flags in such way that it updates arm estimates for phase if any only if all phases subroutine to implement maxoracle setup radt ln for each exploit to cache arms for exploitation if exploit is initialized then return exploit true if is not initialized then return false else iˆt arg radt iˆt radt iˆt it arg if iˆt then arg iˆt radt and return false else exploit iˆt return exploit true in the exploitation stage break ties arbitrarily to initialize arms perturb arms not separated in the exploration stage separated initialize exploit with iˆt in the exploitation stage for are already in the exploitation stage this avoids maintaining useless arm estimates and is major memory saving comparing to in algorithm we define the total exploration time such that for any time is in the exploitation stage for all greedy phases encountered in the algorithm this also means that after time in every step we play the same maximal decision sequence sk which we call stable sequence following common practice we define the hardness coefficient with prefix as where is defined in max rewrite definitions with respect to the regret recall that qmq is the minimum greedy sequence in this section we rewrite the gap max qmq for any the maximum gap mq and max qmq minv for any arm the following theorem shows that with high probability we can find stable greedy sequence and the total exploration time is bounded theorem high probability exploration time given any and suppose after the total exploration time algorithm algorithm with subroutine sticks to stable sequence where is its length with probability at least mδ the following claims hold is an greedy sequence the total exploration time satisfies that ln time horizon is known knowing time horizon we may let in to derive the regret as follows theorem given any when total time is known let algorithm run with suppose is the sequence selected at time define funco tion min ln where is the largest length of feasible set and is defined in then the regret satisfies that rq rq max log where is the minimum unit gap in general the two bounds theorem and theorem are for different regret metrics thus can not be directly compared when is slightly better only in the constant before log on other hand when we are satisfied with greedy regret may work better for algorithm with restart require for epoch do clean and for all arms and restart with defined in run for time steps exit halfway if the time is over some large for the bound takes the maximum in the denominator of the term and the fixed constant term and the memory cost is only mw time horizon is not known when time horizon is not known we can apply the squaring trick and restart the algorithm for each epoch as follows define the duration of epoch as and its accumulated time as where for any time horizon define the final epoch as the epoch where lies in that is τk then our algorithm is proposed in algorithm the following theorem shows that the log regret still holds with slight blowup of the constant hidden in the big notation for completeness the explicit constant before log can be found in theorem of the supplementary material theorem given any use and defined in and function rq defined in theorem in algorithm suppose sm is the sequence selected by the end of epoch of where is its length for any time denote final epoch as such that τk and the regret satisfies that pk max log where is the minimum unit gap lower bound on the greedy regret consider problem of selecting one element each from bandit instances and the player sequentially collects prize at every phase for simplicity we call it the problem which is defined as follows for each bandit instance denote set ei ei sm of size the accessible set system is defined as where ei fi and fi the reward function is first parameter and the form of is unknown to the player let minimum unit gap min where its value is also unknown to the player the objective of the player is to minimize the greedy regret denote the greedy sequence as gm and the greedy arms as ag gg we say an algorithm is consistent if the sum of playing all arms ag is in for any nt theorem for any consistent algorithm there exists problem instance of the problem as time tends to for any minimum unit gap such that for mw ln some constant the greedy regret satisfies that we remark that the detailed problem instance and the greedy regret can be found in theorem of the supplementary material furthermore we may also restrict the maximum gap to and ln the lower bound rg mw for any sufficiently large for the upper bound ucb theorem gives that log thus our upper bound of matches the lower bound within constant factor acknowledgments jian li was supported in part by the national basic research program of china grants and the national nsfc grants references audibert and bubeck best arm identification in bandits in colt audibert bubeck and lugosi minimax policies for combinatorial prediction games arxiv preprint auer and fischer analysis of the multiarmed bandit problem machine learning auer freund and schapire the nonstochastic multiarmed bandit problem siam journal on computing and ziegler introduction to greedoids matroid applications bubeck and regret analysis of stochastic and nonstochastic bandit problems arxiv preprint bubeck munos and stoltz pure exploration in and bandits theoretical computer science and lugosi combinatorial bandits journal of computer and system sciences chen lin king lyu and chen combinatorial pure exploration of bandits in nips chen wang and yuan combinatorial bandit general framework and applications in icml chvatal greedy heuristic for the problem mathematics of operations research gabillon kveton wen eriksson and muthukrishnan adaptive submodular maximization in bandit setting in nips gai krishnamachari and jain learning multiuser channel allocations in cognitive radio networks combinatorial bandit formulation in dyspan ieee garivier and the algorithm for bounded stochastic bandits and beyond arxiv preprint helman moret and shapiro an exact characterization of greedy structures siam journal on discrete mathematics kalyanakrishnan tewari auer and stone pac subset selection in stochastic bandits in icml kempe kleinberg and tardos maximizing the spread of influence through social network in sigkdd korte and greedoids and linear objective functions siam journal on algebraic discrete methods kruskal on the shortest spanning subtree of graph and the traveling salesman problem proceedings of the american mathematical society kveton wen ashkan eydgahi and eriksson matroid bandits fast combinatorial optimization with learning arxiv preprint kveton wen ashkan and szepesvari tight regret bounds for stochastic combinatorial arxiv preprint lai and robbins asymptotically efficient adaptive allocation rules advances in applied mathematics lin abrahao kleinberg lui and chen combinatorial partial monitoring game with linear feedback and its applications in icml mirzasoleiman badanidiyuru karbasi vondrak and krause lazier than lazy greedy in proc conference on artificial intelligence aaai prim shortest connection networks and some generalizations bell system technical journal streeter and golovin an online algorithm for maximizing submodular functions in nips 
deeply learning the messages in message passing inference guosheng lin chunhua shen ian reid anton van den hengel the university of adelaide australia and australian centre for robotic vision abstract deep structured output learning shows great promise in tasks like semantic image segmentation we proffer new efficient deep structured model learning scheme in which we show how deep convolutional neural networks cnns can be used to directly estimate the messages in message passing inference for structured prediction with conditional random fields crfs with such cnn message estimators we obviate the need to learn or evaluate potential functions for message calculation this confers significant efficiency for learning since otherwise when performing structured learning for crf with cnn potentials it is necessary to undertake expensive inference for every stochastic gradient iteration the network output dimension of message estimators is the same as the number of classes rather than exponentially growing in the order of the potentials hence it is more scalable for cases that involve large number of classes we apply our method to semantic image segmentation and achieve impressive performance which demonstrates the effectiveness and usefulness of our cnn message learning method introduction learning deep structured models has attracted considerable research attention recently one popular approach to deep structured model is formulating conditional random fields crfs using deep convolutional neural networks cnns for the potential functions this combines the power of cnns for feature representation learning and of the ability for crfs to model complex relations the typical approach for the joint learning of crfs and cnns is to learn the cnn potential functions by optimizing the crf objective maximizing the the cnn and crf joint learning has shown impressive performance for semantic image segmentation for the joint learning of cnns and crfs stochastic gradient descent sgd is typically applied for optimizing the conditional likelihood this approach requires the marginal inference for calculating the gradient for loopy graphs marginal inference is generally expensive even when using approximate solutions given that learning the cnn potential functions typically requires large number of gradient iterations repeated marginal inference would make the training intractably slow applying an approximate training objective is solution to avoid repeat inference learning and piecewise learning are examples of this kind of approach in this work we advocate new direction for efficient deep structured model learning in conventional crf approaches the final prediction is the result of inference based on the learned potentials however our ultimate goal is the final prediction not the potentials themselves so we propose to directly optimize the inference procedure for the final prediction our focus here is on the extensively studied message passing based inference algorithms as discussed in we can directly learn message estimators to output the required messages in the inference procedure rather than learning the potential functions as in conventional crf learning approaches with the learned message estimators we then obtain the final prediction by performing message passing inference our main contributions are as follows we explore new direction for efficient deep structured learning we propose to directly learn the messages in message passing inference as training deep cnns in an learning fashion message learning does not require any inference step for the gradient calculation which allows efficient training furthermore when cast as tradiational classification task the network output dimension for message estimation is the same as the number of classes while the network output for general cnn potential functions in crfs is which is exponential in the order of the potentials for example for pairwise potentials for etc hence cnn based message learning has significantly fewer network parameters and thus is more scalable especially in cases which involve large number of classes the number of iterations in message passing inference can be explicitly taken into consideration in the message learning procedure in this paper we are particularly interested in learning messages that are able to offer crf prediction results with only one message passing iteration making the message passing inference very fast we apply our method to semantic image segmentation on the pascal voc dataset and achieve impressive performance related work combining the strengths of cnns and crfs for segmentation has been explored in several recent methods some methods resort to simple combination of cnn classifiers and crfs without joint learning in first train fully cnn for pixel classification and applies dense crf method as step later the method in extends deeplab by jointly learning the dense crfs and cnns in also performs joint learning of cnns and the dense crfs they implement the inference as recurrent neural networks which facilitates the learning these methods usually use cnns for modelling the unary potentials only the work in trains cnns to model both the unary and pairwise potentials in order to capture contextual information jointly learning cnns and crfs has also been explored for other applications like depth estimation the work in explores joint training of markov random fields and deep networks for predicting words from noisy images and image classification all these methods that combine cnns and crfs are based upon conventional crf approaches they aim to jointly learn or incorporate cnn potential functions and then perform using the potentials in contrast our method here directly learns cnn message estimators for the message passing inference rather than learning the potentials the inference machine proposed in is relevant to our work in that it has discussed the idea of directly learning message estimators instead of learning potential functions for structured prediction they train traditional logistic regressors with features as message estimators motivated by the tremendous success of cnns we propose to train deep cnns based message estimators in an learning style without using features unlike the approach in which aims to learn message estimators our proposed method aims to learn the message estimators thus we are able to naturally formulate the variable marginals which is the ultimate goal for crf inference as the training objective see sec the approach in jointly learns cnns and crfs for pose estimation in which they learn the marginal likelihood of body parts but ignore the partition function in the likelihood message learning is not discussed in that work and the exact relationship between this pose estimation approach and message learning remains unclear learning crf with cnn potentials before describing our message learning method we review the joint learning approach and discuss limitations an input image is denoted by and the corresponding labeling mask is denoted by the energy function is denoted by which measures the score of the prediction given the input image we consider the following form of conditional likelihood exp exp exp here is the partition function the crf model is decomposed by factor graph over set of factors generally the energy function is written as sum of potential functions factor functions ef xf here indexes one factor in the factor graph denotes the variable nodes which are connected to the factor ef is the potential function factor function the potential function can be unary pairwise or potential function the recent method in describes examples of constructing general cnn based unary and pairwise potentials take semantic image segmentation as an example to predict the pixel labels of test image we can find the mode of the joint label distribution by solving the maximum posteriori map inference problem argmax we can also obtain the final prediction by calculating the label marginal distribution of each variable which requires to solve marginal inference problem yp here indicates the output variables excluding yp for general crf graph with cycles the above inference problems is known to be thus approximate inference algorithms are applied message passing is type of widely applied algorithms for approximate inference loopy belief propagation bp message passing and approximation are examples of the message passing methods joint learning aims to learn cnn potential functions by optimizing the crf objective typically the negative conditional which is log log the energy function is constructed by cnns for which all the network parameters are denoted by adding regularization minimizing negative for crf learning is pn minθ log here denote the training image and its segmentation mask is the number of training images is the weight decay parameter we can apply stochastic gradient descent sgd to optimize the above problem for learning the energy function is constructed from cnns and its gradient can be easily computed by applying the chain rule as in conventional cnns however the partition function brings difficulties for optimization its gradient is log exp exp direct calculation of the above gradient is computationally infeasible for general crf graphs usually it is necessary to perform approximate marginal inference to calculate the gradients at each sgd iteration however repeated marginal inference can be extremely expensive as discussed in cnn training usually requires huge number of sgd iterations hundreds of thousands or even millions hence this inference based learning approach is in general not scalable or even infeasible learning cnn message estimators in conventional crf approaches the potential functions are first learned and then inference is performed based on the learned potential functions to generate the final prediction in contrast our approach directly optimizes the inference procedure for final prediction we propose to learn cnn estimators to directly output the required intermediate values in an inference algorithm here we focus on the message passing based inference algorithm which has been extensively studied and widely applied in the crf prediction procedure the message vectors are recursively calculated based on the learned potentials we propose to construct and learn cnns to directly estimate these messages in the message passing procedure rather than learning the potential functions in particular we directly learn message estimators our message learning framework is general and can accommodate all message passing based algorithms such as loopy belief propagation bp approximation and their variants here we discuss using loopy bp for calculating variable marginals as shown by yedidia et al loopy bp has close relation with bethe free energy approximation typically the message is vector is the number of classes which encodes the information of the label distribution for each connection we need to recursively compute the message rk and the message rk the unnormalized message is computed as yp yp here fp is set of factors connected to the variable fp is the set of factors fp excluding the factor for loopy graphs the message is normalized at each iteration exp yp yp log exp yp the message is computed as yp log exp ef yq here nf is set of variables connected to the factor nf is the set of variables nf excluding the variable once we get all the messages of one variable node we are able to calculate the marginal distribution beliefs of that variable yp exp yp zp in which zp is normalizer zp yp exp yp cnn message estimators the calculation of message depends on the messages substituting the definition of in can be as exp exp ef log yp log exp yq exp log exp ef log exp yq yp here denotes the variable node which is connected to the node by the factor in the factor graph we refer to the variable node as neighboring node of nf is set of variables connected to the factor excluding the node clearly for pairwise factor which only connects to two variables the set nf only contains one variable node the above equations show that the message depends on the potential ef and here is the message which is calculated from neighboring node and factor conventional crf learning approaches learn the potential function then follow the above equations to compute the messages for calculating marginals as discussed in given that the goal is to estimate the marginals it is not necessary to exactly follow the above equations which involve learning potential functions to calculate messages we can directly learn message estimators rather than indirectly learning the potential functions as in conventional methods consider the calculation in the message depends on the observation xpf and the messages here xpf denotes the observations that correspond to the node and the factor we are able to formulate message estimator which takes xpf and as inputs and outputs the message vector and we directly learn such estimators since one message depends on number of previous messages we can formulate sequence of message estimators to model the dependence thus the output from previous message estimator will be the input of the following message estimator there are two message passing strategies for loopy bp synchronous and asynchronous passing we here focus on the synchronous message passing for which all messages are computed before passing them to the neighbors the synchronous passing strategy results in much simpler message dependences than the asynchronous strategy which simplifies the training procedure we define one inference iteration as one pass of the graph with the synchronous passing strategy we propose to learn cnn based message estimator the message estimator models the interaction between neighboring variable nodes we denote by message estimator the message is calculated as yp mf xpf dpf yp we refer to dpf as the dependent message feature vector which encodes all dependent messages from the neighboring nodes that are connected to the node by note that the dependent messages are the output of message estimators at the previous inference iteration in the case of running only one message passing iteration there are no dependent messages for mf and thus we do not need to incorporate dpf to have general exposition we here describe the case of running arbitrarily many inference iterations we can choose any effective strategy to generate the feature vector dpf from the dependent messages here we discuss simple example according to we define the feature vector dpf as vector which aggregates all dependent messages in this case dpf is computed as exp mf xqf dqf dpf log exp mf xqf dqf with the definition of dpf in and in it clearly shows that the message estimation requires evaluating sequence of message estimators another example is to concatenate all dependent messages to construct the feature vector dpf there are different strategies to formulate the message estimators in different iterations one strategy is using the same message estimator across all inference iterations in this case the message estimator becomes recursive function and thus the cnn based estimator becomes recurrent neural network rnn another strategy is to formulate different estimator for each inference iteration details for message estimator networks we formulate the estimator mf as cnn thus the estimation is the network outputs pk yp mf xpf dpf yp yp zpf dpf here denotes the network parameter which we need to learn is the indicator function which equals if the input is true and otherwise we denote by pf rk as the output vector is the number of classes of the message estimator network for the node and the factor zpf is the value in the network output pf corresponding to the class we can consider any possible strategies for implementing pf with cnns for example we here describe strategy which is analogous to the network design in we denote by as fully convolutional network fcnn for convolutional feature generation and as traditional fully connected network for message estimation given an input image the network output is convolutional feature map in which is the feature map size and is the dimension of one feature vector each spatial position each feature vector in the feature map corresponds to one variable node in the crf graph we denote by rr the feature vector corresponding to the variable node likewise nf rr is the averaged vector of the feature vectors that correspond to the set of nodes nf recall that nf is set of nodes connected by the factor excluding the node for pairwise factors nf contains only one node we construct the feature vector pf for the pair by concatenating and nf finally we concatenate the feature vector pf and the dependent message feature vector dpf as the input for the second network thus the input dimension for is for running only one inference iteration the input for is pf alone the final output from the second network is the message vector pf to sum up we generate the final message vector pf as pf nf pf for general cnn based potential function in conventional crfs the potential network is usually required to have large number of output units exponential in the order of the potentials for example it requires is the number of classes outputs for the pairwise potentials large number of output units would significantly increase the number of network parameters it leads to expensive computations and tends to the training data in contrast for learning our cnn message estimator we only need to formulate output units for the network clearly it is more scalable in the cases of large number of classes training cnn message estimators our goal is to estimate the variable marginals in which can be with the estimators yp exp yp exp mf xpf dpf yp zp zp here zp is the normalizer the ideal variable marginal for example has the probability of for the ground truth class and for the remaining classes here we consider the cross entropy loss between the ideal marginal and the estimated marginal yp log yp yp yp exp mf xpf dpf yp yp log exp mf xpf dpf yp in which is the ground truth label for the variable node given set of training images and label masks the optimization problem for learning the message estimator network is pn minθ the work in proposed to learn the message unlike their approach we aim to learn the message for which we are able to naturally formulate the variable marginals which is the ultimate goal for prediction as the training objective moreover for learning in their approach the message estimator will depend on all neighboring nodes connected by any factors given that variable nodes will have different numbers of neighboring nodes they only consider fixed number of neighboring nodes and concatenate their features to generate feature vector for classification in our case for learning the message estimator only depends on fixed number of neighboring nodes connected by one factor thus we do not have this problem most importantly they learn message estimators by training traditional probabilistic classifiers simple logistic regressors with features and in contrast we train deep cnns in an learning style without using features message learning with budgets one advantage of message learning is that we are able to explicitly incorporate the expected number of inference iterations into the learning procedure the number of inference iterations defines the learning sequence of message estimators this is particularly useful if we aim to learn the estimators which are capable of predictions within only few inference iterations in contrast table segmentation results on the pascal voc val set we compare with several recent cnn based methods with available results on the val set our method performs the best method contextdcrf boxsup boxsup ours training set voc extra voc extra voc extra voc extra voc extra voc extra voc extra coco voc extra voc extra train approx iou val set conventional potential function learning in crfs is not able to directly incorporate the expected number of inference iterations we are particularly interested in learning message estimators for use with only one message passing iteration because of the speed of such inference in this case it might be preferable to have largerange neighborhood connections so that large range interaction can be captured within one inference pass experiments we evaluate the proposed cnn message learning method for semantic image segmentation we use the publicly available pascal voc dataset there are object categories and one background category in the dataset it contains images in the training set images in the val set and images in the test set following the common practice in the training set is augmented to images by including the extra annotations provided in for the voc images we use iou score to evaluate the segmentation performance for the learning and prediction of our method we only use one message passing iteration the recent work in referred to as contextdcrf learns fully convolutional cnns fcnns for unary and pairwise potential functions to capture contextual information we follow this crf learning method and replace the potential functions by the proposed message estimators we consider types of spatial relations for constructing the pairwise connections of variable nodes one is the surrounding spatial relation for which one node is connected to its surround nodes the other one is the spatial relation for which one node is connected to the nodes that lie above for the pairwise connections the neighborhood size is defined by range box we learn one type of unary message estimator and types of pairwise message estimators in total one type of pairwise message estimator is for the surrounding spatial relations and the other two are for the spatial relations we formulate one network for one type of message estimator we formulate our message estimators as fcnns for which we apply similar network configuration as in the network see sec for details has convolution blocks and has fully connected layers with output units our networks are initialized using the model we train all layers using our system is built on matconvnet we first evaluate our method on the voc val set we compare with several recent cnn based methods with available results on the val set results are shown in table our method achieves the best performance the comparing method contextdcrf follows conventional crf learning and prediction scheme they first learn potentials and then perform inference based on the learned potentials to output final predictions the result shows that learning the cnn message estimators is able to achieve similar performance compared to learning cnn potential functions in crfs note that since here we only use one message passing iteration for the training and prediction the inference is particularly efficient to further improve the performance we perform simple data augmentation in training we generate extra scales of the training images and their flipped images for training this result is denoted by in the result table bird boat bottle bus car cat chair cow table dog horse mbike person potted sheep sofa train tv mean bike method ours aero table category results on the pascal voc test set our method performs the best table segmentation results on the pascal voc test set compared to methods that use the same augmented voc dataset our method has the best performance method contextdcrf sds boxsup semi ours training set voc extra voc extra voc extra voc extra voc extra voc extra voc extra voc extra voc extra coco voc extra coco voc extra coco voc extra coco voc extra train approx iou test set we further evaluate our method on the voc test set we compare with recent cnn methods with competitive performance the results are described in table since the ground truth labels are not available for the test set we evaluate our method through the voc evaluation server we achieve very competitive performance on the test set iou which is to date the best performance amongst methods that use the same augmented voc training dataset marked as voc extra in the table these results validate the effectiveness of direct message learning with cnns we also include comparison with methods which are trained on the much larger coco dataset around training images our performance is comparable with these methods even though we make use of many fewer training images the results for each category is shown in table we compare with several recent methods which transfer layers from the same model and use the same training data our method performs the best for out of categories conclusion we have proposed new deep message learning framework for structured crf prediction learning deep message estimators for the message passing inference reveals new direction for learning deep structured model learning cnn message estimators is efficient which does not involve expensive inference steps for gradient calculation the network output dimension for message estimation is the same as the number of classes which does not increase with the order of the potentials and thus cnn message learning has less network parameters and is more scalable in the number of classes compared to conventional potential function learning our impressive performance for semantic segmentation demonstrates the effectiveness and usefulness of the proposed deep message learning our framework is general and can be readily applied to other structured prediction applications acknowledgements this research was supported by the data to decisions cooperative research centre and by the australian research council through the arc centre for robotic vision and through laureate fellowship to reid correspondence should be addressed to shen the result link provided by voc evaluation server http references zheng jayasumana vineet su du huang and torr conditional random fields as recurrent neural networks online available http schwing and urtasun fully connected deep structured networks online available http lin shen reid and van den hengel efficient piecewise training of deep structured models for semantic segmentation online available http liu shen and lin deep convolutional neural fields for depth estimation from single image in proc ieee conf comp vis pattern chen schwing yuille and urtasun learning deep structured models online available http besag efficiency of pseudolikelihood estimation for simple gaussian fields biometrika sutton and mccallum piecewise training for undirected models in proc conf uncertainty artificial intelli ross munoz hebert and bagnell learning inference machines for structured prediction in proc ieee conf comp vis pattern chen papandreou kokkinos murphy and yuille semantic image segmentation with deep convolutional nets and fully connected crfs online available http and koltun efficient inference in fully connected crfs with gaussian edge potentials in proc adv neural info process liu shen lin and reid learning depth from single monocular images using deep convolutional neural fields online available http tompson jain lecun and bregler joint training of convolutional network and graphical model for human pose estimation in proc adv neural info process nowozin and lampert structured learning and prediction in computer vision found trends comput graph kolmogorov convergent message passing for energy minimization ieee pattern analysis machine intelligence yedidia freeman weiss et generalized belief propagation in proc adv neural info process long shelhamer and darrell fully convolutional networks for semantic segmentation in proc ieee conf comp vis pattern mostajabi yadollahpour and shakhnarovich feedforward semantic segmentation with features online available http dai he and sun boxsup exploiting bounding boxes to supervise convolutional networks for semantic segmentation online available http everingham gool williams winn and zisserman the pascal visual object classes voc challenge int comp hariharan girshick and malik simultaneous detection and segmentation in proc european conf computer vision hariharan arbelaez bourdev maji and malik semantic contours from inverse detectors in proc int conf comp simonyan and zisserman very deep convolutional networks for image recognition online available http vedaldi and lenc matconvnet convolutional neural networks for matlab in proceeding of the acm int conf on multimedia noh hong and han learning deconvolution network for semantic segmentation in proc ieee conf comp vis pattern papandreou chen murphy and yuille learning of dcnn for semantic image segmentation online available http 
synaptic sampling bayesian approach to neural network plasticity and rewiring david stefan robert legenstein wolfgang maass institute for theoretical computer science graz university of technology graz austria kappel habenschuss legi maass abstract we reexamine in this article the conceptual and mathematical framework for understanding the organization of plasticity in spiking neural networks we propose that inherent stochasticity enables synaptic plasticity to carry out probabilistic inference by sampling from posterior distribution of synaptic parameters this view provides viable alternative to existing models that propose convergence of synaptic weights to maximum likelihood parameters it explains how priors on weight distributions and connection probabilities can be merged optimally with learned experience in simulations we show that our model for synaptic plasticity allows spiking neural networks to compensate continuously for unforeseen disturbances furthermore it provides normative mathematical framework to better understand the permanent variability and rewiring observed in brain networks introduction in the century helmholtz proposed that perception could be understood as unconscious inference this insight has recently re gained considerable attention in models of bayesian inference in neural networks the hallmark of this theory is the assumption that the activity of neuronal networks can be viewed as an internal model for hidden variables in the outside world that give rise to sensory experiences this hidden state is usually assumed to be represented by the activity of neurons in the network network of stochastically firing neurons is modeled in this framework by probability distribution pn that describes the probabilistic relationships between set of inputs xn and corresponding network responses where denotes the vector of network parameters that shape this distribution via synaptic weights and network connectivity the likelihood pn pn of the actually occurring inputs under the resulting internal model can then be viewed as measure for the agreement between this internal model which carries out predictive coding and its environment which generates the goal of network learning is usually described in this probabilistic generative framework as finding parameter values that maximize this agreement or equivalently the likelihood of the inputs maximum likelihood learning arg maxθ pn locally optimal estimates of can be determined by gradient ascent on the data likelihood pn which led to many previous models of network plasticity while these models learn point estimates of locally optimal parameters theoretical considerations for artificial neural networks suggest that it is advantageous to learn full posterior distributions over parameters this full bayesian treatment of learning allows to integrate structural parameter priors in way and promises better generalization of the acquired knowledge to new inputs the problem how such posterior distributions could be learned by brain networks has been highlighted in as an important future challenge in computational neuroscience these authors contributed equally figure illustration of synaptic sampling for two parameters of neural network plot of an example likelihood function for fixed set of inputs it assigns probability density amplitude on to each parameter setting the likelihood function is defined by the underlying neural network example for prior that prefers small values for the posterior that results as product of the prior and the likelihood single trajectory of synaptic sampling from the posterior starting at the black dot the parameter vector fluctuates between different solutions the visited values cluster near local optima red triangles cartoon illustrating the dynamic forces plasticity rule that enable the network to sample from the posterior distribution in here we introduce possible solution to this problem we present new theoretical framework for analyzing and understanding local plasticity mechanisms of networks of neurons as stochastic processes that generate specific distributions of network parameters over which these parameters fluctuate we call this new theoretical framework synaptic sampling we use it here to analyze and model unsupervised learning and rewiring in spiking neural networks in section we show that the synaptic sampling hypothesis also provides unified framework for structural and synaptic plasticity which both are integrated here into single learning rule this model captures salient features of the permanent rewiring and fluctuation of synaptic efficacies observed in the cortex in computer simulations we demonstrate another advantage of the synaptic sampling framework it endows neural circuits with an inherent robustness against perturbations learning posterior distribution through stochastic synaptic plasticity in our learning framework we assume that not only neural network as described above but also prior ps for its parameters θm are given this prior ps can encode both structural constraints such as sparse connectivity and structural rules distribution of synaptic weights then the goal of network learning becomes learn the posterior distribution ps pn with normalizing constant key insight see fig for an illustration is that stochastic local plasticity rules for the parameters θi enable network to achieve the learning goal the distribution of network parameters will converge after while to the posterior distribution and produce samples from it if each network parameter θi obeys the dynamics dθi θi log ps θi log pn θi dt θi dwi for and θi θi the stochastic term dwi describes infinitesimal stochastic increments and decrements of wiener process wi where process increments over time are normally distributed with zero mean and variance wit wis ormal the dynamics extend previous models of bayesian learning via sampling by including temperature and parameter θi that can depend on the current value of θi without changing the stationary distribution for example the sampling speed of synaptic weight can be slowed down if it reaches very high or very low values the temperature parameter can be used to scale the diffusion term the noise the resulting stationary distribution of is proportional to so that the dynamics of the stochastic process can be described by the energy landscape log for high values of this energy landscape is flattened the main modes of become less pronounced for we arrive at the learning goal for the dynamics of approaches deterministic process and converges to the next local maximum of thus the learning process approximates for low values of maximum posteriori map inference the result is formalized in the following theorem theorem let be strictly positive continuous probability distribution over continuous or discrete states and continuous parameters θm twice continuously differentiable with respect to let be strictly positive twice continuously differentiable function then the set of stochastic differential equations leaves the distribution invariant with dθ furthermore is the unique stationary distribution of proof first note that the first two terms in the drift term of eq can be written as θi log ps θi log pn θi log θi where denotes the vector of parameters excluding parameter θi hence the dynamics can be written in terms of an stochastic differential equations with drift ai and diffusion bi dθi θi log θi θi dt θi dwi drift ai diffusion bi this describes the stochastic dynamics of each parameter over time for the stationary distribution we are interested in the dynamics of the distribution of parameters eq translate into the following equation that determines the temporal dynamics of the distribution pfp over network parameters at time see pfp bi pfp ai pfp dt plugging in the presumed stationary distribution on the right hand side of eq one obtains pfp ai bi dt θi log θi θi log which by inserting for the assumed stationary distribution becomes pfp θi log θi dt θi log log θi this proves that is stationary distribution of the parameter sampling dynamics under the assumption that θi is strictly positive this stationary distribution is also unique if the matrix of diffusion coefficients is invertible and the potential conditions are satisfied see section in for details the stationary distribution can be obtained uniquely by simple integration since the matrix of diffusion coefficients is diagonal in our model diag bi bm is trivially invertible since all elements all bi are positive convergence and uniqueness of the stationary distribution follows then for strictly positive θi see section in online synaptic sampling for sequences of inputs xn the weight update rule depends on all inputs such that synapses have to keep track of the whole set of all network inputs for the exact dynamics batch learning in an online scenario we assume that only the current network input xn is available according to the dynamics synaptic plasticity rules have to compute the log likelihood derivative log pn we assume that every τx time units different input is presented to the network and that the inputs xn are visited repeatedly in fixed regular order under the assumption that the input patterns are statistically independent the likelihood pn becomes pn pn xn pn xn each network input xn can be explained as being drawn individually from pn xn independently from other inputs the derivative of the log likelihood in is then given by pn log pn this batch dynamics does not map readily onto log pn network implementation because the weight update requires at any time knowledge of all inputs xn we provide here an online approximation for small sampling speeds to obtain an online learning rule we consider the parameter dynamics dθi θi log ps θi log pn θi dt θi dwi as in the batch learning setting we assume that each input xn is presented for time interval of τx although convergence to the correct posterior distribution can not be guaranteed theoretically for this online rule we show that it is reasonable approximation to the integrating the parameter changes over one full presentation of the data starting from with some initial parameter values up to time τx we obtain for slow sampling speeds τx θi τx θi θi τx θi log ps θi log pn θi win τx this is also what one obtains when integrating the batch rule for τx time units for slow θi hence for slow enough θi is good approximation of optimal weight sampling in the presence of hidden variables maximum likelihood learning can not be applied directly since the state of the hidden variables is not known from the observed data the expectation maximization algorithm can be used to overcome this problem we adopt this approach here in the online setting when pattern xn is applied to the network it responds with network state according to pn xn where the current network parameters are used in this inference process the parameters are updated in parallel according to the dynamics for the given values of xn and synaptic sampling for network rewiring in this section we present simple model to describe permanent network rewiring using the dynamics experimental studies have provided wealth of information about the stochastic rewiring in the brain see they demonstrate that the volume of substantial fraction of dendritic spines varies continuously over time and that all the time new spines and synaptic connections are formed and existing ones are eliminated we show that these experimental data on spine motility can be understood as special cases of synaptic sampling to arrive at concrete model we use the following assumption about dynamic network rewiring in accordance with experimental studies we require that spine sizes have multiplicative dynamics that the amount of change within some given time window is proportional to the current size of the spine we assume here for simplicity that there is single parameter θi for each potential synaptic connection the second requirement can be met by encoding the state of the synapse in an abstract form that represents synaptic connectivity and synaptic efficacy in single parameter θi we define that negative values of θi represent current disconnection and positive values represent functional synaptic connection we focus on excitatory connections the distance of the current value of θi from zero indicates how likely it is that the synapse will soon reconnect for negative values or withdraw for positive values in addition the synaptic parameter θi encodes for positive values the synaptic efficacy wi the resulting epsp amplitudes by simple mapping wi θi the first assumption which requires multiplicative synaptic dynamics supports an exponential function in our model in accordance with previous models of spine motility thus we assume in the following that the efficacy wi of synapse is given by wi exp θi note that for large enough offset negative parameter values θi which model synaptic connection are automatically mapped onto tiny region close to zero in the so that retracted spines have essentially zero synaptic efficacy in addition we use gaussian prior ps θi ormal θi with mean and variance over synaptic parameters in the simulations we used and prior of this form allows to include simple regularization mechanism in the learning scheme which prefers sparse solutions solutions with small parameters together with the exponential mapping this prior induces prior distribution over synaptic weights wi the network therefore learns solutions where only the most relevant synapses are much larger than zero the general rule for online synaptic sampling for the exponential mapping wi exp θi and the gaussian prior becomes for constant small learning rate and unit temperature dθi θi wi log pn xn dt dwi in eq the multiplicative synaptic dynamics becomes explicit the gradient log pn xn the contribution to synaptic plasticity is weighted by wi hence for negative values of θi synaptic connection the activities of the and neurons have negligible impact on the dynamics of the synapse assuming large enough retracted synapses therefore evolve solely according to the prior ps and the random fluctuations dwi for large values of θi the opposite is the case the influence of the prior log ps and the wiener process dwi become negligible and the dynamics is dominated by the likelihood term if the second term in eq that tries to maximize the likelihood is small because θi is small or parameters are near mode of the likelihood then eq implements an process this prediction of our model is consistent with previous analysis which showed that an process is viable model for synaptic spine motility spiking network model through the use of parameters which determine both synaptic connectivity and synaptic weights the synaptic sampling framework provides unified model for structural and synaptic plasticity eq describes the stochastic dynamics of the synaptic parameters θi in this section we analyze the resulting rewiring dynamics and structural plasticity by applying the synaptic sampling framework to networks of spiking neurons here we used wta networks to learn simple sensory integration task and show that learning with synaptic sampling in such networks is inherently robust to perturbations for the wta we adapted the model described in detail in briefly the wta neurons were modeled as stochastic spike response neurons with firing rate that depends exponentially on the membrane voltage the membrane potential uk of neuron at time is given by uk wki xi βk where xi denotes the unweighted input from input neuron wki denotes the efficacy of the synapse from input neuron and βk denotes homeostatic adaptation current see below the input xi models the additive excitatory postsynaptic current from neuron in our simulations we used kernel with time constants τm and τs the instantaneous firing rate ρk of network neuron depends exponentially on the membrane potential and is subject to divisive lateral inhibition ilat described below ρk ilatρnet exp uk where ρnet scales the firing rate of neurons spike trains were then drawn from independent poisson processes with instantaneous rate ρk for each neuron divisive inhibition between the neurons in the wta network was implemented in an idealized form pk ilat exp ul in addition each output spike caused slow depressing current giving rise to the adaptation current βk this implements slow homeostatic mechanism that regulates the output rate of individual neurons see for details the wta network defined above implicitly defines generative model inputs xn are assumed to be generated in dependence on the value of hidden multinomial random variable hn that can take on possible values each neuron in the wta circuit corresponds to one value of this hidden variable one obtains the probability of an input vector for given hidden cause as pn xn oisson xni with scaling parameter in other words the synaptic weight wki encodes in the firing rate of input neuron given that the hidden cause is the network implements inference in this generative model for given input xn the firing rate of network neuron zk is proportional to the posterior probability hn of the corresponding hidden cause online maximum likelihood learning is realized through the synaptic update rule see which realizes here the second term of eq log pn xn sk xi ewki where sk denotes the spike train of the kth neuron and xi denotes the value of the sum of epsps from presynaptic neuron at time in response to pattern xn simulation results here we consider network that allows us to study the of connections between hidden neurons additional details to this experiment and further analyses of the synaptic sampling model can be found in the architecture of the network is illustrated in fig it consists of eight wta circuits with arbitrary excitatory synaptic connections between neurons within the same or different ones of these wta circuits two populations of auditory and visual input neurons xa and xv project onto corresponding populations za and zv of hidden neurons each consisting of four wta circuits with neurons see lower panel of fig the hidden neuron populations receive exclusively auditory za neurons or visual inputs zv neurons and in addition arbitrary lateral excitatory connections between all hidden neurons are allowed this network models sensory integration and association in simplified manner biological neural networks are astonishingly robust against perturbations and lesions to investigate the inherent compensation capability of synaptic sampling we applied two lesions to the network within learning session of hours of equivalent biological time the network was trained by repeatedly drawing random instances of spoken and written digits of the same type digit or taken from mnist and utterances of speaker from ti and simultaneously presenting poisson spiking representations of these input patterns to the network fig shows example firing rates for one input pair input spikes were randomly drawn according to these rates firing rates of visual input neurons were kept fixed throughout the duration of the auditory stimulus in the first lesion we removed all neurons out of that became tuned for digit in the preceding learning the reconstruction performance of the network was measured through the capability of linear readout neuron which received input only from zv during these test trials only the auditory stimulus was presented the remaining utterances of speaker were used as test set and visual input neurons were clamped to background noise the lesion significantly impaired the performance of the network in stimulus reconstruction but it was able to recover from the lesion after about one hour of continuing network plasticity see fig in the second lesion all synaptic connections between hidden neurons that were present after recovery from the first lesion were removed and not allowed to regrow synapses in total after figure inherent compensation for network perturbations illustration of the network architecture recurrent spiking neural network received simultaneously spoken and handwritten spiking representations of the same digit first three pca components of the temporal evolution subset of the network parameters after each lesion the network parameters migrate to new manifold the generative reconstruction performance of the visual neurons zv for the test case when only an auditory stimulus is presented was tracked throughout the whole learning session colors of learning phases as in after each lesion the performance strongly degrades but reliably recovers learning with zero temperature dashed yellow or with approximate hmm learning dashed purple performed significantly worse insets at the top show the synaptic weights of neurons in zv at time points projected back into the input space network diagrams in the middle show ongoing network rewiring for synaptic connections between the hidden neurons each arrow indicates functional connection between two neurons only randomly drawn subset shown the neuron whose parameters are tracked in is highlighted in red numbers under the network diagrams show the total number of functional connections between hidden neurons at the time point about two hours of continuing synaptic sampling new synaptic connections between hidden neurons emerged these connections made it again possible to infer the auditory stimulus from the activity of the remaining hidden neurons in the population zv in the absence of input from the population xv the classification performance was around see bottom of fig in fig we track the temporal evolution of subset of network parameters parameters θi associated with the potential synaptic connections of the neuron marked in red in the middle of fig from or to other hidden neurons excluding those that were removed at lesion and not allowed to regrow the first three pca components of this parameter vector are shown the vector fluctuates first within one region of the parameter space while probing different solutions to the learning problem high probability regions of the posterior distribution blue trace each lesions induced fast switch to different region red green accompanied by recovery of the visual stimulus reconstruction performance see fig the network therefore compensates for perturbations by exploring new parameter spaces without the noise and the prior the same performance could not be reached for this experiment fig shows the result for the approximate hmm learning which is deterministic learning approach without prior using this approach the network was able to learn representations of the handwritten and spoken digits however these representation and the associations between them were not as distinctive as for synaptic sampling and the classification performance was significantly worse only first learning phase shown we also evaluated this experiment with deterministic version of synaptic sampling here the stochasticity inherent to the wta circuit was sufficient to overcome the first lesion however the performance was worse in the last learning phase after removing all active lateral synapses in this situation the random exploration of the parameter space that is inherent to synaptic sampling significantly enhanced the speed of the recovery discussion we have shown that stochasticity may provide an important function for network plasticity it enables networks to sample parameters from the posterior distribution that represents attractive combinations of structural constraints and rules such as sparse connectivity and distributions of synaptic weights and good fit to empirical evidence sensory inputs the resulting rules for synaptic plasticity contain prior distributions over parameters potential functional benefits of priors on emergent selectivity of neurons have recently been demonstrated in for restricted boltzmann machine the mathematical framework that we have presented provides normative model for evaluating empirically found stochastic dynamics of network parameters and for relating specific properties of this noise to functional aspects of network learning some systematic dependencies of changes in synaptic weights for the same pairing of and postsynaptic activity on their current values had already been reported in these can be modeled as the impact of priors in our framework models of learning via sampling from posterior distribution have been previously studied in machine learning and the underlying theoretical principles are well known in physics see section of the theoretical framework provided in this paper extends these previous models for learning by introducing the temperature parameter and by allowing to control the sampling speed in dependence of the current parameter setting through θi furthermore our model combines for the first time automatic rewiring in neural networks with bayesian inference via sampling the functional consequences of these mechanism are further explored in the postulate that networks should learn posterior distributions of parameters rather than maximum likelihood values had been proposed for artificial neural networks since such organization of learning promises better generalization capability to new examples the open problem of how such posterior distributions could be learned by networks of neurons in the brain in way that is consistent with experimental data has been highlighted in as key challenge for computational neuroscience we have presented here model whose primary innovation is to view experimentally found variability and ongoing fluctuations of parameters no longer as nuisance but as functionally important component of the organization of network learning this model may lead to better understanding of such noise and seeming imperfections in the brain it might also provide an important step towards developing algorithms for upcoming new technologies implementing analog spiking hardware which employ noise and variability as computational resource acknowledgments written under partial support of the european union project the human brain project hbp and project fwf pneuma we would like to thank seth grant christopher harvey jason maclean and simon rumpel for helpful comments references hatfield perception as unconscious inference in perception and the physical world psychological and philosophical issues in perception wiley pouget beck jm ma wj latham pe probabilistic brains knowns and unknowns nature neuroscience winkler denham mill tm bendixen multistability in auditory stream segregation predictive coding view phil trans soc biol sci brea senn pfister jp sequence learning with hidden units in spiking neural networks in nips vol rezende dj gerstner stochastic variational learning in recurrent spiking networks frontiers in computational neuroscience nessler pfeiffer maass stdp enables spiking neurons to detect hidden causes of their inputs in nips vol mackay dj bayesian interpolation neural computation bishop cm pattern recognition and machine learning new york springer holtmaat aj trachtenberg jt wilbrecht shepherd gm zhang knott gw et al transient and persistent dendritic spines in the neocortex in vivo neuron loewenstein kuras rumpel multiplicative dynamics underlie the emergence of the distribution of spine sizes in the neocortex in vivo neurosci marder variability compensation and modulation in neurons and circuits pnas gardiner cw handbook of stochastic methods ed springer welling teh yw bayesian learning via stochastic gradient langevin dynamics in proceedings of the international conference on machine learning sato nakagawa approximation analysis of stochastic gradient langevin dynamics by using fokkerplanck equation and ito process in nips kappel nessler maass stdp installs in circuits an online approximation to hidden markov model learning plos comp biol jolivet rauch gerstner predicting spike timing of neocortical pyramidal neurons by simple threshold models comp neurosci mensi naud gerstner from stochastic nonlinear to generalized linear models in nips vol gerstner kistler wm spiking neuron models cambridge university press carandini from circuits to behavior bridge too far nature neurosci habenschuss bill nessler homeostatic plasticity in bayesian spiking networks as expectation maximization with posterior constraints in nips vol habenschuss puhr maass emergence of optimal decoding of population codes through stdp neural computation kappel habenschuss legenstein maass network plasticity as bayesian inference plos comp biol xiong szedmak piater towards sparsity and selectivity bayesian learning of restricted boltzmann machine for early visual features in icann bi gq poo mm synaptic modifications in cultured hippocampal neurons dependence on spike timing synaptic strength and postsynaptic cell type neurosci pj turrigiano gg nelson sb rate timing and cooperativity jointly determine cortical synaptic plasticity neuron montgomery jm pavlidis madison dv pair recordings reveal synaptic connections and the postsynaptic expression of potentiation neuron kennedy ad the hybrid monte carlo algorithm on parallel computers parallel computing johannes schemmel kmem andreas gruebl implementing synaptic plasticity in vlsi spiking neural network model in ijcnn bill legenstein compound memristive synapse model for statistical learning through stdp in spiking neural networks frontiers in neuroscience 
accelerated proximal gradient methods for nonconvex programming huan li zhouchen lin key lab of machine perception moe school of eecs peking university china cooperative medianet innovation center shanghai jiaotong university china lihuanss zlin abstract nonconvex and nonsmooth problems have recently received considerable attention in processing statistics and machine learning however solving the nonconvex and nonsmooth optimization problems remains big challenge accelerated proximal gradient apg is an excellent method for convex programming however it is still unknown whether the usual apg can ensure the convergence to critical point in nonconvex programming in this paper we extend apg for general nonconvex and nonsmooth programs by introducing monitor that satisfies the sufficient descent property accordingly we propose monotone apg and nonmonotone apg the latter waives the requirement on monotonic reduction of the objective function and needs less computation in each iteration to the best of our knowledge we are the first to provide algorithms for general nonconvex and nonsmooth problems ensuring that every accumulation point is critical point and the convergence rates remain when the problems are convex in which is the number of iterations numerical results testify to the advantage of our algorithms in speed introduction in recent years sparse and low rank learning has been hot research topic and leads to wide variety of applications in processing statistics and machine learning and nuclear norm as the continuous and convex surrogates of and rank respectively have been used extensively in the literature see the recent collections although and nuclear norm have achieved great success in many cases they are suboptimal as they can promote sparsity and only under very limited conditions to address this issue many nonconvex regularizers have been proposed such as lp penalty penalty minimax concave penalty geman penalty smoothly clipped absolute deviation and norm this trend motivates revived interest in the analysis and design of algorithms for solving nonconvex and nonsmooth problems which can be formulated as min where is differentiable it can be nonconvex and can be both nonconvex and nonsmooth accelerated gradient methods have been at the heart of convex optimization research in series of celebrated works several accelerated gradient methods are proposed for problem with convex and in these methods iterations are sufficient to find solution within error from the optimal objective value recently ghadimi and lan presented unified treatment of accelerated gradient method uag for convex nonconvex and stochastic table comparisons of gd general descent method ipiano gist gdpa ir ifb apg uag and our method for problem the measurements include the assumption whether the methods accelerate for convex programs cp and converge for nonconvex programs ncp method name gd ipiano gist gdpa ir ifb apg uag ours assumption kl nonconvex convex nonconvex convex nonconvex convex special and nonconvex nonconvex convex convex nonconvex convex nonconvex nonconvex accelerate cp no no no no no no yes yes yes converge ncp yes yes yes yes yes yes unclear yes yes tion they proved that their algorithm converges in nonconvex programming with nonconvex but convex and accelerates with an convergence rate in convex programming for problem convergence rate about the gradient mapping is also analyzed in attouch et al proposed unified framework to prove the convergence of general class of descent methods using the kl inequality for problem and frankel et al studied the convergence rates of general descent methods under the assumption that the desingularising function in kl property has the form of cθ tθ typical example in their framework is the proximal gradient method however there is no literature showing that there exists an accelerated gradient method satisfying the conditions in their framework other typical methods for problem includes inertial ifb ipiano general iterative shrinkage and thresholding gist gradient descent with proximal average gdpa and iteratively reweighted algorithms ir table demonstrates that the existing methods are not ideal gd and ifb can not accelerate the convergence for convex programs gist and gdpa require that should be explicitly written as difference of two convex functions ipiano demands the convexity of and ir is suitable for some special cases of problem apg can accelerate the convergence for convex programs however it is unclear whether apg can converge to critical points for nonconvex programs uag can ensure the convergence for nonconvex programming however it requires to be convex this restricts the applications of uag to solving nonconvexly regularized problems such as sparse and low rank learning to the best of our knowledge extending the accelerated gradient method for general nonconvex and nonsmooth programs while keeping the convergence rate in the convex case remains an open problem in this paper we aim to extend beck and teboulle apg to solve general nonconvex and nonsmooth problem apg first extrapolates point yk by combining the current point and the previous point then solves proximal mapping problem when extending apg to nonconvex programs the chief difficulty lies in the extrapolated point yk we have little restriction on yk when the convexity is absent in fact yk can be arbitrarily larger than xk when yk is bad extrapolation especially when is oscillatory when is computed by proximal mapping at bad yk may also be arbitrarily larger than xk beck and teboulle monotone apg ensures xk however this is not enough to ensure the convergence to critical points to address this issue we introduce monitor satisfying the sufficient descent property to prevent bad extrapolation of yk and then correct it by this monitor in summary our contributions include we propose algorithms for general nonconvex and nonsmooth programs we first extend beck and teboulle monotone apg by replacing their descent condition with sufficient descent condition this critical change ensures that every accumulation point is critical point our monotone apg satisfies some modified conditions for the framework of and thus stronger results on convergence rate can be obtained under the kl except for the work under the kl assumption convergence for nonconvex problems in this paper and the references of this paper means that every accumulation point is critical point assumption then we propose nonmonotone apg which allows for larger stepsizes when line search is used and reduces the average number of proximal mappings in each iteration thus it can further speed up the convergence in practice for our apgs the convergence rates maintain when the problems are convex this result is of great significance when the objective function is locally convex in the neighborhoods of local minimizers even if it is globally nonconvex preliminaries basic assumptions note that function rn is said to be proper if dom where dom is lower semicontinuous at point if lim inf in problem we assume that is proper function with lipschitz continuous gradients and is proper and lower semicontinuous we assume that is coercive is bounded from below and when kxk where is the kl inequality definition function rn is said to have the kl property at rn if there exists neighborhood of and function φη such that for all rn the following inequality holds dist where φη stands for class of function satisfying is concave and on is continuous at and all functions and subanalytic functions satisfy the kl property specially the desingularising function of functions can be chosen to be the form of cθ tθ with typical functions include real polynomial functions kxkp with rank the indicator function of psd cone stiefel manifolds and constant rank matrices review of apg in the convex case we first review apg in the convex case bech and teboulle extend nesterov accelerated gradient method to the nonsmooth case it is named the accelerated proximal gradient method and consists of the following steps xk tk proxαk yk αk yk tk yk xk where the proximal mapping is defined as proxαg argminu kx apg is not monotone algorithm which means that may not be smaller than xk so beck and teboulle further proposed monotone apg which consists of the following steps zk xk xk tk tk proxαk yk αk yk tk if xk xk otherwise yk xk apgs for nonconvex programs in this section we propose two algorithms for general nonconvex nonsmooth problems we establish the convergence in the nonconvex case and the convergence rate in the convex case when the kl property is satisfied we also provide stronger results on convergence rate monotone apg we give two reasons that result in the difficulty of convergence analysis on the usual apg for nonconvex programs yk may be bad extrapolation in only descent property xk is ensured to address these issues we need to monitor and correct yk when it has the potential to fail and the monitor should enjoy the property of sufficient descent which is critical to ensure the convergence to critical point as is known proximal gradient methods can make sure sufficient descent cf so we use proximal gradient step as the monitor more specially our algorithm consists of the following steps yk xk zk xk xk tk tk proxαy yk αy yk proxαx xk αx xk tk if otherwise where αy and αx can be fixed constants satisfying αy and αx or dynamically computed by backtracking line search initialized by is the lipschitz constant of our algorithm is an extension of beck and teboulle monotone apg the difference lies in the extra as the role of monitor and the correction step of in is compared with xk while in is compared with further difference is that beck and teboulle algorithm only ensures descent while our algorithm makes sure sufficient descent which means xk xk where is small constant it is not difficult to understand that only the descent property can not ensure the convergence to critical point in nonconvex programming we present our convergence result in the following theorem let be proper function with lipschitz continuous gradients and be proper and lower semicontinuous for nonconvex and nonconvex nonsmooth assume that is coercive then xk and vk generated by are bounded let be any accumulation point of xk we have is critical point remarkable aspect of our is that although we have made some modifications on beck and teboulle algorithm the convergence rate in the convex case still holds similar to theorem in we have the following theorem on the accelerated convergence in the convex case theorem for convex and assume that is lipschitz continuous let be any global optimum then xk generated by satisfies xn αy when the objective function is locally convex in neighborhood of local minimizers theorem means that apg can ensure to have an convergence rate when approaching to local minimizer thus accelerating the convergence for better reference we summarize the proposed monotone apg algorithm in algorithm for the detail of line search with initializtion please see supplementary materials the proofs in this paper can be found in supplementary materials algorithm monotone apg initialize αy αx for do update yk and by end for convergence rate under the kl assumption the kl property is powerful tool and is studied by and for class of general descent methods the usual apg in does not satisfy the sufficient descent property which is crucial to use the kl property and thus has no conclusions under the kl assumption on the other hand due to the intermediate variables yk vk and zk our algorithm is more complex than the general descent methods and also does not satisfy the conditions therein however due to the step and some modified can be satisfied and we can still get some exciting results under the kl assumption with the same framework of we have the following theorem theorem let be proper function with lipschitz continuous gradients and be proper and lower semicontinuous for nonconvex and nonconvex nonsmooth assume that is coercive if we further assume that and satisfy the kl property and the desingularising function has the form of cθ tθ for some then if then there exists such that xk for all and the algorithm terminates in finite steps if then there exists such that for all xk if then there exists such that for all xk where is the same function value at all the accumulation points of xk rk vk and min when is function the desingularising function can be chosen to be the form of cθ tθ with in this case as shown in theorem our algorithm converges in finite iterations when converges with linear rate when and sublinear rate at least when for the gap xk this is the same as the results mentioned in although our algorithm does not satisfy the conditions therein nonmonotone apg algorithm is monotone algorithm when the problem is monotone algorithm has to creep along the bottom of narrow curved valley so that the objective function value does not increase resulting in short stepsizes or even zigzagging and hence slow convergence removing the requirement on monotonicity can improve convergence speed because larger stepsizes can be adopted when line search is used on the other hand in algorithm we need to compute and in each iteration and use to monitor and correct this is conservative strategy in fact we can accept as directly if it satisfies some criterion showing that yk is good extrapolation then is computed only when this criterion is not met thus we can reduce the average number of proximal for the details of difference please see supplementary materials mappings accordingly the computation cost in each iteration so in this subsection we propose nonmonotone apg to speed up convergence in monotone apg is ensured in nonmonotone apg we allow to make larger objective function value than xk specifically we allow to yield an objective function value smaller than ck relaxation of xk ck should not be too far from xk so the average of xk is good choice thus we follow to define ck as convex combination of xk with exponentially decreasing weights pk xj ck pk where controls the degree of nonmonotonicity in practice ck can be efficiently computed by the following recursion ηqk ηqk ck where and according to we can split into two parts by the different choices of accordingly in nonmonotone apg we consider the following two conditions to replace ck yk ck xk we choose as the criteria mentioned before when holds we deem that yk is good extrapolation and accept directly then we do not compute in this case however does not hold all the time when it fails we deem that yk may not be good extrapolation in this case we compute by satisfying and then monitor and correct by is ensured when αx when backtracking line search is used such that satisfies can be found in finite combing and we have ck yk similarly replacing and by and respectively we have ck xk this means that we replace the sufficient descent condition of xk in by the sufficient descent of ck we summarize the nonmonotone apg in algorithm similar to monotone apg nonmonotone apg also enjoys the convergence property in the nonconvex case and the convergence rate in the convex case we present our convergence result in theorem theorem still holds for algorithm with no modification so we omit it here define kj and mj such that in algorithm holds and is executed for all kj for all mj does not hold and is executed then we have and the following theorem holds theorem let be proper function with lipschitz continuous gradients and be proper and lower semicontinuous for nonconvex and nonconvex nonsmooth assume that is coercive then xk vk and ykj where kj generated by algorithm are bounded and if or is finite then for any accumulation point of xk we have see lemma in supplementary materials please see supplementary materials for nonmonotone apg with line search algorithm nonmonotone apg initialize αx for do yk xk xk tk zk xk tk proxαy yk αy yk if ck yk then else proxαx xk αx xk if otherwise end if tk ηqk ηqk end for αy if and are both infinite then for any accumulation point of xkj of ykj where kj and any accumulation point of vmj of xmj where mj we have and numerical results in this section we test the performance of our algorithm on the problem of sparse logistic regression lr lr is an attractive extension to lr as it can reduce overfitting and perform feature selection simultaneously sparse lr is widely used in areas such as bioinformatics and text categorization in this subsection we follow gong et al to consider sparse lr with nonconvex regularizer min log exp xti we choose as the capped penalty defined as min we compare monotone apg mapg and nonmonotone apg nmapg with monotone mgist nonmonotone gist nmgist and ifb we test the performance on the data which contains samples of dimensions we follow to set and the starting point as zero vectors in nmapg we set in ifb the inertial parameter is set at and the lipschitz constant is computed by backtracking to make fair comparison we first run mgist the algorithm is terminated when the relative change of two consecutive objective function values is less than or the number of iterations exceeds this termination condition is the same as in then we run nmgist mapg nmapg and ifb these four algorithms are terminated when they achieve an equal or smaller objective function value than that by mgist or the number of iterations exceeds we randomly choose of the data as training data and the rest as test data the experiment result is averaged over runs all algorithms are run on matlab and windows with an intel core ghz cpu and memory the result is reported in table we also plot the curves of objective function values iteration number and cpu time in figure for the sake of space limitation we leave another experiment sparse pca in supplementary materials http http table comparisons of apg gist and ifb on the sparse logistic regression problem the quantities include number of iterations averaged number of line searches in each iteration computing time in seconds and test error they are averaged over runs method iter line search time test error mgist nmgist ifb mapg nmapg we have the following observations methods need much fewer iterations and less computing time than gist and ifb to reach the same or smaller objective function values as gist is indeed proximal gradient method pg and ifb is an extension of pg this verifies that apg can indeed accelerate the convergence in practice nmapg is faster than mapg we give two reasons nmapg avoids the computation of vk in most of the time and reduces the number of line searches in each iteration we mention that in mapg line search is performed in both and while in nmapg only the computation of needs line search in every iteration is computed only when necessary we note that the average number of line searches in nmapg is nearly one this means that holds in most of the time so we can trust that zk can work well in most of the time and only in few times vk is computed to correct zk and yk on the other hand nonmonotonicity allows for larger stepsizes which results in fewer line searches mgist nmgist ifb mapg nmapg function value function value mgist nmgist ifb mapg nmapg iteration objective function value iteration cpu time objective function value time figure compare the objective function value produced by apg gist and ifb conclusions in this paper we propose two algorithms for efficiently solving general nonconvex nonsmooth problems which are abundant in machine learning we provide detailed convergence analysis showing that every accumulation point is critical point for general nonconvex nonsmooth programs and the convergence rate is maintained at for convex programs nonmonotone apg allows for larger stepsizes and needs less computation cost in each iteration and thus is faster than monotone apg in practice numerical experiments testify to the advantage of the two algorithms acknowledgments zhouchen lin is supported by national basic research program of china program grant no national natural science foundation nsf of china grant nos and and microsoft research asia collaborative research program he is the corresponding author references yun editor and sparse modeling for visual analysis springer candes wakin and boyd enhancing sparsity by reweighted minimization journal of fourier analysis and applications zhang analysis of convex relaxation for sparse regularization the journal of machine learning rearch foucart and lai sparsest solutions of underdeterminied linear systems via lq minimization for applied and computational harmonic analysis zhang nearly unbiased variable selection under minimax concave penalty the annals of statistics geman and yang nonlinear image recovery with regularization ieee transactions on image processing fan and li variable selection via nonconcave penalized likelihood and its oracle properties journal of the american statistical association mohan and fazel iterative reweighted algorithms for matrix rank minimization the journal of machine learning research nesterov method for unconstrained convex minimization problem with the rate of convergence soviet mathematics doklady nesterov smooth minimization of nonsmooth functions mathematical programming nesterov gradient methods for minimizing composite objective functions technical report center for operations research and econometrics core catholie university of louvain beck and teboulle fast algorithms for constrained total variation image denoising and deblurring problems ieee transactions on image processing beck and teboulle fast iterative shrinkage thresholding algorithm for linear inverse problems siam imaging sciences tseng on accelerated proximal gradient methods for optimization technical report university of washington seattle ghadimi and lan accelerated gradient methods for nonconvex nonlinear and stochastic programming arxiv preprint attouch bolte and svaier convergence of descent methods for and tame problems proximal algorithms splitting and regularized methods mathematical programming frankel garrigos and peypouquet splitting methods with variable metric for kurdykałojasiewicz functions and general convergence rates journal of optimization theory and applications ochs chen brox and pock ipiano inertial proximal algorithms for nonconvex optimization siam image sciences gong zhang lu huang and ye general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems in icml pages zhong and kwok gradient descent with proximal average for nonconvex and composite regularization in aaai ochs dosovitskiy brox and pock on iteratively reweighted algorithms for optimization in computer vision siam imaging sciences bot csetnek and an inertial algorithm for the minimization of the sum of two nonconvex functions preprint bolte sabach and teboulle proximal alternating linearized minimization for nonconvex and nonsmooth problems mathematical programming zhang and hager nonmonotone line search technique and its application to unconstrained optimization siam optimization shevade and keerthi simple and efficient algorithm for gene selection using sparse logistic regression bioinformatics genkin lewis and madigan bayesian logistic regression for text categorization technometrics 
approximating sparse pca from incomplete data abhisek kundu petros drineas malik abstract we study how well one can recover sparse principal components of data matrix using sketch formed from few of its elements we show that for wide class of optimization problems if the sketch is close in the spectral norm to the original data matrix then one can recover near optimal solution to the optimization problem by using the sketch in particular we use this approach to obtain sparse principal components and show that for data points in dimensions max elements gives an approximation to the sparse pca problem is the stable rank of the data matrix we demonstrate our algorithms extensively on image text biological and financial data the results show that not only are we able to recover the sparse pcas from the incomplete data but by using our sparse sketch the running time drops by factor of five or more introduction principal components analysis constructs low dimensional subspace of the data such that projection of the data onto this subspace preserves as much information as possible or equivalently maximizes the variance of the projected data the earliest reference to principal components analysis pca is in since then pca has evolved into classic tool for data analysis challenge for the interpretation of the principal components or factors is that they can be linear combinations of all the original variables when the original variables have direct physical significance genes in biological applications or assets in financial applications it is desirable to have factors which have loadings on only small number of the original variables these interpretable factors are sparse principal components spca the question we address is not how to better perform sparse pca rather it is whether one can perform sparse pca on incomplete data and be assured some degree of success can we do sparse pca when we have small sample of data points and those data points have missing features incomplete data is situation that one is confronted with all too often in machine learning for example with data one does not have all the ratings of any given user or in privacy preserving setting client may not want to give us all entries in the data matrix in such setting our goal is to show that if the samples that we do get are chosen carefully the sparse pca features of the data can be recovered within some provable error bounds significant part of this work is to demonstrate our algorithms on variety of data sets more formally the data matrix is data points in dimensions data matrices often have low effective rank let ak be the best approximation to in practice it is often possible to choose small value of for which ka ak is small the best approximation ak is obtained by projecting onto the subspace spanned by its principal components vk which is the matrix containing the right singular vectors of these principal department of computer science rensselaer polytechnic institute troy ny department of computer science rensselaer polytechnic institute troy ny drinep department of computer science rensselaer polytechnic institute troy ny magdon components are the solution to the variance maximization problem vk trace vt at av arg max vt we denote the maximum variance attainable by optk which is the sum of squares of the topk singular values of to get sparse principal components we add sparsity constraint to the optimization problem every column of should have at most entries the sparsity parameter is an input sk trace vt at av arg max vt kv the sparse pca problem is itself very hard problem that is not only but also inapproximable there are many heuristics for obtaining sparse factors including some approximation algorithms with provable guarantees the existing research typically addresses the task of getting just the top principal component some exceptions are while the sparse pca problem is hard and interesting it is not the focus of this work we address the question what if we do not know but only have sparse sampling of some of the entries in incomplete data the sparse sampling is used to construct sketch of denoted there is not much else to do but solve the sparse pca problem with the sketch instead of the full data to get trace vt arg max vt kv we study how performs as an approximation to sk with respective to the objective that we are trying to optimize namely trace st at as the quality of approximation is measured with respect to the true we show that the quality of approximation is controlled by how well approximates at as measured by the spectral norm of the deviation at this is general result that does not rely on how one constructs the sketch theorem sparse pca from sketch let sk be solution to the sparse pca problem that solves and solution to the sparse pca problem for the sketch which solves then trace at trace sk at ask theorem says that if we can closely approximate with then we can compute from sparse components which capture almost as much variance as the optimal sparse components computed from the full data in our setting the sketch is computed from sparse sampling of the data elements in incomplete data to determine which elements to sample and how to form the sketch we leverage some recent results in elementwise matrix completion in nutshell if one samples larger data elements with higher probability than smaller data elements then for the resulting sketch the error kat will be small the details of the sampling scheme and how the error depends on the number of samples is given in section combining the bound on ka from theorem in section with theorem we get our main result theorem sampling complexity for sparse pca sample from to form the sparse sketch using algorithm let sk be solution to the sparse pca problem that solves and let which solves be solution to the sparse pca problem for the sketch formed from the sampled data elements suppose the number of samples satisfies log and are dimensionless quantities that depend only on then with probability at least trace at trace sk at ask the dependence of and on are given in section roughly speaking we can ignore the term with since it is multiplied by and max where is the stable numerical rank of to paraphrase theorem when the stable rank is small constant with max samples one can recover almost as good sparse principal components as with all data the price being small fraction of the optimal variance since optk as far as we know the only prior work related to the problem we consider here is which proposed specific method to construct sparse pca from incomplete data however we develop general tool that can be used with any existing sparse pca heuristic moreover we derive much simpler bounds theorems and using matrix concentration inequalities as opposed to arguments in we also give an application of theorem to running sparse pca after denoising the data using greedy thresholding algorithm that sets the small elements to zero see theorem such denoising is appropriate when the observed matrix has been perturbed by small noise and the uncontaminated data matrix is sparse and contains large elements we show that if an appropriate fraction of the noisy data is set to zero one can still recover sparse principal components this gives principled approach to regularizing sparse pca in the presence of small noise when the data is sparse not only do our algorithms preserve the quality of the sparse principal components but iterative algorithms for sparse pca whose running time is proportional to the number of entries in the input matrix benefit from the sparsity of our experiments show about speed gains while producing sparse components using less than of the data discussion in summary we show that one can recover sparse pca from incomplete data while gaining computationally at the same time our result holds for the optimal sparse components from versus from one can not efficiently find these optimal components since the problem is nphard to even approximate so one runs heuristic in which case the approximation error of the heuristic would have to be taken into account our experiments show that using the incomplete data with the heuristics is just as good as those same heuristics with the complete data in practice one may not be able to sample the data but rather the samples are given to you our result establishes that if the samples are chosen with larger values being more likely then one can recover sparse pca in practice one has no choice but to run the sparse pca on these sampled elements and hope our theoretical results suggest that the outcome will be reasonable this is because while we do not have specific control over what samples we get the samples are likely to represent the larger elements for example with data users are more likely to rate items they either really like large positive value or really dislike large negative value notation we use bold uppercase for matrices and bold lowercase for column vectors the row of is and the column of is let denote the set is the expectation of random variable for matrix denotes the expectation for matrix the frobenius norm kxkf is kxkf and the spectral operator norm is we pm also have the and norms kxk and the number of entries in the largest singular value of is σk and log is the natural logarithm of sparse pca from sketch in this section we will prove theorem and give simple application to zeroing small fluctuations as way to regularize to noise in the next section we will use more sophisticated way to select the elements of the matrix allowing us to tolerate sparser matrix more incomplete data but still recovering sparse pca to reasonable accuracy theorem will be corollary of more general result for class of optimization problems involving objective function over an arbitrary not necessarily convex domain let be function that is defined for matrix variable and matrix parameter the optimization variable is in some feasible set which is arbitrary the parameter is also arbitrary we assume that is locally lipschitz in with that is kx note we allow the lipschitz constant to depend on the fixed matrix but not the variables this is more general than globally lipshitz objective the next lemma is the key tool we need to prove theorem and it may be on independent interest in other optimization settings we are interested in maximizing to obtain but we only have an approximation for and so we maximize to obtain which will be suboptimal solution with respect to we wish to bound which quantifies how suboptimal is lemma surrogate optimization bound let be lipschitz over the domain define arg arg then kx in the lemma the function and the domain are arbitrary in our setting the domain vt ik kv and trace vt xv we first show that is lipschitz with constant independent of let the representation of by its columns be vk then vt xv trace vt vvt σi kkx where σi is the largest singular value of we used trace inequality and the fact that vvt is projection now by lemma trace trace theorem follows by setting at and greedy thresholding we give the simplest scenario of incomplete data where theorem gives some reassurance that one can compute good sparse principal components suppose the smallest data elements have been set to zero this can happen for example if only the largest elements are measured or in noisy setting if the small elements are treated as noise and set to zero so aij recall kakf stable rank of and define kaδ let by construction kaδ then kat kat suppose the zeroing of elements only loses fraction of the energy in is selected so that kaδ that is an fraction of the total variance in has been lost in the unmeasured or zero data then theorem suppose that is created from by zeroing all elements that are less than and is such that the truncated norm satisfies kaδ then the sparse pca solution satisfies trace trace aat theorem shows that it is possible to recover sparse pca after setting small elements to zero this is appropriate when most of the elements in are small noise and few of the elements in contain large data elements for example elements of if the data consists of sparse nm large magnitude say and many nm nm small elements whose magnitude is nm high setting then kaδ and with just sparse sampling of the nm large elements very incomplete data we recover near optimal sparse pca greedily keeping only the large elements of the matrix requires particular structure in to work and it is based on crude bound for the spectral error in section we use recent results in matrix sparsification to choose the elements in randomized way with bias toward large elements with high probability one can directly bound the spectral error and hence get better performance theorem can also be proved as follows trace vt xv trace trace vt xv trace vt trace vt trace kkx trace vt trace kkx trace trace algorithm hybrid sampling input samples probabilities pij set for trials with replacement do randomly sample indices it jt with it jt pij update aij pij return with at most entries an based sketch in the previous section we created the sketch by deterministically setting the small data elements to zero instead we could randomly select the data elements to keep it is natural to bias this random sampling toward the larger elements therefore we define sampling probabilities for each data element aij which are proportional to mixture of the absolute value and square of the element pij kak kakf where is mixing parameter such sampling probability was used in to sample data elements in independent trials to get sketch we repeat the prototypical algorithm for matrix sampling in algorithm note that unlike with the deterministic zeroing of small elements in this sampling scheme one samples the element aij with probability pij and then rescales it by to see the intuition for this rescaling consider the expected outcome for single sample pij aij pij aij that is is sparse but unbiased estimate for this unbiasedness holds for any choice of the sampling probabilities pij defined over the elements of in algorithm however for an appropriate choice of the sampling probabilities we get much more than unbiasedness we can control the spectral norm of the deviation ka in particular the distribution in was analyzed in where they suggest an optimal choice for the mixing parameter which minimizes the theoretical bound on ka this algorithm to choose is summarized in algorithm of using the probabilities in to create the sketch using algorithm with selected using algorithm of one can prove bound for ka we state simplified version of the bound from in theorem theorem let and let be an accuracy parameter define probabilities pij as in with chosen using algorithm of let be the sparse sketch produced using algorithm with number of samples log where max kak then with probability at least and ka experiments we show the experimental performance of sparse pca from sketch using several real data matrices as we mentioned sparse pca is and so we must use heuristics these heuristics are discussed next followed by the data the experimental design and finally the results algorithms for sparse pca let ground truth denote the algorithm which computes the principal components which may not be sparse of the full data matrix the optimal variance is optk we consider six heuristics for getting sparse principal components gmax gsp hmax hsp umax usp the entries in each principal component generated by components using the spasm toolbox of with the largest entries of the principal components for the sketch components using spasm with the sketch the largest entries of the principal components for the uniformly sampled sketch components using spasm with the uniformly sampled sketch output of an algorithm is sparse principal components and our metric is trace vt at av where is the original centered data we consider the following statistics gmax relative loss of greedy thresholding versus spasm illustrating the value of good gsp sparse pca algorithm our sketch based algorithms do not address this loss relative loss of using the instead of complete data ratio close to is desired relative loss of using the uniform sketch instead of complete data benchmark to highlight the value of good sketch we also report the computation time for the algorithms we show results to confirm that sparse pca algorithms using the are nearly comparable to those same algorithms on the complete data and gain in computation time from sparse sketch is proportional to the sparsity data sets we show results on image text stock and gene expression data digit data we use the handwritten digit images in gray scale each pixel is feature normalized to be in each digit image forms row of the data matrix we focus on three digits samples samples and samples techtc data we use the technion repository of text categorization dataset techtc see from the open directory project odp we removed words features with fewer than letters each document row has unit norm stock data we use stock market data with snapshots of prices for stocks the prices of each day form row of the data matrix and principal component represents an index of sorts each stock is feature gene expression data we use gene expression data for lung cancer from the ncbi gene expression omnibus database there are samples lung tumor cases and normal lung controls forming the rows of the data matrix with probes features from the platform annotation table results we report results for primarily the top principal component which is the case most considered in the literature when our results do not qualitatively change we note the optimal mixing parameter using algorithm of for various datasets in table handwritten digits we sample approximately of the elements from the centered data using as well as uniform sampling the performance for small is shown in table including the running time for this data gmax gsp so it is important to use good sparse pca algorithm we see from table that the significantly outperforms the uniform sketch more extensive comparison of recovered variance is given in figure we also observe of factor of about for the we point out that the uniform sketch is reasonable for the digits data because most data elements are close to either or since the pixels are either black or white we show visualization of the principal components in figure we observe that the sparse components from the are almost identical to that of from the complete data techtc data we sample approximately of the elements from the centered data using our as well as uniform sampling for this data gmax gsp we observe very significant performance difference between the and uniform sketch more extensive comparison of recovered variance is given in figure we also observe digit techtc stock gene table comparison of sparse principal components from the and uniform sketch figure digits visualization of sparse principal components in each figure left panel shows gsp and right panel shows hsp hsp gsp usp gsp hsp gsp usp gsp sparsity constraint percent hsp gsp usp gsp sparsity constraint percent digit hsp gsp usp gsp sparsity constraint percent techtc sparsity constraint percent stock gene figure performance of sparse pca for and uniform sketch over an extensive range for the sparsity constraint the performance of the uniform sketch is significantly worse highlighting the importance of good sketch of factor of about for the unlike the digits data which is uniformly near the text data is spikey and now it is important to sample with bias toward larger elements which is why the performs very poorly as final comparison we look at the actual sparse top component with sparsity parameter the topic ids in the techtc data are us indiana evansville and us florida the features words in the full pca on the complete data are shown in table id top in gmax evansville florida south miami indiana information beach lauderdale estate spacer id other words service small frame tours faver transaction needs commercial bullet inlets producer gmax table techtc top ten words in top principal component of the complete data the other words are discovered by some of the sparse pca algorithms hmax umax gsp hsp usp table techtc relative ordering of the words gmax in the top sparse principal component with sparsity parameter in table we show which words appear in the top sparse principal component with sparsity using various sparse pca algorithms we observe that the sparse pca from the with only of the data sampled matches quite closely with the same sparse pca algorithm using the complete data matches stock data we sample about of the elements from the centered data using the sampling as well as uniform sampling for this data gmax gsp we observe very significant performance difference between the and uniform sketch more extensive comparison of recovered variance is given in figure we also observe of factor of about for the similar to techtc data this dataset is also spikey so biased sampling toward larger elements significantly outperforms the gene expression data we sample about of the elements from the centered data using the as well as uniform sampling for this data gmax gsp which means good sparse pca algorithm is imperative we observe very significant performance difference between the and uniform sketch more extensive comparison of recovered variance is given in figure we also observe of factor of about for the similar to techtc data this dataset is also spikey and consequently biased sampling toward larger elements significantly outperforms the performance of other sketches we briefly report on other options for sketching we consider suboptimal not from algorithm of in to construct suboptimal hybrid distribution and use this in to construct sparse sketch figure reveals that good sketch using the is important hsp hsp figure stock data performance of sketch using suboptimal to illustrate the importance of the optimal mixing parameter sparsity constraint percent conclusion it is possible to use sparse sketch incomplete data to recover nearly as good sparse principal components as one would have gotten with the complete data we mention that while gmax which uses the largest weights in the unconstrained pca does not perform well with respect to the variance it does identify good features simple enhancement to gmax is to recalibrate the sparse component after identifying the features this is an unconstrained pca problem on just the columns of the data matrix corresponding to the features this method of recalibrating can be used to improve any sparse pca algorithm our algorithms are simple and efficient and many interesting avenues for further research remain can the sampling complexity for the sparse pca be reduced from to we suspect pk that this should be possible by getting better bound on σi at we used the crude bound kkat we also presented general surrogate optimization bound which may be of interest in other applications in particular it is pointed out in that though pca optimizes variance more natural way to look at pca is as the linear projection of the data that minimizes the information loss gives efficient algorithms to find sparse linear dimension reduction that minimizes information loss the information loss of sparse pca can be considerably higher than optimal to minimize information loss the objective to maximize is trace at av av it would be interesting to see whether one can recover sparse linear projectors from incomplete data acknowledgments ak and pd are partially supported by nsf and was partially supported by the army research laboratory under cooperative agreement number the arl network science cta the views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies either expressed or implied of the army research laboratory or the government the government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation here on references asteris papailiopoulos and dimakis sparse pca with provable guarantees in proc icml cadima and jolliffe loadings and correlations in the interpretation of principal components applied statistics cai ma and wu sparse pca optimal rates and adaptive estimation the annals of statistics alexandre aspremont francis bach and laurent el ghaoui optimal solutions for sparse principal component analysis journal of machine learning research june alexandre aspremont laurent el ghaoui michael jordan and gert lanckriet direct formulation for sparse pca using semidefinite programming siam review gabrilovich and markovitch text categorization with many redundant features using aggressive feature selection to make svms competitive with in proceedings of international conference on machine learning hull database for handwritten text recognition research in ieee transactions on pattern analysis and machine intelligence pages kundu drineas and recovering pca from sparse sampling of data elements in http lei and vu sparsistency and agnostic inference in sparse pca the annals of statistics karim lounici sparse principal component analysis with missing observations arxiv report http ma sparse principal component analysis and iterative thresholding the annals of statistics and inapproximability of sparse pca http arxiv report and boutsidis arxiv report http moghaddam weiss and avidan generalized spectral bounds for sparse lda in proc icml pearson on lines and planes of closest fit to systems of points in space philosophical magazine haipeng shen and jianhua huang sparse principal component analysis via regularized low rank matrix approximation journal of multivariate analysis july sjstrand clemmensen larsen and ersbll spasm matlab toolbox for sparse statistical modeling in journal of statistical software accepted for publication trendafilov jolliffe and uddin modified principal component technique based on the lasso journal of computational and graphical statistics wang lu and liu nonconvex statistical optimization sparse pca in polynomial time http zou hastie and tibshirani sparse principal component analysis journal of computational graphical statistics 
nonparametric von mises estimators for entropies divergences and mutual informations akshay krishnamurthy microsoft research ny akshaykr kirthevasan kandasamy carnegie mellon university kandasamy larry wasserman carnegie mellon university bapoczos larry james robins harvard university robins abstract we propose and analyse estimators for statistical functionals of one or more distributions under nonparametric assumptions our estimators are derived from the von mises expansion and are based on the theory of influence functions which appear in the semiparametric statistics literature we show that estimators based either on or technique enjoy fast rates of convergence and other favorable theoretical properties we apply this framework to derive estimators for several popular information theoretic quantities and via empirical evaluation show the advantage of this approach over existing estimators introduction entropies divergences and mutual informations are classical quantities that play fundamental roles in statistics machine learning and across the mathematical sciences in addition to their use as analytical tools they arise in variety of applications including hypothesis testing parameter estimation feature selection and optimal experimental design in many of these applications it is important to estimate these functionals from data so that they can be used in downstream algorithmic or scientific tasks in this paper we develop recipe for estimating statistical functionals of one or more nonparametric distributions based on the notion of influence functions entropy estimators are used in applications ranging from independent components analysis intrinsic dimension estimation and several signal processing applications divergence estimators are useful in statistical tasks such as testing recently they have also gained popularity as they are used to measure dis between objects that are modeled as distributions in what is known as the machine learning on distributions framework mutual information estimators have been used in in learning markov random fields feature selection clustering and neuron classification in the parametric setting conditional divergence and conditional mutual information estimators are used for conditional two sample testing or as building blocks for structure learning in graphical models nonparametric estimators for these quantities could potentially allow us to generalise several of these algorithms to the nonparametric domain our approach gives estimators for all these quantities and many others which often outperfom the existing estimators both theoretically and empirically our approach to estimating these functionals is based on correction of preliminary estimator using the von mises expansion this idea has been used before in the semiparametric statistics literature however most studies are restricted to functionals of one distribution and have focused on approach which splits the samples for density estimation and functional estimation while the ds estimator is known to achieve the parametric vergence rate for sufficiently smooth densities in practical settings as we show in our simulations splitting the data results in poor empirical performance in this paper we introduce the method of influence function based nonparametric estimators to the machine learning community and expand on this technique in several novel and important ways the main contributions of this paper are we propose loo technique to estimate functionals of single distribution we prove that it has the same convergence rates as the ds estimator however the loo estimator has better empirical performance in our simulations since it makes efficient use of the data we extend both ds and loo methods to functionals of multiple distributions and analyse their convergence under sufficient smoothness both estimators achieve the parametric rate and the ds estimator has limiting normal distribution we prove lower bound for estimating functionals of multiple distributions we use this to establish minimax optimality of the ds and loo estimators under sufficient smoothness we use the approach to construct and implement estimators for various entropy divergence mutual information quantities and their conditional versions subset of these functionals are listed in table in the appendix our software is publicly available at we compare our estimators against several other approaches in simulation despite the generality of our approach our estimators are competitive with and in many cases superior to existing specialised approaches for specific functionals we also demonstrate how our estimators can be used in machine learning applications via an image clustering task our focus on information theoretic quantities is due to their relevance in machine learning applications rather than limitation of our approach indeed our techniques apply to any smooth functional history we provide brief history of the correction technique and influence functions we defer detailed discussion of other approaches to estimating functionals to section to our knowledge the first paper using correction estimator was that of bickel and ritov the line of work following this paper analysed integral functionals of single one dimensional density of the form recent paper by krishnamurthy et al also extends to functionals of multiple densities but only considers polynomial functionals of the form rthisαline for densities and all approaches above of use data splitting our work contributes to this line of research in two ways we extend the technique to more general class of functionals and study the empirically superior loo estimator fundamental quantity in the design of our estimators is the influence function which appears both in robust and semiparametric statistics indeed our work is inspired by that of robins et al and emery et al who propose based estimator for functionals of single distribution their analysis for nonparametric problems rely on ideas from semiparametric statistics they define influence functions for parametric models and then analyse estimators by looking at all parametric submodels through the true parameter preliminaries let be compact metric space equipped with measure the lebesgue measure let and be measures over that are absolutely continuous let be the derivatives with respect to we focus on estimating functionals of the form dµ or dµ where are real valued lipschitz functions that twice differentiable our framework permits more general functionals functionals based on the conditional densities but we will focus on this form for ease of exposition to facilitate presentation of the main definitions it is easiest to work with functionals of one distribution define to be the set of all measures that are absolutely continuous whose derivatives belong to central to our development is the von mises expansion vme which is the distributional analog of the taylor expansion for this we introduce the derivative which imposes notion of differentiability in topological spaces we then introduce the influence function definition let and be any functional the map where is called the derivative at if the derivative exists and is linear and continuous in is differentiable at if the derivative exists at definition let ru be differentiable at function which satisfies dq is the influence function of the distribution by the riesz representation theorem the influence function exists uniquely since the domain of is bijection of and consequently hilbert space the classical work of fernholz defines the influence function in terms of the derivative by tδx δx where δx is the dirac delta function at while our functionals are defined only on distributions we can still use to compute the influence function the function computed this way can be shown to satisfy definition based on the above the first order vme is dq where is the second order remainder differentiability alone will not be sufficient for our purposes in what follows we will assign and fb where fb are the true and estimated distributions we would like to bound the remainder in terms of distance between and fb for functionals of the form we restrict the domain to be only measures with continuous densities then we can control using the metric of the densities this essentially means that our functionals satisfy stronger form of differentiability called differentiability in the metric consequently we can write all derivatives in terms of the densities and the vme reduces to functional taylor expansion on the densities lemmas in appendix dµ kp this expansion will be the basis for our estimators these ideas generalise to functionals of multiple distributions and to settings where the functional involves quantities other than the density functional of two distributions has two derivatives for formed by perturbing the ith argument with the other fixed the influence functions satisfy the vme can be written as dx dx estimating functionals first consider estimating functional of single distribution dµ from samples we wish to find an estimator tb with low expected mean squared error mse tb using the vme emery et al and robins et al suggest natural estimator if we use half of the data to construct an estimate fˆ of the density then by fˆ dµ kf fˆ as the influence function does not depend on the unknown the first term on the right hand side is simply an expectation of fˆ we can use the second half of the data to estimate this expectation with its sample mean this leads to the following preliminary estimator tbds fˆ xi fˆ we can similarly construct an estimator tbds by using for density estimation and for averaging our final estimator is obtained via tbds tbds tbds in what follows we shall refer to this estimator as the ds estimator the ds estimator for functionals of one distribution has appeared before in the statistics literature the rate of convergence of this estimator is determined by the kf fˆ error in the vme and the rate for estimating an expectation lower bounds from several literature confirm minimax optimality of the ds estimator when is sufficiently smooth the data splitting trick is common approach as the analysis is straightforward while in theory ds estimators enjoy good rates of convergence data splitting is unsatisfying from practical standpoint since using only half the data each for estimation and averaging invariably decreases the accuracy to make more effective use of the sample we propose loo version of the above estimator xi tbloo where is density estimate using all the samples except for xi we prove that the loo estimator achieves the same rate of convergence as the ds estimator but empirically performs much better our analysis is specialised to the case where is kernel density estimate section we can extend this method to estimate functionals of two distributions say we have samples from and samples from akin to the one distribution case we propose the following ds and loo versions tds ψf xi ψg yj fˆ tbloo max max ψf xi ψg yi here are defined similar to fˆ for the ds estimator we swap the samples to compute tbds and average for the loo estimator if we cycle through the points until we have summed over all or vice versa tbloo is asymmetric when seemingly natural alternative would be to sum over all nm pairings of xi and yj however this is computationally more expensive moreover straightforward modification of our proof in appendix shows that both approaches converge at the same rate if and are of the same order examples we demonstrate the generality of our framework by presenting estimators for several entropies divergences mutual informations and their conditional versions in table appendix for many functionals in the table these are the first computationally efficient estimators proposed we hope this table will serve as good reference for practitioners for several functionals conditional and unconditional divergence conditional mutual information the estimators are not listed only because the expressions are too long to fit into the table our software implements total of functionals which include all the estimators in the table in appendix we illustrate how to apply our framework to derive an estimator for any functional via an example as will be discussed in section when compared to other alternatives our technique has several favourable properties the computational complexity of our method is when compared to of other methods for several functionals we do not require numeric integration unlike most other methods we do not require any tuning of hyperparameters analysis some smoothness assumptions on the densities are warranted to make estimation tractable we use the class which is now standard in nonparametrics literature definition let rd be compact space for any rd ri define ri and dr the class is the set of functions on satisfying dr lkx for all bsc and for all moreover define the bounded class to be note that large implies higher smoothness given samples from pn here the kernel density estimator kde with bandwidth is fˆ nhd rd is smoothing kernel when by selecting the kde achieves the minimax rate of op in mean squared error further if is in the bounded class one can truncate the kde from below at and from above at and achieve the same convergence rate in our analysis the density estimators fˆ are formed by either kde or truncated kde and we will make use of these results we will also need the following regularity condition on the influence function this is satisfied for smooth functionals including those in table we demonstrate this in our example in appendix assumption for functional of one distribution the influence function satisfies kf as kf for functional of two distributions the influence functions ψf ψg satisfy ef ψf ψf kf kg as kf kg eg ψg ψg kf kg as kf kg under the above assumptions emery et al robins et al show that the ds estimator on single distribution achieves mse tbds and further is asymptotically normal when their analysis in the semiparametric setting contains the nonparametric setting as special case in appendix we review these results with simpler self contained analysis that directly uses the vme and has more interpretable assumptions an attractive property of our proof is that it is agnostic to the density estimator used provided it achieves the correct rates for the loo estimator equation we establish the following result theorem convergence of loo estimator for let and satisfy assumption then tbloo is when and when the key technical challenge in analysing the loo estimator when compared to the ds estimator is in bounding the variance as there are several correlated terms in the summation the bounded difference inequality is popular trick used in such settings but this requires supremum on the influence functions which leads to significantly worse rates instead we use the inequality which provides an integrated version of bounded differences that can recover the correct rate when coupled with assumption our proof is contingent on the use of the kde as the density estimator while our empirical studies indicate that tbloo limiting distribution is normal fig the proof seems challenging due to the correlation between terms in the summation we conjecture that tbloo is indeed asymptotically normal but for now leave it to future work we reiterate that while the convergence rates are the same for both ds and loo estimators the data splitting degrades empirical performance of tbds as we show in our simulations now we turn our attention to functionals of two distributions when analysing asymptotics we will assume that as denote for the ds estimator we generalise our analysis for one distribution to establish the theorem below theorem normality of ds estimator for let and ψf ψg satisfy assumption then tbds is when and when further when and when ψf ψg tbds is asymptotically normal tbds vf ψf vg ψg the convergence rate is analogous to the one distribution case with the estimator achieving the parametric rate under similar smoothness conditions the asymptotic normality result allows us to construct asymptotic confidence intervals for the functional even though the asymptotic variance of the influence function is not known by slutzky theorem any consistent estimate of the variance gives valid asymptotic confidence interval in fact we can use an influence function based estimator for the asymptotic variance since it is also differentiable functional of the densities we demonstrate this in our example in appendix the condition ψf ψg is somewhat technical when both ψf and ψg are zero the first order terms vanishes and the estimator converges very fast at rate however the asymptotic behavior of the estimator is unclear while this degeneracy occurs only on meagre set it does arise for important choices such as the null hypothesis in testing problems finally for the loo estimator on two distributions we have the following result convergence is analogous to the one distribution setting and the parametric rate is achieved when theorem convergence of loo estimator for let and ψf ψg satisfy assumption then tbloo is when and when for many functionals assumption alone is sufficient to guarantee the rates in theorems and however for some functionals such as the we require fˆ to be bounded above and below existing results demonstrate that estimating such quantities is difficult without this assumption now we turn our attention to the question of statistical difficulty via lower bounds given by and massart and laurent we know that the ds and loo estimators are minimax optimal when for functionals of one distribution in the following theorem we present lower bound for estimating functionals of two distributions theorem lower bound for let and tb be any estimator for define min then there exists strictly positive constant such that lim inf inf sup tb tb our proof given in appendix is based on lecam method and generalises the analysis of and massart for functionals of one distribution this establishes minimax optimality of the estimators for functionals of two distributions when however when there is gap between our upper and lower bounds it is natural to ask if it is possible to improve on our rates in this regime series of work shows that for integral functionals of one distribution one can achieve the rate when by estimating the second order term in the functional taylor expansion this second order correction was also done for polynomial functionals of two distributions with similar statistical gains while we believe this is possible here these estimators are conceptually complicated and computationally expensive requiring running time compared to the running time for our estimator the first order estimator has favorable balance between statistical and computational efficiency further not much is known about the limiting distribution of second order estimators ds loo knn kdp voronoi divergence hellinger divergence ds loo knn divergence ds loo knn ds loo knn kl divergence shannon entropy ds loo knn kdp shannon entropy ds loo knn figure comparison of estimators against alternatives on different functionals the is the error and the is the number of samples all curves were produced by averaging over experiments discretisation in hyperparameter selection may explain some of the unsmooth curves comparison with other approaches estimation of statistical functionals under nonparametric assumptions has received considerable attention over the last few decades large body of work has focused on estimating the shannon beirlant et al gives nice review of results and techniques more recent work in the setting includes estimation of and tsallis entropies there are also several papers extending some of these techniques to divergence estimation many of the existing methods can be categorised as methods they are based on estimating the densities either via kde or using neighbors and evaluating the functional on these estimates methods are conceptually simple but unfortunately suffer several drawbacks first they typically have worse convergence rate than our approach achieving the parametric rate only when as opposed to secondly using either the kde or obtaining the best rates for methods requires undersmoothing the density estimate and we are not aware for principled approaches for selecting this smoothing parameter in contrast the bandwidth used in our estimators is the optimal bandwidth for density estimation so we can select it using number of approaches cross validation this is convenient from practitioners perspective as the bandwidth can be selected automatically convenience that other estimators do not enjoy secondly plugin methods based on the kde always require computationally burdensome numeric integration in our approach numeric integration can be avoided for many functionals of interest see table another line of work focuses more specifically on estimating nguyen et al estimate by solving convex program and analyse the method when the likelihood ratio of the densities belongs to an rkhs comparing the theoretical results is not straightforward as it is not clear how to port the rkhs assumption to our setting further the size of the convex program increases with the sample size which is problematic for large samples moon and hero use weighted ensemble estimator for they establish asymptotic normality and the parametric convergence rate only when which is stronger smoothness assumption than is required by our technique both these works only consider whereas our method has wider applicability and includes as special case experiments we compare the estimators derived using our methods on series of synthetic examples we compare against the methods in software for the estimators was obtained either quantiles of tbloo ds loo quantiles of tbds conditional divergence quantiles of quantiles of figure fig comparison of the loo vs ds estimator on estimating the conditional tsallis divergence in dimensions note that the estimator is intractable due to numerical integration there are no other known estimators for the conditional tsallis divergence figs qq plots obtained using samples for hellinger divergence estimation in dimensions using the ds and loo estimators respectively directly from the papers or from for the estimators we estimate the density via kde with the smoothing kernels constructed using legendre polynomials in both cases and for the plug in estimator we choose the bandwidth by performing cross validation the integration for the plug in estimator is approximated numerically we test the estimators on series of synthetic datasets in dimension the specifics of the densities used in the examples and methods compared to are given in appendix the results are shown in figures and we make the following observations in most cases the loo estimator performs best the ds estimator approaches the loo estimator when there are many samples but is generally inferior to the loo estimator with few samples this as we have explained before is because data splitting does not make efficient use of the data the estimator for divergences requires choosing for this estimator we used the default setting for given in the software as performance is sensitive to the choice of it performs well in some cases but poorly in other cases we reiterate that the of our estimator bandwidth of the kernel can be selected automatically using cross validation next we test the ds and loo estimators for asymptotic normality on hellinger divergence estimation problem we use samples for repeat this experiment times and compare the empiriical asymptotic distribution the tb values where sb is the estimated asymptotic variance to distribution on qq plot the results in figure suggest that both estimators are asymptotically normal image clustering we demonstrate the use of our nonparametric divergence estimators in an image clustering task on the datset using our hellinger divergence estimator we achieved an accuracy of whereas naive spectral clustering approach achieved only when we used estimator for the hellinger divergence we achieved which attests to the superiority of our method since this is not the main focus of this work we defer this to appendix conclusion we generalise existing results in von mises estimation by proposing an empirically superior loo technique for estimating functionals and extending the framework to functionals of two distributions we also prove lower bound for the latter setting we demonstrate the practical utility of our technique via comparisons against other alternatives and an image clustering application an open problem arising out of our work is to derive the limiting distribution of the loo estimator acknowledgements this work is supported in part by nsf big data grant and doe grant references jan beirlant edward dudewicz and edward van der meulen nonparametric entropy estimation an overview international journal of mathematical and statistical sciences peter bickel and ya acov ritov estimating integrated squared density derivatives sharp best order of convergence estimates the indian journal of statistics lucien and pascal massart estimation of integral functionals of density ann of kevin carter raviv raich and alfred hero on local intrinsic dimension estimation and its applications ieee transactions on signal processing inderjit dhillon subramanyam mallela and rahul kumar divisive information theoretic feature clustering algorithm for text classification mach learn emery nemirovski and voiculescu lectures on prob theory and stat springer luisa fernholz von mises calculus for statistical functionals lecture notes in statistics springer mohammed nawaz goria nikolai leonenko victor mergel and pier luigi novi inverardi new class of random vector entropy estimators and its applications nonparametric statistics hero bing ma michel and gorman applications of entropic spanning graphs ieee signal processing magazine david and oleg seleznjev estimation of integral functionals arxiv kerkyacharian and dominique picard estimating nonquadratic functionals of density using haar wavelets annals of akshay krishnamurthy kirthevasan kandasamy barnabas poczos and larry wasserman nonparametric estimation of divergence and friends in icml akshay krishnamurthy kirthevasan kandasamy barnabas poczos and larry wasserman on estimating divergence in artificial intelligence and statistics laurent efficient estimation of integral functionals of density ann of erik and fisher john ica using spacings estimates of entropy mach learn bastian leibe and bernt schiele analyzing appearance and contour based methods for object categorization in cvpr nikolai leonenko and oleg seleznjev statistical inference for the and the quadratic entropy journal of multivariate analysis jeremy lewi robert butera and liam paninski adaptive optimization of neurophysiology experiments in nips han liu larry wasserman and john lafferty exponential concentration for mutual information estimation with application to forests in nips erik miller new class of entropy estimators for densities in icassp kevin moon and alfred hero multivariate estimation with confidence in nips xuanlong nguyen martin wainwright and michael jordan estimating divergence functionals and the likelihood ratio by convex risk minimization ieee transactions on information theory havva alizadeh noughabi and reza alizadeh noughabi on the entropy estimators journal of statistical computation and simulation and csaba estimation of entropy and mutual information based on generalized graphs in nips hanchuan peng fulmi long and chris ding feature selection based on mutual information criteria of and ieee pami fernando kl divergence estimation of continuous distributions in ieee isit and jeff schneider on the estimation of in aistats liang xiong and jeff schneider nonparametric divergence estimation with applications to machine learning on distributions in uai david ramırez javier vıa ignacio santamarıa and pedro crespo entropy and divergence estimation based on szegos theorem in eusipco james robins lingling li eric tchetgen and aad van der vaart quadratic semiparametric von mises calculus metrika elad schneidman william bialek and michael berry ii an information theoretic approach to the functional classification of neurons in nips shashank singh and barnabas poczos exponential concentration of density functional estimator in nips dan stowell and mark plumbley fast multidimensional entropy estimation by partitioning ieee signal process information theoretical estimators toolbox mach learn alexandre tsybakov introduction to nonparametric estimation springer aad van der vaart asymptotic statistics cambridge university press qing wang sanjeev kulkarni and sergio divergence estimation for multidimensional densities via distances ieee transactions on information theory 
column selection via adaptive sampling saurabh paul global risk sciences paypal saupaul malik cs rensselaer polytechnic institute magdon petros drineas cs rensselaer polytechnic institute drinep abstract selecting good column or row subset of massive data matrices has found many applications in data analysis and machine learning we propose new adaptive sampling algorithm that can be used to improve any column selection algorithm our algorithm delivers tighter theoretical bound on the approximation error which we also demonstrate empirically using two well known column subset selection algorithms our experimental results on synthetic and data show that our algorithm outperforms sampling as well as prior adaptive sampling approaches introduction in numerous machine learning and data analysis applications the input data are modelled as matrix where is the number of objects data points and is the number of features often it is desirable to represent your solution using few features to promote better generalization and interpretability of the solutions or using few data points to identify important coresets of the data for example pca sparse pca sparse regression coreset based regression etc these problems can be reduced to identifying good subset of the columns or rows in the data matrix the column subset selection problem cssp for example finding an optimal sparse linear encoder for the data dimension reduction can be explicitly reduced to cssp motivated by the fact that in many practical applications the left and right singular vectors of matrix lacks any physical interpretation long line of work focused on extracting subset of columns of the matrix which are approximately as good as ak at reconstructing to make our discussion more concrete let us formally define cssp column subset cssp find matrix containing columns of selection problem for which is in the prior work one measures the quality of csspsolution against ak the best approximation to obtained via the singular value decomposition svd where is user specified target rank parameter for example gives efficient algorithms to find with columns for which ka ak kf our contribution is not to directly attack cssp we present novel algorithm that can improve an existing cssp algorithm by adaptively invoking it in sense actively learning which columns to sample next based on the columns you have already sampled if you use the from as strawman benchmark you can obtain columns all at once and incur an error roughly ka ak kf or you can invoke the algorithm to obtain for example columns and then allow the algorithm to adapt to the columns already chosen for example by modifying before choosing the remaining columns we refer to the former as continued sampling and to the is the best possible reconstruction of by projection into the space spanned by the columns of latter as adaptive sampling we prove performance guarantees which show that adaptive sampling improves upon continued sampling and we present experiments on synthetic and real data that demonstrate significant empirical performance gains notation denote matrices and denote column vectors in is the identity matrix and denote matrix concatenation operations in and manner respectively given set as is the matrix that contains the columns of indexed by let rank min the economy svd of is vtk σk σi ui vit uk where uk and contain the left singular vectors ui vk and contain the right singular vectors vi and is diagonal matrix containing the singular values σρ the frobenius norm of is kakf and ak the best aij tr is the trace of the pseudoinverse of is vς pk approximation to under any unitarily invariant norm is ak uk σk vtk σi ui vit our contribution adaptive sampling we design novel that adaptively selects columns from the matrix in rounds in each round we remove from the information that has already been captured by the columns that have been thus far selected algorithm selects tc columns of in rounds where in each round columns of are selected using from prior work input target rank rounds columns per round output tc columns of and the indices of those columns for do sample indices of columns from using set as and return algorithm adaptive sampling at round in step we compute column indices and as using on the residual of the previous round to compute this residual remove from the best approximation to in the span of the columns selected from the first rounds similar strategy was developed in with sequential adaptive use of additive error csspalgorithms these additive error select columns according to column norms in the residual in step is defined differently as to motivate our result it helps to take closer look at the reconstruction error after adaptive rounds of the strategy in with the in rounds continued sampling tc columns using from ka ak ka ak adaptive sampling rounds of the strategy in with the from ka ak ka ak typically kakf ka ak kf and is small so adaptive sampling la wins over continued sampling for additive error this is especially apparent after rounds where continued sampling only attenuates the big term kakf by but adaptive sampling exponentially attenuates this term by recently powerful have been developed which give guarantees we can use the adaptive strategy from together with these newer relative error if one carries out the analysis from by replacing the additive error from with the relative error in the comparison of continued and adaptive sampling using the strategy from becomes rounds suffices to see the problem rounds continued sampling tc columns using from ka ak adaptive sampling rounds of the strategy in with the from ka ak adaptive sampling from gives worse theoretical guarantee than continued sampling for relative error in nutshell no matter how many rounds of adaptive sampling you do the theoretical bound will not be better than ka ak kf if you are using relative error this raises an obvious question is it possible to combine csspalgorithms with adaptive sampling to get provably and empirically improved the approach of does not achieve this objective we provide positive answer to this question our approach is subtle modification to the approach in in step of algorithm when we compute the residual matrix in round we subtract from the best approximation to the projection of onto the current columns selected as opposed to subtracting the full projection this subtle change is critical in our new analysis which gives tighter bound on the final error allowing us to boost for rounds of adaptive sampling we get reconstruction error of kekf ka kf ka ak kf where the critical improvement in the bound is that the dominant depends on ka kf and the dependence on ka ak kf is now to highlight this improved theoretical bound in an extreme case consider matrix that has rank exactly then ka kf continued sampling gives an ka ak kf where as our adaptive sampling gives an ka ak kf which is clearly better in this extreme case in practice data matrices have rapidly decaying singular values so this extreme case is not far from reality see figure singular values of hgdp avgd over chromosomes singular values of avgd over datasets singular values singular values figure figure showing the singular value decay for two real world datasets to state our main theoretical result we need to more formally define relative error definition relative error relative error takes as input matrix rank parameter rank and number of columns and outputs column indices with so that the columns xs satisfy ec kx kf kx xk kf where depends on and the expectation is over random choices made in the our main theorem bounds the reconstruction error when our adaptive sampling approach is used to boost the boost in performance depends on the decay of the spectrum of theorem let be matrix of rank and let be target rank if in step of algorithm we use the relative error with then ec ka tk kf ka atk kf ka aik kf comments the dominant term in our bound is ka atk kf not ka ak kf this is major improvement since the former is typically much smaller than the latter in real data further we need bound on the reconstruction error ka akf our theorem give stronger result than needed because ka akf ka tk kf we presented our result for the case of relative error with guarantee on the expected reconstruction error clearly if the is deterministic then theorem will also hold deterministically the result in theorem can also be boosted to hold with high probability by repeating the process log times and picking the columns which performed best then with probability at least ka tk kf ka atk kf ka aik kf if the itself only gives at least guarantee then the bound in theorem also holds with high probability at least tδ which is obtained by applying union bound to the probability of failure in each round our results hold for any relative error combined with our adaptive sampling strategy the relative error in has the relative error csspalgorithm in has log other algorithms can be found in we presented the simplest form of the result which can be generalized to sample different number of columns in each round or even use different in each round we have not optimized the sampling schedule how many columns to sample in each round at the moment this is largely dictated by the cssp algorithm itself which requires minimum number of samples in each round to give theoretical guarantee from the empirical perspective for example using leverage score sampling to select columns strongest performance may be obtained by adapting after every column is selected in the context of the additive error from our adaptive sampling strategy gives theoretical performance guarantee which is at least as good as the adaptive sampling strategy from lastly we also provide the first empirical evaluation of adaptive sampling algorithms we implemented our algorithm using two column selection algorithms the column selection algorithm of and the sampling algorithm of and compared it against the adaptive sampling algorithm of on synthetic and data the experimental results show that our algorithm outperforms prior approaches related work column selection algorithms have been extensively studied in prior literature such algorithms include qr factorizations for which only weak performance guarantees can be derived the qr approach was improved in where the authors proposed memory efficient implementation the randomized additive error was breakthrough which led to series of improvements producing relative using variety of randomized and for an cssp algorithm ec kx kf kx xk deterministic techniques these include leverage score sampling volume sampling the hybrid sampling approach of the column selection algorithms of as well as deterministic variants presented in we refer the reader to section of for detailed overview of prior work our focus is not on per se but rather on adaptively invoking existing the only prior adaptive sampling with provable guarantee was introduced in and further analyzed in this strategy is specifically boosts the additive error but does not work with relative error which are currently in use our modification of the approach in is delicate but crucial to the new analysis we perform in the context of relative error our work is motivated by relative error satisfying definition such algorithms exist which give expected guarantees as well as high probability guarantees specifically given and target rank the sampling approach of selects log columns of to form matrix to give error with probability at least similarly proposed relative error selecting columns and giving error guarantee in expectation which can be boosted to high probability guarantee via independent repetition proof of theorem we now prove the main result which analyzes the performance of our adaptive sampling in algorithm for relative error we will need the following linear algebraic lemma lemma let and suppose that rank then σi proof observe that σi the claim is now immediate from the theorem because has rank at most therefore σi kx kx we are now ready to prove theorem by induction on the number of rounds of adaptive sampling when the claim is that ka kf ka ak kf which is immediate from the definition of the relative error now for the induction suppose that after rounds columns ct are selected and we have the induction hypothesis that ect ka ct tk kf ka atk kf ka aik kf in the th round we use the residual et ct tk to select new columns our relative error gives the following guarantee ket et kf et et etk the last step follows because et ct tk and we can apply lemma with ct tk and rank tk to obtain et we now take the expectation of both sides with respect to the columns ct ii ect ket et kf et ect ka atk kf ka aik kf ka atk kf ka atk kf ka aik kf ka kf ka aik kf follows because of the induction hypothesis eqn the columns chosen after round are ct by the law of iterated expectation ii ect ket et kf et ket et kf observe that et et ct tk et where is in the column space of ct further rank since is the best approximation to in the column space of for any realization of ka kf ket et kf combining with we have that ka kf ka kf ka aik kf this is the desired bound after rounds concluding the induction it is instructive to understand where our new adaptive sampling strategy is needed for the proof to go through the crucial step is where we use lemma it is essential that the residual was perturbation of experiments we compared three adaptive column sampling methods using two real and two synthetic data adaptive sampling methods the prior adaptive method which uses the additive error our new adaptive method using the relative error our adaptive method using the near optimal relative error data sets hgdp chromosomes snps human chromosome data from the hgdp database we use all chromosome matrices rows columns and report the average each matrix contains entries and we randomly filled in missing entries matrices rows documents columns words we kept or larger words and report averages over synthetic random matrices with σi power law synthetic random matrices with σi exp exponential has two stages the first stage is deterministic dual set column selection in which ties could occur we break ties in favor of the column not already selected with the maximum norm for randomized algorithms we repeat the experiments five times and take the average we use the synthetic data sets to provide controlled environment in which we can see performance for different types of singular value spectra on very large matrices in prior work it is common to report on the quality of the columns selected by comparing the best approximation within the column span of to ak hence we report the relative error ka ak kf when comparing the algorithms we set the target rank and the number of columns in each round to we have tried several choices for and and the results are qualitatively identical so we only report on one choice our first set of results in figure is to compare the prior adaptive algorithm with the new adaptive ones and which boose relative error csspalgorithms our two new algorithms are both performing better the prior existing adaptive sampling algorithm further is performing better than and this is also not surprising because produces columns if you boost better you get better results further by comparing the performance on synthetic with synthetic we see that our algorithm as well as prior algorithms gain significantly in performance for rapidly decaying singular values our new theoretical analysis reflects this behavior whereas prior results do not hgdp chromosomes the theory bound depends on the figure to the right shows result for increases but is constant comparing the figure with the hgdp plot in figure we see that the quantitative performance is approximately the same as the theory predicts since has not changed the percentage error stays the same even though we are sampling more columns because the benchmark ka ak kf also get smaller when increases since is the superior algorithm we continue with results only for this algorithm of rounds our next experiment is to test which adaptive strategy works better in practice given the same datasets tial selection of columns that is in figure uses an adaptive sampling based on the residual and then adaptively ples according to the adaptive strategy in the initial columns are chosen with the additive error algorithm our approach chooses initial columns with the relative error and then continues to sample adaptively based on the relative error and the residual tk we now give all the adaptive sam of rounds pling algorithms the benefit of the initial columns chosen in the first round by the algorithm from the result shown to the right confirms that is best even if all adaptive strategies start from the same initial columns datasets adaptive versus continued sequential sampling our last experiment is to demonstrate that adaptive sampling works better than continued sequential sampling we consider the relative error in in two modes the first is which is our adaptive sampling algorithms which selects tc columns in rounds of columns each the second is which is just the relative error in sampling tc columns all in one go the results are shown on the right the adaptive boosting of the relative error can gives up to improvement in this data set of rounds datasets cc hgdp chromosomes of rounds synthetic data of rounds synthetic data cc of rounds of rounds figure plots of relative error ratio ka ak kf for various adaptive sampling gorithms for and in all cases performance improves with more rounds of sampling and rapidly converges to relative reconstruction error of this is most so in data matrices with singular values that decay quickly such as techtc and synthetic the hgdp singular values decay slowly because missing entries are selected randomly and synthetic has slowly decaying singular values by construction conclusion we present new approach for adaptive sampling algorithms which can boost relative error csspalgorithms in particular the near optimal in we showed theoretical and experimental evidence that our new adaptively boosted is better than the prior existing adaptive sampling algorithm which is based on the additive error in we also showed evidence theoretical and empirical that our adaptive sampling algorithms are better than sequentially sampling all the columns at once in particular our theoretical bounds give result which is tighter for matrices whose singular values decay rapidly several interesting questions remain we showed that the simplest adaptive sampling algorithm which samples constant number of columns in each round improves upon sequential sampling all at once what is the optimal sampling schedule and does it depend on the singular value spectrum of the data matric in particular can improved theoretical bounds or empirical performance be obtained by carefully choosing how many columns to select in each round it would also be interesting to see the improved adaptive sampling boosting of in the actual applications which require column selection such as sparse pca or unsupervised feature selection how do the improved theoretical estimates we have derived carry over to these problems theoretically or empirically we leave these directions for future work acknowledgements most of the work was done when sp was graduate student at rpi pd was supported by and mmi was partially supported by the army research laboratory under cooperative agreement number the arl network science cta the views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies either expressed or implied of the army research laboratory or the government the government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation here on references christos boutsidis petros drineas and malik near optimal coresets for regression ieee transactions on information theory october boutsidis and note on sparse regression information processing letters christos boutsidis and malik deterministic feature selection for clustering ieee transactions on information theory september christos boutsidis petros drineas and malik sparse features for regression in proc annual conference on neural information processing systems nips to appear malik and christos boutsidis optimal sparse linear and sparse pca chan and hansen some applications of the rank revealing qr factorization siam sci stat deshpande and rademacher efficient volume sampling for subset selection in proceedings of the ieee focs pages deshpande and vempala adaptive sampling and fast matrix approximation in approximation randomization and combinatorial optimization algorithms and techniques pages springer deshpande rademacher vempala and wang matrix approximation and projective clustering via volume sampling theory of computing drineas kerenidis and raghavan competitive recommendation systems in proceedings of the stoc pages frieze kannan and vempala fast algorithms for finding approximations journal of the acm jacm halko martinsson and tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam may liberty woolfe martinsson rokhlin and tygert randomized algorithms for the lowrank approximation of matrices pnas michael mahoney and petros drineas cur matrix decompositions for improved data analysis pnas boutsidis drineas and matrix reconstruction siam journal of computing drineas mahoney and muthukrishnan subspace sampling and matrix approximation methods in approximation randomization and combinatorial optimization algorithms and techniques pages springer venkatesan guruswami and ali kemal sinop optimal matrix reconstruction in proceedings of the annual symposium on discrete algorithms pages boutsidis drineas and near optimal matrix reconstruction in ieee annual symposium on focs pages petros drineas michael mahoney and muthukrishnan cur matrix decompositions siam journal on matrix analysis and applications chan rank revealing qr factorizations linear algebra and its applications crystal maung and haim schweitzer unsupervised feature selection in advances in neural information processing systems pages boutsidis mahoney and drineas an improved approximation algorithm for the column subset selection problem in proceedings of the soda pages papailiopoulos kyrillidis and boutsidis provable deterministic leverage score sampling in proc sigkdd pages deshpande rademacher vempala and wang matrix approximation and projective clustering via volume sampling in proc soda pages drineas and mahoney randomized algorithm for generalization of the singular value decomposition linear algebra and its applications paschou lewis javed and drineas ancestry informative markers for individual assignment to worldwide populations journal of medical genetics davidov gabrilovich and markovitch parameterized generation of labeled datasets for text categorization based on hierarchical directory in proc sigir pages 
honor hybrid optimization for regularized problems jieping ye univeristy of michigan ann arbor mi jpye pinghua gong univeristy of michigan ann arbor mi gongp abstract recent years have witnessed the superiority of sparse learning formulations over their convex counterparts in both theory and practice however due to the and of the regularizer how to efficiently solve the optimization problem for data is still quite challenging in this paper we propose an efficient hybrid optimization algorithm for regularized problems honor specifically we develop hybrid scheme which effectively integrates qn step and gradient descent gd step our contributions are as follows honor incorporates the information to greatly speed up the convergence while it avoids solving regularized quadratic programming and only involves matrixvector multiplications without explicitly forming the inverse hessian matrix we establish rigorous convergence analysis for honor which shows that convergence is guaranteed even for problems while it is typically challenging to analyze the convergence for problems we conduct empirical studies on data sets and results demonstrate that honor converges significantly faster than algorithms introduction sparse learning with convex regularization has been successfully applied to wide range of applications including marker genes identification face recognition image restoration text corpora understanding and radar imaging however it has been shown recently that many convex sparse learning formulations are inferior to their counterparts in both theory and practice popular penalties include smoothly clipped absolute deviation scad penalty lsp and minimax concave penalty mcp although sparse learning reveals its advantage over the convex one it remains challenge to develop an efficient algorithm to solve the optimization problem especially for data dc programming is popular approach to solve problems whose objective functions can be expressed as the difference of two convex functions however potentially convex subproblem is required to solve at each iteration which is not practical for problems sparsenet can solve least squares problem with penalty at each step sparsenet solves univariate subproblem with penalty which admits solution however to establish the convergence analysis the parameter of the penalty is required to be restricted to some interval such that the univariate subproblem with penalty is convex moreover it is quite challenging to extend sparsenet to problems with loss as the univariate subproblem generally does not admit solution the gist algorithm can solve class of regularized problems by iteratively solving possibly proximal operator problem which in turn admits solution however gist does not well exploit the information the algorithm can incorporate the information to solve regularized problems but it requires to solve regularized quadratic subproblem at each iteration in this paper we propose an efficient hybrid optimization algorithm for regularized problems honor which incorporates the information to speed up the convergence honor adopts hybrid optimization scheme which chooses either qn step or gradient descent gd step per iteration mainly depending on whether an iterate has very small components if an iterate does not have any small component the is adopted which uses to exploit the information the key advantage of the is that it does not need to solve regularized quadratic programming and only involves multiplications without explicitly forming the inverse hessian matrix if an iterate has small components we switch to our detailed theoretical analysis sheds light on the effect of such hybrid scheme on the convergence of the algorithm specifically we provide rigorous convergence analysis for honor which shows that every limit point of the sequence generated by honor is clarke critical point it is worth noting that the convergence analysis for problem is typically much more challenging than the convex one because many important properties for convex problem may not hold for problems empirical studies are also conducted on data sets which include up to millions of samples and features results demonstrate that honor converges significantly faster than algorithms sparse learning we focus on the following regularized optimization problem min where we make the following assumptions throughout the paper is coercive continuously differentiable and is lipschitz continuous with constant moreover for all rn pn where is continuously differentiable and concave with respect to in and with denoting the derivative of at the point remark assumption allows to be assumption implies that is generally with respect to xi and the only convex case is with moreover is continuously differentiable with respect to xi in and nondifferentiable at xi in particular xi for any xi where xi if xi xi if xi and xi otherwise in addition must hold otherwise implies for any contradicting the fact that is it is also easy to show that under the assumptions above both and are locally lipschitz continuous thus the clarke subdifferential is the commonly used least squares loss and the logistic regression loss satisfy the assumption we can add small term to make them coercive the following popular regularizers satisfy the assumption where and except that for scad lsp log scad mcp if if θλ if θλ if θλ if θλ due to the and of problem the traditional subdifferential concept for the convex optimization is not applicable here thus we use the clarke subdifferential to characterize the optimality of problem we say is clarke critical point of problem if where is the clarke subdifferential of at to be we briefly review the clarke subdifferential for locally lipschitz continuous function the clarke generalized directional derivative of at along the direction is defined as αd lim sup then the clarke subdifferential of at is defined as rn dt rn interested readers may refer to proposition in the supplement for more properties about the clarke subdifferential we want to emphasize that some basic properties of the subdifferential of convex function may not hold for the clarke subdifferential of function proposed optimization algorithm honor since each decomposable component function of the regularizer is only at the origin the objective function is differentiable if the segment between any two consecutive iterates do not cross any axis this motivates us to design an algorithm which can keep the current iterate in the same orthant of the previous iterate before we present the detailed honor algorithm we introduce two functions as follows define function rn rn with the entry being xi if xi yi πi xi yi otherwise where rn yi is the entry of is the parameter of the function is the sign function defined as follows xi if xi xi if xi and xi otherwise define the whose entry is given by if xi if xi if xi if xi otherwise where is the derivative of at the point remark if is convex is the of at thus is descent direction however is not even of if is this indicates that some obvious concepts and properties for convex problem may not hold in the case thus it is significantly more challenging to develop and analyze algorithms for problem interestingly we can still show that vk xk is descent direction at the point xk refer to supplement and replace pk dk vk with vk to utilize the information we may perform the optimization along the direction dk vk where is positive definite matrix containing the information however dk is not necessarily descent direction to address this issue we use the following slightly modified direction pk pk dk vk we can show that pk is descent direction proof is provided in supplement thus we can perform the optimization along the direction pk recall that we need to keep the current iterate in the same orthant of the previous iterate so the following iterative scheme is proposed xk xk αpk where ξik xki if xki vik if xki and is step size chosen by the following line search procedure for constants and find the smallest integer with such that the following inequality holds xk xk γα vk dk however only using the above iterative scheme may not guarantee the convergence the main challenge is if there exists subsequence such that xki converges to zero it is possible that for sufficiently large is arbitrarily small but never equal to zero refer to the proof of theorem for more details to address this issue we propose hybrid optimization scheme specifically for small constant if min kvk xki vik is not empty we switch the iteration to the following gradient descent step xk arg min xk xk kx xk where is step size chosen by the following line search procedure for constants and find the smallest integer with such that the following inequality holds xk xk kxk xk the detailed steps of the algorithm are presented in algorithm remark algorithm is similar to algorithms in however honor is significantly different from them the algorithms can only handle regularized convex problems while honor is applicable to class of problems beyond ones the convergence analyses of the algorithms heavily rely on the convexity of the problem in contrast the convergence analysis for honor is applicable to cases beyond the convex ones which is extension algorithm honor hybrid optimization for regularized problems initialize and choose for to maxiter do compute vk xk and ǫk xki vik where ǫk min kvk initialize if then compute dk vk with positive definite matrix using alignment pk dk vk while eq is not satisfied do αβ xk xk αpk end else while eq is not satisfied do αβ xk arg minx xk xk kx xk end end xk if some stopping criterion is satisfied then stop and return end end convergence analysis we first present few basic propositions and then provide the convergence theorem based on the propositions all proofs of the presented propositions are carefully handled due to the lack of convexity first of all an optimality condition is presented proof is provided in supplement which will be directly used in the proof of theorem proposition let xk vk xk and where is subsequence of if lim inf for all then and is clarke critical point of problem we subsequently show that we have inequality in the following proposition proof is provided in supplement which is crucial to prove the final convergence theorem proposition let vk xk xk xk and qkα with then under assumptions and we have xk xk xk xk xk vk xk xk ii xk xk vk qkα kqα we next show that both line search criteria in the eq and the eq at any iteration is satisfied in finite number of trials proof is provided in supplement proposition at any iteration of the honor algorithm if xk is not clarke critical point of problem then for the there exists an with such that the line search criterion in eq is satisfied for the the line search criterion in eq is satisfied whenever min that is both line search criteria at any iteration are satisfied in finite number of trials we are now ready to provide the convergence proof for the honor algorithm theorem the sequence xk generated by the honor algorithm has at least limit point and every limit point of xk is clarke critical point of problem proof it follows from proposition that both line search criteria in the eq and the eq at each iteration can be satisfied in finite number of trials let αk be the accepted step size at iteration then we have xk γαk vk dk γαk vk vk or xk xk xk recall that is positive definite and αk which together with eqs imply that xk is monotonically decreasing thus xk converges to finite value since is bounded from below note that and for all rn due to the boundedness of xk see proposition in supplement the sequence xk generated by the honor algorithm has at least limit point since is continuous there exists subsequence of such that lim xk lim xk lim xk in the following we prove the theorem by contradiction assume that is not clarke critical point of problem then by proposition there exists at least one such that lim inf we next consider the following two cases there exist subsequence of and an integer such that for all the is adopted then for all we have arg min xk xk kx xk thus by the optimality condition of the above problem and properties of the clarke subdifferential proposition in supplement we have xk xk αk taking limits with for eq and considering eqs we have lim xk xk lim lim taking limits with for eq and considering eq αk min proposition and is see proposition in the supplement we have which contradicts the assumption that is not clarke critical point of problem there exists an integer such that for all the is adopted according to remark in supplement we know that the smallest eigenvalue of is uniformly bounded from below by positive constant which together with eq implies lim inf vk vk taking limits with for eq we have lim γαk vk vk which together with αk and eq implies that lim αk eq implies that there exist an integer and constant such that ǫk min kvk for all notice that for all the is adopted thus we obtain that ǫk xki vik for all we also notice that if then there exists constant such that xki πi xki αpki ξik xki αpki for all as pki is bounded proposition in supplement therefore we conclude that for all max and for all at least one of the following three cases must happen xki xki πi xki αpki ξik xki αpki or ǫk xki πi xki αpki ξik xki αpki or xki vik xki pki xki πi xki αpki ξik xki αpki it follows that there exists constant such that qkα xk xk pk thus considering dki vik and vik pki vik dki for all we have kqkα kpk kdk vk vk qkα according to proposition in supplement we know that the largest eigenvalue of is uniformly bounded from above by some positive constant thus we have vk vk vk vk vk vk αl αl which together with eqs and dk vk implies qα vk dk kqα αl αl considering eqs we have αlm vk dk xk xk which together with vk dk vk vk implies that the line search criterion in the eq is satisfied if αlm and considering the backtracking form of the line search in eq we conclude that the line search criterion in the eq is satisfied whenever αk min min lm this leads to contradiction with eq by and we conclude that xk is clarke critical point of problem experiments in this section we evaluate the efficiency of honor on solving the regularized lopn gistic regression by setting log exp ati where ai rn is the sample associated with the label yi three regularizers lsp mcp and scad are included in experiments where the parameters are set as and is set as for scad as it requires we compare honor with the gist on three and sparse data sets which are summarized in table all data sets can be downloaded from http all algorithms are implemented in mattable data set statistics lab under linux operating sysdatasets url tem and executed on an intel core samples cpu with dimensionality memory we choose the starting points for the compared algorithms using the same random vector whose entries are sampled from the standard gaussian distribution we terminate the compared algorithms if the relative change of two consecutive objective function values is less than or the number of iterations exceeds honor or gist for honor we set and the number of unrolling steps in as for gist we use the line search in experiments as it usually performs better than its monotone counterpart to show how the convergence behavior of honor varies over the parameter we use three values we report the objective function value in cpu time in seconds plots in figure we can observe from figure that if is set to small value the is adopted at almost all steps in honor and honor converges significantly faster than gist for all three we do not include the term in the objective and find that the proposed algorithm still works well we do not involve sparsenet dc programming and in comparison because adapting sparsenet to the logistic regression problem is challenging dc programming is shown to be much inferior to gist the objective function value of is larger than gist in most cases cpu time seconds scad scad honor honor honor gist cpu time seconds cpu time seconds honor honor honor gist scad url cpu time seconds objective function value logged scale honor honor honor gist cpu time seconds honor honor honor gist honor honor honor gist mcp url cpu time seconds mcp mcp cpu time seconds honor honor honor gist cpu time seconds objective function value logged scale objective function value logged scale objective function value logged scale lsp url honor honor honor gist objective function value logged scale objective function value logged scale lsp honor honor honor gist objective function value logged scale lsp objective function value logged scale objective function value logged scale regularizers on all three data sets this shows that using the information greatly speeds up the convergence when increases the ratio of the adopted in honor increases meanwhile the convergence performance of honor generally degrades in some cases setting slightly larger and adopting small number of gd steps even sligtly boosts the convergence performance of honor the green curves in the first row but setting to very small value is always safe to guarantee the fast convergence of honor when is large enough the gd steps dominate all iterations of honor and honor converge much slower in this case honor converges even slower than gist the reason is that at each iteration of honor extra computational cost is required in addition to the basic computation in the moreover the line search is used in gist while the monotone line search is adopted in the in some cases the first row gist is trapped in local solution which has much larger objective function value than honor with small this implies that honor may have potential of escaping from high error plateau which often exists in high dimensional problems these results show the great potential of honor for solving sparse learning problems honor honor honor gist cpu time seconds figure objective function value in cpu time in seconds plots for different regularizers and different and data sets the ratios of the adopted in honor are lsp lsp lsp url mcp mcp mcp url scad scad scad url conclusions in this paper we propose an efficient optimization algorithm called honor for solving regularized sparse learning problems honor incorporates the information to speed up the convergence in practice and uses carefully designed hybrid optimization scheme to guarantee the convergence in theory experiments are conducted on data sets and results show that honor converges significantly faster than algorithms in our future work we plan to develop variants of honor to tackle much larger data sets acknowledgements this work is supported in part by research grants from nih and nsf references andrew and gao scalable training of models in icml pages and figueiredo new twist iterative algorithms for image restoration ieee transactions on image processing byrd chin nocedal and oztoprak family of methods for convex optimization technical report industrial engineering and management sciences northwestern university evanston il byrd chin nocedal and wu sample size selection in optimization methods for machine learning mathematical programming byrd lu nocedal and zhu limited memory algorithm for bound constrained optimization siam journal on scientific computing candes wakin and boyd enhancing sparsity by reweighted minimization journal of fourier analysis and applications clarke optimization and nonsmooth analysis john wiley sons new york dutta generalized derivatives and nonsmooth optimization finite dimensional tour top el ghaoui li duong pham srivastava and bhaduri sparse machine learning methods for understanding large text corpora in cidu pages fan and li variable selection via nonconcave penalized likelihood and its oracle properties journal of the american statistical association fan xue and zou strong oracle optimality of folded concave penalized estimation annals of statistics gasso rakotomamonjy and canu recovering sparse signals with certain family of nonconvex penalties and dc programming ieee transactions on signal processing gong and ye modified limited memory method with convergence analysis in icml gong zhang lu huang and ye general iterative shrinkage and thresholding algorithm for regularized optimization problems in icml volume pages jorge and stephen numerical optimization springer mazumder friedman and hastie sparsenet coordinate descent with nonconvex penalties journal of the american statistical association olsen oztoprak nocedal and rennie methods for sparse inverse covariance estimation in advances in neural information processing systems nips pages rakotomamonjy flamary and gasso dc proximal newton for optimization problems shevade and keerthi simple and efficient algorithm for gene selection using sparse logistic regression bioinformatics tan roberts li and stoica sparse learning via iterative minimization with application to mimo radar imaging ieee transactions on signal processing tao and an the dc difference of convex functions programming and dca revisited with dc models of real world nonconvex optimization problems annals of operations research wright yang ganesh sastry and ma robust face recognition via sparse representation ieee transactions on pattern analysis and machine intelligence zhang nearly unbiased variable selection under minimax concave penalty the annals of statistics zhang and zhang general theory of concave regularization for sparse estimation problems statistical science zhang analysis of convex relaxation for sparse regularization jmlr zhang convex relaxation for feature selection bernoulli zou and li sparse estimates in nonconcave penalized likelihood models annals of statistics 
object proposals for accurate object class detection xiaozhi kaustav kundu huimin yukun sanja andrew raquel department of computer science university of toronto department of electronic engineering tsinghua university kkundu yukun mhmpub fidler urtasun abstract the goal of this paper is to generate object proposals in the context of autonomous driving our method exploits stereo imagery to place proposals in the form of bounding boxes we formulate the problem as minimizing an energy function encoding object size priors ground plane as well as several depth informed features that reason about free space point cloud densities and distance to the ground our experiments show significant performance gains over existing rgb and object proposal methods on the challenging kitti benchmark combined with convolutional neural net cnn scoring our approach outperforms all existing results on all three kitti object classes introduction due to the development of advanced warning systems cameras are available onboard of almost every new car produced in the last few years computer vision provides very cost effective solution not only to improve safety but also to one of the holy grails of ai fully autonomous cars in this paper we are interested in and object detection for autonomous driving with the large success of deep learning in the past years the object detection community shifted from simple appearance scoring on exhaustive sliding windows to more powerful visual representations extracted from smaller set of proposals this resulted in over absolute performance gains on the pascal voc benchmark the motivation behind these grouping approaches is to provide moderate number of region proposals among which at least few accurately cover the objects these approaches typically an image into super pixels and group them based on several similarity measures this is the strategy behind selective search which is used in most detectors these days contours in the image have also been exploited in order to locate object proposal boxes another successful approach is to frame the problem as energy minimization where parametrized family of energies represents various biases for grouping thus yielding multiple diverse solutions interestingly the approach does not work well on the autonomous driving benchmark kitti falling significantly behind the current top performers this is due to the low achievable recall of the underlying box proposals on this benchmark kitti images contain many small objects severe occlusion high saturated areas and shadows furthermore denotes equal contribution image stereo prior figure features from left to right original image stereo reconstruction features and our prior in the third image purple is free space in eq and occupancy is yellow in eq in the prior the ground plane is green and red to blue indicates distance to the ground kitti evaluation requires much higher overlap with for cars in order for detection to count as correct since most existing proposal methods rely on grouping super pixels based on intensity and texture they fail in these challenging conditions in this paper we propose new object proposal approach that exploits stereo information as well as contextual models specific to the domain of autonomous driving our method reasons in and places proposals in the form of bounding boxes we exploit object size priors ground plane as well as several depth informed features such as free space point densities inside the box visibility and distance to the ground our experiments show significant improvement in achievable recall over the at all overlap thresholds and object occlusion levels demonstrating that our approach produces highly accurate object proposals in particular we achieve higher recall for proposals than the method combined with cnn scoring our method outperforms all published results on object detection for car cyclist and pedestrian on kitti our code and data are online http related work with the wide success of deep networks which typically operate on fixed spatial scope there has been increased interest in object proposal generation existing approaches range from purely rgb to video in rgb most approaches combine superpixels into larger regions based on color and texture similarity these approaches produce around proposals per image achieving nearly perfect achievable recall on the pascal voc benchmark in regions are proposed by defining parametric affinities between pixels and solving the energy using parametric the proposed solutions are then scored using simple features and typically only proposals are needed to succeed in consequent recognition tasks introduces learning into proposal generation with parametric energies exhaustively sampled bounding boxes are scored in using several objectness features bing proposals also score windows based on an object closure measure as proxy for objectness edgeboxes score millions of windows based on contour information inside and on the boundary of each window detailed comparison is done in fewer approaches exist that exploit extend cpmc with additional affinities that encourage the proposals to respect occlusion boundaries extends mcg to by an additional set of features they show significant improvements in performance with respect to past work in videos are used to propose boxes around very accurate point clouds relevant to our work is sliding shapes which exhaustively evaluates cuboids in scenes this approach however utilizes an object scoring function trained on large number of rendered views of cad models and uses complex potentials that make the method run slow in both training and inference our work advances over prior work by exploiting the typical sizes of objects in the ground plane and very efficient scoring functions related to our work are also detection approaches for autonomous driving in objects are predetected via approach and deformable wireframe model is then fit using the image information inside the box pepik et al extend the deformable model to by linking parts across different viewpoints and using loss function in an ensemble of models derived from visual and geometrical clusters of object instances is employed in selective search boxes are using object level information proposes holistic model that about dpm detections based on priors from cartographic maps in kitti the best performing method so far is the recently proposed which uses the acf detector and learned occlusion patters in order to improve performance of occluded cars bing ss eb mcg ours bing ss eb mcg ours bing ss eb mcg ours candidates recall at iou threshold candidates bing ss eb mcg ours candidates recall at iou threshold recall at iou threshold candidates bing ss eb mcg ours bing ss eb mcg ours candidates candidates bing ss eb mcg ours cyclist candidates recall at iou threshold pedestrian recall at iou threshold candidates recall at iou threshold bing ss eb mcg ours recall at iou threshold recall at iou threshold car recall at iou threshold bing ss eb mcg ours candidates easy moderate hard figure proposal recall we use overlap threshold for car and for pedestrian and cyclist object proposals the goal of our approach is to output diverse set of object proposals in the context of autonomous driving since reasoning is of crucial importance in this domain we place our proposals in and represent them as cuboids we assume stereo image pair as input and compute depth via the approach by yamaguchi et al we use depth to compute and conduct all our reasoning in this domain we next describe our notation and present our framework proposal generation as energy minimization we represent each object proposal with bounding box denoted by which is parametrized by tuple where denotes the center of the box and represents its azimuth angle note that each box in principle lives in continuous space however for efficiency we reason in discretized space details in sec here denotes the object class of the box and tc indexes the set of box templates which represent the physical size variations of each object class the templates are learned from the training data we formulate the proposal generation problem as inference in markov random field mrf which encodes the fact that the proposal should enclose high density region in the point cloud furthermore since the point cloud represents only the visible portion of the space should not overlap with the free space that lies within the rays between the points in the point cloud and the camera if that was the case the box would in fact occlude the point cloud which is not possible we also encode the fact that the point cloud should not extend vertically beyond our placed box and that the height of the point cloud in the immediate vicinity of the box should be lower than the box our mrf energy thus takes the following form wc pcd φpcd wc φf wc ht φht wc note that our energy depends on the object class via weights wc which are trained using structured svm details in sec we now explain each potential in more detail point cloud density this potential encodes the density of the point cloud within the box φpcd where indicates whether the voxel is occupied or not contains point cloud points and denotes the set of voxels inside the box defined by fig visualizes the potential this potential simply counts the fraction of occupied voxels inside the box it can be efficiently computed in constant time via integral accumulators which is generalization of integral images to free space this potential encodes the constraint that the free space between the point cloud and the camera can not be occupied by the box let represent free space grid where means that the ray from the camera to the voxel does not hit an occupied voxel voxel lies in the free space we define the potential as follows φf this potential thus tries to minimize the free space inside the box and can also be computed efficiently using integral accumulators height prior this potential encodes the fact that the height of the point cloud inside the box should be close to the mean height of the object class this is encoded in the following way hc φht with exp hc dp µc ht σc ht if where dp indicates the height of the road plane lying below the voxel here µc ht σc ht are the mle estimates of the mean height and standard deviation by assuming gaussian distribution of the data integral accumulators can be used to efficiently compute these features height contrast this potential encodes the fact that the point cloud that surrounds the bounding box should have lower height than the height of the point cloud inside the box this is encoded as φht φht φht where represents the cuboid obtained by extending by in the direction of each face discretization and accumulators our point cloud is defined with respect to coordinate system where the positive is along the viewing direction of the camera and the is along the direction of gravity we discretize the continuous space such that the width of each voxel is in each dimension we compute the occupancy free space and height prior grids in this discretized space following the idea of integral images we compute our accumulators in inference inference in our model is performed by minimizing the energy argminy due to the efficient computation of the features using integral accumulators evaluating each configuration takes constant time still evaluating exhaustively in the entire grid would be slow in order to reduce the search space we carve certain regions of the grid by skipping configurations which do not overlap with the point cloud we further reduce the search space along the vertical dimension by placing all our bounding boxes on the road plane yroad we estimate the road by partitioning the image into super pixels and train road classifier using neural net with several easy bing ss eb mcg ours iou overlap threshold iou overlap threshold recall recall bing ss eb mcg ours bing ss eb mcg ours iou overlap threshold bing ss eb mcg ours iou overlap threshold recall recall pedestrian recall iou overlap threshold cyclist recall bing ss eb mcg ours iou overlap threshold bing ss eb mcg ours iou overlap threshold bing ss eb mcg ours recall bing ss eb mcg ours recall car recall bing ss eb mcg ours iou overlap threshold iou overlap threshold moderate hard figure recall vs iou for proposals the number next to the labels indicates the average recall ar and features we then use ransac on the predicted road pixels to fit the ground plane using the considerably reduces the search space along the vertical dimension however since the points are noisy at large distances from the camera we sample additional proposal boxes at locations farther than from the camera we sample these boxes at heights yroad σroad where σroad is the mle estimate of the standard deviation by assuming gaussian distribution of the distance between objects and the estimated ground plane using our sampling strategy scoring all possible configurations takes only fraction of second note that by minimizing our energy we only get one best object candidate in order to generate diverse proposals we sort the values of for all and perform greedy inference we pick the top scoring proposal perform nms and iterate the entire inference process and feature computation takes on average per image for proposals learning we learn the weights wc pcd wc wc ht wc of the model using structured svm given ground truth pairs the parameters are learnt by solving the following optimization problem min ξi wt ξi we use the parallel cutting plane of to solve this minimization problem we use iou between the set of gt boxes and candidates as the task loss we compute iou in as the volume of intersection of two boxes divided by the volume of their union this is very strict measure that encourages accurate placement of the proposals object detection and orientation estimation network we use our object proposal method for the task of object detection and orientation estimation we score bounding box proposals using cnn our network is built on fast which share squaresicf aog subcat filteredicf paucenst regionlets ours cars easy moderate hard pedestrians easy moderate hard cyclists easy moderate hard table average precision ap in on the test set of the kitti object detection benchmark aog subcat ours cars easy moderate hard pedestrians easy moderate hard cyclists easy moderate hard table aos scores in on the test set of kitti object detection and orientation estimation benchmark convolutional features across all proposals and use roi pooling layer to compute features we extend this basic network by adding context branch after the last convolutional layer and an orientation regression loss to jointly learn object location and orientation features output from the original and the context branches are concatenated and fed to the prediction layers the context regions are obtained by enlarging the candidate boxes by factor of we used smooth loss for orientation regression we use oxfordnet trained on imagenet to initialize the weights of convolutional layers and the branch for candidate boxes the parameters of the context branch are initialized by copying the weights from the original branch we then it end to end on the kitti training set experimental evaluation we evaluate our approach on the challenging kitti autonomous driving dataset which contains three object classes car pedestrian and cyclist kitti object detection benchmark has training and test images evaluation is done in three regimes easy moderate and hard containing objects at different occlusion and truncation levels the moderate regime is used to rank the competing methods in the benchmark since the test labels are not available we split the kitti training set into train and validation sets each containing half of the images we ensure that our training and validation set do not come from the same video sequences and evaluate the performance of our bounding box proposals on the validation set following we use the oracle recall as metric for each gt object we find the proposal that overlaps the most in iou best proposal we say that gt instance has been recalled if iou exceeds for cars and for pedestrians and cyclists this follows the standard kitti setup oracle recall thus computes the percentage of recalled gt objects and thus the best achievable recall we also show how different number of generated proposals affect recall comparison to the we compare our approach to several baselines mcgd mcg selective search ss bing and edge boxes eb fig shows recall as function of the number of candidates we can see that by using proposals we best prop ground truth top prop images figure qualitative results for the car class we show the original image top scoring proposals groundbest prop ground truth top prop images truth boxes and our best set of proposals that cover the figure qualitative examples for the pedestrian class method time seconds bing selective search edge boxes eb mcg ours table running time of different proposal methods achieve around recall for cars in the moderate and hard regimes while for easy we need only candidates to get the same recall notice that other methods saturate or require orders of magnitude more candidates to reach recall for pedestrians and cyclists our results show similar improvements over the baselines note that while we use features uses both depth and appearance based features and all other methods use only appearance features this shows the importance of information in the autonomous driving scenario furthermore the other methods use class agnostic proposals to generate the candidates whereas we generate them based on the object class this allows us to achieve higher recall values by exploiting size priors tailored to each class fig shows recall for proposals as function of the iou overlap our approach significantly outperforms the baselines particularly for cyclists running time table shows running time of different proposal methods our approach is fairly efficient and can compute all features and proposals in on single core qualitative results figs and show qualitative results for cars and pedestrians we show the input rgb image top proposals the gt boxes in as well as proposals from our method with the best iou chosen among proposals our method produces very precise proposals even for the more difficult far away or occluded objects object detection to evaluate our full object detection pipeline we report results on the test set of the kitti benchmark the results are presented in table our approach outperforms all the competitors significantly across all categories in particular we achieve and improvement in ap for cars pedestrians and cyclists in the moderate setting object orientation estimation average orientation similarity aos is used as the evaluation metric in object detection and orientation estimation task results on kitti test set are shown in table our approach again outperforms all approaches by large margin particularly our approach achieves higher scores than on cars in moderate and hard data the improvement on pedestrians and cyclists are even more significant as they are more than higher than the second best method suppl material we refer the reader to supplementary material for many additional results conclusion we have presented novel approach to object proposal generation in the context of autonomous driving in contrast to most existing work we take advantage of stereo imagery and reason directly in we formulate the problem as inference in markov random field encoding object size priors ground plane and variety of depth informed features our approach significantly outperforms existing object proposal methods on the challenging kitti benchmark in particular for proposals our approach achieves higher recall than the method combined with cnn scoring our method significantly outperforms all previous published object detection results for all three object classes on the kitti benchmark acknowledgements the work was partially supported by nsfc nserc and toyota motor corporation references felzenszwalb girshick mcallester and ramanan object detection with discriminatively trained part based models pami krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips simonyan and zisserman very deep convolutional networks for image recognition in van de sande uijlings gevers and smeulders segmentation as selective search for object recognition in iccv arbelaez barron marques and malik multiscale combinatorial grouping in cvpr girshick donahue darrell and malik rich feature hierarchies for accurate object detection and semantic segmentation arxiv preprint zhu urtasun salakhutdinov and fidler segdeepm exploiting segmentation and context in deep neural networks for object detection in cvpr everingham van gool williams winn and zisserman the pascal visual object classes challenge results zitnick and edge boxes locating object proposals from edges in eccv carreira and sminchisescu cpmc automatic object segmentation using constrained parametric pami geiger lenz and urtasun are we ready for autonomous driving the kitti vision benchmark suite in cvpr xiang choi lin and savarese voxel patterns for object category recognition in cvpr long wang hua yang and lin accurate object detection with location relaxation and regionlets relocalization in accv gupta girshick arbelaez and malik learning rich features from images for object detection and segmentation in eccv cheng zhang lin and torr bing binarized normed gradients for objectness estimation at in cvpr lee fidler and dickinson learning framework for generating region proposals with cues in iccv banica and sminchisescu semantic segmentation of images using cpmc and second order pooling in corr lin fidler and urtasun holistic scene understanding for object detection with rgbd cameras in iccv karpathy miller and li object discovery in scenes via shape analysis in icra oneata revaud verbeek and schmid object detection proposals in eccv carreira caseiro batista and sminchisescu semantic segmentation with pooling in eccv fidler mottaghi yuille and urtasun segmentation for detection in cvpr alexe deselares and ferrari measuring the objectness of image windows pami hosang benenson and schiele what makes for effective detection proposals song and xiao sliding shapes for object detection in depth images in eccv zia stark and schindler towards scene understanding with detailed object representations ijcv pepik stark gehler and schiele and deformable part models pami and trivedi learning to detect vehicles by clustering appearance patterns ieee transactions on intelligent transportation systems wang fidler and urtasun holistic scene understanding from single image in cvpr appel belongie and perona fast feature pyramids for object detection pami yamaguchi mcallester and urtasun efficient joint segmentation occlusion labeling stereo and flow estimation in eccv tsochantaridis hofmann joachims and altun support vector learning for interdependent and structured output spaces in icml schwing fidler pollefeys and urtasun box in the box joint layout and object reasoning from single images in iccv ross girshick fast in iccv geiger wojek and urtasun joint estimation of objects and scene layout in nips benenson mathias tuytelaars and van gool seeking the strongest rigid detector in cvpr yebes bergasa arroyo and lzaro supervised learning and evaluation of kitti cars detector with dpm in iv pepik stark gehler and schiele occlusion patterns for object class detection in cvpr li wu and zhu integrating context and occlusion for car detection by hierarchical model in eccv xu ramos vozquez and lopez hierarchical adaptive structural svm for domain adaptation in premebida carreira batista and nunes pedestrian detection combining rgb and dense lidar data in iros hosang omran benenson and schiele taking deeper look at pedestrians in arxiv zhang benenson and schiele filtered channel features for pedestrian detection in paisitkriangkrai shen and van den hengel pedestrian detection with spatially pooled features and structured ensemble learning in gonzalez villalonga xu vazquez amores and lopez multiview random forest of local experts combining rgb and lidar data for pedestrian detection in iv 
algorithms with logarithmic or sublinear regret for constrained contextual bandits huasen wu university of california at davis hswu srikant university of illinois at rsrikant xin liu university of california at davis liu chong jiang university of illinois at abstract we study contextual bandits with budget and time constraints referred to as constrained contextual bandits the time and budget constraints significantly complicate the exploration and exploitation tradeoff because they introduce complex coupling among contexts over time to gain insight we first study systems with known context distribution when the expected rewards are known we develop an approximation of the oracle referred to alp which achieves and only requires the ordering of expected rewards with these highly desirable features we then combine alp with the ucb method in the general case where the expected rewards are unknown priori we show that the proposed algorithm achieves logarithmic regret except for certain boundary cases further we design algorithms and obtain similar regret bounds for more general systems with unknown context distribution and heterogeneous costs to the best of our knowledge this is the first work that shows how to achieve logarithmic regret in constrained contextual bandits moreover this work also sheds light on the study of computationally efficient algorithms for general constrained contextual bandits introduction the contextual bandit problem is an important extension of the classic bandit mab problem where the agent can observe set of features referred to as context before making decision after the random arrival of context the agent chooses an action and receives random reward with expectation depending on both the context and action to maximize the total reward the agent needs to make careful tradeoff between taking the best action based on the historical performance exploitation and discovering the potentially better alternative actions under given context exploration this model has attracted much attention as it fits the personalized service requirement in many applications such as clinical trials online recommendation and online hiring in crowdsourcing existing works try to reduce the regret of contextual bandits by leveraging the structure of the models such as linearity or similarity and more recent work focuses on computationally efficient algorithms with minimum regret for markovian context arrivals algorithms such as ucrl for more general reinforcement learning problem can be used to achieve logarithmic regret however traditional contextual bandit models do not capture an important characteristic of real systems in addition to time there is usually cost associated with the resource consumed by each action and the total cost is limited by budget in many applications taking crowdsourcing as an example the budget constraint for given set of tasks will limit the number of workers that an employer can hire another example is the clinical trials where each treatment is usually costly and the budget of trial is limited although budget constraints have been studied in bandits where logarithmic or sublinear regret is achieved as we will see later these results are inapplicable in the case with observable contexts in this paper we study contextual bandit problems with budget and time constraints referred to as constrained contextual bandits where the agent is given budget and in addition to reward cost is incurred whenever an action is taken under context the bandit process ends when the agent runs out of either budget or time the objective of the agent is to maximize the expected total reward subject to the budget and time constraints we are interested in the regime where and grow towards infinity proportionally the above constrained contextual bandit problem can be viewed as special case of resourceful contextual bandits rcb in rcb is studied under more general settings with possibly infinite contexts random costs multiple budget constraints mixture elimination algorithm is proposed and shown to achieve regret however the benchmark for the definition of regret in is restricted to within finite policy set moreover the mixture elimination algorithm suffers high complexity and the design of computationally efficient algorithms for such general settings is still an open problem to tackle this problem motivated by certain applications we restrict the set of parameters in our model as follows we assume finite discrete contexts fixed costs and single budget constraint this simplified model is justified in many scenarios such as clinical trials and rate selection in wireless networks more importantly these simplifications allow us to design algorithms that achieve log regret except for set of parameters of zero lebesgue measure which we refer to as boundary cases where the regret is defined more naturally as the performance gap between the proposed algorithm and the oracle the optimal algorithm with known statistics even with simplified assumptions considered in this paper the tradeoff is still challenging due to the budget and time constraints the key challenge comes from the complexity of the oracle algorithm with budget and time constraints the oracle algorithm can not simply take the action that maximizes the instantaneous reward in contrast it needs to balance between the instantaneous and rewards based on the current context and the remaining budget in principle dynamic programming dp can be used to obtain this balance however using dp in our scenario incurs difficulties in both algorithm design and analysis first the implementation of dp is computationally complex due to the curse of dimensionality second it is difficult to obtain benchmark for regret analysis since the dp algorithm is implemented in recursive manner and its expected total reward is hard to be expressed in closed form third it is difficult to extend the dp algorithm to the case with unknown statistics due to the difficulty of evaluating the impact of estimation errors on the performance of algorithms to address these difficulties we first study approximations of the oracle algorithm when the system statistics are known our key idea is to approximate the oracle algorithm with linear programming lp that relaxes the hard budget constraint to an average budget constraint when fixing the average budget constraint at this lp approximation provides an upper bound on the expected total reward which serves as good benchmark in regret analysis further we propose an adaptive linear programming alp algorithm that adjusts the budget constraint to the average remaining budget bτ where is the remaining time and bτ is the remaining budget note that although the idea of approximating dp problem with an lp problem has been widely studied in literature the design and analysis of alp here is quite different in particular we show that alp achieves regret its expected total reward is within constant independent of from the optimum except for certain boundaries this alp approximation and its regret analysis make an important step towards achieving logarithmic regret for constrained contextual bandits using the insights from the case with known statistics we study algorithms for constrained contextual bandits with unknown expected rewards complicated interactions between information acquisition and decision making arise in this case fortunately the alp algorithm has highly desirable property that it only requires the ordering of the expected rewards and can tolerate certain estimation errors of system parameters this property allows us to combine alp with estimation methods that can efficiently provide correct rank of the expected rewards in this paper we propose algorithm by combining alp with the ucb method we achieves log regret except for certain boundary cases where its regret is we note that algorithms are proposed in for bandits with concave rewards and convex constraints and further extended to linear contextual bandits however focuses on static and achieves regret in our setting since it uses fixed budget constraint in each round in comparison we consider random context arrivals and use an adaptive after the online publication of our preliminary version two recent papers extend their previous work to the dynamic context case where they focus on possibly infinite contexts and achieve regret and restricts to finite policy set as budget constraint to achieve logarithmic regret to the best of our knowledge this is the first work that shows how to achieve logarithmic regret in constrained contextual bandits moreover the proposed algorithm is quite computationally efficient and we believe these results shed light on addressing the open problem of general constrained contextual bandits although the intuition behind alp and is natural the rigorous analysis of their regret is since we need to consider many interacting factors such as ranking errors remaining budget fluctuation and randomness of context arrival we evaluate the impact of these factors using series of novel techniques the method of showing concentration properties under adaptive algorithms and the method of bounding estimation errors under random contexts for the ease of exposition we study the alp and algorithms in systems with known context distribution in sections and respectively then we discuss the generalization to systems with unknown context distribution in section and with heterogeneous costs in section which are much more challenging and the details can be found in the supplementary material system model we consider contextual bandit problem with context set and an action set at each round context xt arrives independently with identical distribution xt πj and each action generates reward yk under given context xt the reward yk are independent random variables in the conditional expectation yk uj is unknown to the agent moreover cost is incurred if action is taken under context to gain insight into constrained contextual bandits we consider fixed and known costs in this paper where the cost is cj when action is taken under context similar to traditional contextual bandits the context xt is observable at the beginning of round while only the reward of the action taken by the agent is revealed at the end of round at the beginning of round the agent observes the context xt and takes an action at from where represents dummy action that the agent skips the current context let yt and zt be the reward and cost for the agent in round respectively if the agent takes an action at then the reward is yt yk and the cost is zt cxt otherwise if the agent takes the dummy action at neither reward nor cost is incurred yt and zt in this paper we focus on contextual bandits with known and limited budget the bandit process ends when the agent runs out of the budget or at the end of time contextual bandit algorithm is function that maps the historical observations and the current context xt to an action at the objective of the algorithm is to maximize the expected total reward uγ for given and budget maximizeγ uγ eγ yt subject to zt where the expectation is taken over the distributions of contexts and rewards note that we consider hard budget constraint the total costs should not be greater than under any realization we measure the performance of the algorithm by comparing it with the oracle which is the optimal algorithm with known statistics including the knowledge of πj uj and cj let be the expected total reward obtained by the oracle algorithm then the regret of the algorithm is defined as rγ uγ the objective of the algorithm is then to minimize the regret we are interested in the asymptotic regime where the and the budget grow to infinity proportionally with fixed ratio approximations of the oracle in this section we study approximations of the oracle where the statistics of bandits are known to the agent this will provide benchmark for the regret analysis and insights into the design of constrained contextual bandit algorithms as starting point we focus on systems cj for each and from section to section which will be relaxed in section in systems the quality of action under context is fully captured by its expected reward uj let be the highest expected reward under context and be the best action for context uj and arg uj for ease of exposition we assume that the best action under each context is unique uj for all and similarly we also assume for simplicity with the knowledge of uj the agent knows the best action and its expected reward under any context in each round the task of the oracle is deciding whether to take action kx or not depending on the remaining time and the remaining budget bτ the special case of systems is trivial where the agent just needs to procrastinate for the better context see appendix of the supplementary material when considering more general cases with however it is computationally intractable to exactly characterize the oracle solution therefore we resort to approximations based on linear programming lp upper bound static linear programming we propose an upper bound for the expected total reward of the oracle by relaxing the hard constraint to an average constraint and solving the corresponding constrained lp problem specifically let pj be the probability that the agent takes action for context and pj be the probability that the agent skips context taking action at denote the probability vector as pj for and budget consider the following lp problem lp maximizep pj πj subject to pj πj define the following threshold as function of the average budget max πj with the convention that if we can verify that the following solution is optimal for lp if πj pj if if correspondingly the optimal value of lp is πj this optimal value can be viewed as the maximum expected reward in single round with average budget summing over the entire horizon the total expected reward becomes which is an upper bound of lemma for system with known statistics if the is and the budget is then the proof of lemma is available in appendix of the supplementary material with lemma we can bound the regret of any algorithm by comparing its performance with the upper bound has simple expression as we will see later it significantly instead of since reduces the complexity of regret analysis adaptive linear programming although the solution provides an upper bound on the expected reward using such fixed algorithm will not achieve good performance as the ratio bτ referred to as average remaining budget fluctuates over time we propose an adaptive linear programming alp algorithm that adjusts the threshold and randomization probability according to the instantaneous value of bτ specifically when the remaining time is and the remaining budget is bτ we consider an lp problem lp which is the same as lp except that in eq is replaced with then the optimal solution for lp can be obtained by replacing in eqs and with the alp algorithm then makes decisions based on this optimal solution alp algorithm at each round with remaining budget bτ obtain pj by solving lp take action at kx with probability pxt and at with probability pxt the above alp algorithm only requires the ordering of the expected rewards instead of their accurate values this highly desirable feature allows us to combine alp with classic mab algorithms such as ucb for the case without knowledge of expected rewards moreover this simple alp algorithm achieves very good performance within constant distance from the optimum regret except for certain boundary cases specifically for let qj be the cumulative probability pj defined as qj πj with the convention that the following theorem states the near optimality of alp theorem given any fixed the regret of alp satisfies cases if qj for any then ralp where min boundary cases if qj for some then ralp where and min theorem shows that alp achieves regret except for certain boundary cases where it still achieves regret this implies that the regret due to the linear relaxation is negligible in most cases thus when the expected rewards are unknown we can achieve low regret logarithmic regret by combining alp with appropriate mechanisms sketch of proof although the alp algorithm seems fairly intuitive its regret analysis is nontrivial the key to the proof is to analyze the evolution of the remaining budget bτ by mapping alp to sampling without replacement specifically from eq we can verify that when the remaining time is and the remaining budget is bτ the system consumes one unit of budget with probability and consumes nothing with probability when considering the remaining budget the alp algorithm can be viewed as sampling without replacement thus we can show that bτ follows the hypergeometric distribution and has the following properties lemma under the alp algorithm the remaining budget bτ satisfies the expectation and variance of bτ are bτ ρτ and var bτ tt respectively for any positive number satisfying min the tail distribution of bτ satisfies bτ and bτ then we prove theorem pt based note that the expected total reward under alp is ualp bτ where is defined in and the expectation is taken over the distribution of bτ for the cases the expected reward satisfies bτ if the threshold bτ for all possible bτ the regret then is bounded by constant because the probability of the event bτ decays exponentially due to the concentration property of bτ for the boundary cases we show the conclusion by relating the regret with the variance of bτ please refer to appendix of the supplementary material for details algorithm for constrained contextual bandits now we get back to the constrained contextual bandits where the expected rewards are unknown to the agent we assume the agent knows the context distribution as which will be relaxed in section thanks to the desirable properties of alp the maxim of optimism under uncertainty is still applicable and alp can be extended to the bandit settings when combined with estimation policies that can quickly provide correct ranking with high probability here combining alp with the ucb method we propose algorithm for constrained contextual bandits ucb notations and property let cj be the number of times that action has been taken under context up to round if cj let be the empirical reward of action under context yt xt at where is the indicator function we define the ucb cj log of uj at as for cj and for cj furthermore we define the ucb of the maximum expected reward under context as as suggested in we use smaller coefficient in the exploration term log than the traditional ucb algorithm to achieve better performance we present the following property of ucb that is important in regret analysis lemma for two pairs and if uj uj then for any where log uj lemma states that for two pairs the ordering of their expected rewards can be identified correctly with high probability as long as the suboptimal pair has been executed for sufficient times on the order of log this property has been widely applied in the analysis of ucbbased algorithms and its proof can be found in with minor modification on the coefficients algorithm we propose adaptive linear programming algorithm as shown in algorithm as indicated by the name the algorithm maintains ucb estimates of expected rewards for all pairs and then implements the alp algorithm based on these estimates note that the ucb estimates may be in thus the solution of lp based on depends on the actual ordering of and may be different from eq we use rather than pj to indicate this difference algorithm input budget and context distribution πj init cj and for to do arg maxk if then obtain the probabilities by solving lp with replaced by take action kx with probability end if update cj and end for regret of we study the regret of in this section due to space limitations we only present sketch of the analysis specific representations of the regret bounds and proof details can be found in the supplementary material pj recall that qj πj are the boundaries defined in section we show that as the budget and the grow to infinity in proportion the proposed algorithm achieves logarithmic regret except for the boundary cases theorem given πj uj and fixed the regret of satisfies cases if for any then the regret of is jk log boundary cases if qj for some then the regret of is jk log theorem differs from theorem by an additional term jk log this term results from using ucb to learn the ordering of expected rewards under ucb each of the jk pairs should be executed roughly log times to obtain the correct ordering for the cases is because obtaining the correct action ranking under each context will result in log regret note that our results do not contradict the lower bound in because we consider discrete actions and focus on regret for the boundary cases we keep both the log terms because the constant in the log term is typically much larger than that in the term therefore the log term may dominate the regret particularly when the number of pairs is large for medium it is still an open problem if one can achieve regret lower than in these cases sketch of proof we bound the regret of by comparing its performance with the benchb the analysis of this bound is challenging due to the close interactions among differmark ent sources of regret and the randomness of context arrivals we first partition the regret according to the sources and then bound each part of regret respectively step partition the regret by analyzing the implementation of we show that its regret is bounded as pj where the first part uj cj is the regret from pt action ranking errors within context and the second part pj bτ πj uj is the regret from the fluctuations of bτ and context ranking errors step bound each part of regret for the first part we can show that log using similar techniques for traditional ucb methods the major challenge of regret analysis for then lies in the evaluation of the second part we first verify that the evolution of bτ under is similar to that under alp and lemma still holds under with respect to context ranking errors we note that unlike classic ucb methods not all context ranking errors contribute to the regret due to the threshold structure of alp therefore we carefully categorize the context ranking results based on their contributions we briefly discuss the analysis for the cases here recall that is the threshold for the static lp problem lp we define the following events that capture all possible ranking results based on ucbs the first event indicates roughly correct context ranking because under ucbalp obtains correct solution for lp bτ if bτ the last two events erank represent two types of context ranking errors corresponds to certain contexts with reward having lower ucb while corresponds to certain contexts pt with reward having higher ucb let erank for we can show that the expected number of context ranking errors satisfies jk log implying that jk log summarizing the two parts we have jk log for the cases the regret for the boundary cases can be bounded using similar arguments key insights from constrained contextual bandits involve complicated interactions between information acquisition and decision making alleviates these interactions by approximating the oracle with alp for decision making this approximation achieves performance while tolerating certain estimation errors of system statistics and thus enables the combination with estimation methods such as ucb in unknown statistics cases moreover the adaptation property of guarantees the concentration property of the system status bτ this allows us to separately study the impact of action or context ranking errors and conduct rigorous analysis of regret these insights can be applied in algorithm design and analysis for constrained contextual bandits under more general settings bandits with unknown context distribution when the context distribution is unknown reasonable heuristic is to replace the probability πj in pt alp with its empirical estimate we refer to this modified alp algorithm as empirical alp ealp and its combination with ucb as the empirical distribution provides maximum likelihood estimate of the context distribution and the ealp and algorithms achieve similar performance as alp and respectively as observed in numerical simulations however rigorous analysis for ealp and ucbealp is much more challenging due to the dependency introduced by the empirical distribution to tackle this issue our rigorous analysis focuses on truncated version of ealp where we stop updating the empirical distribution after given round using the method of bounded averaged differences based on coupling argument we obtain the concentration property of the average remaining budget bτ and show that this truncated ealp algorithm achieves regret except for the boundary cases the regret of the corresponding version can by bounded similarly as bandits with heterogeneous costs the insights obtained from systems can also be used to design algorithms for heterogeneous cost systems where the cost cj depends on and we generalize the alp algorithm to approximate the oracle and adjust it to the case with unknown expected rewards for simplicity we assume the context distribution is known here while the empirical estimate can be used to replace the actual context distribution if it is unknown as discussed in the previous section with heterogeneous costs the quality of an action under context is roughly captured by its normalized expected reward defined as ηj uj however the agent can not only focus on the best action arg ηj for context this is because there may exist another action such that ηj ηj but uj uj and of course cj cj if the budget allocated to context is sufficient then the agent may take action to maximize the expected reward therefore to approximate the oracle the alp algorithm in this case needs to solve an lp problem accounting for all pairs with an additional constraint that only one action can be taken under each context by investigating the structure of alp in this case and the concentration of the remaining budget we show that alp achieves regret in cases and regret in boundary cases then an alp algorithm is proposed for the unknown statistics case where an exploration stage is implemented first and then an exploitation stage is implemented according to alp conclusion in this paper we study algorithms that achieve logarithmic or sublinear regret for constrained contextual bandits under simplified yet practical assumptions we show that the close interactions between the information acquisition and decision making in constrained contextual bandits can be decoupled by adaptive linear relaxation when the system statistics are known the alp approximation achieves performance while tolerating certain estimation errors of system parameters when the expected rewards are unknown the proposed algorithm leverages the advantages of ucb and achieves log regret except for certain boundary cases where it achieves regret our study provides an efficient approach of dealing with the challenges introduced by budget constraints and could potentially be extended to more general constrained contextual bandits acknowledgements this research was supported in part by nsf grants and afosr muri grant fa references langford and zhang the algorithm for contextual bandits in advances in neural information processing systems nips pages lu and contextual bandits in international conference on artificial intelligence and statistics pages zhou survey on contextual bandits arxiv preprint auer and fischer analysis of the multiarmed bandit problem machine learning li chu langford and schapire approach to personalized news article recommendation in acm international conference on world wide web www pages slivkins contextual bandits with similarity information the journal of machine learning research agarwal hsu kale langford li and schapire taming the monster fast and simple algorithm for contextual bandits in international conference on machine learning icml auer and ortner logarithmic online regret bounds for undiscounted reinforcement learning in advances in neural information processing systems nips pages badanidiyuru kleinberg and singer learning on budget posted price mechanisms for online procurement in acm conference on electronic commerce pages lai and liao efficient adaptive randomization and stopping rules in clinical trials for testing new treatment sequential analysis chapman rogers and jennings knapsack based optimal policies for bandits in aaai conference on artificial intelligence badanidiyuru kleinberg and slivkins bandits with knapsacks in ieee annual symposium on foundations of computer science focs pages jiang and srikant bandits with budgets in ieee annual conference on decision and control cdc pages slivkins dynamic ad allocation bandits with budgets arxiv preprint xia li qin yu and liu thompson sampling for budgeted bandits in international joint conference on artificial intelligence combes jiang and srikant bandits with budgets regret lower bounds and optimal algorithms in acm sigmetrics badanidiyuru langford and slivkins resourceful contextual bandits in conference on learning theory colt combes proutiere yun ok and yi optimal rate sampling in systems in ieee infocom pages veatch approximate linear programming for average cost mdps mathematics of operations research agrawal and devanur bandits with concave rewards and convex knapsacks in acm conference on economics and computation pages acm agrawal devanur and li contextual bandits with global constraints and objective arxiv preprint agrawal and devanur linear contextual bandits with global constraints and objective arxiv preprint dubhashi and panconesi concentration of measure for the analysis of randomized algorithms cambridge university press garivier and the algorithm for bounded stochastic bandits and beyond in conference on learning theory colt pages golovin and krause dealing with partial feedback lai and robbins asymptotically efficient adaptive allocation rules advances in applied mathematics 
tensorizing neural networks alexander dmitry anton dmitry skolkovo institute of science and technology moscow russia inria sierra paris france national research university higher school of economics moscow russia institute of numerical mathematics of the russian academy of sciences moscow russia novikov vetrovd abstract deep neural networks currently demonstrate performance in several domains at the same time models of this class are very demanding in terms of computational resources in particular large amount of memory is required by commonly used layers making it hard to use the models on devices and stopping the further increase of the model size in this paper we convert the dense weight matrices of the layers to the tensor train format such that the number of parameters is reduced by huge factor and at the same time the expressive power of the layer is preserved in particular for the very deep vgg networks we report the compression factor of the dense weight matrix of layer up to times leading to the compression factor of the whole network up to times introduction deep neural networks currently demonstrate performance in many domains of largescale machine learning such as computer vision speech recognition text processing etc these advances have become possible because of algorithmic advances large amounts of available data and modern hardware for example convolutional neural networks cnns show by large margin superior performance on the task of image classification these models have thousands of nodes and millions of learnable parameters and are trained using millions of images on powerful graphics processing units gpus the necessity of expensive hardware and long processing time are the factors that complicate the application of such models on conventional desktops and portable devices consequently large number of works tried to reduce both hardware requirements memory demands and running times see sec in this paper we consider probably the most frequently used layer of the neural networks the fullyconnected layer this layer consists in linear transformation of input signal to output signal with large dense matrix defining the transformation for example in modern cnns the dimensions of the input and output signals of the layers are of the order of thousands bringing the number of parameters of the layers up to millions we use compact multiliniear format to represent the dense weight matrix of the layers using few parameters while keeping enough flexibility to perform signal transformations the resulting layer is compatible with the existing training algorithms for neural networks because all the derivatives required by the algorithm can be computed using the properties of the we call the resulting layer and refer to network with one or more as tensornet we apply our method to popular network architectures proposed for several datasets of different scales mnist imagenet we experimentally show that the networks with the match the performance of their uncompressed counterparts but require up to times less of parameters decreasing the size of the whole network by factor of the rest of the paper is organized as follows we start with review of the related work in sec we introduce necessary notation and review the tensor train tt format in sec in sec we apply the to the weight matrix of layer and in sec derive all the equations necessary for applying the algorithm in sec we present the experimental evaluation of our ideas followed by discussion in sec related work with sufficient amount of training data big models usually outperform smaller ones however neural networks reached the hardware limits both in terms the computational power and the memory in particular modern networks reached the memory limit with or even memory occupied by the weights of the layers so it is not surprising that numerous attempts have been made to make the layers more compact one of the most straightforward approaches is to use representation of the weight matrices recent studies show that the weight matrix of the layer is highly redundant and by restricting its matrix rank it is possible to greatly reduce the number of parameters without significant drop in the predictive accuracy an alternative approach to the problem of model compression is to tie random subsets of weights using special hashing techniques the authors reported the compression factor of for twolayered network on the mnist dataset without loss of accuracy memory consumption can also be reduced by using lower numerical precision or allowing fewer possible carefully chosen parameter values in our paper we generalize the ideas instead of searching for approximation of the weight matrix we treat it as tensor and apply the tensor train decomposition algorithm this framework has already been successfully applied to several tasks another possible advantage of our approach is the ability to use more hidden units than was available before recent work shows that it is possible to construct wide and shallow not deep neural networks with performance close to the deep cnns by training shallow network on the outputs of trained deep network they report the improvement of performance with the increase of the layer size and used up to hidden units while restricting the matrix rank of the weight matrix in order to be able to keep and to update it during the training restricting the of the weight matrix in contrast to the matrix rank allows to use much wider layers potentially leading to the greater expressive power of the model we demonstrate this effect by training very wide model hidden units on the dataset that outperforms other networks matrix and tensor decompositions were recently used to speed up the inference time of cnns while we focus on layers lebedev et al used the to compress convolution kernel and then used the properties of the decomposition to speed up the inference time this work shares the same spirit with our method and the approaches can be readily combined gilboa et al exploit the properties of the kronecker product of matrices to perform fast multiplication these matrices have the same structure as with unit compared to the tucker format and the canonical format the is immune to the curse of dimensionality and its algorithms are robust compared to the hierarchical tucker format tt is quite similar but has simpler algorithms for basic operations throughout this paper we work with arrays of different dimensionality we refer to the onedimensional arrays as vectors the arrays matrices the arrays of higher dimensions tensors bold lower case letters denote vectors ordinary lower case letters ai vector elements bold upper case letters matrices ordinary upper case letters matrix elements calligraphic bold upper case letters for tensors and ordinary calligraphic upper case letters id tensor elements where is the dimensionality of the tensor we will call arrays explicit to highlight cases when they are stored explicitly by enumeration of all the elements array tensor is said to be represented in the if for each dimension and for each possible value of the dimension index jk nk there exists matrix gk jk such that all the elements of can be computed as the following matrix product jd gd jd all the matrices gk jk related to the same dimension are restricted to be of the same size rk the values and rd equal in order to keep the matrix product of size in what follows we refer to the representation of tensor in the as the or the the sequence rk is referred to as the of the of or the ranks for short its maximum as the maximal of the of rk the collections of the matrices gk jk corresponding to the same dimension technically arrays are called the cores oseledets th shows that for an arbitrary tensor exists but is not unique the ranks among different can vary and it natural to seek representation with the lowest ranks we use the symbols gk jk αk to denote the element of the matrix gk jk in the position αk where αk rk equation can be equivalently rewritten as the sum of the products of the elements of the cores jd gd jd αd αd the representation of tensor via the explicit enumeration of all its elements requires to store qd pd nk numbers compared with nk rk numbers if the tensor is stored in the thus the is very efficient in terms of memory if the ranks are small an attractive property of the is the ability to efficiently perform several types of operations on tensors if they are in the basic linear algebra operations such as the addition of constant and the multiplication by constant the summation and the entrywise product of tensors the results of these operations are tensors in the generally with the increased ranks computation of global characteristics of tensor such as the sum of all elements and the frobenius norm see for detailed description of all the supported operations for vectors and matrices the direct application of the to matrix tensor coincides with the matrix format and the direct of vector is equivalent to explicitly storing its elements to be able to efficiently work with large vectors and matrices the qd for them is defined in special manner consider vector rn where nk we can establish bijection between the coordinate of and vectorindex µd of the corresponding tensor where µk nk the tensor is then defined by the corresponding vector elements building ttrepresentation of allows us to establish compact format for the vector we refer to it as qd now we define of matrix rm where mk and qd nk let bijections νd and µd map row and column indices and of the matrix to the whose dimensions are of length mk and nk respectively from the matrix we can form tensor whose dimension is of length mk nk and is indexed by the tuple νk µk the tensor can then be converted into the νd µd gd νd µd where the matrices gk νk µk serve as the cores with tuple νk µk being an index note that matrix in the is not restricted to be square although indexvectors and are of the same length the sizes of the domains of the dimensions can vary we call matrix in the all operations available for the are applicable to the and the as well for example one can efficiently sum two and get the result in the additionally the allows to efficiently perform the product if only one of the operands is in the the result would be an explicit vector matrix if both operands are in the the operation would be even more efficient and the result would be given in the as well generally with the increased ranks for the case of the product the computational complexity is max where is the number of the cores of the mk is the maximal qd rank and nk is the length of the vector the ranks and correspondingly the efficiency of the for vector matrix depend on the choice of the mapping mappings and between vector matrix elements and the underlying tensor elements in what follows we use matlab reshape command to form tensor from the data from multichannel image but one can choose different mapping in this section we introduce the of neural network in short the is fullyconnected layer with the weight matrix stored in the we will refer to neural network with one or more as tensornet layers apply linear transformation to an input vector where the weight matrix rm and the bias vector rm define the transformation consists in storing the weights of the layer in the allowing to use hundreds of thousands or even millions of hidden units while having moderate number of parameters to control the number of parameters one can vary the number of hidden units as well as the of the weight matrix transforms tensor formed from the corresponding vector to the ddimensional tensor which correspond to the output vector we assume that the weight matrix is represented in the with the cores gk ik jk the linear transformation of layer can be expressed in the tensor form id gd id jd jd id jd direct application of the operation for the eq yields the computational complexity of the forward pass max max learning neural networks are usually trained with the stochastic gradient descent algorithm where the gradient is computed using the procedure allows to compute the gradient of with respect to all the parameters of the network the method starts with the computation of the gradient of the output of the last layer and proceeds sequentially through the layers in the reversed order while computing the gradient the parameters and the input of the layer making use of the gradients computed earlier applied to the layers the method computes the gradients the input and the parameters and given the gradients to the output in what follows we derive the gradients required to use the algorithm with the ttlayer to compute the gradient of the loss function the bias vector and the input vector one can use equations the latter can be applied using the product where the matrix is in the with the complexity of max max http operation fc forward pass tt forward pass fc backward pass tt backward pass time max max memory max max table comparison of the asymptotic complexity and memory usage of an and an layer fc the input and output tensor shapes are md and nd respectively mk and is the maximal to perform step of stochastic gradient descent one can use equation to compute the gradient of the loss function the weight matrix convert the gradient matrix into the with the algorithm and then add this gradient multiplied by step size to the current estimate of the weight matrix wk γk however the direct computation of requires memory better way to learn the tensornet parameters is to compute the gradient of the loss function directly the cores of the of in what follows we use shortened notation for prefix and postfix sequences of indices we also introduce notations for partial core products pk ik jk jk gd id jd we now rewrite the definition of the transformation for any jk gk ik jk pk ik jk jk jk jk ik ik jk the gradient of the loss function to the core in the position can be computed using the chain rule rk the summation can be done explicitly in rk given the gradient matrices time where is the length of the output vector we now show how to compute the matrix for any values of the core index and mk nk for any id such that ik the value of doesn depend on the elements of gk making the corresponding gradient equal zero similarly any summand in the eq such that jk doesn affect the gradient these observations allow us to consider only ik and jk ik is linear function of the core gk and its gradient equals the following sion ik jk jk jk jk rk we denote the partial sum vector as rk rk id rk jk jk jk vectors rk for all the possible values of jk and ik can be computed via dynamic programming by pushing sums each jd inside the equation and summing out one index at time in max substituting these vectors into and using test error matrix rank uncompressed number of parameters in the weight matrix of the first layer figure the experiment on the mnist dataset we use neural network and substitute the first layer with the solid lines and with the matrix rank decomposition based layer dashed line the solid lines of different colors correspond to different ways of reshaping the input and output vectors to tensors the shapes are reported in the legend to obtain the points of the plots we vary the maximal or the matrix rank again dynamic programming yields us all the necesary matrices for summation the overall computational complexity of the backward pass is max the presented algorithm reduces to sequence of products and permutations of dimensions and thus can be accelerated on gpu device experiments parameters of the in this experiment we investigate the properties of the and compare different strategies for setting its parameters dimensions of the tensors representing the of the layer and the of the compressed weight matrix we run the experiment on the mnist dataset for the task of recognition as baseline we use neural network with two fullyconnected layers hidden units and rectified linear unit relu achieving error on the test set for more reshaping options we resize the original images to we train several networks differing in the parameters of the single the networks contain the following layers the with weight matrix of size relu the layer with the weight matrix of size we test different ways of reshaping the tensors and try different ranks of the as simple compression baseline in the place of the we use the layer such that the rank of the weight matrix is bounded implemented as follows the two consecutive layers with weight matrices of sizes and where controls the matrix rank and the compression factor the results of the experiment are shown in figure we conclude that the provide much better flexibility than the matrix rank when applied at the same compression level in addition we observe that the with too small number of values for each tensor dimension and with too few dimensions perform worse than their more balanced counterparts comparison with hashednet we consider neural network with hidden units and replace both layers by the by setting all the in the network to we achieved the test error of with parameters in total and by setting all the to the test error of with parameters chen et al report results on the same architecture by tying random subsets of weights they compressed the network by the factor of to the parameters in total with the test error equal dataset consists of images assigned to different classes airplane automobile bird cat deer dog frog horse ship truck the dataset contains train and test images following we preprocess the images by subtracting the mean and performing global contrast normalization and zca whitening as baseline we use the quick cnn which consists of convolutional pooling and layers followed by two layers of sizes and we fix the convolutional part of the network and substitute the part by compr compr compr top top top top fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc architecture table substituting the layers with the in and networks on the imagenet dataset fc stands for layer stands for with all the equal stands for layer with the matrix rank restricted to we report the compression rate of the matrices and of the whole network in the second third and fourth columns followed by relu and by layer with hidden units contrary to in the original network we achieve the test error of without which is slightly better than the test error of the baseline the treated input and output vectors as and tensors respectively all the equal making the number of the parameters in the equal the compression rate of the tensornet compared with the baseline all the parameters is in addition substituting the both layers by the yields the test error of and reduces the number of parameters of the layer matrices by the factor of and the total parameter number by the factor of for comparison in the layers in cnn were compressed by the factor of at most times with the loss of about in accuracy wide and shallow network with sufficient amount of hidden units even neural network with two layers and sigmoid can approximate any decision boundary traditionally very wide shallow networks are not considered because of high computational and memory demands and the overfitting risk tensornet can potentially address both issues we use tensornet of the following architecture the with the weight matrix of size relu the with the weight matrix of size relu the layer with the weight matrix of size we report the test error of which is to the best of our knowledge the best result achieved by neural network imagenet in this experiment we evaluate the on large scale task we consider the imagenet dataset which consist of million training images and validation images we use deep the cnns and as the reference both networks consist of the two parts the convolutional and the parts in the both networks the second part consist of layers with weight matrices of sizes and in each network we substitute the first layer with the to do this we reshape the input vectors to the tensors of the size and the output vectors to the tensors of the size the remaining fullyconnected layers are initialized randomly the parameters of the convolutional parts are kept fixed as trained by simonyan and zisserman we train the and the layers on the training set in table we vary the ranks of the and report the compression factor of the the original layer the resulting compression factor of the whole network and the top and top errors on the validation set in addition we substitute the second layer with the as baseline compression method we constrain the matrix rank of the weight matrix of the first layer using the approach of after we had started to experiment on the network the networks have been improved by the authors thus we report the results on slightly outdated version of and the version of type cpu layer cpu gpu layer gpu im time ms im time ms table inference time for layer and its corresponding with all the equal the memory usage for feeding forward one image is for the layer and for the in table we observe that the in the best case manages to reduce the number of the parameters in the matrix of the largest layer by factor of from parameters to while increasing the top error from to the compression factor of the whole network remains at the level of because the stops being the storage bottleneck by compressing the largest of the remaining layers the compression factor goes up to the baseline method when providing similar compression rates significantly increases the error for comparison consider the results of obtained for the compression of the layers of the network with the fastfood method the model achieves compression factors of without decreasing the network error implementation details in all experiments we use our matlab of the matconvnet for the operations related to the we use the implemented in matlab as well the experiments were performed on computer with intel core cpu gb ram and single nvidia geforce gtx gpu we report the running times and the memory usage at the forward pass of the and the baseline layer in table we train all the networks with stochastic gradient descent with momentum coefficient we initialize all the parameters of the and layers with gaussian noise and put weight on them discussion and future work recent studies indicate high redundancy in the current neural network parametrization to exploit this redundancy we propose to use the framework on the weight matrix of layer and to use the cores of the decomposition as the parameters of the layer this allows us to train the layers compressed by up to compared with the explicit parametrization without significant error increase our experiments show that it is possible to capture complex dependencies within the data by using much more compact representations on the other hand it becomes possible to use much wider layers than was available before and the preliminary experiments on the dataset show that wide and shallow tensornets achieve promising results setting new for neural networks another appealing property of the is faster inference time compared with the corresponding layer all in all wide and shallow tensornet can become time and memory efficient model to use in real time applications and on mobile devices the main limiting factor for an layer size is its parameters number the limiting factor for an is the maximal linear size max as future work we plan to consider the inputs and outputs of layers in the thus completely eliminating the dependency on and and allowing billions of hidden units in acknowledgements we would like to thank ivan oseledets for valuable discussions novikov podoprikhin vetrov were supported by rfbr project no and by microsoft moscow state university joint research center rpd osokin was supported by the joint center the results of the tensor toolbox application in sec are supported by russian science foundation no https http https references asanovi and morgan experimental determination of precision requirements for training of artificial neural networks international computer science institute tech ba and caruana do deep nets really need to be deep in advances in neural information processing systems nips pp caroll and chang analysis of individual differences in multidimensional scaling via generalization of decomposition psychometrika vol pp chen wilson tyree weinberger and chen compressing neural networks with the hashing trick in international conference on machine learning icml pp cybenko approximation by superpositions of sigmoidal function mathematics of control signals and systems pp denil shakibi dinh ranzato and de freitas predicting parameters in deep learning in advances in neural information processing systems nips pp denton zaremba bruna lecun and fergus exploiting linear structure within convolutional networks for efficient evaluation in advances in neural information processing systems nips pp gilboa saati and cunningham scaling multidimensional inference for structured gaussian processes arxiv preprint no gong liu yang and bourdev compressing deep convolutional networks using vector quantization arxiv preprint no goodfellow mirza courville and bengio maxout networks in international conference on machine learning icml pp hackbusch and new scheme for the tensor representation fourier anal vol pp krizhevsky learning multiple layers of features from tiny images master thesis computer science department university of toronto krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in advances in neural information processing systems nips pp lebedev ganin rakhuba oseledets and lempitsky convolutional neural networks using in international conference on learning representations iclr lecun cortes and burges the mnist database of handwritten digits novikov rodomanov osokin and vetrov putting mrfs on tensor train in international conference on machine learning icml pp oseledets decomposition siam scientific computing vol no pp rumelhart hinton and williams learning representations by errors nature vol no pp russakovsky deng su krause satheesh ma huang karpathy khosla bernstein berg and imagenet large scale visual recognition challenge international journal of computer vision ijcv sainath kingsbury sindhwani arisoy and ramabhadran matrix factorization for deep neural network training with output targets in international conference of acoustics speech and signal processing icassp pp simonyan and zisserman very deep convolutional networks for image recognition in international conference on learning representations iclr snoek larochelle and adams practical bayesian optimization of machine learning algorithms in advances in neural information processing systems nips pp tucker some mathematical notes on factor analysis psychometrika vol no pp vedaldi and lenc matconvnet convolutional neural networks for matlab in proceeding of the acm int conf on multimedia xue li and gong restructuring of deep neural network acoustic models with singular value decomposition in interspeech pp yang moczulski denil de freitas smola song and wang deep fried convnets arxiv preprint no zhang yang oseledets karniadakis and daniel enabling hierarchical uncertainty quantification by anova and decomposition design of integrated circuits and systems ieee transactions on pp 
parallelizing mcmc with random partition trees xiangyu wang dept of statistical science duke university fangjian guo dept of computer science duke university guo katherine heller dept of statistical science duke university kheller david dunson dept of statistical science duke university dunson abstract the modern scale of data has brought new challenges to bayesian inference in particular conventional mcmc algorithms are computationally very expensive for large data sets promising approach to solve this problem is embarrassingly parallel mcmc which first partitions the data into multiple subsets and runs independent sampling algorithms on each subset the subset posterior draws are then aggregated via some combining rules to obtain the final approximation existing algorithms are limited by approximation accuracy and difficulty in resampling in this article we propose new algorithm part that solves these problems the new algorithm applies random partition trees to combine the subset posterior draws which is easy to resample from and can adapt to multiple scales we provide theoretical justification and extensive experiments illustrating empirical performance introduction bayesian methods are popular for their success in analyzing complex data sets however for large data sets markov chain monte carlo mcmc algorithms widely used in bayesian inference can suffer from huge computational expense with large data there is increasing time per iteration increasing time to convergence and difficulties with processing the full data on single machine due to memory limits to ameliorate these concerns various methods such as stochastic gradient monte carlo and based monte carlo have been proposed among directions that have been explored embarrassingly parallel mcmc seems most promising algorithms typically divide the data into multiple subsets and run independent mcmc chains simultaneously on each subset the posterior draws are then aggregated according to some rules to produce the final approximation this approach is clearly more efficient as now each chain involves much smaller data set and the sampling is the key to successful algorithm lies in the speed and accuracy of the combining rule existing algorithms can be roughly divided into three categories the first relies on asymptotic normality of posterior distributions propose consensus monte carlo algorithm which produces final approximation by weighted averaging over all subset draws this approach is effective when the posterior distributions are close to gaussian but could suffer from huge bias when skewness and are present the second category relies on calculating an appropriate variant of mean or median of the subset posterior measures these approaches rely on asymptotics size of data increasing to infinity to justify accuracy and lack guarantees in finite samples the third category relies on the product density equation pde in assuming is the observed data and is the parameter of interest when the observations are iid conditioned on for any partition of the following identity holds qm if the prior on the full data and subsets satisfy πi proposes using kernel density estimation on each subset posterior and then combining via they use an independent metropolis sampler to resample from the combined density apply the weierstrass transform directly to and developed two sampling algorithms based on the transformed density these methods guarantee the approximation density converges to the true posterior density as the number of posterior draws increase however as both are the two methods are limited by two major drawbacks the first is the inefficiency of resampling kernel density estimators are essentially mixture distributions assuming we have collected posterior samples on each machine then multiplying just two densities already yields mixture distribution containing components each of which is associated with different weight the resampling requires the independent metropolis sampler to search over an exponential number of mixture components and it is likely to get stuck at one good component resulting in high rejection rates and slow mixing the second is the sensitivity to bandwidth choice with one bandwidth applied to the whole space in this article we propose novel algorithm termed parallel aggregation random trees part which solves the above two problems the algorithm inhibits the explosion of mixture components so that the aggregated density is easy to resample in addition the density estimator is able to adapt to multiple scales and thus achieve better approximation accuracy in section we motivate the new methodology and present the algorithm in section we present error bounds and prove consistency of part in the number of posterior draws experimental results are presented in section proofs and part of the numerical results are provided in the supplementary materials method recall the pde identity in the introduction when data set is partitioned into subsets the posterior distribution of the ith subset can be written as where is the prior assigned to the full data set assuming observations are iid given the relationship between the full data posterior and subset posteriors is captured by due to the flaws of applying density estimation to mentioned above we propose to use random partition trees or histograms let fk be the collection of all rp partitions formed by disjoint rectangular blocks where rectangular block takes the form of def ak lk rk rp for some lk rk histogram is then defined as nk fˆ ak where ak fk are the blocks and nk are the total number of posterior samples on the ith subset and of those inside the block ak respectively assuming the same across subsets we use to denote the area of block assuming each subset posterior is approximated by histogram if the partition ak is restricted to be the same across all subsets then the aggregated density after applying is still histogram illustrated in the supplement nk ak wk gk pk qm where nk is the normalizing constant wk are the updated weights and gk unif ak is the distribution common histogram blocks across subsets control the number of mixture components leading to simple aggregation and resampling procedures our part algorithm consists of space partitioning followed by density aggregation with aggregation simply multiplying densities across subsets for each block and then normalizing space partitioning to find good partitions our algorithm recursively bisects not necessarily evenly previous block along randomly selected dimension subject to certain rules such partitioning is and related to wavelets assume we are currently splitting the block along the dimension and denote the posterior samples in by θj for the ith subset the cut point on dimension is determined by partition rule θj θj θj the resulting two blocks are subject to further bisecting under the same procedure until one of the following stopping criteria is met nk δρ or ii the area of the block becomes smaller than the algorithm returns tree with leafs each corresponding to block ak details are provided in algorithm algorithm partition tree algorithm procedure uild ree θj θj θj while not empty do draw uniformly at random from δρ δa randomly choose the dimension to cut cardinality of θj for all if lq δa rq δa and min θj θj δρ for all then update left and right boundaries θj θj θj uild ree θj θj θj θj uild ree θj θj θj θj return else try cutting at another dimension end if end while null null return leaf node end procedure in algorithm becomes the minimum edge length of block δa possibly different across dimensions quantities rp are the left and right boundaries of the samples respectively which take the sample when the support is unbounded we consider two choices for the partition rule maximum empirical likelihood partition ml and partition kd maximum likelihood partition ml searches for partitions by greedily maximizing the empirical log likelihood at each iteration for we have φml θj arg max where and are counts of posterior samples in and respectively the solution to falls inside the set θj thus simple linear search after sorting samples suffices by the ordering sorting the whole block once is enough for the entire procedure for we have φq ml arg max similarly solved by linear search this is dominated by sorting and takes log time partition kd partition cuts at the empirical median of posterior samples when there are multiple subsets the median is taken over pooled samples to force ak to be the same across subsets searching for median takes time which is faster than especially when the number of posterior draws is large the same partitioning strategy is adopted by density aggregation given common partition algorithm aggregates all subsets in one stage however assuming single good partition for all subsets is overly restrictive when is large hence we also consider pairwise aggregation which recursively groups subsets into pairs combines each pair with algorithm and repeats until one final set is obtained run time of part is dominated by space partitioning uild ree with normalization and resampling very fast algorithm density aggregation algorithm drawing samples from the aggregated posterior procedure ne tage aggregate θj θj θj uild ree θj θj θj δρ δa δρ δa ak nk raverse eaf for do qm nk end for wk for all for do draw with weights wk and then draw θt gk end for return θn end procedure multiply inside each block normalize variance reduction and smoothing random tree ensemble inspired by random forests the full posterior is estimated by averaging independent trees output by algorithm smoothing and averaging can reduce variance and yield better approximation accuracy the trees can be built in parallel and resampling in algorithm only additionally requires picking tree uniformly at random local gaussian smoothing as another approach to increase smoothness the blockwise uniform distribution in can be replaced by gaussian distribution gk µk σk with mean and covariance estimated locally by samples within the block multiplied gaussian approximation pm pm is used σk µk σk where and are estimated th with the subset we apply both random tree ensembles and local gaussian smoothing in all applications of part in this article unless explicitly stated otherwise theory in this section we provide consistency theory in the number of posterior samples for histograms and the aggregated density we do not consider the variance reduction and smoothing modifications in these developments for simplicity in exposition but extensions are possible section provides error bounds on ml and histogram density estimators constructed from independent samples from single joint posterior modified bounds can be obtained for mcmc samples incorporating the mixing rate but will not be considered here section then provides corresponding error bounds for our density estimators in the and pairwise cases detailed proofs are provided in the supplementary materials let be posterior density function assume is supported on measurable set rp since one can always transform to bounded region by scaling we simply assume as in without loss of generality we also assume that space partitioning maximum likelihood partition ml for given ml partition solves the following problem nk nk log nk fm arg max for some and where kf we have the following result theorem choose for any if the sample size satisfies that log then with probability at least the optimal solution to satisfies that dkl kfˆm log max log log log log where log with kf and when multiple densities are presented our goal of imposing the same partition on all functions requires solving different problem fˆm arg max nk nk log nk where ni is the number of posterior samples for function similar result as theorem for is provided in the supplementary materials median kd the fˆkd cuts at the empirical median for different dimensions we have the following result theorem for any define rε for any if log log then with probability at least we have log kfkd fkd plk log if is further lower bounded by some we can then obtain an upper bound on constant the define and we have pld dkl kfˆkd log log when there are multiple functions and the median partition is performed on pooled data the partition might not happen at the empirical median on each subset however as long as the partition quantiles are upper and lower bounded by and for some we can establish results similar to theorem the result is provided in the supplementary materials posterior aggregation the previous section provides estimation error bounds on individual posterior densities through which we can bound the distance between the true posterior conditional on the full data set and the aggregated density via assume we have density functions and intend to approximate their aggregated density fi where notice that for any fi let maxi kfi is an upper rq bound on all posterior densities formed by subset of also define zi these quantities depend only on the model and the observed data not posterior samples we denote fˆm and fˆkd by fˆ as the following results apply similarly to both methods the aggregation algorithm first obtains an approximation each via either for or and then computes fi theorem aggregation denote the average total variation distance between and fˆ by assume the conditions in theorem and and for log log and for log log then with high probability the total variation distance between fi and fˆi is bounded by kfi where zi is constant that does not depend on the posterior samples zi the approximation error of algorithm increases dramatically with the number of subsets to ameliorate this we introduce the pairwise aggregation strategy in section for which we have the following result theorem pairwise aggregation denote the average total variation distance between and fˆ by assume the conditions in theorem then with high probability the total variation distance between fi and fˆi is bounded by kfi fˆi where maxi is constant that does not depend on posterior samples experiments in this section we evaluate the empirical performance of and compare the two algorithms and to the following posterior aggregation algorithms simple averaging average each aggregated sample is an arithmetic average of samples coming from subsets weighted averaging weighted also called consensus monte carlo algorithm where each aggregated sample is weighted average of samples the weights are optimally chosen for gaussian posterior weierstrass rejection sampler weierstrass subset posterior samples are passed through rejection sampler based on the weierstrass transform to produce the aggregated samples we use its for experiments parametric density product parametric aggregated samples are drawn from multivariate gaussian which is product of laplacian approximations to subset posteriors nonparametric density product nonparametric aggregated posterior is approximated by product of kernel density estimates of subset posteriors samples are drawn with an independent metropolis sampler semiparametric density product semiparametric similar to the nonparametric but with subset posteriors estimated semiparametrically all experiments except the two toy examples use adaptive mcmc for posterior sampling for aggregation algorithm is used only for the toy examples results from pairwise aggregation are provided in the supplement for other experiments pairwise aggregation is used which draws samples for intermediate stages and halves δρ after each stage to refine the resolution the value of δρ listed below is for the final stage the random ensemble of part consists of trees two toy examples the two toy examples highlight the performance of our methods in terms of recovering multiple modes and ii correctly locating posterior mass when subset posteriors are heterogeneous the results are obtained from algorithm without local gaussian smoothing matlab implementation available from https https http bimodal example figure shows an example consisting of subsets each subset consists of samples drawn from mixture of two univariate normals with the means and standard deviations slightly different across subsets given by and where δi independently for and the resulting true combined posterior red solid consists of two modes with different scales in figure the left panel shows the subset posteriors dashed and the true posterior the right panel compares the results with various methods to the truth few are omitted in the graph average and weighted average overlap with parametric and weierstrass overlaps with true density subset densities density true density parametric nonparametric semiparametric figure bimodal posterior combined from subsets left the true posterior and subset posteriors dashed right aggregated posterior output by various methods compared to the truth results are based on aggregated samples iid rare bernoulli example we consider bernoulli trials xi ber split into subsets the parameter is chosen to be so that on average each subset only contains successes by random partitioning the subset posteriors are rather heterogeneous as plotted in dashed lines in the left panel of figure the prior is set as beta the right panel of figure compares the results of various methods and weierstrass capture the true posterior shape while parametric average and weighted average are all biased the nonparametric and semiparametric methods produce flat densities near zero not visible in figure due to the scale true posterior subset posteriors density figure the posterior for the probability of rare event left the full posterior solid and subset posteriors dashed right aggregated posterior output by various methods all results are based on aggregated samples bayesian logistic regression synthetic dataset the dataset xi yi consists of observations in dimensions all features xi are drawn from with and σk the model intercept is set to and the other coefficient are drawn randomly from conditional on xi yi follows yi exp xi the dataset is randomly split into subsets for both full chain and subset chains we run adaptive mcmc for iterations after thinning by results in samples the samples from the full chain denoted as θj are treated as the ground truth to compare the accuracy of different methods we resample points from each aggregated posterior and then compare them using the following metrics rmse of posterior mean pt θj approximate kl divergence dkl and dkl kp where and are both approximated by multivariate gaussians the posterior concentration ratio defined as qp kθj which measures how posterior spreads out around the true value with being ideal the result is provided in table figure shows the dkl versus the length of subset chains supplied to the aggregation algorithm the results of part are obtained with δρ δa and trees figure showcases the aggregated posterior for two parameters in terms of joint and marginal distributions method rmse dkl dkl part kd part ml average weighted weierstrass parametric nonparametric semiparametric table accuracy of posterior aggregation on logistic regression figure posterior of and real datasets we also run experiments on two real datasets the covertype consists of observations in dimensions and the task is to predict the type of forest cover with cartographic measurements the miniboone consists of observations in dimensions whose task is to distinguish electron neutrinos from muon neutrinos with experimental data for both datasets we reserve of the data as the test set the training set is randomly split into and subsets respectively for covertype and miniboone figure shows the prediction accuracy versus total runtime parallel subset mcmc aggregation time for different methods for each mcmc chain the first iterations are discarded before aggregation as the aggregated chain is required to be of the same length as the subset chains as reference we also plot the result for the full chain and lasso run on the full training set prediction accuracy weierstrass average weighted full chain lasso covertype figure approximate kl divergence between the full chain and the combined posterior versus the length of subset chains parametric nonparametric total time sec miniboone total time sec figure prediction accuracy versus total runtime running chain aggregation on covertype and miniboone datasets semiparametric is not compared due to its long running time plots against the length of chain are provided in the supplement conclusion in this article we propose new mcmc algorithm part that can efficiently draw posterior samples for large data sets part is simple to implement efficient in subset combining and has theoretical guarantees compared to existing algorithms part has substantially improved performance possible future directions include exploring other density estimators which share similar properties as partition trees but with better approximation accuracy developing tuning procedure for choosing good δρ and δa which are essential to the performance of part http https references max welling and yee teh bayesian learning via stochastic gradient langevin dynamics in proceedings of the international conference on machine learning dougal maclaurin and ryan adams firefly monte carlo exact mcmc with subsets of data proceedings of the conference on uncertainty in artificial intelligence uai steven scott alexander blocker fernando bonassi hugh chipman edward george and robert mcculloch bayes and big data the consensus monte carlo algorithm in efabbayes conference volume stanislav minsker sanvesh srivastava lizhen lin and david dunson scalable and robust bayesian inference via the median posterior in proceedings of the international conference on machine learning sanvesh srivastava volkan cevher quoc and david dunson wasp scalable bayes via barycenters of subset posteriors in proceedings of the international conference on artificial intelligence and statistics aistats volume willie neiswanger chong wang and eric xing asymptotically exact embarrassingly parallel mcmc in proceedings of the thirtieth conference annual conference on uncertainty in artificial intelligence pages corvallis oregon auai press xiangyu wang and david dunson parallel mcmc via weierstrass sampler arxiv preprint linxi liu and wing hung wong multivariate density estimation based on adaptive partitioning convergence rate variable selection and spatial adaptation arxiv preprint manuel blum robert floyd vaughan pratt ronald rivest and robert tarjan time bounds for selection journal of computer and system sciences jon louis bentley multidimensional binary search trees used for associative searching communications of the acm leo breiman random forests machine learning leo breiman bagging predictors machine learning xiaotong shen and wing hung wong convergence rate of sieve estimates the annals of statistics pages nils lid hjort and ingrid glad nonparametric density estimation with parametric start the annals of statistics pages heikki haario marko laine antonietta mira and eero saksman dram efficient adaptive mcmc statistics and computing heikki haario eero saksman and johanna tamminen an adaptive metropolis algorithm bernoulli pages jock blackard and denis dean comparative accuracies of neural networks and discriminant analysis in predicting forest cover types from cartographic variables in proc second southern forestry gis conf pages byron roe yang ji zhu yong liu ion stancu and gordon mcgregor boosted decision trees as an alternative to artificial neural networks for particle identification nuclear instruments and methods in physics research section accelerators spectrometers detectors and associated equipment lichman uci machine learning repository robert tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series methodological pages 
fmri shared response model janice yaara uri james peter department of electrical engineering princeton university princeton neuroscience institute and department of psychology princeton university department of psychological and brain sciences and center for cognitive neuroscience dartmouth college abstract fmri data is critical for evaluating the generality and validity of findings across subjects and its effective utilization helps improve analysis sensitivity we develop shared response model for aggregating fmri data that accounts for different functional topographies among anatomically aligned datasets our model demonstrates improved sensitivity in identifying shared response for variety of datasets and anatomical brain regions of interest furthermore by removing the identified shared response it allows improved detection of group differences the ability to identify what is shared and what is not shared opens the model to wide range of fmri studies introduction many modern fmri studies of the human brain use data from multiple subjects the use of multiple subjects is critical for assessing the generality and validity of the findings across subjects it is also increasingly important since from one subject one can gather at most few thousand noisy instances of functional response patterns to increase the power of multivariate statistical analysis one therefore needs to aggregate response data across multiple subjects however the successful aggregation of fmri brain imaging data across subjects requires resolving the major problem that both anatomical structure and functional topography vary across subjects moreover it is well known that standard methods of anatomical alignment do not adequately align functional topography hence anatomical alignment is often followed by spatial smoothing of the data to blur functional topographies recently functional spatial registration methods have appeared that use cortical warping to maximize correlation of time series or correlation of functional connectivity more radical approach learns latent multivariate feature that models the shared component of each subject response multivariate statistical analysis often begins by identifying set of features that capture the informative aspects of the data for example in fmri analysis one might select subset of voxels within an anatomical region of interest roi or select subset of principal components of the roi then use these features for subsequent analysis in similar way one can think of the fmri data aggregation problem as two step process first use training data to learn mapping of each subject measured data to shared feature space in way that captures the shared response then use these learned mappings to project held out data for each subject into the shared feature space and perform statistical analysis to make this more precise let xi denote matrices of training data voxels in the roi over trs for subjects we propose using this data to learn subject specific bases wi where is to be selected and shared matrix of feature responses such that xi wi ei where ei is an error term corresponding to unmodeled aspects of the subject response one can think of the bases wi as representing the individual functional topographies and as latent feature that captures the component of the response shared across subjects we don claim that is sufficient statistic but that is useful analogy problem problem problem problem problem problem train objective test accuracy iterations figure comparison of training objective value and testing accuracy for problem and over various on raider dataset with voxels of ventral temporal cortex vt in image stimulus classficiation experiment details in in all cases error bars show standard error the contribution of the paper is twofold first we propose probabilistic generative framework for modeling and estimating the subject specific bases wi and the shared response latent variable critical aspect of the model is that it directly estimates shared features this is in contrast to methods where the number of features equals the number of voxels moreover the bayesian nature of the approach provides natural means of incorporating prior domain knowledge second we give demonstration of the robustness and effectiveness of our data aggregation model using variety of fmri datasets captured on different mri machines employing distinct analysis pathways and based on various brain rois preliminaries fmri data xi is collected for subjects as they are presented with identical time synchronized stimuli here is the number of time samples in trs time of repetition and is the number of voxels our objective is to model each subject response as xi wi ei where wi is basis of topographies for subject is parameter selected by the experimenter is corresponding time series of shared response coordinates and ei is an error term to ensure uniqueness of coordinates it is necessary that wi has linearly independent columns we make the stronger assumption that each wi has orthonormal columns wit wi ik two approaches for estimating the bases wi and the shared response are illustrated below minwi minwi kxi wi skf kwi xi skf wi wi ik wi wi where kf denotes the frobenius norm for can be solved iteratively by first selecting conditions for wi and optimizing with respect to by setting initial wi xi with fixed becomes separate subproblems of the form min kxi wi skf with solution wi where is an svd of xi these two steps can be iterated until stopping criterion is satisfied similarly for can also be solved iteratively however for there is no known fast update of wi given hence this must be done using local gradient decent on the stiefel manifold both approaches yield the same solution when but are not equivalent in the more interesting situation sup what is most important however is that problem with often learns an uninformative shared response this is illustrated in fig which plots of the value of the training objective and the test accuracy for stimulus classification experiment versus iteration count image classification using the raider fmri dataset see for problem test accuracy increases with decreasing training error whereas for problem test accuracy decreases with decreasing training error this can be explained analytically see sup we therefore base our approach on generalization of problem we call the resulting and wi shared response model srm before extending this simple model we note few important properties first solution of is not unique if wi is solution then so is qs wi for any orthogonal matrix this is not problem as long as we only learn one template and one set of subject bases any new subjects or new data will be referenced to the original srm however if we independently learn two srms the group shared responses may not be registered use the same we register to by finding orthogonal matrix to minimize then use in place of and wj qt in place of wj for subjects in the first srm next when projected onto the span of its basis eachp subject training data xi has coordinates si wit xi and the learning phase ensures si the projection to shared features and the averaging across subjects in feature space both contribute to denoising during the learning phase by mapping back into voxel space we obtain the voxel space manifestation wi of the denoised shared component of each subject training data the training data of subject can also be mapped through the shared response model to the functional topography and anatomy of subject by the mapping wi wjt xj new subjects are easily added to an existing srm wi we refer to as the training template to introduce new subject with training data xj form its orthonormal basis by minimizing the mean squared modeling error minwj wjt wj kxj wj we solve this for the least norm solution note that and the existing do not change we simply add new subject by using its training data for the same stimulus and the template to determine its basis of functional topographies we can also add new data to an srm let denote new data collected under distinct stimulus from the same subjects this is added to the study by forming wit then averaging these projections to form the shared response for the new data wit this assumes the learned subject specific topographies wi generalize to the new data this usually requires sufficiently rich stimulus in the learning phase probabilistic shared response model we now extend our simple shared response model to probabilistic setting let xit rv denote the observed pattern of voxel responses of the subject at time for the moment assume these observations are centered over time let st rk be hyperparameter modeling the shared response at time and model the observation at time for dataset as the outcome of random vector xit wi st with wit wi ik where xit takes values in rv wi and is subject independent hyperpap rameter the negative of this model is log log xit wi st xit st noting that xit is the column of xi we see that minimizing with respect to wi and sd requires the solution of min xit wi st xit wi st min kxi wi thus maximum likelihood estimation for this model matches in our fmri datasets and most fmri datasets available today since st is time specific but shared across the subjects we see that there is palpable value in regularizing its estimation in contrast subject specific variables such as wi are shared across time dimension in which data is relatively plentiful hence natural extension of is to make st shared latent random vector st σs taking values in rk the observation for dataset at time then has the conditional density xit wi st µi where the subject specific mean µi allows for mean and we assume subject dependent isotropic noise covariance this is an extended form of factor analysis but in factor analysis one normally assumes σs to form joint model let xtt xmt wm µt µtm diag and σx σs then xt st with xt σx taking values in rmv for this joint model we formulate srm as st σs xit wi st wit wi ik µi st wi µi xit figure graphical model for srm shaded nodes observations unshaded nodes latent variables and black squares hyperparameters where st takes values in rk xit takes values in rv and the hyperparameters wi are matrices in the latent variable st with covariance σs models shared elicited response across the subjects at time by applying the same orthogonal transform to each of the wi we can assume without loss of generality that σs is diagonal the srm graphical model is displayed in fig parameter estimation for srm to estimate the parameters of the srm model we apply constrained em algorithm to find maximum likelihood solutions let denote the vector of all parameters in the given initial value or estimated value θold from the previous we calculate the sufficient statistics by taking expectation with respect to st θold st σs σs xt st stt st st st σs σts σs σs st st in the we update the parameter estimate to θnew by maximizing with respect to wi µi and σs this is given by θnew arg maxθ θold where pd θold st θold log xt st dst due to the model structure can be maximized with respect to each parameter separately to enforce the orthogonality of wi we bring symmetric matrix λi of lagrange multipliers and add the constraint term tr λi wit wi to the objective function setting the derivatives of the modified objective to zero we obtain the following update equations µnew xit new new wi ai ai ai ai xit µi st new new new new ρi dv kxit µi xit µi wi st tr st st new σs st st the orthonormal constraint wit wi ik in srm is similar to that of pca in general there is no reason to believe that key brain response patterns are orthogonal so the orthonormal bases found via srm are computational tool to aid statistical analysis within an roi from computational viewpoint orthogonality has the advantage of robustness and preserving temporal geometry connections with related methods for one subject srm is similar to variant of ppca that imposes an orthogonality constraint on the loading matrix ppca yields an orthogonal loading matrix however due to the increase in model complexity to handle multiple datasets srm has an explicit constraint of orthogonal loading matrices topographic factor analysis tfa is factor model using topographic basis composed of spherical gaussians with different centers and widths this choice of basis is constraining but since each factor is an blob in the brain it has the advantage of providing simple spatial interpretation hyperalignment ha learns shared representational by rotating subjects time series responses to maximize time series correlation the formulation in is based on problem with and wi orthogonal matrix sup so this method does not directly reduce the dimension of the feature space nor does it directly extend to this case see fig although dimensionality reduction can be done posthoc using pca shows that this doesn lead to performance improvement in contrast we show in that selecting can improve the performance of srm beyond that attained by ha the gica iva algorithms do not assume stimulus and hence concatenate data along the time dimension implying spatial consistency and learn spatial independent components we use the assumption of stimulus for anchoring the shared response to overcome spatial mismatch in functional topographies finally srm can be regarded as refinement of the concept of hyperalignment cast into probabilistic framework the ha approach has connections with regularized cca additional details of these connections and connections with canonical correlation analysis cca ridge regression independent component analysis ica regularized hyperalignment are discussed in the supplementary material experiments we assess the performance and robustness of srm using fmri datasets table collected using different mri machines subjects and preprocessing pipelines the sherlock dataset was collected dataset subjs trs region of interest roi voxels sherlock movie posterior medial cortex pmc raider movie ventral temporal cortex vt forrest audio movie planum temporale pt audiobook narrated story default mode network dmn table fmri datasets are shown in the left four columns and the rois are shown in right two columns the rois vary in functional from visual language memory to mental states stands for hemisphere while subjects watched an episode of the bbc tv series sherlock mins raider dataset was collected while subjects viewed the movie raiders of the lost ark mins and series of still images categories runs the forrest dataset was collected while subjects listened to an auditory version of the film forrest gump mins the audiobook dataset was collected while subjects listened to narrated story mins with two possible interpretations half of the subjects had prior context favoring one interpretation the other half had prior context favoring the other interpretation post scanning questionnaires showed no difference in comprehension but significant difference in interpretations between groups experiment srm and spatial smoothing we first use spatial smoothing to determine if we can detect shared response in pmc for the sherlock dataset the subjects are randomly partitioned into two equal sized groups the data for each group is averaged we calculate the pearson correlation over voxels between these averaged responses for each time then average these correlations over time this is measure of similarity of the sequence of brain maps in the two average responses we repeat this for five random subject divisions and average the results if there is shared response we expect positive average correlation between the groups but if functional topographies differ significantly across subjects this correlation may be small if the result not distinct from zero shared response is not detected the computation yields the benchmark value shown as the purple bar in the right plot in fig this is support for shared response in pmc but we posit that the subject functional topographies in pmc are misaligned to test this we use gaussian filter with width at half height of and to spatially smooth each subject fmri data then recalculate the average pearson correlation as described above the results shown as blue bars in fig indicate higher correlations with greater spatial smoothing this indicates greater average correlation of the responses at lower spatial frequencies suggesting fine scale mismatch of functional topographies across subjects we now test the robustness of srm using the unsmoothed data the subjects are randomly partitioned into two equal sized groups the data in each group is divided in time into two halves and the same half in each group is used to learn shared response model for the group the independently obtained group templates are then registered using orthogonal matrix method outlined in for each group the second half of the data is projected to feature space using the bases and averaged then the pearson correlation over features is calculated between the group averaged shared responses and averaged over time this is repeated using the other the halves of the subject data for training and the results are averaged the average results over random subject divisions are report as the green bars in fig with there is no reduction of dimension and srm achieves correlation equivalent to spatial smoothing this strong average correlation between groups suggests some form of shared response as expected if the dimension of the feature space is reduced the correlation increases smaller value of forces learning subject specific bases half movie data subject learn bases subject wm shared group response half movie data learn bases subject subject wm shared group response computing correlation between groups half movie data subject project to bases subject wm qt qt shared group response subject subject wm shared group response figure experiment left learn using half of the data then compute between group correlation on other half right pearson correlation after spatial smoothing and srm with various error bars stand error learning subject specific bases movie data subject subject learn bases wm tr subject movie wm train output subject wm classifier subject subject wm subject testing subject projected response seg seg seg shared response seg seg seg seg subject subject segment overlapping segment testing segment overlapping segment segment testing on subject project to bases projected data shared response template image movie data segment segment seg seg test overlapping seg excluded seg overlapping seg excluded figure left experiment learn subject specific bases test on held out subject and data right experiment time segment matching by correlating with tr segments in the shared response figure experiment top comparison of time segment classification on three datasets using distinct rois bottom left srm time segment classification accuracy vs right learn bases from movie response classify stimulus category using still image response for raider and forrest we conduct experiment on roi in each hemisphere separately and then average the results for sherlock we conduct experiment over whole pmc the tal results for the raider dataset are from error bars stand error srm to focus on shared features yielding the best data representation and gives greater noise rejection learning features achieves higher average correlation in feature space than is achieved by spatial smoothing in voxel space commensurate improvement occurs when srm is applied to the spatially smoothed data experiment time segment matching and image classification we test if the shared response estimated by srm generalizes to new subjects and new data using versions of two experiments from unlike in here the held out subject is not included in learning phase the first experiment tests if an time segment from subject new data can be located in the corresponding new data of the training subjects shared response and subject specific bases are learned using half of the data and the held out subject basis is estimated using the shared response as template then random test segment from the unused half of the held out subject data is projected onto the subject basis and we locate the segment in the averaged shared response of the other subject new data that is maximally correlated with the test segment see fig the held out subject test segment is correctly located matched if its correlation with the average shared response at the same time point is the highest segments overlapping with the test segment are excluded we record the average accuracy and standard error by over the data halves and over subjects the results using three different fmri datasets with distinct rois are shown in the top plot of fig the accuracy is compared using anatomical alignment mni talairach tal standard pca and ica feature selection fastica implementation the hyperalignment ha method and srm pca and ica are directly applied on joint data matrix xm for learning and where and wm srm demonstrates the best matching of the estimated shared temporal features of the methods tested this suggests that the learned shared response is more informative of the shared brain state trajectory at an time scale moreover the experiment verifies generalization of the estimated shared features to subjects not included in the training phase and new but similar data collected during the other half of the movie stimulus since we expect accuracy to improve as the time segment is lengthened group subj subj group tr shared response across all subjects in voxel space tr residual shared by all shared response within group in voxel space voxel test train train train group classification accuracy with srm group shared response within group in voxel space test individual group group tr within group subj voxel subj original data voxel voxel tr train group fig fig fig figure experiment fig experimental procedure fig data components left and group classification performance with srm right in different steps of the procedure fig group classification on audiobook dataset in dmn before and after removing an estimated shared response for various values of and with srm pca and ica error bars stand error what is important is the relative accuracy of the compared methods the method in can be viewed as srm in this experiment it performs worse than srm but better than the other compared methods the effect of the number of features used in srm is shown in fig lower left this can be used to select similar test on the number of features used in pca and ica indicates lower performance than srm results not shown we now use the image viewing data and the movie data from the raider dataset to test the generalizability of learned shared response to subject and new data under very distinct stimulus the raider movie data is used to learn shared response model while excluding subject the subject basis is estimated by matching its movie response data to the estimated shared response the effectiveness of the learned bases is then tested using the image viewing dataset after projecting the image data using the subject bases to feature space an svm classifier is trained and the average classifier accuracy and standard error is recorded by across subject testing the results lower right plot in fig support the effectiveness of srm in generalizing to new subject and distinct new stimulus under srm the image stimuli can be slightly more accurately identified using other subjects data for training than using subject own data indicating that the learned shared response is informative of image category experiment differentiating between groups now consider the audiobook dataset and the dmn roi if subjects are given group labels according to the two prior contexts linear svm classifier trained on labeled voxel space data and tested on the voxel space data of held out subjects can distinguish the two groups at an above chance level this is shown as the leftmost bar in the bottom figure of fig this is consistent with previous similar studies we test if srm can distinguish the two subject groups with higher rate of success to do so we use the procedure outlined in rows of fig we first use the original data of all subjects all fig to learn shared response all and subject bases this shared response is then mapped to voxel space using each subject learned topography fig and all subtracted from the subject data to form the residual response xigj all for subject in group fig leaving out one subject from each group we use two applications of srm to find shared responses and subject bases for the residual response these are mapped into voxel space wigj gj for each subject fig the first application of srm yields an estimate of the response shared by all subjects this is used to form the residual response the subsequent applications of srm to the residual give estimates of the shared residual response both applications of srm seek to remove components of the original response that are uninformative of group membership finally linear svm classifier is trained using the voxel space data and tested on the voxel space data of held out subjects the results are shown as the red bars in fig when using and we observe significant improvement in distinguishing the groups one can visualize why this works using the cartoon in fig showing the data for one subject modeled as the sum of three components the response shared by all subjects the response shared by subjects in the same group after the response shared by all subjects is removed and final residual term called the individual response fig we first identify the response shared by all subjects fig subtracting this from the subject response gives the residual fig the second application of srm removes the individual response fig by tuning in the first application of srm and tuning in the second application of srm we estimate and remove the uninformative components while keeping the informative component classification using the estimated shared response results in accuracy around chance fig indicating that it is uninformative for distinguishing the groups the classification accuracy using the residual response is statistically equivalent to using the original data fig indicating that only removing the response shared by all subjects is insufficient for improvement the classification accuracy that results by not removing the shared response and only applying srm fig is also statistically equivalent to using the original data this indicates that only removing the individual response is also insufficient for improvement by combining both applications of srm we remove both the response shared by all subjects and the individual responses keeping only the responses shared within groups for this leads to significant improvement in performance fig and fig we performed the same experiment using pca and ica fig in this case after removing the estimated shared response group identification quickly drops to chance since the shared response is informative of group difference around accuracy for distinguishing the groups sup detailed comparison of all three methods on the different steps of the procedure is given in the supplementary material discussion and conclusion the vast majority of fmri studies require aggregation of data across individuals by identifying shared responses between the brains of different individuals our model enhances fmri analyses that use aggregated data to evaluate cognitive states key attribute of srm is its dimensionality reduction leading to shared feature space we have shown that by tuning this dimensionality the aggregation achieved by srm demonstrates higher sensitivity in distinguishing multivariate functional responses across cognitive states this was shown across variety of datasets and anatomical brain regions of interest this also opens the door for the identification of shared and individual responses the identification of shared responses after srm is of great interest as it allows us to assess the degree to which functional topography is shared across subjects furthermore the srm allows the detection of group specific responses this was demonstrated by removing an estimated shared response to increase sensitivity in detecting group differences we posit that this technique can be adapted to examine an array of situations where group differences are the key experimental variable the method can facilitate studies of how neural representations are influenced by cognitive manipulations or by factors such as genetics clinical disorders and development successful decoding of particular cognitive state such as stimulus category in given brain area provides evidence that information relevant to that cognitive state is present in the neural activity of that brain area conducting such analyses in locations spanning the brain using searchlight approach can facilitate the discovery of information pathways in addition comparison of decoding accuracies between searchlights can suggest what kind of information is present and where it is concentrated in the brain srm provides more sensitive method for conducting such investigations this may also have direct application in designing better noninvasive interfaces references talairach and tournoux stereotaxic atlas of the human brain proportional system an approach to cerebral imaging thieme watson myers et al area of the human brain evidence from combined study using positron emission tomography and magnetic resonance imaging cereb cortex tootell reppas et al visual motion aftereffect in human cortical area mt revealed by functional magnetic resonance imaging nature mazziotta toga et al probabilistic atlas and reference system for the human brain philosophical transactions of the royal society biological sciences fischl sereno tootell and dale intersubject averaging and coordinate system for the cortical surface human brain mapping brett johnsrude and owen the problem of functional localization in the human brain nat rev neurosci sabuncu singer conroy bryan ramadge and haxby intersubject alignment of human cortical anatomy cerebral cortex conroy singer haxby and ramadge cortical alignment using functional connectivity in advances in neural information processing systems conroy singer guntupalli ramadge and haxby alignment of human cortical anatomy using functional connectivity neuroimage haxby guntupalli et al common model of the representational space in human ventral temporal cortex neuron lorbert and ramadge kernel hyperalignment in adv in neural inform proc systems huth griffiths theunissen and gallant pragmatic probabilistic and generative model of areas tiling the cortex arxiv horn and johnson matrix analysis cambridge university press edelman arias and smith the geometry of algorithms with orthogonality constraints siam journal on matrix analysis and applications ahn and oh constrained em algorithm for principal component analysis neural computation manning ranganath norman and blei topographic factor analysis bayesian model for inferring brain networks from neural data plos one michael anderson et al preserving subject variability in group fmri analysis performance evaluation of gica iva frontiers in systems neuroscience xu lorbert ramadge guntupalli and haxby regularized hyperalignment of fmri data in proc statistical signal processing workshop pages ieee hotelling relations between two sets of variates biometrika hyvärinen karhunen and oja independent component analysis john wiley sons chen leong norman and hasson reinstatement of neural patterns during narrative free recall abstracts of the cognitive neuroscience society margulies vincent and et al precuneus shares intrinsic functional architecture in humans and monkeys proceedings of the national academy of sciences haxby gobbini furey ishai schouten and pietrini distributed and overlapping representations of faces and objects in ventral temporal cortex science hanke baumgartner et al fmri dataset from complex natural stimulation with an audio movie scientific data griffiths and warren the planum temporale as computational hub trends in neurosciences yeshurun swanson chen simony honey lazaridi and hasson how does the brain represent different ways of understanding the same story society for neuroscience abstracts raichle the brain default mode network annual review of neuroscience ames honey chow todorov and hasson contextual alignment of cognitive and neural dynamics journal of cognitive neuroscience debettencourt cohen lee norman and training of attention with brain imaging nature neuroscience 
spectral learning of large structured hmms for comparative epigenomics chicheng zhang uc san diego jimin song rutgers university song kevin chen rutgers university kcchen kamalika chaudhuri uc san diego kamalika abstract we develop latent variable model and an efﬁcient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types natural model for chromatin data in one cell type is hidden markov model hmm we model the relationship between multiple cell types by connecting their hidden states by ﬁxed tree of known structure the main challenge with learning parameters of such models is that iterative methods such as em are very slow while naive spectral methods result in time and space complexity exponential in the number of cell types we exploit properties of the tree structure of the hidden states to provide spectral algorithms that are more computationally efﬁcient for current biological datasets we provide sample complexity bounds for our algorithm and evaluate it experimentally on biological data from nine human cell types finally we show that beyond our speciﬁc model some of our algorithmic ideas can be applied to other graphical models introduction in this paper we develop latent variable model and efﬁcient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types chromatin marks are chemical modiﬁcations on the genome which are important in many basic biological processes after standard preprocessing steps the data consists of binary vector one bit for each chromatin mark for each position in the genome and for each cell type natural model for chromatin data in one cell type is hidden markov model hmm for which efﬁcient spectral algorithms are known on biological data sets spectral algorithms have been shown to have several practical advantages over maximum methods including speed prediction accuracy and biological interpretability here we extend the approach by modeling multiple cell types together we model the relationships between cell types by connecting their hidden states by ﬁxed tree the standard model in biology for relationships between cell types this comparative approach leverages the information shared between the different data sets in statistically uniﬁed and biologically motivated manner formally our model is an hmm where the hidden state zt at time has structure represented by tree graphical model of known structure for each tree node we can associate an individual hidden state ztu that depends not only on the previous hidden state for the same tree node but also on the individual hidden state of its parent node additionally there is an observation variable xut for each node and the observation xut is independent of other state and observation variables conditioned on the hidden state variable ztu in the bioinformatics literature studied this model with the additional constraint that all tree nodes share the same emission parameters in biological applications the main outputs of interest are the learned observation matrices of the hmm and segmentation of the genome into regions which can be used for further studies standard approach to unsupervised learning of hmms is the em algorithm when applied to hmms with very large state spaces em is very slow recent line of work on spectral learning has produced much more computationally efﬁcient algorithms for learning many graphical models under certain mild conditions including hmms however naive application of these algorithms to hmms with large state spaces results in computational complexity exponential in the size of the underlying tree here we exploit properties of the tree structure of the hidden states to provide spectral algorithms that are more computationally efﬁcient for current biological datasets this is achieved by three novel key ideas our ﬁrst key idea is to show that we can treat each path in the tree separately and learn its parameters using tensor decomposition methods this step improves the running time because our trees typically have very low depth our second key idea is novel tensor symmetrization technique that we call skeletensor construction where we avoid constructing the full tensor over the entire path instead we use carefully designed symmetrization matrices to reveal its range in skeletensor which has dimension equal to that of single tree node the third and ﬁnal key idea is called product projections where we exploit the independence of the emission matrices along the path conditioned on the hidden states to avoid constructing the full tensors and instead construct compressed versions of the tensors of dimension equal to the number of hidden states not the number of observations beyond our speciﬁc model we also show that product projections can be applied to other graphical models and thus we contribute general tool for developing efﬁcient spectral algorithms finally we implement our algorithm and evaluate it on biological data from nine human cell types we compare our results with the results of who used variational em approach we also compare with spectral algorithms for learning hmms for each cell type individually to assess the value of the tree model related work the ﬁrst efﬁcient spectral algorithm for learning hmm parameters was due to there has been an explosion of work on spectral algorithms for learning the parameters and structure of latent variable models gives spectral algorithm for learning an observable operator representation of an hmm under certain rank conditions and extend this algorithm to the case when the transition matrix and the observation matrix respectively are extends to hidden models gives general spectral algorithm for learning parameters of latent variable models that have structure there is hidden node and three or more observable nodes that are not connected to any other nodes and are independent conditioned on the hidden node many latent variable models have this structure including hmms tree graphical models topic models and mixture models provides simpler more robust algorithm that involves decomposing third order tensor provide algorithms for learning latent trees and of latent junction trees several algorithms have been designed for learning hmm parameters for chromatin modeling including stochastic variational inference and contrastive learning of two hmms however none of these methods extend directly to modeling multiple chromatin sequences simultaneously the model probabilistic model the natural probabilistic model for single epigenomic sequence is hidden markov model hmm where time corresponds to position in the sequence the observation at time is the sequence value at position and the hidden state at is the regulatory function in this position figure left tree with nodes right hmm whose hidden state has structure in comparative epigenomics the goal is to jointly model epigenomic sequences from multiple species or this is done by an hmm with hidden state where each node in the tree representing the hidden state has corresponding observation node formally we represent the model by tuple figure shows pictorial representation is directed tree with known structure whose nodes represent individual or species the hidden state zt and the observation xt are represented by vectors ztu and xut indexed by nodes if then is the parent of denoted by if is parent of then for all ztv is parent of ztu in addition the observations have the following product structure if then conditioned on ztu the observation xut is independent of ztu and xut as well as any and for is set of observation matrices ou xut for each and is set of transition tensors for each finally is the set of initial distributions where for each given tree structure and number of iid observation sequences corresponding to each node of the tree our goal is to determine the parameters of the underlying and then use these parameters to infer the most likely regulatory function at each position in the sequences below we use the notation to denote the number of nodes in the tree and to denote its depth for typical epigenomic datasets is small to moderate while is very small or as it is difﬁcult to obtain data with large experimentally typically the number of possible values assumed by the hidden state at single node is about while the number of possible observation values assumed by single node is much larger in our dataset tensors an tensor is array with entries with its entry denoted as given ni vectors vi their tensor product denoted by is the tensor whose entry is tensor that can be expressed as the tensor product of set of vectors is called rank tensor tensor is symmetric if and only if for any permutation mπ let if vi rni then is tensor of size whose entry is since matrix is tensor we also use the following shorthand to denote matrix multiplii then is matrix of size cation let if vi whose entry is this is equivalent to in the bioinformatics literature this model is also known as tree hmm and observations matrices and tensors given observations xut and at single node we use the notation pt to denote their expected frequenu cies pt xt and to denote their corresponding empirical version the tensor pt xt and its empirical version are deﬁned similarly occasionally we will consider the states or observations corresponding to subset of nodes in coalesced into single or given connected subset of nodes in the tree that includes the root we use the notation zts and xst to denote the represented by ztu and the represented by xut respectively we deﬁne the observation matrix for as os xst rn and the transition matrix for as rm respectively for sets of nodes and we use the notation pt to denote the expected frequencies of the xt and its empirical version is denoted by similarly we can deﬁne the notation pt and its empirical version background on spectral learning for latent variable models recent work by has provided novel elegant tensor decomposition method for learning latent variable models applied to hmms the main idea is to decompose transformed version of the third order tensor of the ﬁrst three observations to recover the parameters shows that given enough samples and under fairly mild conditions on the model this provides an approximation to the globally optimal solution the algorithm has three main steps first the third order tensor of the is symmetrized using the second order matrices to yield symmetric tensor this symmetric tensor is then orthogonalized by whitening transformation finally the resultant symmetric orthogonal tensor is decomposed via the tensor power method in biological applications instead of multiple independent sequences we have single long sequence in the steady state in this case following ideas from we use the average over of the third order tensors of three consecutive observations starting at time the second order tensor is also modiﬁed similarly algorithm naive approach for learning parameters of hmms with hidden states is to directly apply the spectral method of since this method ignores the structure of the hidden state its running time is very high nd md even with optimized implementations this motivates the design of more computationally efﬁcient approaches plausible approach is to observe that at the observations are generated by tree graphical model thus in principle one could learn the parameters of the underlying tree using existing algorithms however this approach does not directly produce the hmm parameters it also does not work for biological sequences because we do not have multiple independent samples at instead we have single long sequence at the steady state and the steady state distribution of observations is not generated by latent tree another plausible approach is to use the spectral junction tree algorithm of however this algorithm does not provide the actual transition and observation matrix parameters which hold important biological information and instead provides an observable operator representation our main contribution is to show that we can achieve much better running time by exploiting the structure of the hidden state our algorithm is based on three key ideas partitioning skeletensor construction and product projections we explain these ideas next partitioning our ﬁrst observation is that to learn the parameters at node we can focus only on the unique path from the root to thus we partition the learning problem on the tree into separate learning problems on these paths this maintains correctness as proved in the appendix the partitioning step reduces the computational complexity since we now need to learn an hmm with md states and nd observations instead of the naive method where we learn an hmm with md states and nd observations as in biological data this gives us signiﬁcant savings constructing the skeletensor naive way to learn the parameters of the hmm corresponding to each path is to work directly on the nd nd nd tensor instead we show that for each node on path novel symmetrization method can be used to construct much smaller skeleton tensor of size which nevertheless captures the effect of the entire path and projects it into the skeleton tensor thus revealing the range of ou we call this the skeletensor hu hu be the empirical let hu be the path from the root to node and let tensor of of the hu and hu at times and respectively based on the data we construct the following symmetrization matrices hu hu hu hu hu hu hu hu note that and are matrices symmetrizing with and gives us an skeletensor which can in turn be decomposed to give an estimate of ou see lemma in the appendix even though naively constructing the symmetrization matrices and skeletensor takes time this procedure improves computational efﬁciency because tensor construction is onetime operation while the power method which takes many iterations is carried out on much smaller tensor product projections we further reduce the computational complexity by using novel algorithmic technique that we call product projections the key observation is as follows let hu be any path in the tree and consider the hmm that generates the observations xut xut xt for even though the individual observations uj xt are highly dependent the range of ohu the emission matrix of the hmm describing the path hu is contained in the product of the ranges of ouj where ouj is the emission matrix at node uj lemma in the appendix furthermore even though the ouj matrices are difﬁcult to ﬁnd their ranges can be determined by computing the svds of the observation matrices at uj thus we can implicitly construct and store an estimate of the range of ohu this also gives us hu hu hu hu the column spaces of and and the range of the estimates of the range of hu hu therefore during skeletensor construction we can ﬁrst and third modes of the tensor hu hu and instead construct their projections onto their avoid explicitly constructing and ranges this reduces the time complexity of the skeletensor construction step to recall that the range has dimension while the number of hidden states could be as high as this is signiﬁcant gain in practice as in biological datasets observations hidden states product projections are more efﬁcient than random projections on the matrix of the matrices are nd nd matrices and random projections would take nd time also product projections differ from the suggestion of since we exploit properties of the model to efﬁciently ﬁnd good projections the product projections technique is general technique with applications beyond our model some examples are provided in appendix the full algorithm our ﬁnal algorithm follows from combining the three key ideas above algorithm shows how to recover the observation matrices ou at each node once the ou are recovered one can use standard techniques to recover and details are described in algorithm in the appendix performance guarantees we now provide performance guarantees on our algorithm since learning parameters of hmms and many other graphical models is spectral algorithms make simplifying assumptions on the properties of the model generating the data typically these assumptions take the form of some algorithm algorithm for observation matrix recovery input samples of the three consecutive observations generated by an hmm with tree structured hidden state with known tree structure for do perform svd on to get the ﬁrst left singular vectors end for for do let hu denote the set of nodes on the unique path from root to let hu construct projected skeletensor first compute symmetrization matrices hu hu hu hu hu hu hu hu hu hu hu hu compute symmetrized second and third for hu hu hu hu hu hu hu hu orthogonalization and tensor decomposition orthogonalize using and decomu as in see algorithm in the appendix for details pose to recover undo projection onto range estimate ou as where end for conditions on the rank of certain parameter matrices we state below the conditions needed for our algorithm to successfully learn parameters of hmm with tree structured hidden states observe that we need two kinds of rank conditions and to ensure that we can recover the full set of parameters on path assumption rank condition for all the matrix ou has rank and the has rank joint probability matrix assumption rank condition for any let hu denote the path from root to hu hu has rank then the joint probability matrix assumption is required to ensure that the skeletensor can be decomposed and that indeed captures the range of ou assumption ensures that the symmetrization operation succeeds this kind of assumption is very standard in spectral learning has provided spectral algorithm for learning hmms involving fourth and higher order moments when assumption does not hold we believe similar approaches will apply to our problem as well and we leave this as an avenue for future work if assumptions and hold we can show that algorithm is consistent provided enough samples are available the model parameters learnt by the algorithms are close to the true model parameters ﬁnite sample guarantee is provided in the appendix theorem consistency suppose we run algorithm on the ﬁrst three observation vectors from iid sequences generated by an hmm with hidden states then for all nodes the recovered estimates satisfy the following property with high probability over the iid samples there exists permutation πu of the columns of such that as πu where as observe that the observation matrices as well as the transition and initial probabilities are recovered upto permutations of hidden states in globally consistent manner experiments data and experimental settings we ran our algorithm which we call on chromatin dataset on human chromosome from nine cell types hmec hsmm huvec nhek nhlf from the encode project following we used biologically motivated tree structure of star tree with the embryonic stem cell type as the root there are data for eight chromatin marks for each cell type which we preprocessed into binary vectors using standard poisson background assumption the chromosome is divided into segments of length following the observed data consists of binary vector of length eight for each segment so the number of possible observations is the number of all combinations of presence or absence of the chromatin marks we set the number of hidden states which we interpret as chromatin states to similar to the choice of encode our goals are to discover chromatin states corresponding to biologically important functional elements such as promoters and enhancers and to label each chromosome segment with the most probable chromatin state observe that instead of the ﬁrst few observations from iid sequences we have single long sequence in the steady state per cell type thus similar to we calculate the empirical cooccurrence matrices and tensors used in the algorithm based on two and three successive observations respectively so more formally instead of we use the average over of and so on additionally we use projection procedure similar to for rounding negative entries in the recovered observation matrices our experiments reveal that the rank conditions appear to be satisﬁed for our dataset run time and memory usage comparisons first we ﬂattened the hmm with hidden states into an ordinary hmm with an exponentially larger state space our python implementation of the spectral algorithm for hmms of ran out of memory while performing singular value decomposition on the matrix even using sparse matrix libraries this suggests that naive application of spectral hmm is not practical for biological data next we compared the performance of to similar model which additionally constrained all transition and observation parameters to be the same on each branch that work used several variational approximations to the em algorithm and reported that smf structured mean ﬁeld performed the best in their tests although we implemented in matlab and did not optimize it for efﬁciency took hr whereas the smf algorithm took hr for iterations to convergence this suggests that spectral algorithms may be much faster than variational em for our model biological interpretation of the observation matrices having examined the efﬁciency of we next studied the accuracy of the learned parameters we focused on the observation matrices which hold most of the interesting biological information since the full observation matrix is very large where each row is combination of chromatin marks figure shows the marginal distribution of each chromatin mark conditioned on each hidden state spectacletree identiﬁed most of the major types of functional elements typically discovered from chromatin data repressive strong enhancer weak enhancer promoter transcribed region and background state states respectively in figure in contrast the smf algorithm used three out of the six states to model the large background state the state with no chromatin marks it identiﬁed repressive transcribed and promoter states states respectively in figure but did not identify any enhancer states which are one of the most interesting classes for further biological studies we believe these results are due to that fact that the background state in the data set is large of the segments do not have chromatin marks for any cell type the background state has lower biological interest but is modeled well by the maximum likelihood approach in contast biologically interesting states such as promoters and enhancers comprise relatively small fraction of the genome we can not simply remove background segments to make the classes balanced because it would change the length distribution of the hidden states finally we observed that our model estimated signiﬁcantly different parameters for each cell type which captures different chromatin states appendix figure for example we found enhancer states with strong in all cell types except for where both enhancer states and had low signal for this mark this mark is known to be biologically important in these cells for distinguishing active from poised enhancers smf figure the compressed observation matrices for the cell type estimated by the smf and algorithms the hidden states are on the axis this suggests that modeling the additional parameters can yield interesting biological insights comparison of the chromosome segments labels we computed the most probable state for each chromosome segment using posterior decoding algorithm we tested the accuracy of the predictions using an experimentally deﬁned data set and compared it to smf and the spectral algorithm for hmms run for individual cell types without the tree speciﬁcally we assessed promoter prediction accuracy state for smf and state for in figure using cage data from which was available for six of the nine cell types we used the score harmonic mean of precision and recall for comparison and found that was much more accurate than smf for all six cell types table this was because the promoter predictions of smf were biased towards the background state so those predictions had slightly higher recall but much lower speciﬁcity finally we compared our predictions to to assess the value of the tree model is the root node so and have the same model and obtain the same accuracy table predicts promoters more accurately than for all other cell types except however is the most diverged from the root among the cell types based on the hamming distance between the chromatin marks we hypothesize that for the tree is not good model which slightly reduces the prediction accuracy cell type huvec nhek smf table score for predicting promoters for six cell types the highest score for each cell type is emphasized in bold labels for the other are currently unavailable our experiments show that has improved computational efﬁciency biological interpretability and prediction accuracy on an feature compared to variational em for similar tree hmm model and spectral algorithm for single hmms previous study showed improvements for spectral learning of single hmms over the em algorithm thus our algorithms may be useful to the bioinformatics community in analyzing the chromatin data sets currently being produced acknowledgements kc and cz thank nsf under iis for research support references anima anandkumar rong ge daniel hsu sham kakade and matus telgarsky tensor decompositions for learning latent variable models corr animashree anandkumar daniel hsu and sham kakade method of moments for mixture models and hidden markov models corr balle carreras luque and quattoni spectral learning of weighted automata perspective machine learning balle hamilton and pineau methods of moments for learning stochastic languages uniﬁed presentation and empirical comparison in icml pages jacob biesinger yuanfeng wang and xiaohui xie discovering and mapping chromatin states using tree hidden markov model bmc bioinformatics suppl chaganty and liang estimating graphical models using moments and likelihoods in icml encode project consortium an integrated encyclopedia of dna elements in the human genome nature jason ernst and manolis kellis discovery and characterization of chromatin states for systematic annotation of the human genome nature biotechnology bernstein et al the nih roadmap epigenomics mapping consortium nature biotechnology creyghton et al histone separates active from poised enhancers and predicts developmental state proc natl acad sci ernst et al mapping and analysis of chromatin state dynamics in nine human cell types nature jun zhu et al characterizing dynamic changes in the human blood transcriptional network plos comput biol hoffman et al unsupervised pattern discovery in human chromatin structure through genomic segmentation nature methods djebali et al landscape of transcription in human cells nature foster rodu and ungar spectral dimensionality reduction for hmms in corr foti xu laird and fox stochastic variational inference for hidden markov models in nips halko martinsson and tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam review hsu kakade and zhang spectral algorithm for learning hidden markov models in colt melnyk and banerjee spectral algorithm for inference in hidden models in aistats mossel and roch learning phylogenies and hidden markov models ann appl parikh song and xing spectral algorithm for latent tree graphical models in icml pages parikh song ishteva teodoru and xing spectral algorithm for latent junction trees in uai siddiqi boots and gordon hidden markov models in aistats song and chen spectacle fast chromatin state annotation using spectral learning genome biology song ishteva parikh xing and park hierarchical tensor decomposition of latent tree graphical models in icml zou hsu parkes and adams contrastive learning using spectral methods in nips 
individual planning in multiagent settings inference structure and scalability xia qu epic systems verona wi quxiapisces prashant doshi thinc lab dept of computer science university of georgia athens ga pdoshi abstract this paper provides the ﬁrst formalization of planning in multiagent settings using em our formalization in the context of and interactive pomdps is distinct from em formulations for pomdps and cooperative multiagent planning frameworks we exploit the graphical model structure speciﬁc to and present new approach based on descent for further speed up forward sampling combination of exact ﬁltering with sampling is explored to exploit problem structure introduction generalization of bounded policy iteration bpi to interactive partially observable markov decision processes is currently the leading method for selfinterested multiagent planning and obtaining controllers as solutions however interactive bpi is acutely prone to converge to local optima which severely limits the quality of its solutions despite the limited ability to escape from these local optima attias posed planning using mdp as likelihood maximization problem where the data is the initial state and the ﬁnal goal state or the maximum total reward toussaint et al extended this to infer automata for pomdps experiments reveal good quality controllers of small sizes although run time is concern given bpi limitations and the compelling potential of this approach in bringing advances in inferencing to bear on planning we generalize it to and our generalization allows its use toward planning for an individual agent in noncooperation where we may not assume common knowledge of initial beliefs or common rewards due to which others beliefs capabilities and preferences are modeled analogously to pomdps we formulate mixture of dbns however the dbns differ by including models of other agents in special model node our approach labeled as improves on the straightforward extension of toussaint et em to by utilizing various types of structure instead of ascribing as many level controllers as candidate models and improving each using its own em we use the underlying graphical structure of the model node and its update to formulate single em that directly provides the marginal of others actions across all models this rests on new insight which considerably simpliﬁes and speeds em at level we present general approach based on descent for speeding up the nonasymptotic rate of convergence of the iterative em the problem is decomposed into optimization subproblems in which the objective function is optimized with respect to small subset block of variables while holding other variables ﬁxed we discuss the unique challenges and present the ﬁrst effective application of this iterative scheme to multiagent planning finally sampling offers way to exploit the embedded problem structure such as information in distributions the exact is replaced with forward sampling ffbs that generates trajectories weighted with rewards which are used to update the parameters of the controller while sampling has been integrated in em previously ffbs speciﬁcally mitigates error accumulation over long horizons due to the exact forward step overview of interactive pomdps for an agent with strategy level interacting with agent is ti ωi oi ri oci isi denotes the set of interactive states deﬁned as isi mj where mj θj smj for and where is the set of physical states θj is the set of computable intentional models ascribed to agent θj here bj is agent level belief bj isj where is the space of distributions and tj ωj oj rj ocj is frame at level and intentional model reduces to pomdp smj is the set of subintentional models of an example is ﬁnite state automaton ai aj is the set of joint actions of all agents other parameters transition function ti observations ωi observation function oi and preference function ri have their usual semantics analogously to pomdps but involve joint actions optimality criterion oci here is the discounted inﬁnite horizon sum an agent belief over its interactive states is sufﬁcient statistic fully summarizing the agent observation history given the associated belief update solution to an is policy using the bellman equation each belief state in an has value which is the maximum payoff the agent can expect starting from that belief and over the future planning in as inference we may represent the policy of agent for the inﬁnite horizon case as stochastic ﬁnite state controller fsc deﬁned as πi ti li vi where ni is the set of nodes in the controller ti ni ai ωi ni represents the node transition function li ni ai denotes agent action distribution at each node and an initial distribution over the nodes is denoted by vi ni for convenience we group vi ti and li in fˆi deﬁne controller at level for agent as πi ni fˆi where ni is the set of nodes in the controller and fˆi groups remaining parameters of the controller as mentioned before analogously to pomdps we formulate planning in multiagent settings formalized by as likelihood maximization problem πi arg max πi rit πi where πi are all fscs of agent rit is binary random variable whose value is or emitted after time steps with probability proportional to the reward ri ai aj nti oti ati otj atj rjt atj st rit st figure mixture of dbns with to time slices for with policy represented as standard fsc whose node state is denoted by ni the dbns differ from those for pomdps by containing special model nodes hexagons whose values are candidate models of other agents hexagonal model nodes and edges in bold for one other agent in decompose into this dbn values of the node are the candidate models cpt of chance node atj denoted by atj is inferred using likelihood maximization the planning problem is modeled as mixture of dbns of increasing time from onwards fig the transition and observation functions of parameterize the chance nodes and oi respectively along with rit atj st are the maximum and minimum reward values in ri ri st at aj rmax here rmax and rmin the networks include nodes ni of agent fsc therefore functions in fˆi parameterize the network as well which are to be inferred additionally the network includes the hexagonal model nodes one for each other agent that contain the candidate level models of the agent each model node provides the expected distribution over another agent actions without loss of generality no edges exist between model nodes in the same time step correlations between agents could be included as state variables in the models agent model nodes and the edges in bold between them and between the model and chance action nodes represent dbn of length as shown in fig values of the chance node are the candidate models of agent agent initial belief over the state and models of becomes the parameters of and the likelihood maximization at level seeks to obtain the distribution aj for each candidate model in node using em on the dbn proposition correctness the likelihood maximization problem as deﬁned in eq with the mixture models as given in fig is equivalent to the problem of solving the original with discounted inﬁnite horizon whose solution assumes the form of ﬁnite state controller all proofs are given in the supplement given the unique mixture models above the challenge is to generalize the iterative maximization for pomdps to the framework of single em for level models the straightforward approach is to infer likely fsc for each level model however this approach does not scale to many models proposition below shows that the dynamic atj is sufﬁcient predictive information about other agent from its candidate models at time to obtain the most likely policy of agent this is markedly different from using behavioral equivalence that clusters models with identical solutions the latter continues to require the full solution of each model proposition sufﬁciency distributions atj across actions atj aj for each state st is sufﬁcient predictive information about other agent to obtain the most likely policy of in the context of proposition we seek to infer atj for each updated model of at all time steps which is denoted as other terms in the computation of atj are known parameters of the level dbn the likelihood maximization for the level dbn is arg max rjt as the trajectory consisting of states models actions and observations of the other agent is hidden at planning time we may solve the above likelihood maximization using em let st atj otj where the observation at is null be the hidden trajectory the log likelihood is obtained as an expectation of these hidden trajectories rjt log rjt the data in the level dbn consists of the initial belief over the state and models and the observed reward at analogously to em for pomdps this motivates forward smoothing on network with joint state st for computing the log likelihood the transition function for the forward and backward steps is st otj tmj st otj aj aj otj omj st where mj in the subscripts is model at here oj is the delta function that is when belief in updated using aj and oj equals the belief in otherwise forward ﬁltering gives the probability of the next state as follows αt st st where is the initial belief of agent the smoothing by which we obtain the joint probability of the state and model at from the distribution at is st st where denotes the horizon to and st eatj rjt messages αt and give the probability of state at some time slice in the dbn as we consider mixture of bns we seek probabilities for all states in the mixture model subsequently we may compute the forward and backward messages at all states for the entire mixture model in one sweep αt model growth as the other agent performs its actions and makes observations the space of models grows exponentially starting from ﬁnite set of models we obtain models at time this greatly increases the number of trajectories in we limit the growth in the model space by sampling models at the next time step from the distribution αt st as we perform each step of forward ﬁltering it limits the growth by exploiting the structure present in and oj which guide how the models grow we obtain the updated from the full log likelihood in eq by separating the terms independent of and maximizing it atj atj mtj st rit rmj st atj st st oj atj omj atj st tmj st atj aj oj improved em for level at strategy levels eq deﬁnes the likelihood maximization problem which is iteratively solved using em we show the and next beginning with in multiagent setting the hidden variables additionally include what the other agent may observe and how it acts over time however key insight is that prop allows us to limit attention to the marginal distribution over other agents actions given the state thus let st oti nti ati atj atk where the observation at is null and other agents are labeled to this group is denoted the full log likelihood involves an expectation over hidden variables πi rit πi log rit πi due to the subjective perspective in computes the likelihood of agent fsc only and not of joint fscs as in team planning in the dbn of fig observed evidence includes the reward rit at the end and the initial belief we seek the likely distributions vi ti and li across time slices we may again realize the full joint in the expectation using algorithm on hidden markov model whose state is st nti the transition function of this model is st nti oi li ai ti oti nti ai ti oi ai oi in addition to parameters of which are given parameters of agent controller and those relating to other agents predicted actions are present in eq notice that in consequence of proposition eq precludes observation and node transition functions the forward message αt st nti represents the probability of being at some state of the dbn at time αt st nti st nti where vi the backward message gives the probability of observing the reward in the ﬁnal th time step given state of the markov model st nti rit nti st nti where st nti at ni rit ati li nti ati is the horizon here rit ati ri st ati and and side effect of being dependent on is that we can no longer conveniently deﬁne for use in at level instead the computations are folded in the we update the parameters li ti and vi of πi to obtain πi based on the expectation in the speciﬁcally take log of the likelihood zi πi with πi substituted and focus on terms involving the parameters of πi with πi log rt πi independent of πi log nti ati log nti ati log vi ni in order to update li we partially differentiate the of eq with respect to to facilitate differentiation we focus on the terms involving li as shown below πi indep of pr on maximizing the above equation is nti ati li nti ati st at pr rit πi log nti ati γt rit ati αt st nti node transition probabilities ti and node distribution vi for πi is updated analogously to li because fsc is inferred at level at strategy levels and greater candidate models are fscs em at these higher levels proceeds by replacing the state of the dbn st nti with st nti ntj ntk descent for speed up descent bcd is an iterative scheme to gain faster rate of convergence in the context of optimization problems in this scheme within each iteration set of variables referred to as coordinates are chosen and the objective function is optimized with respect to one of the coordinate blocks while the other coordinates are held ﬁxed bcd may speed up the rate of convergence of em for both and pomdps the speciﬁc challenge here is to determine which of the many variables should be grouped into blocks and how we empirically show in section that grouping the number of time slices and horizon in eqs and respectively at each level into coordinate blocks of equal size is beneﬁcial in other words we decompose the mixture model into blocks containing equal numbers of bns alternately grouping controller nodes is ineffective because distribution vi can not be optimized for subsets of nodes formally let be subset of tmax then the set of blocks is bt in practice because both and are ﬁnite say tmax the cardinality of bt is bounded by some analogously we deﬁne the set of blocks of denoted by bh in the now we compute αt for the time steps in single coordinate block ψtc only while using the values of αt from the previous iteration for the complementary coordinate blocks analogously we compute for the horizons in ψhc only while using values from the previous iteration for the remaining horizons we cyclically choose block ψtc at iterations qc where forward filtering backward sampling an approach for exploiting embedded structure in the transition and observation functions is to replace the exact message computations with exact forward ﬁltering and backward sampling ffbs to obtain sampled reverse trajectory consisting of nti ati ati oti nti and so on from to here rit ati is the likelihood weight of this trajectory sample parameters of the updated fsc πi are obtained by summing and normalizing the weights each trajectory is obtained by ﬁrst sampling which becomes the length of dbn for this sample forward message αt st nti is computed exactly eq followed by the backward message st nti and computing differs from eq by utilizing the forward message st nti ati αt st nti li nti ati ti st ati ti nti ati ati oi where st nti rit ati αt st nti nti ati rit ati subsequently we may easily sample nti rit followed by sampling sti nti from eq nti we sample ati oti ati where ati nti li nti ati ti nti ati ti ai oi ati computational complexity our em at level is signiﬁcantly quicker compared to ascribing fscs to other agents in the latter nodes of others controllers must be included alongside and ni proposition speed up each at level using the pass as shown previously results in net speed up of over the formulation that ascribes fscs each to other agents with each having nodes analogously updating the parameters li and ti in the exhibits speedup of while vi leads to this improvement is exponential in the number of other agents on the other hand the at level exhibits complexity that is typically greater compared to the total complexity of the for fscs proposition ratio at level when fscs are inferred for agents exhibit ratio of complexity compared to the for obtaining the ratio in prop is when controllers are sought and there are several models experiments five variants of em are evaluated as appropriate the exact em planning labeled as replacing the exact with its greedy variant analogously to the greedy maximization in em for pomdps iterating em based on coordinate blocks and coupled with greedy and lastly using forward sampling we use problem domains the noncooperative multiagent tiger problem for level at level and with total of agents and models for each other agent larger noncooperative money laundering ml problem forms the second domain it exhibits physical states for the subject agent blue team actions for blue and for the red team observations for subject and for the other with about models ml tiger level value time in log scale em methods time em methods time in log scale em methods level value uav time in log scale time in log scale time policing time time em methods figure fscs improve with time for in the tiger money laundering uav and policing contexts observe that bcd causes substantially larger improvements in the initial iterations until we are close to convergence or its greedy variant converges signiﬁcantly quicker than to fscs for all four problem domains as shown in and respectively all experiments were run on linux with intel xeon cpus and ram for red team we also evaluate uav reconnaissance problem involving uav tasked with intercepting two fugitives in grid before they both reach the safe house it has states for the uav actions observations for each agent and models for the two fugitives finally the recent policing protest problem is used in which police must maintain order in designated protest sites populated by groups of protesters who may be peaceful or disruptive it exhibits states policing and protesting actions observations and models per protesting group the latter two domains are historically the largest test problems for planning comparative performance of all methods in fig we compare the variants on all problems each method starts with random seed and the converged value is signiﬁcantly better than random fsc for all methods and problems increasing the sizes of fscs gives better values in general but also increases time using fscs of sizes and for the domains respectively demonstrated good balance we explored various coordinate block conﬁgurations eventually settling on blocks for both the tiger and ml blocks for uav and for policing protest and the greedy and bcd variants clearly exhibit an anytime property on the tiger uav and policing problems the noncooperative ml shows delayed increases because we show the value of agent controller and initial improvements in the other agent controller may maintain or decrease the value of controller this is not surprising due to competition in the problem nevertheless after small delay the values improve steadily which is desirable consistently improves on and is often the fastest the corresponding value improves by large steps initially fast rate of convergence in the context of ml and uav shows substantive improvements leading to controllers with much improved values compared to other approaches despite low sample size of about for the problems obtains fscs whose values improve in general for tiger and ml though slowly and not always to the level of others this is because the em gets caught in worse local optima due to sampling approximation this strongly impacts the uav problem more samples did not escape these optima however forward ﬁltering only as used in wu et al required much larger sample size to reach these levels ffbs did not improve the controller in the fourth domain characterization of local optima while an exact solution for the smaller tiger problem with agents or the larger problems could not be obtained for comparison climbs to the optimal value of for the downscaled version not shown in fig in comparison bpi does not get past the local optima of using an controller corresponding controller predominantly contains listening actions relying on adding nodes to eventually reach optimum while we are unaware of any general technique to escape local convergence in em can reach the global optimum given an appropriate seed this may not be coincidence the value function space exhibits single ﬁxed point the global optimum which in the context of sition makes the likelihood function πi unimodal if πi is appropriately sized as we is continuously differentiable for the do not have principled way of adding nodes if πi domain on hand corollary in wu indicates that πi will converge to the unique maximizer improvement on we compare the quickest of the variants with previous best algorithm figs allowing the latter to escape local optima as well by adding nodes observe that fscs improved using converge to values similar to those of almost two orders of magnitude faster beginning with nodes adds more nodes to obtain the same level of value as em for the tiger problem for money laundering converges to controllers whose value is at least times better than given the same amount of allocated time and less nodes failed to improve the seed controller and could not escape for the uav and policing protest problems to summarize this makes variants with emphasis on bcd the fastest iterative approaches for currently concluding remarks the em formulation of section builds on the em for pomdp and differs drastically from the eand for the cooperative the differences reﬂect how build on pomdps and differ from these begin with the structure of the dbns where the dbn for in fig adds to the dbn for pomdp hexagonal model nodes that contain candidate models chance nodes for action and model update edges for each other agent at each time step this differs from the dbn for which adds controller nodes for all agents and joint observation chance node the dbn contains controller nodes for the subject agent only and each model node collapses into an efﬁcient distribution on running em at level in domains where the joint reward function may be decomposed into factors encompassing subsets of agents allow the value function to be factorized as well kumar et al exploit this structure by simply decomposing the whole dbn mixture into mixture for each factor and iterating over the factors interestingly the may be performed individually for each agent and this approach scales beyond two agents we exploit both graphical and problem structures to speed up and scale in way that is contextual to bcd also decomposes the dbn mixture into equal blocks of horizons while it has been applied in other areas these applications do not transfer to planning additionally problem structure is considered by using ffbs that exploits information in the transition and observation distributions of the subject agent ffbs could be viewed as tenuous example of monte carlo em which is broad category and also includes the forward sampling utilized by wu et al for however fundamental differences exist between the two forward sampling may be run in simulation and does not require the transition and observation functions indeed wu et al utilize it in model free setting ffbs is model based utilizing exact forward messages in the backward sampling phase this reduces the accumulation of sampling errors over many time steps in extended dbns which otherwise afﬂicts forward sampling the advance in this paper for multiagent planning has wider relevance to areas such as game play and ad hoc teams where agents model other agents developments in online em for hidden markov models provide an interesting avenue to utilize inference for online planning acknowledgments this research is supported in part by nsf career grant and grant from onr we thank akshat kumar for feedback that led to improvements in the paper references ekhlas sonu and prashant doshi scalable solutions of interactive pomdps using generalized and bounded policy iteration journal of autonomous agents and systems pages doi in press hagai attias planning by probabilistic inference in ninth international workshop on ai and statistics aistats marc toussaint and amos storkey probabilistic inference for solving discrete and continuous state markov decision processes in international conference on machine learning icml pages jeffrey fessler and alfred hero generalized expectationmaximization algorithm ieee transactions on signal processing tseng convergence of block coordinate descent method for nondifferentiable minimization journal of optimization theory and applications feng wu shlomo zilberstein and nicholas jennings expectation maximization for decentralized pomdps in international joint conference on artiﬁcial intelligence ijcai pages piotr gmytrasiewicz and prashant doshi framework for sequential planning in multiagent settings journal of artiﬁcial intelligence research yifeng zeng and prashant doshi exploiting model equivalences for solving interactive dynamic inﬂuence diagrams journal of artiﬁcial intelligence research akshat kumar and shlomo zilberstein anytime planning for decentralized pomdps using expectation maximization in conference on uncertainty in ai uai pages ankan saha and ambuj tewari on the nonasymptotic convergence of cyclic coordinate descent methods siam journal on optimization carter and kohn markov chainmonte carlo in conditionally gaussian state space models biometrika marc toussaint laurent charlin and pascal poupart hierarchical pomdp controller optimization by likelihood maximization in conference on uncertainty in artiﬁcial intelligence uai pages prashant doshi and piotr gmytrasiewicz monte carlo sampling methods for approximating interactive pomdps journal of artiﬁcial intelligence research brenda ng carol meyers koﬁ boakye and john nitao towards applying interactive pomdps to adversary modeling in innovative applications in artiﬁcial intelligence iaai pages ekhlas sonu yingke chen and prashant doshi individual planning in agent populations anonymity and hypergraphs in international conference on automated planning and scheduling icaps pages jeff wu on the convergence properties of the em algorithm annals of statistics akshat kumar shlomo zilberstein and marc toussaint scalable multiagent planning using probabilistic inference in international joint conference on artiﬁcial intelligence ijcai pages arimoto an algorithm for computing the capacity of arbitrary discrete memoryless channels ieee transactions on information theory jeffrey fessler and donghwan kim axial block coordinate descent abcd algorithm for ct image reconstruction in international meeting on fully image reconstruction in radiology and nuclear medicine volume pages olivier cappe and eric moulines online algorithm for latent data models journal of the royal statistical society series statistical methodology 
estimating mixture models via mixtures of polynomials sida wang arun tejasvi chaganty percy liang computer science department stanford university stanford ca sidaw chaganty pliang abstract mixture modeling is general technique for making any simple model more expressive through weighted combination this generality and simplicity in part explains the success of the expectation maximization em algorithm in which updates are easy to derive for wide class of mixture models however the likelihood of mixture model is so em has no known global convergence guarantees recently method of moments approaches offer global guarantees for some mixture models but they do not extend easily to the range of mixture models that exist in this work we present polymom an unifying framework based on method of moments in which estimation procedures are easily derivable just as in em polymom is applicable when the moments of single mixture component are polynomials of the parameters our key observation is that the moments of the mixture model are mixture of these polynomials which allows us to cast estimation as generalized moment problem we solve its relaxations using semidefinite optimization and then extract parameters using ideas from computer algebra this framework allows us to draw insights and apply tools from convex optimization computer algebra and the theory of moments to study problems in statistical estimation simulations show good empirical performance on several models introduction mixture models play central role in machine learning and statistics with diverse applications including bioinformatics speech natural language and computer vision the idea of mixture modeling is to explain data through weighted combination of simple parametrized distributions in practice maximum likelihood estimation via expectation maximization em has been the workhorse for these models as the parameter updates are often easily derivable however em is to suffer from local optima the method of moments dating back to pearson in is enjoying recent revival due to its strong global theoretical guarantees however current methods depend strongly on the specific distributions and are not easily extensible to new ones in this paper we present method of moments approach which we call polymom for estimating wider class of mixture models in which the moment equations are polynomial equations section solving general polynomial equations is but our key insight is that for mixture models the moments equations are mixtures of polynomials equations and we can hope to solve them if the moment equations for each mixture component are simple polynomials equations that we can solve polymom proceeds as follows first we recover mixtures of monomials of the parameters from the data moments by solving an instance of the generalized moment problem gmp section we show that for many mixture models the gmp can be solved with basic linear algebra and in the general case can be approximated by an sdp in which the moment equations are linear constraints second we extend multiplication matrix ideas from the computer algebra literature mixture model xt data point rd zt latent mixture component parameters of component rp mixing proportion of all model parameters moments of data observation function fn observation function moments of parameters ly the riesz linear functional ly moment probability measure for the moment sequence mr moment matrix of degree sizes data dimensions mixture components parameters of mixture components data points constraints degree of the moment matrix size of the degree moment matrix polynomials polynomial ring in variables set of integers vector of exponents in np or nd qp monomial coefficient of in fn table notation we use lowercase letters for indexing and the corresponding uppercase letter to denote the upper limit in sizes we use lowercase letters for scalars lowercase bold letters for vectors and bold capital letters for matrices write down mixture model recover parameter moments multinomial mr minimize tr mr derive single mixture moment equations add data mr vpv sim diag diag mr pt xt pt pt xt solve for parameters user specified framework specified figure an overview of applying the polymom framework to extract the parameters by solving certain generalized eigenvalue problem section polymom improves on previous method of moments approaches in both generality and flexibility first while tensor factorization has been the main driver for many of the method of moments approaches for many types of mixture models each model required specific adaptations which are even for experts in contrast polymom provides unified principle for tackling new models that is as turnkey as computing gradients or em updates to use polymom figure one only needs to provide list of observation functions and derive their expected values expressed symbolically as polynomials in the parameters of the specified model fn polymom then estimates expectations of and outputs parameter estimates of the specified model since polymom works in an optimization framework we can easily incorporate constraints such as and parameter tying which is difficult to do in the tensor factorization paradigm in simulations we compared polymom with em and tensor factorization and found that polymom performs similarly or better section this paper assumes identifiability and infinite data with the exception of few specific models in section we defer issues of general identifiability and sample complexity to future work problem formulation the method of moments estimator in mixture model each data point rd is associated with latent component multinomial where are the mixing coefficients are the true model parameters for the th mixture component and rd is the random variable representing data we restrict our attention to mixtures where each component distribution comes from the same parameterized family for example for mixture of gaussians rd consists of the mean and covariance of component we define observation functions rd for and define fn to be the expectation of over single component with parameters which we assume is simple polynomial fn where qp the expectation of each observation function can then be expk pressed as mixture of polynomials of the true parameters pk fn the method of moments for mixture models seeks parameters that satisfy the moment conditions fn pt where can be estimated from the data xt the goal of this work is to find parameters satisfying moment conditions that can be written in the mixture of polynomial form we assume that the observations functions uniquely identify the model parameters up to permutation of the components example gaussian mixture consider of gaussians with parameters corresponding to the mean and variance respectively of the component figure steps and we choose the observation functions which have corresponding moment polynomials for example instantiating pk given and and data the polymom framework can recover the parameters note that the moments we use have been shown by to be sufficient for mixture of two gaussians example mixture of linear regressions consider mixture of linear regressions where each data point is drawn from component by sampling from an unknown distribution independent of and setting wk where the parameters wk are the slope and noise variance for each component let us take our observation functions to be xy xy for which the moment polynomials are in example the coefficients in the polynomial fn are just constants determined by integration for the conditional model in example the coefficients depends on the data however we can not handle arbitrary data dependence see section for sufficient conditions and counterexamples solving the moment conditions our goal is to recover model parameters rp for each of the components of the mixture model that generated the data as well as their respective mixing proportions to start let ignore sampling noise and identifiability issues and suppose that we are given exact moment conditions as defined in each condition fn is polynomial of the parameters for equation is polynomial system of equations in the variables and rp it is natural to ask if standard polynomial solving methods can solve in the case where each fn is simple unfortunately the complexity of general polynomial equation solving is lower bounded by the number of solutions and each of the permutations of the mixture components corresponds to distinct solution of under this polynomial system representation while several methods can take advantage of symmetries in polynomial systems they still can not be adapted to tractably solve to the best of our knowledge the key idea of polymom is to exploit the mixture representation of the moment equations specifically let be particular mixture over the component parameters is probability measure then we can express the moment conditions in terms of fn where as result solving the original moment conditions is equivalent to solving the following feasibility problem over but where we deliberately forget the permutation of the components by using to represent the problem find rµ the set of probability measures over fn is sum of deltas if the true model parameters can be identified by the observed moments up to permutap tion then the measure solving problem is also unique polymom solves problem in two steps moment completion section we show that problem over the measure can be relaxed to an sdp over certain parameter moment matrix mr whose optimal solution pk is mr vr vr where vr is the vector of all monomials of degree at most solution extraction section we then take mr and construct series of generalized eigendecomposition problems whose eigenvalues yield remark from this point on distributions and moments refer to which is over parameters not over the data all the structure about the data is captured in the moment conditions moment completion the first step is to reformulate problem as an instance of the generalized moment problem gmp introduced by reference on the gmp algorithms for solving gmps and its various extensions is we start by observing that problem really only depends on the integrals of monomials under the measure for example if fn then we only need to know the integrals over the constituent monomials and in order to evaluate the integral over fn this suggests that we can optimize over the parameter moment sequence rather than rthe measure itself we say that the moment sequence has representing measure if for all but we do not assume that such exists the riesz linear functional ly is defined to be the linear map such that ly and ly for example ly if has representing measure then ly simply maps polynomials to integrals of against the key idea of the gmp approach is to convexify the problem by treating as free variables and then introduce constraints to guarantee that has representing measure first let vr be the vector of all monomials of degree no greater than then define the truncated moment matrix as mr ly vr vr where the linear functional ly is applied elementwise see example below if has representing measure then mr is simply positive integral over rank matrices vr vr with respect to so necessarily mr holds furthermore by theorem for to have representing measure it is sufficient that rank mr rank mr so problem is equivalent to find rn or equivalently find mr rank mr and rank mr unfortunately the rank constraints in problem are not tractable we use the following relaxation to obtain our final convex optimization problem minimize tr cmr mr where is chosen scaling matrix common choice is is corresponding to minimizing the nuclear norm of the moment matrix the usual convex relaxation for rank section discusses some other choices of example moment matrix for gaussian mixture recall that the parameters are the mean and variance of one dimensional gaussian let us choose the monomials step for figure shows the moment matrix when using each row and column of the moment matrix is labeled with monomial and entry is subscripted by the product of the monomials in row and column for we have which leads to the linear constraint for leading to the constraint related work readers familiar with the sum of squares and polynomial optimization literature will note that problem is similar to the sdp relaxation of polynomial optimization problem however in typical polynomial optimization we are only interested in solutions that actually satisfy the given constraints whereas here we are interested in solutions whose mixture satisfies constraints corresponding to the moment conditions within machine learning generalized pca has been formulated as moment problem and the hankel matrix basically the moment matrix has been used to learn weighted automata while similar tools are used the conceptual approach and the problems considered are different for example the moment matrix of this paper consists of unknown moments of the model parameters whereas exisiting works considered moments of the data that are always directly observable constraints constraints such as for parameters which represent probabilities or variances and parameter tying are quite common in graphical models and are not easily addressed with existing method of moments approaches the gmp framework allows us to incorporate some constraints using localizing matrices thus we can handle constraints during the estimation procedure rather than projecting back onto the constraint set as step this is necessary for models that only become identifiable by the observed moments after constraints are taken into account we describe this method and its learning implications in section guarantees and statistical efficiency in some circumstances in mixture models or the mixture of linear regressions the constraints fully determine the moment matrix we consider these cases in section and appendix while there are no general guarantee on problem the flat extension theorem tells us when the moment matrix corresponds to unique solution more discussions in appendix theorem flat extension theorem let be the solution to problem for particular if mr and rank mr rank mr then is the optimal solution to problem for rank mr and there exists unique supporting measure of mr recovering mr is linearly dependent on small perturbations of the input suggesting that the method has polynomial sample complexity for most models where the moments concentrate at polynomially rate finally in appendix we discuss few other important considerations like noise robustness making problem more statistical efficient along with some technical results on the moment completion problem and some open problems solution extraction having completed the parameter moment matrix mr section we now turn to the problem of extracting the model parameters the solution extraction method we present is based on ideas from solving multivariate polynomial systems where the solutions are eigenvalues of certain multiplication matrices the main advantage of the solution extraction view is that moments and structure in parameters are handled in the framework without modelspecific effort pk recall that the true moment matrix is mr where contains all the monomials up to degree we use for variables and for the true solutions to these variables note the boldface for example denotes the pth value of the th component which corresponds to solution for the variable typically and the elements of are arranged in degree ordering so that for we can also write mr as mr vpv where the canonical basis rs and diag at the high level we want to factorize mr to get however we can not simply mr since is not orthogonal to overcome this challenge we will exploit the internal structure of to construct several other matrices that share the same factors and perform simultaneous diagonalization specifically let be of with only the rows corresponding to monomials with exponents np typically are just the first monomials in now consider the exponent np which is in position and elsewhere corresponding to the monomial the key property of the canonical basis is that multiplying each column by monomial just performs shift to another set of rows dp where dp diag note that dp contains the pth parameter for all mixture components example shifting the canonical basis let and the true solutions be and to extract the solution for which are let and diag while the above reveals the structure of we don know however we recover its column space rs from the moment matrix mr for example with an svd thus we can relate and by linear transformation uq where is some unknown invertible matrix equation can now be rewritten as dp which is generalized eigenvalue problem where dp are the eigenvalues and are the eigenvectors crucially the eigenvalues dp diag give us solutions to our parameters note that for any choice of and we have generalized eigenvalue problems that share eigenvectors though their eigenvectors dp may differ corresponding eigenvalues and hence solutions can be obtained by solving simultaneous generalized eigenvalue problem by using random projections like algorithm of or more robust simutaneous diagonalization algorithms is short overview and is comprehensive treatment including numerical issues table applications of the polymom framework see appendix for more details mixture of linear regressions model observation functions is observed where rd is drawn from an unspecified distribution and and is known the parameters are wk rd for moment polynomials wp pp xp xq wp wq where the np is in position and elsewhere mixture of gaussians model observation functions rd is observed where is drawn from gaussian with diagonal covariance diag the parameters are ck for momentq polynomials cd multiview mixtures model observation functions with views is observed where rd and is drawn from an unspecified distribution with mean for the parameters are ijk xi xj xk where moment polynomials fijk we describe one approach to solve which is similar to algorithm of the idea is to take random weighted combinations of the equations and solve the resulting generalized eigendecomposition problems let rp be whose entries are drawn from pp then for each solve rq qdq the resulting eigenvalues can be collected in rp where dq note that pp by definition rq so we can simply invert to obtain although this simple approach does not have great numerical properties these eigenvalue problems are solvable if the eigenvalues are distinct for all which happens with probability as long as the parameters are different from each other in appendix we show how prior tensor decomposition algorithm from can be seen as solving equation for particular instantiation of applications let us now look at some applications of polymom table presents several models with corresponding observation functions and moment polynomials it is fairly straightforward to write down observation functions for given model the moment polynomials can then be derived by computing expectations under the this step can be compared to deriving gradients for em we implemented polymom for several mixture models in python code https we used cvxopt to handle the sdp and the random projections algorithm from to extract solutions in table we show the relative error maxk averaged over random models of each class in the rest of this section we will discuss guarantees on parameter recovery for each of these models ci and be the absolute value of the coefficient of the degree term of the univariate hermite polynomial for example the first few are gaussians spherical diagonal constrained others lin reg methd em tf poly em tf poly em tf poly table is the number of samples and the error metric is defined above methods em sklearn initialized with using random restarts tf tensor power method implemented in python poly polymom by solving problem models for mixture of gaussians we have spherical and diagonal describes the type of covariance matrix the mean parameters of constrained gaussians satisfies the best result is bolded tf only handles spherical variance but it was of interest to see what tf does if the data is drawn from mixture of gaussians with diagonal covariance these results are in strikeout mixture of linear regressions we can guarantee that polymom can recover parameters for this model when by showing that problem can be solved exactly observe that while no entry of the moment matrix is directly observed each observation gives us linear constraint on the entries of the moment matrix and when there are enough equations that this system admits an unique solution for chaganty et al were also able to recover parameters for this model under the same conditions by solving series of tensor recovery problems which ultimately requires the computation of the same moments described above in contrast the polymom framework makes the dependence on moments upfront and takes care of the in manner lastly the model can be extended to handle per component noise by including as parameter an extension that is not possible using the method in multiview mixtures we can guarantee parameter recovery when by proving that problem can be solved exactly see section mixture of gaussians in this case however the moment conditions are and we can not guarantee recovery of the true parameters however polymom is guaranteed to recover mixture of gaussians that match the moments we can also apply constraints to the model consider the case of mixture where the mean parameters for all components lies on parabola in this case we just need to add constraints to problem for all up to degree by incorporating these constraints at estimation time we can possibly identify the model parameters with less moments see section for more details conclusion we presented an unifying framework for learning many types of mixture models via the method of moments for example for the mixture of gaussians we can apply the same algorithm to both mixtures in needing moments and mixtures in high dimensions where lowerorder moments suffice the generalized moment problem and its semidefinite relaxation hierarchies is what gives us the generality although we rely heavily on the ability of nuclear norm minimization to recover the underlying rank as result while we always obtain parameters satisfying the moment conditions there are no formal guarantees on consistent estimation the second main tool is solution extraction which characterizes more general structure of mixture models compared the tensor structure observed by this view draws connections to the literature on solving polynomial systems where many techniques might be useful finally through the connections we ve drawn it is our hope that polymom can make the method of moments as turnkey as em on more models as well as improve the statistical efficiency of method of moments procedures acknowledgments this work was supported by microsoft faculty research fellowship to the third author and nserc fellowship for the first author references titterington smith and makov statistical analysis of finite mixture distributions volume wiley new york mclachlan and peel finite mixture models john wiley sons pearson contributions to the mathematical theory of evolution philosophical transactions of the royal society of london anandkumar hsu and kakade method of moments for mixture models and hidden markov models in conference on learning theory colt anandkumar foster hsu kakade and liu two svds suffice spectral decompositions for probabilistic topic modeling and latent dirichlet allocation in advances in neural information processing systems nips anandkumar ge hsu kakade and telgarsky tensor decompositions for learning latent variable models arxiv hsu kakade and liang identifiability and unmixing of latent parse trees in advances in neural information processing systems nips hsu and kakade learning mixtures of spherical gaussians moment methods and spectral decompositions in innovations in theoretical computer science itcs chaganty and liang spectral experts for estimating mixtures of linear regressions in international conference on machine learning icml kalai moitra and valiant efficiently learning mixtures of two gaussians in symposium on theory of computing stoc pages hardt and price sharp bounds for learning mixture of two gaussians arxiv preprint ge huang and kakade learning mixtures of gaussians in high dimensions arxiv preprint balle carreras luque and quattoni spectral learning of weighted automata forwardbackward perspective machine learning lasserre moments positive polynomials and their applications imperial college press lasserre semidefinite programming approach to the generalized problem of moments mathematical programming stetter multivariate polynomial equations as matrix eigenproblems wssia and stetter multivariate polynomial equations with multiple zeros solved by matrix eigenproblems numerische mathematik sturmfels solving systems of polynomial equations american mathematical society henrion and lasserre detecting global optimality and extracting solutions in gloptipoly in positive polynomials in control pages anandkumar ge hsu and kakade tensor spectral approach to learning mixed membership community models in conference on learning theory colt pages anandkumar ge and janzamin provable learning of overcomplete latent variable models and unsupervised settings arxiv preprint viele and tong modeling with mixtures of linear regressions statistics and computing sturmfels algorithms in invariant theory springer science business media corless gatermann and kotsireas using symmetries in the eigenvalue method for polynomial systems journal of symbolic computation curto and fialkow solution of the truncated complex moment problem for flat data volume american mathematical society lasserre global optimization with polynomials and the problem of moments siam journal on optimization laurent sums of squares moment matrices and optimization over polynomials in emerging applications of algebraic geometry pages parrilo and sturmfels minimizing polynomial functions algorithmic and quantitative real algebraic geometry dimacs series in discrete mathematics and theoretical computer science parrilo semidefinite programming relaxations for semialgebraic problems mathematical programming ozay sznaier lagoa and camps gpca with denoising convex approach in computer vision and pattern recognition cvpr pages 
on the global linear convergence of optimization variants simon inria sierra normale paris france martin jaggi dept of computer science eth switzerland abstract the fw optimization algorithm has lately popularity thanks in particular to its ability to nicely handle the structured constraints appearing in machine learning applications however its convergence rate is known to be slow sublinear when the solution lies at the boundary simple lessknown fix is to add the possibility to take away steps during optimization an operation that importantly does not require feasibility oracle in this paper we highlight and clarify several variants of the optimization algorithm that have been successfully applied in practice fw pairwise fw fw and wolfe minimum norm point algorithm and prove for the first time that they all enjoy global linear convergence under weaker condition than strong convexity of the objective the constant in the convergence rate has an elegant interpretation as the product of the classical condition number of the function with novel geometric quantity that plays the role of condition number of the constraint set we provide pointers to where these algorithms have made difference in practice in particular with the flow polytope the marginal polytope and the base polytope for submodular optimization the algorithm also known as conditional gradient is one of the earliest existing methods for constrained convex optimization and has seen an impressive revival recently due to its nice properties compared to projected or proximal gradient methods in particular for sparse optimization and machine learning applications on the other hand the classical projected gradient and proximal methods have been known to exhibit very nice adaptive acceleration property namely that the the convergence rate becomes linear for strongly convex objective that the optimization error of the same algorithm after iterations will decrease geometrically with instead of the usual for general convex objective functions it has become an active research topic recently whether such an acceleration is also possible for type methods contributions we clarify several variants of the algorithm and show that they all converge linearly for any strongly convex function optimized over polytope domain with constant bounded away from zero that only depends on the geometry of the polytope our analysis does not depend on the location of the true optimum with respect to the domain which was disadvantage of earlier existing results such as and the newer work of as well as the line of work of which rely on robinson condition our analysis yields weaker sufficient condition than robinson condition in particular we can have linear convergence even in some cases when the function has more than one global minima and is not globally strongly convex the constant also naturally separates as the product of the condition number of the function with novel notion of condition number of polytope which might have applications in complexity theory related work for the classical algorithm showed linear rate for the special case of quadratic objectives when the optimum is in the strict interior of the domain result already subsumed by the more general the early work of showed linear convergence for strongly convex constraint sets under the strong requirement that the gradient norm is not too small see for discussion the variant of the algorithm that can also remove weight from bad atoms in the current active set was proposed in and later also analyzed in the precise method is stated below in algorithm showed local linear convergence rate on polytopes but the constant unfortunately depends on the distance between the solution and its relative boundary quantity that can be arbitrarily small more recently have obtained linear convergence results in the case that the optimum solution satisfies robinson condition in different recent line of work have studied variation of fw that repeatedly moves mass from the worst vertices to the standard fw vertex until specific condition is satisfied yielding linear rate on strongly convex functions their algorithm requires the knowledge of several constants though and moreover is not adaptive to the scenario unlike the algorithm with away steps and none of these previous works was shown to be affine invariant and most require additional knowledge about problem specific parameters setup we consider general constrained convex optimization problems of the form min conv with only access to lmoa arg minhr xi where rd is finite set of vectors that we call we assume that the function is µstrongly convex with continuous gradient over we also consider weaker conditions than strong convexity for in section as is finite is convex and bounded polytope the methods that we consider in this paper only require access to linear minimization oracle lmoa associated with the domain through generating set of atoms this oracle is defined as to return minimizer of linear subproblem over conv for any given direction rd examples optimization problems of the form appear widely in machine learning and signal processing applications the set of atoms can represent combinatorial objects of arbitrary type efficient linear minimization oracles often exist in the form of dynamic programs or other combinatorial optimization approaches as an example from tracking in computer vision could be the set of integer flows on graph where lmoa can be efficiently implemented by minimum cost network flow algorithm in this case can also be described with polynomial number of linear inequalities but in other examples might not have polynomial description in terms of linear inequalities and testing membership in might be much more expensive than running the linear oracle this is the case when optimizing over the base polytope an object appearing in submodular function optimization there the lmoa oracle is simple greedy algorithm another example is when represents the possible consistent value assignments on cliques of markov random field mrf is the marginal polytope where testing membership is in general though efficient linear oracles exist for some special cases optimization over the marginal polytope appears for example in structured svm learning and variational inference the original algorithm the fw optimization algorithm also known as conditional gradient is particularly suited for the setup where is only accessed through the linear minimization oracle it works as follows at current iterate the algorithm finds feasible search atom st to move towards by minimizing the linearization of the objective function over line in algorithm this is where the linear minimization oracle lmoa is used the next iterate is then obtained by doing on between and st line in algorithm one reason for the recent increased popularity of algorithms is the sparsity of their iterates in iteration of the algorithm the iterate can be represented as sparse convex combination of at most atoms of the domain which we write as αv we write for the active set containing the previously discovered search atoms sr for that have weight αsr in the expansion potentially also including the starting point while tracking the active set is not necessary for the original fw algorithm the improved variants of fw that we discuss will require that is maintained phenomenon when the optimal solution lies at the boundary of the gence rate of the iterates is slow sublinear for being an optimal solution this is because the iterates of the classical fw algorithm start to the atoms do not have to be extreme points vertices of all our convergence results can be carefully extended to approximate linear minimization oracles with multiplicative approximation guarantees we state them for exact oracles in this paper for simplicity vt vt st st st figure left the fw algorithm when the solution lies on the boundary middle adding the possibility of an away step attenuates this problem right as an alternative pairwise fw step between the vertices defining the face containing the solution see left of figure in fact the rate is tight for large class functions canon and cullum wolfe showed roughly that for any when lies on face of with some additional regularity assumptions note that this lower bound is different than the one presented in lemma which holds for all algorithms but assumes high dimensionality improved variants of the algorithm algorithm algorithm afw let and so that αv for and otherwise for do the fw direction let st lmoa and dtfw st the away direction let vt arg max and dta vt if dtfw then return fw gap is small enough so return if dtfw dta then dt dtfw and γmax choose the fw direction else dt dta and γmax αvt αvt choose away direction maximum feasible end if γt arg min γdt gtfw γmax update γt dt and accordingly for the weights see text update αv end for algorithm pairwise algorithm pfw as in algorithm except replacing lines to by dt dtpfw st and γmax αvt to address the problem of fw wolfe proposed to add the possibility to move away from an active atom in see middle of figure this simple modification is sufficient to make the algorithm linearly convergent for strongly convex functions we describe the variant of in algorithm the away direction dta is defined in line by finding the by atom vt in that maximizes the potential of descent given gt vt note that this search is over the typically small active set and is fundamentally easier than the linear oracle lmoa the maximum γmax as defined on line ensures that the new iterate γdta stays in in fact this guarantees that the convex representation is maintained and we stay inside conv when is simplex then the barycentric coordinates are unique and γmax dta truly lies on the boundary of on the other hand if dim for the cube then it could hypothetically be possible to have bigger than γmax which is still feasible computing the true maximum feasible would require the ability to know when we cross the boundary of along specific line which is not possible for general using the conservative maximum of line ensures that we the original algorithm presented in was not convergent this was corrected by and marcotte assuming tractable representation of with linear inequalities and called it the modified mfw algorithm our description in algorithm extends it to the more general setup of do not need this more powerful oracle this is why algorithm requires to maintain unlike standard fw finally as in classical fw the fw gap gtfw is an upper bound on the unknown suboptimality and can be used as stopping criterion gtfw dtfw by convexity if γt γmax then we call this step drop step as it fully removes the atom vt from the currently active set of atoms by settings its weight to zero the weight updates for lines and are of the following form for fw step we have st if γt otherwise st also we have αst αst and αv αv for st for an away step we have vt if γt γmax drop step otherwise also we have αvt γt αvt γt and αv γt αv for vt pairwise the next variant that we present is inspired by an early algorithm by mitchell et al called the mdm algorithm originally invented for the polytope distance problem here the idea is to only move weight mass between two atoms in each step more precisely the generalized method as presented in algorithm moves weight from the away atom vt to the fw atom st and keeps all other weights we call such swap of mass between the two atoms pairwise fw step αvt αvt and αst αst for some γmax αvt in contrast classical fw shrinks all active weights at every iteration the pairwise fw direction will also be central to our proof technique to provide the first global linear convergence rate for fw as well as the variant and wolfe algorithm as we will see in section the rate guarantee for the pairwise fw variant is more loose than for the other variants because we can not provide satisfactory bound on the number of the problematic swap steps defined just before theorem nevertheless the algorithm seems to perform quite well in practice often outperforming fw especially in the important case of sparse solutions that is if the optimal solution lies on face of and thus one wants to keep the active set small the pairwise fw step is arguably more efficient at pruning the coordinates in in contrast to the away step which moves the mass back uniformly onto all other active elements and might require more corrections later the pairwise fw step only moves the mass onto the good fw atom st slightly different version than algorithm was also proposed by et al though their convergence proofs were incomplete see appendix the algorithm is related to classical working set algorithms such as the smo algorithm used to train svms we refer to for an empirical comparison for svms as well as their section for more related work see also appendix for link between pairwise fw and and wolfe point algorithm when the linear oracle is expensive it might be worthwhile to do more work to optimize over the active set in between each call to the linear oracle rather than just performing an away or pairwise step we give in algorithm the fcfw variant that maintains correction polytope defined by set of atoms potentially larger than the active set rather than obtaining the next iterate by is obtained by over conv depending on how the correction is implemented and how the correction atoms are maintained several variants can be obtained these variants are known under many names such as the extended fw method by holloway or the simplicial decomposition method wolfe point mnp algorithm for polytope distance problems is often confused with fcfw for quadratic objectives the major difference is that standard fcfw optimizes over conv whereas mnp implements the correction as sequence of affine projections that potentially yield different update but can be computed more efficiently in several practical applications we describe precisely in appendix generalization of the mnp algorithm as specific case of the correction subroutine from step of the generic algorithm the original convergence analysis of the fcfw algorithm and also mnp algorithm only showed that they were finitely convergent with bound on the number of iterations in terms of the cardinality of unfortunately an exponential number in general holloway also argued that fcfw had an asymptotic linear convergence based on the flawed argument of wolfe as far as we know our work is the first to provide global linear convergence rates for fcfw and mnp for algorithm with approximate correction fcfw input set of atoms active set starting point αv stopping criterion let optionally bigger could be passed as argument for warm start for do let st lmoa the fw atom let dtfw st and gtfw dtfw fw gap if gtfw then return correction st approximate correction step end for algorithm approximate correction correction st return with the following properties is the active set for and min st make at least as much progress as fw step max the away gap is small enough general strongly convex functions moreover the proof of convergence for fcfw does not require an exact solution to the correction step instead we show that the weaker properties stated for the approximate correction procedure in algorithm are sufficient for global linear convergence rate this correction could be implemented using fw as done for example in global linear convergence analysis intuition for the convergence proofs we first give the general intuition for the linear convergence proof of the different fw variants starting from the work of and marcotte we assume that the objective function is smooth over compact set its gradient is lipschitz continuous with constant also let diam let dt be the direction in which the is executed by the algorithm line in algorithm by the standard descent lemma see in we have γdt dt lkdt γmax we let rt and let ht be the suboptimality error supposing for now that γmax hrt dt lkdt we can set to minimize the rhs of subtract on both sides and to get lower bound on the progress hrt dt ht hrt dˆt where we use the hat notation to denote normalized vectors dˆt dt let et be the error vector by convexity of we have γet et µket the rhs is lower bounded by its minimum as function of unconstrained achieved using hrt et µket we are then free to use any value of on the lhs and maintain valid bound in particular we use to obtain again we get hrt hrt dˆt ht and combining with we obtain ht ht hrt the inequality is fairly general and valid for any method in direction dt to get linear convergence rate we need to lower bound by positive constant the term in front of ht on the rhs which depends on the angle between the update direction dt and the negative gradient rt if we assume that the solution lies in the relative interior of with distance of at least from the boundary then hrt dt δkrt for the fw direction dtfw and by combining with kdt we get linear rate with constant this was the result from on the other hand if lies on the boundary then dˆt gets arbitrary close to zero for standard fw the phenomenon and the convergence is sublinear proof sketch for afw the key insight to prove the global linear convergence for afw is to relate hrt dt with the pairwise fw direction dtpfw st vt by the way the direction dt is chosen on lines to of algorithm we have fw pfw dt hrt dtfw hrt da da hrt dt hrt dt hrt dtpfw we thus have hrt dt now the crucial property of the pairwise fw direction is that for any potential negative gradient direction rt the worst case inner product dtpfw can be lower bounded away from zero by quantity depending only on st rt the geometry of unless we are at the optimum we call this quantity the pyramidal width of the figure on the right shows the six possible dpfw pairwise fw directions dtpfw for triangle domain depending on which colored area the rt direction falls into we will see that the pyramidal width is related to the smallest width of pyramids that we can construct vt from in specific way related to the choice of the away and towards atoms vt and st see and our main theorem in section fw this gives the main argument for the linear convergence of afw for steps where γmax when γmax is too small afw will perform drop step as the will truncate the to γt γmax we can not guarantee sufficient progress in this case but the drop step decreases the active set size by one and thus they can not happen too often not more than half the time these are the main elements for the global linear convergence proof for afw the rest is to carefully consider various boundary cases we can the same techniques to prove the convergence for pairwise fw though unfortunately the latter also has the possibility of problematic swap steps while their number can be bounded so far we only found the extremely loose bound quoted in theorem proof sketch for fcfw for fcfw by line of the correction algorithm the away gap satisfies gta at the beginning of new iteration supposing that the algorithm does not exit at line of algorithm we have gtfw and therefore dtfw hrt dtpfw using similar argument as in finally by line of algorithm the correction is guaranteed to make at least as much progress as in direction dtfw and so the progress bound applies also to fcfw convergence results we now give the global linear convergence rates for the four variants of the fw algorithm awaysteps fw afw alg pairwise fw pfw alg fw fcfw alg with approximate correction alg and wolfe point algorithm alg with as alg in appendix for the afw mnp and pfw algorithms we call drop step when the active set shrinks for the pfw algorithm we also have the possibility of swap step where γt γmax but the mass was fully swapped from the away atom to the fw atom nice property of fcfw is that it does not have any drop step it executes both fw steps and away steps simultaneously while guaranteeing enough progress at every iteration theorem suppose that has and is convex over conv let diam and width as defined by then the suboptimality ht of the iterates of all the four variants of the fw algorithm decreases geometrically at each step that is not drop step nor swap step when γt γmax called good step that is ht where let be the number of good steps up to iteration we have for fcfw for mnp and afw and for pfw because of the swap steps this yields global linear convergence rate of ht exp for all variants if general convex then ht instead see theorem in appendix for an affine invariant version and proof note that to our knowledge none of the existing linear convergence results showed that the duality gap was also linearly convergent the result for the gap follows directly from the simple manipulation of putting the fw gap to the lhs and optimizing the rhs for theorem suppose that has gradient over with diam then the fw gap gtfw for any algorithm is upper bounded by the primal error ht as follows gtfw ht lm when ht lm gtfw otherwise for afw and pfw we actually require that is over the larger domain pyramidal width we now describe the claimed lower bound on the angle between the negative gradient and the pairwise fw direction which depends only on the geometric properties of according to our argument about the progress bound and the pfw gap our goal is to find lower bound on hrt dtpfw first note that hrt dtpfw hrt max hrt vi where is sible active set for this looks like the directional width of pyramid with base and summit st to be conservative we consider the worst case possible active set for this is what we will call the pyramid directional width dirw rt we start with the following definitions directional width the directional width of set with respect to direction is defined as dirw maxs krk the width of is the minimum directional width over all possible directions in its affine hull pyramidal directional width we define the pyramidal directional width of set with respect to direction and base point to be dirw min dirw min max krk where sx such that is proper convex combination of all the elements in and arg hr vi is the fw atom used as summit pyramidal width to define the pyramidal width of set we take the minimum over the cone of possible feasible directions in order to avoid the problem of zero width direction is feasible for from if it points inwards conv cone we define the pyramidal width of set to be the smallest pyramidal width of all its faces width min dirw conv theorem let conv be suboptimal point and be an active set for let be an optimal point and corresponding error direction xk and negative gradient and so hr let be the pairwise fw direction obtained over and with negative gradient then hr di width hr properties of pyramidal width and consequences examples of values the pyramidal width of set is lower bounded by the minimal width over all subsets of atoms and thus is strictly greater than zero if the number of atoms is finite on the other hand this lower bound is often too loose to be useful as in particular vertex subsets of the unit cube in dimension can have exponentially small width see in on the other hand as we show here the pyramidal width of the unit cube is actually justifying why we kept the tighter but more involved definition see appendix for the proof lemma the pyramidal width of the unit cube in rd is for the probability simplex with vertices the pyramidal width is actually the same as its width which is when is even and when is odd see appendix in contrast the pyramidal width of an infinite set can be zero for example for curved domain the set of active atoms can contain vertices forming very narrow pyramid yielding zero width in the limit condition number of set the inverse of the rate constant appearing in theorem is the product of two terms is the standard condition number of the objective function appearing in the rates of gradient methods in convex optimization the second quantity diameter over pyramidal width can be interpreted as condition number of the domain or its eccentricity the more eccentric the constraint set large diameter compared to its pyramidal width the slower the convergence the best condition number of function is when its level sets are spherical the analog in term of the constraint sets is actually the regular simplex which has the maximum ratio amongst all simplices see corollary in its eccentricity is at most in contrast the eccentricity of the unit cube is which is much worse by proper convex combination we mean that all coefficients are in the convex combination we conjecture that the pyramidal width of set of vertices extrema of their convex hull is when another vertex is added assuming that all previous points remain vertices for example the unit cube can be obtained by iteratively adding the regular probability simplex and the pyramidal width thereby decreases from to this property could provide lower bounds for the pyramidal width of more complicated polytopes such as for the marginal polytope as it can be obtained by removing vertices from the unit cube complexity lower bounds combining the convergence theorem and the condition number of the unit simplex we get complexity of log to reach when optimizing strongly convex function over the unit simplex here the linear dependence on should not come as surprise in view of the known lower bound of for for type methods applications to submodular minimization see appendix for consequence of our linear rate for the popular mnp algorithm for submodular function optimization over the base polytope convex generalization building on the work of beck and shtern and wang and lin we can generalize our global linear convergence results for all variants for the more general case where ax hb xi for rd and where is µg convex and continuously differentiable over am we note that for general matrix is convex but not necessarily strongly convex in this case the linear convergence still holds but with the constant appearing in the rate of theorem replaced with the generalized constant appearing in lemma in appendix illustrative experiments fw awayfw pairfw fw we illustrate the performance of the presented algorithm ants in two numerical experiments shown in figure the away fw first example is constrained lasso problem least squares regression that is kax bk pa irf with scaled we used random sian matrix and noisy measurement ax with being sparse vector with entries and of iteration additive noise for the the linear minimization oracle lmoa just selects the column of of best inner product with the residual vector the second application comes from video the approach used by is formulated as fw quadratic program qp over flow polytope the convex hull of paths in network in this application the linear minimization awayfw oracle is equivalent to finding shortest path in the network pairf which can be done easily by dynamic programming for the lmoa we the code provided by and their included iteration aeroplane dataset resulting in qp over variables in both fw experiments we see that the modified fw variants figure duality gap gt vs iteraand pairwise outperform the original fw algorithm and tions on the lasso problem top and hibit linear convergence in addition the constant in the video bottom code is available from the authors website vergence rate of theorem can also be empirically shown to be fairly tight for afw and pfw by running them on an increasingly obtuse triangle see appendix gap fw awayfw pairfw gap discussion building on preliminary version of our work beck and shtern also proved linear rate for fw but with simpler lower bound for the lhs of using linear duality arguments however their lower bound see lemma in is looser they get constant for the eccentricity of the regular simplex instead of the tighter that we proved finally the recently proposed generic scheme for accelerating optimization methods in the sense of nesterov from applies directly to the fw variants given their global linear convergence rate that we proved this gives for the first time methods that only use linear oracles and obtain the rate for smooth convex functions or the accelerated constant in the linear rate for strongly convex functions given that the constants also depend on the dimensionality it remains an open question whether this acceleration is practically useful acknowledgements we thank alayrac hazan hubard osokin and marcotte for helpful discussions this work was partially supported by the joint center and google research award references sun and todd linear convergence of modified algorithm for computing enclosing ellipsoids optimization methods and software alexander the width and diameter of simplex geometriae dedicata bach learning with submodular functions convex optimization perspective foundations and trends in machine learning beck and shtern linearly convergent conditional gradient for convex functions beck and teboulle conditional gradient method with linear rate of convergence for solving convex linear systems mathematical methods of operations research zor canon and cullum tight upper bound on the rate of convergence of algorithm siam journal on control chari et al on pairwise costs for network flow tracking in cvpr dunn rates of convergence for conditional gradient algorithms near singular and nonsingular extremals siam journal on control and optimization frank and wolfe an algorithm for quadratic programming naval research logistics quarterly garber and hazan linearly convergent conditional gradient algorithm with applications to online and stochastic optimization garber and hazan faster rates for the method over sets in icml and marcotte some comments on wolfe away step mathematical programming hearn lawphongpanich and ventura restricted simplicial decomposition computation and extensions in computation mathematical programming volume pages springer holloway an extension of the frank and wolfe method of feasible directions mathematical programming jaggi revisiting sparse convex optimization in icml joulin tang and efficient image and video with algorithm in eccv kolmogorov and zabin what energy functions can be minimized via graph cuts ieee transactions on pattern analysis and machine intelligence krishnan and sontag barrier for marginal inference in nips kumar and yildirim linearly convergent algorithm for support vector classification with core set result informs journal on computing and jaggi an affine invariant linear convergence analysis for algorithms jaggi schmidt and pletscher optimization for structural svms in icml lan the complexity of convex programming under linear optimization oracle levitin and polyak constrained minimization methods ussr computational mathematics and mathematical physics lin mairal and harchaoui universal catalyst for optimization in nips mitchell demyanov and malozemov finding the point of polyhedron closest to the origin siam journal on control frandi sartori and allende novel algorithm analysis and applications to svm training information sciences nesterov introductory lectures on convex optimization kluwer academic publishers pena rodriguez and soheili on the von neumann and algorithms with away steps platt fast training of support vector machines using sequential minimal optimization in advances in kernel methods support vector learning pages robinson generalized equations and their solutions part ii applications to nonlinear programming springer von hohenbalken simplicial decomposition in nonlinear programming algorithms mathematical programming wainwright and jordan graphical models exponential families and variational inference foundations and trends in machine learning wang and lin iteration complexity of feasible descent methods for convex optimization journal of machine learning research wolfe convergence theory in nonlinear programming in integer and nonlinear programming wolfe finding the nearest point in polytope mathematical programming ziegler lectures on arxiv 
deep knowledge tracing chris jonathan jonathan surya mehran leonidas jascha stanford university khan academy google piech jbassen jascha abstract knowledge machine models the knowledge of student as they interact with well established problem in computer supported education though effectively modeling student knowledge would have high educational impact the task has many inherent challenges in this paper we explore the utility of using recurrent neural networks rnns to model student learning the rnn family of models have important advantages over previous methods in that they do not require the explicit encoding of human domain knowledge and can capture more complex representations of student knowledge using neural networks results in substantial improvements in prediction performance on range of knowledge tracing datasets moreover the learned model can be used for intelligent curriculum design and allows straightforward interpretation and discovery of structure in student tasks these results suggest promising new line of research for knowledge tracing and an exemplary application task for rnns introduction education promises open access to world class instruction and reduction in the growing cost of learning we can develop on this promise by building models of large scale student trace data on popular educational platforms such as khan academy coursera and edx knowledge tracing is the task of modelling student knowledge over time so that we can accurately predict how students will perform on future interactions improvement on this task means that resources can be suggested to students based on their individual needs and content which is predicted to be too easy or too hard can be skipped or delayed already intelligent tutoring systems that attempt to tailor content show promising results human tutoring can produce learning gains for the average student on the order of two standard deviations and machine learning solutions could provide these benefits of high quality personalized teaching to anyone in the world for free the knowledge tracing problem is inherently difficult as human learning is grounded in the complexity of both the human brain and human knowledge thus the use of rich models seems appropriate however most previous work in education relies on first order markov models with restricted functional forms in this paper we present formulation that we call deep knowledge tracing dkt in which we apply flexible recurrent neural networks that are deep in time to the task of knowledge tracing this family of models represents latent knowledge state along with its temporal dynamics using large vectors of artificial neurons and allows the latent variable representation of student knowledge to be learned from data rather than the main contributions of this work are novel way to encode student interactions as input to recurrent neural network gain in auc over the best previous result on knowledge tracing benchmark demonstration that our knowledge tracing model does not need expert annotations discovery of exercise influence and generation of improved exercise curricula correct incorrect line graph intuition slope of line solving for solving for graphing linear equations square roots exercise index predicted probability exercise attempted figure single student and her predicted responses as she solves khan academy exercises she seems to master finding and intercepts and then has trouble transferring knowledge to graphing linear equations the task of knowledge tracing can be formalized as given observations of interactions xt taken by student on particular learning task predict aspects of their next interaction in the most ubiquitous instantiation of knowledge tracing interactions take the form of tuple of xt qt at that combines tag for the exercise being answered qt with whether or not the exercise was answered correctly at when making prediction the model is provided the tag of the exercise being answered qt and must predict whether the student will get the exercise correct at figure shows visualization of tracing knowledge for single student learning grade math the student first answers two square root problems correctly and then gets single exercise incorrect in the subsequent interactions the student solves series of and graphing exercises each time the student answers an exercise we can make prediction as to whether or not she would answer an exercise of each type correctly on her next interaction in the visualization we only show predictions over time for relevant subset of exercise types in most previous work exercise tags denote the single concept that human experts assign to an exercise our model can leverage but does not require such expert annotation we demonstrate that in the absence of annotations the model can autonomously learn content substructure related work the task of modelling and predicting how human beings learn is informed by fields as diverse as education psychology neuroscience and cognitive science from social science perspective learning has been understood to be influenced by complex macro level interactions including affect motivation and even identity the challenges present are further exposed on the micro level learning is fundamentally reflection of human cognition which is highly complex process two themes in the field of cognitive science that are particularly relevant are theories that the human mind and its learning process are recursive and driven by analogy the problem of knowledge tracing was first posed and has been heavily studied within the intelligent tutoring community in the face of aforementioned challenges it has been primary goal to build models which may not capture all cognitive processes but are nevertheless useful bayesian knowledge tracing bayesian knowledge tracing bkt is the most popular approach for building temporal models of student learning bkt models learner latent knowledge state as set of binary variables each of which represents understanding or of single concept hidden markov model hmm is used to update the probabilities across each of these binary variables as learner answers exercises of given concept correctly or incorrectly the original model formulation assumed that once skill is learned it is never forgotten recent extensions to this model include contextualization of guessing and slipping estimates estimating prior knowledge for individual learners and estimating problem difficulty with or without such extensions knowledge tracing suffers from several difficulties first the binary representation of student understanding may be unrealistic second the meaning of the hidden variables and their mappings onto exercises can be ambiguous rarely meeting the model expectation of single concept per exercise several techniques have been developed to create and refine concept categories and mappings the current gold standard cognitive task analysis is an arduous and iterative process where domain experts ask learners to talk through their thought processes while solving problems finally the binary response data used to model transitions imposes limit on the kinds of exercises that can be modeled other dynamic probabilistic models partially observable markov decision processes pomdps have been used to model learner behavior over time in cases where the learner follows an path to arrive at solution although pomdps present an extremely flexible framework they require exploration of an exponentially large state space current implementations are also restricted to discrete state space with meanings for latent variables this makes them intractable or inflexible in practice though they have the potential to overcome both of those limitations simpler models from the performance factors analysis pfa framework and learning factors analysis lfa framework have shown predictive power comparable to bkt to obtain better predictive results than with any one model alone various ensemble methods have been used to combine bkt and pfa model combinations supported by adaboost random forest linear regression logistic regression and neural network were all shown to deliver superior results to bkt and pfa on their own but because of the learner models they rely on these ensemble techniques grapple with the same limitations including requirement for accurate concept labeling recent work has explored combining item response theory irt models with switched nonlinear kalman filters as well as with knowledge tracing though these approaches are promising at present they are both more restricted in functional form and more expensive due to inference of latent variables than the method we present here recurrent neural networks recurrent neural networks are family of flexible dynamic models which connect artificial neurons over time the propagation of information is recursive in that hidden neurons evolve based on both the input to the system and on their previous activation in contrast to hidden markov models as they appear in education which are also dynamic rnns have high dimensional continuous representation of latent state notable advantage of the richer representation of rnns is their ability to use information from an input in prediction at much later point in time this is especially true for long short term memory lstm popular type of rnn recurrent neural networks are competitive or for several time series instance speech to text translation and image captioning large amounts of training data are available these results suggest that we could be much more successful at tracing student knowledge if we formulated the task as new application of temporal neural networks deep knowledge tracing we believe that human learning is governed by many diverse properties of the material the context the timecourse of presentation and the individual involved many of which are difficult to quantify relying only on first principles to assign attributes to exercises or structure graphical model here we will apply two different types of rnns vanilla rnn model with sigmoid units and long short term memory lstm model to the problem of predicting student responses to exercises based upon their past activity model traditional recurrent neural networks rnns map an input sequence of vectors xt to an output sequence of vectors yt this is achieved by computing sequence of hidden states ht which can be viewed as successive encodings of relevant information from past observations that will be useful for future predictions see figure for cartoon illustration the variables are related using simple network defined by the equations ht tanh whx xt whh bh yt wyh ht by yt ht xt figure the connection between variables in simple recurrent neural network the inputs xt to the dynamic network are either encodings or compressed representations of student action and the prediction yt is vector representing the probability of getting each of the dataset exercises correct where both tanh and the sigmoid function are applied elementwise the model is parameterized by an input weight matrix whx recurrent weight matrix whh initial state and readout weight matrix wyh biases for latent and readout units are given by bh and by long short term memory lstm networks are more complex variant of rnns that often prove more powerful in lstms latent units retain their values until explicitly cleared by the action of forget gate they thus more naturally retain information for many time steps which is believed to make them easier to train additionally hidden units are updated using multiplicative interactions and they can thus perform more complicated transformations for the same number of latent units the update equations for an lstm are significantly more complicated than for an rnn and can be found in appendix input and output time series in order to train an rnn or lstm on student interactions it is necessary to convert those interactions into sequence of fixed length input vectors xt we do this using two methods depending on the nature of those interactions for datasets with small number of unique exercises we set xt to be encoding of the student interaction tuple ht qt at that represents the combination of which exercise was answered and if the exercise was answered correctly so xt we found that having separate representations for qt and at degraded performance for large feature spaces encoding can quickly become impractically large for datasets with large number of unique exercises we therefore instead assign random vector nq to each input tuple where nq rn and we then set each input vector xt to the corresponding random vector xt nqt at this random representation of vector is motivated by compressed sensing compressed sensing states that signal in dimensions can be recovered exactly from log random linear projections up to scaling and additive constants since encoding is signal the student interaction tuple can be exactly encoded by assigning it to fixed random gaussian input vector of length log although the current paper deals only with vectors this technique can be extended easily to capture aspects of more complex student interactions in fixed length vector the output yt is vector of length equal to the number of problems where each entry represents the predicted probability that the student would answer that particular problem correctly thus the prediction of can then be read from the entry in yt corresponding to optimization the training objective is the negative log likelihood of the observed sequence of student responses under the model let be the encoding of which exercise is answered at time and let be binary cross entropy the loss for given prediction is yt and the loss for single student is yt this objective was minimized using stochastic gradient descent on minibatches to prevent overfitting during training dropout was applied to ht when computing the readout yt but not when computing the next hidden state we prevent gradients from exploding as we backpropagate through time by truncating the length of gradients whose norm is above threshold for all models in this paper we consistently used hidden dimensionality of and size of to facilitate research in dkts we have published our code and relevant preprocessed educational applications the training objective for knowledge tracing is to predict student future performance based on their past activity this is directly useful for instance formal testing is no longer necessary if student ability undergoes continuous assessment as explored experimentally in section the dkt model can also power number of other advancements improving curricula one of the biggest potential impacts of our model is in choosing the best sequence of learning items to present to student given student with an estimated hidden knowledge state we can query our rnn to calculate what their expected knowledge state would be if we were to assign them particular exercise for instance in figure after the student has answered exercises we can test every possible next exercise we could show her and compute her expected knowledge state given that choice the predicted optimal next problem for this student is to revisit solving for the we use trained dkt to test two classic curricula rules from education literature mixing where exercises from different topics are intermixed and blocking where students answer series of exercises of the same type since choosing the entire sequence of next exercises so as to maximize predicted accuracy can be phrased as markov decision problem we can also evaluate the benefits of using the expectimax algorithm see appendix to chose an optimal sequence of problems discovering exercise relationships the dkt model can further be applied to the task of discovering latent structure or concepts in the data task that is typically performed by human experts we approached this problem by assigning an influence jij to every directed pair of exercises and jij where is the correctness probability assigned by the rnn to exercise on the second timestep given that student answered exercise correctly on the first we show that this characterization of the dependencies captured by the rnn recovers the associated with exercises datasets we test the ability to predict student performance on three datasets simulated data khan academy data and the assistments benchmark dataset on each dataset we measure area under the curve auc for the data we evaluate our results using cross validation and in all cases are learned on training data we compare the results of deep knowledge tracing to standard bkt and when possible to optimal variations of bkt additionally we compare our results to predictions made by simply calculating the marginal probability of student getting particular exercise correct https overview au dataset students exercise tags answers marginal bkt bkt dkt khan math assistments table auc results for all datasets tested bkt is the standard bkt bkt is the best reported result from the literature for assistments dkt is the result of using lstm deep knowledge tracing simulated data we simulate virtual students learning virtual concepts and test how well we can predict responses in this controlled setting for each run of this experiment we generate two thousand students who answer exercises drawn from concepts for this dataset only all students answer the same sequence of exercises each student has latent knowledge state skill for each concept and each exercise has both single concept and difficulty the probability of student getting exercise with difficulty correct if the student had concept skill is modelled using classic item response theory as where is the probability of random guess set to be students learn over time via an increase to the concept skill which corresponded to the exercise they answered to understand how the different models can incorporate unlabelled data we do not provide models with the hidden concept labels instead the input is simply the exercise index and whether or not the exercise was answered correctly we evaluate prediction performance on an additional two thousand simulated test students for each number of concepts we repeat the experiment times with different randomly generated data to evaluate accuracy mean and standard error khan academy data we used sample of anonymized student usage interactions from the eighth grade common core curriculum on khan academy the dataset included million exercises completed by students across different exercise types it did not contain any personal information only the researchers working on this paper had access to this anonymized dataset and its use was governed by an agreement designed to protect student privacy in accordance with khan academy privacy notice khan academy provides particularly relevant source of learning data since students often interact with the site for an extended period of time and for variety of content and because students are often in the topics they work on and in the trajectory they take through material benchmark dataset in order to understand how our model compared to other models we evaluated models on the assistments skill builder public benchmark assistments is an online tutor that simultaneously teaches and assesses students in grade school mathematics it is to the best of our knowledge the largest publicly available knowledge tracing dataset results on all three datasets deep knowledge tracing substantially outperformed previous methods on the khan dataset using an lstm neural network model led to an auc of which was notable improvement over the performance of standard bkt auc especially when compared to the small improvement bkt provided over the marginal baseline auc see table and figure on the assistments dataset dkt produced gain over the previous best reported result auc and respectively the gain we report in auc compared to the marginal baseline is more than triple the largest gain achieved on the dataset to date the prediction results from the synthetic dataset provide an interesting demonstration of the capacities of deep knowledge tracing both the lstm and rnn models did as well at predicting student responses as an oracle which had perfect knowledge of all model parameters and only had to fit the latent student knowledge variables see figure in order to get accuracy on par with an oracle the models would have to mimic function that incorporates latent concepts the difficulty of each exercise the prior distributions of student knowledge and the increase in concept skill that happened https average predicted probability true positive rate test accuracy test testaccuracy accuracy oracle oracle rnn rnn lstm lstm bkt bkt lstm rnn bkt marginal number of concepts number of hidden concepts blocking mixing false positive rate exercise index figure left predictionoracle results for simulated data and khan academy data right predicted knowledge on assistments data for different exercise curricula error bars are standard error of the mean rnn lstm after each exercise in contrast bkt the bkt prediction degraded substantially as the number of hidden concepts increased as it doesn have mechanism to learn unlabelled concepts we tested our ability to intelligently chose exercises on subset of five concepts from the assistment dataset for each curricula method we used our dkt model to simulate how student would answer questions and evaluate how much student knew after exercises we repeated student and measured the average predicted of student getting future simulations times questions correct in the assistment contextofthe blocking strategy had notable advantage over number concepts mixing see figure while blocking performs on par with solving expectimax one exercise deep if we look further into the future when choosing the next problem we come up with curricula where students have higher predicted knowledge after solving fewer problems the prediction accuracy on the synthetic dataset suggest that it may be possible to use dkt models to extract the latent structure between the assessments in the dataset the graph of our model conditional influences for the synthetic dataset reveals perfect clustering of the five latent concepts see figure with directed edges set using the influence function in equation an interesting observation is that some of the exercises from the same concept occurred far apart in time for example in the synthetic dataset where node numbers depict sequence the exercise in the synthetic dataset was from hidden concept and even though it wasn until the problem that another problem from the same concept was asked we were able to learn strong conditional dependency between the two we analyzed the khan dataset using the same technique the resulting graph is compelling articulation of how the concepts in the grade common core are related to each other see figure node numbers depict exercise tags we restricted the analysis to ordered pairs of exercises such that after appeared appeared more than of the time in the remainder of the sequence to determine if the resulting conditional relationships are product of obvious underlying trends in the data we compared our results to two baseline measures the transition probabilities of students answering given they had just answered and the probability in the dataset without using dkt model of answering correctly given student had earlier answered correctly both baseline methods generated discordant graphs which are shown in the appendix while many of the relationships we uncovered may be unsurprising to an education expert their discovery is affirmation that the dkt network learned coherent model discussion in this paper we apply rnns to the problem of knowledge tracing in education showing improvement over prior performance on the assistments benchmark and khan dataset two particularly interesting novel properties of our new model are that it does not need expert annotations it can learn concept patterns on its own and it can operate on any student input that can be vectorized one disadvantage of rnns over simple hidden markov methods is that they require large amounts of training data and so are well suited to an online education environment but not small classroom environment hidden concept hidden concept hidden concept hidden concept hidden concept simulated data khan data scatter plots linear function intercepts interpreting function graphs constructing inconsistent system recognizing irrational numbers systems of equations elim pythagorean theorem proofs linear equations solutions to systems of equations scientific notation intuition theorem multiplication in scientific notation views of function line graph intuition lines parallel lines recog func multistep equations distribution angles systems of equations graphing proportional relationships fractions as repeating decimals equations word problems exponent rules cube roots functions slope of line angles scientific notation line graphs exponents systems of equations fractions linear models of bivariate data understand equations word problems pythagorean theorem systems of equations with elimination exponents functions plotting the line of best fit segment addition vertical angles integer sums systems of equations substitution solving for the intercept congruent angles comparing proportional relationships recognizing functions exponents solutions to linear equations interpreting scatter plots finding intercepts of linear functions slope and triangle similarity repeating decimals to fractions midpoint of segment distance formula graphical solutions to systems volume word problems converting decimals to fractions linear non linear functions constructing scatter plots age word problems square roots interpreting features of linear functions solving for the intercept pythagorean theorem repeating decimals to fractions graphing systems of equations comparing features of functions constructing linear functions frequencies of bivariate data orders of magnitude graphing linear equations comparing features of functions angle addition postulate computing in scientific notation angles parallel lines figure graphs of conditional influence between exercises in dkt models above we observe perfect clustering of latent concepts in the synthetic data below convincing depiction of how grade math common core exercises influence one another arrow size indicates connection strength note that nodes may be connected in both directions edges with magnitude smaller than have been thresholded cluster labels are added by hand but are fully consistent with the exercises in each cluster the application of rnns to knowledge tracing provides many directions for future research further investigations could incorporate other features as inputs such as time taken explore other educational impacts such as hint generation dropout prediction and validate hypotheses posed in education literature such as spaced repetition modeling how students forget because dkts take vector input it should be possible to track knowledge over more complex learning activities an especially interesting extension is to trace student knowledge as they solve programming tasks using recently developed method for vectorization of programs we hope to be able to intelligently model student knowledge over time as they learn to program in an ongoing collaboration with khan academy we plan to test the efficacy of dkt for curriculum planning in controlled experiment by using it to propose exercises on the site acknowledgments many thanks to john mitchell for his guidance and khan academy for its support chris piech is supported by grant number surya ganguli thanks the burroughs wellcome james mcdonnell sloan and mcknight foundations for support references khan academy privacy notice https baraniuk compressive sensing ieee signal processing magazine en koedinger and unker learning factors general method for cognitive model evaluation and improvement in intelligent tutoring systems springer pp ohen and arcia identity belonging and achievement model interventions implications current directions in psychological science orbett cognitive computer tutors solving the problem in user modeling springer pp orbett and nderson knowledge tracing modeling the acquisition of procedural knowledge user modeling and interaction baker orbett and leven more accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing in intelligent tutoring systems springer pp baker pardos owda ooraei and effernan ensembling predictions of student knowledge within intelligent tutoring systems in user modeling adaption and personalization springer pp rasgow and ulin item response theory handbook of industrial and organizational psychology lliot and dweck handbook of competence and motivation guilford publications eng effernan and koedinger addressing the assessment challenge with an online system that tutors as it assesses user modeling and interaction itch auser and homsky the evolution of the language faculty clarifications and implications cognition entner theoretical framework for analogy cognitive science ong eck and effernan in intelligent tutoring systems springer raves ohamed and inton speech recognition with deep recurrent neural networks in acoustics speech and signal processing icassp ieee international conference on ieee pp ochreiter and chmidhuber long memory neural computation arpathy and ei ei deep alignments for generating image descriptions arxiv preprint hajah ing indsey and ozer incorporating latent factors into knowledge tracing to predict individual differences in learning proceedings of the international conference on educational data mining hajah uang onz renes ozer and rusilovsky integrating knowledge tracing and item response theory tale of two frameworks proceedings of the international workshop on personalization approaches in learning environments an tuder and baraniuk learning and content analytics via sparse factor analysis in proceedings of the acm sigkdd international conference on knowledge discovery and data mining acm pp innenbrink and intrich role of affect in cognitive processing in academic contexts motivation emotion and cognition integrative perspectives on intellectual functioning and development ikolov arafi urget ernock and hudanpur recurrent neural network based language model in interspeech annual conference of the international speech communication association makuhari chiba japan pp pardos and effernan introducing item difficulty to the knowledge tracing model in user modeling adaption and personalization springer pp pavlik en and koedinger performance factors new alternative to knowledge tracing online submission iech uang guyen hulsuksombati ahami and uibas learning program embeddings to propagate feedback on student code corr iech ahami uang and uibas autonomously generating hints by inferring problem solving policies in proceedings of the second acm conference on learning scale new york ny usa acm pp iech ahami koller ooper and likstein modeling how students learn to program in proceedings of the acm symposium on computer science education olson and ichardson foundations of intelligent tutoring systems psychology press afferty runskill riffiths and hafto faster teaching by pomdp planning in artificial intelligence in education springer pp rohrer the effects of spacing and mixing practice problems journal for research in mathematics education chraagen hipman and halin cognitive task analysis psychology press illiams and ipser learning algorithm for continually running fully recurrent neural networks neural computation udelson koedinger and ordon individualized bayesian knowledge tracing models in artificial intelligence in education springer pp 
rethinking lda moment matching for discrete ica anastasia podosinnikova francis bach simon inria normale paris abstract we consider moment matching techniques for estimation in latent dirichlet allocation lda by drawing explicit links between lda and discrete versions of independent component analysis ica we first derive new set of cumulantbased tensors with an improved sample complexity moreover we reuse standard ica techniques such as joint diagonalization of tensors to improve over existing methods based on the tensor power method in an extensive set of experiments on both synthetic and real datasets we show that our new combination of tensors and orthogonal joint diagonalization techniques outperforms existing moment matching methods introduction topic models have emerged as flexible and important tools for the modelisation of text corpora while early work has focused on approximate inference techniques such as variational inference or gibbs sampling moment matching techniques have recently emerged as strong competitors due to their computational speed and theoretical guarantees in this paper we draw explicit links with the independent component analysis ica literature and references therein by showing strong relationship between latent dirichlet allocation lda and ica we can then reuse standard ica techniques and results and derive new tensors with better sample complexity and new algorithms based on joint diagonalization is lda discrete pca or discrete ica notation following the text modeling terminology we define corpus xn as collection of documents each document is collection wnln of ln tokens it is convenient to represent the token of the document as encoding with an indicator vector wn with only where is the vocabulary size and each document one as the count vector xn wn in such representation the length ln of the document is ln xnm we will always use index to refer to topics index to refer to documents index to refer to words from the vocabulary and index ln to refer to tokens of the document the plate diagrams of the models from this section are presented in appendix latent dirichlet allocation is generative probabilistic model for discrete data such as text corpora in accordance to this model the document is modeled as an admixture over the vocabulary of words with latent topics specifically the latent variable which is sampled from the dirichlet distribution represents the topic mixture proportion over topics for the document given the topic choice zn for the token is sampled from the multinomial distribution with the probability vector the token wn is then sampled from the multinomial distribution with the probability vector dzn or dk if is the index of the element in zn this vector dk is the topic that is vector of probabilities over the words from the vocabulary subject to the simplex constraint dk where rm dm this generative process of document the index is omitted for simplicity can be summarized as dirichlet multinomial multinomial dz one can think of the latent variables as auxiliary variables which were introduced for convenience of inference but can in fact be marginalized out which leads to the following model dirichlet multinomial lda model where rm is the topic matrix with the column equal to the topic dk and rk is the vector of parameters for the dirichlet distribution while document is represented as set of tokens in the formulation the formulation instead compactly represents document as the count vector although the two representations are equivalent we focus on the second one in this paper and therefore refer to it as the lda model importantly the lda model does not model the length of documents indeed although the original paper proposes to model the document length as poisson this is never used in practice and in particular the parameter is not learned therefore in the way that the lda model is typically used it does not provide complete generative process of document as there is no rule to sample in this paper this fact is important as we need to model the document length in order to make the link with discrete ica discrete pca the lda model can be seen as discretization of principal component analysis pca via replacement of the normal likelihood with the multinomial one and adjusting the prior in the following probabilistic pca model normal ik and normal im where rm is transformation matrix and is parameter discrete ica dica interestingly small extension of the lda model allows its interpretation as discrete independent component analysis model the extension naturally arises when the document length for the lda model is modeled as random variable from the mixture which is equivalent to negative binomial random variable poisson and gamma where ck is the shape parameter and is the rate parameter the lda model with such document length is equivalent see appendix to gamma ck xm poisson gp model where all are mutually independent the parameters ck coincide with the ones of the lda model in and the free parameter can be seen see appendix as scaling parameter for the document length when is already prescribed this model was introduced by canny and later named as discrete ica model it is more natural however to name model as the gp model and the model mutually independent xm poisson dica model as the discrete ica dica model the only difference between and the standard ica model without additive noise is the presence of the poisson noise which enforces discrete instead of continuous values of xm note also that the discrete ica model is model that can adapt to any distribution on the topic intensities and that the gp model is particular case of both the lda model and the dica model thanks to this close connection between lda and ica we can reuse standard ica techniques to derive new efficient algorithms for topic modeling moment matching for topic modeling the method of moments estimates latent parameters of probabilistic model by matching theoretical expressions of its moments with their sample estimates recently the method of moments was applied to different latent variable models including lda resulting in computationally fast learning algorithms with theoretical guarantees for lda they construct lda moments with particular diagonal structure and develop algorithms for estimating the parameters of the model by exploiting this diagonal structure in this paper we introduce novel cumulants with similar to the lda moments structure this structure allows to reapply the algorithms of for the estimation of the model parameters with the same theoretical guarantees we also consider another algorithm applicable to both the lda moments and the cumulants cumulants of the gp and dica models in this section we derive and analyze the novel cumulants of the dica model as the gp model is particular case of the dica model all results of this section extend to the gp model the first three cumulant tensors for the random vector can be defined as follows cum cum cov cum where denotes the tensor product see some properties of cumulants in appendix the essential property of the cumulants which does not hold for moments that we use in this paper is that the cumulant tensor for random vector with independent components is diagonal let then for the poisson random variable xm poisson ym the expectation is xm ym hence by the law of total expectation and the linearity of expectation the expectation in has the following form de further the variance of the poisson random variable xm is var xm ym and as xm are conditionally independent given then their covariance matrix is diagonal cov diag therefore by the law of total covariance the covariance in has the form cov cov cov diag cov diag dcov where the last equality follows by the multilinearity property of cumulants see appendix moving the first term from the rhs of to the lhs we define cov diag dica from and by the independence of see appendix has the following diagonal structure var dk ddiag var by analogy with the second order case using the law of total cumulance the multilinearity property of cumulants and the independence of we derive in appendix expression similar to for the third cumulant moving the terms in this expression we define tensor with the following element cum cov cov dica cov where is the kronecker delta by analogy with appendix the diagonal structure of tensor cum dk dk dk in appendix we recall in our notation the matrix and the tensor for the lda model which are analogues of the matrix and the tensor for the models slightly abusing terminology we refer to the matrix and the tensor as the lda moments and to the matrix and the tensor as the cumulants the diagonal structure of the lda moments is similar to the diagonal structure of the cumulants though arising through slightly different argument as discussed at the end of appendix importantly due to this similarity the algorithmic frameworks for both the cumulants and the lda moments coincide the following sample complexity results apply to the sample estimates of the gp proposition under the gp model the expected error for the sample estimator sb for the gp cumulant is ks skf ks skf max where max kdk min and high probability bound could be derived using concentration inequalities for poisson random variables but the expectation already gives the right order of magnitude for the error for example via markov inequality the expression for an unbiased finite sample estimate sb of and the expression for an unbiased finite sample estimate tb of are in appendix sketch of proof for proposition can be found in appendix by following similar analysis as in we can rephrase the topic recovery error in term of the error on the gp cumulant importantly the whitening transformation introduced in section redivides the error on by which is the scale of see appendixp for details this means that the contribution from to the recovery error will scale as max where both and are smaller than and can be very small we do not present the exact expression for the expected squared error for the estimator of but due to similar structure in the derivation we expect the analogous bound of ktb kf max current sample complexity results of the lda moments can be summarized as however the proof which can be found in the supplementary material analyzes only the case when finite sample estimates of the lda moments are constructed from one triple per document only and not from the that average multiple dependent triples per document as in the practical expressions and moreover one has to be careful when comparing upper bounds nevertheless comparing the bound with the current theoretical results for the lda moments we see that the cumulants sample complexity contains the norm of the columns of the topic matrix in the numerator as opposed to the coefficient for the lda moments this norm can be significantly smaller than for vectors in the simplex for sparse topics this suggests that the cumulants may have better finite sample convergence properties than the lda moments and our experimental results in section are indeed consistent with this statement the cumulants have somewhat more intuitive derivation than the lda moments as they are expressed via the count vectors which are the sufficient statistics for the model and not the tokens note also that the construction of the lda moments depend on the unknown parameter given that we are in an unsupervised setting and that moreover the evaluation of lda is difficult task setting this parameter is in appendix we observe experimentally that the lda moments are somewhat sensitive to the choice of diagonalization algorithms how is the diagonal structure of and of going to be helpful for the estimation of the model parameters this question has already been thoroughly investigated in the signal processing see and references therein and machine learning see and references therein literature we review the approach in this section due to similar diagonal structure the algorithms of this section apply to both the lda moments and the cumulants for simplicity let us rewrite expressions and for and as follows dk tk note that the expected squared error for the dica cumulants is similar but the expressions are less compact and in general depend on the prior on for completeness we also present the finite sample estimates sb and tb of and for the lda moments which are consistent with the ones suggested in in appendix where sk var and tk cum introducing the rescaled topics dek sk dk ed following the same assumption from that the topic vectors are we can also rewrite is full rank we can compute whitening matrix of linearly independent matrix such that sw ik where ik is the identity matrix see appendix for more details as result the vectors zk dek form an orthonormal set of vectors further let us define projection of tensor onto vector rk applying the multilinear transformation see for the definition with to the tensor from and projecting the resulting tensor onto some vector rk we obtain tk hzk uizk zk where tk tk is due to the rescaling of topics and stands for the inner product as the vectors zk are orthonormal the pairs zk and tk hzk ui are eigenpairs of the matrix which are uniquely defined if the eigenvalues are all different if they are unique we can recover the as well as lda model parameters via dek zk and tk ui this procedure was referred to as the spectral algorithm for lda and the blind identification algorithm for ica indeed one can expect that the finite sample estimates sb and tb possess approximately the diagonal structure and and therefore the reasoning from above can be applied assuming that the effect of the sampling error is controlled this spectral algorithm however is known to be quite unstable in practice see to overcome this problem other algorithms were proposed for ica the most notable ones are probably the fastica algorithm and the jade algorithm the fastica algorithm with appropriate choice of contrast function estimates iteratively the topics making use of the orthonormal structure and performs the deflation procedure at every step the recently introduced tensor power method tpm for the lda model is close to the fastica algorithm alternatively the jade algorithm modifies the spectral algorithm by performing multiple projections for and then jointly diagonalizing the resulting matrices with an orthogonal matrix the spectral algorithm is special case of this orthogonal joint diagonalization algorithm when only one projection is chosen importantly fast implementation of the orthogonal joint diagonalization algorithm from was proposed which is based on iterative jacobi updates see for the later in practice the orthogonal joint diagonalization jd algorithm is more robust than fastica see or the spectral algorithm moreover although the application of the jd algorithm for the learning of topic models was mentioned in the literature it was never implemented in practice in this paper we apply the jd algorithm for the diagonalization of the cumulants as well as the lda moments which is described in algorithm note that the choice of projection up for some vector up rk is important and corresponds to vector vp rm obtained as vp along the third mode importantly in algorithm the the multilinear transformation of tb with joint diagonalization routine is performed over matrices of size where the number of topics is usually not too big this makes the algorithm computationally fast see appendix the same is true for the spectral algorithm but not for tpm in section we compare experimentally the performance of the spectral jd and tpm algorithms for the estimation of the parameters of the as well as lda models we are not aware of any experimental comparison of these algorithms in the lda context while already working on this manuscript the jd algorithm was also independently analyzed by in the context of tensor factorization for general latent variable models however focused mostly on the comparison of approaches for tensor factorization and their stability properties with brief experiments using latent variable model related but not equivalent to lda for community detection in contrast we provide detailed experimental comparison in the context of lda in this paper as well as propose novel estimator due to the space restriction the estimation of the topic matrix and the parameter are moved to appendix see appendix for discussion on the orders algorithm joint diagonalization jd algorithm for cumulants or lda moments input rm number of random projections and for lda moments rm for for lda in appendix compute sample estimate of sb see appendix estimate whitening matrix option choose vectors up rk uniformly at random from the unit up rm for all sphere and set vp yields the spectral algorithm option choose vectors up rk as the canonical basis ek of up rm for all rk and set vp tb vp for for lda appendix for compute bp sbw ik bp perform orthogonal joint diagonalization of matrices see and to find an orthogonal matrix and vectors ap rk such that sbw ik and bp diag ap vw and values ap estimate joint diagonalization matrix output estimate of and as described in appendix experiments in this section we compare experimentally the cumulants with the lda moments and the spectral algorithm the tensor power method tpm the joint diagonalization jd algorithm from algorithm and variational inference for lda real data the associated press ap dataset from blei web with documents the nips papers and vocabulary words and the average document length the kos from the dataset of nips papers and words and uci repository with documents and words and data are constructed by analogy with the lda parameters and are learned from the real datasets with variational inference and toy data are sampled from model of interest with the given parameters and this provides the ground truth parameters and for each setting data are sampled times and the results are averaged we plot error bars that are the minimum and maximum values for the ap data topics are learned and for the nips data topics are learned for larger the obtained topic matrix is illconditioned which violates the identifiability condition for topic recovery using moment matching techniques all the documents with less than tokens are resampled sampling techniques all the sampling models have the parameter which is set to where is the learned from the real dataset with variational lda and is parameter that we so that can vary the gp data are sampled from the model with the expected document length is see appendix the data are sampled from the lda model with the document length being fixed to given the data are sampled as follows of the documents are sampled from the model with given document length and of the documents are sampled from the model with given document length evaluation evaluation of topic recovery for data is performed with the and true topic matrices with the best permutation of columns error between the recovered err dk the minimization is over the possible and can be efficiently obtained with the hungarian permutations perm of the columns of algorithm for bipartite matching for the evaluation of topic recovery in the real data case we use an approximation of the for held out documents as the metric see appendix for more details http http https jd jd jd spec tpm number of docs in number of docs in figure comparison of the diagonalization algorithms the topic matrix and dirichlet parameter are learned for from ap is scaled to sum up to and is set to fit the expected document length the dataset is sampled from gp number of documents varies from to left moments right lda moments note smaller value of the is better we use our matlab implementation of the cumulants the lda moments and the diagonalization algorithms the datasets and the code for reproducing our experiments are available in appendix we discuss implementation and complexity of the algorithms we explain how we initialize the parameter for the lda moments in appendix comparison of the diagonalization algorithms in figure we compare the diagonalization algorithms on the ap dataset for using the gp sampling we compare the tensor power method tpm the spectral algorithm spec the orthogonal joint diagonalization algorithm jd described in algorithm with different options to choose the random projections jd takes vectors up sampled uniformly from the unit in rk and selects vp up option in algorithm jd selects the full basis ek in rk and sets vp ep as jade option in algorithm jd chooses the full canonical basis of rm as the projection vectors computationally expensive both the cumulants and lda moments are in this setup however the lda moments have slower finite sample convergence and hence larger estimation error for the same value as expected the spectral algorithm is always slightly inferior to the joint diagonalization algorithms with the cumulants where the estimation error is low all algorithms demonstrate good performance which also fulfills our expectations however although tpm shows almost perfect performance in the case of the cumulants left it significantly deteriorates for the lda moments right which can be explained by the larger estimation error of the lda moments and lack of robustness of tpm the running times are discussed in appendix overall the orthogonal joint diagonalization algorithm with initialization of random projections as multiplied with the canonical basis in rk jd is both computationally efficient and fast comparison of the cumulants and the lda moments in figure when sampling from the gp model top left both the cumulants and lda moments are well specified which implies that the approximation error the error for the infinite number of documents is low for both the cumulants achieve low values of the estimation error already for documents independently of the number of topics while the convergence is slower for the lda moments when sampling from the model top right the cumulants are and their approximation error is high although the estimation error is low due to the faster finite sample convergence one reason of poor performance of the cumulants in this case is the absence of variance in document length indeed if documents with two different lengths are mixed by sampling from the model bottom left the cumulants performance improves moreover the experiment with changing fraction of documents bottom right shows that variance on the length improves the performance of the cumulants as in practice real corpora usually have variance for the document length this bad scenario for the cumulants is not likely to happen https number of docs in number of docs in fraction of doc lengths number of docs in vi in bits in bits figure comparison of the cumulants and lda moments two topic matrices and parameters and are learned from the nips dataset for and and are scaled to sum up to four corpora of different sizes from to top left is set to fit the expected document length sampling from the gp model top right sampling from the model bottom left sampling from the model bottom right the number of documents here is fixed to sampling from the model varying the values of the fraction from to with the step note smaller value of the is better topics topics figure experiments with real data left the ap dataset right the kos dataset note higher value of the is better real data experiments in figure and are compared with variational inference vi and with variational inference initialized with the output of we measure held out per token see appendix for details on the experimental setup the orthogonal joint diagonalization algorithm with the cumulants demonstrates promising performance in particular the cumulants significantly outperform the lda moments moreover although variational inference performs better than the algorithm restarting variational inference with the output of the algorithm systematically leads to better results similar behavior has already been observed see conclusion in this paper we have proposed new set of tensors for discrete ica model related to lda where word counts are directly modeled these moments make fewer assumptions regarding distributions and are theoretically and empirically more robust than previously proposed tensors for lda both on synthetic and real data following the ica literature we showed that our joint diagonalization procedure is also more robust once the topic matrix has been estimated in way where topic intensities are left unspecified it would be interesting to learn the unknown distributions of the independent topic intensities acknowledgments this work was partially supported by the joint center the authors would like to thank christophe dupuy for helpful discussions references blei ng and jordan latent dirichlet allocation mach learn griffiths gibbs sampling in the generative model of latent dirichlet allocation technical report stanford university anandkumar foster hsu kakade and liu spectral algorithm for latent dirichlet allocation in nips anandkumar ge hsu kakade and telgarsky tensor decompositions for learning latent variable models mach learn comon and jutten handbook of blind source separation independent component analysis and applications academic press jutten calcul et traitement du signal analyse en composantes phd thesis grenoble jutten and blind separation of sources part an adaptive algorithm based on neuromimetric architecture signal comon independent component analysis new concept signal buntine variational extensions to em and multinomial pca in ecml tipping and bishop probabilistic principal component analysis stat roweis em algorithms for pca and spca in nips canny gap factor model for discrete data in sigir buntine and jakulin applying discrete pca in data analysis in uai boucheron lugosi and massart concentration inequalities nonasymptotic theory of independence oxford university press anandkumar foster hsu kakade and liu spectral algorithm for latent dirichlet allocation corr wallach murray salakhutdinov and mimno evaluation methods for topic models in icml cardoso source separation using higher order moments in icassp cardoso of the cumulant tensor with application to the blind source separation problem in icassp cardoso and comon independent component analysis survey of some algebraic methods in iscas fast and robust algorithms for independent component analysis ieee trans neural cardoso and souloumiac blind beamforming for non gaussian signals in iee cardoso contrasts for independent component analysis neural cardoso and souloumiac jacobi angles for simultaneous diagonalization siam mat anal byers and mehrmann numerical methods for simultaneous diagonalization siam matrix anal nocedal and wright numerical optimization springer edition bach and jordan kernel independent component analysis mach learn kuleshov chaganty and liang tensor factorization via matrix factorization in aistats globerson chechik pereira and tishby euclidean embedding of data mach learn arora ge halpern mimno moitra sontag wu and zhu practical algorithm for topic modeling with provable guarantees in icml cohen and collins provably correct learning algorithm for pcfgs in acl 
efficient compressive phase retrieval with constrained sensing vectors sohail bahmani justin romberg school of electrical and computer engineering georgia institute of technology atlanta ga jrom abstract we propose robust and efficient approach to the problem of compressive phase retrieval in which the goal is to reconstruct sparse vector from the magnitude of number of its linear measurements the proposed framework relies on constrained sensing vectors and reconstruction method that consists of two standard convex programs that are solved sequentially in recent years various methods are proposed for compressive phase retrieval but they have suboptimal sample complexity or lack robustness guarantees the main obstacle has been that there is no straightforward convex relaxations for the type of structure in the target given set of underdetermined measurements there is standard framework for recovering sparse matrix and standard framework for recovering matrix however general efficient method for recovering jointly sparse and matrix has remained elusive deviating from the models with generic measurements in this paper we show that if the sensing vectors are chosen at random from an incoherent subspace then the and sparse structures of the target signal can be effectively decoupled we show that recovery algorithm that consists of recovery stage followed by sparse recovery stage will produce an accurate estimate of the target when the number of measurements is log kd where and denote the sparsity level and the dimension of the input signal we also evaluate the algorithm through numerical simulation introduction problem setting the problem of compressive phase retrieval cpr is generally stated as the problem of estimating vector rd from noisy measurements of the form yi zi for where ai is the sensing vector and zi denotes the additive noise in this paper we study the cpr problem with specific sensing vectors ai of the form ai wi where and wi rm are known in words the measurement vectors live in fixed subspace the row space of these types of measurements can be applied in imaging systems that have control over how the scene is illuminated examples include systems that use structured illumination with spatial light modulator or scattering medium by standard lifting of the signal to the quadratic measurements can be expressed as zi yi ai at zi wi wt with the linear operator and defined as wi wt and xψ we can write the measurements compactly as our goal is to estimate the sparse and positive semidefinite matrix from the measurements which also solves the cpr problem and provides an estimate for the sparse signal up to the inevitable global phase ambiguity assumptions we make the following assumptions throughout the paper the vectors wi are independent and have the standard gaussian distribution on rm wi the matrix is restricted isometry matrix for vectors and for constant namely it obeys kψ for all vectors rd the noise vector is bounded as as will be seen in theorem and its proof below the gaussian distribution imposed by the assumption will be used merely to guarantee successful estimation of matrix through trace norm minimization however other distributions uniform distribution on the unit sphere can also be used to obtain similar guarantees furthermore the restricted isometry condition imposed by the assumption is not critical and can be replaced by weaker assumptions however the guarantees obtained under these weaker assumptions usually require more intricate derivations provide weaker noise robustness and often do not hold uniformly for all potential target signals therefore to keep the exposition simple and straightforward we assume which is known to hold with high probability for various ensembles of random matrices gaussian rademacher partial fourier etc because in many scenarios we have the flexibility of selecting the assumption is realistic as well notation let us first set the notation used throughout the paper matrices and vectors are denoted by bold capital and small letters respectively the set of positive integers less than or equal to is denoted by the notation is used when cg for some absolute constant for any matrix the frobenius norm the nuclear norm the entrywise and the largest entrywise absolute value of the entries are denoted by km kf km km and km respectively to indicate that matrix is positive semidefinite we write contributions the main challenge in the cpr problem in its general formulation is to design an accurate estimator that has optimal sample complexity and computationally tractable in this paper we address this challenge in the special setting where the sensing vectors can be factored as namely we propose an algorithm that provably produces an accurate estimate of the lifted target from only log kd measurements and can be computed in polynomial time through efficient convex optimization methods related work several papers including have already studied the application of convex programming for phase retrieval pr in various settings and have established estimation accuracy through different mathematical techniques these phase retrieval methods attain nearly optimal sample complexities that scales with the dimension of the target signal up to constant factor or at most logarithmic factor however to the best of our knowledge the exiting methods for cpr either lack accuracy and robustness guarantees or have suboptimal sample complexities the problem of recovering sparse signal from the magnitude of its subsampled fourier transforms is cast in as an with constraints while shows that sufficient number of measurements would grow quadratically in the sparsity of the signal the numerical simulations suggest that the method successfully estimates the sparse signal with only about log kd measurements another approach to cpr is considered in which poses the problem as finding vector that minimizes the residual error that takes quartic form local search algorithm called gespar is then applied to approximate the solution to the formulated optimization this approach is shown to be effective through simulations but it also lacks global convergence or statistical accuracy guarantees an alternating minimization method for both pr and cpr is studied in this method is appealing in large scale problems because of computationally inexpensive iterations more importantly proposes specific initialization using which the alternating minimization method is shown to converge linearly in pr and cpr however the number of measurements required to establish this convergence is effectively quadratic in in and the form of the trace minimization argmin trace subject to is proposed for the cpr problem the guarantees of are based on the restricted isometry propn erty of the sensing operator hai xi for sparse matrices in however the analysis is based on construction of dual certificate through an adaptation of the golfing scheme assuming standard gaussian sensing vectors ai and with appropriate choice of the regularization parameter it is shown in that solves the cpr when log furthermore this method fails to recover the target sparse and matrix if is dominated by estimation of simultaneously structured matrices through convex relaxations similar to is also studied in where it is shown that these methods do not attain optimal sample complexity more recently assuming that the sparse target has distribution generalized approximate message passing framework is proposed in to solve the cpr problem performance of this method is evaluated through numerical simulations for standard gaussian sensing matrices which show the empirical phase transition for successful estimation occurs at log kd and also the algorithms can have significantly lower runtime compared to some of the competing algorithms including gespar and cprl the phasecode algorithm is proposed in to solve the cpr problem with sensing vectors designed using sparse graphs and techniques adapted from coding theory although phasecode is shown to achieve the optimal sample complexity it lacks robustness guarantees while preparing the final version of the current paper we became aware of which has independently proposed an approach similar to ours to address the cpr problem main results algorithm we propose algorithm outlined in algorithm each stage of the algorithm is convex program for which various efficient numerical solvers exists in the first stage we solve to obtain which is an estimator of the matrix matrix is used in the second stage of the algorithm as the measurements for sparse estimation then expressed by the constraint of depends on an absolute constant that should be sufficiently large algorithm input the measurements the operator and the matrix output the estimate estimation stage argmin trace subject to kw sparse estimation stage argmin subject to cε xψ the result of the estimation stage is generally not simic that is it has larly the sparse estimation stage does not necessarily produce at most nonzero rows and columns and in fact since we have not imposed the posic is not even guaranteed to be tive semidefiniteness constraint in the estimate positive semidefinite psd however we can enforce the or the sparsity structure in postprocessing steps simply by projecting the produced estimate on the set of or onto the desired sets at psd matrices the simple but important observation is that projecting most doubles the estimation error this fact is shown by lemma in section in general setting alternatives there are alternative convex relaxations for the estimation and the sparse estimation stages of algorithm for example can be replaced by its regularized least squares analog argmin kw kbk for an appropriate choice of the regularization parameter similarly instead of we can use an least squares furthermore to perform the estimation and the sparse estimation we can use greedy type algorithms that typically have lower computational costs for example the estimation stage can be performed via the wirtinger flow method proposed in furthermore various greedy compressive sensing algorithms such as the iterative hard thresholding and cosamp can be used to solve the desired sparse estimation to guarantee the accuracy of these compressive sensing algorithms however we might need to adjust the assumption to have the restricted isometry property for vectors with being some small positive integer accuracy guarantees the following theorem shows that any solution of the proposed algorithm is an accurate estimator of theorem suppose that the assumptions and hold with sufficiently small constant then there exist positive absolute constants and such that if of the algorithm obeys then any estimate for all and matrices with probability exceeding the proof of theorem is straightforward and is provided in section the main idea is first to show the estimation stage produces an accurate estimate of because this stage can be viewed as standard phase retrieval through lifting we can simply use accuracy guarantees that are already established in the literature in particular we use theorem which established an error bound that holds uniformly for all valid thus we can ensure that is feasible in the sparse estimation stage then the accuracy of the sparse estimation stage can also be established by simple adaptation of the analyses based on the restricted isometry property such as the dependence of the number of measurements and the sparsity of the signal is not explicit in theorem this dependence is absorbed in which must be sufficiently large for assumption to hold considering gaussian matrix the following corollary gives concrete example where the dependence of non through is exposed corollary suppose that the assumptions theorem including hold furthermore suppose that is gaussian matrix with iid entries and log for some absolute constant then any estimate produced by algorithm obeys for all and matrices with probability exceeding for some constant proof it is that if has iid and we have then holds with high probability for example using standard covering argument and union bound shows that if holds for sufficiently large constant then we have for sufficiently small constant with probability exceeding for some constant that depends only on therefore theorem yields the desired result which holds with probability exceeding for some constant depending only on numerical experiments we evaluated the performance of algorithm through some numerical simulations the estimation stage and the sparse estimation stage are implemented using the tfocs package we considered the target signal to be in the support set of of the target signal is selected uniformly at random and the entry values on this support are independently from the noise vector is also gaussian with independent the operator and the matrix are drawn from some gaussian ensembles as described in corollary we measured the relative error kx of achieved by the compared methods over trials with sparsity level varying in the set in the first experiment for each value of the pair that determines the size and are selected from figure illustrates the quantiles of the relative error versus for the mentioned choices of in the second experiment we compared the performance of algorithm to the convex optimization methods that do not exploit the structure of the sensing vectors the setup for the same as in the first experiment except for the size of and we chose log kd and where dre denotes the smallest integer greater than figure illustrates the quantiles of the measured relative errors for algorithm the semidefinite program for and and the argmin subject to figure the empirical quantile of the relative estimation error sparsity for various choices of and with figure the empirical quantile of the relative estimation error sparsity for algorithm and different minimization methods with log kd and which are denoted by sdp and respectively the method did not perform significantly different for other values of in our complementary simulations the relative error for each trial is also overlaid in figure visualize its empirical distribution the empirical performance of the algorithms are in agreement with the theoretical results namely in regime where log kd algorithm can produce accurate estimates whereas while the other approaches fail in this regime the sdp and show nearly identical performance the however competes with algorithm for small values of this observation can be explained intuitively by the fact that the succeeds with measurements which for small values of can be sufficiently close to the considered log kd measurements proofs proof of theorem clearly is feasible in because of therefore we can of accurately estimates using existing results on show that any solution minimization in particular we can invoke theorem and section which guarantees that for some positive absolute constants and if holds then holds for all valid thereby for all valid with probability exceeding therefore with the target matrix would be feasible in now it suffices to show that the sparse estimation stage can produce an accurate estimate of recall that by the matrix is restricted isometry for vectors let be matrix that is matrix whose entries except for some submatrix are all zeros applying to the columns of and adding the inequalities yield kxkf kψ because the columns of are also we can repeat the same argument and obtain using the facts that kψ xkf and xψ the inequalities and imply that kxkf xψ kxkf the proof proceeds with an adaptation of the arguments used to prove accuracy of in compressive sensing based on the restricted isometry property see let furthermore let denote the support set of the target define to be matrix that is identical to over the index set and zero elsewhere by optimality of and feasibility of in we have kx kx kx ke kx ke ke where the last line follows from the fact that and have disjoint supports thus we have ke ke ke kf now consider decomposition of as the sum ej such that for the matrices have disjoint support sets of size except perhaps for the last few matrices that might have smaller supports more importantly the partitioning matrices are chosen to have decreasing frobenius norm ke kf ke kf for we have ke ke ke kf ke kf ke where the chain of inequalities follow from the triangle inequality the fact that ke ke by construction the fact that the matrices have disjoint support and satisfy the bound and the fact that and are orthogonal furthermore we have ej eψ eiψ ej where the first term is obtained by the inequality and the summation is obtained by by definition the triangle inequality and the fact that the triangle inequality because are feasible in imply that and eψ xψ furthermore lemma below which is adapted from lemma guarantees that for and we have ke kf ke kf therefore we obtain ke kf ke kf ke kf xx ke kf ke kf ke kf ke kf ke kf ke kf ke kf ke kf ke kf where the chain of inequalities follow from the lower bound in bound the upper bound the bound and the fact that ke kf ke kf ke kf if then we have and thus ke kf adding the above inequality to and applying the triangle then yields the desired result lemma let be matrix obeying then for any pair of matrices and with disjoint supports we have xψ kxkf proof suppose that and have unit frobenius norm using the identity xψ and the fact that and have disjoint supports it follows from that xψ the general result follows immediately as the desired inequality is homogeneous in the frobenius norms of and lemma projected estimator let be closed nonempty subset of normed vector space not necessarily in that obeys suppose that for we have an estimator denotes projection of onto then we have ke kb if kv bk therefore because we have proof by definition bk kb ke kb ke acknowledgements this work was supported by onr grant and nsf grants and references jacopo bertolotti elbert van putten christian blum ad lagendijk willem vos and allard mosk imaging through opaque scattering layers nature antoine liutkus david martina sébastien popoff gilles chardon ori katz geoffroy lerosey sylvain gigan laurent daudet and igor carron imaging with nature compressive imaging using multiply scattering medium scientific reports volume article no jul emmanuel candès thomas strohmer and vladislav voroninski phaselift exact and stable signal recovery from magnitude measurements via convex programming communications on pure and applied mathematics emmanuel candès and xiaodong li solving quadratic equations via phaselift when there are about as many equations as unknowns foundations of computational mathematics kueng rauhut and terstiege low rank matrix recovery from rank one measurements applied and computational harmonic analysis in press preprint joel tropp convex recovery of structured signal from independent random linear measurements preprint irène waldspurger alexandre aspremont and stéphane mallat phase recovery maxcut and complex semidefinite programming mathematical programming matthew moravec justin romberg and richard baraniuk compressive phase retrieval in proceedings of spie wavelets xii volume pages yoav shechtman yonina eldar alexander szameit and mordechai segev sparsity based subwavelength imaging with partially incoherent light via quadratic compressed sensing optics express yoav shechtman amir beck and yonina eldar gespar efficient phase retrieval of sparse signals signal processing ieee transactions on praneeth netrapalli prateek jain and sujay sanghavi phase retrieval using alternating minimization in advances in neural information processing systems nips pages xiaodong li and vladislav voroninski sparse signal recovery from quadratic measurements via convex programming siam journal on mathematical analysis henrik ohlsson allen yang roy dong and shankar sastry extension of compressive sensing to the phase retrieval problem in advances in neural information processing systems nips pages david gross recovering matrices from few coefficients in any basis information theory ieee transactions on mar samet oymak amin jalali maryam fazel yonina eldar and babak hassibi simultaneously structured models with application to sparse and matrices information theory ieee transactions on schniter and rangan compressive phase retrieval via generalized approximate message passing signal processing ieee transactions on february ramtin pedarsani kangwook lee and kannan ramchandran phasecode fast and efficient compressive phase retrieval based on codes in communication control and computing allerton annual allerton conference on pages extended preprint mark iwen aditya viswanathan and yang wang robust sparse phase retrieval made easy applied and computational harmonic analysis in press preprint emmanuel candès xiaodong li and mahdi soltanolkotabi phase retrieval via wirtinger flow theory and algorithms information theory ieee transactions on apr thomas blumensath and mike davies iterative hard thresholding for compressed sensing applied and computational harmonic analysis deanna needell and joel tropp cosamp iterative signal recovery from incomplete and inaccurate samples applied and computational harmonic analysis emmanuel candès the restricted isometry property and its implications for compressed sensing comptes rendus mathematique richard baraniuk mark davenport ronald devore and michael wakin simple proof of the restricted isometry property for random matrices constructive approximation stephen becker emmanuel candès and michael grant templates for convex cone problems with applications to sparse signal recovery mathematical programming computation 
barrier for marginal inference rahul krishnan courant institute new york university simon inria sierra normale paris david sontag courant institute new york university abstract we introduce algorithm for optimizing the trw variational objective over the marginal polytope the algorithm is based on the conditional gradient method and moves pseudomarginals within the marginal polytope through repeated maximum posteriori map calls this modular structure enables us to leverage map solvers both exact and approximate for variational inference and obtains more accurate results than algorithms that optimize over the local consistency relaxation theoretically we bound the for the proposed algorithm despite the trw objective having unbounded gradients at the boundary of the marginal polytope empirically we demonstrate the increased quality of results found by tightening the relaxation over the marginal polytope as well as the spanning tree polytope on synthetic and instances introduction markov random fields mrfs are used in many areas of computer science such as vision and speech inference in these undirected graphical models is generally intractable our work focuses on performing approximate marginal inference by optimizing the tree trw objective wainwright et the trw objective is concave is exact for mrfs and provides an upper bound on the function fast combinatorial solvers for the trw objective exist including belief propagation trbp wainwright et convergent based on geometric programming globerson and jaakkola and dual decomposition jancsary and matz these methods optimize over the set of pairwise consistency constraints also called the local polytope sontag and jaakkola showed that significantly better results could be obtained by optimizing over tighter relaxations of the marginal polytope however deriving algorithm for the trw objective over tighter relaxations of the marginal polytope is challenging instead sontag and jaakkola use the conditional gradient method also called and linear programming solvers to optimize trw over the cycle consistency relaxation rather than optimizing over the cycle relaxation belanger et al optimize the trw objective over the exact marginal polytope then using the linear minimization performed in the inner loop can be shown to correspond to map inference the optimization algorithm has seen increasing use in machine learning thanks in part to its efficient handling of complex constraint sets appearing with structured data jaggi and jaggi however applying to variational inference presents challenges that were never resolved in previous work first the linear minimization performed in the inner loop is computationally expensive either requiring repeatedly solving large linear program as in sontag and jaakkola or performing map inference as in belanger et al second the trw objective involves entropy terms whose gradients go to infinity near the boundary of the feasible set therefore existing convergence guarantees for do not apply third variational inference using trw involves both an outer and inner loop of where the outer loop optimizes the edge appearance probabilities in the trw entropy bound to tighten it neither sontag and jaakkola nor belanger et al explore the effect of optimizing over the edge appearance probabilities although map inference is in general np hard shimony it is often possible to find exact solutions to large instances within reasonable running times sontag et allouche et kappes et moreover as we show in our experiments even approximate map solvers can be successfully used within our variational inference algorithm as map solvers improve in their runtime and performance their iterative use could become feasible and as byproduct enable more efficient and accurate marginal inference our work provides fast deterministic alternative to recently proposed algorithms papandreou and yuille hazan and jaakkola ermon et contributions this paper makes several theoretical and practical innovations we propose modification to the algorithm that optimizes over adaptively chosen contractions of the domain and prove its rate of convergence for functions whose gradients can be unbounded at the boundary our algorithm does not require different oracle than standard and could be useful for other convex optimization problems where the gradient is at the boundary we instantiate the algorithm for approximate marginal inference over the marginal polytope with the trw objective with an exact map oracle we obtain the first provably convergent algorithm for the optimization of the trw objective over the marginal polytope which had remained an open problem to the best of our knowledge traditional proof techniques of convergence for first order methods fail as the gradient of the trw objective is not lipschitz continuous we develop several heuristics to make the algorithm practical variant of frankwolfe that reuses previously found integer assignments thereby reducing the need for new approximate map calls the use of local search between map calls and significant of computations between subsequent steps of optimizing over the spanning tree polytope we perform an extensive experimental evaluation on both synthetic and inference tasks background markov random fields mrfs are undirected probabilistic graphical models where the probability distribution factorizes over cliques in the graph we consider marginal inference on pairwise mrfs with random variables xn where each variable takes discrete states xi vali let be the markov network with an undirected edge for every two variables xi and xj that are connected together let refer to the set of neighbors of variable xi we organize the edge θij xi xj for all possible values of xi vali xj valj in the vector θij and similarly for the node vector θi we regroup these in the overall we introduce similar grouping for the marginal vector for example µi xi gives the vector coordinate of the marginal vector corresponding to the assignment xi to variable xi be the partition function for the tree objective wainwright et let mrf and be the set of all valid marginal vectors the marginal polytope the maximization of the trw objective gives the following upper bound on the log partition function min max hθ log where the trw entropy is trw ρij µi ρij µij µi µi xi log µi xi xi ij is the spanning tree polytope the convex hull of edge indicator vectors of all possible spanning trees of the graph elements of specify the probability of an edge being present under specific distribution over spanning trees is difficult to optimize over and most trw algorithms optimize over relaxation called the local consistency polytope xi µi xi xi µij xi xj µj xj xj µij xi xj µi xi is globally concave function of over assuming that is the trw objective trw obtained from valid distribution over spanning trees of the graph fw algorithm in recent years the aka conditional gradient algorithm has gained popularity in machine learning jaggi for the optimization of convex functions over compact domains denoted the algorithm is used to solve by iteratively finding good descent vertex by solving the linear subproblem arg si fw oracle and then taking convex step towards this vertex γs for suitably chosen the algorithm remains within the feasible set is projection free is invariant to affine transformations of the domain and can be implemented in memory efficient manner moreover the fw gap provides an upper bound on the suboptimality of the iterate the primal convergence of the algorithm is given by thm in jaggi restated here for convenience for the iterates satisfy where cf is called the curvature constant under the assumption that is on we can bound it as cf with marginal inference with to optimize maxµ trw where the perturbed potentials correspond the linear subproblem becomes arg maxµ elements of are of the form θc xc kc to the gradient of trw with respect to log µc xc evaluated at the pseudomarginals current location in where kc is the coefficient of the entropy for the term in the fw linear subproblem here is thus equivalent to performing map inference in graphical model with potentials belanger et as the vertices of the marginal polytope are in correspondence with valid joint assignments to the random variables of the mrf and the solution of linear program is always achieved at vertex of the polytope the trw objective does not have lipschitz continuous gradient over and so standard convergence proofs for do not hold optimizing over contractions of the marginal polytope motivation we wish to use the fewest possible map calls and avoid regions near the boundary where the unbounded curvature of the function slows down convergence viable option to address is through the use of correction steps where after step one optimizes over the polytope defined by previously visited vertices of called the fcfw algorithm and proven to be linearly convergence for strongly convex objectives and jaggi this does not require additional map calls however we found see sec that when optimizing the trw objective over performing correction steps can surprisingly hurt performance this leaves us in dilemma correction steps enable decreasing the objective without additional map calls but they can also slow global progress since iterates after correction sometimes lie close to the boundary of the polytope where the fw directions become less informative in manner akin to barrier methods and to garber and hazan local linear oracle our proposed solution maintains the iterates within contraction of the polytope this gives us most of the mileage obtained from performing the correction steps without suffering the consequences of venturing too close to the boundary of the polytope we prove global convergence rate for the iterates with respect to the true solution over the full polytope for the approach we adopt we describe convergent algorithms to optimize trw to deal with the issue of unbounded gradients at the boundary is to perform within contraction of the marginal polytope given by mδ for with either fixed or an adaptive definition contraction polytope mδ where is the vector representing the uniform distribution marginal vectors that lie within mδ are bounded away from zero as all the components of are strictly positive denoting as the set of vertices of mδ as the set of vertices of and the key insight that enables our novel approach is that arg min linear minimization over mδ arg min definition of arg min vi run map solver and shift vertex lkx for notice that the dual norm is needed here algorithm updates to after map call adaptive variant at iteration assuming are defined and has been computed compute compute fw gap compute gu compute uniform gap if gu then let compute new proposal for if then min shrink by at least factor of two if proposal is smaller end if end if and set if it was not updated therefore to solve the fw subproblem over mδ we can run as usual map solver and simply shift the resulting vertex of towards to obtain vertex of mδ our solution to optimize over restrictions of the polytope is more broadly applicable to the optimization problem defined below with satisfying prop satisfied by the trw objective in order to get convergence rates problem solve where is compact convex set and is convex and continuously differentiable on the relative interior of property controlled growth of lipschitz constant over dδ we define dδ for fixed in the relative interior of we suppose that there exists fixed and such that for any has bounded lipschitz constant lδ lδ dδ fixed the first algorithm fixes value for and performs the optimization over dδ the following theorem bounds the of the iterates with respect to the optimum over theorem suboptimality bound for algorithm let satisfy the properties in prob and prop and suppose further that is finite on the boundary of then the use of for realizes over bounded as diam where is the optimal solution in cδ lδ dδ and is the modulus of continuity function of the uniformly continuous in particular as the full proof is given in app the first term of the bound comes from the standard convergence analysis of the of relative to the optimum over dδ as in and using prop the second term arises by bounding with cleverly chosen dδ as is optimal in dδ we pick and note that diam as is continuous on compact set it is uniformly continuous and we thus have diam with its modulus of continuity function adaptive the second variant to solve iteratively perform fw steps over dδ but also decreases adaptively the update schedule for is given in alg and is motivated by the convergence proof the idea is to ensure that the fw gap over dδ is always at least half the fw gap over relating the progress over dδ with the one over it turns out that gu where the uniform gap gu quantifies the decrease of the function when contracting towards when gu is negative and large compared to the fw gap we need to shrink see step in alg to ensure that the direction is sufficient descent direction we can show that the algorithm converges to the global solution as follows theorem global convergence for variant over for function satisfying the properties in prob and prop the of the iterates obtained by running the fw updates over dδ with updated according to alg is bounded as full proof with precise rate and constants is given in app the hk traverses three stages with an overall rate as above the updates to as in alg enable us algorithm approximate marginal inference over solving here is the negative trw objective function init inputs probabilities init initial contraction of polytope inner loop stopping criterion fixed reference point in the interior of let init let visited vertices initialize the algorithm at the uniform distribution for max rho its do fw outer loop to optimize over for maxits do fcfw inner loop to optimize over compute gradient let let arg min vi run map solver to compute fw vertex compute inner loop fw duality gap if then break fcfw inner loop is end if for run alg to modify let and quantities arg min fw step with line search update correction polytope correction optional correction step vsearch localsearch optional fast map solver update correction polytope with vertices from localsearch vsearch end for ρv minspantree edgesmi fw vertex of the spanning tree polytope ρv fixed schedule fw update for kept in relint for fcfw inner loop if max rho its then correction end for return and to upper bound the duality gap over as function of the duality gap in dδ and lower bound the value of as function of hk applying the standard descent lemma with the lipschitz constant on the gradient of the form lδ prop and replacing by its bound in hk we get the recurrence hk solving this gives us the desired bound is akin to and the application to the trw objective minµ has been previously shown wainwright et london strong convexity of et the gradient of the trw objective is lipschitz continuous over mδ since all marginals are strictly positive its growth for prop can be bounded with as we show in app this gives rate of convergence of for the variant which interestingly is typical rate for convex optimization the hidden constant is of the order the modulus of continuity for the trw objective is close to linear it is almost lipschitz function and its constant is instead of the order algorithm alg describes the pseudocode for our proposed algorithm to do marginal inference with minspantree finds the minimum spanning tree of weighted graph and trw to edgesmi computes the mutual information of edges of from the pseudomarginals in perform fw updates over as in alg in wainwright et al it is worthwhile to note that our approach uses three levels of for the tightening optimization of over over and to perform to perform approximate marginal inference for the optimization of the correction steps lines and we detail few heuristics that aid practicality fast local search fast methods for map inference such as iterated conditional modes besag offer cheap low cost alternative to more expensive combinatorial map solver we the component ij has value µi µj µij warm start the icm solver with the last found vertex of the marginal polytope the subroutine localsearch alg in appendix performs fixed number of fw updates to the pseudomarginals using icm as the approximate map solver over the vertices of fcfw algorithm as the iterations of fw progress we keep track of the vertices of the marginal polytope found by alg in the set we make use of these vertices in the correction subroutine alg in appendix which the objective function over contraction of the convex hull of the elements of called the correction polytope in alg is initialized to the uniform distribution which is guaranteed to be in and mδ after updating we set to the approximate minimizer in the correction polytope the intuition is that changing by small amount may not substantially modify the optimal for the new and that the new optimum might be in the convex hull of the vertices found thus far if so correction will be able to find it without resorting to any additional map calls this encourages the map solver to search for new unique vertices instead of rediscovering old ones approximate map solvers we can swap out the exact map solver with an approximate map solver the primal objective plus the approximate duality gap may no longer be an upper bound on the function map solvers could be considered to optimize over an inner bound to the marginal polytope furthermore the gap over may be negative if the approximate map solver fails to find direction of descent since requires that the gap be positive in alg we take the max over the last gap obtained over the correction polytope which is always and the computed gap over as heuristic theoretically one could get similar convergence rates as in thm and using an approximate map solver that has multiplicative guarantee on the gap line of alg as was done previously for algorithms see thm in et al with an error guarantee on the map solution one can prove similar rates up to suboptimality error of even if the approximate map solver does not provide an approximation guarantee if it returns an upper bound on the value of the map assignment as do solvers for integer linear programs or sontag et al one can use this to obtain an upper bound on log see app experimental results pn setup the error in marginals is computed as ζµ when using exact map inference the error in log denoted ζlog is computed by adding the duality gap to the primal since this guarantees us an upper bound for approximate map inference we plot the primal objective we use initialization of computed with the matrix tree theorem to duality sontag and jaakkola koo et we perform updates to optimize gap of on and always perform correction steps we use localsearch only for the realworld instances we use the implementation of trbp and the junction tree algorithm to compute exact marginals in libdai mooij unless specified we compute marginals by optimizing the trw objective using the variant of the algorithm denoted in the figures as mδ map solvers for approximate map we run three solvers in parallel qpbo kolmogorov and rother boykov and kolmogorov kolmogorov and icm besag using opengm andres et and use the result that realizes the highest energy for exact inference we use gurobi optimization or allouche et test cases all of our test cases are on binary pairwise mrfs synthetic nodes cliques same setup as sontag and jaakkola fig with sets of instances each with coupling strength drawn from for synthetic grids trials with grids we sample θi and θij for nodes and edges the potentials were θi for nodes and θij θij for edges restricted boltzmann machines rbms from the probabilistic inference challenge horses large mrfs representing images from the weizmann horse data borenstein and ullman with potentials learned by domke chinese characters an image completion task from the kaist database compiled in opengm by andres et al the potentials were learned using decision tree fields nowozin et the mrf is not grid due to skip edges that tie nodes at various offsets the potentials are combination of submodular and supermodular and therefore harder task for inference algorithms http lδ no correction map calls exact map mδ lδ approx map mδ map calls ζlog node rbm approx vs exact map map calls ζµ grids approx vs exact map exact map mδ lδ approx map mδ ζlog node cliques vs mδ error in marginals ζµ error in logz ζlog no correction lδ map calls ζlog grids vs mδ mδ perturbmap lδ lδ ρopt mδ ρopt mδ error in logz ζlog error in marginals ζµ mδ error in logz ζlog error in logz ζlog perturbmap lδ lδ ρopt mδ ρopt mδ ζµ node cliques optimization over ζlog node cliques optimization over figure synthetic experiments in fig we unravel map calls across updates to fig corresponds to single rbm not an aggregate over trials where for approx map we plot the absolute error between the primal objective and log not guaranteed to be an upper bound on the optimization of versus mδ we compare the performance of alg on optimizing over with and without correction optimizing over mδ with denoted and optimizing over mδ using the variant these plots are averaged across all the trials for the first iteration of optimizing over we show error as function of the number of map calls since this is the bottleneck for large mrfs fig depict the results of this optimization aggregated across trials we find that all variants settle on the same average error the adaptive variant converges faster on average followed by the fixed variant despite relatively quick convergence for with no correction on the grids we found that correction was crucial to reducing the number of map calls in subsequent steps of inference after updates to as highlighted earlier correction steps on in blue worsen convergence an effect brought about by iterates wandering too close to the boundary of on the applicability of approximate map solvers synthetic grids fig depicts the accuracy of approximate map solvers versus exact map solvers aggregated across trials for grids the results using approximate map inference are competitive with those of exact inference even as the optimization is tightened over this is an encouraging and result since it indicates that one can achieve high quality marginals through the use of relatively cheaper approximate map oracles rbms as in salakhutdinov we observe for rbms that the bound provided by over lδ is loose and does not get better when optimizing over as fig depicts trw for single rbm optimizing over mδ realizes significant gains in the upper bound on log which improves with updates to the gains are preserved with the use of the approximate map solvers note that there are also fast approximate map solvers specifically for rbms wang et horses see fig right the models are close to submodular and the local relaxation is good approximation to the marginal polytope our marginals are visually similar to those obtained by trbp and our algorithm is able to scale to large instances by using approximate map solvers ground ground truth truth map map ground truth ground truth trbp trbp fw fw fw fw ground truth map trbp fw ground truth map trbp fw ground truth trbp marginals trbp marginals marginals marginals trbp marginals ground truth trbp marginals ground truth trbp marginals fw fw marginals marginals opt rho marginals marginals opt rho marginals marginals opt rho marginals opt rho marginals opt rho figure results on real world test cases fw corresponds to the final marginals at the ith iteration of ground truth trbp marginals marginals trbp marginals ground truth optimizing the area highlighted on the chinese characters depicts the region of uncertainty trbp marginals ground truth marginals marginals marginals opt rho marginals opt rho marginals opt rho on the importance of optimizing over synthetic cliques in fig we study the effect of tightening over against coupling strength we consider the ζµ and ζlog obtained for the final marginals before updating step and compare to the values obtained after optimizing over marked with ρopt the optimization over has little effect on trw optimized over lδ for optimization over mδ updating realizes better marginals and bound on log over and above those obtained in sontag and jaakkola chinese characters fig left displays marginals across iterations of optimizing over the submodular and supermodular potentials lead to frustrated models for which lδ is very loose which results in trbp obtaining poor our method produces reasonable marginals even before the first update to and these improve with tightening over related work for marginal inference with map calls hazan and jaakkola estimate log by averaging map estimates obtained on randomly perturbed inflated graphs our implementation of the method performed well in approximating log but the marginals estimated by fixing the value of each random variable and estimating log for the resulting graph were less accurate than our method fig discussion we introduce the first provably convergent algorithm for the trw objective over the marginal polytope under the assumption of exact map oracles we quantify the gains obtained both from marginal inference over and from tightening over the spanning tree polytope we give heuristics that improve the scalability of when used for marginal inference the runtime cost of iterative map calls reasonable rule of thumb is to assume an approximate map call takes roughly the same time as run of trbp is worthwhile particularly in cases such as the chinese characters where is loose specifically our algorithm is appropriate for domains where marginal inference is hard but there exist efficient map solvers capable of handling potentials code is available at https our work creates flexible modular framework for optimizing broad class of variational objectives not simply trw with guarantees of convergence we hope that this will encourage more research on building better entropy approximations the framework we adopt is more generally applicable to optimizing functions whose gradients tend to infinity at the boundary of the domain our method to deal with gradients that diverge at the boundary bears resemblance to barrier functions used in interior point methods insofar as they bound the solution away from the constraints iteratively decreasing in our framework can be compared to decreasing the strength of the barrier enabling the iterates to get closer to the facets of the polytope although its worthwhile to note that we have an adaptive method of doing so acknowledgements rk and ds gratefully acknowledge the support of the darpa probabilistic programming for advancing machine learning ppaml program under afrl prime contract no we run trbp for iterations using damping the algorithm converges with max norm difference between consecutive iterates of tightening over did not significantly change the results of trbp references allouche de givry and schiex an open source exact cost function network solver andres and kappes opengm library for discrete graphical models june belanger sheldon and mccallum marginal inference in mrfs using nips workshop on greedy optimization and friends besag on the statistical analysis of dirty pictures stat soc series borenstein and ullman segmentation in eccv boykov and kolmogorov an experimental comparison of algorithms for energy minimization in vision tpami domke learning graphical model parameters with approximate marginal inference tpami ermon gomes sabharwal and selman taming the curse of dimensionality discrete integration by hashing and optimization in icml garber and hazan linearly convergent conditional gradient algorithm with applications to online and stochastic optimization arxiv preprint globerson and jaakkola convergent propagation algorithms via oriented trees in uai gurobi optimization gurobi optimizer reference manual hazan and jaakkola on the partition function and random maximum perturbations in icml jaggi revisiting sparse convex optimization in icml jancsary and matz convergent decomposition solvers for free energies in aistats kappes et al comparative study of modern inference techniques for discrete energy minimization problems in cvpr kolmogorov convergent message passing for energy minimization tpami kolmogorov and rother minimizing nonsubmodular functions with graph review tpami koo globerson carreras and collins structured prediction models via the theorem in and jaggi on the global linear convergence of optimization variants in nips jaggi schmidt and pletscher optimization for structural svms in icml london huang and getoor the benefits of learning with strongly convex approximate inference in icml mooij libdai free and open source library for discrete approximate inference in graphical models jmlr nowozin rother bagon sharp yao and kohli decision tree fields in iccv papandreou and yuille random fields using discrete optimization to learn and sample from energy models in iccv salakhutdinov learning and evaluating boltzmann machines technical report shimony finding maps for belief networks is artificial intelligence sontag and jaakkola new outer bounds on the marginal polytope in nips sontag meltzer globerson weiss and jaakkola tightening lp relaxations for map using in uai wainwright jaakkola and willsky new class of upper bounds on the log partition function ieee transactions on information theory wang frostig liang and manning relaxations for inference in restricted boltzmann machines in iclr workshop 
learning theory and algorithms for forecasting time series vitaly kuznetsov courant institute new york ny mehryar mohri courant institute and google research new york ny vitaly mohri abstract we present learning bounds for the general scenario of nonstationary stochastic processes our learning guarantees are expressed in terms of measure of sequential complexity and discrepancy measure that can be estimated from data under some mild assumptions we use our learning bounds to devise new algorithms for time series forecasting for which we report some preliminary experimental results introduction time series forecasting plays crucial role in number of domains ranging from weather forecasting and earthquake prediction to applications in economics and finance the classical statistical approaches to time series analysis are based on generative models such as the autoregressive moving average arma models or their integrated versions arima and several other extensions engle bollerslev brockwell and davis box and jenkins hamilton most of these models rely on strong assumptions about the noise terms often assumed to be random variables sampled from gaussian distribution and the guarantees provided in their support are only asymptotic an alternative approach to time series analysis consists of extending the standard statistical learning theory framework to that of stochastic processes in much of this work the process is assumed to be stationary and suitably mixing doukhan early work along this approach consisted of the bounds for binary classification given by yu under the assumption of stationarity and under the same assumptions meir presented bounds in terms of covering numbers for regression losses and mohri and rostamizadeh proved general rademacher complexity learning bounds vidyasagar showed that pac learning algorithms in the setting preserve their pac learning property in the stationary scenario similar result was proven by shalizi and kontorovitch for mixtures of processes and by berti and rigo and pestov for exchangeable random variables alquier and wintenberger and alquier et al also established learning guarantees under weak dependence and stationarity number of bounds have also been derived for the stationary mixing setting lozano et al studied the convergence of regularized boosting mohri and rostamizadeh gave generalization bounds for stable algorithms for and stationary processes steinwart and christmann proved fast learning rates for regularized algorithms with stationary sequences and modha and masry gave guarantees for certain classes of models under the same assumptions however stationarity and mixing are often not valid assumptions for example even for markov chains which are among the most widely used types of stochastic processes in applications stationarity does not hold unless the markov chain is started with an equilibrium distribution similarly long memory models such as arfima may not be mixing or mixing may be arbitrarily slow baillie in fact it is possible to construct first order autoregressive processes that are not mixing andrews additionally the mixing assumption is defined only in terms of the distribution of the underlying stochastic process and ignores the loss function and the hypothesis set used this suggests that mixing may not be the right property to characterize learning in the setting of stochastic processes number of attempts have been made to relax the assumptions of stationarity and mixing adams and nobel proved asymptotic guarantees for stationary ergodic sequences agarwal and duchi gave generalization bounds for asymptotically stationary mixing processes in the case of stable learning algorithms kuznetsov and mohri established learning guarantees for fully and processes in this paper we consider the general case of processes we are not aware of any prior work providing generalization bounds in this setting in fact our bounds appear to be novel even when the process is stationary but not mixing the learning guarantees that we present hold for both bounded and unbounded memory models deriving generalization bounds for unbounded memory models even in the stationary mixing case was an open question prior to our work meir our guarantees cover the majority of approaches used in practice including various autoregressive and state space models the key ingredients of our generalization bounds are measure of sequential complexity expected sequential covering number or sequential rademacher complexity rakhlin et and measure of discrepancy between the sample and target distributions kuznetsov and mohri also give generalization bounds in terms of discrepancy however unlike the result of kuznetsov and mohri our analysis does not require any mixing assumptions which are hard to verify in practice more importantly under some additional mild assumption the discrepancy measure that we propose can be estimated from data which leads to learning guarantees for case we devise new algorithms for time series forecasting that benefit from our datadependent guarantees the parameters of generative models such as arima are typically estimated via the maximum likelihood technique which often leads to optimization problems in contrast our objective is convex and leads to an optimization problem with unique global solution that can be found efficiently another issue with standard generative models is that they address nonstationarity in the data via differencing transformation which does not always lead to stationary process in contrast we address the problem of in principled way using our learning guarantees the rest of this paper is organized as follows the formal definition of the time series forecasting learning scenario as well as that of several key concepts is given in section in section we introduce and prove our new generalization bounds in section we give learning bounds based on the empirical discrepancy these results combined with novel analysis of hypotheses for time series forecasting appendix are used to devise new forecasting algorithms in section in appendix we report the results of preliminary experiments using these algorithms preliminaries we consider the following general time series prediction setting where the learner receives realization xt yt of some stochastic process with xt yt the objective of the learner is to select out of specified family hypothesis that achieves small generalization error xt yt zt conditioned on observed data where is given loss function the generalization error that we consider in this work is finer measure of the generalization ability than the averaged generalization error xt yt xt yt zt since it only takes into consideration the realized history of the stochastic process and does not average over the set of all possible histories the results that we present in this paper also apply to the setting where the time parameter can take values and prediction lag is an arbitrary number that is the error is defined by xt yt zt but for notational simplicity we set our setup covers larger number of scenarios commonly used in practice the case corresponds to large class of autoregressive models taking leads to growing memory models which in particular include state space models more generally may contain both the history of the process yt and some additional side information to simplify the notation in the rest of the paper we will use the shorter notation for any and introduce the family containing such functions we will assume bounded loss function that is for all for some finally we will use the shorthand zba to denote sequence of random variables za zb the key quantity of interest in the analysis of generalization is the following supremum of the empirical process defined as follows sup zt qt zt where qt are real numbers which in the standard learning scenarios are chosen to be uniform in our general setting different zt may follow different distributions thus distinct weights could be assigned to the errors made on different sample points depending on their relevance to forecasting the future zt the generalization bounds that we present below are for an arbitrary sequence qt which in particular covers the case of uniform weights remarkably our bounds do not even require the of our generalization bounds are expressed in terms of measures of sequential complexity such as expected sequential covering number or sequential rademacher complexity rakhlin et we give brief overview of the notion of sequential covering number and refer the reader to the aforementioned reference for further details we adopt the following definition of complete binary tree complete binary tree is sequence zt of mappings zt path in the tree is σt to simplify the notation we will write zt instead of zt even though zt depends only on the first elements of the following definition generalizes the classical notion of covering numbers to sequential setting set of trees of depth is sequential with respect to norm of function class on tree of depth if for all and all there is such that vt zt where kq is the dual norm the sequential covering number np of function class on given tree is defined to be the size of the minimal sequential cover the maximal covering number is then taken to be np supz np one can check that in the case of uniform weights this definition coincides with the standard definition of sequential covering numbers note that this is purely combinatorial notion of complexity which ignores the distribution of the process in the given learning problem sequential covering numbers can be defined as follows given stochastic process distributed according to the distribution with pt denoting the conditional distribution at time we sample tree of depth according to the following procedure draw two independent samples from in the left child of the root draw according to and in the right child according to more generally for node that can be reached by path σt we draw zt according to pt where st zt and st let denote the tree formed using zt and define the expected covering number to be np where denotes the distribution of in similar manner one can define other measures of complexity such as sequential rademacher complexity and the littlestone dimension rakhlin et as well as their counterparts rakhlin et the final ingredient needed for expressing our learning guarantees is the notion of discrepancy between target distribution and the distribution of the sample sup zt qt zt the discrepancy is natural measure of the of the stochastic process with respect to both the loss function and the hypothesis set in particular note that if the process is then we simply have provided that qt form probability distribution it is also possible to give bounds on in terms of other natural distances between distribution for instance pinsker inequality yields pt pt pt qt pt pt qt pt tv where ktv is the total variation distance and the relative entropy the condipt tional distribution of and qt pt the mixture of the sample marginals alternatively if the target distribution at lag pt is stationary distribution of an asymptotically stationary process agarwal and duchi kuznetsov and mohri then for qt we have mx kp ktv where sups supz kp ktv is the coefficient of asymptotic stationarity the process is asymptotically stationary if however the most important property of the discrepancy is that as shown later in section it can be estimated from data under some additional mild assumptions kuznetsov and mohri also give generalization bounds for mixing processes in terms of related notion of discrepancy it is not known if the discrepancy measure used in kuznetsov and mohri can be estimated from data generalization bounds in this section we prove new generalization bounds for forecasting time series the first step consists of using decoupled tangent sequences to establish concentration results for the supremum of the empirical process given sequence of random variables we say that is decoupled tangent sequence if is distributed according to and is independent of it is always possible to construct such sequence of random variables de la and the next theorem is the main result of this section theorem let be sequence of random variables distributed according to fix then the following holds exp proof the first step is to observe that since the difference of the suprema is upper bounded by the supremum of the difference it suffices to bound the probability of the following event sup qt zt zt by markov inequality for any the following inequality holds sup qt zt zt exp exp sup qt zt zt since is tangent sequence the following equalities hold zt zt using these equalities and jensen inequality we obtain the following exp sup qt zt hx exp sup qt zt exp sup qt zt where the last expectation is taken over the joint measure of and applying lemma appendix we can further bound this expectation by exp sup σt qt zt exp sup σt qt sup qt zt exp sup σt qt exp sup σt qt zt σt qt zt exp sup where for the second inequality we used young inequality and for the last equality we used symmetry given let denote the minimal with respect to the of on then the following bound holds sup σt qt zt max σt qt ct by the monotonicity of the exponential function exp sup σt qt zt exp exp max σt qt ct exp exp σt qt ct since ct depends only on σt by hoeffding bound tx exp σt qt ct exp σt qt ct exp qt ct σt tx exp σt qt ct exp and iterating this inequality and using the union bound we obtain the following sup qt zt exp optimizing over completes the proof an immediate consequence of theorem is the following result corollary for any with probability at least for all and all zt qt zt log we are not aware of other finite sample bounds in case in fact our bounds appear to be novel even in the stationary case using chaining techniques bounds theorem and corollary can be further improved and we will present these results in the full version of this paper while rakhlin et al give high probability bounds for different quantity than the quantity of interest in time series prediction sup qt zt zt their analysis of this quantity can also be used in our context to derive high probability bounds for however this approach results in bounds that are in terms of purely combinatorial notions such as maximal sequential covering numbers while at first sight this may seem as minor technical detail the distinction is crucial in the setting of time series prediction consider the following example let be drawn from uniform distribution on and zt with being distribution over such that if and otherwise let be defined by then one can check that while the bounds of theorem and corollary highlight the fact that the task of time series prediction lies in between the familiar scenario and adversarial learning setting however the key component of our learning guarantees is the discrepancy term note that in the general case the bounds of theorem may not converge to zero due to the discrepancy between the target and sample distributions this is also consistent with the lower bounds of barve and long that we discuss in more detail in section however convergence can be established in some special cases in the case our bounds reduce to the standard covering numbers learning guarantees in the drifting scenario with being sequence of independent random variables our discrepancy measure coincides with the one used and studied in mohri and medina convergence can also be established in asymptotically stationary and stationary mixing cases however as we show in section the most important advantage of our bounds is that the discrepancy measure we use can be estimated from data estimating discrepancy in section we showed that the discrepancy is crucial for forecasting time series in particular if we could select distribution over the sample that would minimize the discrepancy and use it to weight training points then we would have better learning guarantee for an algorithm trained on this weighted sample in some special cases the discrepancy can be computed analytically however in general we do not have access to the distribution of and hence we need to estimate the discrepancy from the data furthermore in practice we never observe zt and it is not possible to estimate without some further assumptions one natural assumption is that the distribution pt of zt does not change drastically with on average under this assumption the last observations ztt are effectively drawn from the distribution close to pt more precisely we can write zt qt zt sup sup zt zt we will assume that the second term denoted by is sufficiently small and will show that the first term can be estimated from data but we first note that our assumption is necessary for learning in this setting observe that sup zt zr sup zt pt ktv for all therefore we must have sup zt zt where supt ktv barve and long showed that is lower bound on the generalization error in the setting of binary classification where is sequence of independent but not identically distributed random variables drifting this setting is special case of the more general scenario that we are considering the following result shows that we can estimate the first term in the upper bound on theorem let be sequence of random variables then for any with probability at least the following holds for all sup pt qt zt sup pt qt zt where kq log the last points and where is the uniform distribution over the proof of this result is given in appendix theorem and theorem combined with the union bound yield the following result corollary let be sequence of random variables then for any with probability at least the following holds for all and all zt kq log qt zt supf where pt qt zt algorithms in this section we use our learning guarantees to devise algorithms for forecasting time series we consider broad family of hypothesis classes with regression losses we present the full analysis of this setting in appendix including novel bounds on the sequential rademacher complexity the learning bounds can be generalized to hold uniformly of theorem over at the price of an additional term in kq we prove this result in theorem appendix suppose is the squared loss and kwkh where is feature mapping from to hilbert space by lemma appendix we can bound the complexity term in our generalization bounds by λr kq where is pds kernel associated with such that supx and is the uniform distribution over the sample then we can formulate joint optimization problem over both and based on the learning guarantee of theorem which holds uniformly over all min qt xt yt dt qt kwkh kq pt here we have upper bounded the empirical discrepancy term by dt qt with each dt defined pt by ps xs ys xt yt each dt can be precomputed using for general loss functions the approach only guarantees convergence to stationary point however for the squared loss our problem can be cast as an instance of the trust region problem which can be solved globally using the dca algorithm of tao and an note that problem is not jointly convex in and however using standard duality results it can be rewritten as follows min max αj αk qj qk kjk kq where dt is the kernel matrix and where denotes the hadamard product we further upper bound kq by kq then using the change of variable rj qj and restricting the problem domain yields the following optimization problem min max αj αk rj rk kjk where rt the optimization problem is convex since is convex set the first term in is convex as maximum of convex quadratic functions of and the second term is quadratic in to show that the last term is convex on we observe that the function is convex for and use the fact that sum of convex functions is convex note that enforcing also constrains to be closer to the uniform distribution which is consistent with our learning guarantees this problem can be solved using standard descent methods where at each iteration we solve standard qp in which admits solution parameters and are selected through an alternative simpler algorithm based on the bounds of corollary consists of first finding distribution minimizing the regularized discrepancy and then using that to find hypothesis minimizing the regularized weighted empirical risk this leads to the following twostage procedure first we find solution of the following convex optimization problem pt qt xt yt kq min sup where and are parameters that can be selected via our generalization bounds hold for arbitrary weights but we restrict them to being positive sequences note that other regularization terms such as and kq from the bound of corollary can be incorporated in the optimization problem but we discard them to minimize the number of parameters this problem can be solved using standard descent optimization methods where at each step we use to evaluate the supremum over alternatively one can upper bound the suprept mum by qt dt and then solve the resulting optimization problem the solution of is then used to solve the following weighted kernel ridge regression problem min xt yt note that in order to guarantee the convexity of this problem we require conclusion we presented general theoretical analysis of learning in the broad scenario of nonmixing processes the realistic setting for variety of applications we discussed in detail several algorithms benefitting from the learning guarantees presented our theory can also provide finer analysis of several existing algorithms and help devise alternative principled learning algorithms acknowledgments this work was partly funded by nsf and and the nserc pgs references adams and nobel uniform convergence of classes under ergodic sampling the annals of probability agarwal and duchi the generalization ability of online algorithms for dependent data information theory ieee transactions on alquier and wintenberger model selection for weakly dependent time series forecasting technical report centre de recherche en economie et statistique alquier li and wintenberger prediction of time series by statistical learning general losses and fast rates dependence modelling andrews first order autoregressive processes and strong mixing cowles foundation discussion papers cowles foundation for research in economics yale university baillie long memory processes and fractional integration in econometrics journal of econometrics barve and long on the complexity of learning from drifting distributions in colt berti and rigo theorem for exchangeable random variables statistics probability letters bollerslev generalized autoregressive conditional heteroskedasticity econometrics box and jenkins time series analysis forecasting and control incorporated brockwell and davis time series theory and methods new york de la and decoupling from dependence to independence randomly stopped processes and processes martingales and beyond probability and its applications springer ny doukhan mixing properties and examples lecture notes in statistics new york engle autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation econometrica hamilton time series analysis princeton kuznetsov and mohri generalization bounds for time series prediction with processes in alt lozano kulkarni and schapire convergence and consistency of regularized boosting algorithms with stationary observations in nips pages meir nonparametric time series prediction through adaptive model selection machine learning pages modha and masry prediction of stationary random processes information theory ieee transactions on jan mohri and medina new analysis and algorithm for learning with drifting distributions in alt mohri and rostamizadeh rademacher complexity bounds for processes in nips mohri and rostamizadeh stability bounds for stationary and processes journal of machine learning research pestov predictive pac learnability paradigm for learning from exchangeable input data in grc rakhlin sridharan and tewari online learning random averages combinatorial parameters and learnability in nips rakhlin sridharan and tewari online learning stochastic constrained and smoothed adversaries in nips rakhlin sridharan and tewari sequential complexities and uniform martingale laws of large numbers probability theory and related fields shalizi and kontorovitch predictive pac learning and process decompositions in nips steinwart and christmann fast learning from observations in nips tao and an optimization algorithm for solving the subproblem siam journal on optimization vidyasagar theory of learning and generalization with applications to neural networks and control systems new york yu rates of convergence for empirical processes of stationary mixing sequences the annals of probability 
compressive spectral embedding sidestepping the svd upamanyu madhow madhow ece department uc santa barbara dinesh ramasamy dineshr ece department uc santa barbara abstract spectral embedding based on the singular value decomposition svd is widely used preprocessing step in many learning tasks typically leading to dimensionality reduction by projecting onto number of dominant singular vectors and rescaling the coordinate axes by predefined function of the singular value however the number of such vectors required to capture problem structure grows with problem size and even partial svd computation becomes bottleneck in this paper we propose compressive spectral embedding algorithm which employs random projections and finite order polynomial expansions to compute approximations to embedding for an matrix with its time complexity is log and the embedding dimension is log both of which are independent of the number of singular vectors whose effect we wish to capture to the best of our knowledge this is the first work to circumvent this dependence on the number of singular vectors for general embeddings the key to sidestepping the svd is the observation that for downstream inference tasks such as clustering and classification we are only interested in using the resulting embedding to evaluate pairwise similarity metrics derived from the rather than capturing the effect of the underlying matrix on arbitrary vectors as partial svd tries to do our numerical results on network datasets demonstrate the efficacy of the proposed method and motivate further exploration of its application to inference tasks introduction inference tasks encountered in natural language processing graph inference and manifold learning employ the singular value decomposition svd as first step to reduce dimensionality while retaining useful structure in the input such spectral embeddings go under various guises principle component analysis pca latent semantic indexing natural language processing kernel principal component analysis commute time and diffusion embeddings of graphs to name few in this paper we present compressive approach for accomplishing dimensionality reduction or embedding without actually performing the computationally expensive svd step the setting is as follows the input is represented in matrix form this matrix could represent the adjacency matrix or the laplacian of graph the probability transition matrix of random walker on the graph representation of documents the action of kernel on set of points rd kernel pca such as or kx where denotes the indicator function or matrices derived from graphs constructed from we wish to compute transformation of the rows of this matrix which succinctly captures the global structure of via euclidean distances or similarity metrics derived from the such as normalized correlations common approach is to pute partial svd of σl ul vlt and to use it to embed the rows of into space using the rows of σk uk for some function the embedding of the variable corresponding to the row of the matrix is the row of for example corresponds to principal component analysis pca the rows of are projections of the rows of along the first principal components vl other important choices include constant used to cut graphs and for commute time embedding of graphs inference tasks such as unsupervised clustering and supervised classification are performed using pairwise similarity metrics on the embedded coordinates rows of instead of the ambient data rows of beyond the obvious benefit of dimensionality reduction from to embeddings derived from the leading can often be interpreted as denoising since the noise in matrices arising from data manifests itself via the smaller singular vectors of see which analyzes graph adjacency matrices this is often cited as motivation for choosing pca over isotropic dimensionality reduction techniques such as random embeddings which under the setting of the jl lemma can also preserve structure the number of singular vectors needed to capture the structure of an matrix grows with its size and two bottlenecks emerge as we scale the computational effort required to extract large number of singular vectors using conventional iterative methods such as lanczos or simultaneous iteration or approximate algorithms like nystrom and randomized svd for computation of partial svd becomes prohibitive scaling as kt where is the number of in the resulting embedding becomes unwieldy for use in subsequent inference steps approach and contributions in this paper we tackle these scalability bottlenecks by focusing on what embeddings are actually used for computing pairwise similarity metrics typically used for supervised or unsupervised learning for example clustering uses pairwise euclidean distances and classification uses pairwise inner products we therefore ask the following question is it possible to compute an embedding which captures the pairwise euclidean distances between the rows of the spectral embedding σk uk while sidestepping the computationally expensive partial svd we answer this question in the affirmative by presenting compressive algorithm which directly computes embedding there are two key insights that drive our algorithm by approximating by min polynomial we can compute the embedding iteratively using products of the form aq or at the iterations can be computed compressively by virtue of the celebrated jl lemma the embedding geometry is approximately captured by small number log of randomly picked starting vectors the number of passes over at and time complexity of the algorithm are and log respectively these are all independent of the number of singular vectors whose effect we wish to capture via the embedding this is in stark contrast to embedding directly based on the partial svd our algorithm lends itself to parallel implementation as sequence of matrixvector products interlaced with vector additions run in parallel across log randomly chosen starting vectors this approach significantly reduces both computational complexity and embedding dimensionality relative to partial svd freely downloadable python implementation of the proposed algorithm that exploits this inherent parallelism can be found in related work as discussed in section the concept of compressive measurements forms key ingredient in our algorithm and is based on the jl lemma the latter which provides probabilistic guarantees on approximate preservation of the euclidean geometry for finite collection of points under random projections forms the basis for many other applications such as compressive sensing we now mention few techniques for exact and approximate svd computation before discussing algorithms that sidestep the svd as we do the time complexity of the full svd of an matrix is for partial svds are computed using iterative methods for eigen decompositions of symmetric matrices derived from such as aat and at the complexity of standard iterative eigensolvers such as simultaneous iteration and the lanczos method scales as kt where denotes the number of of the leading singular value vector triplets σl ul vl minimize the matrix reconstruction error under rank constraint they are solution to the optimization problem arg minka σl ul vl kf where kf denotes the frobenius norm approximate svd algorithms strive to reduce this error while also placing constraints on the computational budget the number of passes over commonly employed approximate eigendecomposition algorithm is the nystrom method based on random sampling of columns of which has time complexity ksn number of variants of the nystrom method for kernel matrices like have been proposed in the literature these aim to improve accuracy using preprocessing steps such as clustering or random projection trees methods to reduce the complexity of the nystrom algorithm to ksn enable nystrom sketches that see more columns of the complexity of all of these grow as ksn other randomized algorithms involving iterative computations include the randomized svd since all of these algorithms set out to recover eigenvectors exact or otherwise their complexity scales as kt we now turn to algorithms that sidestep svd computation in vertices of graph are embedded based on diffusion of probability mass in random walks on the graph using the power iteration run independently on random starting vectors and stopping prior to while this approach is specialized to probability transition matrices unlike our general framework and does not provide explicit control on the nature of the embedding as we do feature in common with the present paper is that the time complexity of the algorithm and the dimensionality of the resulting embedding are independent of the number of eigenvectors captured by it parallel implementation of this algorithm was considered in similar parallelization directly applies to our algorithm another specific application that falls within our general framework is the commute embedding on graph based on the normalized adjacency matrix and weighing function approximate commute time embeddings have been computed using solvers and the jl lemma in the complexity of the latter algorithm and the dimensionality of the resulting embedding are comparable to ours but the method is specially designed for the normalized adjacency matrix and the weighing function our more general framework would for example provide the flexibility of suppressing small eigenvectors from contributing to the embedding by setting thus while randomized projections are extensively used in the embedding literature to the best of our knowledge the present paper is the first to develop general compressive framework for spectral embeddings derived from the svd it is interesting to note that methods similar to ours have been used in different context to estimate the empirical distribution of eigenvalues of large hermitian matrix these methods use polynomial approximation of indicator functions and random projections to compute an approximate histogram of the number of eigenvectors across different bands of the spectrum λmin λmax algorithm we first present the algorithm for symmetric matrix later in section we show how to handle general matrix by considering related symmetric matrix let λl denote the eigenvalues of sorted in descending order and vl their corresponding eigenvectors chosen to be orthogonal in case of repeated eigenvalues for any tion we denote by the symmetric matrix λl vl vlt we now develop an log algorithm to compute log dimensional embedding which approximately captures pairwise euclidean distances between the rows of the embedding λn vn rotations are inconsequential we first observe that rotation of basis does not alter similarity metrics since vn satisfies in pairwise distances between the rows of are equal to corresponding pairwise distances between the rows of ev λl vl vl we use this observation to compute embeddings of the rows of rather than those of compressive embedding suppose now that we know this constitutes an embedding and similarity queries between two vertices we refer to the variables corresponding to rows of as vertices as we would for matrices derived from graphs requires operations however we can reduce this time to log by using the jl lemma which informs us that pairwise distances can be approximately captured by compressive projection onto log dimensions specifically for log let denote an matrix with entries drawn uniformly at random from according to the jl lemma pairwise distances between the rows of approximate pairwise distances between the rows of with high probability in particular the following statement holds with probability at least ku vk ωk ku vk for any two rows of the key are that we can reduce the embedding dimension to log since we are only interested in pairwise similarity measures and we do not need to compute we only need to compute we now discuss how to accomplish the latter efficiently polynomial approximation of embedding direct computation of from the eigenvectors and eigenvalues of as λl vl vlt would suggest is expensive however we now observe that computation of is easy when is polynomial in this case bp for some bp so that can be computed as sequence of products interlaced with vector additions run in parallel for each of the columns of therefore they only require ldt ldn fel where fel is an order flops our strategy is to approximate by polynomial approximation of we defer the details of computing good polynomial approximation to section for now we assume that one such approximation fel is available and give bounds on the loss in fidelity as result of this approximation performance guarantees the spectral norm of the error matrix fe λr fel λr vr vrt satisfies kzk maxl λl fel λl max fel where the spectral norm of matrix denoted by kbk refers to the induced for symmetric matrices kbk where λl are the eigenvalues of letting ip denote the unit vector along the coordinate of rn the distance between the rows of fe can be written as kfel ip iq kf ip iq ip iq ke ip iq similarly we have that kfel ip iq ke ip iq thus pairwise distances between the rows of fel approximate those between the rows of however the distortion term is additive and must be controlled by carefully choosing fel as discussed in section applying the jl lemma to the rows of fel we have that when log with fel captures pairwise entries drawn uniformly at random from the embedding distances between the rows of fel up to multiplicative distortion of with high probability et ip iq ωt fel ip iq fel ip iq ip iq ke ip iq using we can show that ke ip iq ke ip iq we state this result in theorem ke similarly theorem let fel denote an order polynomial such that maxl λl fel λl fel an matrix with entries drawn independently and uniformly at random from where is an integer satisfying log let rp rd denote the mapping from the row of λn vn to the fel the following statement is true with probability at least row of ku vk kg ku vk for any two rows of furthermore there exists an algorithm to compute each of the in flops independent of its other columns which makes log columns of passes over is the number of in choosing the polynomial approximation we restrict attention to matrices which satisfy ksk which implies that we observe that we can trivially center and scale the spectrum of any matrix to satisfy this assumption when we have the following bounds λl σmax and λl σmin via the rescaling and centering operation given by σmax σmin σmax σmin in σmax σmin and by modifying to σmax σmin σmax σmin in order to compute polynomial approximation of we need to define the notion of good approximation we showed in section that the errors introduced by the polynomial approximation can be summarized by furnishing bound on the spectral norm of the error matrix fel since kzk maxl λl fel λl what matters is how well we approximate the function at the eigenvalues λl of indeed if we know the eigenvalues we can minimize kzk by minimizing maxl λl fel λl this is not particularly useful approach since computing the eigenvalues is expensive however we can use our prior knowledge of the domain from which the matrix comes from to penalize deviations from differently for different values of for example if we know the distribution of the eigenvalues of we can minimize the average error fel dx in our examples for the sake of concreteness we assume that the eigenvalues are uniformly distributed over and give procedure to compute an order polynomial approximation of that minimizes fel dx numerically stable procedure to generate finite order polynomial approximations of function over with the objective of minimizing fel dx is via legendre polynomials they satisfy the recursion xp and are orthogonal dx therefore we dx we give method set fel where in algorithm that uses the legendre recursion to compute using ld products and vector additions the coefficients are used to compute fel by adding weighted versions of algorithm proposed algorithm to compute approximate eigenvector embedding of symmetric matrix such that ksk using the random projection matrix procedure fast mbed eig compute polynomial approximation fel which minimizes fel dx for do dx order legendre polynomial for do sq return now holds fer fel as described in section if we have prior knowledge of the distribution of eigenvalues as we do for many commonly encountered large matrices then we can boost the performance of the generic algorithm based on the assumption of eigenvalues uniformly distributed over embedding general matrices we complete the algorithm description by generalizing to any matrix not necessarily symmetric such that kak the approach is to utilize algorithm to compute an approximate embedding of the symmetric matrix at let σl ul vl min be an svd of σl ul vl kak σl consider the following spectral mapping of the rows of to the rows of erow σm um and the columns of to the rows of ecol σn it can be shown that the orthogonal eigenvectors of take the form vl ul and vl min and their corresponding eigenvalues are σl and respectively the remaining eigenvalues of are equal to therefore we call eall fast mbed eig with and is an log matrix entries drawn independently and uniformly at ranecol and erow denote the first and last rows of eall from theorem dom from let erow apwe know that with overwhelming probability pairwise distances between any two rows of proximates those between corresponding rows of erow similarly pairwise distances between any ecol approximates those between corresponding rows of ecol two rows of implementation considerations we now briefly go over implementation considerations before presenting numerical results in section spectral norm estimates in order to ensure that the eigenvalues of are within as we have assumed we scale the matrix by its spectral norm ksk to this end we obtain tight lower bound and good approximation on the spectral norm using power iteration iterates on log randomly chosen starting vectors and then scale this up by small factor for our estimate typically an upper bound for ksk polynomial approximation order the error in approximating by fel as measured by fel dx is function of the polynomial order reduction in often corresponds to reduction in that appears as bound on distortion in theorem smooth functions generally admit lower order approximation for the same target error and hence yield considerable savings in algorithm complexity which scales linearly with polynomial approximation method the rate at which decreases as we increase depends on the function used to compute fel by minimizing fel dx the choice yields the legendre recursion used in algorithm whereas corresponds to the chebyshev recursion which is known to result in fast convergence we defer to future work detailed study of the impact of alternative choices for on denoising by cascading in problems it may be necessary to drive the contribution from certain singular vectors to zero in many settings singular vectors with smaller singular values correspond to noise the number of such singular values can scale as fast as min therefore when we place nulls zeros in it is desirable to ensure that these nulls are pronounced after we approximate by fel we do this by computing where is an order approximation of the small values in the polynomial approximation of which correspond to nulls which we have set get amplified when we pass them through the xb numerical results while the proposed approach is particularly useful for large problems in which exact eigendecomposition is computationally infeasible for the purpose of comparison our results are restricted to smaller settings where the exact solution can be computed we compute the exact partial eigendecomposition using the arpack library called from matlab for given choice of weighing normalized correlation normalized correlation percentile percentile percentile percentile percentile percentile percentile percentile percentile percentile percentile percentile percentile percentile compressive embedding compressive embedding change in normalized inner product effect of dimensionality of embedding percentile percentile percentile percentile percentile percentile percentile eigenvector embedding eigenvector embedding effect of cascading left and right figure dblp collaboration network normalized correlations function the associated embedding λn vn is compared with the come returned by algorithm the latter was implemented in python using the pressive embedding scipy sparse routines and is available for download from we consider two real world undirected graphs in for our evaluation and compute embeddings where is diagonal matrix with row for the normalized adjacency matrix lie in for graphs we study the acsums of the adjacency matrix the eigenvalues of curacy of embeddings by comparing pairwise normalized correlations between rows of given by kke with those predicted by the approximate embede kke is for the row of ding dblp collaboration network is an undirected graph on vertices with the smalledges we compute the leading eigenvectors of the normalized adjacency matrix est of the five hundred eigenvalues is so we set and in algorithm with we demonstrate the dependence and compare the resulting embedding returned by the proposed algorithm on two parameters numof the quality of the embedding ber of random starting vectors which gives the dimensionality of the embedding and ii the parameter using this dataset dependence on the number of random projections in figure ranges from to log and plot the and percentile values of the deviation between and the corresponding exact normalthe compressive normalized correlation from the rows of ized correlation rows of the deviation decreases with increasing corresponding to concentration jl lemma but this payoff saturates for large values of as polynomial approximation errors start to dominate from the and percentile curves we see that significant lie within of their corresponding valfraction of pairwise normalized correlations in ues in when log for figure we use products for each randomly picked starting vector and set cascading parameter for the algorithm in section dependence on cascading parameter in section we described how cascading can help suppress of the eigenvectors whose eigenvalues lie in regions where we the contribution to the embedding have set we illustrate the importance of this boosting procedure by comparing the quality for and keeping the other parameters of the algorithm in section of the embedding fixed products for each of randomly picked starting vectors we report the results in figure where we plot percentile values of compressive normalized for different values of the exact normalized correlation rows of correlation from the rows of for the polynomial approximation of does not suppress small eigenvectors as result we notice deviation bias of the curve green from the ideal dotted line drawn figure left this disappears for figure right the running time for our algorithm on standard workstation was about two orders of magnitude smaller than partial svd using sparse eigensolvers the dimensional embedding of the leading eigenvectors of the dblp graph took minute whereas their exact computation took minutes more detailed comparison of running times is beyond the scope of this paper but it is clear that the promised gains in computational complexity are realized in practice application to graph clustering for the amazon network this is an undirected graph on vertices with edges we illustrate the potential downstream benefits of our algorithm by applying clustering on embeddings exact and compressive exof this network for the purpose of our comparisons we compute the first eigenvectors for which capplicitly using an exact eigensolver and use an compressive embedding tures the effect of these with where is the eigenvalue we compare this against the usual spectral embedding using the first eigenvectors of we keep the dimension fixed at in the comparison because complexity scales linearly with it and quickly becomes the bottleneck indeed our ability to embed large number of eigenvectors directly into low dimensional space log has the added benefit of dimensionality reduction within the subspace of interest in this case the largest eigenvectors we consider instances of clustering with throughout reporting the median of commonly used graph clustering score modularity larger values translate to better cluse is this is tering solutions the median modularity for clustering based on our embedding significantly better than that for which yields median modularity of in addition the come is that for minutes versus minutes when we replace the putational cost for exact eigenvector embedding with approximate eigendecomposition using randomized svd parameters power iterates and excess dimensionality the time taken reduces from minutes to seconds but this comes at the expense of inference quality median modularity drops to on the other hand the median modularity increases to when we consider exact partial svd embedding with eigenvectors this indicates that our compressive embedding yields better clustering quality because it is able to concisely capture more eigenvectors in this example compared to and with conventional partial svd it is worth pointing out that even for known eigenvectors the number of dominant eigenvectors that yields the best inference performance is often unknown priori and is treated as for compressive spectral an elegant approach for implicitly optimizing over is to use the embedding function embedding with as conclusion we have shown that random projections and polynomial expansions provide powerful approach for spectral embedding of large matrices for an matrix our log algorithm computes an log compressive embedding that provably approximates pairwise distances between points in the desired spectral embedding numerical results for several data sets show that our method provides good approximations for embeddings based on partial svd while incurring much lower complexity moreover our method can also approximate spectral embeddings which depend on the entire svd since its complexity does not depend on the number of dominant vectors whose effect we wish to model glimpse of this potential is provided by the example of based clustering for estimating of the amazon graph where our method yields much better performance using graph metrics than partial svd with significantly higher complexity this motivates further investigation into applications of this approach for improving downstream inference tasks in variety of problems acknowledgments this work is supported in part by darpa graphs and by systems on nanoscale information fabrics sonic one of the six src starnet centers sponsored by marco and darpa any opinions findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies references smola and kernel principal component analysis in artificial neural networks icann ser lecture notes in computer science gerstner germond hasler and nicoud eds springer berlin heidelberg pp mika smola scholz and kernel pca and in feature spaces in advances in neural information processing systems white and smyth spectral clustering approach to finding communities in in sdm vol siam and jagers random walks on graphs stochastic processes and their applications nadakuditi and newman graph spectra and the detectability of community structure in networks physical review letters fowlkes belongie chung and malik spectral grouping using the method ieee transactions on pattern analysis and machine intelligence vol no drineas and mahoney on the method for approximating gram matrix for improved learning journal on machine learning resources halko martinsson and tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam python implementation of online available https achlioptas random projections in proceedings of the twentieth acm symposium on principles of database systems ser pods candes and wakin an introduction to compressive sampling signal processing magazine ieee march trefethen and bau numerical linear algebra siam mccormick and noe simultaneous iteration for the matrix eigenvalue problem linear algebra and its applications vol no pp zhang tsang and kwok improved approximation and error analysis in proceedings of the international conference on machine learning ser icml acm yan huang and jordan fast approximate spectral clustering in proceedings of the acm sigkdd international conference on knowledge discovery and data mining ser kdd acm li kwok and lu making approximation in icml kumar mohri and talwalkar ensemble method in advances in neural information processing systems lin and cohen power iteration clustering in proceedings of the international conference on machine learning lin scalable methods for unsupervised and learning dissertation carnegie mellon university yan brahmakshatriya xue gilder and wise pic parallel power iteration clustering for big data journal of parallel and distributed computing random walks on graphs survey combinatorics paul erdos is eighty vol no pp spielman and teng time algorithms for graph partitioning graph sparsification and solving linear systems in proceedings of the annual acm symposium on theory of computing ser stoc new york ny usa acm spielman and teng nearly linear time algorithms for preconditioning and solving symmetric diagonally dominant linear systems siam journal on matrix analysis and applications vol spielman and srivastava graph sparsification by effective resistances siam journal on computing silver roeder voter and kress kernel polynomial approximations for densities of states and spectral functions journal of computational physics vol no pp mar di napoli polizzi and saad efficient estimation of eigenvalue counts in an interval cs yang and leskovec defining and evaluating network communities based on in ieee international conference on data mining icdm fortunato community detection in graphs physics reports vol no 
nonconvex optimization framework for low rank matrix tuo zhao johns hopkins university zhaoran wang han liu princeton university abstract we study the estimation of low rank matrices via nonconvex optimization compared with convex relaxation nonconvex optimization exhibits superior empirical performance for large scale instances of low rank matrix estimation however the understanding of its theoretical guarantees are limited in this paper we define the notion of projected oracle divergence based on which we establish sufficient conditions for the success of nonconvex optimization we illustrate the consequences of this general framework for matrix sensing in particular we prove that broad class of nonconvex optimization algorithms including alternating minimization and methods geometrically converge to the global optimum and exactly recover the true low rank matrices under standard conditions introduction let be rank matrix with much smaller than and our goal is to estimate based on partial observations of its entires for example matrix sensing is based on linear measurements hai where with much smaller than mn and ai is the sensing matrix in the past decade significant progress has been established on the recovery of low rank matrix among all these existing works most are based upon convex relaxation with nuclear norm constraint or regularization nevertheless solving these convex optimization problems can be computationally prohibitive in high dimensional regimes with large and computationally more efficient alternative is nonconvex optimization in particular we reparameterize the matrix variable in the optimization problem as with and and optimize over and such reparametrization automatically enforces the low rank structure and leads to low computational cost per iteration due to this reason the nonconvex approach is widely used in large scale applications such as recommendation systems despite the superior empirical performance of the nonconvex approach the understanding of its theoretical guarantees is relatively limited in comparison with the convex relaxation approach only until recently has there been progress on coordinate nonconvex optimization methods which is known as alternating minimization they show that provided desired initialization the alternating minimization algorithm converges at geometric rate to and which satisfy meanwhile establish the convergence of methods and further establish the convergence of broad class of nonconvex algorithms including both and coordinate methods however only establish the asymptotic convergence for an infinite number of iterations rather than the explicit rate of convergence besides these works consider projected methods which optimize over the matrix variable rather than and these methods involve calculating the top singular vectors of an matrix at each iteration for research supported by nsf nsf nsf nsf nsf nih nih nih and fda much smaller than and they incur much higher computational cost per iteration than the aforementioned methods that optimize over and all these works except focus on specific algorithms while do not establish the explicit optimization rate of convergence in this paper we propose general framework that unifies broad class of nonconvex algorithms for low rank matrix estimation at the core of this framework is quantity named projected oracle divergence which sharply captures the evolution of generic optimization algorithms in the presence of nonconvexity based on the projected oracle divergence we establish sufficiently conditions under which the iteration sequences geometrically converge to the global optima for matrix sensing direct consequence of this general framework is that broad family of nonconvex algorithms including gradient descent coordinate gradient descent and coordinate descent converge at geometric rate to the true low rank matrices and in particular our general framework covers alternating minimization as special case and recovers the results of under standard conditions meanwhile our framework covers methods which are also widely used in practice to the best of our knowledge our framework is the first one that establishes exact recovery guarantees and geometric rates of convergence for broad family of nonconvex matrix sensing algorithms to achieve maximum generality our unified analytic framework significantly differs from previous works in detail view alternating minimization as perturbed version of the power method however their point of view relies on the closed form solution of each iteration of alternating minimization which makes it hard to generalize to other algorithms methods meanwhile take geometric point of view in detail they show that the global optimum of the optimization problem is the unique stationary point within its neighborhood and thus broad class of algorithms succeed however such geometric analysis of the objective function does not characterize the convergence rate of specific algorithms towards the stationary point unlike existing analytic frameworks we analyze nonconvex optimization algorithms as perturbed versions of their convex counterparts for example under our framework we view alternating minimization as perturbed version of coordinate descent on convex objective functions we use the key quantity projected oracle divergence to characterize such perturbation effect which results from the local nonconvexity at intermediate solutions this framework allows us to establish explicit rate of convergence in an analogous way as existing convex optimization analysis notation for vector vd rd let the vector norm be kvkqq vjq for matrix we use amj to denote the column of and ain to denote the row of let max and min be thep largest and smallest nonzero singular values of we define the following matrix norms singular values of given another matrix max moreover we define to be the sum of allp we define the inner product as ha bi aij bij we define ei as an indicator vector where the entry is one and all other entries are zero for bivariate function we define ru to be the gradient with respect to moreover we use the common notations of and to characterize the asymptotics of two real sequences problem formulation and algorithms let be the unknown low rank matrix of interest we have sensing matrices ai with our goal is to estimate based on bi hai in the high dimensional regime with much smaller than mn under such regime common assumption is rank min existing approaches generally recover by solving the following convex optimization problem min km subject to where bd and rd is an operator defined as hai rd existing convex optimization algorithms for solving are computationally inefficient in the sense that they incur high computational cost and only attain sublinear rates of convergence to the global optimum instead in large scale settings we usually consider the following nonconvex optimization problem kb the reparametrization of though making the optimization problem in nonconvex significantly improves the computational efficiency existing literature has established convincing empirical evidence that can be effectively solved by board variety of nonconvex optimization algorithms including gradient descent alternating exact minimization alternating least squares or coordinate descent as well as alternating gradient descent coordinate gradient descent which are shown in algorithm min where it is worth noting the qr decomposition and rank singular value decomposition in algorithm can be accomplished efficiently in particular the qr decomposition can be accomplished in max operations while the rank singular value decomposition can be accomplished in kmn operations in fact the qr decomposition is not necessary for particular update schemes prove that the alternating exact minimization update schemes with or without the qr decomposition are equivalent algorithm family of nonconvex optimization algorithms for matrix sensing here ksvd is the rank singular value decomposition of here is diagonal matrix containing the top singular values of in decreasing order and and contain the corresponding top left and right singular vectors of here rv qr is the qr decomposition where is the corresponding orthonormal matrix and rv is the corresponding upper triangular matrix input bi ai parameter step size total number of iterations ksvd bi ai for alternating exact minimization argminv rv qr alternating gradient descent updating rv qr rv gradient descent rv qr rv alternating exact minimization argminu ru qr alternating gradient descent updating ru qr ru gradient descent ru qr ru end for output for gradient descent we use theoretical analysis we analyze the convergence properties of the general family of nonconvex optimization algorithms illustrated in before we present the main results we first introduce unified analytic framework based on key quantity named projected oracle divergence such unified framework equips our theory with the maximum generality without loss of generality we assume throughout the rest of this paper projected oracle divergence we first provide an intuitive explanation for the success of nonconvex optimization algorithms which forms the basis of our later proof for the main results recall that is special instance of the following optimization problem min key observation is that given fixed is strongly convex and smooth in under suitable conditions and the same also holds for given fixed correspondingly for the convenience of discussion we summarize this observation in the following technical condition which will be later verified for matrix sensing under suitable conditions condition strong biconvexity and bismoothness there exist universal constants and such that ku hu ru ku for all kv hv rv kv for all for the simplicity of discussion for now we assume and are the unique global minimizers to the generic optimization problem in assuming is given we can obtain by argmin condition implies the objective function in is strongly convex and smooth hence we can choose any algorithm to obtain for example we can directly solve for in rv or iteratively solve for using gradient descent where is the step size for the simplicity of discussion we put aside the renormalization issue for now in the example of gradient descent by invoking classical convex optimization results it is easy to prove that kv kf kf for all where is contraction coefficient which depends on and in condition however the oracle rv is not accessible in practice since we do not know instead we only have access to rv where is arbitrary to characterize the divergence between the ideal oracle rv and the accessible oracle rv we define key quantity named projected oracle divergence which takes the form rv rv kv kf where is the point for evaluating the gradient in the above example it holds for later we will illustrate that the projection of the difference of oracles onto specific one dimensional space the direction of is critical to our analysis in the above example of gradient descent we will prove later that for we have kv kf kf in other words the projection of the divergence of oracles onto the direction of captures the perturbation effect of employing the accessible oracle rv instead of the ideal rv for argminv we will prove that kv kf according to the update schemes shown in algorithm for alternating exact minimization we set in while for gradient descent or alternating gradient descent we set or in respectively correspondingly similar results hold for ku kf to establish the geometric rate of convergence towards the global minima and it remains to establish upper bounds for the projected oracle divergence in the example of gradient decent we will prove that for some kf which together with where we take implies kv kf kf kf correspondingly similar results hold for ku kf ku kf kf kf combining and we then establish the contraction max kv kf ku kf max kv kf ku kf which further implies the geometric convergence since respectively we can establish similar results for alternating exact minimization and alternating gradient descent based upon such unified analytic framework we now simultaneously establish the main results remark our proposed projected oracle divergence is inspired by previous work which analyzes the wirtinger flow algorithm for phase retrieval the expectation maximization em algorithm for latent variable models and the gradient descent algorithm for sparse coding though their analysis exploits similar nonconvex structures they work on completely different problems and the delivered technical results are also fundamentally different matrix sensing before we present our main results we first introduce an assumption known as the restricted isometry property rip recall that is the rank of the target low rank matrix assumption the linear operator rd defined in satisfies with parameter for all such that rank it holds that kf ka kf several random matrix ensembles satisfy for sufficiently large with high probability for example suppose that each entry of ai is independently drawn from distribution satisfies with parameter with high probability for kn log the following theorem establishes the geometric rate of convergence of the nonconvex optimization algorithms summarized in algorithm theorem assume there exists sufficiently small constant such that satisfies with and the largest and smallest nonzero singular values of are constants which do not scale with for any precision there exist an and universal constants and such that for all log we have km kf the proof of theorems is provided in appendices and theorem implies that all three nonconvex optimization algorithms geometrically converge to the global optimum moreover assuming that each entry of ai is independently drawn from distribution with mean zero and variance proxy one our result further suggests to achieve exact low rank matrix recovery our algorithm requires the number of measurements to satisfy log since we assume that this sample complexity result matches the result for nonconvex optimization methods which is established by in comparison with their result which only covers the alternating exact minimization algorithm our results holds for broader variety of nonconvex optimization algorithms note that the sample complexity in depends on polynomial of max min which is treated as constant in our paper if we allow max min to increase with the dimension we can plug the nonconvex optimization algorithms into the framework proposed by following similar lines to the proof of theorem we can derive new sample complexity which is independent of max min see more details in proof of main results due to space limitation we only sketch the proof of theorem for alternating exact minimization the proof of theorem for alternating gradient descent and gradient descent and related lemmas are provided in the appendix for notational simplicity let max and min before we proceed with the main proof we first introduce the following lemma which verifies condition lemma suppose that satisfies with parameter given an arbitrary orthonormal matrix for any we have kv hrv kv the proof of lemma is provided in appendix lemma implies that is strongly convex and smooth in given fixed orthonormal matrix as specified in condition equipped with lemma we now lay out the proof for each update scheme in algorithm proof of theorem alternating exact minimization proof throughout the proof of alternating exact minimization we define constant for notational simplicity we assume that at the iteration there exists matrix factorization of where is orthonormal we choose the projected oracle divergence as rv rv kv kf remark note that the matrix factorization is not necessarily unique because given factorizae ve where and tion of we can always obtain new factorization of ve for an arbitrary unitary matrix however this is not issue to our convergence analysis as will be shown later we can prove that there always exists factorization of satisfying the desired computational properties for each iteration see lemma corollaries and the following lemma establishes an upper bound for the projected oracle divergence lemma suppose that and satisfy and ku kf then we have ku kf the proof of lemma is provided in appendix lemma shows that the projected oracle di vergence for updating diminishes with the estimation error of following lemma quantifies the progress of an exact minimization step using the projected oracle divergence lemma we have kv kf the proof of lemma is provided in appendix lemma illustrates that the estimation error of diminishes with the projected oracle divergence the following lemma characterizes the effect of the renormalization step using qr decomposition lemma suppose that satisfies kv kf then there exists factorization of such that orthonormal matrix and satisfies kv kf kv is an kf the proof of lemma is provided in appendix the next lemma quantifies the accuracy of the initialization lemma suppose that satisfies then there exists factorization of such that matrix and satisfies ku kf is an orthonormal the proof of lemma is provided in appendix lemma implies that the initial solution attains sufficiently small estimation error combining the above lemmas we obtain the next corollary for complete iteration of updating corollary suppose that and satisfy and ku we then have kv ku kf and kv kf kf kf moreover we also have kv ku kf kf the proof of corollary is provided in appendix since the alternating exact minimization algorithm updates and in symmetric manner we can establish similar results for complete iteration of updating in the next corollary corollary suppose that and satisfy and kv kf then there exists factorization of such is an orthonormal matrix and satisfies ku kf moreover we also have ku kf kv kf and ku kf kv kf the proof of corollary directly follows appendix and is therefore omitted we then proceed with the proof of theorem for alternating exact minimization lemma ensures that of corollary holds for then corollary ensures that of corollary holds for by induction corollaries and can be applied recursively for all iterations thus we obtain kv kf ku kf kv kf ku kf where the comes therefore for accuracy we need at most log log iterations such that kv kf moreover corollary implies ku kf kv kf where the last inequality comes from therefore we need at most log log iterations such that ku kf then combining and we obtain km ku kv ku where the last inequality is from kv since and kf kf ku kv since kf is orthonormal and ku km is orthonormal thus we complete the proof extension to matrix completion under the same setting as matrix sensing we observe subset of the entries of namely we assume that is drawn uniformly at random mi is observed independently with probability to exactly recover common assumption is the incoherence of which will be specified later popular approach for recovering is to solve the following convex optimization problem min km subject to pw pw where pw is an operator defined as pw ij mij if and otherwise similar to matrix sensing existing algorithms for solving are computationally inefficient hence in practice we usually consider the following nonconvex optimization problem min fw where fw kpw pw similar to matrix sensing can also be efficiently solved by algorithms due to space limitation we present these matrix completion algorithms in algorithm of appendix for the convenience of later convergence analysis we partition the observation set into subsets using algorithm in appendix however in practice we do not need the partition scheme we simply set before we present the main results we introduce an assumption known as the incoherence property assumption the target rank matrix is incoherent with parameter given the rank singular value decomposition of we have max ku and max kv the incoherence assumption guarantees that is far from sparse matrix which makes it feasible to complete when its entries are missing uniformly at random the following theorem establishes the iteration complexity and the estimation error under the frobenius norm theorem suppose that there exists universal constant such that satisfies log log where is the precision then there exist an and universal constants and such that for any log we have km kf with high probability due to space limit we defer the proof of theorem to the longer version of this paper theorem implies that all three nonconvex optimization algorithms converge to the global optimum at geometric rate furthermore our results indicate that the completion of the true low rank matrix up to requires the entry observation probability to satisfy log log this result matches the result established by which is the result for alternating minimization moreover our analysis covers three nonconvex optimization algorithms experiments estimation error estimation error we present numerical experiments for matrix sensing to support our theoretical analysis we choose and and vary from to each entry of ai are independent sampled and ve are two matrices from we then generate where with all their entries independently sampled from we then generate measurements by bi hai for figure illustrates the empirical performance of the alternating exact minimization and alternating gradient descent algorithms for single realization the step size for the alternating gradient descent algorithm is determined by the backtracking line search procedure we see that both algorithms attain linear rate of convergence for and both algorithms fail for because is below the minimum requirement of sample complexity for the exact matrix recovery number of iterations number of iterations alternating exact minimization alternating gradient descent figure two illustrative examples for matrix sensing the vertical axis corresponds to estimation error km kf the horizontal axis corresponds to numbers of iterations both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for and but both algorithms fail for because is below the minimum requirement of sample complexity for the exact matrix recovery references sanjeev arora rong ge tengyu ma and ankur moitra simple efficient and neural algorithms for sparse coding arxiv preprint sivaraman balakrishnan martin wainwright and bin yu statistical guarantees for the em algorithm from population to analysis arxiv preprint emmanuel xiaodong li and mahdi soltanolkotabi phase retrieval via wirtinger flow theory and algorithms ieee transactions on information theory emmanuel and benjamin recht exact matrix completion via convex optimization foundations of computational mathematics emmanuel and terence tao the power of convex relaxation matrix completion ieee transactions on information theory yudong chen matrix completion arxiv preprint david gross recovering matrices from few coefficients in any basis ieee transactions on information theory moritz hardt understanding alternating minimization for matrix completion in symposium on foundations of computer science pages moritz hardt raghu meka prasad raghavendra and benjamin weitz computational limits for matrix completion arxiv preprint moritz hardt and mary wootters fast matrix completion without the condition number arxiv preprint trevor hastie rahul mazumder jason lee and reza zadeh matrix completion and svd via fast alternating least squares arxiv preprint prateek jain raghu meka and inderjit dhillon guaranteed rank minimization via singular value projection in advances in neural information processing systems pages prateek jain and praneeth netrapalli fast exact matrix completion with finite samples arxiv preprint prateek jain praneeth netrapalli and sujay sanghavi matrix completion using alternating minimization in symposium on theory of computing pages raghunandan keshavan andrea montanari and sewoong oh matrix completion from few entries ieee transactions on information theory raghunandan keshavan andrea montanari and sewoong oh matrix completion from noisy entries journal of machine learning research yehuda koren the bellkor solution to the netflix grand prize netflix prize documentation kiryung lee and yoram bresler admira atomic decomposition for minimum rank approximation ieee transactions on information theory sahand negahban and martin wainwright estimation of near matrices with noise and scaling the annals of statistics yurii nesterov introductory lectures on convex optimization basic course volume springer arkadiusz paterek improving regularized singular value decomposition for collaborative filtering in proceedings of kdd cup and workshop volume pages benjamin recht simpler approach to matrix completion journal of machine learning research benjamin recht maryam fazel and pablo parrilo guaranteed solutions of linear matrix equations via nuclear norm minimization siam review benjamin recht and christopher parallel stochastic gradient algorithms for matrix completion mathematical programming computation angelika rohde and alexandre tsybakov estimation of matrices the annals of statistics gilbert stewart sun and harcourt jovanovich matrix perturbation theory volume academic press new york ruoyu sun and luo guaranteed matrix completion via factorization arxiv preprint and domonkos tikk major components of the gravity recommendation system acm sigkdd explorations newsletter 
automatic variational inference in stan alp kucukelbir columbia university alp rajesh ranganath princeton university rajeshr andrew gelman columbia university gelman david blei columbia university abstract variational inference is scalable technique for approximate bayesian inference deriving variational inference algorithms requires tedious calculations this makes it difficult for to use we propose an automatic variational inference algorithm automatic differentiation variational inference we implement it in stan code available probabilistic programming system in the user provides bayesian model and dataset nothing else we make no conjugacy assumptions and support broad class of models the algorithm automatically determines an appropriate variational family and optimizes the variational objective we compare to sampling across hierarchical generalized linear models nonconjugate matrix factorization and mixture model we train the mixture model on quarter million images with we can use variational inference on any model we write in stan introduction bayesian inference is powerful framework for analyzing data we design model for data using latent variables we then analyze data by calculating the posterior density of the latent variables for machine learning models calculating the posterior is often difficult we resort to approximation variational inference approximates the posterior with simpler distribution we search over family of simple distributions and find the member closest to the posterior this turns approximate inference into optimization has had tremendous impact on machine learning it is typically faster than markov chain monte carlo sampling as we show here too and has recently scaled up to massive data unfortunately algorithms are difficult to derive we must first define the family of approximating distributions and then calculate quantities relative to that family to solve the variational optimization problem both steps require expert knowledge the resulting algorithm is tied to both the model and the chosen approximation in this paper we develop method for automating variational inference automatic differentiation variational inference given any model from wide class specifically probability models differentiable with respect to their latent variables determines an appropriate variational family and an algorithm for optimizing the corresponding variational objective we implement in stan flexible probabilistic programming system stan describes language to define probabilistic models figure as well as model compiler library of transformations and an efficient automatic differentiation toolbox with we can now use variational inference on any model we write in see appendices to is available in stan see appendix average log predictive average log predictive advi nuts seconds subset of images seconds full dataset of images figure predictive accuracy results gaussian mixture model of the image image histogram dataset outperforms the sampler the default sampling method in stan scales to large datasets by subsampling minibatches of size from the dataset at each iteration we present more details in section and appendix figure illustrates the advantages of our method consider nonconjugate gaussian mixture model for analyzing natural images this is lines in stan figure figure illustrates bayesian inference on images the is likelihood measure of model fitness the xaxis is time on log scale is orders of magnitude faster than algorithm and stan default inference technique we also study nonconjugate factorization models and hierarchical generalized linear models in section figure illustrates bayesian inference on images the size of data we more commonly find in machine learning here we use with stochastic variational inference giving an approximate posterior in under two hours for data like these techniques can not complete the analysis related work automates variational inference within the stan probabilistic programming system this draws on two major themes the first is body of work that aims to generalize kingma and welling and rezende et al describe reparameterization of the variational problem that simplifies optimization ranganath et al and salimans and knowles propose technique one that only requires the model and the gradient of the approximating family titsias and leverage the gradient of the joint density for small class of models here we build on and extend these ideas to automate variational inference we highlight technical connections as we develop the method the second theme is probabilistic programming wingate and weber study in general probabilistic programs as supported by languages like church venture and anglican another probabilistic programming system is which implements variational message passing an efficient algorithm for conditionally conjugate graphical models stan supports more comprehensive class of nonconjugate models with differentiable latent variables see section automatic differentiation variational inference automatic differentiation variational inference follows straightforward recipe first we transform the support of the latent variables to the real coordinate space for example the logarithm transforms positive variable such as standard deviation to the real line then we posit gaussian variational distribution to approximate the posterior this induces approximation in the original variable space last we combine automatic differentiation with stochastic optimization to maximize the variational objective we begin by defining the class of models we support differentiable probability models consider dataset with observations each xn is discrete or continuous random vector the likelihood relates the observations to set of latent random variables bayesian data number parameters must be model non theta xn poisson theta figure specifying simple nonconjugate probability model in stan analysis posits prior density on the latent variables combining the likelihood with the prior gives the joint density we focus on approximate inference for differentiable probability models these models have continuous latent variables they also have gradient of the with respect to the variables log the gradient is valid within the support of the prior rk and rk where is the dimension of the latent variable space we assume that the support of the posterior equals that of the prior we make no assumptions about conjugacy either full or for example consider model that contains poisson likelihood with unknown rate the observed variable is discrete the latent rate is continuous and positive place weibull prior on defined over the positive real numbers the resulting joint density describes nonconjugate differentiable probability model see figure its partial derivative is valid within the support of the weibull distribution rc because this model is nonconjugate the posterior is not weibull distribution this presents challenge for classical variational inference in section we will see how handles this model many machine learning models are differentiable for example linear and logistic regression matrix factorization with continuous or discrete measurements linear dynamical systems and gaussian processes mixture models hidden markov models and topic models have discrete random variables marginalizing out these discrete variables renders these models differentiable we show an example in section however marginalization is not tractable for all models such as the ising model sigmoid belief networks and untruncated bayesian nonparametric models variational inference bayesian inference requires the posterior density which describes how the latent variables vary when conditioned on set of observations many posterior densities are intractable because their normalization constants lack closed forms thus we seek to approximate the posterior consider an approximating density parameterized by we make no assumptions about its shape or support we want to find the parameters of to best match the posterior according to some loss function variational inference minimizes the divergence from the approximation to the posterior arg min typically the divergence also lacks closed form instead we maximize the evidence lower bound proxy to the divergence log log the first term is an expectation of the joint density under the approximation and the second is the entropy of the variational density maximizing the minimizes the divergence the posterior of fully conjugate model is in the same family as the prior conditionally conjugate model has this property within the complete conditionals of the model the minimization problem from eq becomes arg max such that we explicitly specify the constraint implied in the we highlight this constraint as we do not specify the form of the variational approximation thus must remain within the support of the posterior which we assume equal to the support of the prior why is difficult to automate in classical variational inference we typically design conditionally conjugate model then the optimal approximating family matches the prior this satisfies the support constraint by definition when we want to approximate models that are not conditionally conjugate we carefully study the model and design custom approximations these depend on the model and on the choice of the approximating density one way to automate is to use variational inference if we select density whose support matches the posterior then we can directly maximize the using monte carlo integration and stochastic optimization another strategy is to restrict the class of models and use fixed variational approximation for instance we may use gaussian density for inference in unrestrained differentiable probability models where rk we adopt approach first we automatically transform the support of the latent variables in our model to the real coordinate space then we posit gaussian variational density the transformation induces approximation in the original variable space and guarantees that it stays within the support of the posterior here is how it works automatic transformation of constrained variables begin by transforming the support of the latent variables such that they live in the real coordinate space rk define differentiable function rk and identify the transformed variables as the transformed joint density is det jt where is the joint density in the original latent variable space and jt is the jacobian of the inverse of transformations of continuous probability densities require jacobian it accounts for how the transformation warps unit volumes see appendix consider again our running example the rate lives in rc the logarithm transforms rc to the real line its jacobian adjustment is the derivative of the inverse of the logarithm det jt the transformed density is figures and depict this transformation as we describe in the introduction we implement our algorithm in stan to enable generic inference stan implements model compiler that automatically handles transformations it works by applying library of transformations and their corresponding jacobians to the joint model this transforms the joint density of any differentiable probability model to the real coordinate space now we can choose variational distribution independent from the model implicit variational approximation after the transformation the latent variables have support on rk we posit diagonal gaussian variational approximation if then outside the support of we have eq œlog qç eq œlog pç provides transformations for upper and lower bounds simplex and ordered vectors and structured matrices such as covariance matrices and cholesky factors stan density latent variable space real coordinate space prior posterior approximation standardized space figure transformations for the purple line is the posterior the green line is the approximation the latent variable space is rc transforms the latent variable space to the variational approximation is gaussian absorbs the parameters of the gaussian we maximize the in the standardized space with fixed standard gaussian approximation the vector contains the mean and standard deviation of each gaussian factor this defines our variational approximation in the real coordinate space figure the transformation maps the support of the latent variables to the real coordinate space its inverse maps back to the support of the latent variables this implicitly defines the variational approxˇ imation in the original latent variable space as det jt the transformation ensures that the support of this approximation is always bounded by that of the true posterior in the original latent variable space figure thus we can freely optimize the in the real coordinate space figure without worrying about the support matching constraint the in the real coordinate space is log log det jt log where we plug in the analytic form of the gaussian entropy the derivation is in appendix we choose diagonal gaussian for efficiency this choice may call to mind the laplace approximation technique where taylor expansion around the estimate gives gaussian approximation to the posterior however using gaussian variational approximation is not equivalent to the laplace approximation the laplace approximation relies on maximizing the probability density it fails with densities that have discontinuities on its boundary the gaussian approximation considers probability mass it does not suffer this degeneracy furthermore our approach is distinct in another way because of the transformation the posterior approximation in the original latent variable space figure is automatic differentiation for stochastic optimization we now maximize the in real coordinate space arg max such that we use gradient ascent to reach local maximum of the unfortunately we can not apply automatic differentiation to the in this form this is because the expectation defines an intractable integral that depends on and we can not directly represent it as computer program moreover the standard deviations in must remain positive thus we employ one final transformation elliptical shown in figures and first the gaussian distribution with the log of the standard deviation log applied the support of is now the real coordinate space and is always positive then define the standardization diag exp the standardization also known as transformation an invertible transformation and the reparameterization trick algorithm automatic differentiation variational inference input dataset model set iteration counter and choose stepsize sequence initialize and while change in is above some threshold do draw samples from the standard multivariate gaussian invert the standardization approximate and using update integration eqs and and increment iteration counter end return and encapsulates the variational parameters and gives the fixed density the standardization transforms the variational problem from eq into arg max arg max en log log det jt where we drop constant terms from the calculation this expectation is with respect to standard gaussian and the parameters and are both unconstrained figure we push the gradient inside the expectations and apply the chain rule to get en log log det jt en log log det jt exp the derivations are in appendix we can now compute the gradients inside the expectation with automatic differentiation the only thing left is the expectation integration provides simple approximation draw samples from the standard gaussian and evaluate the empirical mean of the gradients within the expectation this gives unbiased noisy gradients of the for any differentiable probability model we can now use these gradients in stochastic optimization routine to automate variational inference automatic variational inference equipped with unbiased noisy gradients of the implements stochastic gradient ascent algorithm we ensure convergence by choosing decreasing sequence in practice we use an adaptive sequence with finite memory see appendix for details has complexity per iteration where is the number of samples typically between and coordinate ascent has complexity per pass over the dataset we scale to large datasets using stochastic optimization the adjustment to algorithm is simple sample minibatch of size from the dataset and scale the likelihood of the sampled minibatch by the stochastic extension of has complexity advi advi hmc nuts seconds average log predictive average log predictive linear regression with advi advi nuts hmc hierarchical logistic regression figure hierarchical generalized linear models comparison of tive likelihood as function of wall time seconds to empirical study we now study across variety of models we compare its speed and accuracy to two markov chain monte carlo sampling algorithms hamiltonian monte carlo and the sampler we assess convergence by tracking the to place and on common scale we report predictive likelihood on data as function of time we approximate the posterior predictive likelihood using estimate for we plug in posterior samples for we draw samples from the posterior approximation during the optimization we initialize with draw from standard gaussian we explore two hierarchical regression models two matrix factorization models and mixture model all of these models have nonconjugate prior structures we conclude by analyzing dataset of images where we report results across range of minibatch sizes comparison to sampling hierarchical regression models we begin with two nonconjugate regression models linear regression with automatic relevance determination and hierarchical logistic regression linear regression with this is sparse linear regression model with hierarchical prior structure details in appendix we simulate dataset with regressors such that half of the regressors have no predictive power we use training samples and hold out for testing logistic regression with spatial hierarchical prior this is hierarchical logistic regression model from political science the prior captures dependencies such as states and regions in polling dataset from the united states presidential election details in appendix we train using data points and withhold for evaluation the regressors contain age education state and region indicators the dimension of the regression problem is results figure plots average log predictive accuracy as function of time for these simple models all methods reach the same predictive accuracy we study with two settings of the number of samples used to estimate gradients single sample per iteration is sufficient it is also the fastest we set from here on exploring nonconjugacy matrix factorization models we continue by exploring two nonconjugate matrix factorization models constrained gamma poisson model and dirichlet exponential model here we show how easy it is to explore new models using in both models we use the frey face dataset which contains frames pixels of facial expressions extracted from video sequence constrained gamma poisson this is gamma poisson factorization model with an ordering constraint each row of the gamma matrix goes from small to large values details in appendix is an adaptive extension of it is the default sampler in stan advi nuts seconds average log predictive average log predictive advi nuts seconds gamma poisson predictive likelihood dirichlet exponential predictive likelihood gamma poisson factors dirichlet exponential factors figure matrix factorization of the frey faces dataset comparison of predictive likelihood as function of wall time to dirichlet exponential this is nonconjugate dirichlet exponential factorization model with poisson likelihood details in appendix results figure shows average log predictive accuracy as well as ten factors recovered from both models provides an order of magnitude speed improvement over figure struggles with the dirichlet exponential model figure in both cases does not produce any useful samples within budget of one hour we omit from the plots scaling to large datasets gaussian mixture model we conclude with the gaussian mixture model example we highlighted earlier this is nonconjugate applied to color image histograms we place dirichlet prior on the mixture proportions gaussian prior on the component means and lognormal prior on the standard deviations details in appendix we explore the image dataset which has images we withhold images for evaluation in figure we randomly select images and train model with mixture components struggles to find an adequate solution and fails altogether this is likely due to label switching which can affect techniques in mixture models figure shows results on the full dataset here we use with stochastic subsampling of minibatches from the dataset we increase the number of mixture components to with minibatch size of or larger reaches high predictive accuracy smaller minibatch sizes lead to suboptimal solutions an effect also observed in converges in about two hours conclusion we develop automatic differentiation variational inference in stan leverages automatic transformations an implicit variational approximation and automatic differentiation this is valuable tool we can explore many models and analyze large datasets with ease we emphasize that is currently available as part of stan it is ready for anyone to use acknowledgments we thank dustin tran bruno jacobs and the reviewers for their comments this work is supported by nsf onr darpa sloan ies de ndseg facebook adobe amazon and the siebel scholar and john templeton foundations references michael jordan zoubin ghahramani tommi jaakkola and lawrence saul an introduction to variational methods for graphical models machine learning martin wainwright and michael jordan graphical models exponential families and variational inference foundations and trends in machine learning matthew hoffman david blei chong wang and john paisley stochastic variational inference the journal of machine learning research stan development team stan modeling language users guide and reference manual matthew hoffman and andrew gelman the sampler the journal of machine learning research diederik kingma and max welling variational bayes danilo rezende shakir mohamed and daan wierstra stochastic backpropagation and approximate inference in deep generative models in icml pages rajesh ranganath sean gerrish and david blei black box variational inference in aistats pages tim salimans and david knowles on using control variates with stochastic approximation for variational bayes arxiv preprint michalis titsias and miguel doubly stochastic variational bayes for nonconjugate inference in icml pages david wingate and theophane weber automated variational inference in probabilistic programming arxiv preprint noah goodman vikash mansinghka daniel roy keith bonawitz and joshua tenenbaum church language for generative models in uai pages vikash mansinghka daniel selsam and yura perov venture probabilistic programming platform with programmable inference frank wood jan willem van de meent and vikash mansinghka new approach to probabilistic programming inference in aistats pages john winn and christopher bishop variational message passing in journal of machine learning research pages christopher bishop pattern recognition and machine learning springer new york david olive statistical theory and inference springer manfred opper and cédric archambeau the variational gaussian approximation revisited neural computation wolfgang härdle and léopold simar applied multivariate statistical analysis springer christian robert and george casella monte carlo statistical methods springer john duchi elad hazan and yoram singer adaptive subgradient methods for online learning and stochastic optimization the journal of machine learning research mark girolami and ben calderhead riemann manifold langevin and hamiltonian monte carlo methods journal of the royal statistical society series andrew gelman and jennifer hill data analysis using regression and models cambridge university press john canny gap factor model for discrete data in acm sigir pages acm mauricio villegas roberto paredes and bart thomee overview of the imageclef scalable concept image annotation subtask in clef evaluation labs and workshop 
models for speech recognition dzmitry bahdanau jacobs university bremen germany jan chorowski university of wrocław poland dmitriy serdyuk de kyunghyun cho de yoshua bengio de cifar senior fellow abstract recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on range of tasks including machine translation handwriting synthesis and image caption generation we extend the with features needed for speech recognition we show that while an adaptation of the model used for machine translation in reaches competitive phoneme error rate per on the timit phoneme recognition task it can only be applied to utterances which are roughly as long as the ones it was trained on we offer qualitative explanation of this failure and propose novel and generic method of adding to the attention mechanism to alleviate this issue the new method yields model that is robust to long inputs and achieves per in single utterances and in longer repeated utterances finally we propose change to the attention mechanism that prevents it from concentrating too much on single frames which further reduces per to level introduction recently recurrent networks have been successfully applied to wide variety of tasks such as handwriting synthesis machine translation image caption generation and visual object classification such models iteratively process their input by selecting relevant content at every step this basic idea significantly extends the applicability range of training methods for instance making it possible to construct networks with external memory we introduce extensions to recurrent networks that make them applicable to speech recognition learning to recognize speech can be viewed as learning to generate sequence transcription given another sequence speech from this perspective it is similar to machine translation and handwriting synthesis tasks for which methods have been found suitable however compared to machine translation speech recognition principally differs by requesting much longer input sequences thousands of frames instead of dozens of words which introduces challenge of distinguishing similar speech in single utterance it is also different from handwriting synthesis since the input sequence is much noisier and does not have as clear structure for these reasons speech recognition is an interesting testbed for developing new architectures capable of processing long and noisy inputs application of models to speech recognition is also an important step toward building fully trainable speech recognition systems which is an active area of research the an early version of this work was presented at the nips deep learning workshop explained in more detail in sec dominant approach is still based on hybrid systems consisting of deep neural acoustic model triphone hmm model and an language model this requires dictionaries of pronunciation and phoneme lexicons and training procedure to make the components work together excellent results by an recognizer have recently been reported with the system consisting of neural network and language model still the language model was added only at the last stage in that work thus leaving open question of how much an acoustic model can benefit from being aware of language model during training in this paper we evaluate models on the timit phoneme recognition task for each generated phoneme an attention mechanism selects or weighs the signals produced by trained feature extraction mechanism at potentially all of the time steps in the input sequence speech frames the weighted feature vector then helps to condition the generation of the next element of the output sequence since the utterances in this dataset are rather short mostly under seconds we measure the ability of the considered models in recognizing much longer utterances which were created by artificially concatenating the existing utterances we start with model proposed in for the machine translation task as the baseline this model seems entirely vulnerable to the issue of similar speech fragments but despite our expectations it was competitive on the original test set reaching phoneme error rate per however its performance degraded quickly with longer concatenated utterances we provide evidence that this model adapted to track the absolute location in the input sequence of the content it is recognizing strategy feasible for short utterances from the original test set but inherently unscalable in order to circumvent this undesired behavior in this paper we propose to modify the attention mechanism such that it explicitly takes into account both the location of the focus from the previous step as in and the features of the input sequence as in this is achieved by adding as inputs to the attention mechanism auxiliary convolutional features which are extracted by convolving the attention weights from the previous step with trainable filters we show that model with such convolutional features performs significantly better on the considered task per more importantly the model with convolutional features robustly recognized utterances many times longer than the ones from the training set always staying below per therefore the contribution of this work is for one we present novel purely neural speech recognition architecture based on an attention mechanism whose performance is comparable to that of the conventional approaches on the timit dataset moreover we propose generic method of adding location awareness to the attention mechanism finally we introduce modification of the attention mechanism to avoid concentrating the attention on single frame and thus avoid obtaining less effective training examples bringing the per down to model for speech recognition general framework an recurrent sequence generator arsg is recurrent neural network that stochastically generates an output sequence yt from an input in practice is often processed by an encoder which outputs sequential input representation hl more suitable for the attention mechanism to work with in the context of this work the output is sequence of phonemes and the input is sequence of feature vectors each feature vector is extracted from small overlapping window of audio frames the encoder is implemented as deep bidirectional recurrent network birnn to form sequential representation of length at the step an arsg generates an output yi by focusing on the relevant elements of αi attend gi αi hj yi generate gi yi si gi αi αi αi αi hj figure two steps of the proposed recurrent sequence generator arsg with hybrid attention mechanism computing based on both content and location previous information the dotted lines correspond to eq thick solid lines to eq and dashed lines to eqs where is the state of the recurrent neural network to which we refer as the generator αi rl is vector of the attention weights also often called the alignment using the terminology from we call gi glimpse the step is completed by computing new generator state si recurrency gi yi long memory units lstm and gated recurrent units gru are typically used as recurrent activation to which we refer as recurrency the process is graphically illustrated in fig inspired by we distinguish between and hybrid attention mechanisms attend in eq describes the most generic hybrid attention if the term is dropped from attend arguments αi attend we call it see or in this case attend is often implemented by scoring each element in separately and normalizing the scores αi ei score hj exp ei exp ei the main limitation of such scheme is that identical or very similar elements of are scored equally regardless of their position in the sequence this is the issue of similar speech fragments raised above often this issue is partially alleviated by an encoder such as birnn or deep convolutional network that encode contextual information into every element of however capacity of elements is always limited and thus disambiguation by context is only possible to limited extent alternatively attention mechanism computes the alignment from the generator state and the previous alignment only such that αi attend for instance graves used the attention mechanism using gaussian mixture model in his handwriting synthesis model in the case of speech recognition this type of attention mechanism would have to predict the distance between consequent phonemes using only which we expect to be hard due to large variance of this quantity for these limitations associated with both and mechanisms we argue that hybrid attention mechanism is natural candidate for speech recognition informally we would like an attention model that uses the previous alignment to select short list of elements from from which the attention in eqs will select the relevant ones without confusion proposed model arsg with convolutional features we start from the model with the attention mechanism proposed in this model can be described by eqs where ei tanh hj and are vectors and are matrices we extend this attention mechanism of the original model to be by making it take into account the alignment produced at the previous step first we extract vectors fi rk for every position of the previous alignment by convolving it with matrix fi these additional vectors fi are then used by the scoring mechanism ei ei tanh hj fi score normalization sharpening and smoothing there are three potential issues with the normalization in eq first when the input sequence is long the glimpse gi is likely to contain noisy information from many irrelevant feature vectors hj as the normalized scores αi are all positive and sum to this makes it difficult for the proposed arsg to focus clearly on few relevant frames at each time second the attention mechanism is required to consider all the frames each time it decodes single output yi while decoding the output of length leading to computational complexity of lt this may easily become prohibitively expensive when input utterances are long and issue that is less serious for machine translation because in that case the input sequence is made of words not of acoustic frames the other side of the coin is that the use of softmax normalization in eq prefers to mostly focus on only single feature vector hj this prevents the model from aggregating multiple frames to form glimpse gi sharpening there is straightforward way to address the first issue of noisy glimpse by sharpening the scores αi one way to sharpen the weights is to introduce an inverse temperature to the softmax function such that ai exp βei exp βei or to keep only the frames according to the scores and them these sharpening methods however still requires us to compute the score of every frame each time lt and they worsen the second issue of overly narrow focus we also propose and investigate windowing technique at each time the attention mechanism considers only subsequence hpi hpi of the whole sequence where is the predefined window width and pi is the median of the alignment the scores for hj are not computed resulting in lower complexity of this windowing technique is similar to taking the frames and similarly has the effect of sharpening the proposed sharpening based on windowing can be used both during training and evaluation later in the experiments we only consider the case where it is used during evaluation smoothing we observed that the proposed sharpening methods indeed helped with long utterances however all of them and especially selecting the frame with the highest score negatively affected the model performance on the standard development set which mostly consists of short utterances this observations let us hypothesize that it is helpful for the model to aggregate selections from multiple frames in sense this brings more diversity more effective training examples to the output part of the model as more input locations are considered to facilitate this effect we replace the unbounded exponential function the softmax function in eq with the bounded logistic sigmoid such that ai ei ei this has the effect of smoothing the focus found by the attention mechanism related work speech recognizers based on the connectionist temporal classification ctc and its extension rnn transducer are the closest to the arsg model considered in this paper they follow earlier work on trainable deep learning over sequences with gradient signals flowing phoneme error rate dependency of error rate on beam search width baseline conv feats smooth focus dataset dev test beam width figure decoding performance the beam size for rigorous comparison if decoding failed to generate heosi we considered it wrongly recognized without retrying with larger beams size the models especially with smooth focus perform well even with beam width as small as through the alignment process they have been shown to perform well on the phoneme recognition task furthermore the ctc was recently found to be able to directly transcribe text from speech without any intermediate phonetic representation the considered arsg is different from both the ctc and rnn transducer in two ways first whereas the attention mechanism deterministically aligns the input and the output sequences the ctc and rnn transducer treat the alignment as latent random variable over which map maximum posteriori inference is performed this deterministic nature of the arsg alignment mechanism allows beam search procedure to be simpler furthermore we empirically observe that much smaller beam width can be used with the deterministic mechanism which allows faster decoding see sec and fig second the alignment mechanism of both the ctc and rnn transducer is constrained to be monotonic to keep marginalization of the alignment tractable on the other hand the proposed attention mechanism can result in alignment which makes it suitable for larger variety of tasks other than speech recognition hybrid attention model using convolution operation was also proposed in for neural turing machines ntm at each time step the ntm computes attention weights which are then convolved with predicted shifting distribution unlike the ntm approach the hybrid mechanism proposed here lets learning figure out how the and addressing be combined by deep parametric function see eq sukhbaatar et al describes similar hybrid attention mechanism where location embeddings are used as input to the attention model this approach has an important disadvantage that the model can not work with an input sequence longer than those seen during training our approach on the other hand works well on sequences many times longer than those seen during training see sec experimental setup we closely followed the procedure in all experiments were performed on the timit corpus we used the split from the kaldi timit recipe we trained on the standard speaker set with all sa utterances removed and used the speaker dev set for early stopping we tested on the speaker core test set all networks were trained on filterbank features together with the energy in each frame and first and second temporal differences yielding in total features per frame each feature was rescaled to have zero mean and unit variance over the training set networks were trained on the full set extended with an extra token that was appended to each target sequence similarly we appended an frame at the end of each input sequence to indicate the end of the utterance decoding was performed using the phoneme set while scoring was done on the phoneme set training procedure one property of arsg models is that different subsets of parameters are reused different number of times times for those of the encoder lt for the attention weights and times for all the other parameters of the arsg this makes the scales of derivatives parameters vary significantly we used an adaptive learning rate algorithm adadelta which has two hyperparameters and all the weight matrices were initialized from normal gaussian distribution with its standard deviation set to recurrent weights were orthogonalized as timit is relatively small dataset proper regularization is crucial we used the adaptive weight noise as main regularizer we first trained our models with column norm constraint with michael colored the bedroom wall with crayons ay kcl el kcl ah er ix bcl dcl dh eh dcl ux ao kcl ix th ey aa figure alignments produced by the baseline model the vertical bars indicate ground truth phone location from timit each row of the upper image indicates frames selected by the attention mechanism to emit phone symbol the network has clearly learned to produce alignment with tendency to look slightly ahead and does not confuse between the repeated kclk phrase best viewed in color the maximum norm until the lowest development negative is during this time and are set to and respectively at this point we began using the adaptive weight noise with the model complexity cost lc divided by while disabling the column norm constraints once the new lowest development was reached we the model with smaller until we did not observe the improvement in the development phoneme error rate per for weight updates batch size was used throughout the training evaluated models we evaluated the arsgs with different attention mechanisms the encoder was birnn with gru units in each direction and the activations of the units were used as the representation the generator had single recurrent layer of gru units generate in eq had hidden layer of maxout units the initial states of both the encoder and generator were treated as additional parameters our baseline model is the one with purely attention mechanism see eqs the scoring network in eq had hidden units the other two models use the convolutional features in eq with and one of them uses the smoothing from sec decoding procedure beam search over phoneme sequences was used during decoding beam search was stopped when the token heosi was emitted we started with beam width of increasing it up to when the network failed to produce heosi with the narrower beam as shown in fig decoding with wider beam gives benefit results table phoneme error rates per the per corresponds to the best error rate with an recurrent sequence generator arsg incorporating convolutional attention features and smooth focus model dev test baseline model baseline conv features baseline conv features smooth focus rnn transducer hmm over time and frequency convolutional net all the models achieved competitive pers see table with the convolutional features we see relative improvement over the baseline and further with the smoothing to our surprise see sec the baseline model learned to align properly an alignment produced by the baseline model on sequence with repeated phonemes utterance is presented in fig which demonstrates that the baseline model is not confused by repetitions we can also see from the figure that it prefers to select frames that are near the beginning or applying the weight noise from the beginning of training caused severe underfitting figure results of the concatenated utterances each dot represents single utterance created by either concatenating multiple copies of the same utterance or of different randomly chosen utterances we clearly see that the highest robustness is achieved when the hybrid attention mechanism is combined with the proposed sharpening technique see the plot even slightly before the phoneme location provided as part of the dataset the alignments produced by the other models were very similar visually forced alignment of long utterances the good performance of the baseline model led us to the question of how it distinguishes between repetitions of similar phoneme sequences and how reliably it decodes longer sequences with more repetitions we created two datasets of long utterances one by repeating each test utterance and the other by concatenating randomly chosen utterances in both cases the waveforms were with silence inserted as the pau phone we concatenated up to utterances first we checked the forced alignment with these longer utterances by forcing the generator to emit the correct phonemes each alignment was considered correct if of the alignment weight lies inside the phoneme window extended by frames on each side under this definition all phones but the heosi shown in fig are properly aligned the first column of fig shows the number of correctly aligned frames the utterance length in frames for some of the considered models one can see that the baseline model was able to decode sequences up to about phones when single utterance was repeated and up to about phones when different utterances were concatenated even when it failed it correctly aligned about phones on the other hand the model with the hybrid attention mechanism with convolutional features was able to align sequences up to phones long however once it began to fail the model was not able to align almost all phones the model with the smoothing behaved similarly to the one with convolutional features only we examined failed alignments to understand these two different modes of failure some of the examples are shown in the supplementary materials we found that the baseline model properly aligns about first phones then makes jump to the end of the recording and cycles over the last phones this behavior suggests that it learned to track its approximate location in the source sequence however the tracking capability is limited to the lengths observed during training once the tracker saturates it jumps to the end of the recording in contrast when the network failed it just stopped aligning no particular frames were selected for each phone we attribute this behavior to the issue of noisy glimpse discussed in sec with long utterance there are many irrelevant frames negatively affecting the weight assigned to the correct frames in line with this conjecture the network works slightly better on the repetition of the same utterance where all frames are somehow relevant than on the concatenation of different utterances where each misaligned frame is irrelevant to gain more insight we applied the alignment sharpening schemes described in sec in the remaining columns of fig we see that the sharpening methods help the network to find proper alignments while they show little effect on the baseline network the windowing phoneme error rates on long utterances baseline phoneme error rate decoding algorithm conv feats smooth focus keep keep win win dataset mixed utt same utt number of repetitions figure phoneme error rates obtained on decoding long sequences each network was decoded with alignment sharpening techniques that produced proper forced alignments the proposed arsg are clearly more robust to the length of the utterances than the baseline one is technique helps both the baseline and networks with the network properly aligning nearly all sequences during visual inspection we noticed that in the middle of very long utterances the baseline model was confused by repetitions of similar content within the window and that such confusions did not happen in the beginning this supports our conjecture above decoding long utterances we evaluated the models on long sequences each model was decoded using the alignment sharpening techniques that helped to obtain proper forced alignments the results are presented in fig the baseline model fails to decode long utterances even when narrow window is used to constrain the alignments it produces the two other networks are able to decode utterances formed by concatenating up to test utterances better results were obtained with wider window presumably because it resembles more the training conditions when at each step the attention mechanism was seeing the whole input sequence with the wide window both of the networks scored about per on the long utterances indicating that the proposed attention mechanism can scale to sequences much longer than those in the training set with only minor modifications required at the decoding stage conclusions we proposed and evaluated novel trainable speech recognition architecture based on hybrid attention mechanism which combines both content and location information in order to select the next position in the input sequence for decoding one desirable property of the proposed model is that it can recognize utterances much longer than the ones it was trained on in the future we expect this model to be used to directly recognize text from speech in which case it may become important to incorporate monolingual language model to the arsg architecture this work has contributed two novel ideas for attention mechanisms better normalization approach yielding smoother alignments and generic principle for extracting and using features from the previous alignments both of these can potentially be applied beyond speech recognition for instance the proposed attention can be used without modification in neural turing machines or by using convolution instead of for improving image caption generation acknowledgments all experiments were conducted using theano and blocks libraries the authors would like to acknowledge the support of the following agencies for research funding and computing support national science center poland grant sonata nserc calcul compute canada the canada research chairs and cifar bahdanau also thanks planet intelligent systems gmbh and yandex references graves generating sequences with recurrent neural networks august bahdanau cho and bengio neural machine translation by jointly learning to align and translate in proc of the iclr xu ba kiros et al show attend and tell neural image caption generation with visual attention in proc of the icml mnih heess graves et al recurrent models of visual attention in proc of the nips chorowski bahdanau cho and bengio continuous speech recognition using recurrent nn first results cs stat december graves wayne and danihelka neural turing machines weston chopra and bordes memory networks gales and young the application of hidden markov models in speech recognition found trends signal january hinton deng yu et al deep neural networks for acoustic modeling in speech recognition the shared views of four research groups ieee signal processing magazine november hannun case casper et al deepspeech scaling up speech recognition hochreiter and schmidhuber long memory neural cho van merrienboer gulcehre et al learning phrase representations using rnn encoderdecoder for statistical machine translation in emnlp october to appear graves gomez and schmidhuber connectionist temporal classification labelling unsegmented sequence data with recurrent neural networks in proc of the graves sequence transduction with recurrent neural networks in proc of the icml lecun bottou bengio and haffner gradient based learning applied to document recognition proc ieee graves mohamed and hinton speech recognition with deep recurrent neural networks in icassp pages ieee graves and jaitly towards speech recognition with recurrent neural networks in proc of the icml sukhbaatar szlam weston and fergus weakly supervised memory networks garofolo lamel fisher et al darpa timit acoustic phonetic continuous speech corpus povey ghoshal boulianne et al the kaldi speech recognition toolkit in proc asru zeiler adadelta an adaptive learning rate method graves practical variational inference for neural networks in proc of the nips hinton srivastava krizhevsky sutskever and salakhutdinov improving neural networks by preventing of feature detectors sutskever vinyals and le sequence to sequence learning with neural networks in proc of the nips combining convolution in convolutional neural phone recognition in proc icassp gulcehre firat xu et al on using monolingual corpora in neural machine translation bergstra breuleux bastien et al theano cpu and gpu math expression compiler in proc scipy bastien lamblin pascanu et al theano new features and speed improvements deep learning and unsupervised feature learning nips workshop goodfellow lamblin et al machine learning research library arxiv preprint van bahdanau dumoulin et al blocks and fuel frameworks for deep learning cs stat june 
estimators for generalized linear models eunho yang ibm watson research center eunhyang lozano ibm watson research center aclozano pradeep ravikumar university of texas at austin pradeepr abstract we propose class of estimators for glms under sampling regimes our class of estimators is based on deriving variants of the vanilla unregularized mle but which are even under settings and available in we then perform thresholding operations on this mle variant to obtain our class of estimators we derive unified statistical analysis of our class of estimators and show that it enjoys strong statistical guarantees in both parameter error as well as variable selection that surprisingly match those of the more complex regularized glm mles even while our estimators are computationally much simpler we derive instantiations of our class of estimators as well as corollaries of our general theorem for the special cases of logistic exponential and poisson regression models we corroborate the surprising statistical and computational performance of our class of estimators via extensive simulations introduction we consider the estimation of generalized linear models glms under settings where the number of variables may greatly exceed the number of observations glms are very general class of statistical models for the conditional distribution of response variable given covariate vector where the form of the conditional distribution is specified by any exponential family distribution popular instances of glms include logistic regression which is widely used for binary classification as well as poisson regression which together with logistic regression is widely used in key tasks in genomics such as classifying the status of patients based on genotype data and identifying genes that are predictive of survival among others recently glms have also been used as key tool in the construction of graphical models overall glms have proven very useful in many modern applications involving prediction with data accordingly an important problem is the estimation of such glms under sampling regimes under such sampling regimes it is now that consistent estimators can not be obtained unless structural constraints are imposed upon the underlying regression model parameter vector popular structural constraints include that of sparsity which encourages parameter vectors supported with very few entries constraints and structure with parameters among others several lines of work have focused on consistent estimators for such structurally constrained glms popular instance for the case of glms is the regularized maximum likelihood estimator mle which has been shown to have strong theoretical guarantees ranging from risk consistency consistency in the and and model selection consistency another popular instance is the for regularized mle for logistic regression for which prediction consistency has been established all of these estimators solve general convex programs involving components due to regularization while strong line of research has developed computationally efficient optimization methods for solving these programs these methods are iterative and their computational complexity scales polynomially with the number of variables and samples making them expensive for very problems key reason for the popularity of these iterative methods is that while the number of iterations are some function of the required accuracy each iteration itself consists of small finite number of steps and can thus scale to very large problems but what if we could construct estimators that overall require only very small finite number of steps akin to single iteration of popular iterative optimization methods the computational gains of such an approach would require that the steps themselves be suitably constrained and moreover that the steps could be suitably profiled and optimized efficient linear algebra routines implemented in blas libraries systematic study of which we defer to future work we are motivated on the other hand by the simplicity of such potential class of estimators in this paper we thus address the following question is it possible to obtain estimators for glms under settings that nonetheless have the sharp convergence rates of the regularized convex programs and other estimators noted above this question was first considered for linear regression models and was answered in the affirmative our goal is to see whether positive response can be provided for the more complex statistical model class of glms as well in this paper we focus specifically on the class of glms though our framework should extend to more general structures as well as an inkling of why estimators for glms is much trickier than that for linear models is that under settings linear regression models do have statistically efficient estimator the ordinary ols estimator which also serves as the mle under gaussian noise for glms on the other hand even under settings we do not yet have statistically efficient estimators classical algorithm to solve for the mle of logistic regression models for instance is the iteratively reweighted least squares irls algorithm which as its name suggests is iterative and not available in closedform indeed as we show in the sequel developing our class of estimators for glms requires far more advanced mathematical machinery moment polytopes and projections onto an interior subset of these polytopes for instance than the linear regression case our starting point to devise estimator for glms is to nonetheless revisit this classical unregularized mle estimator for glms from statistical viewpoint and investigate the reasons why the estimator fails or is even in the setting these insights enable us to propose variants of the mle that are not only but can also be easily computed in we provide unified statistical analysis for our class of glm estimators and instantiate our theoretical results for the specific cases of logistic exponential and poisson regressions surprisingly our results indicate that our estimators have comparable statistical guarantees to the regularized mles in terms of both variable selection and parameter estimation error which we also corroborate via extensive simulations which surprisingly even show slight statistical performance edge for our estimators moreover our estimators are much simpler and competitive computationally as is corroborated by our extensive simulations with respect to the conditions we impose on the glm models we require that the population covariance matrix of our covariates be weakly sparse which is different condition than those typically imposed for regularized mle estimators we discuss this further in section overall we hope our simple class of statistically as well as computationally efficient estimators for glms would open up the use of glms in machine learning applications even to lay users on the one hand and on the other hand encourage the development of new classes of simple estimators with strong statistical guarantees extending the initial proposals in this paper setup we consider the class of generalized linear models glms where response variable conditioned on covariate vector rp follows an exponential family distribution exp xi xi where is fixed and known scale parameter rp is the glm parameter of interest and xi is the function or the constant of the distribution our goal is to estimate the glm parameter given samples by properties of exponential families the conditional moment of the response given the covariates can be written as xi xi examples popular instances of include the standard linear regression model the logistic regression model and the poisson regression model among others in the case of the linear regression response variable with the conditional distribution have xi exp where the function or constant of in this specific case is given by another popular glm instance is the logistic regression output variable model for categorical exp xi log exp xi exp xi where the function log exp exp the exponential regression model in turn is given by exp xi log xi here the domain of response variable is the set of real numbers it is typically used to model time intervals between events for instance and the function log our final example is the poisson regression model exp log xi exp xi where the response variable is with domain and with function exp any exponential family distribution can be used to derive canonical glm regression model of response conditioned on covariates by setting the canonical parameter of the exponential family distribution to xi for the parameterization to be valid the conditional density should be normalizable so that xi estimation suppose that we are given covariate vectors rp drawn from some distribution and corresponding response variables drawn from the distribution in key goal in statistical estimation is to estimate the parameters rp given just the samples such estimation becomes particularly challenging in regime where the dimension of covariate vector is potentially even larger than the number of samples in such regimes it is well understood that structural constraints on are necessary in order to find consistent estimators in this paper we focus on the structural constraint of sparsity so that the number of elements in is less than or equal to some value much smaller than estimators regularized convex programs the norm is known to encourage the estimation of such parameters accordingly popular class of for glm parameters is the regularized maximum estimator for given samples from the regularized mles can be pn pn written as minimize for notational simplicity we collate the observations in vector and matrix forms where we overload the notation rn to denote the vector of responses so that element of yi is and to denote the design matrix whose row is with this notation we can rewrite optimization problem characterizing the mle simply as minimize where we overload the notation for an input vector rn to denote and rn estimators for glms the goal of this paper is to derive general class of estimators for glms in contrast to solving huge regularized optimization problems before introducing our class of such estimators we first introduce some notation for any rp we use sign ui max to denote the softthresholding operator with thresholding parameter for any given matrix we denote by family of matrix thresholding operators that are defined so that they can be written as ij mij for any scalar thresholding operator that satisfies the following conditions for any input for and the standard and operators are both pointwise operators that satisfy these properties see for further discussion of such pointwise matrix thresholding operators for any rn we let ra denote the gradients ra we assume that the exponential family underlying the glm is minimal so that this map is invertible and so that for any rn in the range of ra we can denote ra as an inverse map of ra µn consider the response moment polytope ep for some distribution over and let mo denote the interior of our estimator will use carefully selected subset mo yi denote the projection of response variable onto this subset as arg where the subset is selected so that the projection step is always and the minimum exists given vector we denote the vector of projections of entries in as so that as the conditions underlying our theorem will make clear we will need the operator ra defined above to be both and lipschitz in the subset of the interior of the response moment polytope in later sections we will show how to carefully construct such subset for different glm models we now have the machinery to describe our class of estimators ra where the various mathematical terms were defined above it can be immediately seen that the estimator is available in in later section we will see instantiations of this class of estimators for various specific glm models and where we will see that these estimators take very simple forms before doing so we first describe some insights that led to our particular construction of the glm estimator above insights behind construction of our estimator we first revisit the classical unregularized mle for glms arg note that this optimization problem does not have unique minimum in general especially under sample settings where nonetheless it is instructive to study why this unregularized mle is either or even under settings the stationary condition of unregularized mle optimization problem can be written as ra there are two main caveats to solving for unique satisfying this stationary condition which we clarify below mapping to mean parameters in high dimensional sampling regime where can be seen to reduce to ra so long as has rank this then suggests solving for ra where we recall the definition of the operator ra in terms of operations involving the caveat however is that is only onto the interior mo of the response moment polytope so that is only when given mo when entries of the sample response vector however lie outside of mo as will typically be the case and which we will illustrate for multiple instances of glm models in later sections the inverse mapping would not be we thus first project the sample response vector onto mo to obtain as defined in armed with this approximation we then consider the more instead of the original stationary condition in amenable ra sample covariance we thus now have the approximate characterization of the mle as ra this then suggests solving for an approximate mle via least squares as ra the regime with poses caveat here since the sample covariance matrix would then be and hence not invertible our approach is to then use thresholded sample covariance matrix defined in the previous subsection instead which can be shown to be invertible and consistent to the population covariance matrix with high probability in particular recent work has shown that thresholded sample covariance is consistent with respect to the spectral norm with convergence rate op logn under some mild conditions detailed in our main theorem plugging in this thresholded sample covariance matrix to get an approximate least squares solution for the glm parameters and then performing precisely yields our estimator in our class of estimators in can thus be viewed as surgical approximations to the mle so that it is in settings as well as being available in but would such an approximation actually yield rigorous consistency guarantees surprisingly as we show in the next section not only is our class of estimators consistent but in our corollaries we show that the statistical guarantees are comparable to those of the state of the art iterative ways like regularized mles we note that our class of estimators in can also be written in an equivalent form that is more amenable to analysis minimize ra the equivalence between and easily follows from the fact that the optimization problem is decomposable into independent and each corresponds to it can be seen that this form is also amenable to extending the framework in this paper to structures beyond sparsity by substituting in alternative regularizers due to space constraints the computational complexity is discussed in detail in the appendix statistical guarantees in this subsection we provide an unified statistical analysis for the class of estimators under the following standard conditions namely sparse and design the parameter in is exactly sparse with elements indexed by the support set so that each row of the design matrix is sampled from distribution with covariance matrix such that for any rp the variable hv xi is with parameter at most for every row of xi our next assumption is on the covariance matrix of the covariate random vector the covariance matrix of satisfies that for all rp with fixed constant moreover is approximately sparse along the lines of for some positive ppconstant for all diagonal entries and moreover for some and maxi if then this condition will be equivalent with being sparse we also introduce some notations used in the followingptheorem under the condition we have that with high probability log for all samples let log we then let be the subset of such that where we also define and on the upper bounds of and max max respectively armed with these conditions and notations we derive our main theorem theorem consider any generalized linear model in where all the conditions and hold problem setting the thresholding parameter now suppose that we solve the estimation log where maxj for any constant and max furthermore suppose also that we set the constraint bound as lognp where and where depends on the approximation error induced by the log projection and is defined as then as long as log where is constant related only on and maxi any optimal solution of is guaranteed to be consistent lognp lognp lognp moreover the support set of the estimate correctly excludes all true zero values of moreover when it correctly includes all true supports of with probability at least for some universal constants depending on and remark while our class of estimators and analyses consider parameters these can be seamlessly extended to more general structures such as group sparsity and low rank using appropriate thresholding functions remark the condition required in theorem is different from and possibly stronger than the restricted strong convexity required for error bound of regularized mle key facet of our analysis with our condition however is that it provides much simpler and clearer identifying constants in our error bounds deriving constant factors in the analysis of the mle on the other hand with its restricted strong convexity condition involves many probabilistic statements and is as shown in another key facet of our analysis in theorem is that it also provides an error bound and guarantees the sparsistency of our estimator for regularized mles this requires separate sparsistency analysis in the case of the simplest standard linear regression models showed that the incoherence condition of is required for sparsistency where is the maximum of absolute row sum as discussed in instances of such incoherent covariance matrices include the identity and toeplitz matrices these matrices can be seen to also satisfy our condition on the other hand not all matrices that satisfy our condition need satisfy the stringent incoherence condition in turn for example consider where for matrix of ones is all zeros but the last column is and then this positive definite can be seen to satisfy our condition since each row has only however is equal to and larger than and consequently the incoherence condition required for the lasso will not be satisfied we defer relaxing our condition further as well as deeper investigation of all the above conditions to future work remark the constant in the statement depends on which in the worst case where only is bounded may scale with on the other hand our theorem does not require an explicit sample complexity condition that be larger than some function on while the analysis of mles do additionally require that log for some constant in our experiments we verify that our estimators outperform the mles even when is fairly large for instant when in order to apply theorem to specific instance of glms we need to specify the quantities in as well as carefully construct subset of the interior of the response moment polytope in case of the simplest linear models described in section we have the identity mapping the inequalities in can thus be seen to be satisfied with moreover we can set mo so that and trivially recover the previous results in as special case in the following sections we will derive the consequences of our framework for the complex instances of logistic and poisson regression models which are also important members in glms key corollaries in order to derive corollaries of our main theorem we need to specify the response polytope subsets in and respectively as well as bound the two quantities and in logistic regression models the exponential family function of logistic regression models described in section can be seen to be log exp exp consequently exp its double derivative exp for any so that holds with the response moment polytope for the binary response variable is the interval so that its interior is given by mo for the subset of the interior we define for some at the same time the forward mapping is given by exp exp and hence becomes where log the inverse mapping of logistic models is given by log and given and it can be seen that is lipschitz for with constant less than max log in note that with this setting of the subset we have that and moreover yi yi which we will use in the corollary below poisson regression models another important instance of glms is the poisson regression model that is becoming increasingly more relevant in modern settings with varied multivariate count data for the poisson regression model case the double derivative of is not uniformly upper bounded exp denoting plog we then have that for any in exp log log so that is satisfied with log the response moment polytope for the response variable is given by so that its interior is given by mo for the subset of the interior we define for some the forward mapping in this case is simply given by exp and in becomes where is log the inverse mapping for the poisson regression model then is given by log which can be seen to be lipschitz for with constant max log in with this setting of it can be seen that the projection operator is given by yi yi yi yi now we are ready to recover the error bounds as corollary of theorem for logistic regression and poisson models when condition holds corollary consider any logistic regression model or poisson regression model where all conditions in theorem hold supposeq that we solve our estimation problem setting the thresholding parameter lognp and the constraint bound clog lognp where and are some constants depending only on and then the table comparisons on simulated datasets when parameters are tuned to minimize error on independent validation sets ethod lem lem lem tp fp rror ime lem lem lem optimal solution of is guaranteed to be consistent log log log log with probability at least moreover when log ethod log tp fp rror ime log log log for some universal constants and max logpp log is sparsistent log remarkably the rates in corollary are asymptotically comparable to those for the mle see for instance theorem and corollary in in appendix we place slightly more stringent condition than and guarantee error bounds with faster convergence rates experiments we corroborate the performance of our elementary estimators on simulated data over varied regimes of sample size number of covariates and sparsity size we consider two popular instances of glms logistic and poisson regression models we compare against standard regularized mle estimators with iteration bounds of and denoted by and respectively we construct the design matrices by sampling the rows independently from where for each simulation the entries of the true model coefficient vector are set to be everywhere except for randomly chosen subset of coefficients which are chosen independently and uniformly in the interval we report results averaged over independent trials noting that our theoretical results were not sensitive to the setting of in we simply report the results when across all experiments while our theorem specified an optimal setting of the regularization parameter and this optimal setting depended on unknown model parameters thus regup as is standard withp larized estimators we set tuning parameters log and log by holdoutvalidated fashion finding parameter that minimizes the error on an independent validation set detailed experimental setup is described in the appendix table summarizes the performances of mle using different stopping criteria and besides errors the target tuning metric we also provide the true and false positives for the support set recovery task on the new test set where the best tuning parameters are used the computation times in second indicate the overall training computation time summing over the whole parameter tuning process as we can see from our experiments with respect to both statistical and computational performance our closed form estimators are quite competitive compared to the classical regularized mle estimators and in certain case outperform them note that stops prematurely after only iterations so that training computation time is sometimes comparable to estimator however its statistical performance measured by is much inferior to other mles with more iterations as well as estimator due to the space limit roc curves results for other settings of and more experiments on real datasets are presented in the appendix references mccullagh and nelder generalized linear models monographs on statistics and applied probability chapman and hoffman logsdon and mezey puma unified framework for penalized multiple regression analysis of gwas data plos computational biology witten and tibshirani survival analysis with covariates stat methods med yang ravikumar allen and liu graphical models via generalized linear models in neur info proc sys nips van de geer generalized linear models and the lasso annals of statistics bach analysis for logistic regression electron kakade shamir sridharan and tewari learning exponential families in strong convexity and sparsity in inter conf on ai and statistics aistats negahban ravikumar wainwright and yu unified framework for analysis of with decomposable regularizers arxiv preprint bunea honest variable selection in linear and logistic regression models via and penalization electron meier van de geer and the group lasso for logistic regression journal of the royal statistical society series kim kim and kim blockwise sparse regression statistica sinica friedman hastie and tibshirani regularization paths for generalized linear models via coordinate descent journal of statistical software koh kim and boyd an method for logistic regression jour mach learning yang lozano and ravikumar elementary estimators for linear regression in international conference on machine learning icml rothman levina and zhu generalized thresholding of large covariance matrices journal of the american statistical association theory and methods wainwright and jordan graphical models exponential families and variational inference foundations and trends in machine learning december bickel and levina covariance regularization by thresholding annals of statistics wainwright sharp thresholds for and noisy sparsity recovery using quadratic programming lasso ieee trans information theory may daniel spielman and teng solving sparse symmetric linear systems in time in symposium on foundations of computer science focs october cambridge ma usa proceedings pages michael cohen rasmus kyng gary miller jakub pachocki richard peng anup rao and shen chen xu solving sdd linear systems in nearly time in proceedings of the annual acm symposium on theory of computing stoc pages acm daniel spielman and teng nearly linear time algorithms for preconditioning and solving symmetric diagonally dominant linear systems siam matrix analysis applications ravikumar wainwright raskutti and yu covariance estimation by minimizing divergence electronic journal of statistics yang lozano and ravikumar elementary estimators for sparse covariance matrices and other structured moments in international conference on machine learning icml yang lozano and ravikumar elementary estimators for graphical models in neur info proc sys nips 
online optimization department of computer science university of paderborn germany busarobi technion haifa israel research group on artiﬁcial intelligence hungary szorenyibalazs krzysztof institute of computing science university of technology poland kdembczynski eyke department of computer science university of paderborn germany eyke abstract the is an important and commonly used performance metric for binary prediction tasks by combining precision and recall into single score it avoids disadvantages of simple metrics like the error rate especially in cases of imbalanced class distributions the problem of optimizing the that is of developing learning algorithms that perform optimally in the sense of this measure has recently been tackled by several authors in this paper we study the problem of maximization in the setting of online learning we propose an efﬁcient online algorithm and provide formal analysis of its convergence properties moreover ﬁrst experimental results are presented showing that our method performs well in practice introduction being rooted in information retrieval the is nowadays routinely used as performance metric in various prediction tasks given predictions of binary labels yt the is deﬁned as recall yi precision precision recall where precision recall yi and where yi yi by deﬁnition compared to measures like the error rate in binary classiﬁcation maximizing the enforces better balance between performance on the minority and majority class therefore it is more suitable in the case of imbalanced data optimizing for such an imbalanced measure is very important in many applications where positive labels are signiﬁcantly less frequent than negative ones it can also be generalized to weighted harmonic average of precision and recall yet for the sake of simplicity we stick to the unweighted mean which is often referred to as the or the given the importance and usefulness of the it is natural to look for learning algorithms that perform optimally in the sense of this measure however optimizing the is quite challenging problem especially because the measure is not decomposable over the binary predictions this problem has received increasing attention in recent years and has been tackled by several authors however most of this work has been done in the standard setting of batch learning in this paper we study the problem of optimization in the setting of online learning which is becoming increasingly popular in machine learning in fact there are many applications in which training data is arriving progressively over time and models need to be updated and maintained incrementally in our setting this means that in each round the learner ﬁrst outputs prediction and then observes the true label yt formally the protocol in round is as follows ﬁrst an instance xt is observed by the learner then the predicted label for xt is computed on the basis of the ﬁrst instances xt the labels observed so far and the corresponding predictions ﬁnally the label yt is revealed to the learner the goal of the learner is then to maximize yt over time optimizing the in an online fashion is challenging mainly because of the of the measure and the fact that the can not be changed after round as potential application of online optimization consider the recommendation of news from rss feeds or tweets besides it is worth mentioning that online methods are also relevant in the context of big data and learning where the volume of data despite being ﬁnite prevents from processing each data point more than once treating the data as stream online algorithms can then be used as algorithms note however that algorithms are evaluated only at the end of the training process unlike online algorithms that are supposed to learn and predict simultaneously we propose an online algorithm for optimization which is not only very efﬁcient but also easy to implement unlike other methods our algorithm does not require extra validation data for tuning threshold that separates between positive and negative predictions and therefore allows the entire data to be used for training we provide formal analysis of the convergence properties of our algorithm and prove its statistical consistency under different assumptions on the learning process moreover ﬁrst experimental results are presented showing that our method performs well in practice formal setting in this paper we consider stochastic setting in which xt yt are assumed to be samples from some unknown distribution on where is the label space and is some instance space we denote the marginal distribution of the feature vector by then the posterior probability of the positive class the conditional probability that the prior distribution of class can given is be written as dµ let be the set of all binary classiﬁers over the set the of binary classiﬁer is calculated as dµ dµ dµ according to the expected value of to with when is used to calculate xt thus yt xt now let denote the set of all probabilistic binary classiﬁers over the set and let denote the set of binary classiﬁers that are obtained by thresholding classiﬁer is classiﬁers of the form for some threshold where is the indicator function that evaluates to if its argument is true and otherwise is assumed to exhibit the required measurability properties according to the optimal computed as maxf can be achieved by thresholded classiﬁer more precisely let us deﬁne the thresholded as dµ dµ dµ then the optimal threshold can be obtained as argmax clearly for the classiﬁer in the form of with and we have then as shown by see their theorem the performance of any binary classiﬁer can not exceed for all therefore estimating posteriors ﬁrst and adjusting threshold afterward appears to be reasonable strategy in practice this seems to be the most popular way of maximizing the in batch mode we call it the maximization approach or for short more speciﬁcally the approach consists of two steps ﬁrst classiﬁer is trained for estimating the posteriors and second threshold is tuned on the posterior estimates for the time being we are not interested in the training of this classiﬁer but focus on the second step that is the labeling of instances via thresholding posterior probabilities for doing this suppose ﬁnite set dn xi yi of labeled instances are given as training information moreover suppose estimates xi of the posterior probabilities pi xi are provided by classiﬁer next one might deﬁne the obtained by applying the threshold classiﬁer on the data dn as follows yi xi dn xi in order to ﬁnd an optimal threshold τn dn it sufﬁces to search the ﬁnite set which requires time log in it is shown that dn as for any and provides an even stronger result if classiﬁer gdn is induced from dn by an and threshold τn is obtained by maximizing τn then gd as under mild assumptions on the on an independent set dn data distribution maximizing the on population level in this section we assume that the data distribution is known according to the analysis in the previous section optimizing the boils down to ﬁnding the optimal threshold at this point an observation is in order remark in general the function is neither convex nor concave for example when is ﬁnite then the denominator and enumerator of are step functions whence so is therefore gradient methods can not be applied for ﬁnding nevertheless can be found based on recent result of who show that ﬁnding the root of max dµ is necessary and sufﬁcient condition for optimality note that is continuos and strictly decreasing with and therefore has unique solution which is moreover also prove an interesting relationship between the optimal threshold and the induced by that threshold the marginal distribution of the feature vectors induces distribution on the posteriors dµ for all by deﬁnition is the derivative of dµ dζ and the density of observing an instance for which the probability of the learning algorithm viewed as samples dn to classiﬁers gdn is called the data distribution if limn pdn dµ for all positive label is we shall write concisely dν dp since is an induced probability measure the measurable transformation allows us to rewrite the notions introduced above in terms of instead of for example section in for example the prior probability dµ can be written equivalently as dν likewise can be rewritten as follows max dν dν dν dν dν dν dν equation will play central role in our analysis note that precise knowledge of sufﬁces to ﬁnd the maxima of this is illustrated by two examples presented in appendix in which we assume speciﬁc distributions for namely uniform and beta distributions algorithmic solution in this section we provide an algorithmic solution to the online maximization problem for this we shall need in each round some classiﬁer gt that provides us with some estimate gt xt of the probability xt we would like to stress again that the focus of our analysis is on optimal thresholding instead of classiﬁer learning thus we assume the sequence of classiﬁers to be produced by an external online learner for example logistic regression trained by stochastic gradient descent as an aside we note that maximization is not directly comparable with the task that is most often considered and analyzed in online learning namely regret minimization this is mainly because the is performance metric in fact the cumulative regret is summation of regret rt which only depends on the prediction and the true outcome yt in the case of the the score and therefore the optimal prediction depends on the entire history that is all observations and decisions made by the learner till time this is discussed in more detail in section the most naive way of forecasting labels is to implement online learning as repeated batch learning that is to apply batch learner such as to dt xi yi in each time step obviously however this strategy is prohibitively expensive as it requires storage of all data points seen so far at least in as well as optimization of the threshold τt and of the classiﬁer gt on an ever growing number of examples in the following we propose more principled technique to maximize the online our approach is based on the observation that and for any such that moreover it is monotone decreasing continuous function therefore ﬁnding the optimal threshold can be viewed as root in practice however is not known and can only be estimated let us deﬁne for now assume to be known and write concisely we can compute the expectation of with respect to the data distribution for ﬁxed threshold as follows dν dν dν dν dν thus an unbiased estimate of can be obtained by evaluating for an instance this suggests designing stochastic approximation algorithm that is able to ﬁnd the root of similarly to the algorithm exploiting the relationship between the optimal and the optimal threshold we deﬁne the threshold in time step as τt at where at yi bt bt yi with this threshold the ﬁrst differences between thresholds τt can be written as follows proposition if thresholds τt are deﬁned according to and as τt then τt τt the proof of prop is deferred to appendix according to the method we obtain almost coincides with the update rule of the algorithm there are however some notable differences in particular the sequence of coefﬁcients namely the values does not consist of predeﬁned real values converging to zero as fast as instead it consists of random quantities that depend on the history namely the observed labels yt and the predicted labels moreover these coefﬁcients are not independent of τt either in spite of these additional difﬁculties we shall present convergence analysis of our algorithm in the next section the of our online optimization algorithm called online algorithm ofo optimizer ofo is shown in algorithm select from and set the forecast rule can be written in the form of for do observe the instance xt for xt where the threshold is xt estimate posterior deﬁned in and pt xt in practice we pt current prediction use xt as an estimate of the true observe label yt posterior pt in line of the code an online at calculate learner is assumed bt and τt bt which produces classiﬁers gt by incrementally gt xt yt update the classiﬁer updating the current classiﬁer with the newly return τt observed example gt xt yt in our experimental study we shall test and compare various online learners as possible choices for consistency in this section we provide an analysis of the online optimizer proposed in the previous section more speciﬁcally we show the statistical consistency of the ofo algorithm the sequence of online thresholds and produced by this algorithm converge respectively to the optimal threshold and the optimal thresholded in probability as ﬁrst step we prove this result under the assumption of knowledge about the true posterior probabilities then in second step we consider the case of estimated posteriors theorem assume the posterior probabilities pt xt of the positive class to be known in each step of the online learning process then the sequences of thresholds τt and online produced by ofo both converge probability their optimal and respectively for any we have and here is sketch of the proof of this theorem the details of which can be found in the supplementary material appendix we focus on τt which is stochastic process the ﬁltration of which is deﬁned as τt is ft ft for this ﬁltration one can show that and τt τt based on as ﬁrst step we can decompose the update rule given in as follows τt conditioned on the ﬁltration ft see lemma bt τt next we show that the sequence behaves similarly to in the sense that see lemma moreover one can show that although is not differentiable on in general it can be piecewise linear for example one can show that its ﬁnite difference is between and see proposition in the appendix as consequence of this result our process deﬁned in does not get stuck even close to the main of the proof is devoted to analyzing the properties of the sequence of βt τt for which we show that βt which is sufﬁcient for the statement of the theorem our proof follows the convergence analysis of nevertheless our analysis essentially differs from theirs since in our case the coefﬁcients can not be chosen freely instead as explained before they depend on the labels observed and predicted so far in addition the noisy estimation of depends on the labels too but the decomposition step allows us to handle this undesired effect remark in principle the algorithm can be applied for ﬁnding the root of as well this yields an update rule similar to with replaced by for constant in this case however the convergence of the online is difﬁcult to analyze if at all because the empirical process can not be written in nice form moreover as it has been found in the analysis the coefﬁcient should be set see proposition and the choice of kt at the end of the proof of theorem yet since is not known beforehand it needs to be estimated from the samples which implies that the coefﬁcients are not independent of the noisy evaluations of like in the case of the ofo algorithm interestingly ofo seems to properly adjust the values in an adaptive manner bt is sum of two terms the ﬁrst of which is in expectation which is very nice property of the algorithm empirically based on synthetic data we found the performance of the original algorithm to be on par with ofo as already announced we are now going to relax the assumption of known posterior probabilities pt xt instead estimates gt xt pt of these probabilities are obtained by classiﬁers gt that are provided by the external online learner in algorithm more concretely assume an online learner where is the set of probabilistic classiﬁers given current model gt and new example xt yt this learner produces an updated classiﬁer gt xt yt showing consistency result for this scenario requires some assumptions on the online learner with this formal deﬁnition of online learner statistical consistency result similar to theorem can be shown the proof of the following theorem is again deferred to supplementary material appendix are provided by an online theorem assume that the classiﬁers gt in the ofo framework learner for which the following holds there is such that gt dµ then and τt this theorem requirement on the online learner is stronger than what is assumed by and recalled in footnote first the learner is trained online and not in batch mode second we also require that the error of the learner goes to with convergence rate of order it might be interesting to note that universal rate of convergence can not be established without assuming regularity properties of the data distribution such as smoothness via absolute continuity results of that kind are beyond the scope of this study instead we refer the reader to for details on consistency and its connection to the rate of convergence discussion regret optimization and stochastic approximation stochastic approximation algorithms can be applied for ﬁnding the optimum of or equivalently to ﬁnd the unique root of based on noisy latter formulation is better suited for the classic version of the root ﬁnding algorithm these algorithms are iterative methods whose analysis focuses on the difference of τt from where τt denotes the estimate of in iteration whereas our online setting is concerned with the distance of yt from where is the prediction for yi in round this difference is crucial because τt only depends on τt and in addition if τt is close to then τt is also close to see for concentration properties whereas in the online optimization setup yt can be very different from even if the current estimate τt is close to in case the number of previous incorrect predictions is large in online learning and online optimization it is common to work with the notion of cumulative regret in our case this notion could be interpreted either as yi or as after division by the former becomes the average accuracy of the over time and the latter the accuracy of our predictions the former is hard to interpret because yi itself is an aggregate measure of our performance table main statistics of the benchmark datasets and one pass obtained by ofo and methods on various datasets the bold numbers indicate when the difference is signiﬁcant between the performance of ofo and methods the signiﬁcance level is set to one sigma that is estimated based on the repetitions learner dataset instances pos gisette replab webspamuni epsilon covtype url susy kdda kddb neg features logreg ofo pegasos ofo perceptron ofo over the ﬁrst rounds which thus makes no sense to aggregate again the latter on the other hand is the differs qualitatively from our ultimate goal in fact yt alternate measure that we are aiming to optimize for instead of the accuracy online optimization of measures online optimization of the can be seen as special case of optimizing loss functions as recently considered by their framework essentially differs from ours in several points first regarding the data generation process the adversarial setup with oblivious adversary is assumed unlike our current study where stochastic setup is assumed from this point of view their assumption is more general since the oblivious adversary captures the stochastic setup second the set of classiﬁers is restricted to differentiable parametric functions which may not include the maximizer therefore their proof of vanishing regret does in general not imply convergence to the optimal seen from this point of view their result is weaker than our proof of consistency convergence to the optimal in probability if the posterior estimates originate from consistent learner finally there are some other performance measures which are intensively used in many practical applications their optimization had already been investigated in the online or setup the most notable such measure might be the area under the roc curve auc which had been investigated in an online learning framework by experiments in this section the performance of the ofo algorithm is evaluated in learning scenario on benchmark datasets and compared with the performance of the maximization appraoch described in section we also assess the rate of convergence of the ofo algorithm in pure online learning the online learner in ofo was implemented in different ways using logistic regression og eg the classical perceptron algorithm erceptron and an online linear svm called egasos in the case of og eg we applied the algorithm introduced in which handles and regularization the hyperparameters of the methods and the validation procedures are described below and in more detail in appendix if necessary the raw outputs of the learners were turned into valid probability estimates they were rescaled to using logistic transform we used in the experiments nine datasets taken from the libsvm repository of binary classiﬁcation many of these datasets are commonly used as benchmarks in information retrieval where the is routinely applied for model selection in addition we also used the textual data released in the replab challenge of identifying relevant tweets we generated the features used by the winner team the main statistics of the datasets are summarized in table additional results of experiments conducted on synthetic data are presented in appendix http webspamuni kdda url num of samples num of samples num of samples online online online online susy num of samples figure online obtained by ofo algorithm on various dataset the dashed lines represent the performance of the ofo algorithm from table which we considered as baseline learning in learning the learner is allowed to read the training data only once whence online learners are commonly used in this setting we run ofo along with the three classiﬁers trained on of the data the learner obtained by ofo is of the form gtτt where is the number of training samples the rest of the data was used to evaluate gtτt in terms of the we run every method on randomly shufﬂed versions of the data and averaged results the means of the computed on the test data are shown in table as baseline we applied the approach more concretely we trained the same set of learners on of the data and validated the threshold on by optimizing since both approaches are consistent the performance of ofo should be on par with the performance of this is conﬁrmed by the results in which signiﬁcant differences are observed in only of cases these differences in performance might be explained by the ﬁniteness of the data the advantage of our approach over is that there is no need of validation and the data needs to be read only once therefore it can be applied in pure learning scenario the hyperparameters of the learning methods are chosen based on the performance of we tuned the hyperparameters in wide range of values which we report in appendix online learning the ofo algorithm has also been evaluated in the online learning scenario in terms of the online the goal of this experiment is to assess the convergence rate of ofo since the optimal is not known for the datasets we considered the test reported in table the results are plotted in figure for four benchmark datasets the plots for the remaining datasets can be found in appendix as can be seen the online converges to the test obtained in evalaution in almost every case there are some exceptions in the case of egasos and erceptron this might be explained by the fact that methods as well as the erceptron tend to produce poor probability estimates in general which is main motivation for calibration methods turning output scores into valid probabilities conclusion and future work this paper studied the problem of online optimization compared to many conventional online learning tasks this is speciﬁcally challenging problem mainly because of the nondecomposable nature of the we presented simple algorithm that converges to the optimal when the posterior estimates are provided by sequence of classiﬁers whose error converges to zero as fast as for some as key feature of our algorithm we note that it is purely online approach moreover unlike approaches such as there is no need for validation set in batch mode our promising results from extensive experiments validate the empirical efﬁcacy of our algorithm for future work we plan to extend our online optimization algorithm to broader family of complex performance measures which can be expressed as ratios of linear combinations of true positive false positive false negative and true negative rates the also belongs to this family moreover going beyond consistency we plan to analyze the rate of convergence of our ofo algorithm this might be doable thanks to several nice properties of the function finally an intriguing question is what can be said about the case when some bias is introduced because the classiﬁer gt does not converge to acknowledgments krzysztof is supported by the polish national science centre under grant no the research leading to these results has received funding from the european research council under the european union seventh framework programme erc grant agreement references de albornoz chugur corujo gonzalo meij de rijke and spina overview of replab evaluating online reputation monitoring systems in clef volume pages bubeck and regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning and gy szarvas tune and mix learning to rank using ensembles of calibrated classiﬁers machine learning and lugosi prediction learning and games cambridge university press devroye and nonparametric density estimation the view wiley ny devroye and lugosi probabilistic theory of pattern recognition springer ny gao jin zhu and zhou auc optimization in icml volume pages hangya and farkas filtering and polarity detection for reputation management on tweets in working notes of clef evaluation labs and workshop kar narasimhan and jain online and stochastic gradient methods for nondecomposable loss functions in nips nagarajan koyejo ravikumar and dhillon consistent binary classiﬁcation with generalized performance metrics in nips pages narasimhan vaish and agarwal on the statistical consistency of classiﬁers for performance measures in nips robbins and monro stochastic approximation method ann math rosenblatt the perceptron probabilistic model for information storage and organization in the brain psychological review singer and srebro pegasos primal estimated solver for svm in icml pages tsuruoka tsujii and ananiadou stochastic gradient descent training for models with cumulative penalty in acl pages van rijsbergen foundation and evalaution journal of documentation varadhan probability theory new york university waegeman jachnik cheng and on the bayesoptimality of maximizers journal of machine learning research ye chai lee and chieu optimizing tale of two approaches in icml zhao edakunni pocock and brown beyond fano inequality bounds on the optimal ber and risk and their implications jmlr pages zhao hoi jin and yang online auc maximization in icml pages 
online rank elicitation for dueling bandits approach technion haifa israel research group on artificial intelligence hungary szorenyibalazs adil paul eyke department of computer science university of paderborn paderborn germany busarobi eyke abstract we study the problem of online rank elicitation assuming that rankings of set of alternatives obey the distribution following the setting of the dueling bandits problem the learner is allowed to query pairwise comparisons between alternatives to sample pairwise marginals of the distribution in an online fashion using this information the learner seeks to reliably predict the most probable ranking or our approach is based on constructing surrogate probability distribution over rankings based on sorting procedure for which the pairwise marginals provably coincide with the marginals of the plackettluce distribution in addition to formal performance and complexity analysis we present first experimental studies introduction several variants of problems have recently been studied in an online setting with preferences over alternatives given in the form of stochastic pairwise comparisons typically the learner is allowed to select presumably most informative alternatives in an active connection to bandits where single alternatives are chosen instead of pairs this is also referred to as the dueling bandits problem methods for online ranking can mainly be distinguished with regard to the assumptions they make about the probabilities pi that in direct comparison between two alternatives and the former is preferred over the latter if these probabilities are not constrained at all complexity that grows quadratically in the number of alternatives is essentially unavoidable yet by exploiting stochastic transitivity properties which are quite natural in ranking context it is possible to devise algorithms with better performance guaranties typically of the order log the idea of exploiting transitivity in online learning establishes natural connection to sorting algorithms naively for example one could simply apply an efficient sorting algorithm such as mergesort as an active sampling scheme thereby producing random order of the alternatives what can we say about the optimality of such an order the problem is that the probability distribution on rankings induced by the sorting algorithm may not be well attuned with the original preference relation the probabilities pi in this paper we will therefore combine sorting algorithm namely quicksort and stochastic preference model that harmonize well with each technical sense to be detailed later on this harmony was first presented in and our main contribution is to show how it can be exploited for online rank elicitation more specifically we assume that pairwise comparisons obey the marginals of model widely used parametric distribution over rankings cf section despite the quadratic worst case complexity of quicksort we succeed in developing its budgeted version presented in section with complexity of log while only returning partial orderings this version allows us to devise algorithms that find respectively item section and ranking of all items section both with high probability related work several studies have recently focused on versions of the bandit setup also known as dueling bandits where the online learner is only able to compare arms in pairwise manner the outcome of the pairwise comparisons essentially informs the learner about pairwise preferences whether or not an option is preferred to another one first group of papers including assumes the probability distributions of pairwise comparisons to possess certain regularity property such as strong stochastic transitivity second group does not make assumptions of that kind instead target ranking is derived from the pairwise preferences for example using the copeland borda count and random walk procedures our work is obviously closer to the first group of methods in particular the study presented in this paper is related to which investigates similar setup for the mallows model there are several approaches to estimating the parameters of the pl model including standard statistical methods such as likelihood estimation and bayesian parameter estimation pairwise marginals are also used in in connection with the approach nevertheless the authors assume that full rankings are observed from pl model algorithms for noisy sorting assume total order over the items and that the comparisons are representative of that order if precedes then the probability of option being preferred to is bigger than some in the data is assumed to consist of pairwise comparisons generated by model however comparisons are not chosen actively but according to some fixed probability distribution pure exploration algorithms for the stochastic bandit problem sample the arms certain number of times not necessarily known in advance and then output recommendation such as the best arm or the best arms while our algorithms can be viewed as pure exploration strategies too we do not assume that numerical feedback can be generated for individual options instead our feedback is qualitative and refers to pairs of options notation set of to be ranked is denoted by to keep the presentation simple we assume that items are identified by natural numbers so ranking is bijection on which can also be represented as vector rm where rj is the rank of the jth item the set of rankings can be identified with the symmetric group sm of order each ranking naturally defines an associated ordering om sm of the items namely the inverse defined by or for all for permutation we write for the permutation in which ri and rj the ranks of items and are replaced with each other we denote by ri sm ri the subset of permutations for which the rank of item is and by rj ri sm rj ri those for which the rank of is higher than the rank of that is item is preferred to written we write to indicate that is preferred to with respect to ranking we assume sm to be equipped with probability distribution sm thus for each ranking we denote by the probability to observe this ranking moreover for each pair of items and we denote by pi rj ri the probability that is preferred to in ranking randomly drawn according to these pairwise probabilities are called the pairwise marginals of the ranking distribution we denote the matrix composed of the values pi by pi approximations our learning problem essentially consists of making good predictions about properties of concretely we consider two different goals of the learner depending on whether the application calls for the prediction of single item or full ranking of items in the first problem which we call or simply paci the goal is to find an item that is almost as good as the optimal one with optimality referring to the condorcet winner an item is condorcet winner if for all then we call an item if it is beaten by the condorcet winner with at most an this setting coincides with those considered in obviously it requires the existence of condorcet winner which is indeed guaranteed in our approach thanks to the assumption of model the second problem called ampr is defined as finding the most probable ranking that is this problem is especially challenging for ranking distributions for which the order of two items is hard to elicit because many entries of are close to therefore we again relax the goal of the learner and only require it to find ranking with the following property there is no pair of items such that ri rj and pi put in words the ranking is allowed to differ from only for those items whose pairwise probabilities are close to any ranking satisfying this property is called an approximately most probable ranking ampr both goals are meant to be achieved with probability at least for some our learner operates in an online setting in each iteration it is allowed to gather information by asking for single pairwise comparison between two using the dueling bandits jargon to pull two arms thus it selects two items and and then observes either preference or the former occurs with probability pi as defined in the latter with probability pj pi based on this observation the learner updates its estimates and decides either to continue the learning process or to terminate and return its prediction what we are mainly interested in is the sample complexity of the learner that is the number of pairwise comparisons it queries prior to termination before tackling the problems introduced above we need some additional notation the pair of items chosen by the learner in the comparison is denoted it where it and the feedback received is defined as ot if it and ot if it the set of steps among the first iterations in which the learner decides to compare items and is denoted by ii and the size of this set by ni ii the proportion of wins of item against item up to iteration is then given by pbi since our samples are independent and identically distributed the relative frequency pbi is reasonable estimate of the pairwise probability the model the pl model is probability distribution on rankings it is parameterized by skill vector vm rm and mimics the successive construction of ranking by selecting items position by position each time choosing one of the remaining items with probability proportional to its skill vi thus with the probability of ranking is oi vom oi as an appealing property of the pl model we note that the marginal probabilities are very easy to calculate as they are simply given by vi pi vi vj likewise the most probable ranking can be obtained quite easily simply by sorting the items according to their skill parameters that is iff vi vj moreover the pl model satisfies strong stochastic transitivity pi max pi pj whenever pi and pj ranking distributions based on sorting in the classical sorting literature the outcome of pairwise comparisons is deterministic and determined by an underlying total order of the items namely the order the sorting algorithm seeks to find now if the pairwise comparisons are stochastic the sorting algorithm can still be run however the result it will return is random ranking interestingly this is another way to define probability distribution over the rankings is the probability that is returned by the algorithm if we omit the index if there is no danger of confusion stochastic comparisons are specified by obviously this view is closely connected to the problem of noisy sorting see the related work section in recent work by ailon the quicksort algorithm is investigated in stochastic setting where the pairwise comparisons are drawn from the pairwise marginals of the model several interesting properties are shown about the ranking distribution based on quicksort notably the property of pairwise stability we denote the ranking distribution by pqs where the matrix contains the marginals of the model then it can be shown that pqs obeys the property of pairwise stability which means that it preserves the marginals although the distributions themselves might not be identical pqs theorem theorem in letp be given by the pairwise marginals pi vi vi vj then pi pqs rj ri pqs one drawback of the quicksort algorithm is its complexity to generate random ranking it compares items in the worst case next we shall introduce budgeted version of the quicksort algorithm which terminates if the algorithm compares too many pairs namely more than log upon termination the modified quicksort algorithm only returns partial order nevertheless we will show that it still preserves the pairwise stability property the budgeted algorithm algorithm shows budgeted version of the random ranking generation algorithm bqs process described in the previous section it require the set to be sorted and budget works in way quite similar to the standard ensure where is the remaining algorithm with the notable get and is the partial order that was condifference of terminating as soon as the number structed based on samples of pairwise comparisons exceeds the budget initialize to be the empty partial order over which is parameter assumed as an input if or then return viously the bqs algorithm run with pick an element uniformly at random and or recovers the for all do inal sampling algorithm as draw random sample oij according to the special case pl marginal update accordingly run of bqs can be represented quite naturally as random tree the root is labeled oi end whenever call to bqs oi ates recursive call bqs child node bqs with label is added to the node with label bqs note that each such tree determines ranking update based on and return which is denoted by in natural way the random ranking generated by bqs for some subset was analyzed by ailon who showed that it gives back the same marginals as the original model as recalled in theorem now for denote by the tree the algorithm would have returned for the budget instead of additionally let denote the set of all possible outcomes of and for two distinct indices and let ti denote the set of all trees in which and are incomparable in the associated ranking some leaf of is labelled by superset of the main result of this section is that bqs does not introduce any bias in the marginals theorem also holds for the budgeted version of bqs proposition for any any set and any indices the partial order generated by bqs satisfies ti that is whenever two items and are comparable by the partial ranking generated by bqs with probability exactly the basic idea of the proof deferred to the appendix is to show that conditioned on the event that and are incomparable by would have been put differently is obtained from by continuing the execution of bqs ignoring the stopping criterion obtained with probability in case execution of bqs had been continued see claim the result then follows by combining this with theorem the problem and its analysis our algorithm for finding the pac item is based on the sampling algorithm plpac nique described in the previous section the for do initialization pseudocode of the algorithm called plpac pbi pi is shown in algorithm in each iteration we generate ranking which is partial line and translate this ranking into pairwise set comparisons that are used to update the repeat bqs where timates of the pairwise marginals based on sorting based random ranking these estimates we apply simple eliminab and correspondtion strategy which consists of eliminating an update the entries of item if it is significantly beaten by another ing to basedron item that is pbi ci lines set ci log for all finally the algorithm terminates when it finds for which by definition for do to identify an item as if pbi ci then it is enough to guarantee that discard is not beaten by any with margin bigger than that is pi for all pbi ci this sufficient condition is implemented in line since we only have until ical estimates of the pi values the test of the return condition does of course also take the confidence intervals into account note that vi vj implies pi in this case it is not possible to decide whether pi is above or not on the basis of finite number of pairwise comparisons the of the goal to be achieved provides convenient way to circumvent this problem sample complexity analysis of plpac first let rt denote the partial ordering produced by bqs in the iteration note that each of these partial orderings defines bucket order the indices are partitioned into different classes buckets in such way that none of the pairs are comparable within one class but pairs from different classes are thus if and belong to some class and and belong to some other class then either rt and rt or rt and rt more specifically the bqs algorithm with budget line always results in bucket order containing only two buckets since no recursive call is carried out with this budget then one might show that the optimal arm and an arbitrary arm fall into different buckets often enough this observation allows us to the number of pairwise comparisons taken by plpac with high probability the proof of the next theorem is deferred to appendix max max for each index with probability at least after log mi calls for bqs with budget terminates and an arm therefore the total number of samples is log mi theorem set in theorem the dependence on is of order log it is easy to show that log is lower bound therefore our result is optimal from this point of view our model assumptions based on the pl model imply some regularity properties for the pairwise marginals such as strong stochastic transitivity and stochastic triangle inequality see appendix of for the proof therefore the nterleaved ilter and eat he ean algorithms can be directly applied in our online framework both algorithms achieve similar sample complexity of order log yet our experimental study in section clearly shows that provided our model assumptions on pairwise marginals are valid plpac outperforms both algorithms in terms of empirical sample complexity the ampr problem and its analysis for strictly more than two elements the surrogate distribution and the pl distribution are in general not identical although their mode rankings coincide the mode of pl model is the ranking that sorts the items in decreasing order of their skill values ri rj iff vi vj for any moreover since vi vj implies pi sorting based on the copeland score bi pi yields most probable ranking our algorithm is based on estimating the copeland score of the items its is shown in algorithm in appendix as first step it generates rankings based on sorting which is used to then it computes lower and upper bound and bi update the pairwise probability estimates for each of the scores bi the lower bound is given as bi pbi which is the number of items that are beaten by item based on the current empirical estimates of pairwise marginals similarly the upper bound is given as bi bi si where si pi pbi obviously si is the number of pairs for which based on the current empirical estimates it can not be decided whether pi is above or below as an important observation note that there is no need to generate full ranking based on sorting in every case because if bi bi bj bj then we already know the order of items and with respect to motivated by this observation consider the interval graph based on the bi bi where bi bi bj bj denote the connected components of this graph by ck obviously if two items belong to different components then they do not need to be compared anymore therefore it is enough to call the sampling with the connected components finally the algorithm terminates if the goal is achieved line more specifically it terminates if there is no pair of items and for which the ordering with respect to is not elicited yet bi bi bj bj and their pairwise probabilities is close to sample complexity analysis of denote by qm the expected number of comparisons of the standard quicksort algorithm on elements namely qm log log see thanks to the concentration property of the performance of the quicksort algorithm there is no pair of items that falls into the same bucket too often in bucket order which is output by bqs this observation allows us to the number of pairwise comparisons taken by with high probability the proof of the next theorem is deferred to appendix theorem set max for each where denotes the th largest skill parameter with probability at least after log calls for bqs with budget qm the algorithm arm plpac terminates and outputs an therefore the total number of samples is log log remark the rankcentrality algorithm proposed in converts the empirical pairwise into matrix then considering as transition matrix of marginals markov chain it ranks the items based on its stationary distribution in the authors show that if the pairwise marginals obey pl distribution this algorithm produces the mode of this distribution if the sample size is sufficiently large in their setup the learning algorithm has no influence on the selection of pairs to be compared instead comparisons are sampled using fixed underlying distribution over the pairs for any sampling distribution their pac bound is of order at least whereas our sample complexity bound in theorem is of order experiments our approach strongly exploits the assumption of data generating process that can be modeled by means of pl distribution the experimental studies presented in this section are mainly aimed at showing that it is doing so successfully namely that it has advantages compared to other approaches in situations where this model assumption is indeed valid to this end we work with synthetic data nevertheless in order to get an idea of the robustness of our algorithm toward violation of the model assumptions some first experiments on real data are presented in appendix the problem we compared our plpac algorithm with other algorithms applicable in our setting namely nterleaved ilter if eat he ean btm and allows mpi while each of these algorithms follows successive elimination strategy and discards items one by one they differ with regard to the sampling strategy they follow since the time horizon must be given in advance for if we run it with subsequently referred to as if the btm algorithm can be accommodated into our setup as is see algorithm in the allows mpi algorithm assumes mallows model instead of pl as an underlying probability distribution over rankings and it seeks to find the condorcet can be applied in our setting too since condorcet winner does exist for pl since the baseline methods are not able to handle except the btm we run our algorithm with and made sure that vi vj for all plpac if if if btm mallowsmpi plpac if if if btm mallowsmpi sample complexity sample complexity sample complexity plpac if if if btm mallowsmpi number of arms number of arms number of arms figure the sample complexity for over repetitions the results are averaged we tested the learning algorithm by setting the parameters of pl to vi with the parameter controls the complexity of the rank elicitation task since the gaps between pairwise probabilities and are of the form which converges to zero as we evaluated the algorithm on this test case with varying numbers of items and with various values of parameter and plotted the sample complexities that is the number of pairwise comparisons taken by the algorithms prior to termination the results are shown in figure only for the rest of the plots are deferred to appendix as can be seen the plpac algorithm significantly outperforms the baseline methods if the pairwise comparisons match with the model assumption namely they are drawn from the marginals of pl distribution allows mpi achieves performance that is slightly worse than plpac for and its performance is among the worst ones for this can be explained by the elimination strategy of allows mpi which heavily relies on the existence of gap between all pairwise probabilities and in our test case the minimal gap pm is getting smaller with increasing and the poor performance of btm for large and can be explained by the same argument the ampr problem since the rankcentrality algorithm produces the most probable ranking if the pairwise marginals obey pl distribution and the sample size is sufficiently large cf remark it was taken as baseline using the same test case as before input data of various size was generated for rankcentrality based on uniform sampling of pairs to be compared its performance is shown by the black lines in figure the results for are again deferred to appendix the accuracy in single run of the algorithm is if the output of rankcentrality is identical with the most probable ranking and otherwise this accuracy was averaged over runs in addition we conducted some experiments to asses the impact of parameter and to test our algorithms based on confidence intervals these experiments are deferred to appendix and due to lack of space rankcentrality rankcentrality rankcentrality rankcentrality rankcentrality rankcentrality optimal recovery fraction optimal recovery fraction optimal recovery fraction rankcentrality rankcentrality rankcentrality sample size sample size sample size figure sample complexity for finding the approximately most probable ranking ampr with parameters the results are averaged over repetitions we also run our algorithm and determined the number of pairwise comparisons it takes prior to termination the horizontal lines in figure show the empirical sample complexity achieved by with in accordance with theorem the accuracy of plpacampr was always significantly higher than actually equal to in almost every case as can be seen rankcentrality slightly outperforms in terms of sample complexity that is it achieves an accuracy of for smaller number of pairwise comparisons keep in mind however that only terminates when its output is correct with probability at least moreover it computes the confidence intervals for the statistics it uses based on the chernoffhoeffding bound which is known to be very conservative as opposed to this rankcentrality is an offline algorithm without any performance guarantee if the sample size in not sufficiently large see remark therefore it is not surprising that asymptotically its empirical sample complexity shows better behavior than the complexity of our online learner as final remark ranking distributions can principally be defined based on any sorting algorithm for example mergesort however to the best of our knowledge pairwise stability has not yet been shown for any sorting algorithm other than quicksort we empirically tested the mergesort algorithm in our experimental study simply by using it in place of budgeted quicksort in the algorithm we found mergesort inappropriate for the pl model since the accuracy of when being used with mergesort instead of quicksort drastically drops on complex tasks for details see appendix the question of pairwise stability of different sorting algorithms for various ranking distributions such as the mallows model is an interesting research avenue to be explored conclusion and future work in this paper we studied different problems of online rank elicitation based on pairwise comparisons under the assumption of model taking advantage of this assumption our idea is to construct surrogate probability distribution over rankings based on sorting procedure namely quicksort for which the pairwise marginals provably coincide with the marginals of the pl distribution in this way we manage to exploit the stochastic transitivity properties of pl which is at the origin of the efficiency of our approach together with the idea of replacing the original quicksort with budgeted version of this algorithm in addition to formal performance and complexity analysis of our algorithms we also presented first experimental studies showing the effectiveness of our approach needless to say in addition to the problems studied in this paper there are many other interesting problems that can be tackled within the framework of online learning for examb of the entire distriple going beyond single item or ranking we may look for good estimate with bution for example an estimate with small divergence kl regard to the use of sorting algorithms another interesting open question is the following is there any sorting algorithm with worst case complexity of order log which preserves the marginal probabilities this question might be difficult to answer since as we conjecture the mergesort and the insertionsort algorithms which are both algorithms with an log complexity do not satisfy this property references nir ailon reconciling real scores with binary comparisons new logistic based model for ranking in advances in neural information processing systems pages braverman and mossel noisy sorting without resampling in proceedings of the nineteenth annual symposium on discrete algorithms pages braverman and mossel sorting from noisy information corr bubeck munos and stoltz pure exploration in bandits problems in proceedings of the alt alt pages berlin heidelberg bubeck wang and viswanathan multiple identifications in bandits in proceedings of the icml pages and survey of online learning with bandit algorithms in algorithmic learning theory alt volume pages and rank elicitation using statistical models the case of mallows in icml volume pages and pac rank elicitation through adaptive sampling of stochastic pairwise preferences in aaai pages weng cheng and selection based on adaptive sampling of noisy preferences in proceedings of the icml jmlr cp volume clopper and pearson the use of confidence or fiducial limits illustrated in the case of the binomial biometrika mannor and mansour pac bounds for bandit and markov decision processes in proceedings of the colt pages uriel feige prabhakar raghavan david peleg and eli upfal computing with noisy information siam october gabillon ghavamzadeh lazaric and bubeck best arm identification in nips pages guiver and snelson bayesian inference for ranking models in proceedings of the icml pages hoare quicksort comput hoeffding probability inequalities for sums of bounded random variables journal of the american statistical association hunter mm algorithms for generalized models the annals of statistics luce and suppes handbook of mathematical psychology chapter preference utility and subjective probability pages wiley luce individual choice behavior theoretical analysis wiley mallows ranking models biometrika john marden analyzing and modeling rank data chapman hall mcdiarmid and hayward large deviations for quicksort journal of algorithms negahban oh and shah iterative ranking from pairwise comparisons in advances in neural information processing systems pages plackett the analysis of permutations applied statistics arun rajkumar and shivani agarwal statistical convergence perspective of algorithms for rank aggregation from pairwise data in icml pages soufiani chen parkes and xia generalized for rank aggregation in advances in neural information processing systems nips pages urvoy clerot and naamane generic exploration and voting bandits in proceedings of the icml jmlr cp volume pages yue broder kleinberg and joachims the dueling bandits problem journal of computer and system sciences yue and joachims beat the mean bandit in proceedings of the icml pages zoghi whiteson munos and rijke relative upper confidence bound for the dueling bandit problem in icml pages 
labelings for submodular energies and beyond alexander dmitrij dmitry carsten rother bogdan tu dresden dresden germany skoltech moscow russia abstract we consider the problem of finding best diverse solutions of energy minimization problems for graphical models contrary to the sequential method of batra et which greedily finds one solution after another we infer all solutions jointly it was shown recently that such jointly inferred labelings not only have smaller total energy but also qualitatively outperform the sequentially obtained ones the only obstacle for using this new technique is the complexity of the corresponding inference problem since it is considerably slower algorithm than the method of batra et al in this work we show that the joint inference of best diverse solutions can be formulated as submodular energy minimization if the original problem is submodular hence fast inference techniques can be used in addition to the theoretical results we provide practical algorithms that outperform the current and can be used in both submodular and case introduction variety of tasks in machine learning can be formulated in the form of an energy minimization problem known also as maximum posteriori map or maximum likelihood estimation mle inference in an undirected graphical models related to markov or conditional random fields its modeling power and importance are which resulted into specialized benchmark and computational challenges for its solvers this underlines the importance of finding the most probable solution following and we argue however that finding diverse configurations with low energies is also of importance in number of scenarios such as expressing uncertainty of the found solution faster training of model parameters ranking of inference results empirical risk minimization we build on the new formulation for finding which was recently proposed in in this formulation all configurations are inferred jointly contrary to the established method where sequential greedy procedure is used as shown in the new formulation does not only reliably produce configurations with lower total energy but also leads to better results in several application scenarios in particular for the image segmentation scenario the results of significantly outperform those of this is true even when uses plain hamming distance as diversity measure and uses more powerful diversity measures our contributions we show that finding configurations of binary submodular energy minimization can be formulated as submodular problem and hence can be solved this project has received funding from the european research council erc under the european unions horizon research and innovation programme grant agreement no vetrov was supported by rfbr proj no and by microsoft rpd efficiently for any diversity measure we show that for certain diversity measures such as hamming distance the configurations of multilabel submodular energy minimization can be formulated as submodular problem which also implies applicability of efficient graph solvers we give the insight that if the problem is submodular then the configurations can be always fully ordered with respect to the natural partial order induced in the space of all configurations we show experimentally that if the problem is submodular we are quantitatively at least as good as and considerably better than the main advantage of our method is major speed up over up to the order of two magnitudes our method has the same order of magnitude as in the case our results are slightly inferior to but the advantage with respect to gain in speed up still holds related work the importance of the considered problem may be justified by the fact that procedure of computing solutions to discrete optimization problems was proposed in which dates back to later more efficient specialized procedures were introduced for on tree ch and general graphical models such methods are however not suited for scenarios where diversity of the solutions is required like in machine translation search engines producing hypothesis in cascaded algorithms since they do not enforce it explicitly structural determinant point processes is tool to model probabilistic distributions over structured models unfortunately an efficient sampling procedure is feasible for graphical models only the recently proposed algorithm to find best modes of distribution is limited to the same narrow class of problems training of independent graphical models to produce diverse solutions was proposed in in contrast we assume single fixed model supporting reasonable along with the most related to our work is the recent paper which proposes subclass of new diversity penalties for which the greedy nature of the algorithm can be substantiated due to submodularity of the used diversity measures in contrast to we do not limit ourselves to diversity measures fulfilling such properties and moreover we define class of problems for which our joint inference approach leads to polynomially and efficiently solvable problems in practice we build on top of the work which is explained in detail in section organization of the paper section provides background necessary for formulation of our results energy minimization for graphical models and existing approaches to obtain diverse solutions in section we introduce submodularity for graphical models and formulate the main results of our work finally section and are devoted to the experimental evaluation of our technique and conclusions supplementary material contains proofs of all mathematical claims and the concurrent submission preliminaries energy minimization let denote the powerset of set the pair is called and has as finite set of variable nodes and as set of factors each variable node is associated with variable yv taking its values in finite set of labels lv the set la lv denotes cartesian product of sets of labels corresponding to the subset of variables functions θf lf associated with factors are called potentials and define local costs on values of variables and their combinations potentials θf with are called unary with pairwise and higher order the set θf of all potentials is referred by for any factor the corresponding set of variables yv will be denoted by yf the energy minimization problem consists of finding labeling yv lv which minimizes the total sum of corresponding potentials arg min arg min θf yf problem is also known as labeling satisfying will be later called solution of the or problem shortly or general diversity measure diversity measure hamming distance diversity figure examples of factor graphs for diverse solutions of the original mrf with different diversity measures the circles represent nodes of the original model that are copied times for clarity the diversity factors of order higher than are shown as squares pairwise factors are depicted by edges connecting the nodes we omit for readability the most general diversity measure the diversity measure hamming distance as diversity measure finally model is defined by the triple lv the underlying the sets of labels and the potentials in the following we use brackets to distinguish between upper index and power means the power of whereas is an upper index in the expression an we will keep however the standard notation rn for the vector space sequential computation of best diverse solutions instead of looking for single labeling with lowest energy one might ask for set of labelings with low energies yet being significantly different from each other in order to find such diverse labelings the method proposed in solves sequence of problems of the form arg min for where determines between diversity and energy is the and the function lv lv defines the diversity of two labelings in other words takes large value if and are diverse in certain sense and small value otherwise this problem can be seen as an energy minimization problem where additionally to the initial potentials the potentials associated with an additional factor are used in the simplest and most commonly used form is represented by sum of diversity measures lv lv yv and the potentials are split to sum of unary potentials those associated with additional factors this implies that in case efficient based inference methods including αexpansion or their generalizations are applicable to the initial problem then they remain applicable to the augmented problem which assures efficiency of the method joint computation of labelings the notation will be used as shortcut for for any function lv instead of the greedy sequential procedure in it was suggested to infer all labelings jointly by minimizing for and some function defines the total diversity of any labelings it was shown in that the labelings obtained according to have both lower total energy pm and are better from the applied point of view than those obtained by the sequential method hence we will build on the formulation in this work though the expression looks complicated it can be nicely represented in the form and hence constitutes an energy minimization problem to achieve this one creates copies liv lv of the initial model lv the for the new task is defined as follows the set of nodes in the new graph is the union of the node sets from the sm sm considered copies factors are again the union of the initial ones extended by special factor corresponding to the diversity penalty that depends on all nodes of the new graph each node is associated with the label set liv lv the corresponding potentials are defined as see fig for illustration the model corresponds to the energy an optimal of these labelings corresponding to minimum of is between low energy of individual labelings and their total diversity complexity of the diversity problem though the formulation leads to better results than those of minimization of is computationally demanding even if the original energy can be easily approximatively optimized this is due to the intrinsic repulsive structure of the diversity potentials according to the intuitive meaning of the diversity similar labels are penalized more than different one consider the simplest case with the hamming distance applied as diversity measure yvi yvj where jy here expression jak equals if is true and otherwise the corresponding factor graph is sketched in fig such potentials can not be optimized with efficient based methods and moreover as shown in the bounds delivered by based solvers are very loose in practice indeed solutions delivered by such solvers are significantly inferior even to the results of the sequential method to cope with this issue clique encoding representation of was proposed in in this representation of labels yvm in the nodes corresponding to the single initial node were considered as the new labels in this way the difficult diversity factors were incorporated into the unary factors of the new representation and the pairwise factors were adjusted respectively this allowed to approximately solve the problem with based techniques if those techniques were applicable to the energy of single labeling the disadvantage of the clique encoding representation is the exponential growth of the label space which was reflected in significantly higher inference time for the problem compared to the procedure in what follows we show an alternative transformation of the problem which does not have this drawback its size is basically the same as those of and ii allows to exactly solve in the case the energy is submodular diversity in what follows we will mainly consider the diversity measures those which can be represented in the form for some node diversity measures see fig for illustration lv labelings for submodular problems submodularity in what follows we will assume that the sets lv of labels are completely ordered this implies that for any lv their maximum and minimum denoted as and respectively are similarly let and denote the maximum and minimum of any two labelings la potential θf is called submodular if for any two labelings lf it θf θf θf θf potential will be called supermodular if is submodular pairwise binary potentials satisfying θf θf θf θf build an important special case of this definition energy is called submodular if for any two labelings lv it holds submodularity of energy trivially follows from the submodularity of all its potentials θf in the pairwise case the inverse also holds submodularity of energy implies also submodularity of all its pairwise potentials thm there are efficient methods for solving energy minimization problems with submodular potentials based on its transformation into problem in case all potentials are either unary or pairwise or to submodular problem in the case ordered solutions in what follows we will write for any two vectors and meaning that the inequality holds for an arbitrary set we will call function of variables permutation invariant if for any xn and any permutation it holds xn xπ xπ xπ in what follows we will consider mainly permutation invariant diversity measures let us consider two arbitrary labelings lv and their minimum and maximum since is either equal to or to for any permutation invariant node diversity measure it holds this in its turn implies for any diversity measure of the form if is submodular then from it additionally follows that where is defined as in note that generalizing these considerations to labelings one obtains theorem let be submodular and be diversity measure with each component being permutation invariant then there exists an ordered for such that for any lv it holds where is defined as in theorem in particular claims that in the binary case lv the optimal labelings define nested subsets of nodes corresponding to the label submodular formulation of problem due to theorem for submodular energies and diversity measures it is sufficient to consider only ordered of labelings this order can be enforced by modifying the diversity measure accordingly ˆm otherwise ˆm and using it instead of the initial measure note that is not permutation invariant in practice one can use sufficiently big numbers in place of in this implies lemma let be submodular and be diversity measure with each component being permutation invariant then any solution of the ordering enforcing problem yv yv is solution of the corresponding problem yv yv where and are related by we will say that vector lv is ordered if it holds given submodularity of the submodularity an hence solvability of in would trivially follow from the supermodularity of however there hardly exist supermodular diversity measures the ordering provided by theorem and the corresponding form of the orderingenforcing diversity measure significantly weaken this condition which is precisely stated by the following lemma in the lemma we substitute of with sufficiently big values such as max for the sake of numerical implementation moreover this values will differ from each other to keep supermodular lemma let for any two ordered vectors lv and lv it holds where and are maximum and minimum respectively then defined as is supermodular note eq and are the same up to the infinity values in though condition resembles the supermodularity condition it has to be fulfilled for ordered vectors only the following corollaries of lemma give two most important examples of the diversity measures fulfilling corollary let for all then the statement of lemma holds for arbitrary lv pm pm corollary let then the condition of lemma is equivalent to for and in particular condition is satisfied for the hamming distance jy the following theorem trivially summarizes lemmas and theorem let energy and diversity measure satisfy conditions of lemmas and then the ordering enforcing problem delivers solution to the problem and is submodular moreover submodularity of all potentials of the energy implies submodularity of all potentials of the ordering enforcing energy experimental evaluation we have tested our algorithms in two application scenarios interactive image segmentation where annotation is available in the form of scribbles and category level segmentation on pascal voc data as baselines we use the sequential method divmbest proposed in and ii the ce method for an approximate joint computation of labelings as mentioned in section this method addresses the energy defined in however it has the disadvantage that its label space grows exponentially with our method that solves the problem with the hamming diversity measure by transforming it into problem and running the solver is denoted as diversity measures used in experiments are the hamming distance hd label cost lc label transitions lt and hamming ball hb the last three measures are higher order diversity potentials introduced in and used only in connection with the divmbest algorithm if not stated otherwise the hamming distance is used as diversity measure both the clique encoding ce based approaches and the methods proposed in this work use only the hamming distance as diversity measure as suggests certain combinations of different diversity measures may lead to better results to denote such combinations the signs and were used in we refer to for detailed description of this notation and treat such combined methods as black box for our comparison divmbest ce quality time quality time quality time table interactive segmentation accuracies quality for the best segmentation out of ones and compare to the average quality of single labeling hamming distance is used as diversity measure the is in milliseconds ms quantitatively outperforms divmbest and is equal to ce however it is considerably faster than ce interactive segmentation instead of returning single segmentation corresponding to diversity methods provide to the user small number of possible results based on the scribbles following we model only the first iteration of such an interactive procedure we consider user scribbles to be given and compare the sets of segmentations returned by the compared diversity methods authors of kindly provided us their graphical model instances corresponding to the mapinference problem they are based on subset of the pascal voc segmentation challenge with manually added scribbles pairwise potentials constitute contrast sensitive potts terms which are submodular this implies that the is solvable by algorithms and ii theorem is applicable and the solutions can be found by reducing the ordering preserving problem to and applying the corresponding algorithm quantitative comparison and of the considered methods is provided in table where each method was used with the parameter see optimally tuned via following as quality measure we used the per pixel accuracy of the best solution for each sample averaged over all test images methods ce and gave the same quality which confirms the observation made in that ce returns an exact map solution for each sample in this dataset combined methods with more sophisticated diversity measures return results that are either inferior to divmbest or only negligibly improved once hence we omitted them the runtime provided is also averaged over all samples the algorithm was used for divmbest and and for ce summary it can be seen that the qualitatively outperforms divmbest and is equal to ce however it is considerably faster than the latter the difference grows exponentially with and the runtime is of the same order of magnitude as the one of divmbest category level segmentation the category level segmentation from pascal voc challenge contains validation images with known ground truth which we used for evaluation of diversity methods corresponding pairwise models with contrast sensitive potts terms of the form θuv wuv jy uv were used in and kindly provided to us by the authors contrary to interactive segmentation the label sets contain elements and hence the respective problem is not submodular anymore however it still can be approximatively solved by or since the problem is not submodular in this experiment theorem is not applicable we used two ways to overcome it first we modified the diversity potentials according to as if theorem were to be correct this basically means we were explicitly looking for ordered best diverse labelings the resulting inference problem was addressed with since neither nor the algorithms are applicable we refer to this method as to the second way to overcome the problem is based on learning using structured svm technique we trained pairwise potentials with additional constraints enforcing their submodularity as it is done in we kept the contrast terms wuv and learned only single submodular function which we used in place of jy after the learning all our potentials had the form θuv wuv uv we refer to map inference quality time quality time quality time divmbest ce lt coop cuts lt coop cuts table pascal voc intersection over union quality time the best segmentation out of is considered compare to the average quality of single labeling time is in seconds notation correspond to absence of result due to computational reasons or inapplicability of the method methods were not run by us and the results were taken from directly the column references the slowest inference technique out of those used by the method this method as to for the model we use as an exact inference method and as fast approximate inference method quantitative comparison and of the considered methods is provided in table where each method was used with the parameter see optimally tuned via crossvalidation on the validation set in pascal voc following we used the intersection over union quality measure averaged over all images among combined methods with higher order diversity measures we selected only those providing the best results the method is hybrid of divmbest and ce delivering reasonable between running time and accuracy of inference for the model quantitative results delivered by and are very similar though the latter is negligibly better significantly outperform those of divmbest and only slightly inferior to those of however the for and version of are comparable to those of divmbest and outperform all other competitors due to use of the fast inference algorithms and linearly growing label space contrary to the label space of which grows as lv though we do not know exact for the combined methods where and are used we expect them to be significantly higher then those for divmbest and because of the intrinsically slow techniques used however contrary to the latter one the inference in can be exact due to submodularity of the underlying energy conclusions we have shown that submodularity of the problem implies fully ordered set of best diverse solutions given permutation invariant diversity measure enforcing such ordering leads to submodular formulation of the joint problem and implies its efficient solvability moreover we have shown that even in cases when the is approximately solvable with efficient based methods enforcing this ordering leads to the problem which is approximately solvable with based methods as well in our test cases and there are likely others such an approximative technique lead to notably better results then those provided by the established sequential divmbest technique whereas its remains quite comparable to the of divmbest and is much smaller than the of other competitors references arora banerjee kalra and maheshwari generalized flows for optimal inference in higher order tpami batra an efficient algorithm for the map problem batra yadollahpour and shakhnarovich diverse solutions in markov random fields in eccv springer boykov and jolly interactive graph cuts for optimal boundary region segmentation of objects in images in iccv boykov and kolmogorov an experimental comparison of algorithms for energy minimization in vision tpami boykov veksler and zabih fast approximate energy minimization via graph cuts tpami chen kolmogorov zhu metaxas and lampert computing the most probable modes of graphical model in aistats elidan and globerson the probabilistic inference challenge everingham van gool williams winn and zisserman the pascal visual object classes challenge results fix gruber boros and zabih graph cut algorithm for markov random fields in iccv franc and savchynskyy discriminative learning of classifiers jmlr fromer and globerson an lp view of the map problem in nips batra and kohli multiple choice learning learning to produce multiple structured outputs in nips kohli and batra divmcuts faster training of structural svms with diverse in aistats kohli batra and rutenbar efficiently enforcing diversity in structured prediction in aistats ishikawa exact optimization for markov random fields with convex priors tpami jegelka and bilmes submodularity beyond submodular energies coupling edges in graph cuts in cvpr kappes andres hamprecht nowozin batra kim kausler lellmann komodakis savchynskyy and rother comparative study of modern inference techniques for structured discrete energy minimization problems ijcv pages kirillov savchynskyy schlesinger vetrov and rother inferring diverse labelings in single one in iccv kolmogorov minimizing sum of submodular functions discrete applied mathematics kolmogorov and zabin what energy functions can be minimized via graph cuts tpami kulesza and taskar structured determinantal point processes in nips lawler procedure for computing the best solutions to discrete optimization problems and its application to the shortest path problem management science nilsson an efficient algorithm for finding the most probable configurationsin probabilistic expert systems statistics and computing prasad jegelka and batra submodular meets structured finding diverse subsets in structured item sets in nips premachandran tarlow and batra empirical minimum bayes risk prediction how to extract an extra few performance from vision models with just three more parameters in cvpr ramakrishna and batra expressing uncertainty via diverse solutions in nips workshop on perturbations optimization and statistics schlesinger and flach transforming an arbitrary minsum problem into binary one tu dresden fak informatik schlesinger and hlavac ten lectures on statistical and structural pattern recognition volume springer science business media tarlow givoni and zemel efficient message passing with high order potentials in aistats werner linear programming approach to problem review tpami yadollahpour batra and shakhnarovich discriminative of diverse segmentations in cvpr yanover and weiss finding the most probable configurations using loopy belief propagation in nips 
tractable bayesian network structure learning with bounded vertex cover number janne korhonen helsinki institute for information technology hiit department of computer science university of helsinki pekka parviainen helsinki institute for information technology hiit department of computer science aalto university abstract both learning and inference tasks on bayesian networks are in general bounded bayesian networks have recently received lot of attention as way to circumvent this complexity issue however while inference on bounded networks is tractable the learning problem remains even for in this paper we propose bounded vertex cover number bayesian networks as an alternative to bounded networks in particular we show that both inference and learning can be done in polynomial time for any fixed vertex cover number bound in contrast to the general and bounded cases on the other hand we also show that learning problem is in parameter furthermore we give an alternative way to learn bounded vertex cover number bayesian networks using integer linear programming ilp and show this is feasible in practice introduction bayesian networks are probabilistic graphical models representing joint probability distributions of random variables they can be used as model in variety of prediction tasks as they enable computing the conditional probabilities of set of random variables given another set of random variables this is called the inference task however to use bayesian network as model for inference one must first obtain the network typically this is done by estimating the network based on observed data this is called the learning task both the inference and learning tasks are in general one approach to deal with this issue has been to investigate special cases where these problems would be tractable that is the basic idea is to select models from restricted class of bayesian networks that have structural properties enabling fast learning or inference this way the computational complexity will not be an issue though possibly at the cost of accuracy if the true distribution is far from the model family most notably it is known that the inference task can be solved in polynomial time if the network has bounded or more precisely the inference task is tractable in the of the network moreover this is in sense optimal as bounded is necessary for inference unless the exponential time hypothesis eth fails the possibility of tractable inference has motivated several recent studies also on learning bounded bayesian networks however unlike in the case of inference learning bayesian network of bounded is for any fixed bound at least furthermore it is known that learning many relatively simple classes such as paths and polytrees is also indeed so far the only class of bayesian networks for which polynomial time learning algorithm is known are trees graphs with it appears that our knowledge about structure classes allowing tractable learning is quite limited structure learning with bounded vertex cover number in this work we propose bounded vertex cover number bayesian networks as an alternative to the paradigm roughly speaking we consider bayesian networks where all pairwise dependencies edges in the moralised graph are covered by having at least one node from the vertex cover incident to each of them see section for technical details like bounded bayesian networks this is parameterised class allowing between the complexity of models and the size of the space of possible models by varying the parameter results complexity of learning bounded vertex cover networks crucially we show that learning an optimal bayesian network structure with vertex cover number at most can be done in polynomial time for any fixed moreover vertex cover number provides an upper bound for implying that inference is also tractable thus we identify rare example of class of bayesian networks where both learning and inference are tractable specifically our main theoretical result shows that an optimal bayesian network structure with vertex cover number at most can be found in time theorem however while the running time of our algorithm is polynomial with respect to the number of nodes the degree of the polynomial depends on we show that this is in sense best we can hope for that is we show that there is no algorithm with running time poly for any function even when the maximum allowed parent set size is restricted to unless the commonly accepted complexity assumption fpt fails theorem results ilp formulation and learning in practice while we prove that the learning bounded vertex cover bayesian network structures can be done in polynomial time the unavoidable dependence on in the degree the polynomial makes the algorithm of our main theorem infeasible for practical usage when the vertex cover number increases therefore we investigate using an integer linear programming ilp formulation as an alternative way to find optimal bounded vertex cover bayesian networks in practice section although the running time of an ilp is exponential in the worst case the actual running time in many practical scenarios is significantly lower indeed most of the algorithms for exact learning of bayesian networks in general and with bounded are based on ilps our experiments show that bounded vertex cover number bayesian networks can indeed be learned fast in practice using ilp section preliminaries directed graphs directed graph consists of node set and arc set for fixed node set we usually identify directed graph with its arc set directed graph is called directed acyclic graph or dag if it contains no directed cycles we write and uv for arc for with uv we say that is parent of and is child of we write av for the parent set of that is av uv bayesian network structure learning we consider the bayesian network structure learning using the approach where the input consists of the node set and the local scores fv for each node and the task is to find dag the network structure that maximises the score fv av we assume that the scores fv are computed beforehand and that we can access each entry fv in constant time we generally consider setting where only parent sets belonging to specified sets fv are permitted typically fv consists of parent sets of size at most in which case we assume that the scores fv are given only for that is the size of the input is nk moralised graphs for dag the moralised graph of is an undirected graph ma ea where ea is obtained by adding an undirected edge to ea for each arc uv and by adding an undirected edge to ea if and have common child that is uw vw in for some the edges added to ea due to rule are called moral edges and vertex cover number of graph is pair where issa tree with node set and xm is collection of subsets of with xi such that for each there is with xi and for each the graph xi is connected the width of is maxi the tw of graph is the minimum width of of for dag we define the tw as the of the moralised graph ma for graph set is vertex cover if each edge is incident to at least one vertex in the vertex cover number of graph is the size of the smallest vertex cover in as with we define the vertex cover number of dag as ma lemma for dag we have tw proof by definition the moralised graph ma has vertex cover of size we can construct for ma with central node with xi and leaf with xj for every clearly this has width thus we have tw tw ma structure learning with parameters finally we give formal definition for the bounded treewidth and bounded vertex cover number bayesian network structure learning problems that is let tw in the bayesian network structure learning we are given node set local scores fv and an integer and the task is to find dag maximising score fv av subject to for both and vertex cover number the parameter also bounds the maximum parent set size so we will assume that the local scores fv are given only if complexity results algorithm we start by making few simple observations about the structure of bounded vertex cover number bayesian networks in the following we slightly abuse the terminology and say that is vertex cover for dag if is vertex cover of ma lemma let be set of size and let be dag on set is vertex cover for if and only if for each node we have av and each node has at most one parent outside proof for we have that if there were nodes such that is the child of the moralised graph ma would have edge that is not covered by likewise for we have that if node had parents then ma would have edge not covered by thus both and have to hold if has vertex cover since holds all directed edges in have one endpoint in and thus the corresponding undirected edges in ma are covered by moreover by and no node has two parents outside so all moral edges in ma also have at least one endpoint in lemma allows us to partition dag with vertex cover number into core that covers at most nodes that are either in fixed vertex cover or are parents of those nodes core nodes and periphery figure example of dag with vertex cover number with sets and as in lemma reduction used in theorem each edge in the original graph is replaced by possible containing arcs going into nodes that have no children and all parents in the vertex cover peripheral nodes this is illustrated in figure and the following lemma formalises the observation lemma let be dag on with vertex cover of size then there is set of size at most and arc sets and such that and is dag on with vertex cover and contains only arcs uv with and proof first let av by lemma each can have at most one parent outside so we have now let uv and to see that holds for this choice of we observe that the edge set of the moralised graph mb is subset of the edges in ma and thus covers all edges of mb for the choice of and lemma ensure that nodes in have no children and again by lemma their parents are all in dually if we fix the core and peripheral node sets we can construct dag with bounded vertex cover number by the selecting the core independently from the parents of the peripheral nodes formally lemma let be disjoint let be dag on with vertex cover and let be dag on such that only contains arcs uv with and then is dag on with vertex cover and the score of is fv bv fv cv proof to see that holds we observe that is acyclic by assumption and addition of arcs from can not create cycles as there are no outgoing arcs from nodes in moreover for there are no arcs ending at in and likewise for there are no arcs ending at in thus we have av bv if and av cv otherwise this implies that since conditions of lemma hold for both and they also hold for and thus is vertex cover for finally the preceding observation implies also that fv av fv bv for and fv av fv cv otherwise which implies lemmas and give the basis of our strategy for finding an optimal bayesian with vertex cover number at most that is we iterate over all possible nk choices for sets and for each choice we construct the optimal core and periphery as follows keeping track of the best found dag step to find the optimal core we construct bayesian network structure learning instance on by removing nodes outside and restricting the possible choices of parent sets so that fv for all and fv for by lemma any solution for this instance is dag with vertex cover moreover this instance has nodes so it can be solved in time using the dynamic programming algorithm of silander and myllymäki step to construct the periphery we compute the value fˆv fv and select corresponding best parent set choice cv for each this can be done in time in time using the dynamic programming algorithm of ott and miyano step we check if and replace with if this holds by lemma all dags considered by the algorithm are valid solutions for bayesian network structure learning with bounded vertex cover number and by lemma we can find the optimal solution for fixed and by optimising the choice of the core and the periphery separately moreover by lemma each bounded vertex cover dag is included in the search space so we are guaranteed to find the optimal one thus we have proven our main theorem theorem bounded vertex cover number bayesian network structure learning can be solved in time lower bound although the algorithm presented in the previous section runs in polynomial time in the degree of the polynomial depends on the size of vertex cover which poses serious barrier to practical use when grows moreover the algorithm is essentially optimal in the general case as the input has size nk when parent sets of size at most are allowed however in practice one often assumes that node can have at most say or parents thus it makes sense to consider settings where the input is restricted by considering instances where parent set size is bounded from above by some constant while allowing vertex cover number to be higher in this case we might hope to do better as the input size is not restricting factor unfortunately we show that it is not possible to obtain algorithm where the degree of the polynomial does not depend on even when the maximum parent set size is limited to that is there is no algorithm with running time poly for any function unless the widely believed complexity assumption fpt fails specifically we show that bayesian network structure learning with bounded vertex cover number is when restricted to instances with parent set size implying the above claim for full technical details on complexity classes fpt and and the related theory we refer the reader to standard texts on the topic for our result it suffices to note that the assumption fpt implies that finding from graph can not be done in time poly for any function theorem bayesian network structure learning with bounded vertex cover number is in parameter even when restricted to instances with maximum parent set size proof we prove the result by reduction from clique which is known to be we use the same reduction strategy as korhonen and parviainen use in proving that the bounded version of the problem is that is given an instance of clique we construct new instance of bounded vertex cover number bayesian network structure learning as follows the node set of the instance is the parent scores are defined by setting fe for each and fv for all other and see figure finally the vertex cover size is required to be at most clearly the new instance can be constructed in polynomial time it now suffices to show that the original graph clique of size if and only if the optimal dag with vertex cover number at most has score assume has let be dag on obtained by setting ae for each and av for all other nodes all edges in the moralised graph ma are now clearly covered by furthermore since is clique in there are nodes with parent set giving assume now that there is dag on with vertex cover number and score there must be at least nodes such that ae as these are the only nodes that can contribute to positive score each of these triangles te for must contain at least two nodes from minimum vertex cover without loss of generality we may assume that these nodes are and as can not cover any other edges however this means that and there are at least edges implying that must be in integer linear programming to complement the combinatorial algorithm of section we will formulate the bounded vertex cover number bayesian network structure learning problem as an integer linear program ilp without loss of generality we may assume that nodes are labeled with integers as basis for the formulation let zsv be binary variable that takes value when is the parent set of and otherwise the objective function for the ilp is max fv zsv to ensure that the variables zsv encode valid dag we use the standard constraints introduced by jaakkola et al and cussens zsv zsv zsv fv now it remains to bound the vertex cover number of the moralised graph we introduce two sets of binary variables the variable yuv takes value if there is an edge between nodes and in the moralised graph and otherwise the variable cu takes value if the node is part of the vertex cover and otherwise by combining construction of the moralised graph and formulation for vertex cover we get the following zsv zt yuv zsv yuw yuv cu cv cu fv yuv cu the constraints and guarantee that encode the moral graph the constraint guarantees that if there is an edge between and in the moral graph then either or is included in the vertex cover finally the constraint bounds the size of the vertex cover experiments we implemented both the combinatorial algorithm of section and the ilp formulation of section to benchmark the practical performance of the algorithms and test how good approximations bounded vertex cover dags provide the combinatorial algorithm was implemented in matlab and is available the ilps were implemented using cplex python api and solved using cplex the implementation is available as part of twilp combinatorial algorithm as the and running time of the combinatorial algorithm are the same we tested it with synthetic data sets varying the number of nodes and the vertex cover bound limiting each run to at most hours the results are shown in figure with reasonable vertex cover number bounds the algorithm scales only up to about nodes this is mainly due to the fact that while the running time is polynomial in the degree of the polynomial depends on and when grows the algorithm becomes quickly infeasible http http time figure running times of the polynomial time algorithm number of nodes varies from to and the vertex cover number from to for and with the algorithm did not finish in hours integer linear program we ran our experiments using union of the data sets used by berg et al and those provided at gobnilp we benchmarked the results against other algorithms namely gobnilp for learning bayesian networks without any restrictions to the structure and twilp for learning bounded bayesian networks in our tests each algorithm was given hours of cpu time figure shows results for selected data sets due to space reasons full results are reported in the supplement the results show that optimal dags with moderate vertex cover number for flag for tend to have higher scores than optimal trees this suggests that often one can trade speed for accuracy by moving from trees to bounded vertex cover number dags we also note that bounded vertex cover number dags are usually learned quickly typically at least two faster than bounded dags however bounded dags are less constrained class and thus in multiple cases the best found bounded dag has better score than the corresponding bounded vertex cover number dag even when the bounded dag is not proven to be optimal this seems to be the case also if we have mismatching bound say for and for vertex cover number finally we notice that ilp solves easily problem instances with say nodes and vertex cover bound see the results for data set thus in practice ilp scales up to significantly larger data sets and vertex cover number bounds than the combinatorial algorithm of section presumably this is due to the fact that ilp solvers tend to use heuristics that can quickly prune out provably parts of choices for the vertex cover while the combinatorial algorithm considers them all discussion we have shown that bounded vertex cover number bayesian networks both allow tractable inference and can be learned in polynomial time the obvious point of comparison is the class of trees which has the same properties structurally these two classes are quite different in particular neither is subclass of the other dags with vertex cover number can contain dense substructures while path of nodes which is also tree has vertex cover number in contrast with trees bounded vertex cover number bayesian networks have densely connected core and each node outside the core is either connected to the core or it has no connections thus we would expect them to perform better than trees when the real network has few dense areas and only few connections between nodes outside these areas on the other hand bounding the vertex cover number bounds the total size of the core area which can be problematic especially in large networks when some parts of the network are not represented in the minimum vertex cover http abalone scores abalone running times time score flag scores flag running times scores running times time score time score no structure constraints bounded bounded vertex cover figure results for selected data sets we report the score for the optimal dag without structure constraints and for the optimal dags with bounded and bounded vertex cover when the bound changes as well as the running time required for finding the optimal dag in each case if the computations were not finished at the time limit of hours we show the score of the best dag found so far the shaded area represents the unexplored part of the search space that is the upper bound of the shaded area is the best score upper bound proven by the ilp solver we also note that bounded vertex cover bayesian networks have close connection to naive bayes classifiers that is variables outside vertex cover are conditionally independent of each other given the vertex cover thus we can replace the vertex cover by single variable whose states are cartesian product of the states of the vertex cover variables this network can then be viewed as naive bayes classifier finally we note some open question related to our current work from theoretical perspective we would like to classify different graph parameters in terms of complexity of learning ideally we would want to have graph parameter that has learning algorithm when we bound the maximum parent set size circumventing the barrier of theorem from practical perspective there is clearly room for improvement in efficiency of our learning algorithm for instance gobnilp uses various optimisations beyond the basic ilp encoding to speed up the search acknowledgments we thank james cussens for fruitful discussions this research was partially funded by the academy of finland finnish centre of excellence in computational inference research coin the experiments were performed using computing resources within the aalto university school of science project references mark bartlett and james cussens advances in bayesian network learning using integer programming in conference on uncertainty in artificial intelligence uai jeremias berg matti järvisalo and brandon malone learning optimal bounded treewidth bayesian networks via maximum satisfiability in international conference on artificial intelligence and statistics aistats david chickering learning bayesian networks is in learning from data artificial intelligence and statistics pages david chickering david heckerman and chris meek learning of bayesian networks is journal of machine learning research chow and liu approximating discrete probability distributions with dependence trees ieee transactions on information theory gregory cooper the computational complexity of probabilistic inference using bayesian belief networks artificial intelligence gregory cooper and edward herskovits bayesian method for the induction of probabilistic networks from data machine learning james cussens bayesian network learning with cutting planes in conference on uncertainty in artificial intelligence uai sanjoy dasgupta learning polytrees in conference on uncertainty in artificial intelligence uai rodney downey and michael fellows parameterized computational feasibility in feasible mathematics ii pages birkhauser rodney downey and michael fellows parameterized complexity gal elidan and stephen gould learning bounded treewidth bayesian networks journal of machine learning research jörg flum and martin grohe parameterized complexity theory david heckerman dan geiger and david chickering learning bayesian networks the combination of knowledge and statistical data machine learning tommi jaakkola david sontag amir globerson and marina meila learning bayesian network structure using lp relaxations in international conference on artificial intelligence and statistics aistats janne korhonen and pekka parviainen learning bounded bayesian networks in international conference on artificial intelligence and statistics aistats johan kwisthout hans bodlaender and van der gaag the necessity of bounded treewidth for efficient inference in bayesian networks in european conference on artificial intelligence ecai chris meek finding path is harder than finding tree journal of artificial intelligence research siqi nie denis deratani maua cassio polpo de campos and qiang ji advances in learning bayesian networks of bounded treewidth in advances in neural information processing systems nips rolf niedermeier invitation to algorithms oxford university press sascha ott and satoru miyano finding optimal gene networks using biological constraints genome informatics pekka parviainen hossein shahrabi farahani and jens lagergren learning bounded bayesian networks using integer linear programming in international conference on artificial intelligence and statistics aistats tomi silander and petri myllymäki simple approach for finding the globally optimal bayesian network structure in conference on uncertainty in artificial intelligence uai 
learning poisson dag models based on overdispersion scoring gunwoong park department of statistics university of madison wi parkg garvesh raskutti department of statistics department of computer science wisconsin institute for discovery optimization group university of madison wi raskutti abstract in this paper we address the question of identifiability and learning algorithms for poisson directed acyclic graphical dag models we define general poisson dag models as models where each node is poisson random variable with rate parameter depending on the values of the parents in the underlying dag first we prove that poisson dag models are identifiable from observational data and present algorithm that learns the poisson dag model under suitable regularity conditions the main idea behind our algorithm is based on overdispersion in that variables that are conditionally poisson are overdispersed relative to variables that are marginally poisson our algorithms exploits overdispersion along with methods for learning sparse poisson undirected graphical models for faster computation we provide both theoretical guarantees and simulation results for both small and dags introduction modeling multivariate count data is an important challenge that arises in numerous applications such as neuroscience systems biology and amny others one approach that has received significant attention is the graphical modeling framework since graphical models include broad class of dependence models for different data types broadly speaking there are two sets of graphical models undirected graphical models or markov random fields and directed acyclic graphical dag models or bayesian networks between undirected graphical models and dags undirected graphical models have generally received more attention in the data setting since both learning and inference algorithms scale to larger datasets in particular for multivariate count data yang et al introduce undirected poisson graphical models yang et al define undirected poisson graphical models so that each node is poisson random variable with rate parameter depending only on its neighboring nodes in the graph as pointed out in yang et al one of the major challenges with poisson undirected graphical models is ensuring global normalizability directed acyclic graphs dags or bayesian networks are different class of generative models that model directional or causal relationships see for details such directional relationships naturally arise in most applications but are difficult to model based on observational data one of the benefits of dag models is that they have straightforward factorization into conditional distributions and hence no issues of normalizability arise as they do for undirected graphical models as mentioned earlier however number of challenges arise that make learning dag models often impossible for large datasets even when variables have natural causal or directional structure these issues are identifiability since inferring causal directions from data is often not possible computational complexity since it is often computationally infeasible to search over the space of dags sample size guarantee since fundamental identifiability assumptions such as faithfulness are often required extremely large sample sizes to be satisfied even when the number of nodes is small see in this paper we define poisson dag models and address these issues in section we prove that poisson dag models are identifiable and in section we introduce dag learning algorithm for poisson dags which we call overdispersion scoring ods the main idea behind proving identifiability is based on the overdispersion of variables that are conditionally poisson but not marginally poisson using overdispersion we prove that it is possible to learn the causal ordering of poisson dags using algorithm and once the ordering is known the problem of learning dags reduces to simple set of neighborhood regression problems while overdispersion with conditionally poisson random variables is phenomena that is exploited in many applications see using overdispersion has never been exploited in dag model learning to our knowledge statistical guarantees for learning the causal ordering are provided in section and we provide numerical experiments on both small dags and dags with up to nodes our theoretical guarantees prove that even in the setting where the number of nodes is larger than the sample size it is possible to learn the causal ordering under the assumption that the degree of the moralized graph of the dag has small degree our numerical experiments support our theoretical results and show that our ods algorithm performs well compared to other dag learning methods our numerical experiments confirm that our ods algorithm is one of the few algorithms that performs well in terms of statistical and computational complexity in the setting poisson dag models in this section we define general poisson dag models dag consists of set of vertices and set of directed edges with no directed cycle we usually set and associate random vector xp with probability distribution over the vertices in directed edge from vertex to is denoted by or the set pa of parents of vertex consists of all nodes such that one of the convenient properties of dag models is that the joint distribution xp factorizes in terms of the conditional distributions as follows xp fj xj where fj xj refers to the conditional distribution of node xj in terms of its parents the basic property of poisson dag models is that each conditional distribution fj xj has poisson distribution more precisely for poisson dag models xj poisson gj xpa where gj is an arbitrary function of xpa to take concrete example gj can represent the link function for the univariate poisson generalized linear model glm or gj xpa exp θj θjk xk where θjk represent the linear weights using the factorization the overall joint distribution is θj θjk xk xp exp θj xj θjk xk xj log xj to contrast this formulation with the poisson undirected graphical model in yang et al the joint distribution for undirected graphical models has the form xp exp θj xj θjk xk xj log xj where is the function or the log of the normalization constant while the two forms and look quite similar the key difference is the normalization constant of in as θj pa θkj xk in which depends on to ensure the undirected opposed to the term graphical model representation in is valid distribution must be finite which guarantees the distribution is normalizable and yang et al prove that is normalizable if and only if all values are less than or equal to identifiability in this section we prove that poisson dag models are identifiable under very mild condition in general dag models can only be defined up to their markov equivalence class see however in some cases it is possible to identify the dag by exploiting specific properties of the distribution for example peters and prove that for gaussian dags based on structural equation models with known or the same variance the models are identifiable shimizu et al prove identifiability for linear structural equation models and peters et al prove identifiability of structural equation models with additive independent noise here we show that poisson dag models are also identifiable using the idea of overdispersion to provide intuition we begin by showing the identifiability of poisson dag model the basic idea is that the relationship between nodes and generates the overdispersed child variable to be precise consider all three models poisson poisson where and are independent poisson and poisson and poisson and poisson our goal is to determine whether the underlying dag model is or figure directed graphs of and now we exploit the fact that for poisson random variable var while for distribution which is conditionally poisson the variance is overdispersed relative to the mean hence for var and var for var while var var var var as long as var similarly under var and var as long as var hence we can identify model and by testing whether the variance is greater than the expectation or equal to the expectation with finite sample size the quantities and var can be estimated from data and we consider the finite sample setting in section and now we extend this idea to provide an identifiability condition for general poisson dag models the key idea to extending identifiability from the bivariate to multivariate scenario involves condition on parents of each node and then testing overdispersion the general result is as follows theorem assume that for any pa and var gj xpa the poisson dag model is identifiable we defer the proof to the supplementary material once again the main idea of the proof is overdispersion to explain the required assumption note that for any and pa var xj xj var gj xpa note that if pa or var gj xpa otherwise var gj xpa by our assumption gm figure moralized graph gm for dag algorithm our algorithm which we call overdispersion scoring ods consists of three main steps estimating candidate parents set using existing learning undirected graph algorithms estimating causal ordering using overdispersion scoring and estimating directed edges using standard regression algorithms such as lasso steps is standard problem in which we use algorithms step allows us to reduce both computational and sample complexity by exploiting sparsity of the moralized or undirected graphical model representation of the dag which we inroduce shortly step exploits overdispersion to learn causal ordering an important concept we need to introduce for step of our algorithm is the moral graph or undirected graphical model representation of the dag see the moralized graph gm for dag is an undirected graph where gm eu where eu includes edge set without directions plus edges between any nodes that are parents of common child fig demonstrates concepts of moralized graph for simple example where for dag note that are parents of common child hence eu where the additional edge arises from the fact that nodes and are both parents of node further define or eu denote the neighborhood set of node in the moralized graph gm let denote samples drawn from the poisson dag model let be bijective function corresponding to permutation or causal ordering we will also use the convenient notation to denote an estimate based on the data for ease of notation for any and let and xs represent xj and xj xs respectively furthermore let and xs denote pn var xj and var xj xs respectively we also define xs xs xs and ns xs xs xs for an arbitrary the computation of the score sbjk in step of our ods algorithm involves the following equation sbjk bjk jk ncbjk bjk refers to an estimated candidate set of parents specified in step of our ods algorithm where bjk so that we ensure we have enough and bjk bjk bjk samples for each element we select in addition is tuning parameter of our algorithm that we specify in our main theorem and our numerical experiments we can use number of standard algorithms for step of our ods algorithm since it boils down to finding candidate set of parents the main purpose of step is to reduce both computational complexity and the sample complexity by exploiting sparsity in the moralized graph in step candidate set of parents is generated for each node which in principle could be the entire set of nodes however since step requires computation of conditional mean and variance both the sample complexity and computational complexity depend significantly on the number of variables we condition on as illustrated in section and hence by making the set of candidate parents for each node as small as possible we gain significant computational and statistical improvements by exploiting the graph structure similar step is taken in the mmhc and sc algorithms the way we choose candidate set of parents is by learning the moralized graph gm and then using the neighborhood set for each hence step reduces to standard undirected graphical model learning algorithm number of choices are available for step including the neighborhood regression approach of yang et al as well as standard dag learning algorithms which find candidate parents set such as hiton and mmpc algorithm overdispersion scoring ods input samples from the given poisson dag model output causal ordering np and graph structure bu corresponding to the moralized graph with step estimate the undirected edges neighborhood set step estimate causal ordering using overdispersion score for do bi sbi end the first element of causal ordering arg minj sbj for do for do the candidate parents set cjk calculate sbjk using end the th element of causal ordering bj arg mink sbjk bj step estimate directed edges toward bj denoted by end the pth element of the causal ordering bp bp the directed edges toward bp denoted by πp return the estimated causal ordering bp return the estimated edge structure step learns the causal ordering by assigning an overdispersion score for each node the basic idea is to determine which nodes are overdispersed based on the sample conditional mean and conditional variance the causal ordering is determined one node at time by selecting the node with the smallest overdispersion score which is representative of node that is least likely to be conditionally poisson and most likely to be marginally poisson finding the causal ordering is usually the most challenging step of dag learning since once the causal ordering is learnt all that remains is to find the edge set for the dag step the final step finds the directed edge set of the dag by finding the parent set of each node using steps and finding the parent set of node boils down to selecting which variables are parents out of the candidate parents of node generated in step intersected with all elements before node of the causal ordering in step hence we have regression variable selection problems which can be performed using glmlasso as well as standard dag learning algorithms computational complexity steps and use existing algorithms with known computational complexity clearly the computational complexity for steps and depend on the choice of algorithm for example if we use the neighborhood selection glmlasso algorithm as is used in yang et al the complexity is min np for single lasso run but since there are nodes the total complexity is min similarly if we use glmlasso for step the computational complexity is also min as we show in numerical experiments algorithms for step tend to run more slowly than neighborhood regression based on glmlasso for step where we estimate the causal ordering has iterations and each iteration has number of overdispersion scores sbj and sbjk computed which is bounded by where is set of candidates of each element of causal ordering which is also bounded by the maximum degree of the moralized graph hence the total number of overdispersion scores that need to be computed is pd since the time for calculating each overdispersion score which is the difference between conditional variance and expectation is proportional to the time complexity is npd in worst case where the degree of the moralized graph is the computational complexity of step is as we discussed earlier there is significant computational saving by exploiting sparse moralized graph which is why we perform step of the algorithm hence steps and are the main computational bottlenecks of our ods algorithm the addition of step which estimates the causal ordering does not significantly add to the computational bottleneck consequently our ods algorithm which is designed for learning dags is almost as computationally efficient as standard methods for learning undirected graphical models statistical guarantees in this section we show consistency of recovering valid causal ordering recovery of our ods algorithm under suitable regularity conditions we begin by stating the assumptions we impose on the functions gj assumption for all pa and all there exists an such that var gj xpa for all there exists an such that exp gj xpa is stronger version of the identifiability assumption in var gj xpa where since we are in the finite sample setting we need the conditional variance to be lower bounded by constant bounded away from is condition on the tail behavior of gj pa for controlling tails of the score sbjk in step of our ods algorithm to take concrete example for which and are satisfied it is straightforward to show that the glm dag model with values of θkj satisfies both and the constraint on the is sufficient but not necessary and ensures that the parameters do not grow too large now we present the main result under assumptions and for general dags the true causal ordering is not unique therefore let denote all the causal orderings that are consistent with the true dag further recall that denotes the maximum degree of the moralized graph theorem recovery of causal ordering consider poisson dag model as specified in with set of true causal orderings and the rate function gj satisfies assumptions if the sample size threshold parameter then there exist positive constants such that exp log max we defer the proof to the supplementary material the main idea behind the proof uses the overdispersion property exploited in theorem in combination with concentration bounds that exploit assumption note once again that the maximum degree of the undirected graph plays an important role in the sample complexity which is why step is so important this is because the size of the conditioning set depends on the degree of the moralized graph hence plays an important role in both the sample complexity and computational complexity theorem can be used in combination with sample complexity guarantees for steps and is the true dag with high probability of our ods algorithm to prove that our output dag sample complexity guarantees for steps and depend on the choice of algorithm but for neighborhood regression based on the glmlasso provided log steps and should be consistent for theorem if the triple satisfies log then our ods algorithm recovers the true dag hence if the moralized graph is sparse ods recovers the true dag in the highdimensional setting dag learning algorithms that apply to the setting are not common since they typically rely on faithfulness or similar assumptions or other restrictive conditions that are not satisfied in the setting note that if the dag is not sparse and our sample complexity is extremely large when is large this makes intuitive sense since if the number of candidate parents is large we would need to condition on large set of variables which is very our sample complexity is certainly not optimal since the choice of tuning parameter determining optimal sample complexity remains an open question causal ordering sample size sample size sample size causal ordering for large dags causal ordering accuracy accuracy causal ordering sample size figure accuracy rates of successful recovery for causal ordering via our ods algorithm using different base algorithms the larger sample complexity of our ods algorithm relative to undirected graphical models learning is mainly due to the fact that dag learning is an intrinsically harder problem than undirected graph learning when the causal ordering is unknown furthermore note that theorem does not require any additional identifiability assumptions such as faithfulness which severely increases the sample complexity for dags numerical experiments in this section we support our theoretical results with numerical experiments and show that our ods algorithm performs favorably compared to dag learning methods the simulation study was conducted using realizations of random poisson dag that was generated as follows the gj functions for the general poissonpdag model was chosen using the standard glm link function xpa exp θj θjk xk resulting in the glm dag model we experimented with other choices of gj but only present results for the glm dag model note that our ods algorithm works well as long as assumption is satisfied regardless of choices of gj in all results presented θjk parameters were chosen uniformly at random in the range θjk although any values far from zero and satisfying the assumption work well in fact smaller values of θjk are more favorable to our ods algorithm than dag learning methods because of weak dependency between nodes dags are generated randomly with fixed unique causal ordering with edges randomly generated while respecting desired maximum degree constraints for the dag in our experiments we always set the thresholding constant although any value below seems to work well in fig we plot the proportion of simulations in which our ods algorithm recovers the correct causal ordering in order to validate theorem all graphs in fig have exactly parents for each node and we plot how the accuracy in recovering the true varies as function of for and for different node sizes and as we can see even when our ods algorithm recovers the true causal ordering about of the time even when is approximately and for smaller dags accuracy is in each different algorithms are used for step glmlasso where we choose mmpc with and hiton again with and an oracle where the edges for the true moralized graph is used as fig shows the glmlasso seems to be the best performing algorithm in terms of recovery so we use the glmlasso for steps and for the remaining figures glmlasso was also the only algorithm that scaled to the setting however it should be pointed out that glmlasso is not necessarily consistent and it is highly depending on the choice of gj recall that the degree refers to the maximum degree of the moralized dag fig provides comparison of how our ods algorithm performs in terms of hamming distance compared to the pc mmhc ges and sc algorithms for the pc mmhc and sc algorithms we use while for the ges algorithm we use the mbde modified bayesian dirichlet equivalent score since it performs better than other score choices we consider node sizes of in and and in and since many of these algorithms do not easily scale to larger node sizes we consider two hamming distance measures in and we only measure the hamming distance to the skeleton of the true dag which is the set of edges of the dag without directions for and we measure the hamming distance for directed edges sample size normalized hamming dist normalized hamming dist skeletons skeletons sample size directed edges sample size sample size figure comparison of our ods algorithm black and pc ges mmhc sc algorithms in terms of hamming distance to skeletons and directed edges the edges with directions the reason we consider the skeleton is because the pc does not recover all directions of the dag we normalize the hamming distance by dividing by the total number of edges and respectively so that the overall score is percentage as we can see our ods algorithm significantly the other algorithms we can also see that as the sample size grows our algorithm recovers the true dag which is consistent with our theoretical results it must be pointed out that the choice of dag model is suited to our ods algorithm while these algorithms apply to more general classes of dag models now we consider the statistical performance for dags fig plots the statistical performance of ods for dags in terms of recovering the causal ordering hamming distance to the true skeleton hamming distance to the true dag with directions all graphs in fig have exactly parents for each node and accuracy varies as function of for and for different node sizes fig shows that our ods algorithm accurately recovers the causal ordering and true dag models even in high dimensional setting supporting our theoretical results causal ordering for large dags accuracy normalized hamming dist fig shows of our ods algorithm we measure the running time by varying node size from to with the fixed and parents sample size from to with the fixed and parents the number of parents of each node from to with the fixed and fig and support the section where the time complexity of our ods algorithm is at most fig shows running time is proportional to parents size which is minimum degree of graph it agrees with the time complexity of step of our ods algorithm is npd we can also see that the glmlasso has the fastest amongst all algorithms that determine the candidate parent set skeletons for large dags directed edges for large dags sample size sample size sample size figure performance of our ods algorithm for dags with time complexity time complexity time complexity running time sec node size sampe size parents size figure time complexity of our ods algorithm with respect to node size sample size and parents size references yang allen liu and ravikumar graphical models via generalized linear models in advances in neural information processing systems pp bonissone henrion kanal and lemmer equivalence and synthesis of causal models in uncertainty in artificial intelligence vol spirtes glymour and scheines causation prediction and search mit press lauritzen graphical models oxford university press chickering learning bayesian networks is in learning from data springer pp uhler raskutti yu et geometry of the faithfulness assumption in causal inference the annals of statistics vol no pp dean testing for overdispersion in poisson and binomial regression models journal of the american statistical association vol no pp zheng salganik and gelman how many people do you know in prison using overdispersion in count data to estimate social structure in networks journal of the american statistical association vol no pp peters and identifiability of gaussian structural equation models with equal error variances biometrika shimizu hoyer and kerminen linear acyclic model for causal discovery the journal of machine learning research vol pp peters mooij janzing et identifiability of causal graphs using functional models arxiv preprint tsamardinos brown and aliferis the bayesian network structure learning algorithm machine learning vol no pp aliferis tsamardinos and statnikov hiton novel markov blanket algorithm for optimal variable selection in amia annual symposium proceedings vol american medical informatics association cowell dawid lauritzen and spiegelhalter probabilistic networks and expert systems tsamardinos and aliferis towards principled feature selection relevancy filters and wrappers in proceedings of the ninth international workshop on artificial intelligence and statistics morgan kaufmann publishers key west fl usa friedman nachman and learning bayesian network structure from massive datasets the sparse candidate algorithm in proceedings of the fifteenth conference on uncertainty in artificial intelligence morgan kaufmann publishers pp friedman hastie and tibshirani glmnet lasso and regularized generalized linear models package version vol chickering optimal structure identification with greedy search the journal of machine learning research vol pp heckerman geiger and chickering learning bayesian networks the combination of knowledge and statistical data machine learning vol no pp 
training restricted boltzmann machines via the free energy marylou eric tramel florent krzakala laboratoire de physique statistique umr cnrs normale pierre et marie curie paris france abstract restricted boltzmann machines are undirected neural networks which have been shown to be effective in many applications including serving as initializations for training deep neural networks one of the main reasons for their success is the existence of efficient and practical stochastic algorithms such as contrastive divergence for unsupervised training we propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the approach we demonstrate that our algorithm provides performance equal to and sometimes superior to persistent contrastive divergence while also providing clear and easy to evaluate objective function we believe that this strategy can be easily generalized to other models as well as to more accurate approximations paving the way for systematic improvements in training boltzmann machines with hidden units introduction restricted boltzmann machine rbm is type of undirected neural network with surprisingly many applications this model has been used in problems as diverse as dimensionality reduction classification collaborative filtering feature learning and topic modeling also quite remarkably it has been shown that generative rbms can be stacked into neural networks forming an initialization for deep network architectures such deep architectures are believed to be crucial for learning representations and concepts although the amount of training data available in practice has made pretraining of deep nets dispensable for supervised tasks rbms remain at the core of unsupervised learning key area for future developments in machine intelligence while the training procedure for rbms can be written as maximization an exact implementation of this approach is computationally intractable for all but the smallest models however fast stochastic monte carlo methods specifically contrastive divergence cd and persistent cd pcd have made rbm training both practical and efficient these methods have popularized rbms even though it is not entirely clear why such approximate methods should work as well as they do in this paper we propose an alternative deterministic strategy for training rbms and neural networks with hidden units in general based on the and extended methods of statistical mechanics this strategy has been used to train neural networks in number of earlier works in fact for entirely visible networks the use of adaptive cluster expansion methods has lead to spectacular results in learning boltzmann machine representations however unlike these fully visible models the hidden units of the rbm must be taken into account during the training procedure in welling and hinton presented similar deterministic learning algorithm for general boltzmann machines with hidden units considering it priori as potentially efficient extension of cd in tieleman tested the method in detail for rbms and found it provided poor performance when compared to both cd and pcd in the wake of these two papers little inquiry has been made in this direction with the apparent consensus being that the deterministic approach is ineffective for rbm training our goal is to challenge this consensus by going beyond mean field mere approximation by introducing and possibly order terms in principle it is even possible to extend the approach to arbitrary order using this extended approximation commonly known as the approach in statistical physics we find that rbm training performance is significantly improved over the approximation and is even comparable to pcd the clear and easy to evaluate objective function along with the extensible nature of the approximation paves the way for systematic improvements in learning efficiency training restricted boltzmann machines restricted boltzmann machine which can be viewed as two layer undirected bipartite neural network is specific case of an energy based model wherein layer of visible units is fully connected to layer of hidden units let us denote the binary visible and hidden units indexed by and respectively as vi and hj the energy of given state vi hj of the rbm is given by ai vi bj hj vi wij hj where wij are the entries of the matrix specifying the weights or couplings between the visible and hidden units and ai and bj are the biases or the external fields in the language of statistical physics of the visible and hidden units respectively thus the set of parameters wij ai bj defines the rbm model the joint probability distribution over the visible pand hidden units is given by the measure where is the normalization constant known as the partition function in physics for given data point represented by the marginal of the rbm is calculated as writing this marginal of in terms of its results in the difference ln where ln is the free energy of the rbm and ln can be interpreted as free energy as well but with visible units fixed to the training data point hence is referred to as the clamped free energy one of the most important features of the rbm model is that can be easily computed as may be summed out analytically since the hidden units are conditionally independent of the visible units owing to the rbm bipartite structure however calculating is computationally intractable since the number of possible states to sum over scales combinatorially with the number of units in the model this complexity frustrates the exact computation of the gradients of the needed in order to train the rbm parameters via gradient ascent monte carlo methods for rbm training rely on the observation that vi hj which can be simulated at ij lower computational cost nevertheless drawing independent samples from the model in order to approximate this derivative is itself computationally expensive and often approximate sampling algorithms such as cd or pcd are used instead extended mean field theory of rbms here we present tractable estimation of the free energy of the rbm this approximation is based on high temperature expansion of the free energy derived by georges and yedidia in the context of spin glasses following the pioneering works of we refer the reader to for review of this topic to apply the expansion to the rbm free energy we start with general energy based model which possesses arbitrary couplings wij between undifferentiated binary spins si such that the energy of measure on the configuration si is defined by ai si wij si sj we also restore the role of the temperature usually considered constant and for simplicity set to in most energy based models by multiplying the energy functional in the boltzmann weight by the inverse temperature next we apply legendre transform to the free energy standard procedure in statistical physics by first writingp the free energypas function of newly introduced auxiliary external field qi ln qi si this external field will be eventually set to the value in order to recover the true free energy the legendre transform is then given as function of the conjugate variable mi by maximizing over max qi mi mi where the maximizing auxiliary field function of the conjugate variables is the inverse df function of df dq since the derivative dq is exactly equal to where the operator refers to the average configuration under the boltzmann measure the conjugate variable is in fact the equilibrium magnetization vector hsi finally we observe that the free energy is also the inverse lengendre transform of its legendre transform at min where minimizes which yields an expression of the free energy in terms of the magnetization vector following this formulation allows us to perform high temperature expansion of around at fixed where the dependence on of the product βq must carefully be taken into account at infinite temperature the spins decorrelate causing the average value of an arbitrary product of spins to equal the product of their local magnetizations useful property accounting for binary spins taking values in one obtains the following expansion mi ln mi mi ln mi ai mi wij mi mj wij mi mj wij mi mi mi mj mj mj wij wjk wki mi mj mk the term corresponds to the entropy of spins with constrained magnetizations values taking this expansion up to the term we recover the standard theory the term is known as the onsager reaction term in the tap equations the higher orders terms are systematic corrections which were first derived in returning to the rbm notation and truncating the expansion at for the remainder of the theoretical discussion we have mv mh mv mh ai mvi bj mhj the notation and mvi mvi mhj mhj wij mvi mhj refers to the sum over the distinct pairs and triplets of spins respectively where is the entropy contribution mv and mh are introduced to denote the magnetization of the visible and hidden units and is set equal to eq can be viewed as weak coupling expansion in wij to recover an estimate of the rbm free energy eq must be minimized with respect to dγ its arguments as in eq lastly by writing the stationary condition dm we obtain the selfconsistency constraints on the magnetizations at we obtain the following constraint on the visible magnetizations mvi sigm wij mhj mvi mhj mhj where sigm is logistic sigmoid function similar constraint must be satisfied for the hidden units as well clearly the stationarity condition for obtained at order utilizes terms up to the nth order within the sigmoid argument of these consistency relations whatever the order of the approximation the magnetizations are the solutions of set of coupled equations of the same cardinality as the number of units in the model finally provided we can define procedure to efficiently derive the value of the magnetizations satisfying these constraints we obtain an extended approximation of the free energy which we denote as emf rbm evaluation and unsupervised training with emf an iteration for calculating emf recalling the of the rbm we have shown that tractable approximation of emf is obtained via weak coupling expansion so long as one can solve the coupled system of equations over the magnetizations shown in eq in the spirit of iterative belief propagation we propose that these relations can serve as update rules for the magnetizations within an iterative algorithm in fact the convergence of this procedure has been rigorously demonstrated in the context of random spin glasses we expect that these convergence properties will remain present even for real data the iteration over the relations for both the hidden and visible magnetizations can be written using the time index as mj sigm bj wij mi wij mj mi mi mvi sigm mhj mhj wij mhj mvi where the time indexing follows from application of the values of mv and mh minimizing mv mh and thus providing the value of emf are obtained by running eqs until they converge to fixed point we note that while we present an iteration to find emf up to above terms can easily be introduced into the procedure deterministic emf training by using the emf estimation of and the iterative algorithm detailed in the previous section to calculate it it is now possible to estimate the gradients of the used for unsupervised training of the rbm model by substituting with emf we note that the deterministic iteration we propose for estimating is in stark contrast with the stochastic sampling procedures utilized in cd and pcd to the same end the gradient ascent update of weight wij is approximated as emf emf where can be computed by differentiating eq at fixed and and computing the value of this derivative at the fixed points of eqs obtained from the iterative procedure the emf gradients with respect to the visible and hidden biases can be derived similarly interestingly emf and are merely the magnetizations of the visible and hidden units mvi and mhj respectively priori the training procedure sketched above can be used at any order of the weak coupling expansion the training algorithm introduced in which was shown to perform poorly for rbm training in can be recovered by retaining only the of the expansion when calculating emf taking emf to we expect that training efficiency and performance will be greatly improved over in fact including the term in the training algorithm is just as easy as including the one due to the fact that the particular structure of the rbm model does not admit triangles in its corresponding factor graphs although the term in eq does include sum over distinct pairs of units as well as sum over coupled triplets of units such triplets are excluded by the bipartite structure of the rbm however coupled quadruplets do contribute to the term and therefore and approximations require much more expensive computations though it is possible to utilize adaptive procedures numerical experiments experimental framework to evaluate the performance of the proposed deterministic emf rbm training we perform number of numerical experiments over two separate datasets and compare these results with both and pcd we first use the mnist dataset of labeled handwritten digit images the dataset is split between training images and test images both subsets contain approximately the same fraction of the ten digit classes to each image is comprised of pixels taking values in the range the mnist dataset was binarized by setting all pixels to in all experiments second we use the pixel version of the caltech silhouette dataset constructed from the caltech image dataset the silhouette dataset consists of black regions of the primary foreground scene objects on white background the images are labeled according to the object in the original picture of which there are unevenly represented object labels the dataset is split between training images validation images and test images sets for both datasets the rbm models require visible units following previous studies evaluating rbms on these datasets we fix the number of rbm hidden units to in all our experiments during training we adopt the learning procedure for gradient averaging with training points per batch for mnist and training points per batch for caltech silhouette we test the emf learning algorithm presented in section in various settings first we compare implementations utilizing the mf and approximations of higher orders were not considered due to their greater complexity next we investigate training quality when the relations on the magnetizations were not converged when calculating the derivatives of emf instead iterated for small fixed number of times an approach similar to cd furthermore we also evaluate persistent version of our algorithm similar to as in pcd the iterative emf procedure possesses multiple initializationdependent magnetizations converging multiple chains allows us to collect proper statistics on these basins of attraction in this implementation the magnetizations of set of points dubbed fantasy particles are updated and maintained throughout the training in order to estimate this persistent procedure takes advantage of the fact that the boltzmann measure changes only slightly between parameter updates convergence to the new fixed point magnetizations at each minibatch should therefore be sped up by initializing with the converged state from the previous update our final experiments consist of persistent training algorithms using iterations of the magnetization relations and and one persistent training algorithm using iterations for comparison for comparison we also train rbm models using following the prescriptions of and pcd as implemented in given that our goal is to compare rbm training approaches rather than achieving the best possible training across all free parameters neither momentum nor adaptive learning rates were included in any of the implementations tested however we do employ weight available as julia package at https pcd lem pseudo epoch epoch figure estimates of the over the mnist test set normalized by the total number of units as function of the number of training epochs the results for the different training algorithms are plotted in different colors with the same color code used for both panels left panel pseudo estimate the difference between emf algorithms and contrastive divergence algorithms is minimal right panel emf estimate at order the improvement from mf to tap is clear perhaps reasonably tap demonstrates an advantage over cd and pcd notice how the emf approximation of provides less noisy estimates at lower computational cost decay regularization in all our trainings to keep weights small necessity for the weak coupling expansion on which the emf relies when comparing learning procedures on the same plot all free parameters of the training learning rate weight decay etc were set identically all results are presented as averages over independent trainings with standard deviations reported as error bars relevance of the emf our first observation is that the implementations of the emf training algorithms are not overly belabored the free parameters relevant for the pcd and procedures were found to be equally well suited for the emf training algorithms in fact as shown in the left panel of fig and the right inset of fig the ascent of the pseudo over training epochs is very similar between the emf training methods and both the and pcd trainings interestingly for the caltech silhouettes dataset it seems that the persistent algorithms tested have difficulties in ascending the in the first epochs of training this contradicts the common belief that persistence yields more accurate approximations of the likelihood gradients the complexity of the training set classes unevenly represented over only training points might explain this unexpected behavior the persistent fantasy particles all converge to similar noninformative blurs in the earliest training epochs with many epochs being required to resolve the particles to distribution of values which are informative about the pseudo examining the fantasy particles also gives an idea of the performance of the rbm as generative model in fig randomly chosen fantasy particles from the epoch of training with pcd and are displayed the rbm trained with pcd generates recognizable digits yet the model seems to have trouble generating several digit classes such as and the fantasy particles extracted from training are of poorer quality with half of the drawn particles featuring digits the algorithm however appears to provide qualitative improvements all digits can be visually discerned with visible defects found only in two of the particles these particles seem to indicate that it is indeed possible to efficiently persistently train an rbm without converging on the fixed point of the magnetizations the relevance of the emf for rbm training is further confirmed in the right panel of fig where we observe that both and pcd ascend the emf even though they are not explicitly constructed to optimize over this objective as expected the persistent algorithm with iterations of the magnetizations achieves the best maximization of lem however with only iterations of the magnetizations achieves very similar performance perhaps making it preferable when faster training algorithm is desired figure fantasy particles generated by hidden unit rbm after epochs of training on the mnist dataset with pcd top two rows middle two rows and bottom two rows these fantasy particles represent typical samples generated by the trained rbm when used as generative prior for handwritten numbers the samples generated by are of similar subjective quality and perhaps slightly preferable to those generated by pcd while certainly preferable to those generated by moreover we note that although demonstrates improvements with respect to the the does not yield significantly better results than this is perhaps not surprising since the third order term of the emf expansion consists of sum over as many terms as the second order but at smaller order in wij lastly we note the computation times for each of these approaches for julia implementation of the tested rbm training techniques running on ghz intel processor we report the trial average wall times for fitting single batch normalized against the model complexity pcd which uses only single sampling step required the three emf techniques and each of which use magnetization iterations required and respectively if fewer magnetization iterations are required as we have empirically observed in limited tests then the run times of the and approaches are commesurate with pcd classification task performance we also evaluate these rbm training algorithms from the perspective of supervised classification an rbm can be interpreted as deterministic function mapping the binary visible unit values to the hidden unit magnetizations in this case the hidden unit magnetizations represent the contributions of some learned features although no supervised of the weights is implemented we tested the quality of the features learned by the different training algorithms by their usefulness in classification tasks for both datasets logistic regression classifier was calibrated with the hidden units magnetizations mapped from the labeled training images using the toolbox we purposely avoid using more sophisticated classification algorithms in order to place emphasis on the quality of the rbm training not the classification method in fig we see that the mnist classification accuracy of the rbms trained with the algorithms is roughly equivalent with that obtained when using pcd training while training yields markedly poorer classification accuracy the slight decrease in performance of and along as the training epochs increase might be emblematic of by the algorithms although no decrease in the emf test set was observed finally for the caltech silhouettes dataset the classification task shown in the right panel of fig is much more difficult priori interestingly the persistent algorithms do not yield better results on this task however we observe that the performance of deterministic emf rbm training is at least comparable with both and pcd mnist caltech silhouette pcd direct epoch pseudo classification accuracy classification accuracy epoch epoch figure test set classification accuracy for the mnist left and caltech silhouette right datasets using logistic regression on the marginal probabilities as function of the number of epochs as baseline comparison the classification accuracy of logistic regression performed directly on the data is given as black dashed line the results for the different training algorithms are displayed in different colors with the same color code being used in both panels right inset pseudo over training epochs for the caltech silhouette dataset conclusion we have presented method for training rbms based on an extended mean field approximation although mean field learning algorithm had already been designed for rbms and judged unsatisfactory we have shown that extending beyond the mean field to include terms of and above brings significant improvements over the approach and allows for practical and efficient deterministic rbm training with performance comparable to the stochastic cd and pcd training algorithms the extended mean field theory also provides an estimate of the rbm which is easy to evaluate and thus enables practical monitoring of the progress of unsupervised learning throughout the training epochs furthermore training on magnetizations is theoretically wellfounded within the presented approach paving the way for many possible extensions for instance it would be quite straightforward to apply the same kind of expansion to rbms as well as to rbms the extended mean field approach might also be used to learn stacked rbms jointly rather than separately as is done in both deep boltzmann machine and deep belief network strategy that has shown some promise in fact the approach can be generalized even to nonrestricted boltzmann machines with hidden variables with very little difficulty another interesting possibility would be to make use of terms in the series expansion using adaptive cluster methods such as those used in we believe our results show that the extended mean field approach and in particular the one may be good starting point to theoretically analyze the performance of rbms and deep belief networks acknowledgments we would like to thank caltagirone and decelle for many insightful discussions this research was funded by european research council under the european union framework programme grant agreement references smolensky chapter information processing in dynamical systems foundations of harmony theory processing of the parallel distributed explorations in the microstructure of cognition volume foundations hinton training products of experts by minimizing contrastive divergence neural hinton and salakhutdinov reducing the dimensionality of data with neural networks science larochelle and bengio classification using discriminative restricted boltzmann machines in icml pages salakhutdinov mnih and hinton restricted boltzmann machines for collaborative filtering in icml pages coates ng and lee an analysis of networks in unsupervised feature learning in intl conf on artificial intelligence and statistics pages hinton and salakhutdinov replicated softmax an undirected topic model in nips pages salakhutdinov and hinton deep boltzmann machines in intl conf on artificial intelligence and statistics pages hinton osindero and teh fast learning algorithm for deep belief nets neural lecun bengio and hinton deep learning nature may neal connectionist learning of deep belief networks artificial tieleman training restricted boltzmann machines using approximations to the likelihood gradient in icml pages peterson and anderson mean field theory learning algorithm for neural networks complex systems hinton deterministic boltzmann learning performs steepest descent in neural galland the limitations of deterministic boltzmann machine learning network kappen and boltzmann machine learning using mean field theory and linear response correction in nips pages welling and hinton new learning algorithm for mean field boltzmann machines in intl conf on artificial neural networks pages cocco leibler and monasson neuronal couplings between retinal ganglion cells inferred by efficient inverse statistical physics methods pnas cocco and monasson adaptive cluster expansion for inferring boltzmann machines with noisy data physical review letters thouless anderson and palmer solution of solvable model of spin glass philosophical magazine georges and yedidia how to expand around theory using expansions journal of physics mathematical and general plefka convergence condition of the tap equation for the ising spin glass model journal of physics mathematical and general opper and saad advanced mean field methods theory and practice mit press bolthausen an iterative construction of solutions of the tap equations for the model communications in mathematical physics lecun bottou bengio and haffner learning applied to document recognition proc of the ieee marlin swersky chen and de freitas inductive principles for restricted boltzmann machine learning in intl conf on artificial intelligence and statistics pages hinton practical guide to training restricted boltzmann machines computer pedregosa et al machine learning in python jmlr goodfellow courville and bengio joint training deep boltzmann machines for classification arxiv preprint 
convolutional networks for text xiang zhang junbo zhao yann lecun courant institute of mathematical sciences new york university broadway floor new york ny xiang yann abstract this article offers an empirical exploration on the use of convolutional networks convnets for text classification we constructed several largescale datasets to show that convolutional networks could achieve or competitive results comparisons are offered against traditional models such as bag of words and their tfidf variants and deep learning models such as convnets and recurrent neural networks introduction text classification is classic topic for natural language processing in which one needs to assign predefined categories to documents the range of text classification research goes from designing the best features to choosing the best possible machine learning classifiers to date almost all techniques of text classification are based on words in which simple statistics of some ordered word combinations such as usually perform the best on the other hand many researchers have found convolutional networks convnets are useful in extracting information from raw signals ranging from computer vision applications to speech recognition and others in particular networks used in the early days of deep learning research are essentially convolutional networks that model sequential data in this article we explore treating text as kind of raw signal at character level and applying temporal convnets to it for this article we only used classification task as way to exemplify convnets ability to understand texts historically we know that convnets usually require datasets to work therefore we also build several of them an extensive set of comparisons is offered with traditional models and other deep learning models applying convolutional networks to text classification or natural language processing at large was explored in literature it has been shown that convnets can be directly applied to distributed or discrete embedding of words without any knowledge on the syntactic or semantic structures of language these approaches have been proven to be competitive to traditional models there are also related works that use features for language processing these include using with linear classifiers and incorporating features to convnets in particular these convnet approaches use words as basis in which features extracted at word or word level form distributed representation improvements for tagging and information retrieval were observed this article is the first to apply convnets only on characters we show that when trained on largescale datasets deep convnets do not require the knowledge of words in addition to the conclusion an early version of this work entitled text understanding from scratch was posted in feb as the present paper has considerably more experimental results and rewritten introduction from previous research that convnets do not require the knowledge about the syntactic or semantic structure of language this simplification of engineering could be crucial for single system that can work for different languages since characters always constitute necessary construct regardless of whether segmentation into words is possible working on only characters also has the advantage that abnormal character combinations such as misspellings and emoticons may be naturally learnt convolutional networks in this section we introduce the design of convnets for text classification the design is modular where the gradients are obtained by to perform optimization key modules the main component is the temporal convolutional module which simply computes convolution suppose we have discrete input function and discrete kernel function the convolution between and with stride is defined as where is an offset constant just as in traditional convolutional networks in vision the module is parameterized by set of such kernel functions fij and which we call weights on set of inputs gi and outputs hj we call each gi or hj input or output features and or input or output feature size the outputs hj is obtained by sum over of the convolutions between gi and fij one key module that helped us to train deeper models is temporal it is the version of the module used in computer vision given discrete input function the function of is defined as max where is an offset constant this very pooling module enabled us to train convnets deeper than layers where all others fail the analysis by might shed some light on this the used in our model is the rectifier or thresholding function max which makes our convolutional layers similar to rectified linear units relus the algorithm used is stochastic gradient descent sgd with minibatch of size using momentum and initial step size which is halved every epoches for times each epoch takes fixed number of random training samples uniformly sampled across classes this number will later be detailed for each dataset sparately the implementation is done using torch character quantization our models accept sequence of encoded characters as input the encoding is done by prescribing an alphabet of size for the input language and then quantize each character using encoding or encoding then the sequence of characters is transformed to sequence of such sized vectors with fixed length any character exceeding length is ignored and any characters that are not in the alphabet including blank characters are quantized as vectors the character quantization order is backward so that the latest reading on characters is always placed near the begin of the output making it easy for fully connected layers to associate weights with the latest reading the alphabet used in all of our models consists of characters including english letters digits other characters and the new line character the characters are later we also compare with models that use different alphabet in which we distinguish between and letters model design we designed convnets one large and one small they are both layers deep with convolutional layers and layers figure gives an illustration length feature quantization some text convolutions conv and pool layers figure illustration of our model the input have number of features equal to due to our character quantization method and the input feature length is it seems that characters could already capture most of the texts of interest we also insert dropout modules in between the layers to regularize they have dropout probability of table lists the configurations for convolutional layers and table lists the configurations for linear layers table convolutional layers used in our experiments the convolutional layers have stride and pooling layers are all ones so we omit the description of their strides layer large feature small feature kernel pool we initialize the weights using gaussian distribution the mean and standard deviation used for initializing the large model is and small model table layers used in our experiments the number of output units for the last layer is determined by the problem for example for classification problem it will be layer output units large output units small depends on the problem for different problems the input lengths may be different for example in our case and so are the frame lengths from our model design it is easy to know that given input length the output frame length after the last convolutional layer but before any of the layers is this number multiplied with the frame size at layer will give the input dimension the first layer accepts data augmentation using thesaurus many researchers have found that appropriate data augmentation techniques are useful for controlling generalization error for deep learning models these techniques usually work well when we could find appropriate invariance properties that the model should possess in terms of texts it is not reasonable to augment the data using signal transformations as done in image or speech recognition because the exact order of characters may form rigorous syntactic and semantic meaning therefore the best way to do data augmentation would have been using human rephrases of sentences but this is unrealistic and expensive due the large volume of samples in our datasets as result the most natural choice in data augmentation for us is to replace words or phrases with their synonyms we experimented data augmentation by using an english thesaurus which is obtained from the mytheas component used in project that thesaurus in turn was obtained from wordnet where every synonym to word or phrase is ranked by the semantic closeness to the most frequently seen meaning to decide on how many words to replace we extract all replaceable words from the given text and randomly choose of them to be replaced the probability of number is determined by geometric distribution with parameter in which pr the index of the synonym chosen given word is also determined by another geometric distribution in which this way the probability of synonym chosen becomes smaller when it moves distant from the most frequently seen meaning we will report the results using this new data augmentation technique with and comparison models to offer fair comparisons to competitive models we conducted series of experiments with both traditional and deep learning methods we tried our best to choose models that can provide comparable and competitive results and the results are reported faithfully without any model selection traditional methods we refer to traditional methods as those that using feature extractor and linear classifier the classifier used is multinomial logistic regression in all these models and its tfidf for each dataset the model is constructed by selecting most frequent words from the training subset for the normal we use the counts of each word as the features for the tfidf version we use the counts as the the inverse document frequency is the logarithm of the division between total number of samples and number of samples with the word in the training subset the features are normalized by dividing the largest feature value and its tfidf the models are constructed by selecting the most frequent up to from the training subset for each dataset the feature values are computed the same way as in the model on word embedding we also have an experimental model that uses on learnt from the training subset of each dataset and then use these learnt means as representatives of the clustered words we take into consideration all the words that appeared more than times in the training subset the dimension of the embedding is the features are computed the same way as in the model the number of means is deep learning methods recently deep learning methods have started to be applied to text classification we choose two simple and representative models for comparison in which one is convnet and the other simple term memory lstm recurrent neural network model convnets among the large number of recent works on convnets for text classification one of the differences is the choice of using pretrained or learned word representations we offer comparisons with both using the pretrained embedding and using lookup tables the embedding size is in both cases in the same way as our model to ensure fair comparison the models for each case are of the same size as our convnets in terms of both the number of layers and each layer output size experiments using thesaurus for data augmentation are also conducted http term memory we also offer comparison mean with recurrent neural network model namely term memory lstm the lstm model used in our case is using pretrained emlstm lstm lstm bedding of size as in previous models the model is formed by taking mean of the outputs of all lstm cells to form feature vector and then using multinomial logistic figure term memory regression on this feature vector the output dimension is the variant of lstm we used is the common vanilla architecture we also used gradient clipping in which the gradient norm is limited to figure gives an illustration choice of alphabet for the alphabet of english one apparent choice is whether to distinguish between and letters we report experiments on this choice and observed that it usually but not always gives worse results when such distinction is made one possible explanation might be that semantics do not change with different letter cases therefore there is benefit of regularization datasets and results previous research on convnets in different areas has shown that they usually work well with largescale datasets especially when the model takes in raw features like characters in our case however most open datasets for text classification are quite small and datasets are splitted with significantly smaller training set than testing therefore instead of confusing our community more by using them we built several datasets for our experiments ranging from hundreds of thousands to several millions of samples table is summary table statistics of our datasets epoch size is the number of minibatches in one epoch dataset ag news sogou news dbpedia yelp review polarity yelp review full yahoo answers amazon review full amazon review polarity classes train samples test samples epoch size ag news corpus we obtained the ag corpus of news article on the it contains categorized news articles from more than news sources we choose the largest classes from this corpus to construct our dataset using only the title and description fields the number of training samples for each class is and testing sogou news corpus this dataset is combination of the sogouca and sogoucs news corpora containing in total news articles in various topic channels we then labeled each piece of news using its url by manually classifying the their domain names this gives us large corpus of news articles labeled with their categories there are large number categories but most of them contain only few articles we choose categories sports finance entertainment automobile and technology the number of training samples selected for each class is and testing although this is dataset in chinese we used pypinyin package combined with jieba chinese segmentation system to produce pinyin phonetic romanization of chinese the models for english can then be applied to this dataset without change the fields used are title and content http table testing errors of all the models numbers are in percentage lg stands for large and sm stands for small is an abbreviation for and lk for lookup table th stands for thesaurus convnets labeled full are those that distinguish between lower and upper letters model bow bow tfidf ngrams ngrams tfidf lstm lg conv sm conv lg conv th sm conv th lg lk conv sm lk conv lg lk conv th sm lk conv th lg full conv sm full conv lg full conv th sm full conv th lg conv sm conv lg conv th sm conv th ag sogou dbp yelp yelp yah amz amz dbpedia ontology dataset dbpedia is community effort to extract structured information from wikipedia the dbpedia ontology dataset is constructed by picking nonoverlapping classes from dbpedia from each of these ontology classes we randomly choose training samples and testing samples the fields we used for this dataset contain title and abstract of each wikipedia article yelp reviews the yelp reviews dataset is obtained from the yelp dataset challenge in this dataset contains samples that have review texts two classification tasks are constructed from this dataset one predicting full number of stars the user has given and the other predicting polarity label by considering stars and negative and and positive the full dataset has training samples and testing samples in each star and the polarity dataset has training samples and test samples in each polarity yahoo answers dataset we obtained yahoo answers comprehensive questions and answers version dataset through the yahoo webscope program the corpus contains questions and their answers we constructed topic classification dataset from this corpus using largest main categories each class contains training samples and testing samples the fields we used include question title question content and best answer amazon reviews we obtained an amazon review dataset from the stanford network analysis project snap which spans years with reviews from users on products similarly to the yelp review dataset we also constructed datasets one full score prediction and another polarity prediction the full dataset contains training samples and testing samples in each class whereas the polarity dataset contains training samples and testing samples in each polarity sentiment the fields used are review title and review content table lists all the testing errors we obtained from these datasets for all the applicable models note that since we do not have chinese thesaurus the sogou news dataset does not have any results using thesaurus augmentation we labeled the best result in blue and worse result in red discussion tfidf lstm convnet ag news dbpedia lookup table convnet yelp yelp yahoo full alphabet convnet amazon amazon figure relative errors with comparison models to understand the results in table further we offer some empirical analysis in this section to facilitate our analysis we present the relative errors in figure with respect to comparison models each of these plots is computed by taking the difference between errors on comparison model and our convnet model then divided by the comparison model error all convnets in the figure are the large models with thesaurus augmentation respectively convnet is an effective method the most important conclusion from our experiments is that convnets could work for text classification without the need for words this is strong indication that language could also be thought of as signal no different from any other kind figure shows random patches learnt by one of our convnets for dbpedia dataset figure first layer weights for each patch height is the kernel size and width the alphabet size dataset size forms dichotomy between traditional and convnets models the most obvious trend coming from all the plots in figure is that the larger datasets tend to perform better traditional methods like tfidf remain strong candidates for dataset of size up to several hundreds of thousands and only until the dataset goes to the scale of several millions do we observe that convnets start to do better convnets may work well for data data vary in the degree of how well the texts are curated for example in our million scale datasets amazon reviews tend to be raw whereas users might be extra careful in their writings on yahoo answers plots comparing deep models figures and show that convnets work better for less curated texts this property suggests that convnets may have better applicability to scenarios however further analysis is needed to validate the hypothesis that convnets are truly good at identifying exotic character combinations such as misspellings and emoticons as our experiments alone do not show any explicit evidence choice of alphabet makes difference figure shows that changing the alphabet by distinguishing between uppercase and lowercase letters could make difference for datasets it seems that not making such distinction usually works better one possible explanation is that there is regularization effect but this is to be validated semantics of tasks may not matter our datasets consist of two kinds of tasks sentiment analysis yelp and amazon reviews and topic classification all others this dichotomy in task semantics does not seem to play role in deciding which method is better is misuse of one of the most obvious facts one could observe from table and figure is that the model performs worse in every case comparing with traditional models this suggests such simple use of distributed word representation may not give us an advantage to text classification however our experiments does not speak for any other language processing tasks or use of in any other way there is no free lunch our experiments once again verifies that there is not single machine learning model that can work for all kinds of datasets the factors discussed in this section could all play role in deciding which method is the best for some specific application conclusion and outlook this article offers an empirical study on convolutional networks for text classification we compared with large number of traditional and deep learning models using several largescale datasets on one hand analysis shows that convnet is an effective method on the other hand how well our model performs in comparisons depends on many factors such as dataset size whether the texts are curated and choice of alphabet in the future we hope to apply convnets for broader range of language processing tasks especially when structured outputs are needed acknowledgement we gratefully acknowledge the support of nvidia corporation with the donation of tesla gpus used for this research we gratefully acknowledge the support of inc for an aws in education research grant used for this research references bottou fogelman blanchet and lienard experiments with time delay networks and dynamic time warping for speaker independent isolated digit recognition in proceedings of eurospeech volume pages paris france boureau bach lecun and ponce learning features for recognition in computer vision and pattern recognition cvpr ieee conference on pages ieee boureau ponce and lecun theoretical analysis of feature pooling in visual recognition in proceedings of the international conference on machine learning pages collobert kavukcuoglu and farabet environment for machine learning in biglearn nips workshop number collobert weston bottou karlen kavukcuoglu and kuksa natural language processing almost from scratch mach learn dos santos and gatti deep convolutional neural networks for sentiment analysis of short texts in proceedings of coling the international conference on computational linguistics technical papers pages dublin ireland august dublin city university and association for computational linguistics fellbaum wordnet and wordnets in brown editor encyclopedia of language and linguistics pages oxford elsevier graves and schmidhuber framewise phoneme classification with bidirectional lstm and other neural network architectures neural networks greff srivastava steunebrink and schmidhuber lstm search space odyssey corr hinton srivastava krizhevsky sutskever and salakhutdinov improving neural networks by preventing of feature detectors arxiv preprint hochreiter and schmidhuber long memory neural joachims text categorization with suport vector machines learning with many relevant features in proceedings of the european conference on machine learning pages johnson and zhang effective use of word order for text categorization with convolutional neural networks corr jones statistical interpretation of term specificity and its application in retrieval journal of documentation kanaris kanaris houvardas and stamatatos words versus character for filtering international journal on artificial intelligence tools kim convolutional neural networks for sentence classification in proceedings of the conference on empirical methods in natural language processing emnlp pages doha qatar october association for computational linguistics lecun boser denker henderson howard hubbard and jackel backpropagation applied to handwritten zip code recognition neural computation winter lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee november lehmann isele jakob jentzsch kontokostas mendes hellmann morsey van kleef auer and bizer dbpedia multilingual knowledge base extracted from wikipedia semantic web journal lev klein and wolf in defense of word embedding for generic text representation in biemann handschuh freitas meziane and mtais editors natural language processing and information systems volume of lecture notes in computer science pages springer international publishing lewis yang rose and li new benchmark collection for text categorization research the journal of machine learning research mcauley and leskovec hidden factors and hidden topics understanding rating dimensions with review text in proceedings of the acm conference on recommender systems recsys pages new york ny usa acm mikolov sutskever chen corrado and dean distributed representations of words and phrases and their compositionality in burges bottou welling ghahramani and weinberger editors advances in neural information processing systems pages nair and hinton rectified linear units improve restricted boltzmann machines in proceedings of the international conference on machine learning pages pascanu mikolov and bengio on the difficulty of training recurrent neural networks in icml volume of jmlr proceedings pages polyak some methods of speeding up the convergence of iteration methods ussr computational mathematics and mathematical physics rumelhart hintont and williams learning representations by errors nature santos and zadrozny learning representations for tagging in proceedings of the international conference on machine learning pages shen he gao deng and mesnil latent semantic model with structure for information retrieval in proceedings of the acm international conference on conference on information and knowledge management pages acm sutskever martens dahl and hinton on the importance of initialization and momentum in deep learning in dasgupta and mcallester editors proceedings of the international conference on machine learning volume pages jmlr workshop and conference proceedings may waibel hanazawa hinton shikano and lang phoneme recognition using neural networks acoustics speech and signal processing ieee transactions on wang zhang ma and ru automatic online news issue construction in web environment in proceedings of the international conference on world wide web www pages new york ny usa acm 
robust linear discriminant analysis for brain disorders diagnosis ehsan thung le an feng shi dinggang shen for the department of radiology and bric university of north carolina at chapel hill nc usa eadeli khthung fengshi dgshen abstract wide spectrum of discriminative methods is increasingly used in diverse applications for classification or regression tasks however many existing discriminative methods assume that the input data is nearly which limits their applications to solve problems particularly for disease diagnosis the data acquired by the neuroimaging devices are always prone to different sources of noise robust discriminative models are somewhat scarce and only few attempts have been made to make them robust against noise or outliers these methods focus on detecting either the or moreover they usually use unsupervised procedures or separately the training and the testing data all these factors may induce biases in the learning process and thus limit its performance in this paper we propose classification method based on the formulation of linear discriminant analysis which simultaneously detects the and the proposed method operates under setting in which both labeled training and unlabeled testing data are incorporated to form the intrinsic geometry of the sample space therefore the violating samples or feature values are identified as or respectively we test our algorithm on one synthetic and two brain neurodegenerative databases particularly for parkinson disease and alzheimer disease the results demonstrate that our method outperforms all baseline and methods in terms of both accuracy and the area under the roc curve introduction discriminative methods pursue direct mapping from the input to the output space for classification or regression task as an example linear discriminant analysis lda aims to find the mapping that reduces the input dimensionality while preserving the most class discriminatory information discriminative methods usually achieve good classification results compared to the generative models when there are enough number of training samples but they are limited when there are small number of labeled data as well as when the data is noisy various efforts have been made to add robustness to these methods for instance and proposed robust discriminant analysis methods and introduced lda by minimizing the upper bound of the lda cost function these methods are all robust to on the other hand some methods were proposed to deal with the or such as parts of the data used in preparation of this article were obtained from the alzheimer disease neuroimaging initiative adni database http the investigators within the adni contributed to the design and implementation of adni provided data but did not participate in analysis or writing of this paper complete listing of adni investigators can be found at http as in many previous works the training and the testing data are often conducted separately this might induce bias or inconsistency to the whole learning process besides for many applications it is cumbersome task to acquire enough training samples to perform proper discriminative analysis hence we propose to take advantage of the unlabeled testing data available to build more robust classifier to this end we introduce discriminative classification model which unlike previous works jointly estimates the noise model both and on the whole labeled training and unlabeled testing data and simultaneously builds discriminative model upon the training data in this paper we introduce novel classification model based on lda which is robust against both and and hence here it is called robust linear discriminant analysis lda finds the mapping between the sample space and the label space through linear transformation matrix maximizing fisher discriminant ratio in practice the major drawback of the original lda is the small sample size problem which arises when the number of available training samples is less than the dimensionality of the feature space reformulation of lda based on the problem known tackles this problem finds the mapping by solving the following min kh ytr xtr where ytr is binary class label indicator matrix for different classes or labels and xtr is the matrix containing ntr training samples is normalization factor defined as ytr ytr that compensates for the different number of samples in each class as result the mapping is reduced rank transformation matrix which could be used to project test data xtst onto dimensional space the class label could therefore be simply determined using strategy to make lda robust against noisy data fidler et al proposed to construct basis which contains complete discriminative information for classification in the testing phase the estimated basis identifies the outliers in samples images in their case and then is used to calculate the coefficients using subsampling approach on the other hand huang et al proposed general formulation for robust regression rr and classification robust lda or rlda in the training stage they denoise the feature values using strategy similar to robust principle component analysis rpca and build the above model using the data in the testing stage they the data by performing locally compact representation of the testing samples from the training data this separate procedure could not effectively form the underlying geometry of sample space to the data huang et al only account for by imposing sparse noise model constraint on the features matrix on the other hand the data fitting term in is vulnerable to large recently in robust statistics it is found that loss functions are able to make more reliable estimations than fitting functions this has been adopted in many applications including robust face recognition and robust dictionary learning reformulating the objective in using this idea would yield to this problem min kh ytr xtr we incorporate this fitting function in our formulation to deal with the by iteratively each single sample while simultaneously the data from this is done through setting to take advantage of all labeled and unlabeled data to build the structure of the sample space more robustly learning has long been of great interest in different fields because it can make use of unlabeled or poorly labeled data for instance joulin and bach introduced convex relaxation and use the model in different semisupervised learning scenarios in another work cai et al proposed discriminant analysis where the separation between different classes is maximized using the labeled data points while the unlabeled data points estimate the structure of the data in contrast we incorporate the unlabeled testing data to form the intrinsic geometry of the sample space and the data whilst building the discriminative model bold capital letters denote matrices all letters denote scalar variables dij is the scalar in the row and column of denotes the inner product between and kdkp and represent the squared euclidean norm and the norm of respectively kdkf tr ij dij and designate the squared frobenius norm and the nuclear norm sum of singular values of respectively xtr xtst tr tr tr tr tr tr xdn xdn xdn tr tr dtr dtst tr tr tr tr tr ddn ddn ddn tr tr edn mapping tr yln tr ytr figure outline of the proposed method the original data matrix is composed of both labeled training and unlabeled testing data our method decomposes this matrix to data matrix and an error matrix to account for simultaneously we learn mapping from the training samples in dtr through robust fitting function dealing with the the same learned mapping on the testing data dtst leads to the test labels we apply our method for the diagnosis of neurodegenerative brain disorders the term neurodegenerative disease is an umbrella term for debilitating and incurable conditions related to progressive degeneration or death of the cells in the brain nervous system although neurodegenerative diseases manifest with diverse pathological features the cellular level processes resemble similar structures for instance parkinson disease pd mainly affects the basal ganglia region and the substansia nigra of the brain leading to decline in generation of chemical messenger dopamine lack of dopamine yields loss of ability to control body movements along with some problems depression anxiety in alzheimer disease ad deposits of tiny protein plaques yield into brain damage and progressive loss of memory these diseases are often incurable and thus early diagnosis and treatment are crucial to slow down the progression of the disease in its initial stages in this study we use two popular databases ppmi and adni the former aims at investigating pd and its related disorders while the latter is designed for diagnosing ad and its prodormal stage known as mild cognitive impairment mci contributions the contribution of this paper would therefore be we propose an approach to deal with the and simultaneously and build robust discriminative classification model the are penalized through an fitting function by the samples based on their prediction power while discarding the our proposed model operates under setting where the whole data labeled training and unlabeled testing samples are incorporated to build the intrinsic geometry of the sample space which leads to better the data we further select the most discriminative features for the learning process through regularizing the weights matrix with an norm this is specifically of great interest for the neurodegenerative disease diagnosis where the features from different regions of the brain are extracted but not all the regions are associated with certain disease therefore the most discriminative regions in the brain that utmost affect the disease would be identified leading to more reliable diagnosis model robust linear discriminant analysis let assume we have ntr training and ntst testing samples each with feature vector which leads to set of ntr ntst total samples let denote the set of all samples both training and testing in which each column indicates single sample and yi their corresponding ith labels in general with different labels we can define thus and are composed by stacking up the training and testing data as xtr xtst and ytr ytst our goal is to determine the labels of the test samples ytst formulation an illustration of the proposed method is depicted in fig first all the samples labeled or unlabeled are arranged into matrix we are interested in this matrix following this could be done by assuming that can be spanned on subspace and therefore should be this assumption supports the fact that samples from same classes should be more correlated therefore the original matrix is decomposed into two counterparts and which represent the data matrix and the error matrix respectively similar to rpca the data matrix shall hold the assumption and the error matrix is considered to be sparse but this process of does not incorporate the label information and is therefore unsupervised nevertheless note that we also seek mapping between the training samples and their respective labels so matrix should be spanned on subspace which also leads to good classification model of its dtr to ensure the of the matrix like in many previous works we approximate the rank function using the nuclear norm the sum of the singular values of the matrix the noise is modeled using the norm of the matrix which ensures sparse noise model on the feature values accordingly the objective function for under setting would be min kh ytr dtr where the first term is the regression model introduced in this term only operates on the denoised training samples from matrix with row of all is added to it to ensure an appropriate linear classification model the second and the third terms together with the first constraint are similar to the rpca formulation they the labeled training and unlabeled testing data together in combination with the first term we ensure that the data also provides favorable model the last term is regularization on the learned mapping coefficients to ensure the coefficients do not get trivial or unexpectedly large values the parameters and are constant regularization parameters which are discussed in more details later the regularization on the coefficients could be posed as simple norm of the matrix but in many applications like ours disease diagnosis many of the features in the feature vectors are redundant in practice features from different brain regions are often extracted but not all the regions contribute to certain disease therefore it is desirable to determine which features regions are the most relevant and the most discriminative to use following we are looking for sparse set of weights that ensures incorporating the least and the most discriminative features we propose regularization on the weights vector as combination of the and frobenius norms kβ γkβ kf evidently the solution to the objective function in is not easy to achieve since the first term contains quadratic term and minimization of the fitting function is not straightforward because of its indifferentiability to this end we formalize the solution with similar strategy as in iteratively least squares irls the minimization problem is approximated by conventional in which each of the samples in the matrix are weighted with the reverse of their regression residual therefore the new problem would be kh ytr min dtr is diagonal matrix the ith diagonal element of which is the ith sample weight where yi αii αij ntr where is very small positive number equal to in our experiments in the next subsection we introduce an algorithm to solve this optimization problem our work is closely related to the rr and rlda formulations in where the authors impose assumption on the training data feature values and an assumption on the noise model the discriminant model is learned similar to as illustrated in while strategy is employed to achieve more robust model on the other hand our model operates under learning setting where both the labeled training and the unlabeled testing samples are simultaneously therefore the geometry of the sample space is better modeled on the subspace by interweaving both labeled training and unlabeled testing data in addition our model further selects the most discriminative features to learn the model by regularizing the mapping weights vector and enforcing an sparsity condition on them algorithm optimization algorithm input xtr xtst ytr parameters and initialization xtr xtst xtr ytr γi dn kxtr dc kβ tr repeat main optimization loop βk update repeat αij and αii yk ntr γi ytr until kf kf or ytr dktr update ek ntr update update update update multipliers and parameters tr min min min until kx dk ek and dktr kf and kββ bk kf output and ytst xtst optimization problem could be efficiently solved using the augmented lagrangian multipliers alm approach hence we introduce the lagrangian multipliers and an auxiliary variable and write the lagrangian function as kh ytr γkβ kf ei kx dtr dtr bi kβ and contributwhere and are penalty parameters there are five variables ing to the problem we alternatively optimize for each variable while fixing the others except for the matrix all the variables have straightforward or solutions is calculated through and solving the conventional irls by iteratively calculating the weights in problem until convergence the detailed optimization steps are given in algorithm the normalization factor is omitted in this algorithm for easier readability in this algorithm is the identity matrix and the operators applies singular value threshdτ and sκ are defined in the following dτ udτ diag σi where uς olding algorithm on the intermediate matrix as dτ is the singular values decomposition svd of and σi are the singular values additionally sκ is the soft thresholding operator or the proximal operator for the norm note that is the positive part of defined as max algorithm analysis the solution for each of the matrices is convex function while all the other variables are fixed for the solution is achieved via the irls approach in an iterative manner both the fitting function and the approximated functions are convex we only need to ensure that the minimization of the latter is numerically better tractable than the minimization of the former this is discussed in depth and the convergence is proved in to estimate the computational complexity of the algorithm we need to investigate the complexity of the of the algorithm the two most computationally expensive steps in the loop are the iterative update of algorithm steps and the svt operation algorithm step the former includes solving iteratively which is in each iteration and the latter has the svd operation as the most computational intensive operation which is of by considering the maximum number of iterations for the first equal to tmax the overall computational complexity of the algorithm in each iteration would be the number of iterations of the whole algorithm until convergence is dependent on the choice of if penalty parameters are increasing smoothly in each iteration as in step algorithm the overall algorithm would be convergent reasonable choice for the sequence of all yields in decrease in the number of required svd operations experiments we compare our method with several baseline and methods in three different scenarios the first experiment is on synthetic data which highlights how the proposed method is robust against or separately or when they occur at the same time the next two experiments are conducted for neurodegenerative brain disorders diagnosis we use two popular databases one for parkinson disease pd and the other for alzheimer disease ad we compare our results with different baseline methods including conventional rlda rpca on the matrix separately to and then for the classification denoted as linear support vector machines svm and sparse feature selection with svm or with rlda except for the other methods in comparison do not incorporate the testing data in order to have fair set of comparisons we also compare against the transductive matrix completion mc approach additionally to also evaluate the effect of the regularization on matrix we report results for when reguβ kf denoted as instead of the term introduced in moreover we larized by only γkβ also train our proposed in fully supervised setting not involving any testing data in the training process to show the effect of the established learning framework in our proposed method this is simply done by replacing variable in with xtr and solving the problem correspondingly this method referred to as only uses the training data to form the geometry of the sample space and therefore only cleans the training for the choice of parameters the best parameters are selected through an inner cross validation on the training data for all the competing methods for the proposed method the parameters are set with same strategy as in min and controlling the in the algorithm is set to we have set and through inner cross validation and found that all set to yields to reasonable results across all datasets synthetic data we construct two independent subspaces with bases and same as described in is random orthogonal matrix and in which is random rotation matrix then vectors are sampled from each subspace through xi ui qi with qi matrix independent and identically distributed from this leads to binary classification problem we gradually add additional noisy samples and features to the data drawn from and evaluate our proposed method the accuracy means and standard deviations of three different runs are illustrated in fig this experiment is conducted under three settings first we analyze the behavior of the method against gradually added noise to some of the features illustrated in fig we randomly add some noisy samples to the aforementioned samples and evaluate the methods in the sole presence of results are depicted in fig finally we simultaneously add noisy features and samples fig shows the accuracy as function of the additional number of noisy features and samples note that all the reported results are obtained through as can be seen our method is able to select better subset of features and samples and achieve superior results compared to rlda and conventional approaches furthermore our method behaves more robust against the increase in the noise factor brain neurodegenrative disease diagnosis databases the first set of data used in this paper is obtained from the parkinson progression markers initiative ppmi ppmi is the first substantial study for identifying the pd progression biomarkers to advance the understanding of the disease in this research we use the mri data acquired by the ppmi study in which sequence mprage or spgr is acquired for each subject using siemens magnetom triotim syngo scanners we use subjects scanned using mprage sequence to http accuracy rlda of added noisy features only added noisy features of added noisy samples of added noisy samples and features only added noisy samples added noisy samples features figure results comparisons on synthetic data for three different runs table the accuracy acc and area under roc curve auc of the classification on ppmi database compared to the baseline methods method acc auc rlda svm mc minimize the effect of different scanning protocols the images were acquired for sagittal slices with the following parameters repetition time ms echo time ms flip angle and voxel size all the mr images were preprocessed by skull stripping cerebellum removal and then segmented into white matter wm gray matter gm and cerebrospinal fluid csf tissues the anatomical automatic labeling atlas parcellated with predefined regions of interest roi was registered using to each subject native space we further added more rois in basal ganglia and brainstem regions which are clinically important rois for pd we then computed wm gm and csf tissue volumes in each of the rois as features pd and normal control nc subjects are used in our experiments the second dataset is from alzheimer disease neuroimaging initiative adni including mri and data for this experiment we used ad patients mci patients and nc subjects to process the data same tools employed in and are used including spatial distortion and cerebellum removal the fsl package was used to segment each mr image into three different tissues gm wm and csf then rois are parcellated for each subject with atlas warping the volume of gm tissue in each roi was calculated as the image feature for images rigid transformation was employed to align it to the corresponding mr image and the mean intensity of each roi was calculated as the feature all these features were further normalized in similar way as in results the first experiment is set up on the ppmi database table shows the diagnosis accuracy of the proposed technique in comparisons with different baseline and methods using strategy as can be seen the proposed method outperforms all others this could be because our method deals with both and note that subjects and their corresponding feature vectors extracted from mri data are quite prone to noise because of many possible sources of noise the patient body movements rf emission due to thermal motion overall mr scanner measurement chain or preprocessing artifacts therefore some samples might not be useful and some might be contaminated by some amounts of noise our method deals with both types and achieves good results the goal for the experiments on adni database is to discriminate both mci and ad patients from nc subjects separately therefore nc subjects form our negative class while the positive class is defined as ad in one experiment and mci in the other the diagnosis results of the ad nc and mci nc experiments are reported in tables as it could be seen in comparisons with the our method achieves good results in terms of both accuracy and the area under curve this is because we successfully discard the and detect the could be downloaded at http http table the accuracy acc and the area under roc curve auc of the alzheimer disease classification on adni database compared to the baseline methods method acc auc acc auc rlda svm mc figure the top selected rois for ad nc left and mci nc right classification problems discussions in medical imaging applications many sources of noise patient movement radiations and limitation of imaging devices preprocessing artifacts contribute to the acquired data and therefore methods that deal with noise and outliers are of great interest our method enjoys from single optimization objective that can simultaneously suppress and which compared to the competing methods exhibits good performance one of the interesting functions of the proposed method is the regularization on the mapping coefficients with the norm which would select compact set of features to contribute to the learned mapping the magnitude of the coefficients would show the level of contribution of that specific feature to the learned model in our application the features from the whole brain regions are extracted but only small number of regions are associated with the disease ad mci or pd using this strategy we can determine which brain regions are highly associated with certain disease fig shows the top regions selected by our algorithm in ad nc and mci nc classification scenarios these regions including middle temporal gyrus medial gyrus postcentral gyrus caudate nucleus cuneus and amygdala have been reported to be associated with ad and mci in the literature the figures show the union of regions selected for both mri and features the most frequently used regions for the experiment are the substantial nigra left and right putamen right middle frontal gyrus right superior temporal gyrus left which are also consistent with the literature this selection of brain regions could be further incorporated for future clinical analysis the setting of the proposed method is also of great interest in the diagnosis of patients when new patients first arrive and are to be diagnosed the previous set of the patients with no certain diagnosis so far not labeled yet could still be used to build more reliable classifier in other words the current testing samples could contribute the diagnosis of future subjects as unlabeled samples conclusion in this paper we proposed an approach for discriminative classification which is robust against both and our method enjoys setting where all the labeled training and the unlabeled testing data are used to detect outliers and are simultaneously we have applied our method to the interesting problem of neurodegenerative brain disease diagnosis and directly applied it for the diagnosis of parkinson and alzheimer diseases the results show that our method outperforms all competing methods as direction for the future work one can develop learning reformulation of the proposed method to incorporate multiple modalities for the subjects or extend the method for the incomplete data case references and fathy matrix completion for action detection image vision bissantz munk and stratmann convergence analysis of generalized iteratively reweighted least squares algorithms on convex function spaces siam boyd and et al distributed optimization and statistical learning via the alternating direction method of multipliers found trends mach heiko braak kelly tredici udo rub rob de vos ernst jansen steur and eva braak staging of brain pathology related to sporadic parkinsons disease neurobio of aging cai he and han discriminant analysis in cvpr cai and shen singular value thresholding algorithm for matrix completion siam li ma and wright robust principal component analysis acm chapelle and zien editors learning mit press croux and dehon robust linear discriminant analysis using canadian of statistics de la torre framework for component analysis ieee tpami elhamifar and vidal robust classification using structured sparse representation in cvpr fidler skocaj and leonardis combining reconstructive and discriminative subspace methods for robust classification and regression by subsampling ieee tpami fritsch varoquaux thyreau poline and thirion detecting outliers in neuroimaging datasets with robust covariance estimators med image goldberg zhu recht xu and nowak transduction with matrix completion three birds with one stone in nips pages huang cabral and de la torre robust regression in eccv pages joulin and bach convex relaxation for weakly supervised classifiers in icml kim magnani and boyd robust fisher discriminant analysis in nips pages li jiang and zhang efficient and robust feature extraction by maximum margin criterion in nips pages li shen van den hengel and shi linear discriminant analysis as scalable semidefinite feasibility problems ieee tip lim and pfefferbaum segmentation of mr brain images into cerebrospinal fluid spaces white and gray matter of computer assisted tomography liu lin yan sun yu and ma robust recovery of subspace structures by representation ieee tpami lu shi and jia online robust dictionary learning in cvpr pages june marek and et al the parkinson progression marker initiative ppmi prog pearce palmer bowen wilcock esiri and davison neurotransmitter dysfunction and atrophy of the caudate nucleus in alzheimer disease neurochem shen and davatzikos hammer hierarchical attribute matching mechanism for elastic registration ieee tmi thung wee yap and shen neurodegenerative disease diagnosis using incomplete data via matrix shrinkage and completion neuroimage and et al automated anatomical labeling of activations in spm using macroscopic anatomical parcellation of the mni mri brain neuroimage wagner wright ganesh zihan zhou and yi ma towards practical face recognition system robust registration and illumination by sparse representation in cvpr pages wang nie yap li shi geng guo shen adni et al robust mri brain extraction for diverse neuroimaging studies on humans and primates plos one wang nie yap shi guo and shen robust for studies in miccai volume pages worker and et al cortical thickness surface area and volume measures in parkinson disease multiple system atrophy and progressive supranuclear palsy plos one zhang wang zhou yuan shen adni et al multimodal classification of alzheimer disease and mild cognitive impairment neuroimage zhang brady and smith segmentation of brain mr images through hidden markov random field model and the algorithm ieee tmi xiaojin zhu learning literature survey technical report computer sciences university of ziegler and augustinack harnessing advances in structural mri to enhance research on parkinson disease imaging 
optimization of noisy functions with unknown smoothness grill michal valko sequel team inria lille nord europe france munos google deepmind munos abstract we study the problem of optimization of function of any dimension given function evaluations perturbed by noise the function is assumed to be locally smooth around one of its global optima but this smoothness is unknown our contribution is an adaptive optimization algorithm poo or parallel optimistic optimization that is able to deal with this setting poo performs almost as well as the best known algorithms requiring the knowledge of the smoothness furthermore poo works for larger class of functions than what was previously considered especially for functions that are difficult to optimize in very precise sense we provide analysis of poo which shows that its error after evaluations is at most factor of ln away from the error of the best known optimization algorithms using the knowledge of the smoothness introduction we treat the problem of optimizing function given finite budget of noisy evaluations we consider that the cost of any of these function evaluations is high that means we care about assessing the optimization performance in terms of the sample complexity the number of function evaluations this is typically the case when one needs to tune parameters for complex system seen as which performance can only be evaluated by costly simulation one such example is the tuning where the sensitivity to perturbations is large and the derivatives of the objective function with respect to these parameters do not exist or are unknown such setting fits the sequential setting under bandit feedback in this setting the actions are the points that lie in domain at each step an algorithm selects an action xt and receives reward rt which is noisy function evaluation such that rt xt εt where εt is bounded noise with εt after evaluations the algorithm outputs its best guess which can be different from xn the performance measure we want to minimize is the value of the function at the returned point compared to the optimum also referred to as simple regret def rn sup we assume there exists at least one point such that the relationship with bandit settings motivated uct an empirically successful heuristic that hierarchically partitions domain and selects the next point xt using upper confidence bounds the empirical success of uct on one side but the absence of performance guarantees for it on the other incited research on similar but theoretically founded algorithms as the global optimization of the unknown function without absolutely any assumptions would be daunting problem most of the algorithms assume at least very weak on the leave from sequel team inria lille nord europe france assumption that the function does not decrease faster than known rate around one of its global optima in other words they assume certain local smoothness property of this smoothness is often expressed in the form of that quantifies this regularity naturally this regularity also influences the guarantees that these algorithms are able to furnish many of them define dimension or zooming dimension these are quantities used to bound the simple regret rn or related notion called cumulative regret our work focuses on notion of such dimension that does not directly relate the smoothness property of to specific metric but directly to the hierarchical partitioning ph representation of the space used by the algorithm indeed an interesting fundamental question is to determine good characterization of the difficulty of the optimization for an algorithm that uses given hierarchical partitioning of the space as its input the kind of hierarchical partitioning ph we consider is similar to the ones introduced in prior work for any depth in the tree representation the set of cells ph form partition of where ih is the number of cells at depth at depth the root of the tree there is single cell cell ph of depth is split into several children subcells of depth we refer to the standard partitioning as to one where each cell is split into regular subcells an important insight detailed in section is that dimension that is independent from the partitioning used by an algorithm as defined in prior work does not embody the optimization difficulty perfectly this is easy to see as for any we could define partitioning perfectly suited for an example is partitioning that at the root splits into and which makes the optimization trivial whatever is this insight was already observed by slivkins and bull whose zooming dimension depends both on the function and the partitioning in this paper we define notion of dimension which measures the complexity of the optimization problem directly in terms of the partitioning used by an algorithm first we make the following local smoothness assumption about the function expressed in terms of the partitioning and not any metric for given partitioning we assume that there exist and νρh ph where is the unique cell of depth containing then we define the dimension as def inf nh where for all nh is the number of cells ph of depth intuitively functions with smaller are easier to optimize and we denote for which is the smallest as obviously depends on and but does not depend on any choice of specific metric in section we argue that this definition of encompasses the optimization complexity better we stress this is not an artifact of our analysis and previous algorithms such as hoo taxonomyzoom or hct can be shown to scale with this new notion of most of the prior algorithms proposed for function optimization for either deterministic or stochastic setting assume that the smoothness of the optimized function is known this is the case of known and this assumption limits the application of these algorithms and opened very compelling question of whether this knowledge is necessary prior work responded with algorithms not requiring this knowledge bubeck et al provided an algorithm for optimization of lipschitz functions without the knowledge of the lipschitz constant however they have to assume that is twice differentiable and bound on the second order derivative is known combes and treat unimodal restricted to dimension one slivkins considered general optimization problem embedded in and provided guarantees as function of the quality of the taxonomy the quality refers to the probability of reaching two cells belonging to the same branch that can have values that differ by more that half of the diameter expressed by the true metric of the branch the problem is that the algorithm needs lower bound on this quality which can be tiny and the performance depends inversely on this quantity also it assumes that the quality is strictly positive in this paper we do not rely on the knowledge of quality and also consider more general class of functions for which the quality can be appendix we use the simplified notation instead of for clarity when no confusion is possible which is similar to the hierarchical partitioning previously defined simple regret after evaluations figure difficult function where if the fractional part of that is bxc is in and if it is in left oscillation between two envelopes of different smoothness leading to nonzero for standard partitioning right regret of hoo after evaluations for different values of another direction has been followed by munos where in the deterministic case the function evaluations are not perturbed by noise their soo algorithm performs almost as well as the best known algorithms without the knowledge of the function smoothness soo was later extended to stosoo for the stochastic case however stosoo only extends soo for limited case of easy instances of functions for which there exists under which also bull provided similar regret bound for the atb algorithm for class of functions called zooming continuous functions which is related to the class of functions for which there exists under which the dimension is but none of the prior work considers more general class of functions where there is no adapted to the standard partitioning for which to give an example of difficult function consider the function in figure lower and upper envelope around its global optimum that are equivalent to and and therefore have different smoothness thus for standard partitioning there is no of the form for which the dimension is as shown by valko et al other examples of nonzero dimension are the functions that for standard partitioning behave differently depending on the direction for instance using bad value for the parameter can have dramatic consequences on the simple regret in figure we show the simple regret after function evaluations for different values of for the values of that are too low the algorithm does not explore enough and is stuck in local maximum while for values of too high the algorithm wastes evaluations by exploring too much in this paper we provide new algorithm poo parallel optimistic optimization which competes with the best algorithms that assume the knowledge of the function smoothness for larger class of functions than was previously done indeed poo handles panoply of functions including hard instances such that like the function illustrated above we also recover the result of stosoo and atb for functions with in particular we bound the poo simple regret as rn this result should be compared to the simple regret of the best known algorithm that uses the knowledge of the metric under which the function is smooth or equivalently which is of the order of ln thus poo performance is at most factor of ln away from that of the best known optimization algorithms that require the knowledge of the function smoothness interestingly this factor decreases with the complexity measure the harder the function to optimize the less important it is to know its precise smoothness background and assumptions hierarchical optimistic optimization poo optimizes functions without the knowledge of their smoothness using subroutine an anytime algorithm optimizing functions using the knowledge of their smoothness in this paper we use modified version of hoo as such subroutine therefore we embark with quick review of hoo hoo follows an optimistic strategy close to uct but unlike uct it uses proper confidence bounds to provide theoretical guarantees hoo refines partition of the space based on hierarchical partitioning where at each step yet unexplored cell leaf of the corresponding tree is selected and the function is evaluated at point within this cell the selected path from the root to the leaf is the one that maximizes the minimum value uh among all cells of each depth where the value uh of any cell ph is defined as ln uh bh νρh nh where is the number of evaluations done so far bh is the empirical average of all evaluations done within ph and nh is the number of them the second term in the definition of uh is type confidence interval measuring the estimation error induced by the noise the third term νρh with is by assumption bound on the difference for any ph cell containing is it this bound where hoo relies on the knowledge of the smoothness because the algorithm requires the values of and in the next sections we clarify the assumptions made by hoo related algorithms and point out the differences with poo assumptions made in prior work most of previous work relies on the knowledge of on such that the function is either locally smooth near to one of its maxima with respect to this metric or require stronger assumption furthermore kleinberg et al assume the full metric note that the does not require the triangular inequality to hold for instance consider the on rp with being the euclidean metric when then this does not satisfy the triangular inequality however it is metric for therefore using only allows us to consider larger class of functions prior work typically requires two assumptions the first one is on and the function an example is the assumption needed by bubeck et al which requires that max it is weak version of lipschitz condition restricting in particular for the values close to more recent results assume only local smoothness around one of the function maxima the second common assumption links the hierarchical partitioning with the it requires the partitioning to be adapted to the semi metric more precisely the assumption states that there exist and such that for any depth and index ih the subset ph is contained by and contains two open balls of radius ρh and ρh respectively where the balls are the same used in the definition of the function smoothness local smoothness is weaker than weakly lipschitz and therefore preferable algorithms requiring the assumption always sample cell ph in special representative point and in the stochastic case collect several function evaluations from the same point before splitting the cell this is not the case of hoo which allows to sample any point inside the selected cell and to expand each cell after one sample this additional flexibility comes at the price of requiring the stronger assumption nevertheless although hoo does not wait before expanding cell it does something similar by selecting path from the root to this leaf that maximizes the minimum of the over the cells of the path as mentioned in section the fact that hoo follows an optimistic strategy even after reaching the cell that possesses the minimal along the path is not used in the analysis of the hoo algorithm furthermore reason for better dependency on the smoothness in other algorithms hct is not only algorithmic hct needs to assume slightly stronger condition on the cell that the single center of the two balls one that covers and the other one that contains the cell is actually the same point that hct uses for sampling this is stronger than just assuming that there simply exist such centers of the two balls which are not necessarily the same points where we sample which is the hoo assumption therefore this is in contrast with hoo that samples any point from the cell in fact it is straightforward to modify hoo to only sample at representative point in each cell and only require the assumption in our analysis and the algorithm we use this modified version of hoo thereby profiting from this weaker assumption prior work often defined some dimension of the space of measured according to the metric for example the dimension measures the size of the space xε in terms of packing numbers for any the dimension of with respect to is defined as inf xcε where for any subset the packing number is the maximum number of disjoint balls of radius contained in our assumption contrary to the previous approaches we need only single assumption we do not introduce any semi and instead directly relate to the hierarchical partitioning defined in section let be the maximum number of children cells jk per cell ph we remind the reader that given global maximum of denotes the index of the unique cell of depth containing such that ph with this notation we can state our sole assumption on both the partitioning ph and the function assumption there exists and such that ph νρh the values defines lower bound on the possible drop of near the optimum according to the partitioning the choice of the exponential rate νρh is made to cover very large class of functions as well as to relate to results from prior work in particular for standard partitioning on rp and any any function such that fits this assumption this is also the case for more complicated functions such as the one illustrated in figure an example of function and partitioning that does not satisfy this assumption is the function ln and standard partitioning of because the function decreases too fast around as observed by valko this assumption can be weaken to hold only for values of that are to up to an constant in the regret let us note that the set of assumptions made by prior work section can be reformulated using solely assumption for example for any one could consider the semimetric for which the corresponding dimension defined by equation for standard partitioning is yet we argue that our setting provides more natural way to describe the complexity of the optimization problem for given hierarchical partitioning indeed existing algorithms that use hierarchical partitioning of like hoo do not use the full metric information but instead only use the values and paired up with the partitioning hence the precise value of the metric does not impact the algorithms decisions neither their performance what really matters is how the hierarchical partitioning of fits indeed this fit is what we measure to reinforce this argument notice again that any function can be trivially optimized given perfectly adapted partitioning for instance the one that associates to one child of the root also the previous analyses tried to provide performance guaranties based only on the metric and however since the metric is assumed to be such that the cells of the partitioning are well shaped the large diversity of possible metrics vanishes choosing such metric then comes down to choosing only and hierarchical decomposition of another way of seeing this is to remark that previous works make an assumption on both the function and the metric and an other on both the metric and the partitioning we underline that the metric is actually there just to create link between the function and the partitioning by discarding the metric we merge the two assumptions into single one and convert topological problem into combinatorial one leading to easier analysis to proceed we define new dimension for any and the nearoptimality dimension of with respect to the partitioning is defined as follows definition dimension of is def inf nh where nh is the number of cells ph of depth such that the hierarchical decomposition of the space is the only prior information available to the algorithm the new dimension is measure of how well is this partitioning adapted to more precisely it is measure of the size of the set the cells which are such that intuitively this corresponds to the set of cells that any algorithm would have to sample in order to discover the optimum as an example any such that for any has zero dimension with respect to the standard partitioning and an appropriate choice of as discussed by valko et al any function such that the upper and lower envelopes of near its maximum are of the same order has dimension of zero for standard partitioning of an example of function with for the standard partitioning is in figure functions that behave differently in different dimensions have also for the standard partitioning nonetheless for some handcrafted partitioning it is possible to have even for those troublesome functions under our new assumption and our new definition of dimension one can prove the same regret bound for hoo as bubeck et al and the same can be done for other related algorithms the poo algorithm description of poo the poo algorithm uses as subroutine an optimizing algorithm that requires the knowledge of the function smoothness we use hoo as the base algorithm but other algorithms such as hct could be used as well poo with pseudocode in algorithm runs several hoo instances in parallel hence the name parallel optimistic optimization the number of base hoo instances and other parameters are adapted to the budget of evaluations and are automatically decided on the fly each instance of hoo requires two real numbers and running hoo parametrized with that are far from the optimal one would cause hoo to underperform surprisingly our analysis of this suboptimality gap reveals that it does not decrease too fast as we stray away from this motivates the following observation if we simultaneously run slew of hoos with different one of them is going to perform decently well in fact we show that to achieve good performance we only require ln hoo instances where is the current number of function evaluations notice that we do not require to know the total number of rounds in advance which hints that we can hope for naturally anytime algorithm algorithm poo parameters ph optional parameters ρmax νmax initialization dmax ln ln number of evaluation performed number of hoo instances νmax ρmax set of hoo instances while computational budget is available do while dmax ln ln do for do start new hoos νmax ρmax perform function evaluation with hoo update the average reward of hoo end for end while ensure there is enough hoos for do perform function evaluation with hoo update the average reward of hoo end for end while output random point evaluated by hoo the strategy of poo is quite simple it consists of running instances of hoo in parallel that are all launched with different at the end of the whole process poo selects the instance which performed the best and returns one of the points selected by this instance chosen uniformly at random note that just using doubling trick in hoo with increasing values of and is not enough to guarantee good performance indeed it is important to keep track of all hoo instances otherwise the regret rate would suffer way too much from using the value of that is too far from the optimal one the parameters satisfying assumption for which is the smallest for clarity the of algorithm takes ρmax and νmax as parameters but in appendix we show how to set ρmax and νmax automatically as functions of the number of evaluations ρmax νmax furthermore in appendix we explain how to share information between the hoo instances which makes the empirical performance better since poo is anytime the number of instances is and does not need to be known in advance in fact is increased alongside the execution of the algorithm more precisely we want to ensure that dmax ln ln where def dmax ln ln to keep the set of different well distributed the number of hoos is not increased one by one but instead is doubled when needed moreover we also require that hoos run in parallel perform the same number of function evaluations consequently when we start running new instances we first ensure to make these instances on par with already existing ones in terms of number of evaluations finally as our analysis reveals good choice of parameters ρi is not uniform grid on instead as suggested by our analysis we require that ln is uniform grid on ln as consequence we add hoo instances in batches such that ρi ρmax upper bound on poo regret poo does not require the knowledge of verifying assumption yet we prove that it achieves performance to the one obtained by hoo using the best parameters this result solves the open question of valko et al whether the stochastic optimization of with unknown parameters when for the standard partitioning is possible theorem let rn be the simple regret of poo at step for any verifying assumption such that νmax and ρmax there exists such that for all rn dmax moreover dmax νmax where is constant independent of ρmax and νmax we prove theorem in the appendix and notice that theorem holds for any νmax and ρmax and in particular for the parameters for which is minimal as long as νmax and ρmax in appendix we show how to make ρmax and νmax optional to give some intuition on dmax it is easy to prove that it is the attainable upper bound on the nearoptimality dimension of functions verifying assumption with ρmax moreover any function of lipschitz for the euclidean metric has ln ln for standard partitioning the poo performance should be compared to the simple regret of hoo run with the best parameters and which is of order ln thus poo performance is only factor of ln away from the optimally fitted hoo furthermore we our regret bound for poo is slightly better than the known regret bound for stosoo in the case when for the same partitioning rn ln with our algorithm and analysis we generalize this bound for any value of note that we only give simple regret bound for poo whereas hoo ensures bound on both the cumulative and simple notice that since poo runs several hoos with values of the parameters this algorithm explores much more than optimally fitted hoo which dramatically impacts the cumulative regret as consequence our result applies to the simple regret only note that several of those parameters are possible for the same function up to logarithmic term ln in the simple regret in fact the bound on the simple regret is direct consequence of the bound on the cumulative regret simple regret simple regret hoo hoo hoo hoo poo hoo hoo hoo hoo poo number of evaluations number of evaluation figure regret of poo and hoo run for different values of experiments we ran experiments on the function plotted in figure for hoo algorithms with different values of and the algorithm for ρmax this function as described in section has an upper and lower envelope that are not of the same order and therefore has for standard partitioning in figure we show the simple regret of the algorithms as function of the number of evaluations in the figure on the left we plot the simple regret after evaluations in the right one we plot the regret after evaluations in the scale in order to see the trend better the hoo algorithms return random point chosen uniformly among those evaluated poo does the same for the best empirical instance of hoo we compare the algorithms according to the expected simple regret which is the difference between the optimum and the expected value of function value at the point they return we compute it as the average of the value of the function for all evaluated points while we did not investigate possibly different heuristics we believe that returning the deepest evaluated point would give better empirical performance as expected the hoo algorithms using values of that are too low do not explore enough and become quickly stuck in local optimum this is the case for both uct hoo run for and hoo run for the hoo algorithm using that is too high waste their budget on exploring too much this way we empirically confirmed that the performance of the hoo algorithm is greatly impacted by the choice of this parameter for the function we considered in particular at the empirical regret of hoo with was half of the regret of uct in our experiments hoo with performed the best which is bit lower than what the theory would suggest since the performance of hoo using this parameter is almost matched by poo this is surprising considering the fact the poo was simultaneously running different hoos it shows that carefully sharing information between the instances of hoo as described and justified in appendix has major impact on empirical performance indeed among the hoo instances only two on average actually needed fresh function evaluation the could reuse the ones performed by another hoo instance conclusion we introduced poo for global optimization of stochastic functions with unknown smoothness and showed that it competes with the best known optimization algorithms that know this smoothness this results extends the previous work of valko et al which is only able to deal with nearoptimality dimension poo is provable able to deal with trove of functions for which for standard partitioning furthermore we gave new insight on several assumptions required by prior work and provided more natural measure of the complexity of optimizing function given hierarchical partitioning of the space without relying on any metric acknowledgements the research presented in this paper was supported by french ministry of higher education and research regional council doctoral grant of normale in paris inria and carnegie mellon university project eduband and french national research agency project code available at https references peter auer and paul fischer analysis of the multiarmed bandit problem machine learning mohammad gheshlaghi azar alessandro lazaric and emma brunskill online stochastic optimization under correlated bandit feedback in international conference on machine learning bubeck munos and gilles stoltz pure exploration in and bandits theoretical computer science bubeck munos gilles stoltz and csaba bandits journal of machine learning research bubeck gilles stoltz and jia yuan yu lipschitz bandits without the lipschitz constant in algorithmic learning theory adam bull bandits bernoulli richard combes and alexandre unimodal bandits without smoothness arxiv http coquelin and munos bandit algorithms for tree search in uncertainty in artificial intelligence robert kleinberg alexander slivkins and eli upfal bandit problems in metric spaces in symposium on theory of computing levente kocsis and csaba bandit based planning in european conference on machine learning munos optimistic optimization of deterministic functions without the knowledge of its smoothness in neural information processing systems munos from bandits to tree search the optimistic principle applied to optimization and planning foundations and trends in machine learning philippe preux munos and michal valko bandits attack function optimization in congress on evolutionary computation aleksandrs slivkins bandits on implicit metric spaces in neural information processing systems michal valko alexandra carpentier and munos stochastic simultaneous optimistic optimization in international conference on machine learning 
recovering communities in the general stochastic block model without knowing the parameters emmanuel abbe department of electrical engineering and pacm princeton university princeton nj eabbe colin sandon department of mathematics princeton university princeton nj sandon abstract the stochastic block model sbm has recently gathered significant attention due to new threshold phenomena however most developments rely on the knowledge of the model parameters or at least on the number of communities this paper introduces efficient algorithms that do not require such knowledge and yet achieve the optimal tradeoffs identified in in the constant degree regime an algorithm is developed that requires only on the relative sizes of the communities and achieves the optimal accuracy scaling for large degrees this requirement is removed for the regime of arbitrarily slowly diverging degrees and the model parameters are learned efficiently for the logarithmic degree regime this is further enhanced into fully agnostic algorithm that achieves the for exact recovery in quasilinear time these provide the first algorithms affording efficiency universality and optimality for strong and weak consistency in the sbm introduction this paper studies the problem of recovering communities in the general stochastic block model with linear size communities for constant and logarithmic degree regimes in contrast to this paper does not require knowledge of the parameters it shows how to learn these from the graph toplogy we next provide some motivations on the problem and further background on the model detecting communities or clusters in graphs is fundamental problem in networks computer science and machine learning this applies to large variety of complex networks social and biological networks as well as to data sets engineered as networks via similarly graphs where one often attempts to get first impression on the data by trying to identify groups with similar behavior in particular finding communities allows one to find people in social networks to improve recommendation systems to segment or classify images to detect protein complexes to find genetically related or discover new tumor subclasses see for references while large variety of community detection algorithms have been deployed in the past decades the understanding of the fundamental limits of community detection has only appeared more recently in particular for the sbm the sbm is canonical model for community detection we use here the notation sbm to refer to random graph ensemble on the where each vertex is assigned independently hidden or planted label σv in under probability distribution pk on and each unordered pair of nodes is connected independently with probability wσu σv where is symmetric matrix with entries in note that sbm denotes random graph drawn under this model without the hidden or planted clusters the labels σv revealed the goal is to recover these labels by observing only the graph recently the sbm came back at the center of the attention at both the practical level due to extensions allowing overlapping communities that have proved to fit well real data sets in massive networks and at the theoretical level due to new phase transition phenomena the latter works focus exclusively on the sbm with two symmetric communities each community is of the same size and the connectivity in each community is identical denoting by the and the probabilities most of the results are concerned with two figure of merits recovery also called exact recovery or strong consistency which investigates the regimes of and for which there exists an algorithm that recovers with high probability the two communities completely ii detection which investigates the regimes for which there exists an algorithm that recovers with high probability positively correlated partition the sharp threshold for exact recovery was obtained in that log log exact recovery is solvable if and only if with efficient algorithms achieving the threshold in addition introduces an sdp proved to achieve the threshold in while shows that spectral algorithm also achieves the threshold the sharp threshold for detection was obtained in showing that detection is solvable and so efficiently if and only if when settling conjecture from besides the detection and the recovery properties one may ask about the partial recovery of the communities studied in of particular interest to this paper is the case of strong recovery also called weak consistency where only vanishing fraction of the nodes is allowed to be misclassified for communities shows that strong recovery is possible if and only if diverges extended in for general sbms in the next section we discuss the results for the general sbm of interest in this paper and the problem of learning the model parameters we conclude this section by providing motivations on the problem of achieving the threshold with an efficient and universal algorithm threshold phenomena have long been studied in fields such as information theory shannon capacity and constrained satisfaction problems the sat threshold in particular the quest of achieving the threshold has generated major algorithmic developments in these fields ldpc codes polar codes survey propagation to name few likewise identifying thresholds in community detection models is key to benchmark and guide the development of clustering algorithms however it is particularly crucial to develop benchmarks that do not depend sensitively on the knowledge of the model parameters natural question is hence whether one can solve the various recovery problems in the sbm without having access to the parameters this paper answers this question in the affirmative for the exact and strong recovery of the communities prior results on the general sbm with known parameters most of the previous works are concerned with the sbm having symmetric communities mainly or sometimes with the exception of which provides the first general achievability results for the recently studied fundamental limits for the general model sbm with independent of the results are summarized below recall first the recovery requirements definition recovery requirements an algorithm recovers or detects communities in sbm with an accuracy of if it outputs labelling of the nodes which agrees with the true labelling on fraction of the nodes with probability on the agreement is maximized over relabellings of the communities strong recovery refers to on and exact recovery refers to the problem is solvable if there exists an algorithm that solves it and efficiently if the algorithm runs in in note that exact recovery in sbm requires the graph not to have vertices of degree in multiple communities with high probability therefore for exact recovery we focus on ln where is fixed partial and strong recovery in the general sbm the first result of concerns the regime where the connectivity matrix scales as for positive symmetric matrix the node generalizes this to also study variations of the model average degree is constant the following notion of snr is first introduced snr where λmin and λmax are respectively the smallest and largest eigenvalues of diag the algorithm is proposed that solves partial recovery with exponential accuracy and complexity when the snr diverges theorem given any with and symmetric matrix with no two rows equal let be the largest eigenvalue of and be the eigenvalue of with the smallest nonzero magnitude if snr and for some and in graphs cρ cρ drawn from sbm with accuracy exp provided pi that the above is larger than time moreover can be made ln and runs in arbitrarily small with ln ln and αq is independent of note that for symmetric clusters snr reduces to which is the quantity of interest for detection moreover the snr must diverge to ensure strong recovery in the symmetric case the following is an important consequence of the previous theorem stating that solves strong recovery when the entries of are amplified corollary for any with and symmetric matrix with no two rows equal there exist ln such that for all sufficiently large detects communities in sbm with accuracy and complexity on the above gives the optimal scaling both in accuracy and complexity ii exact recovery in the general sbm the second result in is for the regime where the connectivity matrix scales as ln independent of where it is shown that exact recovery has sharp threshold characterized by the divergence function max tf named the in specifically if all pairs of columns in diag are at at least from each other then exact recovery is solvable in the general sbm we refer to section in for discussion on the connection with shannon channel coding theorem and ch kl divergence an algorithm is also developed in that solves exact recovery down to the limit in time showing that exact recovery has no informational to computational gap theorem exact recovery is solvable in sbm ln if and only if min ii the algorithm see solves exact recovery whenever it is solvable and runs in time for all exact and strong recovery are thus solved for the general sbm with communities when the parameters are known we next remove the latter assumption estimating the parameters for the estimation of the parameters some results are known for communities in the logarithmic degree regime since the sdp is agnostic to the parameters it is relaxation of the the parameters can be estimated by recovering the communities for the regime shows that the parameters can be estimated above the threshold by counting cycles which is efficiently approximated by counting walks these are however for communities we also became aware of parallel work which considers private graphon estimation including sbms in particular for the logarithmic degree regime obtains procedure to estimate parameters of graphons in an appropriate version of the norm for the general sbm learning the model was to date mainly open the smallest eigenvalue of diag is the one with least magnitude results agnostic algorithms are developed for the constant and diverging node degrees with independent of these afford optimal accuracy and complexity scaling for large node degrees and achieve the limit for logarithmic node degrees in particular the sbm can be learned efficiently for any diverging degrees note that the assumptions on and being independent of could be slightly relaxed for example to slowly growing but we leave this for future work partial recovery our main result for partial recovery holds in the constant degree regime and requires lower bound on the least relative size of the communities this requirement is removed when working with diverging degrees as stated in the corollary below theorem given and for any with pi and min pi and any symmetric matrix with no two rows equal such that every entry in qk is positive in other words such that there is nonzero probability of path between vertices in any two communities in graph drawn from sbm there exist ln such that for all sufficiently large detects communities in graphs drawn from sbm with accuracy at least in on time note that vertex in community has degree with probability exponential in and there is no way to differentiate between vertices of degree from different communities so an error rate that decreases exponentially with is optimal in we provide more detailed version of this theorem which yields quantitate statement on the accuracy of the algorithm in terms of the snr for general sbm corollary if in theorem the knowledge requirement on can be removed exact recovery recall that from exact recovery is and computationally solvable in sbm ln if and only if min we next show that this can be achieved without any knowledge on the parameters for sbm ln theorem the algorithm see section solves exact recovery in any sbm ln for which exact recovery is solvable using no input except the graph in question and runs in time for all in particular exact recovery is efficiently and universally solvable whenever it is solvable proof techniques and algorithms partial recovery and the algorithm simplified version of the algorithm for the symmetric case to ease the presentation of the algorithm we focus first on the symmetric case the sbm with communities of relative size probability of connecting na inside communities and nb across communities let be the average degree definition for any vertex let nr be the set of all vertices with shortest path in to of length we often drop the subscript if the graph in question is the original sbm we also refer to as the vector whose entry is the number of vertices in nr that are in community for an arbitrary vertex and reasonably small there will be typically about dr vertices in nr and about more of them will be in community than in each other community of course this only holds when log log because there are not enough vertices in the graph otherwise the obvious way to try to determine whether or not two vertices and are in the same community is to guess that they are in the same community if nr and different communities otherwise unfortunately whether or not vertex is in nr is not independent of whether or not it is in nr which compromises this plan instead we propose to rely on the following step randomly assign every edge in to some set with fixed probability and then count the number of edges in that connect nr and formally definition for any and subset of edges let nr be the number of pairs such that nr and note that and are disjoint however in sbm is sparse enough that even if the two graphs were generated independently given pair of vertices would have an edge in both graphs with probability so is approximately independent of thus given and denoting by and the two eigvenvalues of in the symmetric case the expected number of neighbors at depth from is approximately whereas the expected number of neighbors at depth from is approximately for each of the other communities all of these are scaled by if we do the computations in using now the emulated independence between and and assuming and to be in the same community the expected number of edges in connecting nr to is approximately given by the inner product ut where and is the matrix with on the diagonal and elsewhere when and are in different communities the inner product is between and permutation of after simplifications this gives kδσv nr where δσv is if and are in the same community and otherwise in order for nr to depend on the relative communities of and it must be that is large enough more than so needs to be at least log log difficulty is that for specific pair of vertices the term will be multiplied by random factor dependent on the degrees of and the nearby vertices so in order to stop the variation in the term from drowning out the kδσv term it is necessary to cancel out the dominant term this brings us to introduce the following statistics ir nr kδσv in particular for odd ir will tend to be positive if and are in the same community and negative otherwise irrespective of the specific values of that suggests the following algorithm for partial recovery it requires knowledge of in the constant degree regime but not in the regime where scale with set log log and put each of the graph edges in with probability set kmax and select kmax ln random vertices vkmax ln compute ir vi vj for each and if there is possible assignment of these vertices to communities such that ir vi if and only if vi and vj are in the same community then randomly select one vertex from each apparent community otherwise fail for every in the graph guess that is in the same community as the that maximizes the value of ir this algorithm succeeds as long as to ensure that the above estimates on nr are reliable further if are scaled by setting log log allows removal of the knowledge requirement on in addition playing with to take different allows us to reduce the complexity of the algorithm one alternative to our approach could be to count the walks of given length between and like in instead of using nr however proving that the number of walks is close to its expected value is difficult proving that nr is within desired range is substantially easier because for any and whether or not there is an edge between and directly effects nr for at most one value of algorithms based on shortest path have also been studied in the general case in the general case define nr and nr as in the previous section now for any nr and with probability of approximately as result nr cq cq eσv eσv nr figure the purple edges represent the edges counted by nr let λh be the distinct eigenvalues of ordered so that also define so that if λh and if λh if wi is the eigenspace of corresponding to the eigenvalue λi and pwi is the projection operator on to wi then nr eσv λi pwi eσv pwi where the final equality holds because for all λi pwi eσv pwj pwi eσv qpwj pwi eσv λj pwj and since λi λj this implies that pwi eσv pwj definition let ζi pwi eσv pwi for all and equation is dominated by the term so getting good estimate of the through terms requires cancelling it out somehow as start if then nr note that the left hand side of this expression is equal to det definition let mm be the matrix such that mm for each and as shown in there exists constant λm such that det mm cm ζi where we assumed that above to simplify the discussion the case is similar this suggests the following plan for estimating the eigenvalues corresponding to graph first pick several vertices at random then use the fact that for any good vertex to estimate next take ratios of for and with and look for the smallest making that ratio small enough this will use the estimate on estimating by this value minus one then estimate consecutively all of eigenvalues for each selected vertex using ratios of finally take the median of these estimates in general whether or det mm λi det mm det λm λi det λm λm λm ζm λm λm this fact can be used to approximate ζi for arbitrary and of course this requires and to be large enough that ζi is large relative to the error terms for all this requires at least λi for all moreover for any and pwi eσv pwi eσv ζi ζi with equality for all if and only if σv so sufficiently good approximations of ζi ζi and ζi can be used to determine which pairs of vertices are in the same community one could generate reasonable classification based solely on this method of comparing vertices with an appropriate choice of the parameters as later detailed however that would require computing nr for every vertex in the graph with fairly large which would be slow instead we use the fact that for any vertices and with σv ζi ζi ζi ζi for all and the inequality is strict for at least one so subtracting ζi from both sides we have ζi ζi for all and the inequality is still strict for at least one so given representative vertex in each community we can determine which of them given vertex is in the same community as without needing to know the value of ζi this runs fairly quickly if is large and is small because the algorithm only requires focusing on vertices this leads to the following plan for partial recovery first randomly select set of vertices that is large enough to contain at least one vertex from each community with high probability next compare all of the selected vertices in an attempt to determine which of them are in the same communities then pick one in each community call these anchor nodes after that use the algorithm referred to above to determine which community each of the remaining vertices is in as long as there actually was at least one vertex from each community in the initial set and none of the approximations were particularly bad this should give reasonable classification the risk that this randomly gives bad classification due to bad set of initial vertices can be mitigated by repeating the previous classification procedure several times as discussed in this completes the algorithm we refer to for the details exact recovery and the algorithm the exact recovery part is similar to and uses the fact that once good enough clustering has been obtained from the classification can be finished by making local improvements based on the node neighborhoods similar techniques have been used in however we establish here sharp characterization of the local procedure error the key result is that when testing between two multivariate poisson distributions of means log and log respectively where the probability of error of maximum posteriori decoding is this is proved in in the case of unknown parameters the algorithmic approach is largely unchanged adding step where the best known classification is used to estimate and prior to any local improvement step the analysis of the algorithm requires however some careful handling first it is necessary to prove that given labelling of the graph vertices with an error rate of one can compute approximations of and that are within log of their true values with probability secondly one needs to modify the above hypothesis testing estimates to control the error probability in attempting to determine vertices communities based on estimates of and that are off by at most say and one must show that classification of its neighbors that has an error rate of classifies the vertices with an error rate only eo log times higher than it would be if the parameter really were and and the vertices neighbors were all classified correctly thirdly one needs to show that since is differentiable with respect to any element of the error rate if the parameters really were and is at worst eo log as high as the error rate with the actual parameters combining these yields the conclusion that any errors in the estimates of the sbm parameters do not disrupt vertex classification any worse than the errors in the preliminary classifications already were the algorithm the inputs are where is graph and see for how to set specifically the algorithm outputs each node label define the graph on the vertex set by selecting each edge in independently with probability and define the graph that contains the edges in that are not in run on with log log to obtain the classification determine the size of all alleged communities and estimate the edge density among these for each node determine the most likely community label of node based on its degree profile computed from the preliminary classification and call it use to get new estimates of and for each node determine the most likely community label of node based on its degree profile computed from output this labelling in step and the most likely label is the one that maximizes the probability that the degree profile comes from multivariate distribution of mean ln for note that this algorithm does not require lower bound on min pi because setting to slowly decreasing function of results in being within an acceptable range for all sufficiently large data implementation and open problems we tested simplified version of our algorithm on real data see for the blog network of adamic and glance we obtained an error rate of about best trial was worst achieving the as described in our extend quite directly to slowly growing number of communities up to logarithmic it would be interesting to extend the current approach to smaller sized watching the complexity scaling as well as to labelededges or overlapping communities though our approach already applies to overlaps acknowledgments this research was partly supported by nsf grant and the bell labs prize references abbe and sandon community detection in general stochastic block models fundamental limits and efficient recovery algorithms to appear in march decelle krzakala moore and asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications phys rev december community detection thresholds and the weak ramanujan property in stoc annual symposium on the theory of computing pages new york united states june mossel neeman and sly proof of the block model threshold conjecture available online at january abbe bandeira and hall exact recovery in the stochastic block model to appear in ieee transactions on information theory available at may mossel neeman and sly consistency thresholds for binary symmetric block models arxiv to appear in july xu chen tradeoffs in planted problems and submatrix localization with growing number of clusters and submatrices february gopalan and blei efficient discovery of overlapping communities in massive networks proceedings of the national academy of sciences holland laskey and leinhardt stochastic blockmodels first steps social networks bui chaudhuri leighton and sipser graph bisection algorithms with good average case behavior combinatorica dyer and frieze the solution of some random problems in polynomial expected time journal of algorithms mark jerrum and gregory sorkin the metropolis algorithm for graph bisection discrete applied mathematics condon and karp algorithms for graph partitioning on the planted partition model lecture notes in computer science snijders and nowicki estimation and prediction for stochastic blockmodels for graphs with latent block structure journal of classification january mcsherry spectral partitioning of random graphs in foundations of computer science proceedings ieee symposium on pages bickel and chen nonparametric view of network models and newmangirvan and other modularities proceedings of the national academy of sciences rohe chatterjee and yu spectral clustering and the stochastic blockmodel the annals of statistics choi wolfe and airoldi stochastic blockmodels with growing number of classes biometrika pages vu simple svd algorithm for finding hidden partitions available online at xu hajek wu achieving exact cluster recovery threshold via semidefinite programming november bandeira random laplacian matrices and convex relaxations yun and proutiere accurate community detection in the stochastic block model via spectral algorithms december mossel neeman and sly belief propagation robust reconstruction and optimal recovery of block models arxiv and vershynin community detection in sparse networks via grothendieck inequality november chin rao and vu stochastic block model and community detection in the sparse graphs spectral algorithm with optimal rate of recovery january mossel neeman and sly stochastic block models and reconstruction available online at borgs chayes and smith private graphon estimation for sparse graphs in preparation abbe and sandon recovering communities in the general stochastic block model without knowing the parameters june bordenave lelarge and spectrum of random graphs community detection and ramanujan graphs available at bhattacharyya and bickel community detection in networks using graph distance arxiv january alon and kahale spectral technique for coloring random graphs in siam journal on computing pages zhang zhou gao ma achieving optimal misclassification proportion in stochastic block model 
deep learning with elastic averaging sgd anna choromanska courant institute nyu achoroma sixin zhang courant institute nyu zsx yann lecun center for data science nyu facebook ai research yann abstract we study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes local workers is based on an elastic force which links the parameters they compute with center variable stored by the parameter server master the algorithm enables the local workers to perform more exploration the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master we empirically demonstrate that in the deep learning setting due to the existence of many local optima allowing more exploration can lead to the improved performance we propose synchronous and asynchronous variants of the new algorithm we provide the stability analysis of the asynchronous variant in the scheme and compare it with the more common parallelized method admm we show that the stability of easgd is guaranteed when simple stability condition is satisfied which is not the case for admm we additionally propose the version of our algorithm that can be applied in both synchronous and asynchronous settings asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the cifar and imagenet datasets experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to downpour and other common baseline approaches and furthermore is very communication efficient introduction one of the most challenging problems in machine learning is how to parallelize the training of large models that use form of stochastic gradient descent sgd there have been attempts to parallelize training for deep learning models on large number of cpus including the google distbelief system but practical image recognition systems consist of convolutional neural networks trained on few gpu cards sitting in single computer the main challenge is to devise parallel sgd algorithms to train deep learning models that yield significant speedup when run on multiple gpu cards in this paper we introduce the elastic averaging sgd method easgd and its variants easgd is motivated by quadratic penalty method but is as parallelized extension of the averaging sgd algorithm the basic idea is to let each worker maintain its own local parameter and the communication and coordination of work among the local workers is based on an elastic force which links the parameters they compute with center variable stored by the master the center variable is updated as moving average where the average is taken in time and also in space over the parameters computed by local workers the main contribution of this paper is new algorithm that provides fast convergent minimization while outperforming downpour method and other baseline approaches in practice simultaneously it reduces the communication overhead between the master and the local workers while at the same time it maintains performance measured by the test error the new algorithm applies to deep learning settings such as parallelized training of convolutional neural networks the article is organized as follows section explains the problem setting section presents the synchronous easgd algorithm and its asynchronous and variants section provides stability analysis of easgd and admm in the scheme section shows experimental results and section concludes the supplement contains additional material including additional theoretical analysis problem setting consider minimizing function in parallel computing environment with workers and master in this paper we focus on the stochastic optimization problem of the following form min where is the model parameter to be estimated and is random variable that follows the probabilr ity distribution over such that dξ the optimization problem in equation can be reformulated as follows min xi kxi where each follows the same distribution thus we assume each worker can sample the entire dataset in the paper we refer to xi as local variables and we refer to as center variable the problem of the equivalence of these two objectives is studied in the literature and is known as the augmentability or the global variable consensus problem the quadratic penalty term in equation is expected to ensure that local workers will not fall into different attractors that are far away from the center variable this paper focuses on the problem of reducing the parameter communication overhead between the master and local workers the problem of data communication when the data is distributed among the workers is more general problem and is not addressed in this work we however emphasize that our problem setting is still highly under the communication constraints due to the existence of many local optima easgd update rule the easgd updates captured in resp equation and are obtained by taking the gradient descent step on the objective in equation with respect to resp variable xi and xit gti xit xit xit where gti xit denotes the stochastic gradient of with respect to xi evaluated at iteration xit and denote respectively the value of variables xi and at iteration and is the learning rate the update rule for the center variable takes the form of moving average where the average is taken over both space and time denote ηρ and pα then equation and become xit ηgti xit xit note that choosing pα leads to an elastic symmetry in the update rule there exists an symmetric force equal to xit between the update of each xi and it has crucial influence on the algorithm stability as will be explained in section also in order to minimize the staleness of the difference xit between the center and the local variable the update for the master in equation involves xit instead of note also that ηρ where the magnitude of represents the amount of exploration we allow in the model in particular small allows for more exploration as it allows xi to fluctuate further from the center the distinctive idea of easgd is to allow the local workers to perform more exploration small and the master to perform exploitation this approach differs from other settings explored in the literature and focus on how fast the center variable converges in this paper we show the merits of our approach in the deep learning setting asynchronous easgd we discussed the synchronous update of easgd algorithm in the previous section in this section we propose its asynchronous variant the local workers are still responsible for updating the local variables xi whereas the master is updating the center variable each worker maintains its own clock ti which starts from and is incremented by after each stochastic gradient update of xi as shown in algorithm the master performs an update whenever the local workers finished steps of their gradient updates where we refer to as the communication period as can be seen in algorithm whenever divides the local clock of the ith worker the ith worker communicates with the master and requests the current value of the center variable the worker then waits until the master sends back the requested parameter value and computes the elastic difference this entire procedure is captured in step in algorithm the elastic difference is then sent back to the master step in algorithm who then updates the communication period controls the frequency of the communication between every local worker and the master and thus the between exploration and exploitation algorithm asynchronous eamsgd processing by worker and the master input learning rate moving rate communication period momentum term initialize is initialized randomly xi ti repeat xi if divides ti then xi xi end δv ηgtii δv xi xi ti ti until forever algorithm asynchronous easgd processing by worker and the master input learning rate moving rate communication period initialize is initialized randomly xi ti repeat xi if divides ti then xi xi end xi xi ηgtii ti ti until forever momentum easgd the momentum easgd eamsgd is variant of our algorithm and is captured in algorithm it is based on the nesterov momentum scheme where the update of the local worker of the form captured in equation is replaced by the following update δvti ηgti xit δvti xit ηρ xit where is the momentum term note that when we recover the original easgd algorithm as we are interested in reducing the communication overhead in the parallel computing environment where the parameter vector is very large we will be exploring in the experimental section the asynchronous easgd algorithm and its variant in the relatively large regime less frequent communication stability analysis of easgd and admm in the scheme in this section we study the stability of the asynchronous easgd and admm methods in the roundrobin scheme we first state the updates of both algorithms in this setting and then we study their stability we will show that in the quadratic case admm algorithm can exhibit chaotic behavior leading to exponential divergence the analytic condition for the admm algorithm to be stable is still unknown while for the easgd algorithm it is very the analysis of the synchronous easgd algorithm including its convergence rate and its averaging property in the quadratic and strongly convex case is deferred to the supplement in our setting the admm method involves solving the following minimax max minp xi λi xi kxi where λi are the lagrangian multipliers the resulting updates of the admm algorithm in the scheme are given next let be global clock at each we linearize the function xit as in the updates become xi with xit xit xi xit λt xit if mod if mod λit xt xit if mod xt if mod each local variable xi is periodically updated with period first the lagrangian multiplier λi is updated with the dual ascent update as in equation it is followed by the gradient descent update of the local variable as given in equation then the center variable is updated with the most recent values of all the local variables and lagrangian multipliers as in equation note that since the step size for the dual ascent update is chosen to be by convention we have the lagrangian multiplier to be λit λit in the above updates the easgd algorithm in the scheme is defined similarly and is given below xt xit xit if mod if mod xit xt mod at time only the local worker whose index equals modulo is activated and performs the update in equations which is followed by the master update given in equation we will now focus on the quadratic case without noise for the admm algorithm let the state of the dynamical system at time be st λpt xpt the local worker updates in equations and are composed of three linear maps which can be written as st for simplicity we will only write them out below for the case when and ηρ ηρ for each of the linear maps it possible to find simple condition such that each map where the ith map has the form is stable the absolute value of the eigenvalues of the map are this condition resembles the stability condition for the synchronous easgd algorithm condition for in the analysis in the supplement the convergence analysis in is based on the assumption that at any master iteration updates from the workers have the same probability of arriving at the which is not satisfied in the scheme smaller or equal to one however when these maps are composed one after another as follows the resulting map can become unstable more precisely some eigenvalues of the map can sit outside the unit circle in the complex plane we now present the numerical conditions for which the admm algorithm becomes unstable in the scheme for and by computing the largest absolute eigenvalue of the map figure summarizes the obtained result eta eta rho rho figure the largest absolute eigenvalue of the linear map as function of and when and to simulate the chaotic behavior of the admm algorithm one may pick and and initialize the state either randomly or with figure should be read in color on the other hand the easgd algorithm involves composing only symmetric linear maps due to the elasticity let the state of the dynamical system at time be st xpt the activated local worker update in equation and the master update in equation can be written as st in case of the map and are defined as follows for the composite map to be stable the condition that needs to be satisfied is actually the same for each and is furthermore independent of since each linear map is symmetric it essentially involves the stability of the matrix whose two real eigenvalues satisfy the resulting stability condition is simple and given as experiments in this section we compare the performance of easgd and eamsgd with the parallel method downpour and the sequential method sgd as well as their averaging and momentum variants all the parallel comparator methods are listed downpour the of the implementation of downpour used in this paper is enclosed in the supplement momentum downpour mdownpour where the nesterov momentum scheme is applied to the master update note it is unclear how to apply it to the local workers or for the case when the is in the supplement method that we call adownpour where we compute the average over time of the center variable as follows zt and is moving rate and denotes the master clock which is initialized to and incremented every time the center variable is updated method that we call mvadownpour where we compute the moving average of the center variable as follows zt and the moving rate was chosen to be constant and denotes the master clock and is defined in the same way as for the adownpour method we have compared asynchronous admm with easgd in our setting as well the performance is nearly the same however admm momentum variant is not as stable for large communication periods all the sequential comparator methods are listed below sgd with constant learning rate momentum sgd msgd with constant momentum asgd with moving rate mvasgd with moving rate set to constant we perform experiments in deep learning setting on two benchmark datasets we refer to it as cifar and imagenet ilsvrc we refer to it as imagenet we focus on the image classification task with deep convolutional neural networks we next explain the experimental setup the details of the data preprocessing and prefetching are deferred to the supplement experimental setup for all our experiments we use interconnected with infiniband each node has titan gpu processors where each local worker corresponds to one gpu processor the center variable of the master is stored and updated on the centralized parameter server to describe the architecture of the convolutional neural network we will first introduce notation let denotes the size of the input image to each layer where is the number of color channels and is both the horizontal and the vertical dimension of the input let denotes the convolutional operator and let denotes the max pooling operator denotes the linear operator with dropout rate equal to and denotes the linear operator with softmax output we use the loss and all inner layers use rectified linear units for the imagenet experiment we use the similar approach to with the following convolutional neural network for the cifar experiment we use the similar approach to with the following convolutional neural network in our experiments all the methods we run use the same initial parameter chosen randomly except that we set all the biases to zero for cifar case and to for imagenet case this parameter is used to initialize the master and all the local we add kxk to the loss function for imagenet we use and for cifar we use we also compute the stochastic gradient using of sample size experimental results for all experiments in this section we use easgd with for all methods we set the momentum term and finally for mvadownpour we set the moving rate to we start with the experiment on cifar dataset with local workers running on single computing node for all the methods we examined the communication periods from the following set for comparison we also report the performance of msgd which outperformed sgd asgd and mvasgd as shown in figure in the supplement for each method we examined wide range of learning rates the learning rates explored in all experiments are summarized in table in the supplement the cifar experiment was run times independently from the same initialization and for each method we report its best performance measured by the smallest achievable test error from the results in figure we conclude that all downpourbased methods achieve their best performance test error for small and become highly unstable for while eamsgd significantly outperforms comparator methods for all values of by having faster convergence it also finds solution measured by the test error and this advantage becomes more significant for note that the tendency to achieve better test performance with larger is also characteristic for the easgd algorithm downloaded from http downloaded from http our implementation is available at https on the contrary initializing the local workers and the master with different random seeds traps the algorithm in the symmetry breaking phase intuitively the effective is pα pηρ thus βpη in the asynchronous setting test error msgd downpour adownpour mvadownpour mdownpour easgd eamsgd test loss nll training loss nll wallclock time min wallclock time min wallclock time min test error test loss nll training loss nll wallclock time min wallclock time min wallclock time min test error test loss nll training loss nll wallclock time min wallclock time min wallclock time min wallclock time min test error test loss nll training loss nll wallclock time min wallclock time min figure training and test loss and the test error for the center variable versus wallclock time for different communication periods on cifar dataset with the convolutional neural network we next explore different number of local workers from the set for the cifar experiment and for the imagenet for the imagenet experiment we report the results of one run with the best setting we have found easgd and eamsgd were run with whereas downpour and mdownpour were run with the results are in figure and for the cifar experiment it noticeable that the lowest achievable test error by either easgd or eamsgd decreases with larger this can potentially be explained by the fact that larger allows for more exploration of the parameter space in the supplement we discuss further the between exploration and exploitation as function of the learning rate section and the communication period section finally the results obtained for the imagenet experiment also shows the advantage of eamsgd over the competitor methods conclusion in this paper we describe new algorithm called easgd and its variants for training deep neural networks in the stochastic setting when the computations are parallelized over multiple gpus experiments demonstrate that this new algorithm quickly achieves improvement in test error compared to more common baseline approaches such as downpour and its variants we show that our approach is very stable and plausible under communication constraints we provide the stability analysis of the asynchronous easgd in the scheme and show the theoretical advantage of the method over admm the different behavior of the easgd algorithm from its momentumbased variant eamsgd is intriguing and will be studied in future works for the imagenet experiment the training loss is measured on subset of the training data of size test error msgd downpour mdownpour easgd eamsgd test loss nll training loss nll wallclock time min wallclock time min wallclock time min test error test loss nll training loss nll wallclock time min wallclock time min wallclock time min test error test loss nll training loss nll wallclock time min wallclock time min wallclock time min figure training and test loss and the test error for the center variable versus wallclock time for different number of local workers for parallel methods msgd uses on cifar with the convolutional neural network eamsgd achieves significant accelerations compared to other methods the relative for the best comparator method is then msgd to achieve the test error equals wallclock time hour test error msgd downpour easgd eamsgd test loss nll training loss nll wallclock time hour wallclock time hour wallclock time hour test error test loss nll training loss nll wallclock time hour wallclock time hour figure training and test loss and the test error for the center variable versus wallclock time for different number of local workers msgd uses on imagenet with the convolutional neural network initial learning rate is decreased twice by factor of and then when we observe that the online predictive loss stagnates eamsgd achieves significant accelerations compared to other methods the relative for the best comparator method is then downpour to achieve the test error equals and simultaneously it reduces the communication overhead downpour uses communication period and eamsgd uses acknowledgments the authors thank power li for implementation guidance bruna henaff farabet szlam bakhtin for helpful discussion combettes bengio and the referees for valuable feedback references bottou online algorithms and stochastic approximations in online learning and neural networks cambridge university press dean corrado monga chen devin le mao ranzato senior tucker yang and ng large scale distributed deep networks in nips krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in advances in neural information processing systems pages sermanet eigen zhang mathieu fergus and lecun overfeat integrated recognition localization and detection using convolutional networks arxiv nocedal and wright numerical optimization second edition springer new york polyak and juditsky acceleration of stochastic approximation by averaging siam journal on control and optimization bertsekas and tsitsiklis parallel and distributed computation prentice hall hestenes optimization theory the finite dimensional case wiley boyd parikh chu peleato and eckstein distributed optimization and statistical learning via the alternating direction method of multipliers found trends mach shamir fundamental limits of online and distributed algorithms for statistical learning and estimation in nips yadan adams taigman and ranzato training of convnets in arxiv paine jin yang lin and huang gpu asynchronous stochastic gradient descent to speed up neural network training in arxiv seide fu droppo li and yu stochastic gradient descent and application to dataparallel distributed training of speech dnns in interspeech september bekkerman bilenko and langford scaling up machine learning parallel and distributed approaches camridge universityy press choromanska henaff mathieu arous and lecun the loss surfaces of multilayer networks in aistats ho cipar cui lee kim gibbons gibson ganger and xing more effective distributed ml via stale synchronous parallel parameter server in nips azadi and sra towards an optimal stochastic alternating direction method of multipliers in icml borkar asynchronous stochastic approximations siam journal on control and optimization bertsekas and borkar distributed asynchronous incremental subgradient methods in inherently parallel algorithms in feasibility and optimization and their applications volume of studies in computational mathematics pages langford smola and zinkevich slow learners are fast in nips agarwal and duchi distributed delayed stochastic optimization in nips recht re wright and niu hogwild approach to parallelizing stochastic gradient descent in nips zinkevich weimer smola and li parallelized stochastic gradient descent in nips nesterov smooth minimization of functions math lan an optimal method for stochastic composite optimization mathematical programming sutskever martens dahl and hinton on the importance of initialization and momentum in deep learning in icml zhang and kwok asynchronous distributed admm for consensus optimization in icml ouyang he tran and gray stochastic alternating direction method of multipliers in proceedings of the international conference on machine learning pages wan zeiler zhang lecun and fergus regularization of neural networks using dropconnect in icml conconi and gentile on the generalization ability of learning algorithms ieee transactions on information theory nesterov introductory lectures on convex optimization volume springer science business media 
monotone function maximization with size constraints yuichi yoshida national institute of informatics and preferred infrastructure yyoshida naoto ohsaka the university of tokyo ohsaka abstract function is generalization of submodular function where the input consists of disjoint subsets instead of single subset of the domain many machine learning problems including influence maximization with kinds of topics and sensor placement with kinds of sensors can be naturally modeled as the problem of maximizing monotone functions in this paper we give approximation algorithms for maximizing monotone ksubmodular functions subject to several size constraints the running time of our algorithms are almost linear in the domain size we experimentally demonstrate that our algorithms outperform baseline algorithms in terms of the solution quality introduction the task of selecting set of items subject to constraints on the size or the cost of the set is versatile in machine learning problems the objective can be often modeled as maximizing function with the diminishing return property where for finite set function satisfies the diminishing return property if for any and for example sensor placement influence maximization in social networks document summarization and feature selection involve objectives satisfying the diminishing return property it is well known that the diminishing return property is equivalent to submodularity where function is submodular if holds for any when the objective function is submodular and hence satisfies the diminishing return property we can find in polynomial time solution with provable guarantee on its solution quality even with various constraints in many practical applications however we want to select several disjoint sets of items instead of single set to see this let us describe two examples influence maximization viral marketing is marketing strategy that promotes products by giving free or discounted items to selected group of highly influential people in the hope that through the effects large number of product adoptions will occur suppose that we have kinds of items each having different topic and thus different effect then we want to distribute these items to people selected from group of people so as to maximize the expected number of product adoptions it is natural to impose constraint that each person can receive at most one item since giving many free items to one particular person would be unfair sensor placement there are kinds of sensors for different measures such as temperature humidity and illuminance suppose that we have bi many sensors of the kind for each and there is set of locations each of which can be instrumented with exactly one sensor then we want to allocate those sensors so as to maximize the information gain when these problems can be modeled as maximizing monotone submodular functions and admit unfortunately however the case of general can not be modeled as maximizing submodular functions and we can not apply the methods in the literature on maximizing submodular functions we note that the problem of selecting disjoint sets can be sometimes modeled as maximizing monotone submodular functions over the extended domain subject to partition matroid although algorithms are known the running time is around and is prohibitively slow our contributions to address the problem of selecting disjoint sets we use the fact that the objectives can be often modeled as functions let xk xi xi xj be the family of disjoint sets then function is called if for any xk and yk in we have where xk yk xi yi xk yk xi yi roughly speaking captures the property that if we choose exactly one set xe xk that an element can belong to for each then the resulting function is submodular see section for details when coincides with submodularity in this paper we give approximation algorithms for maximizing monotone ksubmodular functions with several constraints on the sizes of the sets here we say that is monotone if for any xk and yk with xi yi for each let be the size of the domain for the total size constraint under which the total size of the sets is bounded by we show that simple greedy algorithm outputs in knb time the approximation ratio of is asymptotically tight since the lower bound of for any is known even when combining the random sampling technique we also give randomized algorithm that outputs with probability at least in kn log log time hence even when is as large as the running time is almost linear in for the individual size constraint under which the size of the set is bounded by bi for each we give algorithm with pk running time knb where bi we then give randomized algorithm that outputs with probability at least in log log time to show the practicality of our algorithms we apply them to the influence maximization problem and the sensor placement problem and we demonstrate that they outperform previous methods based on submodular function maximization and several baseline methods in terms of the solution quality related work when is called bisubmodularity and applied bisubmodular functions to machine learning problems however their algorithms do not have any approximation guarantee huber and kolmogorov introduced as generalization of submodularity and bisubmodularity and minimizing functions was successfully used in computer vision application iwata et al gave algorithm algorithm for maximizing and monotone and functions respectively when there is no constraint organization the rest of this paper is organized as follows in section we review properties of functions sections and are devoted to show algorithms for the total size constraint and algorithms for the individual size constraint respectively we show our experimental results in section we conclude our paper in section algorithm input monotone function and an integer output vector with for to do arg return preliminaries for an integer denotes the set we define partial order on so that for xk and yk in if xi yi for every with we also define xi xk xk for and which is the marginal gain when adding to the set of then it is easy to see the monotonicity of is equivalent to for any xk and and also it is not hard to show see for details that the of implies the orthant submodularity for any with and and the pairwise monotonicity for any and with actually the converse holds theorem ward and function is if and only if is orthant submodular and pairwise monotone it is often convenient to identify with to analyze functions namely we associate xk with by xi for hence we sometimes abuse notation and simply write xk by regarding vector as disjoint sets of we define the support of as supp analogously for and we define suppi let be the zero vector in maximizing functions with the total size constraint in this section we give algorithm to the problem of maximizing monotone functions subject to the total size constraint namely we consider max subject to and where is monotone and is integer greedy algorithm the first algorithm we propose is simple greedy algorithm algorithm we show the following theorem algorithm outputs solution by evaluating knb times where the number of evaluations of is clearly knb hence in what follows we focus on analyzing the approximation ratio of algorithm our analysis is based on the framework of consider the iteration of the for loop from line let be the pair greedily chosen in this iteration and let be the solution after this iteration we define let be algorithm input monotone function an integer and failure probability output vector with for to do random subset of size min log bδ uniformly sampled from supp arg return the optimal solution we iteratively define as follows for each let supp supp then we set if and set to be an arbitrary element in otherwise then we define as the resulting vector obtained from by assigning to the element and then define as the resulting vector obtained from by assigning to the element note that supp holds for every and moreover we have for every proof of theorem we first show that for each for each let and then note that and from the monotonicity of it suffices to show that since and are chosen greedily we have since we have from the orthant submodularity combining these two inequalities we establish then we have which implies an almost algorithm by random sampling in this section we improve the number of evaluations of from knb to kn log log where is failure probability our algorithm is shown in algorithm the main difference from algorithm is that we sample sufficiently large subset of and then greedily assign value only looking at elements in we reuse notations and from section and let be in the iteration we iteratively define as follows if is empty then we regard that the algorithm failed suppose is then we set if and set to be an arbitrary element in otherwise finally we define and as in section using and if the algorithm does not fail and are well defined or in other words if is not empty for every then the rest of the analysis is completely the same as in section and we achieve an approximation ratio of hence it suffices to show that are well defined with high probability lemma with probability at least we have for every algorithm input monotone function and integers bk output vector with bi for each and bi for to do suppi bi arg return proof fix if then we clealy have pr otherwise we have pr supp log by the union bound over the lemma follows theorem algorithm outputs solution with probability at least by evaluating at most log log bδ times proof by lemma and the analysis in section algorithm outputs solution with probability at least the number of evaluations of is at most log log kn log log maximizing functions with the individual size constraint in this section we consider the problem of maximizing monotone functions subject to the individual size constraint namely we consider max subject to bi and where is monotone and bk are integers greedy algorithm we first consider simple greedy algorithm described in algorithm we show the following theorem algorithm outputs solution by evaluating at most knb times it is clear that the number of evaluations of is knb the analysis of the approximation ratio is given in appendix an almost algorithm by random sampling we next improve the number of evaluations of from knb to log rithm is given in algorithm in appendix we show the following log our theorem algorithm solution with probability at least by outputs evaluating at most log log times algorithm input monotone function integers bk and failure probability output vector with bi for each and bi for to do suppi bi and loop add random element in supp to arg if min log break the loop return then experiments in this section we experimentally demonstrate that our algorithms outperform baseline algorithms and our almost algorithms significantly improve efficiency in practice we conducted experiments on linux server with intel xeon ghz and of main memory we implemented all algorithms in we measured the computational cost in terms of the number of function evaluations so that we can compare the efficiency of different methods independently from concrete implementations influence maximization with topics under the total size constraint we first apply our algorithms to the problem of maximizing the spread of influence on several topics first we describe our information diffusion model called the independent cascade model which generalizes the independent cascade model in the model there are kinds of items each having different topic and thus kinds of rumors independently spread through social network let be social network with an edge probability piu for each edge representing the strength of influence from to on the topic given seed for each the diffusion process of the rumor about the topic starts by activating vertices in suppi independently from other topics then the process unfolds in discrete steps according to the following randomizes rule when vertex becomes active in the step for the first time it is given single chance to activate each current inactive vertex it succeeds with probability piu if succeeds then becomes active in the step whether or not succeeds it can not make any further attempt to activate in subsequent steps the process runs until no more activation is possible the influence spread in the model is defined as the expected total number of verticeshwho eventually becomei active in one of the diffusion processes given seed namely ai suppi where ai suppi is random variable representing the set of activated vertices in the diffusion process of the topic given directed graph edge probabilities piu and budget the problem is to select seed that maximizes subject to it is easy to see that the influence spread function is monotone see appendix for the proof experimental settings we use publicly available dataset of social news website this dataset consists of directed graph where each vertex represents user and each edge represents the friendship between pair of users and log of user votes for stories we set the number of topics to be and estimated edge probabilities on each topic from the log using the method of we set the value of to and compared the following algorithms http single degree random of evaluations influence spread budget budget figure comparison of influence spreads figure the number of influence estimations algorithm algorithm we chose single greedily choose vertices only considering the topic and assign them items of the topic degree choose vertices in decreasing order of degrees and assign them items of random topics random randomly choose vertices and assign them items of random topics for the first three algorithms we implemented the lazy evaluation technique for efficiency for we maintain an upper bound on the gain of inserting each pair to apply the lazy evaluation technique directly for we maintain an upper bound on the gain for each pair and we pick up pair in with the largest gain for each iteration during the process of the algorithms the influence spread was approximated by simulating the diffusion process times when the algorithms terminate we simulated the diffusion process times to obtain sufficiently accurate estimates of the influence spread results figure shows the influence spread achieved by each algorithm we only show single among single strategies since its influence spread is the largest and clearly outperform the other methods owing to their theoretical guarantee on the solution quality note that our two methods simulated the diffusion process times to choose seed set which is relatively small because of the high computation cost consequently the approximate value of the influence spread has relatively high variance and this might have caused the greedy method to choose seeds with small influence spreads remark that single works worse than degree for larger than which means that focusing on single topic may significantly degrade the influence spread random shows poor performance as expected figure reports the number of influence estimations of greedy algorithms we note that outperforms which implies that the random sampling technique is effective even when combined with the lazy evaluation technique the number of evaluations of drastically increases when is around since we run out of influential vertices and we need to reevaluate the remaining vertices indeed the slope of after is almost constant in figure which indicates that the remaining vertices have similar influence single is faster than our algorithms since it only considers single topic sensor placement with kinds of measures under the individual size constraint next we apply our algorithms for maximizing functions with the individual size constraint to the sensor placement problem that allows multiple kinds of sensors in this problem we want to determine the placement of multiple kinds of sensors that most effectively reduces the expected uncertainty we need several notions to describe our model let xn be pa set of discrete random variables the entropy of subset of is defined as pr log pr the conditional entropy of having observed is hence in order to reduce the uncertainty of we want to find set of as large entropy as possible now we formalize the sensor placement problem there are kinds of sensors for different measures suppose that we want to allocate bi many sensors of the kind for each and there single single of evaluations entropy single value of figure comparison of entropy value of figure the number of entropy evaluations are set of locations each of which can be instrumented with exactly one sensor let xei be the random variable representing the observation collected from sensor of the kind if it is installed at the location and xe then the problem is to select that maximizes subject to bi for each it is easy xe to see that is monotone see appendix for details experimental settings we use the publicly available intel lab this dataset contains log of approximately million readings collected from sensors deployed in the intel berkeley research lab between february and april we extracted temperature humidity and light values from each reading and discretized these values into several bins of degrees celsius each points each and luxes each respectively hence there are kinds of sensors to be allocated to locations budgets for sensors measuring temperature humidity and light are denoted by and we set where is parameter varying from to we compare the following algorithms algorithm algorithm we chose single allocate sensors of the kind to greedily chosen bj places we again implemented these algorithms with the lazy evaluation technique in similar way to the previous experiment also note that single strategies do not satisfy the individual size constraint results figure shows the entropy achieved by each algorithm and clearly outperform single strategies the maximum gap of entropies achieved by and is only figure shows the number of entropy evaluations of each algorithm we observe that outperforms especially when the number of entropy evaluations is reduced by single strategies are faster because they only consider sensors of fixed kind conclusions motivated by applications we proposed approximation algorithms for maximizing monotone functions our algorithms run in almost linear time and achieve the approximation ratio of for the total size constraint and for the individual size constraint we empirically demonstrated that our algorithms outperform baseline methods for maximizing submodular functions in terms of the solution quality improving the approximation ratio of for the individual size constraint or showing it tight is an interesting open problem acknowledgments is supported by jsps for young scientists no mext for scientific research on innovative areas and jst erato kawarabayashi large graph project is supported by jst erato kawarabayashi large graph project http references barbieri bonchi and manco social influence propagation models in icdm pages buchbinder feldman naor and schwartz tight linear time approximation for unconstrained submodular maximization in focs pages calinescu chekuri and maximizing monotone submodular function subject to matroid constraint siam journal on computing domingos and richardson mining the network value of customers in kdd pages filmus and ward monotone submodular maximization over matroid via local search siam journal on computing goldenberg libai and muller talk of the network complex systems look at the underlying process of marketing letters goldenberg libai and muller using complex systems analysis to advance marketing theory development modeling heterogeneity effects on new product growth through stochastic cellular automata academy of marketing science review gridchyn and kolmogorov potts model parametric maxflow and functions in iccv pages huber and kolmogorov towards minimizing functions in combinatorial optimization pages springer berlin heidelberg iwata tanigawa and yoshida improved approximation algorithms for function maximization in soda to appear kempe kleinberg and tardos maximizing the spread of influence through social network in kdd pages ko lee and queyranne an exact algorithm for maximum entropy sampling operations research krause mcmahon guestrin and gupta robust submodular observation selection the journal of machine learning research krause singh and guestrin sensor placements in gaussian processes theory efficient algorithms and empirical studies the journal of machine learning research lin and bilmes summarization via budgeted maximization of submodular functions in pages minoux accelerated greedy algorithms for maximizing submodular set functions optimization techniques mirzasoleiman badanidiyuru karbasi and krause lazier than lazy greedy in aaai pages nemhauser wolsey and fisher an analysis of approximations for maximizing submodular set mathematical programming richardson and domingos mining sites for viral marketing in kdd pages singh guillory and bilmes on bisubmodular maximization in aistats pages sviridenko note on maximizing submodular set function subject to knapsack constraint operations research letters ward and maximizing functions and beyond preliminary version appeared in soda pages 
active learning from weak and strong labelers chicheng zhang uc san diego chichengzhang kamalika chaudhuri uc san diego kamalika abstract an active learner is given hypothesis class large set of unlabeled examples and the ability to interactively query labels to an oracle of subset of these examples the goal of the learner is to learn hypothesis in the class that ﬁts the data well by making as few label queries as possible this work addresses active learning with labels obtained from strong and weak labelers where in addition to the standard active learning setting we have an extra weak labeler which may occasionally provide incorrect labels an example is learning to classify medical images where either expensive labels may be obtained from physician oracle or strong labeler or cheaper but occasionally incorrect labels may be obtained from medical resident weak labeler our goal is to learn classiﬁer with low error on data labeled by the oracle while using the weak labeler to reduce the number of label queries made to this labeler we provide an active learning algorithm for this setting establish its statistical consistency and analyze its label complexity to characterize when it can provide label savings over using the strong labeler alone introduction an active learner is given hypothesis class large set of unlabeled examples and the ability to interactively make label queries to an oracle on subset of these examples the goal of the learner is to learn hypothesis in the class that ﬁts the data well by making as few oracle queries as possible as labeling examples is tedious task for any one person many applications of active learning involve synthesizing labels from multiple experts who may have slightly different labeling patterns while body of recent empirical work has developed methods for combining labels from multiple experts little is known on the theory of actively learning with labels from multiple annotators for example what kind of assumptions are needed for methods that use labels from multiple sources to work when these methods are statistically consistent and when they can yield beneﬁts over plain active learning are all open questions this work addresses these questions in the context of active learning from strong and weak labelers speciﬁcally in addition to unlabeled data and the usual labeling oracle in standard active learning we have an extra weak labeler the labeling oracle is gold standard an expert on the problem domain and it provides high quality but expensive labels the weak labeler is cheap but may provide incorrect labels on some inputs an example is learning to classify medical images where either expensive labels may be obtained from physician oracle or cheaper but occasionally incorrect labels may be obtained from medical resident weak labeler our goal is to learn classiﬁer in hypothesis class whose error with respect to the data labeled by the oracle is low while exploiting the weak labeler to reduce the number of queries made to this oracle observe that in our model the weak labeler can be incorrect anywhere and does not necessarily provide uniformly noisy labels everywhere as was assumed by some previous works plausible approach in this framework is to learn difference classiﬁer to predict where the weak labeler differs from the oracle and then use standard active learning algorithm which queries the weak labeler when this difference classiﬁer predicts agreement our ﬁrst key observation is that this approach is statistically inconsistent false negative errors that predict no difference when and differ lead to biased annotation for the target classiﬁcation task we address this problem by learning instead difference classiﬁer that ensures that false negative errors rarely occur our second key observation is that as existing active learning algorithms usually query labels in localized regions of space it is sufﬁcient to train the difference classiﬁer restricted to this region and still maintain consistency this process leads to signiﬁcant label savings combining these two ideas we get an algorithm that is provably statistically consistent and that works under the assumption that there is good difference classiﬁer with low false negative error we analyze the label complexity of our algorithm as measured by the number of label requests to the labeling oracle in general we can not expect any consistent algorithm to provide label savings under all circumstances and indeed our worst case asymptotic label complexity is the same as that of active learning using the oracle alone our analysis characterizes when we can achieve label savings and we show that this happens for example if the weak labeler agrees with the labeling oracle for some fraction of the examples close to the decision boundary moreover when the target classiﬁcation task is agnostic the number of labels required to learn the difference classiﬁer is of lower order than the number of labels required for active learning thus in realistic cases learning the difference classiﬁer adds only small overhead to the total label requirement and overall we get label savings over using the oracle alone related work there has been considerable amount of empirical work on active learning where multiple annotators can provide labels for the unlabeled examples one line of work assumes generative model for each annotator labels the learning algorithm learns the parameters of the individual labelers and uses them to decide which labeler to query for each example consider separate logistic regression models for each annotator while assume that each annotator labels are corrupted with different amount of random classiﬁcation noise second line of work that includes learning assumes that each labeler is an expert over an unknown subset of categories and uses data to measure the expertise in order to optimally place label queries in general it is not known under what conditions these algorithms are statistically consistent particularly when the modeling assumptions do not strictly hold and under what conditions they provide label savings over regular active learning the ﬁrst theoretical work to consider this problem consider model where the weak labeler is more likely to provide incorrect labels in heterogeneous regions of space where similar examples have different labels their formalization is orthogonal to ours while theirs is more natural in setting ours is more natural for ﬁtting classiﬁers in hypothesis class in nips workshop paper have also considered learning from strong and weak labelers unlike ours their work is in the online selective sampling setting and applies only to linear classiﬁers and robust regression study learning from multiple teachers in the online selective sampling setting in model where different labelers have different regions of expertise finally there is large body of theoretical work on learning binary classiﬁer based on interactive label queries made to single labeler in the realizable case show that generalization of binary search provides an exponential improvement in label complexity over passive learning the problem is more challenging however in the more realistic agnostic case where such approaches lead to inconsistency the two styles of algorithms for agnostic active learning are active learning dbal and the more recent marginbased or active learning our algorithm builds on recent work in dbal preliminaries the model we begin with general framework for actively learning from weak and strong labelers in the standard active learning setting we are given unlabelled data drawn from distribution over an input space label space hypothesis class and labeling oracle to which we can make interactive queries in our setting we additionally have access to weak labeling oracle which we can query interactively querying is signiﬁcantly cheaper than querying however querying generates label yw drawn from conditional distribution pw yw which is not the same as the conditional distribution po yo of let be the data distribution over labelled examples such that pd pu po our goal is to learn classiﬁer in the hypothesis class such that with probability over the sample we have pd pd while making as few interactive queries to as possible observe that in this model may disagree with the oracle anywhere in the input space this is unlike previous frameworks where labels assigned by the weak labeler are corrupted by random classiﬁcation noise with higher variance than the labeling oracle we believe this feature makes our model more realistic second unlike mistakes made by the weak labeler do not have to be close to the decision boundary this keeps the model general and simple and allows greater ﬂexibility to weak labelers our analysis shows that if is largely incorrect close to the decision boundary then our algorithm will automatically make more queries to in its later stages finally note that is allowed to be with respect to the target hypothesis class background on active learning algorithms the standard active learning setting is very similar to ours the only difference being that we have access to the weak oracle there has been long line of work on active learning our algorithms are based on style called active learning dbal the main idea is as follows based on the examples seen so far the algorithm maintains candidate set vt of classiﬁers in that is guaranteed with high probability to contain the classiﬁer in with the lowest error given randomly drawn unlabeled example xt if all classiﬁers in vt agree on its label then this label is inferred observe that with high probability this inferred label is xt otherwise xt is said to be in the disagreement region of vt and the algorithm queries for its label vt is updated based on xt and its label and algorithm continues recent works in dbal have observed that it is possible to determine if an xt is in the disagreement region of vt without explicitly maintaining vt instead labelled dataset st is maintained the labels of the examples in st are obtained by either querying the oracle or direct inference to determine whether an xt lies in the disagreement region of vt two constrained erm procedures are performed empirical risk is minimized over st while constraining the classiﬁer to output the label of xt as and respectively if these two classiﬁers have similar training errors then xt lies in the disagreement region of vt otherwise the algorithm infers label for xt that agrees with the label assigned by more deﬁnitions and notation the error of classiﬁer under labelled data distribution is deﬁned as errq we use the notation err to denote its empirical error on labelled data set we use the notation to denote the classiﬁer with the lowest error under and to denote its error errd where is the target labelled data distribution our active learning algorithm implicitly maintains set for throughout the algorithm given set of labelled examples set of classiﬁers is said to be conﬁdence set for with respect to if with probability over the disagreement between two classiﬁers and under an unlabelled data distribution denoted by ρu is observe that the disagreements under form pseudometric over we use bu to denote ball of radius centered around in this metric the disagreement region of set of classiﬁers denoted by dis is the set of all examples such that there exist two classiﬁers and in for which algorithm our main algorithm is standard dbal algorithm with major modiﬁcation when the dbal algorithm makes label query we use an extra to decide whether this query should be made to the oracle or the weak labeler and make it accordingly how do we make this decision we try to predict if weak labeler differs from the oracle on this example if so query the oracle otherwise query the weak labeler key idea cost sensitive difference classiﬁer how do we predict if the weak labeler differs from the oracle plausible approach is to learn difference classiﬁer hd in hypothesis class to determine if there is difference our ﬁrst key observation is when the region where and differ can not be perfectly modeled by the resulting active learning algorithm is statistically inconsistent any false negative errors that is incorrectly predicting no difference made by difference classiﬁer leads to biased annotation for the target classiﬁcation task which in turn leads to inconsistency we address this problem by instead learning difference classiﬁer and we assume that classiﬁer with low false negative error exists in while training we constrain the false negative error of the difference classiﬁer to be low and minimize the number of predicted positives or disagreements between and subject to this constraint this ensures that the annotated data used by the active learning algorithm has diminishing bias thus ensuring consistency key idea localized difference classiﬁer training unfortunately even with training directly learning difference classiﬁer accurately is expensive if is the of the difference hypothesis class to learn target classiﬁer to excess error we need difference classiﬁer with false negative error which from standard generalization theory requires labels our second key observation is that we can save on labels by training the difference classiﬁer in localized manner because the dbal algorithm that builds the target classiﬁer only makes label queries in the disagreement region of the current conﬁdence set for therefore we train the difference classiﬁer only on this region and still maintain consistency additionally this provides label savings because while training the target classiﬁer to excess error we need to train difference classiﬁer with only φk labels where φk is the probability mass of this disagreement region the localized training process leads to an additional technical challenge as the conﬁdence set for is updated its disagreement region changes we address this through an dbal algorithm where the conﬁdence set is updated and fresh difference classiﬁer is trained in each epoch main algorithm our main algorithm algorithm combines these two key ideas and like implicitly maintains the set for by through labeled dataset in epoch the target excess error is εk and the goal of algorithm is to generate labeled dataset that implicitly represents δk set on additionally has the property that the empirical risk minimizer over it has excess error εk naive way to generate such an is by drawing labeled examples where is the vc dimension of our goal however is to generate using much smaller number of label queries which is accomplished by algorithm this is done in two ways first like standard dbal we infer the label of any that lies outside the disagreement region of the current conﬁdence set for algorithm identiﬁes whether an lies in this region second for any in the disagreement region we determine whether and agree on using difference classiﬁer if there is agreement we query else we query the difference classiﬁer used to determine agreement is retrained in the beginning of each epoch by algorithm which ensures that the annotation has low bias the algorithms use constrained erm procedure given hypothesis class labeled dataset and set of constraining examples returns classiﬁer in that minimizes the empirical error on subject to xi yi for each xi yi identifying the disagreement region algorithm deferred to the appendix identiﬁes if an unlabeled example lies in the disagreement region of the current set for recall that this conﬁdence set is implicitly maintained through the identiﬁcation is based on two erm queries let be the empirical risk minimizer on the current labeled dataset and be the empirical risk minimizer on under the constraint that if the training errors of and are very different then all classiﬁers with training error close to that of assign the same label to and lies outside the current disagreement region training the difference classiﬁer algorithm trains difference classiﬁer on random set of examples which lies in the disagreement region of the current conﬁdence set for the training process is and is similar to hard bound is imposed on the falsenegative error which translates to bound on the annotation bias for the target task the number of positives the number of examples where and differ is minimized subject to this constraint this amounts to approximately minimizing the fraction of queries made to the number of labeled examples used in training is large enough to ensure false negative error εk over the disagreement region of the current conﬁdence set here φk is the probability mass of this disagreement region under this ensures that the overall annotation bias introduced by this procedure in the target task is at most εk as φk is small and typically diminishes with this requires less labels than training the difference classiﬁer globally which would have required queries to algorithm active learning algorithm from weak and strong labelers input unlabeled distribution target excess error conﬁdence labeling oracle weak oracle hypothesis class hypothesis class for difference classiﬁer output classiﬁer in initialize initial error conﬁdence total number of epochs initial number of examples ln ln draw fresh sample and query for its labels let for do set target excess error εk conﬁdence δk train difference classiﬁer call algorithm with inputs unlabeled distribution oracles and target excess error εk conﬁdence δk previously labeled dataset adaptive active learning using difference classiﬁer σk call algorithm with inputs unlabeled distribution oracles and difference classiﬁer target excess error εk conﬁdence δk previously labeled dataset end for return adaptive active learning using the difference classiﬁer finally algorithm deferred to the appendix is our main active learning procedure which generates labeled dataset that is implicitly used to maintain tighter set for speciﬁcally algorithm generates such that the set vk deﬁned as vk err min err has the property that errd errd εk vk errd errd εk this is achieved by labeling through inference or query large enough sample of unlabeled data drawn from labels are obtained from three sources direct inference if lies outside the disagreement region as identiﬁed by algorithm querying if the difference classiﬁer predicts difference and querying how large should the sample be to reach the target excess error if errd then achieving an excess error of requires samples where is the vc dimension of the hypothesis class as is unknown in advance we use doubling procedure in lines to iteratively determine the sample size note that if in algorithm the upper conﬁdence bound of in disagr region is lower than then we can halt algorithm and return an arbitrary hd in using this hd will still guarantee the correctness of algorithm algorithm training algorithm for difference classiﬁer input unlabeled distribution oracles and target error hypothesis class dence previous labeled dataset output difference classiﬁer let be an estimate of in disagr region obtained by calling rithm deferred to the appendix with failure probability let and ln ln repeat draw an example xi from if in disagr region xi then xi is inside the disagreement region query both and for labels to get yi and yi end if xi yi yi until learn classiﬁer based on the following empirical risk minimizer argminhd hd xi hd xi yi yi mε return performance guarantees we now examine the performance of our algorithm which is measured by the number of label queries made to the oracle additionally we require our algorithm to be statistically consistent which means that the true error of the output classiﬁer should converge to the true error of the best classiﬁer in on the data distribution since our framework is very general we can not expect any statistically consistent algorithm to achieve label savings over using alone under all circumstances for example if labels provided by are the complete opposite of no algorithm will achieve both consistency and label savings we next provide an assumption under which algorithm works and yields label savings assumption the following assumption states that difference hypothesis class contains good costsensitive predictor of when and differ in the disagreement region of bu predictor is good if it has low error and predicts positive label with low frequency if there is no such predictor then we can not expect an algorithm similar to ours to achieve label savings assumption let be the joint distribution pd yo yw pu pw yw po yo for any there exists an hηd rf with the following properties pd hdη rf dis bu yo yw pd hdη rf dis bu note that which states there is hd with low error is minimally restrictive and is trivially satisﬁed if includes the constant classiﬁer that always predicts theorem shows that is sufﬁcient to ensure statistical consistency in addition states that the number of positives predicted by the classiﬁer hηd rf is upper bounded by note pu dis bu always performance gain is obtained when is lower which happens when the difference classiﬁer predicts agreement on signiﬁcant portion of dis bu consistency provided assumption holds we next show that algorithm is statistically consistent establishing consistency is for our algorithm as the output classiﬁer is trained on labels from both and theorem consistency let be the classiﬁer that minimizes the error with respect to if assumption holds then with probability the classiﬁer output by algorithm satisﬁes errd errd label complexity the label complexity of standard dbal is measured in terms of the disagreement coefﬁcient the disagreement coefﬁcient at scale is deﬁned as pu dis intuitively this measures the rate of shrinkage of the disagreement region with the radius of the ball bu for any in it was shown by that the label plexity of dbal for target excess generalization error is νε where the notation hides factors logarithmic in and in contrast the label complexity of our algorithm can be stated in theorem here we use the notation for convenience we have the same dependence on log and log as the bounds for dbal theorem label complexity let be the vc dimension of and let be the vc dimension of if assumption holds and if the error of the best classiﬁer in on is then with probability the following hold the number of label queries made by algorithm to the oracle in epoch at most mk dis bu εk the total number of label queries made by algorithm to the oracle is at most sup discussion the ﬁrst terms in and represent the labels needed to learn the target classiﬁer and second terms represent the overhead in learning the difference classiﬁer in the realistic agnostic case where as the second terms are lower order compared to the label complexity of dbal thus even if is somewhat larger than ﬁtting the difference classiﬁer does not incur an asymptotically high overhead in the more realistic agnostic case in the realizable case when the second terms are of the same order as the ﬁrst therefore we should use simpler difference hypothesis class in this case we believe that the lower order overhead term comes from the fact that there exists classiﬁer in whose false negative error is very low comparing theorem with the corresponding results for dbal we observe that instead of we have the term since the worst case asymptotic label complexity is the same as that of standard dbal this label complexity may be considerably better however if is less than the disagreement coefﬁcient as we expect this will happen when the region of difference between and restricted to the disagreement regions is relatively small and this region is by the difference hypothesis class an interesting case is when the weak labeler differs from close to the decision boundary and agrees with away from this boundary in this case any consistent algorithm should switch to querying close to the decision boundary indeed in earlier epochs is low and our algorithm obtains good difference classiﬁer and achieves label savings in later epochs is high the difference classiﬁers always predict difference and the label complexity of the later epochs of our algorithm is the same order as dbal in practice if we suspect that we are in this case we can switch to plain active learning once εk is small enough case study linear classﬁcation under uniform distribution we provide simple example where our algorithm provides better asymptotic label complexity than dbal let be the class yo yo yw figure linear classiﬁcation over unit ball with left decision boundary of labeler and the region where differs from is shaded and has probability middle decision boundary of weak labeler right and note that yo yw of homogeneous linear separators on the unit ball and let furthermore let be the uniform distribution over the unit ball suppose that is deterministic labeler such that errd moreover suppose that is such that there exists difference classiﬁer with false negative error for which pu additionally we assume that observe that this is not strict assumption on as could be as much as constant figure shows an example in that satisﬁes these assumptions in this case as theorem gives the following label complexity bound corollary with probability the number of label queries made to oracle by algorithm is max νg νε νε where the notation hides factors logarithmic in and as this improves over the label complexity of dbal which is νε conclusion in this paper we take step towards theoretical understanding of active learning from multiple annotators through learning theoretic formalization for learning from weak and strong labelers our work shows that multiple annotators can be successfully combined to do active learning in statistically consistent manner under general setting with few assumptions moreover under reasonable conditions this kind of learning can provide label savings over plain active learning an avenue for future work is to explore more general setting where we have multiple labelers with expertise on different regions of the input space can we combine inputs from such labelers in statistically consistent manner second our algorithm is intended for setting where is biased and performs suboptimally when the label generated by is random corruption of the label provided by how can we account for both random noise and bias in active learning from weak and strong labelers acknowledgements we thank nsf under iis for research support and jennifer dy for introducing us to the problem of active learning from multiple labelers references balcan beygelzimer and langford agnostic active learning comput syst balcan and long active and passive learning of linear separators under logconcave distributions in colt beygelzimer hsu langford and zhang active learning with an erm oracle beygelzimer hsu langford and zhang agnostic active learning without constraints in nips nader bshouty and lynn burroughs maximizing agreements with error with applications to heuristic learning machine learning cohn atlas and ladner improving generalization with active learning machine learning crammer kearns and wortman learning from data of variable quality in nips dasgupta coarse sample complexity bounds for active learning in nips dasgupta hsu and monteleoni general agnostic active learning algorithm in nips dekel gentile and sridharan selective sampling and active learning from single and multiple teachers jmlr donmez and carbonell proactive learning active learning with multiple imperfect oracles in cikm meng fang xingquan zhu bin li wei ding and xindong wu active learning from crowds in icdm pages ieee hanneke bound on the label complexity of agnostic active learning in icml hsu algorithms for active learning phd thesis uc san diego panagiotis ipeirotis foster provost victor sheng and jing wang repeated labeling using multiple noisy labelers data mining and knowledge discovery adam tauman kalai varun kanade and yishay mansour reliable agnostic learning comput syst varun kanade and justin thaler reliable learning in colt lin mausam and weld to re label or not to re label in hcomp lin mausam and weld reactive learning actively trading off larger noisier training sets against smaller cleaner ones in icml workshop on crowdsourcing and machine learning and icml active learning workshop malago and renders online active learning with strong and weak annotators in nips workshop on learning from the wisdom of crowds nowak the geometry of generalized binary search ieee transactions on information theory hans ulrich simon in the presence of classiﬁcation noise ann math artif song chaudhuri and sarwate learning from data with heterogeneous noise using sgd in aistats urner and shamir learning from weak teachers in aistats pages vijayanarasimhan and grauman what it going to cost you predicting effort informativeness for image annotations in cvpr pages vijayanarasimhan and grauman active visual category learning ijcv welinder branson belongie and perona the multidimensional wisdom of crowds in nips pages yan rosales fung and dy active learning from crowds in icml pages yan rosales fung farooq rao and dy active learning from multiple knowledge sources in aistats pages zhang and chaudhuri beyond agnostic active learning in nips 
on the optimality of classifier chain for classification weiwei liu ivor centre for quantum computation and intelligent systems university of technology sydney abstract to capture the interdependencies between labels in classification problems classifier chain cc tries to take the multiple labels of each instance into account under deterministic markov chain model since its performance is sensitive to the choice of label order the key issue is how to determine the optimal label order for cc in this work we first generalize the cc model over random label order then we present theoretical analysis of the generalization error for the proposed generalized model based on our results we propose dynamic programming based classifier chain algorithm to search the globally optimal label order for cc and greedy classifier chain algorithm to find locally optimal cc comprehensive experiments on number of data sets from various domains demonstrate that our proposed algorithm outperforms approaches and the ccgreedy algorithm achieves comparable prediction performance with introduction classification where each instance can belong to multiple labels simultaneously has significantly attracted the attention of researchers as result of its various applications ranging from document classification and gene function prediction to automatic image annotation for example document can be associated with range of topics such as sports finance and education gene belongs to the functions of protein synthesis metabolism and transcription an image may have both beach and tree tags one popular strategy for classification is to reduce the original problem into many binary classification problems many works have followed this strategy for example binary relevance br is simple approach for learning which independently trains binary classifier for each label recently dembczynski et al have shown that methods of learning which explicitly capture label dependency will usually achieve better prediction performance therefore modeling the label dependency is one of the major challenges in classification problems many learning models have been developed to capture label dependency amongst them the classifier chain cc model is one of the most popular methods due to its simplicity and promising experimental results cc works as follows one classifier is trained for each label for the th label each instance is augmented with the ith label as the input to train the th classifier given new instance to be classified cc firstly predicts the value of the first label then takes this instance together with the predicted value as the input to predict the value of the next label cc proceeds in this way until the last label is predicted however here is the question does the label order affect the performance of cc apparently yes because different classifier chains involve different corresponding author classifiers trained on different training sets thus to reduce the influence of the label order read et al proposed the ensembled classifier chain ecc to average the predictions of cc over set of random chain ordering since the performance of cc is sensitive to the choice of label order there is another important question is there any globally optimal classifier chain which can achieve the optimal prediction performance for cc if yes how can the globally optimal classifier chain be found to answer the last two questions we first generalize the cc model over random label order we then present theoretical analysis of the generalization error for the proposed generalized model our results show that the upper bound of the generalization error depends on the sum of reciprocal of square of the margin over the labels thus we can answer the second question the globally optimal cc exists only when the minimization of the upper bound is achieved over this cc to find the globally optimal cc we can search over different label where denotes the number of labels which is computationally infeasible for large in this paper we propose the dynamic programming based classifier chain algorithm to simplify the search algorithm which requires nd time complexity furthermore to speed up the training process greedy classifier chain algorithm is proposed to find locally optimal cc where the time complexity of the algorithm is nd notations assume xt rd is real vector representing an input or instance feature for denotes the number of training samples yt λq is the corresponding output label yt is used to represent the label set yt where yt if and only if λj yt related work and preliminaries to capture label dependency hsu et al first use compressed sensing technique to handle the classification problem they project the original label space into low dimensional label space regression model is trained on each transformed label recovering from the regression output usually involves solving quadratic programming problem and many works have been developed in this way such methods mainly aim to use different projection methods to transform the original label space into another effective label space another important approach attempts to exploit the different orders and of label correlations following this way some works also try to provide probabilistic interpretation for label correlations for example guo and gu model the label correlations using conditional dependency network pcc exploits markov chain model to capture the correlations between the labels and provide an accurate probabilistic interpretation of cc other works focus on modeling the label correlations in deterministic way and cc is one of the most popular methods among them this work will mainly focus on the deterministic classifier chain classifier chain similar to br the classifier chain cc model trains binary classifiers hj classifiers are linked along chain where each classifier hj deals with the binary classification problem for label λj the augmented vector xt yt yt is used as the input for training classifier given new testing instance classifier in the chain is responsible for predicting the value of using input then predicts the value of taking plus the predicted value of as an input following in this way predicts using the predicted value of as additional input information cc passes label information between classifiers allowing cc to exploit the label dependence and thus overcome the label independence problem of br essentially it builds deterministic markov chain model to capture the label correlations represents the factorial notation ensembled classifier chain different classifier chains involve different classifiers learned on different training sets and thus the order of the chain itself clearly affects the prediction performance to solve the issue of selecting chain order for cc read et al proposed the extension of cc called ensembled classifier chain ecc to average the predictions of cc over set of random chain ordering ecc first randomly reorders the labels λq many times then cc is applied to the reordered labels for each time and the performance of cc is averaged over those times to obtain the final prediction performance proposed model and generalization error analysis generalized classifier chain we generalize the cc model over random label order called generalized classifier chain gcc model assume the labels λq are randomly reordered as ζq where ζj λk means label λk moves to position from in the gcc model classifiers are also linked along chain where each classifier hj deals with the binary classification problem for label ζj λk gcc follows the same training and testing procedures as cc while the only difference is the label order in the gcc model for input xt yt if and only if ζj yt generalization error analysis in this section we analyze the generalization error bound of the classification problem using gcc based on the techniques developed for the generalization performance of classifiers with large margin and perceptron decision tree let represent the input space both and are samples drawn independently according to an unknown distribution we denote logarithms to base by log if is set denotes its cardinality means the norm we train support vector machine svm for each label ζj let xt as the feature and yt ζj as the label the output parameter of svm is defined as wj bj sv xt yt yt yt ζj the margin for label ζj is defined as γj we begin with the definition of the fat shattering dimension definition let be set of real valued functions we say that set of points is γshattered by relative to rp if there are real numbers rp indexed by such that for all binary vectors indexed by there is function fb satisfying rp if bp fb rp otherwise the fat shattering dimension at of the set is function from the positive real numbers to the integers which maps value to the size of the largest set if this is finite or infinity otherwise assume is the real valued function class and denotes the loss function the expected error of is defined as erd where drawn from the unknown distribution here we select loss function so erd ers is defined as ers yt xt suppose is the number of with respect to the measuring the maximum discrepancy on the sample the notion of the covering number can be referred to the supplementary materials we introduce the following general corollary regarding the bound of the covering number the expression yt xt evaluates to if yt xt is true and to otherwise corollary let be class of functions and distribution over choose and let at em then log dϵ where the expectation is over samples drawn according to dm we study the generalization error bound of the specified gcc with the specified number of labels and margins let be the set of classifiers of gcc hq ers denotes the fraction of the number of errors that gcc makes on define hj hj if an instance is correctly classified by hj then moreover we introduce the following proposition proposition if an instance is misclassified by gcc model then lemma given specified gcc model with labels and with margins for each label satisfying ki at where at is continuous from the right if gcc has correctly classified examples generated independently according to the unknown but fixed distribution and is set of another examples then we can bound the following probability to be less than gcc model it correctly classifies fraction of log and ki log sified where ki proof of lemma suppose is gcc model with labels and with margins the probability event in lemma can be described as ki at ers let and denote two different set of examples which are drawn from the distribution applying the definition of and proposition the event can also be written as ki at ers ri maxt ri mϵ here means the minimal value of which represents the margin for label ζi so let γki min at ki so γki we define the following function if if otherwise so let let represent the minimal γki set of in the we have that for any hi there exists γki for all for all by the definition of ri ri and γki so γki however there are at least mϵ points such that so maxt since only reduces separation between output values we conclude that the inequality maxt holds moreover the mϵ points in with the largest values must remain for the inequality to hold by the permutation argument at most of the sequences obtained by swapping corresponding points satisfy the conditions for fixed as for any hi there exists so there are possibilities of that satisfy the inequality for ki note that is positive integer which is usually bigger than and by the union bound we get the following inequality since every set of points by can be by so atπ where hence by corollary setting to to γki and to γki log log ki where atπ γki γki ki thus and we obtain log ki where provided log log ki log ki and so as required lemma applies to particular gcc model with specified number of labels and specified margin for each label in practice we will observe the margins after running the gcc model thus we must bound the probabilities uniformly over all of the possible margins that can arise to obtain practical bound the generalization error bound of the classification problem using gcc is shown as follows theorem suppose random sample can be correctly classified using gcc model and suppose this gcc model contains classifiers with margins for each label then we can bound the generalization error with probability greater than to be less than log log log where and is the radius of ball containing the support of the distribution before proving theorem we state one key symmetrization lemma and theorem lemma symmetrization let be the real valued function class and are samples both drawn independently according to the unknown distribution if then ps sup ers sup ers the proof details of this lemma can be found in the supplementary material theorem let be restricted to points in ball of dimensions of radius about the origin then ath min proof of theorem we must bound the probabilities over different margins we first use lemma to bound the probability of error in terms of the probability of the discrepancy between the performance on two halves of double sample then we combine this result with lemma we must consider all possible patterns of ki for label ζi the largest value of ki is thus for fixed we can bound the number of possibilities by mq hence there are mq of applications of lemma let ci denote the combination of margins varied in denotes set of gcc models the generalization error of can be represented as erd and ers is where the uniform convergence bound of the generalization error is ps sup ers applying lemma ps sup sup ers let jci gcc model with labels and with margins ci ki at ers clearly sup ers mq mq jci as ki still satisfies ki at lemma can still be applied to each case of jci let δk applying lemma replacing by δk we get jci δk where δk log log ki log ki by the union δk and bound it suffices to show that jci jci δk mq applying lemma ps sup ers sup ers mq mq jci thus ps ers let be the radius of ball containing the support of the distribution applying theorem we get ki at note that we have replaced the constant by in order to ensure the continuity from the right required for the application of lemma we have upperbounded log by log thus erd log log log log log where given the training data size and the number of labels theorem reveals one important factor in reducing the generalization error bound for the gcc model the minimization of the sum of reciprocal of square of the margin over the labels thus we obtain the following corollary corollary globally optimal classifier chain suppose random sample with labels can be correctly classified using gcc model this gcc model is the globally optimal classifier chain if and only if the minimization of in theorem is achieved over this classifier chain given the number of labels there are different label orders it is very expensive to find the globally optimal cc which can minimize by searching over all of the label orders next we discuss two simple algorithms optimal classifier chain algorithm in this section we propose two simple algorithms for finding the optimal cc based on our result in section to clearly state the algorithms we redefine the margins with label order information given label set λq suppose gcc model contains classifiers let oi oi denote the order of λi in the gcc model γioi represents the margin for label λi with previous oi labels as the augmented input if oi then represents the margin for label λi without augmented input then is redefined as dynamic programming algorithm to simplify the search algorithm mentioned before we propose the algorithm to find the globally optimal cc note that we oj γj explore the idea of dp to iteratively optimize over subset of with the length of finally we can obtain the optimal over assume let be the optimal over subset of with the length of where the label order is ending by label λi suppose miη represent the corresponding label set for when be the optimal over where the label order is ending by label λi the dp equation is written as min λi where is the margin for label λi with mjη as the augmented input the initial condition of dp is and λi then the optimal over can be obtained by solving assume the training of linear svm takes nd the algorithm is shown as the following procedure from the bottom we first compute which takes nd then we compute λi which requires at most qnd and set λi similarly it takes at most nd time complexity to calculate last we iteratively solve this dp equation and use to get the optimal solution which requires at most nd time complexity theorem correctness of can be minimized by which means this algorithm can find the globally optimal cc the proof can be referred to in the supplementary materials greedy algorithm we propose algorithm to find locally optimal cc to speed up the algorithm to save time we construct only one classifier chain with the locally optimal label order based on the training instances we select the label from λq as the first label if the maximum margin can be achieved over this label without augmented input the first label is denoted by then we select the label from the remainder as the second label if the maximum margin can be achieved over this label with as the augmented input we continue in this way until the last label is selected finally this algorithm will converge to the locally optimal cc we present the details of the algorithm in the supplementary materials where the time complexity of this algorithm is nd experiment in this section we perform experimental studies on number of benchmark data sets from different domains to evaluate the performance of our proposed algorithms for classification all the methods are implemented in matlab and all experiments are conducted on workstation with intel cpu and main memory running windows platform data sets and baselines we conduct experiments on eight data sets with various domains from three following the experimental settings in and we preprocess the llog yahoo art eurlex sm and eurlex ed data sets their statistics are presented in the supplementary materials we compare our algorithms with some baseline methods br cc ecc cca and mmoc to perform fair comparison we use the same linear package liblinear with square hinge loss primal to train the classifiers for all the methods ecc is averaged over several cc predictions with random order and the ensemble size in ecc is set to according to in our experiment the running time of pcc and epcc on most data sets like slashdot and yahoo art takes more than one week from the results in ecc is comparable with epcc and outperforms pcc so we do not consider pcc and epcc here cca and mmoc are two methods we can not get the results of cca and mmoc on yahoo art eurlex sm and eurlex ed data sets in one week following we consider the and measures to evaluate the prediction performance of all methods we perform on each data set and report the mean and standard error of each evaluation measurement the running time complexity comparison is reported in the supplementary materials http http datasets http data table results of on the various data sets mean standard deviation the best results are in bold numbers in square brackets indicate the rank data set br cc ecc cca mmoc yeast image slashdot enron llog yahoo art eurlex sm eurlex ed average rank prediction performance results for our method and baseline approaches in respect of the different data sets are reported in table other measure results are reported in the supplementary materials from the results we can see that br is much inferior to other methods in terms of our experiment provides empirical evidence that the label correlations exist in many real word data sets and because br ignores the information about the correlations between the labels br achieves poor performance on most data sets cc improves the performance of br however it underperforms ecc this result verifies the answer to our first question stated in section the label order does affect the performance of cc ecc which averages over several cc predictions with random order improves the performance of cc and outperforms cca and mmoc this studies verify that optimal cc achieve competitive results compared with approaches our proposed and algorithms are successful on most data sets this empirical result also verifies the answers to the last two questions stated in section the globally optimal cc exists and can find the globally optimal cc which achieves the best prediction performance the algorithm achieves comparable prediction performance with while it requires lower time complexity than in the experiment our proposed algorithms are much faster than cca and mmoc in terms of both training and testing time and achieve the same testing time with cc through the training time for our algorithms is slower than br cc and ecc our extensive empirical studies show that our algorithms achieve superior performance than those baselines conclusion to improve the performance of classification plethora of models have been developed to capture label correlations amongst them classifier chain is one of the most popular approaches due to its simplicity and good prediction performance instead of proposing new learning model we discuss three important questions in this work regarding the optimal classifier chain stated in section to answer these questions we first propose generalized cc model we then provide theoretical analysis of the generalization error for the proposed generalized model based on our results we obtain the answer to the second question the globally optimal cc exists only if the minimization of the upper bound is achieved over this cc it is very expensive to search over different label orders to find the globally optimal cc thus we propose the algorithm to simplify the search algorithm which requires nd complexity to speed up the algorithm we propose algorithm to find locally optimal cc where the time complexity of the ccgreedy algorithm is nd comprehensive experiments on eight data sets from different domains verify our theoretical studies and the effectiveness of proposed algorithms acknowledgments this research was supported by the australian research council future fellowship references robert schapire and yoram singer boostexter system for text categorization machine learning zafer and robert schapire and olga troyanskaya hierarchical prediction of gene function bioinformatics matthew boutell and jiebo luo and xipeng shen and christopher brown learning scene classification pattern recognition grigorios tsoumakas and ioannis katakis and ioannis vlahavas mining data in data mining and knowledge discovery handbook pages springer us krzysztof dembczynski and weiwei cheng and eyke bayes optimal multilabel classification via probabilistic classifier chains proceedings of the international conference on machine learning pages haifa israel omnipress jesse read and bernhard pfahringer and geoffrey holmes and eibe frank classifier chains for multilabel classification in proceedings of the european conference on machine learning and knowledge discovery in databases part ii pages berlin heidelberg yi zhang and jeff schneider maximum margin output coding proceedings of the international conference on machine learning pages new york ny omnipress yuhong guo and suicheng gu classification using conditional dependency networks proceedings of the international joint conference on artificial intelligence pages barcelona catalonia spain aaai press huang and zhou learning by exploiting label correlations locally proceedings of the aaai conference on artificial intelligence toronto ontario canada aaai press feng kang and rong jin and rahul sukthankar correlated label propagation with application to multilabel learning ieee computer society conference on computer vision and pattern recognition pages new york ny ieee computer society weiwei liu and ivor tsang large margin metric learning for prediction proceedings of the conference on artificial intelligence pages texas usa aaai press mingkui tan and qinfeng shi and anton van den hengel and chunhua shen and junbin gao and fuyuan hu and zhen zhang learning graph structure for image classification via clique generation the ieee conference on computer vision and pattern recognition daniel hsu and sham kakade and john langford and tong zhang prediction via compressed sensing advances in neural information processing systems pages curran associates yi zhang and jeff schneider output codes using canonical correlation analysis proceedings of the fourteenth international conference on artificial intelligence and statistics pages fort lauderdale usa farbound tai and lin multilabel classification with principal label space transformation neural computation zhang and kun zhang learning by exploiting label dependency proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages qwashington dc usa acm john and peter bartlett and robert williamson and martin anthony structural risk minimization over hierarchies ieee transactions on information theory kristin bennett and nello cristianini and john and donghui wu enlarging the margins in perceptron decision trees machine learning michael kearns and robert schapire efficient learning of probabilistic concepts proceedings of the symposium on the foundations of computer science pages los alamitos ca ieee computer society press peter bartlett and john generalization performance of support vector machines and other pattern classifiers advances in kernel methods support vector learning pages cambridge ma usa mit press fan and chang and hsieh and wang and lin liblinear library for large linear classification journal of machine learning research qi mao and ivor tsang and shenghua gao image annotation ieee transactions on image processing 
robust regression via hard thresholding kush prateek and purushottam microsoft research india indian institute of technology kanpur india prajain purushot abstract we study the problem of robust least squares regression rlsr where several response variables can be adversarially corrupted more specifically for data matrix and an underlying model the response vector is generated as where rn is the corruption vector supported over at most coordinates existing exact recovery results for rlsr focus solely on penalty based convex formulations and impose relatively strict model assumptions such as requiring the corruptions to be selected independently of in this work we study simple algorithm called orrent which under mild conditions on can recover exactly even if corrupts the response variables in an adversarial manner both the support and entries of are selected adversarially after observing and our results hold under deterministic assumptions which are satisfied if is sampled from any distribution finally unlike existing results that apply only to fixed generated independently of our results are universal and hold for any rp next we propose gradient extensions of orrent that can scale efficiently to large scale problems such as high dimensional sparse recovery and prove similar recovery guarantees for these extensions empirically we find or rent and more so its extensions offering significantly faster recovery than the solvers for instance even on datasets with with around corrupted responses variant of our proposed method called is more than faster than the best solver if among these errors are some which appear too large to be admissible then those equations which produced these errors will be rejected as coming from too faulty experiments and the unknowns will be determined by means of the other equations which will then give much smaller legendre on the method of least squares introduction robust least squares regression rlsr addresses the problem of learning reliable set of regression coefficients in the presence of several arbitrary corruptions in the response vector owing to the of regression rlsr features as critical component of several important realworld applications in variety of domains such as signal processing economics computer vision and astronomy given data matrix xn with data points in rp and the corresponding response vector rn the goal of rlsr is to learn such that arg min yi xti this work was done while was postdoctoral researcher at microsoft research india that is we wish to simultaneously determine the set of corruption free points and also estimate the best model parameters over the set of clean points however the optimization problem given above is jointly in and in general and might not directly admit efficient solutions indeed there exist reformulations of this problem that are known to be to optimize to address this problem most existing methods with provable guarantees assume that the observations are obtained from some generative model commonly adopted model is the following where rp is the true model vector that we wish to estimate and rn is the corruption vector that can have arbitrary values common assumption is that the corruption vector is sparsely supported for some recently and obtained surprising result which shows that one can recover exactly even when when almost all the points are corrupted by solving an based convex optimization problem minw however these results require the corruption vector to be selected oblivious of and moreover the results impose severe restrictions on the data distribution requiring that the data be either sampled from an isotropic gaussian ensemble or from an incoherent orthogonal matrix finally these results hold only for fixed and are not universal in general in contrast studied rlsr with less stringent assumptions allowing arbitrary corruptions in response variables as well as in the data matrix and proposed trimmed inner product based algorithm for the problem however their recovery guarantees significantly weaker firstly they are able recover only upto an additive error or if is hence they require just to claim bound note that this amounts to being able to tolerate only vanishing fraction of corruptions more importantly even with and extremely small they are unable to guarantee exact recovery of similar result was obtained by albeit using based algorithm with stronger assumptions on in this paper we focus on simple and natural thresholding based algorithm for rlsr at high level at each step our algorithm alternately estimates an active set st of clean points and then updates the model to obtain by minimizing the least squares error on the active set this intuitive algorithm seems to embody long standing heuristic first proposed by legendre over two centuries ago see introductory quotation in this paper that has been adopted in later literature as well however to the best of our knowledge this technique has never been rigorously analyzed before in settings despite its appealing simplicity our contributions the main contribution of this paper is an exact recovery guarantee for the thresholding algorithm mentioned above that we refer to as see algorithm we provide our guarantees in the model given in where the corruptions are selected adversarially but restricted to have at most entries where is global constant dependent only on under deterministic conditions on namely the subset strong convexity ssc and smoothness sss properties see definition we guarantee that converges at geometric rate and recovers exactly we further show that these properties ssc and sss are satisfied if the data is sampled from distribution and log we would like to stress three key advantages of our result over the results of we allow to be adversarial both support and values of to be selected adversarially based on and we make assumptions on data that are natural as well as significantly less restrictive than what existing methods make and our analysis admits universal guarantees holds for any we would also like to stress that while based methods have been studied rigorously for the problem has not been studied formally for the robust regression problem study approaches to the robust regression problem but without any formal guarantees moreover the two problems are completely different and hence techniques from analysis do not extend to robust regression note that for an adaptive adversary as is the case in our work recovery can not be guaranteed for this for an adversarially chosen model since the adversary can introduce corruptions as bi thus making recovery impossible would make it impossible for any algorithm to distinguish between and despite its simplicity does not scale very well to datasets with large as it solves least squares problems at each iteration we address this issue by designing gradient descent based algorithm and hybrid algorithm both of which enjoy geometric rate of convergence and can recover under the model assumptions mentioned above we also propose extensions of orrent for the rlsr problem in the sparse regression setting where but our algorithm is based on but uses the iterative hard thresholding iht algorithm popular algorithm for sparse regression as before we show that also converges geometrically to if the corruption index is less than some constant is sampled from distribution and log finally we experimentally evaluate existing algorithms and our hard algorithms the results demonstrate that our proposed algorithms can be significantly faster than the best solvers exhibit better recovery properties as well as be more robust to dense white noise for instance on problem with dimensions and corruption was found to be faster than solvers as well as achieve lower error rates problem formulation given set of data points xn where xi rp and the corresponding response vector rn the goal is to recover parameter vector which solves the rlsr problem we assume that the response vector is generated using the following model where hence in the above model reduces to estimating we allow the model representing the regressor to be chosen in an adaptive manner after the data features have been generated the above model allows two kinds of perturbations to yi dense but bounded noise εi white noise εi as well as potentially unbounded corruptions bi to be introduced by an adversary the only requirement we enforce is that the gross corruptions be sparse shall represent the dense noise vector for example and the corruption vector such that for some corruption index we shall use the notation supp to denote the set of clean points points that have not faced unbounded corruptions we consider adaptive adversaries that are able to view the generated data points xi as well as the clean responses and dense noise values εi before deciding which locations to corrupt and by what amount we denote the unit sphere in dimensions using for any we let sγ denote the set of all subsets of size for any set we let xs xi denote the matrix whose columns are composed of points in that set also for any vector rn we use the notation vs to denote the vector consisting of those components that are in we use λmin and λmax to denote respectively the smallest and largest eigenvalues of square symmetric matrix we now introduce two properties namely subset strong convexity and subset strong smoothness which are key to our analyses definition ssc and sss properties matrix satisfies the subset strong convexity property resp subset strong smoothness property at level with strong convexity constant λγ resp strong smoothness constant λγ if the following holds λγ min λmin xs xs max λmax xs xs λγ remark we note that the uniformity enforced in the definitions of the ssc and sss properties is not for the sake of convenience but rather necessity indeed uniform bound is required in face of an adversary which can perform corruptions after data and response variables have been generated and choose to corrupt precisely that set of points where the ssc and sss parameters are the worst orrent thresholding robust regression method we now present orrent thresholding robust regression method for performing robust regression at scale key to our algorithms is the hard thresholding operator which we define below algorithm orrent thresholding algorithm update based robust regression method input current model current active set step size input training data xi yi step length xs xs ys thresholding parameter tolerance return while rtst do update algorithm update yi input current model current active set step ht size current residuals previous active set use the gd update if the active end while set is changing lot return wt if then else algorithm update if stable use the fc update input current model current active set return arg min yi hw xi end if return definition hard thresholding operator for any vector rn let σv be the permutation vσ vσ that orders elements of in ascending order of their magnitudes vσ then for any we define the hard thresholding operator as ht using this operator we present our algorithm orrent algorithm for robust regression or rent follows most natural iterative strategy of alternately estimating an active set of points which have the least residual error on the current regressor and then updating the regressor to provide better fit on this active set we offer three variants of our algorithm based on how aggressively the algorithm tries to fit the regressor to the current active set we first propose fully corrective algorithm algorithm that performs fully corrective least squares regression step in an effort to minimize the regression error on the active set this algorithm makes significant progress in each step but at cost of more expensive updates to address this we then propose milder gradient variant algorithm that performs much cheaper update of taking single step in the direction of the gradient of the objective function on the active set this reduces the regression error on the active set but does not minimize it this turns out to be beneficial in situations where dense noise is present along with sparse corruptions since it prevents the algorithm from overfitting to the current active set both the algorithms proposed above have their pros and cons the fc algorithm provides significant improvements with each step but is expensive to execute whereas the gd variant although efficient in executing each step offers slower progress to get the best of both these algorithms we propose third hybrid variant algorithm that adaptively selects either the fc or the gd update depending on whether the active set is stable across iterations or not in the next section we show that this hard strategy offers linear convergence rate for the algorithm in all its three variations we shall also demonstrate the applicability of this technique to high dimensional sparse recovery settings in subsequent section convergence guarantees for the sake of ease of exposition we will first present our convergence analyses for cases where dense noise is not present and will handle cases with dense noise and sparse corruptions later we first analyze the fully corrective algorithm the convergence proof in this case relies on the optimality of the two steps carried out by the algorithm the fully corrective step that selects the best regressor on the active set and the hard thresholding step that discovers new active set by selecting points with the least residual error on the current regressor theorem let xn be the given data matrix and be the corrupted output with let algorithm be executed on this data with the thresholding satisfies the parameter set to let be an invertible matrix such that ssc and sss properties at level with constants and respectively see definition if the data satisfies λβ then after log iterations algorithm obtains an solution kw proof sketch let rt wt be the vector of residuals at time and ct xst xs also let supp be the set of uncorrupted points the fully corrective step ensures that xst yst xst xs bst xst bst combining the two gives us whereas the hard thresholding step ensures that st st ct xst bst est es es bs es bs st st λβ kbst kbst es and es and xs where follows from setting st follows from the ssc and sss properties kbst and solving the quadratic equation and performing other manipulations gives us the claimed result theorem relies on deterministic fixed design assumption specifically in order to guarantee convergence we can show that large class of random designs including gaussian and designs actually satisfy this requirement that is to say data generated from these λβ distributions satisfy the ssc and sss conditions such that with high probability theorem explicates this for the class of gaussian designs theorem let xn be the given data matrix with let and also let and log then with kbk bility at least the data satisfies more specifically after log iterations of algorithm with the thresholding parameter set to we have wt remark note that theorem provides rates that are independent of the condition number λλmax min of the distribution we also note that results similar to theorem can be proven for the larger class of distributions we refer the reader to section for the same remark we remind the reader that our analyses can readily accommodate dense noise in addition to sparse unbounded corruptions we direct the reader to appendix which presents convergence proofs for our algorithms in these settings remark we would like to point out that the design requirements made by our analyses are very mild when compared to existing literature indeed the work of assumes the bouquet model where distributions are restricted to be isotropic gaussians whereas the work of assumes more stringent model of matrices something that even gaussian designs do not satisfy our analyses on the other hand hold for the general class of distributions we now analyze the algorithm which performs cheaper updates on the active set we will show that this method nevertheless enjoys linear rate of convergence theorem let the data settings be as stated in theorem and let algorithm be executed on this data with the thresholding parameter set to and the step length set to if the data kbk satisfies max λβ then after log iterations algorithm obtains an solution wt kwt similar to the assumptions made by the algorithm are also satisfied by the class of distributions the proof of theorem given in appendix details these arguments given the convergence analyses for and gd we now move on to provide convergence analysis for the hybrid algorithm which interleaves fc and gd steps since the exact interleaving adopted by the algorithm depends on the data and not known in advance this poses problem we address this problem by giving below uniform convergence guarantee one that applies to every interleaving of the fc and gd update steps theorem suppose algorithm is executed on data that allows algorithms and convergence rate of ηfc and ηgd respectively suppose we have then for any interleavings of the kbk fc and gd steps that the policy may enforce after log iterations algorithm ensures an solution kwt we point out to the reader that the assumption made by theorem ηfc ηgd is readily satisfied by random designs albeit at the cost of reducing the noise tolerance limit as we shall see offers attractive convergence properties merging the fast convergence rates of the fc step as well as the speed and protection against overfitting provided by the gd step robust regression in this section we extend our approach to the robust sparse recovery setting as before we assume that the response vector is obtained as where however this time we also assume that is as before we shall neglect noise for the sake of simplicity we reiterate that it is not possible to use existing results from sparse recovery such as directly to solve this problem our objective would be to recover sparse model so that the challenge here is to forgo sample complexity of and instead perform recovery with log samples alone for this setting we modify the fc update step of method to the following inf yi hw xi for some target sparsity level we refer to this modified algorithm as assuming satisfies the properties defined below can be solved efficiently using results from sparse recovery for example the iht algorithm analyzed in definition rsc and rss properties matrix will be said to satisfy the restricted strong convexity property resp restricted strong smoothness property at level with strong convexity constant resp strong smoothness constant if the following holds for all and αs ls for our results we shall require the subset versions of both these properties definition srsc and srss properties matrix will be said to satisfy the subset restricted strong convexity resp subset restricted strong smoothness property at level with strong convexity constant resp strong smoothness constant if for all subsets sγ the matrix xs satisfies the rsc resp rss property at level with constant αs resp ls we now state the convergence result for the algorithm theorem let be the given data matrix and be the corrupted output with and let be an invertible matrix such that satisfies the srsc and srss properties at level with constants and respectively see definition let algorithm be executed with the on this data update thresholding parameter set to and total points alpha sigma corrupted points corrupted points corrupted points sigma sigma total points total points kw sigma magnitude of corruption figure and diagrams depicting the recovery properties of the and algorithms the colors red and blue represent high and low probability of success resp method is considered successful in an experiment if it recovers upto relative error both variants of orrent can be seen to recover in presence of larger number of corruptions than the solver variation in recovery error with the magnitude of corruption as the corruption is increased and show improved performance while the problem becomes more difficult for the solver if also satisfies kbk then after log iterations algorithm obtains an solution wt kwt if is sampled from gaussian distribution and log then for all values of we can guarantee kwt kbk after log iterations of the algorithm in particular λmax λmin remark the sample complexity required by theorem is identical to the one required by analyses for high dimensional sparse recovery save constants also note that can tolerate the same corruption index as experiments several numerical simulations were carried out on linear regression problems in as well as sparse settings the experiments show that orrent not only offers statistically better recovery properties as compared to approaches but that it can be more than an order of magnitude faster as well data for the low dimensional setting the regressor rp was chosen to be random unit norm vector data was sampled as xi ip and response variables were generated as xi the set of corrupted points was selected as uniformly random αn subset of and the corruptions were set to bi for the corrupted responses were then generated as yi bi εi where εi for the sparse setting supp was selected to be random subset of diagrams figure were generated by repeating each experiment times for all other plots each experiment was run over random instances of the data and the plots were drawn to depict the mean results algorithms we compared various variants of our algorithm orrent to the regularized algorithm for robust the problem can be written as minz that regression where and λb we used the dual augmented lagrange multiplier dalm solver implemented by to solve the problem we ran fine tuned grid search over the parameter for the solver and quoted the best results obtained from the search in the setting we compared the recovery properties of algorithm and algorithm with the solver while for the case we compared against the solver both the solver as well as our methods were implemented in matlab and were run on single core machine with gb ram choice of an extensive comparative study of various minimization algorithms was performed by who showed that the dalm and homotopy solvers outperform other counterparts both in terms of recovery properties and timings we extended their study to our observation model and found the dalm solver to be significantly better than the other solvers see figure in the appendix we also observed similar to that the approximate message passing amp solver diverges on our problem as the input matrix to the solver is matrix hd alpha kw kw alpha kappa kw kw sigma fraction of corrupted points time in sec fraction of corrupted points time in sec figure in as well as sparse high dimensional settings orrent offers better recovery as the fraction of corrupted points is varied in terms of runtime orrent is an order of magnitude faster than solvers in both settings in the setting is the fastest of all the variants evaluation metric we measure the performance of various algorithms using the standard error for the plots figure we deemed an algorithm successful on rw kw with error rw an instance if it obtained model we also measured the cpu time required by each of the methods so as to compare their scalability low dimensional results recovery property the plots presented in figure represent our recovery experiments in graphical form both the and hybrid variants of orrent show better recovery properties than the approach indicated by the number of runs in which the algorithm was able to correctly recover out of runs figure shows the variation in recovery error as function of in the presence of white noise and exhibits the superiority of or rent and orrent over here again orrent and orrent achieve significantly lesser recovery error than for all figure in the apb with varying and follow similar trend with pendix show that the variations of kw orrent having significantly lower recovery error in comparison to the approach figure brings out an interesting trend in the recovery property of orrent as we increase the magnitude of corruption from to the recovery error for and decreases as expected since it becomes easier to identify the grossly corrupted points however the was unable to exploit this observation and in fact exhibited an increase in recovery error run time in order to ascertain the recovery guarantees for orrent on problems we performed an experiment where data was sampled as xi where diag figure plots the recovery error as function of time was able to correctly recover about faster than which spent considerable amount of time the data matrix even after allowing the algorithm to run for iterations it was unable to reach the desired residual error of figure also shows that our algorithm is able to converge to the optimal solution much faster than or this is because solves least square problem at each step and thus even though it requires significantly fewer iterations to converge each iteration in itself is very expensive while each iteration of is cheap it is still limited by the slow convergence rate of the gradient descent algorithm where is the condition number of the covariance matrix on the other hand is able to combine the strengths of both the methods to achieve faster convergence high dimensional results recovery property figure shows the variation in recovery error in the setting as the number of corrupted points was varied for these experiments was set to log and the fraction of corrupted points was varied from to while fails to recover for offers perfect recovery even for values upto run time figure shows the variation in recovery error as function of run time in this setting was found to be an order of magnitude slower than making it infeasible for sparse settings one key reason for this is that the solver is significantly slower in identifying the set of clean points for instance whereas was able to identify the clean set of points in only iterations it took around iterations to do the same references christoph studer patrick kuppinger graeme pope and helmut recovery of sparsely corrupted signals ieee transaction on information theory peter rousseeuw and annick leroy robust regression and outlier detection john wiley and sons john wright alan yang arvind ganesh shankar sastry and yi ma robust face recognition via sparse representation ieee transactions on pattern analysis and machine intelligence john wright and yi ma dense error correction via minimization ieee transaction on information theory nam nguyen and trac tran exact recoverability from dense corrupted observations via minimization ieee transaction on information theory yudong chen constantine caramanis and shie mannor robust sparse regression under adversarial corruption in international conference on machine learning icml brian mcwilliams gabriel krummenacher mario lucic and joachim buhmann fast and robust least squares estimation in corrupted linear models in annual conference on neural information processing systems nips legendre on the method of least squares in translated from the french smith editor source book in mathematics pages new york dover publications peter rousseeuw least median of squares regression journal of the american statistical association peter rousseeuw and katrien driessen computing lts regression for large data sets journal of data mining and knowledge discovery thomas blumensath and mike davies iterative hard thresholding for compressed sensing applied and computational harmonic analysis prateek jain ambuj tewari and purushottam kar on iterative hard thresholding methods for in annual conference on neural information processing systems nips yiyuan she and art owen outlier detection using nonconvex penalized regression rahul garg and rohit khandekar gradient descent with sparsification an iterative algorithm for sparse recovery with restricted isometry property in international conference on machine learning icml allen yang arvind ganesh zihan zhou shankar sastry and yi ma review of fast algorithms for robust face recognition corr beatrice laurent and pascal massart adaptive estimation of quadratic functional by model selection the annals of statistics thomas blumensath sampling and reconstructing signals from union of linear subspaces ieee transactions on information theory roman vershynin introduction to the analysis of random matrices in eldar and kutyniok editors compressed sensing theory and applications chapter pages cambridge university press 
sparse local embeddings for extreme classification kush himanshu purushottam manik and prateek microsoft research india indian institute of technology delhi india indian institute of technology kanpur india prajain manik purushot abstract the objective in extreme learning is to train classifier that can automatically tag novel data point with the most relevant subset of labels from an extremely large label set embedding based approaches attempt to make training and prediction tractable by assuming that the training label matrix is and reducing the effective number of labels by projecting the high dimensional label vectors onto low dimensional linear subspace still leading embedding approaches have been unable to deliver high prediction accuracies or scale to large problems as the low rank assumption is violated in most real world applications in this paper we develop the sleec classifier to address both limitations the main technical contribution in sleec is formulation for learning small ensemble of local distance preserving embeddings which can accurately predict infrequently occurring tail labels this allows sleec to break free of the traditional assumption and boost classification accuracy by learning embeddings which preserve pairwise distances between only the nearest label vectors we conducted extensive experiments on several as well as benchmark data sets and compared our method against methods for extreme classification experiments reveal that sleec can make significantly more accurate predictions then the methods including both by as much as as well as by as much as methods sleec can also scale efficiently to data sets with million labels which are beyond the pale of leading embedding methods introduction in this paper we develop sleec sparse local embeddings for extreme classification an extreme classifier that can make significantly more accurate and faster predictions as well as scale to larger problems as compared to embedding based approaches extreme learning xml addresses the problem of learning classifier that can automatically tag data point with the most relevant subset of labels from large label set for instance there are more than million labels categories on wikipedia and one might wish to build classifier that annotates new article or web page with the subset of most relevant wikipedia categories it should be emphasized that learning is distinct from classification where the aim is to predict single mutually exclusive label challenges xml is hard problem that involves learning with hundreds of thousands or even millions of labels features and training points although some of these problems can be ameliorated this work was done while was postdoctoral researcher at microsoft research india using label hierarchy such hierarchies are unavailable in many applications in this setting an obvious baseline is thus provided by the technique which seeks to learn an an independent classifier per label as expected this technique is infeasible due to the prohibitive training and prediction costs given the large number of labels approaches natural way of overcoming the above problem is to reduce the effective number of labels embedding based approaches try to do so by projecting label vectors onto low dimensional space based on an assumption that the label matrix is more specifically given set of training points xi yi with feature vectors xi rd and ldimensional label vectors yi embedding approaches project the label vectors onto lower linear subspace as zi uyi regressors are then trained to predict zi as vxi labels for novel point are predicted by vx where is decompression matrix which lifts the embedded label vectors back to the original label space embedding methods mainly differ in the choice of their compression and decompression techniques such as compressed sensing bloom filters svd landmark labels output codes etc the leml algorithm directly optimizes for using regularized least squares objective embedding approaches have many advantages including simplicity ease of implementation strong theoretical foundations the ability to handle label correlations as well as adapt to online and incremental scenarios consequently embeddings have proved to be the most popular approach for tackling xml problems embedding approaches also have limitations they are slow at training and prediction even for small for instance on wikilshtc wikipedia based challenge data embedding dimensions set leml with takes hours to train even with early termination whereas prediction takes nearly milliseconds per test point in fact for text applications with feature leml prediction time db vectors such as wikilshtc where db can be an order of magnitude more than even prediction time dl more importantly the critical assumption made by embedding methods that the training label matrix is is violated in almost all real world applications figure plots the approximation is varied on the wikilshtc data set as is clear even with in the label matrix as dimensional subspace the label matrix still has approximation error this happens primarily due to the presence of hundreds of thousands of tail labels figure which occur in at most data points each and hence can not be well approximated by any linear low dimensional basis the sleec approach our algorithm sleec extends embedding methods in multiple ways to address these limitations first instead of globally projecting onto linear subspace sleec learns embeddings zi which capture label correlations by preserving the pairwise distances between only the closest rather than all label vectors zi zj yi yj only if knn where is distance metric regressors are trained to predict zi vxi we propose novel formulation for learning such embeddings that can be formally shown to consistently preserve nearest neighbours in the label space we build an efficient pipeline for training these embeddings which can be orders of magnitude faster than embedding methods during prediction rather than using decompression matrix sleec uses neighbour knn classifier in the embedding space thus leveraging the fact that nearest neighbours have been preserved during training thus for novel point the predicted label vector is obtained using vxi vx yi the use of knn classifier is well motivated as knn outperforms discriminative methods in acutely low training data regimes as is the case with tail labels the superiority of sleec proposed embeddings over traditional embeddings can be seen by looking at figure which shows that the relative approximation error in learning sleec embeddings is significantly smaller as compared to the approximation error moreover we also find that sleec can improve the prediction accuracy of embedding methods by as much as absolute on the challenging wikilshtc data set sleec also significantly outperforms methods such as wsabie which also use knn classification in the embedding space but learn their embeddings using the traditional assumption clustering based speedup however knn classifiers are known to be slow at prediction sleec therefore clusters the training data into clusters learning separate embedding per cluster and performing knn classification within the test point cluster alone this allows sleec to be more global svd local svd sleec nn obj approximation rank precision active documents approximation error label id sleec localleml number of clusters figure error ky ylb in approximating the label matrix global svd denotes the error svd of local svd computes rank svd of within each cluster incurred by computing the rank sleec nn objective denotes sleec objective function global svd incurs error and the error is decreasing at most linearly as well shows the number of documents in which each label is present for the wikilshtc data set there are about labels which are present in documents lending it heavy tailed distribution shows precision accuracy of sleec and localleml on the data set as we vary the number of clusters than two orders of magnitude faster at prediction than leml and other embedding methods on the wikilshtc data in fact sleec also scales well to the data set involving million labels which is beyond the pale of leading embedding methods moreover the clustering trick does not significantly benefit other methods see figure thus indicating that sleec embeddings are key to its performance boost since clustering can be unstable in large dimensions sleec compensates by learning small ensemble where each individual learner is generated by different random clustering this was empirically found to help tackle instabilities of clustering and significantly boost prediction accuracy with only linear increases in training and prediction time for instance on wikilshtc sleec prediction accuracy was with an millisecond prediction time whereas leml could only manage accuracy while taking milliseconds for prediction per test point approaches recently tree based methods have also become popular for xml as they enjoy significant accuracy gains over the existing embedding methods for instance fastxml can achieve prediction accuracy of on wikilshtc using tree ensemble however using sleec we are now able to extend embedding methods to outperform tree ensembles achieving with learners and with thus sleec obtains the best of both worlds achieving the highest prediction accuracies across all methods on even the most challenging data sets as well as retaining all the benefits of embeddings and eschewing the disadvantages of large tree ensembles such as large model size and lack of theoretical understanding method let xn yn be the given training data set xi rd be the input feature vector yi be the corresponding label vector and yij iff the label is turned on for xi let xn be the data matrix and yn be the label matrix given the goal is to learn classifier rd that accurately predicts the label vector for given test point recall that in xml settings is very large and is of the same order as and ruling out several standard approaches such as we now present our algorithm sleec which is designed primarily to scale efficiently for large our algorithm is an algorithm during training we map the label vectors yi to vectors zi rl and learn set of regressors zi xi during the test phase for an unseen point we first compute its embedding and then perform knn over the set xn to scale our algorithm we perform clustering of all the training points and apply the above mentioned procedures in each of the cluster separately below we first discuss our method to compute the embeddings zi and the regressors section then discusses our approach for scaling the method to large data sets learning embeddings as mentioned earlier our approach is motivated by the fact that typical data set tends to have large number of tail labels that ensure that the label matrix can not be using linear subspace see figure however can still be accurately modeled algorithm sleec train algorithm sleec svp require xn yn embedding require observations index set dimensionb no of neighbors no of ality dimensionality clusters regularization parameter repeat smoothing parameter pω partition into qc using for each partition qj do form using nearest neighbors of each label σii max σii vector yi qj ut until convergence svp pω output zj adm zj xj sleec admm end for require data matrix embeddings output qc ization parameter smoothing parameter algorithm sleec test algorithm repeat require test point no of nn no of desired xx λi labels qτ partition closest to αi sign αi max µρ nz nearest neighbors of in alpha px empirical label dist for points nz until convergence ypred opp px output using manifold that is instead of preserving distances or inner products of given label vector to all the training points we attempt to preserve the distance to only few nearest neighbors that is we wish to find embedding matrix which minimizes the following objective zn min kpω pω where the index set denotes the set of neighbors that we wish to preserve iff ni ni denotes set of nearest neighbors of we select ni arg maxs yti yj which is the set of points with the largest inner products with yi is always chosen large enough so that distances inner products to few far away points are also preserved while optimizing for our objective function this prohibits points from entering the immediate neighborhood of any given point pω is defined as hyi yj if pω ij otherwise also we add regularization kzi to the objective function to obtain sparse embeddings sparse embeddings have three key advantages they reduce prediction time reduce the size of the model and avoid overfitting now given the embeddings zn we wish to learn model to predict the embeddings using the input features that is we require that where combining the two formulations and adding an for we get min kpω pω λkv µkv note that the above problem formulation is somewhat similar to few existing methods for nonlinear dimensionality reduction that also seek to preserve distances to few near neighbors however in contrast to our approach these methods do not have direct out of sample generalization do not scale well to data sets and lack rigorous generalization error bounds optimization we first note that optimizing is significant challenge as the objective function is as well as furthermore our goal is to perform optimization for data sets where to this end we divide the optimization into two phases we first learn embeddings zn and then learn regressors in the second stage that is is obtained by directly solving but without the penalty term min kpω pω min rank kpω pω where next is obtained by solving the following problem min kz λkv µkv note that the matrix obtained using need not be sparse however we store and use as our embeddings so that sparsity is still maintained optimizing note that even the simplified problem is an instance of the popular matrix completion problem and is known to be in general the main challenge arises due to the rank constraint on however using the singular value projection svp method popular matrix completion method we can guarantee convergence to local minima svp is simple projected gradient descent method where the projection is onto the set of matrices that is the step update for svp is given by plb mt ηpω mt where mt is the step iterate is the and plb is the projection of onto positive definite psd matrices note that while the set of rankthe set of psd matrices is we can still project onto this set efficiently using the eigenvalue decomposition of that is if um λm um be the eigenvalue decomposition of then plb um λm um and is the number of positive eigenvalues of λm denotes where min the eigenvalues of and um denotes the corresponding eigenvectors computing while the above update restricts the rank of all intermediate iterates mt to be at most eigenvalue decomposition can still be fairly expensive for large however by using special structure in the update one can significantly reduce eigenvalue decomposition computation complexity as well in general the eigenvalue decomposition can be computed in time lζ where is the time complexity of computing product now for svp update where matrix has special structure of mt ηpω mt hence nl is the average number of neighbors preserved by sleec hence the time which is linear in assuming is nearly constant complexity reduces to nl optimizing contains an term which makes the problem moreover as the term involves both and we can not directly apply the standard based algorithms instead we use the admm method to optimize see for the updates and for detailed derivation of the algorithm generalization error analysis let be fixed but unknown distribution over let each training point xi yi be sampled from then the goal of our embedding method is to learn an embedding matrix that preserves nearest neighbors in terms of label of any the above requirements can be formulated as the following stochastic optimization problem min rank he et ax and he where the loss function yi he yi yi he yi where is the indicator function hence loss is incurred only if and have large inner product for an appropriate selection of the neighborhood selection operator indeed minimizes regularized empirical estimate of the loss function it is regularized erm to indeed minimizes the loss upto an additive we now show that the optimal solution approximation error the existing techniques for analyzing excess risk in stochastic optimization require the empirical loss function to be decomposable over the training set and as such do not apply to which contains with two training points still using techniques from the auc maximization literature we can provide interesting excess risk bounds for problem theorem with probability at least over the sampling of the dataset the solution to the optimization problem satisfies ka kf inf log where is the minimizer of rank and see appendix for proof of the result note that the generalization error bound is independent of both and which is critical for extreme classification problems with large in fact the error bound is only dependent on which is the average number of positive labels per data point moreover our bound also provides way to compute best regularization parameter that minimizes the error bound however in practice we set to be fixed constant theorem only preserves the population neighbors of test point theorem given in appendix extends theorem to ensure that the neighbors in the training set are also preserved we would also like to stress that our excess risk bound is universal and hence holds even if does not minimize where is given in theorem scaling to data sets to be fairly large say few for data sets one might require the embedding dimension hundreds which might make computing the updates infeasible hence to scale to such large data sets sleec clusters the given datapoints into smaller local region several data sets indeed reveal that there exist small local regions in the where the number of points as well as the number of labels is reasonably small hence we can train our embedding method over such local regions without significantly sacrificing overall accuracy we would like to stress that despite clustering datapoints in homogeneous regions the label matrix of any given cluster is still not close to hence applying linear embedding method such as leml to each cluster is still significantly less accurate when compared to our method see figure naturally one can cluster the data set into an extremely large number of regions so that eventually the label matrix is in each cluster however increasing the number of clusters beyond certain limit might decrease accuracy as the error incurred during the cluster assignment phase itself might nullify the gain in accuracy due to better embeddings figure illustrates this phenomenon where increasing the number of clusters beyond certain limit in fact decreases accuracy of leml algorithm provides of our training algorithm we first cluster the datapoints into partitions then for each partition we learn set of embeddings using and then compute the regression parameters using for given test point we first find out the appropriate cluster then we find the embedding the label vector is then predicted using in the embedding space see algorithm for more details owing to the clustering turns out to be quite unstable for data sets with large and in many cases leads to some drop in prediction accuracy to safeguard against such instability we use an ensemble of models generated using different sets of clusters we use different initialization points in our clustering procedure to obtain different sets of clusters our empirical results demonstrate that using such ensembles leads to significant increase in accuracy of sleec see figure and also leads to stable solutions with small variance see table experiments experiments were carried out on some of the largest xml benchmark data sets demonstrating that sleec could achieve significantly higher prediction accuracies as compared to the it is also demonstrated that sleec could be faster at training and prediction than leading embedding techniques such as leml sleec fastxml model size gb sleec fastxml number of learners precision wikilshtc precision precision wikilshtc sleec fastxml number of learners figure variation in precision accuracy with model size and the number of learners on data sets clearly sleec achieves better accuracy than fastxml and at every point of the curve for wikilsthc sleec with single learner is more accurate than with even learners similarly sleec with learners achieves more accuracy than fastxml with learners data sets experiments were carried out on data sets including labels amazon labels wikilshtc labels deliciouslarge labels and labels all the data sets are publically available except which is proprietary and is included here to test the scaling capabilities of sleec unfortunately most of the existing embedding techniques do not scale to such large data sets we therefore also present comparisons on publically available small data sets such as bibtex mediamill delicious and eurlex table in the appendix lists their statistics baseline algorithms this paper primary focus is on comparing sleec to methods which can scale to the large data sets such as embedding based leml and tree based fastxml and lpsr bayes was used as the base classifier in lpsr as was done in techniques such as cs cplst could only be trained on the small data sets given standard resources comparisons between sleec and such techniques are therefore presented in the supplementary material the implementation for leml and fastxml was provided by the authors we implemented the remaining algorithms and ensured that the published results could be reproduced and were verified by the authors wherever possible most of sleec were kept fixed including the number of clusters in learner bntrain embedding dimension for the small data sets and for the large number of learners in the ensemble and the parameters used for optimizing the remaining two the in knn and the number of neighbours considered during svp were both set by limited validation on validation set the for all the other algorithms were set using fine grained validation on each data set so as to achieve the highest possible prediction accuracy for each method in addition all the embedding methods were allowed much larger embedding dimension than sleec to give them as much opportunity as possible to outperform sleec evaluation metrics we evaluated algorithms using metrics that have been widely adopted for xml and ranking tasks precision at is one such metric that counts the fraction of correct predictions in the top scoring labels in and has been widely utilized we use the ranking measure ndcg as another evaluation metric we refer the reader to the supplementary material appendix and tables and for further descriptions of the metrics and results results on large data sets with more than labels table compares sleec prediction accuracy in terms of to all the leading methods that could be trained on five such data sets sleec could improve over the leading embedding method leml by as much as and in terms of and on wikilshtc similarly sleec outperformed leml by and in terms of and on the amazon data set which also has many tail labels the gains on the other data sets are consistent but smaller as the tail label problem is not so acute sleec also outperforms the leading tree method fastxml by in terms of both and on wikilshtc and respectively this demonstrates the superiority of sleec overall pipeline constructed using local distance preserving embeddings followed by knn classification sleec also has better scaling properties as compared to all other embedding methods in particular apart from leml no other embedding approach could scale to the large data sets and even leml could not scale to with million labels in contrast single sleec learner could be learnt on wikilshtc in hours on single core and already gave improvement in over leml see figure for the variation in accuracy vs sleec learners in fact sleec training table precision accuracies data sets our proposed method sleec is as much as more accurate in terms of and in terms of than leml leading embedding method other embedding based methods do not scale to the data sets we compare against them on data sets in table sleec is also more accurate and than fastxml tree method indicates leml could not be run with the standard resources data sets sleec consistently outperforms state of the art approaches wsabie which also uses knn classifier on its embeddings is significantly less accurate than sleec on all the data sets showing the superiority of our embedding learning algorithm data set sleec leml fastxml sleec leml fastxml wsabie onevsall bibtex delicious mediamill wikilshtc amazon data set eurlex time on wikilshtc was comparable to that of tree based fastxml fastxml trains trees in hours on single core to achieve of whereas sleec could achieve by training learners in hours similarly sleec training time on was hours per learner on single core sleec predictions could also be up to times faster than lemls for instance on wikilshtc sleec made predictions in milliseconds per test point as compared to leml sleec therefore brings the prediction time of embedding methods to be much closer to that of tree based methods fastxml took milliseconds per test point on wikilshtc and within the acceptable limit of most real world applications effect of clustering and multiple learners as mentioned in the introduction other embedding methods could also be extended by clustering the data and then learning local embedding in each cluster ensembles could also be learnt from multiple such clusterings we extend leml in such fashion and refer to it as localleml by using exactly the same clusters per learner in the ensemble as used in sleec for fair comparison as can be seen in figure sleec significantly outperforms localleml with single sleec learner being much more accurate than an ensemble of even localleml learners figure also demonstrates that sleec ensemble can be much more accurate at prediction as compared to the tree based fastxml ensemble the same plot is also presented in the appendix depicting the variation in accuracy with model size in ram rather than the number of learners in the ensemble the figure also demonstrates that very few sleec learners need to be trained before accuracy starts saturating finally table shows that the variance in sleec prediction accuracy different cluster initializations is very small indicating that the method is stable even though clustering in more than million dimensions results on small data sets table in the appendix compares the performance of sleec to several popular methods including embeddings trees knn and svms even though the tail label problem is not acute on these data sets and sleec was restricted to single learner sleec predictions could be significantly more accurate than all the other methods except on delicious where sleec was ranked second for instance sleec could outperform the closest competitor on eurlex by in terms of particularly noteworthy is the observation that sleec outperformed wsabie which performs knn classification on linear embeddings by as much as on multiple data sets this demonstrates the superiority of sleec local distance preserving embeddings over the traditional embeddings acknowledgments we are grateful to abhishek kadian for helping with the experiments himanshu jain is supported by google india phd fellowship at iit delhi references agrawal gupta prabhu and varma learning with millions of labels recommending advertiser bid phrases for web pages in www pages weston makadia and yee label partitioning for sublinear ranking in icml hsu kakade langford and zhang prediction via compressed sensing in nips usunier and gallinari robust bloom filters for large multilabel classification tasks in nips pages tai and lin classification with principal label space transformation in workshop proceedings of learning from data balasubramanian and lebanon the landmark selection method for multiple output prediction in icml bi and kwok efficient classification with many labels in icml zhang and schneider output codes using canonical correlation analysis in aistats pages yu jain kar and dhillon learning with missing labels icml chen and lin label space dimension reduction for classification in nips pages feng and lin classification with codes jmlr ji tang yu and ye extracting shared subspace for classification in kdd weston bengio and usunier wsabie scaling up to large vocabulary image annotation in ijcai lin ding hu and wang classification via implicit label space encoding in icml pages prabhu and varma fastxml fast accurate and stable for extreme learning in kdd pages wikipedia dataset for the large scale hierarchical text classification challenge ng and jordan on discriminative generative classifiers comparison of logistic regression and naive bayes in nips weinberger and saul an introduction to nonlinear dimensionality reduction by maximum variance unfolding in aaai pages shaw and jebara minimum volume embedding in aistats pages jain meka and dhillon guaranteed rank minimization via singular value projection in nips pages sprechmann litman yakar bronstein and sapiro efficient supervised sparse analysis and synthesis operators in nips kar sriperumbudur jain and karnick on the generalization ability of online learning algorithms for pairwise loss functions in icml leskovec and krevl snap datasets stanford large network dataset collection wetzker zimmermann and bauckhage analyzing social bookmarking systems cookbook in mining social data msoda workshop proceedings ecai pages july zubiaga enhancing navigation on wikipedia with social tags katakis tsoumakas and vlahavas multilabel text classification for automated tag suggestion in proceedings of the discovery challenge snoek worring van gemert geusebroek and smeulders the challenge problem for automated detection of semantic concepts in multimedia in acm multimedia tsoumakas katakis and vlahavas effective and effcient multilabel classification in domains with large number of labels in efficient pairwise multilabel classification for problems in the legal domain in chen and lin label space dimension reduction for classification in nips pages hariharan vishwanathan and varma efficient classification with applications to learning ml 
solving random quadratic systems of equations is nearly as easy as solving linear systems yuxin chen department of statistics stanford university stanford ca yxchen emmanuel candès department of mathematics and department of statistics stanford university stanford ca candes abstract this paper is concerned with finding solution to quadratic system of equations yi we demonstrate that it is possible to solve unstructured random quadratic systems in variables exactly from equations in linear time that is in time proportional to reading the data ai and yi this is accomplished by novel procedure which starting from an initial guess given by spectral initialization procedure attempts to minimize nonconvex objective the proposed algorithm distinguishes from prior approaches by regularizing the initialization and descent procedures in an adaptive fashion which discard terms bearing too much influence on the initial estimate or search directions these careful selection effectively serve as variance reduction tighter initial guess more robust descent directions and thus enhanced practical performance further this procedure also achieves nearoptimal statistical accuracy in the presence of noise empirically we demonstrate that the computational cost of our algorithm is about four times that of solving problem of the same size introduction suppose we are given response vector yi generated from quadratic transformation of an unknown object rn yi where the vectors ai rn are known in other words we acquire measurements about the linear product hai xi with all missing can we hope to recover from this nonlinear system of equations this problem can be recast as quadratically constrained quadratic program qcqp which subsumes as special cases various classical combinatorial problems with boolean variables the stone problem section in the physical sciences this problem is commonly referred to as phase retrieval the origin is that in many imaging applications crystallography diffraction imaging microscopy it is infeasible to record the phases of the diffraction patterns so that we can only record where is the electrical field of interest moreover this problem finds applications in estimating the mixture of linear regression since one can transform the latent membership variables into missing phases despite its importance across various fields solving the quadratic system is combinatorial in nature and in general np complete to be more realistic albeit more challenging the acquired samples are almost always corrupted by some amount of noise namely yi for instance in imaging applications the data are best modeled by poisson random variables ind yi poisson which captures the variation in the number of photons detected by sensor while we shall pay special attention to the poisson noise model due to its practical relevance the current work aims to accommodate even structures nonconvex optimization assuming independent samples the first attempt is to seek the maximum likelihood estimate mle xm minimizez yi where yi represents the of candidate given the outcome yi as an example under the poisson data model one has up to some constant offset yi yi log computing the mle however is in general intractable since yi is not concave in fortunately under unstructured random systems the problem is not as as it might seem and is solvable via convenient convex programs with optimal statistical guarantees the basic paradigm is to lift the quadratically constrained problem into linearly constrained problem by introducing matrix variable and relaxing the constraint nevertheless working with the auxiliary matrix variable significantly increases the computational complexity which exceeds the order of and is prohibitively expensive for data this paper follows different route which attempts recovery by minimizing the nonconvex objective or directly the main incentive is the potential computational benefit since this strategy operates directly upon vectors instead of lifting decision variables to higher dimension among this class of procedures one natural candidate is the family of type algorithms developed with respect to the objective this paradigm can be regarded as performing some variant of stochastic gradient descent over the random samples yi ai as an approximation to maximize the population likelihood while in general nonconvex optimization falls short of performance guarantees recently proposed approach called wirtinger flow wf promises efficiency under random features in nutshell wf initializes the iterate via spectral method and then successively refines the estimate via the following update rule µt xm yi where denotes the tth iterate of the algorithm and µt is the learning rate here yi represents the wirtinger derivative with respect to which reduces to the ordinary gradient in the real setting under gaussian designs wf allows exact recovery from log quadratic equations ii recovers up to within log time or flops and iii is stable and converges to the mle under gaussian noise despite these intriguing guarantees the computational complexity of wf still far exceeds the best that one can hope for moreover its sample complexity is logarithmic factor away from the limit this paper truncated wirtinger flow this paper develops novel algorithm called truncated wirtinger flow twf that achieves statistical accuracy the distinguishing features include careful initialization procedure and more adaptive gradient flow informally twf entails two stages initialization compute an initial guess by means of spectral method applied to subset of data yi that do not bear too much influence on the spectral estimates loop for µt yi for some index set over which yi are or resp means there exists constant such that resp means and are orderwise equivalent relative mse db relative error truncated wf least squares cg truncated wf mle phase iteration snr db figure relative errors of cg and twf iteration count where and relative mse snr in db where the curves are shown for two settings twf for solving quadratic equations blue and mle had we observed additional phase information green we highlight three aspects of the proposed algorithm with details deferred to section in contrast to wf and other gradient descent variants we regularize both the initialization and the gradient flow in more cautious manner by operating only upon some index sets tt the main point is that enforcing such careful selection rules lead to tighter initialization and more robust descent directions twf sets the learning rate µt in far more liberal fashion µt under suitable conditions as opposed to the situation in wf that recommends µt computationally each iterative step mainly consists in calculating yi which is inexpensive and can often be performed in linear time that is in time proportional to evaluating the data and the constraints take the poisson likelihood for example yi yi ai yi which essentially amounts to two products to see this rewrite iz yi vi otherwise where am hence az gives and the desired truncated gradient numerical surprises the power of twf is best illustrated by numerical examples since and are indistinguishable given we evaluate the solution based on metric that disregards the global phase dist minϕ xk in the sequel twf operates according to the poisson and takes µt we first compare the computational efficiency of twf for solving quadratic systems with that of conjugate gradient cg for solving least square problems as is well known cg is among the most popular methods for solving least square problems and hence offers desired benchmark we run twf and cg respectively over the following two problems find rn find bi bi ind where and ai this yields design matrix for which cg converges extremely fast the relative estimation errors of both methods are reported in fig where twf is seeded by power iterations the iteration counts are plotted in different scales so that twf iterations are tantamount to cg iteration since each iteration of cg and twf involves two matrix vector products az and the numerical plots lead to suprisingly positive observation for such an unstructured design figure recovery after top truncated spectral initialization and bottom twf iterations even when all phase information is missing twf is capable of solving quadratic system of equations only about slower than solving least squares problem of the same size the numerical surprise extends to noisy quadratic systems under the poisson data model fig displays the relative error mse of twf when the ratio snr varies here the relative mse and the snr are defined mse and snr where is an estimate both snr and mse are displayed on db scale the values of snr and mse are plotted to evaluate the quality of the twf solution we compare it with the mle applied to an ideal problem where the phases ϕi sign are revealed priori the presence of this precious side information gives away the phase retrieval problem and allows us to compute the mle via convex programming as illustrated in fig twf solves the quadratic system with nearly the best possible accuracy since it only incurs an extra db loss compared to the ideal mle with all true phases revealed to demonstrate the scalability of twf on real data we apply twf on image consider type of physically realizable measurements called coded diffraction patterns cdp where where nl denotes the vector of entrywise squared magnitudes and is the dft matrix here is diagonal matrix whose diagonal entries are randomly drawn from which models signal modulation before diffraction we generate masks for measurements and run twf on macbook pro with ghz intel core we run truncated power iterations and twf iterations which in total cost seconds for each color channel the relative errors after initialization and twf iterations are and respectively see fig main results we corroborate the preceding numerical findings with theoretical support for concreteness we assume twf proceeds according to the poisson we suppose the samples yi ai are independently and randomly drawn from the population and model the random features ai as ai to start with the following theorem confirms the performance of twf under noiseless data similar phenomena arise in many other experiments we ve conducted when the sample size ranges from to in fact this factor seems to improve slightly as increases to justify the definition of snr note that the signals and noise are captured by µi and yi µi respectively the snr is thus given by pm pm var yi pm pm theorem exact recovery consider the noiseless case with an arbitrary rn suppose that the learning rate µt is either taken to be constant µt or chosen via backtracking line search then there exist some constants and such that with probability exceeding exp the twf estimates algorithm obey dist kxk provided that and as discussed below we can take theorem justifies two intriguing properties of twf to begin with twf recovers the ground truth exactly as soon as the number of equations is on the same order of the number of unknowns which is information theoretically optimal more surprisingly twf converges at geometric rate it achieves dist kxk within at most log iterations as result the time taken for twf to solve the quadratic systems is proportional to the time taken to read the data which confirms the complexity of twf these outperform the theoretical guarantees of wf which requires log runtime and log sample complexity notably the performance gain of twf is the result of the key algorithmic changes rather than maximizing the data usage at each step twf exploits the samples at hand in more selective manner which effectively trims away those components that are too influential on either the initial guess or the search directions thus reducing the volatility of each movement with tighter initial guess and search directions in place twf is able to proceed with more aggressive learning rate taken collectively these efforts enable the appealing convergence property of twf next we turn to more realistic noisy data by accounting for general additive noise model yi ηi where ηi represents noise term the stability of twf is demonstrated in the theorem below theorem stability consider the noisy case suppose that the learning rate µt is either taken to be positive constant µt or chosen via backtracking line search if and kxk then with probability at least exp the twf estimates algorithm satisfy kηk dist kxk mkxk for all rn here and are some universal constants alternatively if one regards the snr for the model as follows xm snr then we immediately arrive at another form of performance guarantee stated in terms of snr kxk kxk dist snr as consequence the relative error of twf reaches within logarithmic number of iterations it is worth emphasizing that the above stability guarantee is deterministic which holds for any noise structure obeying encouragingly this statistical accuracy is nearly as revealed by minimax lower bound that we provide in the supplemental materials we pause to remark that several other nonconvex methods have been proposed for solving quadratic equations which exhibit intriguing empirical performances partial list includes the error reduction schemes by fienup alternating minimization kaczmarz method and generalized approximate message passing however most of them fall short of theoretical support the analytical difficulty arises since these methods employ the same samples in each iteration which introduces complicated dependencies across all iterates to circumvent this issue proposes version of the alternating minimization method that employs fresh samples in each iteration despite the mathematical convenience the sample complexity of this approach is log which is factor of from optimal and is empirically largely outperformed by the variant that reuses all samples in contrast our algorithm uses the same pool of samples all the time and is therefore practically appealing besides the approach in does not come with provable stability guarantees numerically each iteration of fienup algorithm or alternating minimization involves solving least squares problem and the algorithm converges in tens or hundreds of iterations this is computationally more expensive than twf whose computational complexity is merely about times that of solving least squares problem algorithm truncated wirtinger flow this section explains the basic principles of truncated wirtinger flow for notational convenience for any we denote am and truncated gradient stage in the case of independent data the descent direction of the wf is the gradient of the poisson be expressed as follows yi yi ai ai where νi represents the weight assigned to each feature ai figure the locus of for all unit vectors ai the red arrows depict those directions with large weights unfortunately the gradient of this form is and hence uncontrollable to see this consider any fixed rn the typical value of is on the order of kzk leading to some excessively large weights νi notably an underlying premise for nonconvex procedure to succeed is to ensure all iterates reside within basin of attraction that is neighborhood surrounding within which is the unique stationary point of the objective when gradient is unreasonably large the iterative step might overshoot and end up leaving this basin of attraction consequently wf moving along the preceding direction might not come close to the truth unless is already very close to this is observed in numerical twf addresses this challenge by discarding terms having too high of leverage on the search direction this is achieved by regularizing the weights νi via appropriate truncation specifically µt tr where tr denotes the truncated gradient given by tr xm yi ai ai for some appropriate truncation criteria specified by and in our algorithm we take and to be two collections of events given by lb ub αz kzk αz kzk αh zz kzk where αzlb αzub αz are predetermined truncation thresholds in words we drop components whose size fall outside some confidence range where the magnitudes of both the numerator and denominator of νi are comparable to their respective mean values this paradigm could be at first glance since one might expect the larger terms to be better aligned with the desired search direction the issue however is that the large terms are extremely volatile and could dominate all other components in an undesired way in contrast twf makes use of only gradient components of typical sizes which slightly increases the bias but remarkably reduces the variance of the descent direction we expect such gradient regularization and variance reduction schemes to be beneficial for solving broad family of nonconvex problems truncated spectral initialization key step to ensure meaningful convergence is to seed twf with some point inside the basin of attraction which proves crucial for other nonconvex procedures as well an appealing initialization for data wf converges empirically as mini is much larger than the real case algorithm truncated wirtinger flow input measurements yi and feature vectors ai truncation thresholds αzlb αzub αh and αy satisfying by default αzlb αzub αh and αy αzlb initialize to be αzub αh and αy pm where yi and is the leading eigenvector of yi ai pmmn kai loop for do xm yi ai ai where lb ub αz αz αh kt kai kz kai kz xm yl and kt output procedure pm is the spectral method which initializes as the leading eigenvector of this is based on the observation that for any fixed unit vector ye whose principal component is exactly with an eigenvalue of unfortunately the success of this method requires sample complexity exceeding log to see this recall that maxi yi log letting arg maxi yi and ak one can derive ak yk log which dominates ye unless log as result is closer to the principal component of ye than when this drawback turns out to be substantial practical issue relative mse this issue can be remedied if we preclude those data yi with large magnitudes when running the spectral method specifically we propose to initialize as the leading eigenvector of xm pm yi ai yl spectral method truncated spectral method followed by proper scaling so as to ensure kz kxk as illustrated in fig the empirical advantage of the truncated spectral method is increasingly more remarkable as grows signal dimension figure relative initialization error when ai choice of algorithmic parameters one important implementation detail is the learning rate µt there are two alternatives that work well in both theory and practice fixed size take µt for some constant as long as is not too large this strategy always works under the condition our theorems hold for any positive constant backtracking line search with truncated objective this strategy performs line search along the descent direction and determines an appropriate learning rate that guarantees sufficient improvement with respect to the truncated objective details are deferred to the supplement another algorithmic details to specify are the truncation thresholds αh αzlb αzub and αy the present paper isolates concrete set of combinations as given in in all theory and numerical experiments presented in this work we assume that the parameters fall within this range twf poisson objective wf gaussian objective number of measurements empirical success rate empirical success rate relative mse db twf poisson objective wf gaussian objective number of measurements snr db figure empirical success rates for real gaussian design empirical success rates for complex gaussian design relative mse averaged over runs snr for poisson data more numerical experiments and discussion we conduct more extensive numerical experiments to corroborate our main results and verify the applicability of twf on practical problems for all experiments conducted herein we take fixed step size µt employ power iterations for initialization and gradient iterations the truncation levels are taken to be the default values αzlb αzub αh and αy we first apply twf to sequence of noiseless problems with and varying generate the ind object at random and produce the feature vectors ai in two different ways ai ind ai jn monte carlo trial is declared success if the estimate obeys dist kxk fig and illustrate the empirical success rates of twf average over runs for each for noiseless data indicating that are are often sufficient under real and complex gaussian designs respectively for the sake of comparison we simulate the empirical success rates of wf with the step size µt min as recommended by as shown in fig twf outperforms wf under random gaussian features implying that twf exhibits either better convergence rate or enhanced phase transition behavior ind while this work focuses on the objective for concreteness the proposed paradigm carries over to variety of nonconvex objectives and might have implications in solving other problems that involve latent variables matrix completion sparse coding dictionary learning and mixture problems we conclude this paper with an example on estimating mixtures of linear regression imagine ai with probability yi else where are unknown it has been shown in that in the noiseless case the ground truth satisfies empirical success rate next we empirically evaluate the stability of twf under noisy data set produce ai and generate yi according to the poisson model fig shows the relative mean square the db varying snr cf as can be seen the empirical relative mse scales inversely proportional to snr which matches our stability guarantees in theorem since on the db scale the slope is about as predicted by the theory number of measurements figure empirical success rate for mixed regression fi ai ai yi which forms set of quadratic constraints in particular if one further knows pm then this reduces to the form running twf with nonconvex objective with the assistance of grid search proposed in applied right after truncated initialization yields accurate estimation of under minimal sample complexity as illustrated in fig acknowledgments is partially supported by nsf under grant and by the math award from the simons foundation is supported by the same nsf grant references and nemirovski lectures on modern convex optimization volume fienup phase retrieval algorithms comparison applied optics chen yi and caramanis convex formulation for mixed regression with two components minimax optimal rates in conference on learning theory colt candès strohmer and voroninski phaselift exact and stable signal recovery from magnitude measurements via convex programming communications on pure and applied mathematics waldspurger aspremont and mallat phase recovery maxcut and complex semidefinite programming mathematical programming shechtman eldar szameit and segev sparsity based imaging with partially incoherent light via quadratic compressed sensing optics express candès and li solving quadratic equations via phaselift when there are about as many equations as unknowns foundations of computational ohlsson yang dong and sastry extension of compressive sensing to the phase retrieval problem in advances in neural information processing systems nips chen chi and goldsmith exact and stable covariance estimation from quadratic sampling via convex programming ieee trans on inf theory cai and zhang rop matrix recovery via projections annals of stats jaganathan oymak and hassibi recovery of sparse signals from the magnitudes of their fourier transform in ieee isit pages gross krahmer and kueng partial derandomization of phaselift using spherical designs journal of fourier analysis and applications candès li and soltanolkotabi phase retrieval via wirtinger flow theory and algorithms ieee transactions on information theory april netrapalli jain and sanghavi phase retrieval using alternating minimization nips schniter and rangan compressive phase retrieval via generalized approximate message passing ieee transactions on signal processing feb repetti chouzenoux and pesquet nonconvex regularized approach for phase retrieval international conference on image processing pages wei phase retrieval via kaczmarz methods white ward and sanghavi the local convexity of solving quadratic equations shechtman beck and eldar gespar efficient phase retrieval of sparse signals ieee transactions on signal processing soltanolkotabi algorithms and theory for clustering and nonconvex quadratic programming phd thesis stanford university trefethen and bau iii numerical linear algebra volume siam candès li and soltanolkotabi phase retrieval from coded diffraction patterns to appear in applied and computational harmonic analysis keshavan montanari and oh matrix completion from noisy entries journal of machine learning research jain netrapalli and sanghavi matrix completion using alternating minimization in acm symposium on theory of computing pages sun and luo guaranteed matrix completion via nonconvex factorization focs arora ge ma and moitra simple efficient and neural algorithms for sparse coding conference on learning theory colt sun qu and wright complete dictionary recovery over the sphere icml balakrishnan wainwright and yu statistical guarantees for the em algorithm from population to analysis arxiv preprint yi caramanis and sanghavi alternating minimization for mixed linear regression international conference on machine learning june 
framework for individualizing predictions of disease trajectories by exploiting structure suchi saria dept of computer science johns hopkins university baltimore md ssaria peter schulam dept of computer science johns hopkins university baltimore md pschulam abstract for many complex diseases there is wide variety of ways in which an individual can manifest the disease the challenge of personalized medicine is to develop tools that can accurately predict the trajectory of an individual disease which can in turn enable clinicians to optimize treatments we represent an individual disease trajectory as function describing the severity of the disease over time we propose hierarchical latent variable model that individualizes predictions of disease trajectories this model shares statistical strength across observations at different population subpopulation and the individual level we describe an algorithm for learning population and subpopulation parameters offline and an online procedure for dynamically learning parameters finally we validate our model on the task of predicting the course of interstitial lung disease leading cause of death among patients with the autoimmune disease scleroderma we compare our approach against and demonstrate significant improvements in predictive accuracy introduction in complex chronic diseases such as autism lupus and parkinson the way the disease manifests may vary greatly across individuals for example in scleroderma the disease we use as running example in this work individuals may be affected across six organ lungs heart skin gastrointestinal tract kidneys and varying extents for any single organ system some individuals may show rapid decline throughout the course of their disease while others may show early decline but stabilize later on often in such diseases the most effective drugs have strong with tools that can accurately predict an individual disease activity trajectory clinicians can more aggressively treat those at greatest risk early rather than waiting until the disease progresses to high level of severity to monitor the disease physicians use clinical markers to quantify severity in scleroderma for example pfvc is clinical marker used to measure lung severity the task of individualized prediction of disease activity trajectories is that of using an individual clinical history to predict the future course of clinical marker in other words the goal is to predict function representing trajectory that is updated dynamically using an individual previous markers and individual characteristics predicting disease activity trajectories presents number of challenges first there are multiple latent factors that cause heterogeneity across individuals one such factor is the underlying biological mechanism driving the disease for example two different genetic mutations may trigger distinct disease trajectories as in figures and if we could divide individuals into groups according to their disease subtypes see would be straightforward to fit separate models to each subpopulation in most complex diseases however the mechanisms are poorly understood and clear definitions of subtypes do not exist if subtype alone determined trajectory then we could cluster individuals however other unobserved factors such as behavior and prior exposures affect health and can cause different trajectories across individuals of the same subtype for instance chronic smoker will typically have unhealthy lungs and so may have trajectory that is consistently lower than which we must account for using parameters an individual trajectory may also be influenced by transient an infection unrelated to the disease that makes it difficult to breath similar to the dips in figure or the third row in figure this can cause marker values to temporarily drop and may be hard to distinguish from disease activity we show that these factors can be arranged in hierarchy population subpopulation and individual but that not all levels of the hierarchy are observed finally the functional outcome is rich target and therefore more challenging to model than scalar outcomes in addition the marker data is observed in and is irregularly sampled making commonly used approaches to time series modeling or approaches that rely on imputation not well suited to this domain related work the majority of predictive models in medicine explain variability in the target outcome by conditioning on observed risk factors alone however these do not account for latent sources of variability such as those discussed above further these models are typically use features from data measured up until the current time to predict clinical marker or outcome at fixed point in the future as an example consider the mortality prediction model by lee et al where logistic regression is used to integrate features into prediction about the probability of death within days for given patient to predict the outcome at multiple time points typically separate models are trained moreover these models use data from window rather than growing history researchers in the statistics and machine learning communities have proposed solutions that address number of these limitations most related to our work is that by rizopoulos where the focus is on making dynamical predictions about outcome time until death their model updates predictions over time using all previously observed values of longitudinally recorded marker besides conditioning on observed factors they account for latent heterogeneity across individuals by allowing for adjustments to the for longitudinal marker deviations from the population baseline are modeled using random effects by sampling intercepts from common distribution other closely related work by et al tackle similar problem as rizopoulos but address heterogeneity using mixture model another common approach to dynamical predictions is to use markov models such as autoregressive models hmms state space models and dynamic bayesian networks see in although such models naturally make dynamic predictions using the full history by they typically assume discrete observation times gaussian processes gps are commonly used alternative for handling roberts et al for recent review of gp time series models since gaussian processes are nonparametric generative models of functions they naturally produce functional predictions dynamically by using the posterior predictive conditioned on the observed data mixtures of gps have been applied to model heterogeneity in the covariance structure across time series however as noted in roberts et appropriate mean functions are critical for accurate forecasts using gps in our work an individual trajectory is expressed as gp with highly structured mean comprising population subpopulation and components where some components are observed and others require inference more broadly models have been applied in many fields to model heterogeneous collections of units that are organized within hierarchy for example in predicting student grades over time individuals within school may have parameters sampled from the model and the model parameters in turn may be sampled from model in our setting the hierarchical individuals belong to the same not known priori similar ideas are studied in learning where relationships between distinct prediction tasks are used to encourage similar parameters this has been applied to modeling trajectories by treating predictions at each time point as separate task and enforcing similarity between submodels close in time this approach is limited however in that it models finite number of times others more recently have developed models for disease trajectories see and references within but these focus on retrospective analysis to discover disease etiology rather than dynamical prediction schulam et al incorporate differences in trajectories due to subtypes and factors we build upon this work here finally recommender systems also share subtype coefficients rdz subtype marginal model coefficients rqz zi subtype indicator subtype marginal model features rqz structured noise gp yij ni fi structured noise function rr individual model coefficients tij rd individual model covariance matrix rd population model map rdp population model coefficients rdp population model features figure plots show example marker trajectories plot shows adjustments to population and subpopulation fit row row makes an adjustment row makes shortterm structured noise adjustments plot shows the proposed graphical model levels in the hierarchy are model parameters are enclosed in dashed circles observed random variables are shaded information across individuals with the aim of tailoring predictions see but the task is otherwise distinct from ours contributions we propose hierarchical model of disease activity trajectories that directly addresses and of heterogeneity in complex chronic diseases using three levels the population level subpopulation level and individual level the model discovers the subpopulation structure automatically and infers structure over time when making predictions in addition we include gaussian process as model of structured noise which is designed to explain away temporary sources of variability that are unrelated to disease activity together these four components allow individual trajectories to be highly heterogeneous while simultaneously sharing statistical strength across observations at different resolutions of the data when making predictions for given individual we use bayesian inference to dynamically update our posterior belief over parameters given the clinical history and use the posterior predictive to produce trajectory estimate finally we evaluate our approach by developing trajectory prediction tool for lung disease in scleroderma we train our model using large national dataset containing individuals with scleroderma tracked over years and compare our predictions against alternative approaches we find that our approach yields significant gains in predictive accuracy of disease activity trajectories disease trajectory model we describe hierarchical model of an individual clinical marker values the graphical model is shown in figure for each individual we use ni to denote the number of observed markers we denote each individual observation using yij and its measurement time using tij where ni we use rni and rni to denote all of individual marker values and measurement times respectively in the following discussion tij denotes containing basis expansion of the time tij and we use tini to denote the matrix containing the basis expansion of points in in each of its rows we model the jth marker value for individual as normally distributed random variable with mean assumed to be the sum of four terms population component subpopulation component an individual component and structured noise component yij tij φz tij tij population subpopulation individual fi tij structured noise the four terms in the sum serve two purposes first they allow for number of different sources of variation to influence the observed marker value which allows for heterogeneity both across and within individuals second they share statistical strength across different subsets of observations the population component shares strength across all observations the subpopulation component shares strength across observations belonging to subgroups of individuals the individual component shares strength across all observations belonging to the same individual finally the structured noise shares information across observations belonging to the same individual that are measured at similar times predicting an individual trajectory involves estimating her subtype and individualspecific parameters as new clinical data becomes we describe each of the components in detail below population level the population model predicts aspects of an individual disease activity trajectory using observed baseline characteristics gender and race which are represented using the feature vector this is shown within the orange box in figure here we assume that this component is linear model where the coefficients are function of the features rqp the predicted value of the jth marker of individual measured at time tij is shown in eq where φp rdp is basis expansion of the observation time and rdp is matrix used as linear map from an individual covariates to coefficients ρi rdp at this level individuals with similar covariates will have similar coefficients the matrix is learned offline subpopulation level we model an individual subtype using latent variable zi where is the number of subtypes we associate each subtype with unique disease activity trajectory represented using where the number and location of the knots and the degree of the polynomial pieces are fixed prior to learning these determine basis expansion φz rdz mapping time to the basis function values at rdz that time trajectories for each subtype are parameterized by vector of coefficients for which are learned offline under subtype zi the predicted value of marker yij measured at time tij is shown in eq this component explains differences such as those observed between the trajectories in figures and in many cases features at baseline may be predictive of subtype for example in scleroderma the types of antibody an individual produces the presence of certain proteins in the blood are correlated with certain trajectories we can improve predictive performance by conditioning on baseline covariates to infer the subtype to do this we use multinomial logistic regression to define marginal probabilities zi mult where πg we denote the weights of the multinomial regression using where the weights of the first class are constrained to be to ensure model identifiability the remaining weights are learned offline individual level this level models deviations from the population and subpopulation models using parameters that are learned dynamically as the individual clinical history grows here we parameterize the individual component using linear model with basis expansion rd and coefficients rd an individual coefficients are modeled as latent variables with marginal distribution σb for individual the predicted value of marker yij measured at time tij is shown in eq this component can explain for example differences in overall health due to an unobserved characteristic such as chronic smoking which may cause atypically lower lung function than what is predicted by the population and subpopulation components such an adjustment is illustrated across the first and second rows of figure structured noise finally the structured noise component fi captures transient trends for example an infection may cause an individual lung function to temporarily appear more restricted than it actually is which may cause trends like those shown in figure and the third row of figure we treat fi as latent variable and model it using gaussian process with mean function and ou covariance function kou exp the amplitude controls the magnitude of the structured noise that we expect to see and the controls the length of time over which we expect these temporary trends to occur the ou kernel is ideal for modeling such deviations as it is both and draws from the corresponding stochastic process are only continuous which eliminates dependencies between deviations applications in other domains may require different kernel structures motivated by properties of the noise in the trajectories the model focuses on predicting the trajectory of an individual when left untreated in many chronic conditions as is the case for scleroderma drugs only provide relief accounted for in our model by the adjustments if treatments that alter course are available and commonly prescribed then these should be included within the model as an additional component that influences the trajectory learning objective function to learn the parameters of our model σb we maximize the the probability of all individual marker values given measurement times and features this requires marginalizing over the latent variables zi fi for each individual this yields mixture of multivariate normals xi πzi φp φz zi where σb kou the pm for all individuals is therefore log xi more detailed derivation is provided in the supplement optimizing the objective to maximize the with respect to we partition the parameters into two subsets the first subset σb contains values that parameterize the covariance function above as is often done when designing the kernel of gaussian process we use combination of domain knowledge to choose candidate values and model selection using as criterion for choosing among candidates the second subset contains values that parameterize the mean of the multivariate normal distribution in equation we learn these parameters using expectation maximization em to find local maximum of the expectation step all parameters related to and fi are limited to the covariance kernel and are not optimized using em we therefore only need to consider the subtype indicators zi as unobserved in the expectation step because zi is discrete its posterior is computed by normalizing the joint probability of zi and let πig denote the posterior probability that individual has subtype then we have πig πg φp φz maximization step in the maximization step we optimize the marginal probability of the soft assignments under the multinomial logistic regression model with respect to using gradientbased methods to optimize the expected with respect to and we note that the mean of the multivariate normal for each individual is linear function of these parameters holding fixed we can therefore solve for in closed form and vice versa we use block coordinate ascent approach alternating between solving for and until convergence because the expected is concave with respect to all parameters in each maximization step is guaranteed to converge we provide additional details in the supplement prediction our prediction for the value of the trajectory at time is the expectation of the marker under the posterior predictive conditioned on observed markers measured at times thus far this requires evaluating the following expression ti zi fi zi fi xi dfi zi prediction given latent vars φp φz eq posterior over latent vars fi eq φp φz population prediction subpopulation prediction individual prediction eq fi structured noise prediction where denotes an expectation conditioned on xi in moving from eq to we have written the integral as an expectation and substituted the inner expectation with the mean of the normal distribution in eq from eq to we use linearity of expectation eqs and pr pr pr pr pr pr pr pr pr pr pr pr pr pr pr pr years since first seen figure plots and show dynamic predictions using the proposed model for two individuals red markers are unobserved blue shows the trajectory predicted using the most likely subtype and green shows the second most likely plot shows dynamic predictions using the gp baseline plot shows predictions made using the proposed model without adjustments below show how the expectations in eq are computed an expanded version of these steps are provided in the supplement computing the population prediction is straightforward as all quantities are observed to compute the subpopulation prediction we need to compute the marginal posterior over zi which we used in the expectation step above eq the expected subtype coefficients are therefore zi πizi βzi to compute the individual prediction note that by conditioning on zi the integral over the likelihood with respect to fi and the prior over form the likelihood and prior of bayesian linear regression let kf kou then the posterior over conditioned on zi is zi xi σb φp φz kf just as in eq we have integrated over fi moving its effect from the mean of the normal distribution to the covariance because the prior over is conjugate to the likelihood on the right side of eq the posterior can be written in closed form as normal distribution see the mean of the left side of eq is therefore ip to compute the unconditional posterior mean of we take the expectation of eq with respect to the posterior over zi eq is linear in so we can directly replace with its mean eq φp φz finally to compute the structured noise prediction note that conditioned on zi and the gp prior and marker likelihood eq form standard gp regression see the conditional posterior of fi is therefore gp with mean kou kou φp φz to compute the unconditional posterior expectation of fi we note that the expression above is linear in zi and and so their expectations can be plugged in to obtain φp φz kou kou experiments we demonstrate our approach by building tool to predict the lung disease trajectories of individuals with scleroderma lung disease is currently the leading cause of death among scleroderma patients and is notoriously difficult to treat because there are few predictors of decline and there is tremendous variability across individual trajectories clinicians track lung severity using percent of predicted forced vital capacity pfvc which is expected to drop as the disease progresses in addition demographic variables and molecular test results are often available at baseline to aid prognoses we train and validate our model using data from the johns hopkins scleroderma center patient registry which is one of the largest in the world to select individuals from the registry we used the following criteria first we include individuals who were seen at the clinic within two years of their earliest symptom second we exclude all individuals with fewer than two pfvc measurements after their first visit finally we exclude individuals who received lung transplant the dataset contains individuals and total of pfvc measurements for the population model we use constant functions observed covariates adjust an individual intercept the population covariates are gender african american race and indicators of aca and proteins believed to be connected to lung disease note that all features are binary for the subpopulation we set boundary knots at and years the maximum observation time in our data set is years use two interior knots that divide the time period from years into three equally spaced chunks and use quadratics as the piecewise components these hyperparameters knots and polynomial degree are also used for all baseline models we select subtypes using bic the covariates in the subtype marginal model are the same used in the population model for the individual model we use linear functions for the σb we set σb to be diagonal covariance matrix with entries along the diagonal which correspond to intercept and slope variances respectively finally we set and using domain knowledge we expect transient deviations to last around years and to change pfvc by around units baselines first to compare against typical approaches used in clinical medicine that condition on baseline covariates only we fit regression model conditioned on all covariates included in above the mean is parameterized using bases as xi xi xj xi in xiz xi xj in pairs of xiz the second baseline is similar to and and extends the first baseline by accounting for heterogeneity the model has mean function identical to the first baseline and individualizes predictions using gp with the same kernel as in equation using as above another natural approach is to explain heterogeneity by using mixture model similar to however mixture model can not adequately explain away sources of variability that are unrelated to subtype and therefore fails to recover subtypes that capture canonical trajectories we discuss this in detail in the supplemental section the recovered subtypes from the full model do not suffer from this issue to make the comparison fair and to understand the extent to which the component contributes towards personalizing predictions we create mixture model proposed no personalization where the subtypes are fixed to be the same as those in the full model and the remaining parameters are learned note that this version does not contain the component evaluation we make predictions after one two and four years of errors are summarized within four disjoint time periods and to measure error we use the absolute difference between the prediction and smoothed version of the individual observed trajectory we estimate mean absolute error mae using cv at the level of individuals all of an individual data is and test for statistically significant reductions in error using paired for all models we use the map estimate of the individual trajectory in the models that include subtypes this means that we choose the trajectory predicted by the most likely subtype under the posterior although this discards information from the posterior in our experience clinicians find this choice to be more interpretable qualitative results in figure we present dynamically updated predictions for two patients one per row dynamic updates move left to right blue lines indicate the prediction under the most likely subtype and green lines indicate the prediction under the second most likely the first individual after the eighth year data becomes too sparse to further divide this time span model with baseline feats gp proposed proposed no personalization with baseline feats gp proposed proposed no personalization predictions using year of data im im predictions using years of data predictions using years of data im im with baseline feats gp proposed proposed no personalization table mae of pfvc predictions for the two baselines and the proposed model bold numbers indicate best performance across models is stat significant reports percent improvement over next best figure is white woman with antibodies which are thought to be associated with active lung disease within the first year her disease seems stable and the model predicts this course with confidence after another year of data the model shifts of its belief to rapidly declining trajectory likely in part due to the sudden dip in year we contrast this with the behavior of the gp shown in figure which has limited capacity to express individualized behavior we see that the model does not adequately adjust in light of the downward trend between years one and two to illustrate the value of including adjustments we now turn to figures and which plot predictions made by the proposed model with and without personalization respectively this individual is white man that is negative which makes declining lung function less likely both models use the same set of subtypes but whereas the model without adjustment does not consider the recovering subtype to be likely until after year two the full model shifts the recovering subtype trajectory downward towards the man initial pfvc value and identify the correct trajectory using single year of data quantitative results table reports mae for the baselines and the proposed model we note that after observing two or more years of data our model errors are smaller than the two baselines and statistically significantly so in all but one comparison although the gp improves over the first baseline these results suggest that both subpopulation and components enable more accurate predictions of an individual future course as more data are observed moreover by comparing the proposed model with and without personalization we see that subtypes alone are not sufficient and that adjustments are critical these improvements also have clinical significance for example individuals who drop by more than pfvc are candidates for aggressive immunosuppressive therapy out of the of individuals in our data who decline by more than pfvc our model predicts such decline at twice the rate of the gp and with lower rate conclusion we have described hierarchical model for making individualized predictions of disease activity trajectories that accounts for both latent and observed sources of heterogeneity we empirically demonstrated that using all elements of the proposed hierarchy allows our model to dynamically personalize predictions and reduce error as more data about an individual is collected although our analysis focused on scleroderma our approach is more broadly applicable to other complex heterogeneous diseases examples of such diseases include asthma autism and copd there are several promising directions for further developing the ideas presented here first we observed that predictions are less accurate early in the disease course when little data is available to learn the adjustments to address this shortcoming it may be possible to leverage covariates in addition to the baseline covariates used here second the quality of our predictions depends upon the allowed types of adjustments encoded in the model more sophisticated models of individual variation may further improve performance moreover approaches for automatically learning the class of possible adjustments would make it possible to apply our approach to new diseases more quickly references craig complex diseases research and applications nature education varga denton and wigley scleroderma from pathogenesis to comprehensive management springer science business media et al asthma endotypes new approach to classification of disease entities within the asthma syndrome journal of allergy and clinical immunology wiggins robins adamson bakeman and henrich support for dimensional view of autism spectrum disorders in toddlers journal of autism and developmental disorders castaldi et al cluster analysis in the copdgene study identifies subtypes of smokers with distinct patterns of airway disease and emphysema thorax saria and goldenberg subtyping what is it and its role in precision medicine ieee intelligent systems lee austin rouleau liu naimark and tu predicting mortality among patients hospitalized for heart failure derivation and validation of clinical model jama rizopoulos dynamic predictions and prospective accuracy in joint models for longitudinal and data biometrics et al joint latent class models for longitudinal and data review statistical methods in medical research murphy machine learning probabilistic perspective mit press roberts osborne ebden reece gibson and aigrain gaussian processes for timeseries modelling philosophical transactions of the royal society mathematical physical and engineering sciences shi and titterington hierarchical gaussian process mixtures for regression statistics and computing gelman and hill data analysis using regression and models cambridge university press wang et al feature learning to identify longitudinal phenotypic markers for alzheimer disease progression prediction in advances in neural information processing systems pages ross and dy nonparametric mixture of gaussian processes with constraints in proceedings of the international conference on machine learning pages schulam wigley and saria clustering longitudinal clinical marker trajectories from electronic health data applications to phenotyping and endotype discovery in proceedings of the twintyninth aaai conference on artificial intelligence marlin modeling user rating profiles for collaborative filtering in advances in neural information processing systems adomavicius and tuzhilin toward the next generation of recommender systems survey of the and possible extensions knowledge and data engineering ieee transactions on sontag bennett white dumais and billerbeck probabilistic models for personalizing web search in proceedings of the fifth acm international conference on web search and data mining pages acm rasmussen and williams gaussian processes for machine learning the mit press allanore et al systemic sclerosis nature reviews disease primers khanna et al clinical course of lung physiology in patients with scleroderma and interstitial lung disease analysis of the scleroderma lung study placebo group arthritis rheumatism shi wang will and west gaussian process functional regression models with application to curve prediction stat 
subspace clustering with irrelevant features via robust dantzig selector chao qu department of mechanical engineering national university of singapore huan xu department of mechanical engineering national university of singapore mpexuh abstract this paper considers the subspace clustering problem where the data contains irrelevant or corrupted features we propose method termed robust dantzig selector which can successfully identify the clustering structure even with the presence of irrelevant features the idea is simple yet powerful we replace the inner product by its robust counterpart which is insensitive to the irrelevant features given an upper bound of the number of irrelevant features we establish theoretical guarantees for the algorithm to identify the correct subspace and demonstrate the effectiveness of the algorithm via numerical simulations to the best of our knowledge this is the first method developed to tackle subspace clustering with irrelevant features introduction the last decade has witnessed fast growing attention in research of data images videos dna microarray data and data from many other applications all have the property that the dimensionality can be comparable or even much larger than the number of samples while this setup appears in the first sight the inference and recovery is possible by exploiting the fact that data often possess low dimensional structures on the other hand in this era of big data huge amounts of data are collected everywhere and such data is generally heterogeneous clean data and irrelevant or even corrupted information are often mixed together which motivates us to consider the big but dirty data problem in particular we study the subspace clustering problem in this setting subspace clustering is an important subject in analyzing data inspired by many real applications given data points lying in the union of multiple linear spaces subspace clustering aims to identify all these linear spaces and cluster the sample points according to the linear spaces they belong to here different subspaces may correspond to motion of different objects in video sequence different rotations translations and thickness in handwritten digit or the latent communities for the social graph variety of algorithms of subspace clustering have been proposed in the last several years including algebraic algorithms iterative methods statistical methods and spectral methods among them sparse subspace clustering ssc not only achieves empirical performance but also possesses elegant theoretical guarantees in the authors provide geometric analysis of ssc which explains rigorously why ssc is successful even when the subspaces are overlapping and extend ssc to the noisy case where data are contaminated by additive gaussian noise different from these work we focus on the case where some irrelevant features are involved mathematically ssc indeed solves for each sample sparse linear regression problem with the dictionary being all other samples many properties of sparse linear regression problem are well understood in the clean data case however the performance of most standard algorithms deteriorates lasso and omp even only few entries are corrupted as such it is well expected that standard ssc breaks for subspace clustering with irrelevant or corrupted features see section for numerical evidences sparse regression under corruption is hard problem and few work has addressed this problem our contribution inspired by we use simple yet powerful tool called robust inner product and propose the robust dantzig selector to solve the subspace clustering problem with irrelevant features while our work is based upon the robust inner product developed to solve robust sparse regression the analysis is quite different from the regression case since both the data structures and the tasks are completely different for example the rip condition essential for sparse regression is hardly satisfied for subspace clustering we provide sufficient conditions to ensure that the robust dantzig selector can detect the true subspace clustering we further demonstrate via numerical simulation the effectiveness of the proposed method to the best of our knowledge this is the first attempt to perform subspace clustering with irrelevant features problem setup and method notations and model the clean data matrix is denoted by xa where each column corresponds to data point normalized to unit vector the data points are lying on union of subspace sl each subspace sl is of dimension dl which is smaller than and contains nl data samples with we denote the observed dirty data matrix by out of the features up to of them are irrelevant without loss of generality let xo xa where xo denotes the irrelevant data the subscript and denote the set of row indices corresponding to true and irrelevant features and the superscript denotes the transpose notice that we do not know priori except its cardinality is the model is illustrated in figure let xa denote the selection of columns in xa that belongs to sl similarly denote the corresponding columns in by without loss of generality let be ordered further more we use the subscript to describe matrix that excludes the column xa xa xa xa xa nl we use the superscript lc to describe matrix that excludes column in subspace xa xa xa xa xa for matrix we use σs to denote the submatrix with row indices in set and column indices in set for any matrix denotes the symmetrized convex hull of its column conv we define xa for simplification the symmetrized convex hull of clean data in subspace except data finally we use to denote the norm of vector and to denote infinity norm of vector or matrix caligraphic letters such as xl represent the set containing all columns of the corresponding clean data matrix figure illustration of the model of irrelevant features in the subspace clustering problem the left one is the model addressed in this paper among total features up tp of them are irrelevant the right one illustrates more general case where the value of any element of each column can be arbitrary due to corruptions it is harder case and left for future work figure illustration of the subspace detection property here each figure corresponds to matrix where each column is ci and entries are in white the left figure satisfies this property the right one does not method in this secion we present our method as well as the intuition that derives it when all observed data are clean to solve the subspace clustering problem the celebrated ssc proposes to solve the following convex programming min kci ci xi ci for each data point xi when data are corrupted by noise of small magnitude such as gaussian noise straightforward extension of ssc is the lasso type method called min kci ci kxi ci note that while formulation has the same form as lasso it is used to solve the subspace clustering task in particular the support recovery analysis of lasso does not extend to this case as typically does not satisfy the rip condition this paper considers the case where contains corrupted features as we discussed above lasso is not robust to such corruption an intuitive idea is to consider the following formulation first proposed for sparse linear regression min kci ci kxi ci where is some norm corresponding to the sparse type of one major challenge of this formulation is that it is not convex as such it is not clear how to efficiently find the optimal solution and how to analyze the property of the solution typically done via convex analysis in the subspace clustering task our method is based on the idea of robust inner product the robust inner product ha bik is defined as follows for vector rd rd we compute qi ai bi then are sorted and the smallest are selected let be the set of selected indices then ha bik qi the largest terms are truncated our main idea is to replace all inner products involved by robust counterparts ha where is the upper bound of the number of irrelevant features the intuition is that the irrelevant features with large magnitude may affect the correct subspace clustering this simple truncation process will avoid this we remark that we do not need to know the exact number of irrelevant feature but instead only an upper bound of it extending using the robust inner product leads the following formulation min kci ci ci where and are robust counterparts of and xi unfortunately may not be positive semidefinite matrix thus is not convex program unlike the work which studies in linear regression the difficulty of in the subspace clustering task appears to be hard to overcome instead we turn to the dantzig selector which is essentially linear program and hence no positive semidefiniteness is required min kci ci xi ci replace all inner product by its robust counterpart we propose the following robust dantzig selector which can be easily recast as linear program min kci robust dantzig selector ci subspace detection property to measure whether the algorithm is successful we define the criterion subspace detection property following we say that the subspace detection property holds if and only if for all the optimal solution to the robust dantzig selector satisfies ci is not zero vector property nonzeros entries of ci correspond to only columns of sampled from the same subspace as xi see figure for illustrations main results to avoid repetition and cluttered notations we denote the following primal convex problem by min λkσc its dual problem denoted by is maxhξ γi subject to before we presents our results we define some quantities the dual direction is an important geometric term introdcued in analyzing ssc here we define similarly the dual direction of the robust dantzig selector notice that the dual of robust dantzig problem is where and are robust counterparts of xi and respectively recall that and xi are the dirty data we decompose into two parts xa xa where the first term corresponds to the clean data and the second term is due to the irrelevant features and truncation from the robust inner product thus the second constraint of the dual problem becomes xa xa let be the optimal solution to the above optimization problem we define xi xa and the dual direction as xli kv xli similarly as ssc we define the subspace incoherence let vn the incoherl ence of point set xl to other clean data points is defined as xl maxk xa recall that we decompose and as xa xa and xa xa intuitively for robust dantzig selecter to succeed we want and not too large particularly we assume xa and xa theorem deterministic model denote µl xl rl mini xi rl and suppose µl rl for all if rl ul min ul rl then the subspace detection property holds for all in the range rl ul min ul rl in an ideal case when the condition of the upper bound of reduces to rl ul similar to the condition for ssc in the noiseless case based on condition under randomized generative model we can derive how many irrelevant features can be tolerated theorem random model suppose there are subspaces and for simplicity all subspaces have same dimension and are chosen uniformly at random for each subspace there are ρd points chosen independently and uniformly at random up to features of data are irrelevant each data point including true and irrelevant features is independent from other data points then for some universal constants the subspace detection property holds with probability at least exp ρd if log log and log log log log log log log where log is constant only depending on the density of data points on subspace and satisfies for all there is numerical value such that for all one can take simplifying the above conditions we can determine the number of irrelevant features that can be tolerated in particular if log and we choose the as log then the maximal number of irrelevant feature that can be torelated is log log log log log log with probability at least exp ρd min if log and we choose the same then the number of irrelevant feature we can tolerate is dc logd log min log log log log with probability at least exp ρd remark if is much larger than the lower bound of is proportional to the subspace dimension when increases the upper bound of decreases since decreases thus the valid range of shrinks when increases remark ignoring the logarithm terms when is large the tolerable is proportional to min when is small is proportional to min roadmap of the proof in this section we lay out the roadmap of proof in specific we want to establish the condition with the number of irrelevant features and the structure of data the incoherence and inradius for the algorithm to succeed indeed we provide lower bound of such that the optimal solution ci is not trivial and an upper bound of so that the property holds combining them together established the theorems property the property is related to the upper bound of the proof technique is inspired by and we first establish the following lemma which provides sufficient condition such that property holds of the problem lemma consider matrix rn and rn if there exist pair such that has support and sgn σs ξη kσsc ξη kσt ξη where is the set of indices of entry such that then for all optimal solution to the problem we have the variable in lemma is often termed the dual certificate we next consider an oracle problem to construct such dual certificate this and use its dual optimal variable denoted by candidate satisfies all conditions in the lemma automatically except to show where denotes the set of indices expect the ones corresponding to subspace we can compare this condition with the corresponding one in analyzing ssc in which one need where is the dual certificate recall that we can decompose xa xa thus condition becomes xa xa to show this holds we need to bound two terms xa and bounding the following lemma relates with and lemma suppose and are robust counterparts of and xi respectively and among features up to are irrelevant we can decompose and into following form xa xa and xa xa we define and xa and xa then we then bound and in the random model cap using the upper bound of the spherical indeed we have log log and log log with high probability bounding by exploiting the feasible condition in the dual of the oracle problem we obtain the following bound log furthermore can be lower bound by and can be upper bounded by log log in the random model with high probability thus the rhs can be upper bounded plugging this upper bound into we obtain the upper bound of with sufficiently large to ensure that the solution is not trivial not we need lower bound on if satisfies the following condition the optimal solution to problem can not be zero the proof idea is to show when is large enough the trivial solution can not be optimal in particular if the corresponding value in the primal problem is we then establish lower bound of and upper bound of so that the following inequality always holds by some carefully choosen notice we then further lower bound the rhs of equation using the bound of and that condition requires that and condition requires where and are some terms depending on the number of irrelevant features thus we require to get the maximal number of irrelevant features that can be tolerated numerical simulations in this section we use three numerical experiments to demonstrate the effectiveness of our method to handle features in particular we test the performance of our method and effect of number of irrelevant features and dimension subspaces with respect to different in all experiments the ambient dimension sample density the subspace are drawn uniformly at random each subspace has points chosen independently and uniformly random we measure the success of the algorithms using the relative violation of the subspace detection property defined as follows relv iolation where cn is the ground truth mask containing all such that xi xj belong to same subspace if relv iolation then the subspace detection property is satisfied we also check whether we obtain trivial solution if any column in is we first compare the robust dantzig selector with ssc and the results are shown in figure the is the number of irrelevant features and the is the relviolation defined above the ambient dimension the relative sample density the values of irrelevant features are independently sampled from uniform distribution in the region in and in we observe from figure that both ssc and lasso ssc are very sensitive to irrelevant information notice that is pretty large and can be considered as clustering failure compared with that the proposed robust dantzig selector performs very well even when it still detects the true subspaces perfectly in the same setting we do some further experiments our method breaks when is about we also do further experiment for with different in the supplementary material to show is not robust to irrelevant features we also examine the relation of to the performance of the algorithm in figure we test the subspace detection property with different and when is too small the algorithm gives trivial solution the black region in the figure as we increase the value of the corresponding solutions satisfy the subspace detection property represented as the white region in the figure when is larger than certain upper bound relv iolation becomes indicating errors in subspace clustering in figure we test the subspace detection property with different and notice we rescale with since by theorem should be proportional to we observe that the valid region of shrinks with increasing which matches our theorem conclusion and future work we studied subspace clustering with irrelevant features and proposed the robust dantzig selector based on the idea of robust inner product essentially truncated version of inner product to avoid original ssc lasso ssc robust dantzig selector relviolation relviolation original ssc lasso ssc robust dantzig selector number of irrelevant features number of irrelevant features figure relviolation with different simulated with and from to number of irrelevant features exact recovery with different number of irrelevant features simulated with with an increasing from to black region trivial solution white region solution with gray region relviolation subspace dimension exact recovery with different subspace dimension simulated with and an increasing from to black region trivial solution white region solution with gray region relviolation figure subspace detection property with different any single entry having too large influnce on the result we established the sufficient conditions for the algorithm to exactly detect the true subspace under the deterministic model and the random model simulation results demonstrate that the proposed method is robust to irrelevant information whereas the performance of original ssc and significantly deteriorates we now outline some directions of future research an immediate future work is to study theoretical guarantees of the proposed method under the model where each subspace is chosen deterministically while samples are randomly distributed on the respective subspace the challenge here is to bound the subspace incoherence previous methods uses the rotation invariance of the data which is not possible in our case as the robust inner product is invariant to rotations acknowledgments this work is partially supported by the ministry of education of singapore acrf tier two grant and star serc psf grant references pankaj agarwal and nabil mustafa projective clustering in proceedings of the acm symposium on principles of database systems pages keith ball an elementary introduction to modern convex geometry flavors of geometry emmanuel xiaodong li yi ma and john wright robust principal component analysis journal of the acm yudong chen constantine caramanis and shie mannor robust sparse regression under adversarial corruption in proceedings of the international conference on machine learning pages yudong chen ali jalali sujay sanghavi and huan xu clustering partially observed graphs via convex optimization the journal of machine learning research ehsan elhamifar and vidal sparse subspace clustering in cvpr pages guangcan liu zhouchen lin and yong yu robust subspace segmentation by representation in proceedings of the international conference on machine learning pages loh and martin wainwright regression with noisy and missing data provable guarantees with in advances in neural information processing systems pages le lu and vidal combined central and subspace clustering for computer vision applications in proceedings of the international conference on machine learning pages yi ma harm derksen wei hong and john wright segmentation of multivariate mixed data via lossy data coding and compression ieee transactions on pattern analysis and machine intelligence shankar rao roberto tron vidal and yi ma motion segmentation via robust subspace separation in the presence of outlying incomplete or corrupted trajectories in cvpr mahdi soltanolkotabi emmanuel candes et al geometric analysis of subspace clustering with outliers the annals of statistics mahdi soltanolkotabi ehsan elhamifar emmanuel candes et al robust subspace clustering the annals of statistics tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series pages vidal tutorial on subspace clustering ieee signal processing magazine rene vidal yi ma and shankar sastry generalized principal component analysis gpca ieee transactions on pattern analysis and machine intelligence vidal roberto tron and richard hartley multiframe motion segmentation with missing data using powerfactorization and gpca international journal of computer vision wang and huan xu noisy sparse subspace clustering in proceedings of the international conference on machine learning pages xu caramanis and sanghavi robust pca via outlier pursuit ieee transactions on information theory jingyu yan and marc pollefeys general framework for motion segmentation independent articulated rigid degenerate and in eccv pages hao zhu geert leus and georgios giannakis total for perturbed compressive sampling ieee transactions on signal processing 
sparse pca via bipartite matchings megasthenis asteris the university of texas at austin megas dimitris papailiopoulos university of california berkeley dimitrisp anastasios kyrillidis the university of texas at austin anastasios alexandros dimakis the university of texas at austin dimakis abstract we consider the following sparse pca problem given set of data points we seek to extract small number of sparse components with disjoint supports that jointly capture the maximum possible variance such components can be computed one by one repeatedly solving the problem and deflating the input data matrix but this greedy procedure is suboptimal we present novel algorithm for sparse pca that jointly optimizes multiple disjoint components the extracted features capture variance that lies within multiplicative factor arbitrarily close to from the optimal our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem its complexity grows as low order polynomial in the ambient dimension of the input data but exponentially in its rank however it can be effectively applied on sketch of the input data we evaluate our algorithm on real datasets and empirically demonstrate that in many cases it outperforms existing approaches introduction principal component analysis pca reduces data dimensionality by projecting it onto principal subspaces spanned by the leading eigenvectors of the sample covariance matrix it is one of the most widely used algorithms with applications ranging from computer vision document clustering to network anomaly detection see sparse pca is useful variant that offers higher data interpretability property that is sometimes desired even at the cost of statistical fidelity furthermore when the obtained features are used in subsequent learning tasks sparsity potentially leads to better generalization error given real data matrix representing centered data points in variables the first sparse principal component is the sparse vector that maximizes the explained variance arg max ax where is the empirical covariance matrix unfortunately the directly enforced sparsity constraint makes the problem and hence computationally intractable in general significant volume of prior work has focused on various algorithms for approximately solving this optimization problem while some theoretical results have also been established under statistical or spectral assumptions on the input data in most cases one is not interested in finding only the first sparse eigenvector but rather the first where is the reduced dimension where the data will be projected contrary to the problem there has been very limited work on computing multiple sparse components the scarcity is partially attributed to conventional wisdom stemming from pca multiple components can be computed one by one repeatedly solving the sparse pca problem and deflating the input data to remove information captured by previously extracted components in fact sparse pca is not uniquely defined problem in the literature approaches can lead to different output depending on the type of deflation extracted components may or may not be orthogonal while they may have disjoint or overlapping supports in the statistics literature where the objective is typically to recover true principal subspace branch of work has focused on the subspace row sparsity an assumption that leads to sparse components all supported on the same set of variables while in the authors discuss an alternative perspective on the fundamental objective of the sparse pca problem we focus on the sparse pca problem with disjoint supports the problem of computing small number of sparse components with supports that jointly maximize the explained variance arg max ax xk kxj kxj supp xi supp xj with xj denoting the jth column of the number of the desired components is considered small constant contrary to the greedy sequential approach that repeatedly uses deflation our algorithm jointly computes all the vectors in and comes with theoretical approximation guarantees note that even if we could solve the sparse pca problem exactly the greedy approach could be highly suboptimal we show this with simple example in sec of the appendix our contributions we develop an algorithm that provably approximates the solution to the sparse pca problem within multiplicative factor arbitrarily close to optimal our algorithm is the first that jointly optimizes multiple components with disjoint supports and operates by recasting the sparse pca problem into multiple instances of the bipartite maximum weight matching problem the computational complexity of our algorithm grows as low order polynomial in the ambient dimension but is exponential in the intrinsic dimension of the input data the rank of to alleviate the impact of this dependence our algorithm can be applied on sketch of the input data to obtain an approximate solution to this extra level of approximation introduces an additional penalty in our theoretical approximation guarantees which naturally depends on the quality of the sketch and in turn the spectral decay of we empirically evaluate our algorithm on real datasets and compare it against methods for the sparse pca problem in conjunction with the appropriate deflation step in many cases our algorithm significantly outperforms these approaches our sparse pca algorithm we present novel algorithm for the sparse pca problem with multiple disjoint components our algorithm approximately solves the constrained maximization on positive semidefinite psd matrix within multiplicative factor arbitrarily close to it operates by recasting the maximization into multiple instances of the bipartite maximum weight matching problem each instance ultimately yields feasible solution to the original sparse pca problem set of components with disjoint supports finally the algorithm exhaustively determines and outputs the set of components that maximizes the explained variance the quadratic objective in the computational complexity of our algorithm grows as low order polynomial in the ambient dimension of the input but exponentially in its rank despite the unfavorable dependence on the rank it is unlikely that substantial improvement can be achieved in general however decoupling the dependence on the ambient and the intrinsic dimension of the input has an interesting ramification instead of the original input our algorithm can be applied on surrogate to obtain an approximate solution alleviating the dependence on we discuss this in section in the sequel we describe the key ideas behind our algorithm leading up to its guarantees in theorem let denote the truncated eigenvalue decomposition of is diagonal whose ith diagonal entry is equal to the ith largest eigenvalue of while the columns of coincide with the corresponding eigenvectors by the inequality for any rd ax rr in fact equality in can always be achieved for colinear to ux rr and in turn uλ ax max where denotes the sphere in dimensions more generally for any ax ax max cj xj cj under the variational characterization of the trace objective in the sparse pca problem can be as joint maximization over the variables and as follows cj max ax max max cj the alternative formulation of the sparse pca problem in may be seemingly more complicated than the original one in however it takes step towards decoupling the dependence of the optimization on the ambient and intrinsic dimensions and respectively the motivation behind the introduction of the auxiliary variable will become more clear in the sequel for given the value of xk that maximizes the objective in for that is arg max wj where is real matrix the constrained maximization plays central role in our developments we will later describe combinatorial procedure to reducing the maximization to an instance of the bipartite maximum weight efficiently compute matching problem for now however let us assume that such procedure exists let be the pair that attains the maximum in in other words is the desired solution to the sparse pca problem if the optimal value of the auxiliary variable were known then we would be able to recover by solving the maximization for of course is not known and it is not possible to exhaustively consider all possible values in the domain of instead we examine only finite number of possible values of over fine discretization of its domain in particular let denote finite of the sphere for any point in the net contains point within an radius from the former there are several ways to construct such net further let denote the kth cartesian power of the aforementioned by construction this collection of points contains matrix that is close to in turn it can be shown using the properties of the net that the candidate solution xk obtained through at that point will be approximately as good as the optimal in terms of the quadratic objective in all above observations yield procedure for approximately solving the sparse pca problem the steps are outlined in algorithm given the desired number of components and an accuracy parameter the algorithm generates net and iterates over its points at each point it computes feasible solution for the sparse pca problem set of components by solving maximization via procedure alg that will be described in the sequel the algorithm collects the candidate solutions identified at the points of the net the best among them achieves an objective in that provably lies close to optimal more formally theorem for any real psd matrix desired number of components number of nonzero entries per component and accuracy parameter algorithm outputs xk such that ax where arg ax in time tsvd algorithm is the first nontrivalgorithm sparse pca multiple disjoint components ial algorithm that provably approxinput psd matrix imates the solution of the sparse pca problem according to theorem output xk theorem it achieves an objective value that lies within eig tive factor from the optimal for each do trarily close to its complexity grows as polynomial in arg max wj alg the dimension of the input but ponentially in the intrinsic end for sion note however that it can be arg ax substantially better compared to the brute force approach that exhaustively considers all candidate supports for the sparse components the complexity of our algorithm follows from the cardinality of the net and the complexity of algorithm the subroutine that solves the constrained maximization the latter is key ingredient of our algorithm and is discussed in detail in the next subsection formal proof of theorem is provided in section sparse components via bipartite matchings in the core of alg lies procedure that solves the constrained maximization alg the latter breaks down the maximization into two stages first it identifies the support of the optimal by solving an instance of the maximum weight matching problem on bipartite graph solution then it recovers the exact values of its nonzero entries based on the inequality in the sequel we provide brief description of alg leading up to its guarantees in lemma be the support of the jth column of the objective in becomes let ij supp wj bij wij the inequality is due to and the constraint kxj in fact if an oracle reveals the supports ij the upper bound in can always be achieved as in algorithm line therefore the key in solving is by setting the nonzero entries of determining the collection of supports to maximize the side of by constraint the sets ij must be pairwise disjoint each with cardinality consider weighted bipartite graph uk constructed as fig is set of vertices vd corresponding to the variables the rows of is set of vertices conceptually partitioned into disjoint subsets uk each of cardinality the jth subset uj is associated with the support ij the vertices in uj serve as placeholders for the in ij finally the edge set is the edge weights are determined by the matrix in in particular the weight of edge vi is equal to note that all vertices in uj are effectively identical they all share common neighborhood and edge weights uk us vi vd figure the graph generated by alg it is used to determine the support in of the solution the construction is formally outlined in algorithm in section any feasible support ij correalgorithm compute candidate solution sponds to perfect matching in input real matrix and recall that match arg pk xj wj ing is subset of the edges conoutput taining no two edges incident to the uj en raph alg same vertex while perfect ax eight atch ing in the case of an unbalanced bipartite graph with for do is matching that ij vi uj tains at least one incident edge for wj wj each vertex in given perj end for fect matching the disjoint neighborhoods of uj under yield support ij conversely any valid support yields unique perfect matching in taking into account that all vertices in uj are isomorphic moreover due to the choice of weights in the side of for given support ij is equal to the weight of the matching in induced by the former pk it follows that determining the support of the solution in wij reduces to solving the maximum weight matching problem on the bipartite graph algorithm readily follows given the algorithm generates weighted bipartite graph as described and computes its maximum weight matching based on the latter it first line and subsequently the exact values of its nonzero entries recovers the desired support of line the running time is dominated by the computation of the matching which can be done in log using variant of the hungarian algorithm hence lemma for any algorithm computes the solution to in time more formal analysis and proof of lemma is available in sec this completes the description of our sparse pca algorithm alg and the proof sketch of theorem sparse pca on sketches algorithm approximately solves the algorithm sparse pca on low dim sketch sparse pca problem on psd matrix in time that grows as input real polynomial in the ambient dimen thm output xk sion but depends exponentially on this ketch dependence can be prohibitive in practice to mitigate its effect we can apply our lgorithm sparse pca algorithm on sketch of intuitively the quality of the extracted components should depend on how well that surrogate approximates the original input more formally let be the real data matrix representing potentially centered datapoints in variables and the corresponding covariance matrix further let be sketch of the original data an matrix whose rows lie in an subspace with being an accuracy parameter such sketch can be obtained in several ways including for example exact or approximate svd or online sketching methods finally let be the covariance matrix of the sketched data then instead of we can approximately solve the sparse pca problem by applying algorithm on the surrogate the above are formally outlined in algorithm we note that the covariance matrix does not need to be explicitly computed algorithm can operate directly on the sketched input data matrix theorem for any input data matrix with corresponding empirical covariance matrix any desired number of components and accuracy parameters and algorithm outputs xk such that ax ka where arg ax in time tsketch tsvd the error term ka and in turn the tightness of the approximation guarantees hinges on the quality of the sketch roughly higher values of the parameter should allow for sketch that more accurately represents the original data leading to tighter guarantees that is the case for example when the sketch is obtained through exact svd in that sense theorem establishes natural between the running time of algorithm and the quality of the approximation guarantees see for additional results formal proof of theorem is provided in appendix section related work significant volume of work has focused on the sparse pca problem we scratch the surface and refer the reader to citations therein representative examples range from early heuristics in to the lasso based techniques in the elastic net in and regularized optimization methods such as gpower in greedy technique in or semidefinite programming approaches many focus on statistical analysis that pertains to specific data models and the recovery of true sparse component in practice the most competitive results in terms of the maximization in seem to be achieved by the simple and efficient truncated power tpower iteration of ii the approach of stemming from an em formulation and iii the spanspca framework of which solves the sparse pca problem through low rank approximations based on we are not aware of any algorithm that explicitly addresses the sparse pca problem multiple components can be extracted by repeatedly solving with one of the aforementioned methods to ensure disjoint supports variables selected by component are removed from the dataset however this greedy approach can result in highly suboptimal objective value see sec more generally there has been relatively limited work in the estimation of principal subspaces or multiple components under sparsity constraints algorithms include extensions of the diagonal and iterative thresholding approaches while and propose methods that rely on the row sparsity for subspaces assumption of these methods yield components supported on common set of variables and hence solve problem different from in the authors discuss the sparse pca problem propose an alternative objective function and for that problem obtain interesting theoretical guarantees in they consider structured variant of sparse pca where structure is encoded by an atomic norm regularization finally develops framework for sparse matrix factorizaiton problems based on an atomic norm their framework captures sparse pca not explicitly the constraint of disjoint but the resulting optimization problem albeit convex is experiments we evaluate our algorithm on series of real datasets and compare it to approaches for sparse pca using tpower em and spanspca the latter are representative of the state of the art for the sparse pca problem multiple components are computed one by one to ensure disjoint supports the deflation step effectively amounts to removing from the dataset all variables used by previously extracted components for algorithms that are randomly initialized we depict best results over multiple random restarts additional experimental results are listed in section of the appendix our experiments are conducted in matlab environment due to its nature our algorithm is easily parallelizable its prototypical implementation utilizes the parallel pool matlab feature to exploit multicore or distributed cluster capabilities recall that our algorithm operates on approximation of the input data unless otherwise specified it is configured for approximation obtained via truncated svd finally we note that our algorithm is slower than the methods we set barrier on the execution time of our algorithm at the cost of the theoretical approximation guarantees the algorithm returns the best result at the time of termination this early termination can only hurt the performance of our algorithm leukemia dataset we evaluate our algorithm on the leukemia dataset the dataset comprises samples each consisting of expression values for probe sets we extract sparse components each active on features in fig we plot the cumulative explained variance versus the number of components approaches are greedy the leading components tpower spanspca spcabipart spcabipart spanspca tpower total cumulative expl variance cumulative expl variance number of components number of target components figure cumul variance captured by extracted components leukemia dataset we arbitrarily set nonzero entries per component fig depicts the cumul variance vs the number of components for approaches are greedy first components capture high variance but subsequent contribute less our algorithm jointly optimizes the components and achieves higher objective fig depicts the cumul variance achieved for various values of components capture high values of variance but subsequent ones contribute less on the contrary our algorithm jointly optimizes the components and achieves higher total cumulative variance one can not identify top component we repeat the experiment for multiple values of fig depicts the total cumulative variance capture by each method for each value of additional datasets we repeat the experiment on multiple datasets arbitrarily selected from table lists the total cumulative variance captured by components each with nonzero entries extracted using the four methods our algorithm achieves the highest values in most cases bag of words bow dataset this is collection of text corpora stored under the model for each text corpus vocabulary of words is extracted upon tokenization and the removal of stopwords and words appearing fewer than ten times in total each document is then represented as vector in that space with the ith entry corresponding to the number of appearances of the ith vocabulary entry in the document we solve the sparse pca problem on the cooccurrence matrix and extract sparse components each with cardinality we note that the latter is not explicitly constructed our algorithm can operate directly on the input matrix table lists the variance captured by each method our algorithm consistently outperforms the other approaches finally note that here each sparse component effectively selects small set of words in turn the extracted components can be interpreted as set of topics in table we list the mzn om ev rcence rain cbcl face rain eukemia ems rain feat ix tpower em spca spanspca spcabipart table total cumulative variance captured by extracted components on various datasets for each dataset we list the size variables and the value of variance captured by each method our algorithm operates on sketch in all cases ow nips ow kos ow nron ow imes tpower em spca spanspca spcabipart table total variance captured by extracted components each with nonzero entries bag of words dataset for each corpus we list the size and the explained variance our algorithm operates on sketch in all cases topics extracted from the ny times corpus part of the bag of words dataset the corpus consists of news articles and vocabulary of words conclusions we considered the sparse pca problem for multiple components with disjoint supports existing methods for the single component problem can be used along with an appropriate deflation step to compute multiple components one by one leading to potentially suboptimal results we presented novel algorithm for jointly computing multiple sparse and disjoint components with provable approximation guarantees our algorithm is combinatorial and exploits interesting connections between the sparse pca and the bipartite maximum weight matching problems its running time grows as polynomial in the ambient dimension of the input data but depends exponentially on its rank to alleviate this dependency we can apply the algorithm on sketch of the input at the cost of an additional error in our theoretical approximation guarantees empirical evaluation showed that in many cases our algorithm outperforms approaches acknowledgments dp is generously supported by nsf awards and and muri afosr grant this research has been supported by nsf grants ccf and aro yip references majumdar image compression by sparse pca coding in curvelet domain signal image and video processing vol no pp wang han and liu sparse principal component analysis for high dimensional multivariate time series in proceedings of the sixteenth international conference on artificial intelligence and statistics pp aspremont el ghaoui jordan and lanckriet direct formulation for sparse pca using semidefinite programming siam review vol no pp topic percent million money high program number need part problem com topic topic topic topic topic topic topic zzz united states zzz zzz american attack military palestinian war administration zzz white house games zzz bush official government president group leader country political american law company companies market stock business billion analyst firm sales cost team game season player play point run right home won cup minutes add tablespoon oil teaspoon water pepper large food school student children women show book family look hour small zzz al gore zzz george bush campaign election plan tax public zzz washington member nation table ow imes dataset the table lists the words corresponding to the nonzero entries of each of the extracted components topics words corresponding to higher magnitude entries appear higher in the topic jiang fei and huan anomaly localization for network data streams with graph joint sparse pca in proceedings of the acm sigkdd pp acm zou hastie and tibshirani sparse principal component analysis journal of computational and graphical statistics vol no pp kaiser the varimax criterion for analytic rotation in factor analysis psychometrika vol no pp jolliffe rotation of principal components choice of normalization constraints journal of applied statistics vol no pp jolliffe trendafilov and uddin modified principal component technique based on the lasso journal of computational and graphical statistics vol no pp boutsidis drineas and sparse features for linear regression in advances in neural information processing systems pp nesterov and sepulchre generalized power method for sparse principal component analysis the journal of machine learning research vol pp moghaddam weiss and avidan spectral bounds for sparse pca exact and greedy algorithms nips vol aspremont bach and ghaoui optimal solutions for sparse principal component analysis the journal of machine learning research vol pp zhang aspremont and ghaoui sparse pca convex relaxations algorithms and applications handbook on semidefinite conic and polynomial optimization pp yuan and zhang truncated power method for sparse eigenvalue problems the journal of machine learning research vol no pp sigg and buhmann for sparse and pca in proceedings of the international conference on machine learning icml new york ny usa pp acm papailiopoulos dimakis and korokythakis sparse pca through approximations in proceedings of the international conference on machine learning pp asteris papailiopoulos and karystinos the sparse principal component of matrix information theory ieee transactions on vol pp april mackey deflation methods for sparse pca nips vol pp vu and lei minimax rates of estimation for sparse pca in high dimensions in international conference on artificial intelligence and statistics pp and boutsidis optimal sparse linear and sparse pca arxiv preprint and inapproximability of sparse pca corr vol ramshaw and tarjan on assignments in unbalanced bipartite graphs hp labs palo alto ca usa tech halko martinsson and tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam review vol no pp asteris papailiopoulos kyrillidis and dimakis sparse pca via bipartite matchings arxiv preprint johnstone and lu on consistency and sparsity for principal components analysis in high dimensions journal of the american statistical association vol no ma sparse principal component analysis and iterative thresholding the annals of statistics vol no pp vu cho lei and rohe fantope projection and selection convex relaxation of sparse pca in nips pp wang lu and liu nonconvex statistical optimization sparse pca in polynomial time arxiv preprint jenatton obozinski and bach structured sparse principal component analysis in proceedings of the thirteenth international conference on artificial intelligence and statistics aistats pp richard obozinski and vert tight convex relaxations for sparse matrix factorization in advances in neural information processing systems pp lichman uci machine learning repository 
fast randomized kernel ridge regression with statistical ahmed el alaoui michael mahoney electrical engineering and computer sciences statistics and international computer science institute university of california berkeley berkeley ca elalaoui eecs mmahoney stat abstract one approach to improving the running time of methods is to build small sketch of the kernel matrix and use it in lieu of the full matrix in the machine learning task of interest here we describe version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance by extending the notion of statistical leverage scores to the setting of kernel ridge regression we are able to identify sampling distribution that reduces the size of the sketch the required number of columns to be sampled to the effective dimensionality of the problem this latter quantity is often much smaller than previous bounds that depend on the maximal degrees of freedom we give an empirical evidence supporting this fact our second contribution is to present fast algorithm to quickly compute coarse approximations to these scores in time linear in the number of samples more precisely the running time of the algorithm is with only depending on the trace of the kernel matrix and the regularization parameter this is obtained via variant of squared length sampling that we adapt to the kernel setting lastly we discuss how this new notion of the leverage of data point captures fine notion of the difficulty of the learning problem introduction we consider the approximation of symmetric positive spsd matrices that arise in machine learning and data analysis with an emphasis on obtaining good statistical guarantees this is of interest primarily in connection with machine learning methods recent work in this area has focused on one or the other of two very different perspectives an algorithmic perspective where the focus is on running time issues and guarantees given fixed input matrix and statistical perspective where the goal is to obtain good inferential properties under some hypothesized model by using the approximation in place of the full kernel matrix the recent results of gittens and mahoney provide the strongest example of the former and the recent results of bach are an excellent example of the latter in this paper we combine ideas from these two lines of work in order to obtain fast randomized kernel method with statistical guarantees that are improved relative to the to understand our approach recall that several papers have established the crucial from the algorithmic the statistical leverage scores as they capture structural nonuniformities of the input matrix and they can be used to obtain very sharp approximation guarantees see work on cur matrix decompositions work on the the fast approximation of the statistical leverage scores and the recent review for more details here we technical report version of this conference paper is available at simply note that when restricted to an spsd matrix and rank parameter the statistical leverage scores relative to the best approximation to call them for are the diagonal elements of the projection matrix onto the best approximation of that is diag kk where kk is the best rank approximation of and where is the moorepenrose inverse of kk the recent work by gittens and mahoney showed that qualitatively improved bounds for the approximation of spsd matrices could be obtained in one of two related ways either compute with the fast algorithm of approximations to the leverage scores and use those approximations as an importance sampling distribution in random sampling algorithm or rotate with or random projection to random basis where those scores are uniformized and sample randomly in that rotated basis in this paper we extend these ideas and we show the statistical are able to obtain approximation that comes with improved statistical guarantees by using variant of this more traditional notion of statistical leverage in particular we improve the recent bounds of bach which provides the first known statistical convergence result when substituting the kernel matrix by its approximation to understand the connection recall that key component of bach approach is the quantity dmof nk diag nλi which he calls the maximal marginal degrees of bach main result is that by constructing lowrank approximation of the original kernel matrix by sampling uniformly at random dmof columns performing the vanilla method and then by using this approximation in prediction task the statistical performance is within factor of of the performance when the entire kernel matrix is used here we show that this uniform sampling is suboptimal we do so by sampling with respect to coarse but approximation of variant to the statistical leverage scores given in definition below and we show that we can obtain similar guarantees by sampling only deff columns where deff tr nλi dmof the quantity deff is called the effective dimensionality of the learning problem and it can be interpreted as the implicit number of parameters in this nonparametric setting we expect that our results and insights will be useful much more generally as an example of this we can directly compare the sampling method to related approach thereby answering an open problem of zhang et al recall that the zhang et al method consists of dividing the dataset xi yi into random partitions of equal size computing estimators on each partition in parallel and then averaging the estimators they prove the minimax optimality of their estimator although their multiplicative constants are suboptimal and in terms of the number of kernel evaluations their method requires with in the order of which gives total number of evaluations they noticed that the scaling of their estimator was not directly comparable to that of the sampling method which was proven to only require ndmof evaluations if the sampling is uniform and they left it as an open problem to determine which if either method is fundamentally better than the other using our theorem we are able to put both results on common ground for comparison indeed the estimator obtained by our sampling requires only ndeff kernel evaluations compared to and ndmof and it obtains the same bound on the statistical predictive performance as in in this sense our result combines the best of both worlds by having the reduced sample complexity of and the sharp approximation bound of preliminaries and notation let xi yi be pairs of points in where is the input space and is the response space the learning problem can be cast as the following minimization problem min yi xi kf where is reproducing kernel hilbert space and is loss function we denote by the positive definite kernel corresponding to and by corresponding feature map that is hφ if for every the representer theorem allows us to reduce problem to optimization problem in which we will refer to it as the maximal degrees of freedom case problem boils down to finding the vector rn that solves minn yi kα kα where kij xi xj we let σu be the eigenvalue decomposition of with diag σn σn and an orthogonal matrix the underlying data model is yi xi ξi with xi deterministic sequence and ξi are standard normal random variables we consider to be the squared loss in which case we will be interested in the mean squared error as measure of statistical risk for any estimator fˆ let fˆ eξ kfˆ be the risk function of fˆ where eξ denotes the expectation under the randomness induced by in this setting the problem is called kernel ridge regression krr the solution to problem is nλi and the estimate of at any training point xi is given by fˆ xi kα we will use fˆk as shorthand for the vector fˆ xi rn when the matrix is used as kernel matrix this notation will be used accordingly for other kernel matrices fˆl for matrix recall that the risk of the estimator fˆk can then be decomposed into bias and variance term eξ kk nλi eξ kk nλi nλi nλi tr nλi bias variance fˆk solving problem either by direct method or by an optimization algorithm needs at least quadratic and often cubic running time in which is prohibitive in the large scale setting the method approximates the solution to problem by substituting with lowrank approximation to in practice this approximation is often not only fast to construct but the resulting learning problem is also often easier to solve the method operates as follows small number of columns kp are randomly sampled from if we let kp denote the matrix containing the sampled columns the overlap between and in then the approximation of is the matrix cw more generally if we let be an arbitrary sketching matrix tall and skinny matrix that when by produces sketch of that preserves some desirable properties then the approximation associated with is ks ks for instance for random sampling algorithms would contain entry at position if the column of is chosen at the trial of the sampling process alternatively could also be random projection matrix or could be constructed with some other perhaps deterministic method as long as it verifies some structural properties depending on the application we will focus in this paper on analyzing this approximation in the statistical prediction context related to the estimation of by solving problem we proceed by revisiting and improving upon prior results from three different areas the first result theorem is on the behavior of the bias of fˆl when is constructed using general sketching matrix this result underlies the statistical analysis of the method to see this first it is not hard to prove that in the sense of usual the order on the positive cone second one can prove that the variance is hence the variance will decrease when replacing by on the other hand the bias while not matrix monotone in general can be proven to not increase too much when replacing by this latter statement will be the main technical difficulty for obtaining bound on fˆl see appendix form of this result is due to bach in the case where is uniform sampling matrix the second result theorem is concentration bound for approximating matrix multiplication when the components of the product are sampled non uniformly this result is derived from the matrix bernstein inequality and yields sharp quantification of the deviation of the approximation from the true product the third result definition is an extension of the definition of the leverage scores to the context of kernel ridge regression whereas the notion of leverage is established as an algorithmic tool in randomized linear algebra we introduce natural counterpart of it to this statistical setting by combining these contributions we are able to give sharp statistical statement on the behavior of the method if one is allowed to sample non uniformly all the proofs are deferred to the appendix or see revisiting prior work and new results structural result we begin by stating structural result that the bias of the estimator constructed using the approximation this result is deterministic it only depends on the properties of the input data and holds for any sketching matrix that satisfies certain conditions this way the randomness of the construction of is decoupled from the rest of the analysis we highlight the fact that this view offers possible way of improving the current results since better construction of deterministic or satisfying the conditions would immediately lead to down stream algorithmic and statistical improvements in this setting theorem let be sketching matrix and the corresponding mation for let nγi if the sketching matrix satisfies λmax ss for and λmaxn where λmax denotes the maximum eigenvalue and kop is the operator norm then bias bias in the special case where contains one non zero entry equal to pn in every column with the number of sampled columns the result and its proof can be found in appendix although we believe that their argument contains problematic statement we propose an alternative and complete proof in appendix the subsequent analysis unfolds in two steps assuming the sketching matrix satisfies the conditions stated in theorem we will have fˆl fˆk and matrix concentration is used to show that an appropriate random construction of satisfies the said conditions we start by stating the concentration result that is the source of our improvement section define notion of statistical leverage scores section and then state and prove the main statistical result theorem section we then present our main algorithmic result consisting of fast approximation to this new notion of leverage scores section concentration bound on matrix multiplication next we state our result for approximating matrix products of the form ψψ when few columns from are sampled to form the approximate product ψi where ψi contains the chosen columns the proof relies on matrix bernstein inequality see and is presented at the end of the paper appendix theorem let be positive integers consider matrix and denote by ψi the ith column of let and ip be subset of formed by elements chosen randomly with replacement according to the distribution pr choosing pi kψi for some let be sketching matrix such that sij pij only if ij and elsewhere then pr λmax ψψ ψss exp λmax ψψ remarks this result will be used for in conjunction with theorem to prove our main result in theorem notice that is scaled version of the eigenvectors with scaling given by the diagonal matrix nγi which should be considered as soft projection matrix that smoothly selects the top part of the spectrum of the setting of gittens et al in which is diagonal is the closest analog of our setting it is known that pi kψi is the optimal sampling ψss kf the above distribution in terms of minimizing the expected error ekψψ result exhibits robustness property by allowing the chosen sampling distribution to be different from the optimal one by factor the of such distribution is reflected in the upper bound by the amplification of the squared frobenius norm of by factor for instance if the sampling distribution is chosen to be uniform pi then the value of for which is tight is maxi kψ in which case we recover concentration result proven by bach note that theorem is derived from one of the bounds on matrix concentration but it is one among many others in the literature and while it constitutes the base of our improvement it is possible that concentration bound more tailored to the problem might yield sharper results an extended definition of leverage we introduce an extended notion of leverage scores that is specifically tailored to the ridge regression problem and that we call the leverage scores definition for the leverage scores associated with the kernel matrix and the parameter are σj li uij nλ note that li is the ith diagonal entry of nλi the quantities li are in this setting the analogs of the leverage scores in the statistical literature as they characterize the data points that stick out and consequently that most affect the result of statistical procedure they are classically defined as the row norms of the left singular matrix of the input matrix and they have been used in regression diagnostics for outlier detection and more recently in randomized matrix algorithms as they often provide an optimal importance sampling distribution for constructing random sketches for low rank approximation and least squares regression when the input matrix is tall and skinny in the case where the input matrix is square this definition is vacuous as the row norms of are all equal to recently gittens and mahoney used truncated version of these scores that they called leverage scores relative to the best space to obtain the best algorithmic results known to date on low rank approximation of positive matrices definition is weighted version of the classical leverage scores where the weights depend on the spectrum of and regularization parameter in this sense it is an interpolation between gittens scores and the classical leverage scores where the parameter plays the role of rank parameter in addition we point out that bach maximal degrees of freedom dmof is to the leverage scores what the coherence is to gittens leverage scores their scaled maximum value dmof maxi li and that while the sum of gittens scores is the rank parameter the sum of the leverage scores is the effective dimensionality deff we argue in the following that definition provides relevant notion of leverage in the context of kernel ridge regression it is the natural counterpart of the algorithmic notion of leverage in the prediction context we use it in the next section to make statistical statement on the performance of the method in their work drineas et al have comparable robust statement for controlling the expected error our result is robust quantification of the tail probability of the error which is much stronger statement main statistical result an error bound on approximate kernel ridge regression now we are able to give an improved version of theorem by bach that establishes performance guaranty on the use of the method in the context of kernel ridge regression it is improved in the sense that the sufficient number of columns that should be sampled in order to incur no or little loss in the prediction performance is lower this is due to more way of sampling the columns of depending on the leverage scores during the construction of the approximation the proof is in appendix theorem let and be approximation of by choosing columns randomly with replacement pnaccording to probability distribution pi such that pi li li for some let mini li if deff λmax log and pn with deff li tr then fˆl fˆk with probability at least where li are introduced in definition and is defined in theorem asserts that substituting the kernel matrix by approximation of rank in the krr problem induces an arbitrarily small prediction loss provided that scales linearly with the effective dimensionality deff and that is not too the sampling appears to be crucial for obtaining this dependence as the leverage scores provide information on which columns hence which data capture most of the difficulty of the estimation problem also as sanity check the smaller the target accuracy the higher deff and the more uniform the sampling distribution li becomes in the limit is in the order of and the scores are uniform and the method is essentially equivalent to using the entire matrix moreover if the sampling distribution pi is factor away from optimal slight oversampling increase by achieves the same performance in this sense the above result shows robustness to the sampling distribution this property is very beneficial from an implementation point of view as the error bounds still hold when only an approximation of the leverage scores is available if the columns are sampled uniformly worse lower bound on that depends on dmof is obtained main algorithmic result fast approximation to the leverage scores although the leverage scores can be naively computed using svd the exact computation is as costly as solving the original problem therefore the central role they play in the above result motivates the problem of fast approximation in similar way the importance of the usual leverage scores has motivated drineas et al to approximate them is random projection time success in this task will allow us to combine the running time benefits with the improved statistical guarantees we have provided algorithm inputs data points xi probability vector pi sampling parameter output to li sample data points from xi with replacement with probabilities pi compute the corresponding columns kp of the kernel matrix construct kp and as presented in section construct such that bb cw for every set nλi bi where bi is the row of and return it note that deff depends on the precision parameter which is absent in the classical definition of the effective dimensionality however the following bound holds deff tr nλi this condition on is not necessary if one constructs as ks ks see proof running time the running time of the above algorithm is dominated by steps and indeed constructing can be done using cholesky factorization on and then multiplication of by the inverse of the obtained cholesky factor which yields running time of computing the approximate leverage scores in step also runs in thus for the overall algorithm runs in note that formula only involves matrices and vectors of size everything is computed in the smaller dimension and the fact that this yields correct approximation relies on the matrix inversion lemma see proof in appendix also only the relevant columns of are computed and we never have to form the entire kernel matrix this improves over earlier models that require that all of has to be written down in memory the improved running time is obtained by considering the construction which is quite different from the regular setting of approximating the leverage scores of rectangular matrix we now give both additive and multiplicative error bounds on its approximation quality theorem let and let be approximation of by choosing columns at random with probabilities pi kii if tr log then we have additive error bound li li li and multiplicative error bound li li σn with probability at least remarks theorem states that if the columns of are sampled proportionally to kii then tr nλ is sufficient number of samples recall that kii kφ xi kf so our procedure is akin to sampling according to the squared lengths of the data vectors which has been extensively used in different contexts of randomized matrix approximation due to how is defined in eq the in the denominator is artificial nλ should be thought of as rescaled regularization parameter in the that yields the best generalization error scales like hence tr is sufficient on the other hand if the columns are sampled uniformly one would get dmof maxi li experiments we test our results based on several datasets one synthetic regression problem from to illustrate the importance of the leverage scores the pumadyn family consisting of three datasets and and the gas sensor array drift dataset from the uci the synthetic case consists of regression problem on the interval where given sequence xi and sequence of noise we observe the sequence yi xi the function belongs to the rkhs generated by the kernel where is the bernoulli polynomial one important feature of this regression problem is the distribution of the points xi on the interval if they are spread uniformly over the interval the leverage scores li are uniform for every and uniform column sampling is optimal in this case in fact if xi for the kernel matrix is circulant matrix in which case we can prove that the leverage scores are constant otherwise if the data points are distributed asymmetrically on the interval the leverage scores are non uniform and importance sampling is beneficial figure in this experiment the data points xi have been generated with distribution symmetric about having high density on the borders of the interval and low density on the center of the interval the number of observations is on figure we can see that there are few data points with http https figure the leverage scores for the synthetic bernoulli data set described in the text left and the mse risk the number of sampled columns used to construct the approximation for different sampling methods right high leverage and those correspond to the region that is underrepresented in the data sample the region close to the center of the interval since it is the one that has the lowest density of observations the leverage scores are able to capture the importance of these data points thus providing way to detect them with an analysis of outliers had we not known their existence for all datasets we determine and the band width of by cross validation and we compute the effective dimensionality deff and the maximal degrees of freedom dmof table summarizes the experiments it is often the case that deff dmof and fˆl fˆk in agreement with theorem kernel bern linear rbf dataset synth nb feat band width deff dmof risk ratio fˆl fˆk deff deff deff deff deff table parameters and quantities of interest for the different datasets and using different kernels the synthetic dataset using the bernoulli kernel denoted by synth the gas sensor array drift dataset batches and denoted by and and the pumadyn datasets using linear and rbf kernels conclusion we showed in this paper that in the case of kernel ridge regression the sampling complexity of the method can be reduced to the effective dimensionality of the problem hence bridging and improving upon different previous attempts that established weaker forms of this result this was achieved by defining natural analog to the notion of leverage scores in this statistical context and using it as column sampling distribution we obtained this result by combining and improving upon results that have emerged from two different perspectives on low rank matrix approximation we also present way to approximate these scores that is computationally tractable runs in time with depending only on the trace of the kernel matrix and the regularization parameter one natural unanswered question is whether it is possible to further reduce the sampling complexity or is the effective dimensionality also lower bound on and as pointed out by previous work it is likely that the same results hold for smooth losses beyond the squared loss logistic regression however the situation is unclear for losses support vector regression acknowledgements we thank xixian chen for pointing out mistake in an earlier draft of this paper we thank francis bach for stimulating discussions and for contributing to rectified proof of theorem we thank jason lee and aaditya ramdas for fruitful discussions regarding the proof of theorem we thank yuchen zhang for pointing out the connection to his work references ahmed el alaoui and michael mahoney fast randomized kernel methods with statistical guarantees arxiv preprint alex gittens and michael mahoney revisiting the method for improved largescale machine learning in proceedings of the international conference on machine learning pages francis bach sharp analysis of kernel matrix approximations in proceedings of the conference on learning theory pages francis bach personal communication october petros drineas michael mahoney and muthukrishnan cur matrix decompositions siam journal on matrix analysis and applications michael mahoney and petros drineas cur matrix decompositions for improved data analysis proceedings of the national academy of sciences petros drineas malik michael mahoney and david woodruff fast approximation of matrix coherence and statistical leverage the journal of machine learning research michael mahoney randomized algorithms for matrices and data foundations and trends in machine learning yuchen zhang john duchi and martin wainwright divide and conquer kernel ridge regression in proceedings of the conference on learning theory pages jerome friedman trevor hastie and robert tibshirani the elements of statistical learning volume springer series in statistics springer berlin george kimeldorf and grace wahba some results on tchebycheffian spline functions journal of mathematical analysis and applications bernhard ralf herbrich and alex smola generalized representer theorem in computational learning theory pages springer shai fine and katya scheinberg efficient svm training using kernel representations the journal of machine learning research christopher williams and matthias seeger using the method to speed up kernel machines in proceedings of the annual conference on neural information processing systems pages sanjiv kumar mehryar mohri and ameet talwalkar sampling techniques for the method in international conference on artificial intelligence and statistics pages joel tropp tail bounds for sums of random matrices foundations of computational mathematics petros drineas ravi kannan and michael mahoney fast monte carlo algorithms for matrices approximating matrix multiplication siam journal on computing samprit chatterjee and ali hadi influential observations high leverage points and outliers in linear regression statistical science pages petros drineas ravi kannan and michael mahoney fast monte carlo algorithms for matrices ii computing approximation to matrix siam journal on computing petros drineas michael mahoney muthukrishnan and faster least squares approximation numerische mathematik alan frieze ravi kannan and santosh vempala fast algorithms for finding approximations journal of the acm jacm francis bach analysis for logistic regression electronic journal of statistics 
online learning for adversaries with memory price of past mistakes elad hazan princeton university new york usa ehazan oren anava technion haifa israel oanava shie mannor technion haifa israel shie abstract the framework of online learning with memory naturally captures learning problems with temporal effects and was previously studied for the experts setting in this work we extend the notion of learning with memory to the general online convex optimization oco framework and present two algorithms that attain low regret the first algorithm applies to lipschitz continuous loss functions obtaining optimal regret bounds for both convex and strongly convex losses the second algorithm attains the optimal regret bounds and applies more broadly to convex losses without requiring lipschitz continuity yet is more complicated to implement we complement the theoretical results with two applications statistical arbitrage in finance and ahead prediction in statistics introduction online learning is learning paradigm which has both theoretical and practical appeals the goal in this paradigm is to make sequential decision where at each trial the cost associated with previous prediction tasks is given in recent years online learning has been widely applied to several research fields including game theory information theory and optimization we refer the reader to for more comprehensive survey one of the most frameworks of online learning is online convex optimization oco in this framework an online player iteratively chooses decision in convex set then convex loss function is revealed and the player suffers loss that is the convex function applied to the decision she chose it is usually assumed that the loss functions are chosen arbitrarily possibly by an allpowerful adversary the performance of the online player is measured using the regret criterion which compares the accumulated loss of the player with the accumulated loss of the best fixed decision in hindsight the above notion of regret captures only memoryless adversaries who determine the loss based on the player current decision and fails to cope with adversaries who determine the loss based on the player current and previous decisions however in many scenarios such as coding compression portfolio selection and more the adversary is not completely memoryless and the previous decisions of the player affect her current loss we are particularly concerned with scenarios in which the memory is relatively and simple in contrast to models for which reinforcement learning models are more suitable an important aspect of our work is that the memory is not used to relax the adaptiveness of the adversary cf but rather to model the feedback received by the player in particular throughout this work we assume that the adversary is oblivious that is must determine the whole set of loss functions in advance in addition we assume counterfactual feedback model the player is aware of the loss she would suffer had she played any sequence of decisions in the previous rounds this model is quite common in the online learning literature see for instance our goal in this work is to extend the notion of learning with memory to one of the most general online learning frameworks the oco to this end we adapt the policy criterion of and propose two different approaches for the extended framework both attain the optimal bounds with respect to this criterion summary of results we present and analyze two algorithms for the framework of oco with memory both attain policy regret bounds that are optimal in the number of rounds our first algorithm utilizes the lipschitz property of the loss functions and to the best of our knowledge is the first algorithm for this framework that is not based on any blocking technique this technique is detailed in the related work section below this algorithm attains regret for general convex loss functions and log regret for strongly convex losses for the case of convex and loss functions our second algorithm attains the nearly optimal its downside is that it is randomized and more difficult to implement novel result that follows immediately from our analysis is that our second algorithm attains an expected along with decision switches in the standard oco framework similar result currently exists only for the special case of the experts problem we note that the two algorithms we present are related in spirit both designed to cope with adversaries but differ in the techniques and analysis framework experts with memory oco with memory convex losses oco with memory strongly convex losses previous bound our first approach our second approach not applicable log table on the policy regret as function of number of rounds for the framework of oco with memory the best known bounds are due to the works of and which are detailed in the related work section below related work the framework of oco with memory was initially considered in as an extension to the experts framework of merhav et al offered blocking technique that guarantees policy regret bound of against adversaries roughly speaking the proposed technique divides the rounds into blocks while employing constant decision throughout each of these blocks the small number of decision switches enables the learning in the extended framework yet the constant block size results in suboptimal policy regret bound later showed that policy regret bound of can be achieved by simply adapting the shrinking dartboard sd algorithm of to the framework considered in in short the sd algorithm is aimed at ensuring an expected decision switches in addition to regret these two properties together enable the learning in the considered framework and the randomized block size yields an optimal policy regret bound note that in both and the the policy regret compares the performance of the online player with the best fixed sequence of actions in hindsight and thus captures the notion of adversaries with memory formal definition appears in section the notation is variant of the notation that ignores logarithmic factors presented techniques are applicable only to the variant of the experts framework to adversaries with memory and not to the general oco framework the framework of online learning against adversaries with memory was studied also in the setting of the adversarial bandit problem in this context showed how to convert an online learning algorithm with regret guarantee of into an online learning algorithm that attains regret also using blocking technique this approach is in fact generalization of to the bandit setting yet the ideas presented are somewhat simpler despite the original presentation of in the bandit setting their ideas can be easily generalized to the framework of oco with memory yielding policy regret bound of for convex losses and regret for strongly convex losses an important concept that is captured by the framework of oco with memory is switching costs which can be seen as special case where the memory is of length this special case was studied in the works of who studied the relationship between second order regret bounds and switching costs and who proved that the blocking algorithm of is optimal for the setting of the adversarial bandit with switching costs preliminaries and model we continue to formally define the notation for both the standard oco framework and the framework of oco with memory for sake of readability we shall use the notation gt for memoryless loss functions that correspond to memoryless adversaries and ft for loss functions with memory that correspond to adversaries the standard oco framework in the standard oco framework an online player iteratively chooses decision xt and suffers loss that is equal to gt xt the decision set is assumed to be bounded convex subset of rn and the loss functions gt are assumed to be convex functions from to in addition the set gt is assumed to be chosen in advance possibly by an adversary that has full knowledge of our learning algorithm see for instance the performance of the player is measured using the regret criterion defined as follows rt gt xt min gt where is predefined integer denoting the total number of rounds played the goal in this framework is to design efficient algorithms whose regret grows sublinearly in corresponding to an average regret going to zero as increases the framework of oco with memory in this work we consider the framework of oco with memory detailed as follows at each round the online player chooses decision xt rn then loss function ft is revealed and the player suffers loss of ft xt for simplicity we assume that and that ft xm for any xm notice that the loss at round depends on the previous decisions of the player as well as on his current one we assume that after ft is revealed the player is aware of the loss she would suffer had she played any sequence of decisions xt this corresponds to the counterfactual feedback model mentioned earlier our goal in this framework is to minimize the policy regret as defined in rt ft xt min ft we define the notion of convexity for the loss functions ft as follows we say that ft is convex loss function with memory if ft is convex in from now on we assume that the rounds in which are ignored since we assume that the loss per round is bounded by constant this adds at most constant to the final regret bound algorithm input learning rate convex and smooth regularization function choose xm arbitrarily for to do play xt and suffer loss ftn xt pt set arg end for ft are convex loss functions with memory this assumption is necessary in some cases if efpt ficient algorithms are considered otherwise the optimization problem ft might not be solvable efficiently policy regret for lipschitz continuous loss functions in this section we assume that the loss functions ft are lipschitz continuous for some lipschitz constant that is xm ft ym xm ym and adapt the regularized follow the leader rftl algorithm to cope with boundedmemory adversaries in the above and throughout the paper we use to denote the due to space constraints we present here only the algorithm and the main theorem and defer the complete analysis to the supplementary material intuitively algorithm relies on the fact that the corresponding functions are memoryless and convex thus standard regret minimization techniques are applicable yielding regret bound of for this however is not the policy regret bound we are interested in but is in fact quite close if we use the lipschitz property of ft and set the learning rate properly the algorithm requires the following standard definitions of and see supplementary material for more comprehensive background and exact norm definitions sup and sup additionally we denote by the strong convexity parameter of the regularization function for algorithm we can prove the following theorem let ft be lipschitz continuous loss functions with memory from to and let and be as defined in equation then algorithm generates an online sequence xt for which the following holds rt ft xt min setting ft λη yields rt rl the following is an immediate corollary of theorem to convex losses corollary let ft be lipschitz continuous and convex loss functions with memory from to and denote supt then algorithm generates an online sequence xt for which the following holds rt ηt kxt ηt setting ηt ht yields rt log the proof simply requires plugging learning rate in the proof of theorem and is thus omitted here is convex if for all we say that ft is convex loss function with memory if ft is convex in algorithm input learning parameter initialize for all and choose arbitrarily for to do play xt and suffer loss gt xt define weights where and gt kxk xt set xt with probability xt otherwise sample from the density function dx end for policy regret with low switches in this section we present different approach to the framework of oco with memory low switches this approach was considered before in who adapted the shrinking dartboard sd algorithm of to cope with coding however the authors in consider only the experts setting in which the decision set is the simplex and the loss functions are linear here we adapt this approach to general decision sets and generally convex loss functions and obtain optimal policy regret against adversaries due to space constraints we present here only the algorithm and main theorem the complete analysis appears in the supplementary material intuitively algorithm defines probability distribution over at each round by sampling from this probability distribution one can generate an online sequence that has an expected low regret guarantee this however is not sufficient in order to cope with adversaries and thus an additional element of choosing xt with high probability is necessary line our xt analysis shows that if this probability is equal to the regret guarantee remains and we get xt an additional low switches guarantee for algorithm we can prove the following theorem let gt be convex functions from to such that supx kx yk and supx and define gt for then algorithm generates an online sequence xt for which it holds that and log rt log where is the number of decision switches in the sequence xt the exact bounds for rt and are given in the supplementary material notice that algorithm applies to memoryless loss functions yet its low switches guarantee implies learning against adversaries as stated and proven in lemma see supplementary material application to statistical arbitrage our first application is motivated by financial models that are aimed at creating statistical arbitrage opportunities in the literature statistical arbitrage refers to statistical mispricing of one or more assets based on their expected value one of the most common trading strategies known as pairs trading seeks to create mean reverting portfolio using two assets with same sectorial belonging typically using both long and short sales then by buying this portfolio below its mean and selling it above one can have an expected positive profit with low risk here we extend the traditional pairs trading strategy and present an approach that aims at constructing mean reverting portfolio from an arbitrary yet known in advance number of assets roughly speaking our goal is to synthetically create mean reverting portfolio by maintaining weights upon different assets the main problem arises in this context is how do we quantify the amount of mean reversion of given portfolio indeed mean reversion is somewhat an concept and thus different proxies are usually defined to capture its notion we refer the reader to in which few of these proxies such as predictability and are presented in this work we consider proxy that is aimed at preserving the mean price of the constructed portfolio over the last trading periods close to zero while maximizing its variance we note that due to the very nature of the problem weights of one trading period affect future performance the memory comes unavoidably into the picture we proceed to formally define the new mean reversion proxy and the use of our new algorithm in this model thus denote by yt rn the prices of assets at time and by xt rn distribution of weights over these assets since short selling is allowed the norm of xt can sum up to an arbitrary number determined by the loan flexibility without loss of generality we assume that kxt which is also assumed in the works of note that since xt determines the proportion of wealth to be invested in each asset and not the actual wealth it self any other constant would work as well consequently define ft xt for some notice that minimizing ft iteratively yields process yt such that its mean is close to zero due to the expression on the left and its variance is maximized due to the expression on the right we use the regret criterion to measure our performance against the best distribution of weights in hindsight and wish to generate series of weights xt such that the regret is sublinear thus define the memoryless loss function ft and denote and bt at notice we can write at bt since is not convex in general our techniques are not straightforwardly applicable here however the hidden convexity of the problem allows us to bypass this issue by simple and tight positive psd relaxation define ht at bt pn pn where is psd matrix with and is defined as pt now notice that the problem of minimizing ht is psd relaxation to the minimization pt problem and for the optimal solution it holds that min ht ht pt where arg also we can recover vector pn from the psd matrix using an eigenvector decomposition as follows represent λi vi vi where each vi is unit vector and λi are coefficients such that λi then by sampling the eigenvector vi with probability λi we get that ht technically this decomposition is possible due to the fact that is psd matrix with notice that ht is linear in and thus we can apply regret minimization techniques on the loss functions ht this procedure is formally given in algorithm for this algorithm we can prove the following corollary let ft be as defined in equation and ht be the corresponding memoryless functions as defined in equation then applying algorithm to the loss functions ht yields an online sequence xt for which the following holds ht xt min tr ht log sampling xt xt using the eigenvector decomposition described above yields rt ft xt min ft log algorithm online statistical arbitrage osa input learning rate memory parameter regularizer initialize for to do randomize xt xt using the eigenvector decomposition observe ft and define ht as in equation apply algorithm to ht xt to get end for remark we assume here that the prices of the assets at round are bounded for all by constant which is independent of the main novelty of our approach to the task of constructing mean reverting portfolios is the ability to maintain the weight distributions online this is in contrast to the traditional offline approaches that require training period to learn weight distribution and trading period to apply corresponding trading strategy application to ahead prediction our second application is motivated by statistical models for time series prediction and in particular by statistical models for ahead ar prediction thus let xt be time series that is series of signal observations the traditional ar short for autoregressive model parameterized by lag and coefficient vector rp assumes that each observation complies with xt αk where is white noise in words the model assumes that xt is noisy linear combination of the previous observations sometimes an additional additive term is included to indicate drift but we ignore this for simplicity the online setting for time series prediction is by now and appears in the works of here we adapt this setting to the task of ahead ar prediction as follows at round the online player has to predict while at her disposal are all the previous observations the parameter determines the number of steps ahead then xt is revealed and she suffers loss of ft xt where denotes her prediction for xt for simplicity we consider the squared loss to be our error measure that is ft xt xt in the statistical literature common approach to the problem of ahead prediction is to consider ahead recursive ar predictors essentially this approach makes use of standard methods maximum likelihood or least squares estimation to extract the ahead estimator for instance least squares estimator for at round would be xτ arg min xτ αk xτ αls arg min then αls is used to generate prediction for xt αls used as proxy for it in order to predict the value of ar ls ls ar ls pp αils which is in turn αkls the values of are predicted in the same recursive manner the most obvious drawback of this approach is that not much can be said on the quality of this predictor even if the ar model is let alone if it is not see for further discussion on this issue in light of this the motivation to formulate the problem of ahead prediction in the online setting is quite clear attaining regret in this setting would imply that our algorithm performance algorithm adaptation of algorithm to ahead prediction input learning rate regularization function signal xt choose kip arbitrarily for to do pp ip predict and suffer loss ip set arg xτ end for is comparable with the best ahead recursive ar predictor in hindsight even if the latter is misspecified thus our goal is to minimize the following regret term rt xt min xt where denotes the set of all ahead recursive ar predictors against which we want to compete note that since the feedback is delayed the ar coefficients chosen at round are used to generate the prediction at round the memory comes unavoidably into the picture nevertheless here also both of our techniques are not straightforwardly applicable due the structure of the problem each prediction contains products of coefficients that cause the losses to be in to circumvent this issue let our predictions to be of ppwe use learning techniques and np np the form wk for properly chosen set rp of the coefficients basically the idea is to show that attaining regret bound with respect to the best predictor in the new family can be done using the techniques we present in this work and the best predictor in the new family is better than the best ahead recursive ar predictor this would imply regret bound with respect to best ahead recursive ar predictor in hindsight our formal result is given in the following corollary corollary let and supw xt then algorithm generates an online sequence wt for which it holds that xt xt min remark the tighter bound in instead of follows directly by modifying the proof of theorem to this setting ft is affected only by and not by wt in the above the values of and are determined by the choice of the set for instance if we want to compete against the best we need to use the restriction wk for all in this if we consider to be the set of all rp such that case and αk we get that and the main novelty of our approach to the task of ahead prediction is the elimination of generative assumptions on the data that is we allow the time series to be arbitrarily generated such assumptions are common in the statistical literature and needed in general to extract ml estimators discussion and conclusion in this work we extended the notion of online learning with memory to capture the general oco framework and proposed two algorithms with tight regret guarantees we then applied our algorithms to two extensively studied problems construction of mean reverting portfolios and multistep ahead prediction it remains for future work to further investigate the performance of our algorithms in these problems and other problems in which the memory naturally arises acknowledgments this work has been supported by the european community seventh framework programme under grant agreement suprel references and lugosi prediction learning and games cambridge university press elad hazan the convex optimization approach to regret minimization optimization for machine learning page shai online learning and online convex optimization foundations and trends in machine learning puterman markov decision processes discrete stochastic dynamic programming wiley series in probability and statistics wiley raman arora ofer dekel and ambuj tewari online bandit learning against an adaptive adversary from regret to policy regret nicolo ofer dekel and ohad shamir online learning with switching costs and other adaptive adversaries in advances in neural information processing systems pages neri merhav erik ordentlich gadiel seroussi and marcelo weinberger on sequential strategies for loss functions with memory ieee transactions on information theory and gergely neu rates for universal lossy source coding in isit pages sascha geulen berthold and melanie winkler regret minimization for online buffering problems using the weighted majority algorithm in colt pages nick littlestone and manfred warmuth the weighted majority algorithm in focs pages eyal gofer regret bounds with switching costs in proceedings of the conference on learning theory pages ofer dekel jian ding tomer koren and yuval peres bandits with switching costs regret in proceedings of the annual acm symposium on theory of computing pages acm anatoly schmidt financial markets and trading an introduction to market microstructure and trading strategies wiley finance wiley edition august alexandre aspremont identifying small portfolios quant finance marco cuturi and alexandre aspremont mean reversion with variance threshold may oren anava elad hazan shie mannor and ohad shamir online learning for time series prediction arxiv preprint oren anava elad hazan and assaf zeevi online time series prediction with missing data in icml michael clements and david hendry estimation for forecasting oxford bulletin of economics and statistics massimiliano marcellino james stock and mark watson comparison of direct and iterated multistep ar methods for forecasting macroeconomic time series journal of econometrics maddala and kim unit roots cointegration and structural change themes in modern econometrics cambridge university press soren johansen estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models econometrica november jakub jurek and halla yang dynamic portfolio selection in arbitrage in efa meetings paper elad hazan amit agarwal and satyen kale logarithmic regret algorithms for online convex optimization machine learning and santosh vempala logconcave functions geometry and efficient sampling algorithms in focs pages ieee computer society hariharan narayanan and alexander rakhlin random walk approach to regret minimization in john lafferty christopher williams john richard zemel and aron culotta editors nips pages curran associates 
convolutional covariance analysis for neural subunit models anqi il memming jonathan princeton neuroscience institute princeton university anqiw pillow department of neurobiology and behavior stony brook university abstract subunit models provide powerful yet parsimonious description of neural responses to complex stimuli they are defined by cascade of two ln stages with the first stage defined by linear convolution with one or more filters and common point nonlinearity and the second by pooling weights and an output nonlinearity recent interest in such models has surged due to their biological plausibility and accuracy for characterizing early sensory responses however fitting poses difficult computational challenge due to the expense of evaluating the and the ubiquity of local optima here we address this problem by providing theoretical connection between covariance analysis and nonlinear subunit models specifically we show that convolutional decomposition of average sta and covariance stc matrix provides an asymptotically efficient estimator for class of quadratic subunit models we establish theoretical conditions for identifiability of the subunit and pooling weights and show that our estimator performs well even in cases of model mismatch finally we analyze neural data from macaque primary visual cortex and show that our estimator outperforms highly regularized generalized quadratic model gqm and achieves nearly the same prediction performance as the full estimator yet at substantially lower cost introduction central problem in systems neuroscience is to build flexible and accurate models of the sensory encoding process neurons are often characterized as responding to small number of features in the space of natural stimuli this motivates the idea of using dimensionality reduction methods to identify the features that affect the neural response however many neurons in the early visual pathway pool signals from small population of upstream neurons each of which integrates and nolinearly transforms the light from small region of visual space for such neurons stimulus selectivity is often not accurately described with small number of filters more accurate description can be obtained by assuming that such neurons pool inputs from an earlier stage of shifted identical nonlinear subunits recent interest in subunit models has surged due to their biological plausibility and accuracy for characterizing early sensory responses in the visual system linear pooling of shifted rectified linear filters was first proposed to describe sensory processing in the cat retina and more recent work has proposed similar models for responses in other early sensory areas moreover recent research in machine learning and computer vision has focused on hierarchical stacks of such subunit models often referred to as convolutional neural networks cnn the subunit models we consider here describe neural responses in terms of an cascade that is cascade of two ln processing stages each of which involves linear projection and nonlinear transformation the first ln stage is convolutional meaning it is formed from one or more banks of identical spatially shifted subunit filters with outputs transformed by shared subunit nonlinearity the second ln stage consists of set of weights for linearly pooling the nonlinear subunits an output nonlinearity for mapping the output into the neuron response range and finally an noise source for capturing the stochasticity of neural responses typically assumed to be gaussian bernoulli or poisson vintch et al proposed one variant of this type of subunit model and showed that it could account parsimoniously for the properties revealed by analysis of responses stimulus ln stage subunit filter subunit nonliearity pooling weights output nonlinearity ln stage however fitting such models remains challengpoisson ing problem simple ln models with gaussian or spiking poisson noise can be fit very efficiently with based estimators but there response is no equivalent theory for or subunit models this paper aims to fill that gap we show that convolutional decomposition of the figure schematic of subunit casaverage sta and covariance stc provides an cade model for simplicity we show only asymptotically efficient estimator for poisson subunit type unit model under certain technical conditions the stimulus is gaussian the subunit nonlinearity is well described by polynomial and the final nonlinearity is exponential in this case the subunit model represents special case of canonical poisson generalized quadratic model gqm which allows us to apply the expected trick to reduce the to form involving only the moments of the stimulus distribution estimating the subunit model from these moments an approach we refer to as convolutional stc has fixed computational cost that does not scale with the dataset size after single pass through the data to compute sufficient statistics we also establish theoretical conditions under which the model parameters are identifiable finally we show that convolutional stc is robust to modest degrees of model mismatch and is nearly as accurate as the full maximum likelihood estimator when applied to neural data from simple and complex cells subunit model we begin with general definition of the poisson convolutional subunit model fig the model is specified by subunit outputs spike rate smi km xi wmi smi spike count poiss where km is the filter for the th type of subunit xi is the vectorized stimulus segment in the th position of the shifted filter during convolution and is the nonlinearity governing subunit outputs for the second stage wmi is linear pooling weight from the th subunit at position and is the neuron output nonlinearity spike count is conditionally poisson with rate fitting subunit models with arbitrary and poses significant computational challenges however if we set to exponential and takes the form of polynomial the model reduces to exp wmi km xi wmi km xi exp where km diag wm km km wm and km is toeplitz matrix consisting of shifted copies of km satisfying km km in essence these restrictions on the two nonlinearities reduce the subunit model to canonicalform poisson generalized quadratic model gqm that is model in which the poisson spike rate takes the form of an exponentiated quadratic function of the stimulus we will pursue the implications of this mapping below we assume that is spatial filter vector without time expansion if we have should be filter but the subunit convolution across filter position involves only the spatial dimension from eqs and it can be seen that the subunit model contains fewer parameters than full glm making it more parsimonious description for neurons with stimulus selectivity estimators for subunit model with the above definitions and formulations we now present three estimators for the model parameters to simplify the notation we omit the subscript in and but their dependence on the model parameters is assumed throughout maximum estimator the maximum estimator mle has excellent asymptotic properties though it comes with the high computational cost the function can be written lmle yi log yi exp cxi xi cxi xi tr ansp exp xi cxi xi where yi xi is the average sta and yi xi is the covariance stc and nsp yi is the total number of spikes we denote the mle as estimator with expected fitting if the stimuli are drawn from gaussian with covariance then the expression in square brackets divided by in eq will converge to its expectation given by exp exp cxi xi substituting this expectation into yields quantity called expected with the objective function as lell tr ansp exp where is the number of time bins we refer to arg lell as the mele maximum expected estimator estimator with least squares fitting maximizing yields analytical expected maximum likelihood estimates cmele bmele amele log nsp with these analytical estimates it is straightforward and to optimize and by directly minimizing squared error lls diag which corresponds to an optimal convolutional decomposition of the estimates this formulation shows that the eigenvectors of cmele are spanned by shifted copies of we denote this estimate all three estimators and should provide consistent estimates for the subunit model parameters due to consistency of ml and mele estimates however the estimates mele and ls are computationally much simpler and scale much better to large datasets due to the fact that they depend on the data only via the moments in fact their only dependence on the dataset size is the cost of computing the sta and stc in one pass through the data as for efficiency has the drawback of being sensitive to noise in the cmele estimate which has far more free parameters than in the two vectors and for model therefore accurate estimation of cmele should be precondition for good performance of and we expect to perform better for small datasets identifiability the equality diag is core assumption to bridge the theoretical connection between subunit model and the moments sta stc in case we care about recovering the underlying biological structure we maybe interested to know when the solution is unique and naively interpretable here we address the identifiability of the convolution decomposition of for and estimation specifically we briefly study the uniqueness of the form diag for single subunit and multiple subunits respectively we provide the proof for the single subunit case in the main text and the proof for multiple subunits sharing the same pooling weight in the supplement note that failure of identifiability only indicates that there are possible symmetries in the solution space so that there are multiple equivalent optima which is question of theoretical interest but it holds no implications for practical performance identifiability for single subunit model we will frequently make use of frequency domain representation let denote the discrete fourier transform dft matrix with column is bj be vector resulting from discrete fourier transform that is bk where let rd be fourier representation of bk is dk dft matrix and similarly we assume that and have full support in the frequency domain or is zero assumption no element in theorem suppose assumption holds the convolution decomposition diag is uniquely identifiable up to shift and scale where and dk dw to be unit vector to deal with the obvious scale invariance first note proof we fix and thus that we can rewrite the convolution operator using dft matrices as diag bk bw where is the dft matrix and denotes conjugate transpose operation thus bw diag diag diag bw diag bw note that is circulant matrix ed circulant ed ed hence we can rewrite in the frequency domain as hw bcb diag diag ek since is invertible the uniqueness of the original decomposition is equivalent to the uniqueness decomposition the newly defined decomposition is of ek and ve and where both suppose there are two distinct decompositions since both and have no zero are unit vectors such that kk gg then we have define the ratio ek eh eg eh note that rank ek eh rank eh gg is also circulant matrix which can be diagonalized by dft diag rd pd we can express as ri bi bh using the identity for hadamard product that for any vector and aah bbh we get ek eh ri bi bh by lemma in the appendix ek eh ri bi bd is linearly independent set ek eh ri can be at most single therefore to satisfy the rank constraint rank without loss of generality let ri and all other to be zero then we have ek eh ek eh diag bi eg eh ri diag bi eg eh ri bi bh and is the fourier are unit vectors ri by recognizing that diag bi because bi transform of positions shifted denoted as ki we have ki ki gg therefore thus vi that is must also ki moreover from and bi bh ve be shifted version of if restricting and to be unit vectors then any solution and would satisfy vi ki therefore the two decompositions are identical up to scale and shift and identifiability for multiple subunits model multiple subunits model with subunits is far more complicated to analyze due to large degree of hidden invariances in this study we only provide the analysis under specific condition when all subunits share common pooling weight assumption all models share common we make few additional assumptions we would like to consider tight parameterization where no combination of subunits can take over another subunit task assumption km spans an subspace where ki is the subunit filter for subunit and rdk in addition has orthogonal columns we denote with positions shifted along the column as kp kpm also note that trivially dk dk dw since dw to allow arbitrary scale corresponding to each unit vector ki we introduce coefficient to the subunit thus extending to ka ki is the dft of where is diagonal matrix of and assumption such that ki where rdk is the permutation matrix from to by shifting rows namely kj ki and is linear projection coefficient matrix satisfying kj ki assumption has all positive or all negative values on the diagonal given these assumptions we establish the proposition for multiple subunits model is ka proposition under assumptions the convolutional decomposition uniquely identifiable up to shift and scale the proof for the proposition and illustrations of assumption are in the supplement true parameters mele smoothmele exponential mse quadratic sigmoid subunit nonlinearity run time sec output nonlinearity smoothls sample size sample size smoothls smoothmele smoothmle smoothmele smoothmle figure true parameters and mele and smoothmele estimations speed performance for smoothls smoothmele and smoothmle the slightly decreasing running time along with larger size is resulted from more and more fully supported subspace which makes optimization require fewer iterations accuracy performance for all combinations of subunit and output nonlinearities for smoothls smoothmele and smoothmle top left is the subunit model matching the data others are model mismatch experiments initialization all three estimators are and contain many local optima thus the selection of model initialization would affect the optimization substantially similar to using convolutional stc for initialization we also use simple moment based method with some assumptions for simplicity we assume all subunit models sharing the same with different scaling factors as in eq our initializer is generated from shallow bilinear regression firstly initialize with wide gaussian from division of cmele by secondly use svd profile then estimate ka into an orthogonal base set and positive diagonal matrix where to decompose ka and contain information about ki and respectively hypothesizing that are orthogonal to each other and are all positive assumptions and based on the ki and we estimated from the rough gaussian profile of now we fix those and with the same elementf this bilinear iterative procedure proceeds only few times in order to avoid wise division for overfitting to cmele which is coarse estimate of smoothing prior neural receptive fields are generally smooth thus prior smoothing out high frequency fluctuations would improve the performance of estimators unless the data likelihood provides sufficient evidence for jaggedness we apply automatic smoothness determination asd to both and each with an associated balancing hyper parameter and assuming cw with cw exp where is the vector of differences between neighboring locations in and are variance and length scale of cw that belong to the hyper parameter set also has the same asd prior with hyper parameters and for multiple subunits each wi and ki would have its own asd prior smoothls smoothmele smoothmele smooth gqm smoothls smoothmle smoothmle speed training size running time sec performance smooth expected gqm training size figure fits from various estimators and their running speeds without gqm comparisons black curves are regularized gqm with and without expected trick blue is smooth ls green is smooth mele red is smooth mle all the subunit estimators have results for subunit and subunits the inset figure in performance is the enlarged view for large values the right figure is the speed result showing that methods require exponentially increasing running time when increasing the training size but our ones have quite consistent speed fig shows the true and and the estimations from mele and smoothmele mele with smoothing prior from now on we use smoothing prior by default simulations to illustrate the performance of our estimators we generated gaussian stimuli from an lnp neuron with nonlinearity and subunit model with filter and pooling weights mean firing rate is in our estimation each time bin stimulus with dimensions is treated as one sample to generate spike response fig and show the speed and accuracy performance of three estimators ls mele and mle with smoothing prior ls and mele are comparable with baseline mle in terms of accuracy but are exponentially faster although lnp with exponential nonlinearity has been widely adapted in neuroscience for its simplicity the actual nonlinearity of neural systems is often such as nonlinearity but exponential is favored as convenient approximation of within small regime around the origin also generally lnp neuron leans towards sigmoid subunit nonlinearity rather than quadratic quadratic could well approximate sigmoid within small nonlinear regime before the linear regime of the sigmoid therefore in order to check the generalization performance of ls and mele on mismatch models we stimulated data from neuron with sigmoid subunit nonlinearity or output nonlinearity as shown in fig all the full mles formulated with no model mismatch provide baseline for inspecting the performance of the ell methods despite the our estimators ls and mele are on par with mle when the subunit nonlinearity is quadratic but the performance is notably worse for the sigmoid nonlinearity even so in real applications we will explore fits with different subunit nonlinearities using full mle where the exponential and quadratic assumption is thus primarily useful for reasonable and extremely fast initializer moreover the running time for estimators is always exponentially faster application to neural data in order to show the predictive performance more comprehensively in real neural dataset we applied ls mele and mle estimators to data from population of simple and complex cells data published in the stimulus consisted of oriented binary white noise flickering bar aligned with the cell preferred orientation the size of receptive field was chosen to be of bars time bins yielding stimulus space the time bin size is ms and the number of bars is in our experiment we compared estimators and mle with smoothed expected gqm and smoothed gqm models are trained on stimuli with size varying from to and tested on samples each subunit filter has length of all hyper parameters are chosen by cross validation fig shows that gqm is weakly better than ls but its running time is far more than ls data not shown both mele and mle but not ls outfight gqm and subunit subunit subunit subunit sta excitatory stc filters suppressive stc filters responses subunit model figure estimating visual receptive fields from complex cell and by fitting smoothmele subunit is suppressive negative and is excitatory positive form the we can tell from that both imply that middle subunits contribute more than the ends qualitative analysis each image corresponds to normalized dimensions spatial pixels horizontal by time bins vertical filter top row from true data bottom row simulated response from mele model given true stimuli and applied the same subspace analysis expected gqm with both subunit and subunits especially the improvement is the greatest with subunit which results from the average over all simple and complex cells generally the more complex the cell is the higher probability that multiple subunits would fit better outstandingly mele outperforms others with best and flat speed curve the is defined to be the on the test set divided by spike count for qualitative analysis we ran smoothmele for complex cell and learned the optimal subunit filters and pooling weights fig and then simulated response by fitting mele generative model given the optimal parameters analysis is applied to both neural data and simulated response data the quality of the filters trained on stimuli are qualitatively close to that obtained by fig subunit models can recover sta the first six excitatory stc filters and the last four suppressive ones but with considerably parsimonious parameter space conclusion we proposed an asymptotically efficient estimator for quadratic convolutional subunit models which forges an important theoretical link between covariance analysis and nonlinear subunit models we have shown that the proposed method works well even when the assumptions about model specification nonlinearity and input distribution were violated our approach reduces the difficulty of fitting subunit models because computational cost does not depend on dataset size beyond the cost of single pass through the data to compute the moments we also proved conditions for identifiability of the convolutional decomposition which reveals that for most cases the parameters are indeed identifiable we applied our estimators to the neural data from macaque primary visual cortex and showed that they outperform highly regularized form of the gqm and achieve similar performance to the subunit model mle at substantially lower computational cost acknowledgments this work was supported by the sloan foundation jp mcknight foundation jp simons global brain award jp nsf career award jp and grant from the nih nimh grant jp we thank rust and movshon for data references de ruyter van steveninck and bialek performance of neuron in the blowfly visual system coding and information transmission in short spike sequences proc soc lond touryan lau and dan isolation of relevant visual features from random stimuli for cortical complex cells journal of neuroscience aguera arcas and fairhall what causes neuron to spike neural computation tatyana sharpee nicole rust and william bialek analyzing neural responses to natural signals maximally informative dimensions neural comput feb schwartz pillow rust and simoncelli neural characterization journal of vision pillow and simoncelli dimensionality reduction in neural models an generalization of average and covariance analysis journal of vision il memming park and jonathan pillow bayesian covariance analysis in shawetaylor zemel bartlett pereira and weinberger editors advances in neural information processing systems pages il park evan archer nicholas priebe and jonathan pillow spectral methods for neural characterization using generalized quadratic models in advances in neural information processing systems pages ross williamson maneesh sahani and jonathan pillow the equivalence of and methods for neural dimensionality reduction plos comput biol kanaka rajan olivier marre and learning quadratic receptive fields from neural responses to natural stimuli neural computation nicole rust odelia schwartz anthony movshon and eero simoncelli spatiotemporal elements of macaque receptive fields neuron jun vintch zaharia movshon and simoncelli efficient and direct estimation of neural subunit model for sensory coding in adv neural information processing systems nips volume cambridge ma mit press to be presented at neural information processing systems dec brett vintch andrew zaharia movshon and eero simoncelli convolutional subunit model for neuronal responses in macaque neursoci page in press hb barlow and ro levick the mechanism of directionally selective units in rabbit retina the journal of physiology hochstein and shapley linear and nonlinear spatial subunits in cat retinal ganglion cells jonathan demb kareem zaghloul loren haarsma and peter sterling bipolar cells contribute to nonlinear spatial summation in the ganglion cell in mammalian retina the journal of neuroscience joanna crook beth peterson orin packer farrel robinson john troy and dennis dacey receptive field and collicular projection of parasol ganglion cells in macaque monkey retina the journal of neuroscience px joris ce schreiner and rees neural processing of sounds physiological reviews kunihiko fukushima neocognitron neural network model for mechanism of pattern recognition unaffected by shift in position biological cybernetics serre wolf bileschi riesenhuber and poggio robust object recognition with mechanisms pattern analysis and machine intelligence ieee transactions on yann lecun bottou yoshua bengio and patrick haffner learning applied to document recognition proceedings of the ieee alexandrod ramirez and liam paninski fast inference in generalized linear models via expected loglikelihoods journal of computational neuroscience pages philip davis circulant matrices american mathematical sahani and linden evidence optimization techniques for estimating functions nips 
convolutional lstm network machine learning approach for precipitation nowcasting xingjian shi zhourong chen hao wang yeung department of computer science and engineering hong kong university of science and technology xshiab zchenbb hwangaz dyyeung wong woo hong kong observatory hong kong china wkwong wcwoo abstract the goal of precipitation nowcasting is to predict the future rainfall intensity in local region over relatively short period of time very few previous studies have examined this crucial and challenging weather forecasting problem from the machine learning perspective in this paper we formulate precipitation nowcasting as spatiotemporal sequence forecasting problem in which both the input and the prediction target are spatiotemporal sequences by extending the fully connected lstm to have convolutional structures in both the and transitions we propose the convolutional lstm convlstm and use it to build an trainable model for the precipitation nowcasting problem experiments show that our convlstm network captures spatiotemporal correlations better and consistently outperforms and the operational rover algorithm for precipitation nowcasting introduction nowcasting convective precipitation has long been an important problem in the field of weather forecasting the goal of this task is to give precise and timely prediction of rainfall intensity in local region over relatively short period of time hours it is essential for taking such timely actions as generating emergency rainfall alerts producing weather guidance for airports and seamless integration with numerical weather prediction nwp model since the forecasting resolution and time accuracy required are much higher than other traditional forecasting tasks like weekly average temperature prediction the precipitation nowcasting problem is quite challenging and has emerged as hot research topic in the meteorology community existing methods for precipitation nowcasting can roughly be categorized into two classes namely nwp based methods and radar extrapolation based methods for the nwp approach making predictions at the nowcasting timescale requires complex and meticulous simulation of the physical equations in the atmosphere model thus the current operational precipitation nowcasting systems often adopt the faster and more accurate extrapolation based methods specifically some computer vision techniques especially optical flow based methods have proven useful for making accurate extrapolation of radar maps one recent progress along this path is the optical flow by variational methods for echoes of radar rover in systems radar echo maps are often constant altitude plan position indicator cappi images algorithm proposed by the hong kong observatory hko for its warning of intense rainstorms in localized system swirls rover calculates the optical flow of consecutive radar maps using the algorithm in and performs advection on the flow field which is assumed to be still to accomplish the prediction however the success of these optical flow based methods is limited because the flow estimation step and the radar echo extrapolation step are separated and it is challenging to determine the model parameters to give good prediction performance these technical issues may be addressed by viewing the problem from the machine learning perspective in essence precipitation nowcasting is spatiotemporal sequence forecasting problem with the sequence of past radar maps as input and the sequence of fixed number usually larger than of future radar maps as however such learning problems regardless of their exact applications are nontrivial in the first place due to the high dimensionality of the spatiotemporal sequences especially when predictions have to be made unless the spatiotemporal structure of the data is captured well by the prediction model moreover building an effective prediction model for the radar echo data is even more challenging due to the chaotic nature of the atmosphere recent advances in deep learning especially recurrent neural network rnn and long memory lstm models provide some useful insights on how to tackle this problem according to the philosophy underlying the deep learning approach if we have reasonable model and sufficient data for training it we are close to solving the problem the precipitation nowcasting problem satisfies the data requirement because it is easy to collect huge amount of radar echo data continuously what is needed is suitable model for learning the pioneering lstm framework proposed in provides general framework for learning problems by training temporally concatenated lstms one for the input sequence and another for the output sequence in it is shown that prediction of the next video frame and interpolation of intermediate frames can be done by building an rnn based language model on the visual words obtained by quantizing the image patches they propose recurrent convolutional neural network to model the spatial relationships but the model only predicts one frame ahead and the size of the convolutional kernel used for transition is restricted to their work is followed up later in which points out the importance of prediction in learning useful representations they build an lstm model which reconstructs the input sequence and predicts the future sequence simultaneously although their method can also be used to solve our spatiotemporal sequence forecasting problem the fully connected lstm layer adopted by their model does not take spatial correlation into consideration in this paper we propose novel convolutional lstm convlstm network for precipitation nowcasting we formulate precipitation nowcasting as spatiotemporal sequence forecasting problem that can be solved under the general learning framework proposed in in order to model well the spatiotemporal relationships we extend the idea of to convlstm which has convolutional structures in both the and transitions by stacking multiple convlstm layers and forming an structure we can build an trainable model for precipitation nowcasting for evaluation we have created new radar echo dataset which can facilitate further research especially on devising machine learning algorithms for the problem when evaluated on synthetic dataset and the radar echo dataset our convlstm model consistently outperforms both the and the operational rover algorithm preliminaries formulation of precipitation nowcasting problem the goal of precipitation nowcasting is to use the previously observed radar echo sequence to forecast fixed length of the future radar maps in local region hong kong new york or tokyo in real applications the radar maps are usually taken from the weather radar every minutes and nowcasting is done for the following hours to predict the frames ahead from the it is worth noting that our precipitation nowcasting problem is different from the one studied in which aims at predicting only the central region of just the next frame chine learning perspective this problem can be regarded as spatiotemporal sequence forecasting problem suppose we observe dynamical system over spatial region represented by an grid which consists of rows and columns inside each cell in the grid there are measurements which vary over time thus the observation at any time can be represented by tensor rp where denotes the domain of the observed features if we record the observations periodically we will get sequence of tensors the spatiotemporal sequence forecasting problem is to predict the most likely sequence in the future given the previous observations which include the current one arg max for precipitation nowcasting the observation at every timestamp is radar echo map if we divide the map into tiled patches and view the pixels inside patch as its measurements see fig the nowcasting problem naturally becomes spatiotemporal sequence forecasting problem we note that our spatiotemporal sequence forecasting problem is different from the time series forecasting problem because the prediction target of our problem is sequence which contains both spatial and temporal structures although the number of free variables in sequence can be up to in practice we may exploit the structure of the space of possible predictions to reduce the dimensionality and hence make the problem tractable long memory for sequence modeling for sequence modeling lstm as special rnn structure has proven stable and powerful for modeling dependencies in various previous studies the major innovation of lstm is its memory cell ct which essentially acts as an accumulator of the state information the cell is accessed written and cleared by several controlling gates every time new input comes its information will be accumulated to the cell if the input gate it is activated also the past cell status could be forgotten in this process if the forget gate ft is on whether the latest cell output ct will be propagated to the final state ht is further controlled by the output gate ot one advantage of using the memory cell and gates to control information flow is that the gradient will be trapped in the cell also known as constant error carousels and be prevented from vanishing too quickly which is critical problem for the vanilla rnn model may be seen as multivariate version of lstm where the input cell output and states are all vectors in this paper we follow the formulation of as in the key equations are shown in below where denotes the hadamard product it ft ct ot ht wxi xt whi wci bi wxf xt whf wcf bf ft it tanh wxc xt whc bc wxo xt who wco ct bo ot tanh ct multiple lstms can be stacked and temporally concatenated to form more complex structures such models have been applied to solve many sequence modeling problems the model we now present our convlstm network although the layer has proven powerful for handling temporal correlation it contains too much redundancy for spatial data to address this problem we propose an extension of which has convolutional structures in both the and transitions by stacking multiple convlstm layers and forming an structure we are able to build network model not only for the precipitation nowcasting problem but also for more general spatiotemporal sequence forecasting problems ht ct image xt tensor figure transforming image into tensor figure inner structure of convlstm convolutional lstm the major drawback of in handling spatiotemporal data is its usage of full connections in and transitions in which no spatial information is encoded to overcome this problem distinguishing feature of our design is that all the inputs xt cell outputs ct hidden states ht and gates it ft ot of the convlstm are tensors whose last two dimensions are spatial dimensions rows and columns to get better picture of the inputs and states we may imagine them as vectors standing on spatial grid the convlstm determines the future state of certain cell in the grid by the inputs and past states of its local neighbors this can easily be achieved by using convolution operator in the and transitions see fig the key equations of convlstm are shown in below where denotes the convolution operator and as before denotes the hadamard product it ft ct ot ht wxi xt whi wci bi wxf xt whf wcf bf ft it tanh wxc xt whc bc wxo xt who wco ct bo ot tanh ct if we view the states as the hidden representations of moving objects convlstm with larger transitional kernel should be able to capture faster motions while one with smaller kernel can capture slower motions also if we adopt similar view as the inputs cell outputs and hidden states of the traditional represented by may also be seen as tensors with the last two dimensions being in this sense is actually special case of convlstm with all features standing on single cell to ensure that the states have the same number of rows and same number of columns as the inputs padding is needed before applying the convolution operation here padding of the hidden states on the boundary points can be viewed as using the state of the outside world for calculation usually before the first input comes we initialize all the states of the lstm to zero which corresponds to total ignorance of the future similarly if we perform which is used in this paper on the hidden states we are actually setting the state of the outside world to zero and assume no prior knowledge about the outside by padding on the states we can treat the boundary points differently which is helpful in many cases for example imagine that the system we are observing is moving ball surrounded by walls although we can not see these walls we can infer their existence by finding the ball bouncing over them again and again which can hardly be done if the boundary points have the same state transition dynamics as the inner points structure like convlstm can also be adopted as building block for more complex structures for our spatiotemporal sequence forecasting problem we use the structure shown in fig which consists of two networks an encoding network and forecasting network like in the initial states and cell outputs of the forecasting network are copied from the last state of the encoding network both networks are formed by stacking several convlstm layers as our prediction target has the same dimensionality as the input we concatenate all the states in the forecasting network and feed them into convolutional layer to generate the final prediction we can interpret this structure using similar viewpoint as the encoding lstm compresses the whole input sequence into hidden state tensor and the forecasting lstm unfolds this hidden rediction encoding network convlst copy convlst convlst copy convlst input forecasting network figure convlstm network for precipitation nowcasting state to give the final prediction arg max arg max fencoding gforecasting fencoding this structure is also similar to the lstm future predictor model in except that our input and output elements are all tensors which preserve all the spatial information since the network has multiple stacked convlstm layers it has strong representational power which makes it suitable for giving predictions in complex dynamical systems like the precipitation nowcasting problem we study here experiments we first compare our convlstm network with the network on synthetic movingmnist dataset to gain some basic understanding of the behavior of our model we run our model with different number of layers and kernel sizes and also study some cases as in to verify the effectiveness of our model on the more challenging precipitation nowcasting problem we build new radar echo dataset and compare our model with the rover algorithm based on several commonly used precipitation nowcasting metrics the results of the experiments conducted on these two datasets lead to the following findings convlstm is better than in handling spatiotemporal correlations making the size of convolutional kernel bigger than is essential for capturing the spatiotemporal motion patterns deeper models can produce better results with fewer parameters convlstm performs better than rover for precipitation nowcasting our implementations of the models are in python with the help of theano we run all the experiments on computer with single nvidia gpu also more illustrative gif examples are included in the appendix dataset for this synthetic dataset we use generation process similar to that described in all data instances in the dataset are frames long frames for the input and frames for the prediction and contain two handwritten digits bouncing inside patch the moving digits are chosen randomly from subset of digits in the mnist the starting position and velocity direction are chosen uniformly at random and the velocity amplitude is chosen randomly in this generation process is repeated times resulting in dataset with training sequences validation sequences and testing sequences we train all the lstm models by minimizing the using through time bptt and mnist dataset http pthe loss of the predicted frame and the frame is defined as ti log pi ti log pi table comparison of convlstm networks with network on the dataset and represent the corresponding kernel size which is either or and refer to the number of hidden states in the convlstm layers and represent the kernel size model convlstm convlstm convlstm convlstm convlstm number of parameters cross entropy figure an example showing an run from left to right input frames ground truth prediction by the network rmsprop with learning rate of and decay rate of also we perform on the validation set despite the simple generation process there exist strong nonlinearities in the resulting dataset because the moving digits can exhibit complicated appearance and will occlude and bounce during their movement it is hard for model to give accurate predictions on the test set without learning the inner dynamics of the system for the network we use the same structure as the unconditional future predictor model in with two lstm layers for our convlstm network we set the patch size to so that each frame is represented by tensor we test three variants of our model with different number of layers the network contains one convlstm layer with hidden states the network has two convlstm layers with hidden states each and the network has and hidden states respectively in the three convlstm layers all the and kernels are of size our experiments show that the convlstm networks perform consistently better than the network also deeper models can give better results although the improvement is not so significant between the and networks moreover we also try other network configurations with the and kernels of the and networks changed to and respectively although the number of parameters of the new network is close to the original one the result becomes much worse because it is hard to capture the spatiotemporal motion patterns with only transition meanwhile the new network performs better than the new network since the higher layer can see wider scope of the input nevertheless its performance is inferior to networks with larger kernel size this provides evidence that larger kernels are more suitable for capturing spatiotemporal correlations in fact for kernel the receptive field of the states will not grow as time advances but for larger kernels later states have larger receptive fields and are related to wider range of the input the average loss loss per sequence of each algorithm on the test set is shown in table we need to point out that our experiment setting is different from where an infinite number of training data is assumed to be available the current offline setting is chosen in order to understand how different models perform in occasions where not so much data is available comparison of the convlstm and in the online setting is included in the appendix next we test our model on some inputs we generate another sequences of three moving digits with the digits drawn randomly from different subset of mnist digits that does not overlap with the training set since the model has never seen any system with three digits such an run is good test of the generalization ability of the model the average error of the model on this dataset is by observing some of the prediction results we find that the model can separate the overlapping digits successfully and predict the overall motion although the predicted digits are quite blurred one prediction example is shown in fig radar echo dataset the radar echo dataset used in this paper is subset of the weather radar intensities collected in hong kong from to since not every day is rainy and our nowcasting target is precipitation we select the top rainy days to form our dataset for preprocessing we first transform the intensity values to pixels by setting max and crop the radar maps in the central region after that we apply the disk with radius and resize the radar maps to to reduce the noise caused by measuring instruments we further remove the pixel values of some noisy regions which are determined by applying clustering to the monthly pixel average the weather radar data is recorded every minutes so there are frames per day to get disjoint subsets for training testing and validation we partition each daily sequence into frame blocks and randomly assign blocks for training block for testing and block for validation the data instances are sliced from these blocks using sliding window thus our radar echo dataset contains training sequences testing sequences and validation sequences and all the sequences are frames long for the input and for the prediction although the training and testing instances sliced from the same day may have some dependencies this splitting strategy is still reasonable because in nowcasting we do have access to all previous data including data from the same day which allows us to apply online of the model such data splitting may be viewed as an approximation of the setting for this application we set the patch size to and train convlstm network with each layer containing hidden states and kernels for the rover algorithm we tune the parameters of the optical flow on the validation set and use the best parameters shown in the appendix to report the test results also we try three different initialization schemes for rover computes the optical flow of the last two observed frames and performs advection afterwards initializes the velocity by the mean of the last two flow fields and gives the initialization by weighted average with weights and of the last three flow fields in addition we train an network with two lstm layers both the convlstm network and the network optimize the error of predictions we evaluate these methods using several commonly used precipitation nowcasting metrics namely rainfall mean squared error critical success index csi false alarm rate far probability of detection pod and correlation the metric is defined as the average squared error between the predicted rainfall and the ground truth since our predictions are done at the pixel level we project them back to radar echo intensities and calculate the rainfall at every cell of the grid using the relationship log log where is the radar echo intensity in db is the rainfall rate in and are two constants with the csi far and pod are skill scores similar to precision and recall commonly used by machine learning researchers we convert the prediction and ground truth to matrix using threshold of rainfall rate indicating raining or not and calculate the hits prediction truth misses prediction truth and false alarms prediction truth the three skill scores falsealarms hits hits far pod the corare defined as csi relation of predicted frame and frame is defined as pi ti ti pi where the disk filter is applied using the matlab function fspecial disk we use an project to calculate the optical flow http table comparison of the average scores of different models over prediction steps model csi far pod correlation convlstm convlstm time pod far time csi correlation time time figure comparison of different models based on four precipitation nowcasting metrics over time figure two prediction examples for the precipitation nowcasting problem all the predictions and ground truths are sampled with an interval of from top to bottom input frames ground truth frames prediction by convlstm network prediction by all results are shown in table and fig we can find that the performance of the network is not so good for this task which is mainly caused by the strong spatial correlation in the radar maps the motion of clouds is highly consistent in local region the structure has too many redundant connections and makes the optimization very unlikely to capture these local consistencies also it can be seen that convlstm outperforms the optical flow based rover algorithm which is mainly due to two reasons first convlstm is able to handle the boundary conditions well in nowcasting there are many cases when sudden agglomeration of clouds appears at the boundary which indicates that some clouds are coming from the outside if the convlstm network has seen similar patterns during training it can discover this type of sudden changes in the encoding network and give reasonable predictions in the forecasting network this however can hardly be achieved by optical flow and advection based methods another reason is that convlstm is trained for this task and some complex spatiotemporal patterns in the dataset can be learned by the nonlinear and convolutional structure of the network for the optical flow based approach it is hard to find reasonable way to update the future flow fields and train everything some prediction results of and convlstm are shown in fig we can find that convlstm can predict the future rainfall contour more accurately especially in the boundary although can give sharper predictions than convlstm it triggers more false alarms and is less precise than convlstm in general also the blurring effect of convlstm may be caused by the inherent uncertainties of the task it is almost impossible to give sharp and accurate predictions of the whole radar maps in predictions we can only blur the predictions to alleviate the error caused by this type of uncertainty conclusion and future work in this paper we have successfully applied the machine learning approach especially deep learning to the challenging precipitation nowcasting problem which so far has not benefited from sophisticated machine learning techniques we formulate precipitation nowcasting as spatiotemporal sequence forecasting problem and propose new extension of lstm called convlstm to tackle the problem the convlstm layer not only preserves the advantages of but is also suitable for spatiotemporal data due to its inherent convolutional structure by incorporating convlstm into the structure we build an trainable model for precipitation nowcasting for future work we will investigate how to apply convlstm to action recognition one idea is to add convlstm on top of the spatial feature maps generated by convolutional neural network and use the hidden states of convlstm for the final classification references bastien lamblin pascanu bergstra goodfellow bergeron bouchard wardefarley and bengio theano new features and speed improvements deep learning and unsupervised feature learning nips workshop bengio goodfellow and courville deep learning book in preparation for mit press bergstra breuleux bastien lamblin pascanu desjardins turian and bengio theano cpu and gpu math expression compiler in scipy volume page austin tx bridson fluid simulation for computer graphics ak peters series taylor francis brox bruhn papenberg and weickert high accuracy optical flow estimation based on theory for warping in eccv pages cheung and yeung application of technique to significant convection nowcast for terminal areas in hong kong in the wmo international symposium on nowcasting and very shortrange forecasting pages cho van merrienboer gulcehre bougares schwenk and bengio learning phrase representations using rnn for statistical machine translation in emnlp pages donahue hendricks guadarrama rohrbach venugopalan saenko and darrell recurrent convolutional networks for visual recognition and description in cvpr douglas the stormy weather group canada in radar in meteorology pages urs germann and isztar zawadzki of the predictability of precipitation from continental radar images part description of the methodology monthly weather review graves generating sequences with recurrent neural networks arxiv preprint hochreiter and schmidhuber long memory neural computation karpathy and deep alignments for generating image descriptions in cvpr klein wolf and afek dynamic convolutional layer for short range weather prediction in cvpr li wong chan and lai evolving nowcasting system hong kong special administrative region government long shelhamer and darrell fully convolutional networks for semantic segmentation in cvpr pascanu mikolov and bengio on the difficulty of training recurrent neural networks in icml pages ranzato szlam bruna mathieu collobert and chopra video language modeling baseline for generative models of natural videos arxiv preprint reyniers quantitative precipitation forecasts based on radar observations principles algorithms and operational systems institut royal de belgique sakaino image pattern prediction method based on physical model with timevarying optical flow ieee transactions on geoscience and remote sensing srivastava mansimov and salakhutdinov unsupervised learning of video representations using lstms in icml sun xue wilson zawadzki ballard joe barker li golding xu and pinto use of nwp for nowcasting convective precipitation recent progress and challenges bulletin of the american meteorological society sutskever vinyals and le sequence to sequence learning with neural networks in nips pages tieleman and hinton lecture rmsprop divide the gradient by running average of its recent magnitude coursera course neural networks for machine learning woo and wong application of optical flow techniques to rainfall nowcasting in the conference on severe local storms xu ba kiros courville salakhutdinov zemel and bengio show attend and tell neural image caption generation with visual attention in icml 
gap safe screening rules for sparse and models eugene ndiaye olivier fercoq alexandre gramfort joseph salmon ltci cnrs paristech paris france abstract high dimensional regression benefits from sparsity promoting regularizations screening rules leverage the known sparsity of the solution by ignoring some variables in the optimization hence speeding up solvers when the procedure is proven not to discard features wrongly the rules are said to be safe in this paper we derive new safe rules for generalized linear models regularized with and norms the rules are based on duality gap computations and spherical safe regions whose diameters converge to zero this allows to discard safely more variables in particular for low regularization parameters the gap safe rule can cope with any iterative solver and we illustrate its performance on coordinate descent for lasso binary and multinomial logistic regression demonstrating significant speed ups on all tested datasets with respect to previous safe rules introduction the computational burden of solving high dimensional regularized regression problem has lead to vast literature in the last couple of decades to accelerate the algorithmic solvers with the increasing popularity of regularization ranging from the lasso or to regularized logistic regression and learning many algorithmic methods have emerged to solve the associated optimization problems although for the simple regularized least square specific algorithm the lars can be considered for more general formulations penalties and possibly larger dimension coordinate descent has proved to be surprisingly efficient strategy our main objective in this work is to propose technique that can any solver for such learning problems and that is particularly well suited for coordinate descent method thanks to active set strategies the safe rules introduced by for generalized regularized problems is set of rules that allows to eliminate features whose associated coefficients are proved to be zero at the optimum relaxing the safe rule one can obtain some more at the price of possible mistakes such heuristic strategies called strong rules reduce the computational cost using an active set strategy but require difficult to check for features possibly wrongly discarded another road to screening method has been the introduction of sequential safe rules the idea is to improve the screening thanks to the computations done for previous regularization parameter this scenario is particularly relevant in machine learning where one computes solutions over grid of regularization parameters so as to select the best one to perform nevertheless such strategies suffer from the same problem as strong rules since relevant features can be wrongly disregarded sequential rules usually rely on theoretical quantities that are not known by the solver but only approximated especially for such rules to work one needs the exact dual optimal solution from the previous regularization parameter recently the introduction of safe dynamic rules has opened promising venue by letting the screening to be done not only at the beginning of the algorithm but all along the iterations following method introduced for the lasso we generalize this dynamical safe rule called gap safe rules because it relies on duality gap computation to large class of learning problems with the following benefits unified and flexible framework for wider family of problems easy to insert in existing solvers proved to be safe more efficient that previous safe rules achieves fast true active set identification we introduce our general gap safe framework in section we then specialize it to important machine learning use cases in section in section we apply our gap safe rules to multitask lasso problem relevant for brain imaging with magnetoencephalography data as well as to multinomial logistic regression regularized with norm for joint feature selection gap safe rules model and notations we denote by rds the set du for any integer and by qj the transpose of matrix our observation matrix is rnˆq where represents the number of samples and the number of tasks or classes the design matrix xppq xn sj rnˆp has explanatory variables or features and observations the standard norm is written the norm the norm the unit ball is denoted by or simply and we write bpc ball with center and radius for matrix rpˆq we řq rq the denote by bj the frobenius norm and by the associated inner product we consider the general optimization problem of minimizing separable function with grouplasso regularization the parameter to recover is matrix rpˆq and for any in rp bj is the row of while for any in rq is the column we would like to find pλq arg min fi pxj bq λωpbq bprpˆq loooooooooooomoooooooooooon pλ pbq řn where fi þñ is convex function with gradient so fi pxj bq pˆq is þñ is the norm ωpbq řpalso convex with lipschitz gradient the function bj promoting few lines of to be at time the parameter is constant controlling the between data fitting and regularization some elements of convex analysis used in the following are introduced here for convex function rd the of is the function rd defined by puq supzprd xz uy pzq the of function at point is denoted by bf pxq the dual norm of is the norm and reads pbq maxjprps bj remark for the ease of reading all groups are weighted with equal strength but extension of our results to weights as proposed in the original paper would be straightforward basic properties first we recall the associated fermat condition and dual formulation of the optimization problem theorem fermat condition see proposition for more general result for any convex function rn arg min pxq bf pxq xprn this is also often referred to as the convex conjugate of function theorem dual formulation of is given by pλq arg max looooooooomooooooooon dλ pθq where tθ rnˆq rps and dual solutions are linked by pjq rns tθ rnˆq px θq the primal pλq pxj pλq furthermore fermat condition reads rps pjq pλq pλq if pλq if remark contrarily to the primal the dual problem has unique solution under our assumption on fi indeed the dual function is strongly concave hence strictly concave remark for any rnˆq let us introduce gpθq qj pθn qj rnˆq pλq pλq then the link can be written critical parameter λmax for large enough the solution of the primal problem is simply thanks to the fermat rule is optimal if and only if thanks to the property of the dual norm this is equivalent to λq where is the dual norm of since is primal solution of pλ if and only if λmax maxjprps xpjq px this development shows that for λmax problem is trivial so from now on we will only focus on the case where λmax screening rules description safe screening rules rely on simple consequence of the fermat condition pλq pλq xpjq pλq is unknown unless λmax however stated in such way this relation is useless because it is often possible to construct set rnˆq called safe region containing it then note that pλq max xpjq θpr the so called safe screening rules consist in removing the variable from the problem whenever the pλq is then guaranteed to be zero this property leads to considerable previous test is satisfied since in practice especially with active sets strategies see for instance for the lasso case natural goal is to find safe regions as narrow as possible smaller safe regions can only increase the number of screened out variables however complex regions could lead to computational burden limiting the benefit of screening hence we focus on constructing satisfying the pλq is as small as possible and contains pjq computing maxθpr is cheap spheres as safe regions various shapes have been considered in practice for the set such as balls referred to as spheres domes or more refined sets see for survey here we consider the sphere regions choosing ball bpc rq as safe region one can easily obtain control on maxθpbpc rq xpjq by extending the computation of the support function of ball eq to the matrix case max xpjq xpjq xpjq θpbpc rq note that here the center is matrix in rpˆq we can now state the safe sphere test sphere test if xpjq xpjq pλq then gap safe rule description in this section we derive gap safe screening rule extending the one introduced in for this we rely on the strong convexity of the dual objective function and on weak duality finding radius remember that rns fi is differentiable with gradient as consequence rns is convex theorem and so dλ is concave rnˆq rnˆq dλ dλ pλq one has specifying the previous inequality for pλq pθ pλq pλq γλ pλq dλ pθq dλ pθ pλq maximizes dλ on so we have pθ pλq pλq this implies by definition pλq γλ pλq dλ pθq dλ pθ pλq pλ pbq so rpˆq dλ pθq pλ pbq by weak duality rpˆq dλ pθ γλ pλq and we deduce the following theorem theorem pλq pbq dλ pθqq pb θq rpˆq provided one knows dual feasible point and rpˆq it is possible to construct safe sphere with radius pb θq centered on we now only need to build relevant dual point to center such ball results from section ensure that λmax but it leads to static rule introduced in we need dynamic center to improve the screening as the solver proceeds pλq pλq now assume that one has converging finding center remember that pλq hence natural choice for creating dual algorithm for the primal problem bk feasible point θk is to choose it proportional to for instance by setting rk if px rk θk where rk otherwise px rk refined method consists in solving the one dimensional problem arg xspanprk dλ pθq in the lasso and case such step is simply projection on the intersection of line and the polytope dual set and can be computed efficiently however for logistic regression the computation is more involved so we have opted for the simpler solution in equation this still provides converging safe rules see proposition dynamic gap safe rule summarized we can now state our dynamical gap safe rule at the step of an iterative solver compute bk and then obtain θk and pbk θk using pλq and remove xpjq from if xpjq θk pbk θk xpjq then set dynamic safe screening rules are more efficient than existing methods in practice because they can increase the ability of screening as the algorithm proceeds since one has sharper and sharper dual regions available along the iterations support identification is improved provided one relies on primal converging algorithm one can show that the dual sequence we propose is converging too the convergence of the primal is unaltered by our gap safe rule screening out unnecessary coefficients of bk can only decrease its distance with its original limits moreover practical consequence is that one can observe surprising situations where lowering the tolerance of the solver can reduce the computation time this can happen for sequential setups pλq and θk defined in eq be the current proposition let bk be the current estimate of pλq then limkñ bk pλq implies limkñ θk pλq estimate of note that if the primal sequence is converging to the optimal our dual sequence is also converging but we know that the radius of our safe sphere is pbk dλ pθk qq by strong duality this radius converges to hence we have certified that our gap safe regions sequence bpθk pbk θk qq is converging safe rules in the sense introduced in definition remark the active set obtained by our gap safe rule the indexes of non pλq allowing us variables converges to the equicorrelation set eλ tj xpjq to early identify relevant features see proposition in the supplementary material for more details special cases of interest we now specialize our results to relevant supervised learning problems see also table lasso in rp pβq xβ řnthe lasso the parameter is vector pyi xi βq meaning that fi pzq pyi zq and ωpβq regression in the lasso which is special we assume that the observation is řn case of nˆq pbq pzq yi and ωpbq řp bj in signal processing this model is also referred to as multiple measurement vector mmv problem it allows to jointly select the same features for multiple regression tasks remark our framework could encompass easily the case of groups with various size and weights presented in since our aim is mostly for and multinomial applications we have rather presented matrix formulation regularized logistic regression here we consider the formulation given in chapter for the two classes logistic regression in such context one observes for each rns class label ci this information can be recast as yi and it is then customary to minimize where pβq xj log exp xi with rp fi pzq exppzqq and the penalty is simply the norm ωpβq let us introduce nh the binary negative entropy function defined by logpxq xq xq if nhpxq otherwise then one can easily check that pzi nhpzi yi and with the convention lasso regr logistic regr fi pzq pyi yi ez yi puq pyi ωpbq yi yi multinomial regr yi zk log ezk nhpu yi bj nhpu yi bj λmax px gpθq yq ez ez px qq rownormpeθ table useful ingredients for computing gap safe rules we have used lower case to indicate when the parameters are vectorial the function rownorm consists in normalizing matrix such that each row sums to one multinomial logistic regression we adapt the formulation given in chapter for the multinomial regression in such context one observes for each rns class label ci qu this information can be recast into matrix rnˆq filled by and yi ku in the same spirit as the lasso matrix rpˆq is formed by vectors encoding the hyperplanes for the linear classification the multinomial regularized regression reads pbq xi log exp xi řq řq with fi pzq zk log exp pzk qq to recover the formulation as in let us introduce nh the negative entropy function defined by still with the convention řq řq xi logpxi if σq tx rq xi nhpxq otherwise again one can easily check that pzq nhpz yi and remark for multinomial logistic regression dλ implicitly encodes the additional constraint dom dλ rns yi σq where σq is the dimensional simplex see as and rk both belong to this set any convex combination of them such as θk defined in satisfies this additional constraint remark the intercept has been neglected in our models for simplicity our gap safe framework can also handle such feature at the cost of more technical details by adapting the results from for instance however in practice the intercept can be handled in the present formulation by adding constant column to the design matrix the intercept is then regularized however if the constant is set high enough regularization is small and experiments show that it has little to no impact for problems this is the strategy used by the liblinear package experiments in this section we present results obtained with the gap safe rule results are on high dimensional data both dense and sparse implementation have been done in python and cython for low critical parts they are based on the lasso implementation of and coordinate descent logistic regression solver in the lightning software in all experiments the coordinate descent algorithm used follows the pseudo code from with screening step every iterations figure experiments on brain imaging dataset dense data with and on the left fraction of active variables as function of and the number of iterations the gap safe strategy has much longer range of with red small active sets on the right computation time to reach convergence using different screening strategies note that we have not performed comparison with the sequential screening rule commonly acknowledge as the safe screening rule such as th since we can show that this kind of rule is not safe indeed the stopping criterion is based on dual gap accuracy and comparisons would be unfair since such methods sometimes do not converge to the prescribed accuracy this is by counter example given in the supplementary material nevertheless modifications of such rules inspired by our gap safe rules can make them safe however the obtained sequential rules are still outperformed by our dynamic strategies see figure for an illustration regression to demonstrate the benefit of the gap safe screening rule for lasso problem we used neuroimaging data electroencephalography eeg and magnetoencephalography meg are brain imaging modalities that allow to identify active brain regions the problem to solve is regression problem with squared loss where every task corresponds to time instant using multitask lasso one can constrain the recovered sources to be identical during short time interval this corresponds to temporal stationary assumption in this experiment we used joint data with meg and eeg sensors leading to the number of possible sources is and the number of time instants with khz sampling rate it is equivalent to say that the sources stay the same for ms results are presented in figure the gap safe rule is compared with the dynamic safe rule from the experimental setup consists in estimating the solutions of the lasso problem for values of on logarithmic grid from λmax to λmax for the experiments on the left fixed number of iterations from to is allowed for each the fraction of active variables is reported figure illustrates that the gap safe rule screens out much more variables than the compared method as well as the converging nature of our safe regions indeed the more iterations performed the more the rule allows to screen variables on the right computation time confirms the effective our rule significantly improves the computation time for all duality gap tolerance from to especially when accurate estimates are required for feature selection binary logistic regression results on the leukemia dataset are reported in figure we compare the dynamic strategy of gap safe to sequential and non dynamic rule such as slores we do not compare to the actual slores rule as it requires the previous dual optimal solution which is not available slores is indeed not safe method see section in the supplementary materials nevertheless one can observe that dynamic strategies outperform pure sequential one see section in the supplementary material no screening gap safe sequential gap safe dynamic no screening gap safe sequential gap safe dynamic figure regularized binary logistic regression on the leukemia dataset simple sequential and full dynamic screening gap safe rules are compared on the left fraction of the variables that are active each line corresponds to fixed number of iterations for which the algorithm is run on the right computation times needed to solve the logistic regression path to desired accuracy with values of multinomial logistic regression we also applied gap safe to an multinomial logistic regression problem on sparse dataset data are bag of words features extracted from the dataset removing english stop words and words occurring only once or more than of the time one can observe on figure the dynamic screening and its benefit as more iterations are performed gap safe leads to significant speedup to get duality gap smaller than on the values of we needed without screening and only when gap safe was activated figure fraction of the variables that are active for regularized multinomial logistic regression on classes of the dataset sparse data with computation was run on the best of the features using univariate feature selection each line corresponds to fixed number of iterations for which the algorithm is run conclusion this contribution detailed new safe rules for accelerating algorithms solving generalized linear models regularized with and norms the rules proposed are safe easy to implement dynamic and converging allowing to discard significantly more variables than alternative safe rules the positive impact in terms of computation time was observed on all tested datasets and demonstrated here on high dimensional regression task using brain imaging data as well as binary and multiclass classification problems on dense and sparse data extensions to other generalized linear model poisson regression are expected to reach the same conclusion future work could investigate optimal screening frequency determining when the screening has correctly detected the support acknowledgment we acknowledge the support from chair machine learning for big data at paristech and from the paristech think tank this work benefited from the support of the fmjh program gaspard monge in optimization and operation research and from the support to this program from edf references argyriou evgeniou and pontil feature learning in nips pages argyriou evgeniou and pontil convex feature learning machine learning bauschke and combettes convex analysis and monotone operator theory in hilbert spaces springer new york blondel seki and uehara block coordinate descent algorithms for sparse multiclass classification machine learning bonnefoy emiya ralaivola and gribonval dynamic screening principle for the lasso in eusipco bonnefoy emiya ralaivola and gribonval dynamic screening accelerating algorithms for the lasso and ieee trans signal and van de geer statistics for data springer series in statistics springer heidelberg methods theory and applications efron hastie johnstone and tibshirani least angle regression ann with discussion and rejoinder by the authors el ghaoui viallon and rabbani safe feature elimination in sparse supervised learning pacific fan chang hsieh wang and lin liblinear library for large linear classification mach learn fercoq gramfort and salmon mind the duality gap safer rules for the lasso in icml pages friedman hastie and tibshirani regularization paths for generalized linear models via coordinate descent journal of statistical software gramfort kowalski and estimates for the inverse problem using accelerated gradient methods phys med and convex analysis and minimization algorithms ii volume berlin koh kim and boyd an method for logistic regression mach learn manning and foundations of statistical natural language processing mit press cambridge ma usa pedregosa varoquaux gramfort michel thirion grisel blondel prettenhofer weiss dubourg vanderplas passos cournapeau brucher perrot and duchesnay machine learning in python mach learn tibshirani regression shrinkage and selection via the lasso jrssb tibshirani bien friedman hastie simon taylor and tibshirani strong rules for discarding predictors in problems jrssb tibshirani the lasso problem and uniqueness electron wang wonka and ye lasso screening rules via dual polytope projection arxiv preprint wang zhou liu wonka and ye safe screening rule for sparse logistic regression in nips pages xiang wang and ramadge screening tests for lasso problems arxiv preprint yuan and lin model selection and estimation in regression with grouped variables jrssb 
empirical localization of homogeneous divergences on discrete sample spaces takashi takenouchi department of complex and intelligent systems future university hakodate kamedanakano hakodate hokkaido japan ttakashi takafumi kanamori department of computer science and mathematical informatics nagoya university furocho chikusaku nagoya japan kanamori abstract in this paper we propose novel parameter estimator for probabilistic models on discrete space the proposed estimator is derived from minimization of homogeneous divergence and can be constructed without calculation of the normalization constant which is frequently infeasible for models in the discrete space we investigate statistical properties of the proposed estimator such as consistency and asymptotic normality and reveal relationship with the information geometry some experiments show that the proposed estimator attains comparable performance to the maximum likelihood estimator with drastically lower computational cost introduction parameter estimation of probabilistic models on discrete space is popular and important issue in the fields of machine learning and pattern recognition for example the boltzmann machine with hidden variables is very popular probabilistic model to represent binary variables and attracts increasing attention in the context of deep learning training of the boltzmann machine estimation of parameters is usually done by the maximum likelihood estimation mle the mle for the boltzmann machine can not be explicitly solved and the optimization is frequently used difficulty of the optimization is that the calculation of the gradient requires calculation of normalization constant or partition function in each step of the optimization and its computational cost is sometimes exponential order the problem of computational cost is common to the other probabilistic models on discrete spaces and various kinds of approximation methods have been proposed to solve the difficulty one approach tries to approximate the probabilistic model by tractable model by the approximation which considers model assuming independence of variables another approach such as the contrastive divergence avoids the exponential time calculation by the markov chain monte carlo mcmc sampling in the literature of parameters estimation of probabilistic model for continuous variables employs score function which is gradient of with respect to the data vector rather than parameters this approach makes it possible to estimate parameters without calculating the normalization term by focusing on the shape of the density function extended the method to discrete variables which defines information of neighbor by contrasting probability with that of flipped variable proposed generalized local scoring rules on discrete sample spaces and proposed an approximated estimator with the bregman divergence in this paper we propose novel parameter estimator for models on discrete space which does not require calculation of the normalization constant the proposed estimator is defined by minimization of risk function derived by an unnormalized model and the homogeneous divergence having weak coincidence axiom the derived risk function is convex for various kind of models including higher order boltzmann machine we investigate statistical properties of the proposed estimator such as the consistency and reveal relationship between the proposed estimator and the settings let be vector of random variables in discrete typically and bracket be summation of function on let and be space of all finite measures on and subspace consisting of all probability measures on respectively in this paper we focus on parameter estimation of probabilistic model on written as qθ zθ where is an vector of parameters qθ is an unnormalized model in and zθ is normalization constant computation of the normalization constant zθ sometimes requires calculation of exponential order and is sometimes difficult for models on the discrete space note that the unnormalized model qθ is not normalized and qθ does not necessarily hold let ψθ be function on and throughout the paper we assume without loss of generality that the unnormalized model qθ can be written as qθ exp ψθ remark by setting ψθ as ψθ log zθ the normalized model can be written as example the bernoulli distribution on is simplest example of the probabilistic model with the function ψθ θx example with function ψθ xd xd we can define order boltzmann machine example let xo be an observed vector and hidden vector and xh respectively and xo xh where indicates the transpose be concatenated vector function ψh xo for the boltzmann machine with hidden variables is written as ψh xo log exp where xh xh is the summation with respect to the hidden variable xh let us assume that dataset xi generated by an underlying distribution is given and be set of all patterns which appear in the dataset an empirical distribution associated with the dataset is defined as nx otherwise where nx xi is number of pattern appeared in the dataset definition for the unnormalized model and distributions and in probability functions rα and on are defined by rα qθ qθ qθ the distribution rα is an model of the unnormalized model and with ratio remark we observe that also if rα holds for an arbitrary to estimate the parameter of probabilistic model the mle defined by mle argmaxθ is frequently employed where log xi is the of the parameter with the model though the mle is asymptotically consistent and efficient estimator main drawback of the mle is that computational cost for probabilistic models on the discrete space sometimes becomes exponential unfortunately the mle does not have an explicit solution in general the estimation of the parameter can be done by the gradient based optimization with gradient of where while the first term can be easily calculated the second term includes calculation of the normalization term zθ which requires times summation for and is not feasible when is large homogeneous divergences for statistical inference divergences are an extension of the squared distance and are often used in statistical inference formal definition of the divergence is valued function on or on such that holds for arbitrary many popular divergences such as the kl divergence defined on enjoy the coincidence axiom leads to the parameter in the statistical model is estimated by minimizing the divergence with respect to in the statistical inference using unnormalized models the coincidence axiom of the divergence is not suitable since the probability and the unnormalized model do not exactly match in general our purpose is to estimate the underlying distribution up to constant factor using unnormalized models hence divergences having the property of the weak coincidence axiom if and only if cf for some are good candidate as class of divergences with the weak coincidence axiom we focus on homogeneous divergences that satisfy the equality cg for any and any representative of homogeneous divergences is the ps divergence or in other words that is defined from the inequality assume that is positive constant for all functions in the inequality holds the inequality becomes an equality if and only if and are linearly dependent the psdivergence dγ for is defined by dγ log log log the ps divergence is homogeneous and the inequality ensures the and the weak coincidence axiom of the one can confirm that the scaled dγ converges to the extended defined on as the is used to obtain robust estimator as shown in the standard from the empirical distribution to the unnormalized model qθ requires the computation of that may be infeasible in our setup to circumvent such an expensive computation we employ trick and substitute model localized by the empirical distribution for qθ which makes it possible to replace the total sum in with the empirical mean more precisely let us consider the from pα to pα for the probability distribution and the unnormalized model where are two distinct real numbers then the divergence vanishes if and only if pα pα we define the localized sα by sα dγ pα pα log pα log pβ where substituting the empirical distribution the total into sum over nx is replaced with variant of the empirical mean such as for real number since sα holds we can assume without loss of generality in summary the conditions of the real parameters are given by where the last condition denotes let us consider another aspect of the computational issue about the localized for the probability distribution and the unnormalized exponential model qθ we show that the localized sα qθ is convex in when the parameters and are properly chosen theorem let be any probability distribution and let qθ be the unnormalized exponential model qθ exp where is any function corresponding to the sufficient statistic in the normalized exponential model for given the localized sα qθ is convex in for any satisfying if and only if proof after some calculation we have vrα where vrα is the covariance matrix of under the probability rα thus the hessian matrix of sα qθ is written as qθ vrβ the hessian matrix is definite if the converse direction is deferred to the supplementary material up to constant factor the localized with characterized by theorem is denotes as sα that is defined by sα log pα for the parameter can be negative if is positive on clearly sα satisfies the homogeneity and the weak coincidence axiom as well as sα estimation with the localized divergence given the empirical distribution and the unnormalized model qθ we define novel estimator with the localized sα or sα though the localized the empirical distribution is not when we can formally define the following estimator by restricting the domain to the observed set of examples even for negative argmin sα qθ log qθ log qθ log qθ argmin remark the summation in is defined on and then is computable even when also the summation includes only terms and its computational cost is proposition for the unnormalized model the estimator is fisher consistent proof we observe sα qθ implying the fisher consistency of theorem let qθ be the unnormalized model and be the true parameter of underlying distribution then an asymptotic distribution of the estimator is written as where is the fisher information matrix proof we shall sketch proof and the detailed proof is given in supplementary material let us assume that the empirical distribution is written as note that because the asymptotic expansion of the equilibrium condition for the estimator around leads to sα qθ sα qθ sα qθ by the delta method we have sα qθ sα qθ and from the central limit theorem we observe that xi ψθ asymptotically follows the normal distribution with mean and variance which is known as the fisher information matrix also from the law of large numbers we have qθ in the limit of consequently we observe that remark the asymptotic distribution of is equal to that of the mle and its variance does not depend on remark as shown in remark the normalized model is special case of the unnormalized model and then theorem holds for the normalized model characterization of localized divergence sα throughout this section we assume that holds and investigate properties of the localized psdivergence sα we discuss influence of selection of and characterization of the localized sα in the following subsections influence of selection of we investigate influence of selection of for the localized sα with view of the estimating equation the estimator derived from sα satisfies qθ which is moment matching with respect to two distributions and on the other hand the estimating equation of the mle is written as mle ψθmle θmle mle θmle mle mle which is moment matching with respect to the empirical distribution θmle and the normalized model θmle while the localized sα is not defined with comparison of with implies that behavior the estimator becomes similar to that of the mle in the limit of and relationship with the the between two positive measures is defined as dα αf where is real number note that dα and if and only if and the reduces to kl and kl in the limit of and respectively remark an estimator defined by minimizing dα between the empirical distribution and normalized model satisfies qθ ψθ and requires calculation proportional to which is infeasible also the same hold for an estimator defined by minimizing dα qθ between the empirical distribution and unnormalized qθ model satisfying qθ here we assume that and consider trick to cancel out the term by mixing two as follows dα remark dα is divergence when holds dα and dα if and only if without loss of generality we assume for dα firstly we consider an estimator defined by the minmizer of nx min note that the summation in includes only terms we remark the following remark let be the underlying distribution and qθ be the unnormalized model then an estimator defined by minimizing dα qθ is not in general fisher consistent qθ this remark shows that an estimator associated with dα qθ does not have suitable properties such as asymptotic unbiasedness and consistency while required computational cost is drastically reduced intuitively this is because the mixture of satisfies the coincidence axiom to overcome this drawback we consider the following minimization problem for estimation of the parameter of model argmin dα rqθ where is constant corresponding to an inverse of the normalization term zθ proposition let qθ be the unnormalized model for and the minimization of dα rqθ is equivalent to the minimization of sα qθ proof for given we observe that argmin dα rqθ qθ note that computation of requires only sample order calculation by plugging into dα rqθ we observe argmin dα qθ argmin sα qθ if and hold the estimator is equivalent to the estimator associated with the localized sα implying that sα is characterized by the mixture of remark from viewpoint of the information geometry metric information geometrical structure induced by the is the fisher metric induced by the this implies that the estimation based on the mixture of is fisher efficient and is an intuitive explanation of the theorem the localized ps divergence sα and sα with can be interpreted as an extension of the which preserves fisher efficiency experiments we especially focus on setting of convexity of the risk function with the unnormalized model exp holds theorem and examined performance of the proposed estimator fully visible boltzmann machine in the first experiment we compared the proposed estimator with parameter settings with the mle and the ratio matching method note that the ratio matching method also does not require calculation of the normalization constant and the proposed method with may behave like the mle as discussed in section all methods were optimized with the optim function in language the dimension of input was set to and the synthetic dataset was randomly generated from the second order boltzmann machine example with parameter we repeated comparison times and observed averaged performance figure shows median of the root mean square errors rmses between and of each method over trials against the number of examples we observe that the proposed estimator works well and is superior to the ratio matching method in this experiment the mle outperforms the proposed method contrary to the prediction of theorem this is because observed patterns were only small portion of all possible patterns as shown in figure even in such case the mle can take all possible patterns into account through the normalization term log zθ const that works like regularizer on the other hand the proposed method genuinely uses only the observed examples and the asymptotic analysis would not be relevant in this case figure shows median of computational time of each method against the computational time of the mle does not vary against because the computational cost is dominated by the calculation of the normalization constant both the proposed estimator and the ratio matching method are significantly faster than the mle and the ratio matching method is faster than the proposed estimator while the rmse of the proposed estimator is less than that of the ratio matching boltzmann machine with hidden variables in this subsection we applied the proposed estimator for the boltzmann machine with hidden variables whose associated function is written as the proposed estimator with parameter settings was compared with the mle the dimension of observed variables was fixed to and of hidden variables was set to and the parameter was generated as including parameters corresponding to hidden variables note that the boltzmann machine with hidden variables is not identifiable and different values of the parameter do not necessarily generate different probability distributions implying that estimators are influenced by local minimums then we measured performance of each estimator by the averaged time mle ratio matching number of unique patterns rmse mle ratio matching figure median of rmses of each method against in log scale plot of number of unique patterns in the dataset against median of computational time of each method against in log scale mle mle time averaged log likelihood mle averaged log likelihood log xi rather than the rmse an initial value of the parameter was set by and commonly used by all methods we repeated the comparison times and observed the averaged performance figure shows median of averaged of each method over trials against the number of example we observe that the proposed estimator is comparable with the mle when the number of examples becomes large note that the averaged of mle once decreases when is samll and this is due to overfitting of the model figure shows median of averaged of each method for test dataset consists of examples over trials figure shows median of computational time of each method against and we observe that the proposed estimator is significantly faster than the mle figure median of averaged of each method against median of averaged of each method calculated for test dataset against median of computational time of each method against in log scale conclusions we proposed novel estimator for probabilistic model on discrete space based on the unnormalized model and the localized which has the homogeneous property the proposed estimator can be constructed without calculation of the normalization constant and is asymptotically efficient which is the most important virtue of the proposed estimator numerical experiments show that the proposed estimator is comparable to the mle and required computational cost is drastically reduced references hinton sejnowski learning and relearning in boltzmann machines mit press cambridge mass ackley hinton sejnowski learning algorithm for boltzmann machines cognitive science amari kurata nagaoka information geometry of boltzmann machines in ieee transactions on neural networks hinton salakhutdinov better way to pretrain deep boltzmann machines in advances in neural information processing systems pp cambridge ma mit press opper saad advanced mean field methods theory and practice mit press cambridge ma hinton training products of experts by minimizing contrastive divergence neural computation estimation of statistical models by score matching journal of machine learning research some extensions of score matching computational statistics data analysis dawid lauritzen parry proper local scoring rules on discrete sample spaces the annals of statistics gutmann hirayama bregman divergence as general framework to estimate unnormalized statistical models arxiv preprint amari nagaoka methods of information geometry volume of translations of mathematical monographs oxford university press sejnowski boltzmann machines in american institute of physics conference series good comment on measuring information and uncertainty by buehler in godambe sprott editors foundations of statistical inference pp toronto holt rinehart and winston fujisawa eguchi robust parameter estimation with small bias against heavy contamination journal of multivariate analysis van der vaart asymptotic statistics cambridge university press core team language and environment for statistical computing foundation for statistical computing vienna austria 
statistical model criticism using kernel two sample tests james robert lloyd department of engineering university of cambridge zoubin ghahramani department of engineering university of cambridge abstract we propose an exploratory approach to statistical model criticism using maximum mean discrepancy mmd two sample tests typical approaches to model criticism require practitioner to select statistic by which to measure discrepancies between data and statistical model mmd two sample tests are instead constructed as an analytic maximisation over large space of possible statistics and therefore automatically select the statistic which most shows any discrepancy we demonstrate on synthetic data that the selected statistic called the witness function can be used to identify where statistical model most misrepresents the data it was trained on we then apply the procedure to real data where the models being assessed are restricted boltzmann machines deep belief networks and gaussian process regression and demonstrate the ways in which these models fail to capture the properties of the data they are trained on introduction statistical model criticism or is an important part of complete statistical analysis when one fits linear model to data set complete analysis includes computing cook distances to identify influential points or plotting residuals against fitted values to identify or heteroscedasticity similarly modern approaches to bayesian statistics view model criticism as in important component of cycle of model construction inference and criticism as statistical models become more complex and diverse in response to the challenges of modern data sets there will be an increasing need for greater range of model criticism procedures that are either automatic or widely applicable this will be especially true as automatic modelling methods and probabilistic programming mature model criticism typically proceeds by choosing statistic of interest computing it on data and comparing this to suitable null distribution ideally these statistics are chosen to assess the utility of the statistical model under consideration see applied examples but this can require considerable expertise on the part of the modeller we propose an alternative to this manual approach by using statistic defined as supremum over broad class of measures of discrepancy between two distributions the maximum mean discrepancy mmd the advantage of this approach is that the discrepancy measure attaining the supremum automatically identifies regions of the data which are most poorly represented by the statistical model fit to the data we demonstrate mmd model criticism on toy examples restricted boltzmann machines and deep belief networks trained on mnist digits and gaussian process regression models trained on several time series our proposed method identifies discrepancies between the data and fitted models that would not be apparent from predictive performance focused metrics it is our belief that more effort should be expended on attempting to falsify models fitted to data using model criticism techniques or otherwise not only would this aid research in targeting areas for improvement but it would give greater confidence in any conclusions drawn from model we follow box using the term model criticism for similar reasons to hagan model criticism suppose we observe data obs yiobs and we attempt to fit model with parameters after performing statistical analysis we will have either an estimate or an approximate posterior obs for the parameters how can we check whether any aspects of the data were poorly modelled criticising prior assumptions the classical approach to model criticism is to attempt to falsify the null hypothesis that the data could have been generated by the model for some value of the parameters obs this is typically achieved by constructing statistic of the data whose distribution does not depend on the parameters pivotal quantity the extent to which the observed data obs differs from expectations under the model can then be quantified with based pfreq obs obs where for any analogous quantities in bayesian analysis are the prior predictive of box the null hypothesis is replaced with the claim that the data could have been generated from the prior predictive distribution obs dθ can then be constructed for any statistic of the data pprior obs obs where dθ both of these procedures construct function of the data obs whose distribution under suitable null hypothesis is uniform the quantifies how surprising it would be for the data obs to have been generated by the model the different null hypotheses reflect the different uses of the word model in frequentist and bayesian analyses frequentist model is class of probability distributions over data indexed by parameters whereas bayesian model is joint probability distribution over data and parameters criticising estimated models or posterior distributions constrasting method of bayesian model criticism is the calculation of posterior predictive ppost where the prior predictive distribution in is replaced with the posterior predictive distribution obs dθ the corresponding test for an analysis resulting in point estimate of the parameters would use the predictive distribution to form the pplug these quantify how surprising the data obs is even after having observed it simple variant of this method of model criticism is to use held out data generated from the same distribution as obs to compute this quantifies how surprising the held out data is after having observed obs which type of model criticism should be used different forms of model criticism are appropriate in different contexts but we believe that posterior predictive and will be most often useful for highly flexible models for example suppose one is fitting deep belief network to data classical would assume null hypothesis that the data could have been generated from some deep belief network since the space of all possible deep belief networks is very large it will be difficult to ever falsify this hypothesis more interesting null hypothesis to test in this example is whether or not our particular deep belief network can faithfully mimick the distribution of the sample it was trained on this is the null hypothesis of posterior or model criticism using maximum mean discrepancy two sample tests we assume that our data obs are samples from some distribution yiobs after performing inference resulting in point estimate of the parameters the null hypothesis associated with is yiobs we can test this null hypothesis using two sample test in particular we have samples of data yiobs and we can generate samples from the predictive distribution yirep and then test whether or not these samples could have been generated from the same distribution for consistency with two sample testing literature we now switch notation suppose we have samples xi and yi drawn from distributions and respectively the two sample problem asks if way of answering the two sample problem is to consider maximum mean discrepancy mmd statistics mmd sup where is set of functions when is reproducing kernel hilbert space rkhs the function attaining the supremum can be derived analytically and is called the witness function where is the kernel of the rkhs substituting into and squaring yields ex ey this expression only involves expectations of the kernel which can be estimated empirically by xi yj xi xj yi yj mn one can also estimate the witness function from finite samples xi yi fˆ the empirical witness function is the difference of two kernel density estimates this means that we can interpret the witness function as showing where the estimated densities of and are most different while mmd two sample tests are well known in the literature the main contribution of this work is to show that this interpretability of the witness function makes them useful tool as an exploratory form of statistical model criticism examples on toy data to illustrate the use of the mmd two sample test as tool for model criticism we demonstrate its properties on two simple datasets and models newcomb speed of light data histogram of simon newcomb measurements used to determine the speed of light is shown on the left of figure we fit normal distribution to this data by maximum likelihood and ask whether this model is faithful representation of the data witness function count witness function density estimate density estimate deviations from nanoseconds deviations from nansoeconds deviations from nansoeconds figure left histogram of simon newcomb speed of light measurements middle histogram together with density estimate red solid line and mmd witness function green dashed line right histogram together with updated density estimate and witness function we sampled points from the fitted distribution and performed an mmd two sample test using radial basis function the estimated of the test was less than clear disparity between the model and data the data fitted density estimate normal distribution and witness function are shown in the middle of figure the witness function has trough at the centre of the data and peaks either side indicating that the fitted model has placed too little mass in its centre and too much mass outside its centre throughout this paper we estimate the null distribution of the mmd statistic using the bootstrap method described in using replicates we use radial basis function kernel and select the lengthscale by fold cross validation using predictive likelihood of the kernel density estimate as the selection criterion this suggests that we should modify our model by either using distribution with heavy tails or explicitly modelling the possibility of outliers however to demonstrate some of the properties of the mmd two sample test we make an unusual choice of fitting gaussian by maximum likelihood but ignoring the two outliers in the data the new fitted density estimate the normal distribution and witness function of an mmd test are shown on the right of figure the estimated associated with the mmd two sample test is roughly despite the fitted model being very poor explanation of the outliers the nature of an mmd test depends on the kernel defining the rkhs in equation in this paper we use the radial basis function kernel which encodes for smooth functions with typical lengthscale consequently the test identifies dense discrepancies only identifying outliers if the model and inference method are not robust to them this is not failure test that can identify too many types of discrepancy would have low statistical power see for discussion of the power of the mmd test and alternatives high dimensional data the interpretability of the witness functions comes from being equal to the difference of two kernel density estimates in high dimensional spaces kernel density estimation is very high variance procedure that can result in poor density estimates which destroy the interpretability of the method in response we consider using dimensionality reduction techniques before performing two sample tests we generated synthetic data from mixture of gaussians and in we then fit mixture of gaussians and performed an mmd two sample test we reduced the dimensionality of the data using principal component analysis pca selecting the first two principal components to ensure that the mmd test remains well calibrated we include the pca dimensionality reduction within the bootstrap estimation of the null distribution the data and predictive samples are plotted on the left of figure while we can see that one cluster is different from the rest it is difficult to assess by eye if these distributions are different due in part to the difficulty of plotting two sets of samples on top of each other figure left pca projection of synthetic high dimensional cluster data green circles and projection of samples from fitted model red circles right witness function of mmd model criticism the poorly fit cluster is clearly identified the mmd test returns of and the witness function right of figure clearly identifies the cluster that has been incorrectly modelled presented with this discrepancy statistical modeller might try more flexible clustering model the of the mmd statistic can also be made by fitting mixture of gaussians this is sufficient approximation to the such that no discrepancy can be detected with the amount of data available what exactly do neural networks dream about to recognize shapes first learn to generate images quoth hinton restricted boltzmann machine rbm pretraining of neural networks was shown by to learn deep belief network dbn for the data generative model in agreement with this observation as well as computing estimates of marginal likelihoods and testing errors it is standard to demonstrate the effectiveness of generative neural network by generating samples from the distribution it has learned for details see code at redacted when trained on the mnist handwritten digit data samples from rbms see figure for random and dbns certainly look like digits but it is hard to detect any systematic anomalies purely by visual inspection we now use mmd model criticism to investigate how faithfully rbms and dbns can capture the distribution over handwritten digits rbms can consistently mistake the identity of digits we trained an rbm with architecture using epochs of persistent contrastive divergence batch size of and learning rate of we used the same settings as the code available at the deep learning tutorial we generated independent samples from the learned generative model by initialising the network with random training image and performing gibbs updates with the digit labels to generate each image as in since we generated digits from the class conditional distributions we compare each class separately rather than show plots of the witness function for each digit we summarise the witness function by examples of digits closest to the peaks and troughs of the witness function the witness function estimate is differentiable so we can find the peaks and troughs by gradient based optimisation we apply mmd model criticism to each class conditional distribution using pca to reduce to dimensions as in section figure random samples from an rbm peaks of the witness function for the rbm digits that are by the model peaks of the witness function for samples from rbms with differently initialised pseudo random number generators during training peaks of the witness function for the dbn troughs digits that are by the model of the witness function for samples from rbms troughs of the witness function for the dbn figure shows the digits closest to the two most extreme peaks of the witness function for each class the peaks indicate where the fitted distribution the distribution of true digits the estimated for all tests was less than the most obvious problem with these digits is that the first and look quite similar to test that this was not just an single unlucky rbm we trained rbms with differently initialised pseudo random number generators and generated one sample from each and performed the same tests the estimated were again all less than and the summaries of the peaks of the witness function are shown in figure on the first toy data example we observed that the mmd statistic does not highlight outliers and therefore we can conclude that rbms are making consistent mistakes generating from the distribution or when it should have been generating an dbns have nightmares about ghosts we now test the effectiveness of deep learning to represent the distribution of mnist digits in particular we fit dbn with architecture using rbm and generative fine tuning algorithm described in performing the same tests with samples results in estimated of less than except for the digit and digit summaries of the witness function peaks are shown in figure specifically these are the activations of the visible units before sampling sampling binary values this procedure is an attempt to be consistent with the grayscale input distribution of the images analogous discrepancies would be discovered if we had instead sampled binary pixel values that is input pixels and indicators of the class label are connected to hidden neurons without clamping the label neurons the generative distribution is heavily biased towards certain digits the witness function no longer shows any class label mistakes except perhaps for the digit which looks very peculiar but the and appear ghosted the digits fade in and out for comparison figure shows digits closest to the troughs of the witness function there is no trace of ghosting this discrepancy could be due to errors in the autoassociative memory of dbn propogating down the hidden layers resulting in spurious features in several visible neurons an extension to non data we now describe how the mmd statistic can be used for model criticism of non predictive distributions in particular we construct model criticism procedure for regression models obs we assume that our data consists of pairs of inputs and outputs xobs yi typical formulation of the problem of regression is to estimate the conditional distribution of the outputs given the inputs ignoring that our data are not we can generate data from the conditional distribution yirep xobs and compute the empirical mmd estimate between obs obs rep the only difference between this test and the mmd two xobs and sample test is that our data is generated from conditional distribution rather than being the null distribution of this statistic can be trivially estimated by sampling several sets of replicate data from the predictive distribution to demonstrate this test we apply it to regression algorithms and time series analysed in in this work the authors compare several methods for constructing gaussian process regression models example data sets are shown in figures and while it is clear that simple methods will fail to capture all of the structure in this data it is not clear priori how much better the more advanced methods will fair to construct we use held out data using the same split of training and testing data as the interpolation experiment in table shows table of for data sets and regression methods the four methods are linear regression lin gaussian process regression using squared exponential kernel se spectral mixture kernels sp and the method proposed in abcd values in bold indicate positive discovery after hochberg procedure with false discovery rate of applied to each model construction method dataset airline solar mauna wheat temperature internet call centre radio gas production sulphuric unemployment births wages lin se sp abcd table two sample test applied to time series and regression algorithms bold values indicate positive discovery using procedure with false discovery rate of for each method we now investigate the type of discrepancies found by this test by looking at the witness function which can still be interpreted as the difference of kernel density estimates figure shows the solar and gas production data sets the posterior distribution of the se fits to this data and the witness functions for the se fit the solar witness function has clear narrow trough indicating that the data is more dense than expected by the fitted model in this region we can see that this has identified region of low variability in the data it has identified local heteroscedasticity not captured by the model similar conclusions can be drawn about the gas production data and witness function of the four methods compared here only abcd is able to model heteroscedasticity explaining why it is the only method with substantially different set of significant however the procedure is still potentially failing to capture structure on four of the datasets gaussian processes when applied to regression problems learn joint distribution of all output values however this joint distribution information is rarely used typically only the pointwise conditional distributions xobs are used as we have done here gas production solar figure from left to right solar data with se posterior witness function of se fit to solar gas production data with se posterior witness function of se fit to gas production figure shows the unemployment and internet data sets the posterior distribution for the abcd fits to the data and the witness functions of the abcd fits the abcd method has captured much of the structure in these data sets making it difficult to visually identify discrepancies between model and data the witness function for unemployment shows peaks and troughs at similar values of the input comparing to the raw data we see that at these input values there are consistent outliers since abcd is based on gaussianity assumptions these consistent outliers have caused the method to estimate large variance in this region when the true data is there is also similar pattern of peaks and troughs on the internet data suggesting that has again been detected indeed the data appears to have hard lower bound which is inconsistent with gaussianity unemployment internet figure from left to right unemployment data with abcd posterior witness function of abcd fit to unemployment internet data with abcd posterior witness function of abcd fit to internet discussion of model criticism and related work are we criticising particular model or class of models in section we interpreted the differences between classical bayesian and as corresponding to different null hypotheses and interpretations of the word model in particular classical test null hypothesis that the data could have been generated by class of distributions all normal distributions whereas all other test particular probability distribution robins van der vaart ventura demonstrated that bayesian and are not classical frequentist in their terminology they do not have uniform distribution under the relevant null hypothesis however this was presented as failure of these methods in particular they demonstrated that methods proposed by bayarri berger based on posterior predictive are asymptotically classical this claimed inadequacy of posterior predictive was rebutted and while their usefulness is becoming more accepted see introduction of it would appear there is still confusion on the subject we hope that our interpretation of the differences between these methods as different null hypotheses appropriate in different circumstances sheds further light on the matter should we worry about using the same data for traning and criticism and posterior predictive test the null hypothesis that the observed data could have been generated by the fitted model or posterior predictive distribution in some situations it may be more appropriate to attempt to falsify the null hypothesis that future data will be generated by the or posterior predictive distribution as mentioned in section this can be achieved by reserving portion of the data to be used for model criticism alone rather than fitting model or updating posterior on the full data cross validation methods have also been investigated in this context other methods for evaluating statistical models other typical methods of model evaluation include estimating the predictive performance of the model analyses of sensitivities to modelling parameters priors graphical tests and estimates of model utility for recent survey of bayesian methods for model assessment selection and comparison see which phrases many techniques as estimates of the utility of model for some discussion of sensitivity analysis and graphical model comparison see in this manuscript we have focused on methods that compare statistics of data with predictive distributions ignoring parameters of the model the discrepancy measures of compute statistics of data and parameters examples can be found in hagan also proposes method and selectively reviews techniques for model criticism that also take model parameters into account in the spirit of scientific falsification ideally all methods of assessing model should be performed to gain confidence in any conclusions made of course when performing multiple hypothesis tests care must be taken in the intrepetation of individual conclusions and future work in this paper we have demonstrated an exploratory form of model criticism based on two sample tests using kernel maximum mean discrepancy in contrast to other methods for model criticism the test analytically maximises over broad class of statistics automatically identifying the statistic which most demonstrates the discrepancy between the model and data we demonstrated how this method of model criticism can be applied to neural networks and gaussian process regression and demonstrated the ways in which these models were misrepresenting the data they were trained on we have demonstrated an application of mmd two sample tests to model criticism but they can also be applied to any aspect of statistical modelling where two sample tests are appropriate this includes for example geweke tests of markov chain posterior sampler validity and tests of markov chain convergence the two sample tests proposed in this paper naturally apply to data and models but model criticism techniques should of course apply to models with other symmetries exchangeable data logitudinal data time series graphs and many others we have demonstrated an adaptation of the mmd test to regression models but investigating extensions to greater number of model classes would be profitable area for future study we conclude with question do you know how the model you are currently working with most misrepresents the data it is attempting to model in proposing new method of model criticism we hope we have also exposed the reader unfamiliar with model criticism to its utility in diagnosing potential inadequacies of model references george box sampling and bayes inference in scientific modelling and robustness stat soc ser hagan hsss model criticism highly structured stochastic systems pages dennis cook and sanford weisberg residuals and inuence in regression mon on stat and app gelman carlin stern dunson vehtari and rubin bayesian data analysis third edition chapman texts in statistical science taylor francis roger grosse ruslan salakhutdinov william freeman and joshua tenenbaum exploiting compositionality to explore large space of model structures in conf on unc in art int uai chris thornton frank hutter holger hoos and kevin combined selection and hyperparameter optimization of classification algorithms in proc int conf on knowledge discovery and data mining kdd pages new york ny usa acm james robert lloyd david duvenaud roger grosse joshua tenenbaum and zoubin ghahramani automatic construction and description of nonparametric regression models in association for the advancement of artificial intelligence aaai july koller mcallester and pfeffer effective bayesian inference for stochastic programs association for the advancement of artificial intelligence aaai milch marthi russel sontag ong and kolobov blog probabilistic models with unknown objects in proc int joint conf on artificial intelligence noah goodman vikash mansinghka daniel roy keith bonawitz and joshua tenenbaum church language for generative models in conf on unc in art int uai stan development team stan library for probability and sampling version arthur gretton karsten borgwardt malte rasch berhard and alexander smola kernel method for the problem journal of machine learning research irwin guttman the use of the concept of future observation in problems stat soc series stat donald rubin bayesianly justifiable and relevant frequency calculations for the applied statistician ann harold hotelling generalized and measure of multivariate dispersion in proc berkeley symp math stat and prob the regents of the university of california bickel distribution free version of the smirnov two sample test in the case ann math february murray rosenblatt remarks on some nonparametric estimates of density function ann math september parzen on estimation of probability density function and mode ann math stephen stigler do robust estimators work with real data ann november rasmussen and williams gaussian processes for machine learning the mit press cambridge ma usa peel and mclachlan robust mixture modelling using the distribution stat october tomoharu iwata david duvenaud and zoubin ghahramani warped mixtures for nonparametric cluster shapes in conf on unc in art int uai geoffrey hinton to recognize shapes first learn to generate images prog brain geoffrey hinton simon osindero and yee whye teh fast learning algorithm for deep belief nets neural deep learning tutorial http andrew gordon wilson and ryan prescott adams gaussian process covariance kernels for pattern discovery and extrapolation in proc int conf machine yoav benjamini and yosef hochberg controlling the false discovery rate practical and powerful approach to multiple testing stat soc series stat james robins aad van der vaart and valerie venture asymptotic distribution of in composite null models am stat bayarri and berger quantifying surprise in the data and model verification bayes andrew gelman bayesian formulation of exploratory data analysis and testing int stat bayarri and castellanos bayesian checking of the second levels of hierarchical models stat august andrew gelman understanding posterior elec gelfand dey and chang model determination using predictive distributions with implementation via methods technical report stanford uni ca dept stat marshall and spiegelhalter identifying outliers in bayesian hierarchical models simulationbased approach bayesian june aki vehtari and janne ojanen survey of bayesian predictive methods for model assessment selection and comparison stat andrew gelman meng and hal stern posterior predictive assessment of model fitness via realized discrepancies stat popper the logic of scientific discovery routledge john geweke getting it right am stat september mary kathryn cowles and bradley carlin markov chain monte carlo convergence diagnostics comparative review am stat june 
curves pr analysis done right meelis kull intelligent systems laboratory university of bristol united kingdom peter flach intelligent systems laboratory university of bristol united kingdom abstract analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier performance perhaps inspired by the many advantages of receiver operating characteristic roc curves and the area under such curves for accuracybased performance assessment many researchers have taken to report precisionrecall pr curves and associated areas as performance metric we demonstrate in this paper that this practice is fraught with difficulties mainly because of incoherent scale assumptions the area under pr curve takes the arithmetic mean of precision values whereas the fβ score applies the harmonic mean we show how to fix this by plotting pr curves in different coordinate system and demonstrate that the new curves inherit all key advantages of roc curves in particular the area under curves conveys an expected score on harmonic scale and the convex hull of curve allows us to calibrate the classifier scores so as to determine for each operating point on the convex hull the interval of values for which the point optimises fβ we demonstrate experimentally that the area under traditional pr curves can easily favour models with lower expected score than others and so the use of curves will result in better model selection introduction and motivation in machine learning and related areas we often need to optimise multiple performance measures such as classification accuracies precision and recall in information retrieval etc we then have the option to fix particular way to trade off these performance measures we can use overall classification accuracy which gives equal weight to correctly classified instances regardless of their class or we can use the score which takes the harmonic mean of precision and recall however optimisation suggests that to delay fixing for as long as possible has practical benefits such as the ability to adapt model or set of models to changing operating contexts the latter is essentially what receiver operating characteristic roc curves do for binary classification in an roc plot we plot true positive rate the proportion of correctly classified positives also denoted tpr on the against false positive rate the proportion of incorrectly classified negatives also denoted fpr on the categorical classifier evaluated on test set gives rise to single roc point while classifier which outputs scores henceforth called model can generate set of points commonly referred to as the roc curve by varying the decision threshold figure left roc curves are widely used in machine learning and their main properties are well understood these properties can be summarised as follows precision true positive rate false positive rate recall figure left roc curve with points red circles and convex hull red dotted line right corresponding curve with points red circles universal baselines the major diagonal of an roc plot depicts the line of random performance which can be achieved without training more specifically random classifier assigning the positive class with probability and the negative class with probability has expected true positive rate of and true negative rate of represented by the roc point the triangle of roc plots hence denotes better worse than random performance related baselines include the and classifier which occupy fixed points in roc plots the origin and the upper corner respectively these baselines are universal as they don depend on the class distribution linear interpolation any point on straight line between two points representing the performance of two classifiers or thresholds and can be achieved by making suitably biased random choice between and effectively this creates an interpolated contingency table which is linear combination of the contingency tables of and and since all three tables involve the same numbers of positives and negatives it follows that the interpolated accuracy as well as true and false positive rates are also linear combinations of the corresponding quantities pertaining to and the slope of the connecting line determines the between the classes under which any linear combination of and would yield equivalent performance in particular test set accuracy assuming uniform misclassification costs is represented by accuracy isometrics with slope where is the proportion of positives optimality point dominates another point if tpr and fpr are not worse than and at least one of them is strictly better the set of points the pareto front establishes the set of classifiers or thresholds that are optimal under some between the classes due to linearity any interpolation between points is both achievable and giving rise to the convex hull rocch which can be easily constructed both algorithmically and by visual inspection area the proportion of the unit square which falls under an roc curve auroc has meaning as ranking performance measure it estimates the probability that randomly chosen positive is ranked higher by the model than randomly chosen negative more importantly in classification context there is linear relationship between auroc tpr fpr and the expected accuracy acc πtpr fpr averaged over all possible predicted positive rates rate πtpr fpr which can be established by change of variable acc acc rate calibration slopes of convex hull segments can be interpreted as empirical likelihood ratios associated with particular interval of raw classifier scores this gives rise to calibration procedure which is also called isotonic regression or pool adjacent violators and results in calibration map which maps each segment of rocch with slope to calibrated score πr define version of accuracy as accc fpr standard accuracy is then fectly calibrated classifier outputs for every instance the value of for which the instance is on the accc decision boundary alternative solutions for each of these exist for example parametric alternatives to rocch calibration exist based on the logistic function platt scaling as do alternative ways to aggregate classification performance across different operating points the brier score however the power of roc analysis derives from the combination of the above desirable properties which helps to explain its popularity across the machine learning discipline this paper presents fundamental improvements in analysis inspired by roc analysis as follows we identify in section the problems with current practice in curves by demonstrating that they fail to satisfy each of the above properties in some respect ii we propose principled way to remedy all these problems by means of change of coordinates in section iii in particular our improved curves enclose an area that is directly related to expected score on harmonic scale in similar way as auroc is related to expected accuracy iv furthermore with curves it is possible to calibrate model for fβ in the sense that the predicted score for any instance determines the value of for which the instance is on the fβ decision boundary we give experimental evidence in section that this matters by demonstrating that the area under traditional curves can easily favour models with lower expected score than others proofs of the formal results are found in the supplementary material see also http traditional analysis of negative examples is common phenomenon in many subfields of machine learning and data mining including information retrieval recommender systems and social network analysis indeed most web pages are irrelevant for most queries and most links are absent from most networks classification accuracy is not sensible evaluation measure in such situations as it the classifier neither does adjusting the class imbalance through costsensitive versions of accuracy help as this will not just downplay the benefit of true negatives but also the cost of false positives good solution in this case is to ignore true negatives altogether and use precision defined as the proportion of true positives among the positive predictions as performance metric instead of false positive rate in this context the true positive rate is usually renamed to recall more formally we define precision as prec tp fp and recall as rec tp fn where tp fp and fn denote the number of true positives false positives and false negatives respectively perhaps motivated by the appeal of roc plots many researchers have begun to produce precisionrecall or pr plots with precision on the against recall on the figure right shows the pr curve corresponding to the roc curve on the left clearly there is correspondence between the two plots as both are based on the same contingency tables in particular precision associated with an roc point is proportional to the angle between the line connecting the point with the origin and the however this is where the similarity ends as pr plots have none of the aforementioned desirable properties of roc plots baselines random classifier has precision and hence baseline performance is horizontal line which depends on the class distribution the classifier is at the end of this baseline the classifier has undefined precision interpolation the main reason for this is that precision in linearly interpolated contingency table is only linear combination of the original precision values if the two classifiers have the same predicted positive rate which is impossible if the two contingency tables arise from different decision thresholds on the same model discusses this further and also gives an interpolation formula more generally it isn meaningful to take the arithmetic average of precision values pareto front the set of operating points continues to be see the red circles in figure right but in the absence of linear interpolation this set isn convex for pr curves nor is it straightforward to determine by visual inspection uninterpretable area although many authors report the area under the pr curve aupr it doesn have meaningful interpretation beyond the geometric one of expected precision when uniformly varying the recall and even then the use of the arithmetic average can not be justified furthermore pr plots have unachievable regions at the lower side the size of which depends on the class distribution no calibration although some results exist regarding the relationship between calibrated scores and score more about this below these are unrelated to the pr curve to the best of our knowledge there is no published procedure to output scores that are calibrated for fβ that is which give the value of for which the instance is on the fβ decision boundary the fβ measure the standard way to combine precision and recall into single performance measure is through the score it is commonly defined as the harmonic mean of precision and recall rec tp prec rec tp fp fn the last form demonstrates that the harmonic mean is natural here as it corresponds to taking the arithmetic mean of the numbers of false positives and false negatives another way to understand the score is as the accuracy in modified contingency table which copies the true positive count to the true negatives actual actual predicted predicted tp fp tp fp fn tp pos pos neg tn tp fp fn we can take weighted harmonic mean which is commonly parametrised as follows tp fβ tp fp fn there is range of recent papers studying the several of which in last year nips conference relevant results include the following of the fβ score meaning it is not an average over instances it is ratio of such averages called function by ii estimators exist that are consistent they are unbiased in the limit iii given model operating points that are optimal for fβ can be achieved by thresholding the model scores iv classifier yielding perfectly calibrated posterior probabilities has the property that the optimal threshold for is half the optimal at that point first proved by and later by while generalised to fβ by the latter results tell us that optimal thresholds for fβ are lower than optimal thresholds for accuracy or equal only in the case of the perfect model they don however tell us how to find such thresholds other than by tuning and propose method inspired by classification the analysis in the next section significantly extends these results by demonstrating how we can identify all fβ thresholds for any in single calibration procedure curves in this section we demonstrate how analysis can be adapted to inherit all the benefits of roc analysis while technically straightforward the implications of our results are for example even something as seemingly innocuous as reporting the arithmetic average of values over folds is methodologically misguided we will define the corresponding performance measure that can safely be averaged baseline random classifier that predicts positive with probability has fβ score this is monotonically increasing in hence reaches its maximum for the precision gain precision recall recall gain figure left conventional pr curve with hyperbolic isometrics dotted lines and the baseline performance by the classifier solid hyperbole right curve with minor diagonal as baseline parallel isometrics and convex pareto front positive classifier hence analysis differs from classification accuracy in that the baseline to beat is the classifier rather than any random classifier this baseline has prec and rec and it is easily seen that any model with prec or rec loses against this baseline hence it makes sense to consider only precision and recall values in the interval any variable min max can be rescaled by the mapping however the linear scale is inappropriate here and we should use harmonic scale instead hence map to max min max min taking max and min we arrive at the following definition definition precision gain and recall gain prec fp rec fn precg recg prec tp rec tp curve plots precision gain on the against recall gain on the in the unit square negative gains are ignored an example prg curve is given in figure right the classifier has recg and precg and hence gets plotted in the lower corner of space regardless of the class distribution since we show in the next section that isometrics have slope in this space it follows that all classifiers with baseline performance end up on the minor diagonal in space in contrast the corresponding isometric in pr space is hyperbolic figure left and its exact location depends on the class distribution linearity and optimality one of the main benefits of prg space is that it allows linear interpolation this manifests itself in two ways any point on straight line between two endpoints is achievable by random choice between the endpoints theorem and fβ isometrics are straight lines with slope theorem theorem let and be points in the space representing the performance of models and with contingency tables and then model with an interpolated contingency table has precision gain and recall gain where theorem precg recg fgβ with fgβ fβ fβ fn tp fgβ is linearised version of fβ in the same way as precg and recg are linearised versions of precision and recall fgβ measures the gain in performance on linear scale relative to classifier with both precision and recall and hence fβ equal to isometrics are indicated in figure right by increasing decreasing these lines of constant fβ become steeper flatter and hence we are putting more emphasis on recall precision with regard to optimality we already knew that every classifier or threshold optimal for fβ for some is optimal for accc for some the reverse also holds except for the roc convex hull points below the baseline the classifier due to linearity the prg pareto front is convex and easily constructed by visual inspection we will see in section that these segments of the prg convex hull can be used to obtain classifier scores specifically calibrated for thereby the need for any more threshold tuning area define the area under the curve as auprg precg recg we will show how this area can be related to an expected score when averaging over the operating points on the curve in particular way to this end we define which expresses the extent to which recall exceeds precision reweighting by and guarantees that is monotonically increasing when changing the threshold towards having more positive predictions as shown in the proof of theorem in the supplementary material hence where denotes the precision gain at the operating point where recall gain is zero the following theorem shows that if the operating points are chosen such that is uniformly distributed in this range then the expected can be calculated from the area under the curve the supplementary material proves more general result for expected fgβ this justifies the use of auprg as performance metric without fixing the classifier operating point in advance theorem let the operating points of model with area under the curve auprg be chosen such that is uniformly distributed within then the expected score is equal to the expected reciprocal score can be calculated from the relationship which follows from the definition of fgβ in the special case where the expected score is calibration figure left shows an roc curve with empirically calibrated posterior probabilities obtained by isotonic regression or the roc convex hull segments of the convex hull are labelled with the value of for which the two endpoints have the same accuracy accc conversely if point connects two segments with then that point is optimal for any such that the calibrated values are derived from the roc slope by πr for example the point on the convex hull two steps up from the origin optimises accuracy accc for and hence also standard accuracy we are now in position to calculate similarly calibrated scores for theorem let two classifiers be such that and then these two classifiers have the same fβ score if and only if in line with roc calibration we convert these slopes into calibrated score between and it is important to note that there is no relationship between scores and scores so we can not derive from however we can equip model with two calibration maps one for accuracy and the other for precision gain true positive rate false positive rate recall gain figure left roc curve with scores empirically calibrated for accuracy the green dots correspond to regular grid in space right curve with scores calibrated for fβ the green dots correspond to regular grid in roc space clearly indicating that roc analysis the region figure right shows the prg curve for the running example with scores calibrated for fβ score corresponds to and score corresponds to so the point closest to the breakeven line optimises fβ for and hence also but note that the next point to the right on the convex hull is nearly as good for on account of the connecting line segment having calibrated score close to practical examples the key message of this paper is that precision recall and are expressed on harmonic scale and hence any kind of arithmetic average of these quantities is methodologically wrong we now demonstrate that this matters in practice in particular we show that in some sense aupr and auprg are as different from each other as aupr and auroc using the openml platform we took all those binary classification tasks which have predictions using at least models from different learning methods these are called flows in openml in each of the obtained tasks covering different datasets we applied the following procedure first we fetched the predicted scores of randomly selected models from different flows and calculated areas under roc prg and pr curves with hyperbolic interpolation as recommended by with minority class as positives we then ranked the models with respect to these measures figure plots against across all models figure left demonstrates that aupr and auprg often disagree in ranking the models in particular they disagree on the best method in of the tasks and on the top three methods in of the tasks they agree on top second and third method in of the tasks this amount of disagreement is comparable to the disagreement between aupr and auroc and disagreement for top and top respectively and between auprg and auroc and therefore aupr auprg and auroc are related quantities but still all significantly different the same conclusion is supported by the pairwise correlations between the ranks across all tasks the correlation between and is between aupr and auroc it is and between auprg and auroc it is figure right shows auprg vs aupr in two datasets with relatively low and high rank correlations and selected as lower and upper quartiles among all tasks in both datasets aupr and auprg agree on the best model however in the dataset the second best is adaboost according to auprg and logistic regression according to aupr as seen in figure this disagreement is caused by aupr taking into account the poor performance of adaboost in the early part of the ranking auprg ignores this part as it has negative recall gain count auprg rank of auprg dataset rank of aupr aupr precision gain precision true positive rate figure left comparison of vs each cell shows how many models across openml tasks have these ranks among the models in the same task right comparison of auprg vs aupr in openml tasks with ids and with models in each task some models perform worse than random auprg and are not plotted the models represented by the two encircled triangles are shown in detail in figure false positive rate recall recall gain figure left roc curves for adaboost solid line and logistic regression dashed line on the dataset openml run ids and respectively middle corresponding pr curves the solid curve is on average lower with aupr whereas the dashed curve has aupr right corresponding prg curves where the situation has reversed the solid curve has auprg while the dashed curve has lower auprg of concluding remarks if practitioner using and the should take one methodological recommendation from this paper it is to use the score instead to make sure baselines are taken into account properly and averaging is done on the appropriate scale if required the fgβ score can be converted back to an fβ score at the end the second recommendation is to use curves instead of pr curves and the third to use auprg which is easier to calculate than aupr due to linear interpolation has proper interpretation as an expected score and allows performance assessment over range of operating points to assist practitioners we have made matlab and java code to calculate auprg and prg curves available at http we are also working on closer integration of auprg as an evaluation metric in openml and performance visualisation platforms such as vipercharts as future work we mention the interpretation of auprg as measure of ranking performance we are working on an interpretation which gives weights to the positives and as such is related to discounted cumulative gain second line of research involves the use of cost curves for the fgβ score and associated threshold choice methods acknowledgments this work was supported by the reframe project granted by the european coordinated research on challenges in information and communication sciences technologies eranet and funded by the engineering and physical sciences research council in the uk under grant discussions with hendrik blockeel helped to clarify the intuitions underlying this work references boyd costa davis and page unachievable region in space and its effect on empirical evaluation in international conference on machine learning page davis and goadrich the relationship between and roc curves in proceedings of the international conference on machine learning pages fawcett an introduction to roc analysis pattern recognition letters fawcett and pav and the roc convex hull machine learning july flach the geometry of roc space understanding machine learning metrics through roc isometrics in machine learning proceedings of the twentieth international conference icml pages flach roc analysis in sammut and webb editors encyclopedia of machine learning pages springer us hand and till simple generalisation of the area under the roc curve for multiple class classification problems machine learning flach and ferri unified view of performance metrics translating threshold choice into expected classification loss journal of machine learning research koyejo natarajan ravikumar and dhillon consistent binary classification with generalized performance metrics in advances in neural information processing systems pages lipton elkan and naryanaswamy optimal thresholding of classifiers to maximize measure in machine learning and knowledge discovery in databases volume of lecture notes in computer science pages springer berlin heidelberg narasimhan vaish and agarwal on the statistical consistency of classifiers for performance measures in advances in neural information processing systems pages parambath usunier and grandvalet optimizing by classification in advances in neural information processing systems pages platt probabilistic outputs for support vector machines and comparisons to regularized likelihood methods in advances in large margin classifiers pages mit press boston provost and fawcett robust classification for imprecise environments machine learning sluban and vipercharts visual performance evaluation platform in blockeel kersting nijssen and editors machine learning and knowledge discovery in databases volume of lecture notes in computer science pages springer berlin heidelberg van rijsbergen information retrieval newton ma usa edition vanschoren van rijn bischl and torgo openml networked science in machine learning sigkdd explorations ye chai lee and chieu optimizing tale of two approaches in proceedings of the international conference on machine learning pages zadrozny and elkan obtaining calibrated probability estimates from decision trees and naive bayesian classifiers in proceedings of the eighteenth international conference on machine learning icml pages zhao edakunni pocock and brown beyond fano inequality bounds on the optimal ber and risk and their implications the journal of machine learning research 
generalization of submodular cover via the diminishing return property on the integer lattice tasuku soma the university of tokyo tasuku soma yuichi yoshida national institute of informatics and preferred infrastructure yyoshida abstract we consider generalization of the submodular cover problem based on the concept of diminishing return property on the integer lattice we are motivated by real scenarios in machine learning that can not be captured by traditional submodular set functions we show that the generalized submodular cover problem can be applied to various problems and devise bicriteria approximation algorithm our algorithm is guaranteed to output approximate solution that satisfies the constraints with the desired accuracy the running time of our algorithm is roughly log nr log where is the size of the ground set and is the maximum value of coordinate the dependency on is exponentially better than the naive reduction algorithms several experiments on real and artificial datasets demonstrate that the solution quality of our algorithm is comparable to naive algorithms while the running time is several orders of magnitude faster introduction function is called submodular if for all where is finite ground set an equivalent and more intuitive definition is by the diminishing return property for all and in the last decade the optimization of submodular function has attracted particular interest in the machine learning community one reason of this is that many models naturally admit the diminishing return property for example document summarization influence maximization in viral marketing and sensor placement can be described with the concept of submodularity and efficient algorithms have been devised by exploiting submodularity for further details refer to variety of proposed models in machine learning boil down to the submodular cover problem for given monotone and nonnegative submodular functions and we are to minimize subject to intuitively and represent the cost and the quality of solution respectively the objective of this problem is to find of minimum cost with the worst quality guarantee although this problem is since it generalizes the set cover problem simple greedy algorithm achieves tight approximation and it practically performs very well the aforementioned submodular models are based on the submodularity of set function function defined on however we often encounter problems that can not be captured by set function let us give two examples sensor placement let us consider the following sensor placement scenario suppose that we have several types of sensors with various energy levels we assume simple between information gain and cost sensors of high energy level can collect considerable amount of information but we have to pay high cost for placing them sensors of low energy level can be placed at low cost but they can only gather limited information in this scenario we want to decide which type of sensor should be placed at each spot rather than just deciding whether to place sensor or not such scenario is beyond the existing models based on submodular set functions optimal budget allocation similar situation also arises in the optimal budget allocation problem in this problem we want to allocate budget among ad sources so that at least certain number of customers is influenced while minimizing the total budget again we have to decide how much budget should be set aside for each ad source and hence set functions can not capture the problem we note that function can be seen as function defined on boolean hypercube then the above real scenarios prompt us to generalize the submodularity and the diminishing return property to functions defined on the integer lattice the most natural generalization of the diminishing return property to function is the following inequality χs χs for and where χs is the unit vector if satisfies then also satisfies the following lattice submodular inequality for all where and are the max and min operations respectively while the submodularity and the diminishing return property are equivalent for set functions this is not the case for functions over the integer lattice the diminishing return property is stronger than the lattice submodular inequality we say that is lattice submodular if satisfies and if further satisfies we say that is diminishing return submodular for short one might feel that the is too restrictive however considering the fact that the diminishing return is more crucial in applications we may regard the as the most natural generalization of the submodularity at least for applications mentioned so far for example under natural condition the objective function in the optimal budget allocation satisfies the was also considered in the context of submodular welfare in this paper we consider the following generalization of the submodular cover problem for set functions given monotone function subadditive function and we are to minimize subject to where we say that is subadditive if for all we call problem the cover problem this problem encompasses problems that boil down to the submodular cover problem for set functions and their generalizations to the integer lattice furthermore the cost function is generalized to subadditive function in particular we note that two examples given above can be rephrased using this problem see section for details if is also monotone one can reduce the problem to the set version for technical details see section the problem of this naive reduction is that it only yields time algorithm the running time depends on rather than log since can be huge in many practical settings the maximum energy level of sensor even linear dependence on could make an algorithm impractical furthermore for general subadditive function this naive reduction does not work our contribution for the problem we devise bicriteria approximation algorithm based on the decreasing threshold technique of more precisely our algorithm takes the additional parameters the output of our algorithm is guaranteed to satisfy that is at most log βd times the optimum and where is the curvature of see section for the definition maxs χs is the maximum value of over all standard unit vectors and is the minimum value of the positive increments of in the feasible region running time dependency on an important feature of our algorithm is that the running time depends on the bit length of only polynomially whereas the naive reduction algorithms depend on it exponentially as mentioned above more precisely the running time of our algorithm is max log nrc δcmin log which is polynomial in the input size whereas the naive algorithm is only time algorithm in fact our experiments using real and synthetic datasets show that our algorithm is considerably faster than naive algorithms furthermore in terms of the objective value that is the cost of the output our algorithm also exhibits comparable performance approximation guarantee our approximation guarantee on the cost is almost tight note that the dr submodular cover problem includes the set cover problem in which we are given collection of sets and we want to find minimum number of sets that covers all the elements in our context corresponds to the collection of sets the cost is the number of chosen sets and is the number of covered elements it is known that we can not obtain an log unless np where is the number of elements however since for the set cover problem we have and our approximation guarantee is log related work our result can be compared with several results in the literature for the submodular cover problem for set functions it is shown by wolsey that if simple greedy algorithm yields log βd which coincides with our approximation ratio except for the factor note that when or more generally when is modular recently wan et al discussed slightly different setting in which is also submodular and both and are integer valued they proved that the greedy algorithm achieves ρh where is the harmonic number again their ratio asymptotically coincides with our approximation ratio note that when is integer valued another common model in machine learning is in the form of the submodular maximization problem given monotone submodular set function and feasible set matroid polytope or knapsack polytope we want to maximize subject to such models can be widely found in various tasks as already described we note that the submodular cover problem and the submodular maximization problem are somewhat dual to each other indeed iyer and bilmes showed that bicriteria algorithm of one of these problems yields bicriteria algorithm for the other being parallel to our setting generalizing the submodular maximization problem to the integer lattice is natural question in this direction soma et al considered the maximization of lattice submodular functions not necessarily being and devised approximation time algorithm we note that our result is not implied by via the duality of in fact such reduction only yields time algorithm organization of this paper the rest of this paper is organized as follows section sets the mathematical basics of submodular functions over the integer lattice section describes our algorithm and the statement of our main theorem in section we show various experimental results using real and artificial datasets section sketches the proof of the main theorem finally we conclude the paper in section preliminaries let be finite set for each we denote the unit vector by χs that is χs if otherwise χs function zs is said to be lattice submodular if for all zs function is monotone if for all zs with for zs and function zs we denote function is diminishing return submodular or if χs χs for each zs and for function one can immediately check that kχs kχs for arbitrary and function is subadditive if for zs for each we define to be the multiset in which each is contained times in lattice submodular function zs is said to have the diminishing return property if is concave χs χs for each zs and we note that our definition is consistent with formally we have the following lemma whose proof can be found in appendix lemma function zs is if and only if is lattice submodular and concave the following is fundamental for monotone function proof is placed in appendix due to the limitation of space lemma for monotone function χs for arbitrary zs algorithm for the cover recall the cover problem let be monotone function and let be subadditive cost function the objective is to minimize subject to and where and are the given constants without loss of generality we can assume that max otherwise we can consider fb min instead of furthermore we can assume for any pseudocode description of our algorithm is presented in algorithm the algorithm can be viewed as modified version of the greedy algorithm and works as follows we start with the initial solution and increase each coordinate of gradually to determine the amount of increments the algorithm maintains threshold that is initialized to be sufficiently large enough for each the algorithm finds the largest integer step size such that the marginal kχs ratio kc is above the threshold if such exists the algorithm updates to kχs after repeating this for each the algorithm decreases the threshold by factor of if becomes feasible the algorithm returns the current even if does not become feasible the final satisfies if we iterate until gets sufficiently small algorithm decreasing threshold for the cover problem input output such that max χs cmin min χs cmax max χs for cmin ncmax do for all do find maximum integer such that if such exists then kχs if then break the outer for loop return kχs kc χs with binary search before we claim the theorem we need to define several parameters on and let min χs χs and maxs χs let cmax maxs χs and cmin mins χs define the curvature of to be χs min optimal solution definition for and vector is approximate solution if and our main theorem is described below we sketch the proof in section theorem algorithm outputs log βd approximate solution max in log nrc δcmin log time discussion case let us make simple remark on the case that is integer valued without loss of generality we can assume then algorithm always returns feasible solution for any therefore our algorithm can be easily modified to an approximation algorithm if is integer valued definition of curvature several authors use different notion of curvature called the total curvature whose natural extension for function over the integer lattice is as follows the total curvature of is defined as χsc note that if is modular while if is modular for example iyer and bilmes devised bicriteria approximation algorithm whose approximation guarantee is roughly log βd let us investigate the relation between and for functions one can show that see lemma in appendix which means that our bound in terms of is tighter than one in terms of comparison to naive reduction algorithm if is also monotone function one can reduce to the set version as follows for each create copies of and let be the set of these copies for define be the integral vector such that is the number of copies of contained in then is submodular similarly is also submodular if is function therefore we may apply standard greedy algorithm of to the reduced problem and this is exactly what greedy does in our experiment see section however this straightforward reduction only yields pseudopolynomial time algorithm since nr even if the original algorithm was linear the resulting algorithm would require nr time indeed this difference is not negligible since can be quite large in practical applications as illustrated by our experimental evaluation lazy evaluation we finally note that we can combine the lazy evaluation technique which significantly reduces runtime in practice with our algorithm specifically we first push all χs the elements in to priority queue here the key of an element is fc then the inner loop of algorithm is modified as follows instead of checking all the elements in we pop elements whose keys are at least for each popped element we find such that kχs with kc with binary search if there is such we update with kχs finally we push again with the key χs χs if the correctness of this technique is obvious because of the of in particular χs where is the current vector the key of each element in the queue is always at least hence we never miss with kχs kc χs experiments experimental setting we conducted experiments on linux server with an intel xeon ghz processor and gb of main memory the experiments required at most gb of memory all the algorithms were implemented in and compiled with in our experiments the cost function is always chosen as let be submodular function and be the worst quality guarantee we implemented the following four methods is our method with the lazy evaluation technique we chose as stated otherwise greedy is method in which starting from we iteratively increment for that maximizes χs until we get we also implemented the lazy evaluation technique degree is method in which we assign value proportional to the marginal χs where is determined by binary search so that precisely speaking is approximately proportional to the marginal since must be an integer uniform is method that returns for minimum such that we use the following and synthetic datasets to confirm the accuracy and efficiency of our method against other methods we set for both problems sensor placement we used dataset acquired by running simulations on sensor network used in battle of the water sensor networks bwsn we used the program to simulate random injection events to this network for duration of hours let and be the set of the sensors in the network and the set of the events respectively for each sensor and event value is provided which denotes the time in minutes the pollution has reached after the injection we define function as follows let be vector where we regard as the energy level of the sensor suppose that when the pollution reaches sensor the probability that we can detect it is where in other words by spending unit energy we obtain an extra chance of detecting the pollution with probability for each event let se be the first sensor where the pollution is detected in that injection event note that se is random variable let max then we define as follows se se where se is defined as when there is no sensor that managed to detect the pollution intuitively speaking se expresses how much time we managed to save in the event se on average then we take the average over all the events similar function was also used in to measure the performance of sensor allocation although they only considered the case this corresponds to the case that by spending unit energy at sensor we can always detect the pollution that has reached we note that is see lemma for the proof budget allocation problem in order to observe the behavior of our algorithm for instances we created synthetic instance of the budget allocation problem as follows the instance can be represented as bipartite graph where is set of vertices and is set of vertices we regard vertex in as an ad source and vertex in as person then we fix the degrees of vertices in so that their distribution obeys the power law of that is the fraction of ad sources with is proportional to for vertex of the supposed degree we choose vertices in uniformly at random and connect them to with edges we define function as where is the set of vertices connected to and here we suppose that by investing unit cost to an ad source we have an extra chance of influencing person with with probability then can be seen as the expected number of people influenced by ad sources we note that is known to be monotone function experimental results figure illustrates the obtained objective value for various choices of the worst quality guarantee on each dataset we chose in decreasing threshold we can observe that decreasing threshold attains almost the same objective value as greedy and it outperforms degree and uniform figure illustrates the runtime for various choices of the worst quality guarantee on each dataset we chose in decreasing threshold we can observe that the runtime growth of decreasing threshold is significantly slower than that of greedy although three other values are provided they showed similar empirical results and we omit them sensor placement bwsn greedy time greedy decreasing threshold degree uniform relative cost increase greedy decreasing threshold degree uniform sensor placement bwsn objective value relative increase of the objective value time objective value uniform decreasing threshold degree greedy time uniform decreasing threshold degree greedy budget allocation synthetic budget allocation synthetic runtime figure objective values figure runtime figure effect of figures and show the relative increase of the objective value and the runtime respectively of our method against greedy on the bwsn dataset we can observe that the relative increase of the objective value gets smaller as increases this phenomenon can be well explained by considering the extreme case that max in this case we need to choose anyway in order to achieve the worst quality guarantee and the order of increasing coordinates of does not matter also we can see that the empirical runtime grows as function of which matches our theoretical bound proof of theorem in this section we outline the proof of the main theorem proofs of some minor claims can be found in appendix first we introduce notation let us assume that is updated times in the algorithm let xi be the variable after the update note that and xl is the final output of the algorithm let si and ki be the pair used in the update for that is χsi for let xi ki χsi for let and µi kiiχs and for where θi is the threshold value on the update note that for let be an optimal solution such that χs we regard that in the update the elements of are charged by the value of µi χs χs xi then the total charge on is defined as µi χs χs xi claim let us fix arbitrary and let be the threshold value on the update then χs ki χsi and ki χsi χs eliminating from the inequalities in claim we obtain ki χsi χs ki χsi χs furthermore we have µi claim µi for claim for each the total charge on is at most log χs proof let us fix and let be the minimum such that χs xi by we have ki χsi χs µi ki χsi χs then we have µi χs χs xi µi χs χs xi µl χs χs χs xi χs χs χs χs χs xi χs χs χs χs since log for log χs xi χs χs log log χs χs proof of theorem combining these claims we have log log ρc thus is an approximate solution with the desired ratio let us see that approximately satisfies the constraint that is we will now consider slightly modified version of the algorithm in the modified algorithm the threshold is updated until let be the output of the modified algorithm then we have δc χs χs δd δα cmax nr the third inequality holds since χs cmax and nr thus conclusions in this paper motivated by real scenarios in machine learning we generalized the submodular cover problem via the diminishing return property over the integer lattice we proposed bicriteria approximation algorithm with the following properties the approximation ratio to the cost almost matches the one guaranteed by the greedy algorithm and is almost tight in general ii we can satisfy the worst solution quality with the desired accuracy iii the running time of our algorithm is roughly log log the dependency on is exponentially better than that of the greedy algorithm we confirmed by experiment that compared with the greedy algorithm the solution quality of our algorithm is almost the same and the runtime is several orders of magnitude faster acknowledgments the first author is supported by jsps for jsps fellows the second author is supported by jsps for young scientists no mext for scientific research on innovative areas and jst erato kawarabayashi large graph project the authors thank satoru iwata and yuji nakatsukasa for reading draft of this paper references http alon gamzu and tennenholtz optimizing budget allocation among channels and influencers in proc of www pages badanidiyuru and fast algorithms for maximizing submodular functions in proc of soda pages chen shioi montesinos koh wich and krause active detection via adaptive submodularity in proc of icml pages iyer and bilmes submodular optimization with submodular cover and submodular knapsack constraints in proc of nips pages kapralov post and vondrak online submodular welfare maximization greedy is optimal in proc of soda pages kempe kleinberg and tardos maximizing the spread of influence through social network in proc of kdd pages krause and golovin submodular function maximization in tractability practical approaches to hard problems pages cambridge university press krause and leskovec efficient sensor placement optimization for securing large water distribution networks journal of water resources planning and management krause singh and guestrin sensor placements in gaussian processes theory efficient algorithms and empirical studies the journal of machine learning research leskovec krause guestrin faloutsos vanbriesen and glance outbreak detection in networks in proc of kdd pages lin and bilmes summarization via budgeted maximization of submodular functions in proceedings of the annual conference of the north american chapter of the association for computational linguistics pages lin and bilmes class of submodular functions for document summarization in proc of naacl pages minoux accelerated greedy algorithms for maximizing submodular set functions optimization techniques lecture notes in control and information sciences ostfeld uber salomons berry hart phillips watson dorini jonkergouw kapelan di pierro khu savic eliades polycarpou ghimire barkdoll gueli huang mcbean james krause leskovec isovitsch xu guestrin vanbriesen small fischbeck preis propato piller trachtman wu and walski the battle of the water sensor networks bwsn design challenge for engineers and algorithms journal of water resources planning and management raz and safra test and pcp characterization of np in proc of stoc pages soma kakimura inaba and kawarabayashi optimal budget allocation theoretical guarantee and efficient algorithm in proc of icml song girshick jegelka mairal harchaoui and darrell on learning to localize objects with minimal supervision in proc of icml sviridenko and ward optimal approximation for submodular and supermodular optimization with bounded curvature in proc of soda pages wan du pardalos and wu greedy approximations for minimum submodular cover with submodular cost computational optimization and applications wolsey an analysis of the greedy algorithm for the submodular set covering problem combinatorica 
bidirectional recurrent neural networks as generative models mathias berglund aalto university finland leo nokia labs finland tapani raiko aalto university finland akos vetek nokia labs finland mikko honkala nokia labs finland juha karhunen aalto university finland abstract bidirectional recurrent neural networks rnn are trained to predict both in the positive and negative time directions simultaneously they have not been used commonly in unsupervised tasks because probabilistic interpretation of the model has been difficult recently two different frameworks gsn and nade provide connection between reconstruction and probabilistic modeling which makes the interpretation possible as far as we know neither gsn or nade have been studied in the context of time series before as an example of an unsupervised task we study the problem of filling in gaps in time series with complex dynamics although unidirectional rnns have recently been trained successfully to model such time series inference in the negative time direction is we propose two probabilistic interpretations of bidirectional rnns that can be used to reconstruct missing gaps efficiently our experiments on text data show that both proposed methods are much more accurate than unidirectional reconstructions although bit less accurate than computationally complex bidirectional bayesian inference on the unidirectional rnn we also provide results on music data for which the bayesian inference is computationally infeasible demonstrating the scalability of the proposed methods introduction recurrent neural networks rnn have recently been trained successfully for time series modeling and have been used to achieve results in supervised tasks including handwriting recognition and speech recognition rnns have also been used successfully in unsupervised learning of time series recently rnns have also been used to generate sequential data in machine translation context which further emphasizes the unsupervised setting bahdanau et al used bidirectional rnn to encode phrase into vector but settled for unidirectional rnn to decode it into translated phrase perhaps because bidirectional rnns have not been studied much as generative models even more recently maas et al used deep bidirectional rnn in speech recognition generating text as output missing value reconstruction is interesting in at least three different senses firstly it can be used to cope with data that really has missing values secondly reconstruction performance of artificially missing values can be used as measure of performance in unsupervised learning thirdly reconstruction of artificially missing values can be used as training criterion while traditional rnn training criterions correspond to prediction training to reconstruct longer gaps can push the model towards concentrating on predictions note that the figure structure of the simple rnn left and the bidirectional rnn right prediction criterion is typically used even in approaches that otherwise concentrate on modelling dependencies see when using unidirectional rnns as generative models it is straightforward to draw samples from the model in sequential order however inference is not trivial in smoothing tasks where we want to evaluate probabilities for missing values in the middle of time series for discrete data inference with gap sizes of one is feasible however inference with larger gap sizes becomes exponentially more expensive even sampling can be exponentially expensive with respect to the gap size one strategy used for training models that are used for filling in gaps is to explicitly train the model with missing data see however such criterion has not to our knowledge yet been used and thoroughly evaluated compared with other inference strategies for rnns in this paper we compare different methods of using rnns to infer missing values for binary time series data we evaluate the performance of two generative models that rely on bidirectional rnns and compare them to inference using unidirectional rnn the proposed methods are very favourable in terms of scalability recurrent neural networks recurrent neural networks can be seen as extensions of the standard feedforward multilayer perceptron networks where the inputs and outputs are sequences instead of individual observations let us denote the input to recurrent neural network by xt where xt rn is an input vector for each time step let us further denote the output as yt where yt rm is an output vector for each time step our goal is to model the distribution although rnns map input sequences to output sequences we can use them in an unsupervised manner by letting the rnn predict the next input we can do so by setting yt unidirectional recurrent neural networks the structure of basic rnn with one hidden layer is illustrated in figure where the output yt is determined by yt xd wy ht by wx xt bh where ht tanh wh ht and wy wh and wx are the weight matrices connecting the hidden to output layer hidden to hidden layer and input to hidden layer respectively by and bh are the output and hidden layer bias vectors respectively typical options for the final nonlinearity are the softmax function for classification or categorical prediction tasks or independent bernoulli variables with sigmoid functions for other binary prediction tasks in this form the rnn therefore evaluates the output yt based on information propagated through the hidden layer that directly or indirectly depends on the observations xd xt bidirectional recurrent neural networks bidirectional rnns brnn extend the unidirectional rnn by introducing second hidden layer where the hidden to hidden connections flow in opposite temporal order the model is therefore able to exploit information both from the past and the future the output yt is traditionally determined by yt xd wyf hft wyb hbt by but we propose the use of yt xd wyf hft wyb by where hft tanh whf hft wxf xt bfh hbt tanh whb wxb xt bbh the structure of the brnn is illustrated in figure right compared with the regular rnn the forward and backward directions have separate weights and hidden activations and are denoted by the superscript and for forward and backward respectively note that the connections are acyclic note also that in the proposed formulation yt does not get information from xt we can therefore use the model in an unsupervised manner to predict one time step given all other time steps in the input sequence simply by setting probabilistic interpretation for unsupervised modelling probabilistic unsupervised modeling for sequences using unidirectional rnn is straightforward as the joint distribution for the whole sequence is simply the product of the individual predictions punidirectional xt xd for the brnn the situation is more complicated the network gives predictions for individual outputs given all the others and the joint distribution can not be written as their product we propose two solutions for this denoted by gsn and nade gsn generative stochastic networks gsn use denoising to estimate the data distribution as the asymptotic distribution of the markov chain that alternates between corruption and denoising the resulting distribution is thus defined only implicitly and can not be written analytically we can define corruption function that masks xt as missing and denoising function that reconstructs it from the others it turns out that one feedforward pass of the brnn does exactly that our first probabilistic interpretation is thus that the joint distribution defined by brnn is the asymptotic distribution of process that replaces one observation vector xt at time in by sampling it from pbrnn xt xd in practice we will start from random initialization and use gibbs sampling nade the neural autoregressive distribution estimator nade defines probabilistic model by reconstructing missing components of vector one at time in random order starting from fully unobserved vector each reconstruction is given by an network that takes as input the observations so far and an auxiliary mask vector that indicates which values are missing we extend the same idea for time series firstly we concatenate an auxiliary binary element to input vectors to indicate missing input the joint distribution of the time series is defined by first drawing random permutation od of time indices and then setting data points observed one by one in that order starting from fully missing sequence pnade od xod xoe in practice the brnn will be trained with some inputs marked as missing while all the outputs are observed see section for more training details filling in gaps with recurrent neural networks the task we aim to solve is to fill in gaps of multiple consecutive data points in binary time series data the inference is not trivial for two reasons firstly we reconstruct multiple consecutive data points which are likely to depend on each other and secondly we fill in data in the middle of time series and hence need to consider the data both before and after the gap for filling in gaps with the gsn approach we first train bidirectional rnn to estimate pbrnn xt xd in order to achieve that we use the structure presented in section at test time the gap is first initialized to random values after which the missing values are sampled from the distribution pbrnn xt xd one by one in random order repeatedly to approximate the stationary distribution for the rnn structures used in this paper the computational complexity of this approach at test time is dc gm where is the dimensionality of data point is the number of hidden units in the rnn is the number of time steps in the data is the length of the gap and is the number of markov chain monte carlo mcmc steps used for inference for filling in gaps with the nade approach we first train bidirectional rnn where some of the inputs are set to separate missing value token at test time all data points in the gap are first initialized with this token after which each missing data point is reconstructed once until the whole gap is filled computationally the main difference to gsn is that we do not have to sample each reconstructed data point multiple times but the reconstruction is done in as many steps as there are missing data points in the gap for the rnn structures used in this paper the computational complexity of this approach at test time is dc where is the dimensionality of data point is the number of hidden units in the rnn is the length of the gap and is the number of time steps in the data in addition to the two proposed methods one can use unidirectional rnn to solve the same task we call this method bayesian mcmc using unidirectional rnn for the task of filling in gaps is not trivial as we need to take into account the probabilities of the values after the gap which the model does not explicitly do we therefore resort to similar approach as the gsn approach where we replace the pbrnn xt xd with unidirectional equivalent for the gibbs sampling as the unidirectional rnn models conditional probabilities of the form prnn xt xd we can use bayes theorem to derive prnn xt xd prnn xt xd prnn xd prnn xe xt xt xd where prnn xd is directly the output of the unidirectional rnn given an input sequence where one time step the one we gibbs sample is replaced by proposal the problem is that we have to go through all possible proposals separately to evaluate the probability xt xd we therefore have to evaluate the product of the outputs of the unidirectional rnn for time steps for each possible in some cases this is feasible to evaluate for categorical data text there are as many possible values for as there are however for other binary data the number of possibilities grows exponentially and is clearly not feasible to evaluate for the rnn structures used in this paper the computational complexity of this approach at test time is dc at where is the number of different values data point can have is the dimensionality of data point is the number of hidden units in the rnn is the number of time steps in the data and is the number of mcmc steps used for inference the critical difference in complexity to the gsn approach is the coefficient that for categorical data takes the value for binary vectors and for continuous data is infinite as simple baseline model we also evaluate the of the gaps the model assumes constant categorical distribution for the categorical task or for text the number of dimensions is the number of characters in the model alphabet vector of factorial binomial probabilities for the structured prediction task pone gram yt by this can be done in dg we also compare to inference where the data points in the gap are reconstructed in order without taking the future context into account using equations and directly the computational complexity is dc experiments we run two sets of experiments one for categorical prediction task and one for binary structured prediction task in the categorical prediction task we fill in gaps of five characters in wikipedia text while in the structural prediction task we fill in gaps of five time steps in different polyphonic music data sets training details for categorical prediction task for the categorical prediction task we test the performance of the two proposed methods gsn and nade in addition we compare the performance to mcmc using bayesian inference and inference with unidirectional rnn we therefore have to train three different rnns one for each method each rnn is trained as predictor network where the character at each step is predicted based on all the previous characters in the case of the rnn or all the previous and following characters in the case of the brnns we use the same data set as sutskever et al which consists of of english text from wikipedia for training we follow similar strategy as hermans and schrauwen the characters are encoded as binary vectors with dimensionality of characters and the output is modelled with softmax distribution we train the unirectional rnn with string lengths of characters where the error is propagated only from the last outputs in the brnn we use string length of characters where the error is propagated from the middle outputs we therefore avoid propagating the gradient from predictions that lack long temporal context for the brnn used in the nade method we add one dimension to the input which corresponds to missing value token during training in each minibatch we mark consecutive characters every time steps as gap during training the error is propagated only from these gaps for each gap we uniformly draw value from to and set that many characters in the gap to the missing value token the model is therefore trained to predict the output in different stages of inference where number of the inputs are still marked as missing for comparison we also train similar network but without masking in that variant the error is therefore propagated from all time steps we refer to nade masked and nade no mask respectively for these two training methods for all the models the weight elements are drawn from the uniform distribution wi where for the input to hidden layer and following glorot and bengio where din dout for the and the output layers the biases are initialized to zero we use hidden units in the unidirectional rnn and hidden units in the two hidden layers in the brnns the number of parameters in the two model types is therefore roughly the same in the recurrent layers we set the recurrent activation connected to the first time step to zero the networks are trained using stochastic gradient descent and the gradient is calculated using backpropagation through time we use minibatch size of each minibatch consists of randomly sampled sequences of length as the gradients tend to occasionally blow up when training rnns we normalize the gradients at each update to have length one the step size is set to for all layers in the beginning of training and it is linearly decayed to zero during training as training the model is very we do not optimize the hyperparameters or repeat runs to get confidence intervals around the evaluated performances we used about weeks of gpu time for the reported results training details for the binary structured prediction task in the other set of experiments we use four polyphonic music data sets the data sets consist of at least hours of polyphonic music each where each data point is binary vector that represents one time step of music indicating which of the keys of piano are pressed we test the performance of the two proposed methods but omit training the unidirectional rnns as the computational complexity of the bayesian mcmc is prohibitive we train all models for updates in minibatches of individual data as the data sets are small we select the initial learning rate on grid of based on the lowest validation set cost we use no as several of the scores are fairly short and therefore do not specifically mask out values in the beginning or end of the data set as we did for the text data for the nade method we use an additional dimension as missing value token in the data for the missing values we set the missing value token to one and the other dimensions to zero other training details are similar to the categorical prediction task evaluation of models at test time we evaluate the models by calculating the mean of the correct value of gaps of five consecutive missing values in test data in the gsn and bayesian mcmc approaches we first set the five values in the gap to random value for the categorical prediction task or to zero for the structured prediction task we then sample all five values in the gap in random order and repeat the procedure for mcmc for evaluating the of the correct value for the string we force the last five steps to sample the correct value and store the probability of the model sampling those values we also evaluate the probability of reconstructing correctly the individual data points by not forcing the last five time steps to sample the correct value but by storing the probability of reconstructing the correct value for each data point separately we run the mcmc chain times and use the log of the mean of the likelihoods of predicting the correct value over these runs when evaluating the performance of inference we use similar approach to mcmc however when evaluating the of the entire gap we only construct it once in sequential order and record the probabilities of reconstructing the correct value when evaluating the probability of reconstructing the correct value for each data point separately we use the same approach as for mcmc and sample the gap times recording for each step the probability of sampling the correct value the result for each data point is the log of the mean of the likelihoods over these runs on the wikipedia data we evaluate the gsn and nade methods on gaps on the test data on the music data all models are evaluated on all possible gaps of on the test data excluding gaps that intersect with the first and last time steps of score when evaluating the bayesian mcmc with the unidirectional rnn we have to significantly limit the size of the data set as the method is highly computationally complex we therefore run it on gaps on the test data for nade we set the five time steps in the gap to the missing value token we then reconstruct them one by one to the correct value and record the probability of the correct reconstruction we repeat this process for all possible permutations of the order in which to do the reconstruction and therefore acquire the exact probability of the correct reconstruction given the model and the data we also evaluate the individual character reconstruction probabilities by recording the probability of sampling the correct value given all other values in the gap are set to missing results from table we can see that the bayesian mcmc method seems to yield the best results while gsn or nade outperform inference it is worth noting that in the most difficult data sets minibatch can therefore consist of musical scores each of length mcmc steps means that each value in the gap of will be resampled times table negative log likelihood nll for gaps of five time steps using different models lower is better in the experiments gsn and nade perform well although they are outperformed by bayesian mcmc inference strategy wikipedia nottingham piano muse jsb gsn nade masked nade bayesian mcmc inference na na na na gsn nade bayesian mcmc inference data point nll data point nll gsn nade inference position in gap position in gap figure average nll per data point using different methods with the wikipedia data set left and the piano data set right for different positions in gap of consecutive missing values the middle data point is the most difficult to estimate for the most methods while the inference can not take future context into account making prediction of later positions difficult for the leftmost position in the gap the inference performs the best since it does not require any approximations such as mcmc piano and jsb oneway inference performs very well qualitative examples of the reconstructions obtained with the gsn and nade on the wikipedia data are shown in table supplementary material in order to get an indication of how the number of mcmc steps in the gsn approach affects performance we plotted the difference in nll of gsn and nade of the test set as function of the number of mcmc steps in figure supplementary material the figure indicates that the music data sets mix fairly well as the performance of gsn quickly saturates however for the wikipedia data the performance could probably be even further improved by letting the mcmc chain run for more than steps in figure we have evaluated the nll for the individual characters in the gaps of length five as expected all methods except for inference are better at predicting characters close to both edges of the gap as sanity check we make sure our models have been successfully trained by evaluating the mean test of the brnns for gap sizes of one in table supplementary material we can see that the brnns expectedly outperform previously published results with unidirectional rnns which indicates that the models have been trained successfully conclusion and discussion although recurrent neural networks have been used as generative models for time series data it has not been trivial how to use them for inference in cases such as missing gaps in the sequential data in this paper we proposed to use bidirectional rnns as generative models for time series with two probabilistic interpretations called gsn and nade both provide efficient inference in both positive and negative directions in time and both can be used in tasks where bayesian inference of unidirectional rnn is computationally infeasible the model we trained for nade differed from the basic brnn in several ways firstly we artificially marked gaps of consecutive points as missing which should help in specializing the model for such reconstruction tasks it would be interesting to study the effect of the missingness pattern used in training on the learned representations and predictions secondly in addition to using all outputs as the training signal we tested using only the reconstructions of those missing values as the training signal this reduces the effective amount of training that the model went through thirdly the model had one more input the missingness indicator that makes the learning task more difficult we can see from table that the model we trained for nade where we only used the reconstructions as the training signal has worse performance than the brnn for reconstructing single values this indicates that these differences in training have significant impact on the quality of the final trained probabilistic model we used the same number of parameters when training an rnn and brnn the rnn can concentrate all the learning effort on forward prediction and the learned dependencies in backward inference by the computationally heavy bayesian inference it remains an open question which approach would work best given an optimal size of the hidden layers as future work other model structures could be explored in this context for instance the long shortterm memory specifically to our nade approach it might make sense to replace the regular additive connection from the missingness indicator input to the hidden activations in eq by multiplicative connection that somehow gates the dynamics mappings whf and whb another direction to extend is to use deep architecture with more hidden layers the midi music data is an example of structured prediction task components of the output vector depend strongly on each other however our model assumes independent bernoulli distributions for them one way to take those dependencies into account is to use stochastic hidden units hft and hbt which has been shown to improve performance on structured prediction tasks bayer and osendorfer explored that approach and reconstructed missing values in the middle of motion capture data in their reconstruction method the hidden stochastic variables are selected based on an auxiliary inference model after which the missing values are reconstructed conditioned on the hidden stochastic variable values both steps are done with maximum posteriori point selection instead of sampling further quantitative evaluation of the method would be an interesting point of comparison the proposed methods could be easily extended to data as an example application reconstructions with recurrent model has been shown to be effective in speech recognition especially under impulsive noise acknowledgements we thank kyunghyun cho and yoshua bengio for useful discussions the software for the simulations for this paper was based on theano nokia has supported mathias berglund and the academy of finland has supported tapani raiko references bahdanau cho and bengio neural machine translation by jointly learning to align and translate in proceedings of the international conference on learning representations iclr baldi brunak frasconi soda and pollastri exploiting the past and the future in protein secondary structure prediction bioinformatics bastien lamblin pascanu bergstra goodfellow bergeron bouchard and bengio theano new features and speed improvements deep learning and unsupervised feature learning nips workshop bayer and osendorfer learning stochastic recurrent networks arxiv preprint bengio simard and frasconi learning dependencies with gradient descent is difficult ieee transactions on neural networks bengio yao alain and vincent generalized denoising as generative models in advances in neural information processing systems pages bergstra breuleux bastien lamblin pascanu desjardins turian and bengio theano cpu and gpu math expression compiler in proceedings of the python for scientific computing conference scipy oral presentation bengio and vincent modeling temporal dependencies in sequences application to polyphonic music generation and transcription in proceedings of the international conference on machine learning icml pages brakel stroobandt and schrauwen training models for imputation the journal of machine learning research glorot and bengio understanding the difficulty of training deep feedforward neural networks in international conference on artificial intelligence and statistics pages goodfellow mirza courville and bengio deep boltzmann machines in advances in neural information processing systems pages graves liwicki bertolami bunke and schmidhuber novel connectionist system for unconstrained handwriting recognition ieee transactions on pattern analysis and machine intelligence graves mohamed and hinton speech recognition with deep recurrent neural networks arxiv preprint haykin neural networks and learning machines volume pearson education hermans and schrauwen training and analysing deep recurrent neural networks in burges bottou welling ghahramani and weinberger editors advances in neural information processing systems pages curran associates hochreiter and schmidhuber long memory neural computation greff gomez and schmidhuber clockwork rnn in proceedings of the st international conference on machine learning maas hannun jurafsky and ng large vocabulary continuous speech recognition using recurrent dnns arxiv preprint mikolov joulin chopra mathieu and ranzato learning longer memory in recurrent neural networks arxiv preprint pascanu mikolov and bengio on the difficulty of training recurrent neural networks in proceedings of the international conference on machine learning icml pages raiko and valpola missing values in nonlinear factor analysis in proc of the int conf on neural information processing shanghai pages raiko berglund alain and dinh techniques for learning binary stochastic feedforward neural networks in international conference on learning representations iclr san diego remes raiko honkela and kurimo reconstruction with bounded nonlinear model ieee signal processing letters rumelhart hinton and williams learning representations by errors nature schuster and paliwal bidirectional recurrent neural networks ieee transactions on signal processing sutskever martens and hinton generating text with recurrent neural networks in proceedings of the international conference on machine learning icml pages uria murray and larochelle deep and tractable density estimator in proceedings of the international conference on machine learning pages 
quartz randomized dual coordinate ascent with arbitrary sampling peter school of mathematics the university of edinburgh united kingdom zheng qu department of mathematics the university of hong kong hong kong zhengqu tong zhang department of statistics rutgers university piscataway nj tzhang abstract we study the problem of minimizing the average of large number of smooth convex functions penalized with strongly convex regularizer we propose and analyze novel method quartz which at every iteration samples and updates random subset of the dual variables chosen according to an arbitrary distribution in contrast to typical analysis we directly bound the decrease of the error in expectation without the need to first analyze the dual error depending on the choice of the sampling we obtain efficient serial and variants of the method in the serial case our bounds match the best known bounds for sdca both with uniform and importance sampling with standard our bounds predict initial speedup as well as additional speedup which depends on spectral and sparsity properties of the data keywords empirical risk minimization dual coordinate ascent arbitrary sampling speedup introduction in this paper we consider pair of structured convex optimization problems which has in several variants of varying degrees of generality attracted lot of attention in the past few years in the machine learning and optimization communities let an be collection of real matrices and φn be convex functions from rm to where further let rd be convex function and regularization parameter we are interested in solving the following primal problem pn def wd φi λg in the machine learning context matrices ai are interpreted as is linear predictor function φi is the loss incurred by the predictor on example ai is regularizer is regularization parameter and is the regularized empirical risk minimization problem in this paper we are especially interested in problems where is very big millions billions and much larger than this is often the case in big data applications stochastic gradient descent sgd was designed for solving this type of optimization problems in each iteration sgd computes the gradient of one single randomly chosen function φi and approximates the gradient using this unbiased but noisy estimation because of the variance of the stochastic estimation sgd has slow convergence rate recently many methods achieving fast linear convergence rate log have been proposed including sag svrg saga and miso all using different techniques to reduce the variance another approach such as stochastic dual coordinate ascent sdca solves by considering its dual problem that is defined as follows for each let rm be the convex conjugate of φi namely φi and similarly let rd be the convex conjugate of the dual problem of is defined as def max αn where αn rn rnm is obtained by stacking dual variables blocks αi rm on top of each other and functions and are defined by pn pn def def λg λn ai αi sdca and its proximal extension first solve the dual problem by updating uniformly at random one dual variable at each round and then recover the primal solution by setting let li λmax ai it is known that if we run sdca for at least li li max log max λγ λγ iterations then sdca finds pair such that by applying accelerated li randomized coordinate descent on the dual problem apcg needs at most max λγ number of iterations to get asdca and spdc are also accelerated and randomized methods moreover they can update of dual variables in each round we propose new algorithm algorithm which we call quartz for simultaneously solving the primal and dual problems on the dual side at each iteration our method selects and updates random subset sampling of the dual we assume that these sets are throughout the iterations however we do not impose any additional assumptions on the distribution of apart from the necessary requirement that each block needs to be chosen with def positive probability pi quartz is the first method analyzed for an arbitrary sampling the dual updates are then used to perform an update to the primal variable and the process is repeated our primal updates are different less aggressive from those used in sdca and thanks to which the decrease in the error can be bounded directly without first establishing the dual convergence as in and our analysis is novel and directly in nature as result our proof is more direct and the logarithmic term in our bound has simpler form main result we prove that starting from an initial pair quartz finds pair for which in expectation in at most max pi vλγn log iterations the parameters vn are assumed to satisfy the following eso expected separable overapproximation inequality pn ai hi pi vi khi where denotes the standard euclidean norm moreover the parameters vn are needed to run the method they determine stepsizes and hence it is critical that they can be cheaply computed before the method starts we wish to point out that always holds for some parameters vi indeed the left hand side is quadratic function of and hence the inequality holds for largeenough vi having said that the size of these parameters directly influences the complexity and hence one would want to obtain as tight bounds as possible as we will show for many samplings of interest small enough parameter can be obtained in time required to read the data ai in particular if the data matrix an is sufficiently sparse our iteration complexity result specialized to the case of standard can be better than that of accelerated methods such as asdca and spdc even when the condition number maxi li is larger than see proposition and figure as described above quartz uses an arbitrary sampling for picking the dual variables to be updated in each iteration to the best of our knowledge only two papers exist in the literature where stochastic method using an arbitrary sampling was analyzed nsync for unconstrained minimization of strongly convex function and alpha for composite minimization of convex function assumption was for the first time introduced in however nsync is not method besides nsync the closest works to ours in terms of the generality of the sampling are pcdm spcdm and approx all these are randomized coordinate descent methods and all were analyzed for arbitrary uniform samplings samplings satisfying for all again none of these methods were analyzed in framework in section we describe the algorithm show that it admits natural interpretation in terms of fenchel duality and discuss the flexibility of quartz we then proceed to section where we state the main result specialize it to the samplings discussed in section and give detailed comparison of our results with existing results for related stochastic methods in the literature in section we demonstrate how quartz compares to other related methods through numerical experiments the quartz algorithm throughout the paper we consider the standard euclidean norm denoted by function rm is if it is differentiable and has lipschitz continuous gradient with lispchitz constant kx yk for all rm function rd is convex if kw for all dom where dom denotes the domain of and is subgradient of at the most important parameter of quartz is random sampling which is random subset of the only assumption we make on the sampling in this paper is the following assumption proper sampling is proper sampling that is def pi this assumption guarantees that each block dual variable has chance to get updated by the method prior to running the algorithm we compute positive constants vn satisfying to define the stepsize parameter used throughout in the algorithm λγn min vpi note from that depends on both the data matrix and the sampling we shall show how to compute in less than two passes over the data the parameter satisfying for some examples of sampling in section interpretation of quartz through fenchel duality algorithm quartz parameters proper random sampling and positive vector rn λγn initialization rn rd pi min vpi λn pn ai for do wt αt generate random set st following the distribution of for st do αit αi ai end for λn ai αit end for output wt αt quartz algorithm natural interpretation in terms of fenchel duality let rd rn phas and define λn ai αi the duality gap for the pair can be decomposed as pn φi φi hw φi φi hai αi gapg gapφi αi by inequality gapg and gapφi αi for all which proves weak duality for the problems and the pair is optimal when both gapg and gapφi for all are zero it is known that this happens precisely when the following optimality conditions hold αi we will now interpret the primal and dual steps of quartz in terms of the above discussion it is easy to see that algorithm updates the primal and dual variables as follows wt θpi αi θpi ai st αit st pn where λn is constant defined in and st is random subset of ai αi in other words at iteration we first set the primal variable wt to be convex combination of its current value and value reducing gapg to zero see this is followed by adjusting subset of dual variables corresponding to randomly chosen set of examples st such that for each example st the dual variable αit is set to be convex combination of its current value and value reducing gapφi to zero see flexibility of quartz clearly there are many ways in which the distribution of can be chosen leading to numerous variants of quartz the convex combination constant used throughout the algorithm should be tuned according to where vn are constants satisfying note that the best possible is obtained by computing the maximal eigenvalue of the matrix where denotes the hadamard product of matrices and rn is an block matrix with all elements in block equal to see however the complexity of computing directly the maximal eigenvalue of amounts to which requires unreasonable preprocessing time in the context of machine learning where is assumed to be very large we now describe some examples of sampling and show how to compute in less than two passes over the data the corresponding constants vn more examples including distributed sampling are presented in the supplementary material serial sampling the most studied sampling in the literature on stochastic optimization is the serial sampling which corresponds to the selection of single block that is with probability the name serial is pointing to the fact that method using such sampling will typically be serial as opposed to being parallel method updating single block dual variable at time serial sampling is uniquely characterized by the vector of probabilities pn where pi is defined by for serial sampling it is easy to see that is satisfied for def vi li λmax ai where λmax denotes the maximal eigenvalue standard we now consider which selects subsets of of cardinality uniformly at random in the terminology established in such is called this sampling satisfies pi pj for all and hence it is uniform this sampling is well suited for parallel computing indeed quartz could be implemented as follows if we have processors available then at the beginning of iteration we can assign each block dual variable in st to dedicated processor the processor assigned to would then compute and apply the update if all processors have fast access to the memory where all the data is stored as is the case in multicore workstation then this way of assigning workload to the individual processors does not cause any major problems for sampling is satisfied for pd vi λmax mi mi ji aji where for each ωj is the number of nonzero blocks in the row of matrix def ωj aji note that follows from an extension of formula given in from to main result the complexity of our method is given by the following theorem the proof can be found in the supplementary material theorem main result assume that is convex and that for each φi is convex and let be proper sampling assumption and vn be positive scalars satisfying then the sequence of primal and dual variables wt αt of quartz algorithm satisfies wt αt where is defined in in particular if we fix then for max pi vλγn log wt αt in order to put the above result into context in the rest of this section we will specialize the above result to two special samplings serial sampling and the sampling quartz with serial sampling when is serial sampling we just need to plug into and derive the bound log wt αt max pilλγn if in addition is uniform then pi for all and we refer to this special case of quartz as by replacing pi in we obtain directly the complexity of li max log wt αt λγ otherwise we can seek to maximize the side of the inequality in with respect to the sampling probability to obtain the best bound simple calculation reveals that the optimal probability is given by pn def li λγn li λγn we shall call the algorithm obtained by using the above serial sampling probability the following complexity result of can be derived easily by plugging into pn li nλγ log wt αt note that in contrast with the complexity result of we now have dependence on the average of the eigenvalues li vs should be compared to proximal stochastic dual coordinate ascent indeed the dual update of takes exactly the same form of see the main difference is how the primal variable wt is updated while quartz performs the update see also performs the more aggressive update wt and the complexity result of is as follows li li log max wt αt max λγ λγ where is the dual optimal solution notice that the dominant terms in and exactly match although our logarithmic term is better and simpler this is due to direct bound on the decrease of the error of quartz without the need to first analyze the dual error in contrast to the typical approach for most of the dual coordinate ascent methods vs the importance sampling was previously used in the algorithm which extends to serial sampling the complexity of should then be compared with the following complexity result of pn pn li li nλγ log nλγ wt αt again the dominant terms in and exactly match but our logarithmic term is smaller quartz with sampling standard we now specialize theorem to the case of the sampling we define such that ωj maxi li maxi λmax ji ji it is clear that maxj wj and can be considered as measure of the density of the data by plugging into we obtain directly the following corollary corollary assume is the sampling and is chosen as in if we let and maxi li log wt αt nτ λγτ let us now have detailed look at the above result especially in terms of how it compares with the serial uniform case for fully sparse data we get perfect linear speedup the bound in is def fraction of the bound in for fully dense data the condition number maxi li λγ is unaffected by for general data the behaviour of quartz with sampling interpolates these two extreme cases it is important to note that regardless of the condition number as long as the bound in is at most fraction of the bound in hence for sparser problems quartz can achieve linear speedup for larger sizes in the authors proposed five options of dual updating rule our dual updating formula should be compared with option in for the same reason as given in the beginning of appendix quartz implemented with the same other four options achieves the same complexity result as theorem quartz vs existing methods we now compare the above result with existing stochastic dual coordinate ascent methods the variants of sdca to which quartz with sampling can be naturally compared have been proposed and analyzed previously in and in the authors proposed to use safe which is precisely equivalent to finding the stepsize parameter satisfying in the special case of sampling however they only analyzed the case where the functions φi are in the authors studied accelerated minibatch sdca asdca specialized to the case when the regularizer is the squared norm they showed that the complexity of asdca interpolates between that of sdca and accelerated gradient descent agd through varying the size in the authors proposed extension of their stochastic coordinate algorithm spdc both asdca and spdc reach the same complexity as agd when the size equals to thus should be considered as accelerated algorithms the complexity bounds for all these algorithms are summarized in table in table we compare the complexities of sdca asdca spdc and quartz in several regimes algorithm sdca asdca spdc quartz with sampling iteration complexity λγ max nτ λγτ λγτ λγτ λγτ λγτ general general table comparison of the iteration complexity of several algorithms performing stochastic coordinate ascent steps in the dual using of examples of size with the exception of sdca which is serial method using algorithm sdca asdca spdc quartz γλn γλn γλn γλn table comparison of leading factors in the complexity bounds of several methods in regimes looking at table we see that in the γλn regime if the condition number is quartz matches the linear speedup when compared to sdca of asdca and spdc when the condition number is roughly equal to the sample size then quartz does better in particular this is the case when than both asdca and spdc as long as the data is sparse if the data is even more sparse and in many big data applications one has and we have then quartz significantly outperforms both asdca and spdc note that quartz can be better than both asdca and spdc even in the domain of accelerated methods that is when the condition number is larger than the number of examples γλ indeed we have the following result proposition assume that nλγ and that maxi li if the data is sufficiently sparse so that λγτ nλγ then the iteration complexity in order of quartz is better than that of asdca and spdc the result can be interpreted as follows if that is λγτ nλγ then there are problems for which quartz is better than both asdca and spdc apcg also reaches accelerated convergence rate but was not proposed in the setting experimental results in this section we demonstrate how quartz specialized to different samplings compares with other methods all of our experiments are performed with for smoothed functions φi with and squared see the experiments were performed on the three datasets reported in table and three randomly generated large dataset with examples features with different sparsity in figure we compare quartz specialized to serial sampling and for both uniform and optimal sampling with and previously discussed in section on three datasets due to the conservative primal date in quartz appears to be slower than in practice nevertheless in all the experiments shows almost identical convergence behaviour to that of in figure we compare quartz specialized to sampling with spdc for different values of in the domain of accelerated methods the datasets are randomly generated following section when it is clear that spdc outperforms quartz as the condition larger than however as increases the number of data processed by spdc is increased by as predicted by its theory but the number of data processed by quartz remains almost the same by taking advantage of the large sparsity of the data hence quartz is much better in the large regime dataset training size features sparsity nd table datasets used in our experiments primal dual gap primal dual gap primal dual gap nb of epochs nb of epochs nb of epochs figure comparison of uniform sampling optimal importance sampling proxsdca uniform sampling and optimal importance sampling nb of epochs primal dual gap primal dual gap primal dual gap nb of epochs nb of epochs figure comparison of quartz with spdc for different size in the regime the three random datasets and have respective sparsity and references defazio bach and saga fast incremental gradient method with support for convex composite objectives in advances in neural information processing systems pages fercoq and accelerated parallel and proximal coordinate descent siam journal on optimization after minor revision fercoq and smooth minimization of nonsmooth functions by parallel coordinate descent hsieh chang lin keerthi and sundararajan dual coordinate descent method for linear svm in proc of the international conference on machine learning icml pages jaggi smith takac terhorst krishnan hofmann and jordan communicationefficient distributed dual coordinate ascent in advances in neural information processing systems pages curran associates johnson and zhang accelerating stochastic gradient descent using predictive variance reduction in burges bottou welling ghahramani and weinberger editors advances in neural information processing systems pages lu and gradient descent in the proximal setting and gradient descent methods lin lu and xiao an accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization technical report july mairal incremental optimization with application to machine learning siam nemirovski juditsky lan and shapiro robust stochastic approximation approach to stochastic programming siam nesterov efficiency of coordinate descent methods on optimization problems siam nesterov gradient methods for minimizing composite functions math ser qu and coordinate descent methods with arbitrary sampling ii expected separable overapproximation qu and coordinate descent methods with arbitrary sampling algorithms and complexity and on optimal probabilities in stochastic coordinate descent methods optimization letters published online and parallel coordinate descent methods for big data optimization math published online robbins and monro stochastic approximation method ann math statistics schmidt le roux and bach minimizing finite sums with the stochastic average gradient and zhang proximal stochastic dual coordinate ascent and zhang accelerated stochastic dual coordinate ascent in advances in neural information processing systems pages and zhang stochastic dual coordinate ascent methods for regularized loss mach learn february bijral and srebro primal and dual methods for svms in proc of the international conference on machine learning pages yang trading computation for communication distributed stochastic dual coordinate ascent in advances in neural information processing systems pages zhang solving large scale prediction problems using stochastic gradient descent algorithms in proc of the international conference on machine learning pages zhang and xiao stochastic coordinate method for regularized empirical risk minimization in proc of the international conference on machine learning pages zhao and zhang stochastic optimization with importance sampling icml 
maximum likelihood learning with arbitrary treewidth via parameter sets justin domke nicta australian national university abstract inference is typically intractable in undirected graphical models making maximum likelihood learning challenge one way to overcome this is to restrict parameters to tractable set most typically the set of parameters this paper explores an alternative notion of tractable set namely set of parameters where markov chain monte carlo mcmc inference can be guaranteed to quickly converge to the stationary distribution while it is common in practice to approximate the likelihood gradient using samples obtained from mcmc such procedures lack theoretical guarantees this paper proves that for any exponential family with bounded sufficient statistics not just graphical models when parameters are constrained to set gradient descent with gradients approximated by sampling will approximate the maximum likelihood solution inside the set with when unregularized to find solution in requires total amount of effort cubic in disregarding logarithmic factors when strong convexity allows solution in parameter distance with effort quadratic in both of these provide of time randomized approximation scheme introduction in undirected graphical models maximum likelihood learning is intractable in general for example jerrum and sinclair show that evaluation of the partition function which can easily be computed from the likelihood for an ising model is and that even the existence of time randomized approximation scheme fpras for the partition function would imply that rp np if the model is meaning that the target distribution falls in the assumed family then there exist several methods that can efficiently recover correct parameters among them the pseudolikelihood score matching composite likelihoods mizrahi et method based on parallel learning in local clusters of nodes and abbeel et method based on matching local probabilities while often useful these methods have some drawbacks first these methods typically have inferior sample complexity to the likelihood second these all assume model if the target distribution is not in the assumed class the solution will converge to the minimum of the but these estimators do not have similar guarantees third even when these methods succeed they typically yield distribution in which inference is still intractable and so it may be infeasible to actually make use of the learned distribution given these issues natural approach is to restrict the graphical model parameters to tractable set in which learning and inference can be performed efficiently the gradient of the likelihood is determined by the marginal distributions whose difficulty is typically determined by the treewidth of the graph thus probably the most natural tractable family is the set of distributions where θij the algorithm provides an efficient method for finding the maximum likelihood parameter vector in this set by computing the mutual information of all empirical pairwise marginals and finding the maximum spanning tree similarly heinemann and globerson give method to efficiently learn models where correlation decay limits the error of approximate inference though this will not converge to the when the model is this paper considers fundamentally different notion of tractability namely guarantee that markov chain monte carlo mcmc sampling will quickly converge to the stationary distribution our fundamental result is that if is such set and one can project onto then there exists fpras for the maximum likelihood solution inside while inspired by graphical models this result works entirely in the exponential family framework and applies generally to any exponential family with bounded sufficient statistics the existence of fpras is established by analyzing common existing strategy for maximum likelihood learning of exponential families namely gradient descent where mcmc is used to generate samples and approximate the gradient it is natural to conjecture that if the markov chain is fast mixing is run long enough and enough gradient descent iterations are used this will converge to nearly the optimum of the likelihood inside with high probability this paper shows that this is indeed the case separate analysis is used for the case using strong convexity and the unregularized case which is merely convex setup though notation is introduced when first used the most important symbols are given here for more reference parameter vector to be learned mθ markov chain operator corresponding to θk estimated parameter vector at gradient descent iteration qk approximate distribution sampled from at iteration iterations of the markov chain corresponding to from arbitrary starting distribution constraint set for negative on training data lipschitz constant for the gradient of arg minimizer of likelihood inside of total number of gradient descent steps total number of samples drawn via mcmc length of vector number of markov chain transitions applied for each sample parameters determining the mixing rate of the markov chain equation ra sufficient statistics norm bound ϵf desired optimization accuracy for ϵθ desired optimization accuracy for permitted probability of failure to achieve given approximation accuracy this paper is concerned with an exponential family of the form pθ exp where is vector of sufficient statistics and the function ensures normalization an undirected model can be seen as an exponential family where consists of indicator functions for each possible configuration of each clique while such graphical models motivate this work the results are most naturally stated in terms of an exponential family and apply more generally initialize for draw samples for sample estimate the gradient as xi λθ ek update the parameter vector as θk πθ ek output θk or θk figure left algorithm approximate gradient descent with gradients approximated via mcmc analyzed in this paper right cartoon of the desired performance stochastically finding solution near the minimum of the regularized negative in the set we are interested in performing learning minimizing for dataset zd log pθ zi where we define zi it is easy to see that the gradient of takes the form epθ λθ if one would like to optimize using method computing the expectation of with respect to pθ can present computational challenge with discrete graphical models the expected value of is determined by the marginal distributions of each factor in the graph typically the computational difficulty of computing these marginal distributions is determined by the treewidth of the if the graph is tree or close to tree the marginals can be computed by the algorithm one option with high treewidth is to approximate the marginals with variational method this can be seen as exactly optimizing surrogate likelihood approximation of eq another common approach is to use markov chain monte carlo mcmc to compute sample xi from distribution close to pθ and then approximate epθ by xi this strategy is widely used varying in the model type the sampling algorithm how samples are initialized the details of optimization and so on recently steinhardt and liang proposed learning in terms of the stationary distribution obtained from chain with nonzero restart probability which is by design while popular such strategies generally lack theoretical guarantees if one were able to exactly sample from pθ this could be understood simply as stochastic gradient descent but with mcmc one can only sample from distribution approximating pθ meaning the gradient estimate is not only noisy but also biased in general one can ask how should the step size number of iterations number of samples and number of markov chain transitions be set to achieve convergence level the gradient descent strategy analyzed in this paper in which one updates parameter vector θk using approximate gradients is outlined and shown as cartoon in figure here and in the rest of the paper we use pk as shorthand for pθk and we let ek denote the difference between the estimated gradient and the true gradient the projection operator is defined by πθ arg we assume that the parameter set is constrained to set such that mcmc is guaranteed to mix at certain rate section with convexity this assumption can bound the mean and variance of the errors at each iteration leading to bound on the sum of errors with strong convexity the error of the gradient at each iteration is bounded with high probability then using results due to for projected gradient descent with errors in the gradient we show schedule the number of iterations the number of samples and the number of markov transitions such that with high probability θk ϵf or ϵθ for the convex or strongly convex cases respectively where arg the total number of markov transitions applied through the entire algorithm km grows as log for the convex case log for the strongly convex case and polynomially in all other parameters of the problem background mixing times and parameter sets this section discusses some background on mixing times for mcmc typically mixing times are defined in terms of the distance maxa where the maximum ranges over the sample space for discrete distributions this can be shown to be equivalent to we assume that sampling algorithm is known single iteration of which can be thought of an operator mθ that transforms some starting distribution into another the stationary distribution is pθ mvθ pθ for all informally markov chain will be fast mixing if the total variation distance between the starting distribution and the stationary distribution decays rapidly in the length of the chain this paper assumes that convex set and constants and are known such that for all and all distributions pθ cαv sup cαv this means that the distance between an arbitrary starting distribution and the stationary distribution pθ decays geometrically in terms of the number of markov iterations this assumption is justified by the convergence theorem theorem which states that if is irreducible and aperiodic with stationary distribution then there exists constants and such that many results on mixing times in the literature however are stated in less direct form given constant the mixing time is defined by min it often happens that bounds on mixing times are stated as something like ln for some constants and it follows from this that cαv with exp and exp simple example of exponential family is the ising model defined for as θi xi exp θij xi xj simple result for this model is that if the maximum degree of any node is and for log all then for univariate gibbs sampling with random updates tanh the algorithm discussed in this paper needs the ability to project some parameter vector onto to find arg projecting set of arbitrary parameters onto this set of parameters is simply set θij for θij and θij for θij for more dense graphs it is known that for matrix norm that is the spectral norm or induced or infinity norms log where rij domke and liu show how to perform this projection for the ising model when is the spectral norm with convex optimization utilizing the singular value decomposition in each iteration loosely speaking the above result shows that univariate gibbs sampling on the ising model is fastmixing as long as the interaction strengths are not too strong conversely jerrum and sinclair exhibited an alternative markov chain for the ising model that is rapidly mixing for arbitrary interaction strengths provided the model is ferromagnetic that all interaction strengths are positive with θij and that the field is unidirectional this markov chain is based on sampling in different subgraphs world nevertheless it can be used to estimate derivatives of the ising model function with respect to parameters which allows estimation of the gradient of the huber provided simulation reduction to obtain an ising model sample from subgraphs world sample more generally liu and domke consider pairwise markov random field defined as θi xi exp θij xi xj and show that if one defines rij maxa then again equation holds an algorithm for projecting onto the set exists there are many other bounds for different algorithms and different types of models the most common algorithms are univariate gibbs sampling often called glauber dynamics in the mixing time literature and sampling the ising model and potts models are the most common distributions studied either with grid or graph structure often the motivation for studying these systems is to understand physical systems or to mathematically characterize in mixing time that occur as interactions strengths vary as such many existing bounds assume uniform interaction strengths for all these reasons these bounds typically require some adaptation for learning setting main results lipschitz gradient for lack of space detailed proofs are postponed to the appendix however informal proof sketches are provided to give some intuition for results that have longer proofs our first main result is that the regularized has lipschitz gradient theorem the regularized gradient is with da proof sketch it is easy by the triangle inequality that da dθ dφ da next using the assumption that one can bound that da dθ dφ finally some effort can bound that pφ convex convergence now our first major result is guarantee on the convergence that is true both in the regularized case where and the unregularized case where theorem with probability at least at long as log algorithm will satisfy log kcαv θk kl proof sketch first note that is convex since the hessian of is the covariance of when and only adds quadratic now define the quantity dk xm eqk to be the difference between the estimated expected value of under qk and the true value an elementary argument can bound the expected value of while the inequality can bounds its variance using both of these bounds in bernstein inequality can then show that with probability log finally we can observe that epθk by the assumption on mixing speed the last bounded byv cα and so with probability log cα finally result due to schmidt et al on the convergence of gradient descent with errors in estimated gradients gives the result intuitively this result has the right character if grows on the order of and grows on the order of log log then all terms inside the quadratic will be held constant and so if we set of the order the will on the order of with total computational effort roughly on the order of log the following results pursue this more carefully firstly one can observe that minimum amount of work must be performed theorem for if are set so that kcαv then log ac log since it must be true that kcα each of these three terms must also be at most giving on and multiplying these gives the result km next an explicit schedule for and is possible in terms of convex set of parameters comparing this to the above shows that this is not too far from optimal theorem suppose that if then setting ac log log is sufficient to guarantee that kcαv with total work of km ac log log simply verify that the bound holds and multiply the terms together log ac for example setting and gives that km finally we can give an explicit schedule for and and bound the total amount of work that needs to be performed theorem if max log then for all there is setting of such that θk ϵf with probability and km log ϵf proof sketch this follows from setting and as in theorem with log and ϵf strongly convex convergence this section gives the main result for convergence that is true only in the regularized case where again the main difficulty in this proof is showing that the sum of the errors of estimated gradients at each iteration is small this is done by using concentration inequality to show that the error of each estimated gradient is small and then applying union bound to show that the sum is small the main result is as follows theorem when the regularization constant obeys with probability at least algorithm will satisfy log cα proof sketch when is convex as in theorem and so is strongly convex when the basic proof technique here is to decompose the error in particular step as variant of xi eqk epθk multidimensional effding inequality can bound the first term with probability by log while our assumption on mixing speed can bound the second term by cαv applying this to all iterations using gives that all errors are simultaneously bounded as before this can then be used in another result due to schmidt et al on the convergence of gradient descent with errors in estimated gradients in the strongly convex case similar proof strategy could be used for the convex case where rather than directly bounding the sum of the norm of errors of all steps using the inequality and bernstein bound one could simply bound the error of each step using multidimensional inequality and then apply this with probability to each step this yields slightly weaker result than that shown in theorem the reason for applying uniform bound on the errors in gradients here is that schmidt et bound on the convergence of proximal gradient descent on strongly convex functions depends not just on the sum of the norms of gradient errors but weighted variant of these again we consider how to set parameters to guarantee that θk is not too far from with minimum amount of work firstly we show theorem suppose then for any such that log it must be the case that log aϵ log cϵ log aϵ km log log log log proof sketch this is established by noticing that log and cα must each be less than giving lower bounds on and next we can give an explicit schedule that is not too far off from this theorem suppose that if βi then setting log log log and log log is sufficient to guarantee that log cαv with total work of at most log log log log km log log log for example if you choose and then this varies from the in theorem by factor of two and multiplicative factor of inside the logarithmic terms corollary if we choose log and log log ϵλ then ϵθ with probability at least and the total amount of work is bounded by km log log log λδ discussion an important detail in the previous results is that the convex analysis gives convergence in terms of the regularized while the analysis gives convergence in terms of the parameter distance if we drop logarithmic factors the amount of work necessary for ϵf optimality in the using the convex algorithm is of the order while the amount of work necessary for ϵθ optimality using the strongly convex analysis is of the order though these quantities are not directly comparable the standard bounds on for convex functions with gradients are that ϵf thus roughly speaking when regularized for the analysis shows that ϵf optimality in the can be achieved with an amount of work only linear in iterations iterations iterations figure ising model example left the difference of the current test from the optimal on random runs center the distance of the current estimated parameters from the optimal parameters on random runs right the current estimated parameters on one run as compared to the optimal parameters far right example while this paper claims no significant practical contribution it is useful to visualize an example take an ising model exp θij xi xj for xi on grid with random vectors as training data the sufficient statistics are xi xj pairs and with pairs for set constrain for all pairs since log the maximum degree is tanh fix ϵθ and though the theory above suggests the lipschitz constant lower value of is used which converged faster in practice with exact or approximate gradients now one can derive that log and exp tanh applying corollary with and gives and fig shows the results in practice the algorithm finds solution tighter than the specified ϵθ indicating degree of conservatism in the theoretical bound conclusions this section discusses some weaknesses of the above analysis and possible directions for future work analyzing complexity in terms of the total sampling effort ignores the complexity of projection itself since projection only needs to be done times this time will often be very small in comparison to sampling time this is certainly true in the above example however this might not be the case if the projection algorithm scales in the size of the model another issue to consider is how the samples are initialized as far as the proof of correctness goes the initial distribution is arbitrary in the above example simple uniform distribution was used however one might use the empirical distribution of the training data which is equivalent to contrastive divergence it is reasonable to think that this will tend to reduce the mixing time when the pθ is close to the model generating the data however the number of markov chain transitions prescribed above is larger than typically used with contrastive divergence and algorithm does not reduce the step size over time while it is common to regularize to encourage fast mixing with contrastive divergence section this is typically done with simple heuristic penalties further contrastive divergence is often used with hidden variables still this provides bound for how closely variant of contrastive divergence could approximate the maximum likelihood solution the above analysis does not encompass the common strategy for maximum likelihood learning where one maintains pool of samples between iterations and initializes one markov chain at each iteration from each element of the pool the idea is that if the samples at the previous iteration were close to and is close to pk then this provides an initialization close to the current solution however the proof technique used here is based on the assumption that the samples xki at each iteration are independent and so can not be applied to this strategy acknowledgements thanks to ivona bezáková aaron defazio nishant mehta aditya menon cheng soon ong and christfried webers nicta is funded by the australian government through the dept of communications and the australian research council through the ict centre of excellence program references abbeel koller and ng learning factor graphs in polynomial time and sample complexity journal of machine learning research asuncion liu ihler and smyth learning with blocks composite likelihood and contrastive divergence in aistats besag statistical analysis of data journal of the royal statistical society series the statistician boucheron lugosi and massart concentration inequalities nonasymptotic theory of independence oxford university press and hinton on contrastive divergence learning in aistats chow and liu approximating discrete probability distributions with dependence trees ieee transactions on information theory descombes robin morris and berthod estimation of markov random field prior parameters using markov chain monte carlo maximum likelihood ieee transactions on image processing domke and liu projecting ising model parameters for fast mixing in nips dyer goldberg and jerrum matrix norms and rapid mixing for spin systems ann appl geyer markov chain monte carlo maximum likelihood in symposium on the interface gu and zhu maximum likelihood estimation for spatial models by markov chain monte carlo stochastic approximation journal of the royal statistical society series statistical methodology hayes simple condition implying rapid mixing of dynamics on spin systems in focs heinemann and globerson inferning with high girth graphical models in icml hinton practical guide to training restricted boltzmann machines technical report university of toronto huber simulation reductions for the ising model journal of statistical theory and practice hyvärinen estimation of statistical models by score matching journal of machine learning research jerrum and sinclair approximation algorithms for the ising model siam journal on computing koller and friedman probabilistic graphical models principles and techniques mit press levin peres and wilmer markov chains and mixing times american mathematical society lindsay composite likelihood methods contemporary mathematics liu and domke projecting markov random field parameters for fast mixing in nips marlin and de freitas asymptotic efficiency of deterministic estimators for discrete models ratio matching and pseudolikelihood in uai mizrahi denil and de freitas linear and parallel learning of markov random fields in icml papandreou and yuille random fields using discrete optimization to learn and sample from energy models in iccv salakhutdinov learning in markov random fields using tempered transitions in nips schmidt roux and bach convergence rates of inexact methods for convex optimization in nips schmidt gao and roth generative perspective on mrfs in vision in cvpr steinhardt and liang learning models for structured prediction in icml tieleman training restricted boltzmann machines using approximations to the likelihood gradient in icml varin reid and firth an overview of composite likelihood methods statistica sinica wainwright estimating the wrong graphical model benefits in the setting journal of machine learning research wainwright and jordan graphical models exponential families and variational inference found trends mach zhu wu and mumford filters random fields and maximum entropy frame towards unified theory for texture modeling international journal of computer vision 
optimization for learning deep multidimensional recurrent neural networks minhyung cho chandra shekhar dhir jaehyung lee applied research korea gracenote shekhardhir abstract multidimensional recurrent neural networks mdrnns have shown remarkable performance in the area of speech and handwriting recognition the performance of an mdrnn is improved by further increasing its depth and the difficulty of learning the deeper network is overcome by using hf optimization given that connectionist temporal classification ctc is utilized as an objective of learning an mdrnn for sequence labeling the of ctc poses problem when applying hf to the network as solution convex approximation of ctc is formulated and its relationship with the em algorithm and the fisher information matrix is discussed an mdrnn up to depth of layers is successfully trained using hf resulting in an improved performance for sequence labeling introduction multidimensional recurrent neural networks mdrnns constitute an efficient architecture for building multidimensional context into recurrent neural networks training of mdrnns in conjunction with connectionist temporal classification ctc has been shown to achieve performance in handwriting and speech recognition in previous approaches the performance of mdrnns having depth of up to five layers which is limited as compared to the recent progress in feedforward networks was demonstrated the effectiveness of mdrnns deeper than five layers has thus far been unknown training deep architecture has always been challenging topic in machine learning notable breakthrough was achieved when deep feedforward neural networks were initialized using layerwise recently approaches have been proposed in which supervision is added to intermediate layers to train deep networks to the best of our knowledge no such or bootstrapping method has been developed for mdrnns alternatively hf optimization is an appealing approach to training deep neural networks because of its ability to overcome pathological curvature of the objective function furthermore it can be applied to any connectionist model provided that its objective function is differentiable the recent success of hf for deep feedforward and recurrent neural networks supports its application to mdrnns in this paper we claim that an mdrnn can benefit from deeper architecture and the application of second order optimization such as hf allows its successful learning first we offer details of the development of hf optimization for mdrnns then to apply hf optimization for sequence labeling tasks we address the problem of the of ctc and formulate convex approximation in addition its relationship with the em algorithm and the fisher information matrix is discussed experimental results for offline handwriting and phoneme recognition show that an mdrnn with hf optimization performs better as the depth of the network increases up to layers multidimensional recurrent neural networks mdrnns constitute generalization of rnns to process multidimensional data by replacing the single recurrent connection with as many connections as the dimensions of the data the network can access the contextual information from directions allowing collective decision to be made based on rich context information to enhance its ability to exploit context information long shortterm memory lstm cells are usually utilized as hidden units in addition stacking mdrnns to construct deeper networks further improves the performance as the depth increases achieving the performance in phoneme recognition for sequence labeling ctc is applied as loss function of the mdrnn the important advantage of using ctc is that no sequences are required and the entire transcription of the input sample is sufficient learning mdrnns mdrnn with inputs and outputs is regarded as mapping from an input sequence rm to an output sequence rk of length where the input data for input neurons are given by the vectorization of data and td is the length of the sequence in each dimension all learnable weights and biases are concatenated to obtain parameter vector rn in the learning phase with fixed training data the mdrnn is formalized as mapping rn rk from the parameters to the output sequence the scalar loss function is defined over the output sequence as rk learning an mdrnn is viewed as an optimization of the objective with respect to the jacobian jf of function rm rn is the matrix where each element is partial derivative of an element of output with respect to an element of input the hessian hf of scalar function rm is the matrix of partial derivatives of the output with respect to its inputs throughout this paper vector sequence is denoted by boldface vector at time in is denoted by at and the element of at is denoted by atk optimization for mdrnns the application of hf optimization to an mdrnn is straightforward if the matching loss function for its output layer is adopted however this is not the case for ctc which is necessarily adopted for sequence labeling before developing an appropriate approximation to ctc that is compatible with hf optimization we discuss two considerations related to the approximation the first is obtaining quadratic approximation of the loss function and the second is the efficient calculation of the product used at each iteration of the conjugate gradient cg method hf optimization minimizes an objective by constructing local quadratic approximation for the objective function and minimizing the approximate function instead of the original one the loss function needs to be approximated at each point θn of the iteration qn θn θn δn δn gδn where δn θn is the search direction the parameters of the optimization and is local approximation to the curvature of at θn which is typically obtained by the generalized ggn matrix as an approximation of the hessian hf optimization uses the cg method in subroutine to minimize the quadratic objective above for utilizing the complete curvature information and achieving computational efficiency cg requires the computation of gv for an arbitrary vector but not the explicit evaluation of for neural networks an efficient way to compute gv was proposed in extending the study in in section we provide the details of the efficient computation of gv for mdrnns quadratic approximation of loss function the hessian matrix of the objective is written as jn hl jn kt jl where jn rkt hl rkt and denotes the component of the vector an indefinite hessian matrix is problematic for optimization because it defines an unbounded local quadratic approximation for nonlinear systems the hessian is not necessarily positive semidefinite and thus the ggn matrix is used as an approximation of the hessian the ggn matrix is obtained by ignoring the second term in eq as given by jn hl the sufficient condition for the ggn approximation to be exact is that the network makes perfect prediction for every given sample that is jl or stays in the linear region for all that is has less rank than kt and is positive semidefinite provided that hl is thus is chosen to be convex function so that hl is positive semidefinite in principle it is best to define and such that performs as much of the computation as possible with the positive semidefiniteness of hl as minimum requirement in practice nonlinear output layer together with its matching loss function such as the softmax function with loss is widely used computation of product for mdrnn hl jn amounts to the sethe product of an arbitrary vector by the ggn matrix gv jn quential multiplication of by three matrices first the product jn is jacobian times vector and is therefore equal to the directional derivative of along the direction of thus jn can be written using differential operator jn rv and the properties of the operator can be utilized for efficient computation because an mdrnn is composition of differentiable components the computation of rv throughout the whole network can be accomplished by repeatedly applying the sum product and chain rules starting from the input layer the detailed derivation of the operator to lstm normally used as hidden unit in mdrnns is provided in appendix next the multiplication of jn by hl can be performed by direct computation the dimension of hl could at first appear problematic since the dimension of the output vector used by the loss function can be as high as kt in particular if ctc is adopted as an objective for the mdrnn if the loss function can be expressed as the sum of individual loss functions with domain restricted in time the computation can be reduced significantly for example with the commonly used crossentropy loss function the kt kt matrix hl can be transformed into block diagonal matrix with blocks of hessian matrix let hl be the block in hl then the ggn matrix can be written as jn hl jnt where jnt is the jacobian of the network at time is calculated using the backfinally the multiplication of vector hl jn by the matrix jn propagation through time algorithm by propagating instead of the error at the output layer convex approximation of ctc for application to hf optimization connectioninst temporal classification ctc provides an objective function of learning an mdrnn for sequence labeling in this section we derive convex approximation of ctc inspired by the ggn approximation according to the following steps first the part of the original objective is separated out by reformulating the softmax part next the remaining convex part is approximated without altering its hessian making it well matched to the part finally the convex approximation is obtained by reuniting the convex and parts connectionist temporal classification ctc is formulated as the mapping from an output sequence of the recurrent network rk to scalar loss the output activations at time are normalized using the softmax function ykt exp atk exp where ykt is the probability of label given at time the conditional probability of the path is calculated by the multiplication of the label probabilities at each timestep as given by yπt where πt is the label observed at time along the path the path of length is mapped to label sequence of length by an operator which removes the repeated labels and then the blanks several mutually exclusive paths can map to the same label sequence let be set containing every possible sequence mapped by that is for some is the image of and let denote the cardinality of the set the conditional probability of label sequence is given by which is the sum of probabilities of all the paths mapped to label sequence by the loss assigns negative log probability to the correct answer given target sequence the loss function of ctc for the sample is written as log from the description above ctc is composed of the sum of the product of softmax components the function log ykt corresponding to the softmax with loss is convex therefore ykt is whereas is closed under multiplication the sum of functions is not in general as result the ctc objective is not convex in general because it contains the sum of softmax components in eq reformulation of ctc objective function we reformulate the ctc objective eq to separate out the terms that are responsible for the nonconvexity of the function by reformulation the softmax function is defined over the categorical label sequences by substituting eq into eq it follows that exp bπ exp bπ where bπ as atπt by substituting eq into eq and setting can be exp bπ exp fz exp where is the set of every possible label sequence and fz log exp bπ is the logp exp bπ which is proportional to the probability of observing the label sequence among all the other label sequences with the reformulation above the ctc objective can be regarded as the loss with the softmax output which is defined over all the possible label sequences because the loss function matches the softmax output layer the ctc objective is convex except the part that computes fz for each of the label sequences at this point an obvious candidate for the convex approximation of ctc is the ggn matrix separating the convex and parts let the part be nc and the convex part be lc the mapping nc rk is defined by nc xn log exn is the function defined on rn where fz is given above and is the number of all the possible label sequences for given as above the mapping lc is defined by exp fz lc log log exp exp fz where is the label sequence corresponding to the final reformulation for the loss function of ctc is given by lc nc convex approximation of ctc loss function the ggn approximation of eq immediately gives convex approximation of the hessian for ctc as glc jn hlc jnc although hlc has the form of diagonal matrix plus matrix diag the dimension of hlc is where becomes exponentially large as the length of the sequence increases this makes the practical calculation of hlc difficult on the other hand removing the linear team from in eq does not alter its hessian the resulting formula is lp log exp fz the ggn matrices of lc nc and lp nc are the same glc glp therefore their hessian matrices are approximations of each other the condition that the two hessian matrices hl and hm converges to the same matrix is discussed below interestingly is given as compact formula lp nc log exp atk where atk is the output unit at time its hessian hm can be directly computed resulting in block diagonal matrix each block is restricted in time and the block is given by hm diag and ykt is given in eq because the hessian of each block is positive where yk semidefinite hm is positive semidefinite convex approximation of the hessian of an mdrnn using the ctc objective can be obtained by substituting hm for hl in eq note that the resulting matrix is block diagonal and eq can be utilized for efficient computation our derivation can be summarized as follows hl hlc is not positive semidefinite glc glp is positive semidefinite but not computationally tractable hlp is positive semidefinite and computationally tractable sufficient condition for the proposed approximation to be exact pkt from eq the condition hlc hlp holds if and only if jlc nc pkt jlp nc since jlc jlp in general we consider only the case of nc for all which corresponds to the case where nc is linear mapping nc contains function mapping from paths to label sequence let be the label sequence corresponding to nc then nc fl bπ for if the probability of one path is sufficiently large to ignore all the other paths that is exp exp bπ for it follows that fl this is linear mapping which results in nc in conclusion the condition hlc hlp holds if one dominant path exists such that fl bπ bπ for each label sequence derivation of the proposed approximation from the fisher information matrix the identity of the ggn and the fisher information matrix has been shown for the network using the softmax with loss thus it follows that the ggn matrix of eq is identical to the fisher information matrix now we show that the proposed matrix in eq is derived from the fisher information matrix under the condition given in section the fisher information matrix of an mdrnn using ctc is written as log log ex jn jn where is the kt output of the network ctc assumes output probabilities at each timestep to be independent of those at other timesteps and therefore its fisher information matrix is given as the sum of every timestep it follows that log log jnt ex jnt under the condition in section the fisher information matrix is given by ex jnt diag jnt which is the same form as eqs and combined see appendix for the detailed derivation em interpretation of the proposed approximation the goal of the em algorithm is to find the maximum likelihood solution for models having latent variables given an input sequence and its corresponding target label sequence the log likelihood of is given by log log where represents the model parameters for each observation we have corresponding latent variable which is binary vector where is the number pof all the paths mapped to the log likelihood can be written in terms of as log log the em algorithm starts with an initial parameter and repeats the following process until convergence expectation step calculates maximization step updates argmaxθ where log in the context of ctc and rnn is given as as in eq where is the kt output of the neural network taking the derivative of log with respect to gives diag with as in eq tbecause this term is independent of and the hessian of with respect to is given by hq diag which is the same as the convex approximation in eq experiments in this section we present the experimental results for two different sequence labeling tasks offline handwriting recognition and phoneme recognition the performance of optimization for mdrnns with the proposed matrix is compared with that of stochastic gradient descent sgd optimization on the same settings database and preprocessing the database is database of handwritten arabic words which consists of images the entire dataset has five subsets the images corresponding to the subsets were used for training the validation set consisted of images corresponding to the first half of the sorted list in alphabetical order in set the remaining images in set amounting to were used for the test the intensity of pixels was centered and scaled using the mean and standard deviation calculated from the training set the timit corpus is benchmark database for evaluating speech recognition performance the standard training validation and core datasets were used each set contains sentences sentences and sentences respectively mel spectrum with coefficients was used as feature vector with filter ms window size and ms shift size each input feature was centered and scaled using the mean and standard deviation of the training set experimental setup for handwriting recognition the basic architecture was adopted from that proposed in deeper networks were constructed by replacing the top layer with more layers the number of lstm cells in the augmented layer was chosen such that the total number of weights between the different networks was similar the detailed architectures are described in table together with the results for phoneme recognition the deep bidirectional lstm and ctc in was adopted as the basic architecture in addition the memory cell block in which the cells share the gates was applied for efficient information sharing each lstm block was constrained to have memory cells according to the results using large value of bias for gates is beneficial for training deep mdrnns possible explanation is that the activation of neurons is exponentially decayed by gates during the propagation thus setting large bias values for these gates may facilitate the transmission of information through many layers at the beginning of the learning for this reason the biases of the input and output gates were initialized to whereas those of the forget gates and memory cells were initialized to all the other weight parameters of the mdrnn were initialized randomly from uniform distribution in the range the label error rate was used as the metric for performance evaluation together with the average loss of ctc in eq it is defined by the edit distance which sums the total number of insertions deletions and substitutions required to match two given sequences the final performance shown in tables and was evaluated using the weight parameters that gave the best label error rate on the validation set to map output probabilities to label sequence best path decoding was used for handwriting recognition and beam search decoding with beam width of was used for phoneme recognition for phoneme recognition phoneme labels were used during training and decoding and then mapped to classes for calculating the phoneme error rate per for phoneme recognition the regularization method suggested in was used we applied gaussian weight noise of standard deviation together with regularization of strength the network was first trained without noise and then it was initialized to the weights that gave the lowest ctc loss on the validation set then the network was retrained with gaussian weight noise table presents the best result for different values of parameters for hf optimization we followed the basic setup described in but different parameters were utilized tikhonov damping was used together with heuristics the value of the damping parameter was initialized to and adjusted according to the reduction ratio multiplied by if divided by if and unchanged otherwise the initial search direction for each run of cg was set to the cg direction found by the previous hf optimization iteration decayed by to ensure that cg followed the descent direction we continued to perform minimum and maximum of additional cg iterations after it found the first descent direction we terminated cg at iteration before reaching the maximum iteration if the following condition was satisfied xi xi where is the quadratic objective of cg without offset the training data were divided into and for the handwriting and phoneme recognition experiments respectively and used for both the gradient and product calculation the learning was stopped if any of two criteria did not improve for epochs and epochs in handwriting and phoneme recognition respectively for sgd optimization the learning rate was chosen from and the momentum from for handwriting recognition the best performance obtained using all the possible combinations of parameters is presented in table for phoneme recognition the best parameters out of nine candidates for each network were selected after training without weight noise based on the ctc loss additionally the backpropagated error in lstm layer was clipped to remain in the range for stable learning the learning was stopped after epochs had been processed and the final performance was evaluated using the weight parameters that showed the best label error rate on the validation set it should be noted that in order to guarantee the convergence we selected conservative criterion as compared to the study where the network converged after epochs in handwriting recognition and after epochs in phoneme recognition results table presents the label error rate on the test set for handwriting recognition in all cases the networks trained using hf optimization outperformed those using sgd the advantage of using hf is more pronounced as the depth increases the improvements resulting from the deeper architecture can be seen with the error rate dropping from to as the depth increases from to table shows the phoneme error rate per on the core set for phoneme recognition the improved performance according to the depth can be observed for both optimization methods the best per for hf optimization is at layers and that for sgd is at layers which are comparable to that reported in where the reported results are per of from network with layers having million weights and per of from network with layers having million weights the benefit of deeper network is obvious in terms of the number of weight parameters although this is not intended to be definitive performance comparison because of the different preprocessing the advantage of hf optimization is not prominent in the result of the experiments using the timit database one explanation is that the networks tend to overfit to relatively small number of the training data samples which removes the advantage of using advanced optimization techniques table experimental results for arabic offline handwriting recognition the label error rate is presented with the different network depths ab denotes stack of layers having hidden lstm cells in each layer epochs is the number of epochs required by the network using hf optimization so that the stopping criteria are fulfilled is the learning rate and is the momentum networks depth weights hf epochs sgd table experimental results for phoneme recognition using the timit corpus per is presented with the different mdrnn architectures depth block is the standard deviation of gaussian weight noise the remaining parameters are the same as in table networks weights hf epochs sgd the results were reported by graves in conclusion optimization as an approach for successful learning of deep mdrnns in conjunction with ctc was presented to apply hf optimization to ctc convex approximation of its objective function was explored in experiments improvements in performance were seen as the depth of the network increased for both hf and sgd hf optimization showed significantly better performance for handwriting recognition than did sgd and comparable performance for speech recognition references alex graves supervised sequence labelling with recurrent neural networks volume springer alex graves marcus liwicki horst bunke schmidhuber and santiago unconstrained handwriting recognition with recurrent neural networks in advances in neural information processing systems pages alex graves and schmidhuber offline handwriting recognition with multidimensional recurrent neural networks in advances in neural information processing systems pages alex graves mohamed and geoffrey hinton speech recognition with deep recurrent neural networks in proceedings of icassp pages ieee adriana romero nicolas ballas samira ebrahimi kahou antoine chassang carlo gatta and yoshua bengio fitnets hints for thin deep nets corr url http geoffrey hinton and ruslan salakhutdinov reducing the dimensionality of data with neural networks science christian szegedy wei liu yangqing jia pierre sermanet scott reed dragomir anguelov dumitru erhan vincent vanhoucke and andrew rabinovich going deeper with convolutions in proceedings of the ieee conference on computer vision and pattern recognition pages james martens deep learning via optimization in proceedings of the international conference on machine learning pages james martens and ilya sutskever learning recurrent neural networks with optimization in proceedings of the international conference on machine learning pages sepp hochreiter and schmidhuber long memory neural computation nicol schraudolph fast curvature products for gradient descent neural computation barak pearlmutter fast exact multiplication by the hessian neural computation james martens and ilya sutskever training deep and recurrent networks with optimization in neural networks tricks of the trade pages springer alex graves santiago faustino gomez and schmidhuber connectionist temporal classification labelling unsegmented sequence data with recurrent neural networks in proceedings of the international conference on machine learning pages stephen boyd and lieven vandenberghe editors convex optimization cambridge university press amari natural gradient works efficiently in learning neural computation razvan pascanu and yoshua bengio revisiting natural gradient for deep networks in international conference on learning representations hyeyoung park amari and kenji fukumizu adaptive natural gradient learning algorithms for various stochastic models neural networks christopher bishop editor pattern recognition and machine learning springer mario pechwitz snoussi maddouri volker noureddine ellouze and hamid amiri of handwritten arabic words in proceedings of cifed pages the darpa timit continuous speech corpus timit in speech disc edition alex graves sequence transduction with recurrent neural networks in icml representation learning workshop lee and hon phone recognition using hidden markov models ieee transactions on acoustics speech and signal processing alex graves practical variational inference for neural networks in advances in neural information processing systems pages alex graves rnnlib recurrent neural network library for sequence learning problems 
probabilistic predictors with and without guarantees of validity vladimir ivan and valentina department of computer science royal holloway university of london uk yandex moscow russia alushaf abstract this paper studies theoretically and empirically method of turning machinelearning algorithms into probabilistic predictors that automatically enjoys property of validity perfect calibration and is computationally efficient the price to pay for perfect calibration is that these probabilistic predictors produce imprecise in practice almost precise for large data sets probabilities when these imprecise probabilities are merged into precise probabilities the resulting predictors while losing the theoretical property of perfect calibration are consistently more accurate than the existing methods in empirical studies introduction prediction algorithms studied in this paper belong to the class of predictors introduced in they are based on the method of isotonic regression and prompted by the observation that when applied in machine learning the method of isotonic regression often produces miscalibrated probability predictions see it has also been reported section that isotonic regression is more prone to overfitting than platt scaling when data is scarce the advantage of predictors is that they are special case of venn predictors chapter and so theorem are always cf proposition below they can be considered to be regularized version of the procedure used by which helps them resist overfitting the main desiderata for venn and related conformal chapter predictors are validity predictive efficiency and computational efficiency this paper introduces two computationally efficient versions of predictors which we refer to as inductive predictors ivaps and predictors cvaps the ways in which they achieve the three desiderata are validity in the form of perfect calibration is satisfied by ivaps automatically and the experimental results reported in this paper suggest that it is inherited by cvaps predictive efficiency is determined by the predictive efficiency of the underlying learning algorithms so that the full arsenal of methods of modern machine learning can be brought to bear on the prediction problem at hand computational efficiency is again determined by the computational efficiency of the underlying algorithm the computational overhead of extracting probabilistic predictions consists of sorting which takes time log where is the number of observations and other computations taking time an advantage of venn prediction over conformal prediction which also enjoys validity guarantees is that venn predictors output probabilities rather than and probabilities in the spirit of bayesian decision theory can be easily combined with utilities to produce optimal decisions in sections and we discuss ivaps and cvaps respectively section is devoted to minimax ways of merging imprecise probabilities into precise probabilities and thus making ivaps and cvaps precise probabilistic predictors in this paper we concentrate on binary classification problems in which the objects to be classified are labelled as or most of machine learning algorithms are scoring algorithms in that they output score for each test object which is then compared with threshold to arrive at categorical prediction or as precise probabilistic predictors ivaps and cvaps are ways of converting the scores for test objects into numbers in the range that can serve as probabilities or calibrating the scores in section we briefly discuss two existing calibration methods platt and the method based on isotonic regression section is devoted to experimental comparisons and shows that cvaps consistently outperform the two existing methods more extensive experimental studies can be found in inductive predictors ivaps in this paper we consider data sequences usually loosely referred to as sets consisting of observations each observation consisting of an object and label we only consider binary labels we are given training set whose size will be denoted this section introduces inductive predictors our main concern is how to implement them efficiently but as functions an ivap is defined in terms of scoring algorithm see the last paragraph of the previous section as follows divide the training set of size into two subsets the proper training set of size and the calibration set of size so that train the scoring algorithm on the proper training set find the scores sk of the calibration objects xk when new test object arrives compute its score fit isotonic regression to sk yk obtaining function fit isotonic regression to sk yk obtaining function the multiprobability prediction for the label of is the pair intuitively the prediction is that the probability that is either or notice that the multiprobability prediction output by an ivap always satisfies and so and can be interpreted as the lower and upper probabilities respectively in practice they are close to each other for large training sets first we state formally the property of validity of ivaps adapting the approach of to ivaps random variable taking values in is perfectly calibrated as predictor for random variable taking values in if selector is random variable taking values in as general rule in this paper random variables are denoted by capital letters are random objects and are random labels proposition let be an ivap prediction for based on training sequence xl yl there is selector such that ps is perfectly calibrated for provided the random observations xl yl are our next proposition concerns the computational efficiency of ivaps proposition will be proved later in this section while proposition is proved in proposition given the scores sk of the calibration objects the prediction rule for computing the ivap predictions can be computed in time log and space its application to each test object takes time log given the sorted scores of the calibration objects the prediction rule can be computed in time and space proofs of both statements rely on the geometric representation of isotonic regression as the slope of the gcm greatest convex minorant of the csd cumulative sum diagram see pages especially theorem to make our exposition more we define both gcm and csd below first we explain how to fit isotonic regression to sk yk without necessarily assuming that si are the calibration scores and yi are the calibration labels which will be needed to cover the use of isotonic regression in ivaps we start from sorting all scores sk in the increasing order and removing the duplicates this is the most computationally expensive step in our calibration procedure log in the worst case let be the number of distinct elements among sk the cardinality of the set sk define to be the jth est element of sk so that define wj si to be the number of times occurs among sk finally define yi wj si to be the average label corresponding to si the csd of sk yk is the set of points pi wj wj in particular the gcm is the greatest convex minorant of the csd the value at of the isotonic regression fitted to sk yk is defined to be the slope of pi the gcm between wj and wj the values at other are somewhat arbitrary namely the value at can be set to anything between the left and right slopes of the gcm at pi wj but are never needed in this paper unlike in the standard use of isotonic regression in machine learning is the value of the isotonic regression fitted to sequence that already contains proof of proposition set the statement of the proposition even holds conditionally on knowing the values of xm ym and the multiset xl yl this knowledge allows us to compute the scores sk of the calibration objects xl and the test object the only remaining randomness is over the equiprobable permutations of xl yl in particular is drawn randomly from the multiset sk yl it remains to notice that according to the gcm construction the average label of the calibration and test observations corresponding to given value of ps is equal to ps the idea behind computing the pair efficiently is to two vectors and storing and respectively for all possible values of let and be as defined above in the case where sk are the calibration scores and yk are the corresponding labels the vectors and are of length and for all and both is the value of when therefore for all is also the value of when is just to the left of is also the value of when is just to the right of since and can change their values only at the points the vectors and uniquely determine the functions and respectively for details of computing and see remark there are several algorithms for performing isotonic regression on partially rather than linearly ordered set see section although one of the algorithms described in that section the minimax order algorithm was later shown to be defective therefore ivaps and cvaps below can be defined in the situation where scores take values only in partially ordered set moreover proposition will continue to hold for the reader familiar with the notion of venn predictors we could also add that predictors will continue to be venn predictors which follows from the isotonic regression being the average of the original function over certain equivalence classes the importance of partially ordered scores stems from the fact that they enable us to benefit from possible synergy between two or more prediction algorithms suppose that one prediction algorithm outputs scalar scores for the calibration algorithm cvap predictor for training set split the training set into folds tk for ivap tk tk return gm gm gm objects xk and another outputs for the same calibration objects we would like to use both sets of scores we could merge the two sets of scores into composite vector scores si and then classify new object as described earlier using its composite score where and are the scalar scores computed by the two algorithms and the partial order between composite scores is preliminary results reported in in related context suggest that the resulting predictor can outperform predictors based on the individual scalar scores however we will not pursue this idea further in this paper cross predictors cvaps cvap is just combination of ivaps where is the parameter of the algorithm it is described as algorithm where ivap stands for the output of ivap applied to as proper training set as calibration set and as test object and gm stands for geometric mean so that gm is the geometric mean of pk and gm is the geometric mean of the folds should be of approximately equal size and usually the training set is split into folds at random although we choose contiguous folds in section to facilitate reproducibility one way to obtain random assignment of the training observations to folds see line is to start from regular array in which the first observations are assigned to fold the following observations are assigned to fold up to the last lk observations which are assigned to fold where for all and then to apply random permutation remember that the procedure andomize in lace section can do the last step in time see the next section for justification of the expression gm gm gm used for merging the ivaps outputs making probability predictions out of multiprobability ones in cvap algorithm we merge the multiprobability predictions output by ivaps in this section we design minimax way for merging them essentially following for the function the result is especially simple gm gm gm let us check that gm gm gm is indeed the minimax expression under log loss suppose the pairs of lower and upper probabilities to be merged are pk and the merged probability is the extra cumulative loss suffered by over the correct members pk of the pairs when the true label is is pk log and the extra cumulative loss of over the correct members of the pairs when the true label is is log pk log equalizing the two expressions we obtain log pk pk which gives the required minimax expression for the merged probability since is decreasing and is increasing in for the computations in the case of the brier loss function see the argument above conditioned on the proper training set is also applicable to ivap in which case we need to set the probability predictor obtained from an ivap by replacing with will be referred to as the ivap and cvap is by definition comparison with other calibration methods the two alternative calibration methods that we consider in this paper are platt and isotonic regression platt method platt method uses sigmoids to calibrate the scores platt uses regularization procedure ensuring that the predictions of his method are always in the range where is the number of calibration observations labelled and is the number of calibration observations labelled it is interesting that the predictions output by the ivap are in the same range except that the are now allowed see isotonic regression there are two standard uses of isotonic regression we can train the scoring algorithm using what we call proper training set and then use the scores of the observations in disjoint calibration also called validation set for calibrating the scores of test objects as in alternatively we can train the scoring algorithm on the full training set and also use the full training set for calibration it appears that this was done in in both cases however we can expect to get an infinite log loss when the test set becomes large enough see the presence of regularization is an advantage of platt method it never suffers an infinite loss when using the log loss function there is no standard method of regularization for isotonic regression and we do not apply one empirical studies the main loss function that we use in our empirical studies is the log loss log if λlog log if where log is binary logarithm is probability prediction and is the true label another popular loss function is the brier loss λbr we choose the coefficient in front of in and the base of the logarithm in in order for the minimax predictor that always predicts to suffer loss an advantage of the brier loss function is that it still makes it possible to compare the quality of prediction in cases when prediction algorithms such as isotonic regression give categorical but wrong prediction and so are simply regarded as infinitely bad when using log loss the loss of probability predictor on test set will be measured by the arithmetic average of the losses it suffers on the test set namely by the mean log loss mll and the mean brier loss mbl mll λlog pi yi mbl λbr pi yi where yi are the test labels and pi are the probability predictions for them we will not be checking directly whether various calibration methods produce predictions since it is well known that lack of calibration increases the loss as measured by loss functions such as log loss and brier loss see for the most standard decomposition of the latter into the sum of the calibration error and refinement error in this section we compare ivaps ivaps whose outputs are replaced by probability predictions as explained in section and cvaps with platt method and the standard method based on isotonic regression the latter two will be referred to as platt and isotonic in our tables and figures for both ivaps and cvaps we use the procedure the procedure leads to virtually identical empirical results we use the same underlying algorithms as in namely decision trees abbreviated to decision trees with bagging bagging logistic regression sometimes abbreviated to logistic naive bayes neural networks and support vector machines svm as implemented in weka university of waikato new zealand the underlying algorithms except for svm produce scores in the interval which can be used directly as probability predictions referred to as underlying in our tables and figures or can be calibrated using the methods of or the methods proposed in this paper ivap or cvap in the tables and figures for illustrating our results in this paper we use the adult data set available from the uci repository this is the main data set used in and one of the data sets used in however the picture that we observe is typical for other data sets as well cf we use the original split of the data set into training set of ntrain observations and test set of ntest observations the results of applying the four calibration methods plus the vacuous one corresponding to just using the underlying algorithm to the six underlying algorithms for this data set are shown in figure the six top plots report results for the log loss namely mll as defined in and the six bottom plots for the brier loss namely mbl the underlying algorithms are given in the titles of the plots and the calibration methods are represented by different line styles as explained in the legends the horizontal axis is labelled by the ratio of the size of the proper training set to that of the calibration set except for the label all which will be explained later in particular in the case of cvaps it is labelled by the number one less than the number of folds in the case of cvaps the training set is split into equal or as close to being equal as possible contiguous folds the first dntrain training observations are included in the first fold the next dntrain or bntrain in the second fold etc first and then is used unless ntrain is divisible by in the case of the other calibration methods we used the first ntrain training observation as the proper training set used for training the scoring algorithm and the rest of the training observations are used as the calibration set in the case of log loss isotonic regression often suffers infinite losses which is indicated by the absence of the round marker for isotonic regression only one of the log losses for svm is finite we are not trying to use ad hoc solutions such as clipping predictions to the interval for small since we are also using the bounded brier loss function the cvap lines tend to be at the bottom in all plots experiments with other data sets also confirm this the column all in the plots of figure refers to using the full training set as both the proper training set and calibration set in our official definition of ivap we require that the last two sets be disjoint but in this section we continue to refer to ivaps modified in this way simply as ivaps in such prediction algorithms were referred to as svaps simplified predictors using the full training set as both the proper training set and calibration set might appear naive and is never used in the extensive empirical study but it often leads to good empirical results on larger data sets however it can also lead to very poor results as in the case of bagging for ivap platt and isotonic the underlying algorithm that achieves the best performance in figure natural question is whether cvaps perform better than the alternative calibration methods in figure and our other experiments because of applying in moving from ivap to cvap or because of the extra regularization used in ivaps the first reason is undoubtedly important for both loss functions and the second for the log loss function the second reason plays smaller role for brier loss for relatively large data sets in the lower half of figure the curves for isotonic and ivap are very close to each other but ivaps are consistently better for smaller data sets even when using brier loss in tables and we apply the four calibration methods and six underlying algorithms to much smaller training set namely to the first observations of the adult data set as the new training set following the first training observations are used as the proper training set the following training observations as the calibration set and all other observations the remaining training and all test observations are used as the new test set the results are log loss bagging logistic regression underlying platt isotonic ivap cvap underlying platt isotonic ivap cvap underlying isotonic ivap cvap all all neural networks naive bayes isotonic ivap cvap all underlying platt isotonic ivap cvap svm platt isotonic ivap cvap all all brier loss all underlying platt isotonic ivap cvap logistic regression bagging student version of matlab underlying platt isotonic ivap cvap underlying isotonic ivap cvap all naive bayes all neural networks all svm underlying platt isotonic ivap cvap underlying platt isotonic ivap cvap platt isotonic ivap cvap all all all student version of matlab figure the log and brier losses of the four calibration methods applied to the six prediction algorithms on the adult data set the numbers on the horizontal axis are ratios of the size of the proper training set to the size of the calibration set in the case of cvaps they can also be expressed as where is the number of folds therefore column corresponds to the standard choice of folds in the method of missing curves or points on curves mean that the corresponding values either are too big and would squeeze unacceptably the interesting parts of the plot if shown or are infinite such as many results for isotonic regression under log loss shown in tables for log loss and for brier loss they are consistently better for ivap than for ir isotonic regression results for nine very small data sets are given in tables and of where the results for ivap with the full training set used as both proper training and calibration sets labelled sva in the tables in are consistently in cases out of the using brier loss better usually significantly better than for isotonic regression referred to as dir in the tables in the following information might help the reader in reproducing our results in addition to our code being publicly available for each of the standard prediction algorithms within weka that we use we optimise the parameters by minimising the brier loss on the calibration set apart from the column labelled all we can not use the log loss since it is often infinite in the case of isotonic regression we then use the trained algorithm to generate the scores for the calibration and test sets which allows us to compute probability predictions using platt method isotonic regression ivap and cvap all the scores apart from svm are already in the range and can be used as probability predictions most of the parameters are set to their default values and the only parameters table the log loss for the four calibration methods and six underlying algorithms for small subset of the adult data set algorithm platt ir ivap cvap bagging logistic naive bayes neural networks svm table the analogue of table for the brier loss algorithm bagging logistic naive bayes neural networks svm platt ir ivap cvap that are optimised are pruning confidence for and bagging ridge for logistic regression learning rate and momentum for neural networks multilayerperceptron and complexity constant for svm smo with the linear kernel naive bayes does not involve any parameters notice that none of these parameters are hyperparameters in that they do not control the flexibility of the fitted prediction rule directly this allows us to optimize the parameters on the training set for the all column in the case of cvaps we optimise the parameters by minimising the cumulative brier loss over all folds so that the same parameters are used for all folds to apply platt method to calibrate the scores generated by the underlying algorithms we use logistic regression namely the function mnrfit within matlab statistics toolbox for isotonic regression calibration we use the implementation of the pava in the package fdrtool namely the function monoreg for further experimental results see conclusion this paper introduces two new computationally efficient algorithms for probabilistic prediction ivap which can be regarded as regularised form of the calibration method based on isotonic regression and cvap which is built on top of ivap using the idea of whereas ivaps are automatically perfectly calibrated the advantage of cvaps is in their good empirical performance this paper does not study empirically upper and lower probabilities produced by ivaps and cvaps whereas the distance between them provides information about the reliability of the merged probability prediction finding interesting ways of using this extra information is one of the directions of further research acknowledgments we are grateful to the conference reviewers for numerous helpful comments and observations to vladimir vapnik for sharing his ideas about exploiting synergy between different learning algorithms and to participants in the conference machine learning prospects and applications october berlin and nips canada for their questions and comments the first author has been partially supported by epsrc grant and afosr grant semantic completions the second and third authors are grateful to their home institutions for funding their trips to references vladimir vovk and ivan petej predictors in nevin zhang and jin tian editors proceedings of the thirtieth conference on uncertainty in artificial intelligence pages corvallis or auai press miriam ayer daniel brunk george ewing reid and edward silverman an empirical distribution function for sampling with incomplete information annals of mathematical statistics xiaoqian jiang melanie osl jihoon kim and lucila smooth isotonic regression new method to calibrate predictive models amia summits on translational science proceedings antonis lambrou harris papadopoulos ilia nouretdinov and alex gammerman reliable probability estimates based on support vector machines for large multiclass datasets in lazaros iliadis ilias maglogiannis harris papadopoulos kostas karatzas and spyros sioutas editors proceedings of the aiai workshop on conformal prediction and its applications volume of ifip advances in information and communication technology pages berlin springer rich caruana and alexandru an empirical comparison of supervised learning algorithms in proceedings of the twenty third international conference on machine learning pages new york acm john platt probabilities for sv machines in alexander smola peter bartlett bernhard and dale schuurmans editors advances in large margin classifiers pages mit press vladimir vovk alex gammerman and glenn shafer algorithmic learning in random world springer new york bianca zadrozny and charles elkan obtaining calibrated probability estimates from decision trees and naive bayesian classifiers in carla brodley and andrea danyluk editors proceedings of the eighteenth international conference on machine learning pages san francisco ca morgan kaufmann vladimir vovk ivan petej and valentina fedorova probabilistic predictors with and without guarantees of validity technical report archive november full version of this paper richard barlow bartholomew bremner and daniel brunk statistical inference under order restrictions the theory and application of isotonic regression wiley london charles lee the algorithm and isotonic regression annals of statistics gordon murray nonconvergence of the minimax order algorithm biometrika vladimir vapnik intelligent learning similarity control and knowledge transfer talk at the yandex school of data analysis conference machine learning prospects and applications october berlin thomas cormen charles leiserson ronald rivest and clifford stein introduction to algorithms mit press cambridge ma third edition vladimir vovk the fundamental nature of the log loss function in lev beklemishev andreas blass nachum dershowitz berndt finkbeiner and wolfram schulte editors fields of logic and computation ii essays dedicated to yuri gurevich on the occasion of his birthday volume of lecture notes in computer science pages cham springer allan murphy new vector partition of the probability score journal of applied meteorology mark hall eibe frank geoffrey holmes bernhard pfahringer peter reutemann and ian witten the weka data mining software an update sigkdd explorations frank and asuncion uci machine learning repository 
shepard convolutional neural networks jimmy sj sensetime group limited rensijie li xu sensetime group limited xuli qiong yan sensetime group limited yanqiong wenxiu sun sensetime group limited sunwenxiu abstract deep learning has recently been introduced to the field of computer vision and image processing promising results have been obtained in number of tasks including inpainting deconvolution filtering etc however previously adopted neural network approaches such as convolutional neural networks and sparse are inherently with translation invariant operators we found this property prevents the deep learning approaches from outperforming the if the task itself requires translation variant interpolation tvi in this paper we draw on shepard interpolation and design shepard convolutional neural networks shcnn which efficiently realizes trainable tvi operators in the network we show that by adding only few feature maps in the new shepard layers the network is able to achieve stronger results than much deeper architecture superior performance on both image inpainting and is obtained where our system outperforms previous ones while keeping the running time competitive introduction in the past few years deep learning has been very successful in addressing many aspects of visual perception problems such as image classification object detection face recognition to name few inspired by the breakthrough in computer vision several attempts have been made very recently to apply deep learning methods in vision as well as image processing tasks encouraging results has been obtained in number of tasks including image inpainting denosing image deconvolution dirt removal filtering etc powerful models with multiple layers of nonlinearity such as convolutional neural networks cnn sparse etc were used in the previous studies notwithstanding the rapid progress and promising performance we notice that the building blocks of these models are inherently translation invariant when applying to images the property makes the network architecture less efficient in handling translation variant operators exemplified by the image interpolation operation figure illustrates the problem of image inpainting typical translation variant interpolation tvi task the black region in figure indicates the missing region where the four selected patches with missing parts are visualized in figure the interpolation process for the central pixel in each patch is done by four different weighting functions shown in the bottom of figure this process can not be simply modeled by single kernel due to the inherent spatially varying property in fact the tvi operations are common in many vision applications image which aims to interpolate high resolution image with low resolution observation also suffers from the project page http figure illustration of translation variant interpolation the application of inpainting the black regions indicate the missing part four selected patches the bottom row shows the kernels for interpolating the central pixel of each patch same problem different local patches have different pattern of anchor points we will show that it is thus less optimal to use the traditional convolutional neural network to do the translation variant operations for task in this paper we draw on shepard method and devise novel cnn architecture named shepard convolutional neural networks shcnn which efficiently equips conventional cnn with the ability to learn translation variant operations for irregularly spaced data by adding only few feature maps in the new shepard layer and optimizing more powerful tvi procedure in the fashion the network is able to achieve stronger results than much deeper architecture we demonstrate that the resulting system is general enough to benefit number of applications with tvi operations related work deep learning methods have recently been introduced to the area of computer vision and image processing burger et al used simple neural network to directly learn mapping between noisy and clear image patches xie et al adopted sparse and demonstrated its ability to do blind image inpainting cnn was used in to tackle of problem of rain drop and dirt it demonstrated the ability of cnn to blindly handle translation variant problem in real world challenges xu et al advocated the use of generative approaches to guide the design of the cnn for deconvolution tasks in filters can be well approximated using cnn while it is feasible to use the translation invariant operators such as convolution to obtain the translation variant results in deep neural network architecture it is less effective in achieving high quality results for interpolation operations the first attempt using cnn to perform image connected the cnn approach to the sparse coding ones but it failed to beat the super resolution system in this paper we focus on the design of deep neural network layer that better fits the translation variant interpolation tasks we note that tvi is the essential step for wide range of vision applications including inpainting dirt removal noise suppression to name few analysis deep learning approaches without explicit tvi mechanism generated reasonable results in few tasks requiring translation variant property to some extent deep architecture with multiple layers of nonlinearity is expressive to approximate certain tvi operations given sufficient amount of training data it is however to beat based approaches while ensuring the high efficiency and simplicity to see this we experimented with the cnn architecture in and and trained cnn with three convolutional layers by using million synthetic image pairs network and training details as well as the concrete statistics of the data will be covered in the experiment section typical test images are shown in the left column of figure whereas the results of this model are displayed in the column of the same figure we found that visually very similar results as in are obtained namely obvious residues of the text are still left in the images we also experimented with much deeper network by adding more convolutional layers virtually replicating the network in by and times although slight visual differences are found in the results no fundamental improvement in the missing regions is observed namely residue still remains sensible next step is to explicitly inform the network about where the missing pixels are so that the network has the opportunity to figure out more plausible solutions for tvi operations for many applications the underlying mask indicating the processed regions can be detected or be known in advance sample applications include image image matting noise removal etc other applications such as sparse point propagation and super resolution by nature have the masks for unknown regions one way to incorporate the mask into the network is to treat it as an additional channel of the input we tested this idea with the same set of network and experimental settings as the previous trial the results showed that such additional piece of information did bring about improvement but still considerably far from satisfactory in removing the residues results are visualized in the column of figure to learn tractable tvi model we devise in the next session novel architecture with an effective mechanism to exploit the information contained in the mask shepard convolutional neural networks we initiate the attempt to leverage the traditional interpolation framework to guide the design of neural network architecture for tvi we turn to the shepard framework which weighs known pixels differently according to their spatial distances to the processed pixel specifically shepard method can be in convolution form jp ip if if mp mp where and are the input and output images respectively indexes the image coordinates is the binary indicator mp indicates the pixel values are unknown is the convolution operation is the kernel function with its weights inversely proportional to the distance between pixel with mp and the pixel to process the division between the convolved image and the convolved mask naturally controls the way how pixel information is propagated across the regions it thus enables the capability to handle interpolation for data and make it possible translation variant the key element in shepard method affecting the interpolation result is the definition of the convolution kernel we thus propose new convolutional layer in the light of shepard method but allow for more flexible kernel design the layer is referred to as the shepard interpolation layer figure comparison between shcnn and cnn in image inpainting input images left results from regular cnn results from regular cnn trained with masks our results right the shepard interpolation layer the pass of the trainable interpolation layer can be mathematically described as the following equation knij fin mn mn ij where is the index of layers the subscript in fin is the index of feature maps in layer in index the feature maps in layer and mn are the input and the mask of the current layer respectively represents all the feature maps in layer kij are the trainable kernels which are shared in both numerator and denominator in computing the fraction concretely same kij is to be convolved with both the activations of the last layer in the numerator and the mask of the current layer mn in the denominator could be the output feature maps of regular layers in cnn such as convolutional layer or pooling layer it could also be previous shepard interpolation layer which is function of both and thus shepard interpolation layers can actually be stacked together to form highly nonlinear interpolation operator is the bias term and is the nonlinearity imposed to the network is smooth and differentiable function therefore standard can be used to train the parameters figure illustrates our neural network architecture with shepard interpolation layers the inputs of the shepard interpolation layer are maps as well as masks indicating where interpolation should occur note that the interpolation layer can be applied repeatedly to construct more complex interpolation functions with multiple layers of nonlinearity the mask is binary map of value one for the known area zero for the missing area same kernel is applied to the image and the mask we note that the mask for layer can be automatically generated by the result of previous convolved mask kn mn by zeroing out insignificant values and thresholding it it is important for tasks with relative large missing areas such as inpainting where sophisticated ways of propagation may be learned from data by shepard interpolation layer with nonlinearity this is also flexible way to balance the kernel size and the depth of the network we refer to figure illustration of shcnn architecture for multiple layers of interpolation convolutional neural network with shepard interpolation layers as shepard convolutional neural network shcnn discussion although standard can be used because is function of both ks in the fraction matrix form of the quotient rule for derivatives need to be used in deriving the equations of the interpolation layer to make the implementation efficient we unroll the two convolution operations and into two matrix multiplications denoted and where and are the unrolled versions of and is the rearrangement of the kernels where each kernel is listed in single row is the error function to compute the distance between the network output and the ground truth norm is used to compute this distance we also denote the derivative of the error function with respect to can be computed the same way as in previous cnn papers once this value is computed we show that the derivative of with respect to the kernels connecting th node in th layer to ith node in nth layer can be computed by wij mjm ijm wij ijm mjm δim wij mjm where is the column index in and the denominator of each element in the outer summation in eq is different therefore the numerator of each summation element has to be computed separately while this operation can still be efficiently parallelized by vectorization it requires significantly more memory and computations than the regular cnns though it brings extra workload in training the new interpolation layer only adds fraction of more computation during the test time we can discern this from eq the only added operations are the convolution of the mask with the and the division because the two convolutions shares the same kernel it can be efficiently implemented by convolving with samples with the batch size of it thus keeps the computation of shepard interpolation layer competitive compare to the traditional convolution layer we note that it is also natural to integrate the interpolation layer to any previous cnn architecture this is because the new layer only adds mask input to the convolutional layer keeping all other interfaces the same this layer can also degenerate to fully connected layer because the unrolled version of eq merely contains matrix multiplication in the fraction therefore as long as the tvi operators are necessary in the task no matter where it is needed in the architecture and the type of layer before or after it the interpolation layer can be seamlessly plugged in last but not least the interpolation kernels in the layer is learned from data rather than therefore it is more flexible and could be more powerful than kernels on the other hand it is trainable so that the learned interpolation operators are embedded in the overall optimization objective of the model experiments we conducted experiments on two applications involving tvi the inpainting and the superresolution the training data was generated by randomly sampling million patches from natural images scraped from flickr grayscale patches of size were used for both tasks to facilitate the comparison with previous studies all psnr comparison in the experiment is based on grayscale results our model can be directly extended to process color images inpainting the natural images are contaminated by masks containing text of different sizes and fonts as shown in figure we assume the binary masks indicating missing regions are known in advance the shcnn for inpainting is consists of five layers two of which are shepard interpolation layers we use relu function to impose nonlinearity in all our experiments filters were used in the first shepard layer to generate feature maps followed by another shepard interpolation layer with filters the rest of the shcnn is conventional cnn architecture the filters for the third layer is with size which are use to generate feature maps filters are used in the fourth layer filters are used to carry out the reconstruction of image details visual results are shown in the last column in figure the results of the comparisons are generated using the architecture in more examples are provided in the project webpage ground truth psnr bicubic ksvd anr srcnn shcnn figure visual comparison factor upscaling of the butterfly image in super resolution the quantitative evaluation of super resolution is conducted using synthetic data where the high resolution images are first downscaled by factor to generate low resolution patches to perform super resolution we upscale the low resolution patches and zero out the pixels in the upscaled images leaving one copy of pixels from low resolution images in this regard super resolution can be seemed as special form of inpainting with repeated patterns of missing area baboon barbara bridge coastguard comic face flowers foreman lenna man monarch pepper zebra avg psnr baboon barbara bridge coastguard comic face flowers foreman lenna man monarch pepper zebra avg psnr baboon barbara bridge coastguard comic face flowers foreman lenna man monarch pepper zebra avg psnr bicubic db db db db db db db db db bicubic bicubic anr anr anr srcnn shcnn srcnn shcnn srcnn shcnn table psnr comparison on the image set for upscaling of factor and methods compared bicubic anr srcnn our shcnn we use one shepard interpolation layer at the top with kernel size of and feature map number other configuration of the network is the same as that in our new network for inpainting during training weights were randomly initialized by drawing from gaussian distribution with zero mean and standard deviation of adagrad was used in all experiments with learning rate of and fudge factor of table show the quantitative results of our shcnn in widely used data set for upscaling images times times and times respectively we compared our method with methods including the two current systems clear improvement over the systems can be observed visual comparison between our method and the previous methods is illustrated in figure and figure conclusions in this paper we disclosed the limitation of previous cnn architectures in image processing tasks in need of translation variant interpolation new architecture based on shepard interpolation was proposed and successfully applied to image inpainting and the effectiveness of ground truth psnr bicubic ksvd anr srcnn shcnn figure visual comparison factor upscaling of the bird image in the shcnn with shepard interpolation layers have been demonstrated by the performance references krizhevsky sutskever hinton imagenet classification with deep convolutional neural networks in nips szegedy liu jia sermanet reed anguelov erhan vanhoucke rabinovich going deeper with convolutions in cvpr sun liang wang tang face recognition with very deep neural networks in dong loy he tang learning deep convolutional network for image in eccv xie xu chen image denoising and inpainting with deep neural networks in nips burger schuler harmeling compete with in cvpr image denoising can plain neural networks xu ren liu jia deep convolutional neural network for image deconvolution in nips eigen krishnan fergus restoring an image taken through window covered with dirt or rain in iccv xu ren yan liao jia deep filters in icml shepard interpolation function for data in acm national conference timofte smet gool adjusted anchored neighborhood regression for fast in accv lecun bottou bengio haffner learning applied to document recognition in proceedings of ieee zeyde elad protter on single image using curves and surfaces bevilacqua roumy guillemot morel based on nonnegative neighbor embedding in bmvc chang yeung xiong through neighbor embedding in cvpr timofte smet gool anchored neighborhood regression for fast examplebased in iccv duchi hazan singer adaptive subgradient methods for online learning and stochastic optimization journal of machine learning research 
matrix manifold optimization for gaussian mixtures reshad hosseini school of ece college of engineering university of tehran tehran iran suvrit sra laboratory for information and decision systems massachusetts institute of technology cambridge ma suvrit abstract we take new look at parameter estimation for gaussian mixture model gmms specifically we advance riemannian manifold optimization on the manifold of positive definite matrices as potential replacement for expectation maximization em which has been the de facto standard for decades an invocation of riemannian optimization however fails spectacularly it obtains the same solution as em but vastly slower building on intuition from geometric convexity we propose simple reformulation that has remarkable consequences it makes riemannian optimization not only match em nontrivial result on its own given the poor record nonlinear programming has had against em but also outperforms it in many settings to bring our ideas to fruition we develop welltuned riemannian lbfgs method that proves superior to known competing methods riemannian conjugate gradient we hope that our results encourage wider consideration of manifold optimization in machine learning and statistics introduction gaussian mixture models gmms are mainstay in variety of areas including machine learning and signal processing quick literature search reveals that for estimating parameters of gmm the expectation maximization em algorithm is still the de facto choice over the decades other numerical approaches have also been considered but methods such as conjugate gradients newton have been noted to be usually inferior to em the key difficulty of applying standard nonlinear programming methods to gmms is the positive definiteness pd constraint on covariances although an open subset of euclidean space this constraint can be difficult to impose especially in when approaching the boundary of the constraint set convergence speed of iterative methods can also get adversely affected partial remedy is to remove the pd constraint by using cholesky decompositions as exploited in semidefinite programming it is believed that in general the nonconvexity of this decomposition adds more stationary points and possibly spurious local another possibility is to formulate the pd constraint via set of smooth convex inequalities and apply methods but such sophisticated methods can be extremely slower on several statistical problems than simpler iterations especially for higher dimensions since the key difficulty arises from the pd constraint an appealing idea is to note that pd matrices form riemannian manifold and to invoke riemannian manifold optimization indeed if we operate on the we implicitly satisfy the pd constraint and may have better chance at focusing on likelihood maximization while attractive this line of thinking also fails an invocation of manifold optimization is also vastly inferior to em thus we need new approach to challenge the hegemony of em we outline one such new approach below remarkably using cholesky with the reformulation in does not add spurious local minima to gmms equivalently on the interior of the constraint set as is done by interior point methods their nonconvex versions though these turn out to be slow too as they are second order methods key idea intuitively the mismatch is in the geometry for gmms the of em is euclidean convex optimization problem whereas the gmm is not manifold even for single gaussian if we could reformulate the likelihood so that the single component maximization task which is the analog of the of em for gmms becomes manifold convex it might have substantial empirical impact this intuition supplies the missing link and finally makes riemannian manifold optimization not only match em but often also greatly outperform it to summarize the key contributions of our paper are the following introduction of riemannian manifold optimization for gmm parameter estimation for which we show how reformulation based on geodesic convexity is crucial to empirical success development of riemannian lbfgs solver here our main contribution is the implementation of powerful procedure which ensures convergence and makes lbfgs outperform both em and manifold conjugate gradients this solver may be of independent interest we provide substantive experimental evidence on both synthetic and we compare manifold optimization em and unconstrained euclidean optimization that reformulates the problem using cholesky factorization of inverse covariance matrices our results show that manifold optimization performs well across wide range of parameter values and problem sizes it is much less sensitive to overlapping data than em and displays much less variability in running times these results are very encouraging and we believe that manifold optimization could open new algorithmic avenues for mixture models and perhaps for other statistical estimation problems note to aid reproducibility of our results atlab implementations of our methods are available as part of the ixest toolbox developed by our group the manifold cg method that we use is directly based on the excellent toolkit anopt related work summarizing published work on em is clearly impossible so let us briefly mention few lines of related work xu and jordan examine several aspects of em for gmms and counter the claims of redner and walker who claim em to be inferior to generic nonlinear programming techniques however it is now that em can attain good likelihood values rapidly and scale to much larger problems than amenable to methods local convergence analysis of em is available in with more refined results in who show that for data with low overlap em can converge locally superlinearly our paper develops riemannian lbfgs which can also achieve local superlinear convergence for gmms some innovative methods have also been suggested where the pd constraint is handled via cholesky decomposition of covariance matrices however these works report results only for problems and near spherical covariances our idea of using manifold optimization for gmms is new though manifold optimization by itself is subject classic reference is more recent work is and even atlab toolbox exists in machine learning manifold optimization has witnessed increasing for optimization or optimization based on geodesic convexity background and problem setup the key object in this paper is the gaussian mixture model gmm whose probability density is xk αj pn µj σj rd and where pn is multivariate gaussian with mean rd and covariance that is pn det exp given samples xn we wish to estimate rd and weights the probability simplex this leads to the gmm optimization problem αj pn xi µj σj max log µj σj that is convex along geodesics on the pd manifold manifold optimization should not be confused with manifold learning separate problem altogether solving problem can in general require exponential time however our focus is more pragmatic similar to em we also seek to efficiently compute local solutions our methods are set in the framework of manifold optimization so let us now recall some material on manifolds manifolds and geodesic convexity smooth manifold is space that locally resembles euclidean space for optimization it is more convenient to consider riemannian manifolds smooth manifolds equipped with an inner product on the tangent space at each point these manifolds possess structure that allows one to extend the usual nonlinear optimization algorithms to them algorithms on manifolds often rely on geodesics curves that join points along shortest paths geodesics help generalize euclidean convexity to geodesic convexity in particular say is riemmanian manifold and also let be geodesic joining to such that γxy γxy γxy then set is geodesically convex if for all there is geodesic γxy contained within further function is geodesically convex if for all the composition γxy is convex in the usual sense the manifold of interest to us is pd the set of symmetric positive definite matrices at any point pd the tangent space is isomorphic to the set of symmetric matrices and the riemannian metric at is given by tr dς this metric induces the geodesic ch thus function is geodesically convex on set if it satisfies tf such functions can be nonconvex in the euclidean sense but are globally optimizable due to geodesic convexity this property has been important in some matrix theoretic applications and has gained more extensive coverage in several recent works we emphasize that even though the mixture cost is not geodesically convex for gmm optimization geodesic convexity seems to play crucial role and it has huge impact on convergence speed this behavior is partially expected and analogous to em where convex makes the overall method much more practical this intuition guides us to elicit geodesic convexity below problem reformulation we begin with parameter estimation for single gaussian although this has solution which ultimately benefits em it requires more subtle handling when using manifold optimization consider the following maximum likelihood parameter estimation for single gaussian xn max log pn xi although is euclidean convex problem it is not geodesically convex on its domain rd pd which makes it geometrically handicapped when applying manifold optimization to overcome this problem we invoke simple that has impact more precisely we augment the sample vectors xi to instead consider yit xti therewith turns into xn max log qn yi where qn yi exp pn yi proposition states the key property of proposition the map where is as in is geodesically convex we omit the proof due to space limits see for details alternatively see for more general results on geodesic convexity theorem shows that the solution to yields the solution to the original problem too though under very strong assumptions it has polynomial smoothed complexity this reparametrization in itself is probably folklore its role in gmm optimization is what is crucial here average average lbfgs reformulated mvn cg reformulated mvn lbfgs usual mvn cg usual mvn time seconds lbfgs reformulated mvn cg reformulated mvn lbfgs original mvn cg original mvn single gaussian time seconds mixtures of seven gaussians figure the effect of reformulation in convergence speed of manifold cg and manifold lbfgs methods note that the time is on logarithmic scale for theorem if maximize and if maximizes then proof we express by new variables and by writing sttt stt st the objective function in terms of the new parameters becomes log log log det xn xi xi optimizing lb over we see that must hold hence the objective reduces to ddimensional gaussian for which clearly and theorem shows that reformulation is faithful as it leaves the optimum unchanged theorem proves local version of this result for gmms theorem local maximum of the reparameterized gmm xk xn log αj qn yi sj is local maximum of the original xn µj σj log αj pn xi σj the proof can be found in theorem shows that we can replace problem by one whose local maxima agree with those of and whose individual components are geodesically convex figure shows the true import of our reformulation the dramatic impact on the empirical performance of riemmanian conjugategradient cg and riemannian lbfgs for gmms is unmistakable the final technical piece is to replace the simplex constraint to make the unconstrained we do this via commonly used change of variables ηk log ααkk for assuming ηk is constant the final gmm optimization problem is max sj ηj ηj log exp ηj qn yi sj pk exp ηk we view as manifold problem specifically it is an optimization problem on the qk product manifold the next section presents method for solving it manifold optimization in unconstrained euclidean optimization typically one iteratively finds descent direction and ii performs to obtain sufficient decrease and ensure convergence on riemannian manifold the descent direction is computed on the tangent space this space varies smoothly as one moves along the manifold at point the tangent space tx is the approximating vector space see fig given descent direction ξx tx is performed along smooth curve on the manifold red curve in fig the derivative of this curve at equals the descent direction ξx we refer the reader to for an in depth introduction to manifold optimization successful euclidean methods such as and lbfgs combine gradients at the current point with gradients and descent directions from previous points to obtain new descent direction to adapt such algorithms to manifolds in addition to defining gradients on manifolds we also need to define how to transport vectors in tangent space at one point to vectors in different tangent space at another point tx on riemannian manifolds the gradient is simply direction on the tangent space where the of the gradient with another direction in the tangent space gives the directional derivative of the function formally if gx figure visualization of on manifold is point on the manifold tx defines the inner product in the tangent space tx then is the tangent space at the point ξx is df gx gradf for tx descent direction at the red curve is the given descent direction in the tangent space the curve curve along which is performed along which we perform can be geodesic map that takes the direction and step length to obtain corresponding point on the geodesic is called an exponential map riemannian manifolds are also equipped with natural way of transporting vectors on geodesics which is called parallel transport intuitively parallel transport is differential map with zero derivative along the geodesics using the above ideas algorithm sketches generic manifold optimization algorithm algorithm sketch of an optimization algorithm cg lbfgs to minimize on manifold given riemannian manifold with riemannian metric parallel transport on exponential map initial value smooth function for do obtain descent direction based on stored information and gradf xk using metric and transport use to find such that it satisfies appropriate descent conditions calculate the retraction update rxk αξk based on the memory and need of algorithm store xk gradf xk and αξk end for return estimated minimum xk note that cartesian products of riemannian manifolds are again riemannian with the exponential map gradient and parallel transport defined as the cartesian product of individual expressions the inner product is defined as the sum of inner product of the components in their respective manifolds different variants of riemannian lbfgs can be obtained depending where to perform the vector definition tangent space metric between two tangent vectors at gradient at if euclidean gradient is exponential map at point in direction parallel transport of tangent vector from to expression for psd matrices space of symmetric matrices gς tr gradf rς exp eξe table summary of key riemannian objects for the pd matrix manifold transport we found that the version developed in gives the best performance once we combine it with algorithm satisfying wolfe conditions we present the crucial details below algorithm satisfying wolfe conditions to ensure riemannian lbfgs always produces descent direction it is necessary to ensure that the algorithm satisfies wolfe conditions these conditions are given by rxk αξk xk αdf xk ξk df rxk αξk txk rxk αξk ξk df xk ξk where note that df xk ξk gxk gradf xk ξk the derivative of xk in the direction ξk is the inner product of descent direction and gradient of the function practical algorithms implement stronger wolfe version of that enforces rxk αξk txk rxk αξk ξk df xk ξk similar to the euclidean case our algorithm is also divided into two phases bracketing and zooming during bracketing we compute an interval such that point satisfying wolfe conditions can be found in this interval in the zooming phase we obtain such point in the determined interval the function and its gradient used by the are rxk αξk df rxk αξk txk rxk αξk ξk the algorithm is essentially the same as the in the euclidean space the reader can also see its manifold incarnation in theory behind how this algorithm is guaranteed to find steplength satisfying strong wolfe conditions can be found in good choice of initial can greatly speed up the we propose the following choice that turns out to be quite effective in our experiments xk df xk ξk equation is obtained by finding that minimizes quadratic approximation of the function along the geodesic through the previous point based on xk and df xk df then assuming that change will be the same as in the previous step we write df df xk ξk combining and we obtain our estimate expressed in nocedal and wright suggest using either of for the initial or using where is set to be the obtained in the in the previous point we observed that if one instead uses instead one obtains substantially better performance than the other two approaches experimental results we have performed numerous experiments to examine effectiveness of our method below we report performance comparisons on both real and simulated data in all experiments we initialize the mixture parameters for all methods using all methods also use the same termination criteria they stop either when the difference of average between consecutive iterations falls below or when the number of iterations exceeds more extensive empirical results can be found in the longer version of this paper simulated data em performance is to depend on the degree of separation of the mixture components to assess the impact of this separation on our methods we generate data as proposed in the distributions are chosen so their means satisfy the following inequality kmi mj max tr σi tr σj where models the degree of separation since mixtures with high eccentricity the ratio of the largest eigenvalue of the covariance matrix to its smallest eigenvalue have smaller overlap in em original time all lbfgs reformulated time all cg reformulated time all cg original time all table speed and average all comparisons for each row reports values averaged over runs over different datasets so the all values are not comparable to each other em original time all lbfgs reformulated time all cg reformulated time all cg original time all table speed and all comparisons for cg cholesky original time all time all cg cholesky reformulated time all time all table speed and all for applying cg on problems with addition to high eccentricity we also test the spherical case where we test three levels of separation low medium and high we test two different numbers of mixture components and we consider experiments with larger values of in our experiments on real data for the results for data with dimensionality are given in table the results are obtained after running with different random choices of parameters for each configuration it is apparent that the performance of em and riemannian optimization with our reformulation is very similar the variance of computation time shown by riemmanian optimization is however notably smaller manifold optimization on the problem last column performs the worst in another set of simulated data experiments we apply different algorithms to spherical data the results are shown in table the interesting instance here is the case of low separation where the condition number of the hessian becomes large as predicted by theory the em converges very slowly in such case table confirms this claim it is known that in this case the performance of powerful optimization approaches like cg and lbfgs also degrades but both cg and lbfgs suffer less than em while lbfgs performs noticeably better than cg cholesky decomposition is commonly suggested idea for dealing with pd constraint so we also compare against unconstrained optimization using euclidean cg where the inverse covariance matrices are cholesky factorized the results for the same data as in tables and are reported in table although the problem proves to be much inferior to both em and the manifold methods our reformulation seems to also help it in several problem instances real data we now present performance evaluation on natural image dataset where mixtures of gaussians were reported to be good fit to the data we extracted image patches of size from images and subtracted the dc component leaving us with vectors performance of different algorithms are reported in table similar to the simulated results performance of em and em algorithm time all lbfgs reformulated time all cg reformulated time all cg original time all cg cholesky reformulated time all table speed and all comparisons for natural image data em usual mvn lbfgs reparameterized mvn cg reparameterized mvn all em original mvn lbfgs reformulated mvn cg reformulated mvn all all em original mvn lbfgs reformulated mvn cg reformulated mvn function and gradient evaluations function and gradient evaluations function and gradient evaluations figure best all minus current all values with number of function and gradient evaluations left magic telescope middle year predict right natural images manifold cg on the reformulated parameter space is similar manifold lbfgs converges notably faster except for than both em and cg without our reformulation performance of the manifold methods degrades substantially note that for and cg without reformulation stops prematurely because it hits the bound of maximum iterations and therefore its all is smaller than the other two methods the table also shows results of the and reformulated problem it is more than times slower than manifold optimization optimizing the problem is the slowest not shown and it always reaches the maximum number of iterations before finding the local minimum fig depicts the typical behavior of our manifold optimization methods versus em the is the number of and gradient evaluations or the number of and in em fig and fig are the results of fitting gmms to the magic telescope and year prediction fig is the result for the natural image data of table apparently in the initial few iterations em is faster but manifold optimization methods match em in few iterations this is remarkable given that manifold optimization methods need to perform conclusions and future work we introduced riemannian manifold optimization as an alternative to em for fitting gaussian mixture models we demonstrated that for making manifold optimization succeed to either match or outperform em it is necessary to represent the parameters in different space and reformulate the cost function accordingly extensive experimentation with both experimental and real datasets yielded quite encouraging results suggesting that manifold optimization could have the potential to open new algorithmic avenues for mixture modeling several strands of practical importance are immediate and are part of our ongoing work extension to gmms through stochastic optimization ii use of richer classes of priors with gmms than the usual inverse wishart priors which are typically also used as they make the convenient which is actually just one instance of geodesically convex prior that our methods can handle iii incorporation of penalties for avoiding tiny clusters an idea that fits easily in our framework but not so easily in the em framework finally beyond gmms extension to other mixture models will be fruitful acknowledgments ss was partially supported by nsf grant available at uci machine learning dataset repository via https references absil mahony and sepulchre optimization algorithms on matrix manifolds princeton university press arthur and vassilvitskii the advantages of careful seeding in proceedings of the eighteenth annual symposium on discrete algorithms soda pages bhatia positive definite matrices princeton university press bishop pattern recognition and machine learning springer bonnabel stochastic gradient descent on riemannian manifolds automatic control ieee transactions on boumal mishra absil and sepulchre manopt matlab toolbox for optimization on manifolds the journal of machine learning research burer monteiro and zhang solving semidefinite programs via nonlinear programming part transformations and derivatives technical report rice university houston tx dasgupta learning mixtures of gaussians in foundations of computer science annual symposium on pages ieee dempster laird and rubin maximum likelihood from incomplete data via the em algorithm journal of the royal statistical society series duda hart and stork pattern classification john wiley sons edition ge huang and kakade learning mixtures of gaussians in high dimensions hosseini and mash al mixest an estimation toolbox for mixture models arxiv preprint hosseini and sra differential geometric optimization for gaussian mixture models jordan and jacobs hierarchical mixtures of experts and the em algorithm neural computation bach absil and sepulchre optimization on the cone of positive semidefinite matrices siam journal on optimization keener theoretical statistics springer texts in statistics springer lee introduction to smooth manifolds number in gtm springer ma xu and jordan asymptotic convergence rate of the em algorithm for gaussian mixtures neural computation mclachlan and peel finite mixture models john wiley and sons new jersey moitra and valiant settling the polynomial learnability of mixtures of gaussians in foundations of computer science focs annual ieee symposium on pages ieee murphy machine learning probabilistic perspective mit press naim and gildea convergence of the em algorithm for gaussian mixtures with unbalanced mixing coefficients in pages nocedal and wright numerical optimization springer redner and walker mixture densities maximum likelihood and the em algorithm siam review ring and wirth optimization methods on riemannian manifolds and their application to shape space siam journal on optimization salakhutdinov roweis and ghahramani optimization with em and in pages sra and hosseini geometric optimisation on positive definite matrices for elliptically contoured distributions in advances in neural information processing systems pages sra and hosseini conic geometric optimization on the manifold of positive definite matrices siam journal on optimization convex functions and optimization methods on riemannian manifolds kluwer vanderbei and benson on formulating semidefinite programming problems as smooth convex nonlinear optimization problems technical report vandereycken matrix completion by riemannian optimization siam journal on optimization verbeek vlassis and efficient greedy learning of gaussian mixture models neural computation wiesel geodesic convexity and covariance estimation ieee transactions on signal processing xu and jordan on convergence properties of the em algorithm for gaussian mixtures neural computation zoran and weiss natural images gaussian mixtures and dead leaves in advances in neural information processing systems pages 
convolutional neural networks for text categorization via region embedding rie johnson rj research consulting tarrytown ny usa riejohnson tong baidu beijing china rutgers university piscataway nj usa tzhang abstract this paper presents new framework with convolutional neural networks cnns for text categorization unlike the previous approaches that rely on word embeddings our method learns embeddings of small text regions from unlabeled data for integration into supervised cnn the proposed scheme for embedding learning is based on the idea of learning which is intended to be useful for the task of interest even though the training is done on unlabeled data our models achieve better results than previous approaches on sentiment classification and topic classification tasks introduction convolutional neural networks cnns are neural networks that can make use of the internal structure of data such as the structure of image data through convolution layers where each computation unit responds to small region of input data small square of large image on text cnn has been gaining attention used in systems for tagging entity search sentence modeling and so on to make use of the structure word order of text data since cnn was originally developed for image data which is and dense without modification it can not be applied to text documents which are highdimensional and sparse if represented by sequences of vectors in many of the cnn studies on text therefore words in sentences are first converted to word vectors the word vectors are often obtained by some other method from an additional large corpus which is typically done in fashion similar to language modeling though there are many variations use of word vectors obtained this way is form of learning and leaves us with the following questions how effective is cnn on text in purely supervised setting without the aid of unlabeled data can we use unlabeled data with cnn more effectively than using general word vector learning methods our recent study addressed on text categorization and showed that cnn without word vector layer is not only feasible but also beneficial when not aided by unlabeled data here we address also on text categorization building on we propose new framework that learns embeddings of small text regions instead of words from unlabeled data for use in supervised cnn the essence of cnn as described later is to convert small regions of data love it in document to feature vectors for use in the upper layers in other words through training convolution layer learns an embedding of small regions of data here we use the term embedding loosely to mean function in particular function that generates features that preserve the predictive structure applies cnn directly to vectors which leads to directly learning an embedding of small text regions regions of size tong zhang would like to acknowledge nsf nsf and nih for supporting his research like phrases or regions of size like sentences eliminating the extra layer for word vector conversion this direct learning of region embedding was noted to have the merit of higher accuracy with simpler system no need to tune for word vectors than supervised word cnn in which word vectors are randomly initialized and trained as part of cnn training moreover the performance of best cnn rivaled or exceeded the previous best results on the benchmark datasets motivated by this finding we seek effective use of unlabeled data for text categorization through direct learning of embeddings of text regions our new framework learns region embedding from unlabeled data and uses it to produce additional input additional to vectors to supervised cnn where region embedding is trained with labeled data specifically from unlabeled data we learn tv stands for defined later of text region through the task of predicting its surrounding context according to our theoretical finding has desirable properties under ideal conditions on the relations between two views and the labels while in reality the ideal conditions may not be perfectly met we consider them as guidance in designing the tasks for learning we consider several types of learning task trained on unlabeled data one task is to predict the presence of the concepts relevant to the intended task desire to recommend the product in the context and we indirectly use labeled data to set up this task thus we seek to learn useful specifically for the task of interest this is in contrast to the previous word learning methods which typically produce word embedding for general purposes so that all aspects either syntactic or semantic of words are captured in sense the goal of our region embedding learning is to map text regions to concepts relevant to the task this can not be done by word embedding learning since individual words in isolation are too primitive to correspond to concepts for example easy to use conveys positive sentiment but use in isolation does not we show that our models with outperform the previous best results on sentiment classification and topic classification moreover more direct comparison confirms that our region provide more compact and effective representations of regions for the task of interest than what can be obtained by manipulation of word embedding preliminary cnn for text categorization cnn is network equipped with convolution layers interleaved with pooling layers convolution layer consists of computation units each of which responds to small region of input small square of an image and the small regions collectively cover the entire data computation unit associated with the region of input computes where is the input region vector that represents the region weight matrix and bias vector rm are shared by all the units in the same layer and they are learned through training in input is document represented by vectors figure therefore we call cnn cnn can be either concatenation of vectors vector bow or vector for region love it it love it love it love concatenation bow the bow representation loses word order within the region but is more robust to data sparsity enables large region size such as and speeds up training by having fewer parameters this is what we mainly use for embedding learning from unlabeled data cnn with is called and cnn with the region size and stride distance between the region centers are note that we used tiny vocabulary for the vector examples above to save space but vocabulary of typical applications could be much larger in is componentwise function applying max to each vector component thus each computation unit generates an vector where is the number of weight vectors rows or neurons in other words convolution layer embodies an embedding of text regions which produces an vector for each text region in essence region embedding uses and absence of words in region as input to produce predictive features if presence of easy output good good acting output fun plot plot top layer convolution layer positive top layer pooling layer convolution layer size really love it input vectors figure cnn example region size stride good acting fun plot input figure learning by training to predict adjacent regions to use with absence of not is predictive indicator it can be turned into large feature value by having negative weight on not to penalize its presence and positive weights on the other three words in one row of more formal argument can be found in the supplementary material the vectors from all the text regions of each document are aggregated by the pooling layer by either maximum or average and used by the top layer linear classifier as features for classification here we focused on the convolution layer for other details should be consulted cnn with for text categorization it was shown in that cnn is effective on text categorization where the essence is direct learning of an embedding of text regions aided by new options of input region vector representation we go further along this line and propose learning framework that learns an embedding of text regions from unlabeled data and then integrates the learned embedding in supervised training the first step is to learn an embedding with the following property definition function is of if there exists function such that for any tv stands for of view by definition preserves everything required to predict another view and it can be trained on unlabeled data the motivation of tvembedding is our theoretical finding formalized in the appendix that essentially feature vector is as useful as for the purpose of classification under ideal conditions the conditions essentially state that there exists set of hidden concepts such that two views and labels of the classification task are related to each other only through the concepts in the concepts in might be for example pricey handy hard to use and so on for sentiment classification of product reviews while in reality the ideal conditions may not be completely met we consider them as guidance and design learning accordingly learning is related to feature learning and aso which learn linear embedding from unlabeled data through tasks such as predicting word or predicted labels from the features associated with its surrounding words these studies were however limited to linear embedding related method in learns word embedding so that left context and right context maximally correlate in terms of canonical correlation analysis while we share with these studies the general idea of using the relations of two views we focus on nonlinear learning of region embeddings useful for the task of interest and the resulting methods are very different an important difference of learning from is that it does not involve label guessing thus avoiding risk of label contamination used stacked denoising to extract features invariant across domains for sentiment classification from unlabeled data it is for neural networks which underperformed cnns in now let be the base cnn model for the task of interest and assume that has one convolution layer with region size note however that the restriction of having only one convolution layer is merely for simplifying the description we propose framework with the following two steps learning train neural network to predict the context from each region of size so that convolution layer generates feature vectors for each text region of size for use in the classifier in the top layer it is this convolution layer which embodies the that we transfer to the supervised learning model in the next step note that differs from cnn in that each small region is associated with its own final supervised learning integrate the learned the convolution layer of into so that the regions the output of convolution layer are used as an additional input to convolution layer train this final model with labeled data these two steps are described in more detail in the next two sections learning from unlabeled data we create task on unlabeled data to predict the context adjacent text regions from each region of size defined in convolution layer to see the correspondence to the definition of it helps to consider that assigns label to each text region fun plot instead of the ultimate task of categorizing the entire document this is sensible because cnn makes predictions by building up from these small regions in document good acting fun plot as in figure the clues for predicting label of fun plot are fun plot itself and its context good acting and is trained to predict from to approximate by as in definition and functions and are embodied by the convolution layer and the top layer respectively given document for each text region indexed by convolution layer computes which is the same as except for the superscript to indicate that these entities belong to the top layer linear model for classification uses as features for prediction and and the parameters are learned through training the input region vector representation can be either sequential bow or independent of in the goal here is to learn an embedding of text regions shared with all the text regions at every location context is used only in learning as prediction target not transferred to the final model thus the representation of context should be determined to optimize the final outcome without worrying about the cost at prediction time our guidance is the conditions on the relationships between the two views mentioned above ideally the two views should be related to each other only through the relevant concepts we consider the following two types of representation unsupervised target straightforward vector encoding of is bow vectors of the text regions on the left and right to if we distinguish the left and right the target vector is with vocabulary and if not one potential problem of this encoding is that adjacent regions often have syntactic relations the is often followed by an adjective or noun which are typically irrelevant to the task to identify sentiment and therefore undesirable simple remedy we found effective is vocabulary control of context to remove function words or if available from and only from the target vocabulary target another context representation that we consider is partially supervised in the sense that it uses labeled data first we train cnn with the labeled data for the intended task and apply it to the unlabeled data then we discard the predictions and only retain the internal output of the convolution layer which is an vector for each text region where is the number of neurons we use these vectors to represent the context has shown by examples that each dimension of these vectors roughly represents concepts relevant to the task desire to recommend the product report of faulty product and so on therefore an advantage of this representation is that there is no obvious noise between and since context is represented only by the concepts relevant to the task disadvantage is that it is only as good as the supervised cnn that produced it which is not perfect and in particular some relevant concepts would be missed if they did not appear in the labeled data final supervised learning integration of into supervised cnn we use the obtained from unlabeled data to produce additional input to convolution layer by replacing with where is defined by is the output of the applied to the region we train this model with the labeled data of the task that is we update the weights bias and the parameters so that the designated loss function is minimized on the labeled training data and can be either fixed or updated for and in this work we fix them for simplicity note that while takes region as input itself is also an embedding of text regions let us call it and also supervised embedding as it is trained with labeled data to distinguish it from that is we use to improve the supervised embedding note that can be naturally extended to accommodate multiple by so that for example two types of obtained with the unsupervised target and the target can be used at once which can lead to performance improvement as they complement each other as shown later experiments our code and the experimental settings are available at data we used the three datasets used in imdb elec and as summarized in table imdb movie reviews comes with an unlabeled set to facilitate comparison with previous studies we used union of this set and the training set as unlabeled data elec consists of amazon reviews of electronics products to use as unlabeled data we chose reviews from the same data source so that they are disjoint from the training and test sets and that the reviewed products are disjoint from the test set on the classification of the topics on news unlabeled data was chosen to be disjoint from the training and test sets on the categorization of topics on since the official split for this task divides the entire corpus into training set and test set we used the entire test set as unlabeled data the transductive learning setting imdb elec train test unlabeled words words words words class single multi output sentiment topic table datasets is used only in table implementation we used the cnn models found to be effective in as our base models namely on and on training minimized weighted square loss αi zi pi where goes through the regions represents the target regions and is the model output the weights αi were set to balance the loss originating from the presence and absence of words or concepts in case of the target and to speed up training by eliminating some negative examples similar to negative sampling of to experiment with the unsupervised target we set to be bow vectors of adjacent regions on the left and right while only retaining the most frequent words with vocabulary control on sentiment classification function words were removed and on topic classification numbers and provided by were removed note that these words were removed from and only from the target vocabulary to produce the target we first trained the supervised cnn models with neurons and applied the trained convolution layer to unlabeled data to generate vectors for each region the rest of implementation follows supervised models minimized square loss with regularization and optional dropout and were the rectifier response normalization was performed optimization was done by sgd model selection on all the tested methods tuning of was done by testing the models on the portion of the training data and then the models were with the chosen using the entire training data performance results overview after confirming the effectiveness of our new models in comparison with the supervised cnn we report the performances of cnn which relies on word vectors with very large corpus table besides comparing the performance of approaches as whole it is also of interest to compare the usefulness of what was learned from unlabeled data therefore we show how it performs if we integrate the word vectors into our base model cnns figure in these experiments we also test word vectors trained by on our unlabeled data figure we then compare our models with two standard methods transductive svm tsvm and table and with the previous best results in the literature tables in all comparisons our models outperform the others in particular our region tvembeddings are shown to be more compact and effective than region embeddings obtained by simple manipulation of word embeddings which supports our approach of using region embedding instead of word embedding names in table bow vector bow vector vector target of training bow vector output of supervised embedding bow vector table tested linear svm with linear tsvm with cnn cnn simple cnn simple best our cnn all three imdb elec table error rates for comparison all the cnn models were constrained to have neurons the parentheses around the error rates indicate that were tuned on test data our cnn with we tested three types of as summarized in table the first thing to note is that all of our cnns table row outperform their supervised counterpart in row this confirms the effectiveness of the framework we propose in table for meaningful comparison all the cnns are constrained to have exactly one convolution layer except for cnn with neurons the supervised cnns within these constraints row are region size on imdb and elec and region size on they also served as our base models with region size parameterized on more complex supervised cnns from will be reviewed later on sentiment classification imdb and elec the region size chosen by model selection for our models was larger than for the supervised cnn this indicates that unlabeled data enabled effective use of larger regions which are more predictive but might suffer from data sparsity in supervised settings rows uses vector to initially represent each region thus retains word order partially within the region when used individually did not outperform the other which use bow instead rows but we found that it contributed to error reduction when combined with the others not shown in the table this implies that it learned from unlabeled data predictive information that the other two embeddings missed the best performances row were obtained by using all the three types of at once according to by doing so the error rates were improved by nearly imdb and elec and compared with the supervised cnn row as result of the three with different strengths complementing each other the error rate on in row slightly differs from because here we did not use the stopword list tors integrated into our base models better than cnn table row imdb additional dim elec error rate figure gn word error rate avg error rate imdb elec concat additional dim additional dim supervised concat average figure region word embeddings trained on our unlabeled data dimensionality of the additional input to supervised region embedding region word cnn it was shown in that cnn that uses the google news word vectors as input is competitive on number of sentence classification tasks these vectors were trained by the authors of on very large google news gn corpus billion words times larger than our unlabeled data argued that these vectors can be useful for various tasks serving as universal feature extractors we tested cnn which is equipped with three convolution layers with different region sizes and and using the gn vectors as input although used only neurons for each layer we changed it to and to match the other models which use neurons our models clearly outperform these models table row with relatively large differences comparison of embeddings besides comparing the performance of the approaches as whole it is also of interest to compare the usefulness of what was learned from unlabeled data for this purpose we experimented with integration of word embedding into our base models using two methods one takes the concatenation and the other takes the average of word vectors for the words in the region these provide additional input to the supervised embedding of regions in place of in that is for comparison we produce region embedding from word embedding to replace region we show the results with two types of word embeddings the gn word embedding above figure and word embeddings that we trained with the software on our unlabeled data the same data as used for learning and all others figure note that figure plots error rates in relation to the dimensionality of the produced additional input smaller dimensionality has an advantage of faster on the results first the region is more useful for these tasks than the tested word embeddings since the models with clearly outperform all the models with word embedding word vector concatenations of much higher dimensionality than those shown in the figure still underperformed region second since our region takes the form of with being bow vector the columns of correspond to words and therefore is the sum of columns whose corresponding words are in the region based on that one might wonder why we should not simply use the sum or average of word vectors obtained by an existing tool such as instead the suboptimal performances of average figure tells us that this is bad idea we attribute it to the fact that region embeddings learn predictiveness of and absence of words in region region embedding can be more expressive than averaging of word vectors thus an effective and compact region embedding can not be trivially obtained from word embedding in particular effectiveness of the combination of three in figure stands out additionally our mechanism of using information from unlabeled data is more effective than cnn since our cnns with gn figure outperform cnns with gn table row this is because in our model vectors the original features compensate for potential information loss in the embedding learned from unlabeled data this as well as embedding is major difference between our model and model standard methods many of the standard methods are not applicable to cnn as they require bow vectors as input we tested tsvm with vectors using svmlight tsvm underperformed the supervised on two of the three datasets note that for feasibility we only used the most frequent in the tsvm experiments thus showing the svm results also with vocabulary for comparison though on some datasets svm performance can be improved by use of all the million on imdb this is because the computational cost of tsvm turned out to be high taking several days even with vocabulary best cnn paragraph vectors ensemble of models our best svm dense nn best cnn our best table imdb previous error rates models svm three table elec previous error rates extra resource unlabeled data table and on the task topics with the split table rows since is it can be used with cnn random split of vocabulary and split into the first and last half of each document were tested to reduce the computational burden we report the best and unrealistic performances obtained by optimizing the including when to stop on the test data even with this unfair advantage to cotraining table row clearly underperformed our models the results demonstrate the difficulty of effectively using unlabeled data on these tasks given that the size of the labeled data is relatively large comparison with the previous best results we compare our models with the previous best results on imdb table our best model with three outperforms the previous best results by nearly all of our models with single table row also perform better than the previous results since elec is relatively new dataset we are not aware of any previous results our performance is better than best supervised cnn which has complex network architecture of three pairs in parallel table to compare with the benchmark results in we tested our model on the task with the split on in which more than one out of categories can be assigned to each document our model outperforms the best svm of and the best supervised cnn of table conclusion this paper proposed new cnn framework for text categorization that learns embeddings of text regions with unlabeled data and then labeled data as discussed in section region embedding is trained to learn the predictiveness of and absence of words in region in contrast word embedding is trained to only represent individual words in isolation thus region embedding can be more expressive than simple averaging of word vectors in spite of their seeming similarity our comparison of embeddings confirmed its advantage our region tvembeddings which are trained specifically for the task of interest are more effective than the tested word embeddings using our new models we were able to achieve higher performances than the previous studies on sentiment classification and topic classification appendix theory of suppose that we observe two views of the input and target label of interest where and are finite discrete sets assumption assume that there exists set of hidden states such that and are conditionally independent given in and that the rank of matrix is theorem consider of under assumption there exists function such that further consider of then under assumption there exists function such that the proof can be found in the supplementary material references rie ando and tong zhang framework for learning predictive structures from multiple tasks and unlabeled data journal of machine learning research rie ando and tong zhang feature generation model for learning in proceedings of icml yoshua bengio ducharme pascal vincent and christian jauvin neural probabilistic language model journal of marchine learning research ronan collobert and jason weston unified architecture for natural language processing deep neural networks with multitask learning in proceedings of icml ronan collobert jason weston bottou michael karlen koray kavukcuoglu and pavel kuksa natural language processing almost from scratch journal of machine learning research paramveer dhillon dean foster and lyle ungar learning of word embeddings via cca in proceedings of nips jianfeng gao patric pantel michael gamon xiaodong he and li dent modeling interestingness with deep neural networks in proceedings of emnlp xavier glorot antoine bordes and yoshua bengio domain adaptation for sentiment classification deep learning approach in proceedings of icml geoffrey hinton nitish srivastava alex krizhevsky ilya sutskever and ruslan salakhutdinov improving neural networks by preventing of feature detectors thorsten joachims transductive inference for text classification using support vector machines in proceedings of icml rie johnson and tong zhang effective use of word order for text categorization with convolutional neural networks in proceedings of naacl hlt nal kalchbrenner edward grefenstette and phil blunsom convolutional neural network for modeling sentences in proceedings of acl pages yoon kim convolutional neural networks for sentence classification in proceedings of emnlp pages quoc le and tomas mikolov distributed representations of sentences and documents in proceedings of icml yann lecun bottou yoshua bengio and patrick haffner learning applied to document recognition in proceedings of the ieee david lewis yiming yang tony rose and fan li new benchmark collection for text categorization research journal of marchine learning research andrew maas raymond daly peter pham dan huang andrew ng and christopher potts learning word vectors for sentiment analysis in proceedings of acl mesnil tomas mikolov marc aurelio ranzato and yoshua bengio ensemble of generative and discriminative techniques for sentiment analysis of movie reviews feb version tomas mikolov ilya sutskever kai chen greg corrado and jeffrey dean distributed representations of words and phrases and their compositionality in proceedings of nips andriy mnih and geoffrey hinton scalable hierarchical distributed language model in nips yelong shen xiaodong he jianfeng gao li deng and mensnil latent semantic model with structure for information retrieval in proceedings of cikm duyu tang furu wei nan yang ming zhou ting liu and bing qin learning word embedding for twitter sentiment classification in proceedings of acl pages joseph turian lev rainov and yoshua bengio word representations simple and general method for learning in proceedings of acl pages jason weston sumit chopra and keith adams tagspace semantic embeddings from hashtags in proceedings of emnlp pages liheng xu kang liu siwei lai and jun zhao product feature mining semantic clues versus syntactic constituents in proceedings of acl pages puyang xu and ruhi sarikaya convolutional neural network based triangular crf for joint intent detection and slot filling in asru 
parallel recursive search for exact map inference in graphical models akihiro kishimoto ibm research ireland radu marinescu ibm research ireland adi botea ibm research ireland akihirok adibotea abstract the paper presents and evaluates the power of parallel search for exact map inference in graphical models we introduce new parallel recursive search algorithm called sprbfaoo that explores the search space in manner while operating with restricted memory our experiments show that sprbfaoo is often superior to the current sequential search approaches leading to considerable up to with threads especially on hard problem instances introduction graphical models provide powerful framework for reasoning with probabilistic information these models use graphs to capture conditional independencies between variables allowing concise knowledge representation and efficient query processing algorithms combinatorial maximization or maximum posteriori map tasks arise in many applications and often can be efficiently solved by search schemes especially in the context of search spaces that are sensitive to the underlying problem structure recursive search rbfaoo is recent yet very powerful scheme for exact map inference that was shown to outperform current and methods by several orders of magnitude on variety of benchmarks rbfaoo explores the context minimal search graph associated with graphical model in manner even with nonmonotonic heuristics while running within restricted memory rbfaoo extends recursive bestfirst search rbfs to graphical models and thus uses threshold controlling technique to drive the search in like manner while using the available memory for caching up to now map solvers were developed primarily as sequential search algorithms however parallel processing can be powerful approach to boosting the performance of problem solver now that computing systems are ubiquitous one way to extract substantial from the hardware is to resort to parallel processing parallel search has been successfully employed in variety of ai areas including planning satisfiability and game playing however little research has been devoted to solving graphical models in parallel the only parallel search scheme for map inference in graphical models that we are aware of is the distributed branch and bound algorithm daoopt this assumes however large and distributed computational grid environment with hundreds of independent and loosely connected computing systems without access to shared memory space for caching and reusing partial results contribution in this paper we take radically different approach and explore the potential of parallel search for map tasks in environment which to our knowledge has not been attempted before we introduce sprbfaoo new parallelization of rbfaoo in sharedmemory environments sprbfaoo maintains single cache table shared among the threads in this way each thread can effectively reuse the search effort performed by others since all threads start from the root of the search graph using the same search strategy an effective load balancing is primal graph pseudo tree context minimal search graph figure simple graphical model and its associated search graph obtained without using sophisticated schemes as done in previous work an extensive empirical evaluation shows that our new parallel recursive search scheme improves considerably over current sequential search approaches in many cases leading to considerable up to using threads especially on hard problem instances background graphical models bayesian networks or markov random fields capture the factorization structure of distribution over set of variables graphical model is tuple hx fi where xi is set of variables indexed by set and di is the set of their finite domains of values ψα is set of discrete positive realvalued local functions defined on subsets of variables where is set of variable subsets we use and xα to indicate the scope of function ψα xα var ψα xi the function scopes yield primal graph whose vertices are the variables and whose edges connect any two variables that appear in the scope of the same graphical model defines factorized probability distribution on as follows ψα xα where the partition function normalizes the probability an important inference task which appears in many real world applications is maximum posteriori map sometimes called maximum probable explanation or mpe finds complete assignment to the variables that has the highest probability mode of the joint probability namely argmaxx ψα xα the task is to solve in general in this paper we focus on solving map as minimization problem by taking the negative logarithm of the local functions to avoid numerical issues namely argminx log ψα xα significant improvements for map inference have been achieved by using search spaces which often capture problem structure far better than standard or search methods pseudo tree of the primal graph captures the problem decomposition and is used to define the search space pseudo tree of an undirected graph is directed rooted tree such that every arc of not included in is in namely it connects node in to an ancestor in the arcs in may not all be included in given graphical model hx fi with primal graph and pseudo tree of the search tree st has alternating levels of or nodes corresponding to the variables and and nodes corresponding to the values of the or parent variable with edges weighted according to we denote the weight on the edge from or node to and node by identical subproblems identified by their context the partial instantiation that separates the from the rest of the problem graph can be merged yielding an search graph merging all nodes yields the context minimal search graph denoted by ct the size of ct is exponential in the induced width of along traversal of solution tree of ct is subtree such that it contains the root node of ct if an internal and node is in then all its children are in if an internal or node is in then exactly one of its children is in every tip node in nodes with no children is terminal node the cost of solution tree is the sum of the weights associated with its edges each node in ct is associated with value capturing the optimal solution cost of the conditioned rooted at it was shown that can be computed recursively based on the values of children or nodes by minimization and nodes by summation see also example figure shows the primal graph of simple graphical model with variables and binary functions figure displays the context minimal search graph based on the pseudo tree from figure the contexts are shown next to the pseudo tree nodes solution tree corresponding to the assignment is shown in red current sequential search methods for exact map inference perform either depthfirst or search prominent methods studied and evaluated extensively are the branch and bound aobb and search aobf more recently recursive search rbfaoo has emerged as the best performing algorithm for exact map inference rbfaoo belongs to the class of rbfs algorithms and employs local threshold controlling mechanism to explore the search graph in like manner rbfaoo maintains at each node called on during search rbfaoo improves and caches in fixed size table which is calculated by propagating back the of children rbfaoo stops when at the root or it proves that there is no solution namely our parallel algorithm algorithm sprbfaoo for all from to nr cpu cores do launch trbfs root on separate thread wait for threads to finish their work return optimal cost as root in the cache we now describe sprbfaoo parallelization of rbfaoo in environments sprbfaoo threads start from the root and run in parallel as shown in algorithm threads share one cache table allowing them to reuse the results of each other an entry in the cache table corresponding to node is tuple with fields being lower bound on the optimal cost of node flag indicating whether is solved optimally virtual vq defined later in this section best known solution cost bs for node the number of threads currently working on and lock when accessing cache entry threads lock it temporarily for other threads the method ctxt identifies the context of which is further used to access the corresponding cache entry besides the cache shared among threads each thread will use two threshold values and for each node these are separated from one thread to another algorithm shows the procedure invoked on each thread when thread examines node it first increments in the cache the number of threads working on node line then it increases vq by an increment and stores the new value in the cache line the virtual vq is initially set to as more threads work on solving vq grows due to the repeated increases by in effect vq reflects both the estimated cost of node through its component and the number of threads working on by computing vq this way our goal is to dynamically control the degree to which threads overlap when exploring the search space when given area of the search space is more promising than others more than one thread are encouraged to work together within that area on the other hand when several areas are roughly equally promising threads should diverge and work on different areas indeed in algorithm the tests on lines and prevent thread from working on node if vq other conditions in these tests are discussed later large vq which increases the likelihood that vq may reflect less promising node large or many threads working on or both thus our strategy is an automated and dynamic way of tuning the number of threads working on solving node as function of how promising that node is we call this the thread coordination mechanism lines address the case of nodes with no children which are either terminal nodes or deadends in both cases method evaluate sets the solved flag to true the is set to for terminal algorithm method trbfs handling locks skipped for clarity require node incrementnrthreadsincache ctxt increasevqincache ctxt if has no children then solved evaluate saveincache ctxt solved decrementnrthreadsincache ctxt return generatechildren if is an or node then loop cbest vq bs bestchild min bs if vq then break cbest min cbest cbest cbest trbsf cbest continued from previous column if is an and node then loop vq bs sum min bs if vq then break cbest qcbest vqcbest unsolvedchild cbest vq vqcbest cbest qcbest trbsf cbest if nrthreadscache ctxt then vq decrementnrthreadsincache ctxt saveincache ctxt vq bs nodes and to otherwise method saveincache takes as argument the context of the node and four values to be stored in order in these fields of the corresponding cache entry solved vq and bs lines and show respectively the cases when the current node is an or node or an and node both these follow similar sequence of steps update vq and bs for from the children values lines also update lines an upper bound for the best solution cost known for so far methods bestchild and sum are shown in algorithm in these child node information is either retrieved from the cache if available or initialized with an admissible heuristic function perform the backtracking test lines and the thread backtracks to parent if at least one of the following conditions hold th vq discussed earlier solution containing can not possibly beat the best known solution we call this the suboptimality test or the node is solved the solved flag is true iff the node cost has been proven to be optimal or the node was proven not to have any solution otherwise select successor cbest to continue with lines at or nodes cbest is the child with the smallest vq among all children not solved yet see method bestchild at and nodes any unsolved child can be chosen then update the thresholds of cbest lines and and recursively process cbest lines the threshold is updated in similar way to rbfaoo including the overestimation parameter see however there are two key differences first we use vq instead of to obtain the thread coordination mechanism presented earlier secondly we use two thresholds th and thub instead of just th with thub being used to implement the suboptimality test when thread backtracks to parent if either solved flag is set or no other thread currently examines the thread sets vq to lines in algorithm in this way sprbfaoo reduces the frequency of the scenarios where is considered to be less promising finally the thread decrements in the cache the number of threads working on line and saves in the cache the recalculated vq bs and the solved flag line theorem with an admissible heuristic in use sprbfaoo returns optimal solutions proof sketch sprbfaoo bs at the root is computed from solution tree therefore bs additionally sprbfaoo determines solution optimality by using not vq but saved in the cache table by an discussion similar to theorem in holds for any saved in the cache table with admissible which indicates when sprbfaoo returns solution bs therefore bs we conjecture that sprbfaoo is also complete and leave more analysis as future work algorithm methods bestchild left and sum right require node require node stands for false stands for true initialize vq bs to initialize vq bs to for all ci child of do for all ci child of do if ctxt ci in cache then if ctxt ci in cache then qci sci vqci bsci fromcache ctxt ci qci sci vqci bsci fromcache ctxt ci else else qci sci vqci bsci ci ci qci sci vqci bsci ci ci qci ci qci qci vqci ci vqci vq vq vqci bs min bs ci bsci bs bs bsci if qci qci then sci sci qci return vq bs if vqci vq then vq vq vqci cbest ci else if vqci then vqci return cbest vq bs experiments we evaluate empirically our parallel sprbfaoo and compare it against sequential rbfaoo and aobb we also considered parallel aobb denote by spaobb which uses master thread to explore centrally the search graph up to certain depth and solves the remaining conditioned in parallel using set of worker threads the cache table is shared among the workers so that some workers may reuse partial search results recorded by others in our implementation the search space explored by the master corresponds to the first variables in the pseudo tree the performance of spaobb was very poor across all benchmarks due to noticeably large search overhead as well as poor load balancing and therefore its results are omitted hereafter all competing algorithms sprbfaoo rbfaoo and aobb use the heuristic for guiding the search the heuristic is controlled by parameter called which allows between accuracy and requirements higher values of yield more accurate heuristic but take more time and space to compute the search algorithms were also restricted to static variable ordering obtained as traversal of pseudo tree our benchmark include three sets of instances from genetic linkage analysis denoted pedigree grid networks and protein interaction networks denoted protein in total we evaluated pedigrees grids and protein networks the algorithms were implemented in and the experiments were run on processor with of ram following rbfaoo ran with cache table entries and overestimation parameter however sprbfaoo allocated only entries with the same amount of memory due to extra information such as virtual we set throughout the experiments except those where we vary the time limit was set to hours we also record typical ranges of problem specific parameters shown in table such as the number of variables maximum domain size induced width and depth of the pseudo tree table ranges of the benchmark problems parameters benchmark grid pedigree protein table number of unsolved problem instances vs cores method rbfaoo sprbfaoo grid pedigree protein the primary performance measures reported are the run time and node expansions during search when the run time of solver is discussed the total cpu time reported in seconds is one metric to show overall performance the total cpu time consists of the heuristic compilation time and search http table total cpu time sec and nodes on grid and pedigree instances time limit hours instance mbe aobb rbfaoo sprbfaoo mbe aobb rbfaoo sprbfaoo mbe aobb rbfaoo sprbfaoo mbe aobb rbfaoo sprbfaoo mbe aobb rbfaoo sprbfaoo mbe aobb rbfaoo sprbfaoo time nodes time nodes grids total cpu time sprbfaoo sprbfaoo protein total cpu time pedigree total cpu time protein total cpu time rbfaoo nodes sprbfaoo sprbfaoo rbfaoo rbfaoo grids total cpu time time sprbfaoo nodes rbfaoo time pedigree total cpu time nodes time sprbfaoo algorithm rbfaoo rbfaoo figure total cpu time sec for rbfaoo sprbfaoo with smaller top and larger bottom time limit hours for grid and pedigree for protein time sprbfaoo does not reduce the heuristic compilation time calculated sequentially note that parallelizing the heuristic compilation is an important extension as future work parallel versus sequential search table shows detailed results as total cpu time in seconds and nodes expanded for solving grid and pedigree instances using parallel and sequential search the columns are indexed by the for each problem instance we also record the heuristic time denoted by mbe corresponding to each sprbfaoo ran with threads we can see that sprbfaoo improves considerably over rbfaoo across all reported the benefit of parallel search is more clearly observed at smaller that correspond to relatively weak heuristics in this case the heuristic is less likely to guide the search towards more promising regions of the search space and therefore diversifying the search via multiple parallel threads is key to achieving significant for example on grid sprbfaoo is almost times faster than rbfaoo similarly sprbfaoo solves the instance while rbfaoo runs out of time this is important since on very hard problem instances it may only be possible to compute rather weak heuristics given limited resources notice average total search time sec grids pedigree protein grids pedigree protein parameter parameter figure total search time sec and average as function of parameter time limit hours for grid and pedigree for protein also that the time mbe increases with the table shows the number of unsolved problems in each domain note that sprbfaoo solved all instances solved by rbfaoo figure plots the total cpu time obtained by rbfaoo and sprbfaoo using smaller resp larger corresponding to relatively weak resp strong heuristics we selected for grid and pedigree and for protein specifically grids pedigrees and proteins were the smallest for which sprbfaoo could solve at least two thirds of instances within the hour time limit while grids pedigrees and proteins were the largest possible for which we could compile the heuristics without running out of memory on all instances the data points shown in green correspond to problem instances that were solved only by sprbfaoo as before we notice the benefit of parallel search when using relatively weak heuristics the largest of is obtained on the pdbilk protein instance with as the increases and the heuristics become more accurate the difference between rbfaoo and sprbfaoo decreases because both algorithms are guided more effectively towards the subspace containing the optimal solution in addition the overhead associated with larger which is calculated sequentially offsets considerably the obtained by sprbfaoo over rbfaoo see for example the plot for protein instances with we also observed that sprbfaoo over rbfaoo increases sublinearly as more threads are used we experimented with and threads respectively in addition to search overhead synchronization overhead is another cause for achieving only sublinear the synchronization overhead can be estimated by checking the node expansion rate per thread for example in case of sprbfaoo with threads the node expansion rate per thread slows down to and of rbfaoo in grid pedigree and protein respectively this implies that the overhead related to locks is large since these numbers with threads are and respectively the slowdown becomes severer with more threads we hypothesize that due to the property of the virtual sprbfaoo threads tend to follow the same path from the root until search directions are diversified and frequently access the cache table entries of the these internal nodes located on that path where lock contentions occur finally sprbfaoo load balance is quite stable in all domains especially when all threads are invoked and perform search after while for example its load balance ranges between and for grid pedigree and protein especially on those instances where sprbfaoo expands at least million nodes with threads impact of parameter in figure we analyze the performance of sprbfaoo with threads as function of the parameter which controls the way different threads are encouraged or discouraged to start exploring specific subproblem see also section for this purpose and to better understand sprbfaoo scaling behavior we ignore the heuristic compilation time therefore we show the total search time in seconds over the instances that all parallel versions solve and the average based on the instances where rbfaoo needs at least second to solve we obtained these numbers for we see that all values lead to improved this is important because unlike the approach of which involves sophisticated scheme it is considerably simpler yet extremely efficient and only requires tuning single parameter of the three values while sprbfaoo with spends the largest total search time it yields the best this indicates about selecting since the instances used to calculate values are solved by rbfaoo they contain relatively easy instances table total cpu time sec and node expansions for hard pedigree instances sprbfaoo ran with threads and largefam time limit hours instance mbe time rbfaoo time nodes sprbfaoo time nodes on the other hand several difficult instances solved by sprbfaoo with threads are included in calculating the total search time in case of because of increased search overhead sprbfaoo needs more search time to solve these difficult instances there is also one protein instance unsolved with but solved with and this phenomenon can be explained as follows with large sprbfaoo searches in more diversified directions which could reduce lock contentions resulting in improved values however due to larger diversification when sprbfaoo with solves difficult instances it might focus on less promising portions of the search space resulting in increased total search time summary of the experiments in terms of our parallel sharedmemory method sprbfaoo improved considerably over its sequential counterpart rbfaoo by up to times using threads at relatively larger their corresponding computational overhead typically outweighed the gains obtained by parallel search still parallel search had an advantage of solving additional instances unsolved by serial search finally in table we report the results obtained on very hard pedigree instances from mbe records the heuristic compilation time we see again that sprbfaoo improved over rbfaoo on all instances while achieving of on two of them and related work the distributed aobb algorithm daoopt which builds on the notion of parallel tree search explores centrally the search tree up to certain depth and solves the remaining conditioned in parallel using large grid of distributed processing units without shared cache in parallel evidence propagation the notion of pointer jumping has been used for exact probabilistic inference for example pennock performs theoretical analysis xia and prasanna split junction tree into chains where evidence propagation is performed in parallel using distributedmemory environment and the results are merged later on search pns in spaces and its parallel variants have been shown to be effective in games as pns is suboptimal it can not be applied as is to exact map inference kaneko presents parallel search with virtual proof and disproof numbers vpdn these combine proof and disproof numbers and the number of threads examining node thus our vq is closely related to vpdn however vpdn has an problem which we avoid due to the way we dynamically update vq saito et al uses threads that probabilistically avoid the strategy hoki et al adds small random values the proof and disproof numbers of each thread without sharing any cache table conclusion we presented sprbfaoo new parallel recursive search scheme in graphical models using the virtual shared in single cache table sprbfaoo enables threads to work on promising regions of the search space with effective reuse of the search effort performed by others homogeneous search mechanism across the threads achieves an effective load balancing without resorting to sophisticated schemes used in related work we prove the correctness of the algorithm in experiments sprbfaoo improves considerably over current sequential search approaches in many cases leading to considerable up to using threads especially on hard problem instances ongoing and future research directions include proving the completeness conjecture extending sprbfaoo to distributed memory environments and parallelizing the heuristic for shared and distributed memory references marinescu and dechter search for combinatorial optimization in graphical models artificial intelligence kishimoto and marinescu recursive search for optimization in graphical models in international conference on uncertainty in artificial intelligence uai pages korf search artificial intelligence kishimoto fukunaga and botea evaluation of simple scalable parallel search strategy artificial intelligence wahid chrabakh and rich wolski gradsat parallel sat solver for the grid technical report university of california at santa barbara campbell joseph hoane and hsu deep blue artificial intelligence enzenberger arneson and segal fuego an framework for board games and go engine based on tree search ieee transactions on computational intelligence and ai in games otten and dechter case study in complexity estimation towards parallel over graphical models in uncertainty in artificial intelligence uai pages pearl probabilistic reasoning in intelligent systems morgan kaufmann lauritzen graphical models clarendon press dechter and mateescu search spaces for graphical models artificial intelligence marinescu and dechter memory intensive search for combinatorial optimization in graphical models artificial intelligence nagai algorithm for searching trees and its applications phd thesis the university of tokyo fishelson and geiger exact genetic linkage computations for general pedigrees bioinformatics yanover and weiss minimizing and learning energy functions for prediction journal of computational biology grama and kumar state of the art in parallel search techniques for discrete optimization problems ieee transactions on knowledge and data engineering pennock logarithmic time parallel bayesian inference in uncertainty in artificial intelligence uai pages xia and prasanna junction tree decomposition for parallel exact inference in ieee international symposium on parallel and distributed processing ipdps allis van der meulen and van den herik search artificial intelligence kishimoto winands and saito search using proof numbers the first twenty years icga journal vol no kaneko parallel depth first proof number search in aaai conference on artificial intelligence pages saito winands and van den herik randomized parallel search in advances in computers games conference acg volume of lecture notes in computer science pages springer hoki kaneko kishimoto and ito parallel dovetailing and its application to depthfirst search icga journal 
convolutional neural networks with recurrent connections for scene labeling ming liang xiaolin hu bo zhang tsinghua national laboratory for information science and technology tnlist department of computer science and technology center for computing research cbicr tsinghua university beijing china xlhu dcszb abstract scene labeling is challenging computer vision task it requires the use of both local discriminative features and global context information we adopt deep recurrent convolutional neural network rcnn for this task which is originally proposed for object recognition different from traditional convolutional neural networks cnn this model has recurrent connections in the convolutional layers therefore each convolutional layer becomes recurrent neural network the units receive constant inputs from the previous layer and recurrent inputs from their neighborhoods while recurrent iterations proceed the region of context captured by each unit expands in this way feature extraction and context modulation are seamlessly integrated which is different from typical methods that entail separate modules for the two steps to further utilize the context rcnn is proposed over two benchmark datasets standford background and sift flow the model outperforms many models in accuracy and efficiency introduction scene labeling or scene parsing is an important step towards image interpretation it aims at fully parsing the input image by labeling the semantic category of each pixel compared with image classification scene labeling is more challenging as it simultaneously solves both segmentation and recognition the typical approach for scene labeling consists of two steps first extract local handcrafted features second integrate context information using probabilistic graphical models or other techniques in recent years motivated by the success of deep neural networks in learning visual representations cnn is incorporated into this framework for feature extraction however since cnn does not have an explicit mechanism to modulate its features with context to achieve better results other methods such as conditional random field crf and recursive parsing tree are still needed to integrate the context information it would be interesting to have neural network capable of performing scene labeling in an manner natural way to incorporate context modulation in neural networks is to introduce recurrent connections this has been extensively studied in sequence learning tasks such as online handwriting recognition speech recognition and machine translation the sequential data has strong correlations along the time axis recurrent neural networks rnn are suitable for these tasks because the context information can be captured by fixed number of recurrent weights treating scene labeling as variant of sequence learning rnn can also be applied but the studies are relatively scarce recently recurrent cnn rcnn in which the output of the top layer of cnn is integrated with the input in the bottom is successfully applied to scene labeling training extract patch and resize valid convolutions concatenate classify softmax 𝐲𝒏 𝐥𝒏 boat cross entropy loss label test same convolutions concatenate upsample classify downsample upsample figure training and testing processes of rcnn for scene labeling solid lines denote connections and dotted lines denote recurrent connections without the aid of extra preprocessing or techniques it achieves competitive results this type of recurrent connections captures both local and global information for labeling pixel but it achieves this goal indirectly as it does not model the relationship between pixels or the corresponding units in the hidden layers of cnn in the space explicitly to achieve the goal directly recurrent connections are required to be between units within layers this type of rcnn has been proposed in but there it is used for object recognition it is unknown if it is useful for scene labeling more challenging task this motivates the present work prominent structural property of rcnn is that and recurrent connections in multiple layers this property enables the seamless integration of feature extraction and context modulation in multiple levels of representation in other words an rcnn can be seen as deep rnn which is able to encode the context dependency therefore we expect rcnn to be competent for scene labeling is another technique for capturing both local and global information for scene labeling therefore we adopt rcnn an rcnn is used for each scale see figure for its overall architecture the networks in different scales have exactly the same structure and weights the outputs of all networks are concatenated and input to softmax layer the model operates in an fashion and does not need any preprocessing or techniques related work many models either or parametric have been proposed for scene labeling comprehensive review is beyond the scope of this paper below we briefly review the neural network models for scene labeling in cnn is used to extract local features for scene labeling the weights are shared among the cnns for all scales to keep the number of parameters small however the scheme alone has no explicit mechanism to ensure the consistency of neighboring pixels labels some techniques such as superpixels and crf are shown to significantly improve the performance of cnn in cnn features are combined with fully connected crf for more accurate segmentations in both models cnn and crf are trained in separated stages in crf is reformulated and implemented as an rnn which can be jointly trained with cnn by bp algorithm in recursive neural network is used to learn mapping from visual features to the semantic space which is then used to determine the labels of pixels in recursive context propagation network rcpn is proposed to better make use of the global context information the rcpn is fed superpixel representation of cnn features through parsing tree the rcpn recursively aggregates context information from all superpixels and then disseminates it to each superpixel although recursive neural network is related to rnn as they both use weight sharing between different layers they have significant structural difference the former has single path from the input layer to the output layer while the latter has multiple paths as will be shown in section this difference has great influence on the performance in scene labeling to the best of our knowledge the first neural network model for scene labeling refers to the deep cnn proposed in the model is trained by supervised greedy learning strategy in another model is proposed recurrent connections are incorporated into cnn to capture context information in the first recurrent iteration the cnn receives raw patch and outputs predicted label map downsampled due to pooling in other iterations the cnn receives both downsampled patch and the label map predicted in the previous iteration and then outputs new predicted label map compared with the models in this approach is simple and elegant but its performance is not the best on some benchmark datasets it is noted that both models in and are called rcnn for convenience in what follows if not specified rcnn refers to the model in model rcnn the key module of the rcnn is the rcl generic rnn with input internal state and parameters can be described by where is the function describing the dynamic behavior of rnn the rcl introduces recurrent connections into convolutional layer see figure for an illustration it can be regarded as special rnn whose and recurrent computations both take the form of convolution xijk wkf wkr bk where and are vectorized square patches centered at of the feature maps of the previous layer and the current layer wkf and wkr are the weights of and recurrent connections for the kth feature map and bk is the kth element of the bias used in this paper is composed of two functions zijk zijk where is the widely used rectified linear function zijk max zijk and is the local response normalization lrn zijk zijk min where is the number of feature maps and are constants controlling the amplitude of normalization the lrn forces the units in the same location to compete for high activities which mimics the lateral inhibition in the cortex in our experiments lrn is found to consistently improve the accuracy though slightly following and are set to and respectively is set to during the training or testing phase an rcl is unfolded for time steps into subnetwork is predetermined see figure for an example with the receptive field rf of each unit expands with larger so that more context information is captured the depth of the subnetwork also increases with larger in the meantime the number of parameters is kept constant due to weight sharing let denote the static input an image the input to the rcl denoted by can take this constant for all but here we adopt more general form an rcl unit red unfold rcl multiplicatively unfold two rcls additively unfold two rcls rcnn pooling pooling figure illustration of the rcl and rcnn used in this paper sold arrows denote connections and dotted arrows denote recurrent connections where is discount factor which determines the tradeoff between the component and the recurrent component when the component is totally discarded after the first iteration in this case the network behaves like the recursive convolutional network in which several convolutional layers have tied weights there is only one path from input to output when the network is typical rnn there are multiple paths from input to output see figure rcnn is composed of stack of rcls between neighboring rcls there are only connections max pooling layers are optionally interleaved between rcls the total number of recurrent iterations is set to for all rcls there are two approaches to unfold an rcnn first unfold the rcls one by one and each rcl is unfolded for time steps before feeding to the next rcl see figure this unfolding approach multiplicatively increases the depth of the network the largest depth of the network is proportional to in the second approach at each time step the states of all rcls are updated successively see figure the unfolded network has structure whose axis is the time step and axis is the level of layer this unfolding approach additively increases the depth of the network the largest depth of the network is proportional to we adopt the first unfolding approach due to the following advantages first it leads to larger effective rf and depth which are important for the performance of the model second the second approach is more computationally intensive since the inputs need to be updated at each time step however in the first approach the input of each rcl needs to be computed for only once rcnn in natural scenes objects appear in various sizes to capture this variability the model should be scale invariant in cnn is proposed to extract features for scene labeling in which several cnns with shared weights are used to process images of different scales this approach is adopted to construct the rcnn see figure the original image corresponds to the finest scale images of coarser scales are obtained simply by max pooling the original image the outputs of all rcnns are concatenated to form the final representation for pixel its probability falling into the cth semantic category is given by softmax layer exp wc yc exp where denotes the concatenated feature vector of pixel and wc denotes the weight for the cth category the loss function is the cross entropy between the predicted probability ycp and the true hard label xx log ycp where if pixel is labeld as and otherwise the model is trained by backpropagation through time bptt that is unfolding all the rcnns to networks and apply the bp algorithm training and testing most neural network models for scene labeling are trained by the approach the training samples are randomly cropped image patches whose labels correspond to the categories of their center pixels valid convolutions are used in both and recurrent computation the patch is set to proper size so that the last feature map has exactly the size of in training an image is input to the model and the output has exactly the same size as the image the loss is the average of all pixels loss we have conducted experiments with both training methods and found that training seriously suffered from possible reason is that the pixels in an image have too strong correlations so training is used in all our experiments in it is suggested that and training are equally effective and the former is faster to converge but their model is obtained by finetuning the vgg model pretrained on imagenet this conclusion may not hold for models trained from scratch in the testing phase the approach is time consuming because the patches corresponding to all pixels need to be processed we therefore use testing there are two testing approaches to obtain dense label maps the first is the approach when the predicted label map is downsampled by factor of the original image will be shifted and processed for times at each time the image is shifted by pixels to the right and down both and take their value from and the shifted image is padded in their left and top borders with zero the outputs for all shifted images are interleaved so that each pixel has corresponding prediction approach needs to process the image for times although it produces the exact prediction as the testing the second approach inputs the entire image to the network and obtains downsampled label map then simply upsample the map to the same resolution as the input image using bilinear or other interpolation methods see figure bottom this approach may suffer from the loss of accuracy but is very efficient the deconvolutional layer proposed in is adopted for upsampling which is the backpropagation counterpart of the convolutional layer the deconvolutional weights are set to simulates the bilinear interpolation both of the testing methods are used in our experiments experiments experimental settings experiments are performed over two benchmark datasets for scene labeling sift flow and stanford background the sift flow dataset contains color images all of which have the size of pixels among them images are training data and the remaining images are testing data there are semantic categories and the class frequency is highly unbalanced the stanford background dataset contains color images most of them have the size of pixels following cross validation is used over this dataset in each fold there are training images and testing images the pixels have semantic categories and the class frequency is more balanced than the sift flow dataset in most of our experiments rcnn has three parameterized layers figure the first parameterized layer is convolutional layer followed by max pooling layer this is to reduce the size of feature maps and thus save the computing cost and memory the other two parameterized layers are rcls another max pooling layer is placed between the two rcls the numbers of feature maps in these layers are and the filter size in the first convolutional layer is and the and recurrent filters in rcls are all three scales of images are used and neighboring scales differed by factor of in each side of the image the models are implemented using caffe they are trained using stochastic gradient descent algorithm for the sift flow dataset the are determined on separate validation set the same set of is then used for the stanford background dataset dropout and weight decay are used to prevent two dropout layers are used one after the second pooling layer and the other before the concatenation of different scales the dropout ratio is and weight decay coefficient is the base learning rate is which is reduced to when the training error enters plateau overall about ten millions patches have been input to the model during training data augmentation is used in many models for scene labeling to prevent it is technique to distort the training data with set of transformations so that additional data is generated to improve the generalization ability of the models this technique is only used in section for the sake of fairness in comparison with other models augmentation includes horizontal reflection and resizing model analysis we empirically analyze the performance of rcnn models for scene labeling on the sift flow dataset the results are shown in table two metrics the accuracy pa and the average accuracy ca are used pa is the ratio of correctly classified pixels to the total pixels in testing images ca is the average of all accuracies the following results are obtained using the testing and without any data augmentation note that all models have architecture model rcnn rcnn rcnn rcnn rcnn rcnn rcnn rcnn rcnn rcnn no share patch size no param pa ca table model analysis over the sift flow dataset we limit the maximum size of input patch to which is the size of the image in the sift flow dataset this is achieved by replacing the first few valid convolutions by same convolutions first the influence of in is investigated the patch sizes of images for different models are set such that the size of the last feature map is we mainly investigate two specific values and with different iteration number several other values of are tested with see table for details for rcnn with the performance monotonously increase with more time steps this is not the case for rcnn with with which the network tends to be with more iterations to further investigate this issue larger model denoted as is tested it has four rcls and has more parameters and larger depth with it achieves better performance than rcnn however the with obtains worse performance than rcnn when is set to other values or the performance seems better than but the difference is small second the influence of weight sharing in recurrent connections is investigated another rcnn with and is tested its recurrent weights in different iterations are not shared anymore which leads to more parameters than shared ones but this setting leads to worse accuracy both for pa and ca possible reason is that more parameters make the model more prone to third two cnns are constructed for comparison is constructed by removing all recurrent connections from rcnn and then increasing the numbers of feature maps in each layer from and to and respectively is constructed by removing the recurrent connections and adding two extra convolutional layers had five convolutional layers and the corresponding numbers of feature maps are and respectively with these settings the two models have approximately the same number of parameters as rcnn which is for the sake of fair comparison the two cnns are outperformed by the rcnns by significant margin compared with the rcnn the topmost units in these two cnns cover much smaller regions see the patch size column in table note that all convolutionas in these models are performed in valid mode this mode decreases the size of feature maps and as consequence figure examples of scene labeling results from the stanford background dataset mntn denotes mountains and object denotes foreground objects together with max pooling increases the rf size of the top units since the cnns have fewer convolutional layers than the rcnns their rf sizes of the top units are smaller model liu et al tighe and lazebnik eigen and fergus singh and kosecka tighe and lazebnik cnn cover cnn cover balanced rcnn cnn rcpn cnn rcpn balanced rcnn rcnn balanced fcnn from vgg model no param na na na na na pa ca na time cpu cpu cpu cpu cpu na na na gpu gpu gpu gpu gpu gpu gpu table comparison with the models over the sift flow dataset comparison with the models next we compare the results of rcnn and the models the rcnn with and is used for comparison the results are obtained using the upsampling testing approach for efficiency data augmentation is employed in training because it is used by many other models the images are only preprocessed by removing the average rgb values computed over training images model gould et al tighe and lazebnik socher et al eigen and fergus singh and kosecka lempitsky et al multiscale cnn crf rcnn cnn rcpn multiscale cnn rcpn rcnn no param na na na na na na pa ca na na na time to cpu cpu na cpu cpu cpu cpu cpu gpu gpu na gpu table comparison with the models over the stanford background dataset the results over the sift flow dataset are shown in table besides the pa and ca the time for processing an image is also presented for neural network models the number of parameters are shown when extra training data from other datasets is not used the rcnn outperforms all other models in terms of the pa metric by significant margin the rcnn has fewer parameters than most of the other neural network models except the rcnn small rcnn is then constructed by reducing the numbers of feature maps in rcnn to and respectively so that its total number of parameters is million the pa and ca of the small rcnn are and respectively significantly higher than those of the rcnn note that better result over this dataset has been achieved by the fully convolutional network fcn however fcn is finetuned from the vgg net trained over the million images of imagenet and has approximately million parameters being trained over images rcnn is only outperformed by percent on pa this gap can be further reduced by using larger rcnn models for example the in table achieves pa of with data augmentation the class distribution in the sift flow dataset is highly unbalanced which is harmful to the ca performance in frequency balance is used so that patches in different classes appear in the same frequency this operation greatly enhance the ca value for better comparison we also test an rcnn with weighted sampling balanced so that the rarer classes apprear more frequently in this case the rcnn achieves much higher ca than other methods including fcn while still keeping good pa the results over the stanford background dataset are shown in table the set of used for the sift flow dataset is adopted without further tuning frequency balance is not used the rcnn again achieves the best pa score although ca is not the best some typical results of rcnn are shown in figure on gtx titan black gpu it takes about second for the rcnn and second for the to process an image compared with other models the efficiency of rcnn is mainly attributed to its property for example the rcpn model takes much time in obtaining the superpixels conclusion recurrent convolutional neural network is used for scene labeling the model is able to perform local feature extraction and context integration simultaneously in each parameterized layer therefore particularly fits this application because both local and global information are critical for determining the label of pixel in an image this is an approach and can be simply trained by the bptt algorithm experimental results over two benchmark datasets demonstrate the effectiveness and efficiency of the model acknowledgements we are grateful to the anonymous reviewers for their valuable comments this work was supported in part by the national basic research program program of china under grant and grant in part by the national natural science foundation of china under grant grant and grant in part by the natural science foundation of beijing under grant references chen papandreou kokkinos murphy and yuille semantic image segmentation with deep convolutional nets and fully connected crfs in iclr deng dong socher li li and imagenet hierarchical image database in cvpr pages eigen and fergus nonparametric image parsing using adaptive neighbor sets in cvpr pages eigen rolfe fergus and lecun understanding deep architectures using recursive convolutional network in iclr farabet couprie najman and lecun learning hierarchical features for scene labeling ieee transactions on pattern analysis and machine intelligence pami gould fulton and koller decomposing scene into geometric and semantically consistent regions in iccv pages grangier bottou and collobert deep convolutional networks for scene parsing in icml deep learning workshop volume graves liwicki bertolami bunke and schmidhuber novel connectionist system for unconstrained handwriting recognition ieee transactions on pattern analysis and machine intelligence pami graves mohamed and hinton speech recognition with deep recurrent neural networks in icassp pages jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding in proceedings of the acm international conference on multimedia pages krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips pages lecun boser denker henderson howard hubbard and jackel backpropagation applied to handwritten zip code recognition neural computation lempitsky vedaldi and zisserman pylon model for semantic segmentation in nips pages liang and hu recurrent convolutional neural network for object recognition in cvpr pages liu yuen and torralba nonparametric scene parsing via label transfer ieee transactions on pattern analysis and machine intelligence pami long shelhamer and darrell fully convolutional networks for semantic segmentation in cvpr mostajabi yadollahpour and shakhnarovich feedforward semantic segmentation with zoomout features in cvpr mottaghi chen liu cho lee fidler urtasun and yuille the role of context for object detection and semantic segmentation in the wild in cvpr pages pinheiro and collobert recurrent convolutional neural networks for scene parsing in icml sermanet eigen zhang mathieu fergus and lecun overfeat integrated recognition localization and detection using convolutional networks in iclr sharma tuzel and liu recursive context propagation network for semantic scene labeling in nips pages simonyan and zisserman very deep convolutional networks for image recognition corr singh and kosecka nonparametric scene parsing with adaptive feature relevance and semantic context in cvpr pages socher lin manning and ng parsing natural scenes and natural language with recursive neural networks in icml pages sutskever vinyals and le sequence to sequence learning with neural networks in nips pages tighe and lazebnik finding things image parsing with regions and detectors in cvpr pages tighe and lazebnik superparsing scalable nonparametric image parsing with superpixels international journal of computer vision ijcv werbos backpropagation through time what it does and how to do it proceedings of the ieee zheng jayasumana vineet su du huang and torr conditional random fields as recurrent neural networks in iccv 
bounding the cost of lifted inference vibhav gogate university of texas at dallas campbell rd richardson tx david smith university of texas at dallas campbell rd richardson tx abstract recently there has been growing interest in systematic and importance lifted inference algorithms for statistical relational models srms these lifted algorithms achieve significant complexity reductions over their propositional counterparts by using lifting rules that leverage symmetries in the relational representation one drawback of these algorithms is that they use an representation of the search space which makes it difficult to efficiently tight upper bounds on the exact cost of inference without running the algorithm to completion in this paper we present principled approach to address this problem we introduce lifted analogue of the propositional search space framework which we call lifted schematic given representation of an srm we show how to efficiently compute tight upper bound on the time and space cost of exact inference from current assignment and the remaining schematic we show how our bounding method can be used within lifted importance sampling algorithm in order to perform effective and demonstrate experimentally that the version of the algorithm yields more accurate estimates on several datasets introduction myriad of probabilistic logic languages have been proposed in recent years these languages can express elaborate models with compact specification unfortunately performing efficient inference in these models remains challenge researchers have attacked this problem by lifting propositional inference techniques lifted algorithms identify indistinguishable random variables and treat them as single block at inference time which can yield significant reductions in complexity since the original proposal by poole variety of lifted inference algorithms have emerged one promising approach is the class of algorithms which lift propositional weighted model counting to the level by transforming the propositional search space into smaller lifted search space in general exact lifted inference remains intractable as result there has been growing interest in developing approximate algorithms that take advantage of symmetries in this paper we focus on class of such algorithms called lifted sampling methods and in particular on the lifted importance sampling lis algorithm lis can be understood as sampling analogue of an exact lifted search algorithm called probabilistic theorem proving ptp ptp accepts srm as input as markov logic network mln decides upon lifted inference rule to apply conditioning decomposition partial grounding etc constructs set of reduced mlns recursively calls itself on each reduced mln in this set and combines the returned values in an appropriate manner drawback of ptp is that the mln representation of the search space is inference unaware at any step in ptp the cost of inference over the remaining model is unknown this is problematic because unlike propositional importance sampling algorithms for graphical models which can be in principled manner by sampling variables until the treewidth of the remaining model is bounded by small constant called sampling it is currently not possible to lis in principled manner to address these limitations we make the following contributions we propose an alternate representation of the lifted search space that allows efficient computation of the cost of inference at any step of the ptp algorithm our approach is based on the search space perspective propositional search associates compact representation of search space with graphical model called pseudotree and then uses this representation to guide weighted model counting algorithm over the full search space we extend this notion to lifted search spaces we associate with each srm schematic which describes the associated lifted search space in terms of lifted or nodes which represent branching on counting assignments to groups of indistinguishable variables and lifted and nodes which represent decompositions over independent and possibly identical subproblems our formal specification of lifted search spaces offers an intermediate representation of srms that bridges the gap between probabilistic logics such as markov logic and the search space representation that must be explored at inference time we use the intermediate specification to characterize the size of the search space associated with an srm without actually exploring it providing tight upper bounds on the complexity of ptp this allows us in principle to develop advanced approximate lifted inference algorithms that take advantage of exact lifted inference whenever they encounter tractable subproblems we demonstrate the utility of our lifted schematic and tight upper bounds by developing lifted importance sampling algorithm enabling the user to systematically explore the accuracy versus complexity we demonstrate experimentally that it vastly improves the accuracy of estimation on several datasets background and terminology search spaces the search space model is general perspective for searching over graphical models including both probabilistic networks and constraint networks search spaces allow for many familiar graph notions to be used to characterize algorithmic complexity given graphical model xg φy where xv ey is graph and is set of features or potentials and rooted tree that spans in such manner that the edges of that are not in are all is pseudo tree the corresponding search space denoted st prq contains alternating levels of and nodes and or nodes or nodes are labeled with xi where xi varspφq and nodes are labeled with xi and correspond to assignments to xi the root of the search tree is an or node corresponding to the root of intuitively the pseudo tree can be viewed as schematic for the structure of an search space associated with graphical model which denotes the conditioning order on the set varspφq and the locations along this ordering at which the model decomposes into independent subproblems given pseudotree we can generate the corresponding search tree via straightforward algorithm that adds conditioning branches to the pseudo tree representation during dfs walk over the structure adding cache that stores the value of each subproblem keyed by an assignment to its context allows each subproblem to be computed just once and converts the search tree into search graph thus the cost of inference is encoded in the pseudo tree in section we define lifted analogue to the backbone pseudo tree called lifted schematic and in section we use the definition to prove cost of inference bounds for probabilistic logic models first order logic an entity or constant is an object in the model about which we would like to reason each entity has an associated type the set of all unique types forms the set of base types for the model domain is set of entities of the same type we assume that each domain is finite and is disjoint from every other domain in the model variable denoted by letter is symbolic placeholder that specifies where substitution may take place each variable is associated with type valid substitution requires that variable be replaced by an object either an entity or another variable with the same type we denote the domain associated with variable by we define predicate denoted by tk τk to be functor that maps typed entities to random variables also called parameterized random variable substitution is an expression of the form tk xk where ti are variables of type τi and xi are either entities or variables of type τi given predicate and substitution tk xk the application of to yields another functor functor with each ti replaced by xi called an atom if all the xi are entities the application yields random variable in this case we refer to as grounding of and rθ as ground atom we adopt the notation θi to refer to the assignment of θi xi statistical relational models combine logic and probabilistic graphical models popular srm is markov logic networks mlns an mln is set of weighted logic clauses given entities the mln defines markov network over all the ground atoms in its herbrand base cf with feature corresponding to each ground clause in the herbrand base we assume herbrand interpretations throughout this paper the weight of each feature is the weight of the corresponding clause the probability distribution associated with the markov network is given by pxq expp wi ni pxqq where wi is řthe weight of the ith clause and ni pxq is its number of true groundings in and expp wi ni pxqq is the partition function in this paper we focus on computing it is known that many inference problems over mlns can be reduced to computing probabilistic theorem proving ptp is an algorithm for computing in mlns it lifts the two main steps in propositional inference conditioning or nodes and decomposition and nodes in lifted conditioning the set of truth assignments to ground atoms of predicate are partitioned into multiple parts such that in each part all truth assignment have the same number of true atoms and the mlns obtained by applying the truth assignments are identical thus if has ground atoms the lifted search procedure will search over opn new mlns while the propositional search procedure will search over mlns an exponential reduction in complexity in lifted decomposition the mln is partitioned into set of mlns that are not only identical up to renaming but also disjoint in the sense that they do not share any ground atoms thus unlike the propositional procedure which creates disjoint mlns and searches over each the lifted procedure searches over just one of the mlns since they are identical unfortunately lifted decomposition and lifted conditioning can not always be applied and in such cases ptp resorts to propositional conditioning and decomposition drawback of ptp is that unlike propositional search which has tight complexity guarantees exponential in the treewidth and pseudotree height there are no tight formal guarantees on the complexity of we address this limitation in the next two sections lifted schematics our goal in this section is to define lifted analogue the pseudotree notion employed un un un by the propositional framework the structure must encode all infor mation contained in propositional pseu dotree conditioning order conditional un un un independence assumptions as well as additional information needed by the ptp algorithm in order to exploit the figure possible schematics for rpxq spxq rpxq tries of the lifted model since the yq and rpxq rpyq spx yq stands for unknown circles and diamonds represent tries that can be exploited highly depend lifted or and and nodes respectively on the amount of evidence we encode the srm after evidence is instantiated via process called shattering thus while pseudotree encodes graphical model schematic encodes an srm evidence set pair definition lifted or node is vertex labeled by tuple xr ty where is predicate is set of valid substitutions for ku represents the counting argument for the predicate tk τk and specifies domain τα to be counted over is an identifier of the block of the partition being counted over is the number of entities in block and tt rue alse nknownu is the truth value of the set of entities in block definition lifted and node is vertex labeled by possibly empty set of formulas where formula is pair ptpo bqu wq in which is lifted or node xr ty tt rue alseu and formulas are assumed to be in clausal form definition lifted schematic xvs es vr is rooted tree comprised of lifted or nodes and lifted and nodes must obey the following properties every lifted or node vs has single child node vs every lifted and node vs has possibly empty set of children nn vs although complexity bounds exist for related inference algorithms such as decomposition trees they are not as tight as the ones presented in this paper for each pair of lifted or nodes vs with respective labels xr ty pr iq pairs pr iq uniquely identify lifted or nodes for every lifted or node vs with label xr ty either or where has appeared as the decomposer label of some edge in paths po vr for each formula fi ptpo bqu wq appearing at lifted and node tpo bqu paths pvr aq we call the set of edges tpo aq formulaspaqu the back edges of each edge between lifted or node and its child node is unlabeled each edge between lifted and node and its child node may be unlabeled or labeled with pair px cq where is set of variables called decomposer set and is the the number of equivalent entities in the block of represented by the subtree below if it is labeled with decomposer set then for every substitution set labeling lifted or node appearing in the subtree rooted at di θi and decomposer sets labeling edges in the subtree rooted at the lifted schematic is general structure for specifying the inference procedure in srms it can encode models specified in many formats such as markov logic and prv models given model and evidence set constructing schematic conversion into canonical form is achieved via shattering whereby exchangeable variables are grouped together inference only requires information on the size of these groups so the representation omits information on the specific variables in given group figure shows schematics for three mlns algorithm function evalnode and algorithm function evalnode or input schematic with and root node counting store cs output real number root for formula do wˆ calculateweightpf csq for child of do sumoutdoneatomspcs if pn has label xv cb then if expv bq ccy cs then xpv bq xtu tptu cb qyy xp getcc cc for for assignment pai ki do its own entry in cs pai wˆevalnodepn qki else wˆevalnodepn csq return input schematic with or node root counting store cs output real number if pxroot cs wq cache then return xr root child xr tv θα vu xp txai ki yuy getcc if tt rue alseu then updatecc xp tv evalnode else assigns vn vi ki uu for vn assigns do updatecc xp vn evalnodept vi insertcache xr return lifted node evaluation describe the inference procedure in algorithms and we require the notion of counting store in order to track counting assignments over the variables in the model counting store is set of pairs xpv iq ccy where is set of variables that are counted over together is block identifier and cc is counting context counting context introduced in is pair xp where is list of predicates and tt rue alseum is map from truth assignments to to integer denoting the count of the number of entities in the block of the partition of each that take that assignment we initialize the algorithm by call to algorithm with an appropriate schematic and empty counting store the lifted and node function algorithm first computes the weight of any completely conditioned formulas it then makes set of evalnode calls for each of its children if pa oq has decomposer label it makes call for each assignment in each block of the partition of otherwise it makes single call to the algorithm takes the product of the resulting terms along with the product of the weights and returns the result the lifted or node function algorithm retrieves the set of all assignments previously made to its counting argument variable set it then makes an evalnode call to its child for each completion to its assignment set that is consistent with its labeled truth value and takes their weighted sum where the weight is the number of symmetric assignments represented by each assignment completion the overall complexity of depends on the number of entries in the counting store at each step of inference note that algorithm reduces the size of the store by summing out over atoms that leave context algorithm increases the size of the store at atoms with unknown truth value by splitting the current assignment into true and false blocks its atom predicate atoms with known truth value leave the size of the store unchanged complexity analysis algorithms and describe traversal of the lifted search space associated with as our notion of complexity we are interested in specifying the maximum number of times any node vs is replicated during instantiation of the search space we describe this quantity as ssn psq our goal in this section is to define the function ssn psq which we refer to as the induced lifted width of computing the induced lifted width of the propositional framework the inference cost of pseudotree is determined by dr the tree decomposition of the graph xn odespt backedgespt qy induced by the variable ordering attained by traversing along any dfs ordering from root to leaves inference is opexppwqq where is the size of the largest cluster in dr the analogous procedure in lifted requires additional information be stored at each cluster lifted tree decompositions are identical to their propositional counterparts with two exceptions first each cluster ci requires the ordering of its nodes induced by the original order of second each cluster ci that contains node which occurs after decomposer label requires the inclusion of the decomposer label formally definition the tree sequence ts associated with schematic is partially ordered set such that odespsq ts pa with label edgespsq pa lq ts and sq ts definition the path sequence associated with tree sequence ts of schematic is any totally ordered subsequence of ts definition given schematic and its tree sequence ts the lifted tree decomposition of ts denoted ds is pair pc in which is set of path sequences and is tree whose nodes are the members of satisfying the following properties po aq backedgespp di ci ck atht pci cj ci cj ck ts ci ci given the partial ordering of nodes defined by each schematic induces unique lifted tree decomposition ds computing ssn psq requires computing maxci pc ssc pci there exists total ordering over the nodes in each ci hence the lifted structure in each ci constitutes path we take the lifted search space generated by each cluster to be tree hence computing the maximum node replication is equivalent to computing the number of leaves in ssc in order to calculate the induced lifted width of given path we must first determine which or nodes are counted over dependently let vc tv xr ty θα vu be the set of variables that are counted over by an or node in cluster let vc be partition of vc into its dependent variable counting sets define the binary relation cs dxr ty vs dθ θα then tv pv cs where cs is the transitive closure of cs let vc tvj vj ðñ cs variables that appear in set vj vc refer to the same set of objects thus all have the same type τj and they all share the same partition of the entities of tj let pj denote the partition of the entities of tj variable set vj then each block pij pj is counted over independently we refer to each pij as dependent counting path thus we can calculate the total leaves corresponding to cluster by taking the product of the leaves of each pij block ssc pcq vj pvc pij ppj ssp ppij analysis of lifted or nodes that count over the same block pij depends on the structure of the decomposers sets over the structure first we consider the case in which contains no decomposers lifted or nodes with no orc vj the sequence of nodes in that perform conditioning over the block of the partition of the variables in vj the nodes in orc vj count over the same set of entities conditioning assignment at assigns ct cu entities to rue and cf ct entities to alse its predicate breaking the symmetry over the elements in the block each orp vj that occurs after must perform counting over two sets of size ct and cf separately the number of assignments for block tvj iu grows exponentially with the number of ancestors counting over tvj iu whose truth value is unknown formally let cij be the size of the block of the partition of vj and let nij orc vj xr for an initial domain size cij and predicate count nij we must compute the number of possible ways to represent cij as sum of integers define kij we can count the number of leaf nodes generated by counting the number of weak compositions of cij into kij parts thus the number of search space leaves corresponding to pij generated byij ssp ppij pcij kij cijk ij example consider the example in figure there is single path from the root to leaf the set of variables appearing on the path txu and hence the partition of into variables that are counted over together yields ttxuu thus nq and so we can count the leaves of the model by the expression lifted or nodes with to algorithm function countpathleaves determine the size of the search tree induced by input subsequence path subsequence that contains decomposers we output pxq where is domain size and pxq is the number of search space leaves generated by must consider whether the counting argument represent the recursive polynomial as triple pa of each or node is decomposed on lifted or nodes with decomposers as non counting arguments where and are either weak compositions base case or triples of this type recursive case type wcp wc int wcd int wcp wcp constructs the polynomial function make oly wc nq pt sq wc wc pn qq return wcd we first consider the case when orc vj contains decomposer variables as function make oly wcd qq pt sqq arguments for each edge return wcdpa makepoly pt sq makepoly label algorithm generates child for sqq each assignment in the counting store divides out the or nodes with containing the decomposer variable if path counting variables that are decomposers subsequence over variable of initial domain function apply ec wc return wc pa qq has or nodes of which occur below the function apply ec wcd composer label then we can compute the return wcd applydec applydec ber of assignments in the counting store at each creates function that takes domain and computes the differences of the decomposer as further we can compute constituent weak compositions the number of leaves generated by each function eval oly wcd assignment can be computed as the difference return evalpoly evalpoly function eval wc in leaves from the model over or nodes and return the model over nodes hence the totalornodes ing model has dv ornodeswithdecomposercountingargument poly wc leaves this procedure can be repeated by for of do sively applying the rule to split each weak if pa xv cyq then poly makepoly poly ornodesabove ornodesbetween position into difference of weak compositions for each decomposer label present in the else quence under consideration algorithm the return evalpoly applydec dv poly final result is polynomial in which when given domain size returns the number of leaves generated by the path subsequence example consider the example in figure again there is single path from the root to leaf the set of variables appearing on the path is tx yu the partition of into variables that are counted over together yields ttx returns the polynomial pxq px px so the search space contains leaves lifted or nodes with decomposers as counting arguments the procedure is similar for the case when contains or nodes that count over variables that have been decomposed one addition or nodes that count over variable that has previously appeared as the decomposer label of an ancestor in the path have domain size of and hence always spawn children instead of px children if there are or nodes in that count over decomposed variables we must divide the term of each weak composition in our polynomial by lines of algorithm perform this operation example consider the example shown in figure again there is one path from the root to leaf with tx yu partitioning into sets of variables that are counted over together yields ttxu tyuu thus and similarly and algorithm returns the constant functions pxq pxq px equation indicates that we take the product of these functions so the search space contains leaves regardless of the domain sizes of and overall analysis as well as proof of correctness of algorithm is given in the supplemental material section here we give general complexity results theorem given lifted schematic with associated tree decomposition ds pc the overall time and space complexity of inference in is opmaxci pc ssc pci qq an application importance sampling is technique which combines exact inference with sampling the idea is to partition the ground atoms into two sets set of atoms say that will be sampled and set of atoms that will be summed out analytically using exact inference techniques say typically the accuracy variance decreases improves as the cardinality of is increased however so does the cost of exact inference which in turn decreases the accuracy because fewer samples are generated thus there is algorithm function makeraofunction input schematic output pxq cs find the clusters of pc findtreedecomposition sizef tu for ci of do dependentcountingpaths ci cf tu for pvj pj of do fj countpathleaves pj cf xvj fj sizef cf return sizef is particularly useful in lifted sampling schemes because subproblems over large algorithm function evalraofunction sets of random variables are often tractable input counting store cs list of list of size functions sf subproblems containing assignments can often output the cost of exact inference be summed out in opnq time via lifted clustercosts tu for cf of sf do ing or in time via lifted decomposition the clustercost approach presented in section is ideal for this for xvj fj of cf do assigns getccpvj task because algorithm returns function that for sk of assigns do is specified at the schematic level rather than the clustercost clustercost fj psk search space level computing the size of the clustercost return maxpclustercostsq maining search space requires just the evaluation of set of polynomials in this section we introduce our sampling scheme which adds to lifted importance sampling lis as detailed in technically lis is minor modification of ptp in which instead of searching over all possible truth assignments to ground atoms via lifted conditioning the algorithm generates random truth assignment lifted sampling and weighs it appropriately to yield an unbiased estimate of the partition function computing the size bounding schematic xvs es vr to sample we introduce preprocessing step that constructs size evaluation function for each vs algorithm details the process of creating the function for one node it takes as input the schematic rooted at it first finds the tree decomposition of the algorithm then finds the dependent paths in each cluster finally it applies algorithm to each dependent path and wraps the resulting function with the variable dependency it returns list of list of variable function pairs importance sampling at lifted or sampling at lifted or nodes is similar to its propositional analogue each lifted or node is now specified by an xr sf in which is the proposal distribution for pr iq and sf is the output of algorithm the sampling algorithm takes an additional input cb specifying the complexity bound for given an or node where unknown we first compute the cost of exact inference algorithm describes the procedure it takes as input the list of lists sf output by algorithm and the counting store detailing the counting assignments already made by the current sample for each sublist in the input list the algorithm evaluates each variable function pair by retrieving the list of current assignments from the counting store evaluating the function for the domain size of each assignment and computing the product of the results each of these values represents bound on the cost of inference for single cluster algorithm returns the maximum of this list if cb we call evaln odepsq otherwise we sample assignment from with probability qi update the counting store with assignment and call samplen odeps where is the child schematic yielding estimate of the partition function of we then return as the estimate of the partition function at importance sampling at lifted and sampling at lifted and nodes differs from its propositional counterpart in that decomposer labeled edge pa represents distributions that are not only independent but also identical let be lifted and node that we wish to sample with children sk with corresponding decomposer labels dk for each edge with no decomposer label take di then the estimator for the partition function at is ku di δti experiments time vs log sample variance log sample variance we ran our importance sampler on three benchmark srms and datasets the friends smokers and asthma mln and dataset described in the webkb mln for collective classification and the protein mln in which the task is to infer protein interactions from biological data all models are available from time friends and smokers asthma objects evidence time vs log sample variance log sample variance results figure shows the sample variance of the estimators as function of time we see that the samplers typically have smaller variance than lis however increasing the complexity bound typically does not improve the variance as function of time but the variance does improve as function of number of samples our results indicate that the structure of the model plays role in determining the most efficient complexity bound for sampling in general models with large decomposers especially near the bottom of the schematic will benefit from larger complexity bound because it is often more efficient to perform exact inference over decomposer node time webkb objects evidence time vs log sample variance log sample variance setup for each model we set randomly selected ground atoms as evidence and designated them to have rue value we then estimated the partition function via our sampler with complexity bounds bound of yields the lis algorithm we used the uniform distribution as our proposal we ran each sampler times and computed the sample variance of the estimates conclusions and future work time protein objects evidence figure log variance as function of time in this work we have presented an representation of srms based on the framework using this framework we have proposed an accurate and efficient method for bounding the cost of inference for the family of lifted conditioning based algorithms such as probabilistic theorem proving given shattered srm we have shown how the method can be used to quickly identify tractable subproblems of the model we have presented one immediate application of the scheme by developing lifted importance sampling algorithm which uses our bounding scheme as variance reducer acknowledgments we gratefully acknowledge the support of the defense advanced research projects agency darpa probabilistic programming for advanced machine learning program under air force research laboratory afrl prime contract no any opinions findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the view of darpa afrl or the us government references bidyuk and dechter cutset sampling for bayesian networks journal of artificial intelligence research braz eyal amir and dan roth lifted probabilistic inference in proceedings of the international joint conference on artificial intelligence pages citeseer george casella and christian robert of sampling schemes biometrika chavira and darwiche on probabilistic inference by weighted model counting artificial intelligence luc de raedt and kristian kersting probabilistic inductive logic programming springer rina dechter and robert mateescu search spaces for graphical models artificial intelligence michael genesereth and eric kao introduction to logic second edition morgan claypool publishers vibhav gogate and pedro domingos exploiting logical structure in lifted probabilistic inference in statistical relational artificial intelligence vibhav gogate and pedro domingos probabilistic theorem proving in proceedings of the conference annual conference on uncertainty in artificial intelligence pages corvallis oregon auai press vibhav gogate abhay kumar jha and deepak venugopal advances in lifted importance sampling in aaai abhay jha vibhav gogate alexandra meliou and dan suciu lifted inference seen from the other side the tractable features in advances in neural information processing systems pages brian milch bhaskara marthi stuart russell david sontag daniel ong and andrey kolobov blog probabilistic models with unknown objects statistical relational learning page niepert lifted probabilistic inference an mcmc perspective in uai workshop on statistical relational artificial intelligence niepert marginal density estimation in aaai conference on artificial intelligence pages david poole probabilistic inference in ijcai volume pages citeseer david poole fahiem bacchus and jacek kisynski towards completely lifted probabilistic inference arxiv preprint matthew richardson and pedro domingos markov logic networks machine learning sang beame and kautz solving bayesian networks by weighted model counting in proceedings of the twentieth national conference on artificial intelligence pages dan suciu abhay jha vibhav gogate and alexandra meliou lifted inference seen from the other side the tractable features in nips nima taghipour jesse davis and hendrik blockeel decomposition trees in advances in neural information processing systems pages guy van den broeck nima taghipour wannes meert jesse davis and luc de raedt lifted probabilistic inference by knowledge compilation in proceedings of the twentysecond international joint conference on artificial volume three pages aaai press deepak venugopal and vibhav gogate on lifting the gibbs sampling algorithm in advances in neural information processing systems pages 
hamiltonian monte carlo with efficient kernel exponential families heiko dino samuel livingstoneo zoltan arthur gatsby unit university college london department of statistics university of oxford school of mathematics university of bristol abstract we propose kernel hamiltonian monte carlo kmc adaptive mcmc algorithm based on hamiltonian monte carlo hmc on target densities where classical hmc is not an option due to intractable gradients kmc adaptively learns the target gradient structure by fitting an exponential family model in reproducing kernel hilbert space computational costs are reduced by two novel efficient approximations to this gradient while being asymptotically exact kmc mimics hmc in terms of sampling efficiency and offers substantial mixing improvements over gradient free samplers we support our claims with experimental studies on both toy and applications including approximate bayesian computation and mcmc introduction estimating expectations using markov chain monte carlo mcmc is fundamental approximate inference technique in bayesian statistics mcmc itself can be computationally demanding and the expected estimation error depends directly on the correlation between successive points in the markov chain therefore efficiency can be achieved by taking large steps with high probability hamiltonian monte carlo is an mcmc algorithm that improves efficiency by exploiting gradient information it simulates particle movement along the contour lines of dynamical system constructed from the target density projections of these trajectories cover wide parts of the target support and the probability of accepting move along trajectory is often close to one remarkably this property is mostly invariant to growing dimensionality and hmc here often is superior to random walk methods which need to decrease their step size at much faster rate sec unfortunately for large class of problems gradient information is not available for example in mcmc the posterior does not have an analytic expression but can only be estimated at any given point in bayesian gaussian process classification related setting is mcmc for approximate bayesian computation where the posterior is approximated through repeated simulation from likelihood model in both cases hmc can not be applied leaving random walk methods as the only mature alternative there have been efforts to mimic hmc behaviour using stochastic gradients from in big data or stochastic finite differences in abc stochastic gradient based hmc methods however often suffer from low acceptance rates or additional bias that is hard to quantify random walk methods can be tuned by matching scaling of steps and target for example adaptive amh is based on learning the global scaling of the target from the history of the markov chain yet for densities with nonlinear support this approach does not work very well recently introduced kernel adaptive kamh algorithm whose proposals are locally aligned to the target by adaptively learning target covariance in reproducing kernel hilbert space rkhs kamh achieves improved sampling efficiency in this paper we extend the idea of using kernel methods to learn efficient proposal distributions rather than locally smoothing the target density however we estimate its gradients globally more precisely we fit an infinite dimensional exponential family model in an rkhs via score matching this is method of modelling the log unnormalised target density as an rkhs function and has been shown to approximate rich class of density functions arbitrarily well more importantly the method has been empirically observed to be relatively robust to increasing dimensionality in sharp contrast to classical kernel density estimation sec gaussian processes gp were also used in as an emulator of the target density in order to speed up hmc however this requires access to the target in closed form to provide training points for the gp we require our adaptive kmc algorithm to be computationally efficient as it deals with highdimensional mcmc chains of growing length we develop two novel approximations to the infinite dimensional exponential family model the first approximation score matching lite is based on computing the solution in terms of lower dimensional yet growing subspace in the rkhs kmc with score matching lite kmc lite is geometrically ergodic on the same class of targets as standard random walks the second approximation uses finite dimensional feature space kmc finite combined with random fourier features kmc finite is an efficient online estimator that allows to use all of the markov chain history at the cost of decreased efficiency in unexplored regions choice between kmc lite and kmc finite ultimately depends on the ability to initialise the sampler within regions of the target alternatively the two approaches could be combined experiments show that kmc inherits the efficiency of hmc and therefore mixes significantly better than adaptive samplers on number of target densities including on synthetic examples and when used in and all code can be found at https background and previous work let the domain of interest be subset of rd and denote the unnormalised target density on by we are interested in constructing markov chain such that xt by running the markov chain for long time we can consistently approximate any expectation markov chains are constructed using the algorithm which at the current state xt draws point from proposal mechanism and sets with probability min xt xt and xt otherwise we assume that is that we can neither evaluate log for any but can only estimate it unbiasedly via replacing with results in which asymptotically remains exact inference kernel adaptive in the absence of log the usual choice of is random walk σt popular choice of the scaling is σt when the scale of the target density is not uniform across dimensions or if there are strong correlations the amh algorithm improves mixing by adaptively learning global covariance structure of from the history of the markov chain for cases where the local scaling does not match the global covariance of the support of the target is nonlinear kamh improves mixing by learning the target covariance in rkhs kamh proposals are gaussian with covariance that matches the local covariance of around the current state xt without requiring access to log hamiltonian monte carlo hamiltonian monte carlo hmc uses deterministic measurepreserving maps to generate efficient markov transitions starting from the negative log target referred to as the potential energy log we introduce an auxiliary momentum variable exp with the joint distribution of is then proportional to exp where is called the hamiltonian defines hamiltonian flow parametrised by trajectory length which is map φh for which this allows constructing markov chains for chain at state xt repeatedly exp and then ii apply the hamiltonian flow the compactness restriction is imposed to satisfy the assumptions in is analytically intractable as opposed to computationally expensive in the big data context throughout the paper denotes the gradient operator to for time giving φh the flow can be generated by the hamiltonian operator in practice is usually unavailable and we need to resort to approximations here we limit ourselves to the integrator see for details to correct for discretisation error metropolis acceptance procedure can be applied starting from the of the approximate trajectory is accepted with probability min exp hmc is often able to propose distant uncorrelated moves with high acceptance probability intractable densities in many cases the gradient of log can not be written in closed form leaving based methods as the we aim to overcome behaviour so as to obtain significantly more efficient sampling kernel induced hamiltonian dynamics kmc replaces the potential energy in by kernel induced surrogate computed from the history of the markov chain this surrogate does not require gradients of the density the surrogate induces kernel hamiltonian flow which can be numerically simulated using standard integration as with the discretisation error in hmc any deviation of the kernel induced flow from the true flow is corrected via metropolis acceptance procedure this here also contains the estimation noise from and previous values of table consequently the stationary distribution of the chain remains correct given that we take care when adapting the surrogate infinite dimensional exponential families in rkhs we construct kernel induced potential energy surrogate whose gradients approximate the gradients of the true potential energy in without accessing or directly but only using the history of the markov chain to that end we model the unnormalised target density with an infinite dimensional exponential family model of the form const exp hf ih which in particular implies log here is rkhs of real valued functions on the rkhs has uniquely associated symmetric positive definite kernel which satisfies hf ih for any the canonical feature map here takes role of the sufficient statistics while are the natural parameters and log exp hf ih dx is the cumulant generating function eq defines broad class of densities when universal kernels are used the family is dense in the space of continuous densities on compact domains with respect to total variation and kl section it is possible to consistently fit an unnormalised version of by directly minimising the expected gradient mismatch between the model and the true target density observed through the markov chain history this is achieved by generalising the score matching approach to infinite dimensional parameter spaces the technique avoids the problem of dealing with the intractable and reduces the problem to solving linear system more importantly the approach is observed to be relatively robust to increasing dimensions we return to estimation in section where we develop two efficient approximations for now assume access to an fˆ such that log kernel induced hamiltonian flow we define kernel induced hamiltonian operator by ing in the potential energy part in by our kernel surrogate uk it is clear that depending on uk the resulting kernel induced hamiltonian flow differs from the original one that said any bias on the resulting markov chain in addition to discretisation error from the integrator is naturally corrected for in the metropolis step we accept an φh of trajectory starting at along the kernel induced flow with probability min exp φh hk where φh corresponds to the true hamiltonian at φt here in the pseudomarginal context we replace both terms in the ratio in by unbiased estimates we replace acceptance prob hmc acceptance prob kmc steps steps figure hamiltonian trajectories on standard gaussian end points of such trajectories red stars to blue stars form the proposal of algorithms left plain hamiltonian trajectories oscillate on stable orbit and acceptance probability is close to one right kernel induced trajectories and acceptance probabilities on an estimated energy function within with an unbiased estimator note that this also involves recycling the estimates of from previous iterations to ensure anyymptotic correctness table any deviations of the kernel induced flow from the true flow result in decreased acceptance probability we therefore need to control the approximation quality of the kernel induced potential energy to maintain high acceptance probability in practice see figure for an illustrative example two efficient estimators for exponential families in rkhs we now address estimating the infinite dimensional exponential family model from data the original estimator in has large computational cost this is problematic in the adaptive mcmc context where the model has to be updated on regular basis we propose two efficient approximations each with its strengths and weaknesses both are based on score matching score matching following we model an unnormalised log probability density log with parametric model log log log where is collection of parameters of yet unspecified dimension natural parameters of and is an unknown normalising constant we aim to find fˆ from set of xi such that fˆ const from eq the criterion being optimised is the expected squared distance between gradients of the log density score functions log log dx where we note that the normalising constants vanish from taking the gradient as shown in theorem it is possible to compute an empirical version without accessing or log other than through observed samples xx log log our approximations of the original model are based on minimising using approximate scores infinite dimensional exponential families lite the original estimator of in takes dual form in rkhs spanned by nd kernel derivatives thm the update of the proposal at the iteration of mcmc requires inversion of td td matrix this is clearly prohibitive if we are to run even moderate number of iterations of markov chain following we take simple approach to avoid prohibitive computational costs in we form proposal using random of fixed size from the markov chain history zi xi in order to avoid excessive computation when is large we replace the full dual solution with solution in terms of span zi which covers the support of the true density by construction and grows with increasing that is we assume that the model takes the light form we assume fixed sample set here but will use both the full chain history xi or later αi zi where rn are real valued parameters that are obtained by minimising the empirical score matching objective this representation is of form similar to section the main differences being that the basis functions are chosen randomly the basis set grows with and we will require an additional regularising term the estimator is summarised in the following proposition which is proved in appendix pn αi zi for the proposition given set of samples zi and assuming gaussian kernel of the form exp kx and the unique minimiser of the λkf empirical score matching objective is given by λi where rn and are given by ks ds kx and dx kdx kdx dx with products and dx diag the estimator costs computation for computing and for inverting and storage for fixed random chain history size this can be further reduced via approximations to the kernel matrix and conjugate gradient methods which are derived in appendix pn gradients of the model are given as αi xi they simply require to evaluate gradients of the kernel function evaluation and storage of both cost dn exponential families in finite feature spaces instead of fitting an model on subset of the available data the second estimator is based on fitting finite dimensional approximation using all available data xi in primal form as we will see updating the estimator when new data point arrives can be done online define an approximate feature space hm rm and denote by φx hm the embedding of point rd into hm rm assume that the embedding approximates the kernel function as finite rank expansion φy the log unnormalised density of the infinite model can be approximated by assuming the model in takes the form hθ φx ihm φx to fit we again minimise the score matching objective as proved in appendix proposition given set of samples xi and assuming φx for finite dimensional feature embedding φx rm and the unique minimiser of the regularised empirical score matching objective is given by λi where xx rm with φx and an example feature features and standard gaussian based on random fourier kernel is φx cos cos ωm um with ωi and ui uniform the estimator has cost of computation and storage given that we have computed solution based on the markov chain history xi however it is straightforward to update and the solution online after new point arrives this is achieved by storing running averages and performing updates of matrix inversions and costs computation and storage independent of further details are given in appendix gradients of the model are they require the evaluation of the gradient of the feature space embedding costing md computation and and storage algorithm kernel hamiltonian monte carlo input target possibly noisy estimator adaptation schedule at hmc parameters size of basis or size at iteration current state xt history xi perform with probability at kmc lite kmc finite update xi update to from prop perform update to solve λi αi zi update λi from prop propose with kernel induced hamiltonian flow using perform metropolis step using accept and reject xt otherwise if is noisy and was accepted store above for evaluating in the next iteration kernel hamiltonian monte carlo constructing kernel induced hamiltonian flow as in section from the gradients of the infinite dimensional exponential family model and approximate estimators we arrive at gradient free adaptive mcmc algorithm kernel hamiltonian monte carlo algorithm computational efficiency geometric ergodicity and kmc finite using allows for online updates using the full markov chain history and therefore is more elegant solution than kmc lite which has greater computational cost and requires the chain history due to the parametric nature of kmc finite however the tails of the estimator are not guaranteed to decay for example the random fourier feature embedding described below proposition contains periodic cosine functions and therefore oscillates in the tails of resulting in reduced acceptance probability as we will demonstrate in the experiments this problem does not appear when kmc finite is initialised in regions nor after in situations where information about the target density support is unknown and during we suggest to use the lite estimator whose gradients decay outside of the training data as result kmc lite is guaranteed to fall back to random walk metropolis in unexplored regions inheriting its convergence properties and smoothly transitions to proposals as the mcmc chain grows proof of the proposition below can be found in appendix proposition assume has tails the regularity conditions of thm implying and smallness of compact sets that mcmc adaptation stops after fixed time and fixed number of steps if lim and then kmc lite is geometrically ergodic from any starting point vanishing adaptation mcmc algorithms that use the history of the markov chain for constructing proposals might not be asymptotically correct we follow sec and the idea of ing adaptation to avoid biases let at be schedule of decaying probabilities such that at and at we update the density gradient estimate according to this schedule in algorithm intuitively adaptation becomes less likely as the mcmc chain progresses but never fully stops while sharing asymptotic convergence with adaptation that stops at fixed point theorem note that proposition is stronger statement about the convergence rate free parameters kmc has two free parameters the gaussian kernel bandwidth and the regularisation parameter as kmc performance depends on the quality of the approximate infinite dimensional exponential family model in or principled approach is to use the score matching objective function in to choose pairs via using blackbox optimisation earlier adaptive mcmc methods did not address parameter choice experiments we start by quantifying performance of kmc finite on synthetic targets we emphasise that these results can be reproduced with the lite version hmc kmc median kmc kmc figure hypothetical acceptance probability of kmc finite on challening target in growing dimensions left as function of and slices through left plot with error bars for fixed and as function of left and for fixed as function of right acc rate minimum ess hmc kmc rw kamh figure results for the synthetic banana as the amout of observed data increases kmc performance approaches hmc outperforming kamh and rw error bars over runs kmc finite stability of trajectories in high dimensions in order to quantify efficiency in growing dimensions we study hypothetical acceptance rates along trajectories on the kernel induced hamiltonian flow no mcmc yet on challenging gaussian target we sample the diagonal entries of the covariance matrix from gamma distribution and rotate with uniformly sampled random orthogonal matrix the resulting target is challenging to estimate due to its smoothness substantially differing across its principal components as single gaussian kernel is not able to effeciently represent such scaling families we use rational quadratic kernel for the gradient estimation whose random features are straightforward to compute figure shows the average acceptance over independent trials as function of the number of ground truth samples and basis functions which are set to be equal and of dimension in low to moderate dimensions gradients of the finite estimator lead to acceptance rates comparable to plain hmc on targets with more regular smoothness the estimator performs well in up to with less variance see appendix for details kmc finite mixing on synthetic example we next show that kmc performance approaches that of hmc as it sees more data we compare kmc hmc an isotropic random walk rw and kamh on the nonlinear target see appendix we here only quantify mixing after sufficient speed is included in next example we quantify performance on estimating the target mean which is exactly we tuned the scaling of kamh and rw to achieve acceptance we set hmc parameters to achieve acceptance and then used the same parameters for kmc we ran all samplers for iterations from random start point discarded the and computed acceptance rates the norm of the empirical mean and the minimum effective sample size ess across dimensions for kamh and kmc we repeated the experiment for an increasing number of samples and basis functions figure shows the results as function of kmc clearly outperforms rw and kamh and eventually achieves performance close to hmc as grows kmc lite mcmc for gp classification on real world data we next apply kmc to sample from the marginal posterior over of gaussian process classification gpc model on the uci glass dataset classical hmc can not be used for this problem due to the intractability of the marginal data likelihood our experimental protocol mostly follows section see appendix but uses only mcmc iterations without discarding we study how fast kmc initially explores the target we compare convergence in terms of all mixed moments of order up to to set of benchmark samples mmd lower is better kmc randomly uses between and leapfrog steps of size chosen uniformly in autocorrelation mmd from ground truth kmc kamh rw kmc rw habc iterations lag figure left results for marginal posterior over length scales of gpc model applied to the uci glass dataset the plots shows convergence no discarded of all mixed moments up to order lower mmd is better and marginal posterior for skew normal likelihood while kmc mixes as well as habc it does not suffer from any bias overlaps with rw while habc is significantly different and requires fewer simulations per proposal standard gaussian momentum and kernel tuned by see appendix we did not extensively tune the hmc parameters of kmc as the described settings were sufficient both kmc and kamh used samples from the chain history figure left shows that kmc burnin contains short exploration phase where produced estimates are bad due to it falling back to random walk in unexplored regions proposition from around iterations however kmc clearly outperforms both rw and the earlier kamh these results are backed by the minimum ess not plotted which is around for kmc and is around and for kamh and rw respectively note that all samplers effectively stop improving from iterations indicating bias all samplers took time with most time spent estimating the marginal likelihood kmc lite reduced simulations and no additional bias in abc we now apply kmc in the context of approximate bayesian computation abc which often is employed when the data likelihood is intractable but can be obtained by simulation see targets an approximate posterior by constructing an unbiased monte carlo estimator of the approximate likelihood as each such evaluation requires expensive simulations from the likelihood the goal of all abc methods is to reduce the number of such simulations accordingly hamiltonian abc was recently proposed combining the synthetic likelihood approach with gradients based on stochastic finite differences we remark that this requires to simulate from the likelihood in every leapfrog step and that the additional bias from the gaussian likelihood approximation can be problematic in contrast kmc does not require simulations to construct proposal but rather invests simulations into an step that ensures convergence to the original abc target figure right compares performance of rw habc sticky random numbers and spas sec and kmc on distribution hα yi with kmc mixes as well as habc but habc suffers from severe bias kmc also reduces the number of simulations per proposal by factor see appendix for details discussion we have introduced kmc gradient free adaptive mcmc algorithm that mimics hmc behaviour by estimating target gradients in an rkhs in experiments kmc outperforms random walk based sampling methods in up to dimensions including the recent kernelbased kamh kmc is particularly useful when gradients of the target density are unavailable as in or where classical hmc can not be used we have proposed two efficient empirical estimators for the target gradients each with different strengths and weaknesses and have given experimental evidence for the robustness of both future work includes establishing theoretical consistency and uniform convergence rates for the empirical estimators for example via using recent analysis of random fourier features with tight bounds and thorough experimental study in the context where we see lot of potential for kmc it might also be possible to use kmc as precomputing strategy to speed up classical hmc as in for code see https references neal mcmc using hamiltonian dynamics handbook of markov chain monte carlo beaumont estimation of population growth or decline in genetically monitored populations genetics andrieu and roberts the approach for efficient monte carlo computations the annals of statistics april filippone and girolami bayesian inference for gaussian processes ieee transactions on pattern analysis and machine intelligence marjoram molitor plagnol and markov chain monte carlo without likelihoods proceedings of the national academy of sciences sisson and fan markov chain monte carlo handbook of markov chain monte carlo chen fox and guestrin stochastic gradient hamiltonian monte carlo in icml pages meeds leenders and welling hamiltonian abc in uai betancourt the fundamental incompatibility of hamiltonian monte carlo and data subsampling arxiv preprint haario saksman and tamminen adaptive proposal distribution for random walk metropolis algorithm computational statistics andrieu and thoms tutorial on adaptive mcmc statistics and computing december sejdinovic strathmann lomeli andrieu and gretton kernel adaptive in icml sriperumbudur fukumizu kumar gretton and density estimation in infinite dimensional exponential families arxiv preprint estimation of statistical models by score matching jmlr larry wasserman all of nonparametric statistics springer rasmussen gaussian processes to speed up hybrid monte carlo for expensive bayesian integrals bayesian statistics pages rahimi and recht random features for kernel machines in nips pages betancourt byrne and girolami optimizing the integrator step size for hamiltonian monte carlo arxiv preprint berlinet and reproducing kernel hilbert spaces in probability and statistics kluwer some extensions of score matching computational statistics data analysis sriperumbudur and optimal rates for random fourier features in nips roberts and tweedie geometric convergence and central limit theorems for multidimensional hastings and metropolis algorithms biometrika roberts and rosenthal coupling and ergodicity of adaptive markov chain monte carlo algorithms journal of applied probability bache and lichman uci machine learning repository gretton borgwardt smola and rasch kernel test jmlr wood statistical inference for noisy nonlinear ecological dynamic systems nature zhang shahbaba and zhao hamiltonian monte carlo acceleration using neural network surrogate functions arxiv preprint and cristianini kernel methods for pattern analysis cambridge university press le and smola kernel expansions in loglinear time in icml mengersen and tweedie rates of convergence of the hastings and metropolis algorithms the annals of statistics 
linear allocation with feedback koby crammer department of electrical engineering the technion israel koby tor lattimore department of computing science university of alberta canada csaba department of computing science university of alberta canada szepesva abstract we study an idealised sequential resource allocation problem in each time step the learner chooses an allocation of several resource types between number of tasks assigning more resources to task increases the probability that it is completed the problem is challenging because the alignment of the tasks to the resource types is unknown and the feedback is noisy our main contribution is the new setting and an algorithm with regret analysis along the way we draw connections to the problem of minimising regret for stochastic linear bandits with heteroscedastic noise we also present some new results for stochastic linear bandits on the hypercube that significantly improve on existing work especially in the sparse case introduction economist thomas sowell remarked that the first lesson of economics is scarcity there is never enough of anything to fully satisfy all those who want the optimal allocation of resources is an enduring problem in economics operations research and daily life the problem is challenging not only because you are compelled to make difficult but also because the expected outcome of particular allocation may be unknown and the feedback noisy we focus on an idealised resource allocation problem where the economist plays repeated resource allocation game with multiple resource types and multiple tasks to which these resources can be assigned specifically we consider nearly linear model with resources and tasks in each time step the economist chooses an allocation of resources mt where mtk rd is the kth column and represents the amount of each resource type assigned to the kth task we assume that the kth task is completed successfully with probability min hmtk νk and νk rd is an unknown vector that determines how the success rate of given task depends on the quantity and type of resources assigned to it naturally we will limit the availability of resources pk by demanding that mt satisfies mtdk for all resource types at the end of each time step the economist observes which tasks were successful the objective is to maximise the number of successful tasks up to some time horizon that is known in advance this model is natural generalisation of the one used by lattimore et al where it was assumed that there was single resource type only he went on to add that the first lesson of politics is to disregard the first lesson of sowell an example application might be the problem of allocating computing resources on server between number of virtual private servers vps in each time step some fixed interval the controller chooses how much to allocate to each vps vps is said to fail in given round if it fails to respond to requests in timely fashion the requirements of each vps are unknown in advance but do not change greatly with time the controller should learn which vps benefit the most from which resource types and allocate accordingly the main contribution of this paper besides the new setting is an algorithm designed for this problem along with theoretical guarantees on its performance in terms of the regret along the way we present some additional results for the related problem of minimising regret for stochastic linear bandits on the hypercube we also prove new concentration results for weighted least squares estimation which may be independently interesting the generalisation of the work of lattimore et al to multiple resources turns out to be fairly those with knowledge of the theory of stochastic linear bandits will recognise some similarity in particular once the nonlinearity of the objective is removed the problem is equivalent to playing linear bandits in parallel but where the limited resources constrain the actions of the learner and correspondingly the returns for each task stochastic linear bandits have recently been generating significant body of research auer dani et al rusmevichientong and tsitsiklis et al agrawal and goyal and many others related problem is that of online combinatorial optimisation this has an extensive literature but most results are only applicable for discrete action sets are in the adversarial setting and can not exploit the additional structure of our problem nevertheless we refer the interested reader to say the recent work by kveton et al and references also worth mentioning is that the resource allocation problem at hand is quite different to the linear proposed and analysed by krishnamurthy et al where the action set is also finite the setting is different in many other ways besides given its similarity it is tempting to apply the techniques of linear bandits to our problem when doing so two main difficulties arise the first is that our payoffs are the expected reward is linear function only up to point after which it is clipped in the resource allocation problem this has natural interpretation which is that resources beyond certain point is fruitless fortunately one can avoid this difficulty rather easily by ensuring that with high probability resources are never the second problem concerns achieving good regret regardless of the task specifics in particular when the number of tasks is large and resources are at premium the allocation problem behaves more like bandit where the economist must choose the few tasks that can be completed successfully for this kind of problem regret should scale in the worst case with only auer et bubeck and the standard linear bandits approach on the other hand would lead to bound on the regret that depends linearly on to remedy this situation we will exploit that if is large and resources are scarce then many tasks will necessarily be and will fail with high probability since the noise model is bernoulli the variance of the noise for these tasks is extremely low by using weighted estimators we are able to exploit this and thereby obtain an improved regret an added benefit is that when resources are plentiful then all tasks will succeed with high probability under the optimal allocation and in this case the variance is also low this leads to regret for the case where the optimal allocation fully allocates every task preliminaries if is some event then is its complement it is the event that does not occur if is positive definite and is vector then ax stands for the weighted we write to be the vector of absolute values of we let be matrix with columns νk all entries in are but otherwise we make no global assumptions on at each time step the learner chooses an allocation matrix mt where mdk for all the assumption that each resource type has bound of is since the units of any resource can be changed to accommodate this assumption we write mtk for the kth column of mt the reward at time step is kyt where ytk is sampled from bernoulli distribution with parameter hmtk νk min hmtk νk the economist observes all ytk however not just the sum the optimal allocation is denoted by and defined by arg max hmk νk we are primarily concerned with designing an allocation algorithm that minimises the expected pseudo regret of this problem which is defined by xx rn hmk νk hmtk νk where the expectation is taken over both the actions of the algorithm and the observed reward optimal allocations if is known then the optimal allocation can be computed by constructing an appropriate linear program somewhat surprisingly it may also be computed exactly in log log time using algorithm below the optimal allocation is not so as simply allocating resources to the incomplete task for which the corresponding is largest in some dimension for example for tasks and resource types we see that even though is the largest algorithm eter the optimal allocation assigns only half of the input second resource to this task the right apm and rd proach is to allocate resources to incomplete tasks while hmk νk and bd do using the ratios as prescribed by algorithm the hmk νk and bd intuition for allocating in this way is that resources νdk arg max min should be allocated as efficiently as possible and νdi ficiency is determined by the ratio of the expected hmk νk success due to the allocation of resource and the mdk min bd νdk amount of resources allocated end while theorem algorithm returns return the proof of theorem and an implementation of algorithm may be found in the supplementary material we are interested primarily in the case when is unknown so algorithm will not be directly applicable nevertheless the algorithm is useful as module in the implementation of subsequent algorithm that estimates from data optimistic allocation algorithm we follow the optimism in the face of uncertainty principle in each time step the algorithm constructs an estimator for each νk and corresponding confidence set ctk for which νk ctk holds with high probability the algorithm then takes the optimistic action subject to the assumption that νk does indeed lie in ctk for all the main difficulty is the construction of the confidence sets like other authors dani et rusmevichientong and tsitsiklis et we define our confidence sets to be ellipses but the use of weighted estimator means that our ellipses may be significantly smaller than the sets that would be available by using these previous works in straightforward way the algorithm accepts as input the number of tasks and resource types the horizon and constants and where constant is defined by max kνk so that nk αb log log note that must be known bound on maxk kνk which might seem like serious restriction until one realizes that it is easy to add an initialisation phase where estimates are quickly made while incurring minimal additional regret as was also done by lattimore et al the value of determines the level of regularisation in the least squares estimation and will be tuned later to optimise the regret algorithm optimistic allocation algorithm input for do compute confidence sets for all tasks gtk αi γτ mτ mτ γτ mτ yτ tk kgtk ctk kgtk and ctk compute optimistic allocation mt arg maxmt hmtk observe success indicators ytk for all tasks ytk bernoulli hmtk νk compute weights for all tasks γtk arg hmtk hmtk end for computational efficiency we could not find an efficient implementation of algorithm because solving the bilinear optimisation problem in line is likely to be bennett and mangasarian and also petrik and zilberstein in our experiments we used simple algorithm based on optimising for and in alternative steps combined with random restarts but for large and this would likely not be efficient in the supplementary material we present an alternative algorithm that is efficient but relies on the assumption that kνk for all in this regime it is impossible to resources and this fact can be exploited to obtain an efficient and practical algorithm with strong guarantees along the way we are able to construct an elegant algorithm for linear bandits on the hypercube that enjoys optimal regret and adapts to sparsity computing the weights γtk line is somewhat surprisingly define and ptk hmtk kmtk hmtk kmtk tk tk then the weights can be computed by γtk if ptk ptk if ptk otherwise curious reader might wonder why the weights are computed by optimising within confidence set ctk which has double the radius of ctk the reason is rather technical but essentially if the true parameter νk were to lie on the boundary of the confidence set then the corresponding weight could become infinite for the analysis to work we rely on controlling the size of the weights it is not clear whether or not this trick is really necessary regret for algorithm we now analyse the regret of algorithm first we offer bound on the regret that depends on the like we then turn our attention to the case where the optimal allocation satisfies νk for all in this instance we show that the dependence on the horizon is only which would normally be unexpected when the is continuous the improvement comes from the weighted estimation that exploits the fact that the variance of the noise under the optimal allocation vanishes theorem suppose algorithm is run with bound maxk kνk then rn max kνk log choosing log and assuming that maxk kνk then rn nk max kνk log log the proof of theorem will follow by carefully analysing the width of the confidence sets as the algorithm makes allocations we start by proving the validity of the confidence sets and then prove the theorem weighted least squares estimation for this we focus on the problem of estimating single unknown νk let mn be sequence of allocations to task with mt rd let ft be filtration with ft containing information available at the end of round which means that mt is measurable let γn be the sequence of weights chosen by algorithm the sequence of outcomes is yn for which yt hmt νi the weighted regularised gram matrix is gt αi γτ mτ mτ and the corresponding weighted least squares estimator is γt mτ yτ theorem if and is chosen as in eq then νkgt for all with probability at least nk similar results exist in the literature for unweighted estimators for example dani et al rusmevichientong and tsitsiklis et al in our case however gt is the weighted gram matrix which may be significantly larger than an unweighted version when the weights become large the proof of theorem is unfortunately too long to include in the main text but it may be found in the supplementary material analysing the regret we start with some technical lemmas let be the failure event that νk kgtk for some and lemma et al let xn be an arbitrary sequence of with pn kxt and let gt xs xs then min kxt log corollary if does not hold then γtk min kmtk log tk the proof is omitted but follows rather easily by showing that γtk can be moved inside the minimum at price of increasing the loss at most by factor of four and then applying lemma see the supplementary material for the formal proof lemma suppose does not hold then γtk max kνk proof we exploit the fact that γtk is an estimate of the variance which is small whenever kmtk is small γtk arg max hmtk hmtk arg max hmtk hmtk νi arg max hmtk νi kmtk kνk kmtk kνk kmtk tk kmtk kmtk kνk where follows from ctk since gtk and basic pand the fact that νp linear algebra since kmtk kmtk kmtk the result is completed pk since the resource constraints implies that kmtk proof of theorem by theorem we have that holds with probability at most nk if does not hold then by the definition of the confidence set we have νk ctk for all and therefore rn νk hmtk νk mtk νk note that we were able to replace hmtk νk hmtk νk since if does not hold then mtk will never be chosen in such way that resources are we will now assume that does not hold and bound the argument in the expectation by the optimism principle we have mtk νk min hmtk νk min kmtk νk kgtk tk min kmtk tk min kmtk tk γtk γtk min kmtk tk nd max kνk γtk min kmtk tk max kνk log where follows from the assumption that νk ctk for all and and since mt is chosen optimistically by the inequality by the definition of which lies inside ctk by jensen inequality by again follows from lemma finally follows from corollary regret in case we now show that if there are enough resources such that the optimal strategy can complete every task with certainty then the regret of algorithm is in contrast to otherwise as before we exploit the low variance but now the variance is small because hmtk νk is close to while in the previous section we argued that this could not happen too often there is no contradiction as the quantity maxk kνk appeared in the previous bound pk theorem if νk then rn log proof we start by showing that the weights are large γtk hmtk νi hmtk νi hmtk νi hmtk νi kmtk νkgtk kmtk tk tk applying the optimism principle and using the bound above combined with corollary gives the result ern min hmtk νk min kmtk tk op min γtk γtk kmtk tk min γtk kmtk tk log experiments we present two experiments to demonstrate the behaviour of algorithm all code and data is available in the supplementary material error bars indicate confidence intervals but sometimes they are too small to see the algorithm is quite conservative so the variance is very low we used for all experiments the first experiment demonstrates the improvements obtained by using weighted estimator over an unweighted one and also serves to give some idea of the rate of learning for this experiment we used and and νk and where the kth column is the for the kth task we ran two versions of the algorithm the first exactly as given in algorithm and the second identical except that the weights were fixed to γtk for all and this value is chosen because it corresponds to the minimum inverse variance for bernoulli variable the data was produced by taking the average regret over runs the results are given in fig in fig we plot γtk the results show that γtk is increasing linearly with this is congruent with what we might expect because in this regime the estimation error should drop with and the estimated variance is proportional to the estimation error note that the estimation error for the algorithm with γtk will be for the second experiment we show the algorithm adapting to the environment we fix and for we define and νk να the unusual profile of the regret as varies can be attributed to two factors first if is small then the algorithm quickly identifies that resources should be allocated first to the first task however in the early stages of learning the algorithm is conservative in allocating to the first task to avoid overallocation since the remaining resources are given to the second task the regret is larger for small because the gain from allocating to the second task is small on the other hand if is close to then the algorithm suffers the opposite problem namely it can not identify which task the resources should be assigned to of course if then the algorithm must simply learn that all resources can be allocated safely and so the regret is smallest here an important point is that the algorithm never allocates all its resources at the start of the process because this risks so even in easy problems the regret will not vanish figure weighted vs unweighted estimation figure gap dependence figure weights weighted estimator unweighted estimator regret regret conclusions and summary we introduced the stochastic allocation problem and developed new algorithm that enjoys regret the main drawback of the new algorithm is that its computation time is exponential in the dimension parameters which makes practical implementations challenging unless both and are relatively small despite this challenge we were able to implement that algorithm using relatively brutish approach to solving the optimisation problem and this was sufficient to present experimental results on synthetic data showing that the algorithm is behaving as the theory predicts and that the use of the weighted estimation is leading to real improvement despite the computational issues we think this is reasonable first step towards more practical algorithm as well as solid theoretical understanding of the structure of the problem as consolation and on their own merits we include some other results an efficient both in terms of regret and computation algorithm for the case where overallocation is impossible an algorithm for linear bandits on the hypercube that enjoys optimal regret bounds and adapts to sparsity theoretical analysis of weighted estimators which may have other applications linear bandits with heteroscedastic noise there are many directions for future research the most natural is to improve the practicality of the algorithm we envisage such an algorithm might be obtained by following the program below generalise the thompson sampling analysis for linear bandits by agrawal and goyal this is highly step since it is no longer to show that such an algorithm is optimistic with high probability instead it will be necessary to make do with some kind of local optimism for each task the method of estimation depends heavily on the algorithm its resources only with extremely low probability but this significantly slows learning in the initial phases when the confidence sets are large and the algorithm is acting conservatively ideally we would use method of estimation that depended on the real structure of the problem but existing techniques that might lead to theoretical guarantees empirical process theory do not seem promising if small constants are expected it is not hard to think up extensions or modifications to the setting for example it would be interesting to look at an adversarial setting even defining it is not so easy or move towards model for the likelihood of success given an allocation references yasin csaba and david tax improved algorithms for linear stochastic bandits in advances in neural information processing systems pages yasin david pal and csaba szepesvari conversions and application to sparse stochastic bandits in aistats volume pages shipra agrawal and navin goyal thompson sampling for contextual bandits with linear payoffs arxiv preprint peter auer using confidence bounds for the journal of machine learning research peter auer and paul fischer analysis of the multiarmed bandit problem machine learning kristin bennett and olvi mangasarian bilinear separation of two sets computational optimization and applications bubeck and regret analysis of stochastic and nonstochastic multiarmed bandit problems foundations and trends in machine learning now publishers incorporated isbn varsha dani thomas hayes and sham kakade stochastic linear optimization under bandit feedback in colt pages akshay krishnamurthy alekh agarwal and miroslav dudik efficient contextual learning arxiv preprint branislav kveton zheng wen azin ashkan and csaba szepesvari tight regret bounds for stochastic combinatorial arxiv preprint tor lattimore koby crammer and csaba optimal resource allocation with feedback in proceedings of the conference on uncertainty in artificial intelligence uai marek petrik and shlomo zilberstein robust approximate bilinear programming for value function approximation the journal of machine learning research paat rusmevichientong and john tsitsiklis linearly parameterized bandits mathematics of operations research thomas sowell is reality optional and other essays hoover institution press 
unsupervised learning by program synthesis kevin ellis department of brain and cognitive sciences massachusetts institute of technology ellisk armando mit csail massachusetts institute of technology asolar joshua tenenbaum department of brain and cognitive sciences massachusetts institute of technology jbt abstract we introduce an unsupervised learning algorithm that combines probabilistic modeling with techniques for program synthesis we apply our techniques to both visual learning domain and language learning problem showing that our algorithm can learn many visual concepts from only few examples and that it can recover some english inflectional morphology taken together these results give both new approach to unsupervised learning of symbolic compositional structures and technique for applying program synthesis tools to noisy data introduction unsupervised learning seeks to induce good latent representations of data set nonparametric statistical approaches such as deep autoencoder networks density estimators or nonlinear manifold learning algorithms have been very successful at learning representations of perceptual input however it is unclear how they would represent more abstract structures such as spatial relations in vision inside of or all in line or morphological rules in language the different inflections of verbs here we give an unsupervised learning algorithm that synthesizes programs from data with the goal of learning such concepts our approach generalizes from small amounts of data and produces interpretable symbolic representations parameterized by programming language programs deterministic or probabilistic are natural knowledge representation for many domains and the idea that inductive learning should be thought of as probabilistic inference over programs is at least years old recent work in learning programs has focused on supervised learning from noiseless pairs or from formal specifications our goal here is to learn programs from noisy observations without explicit examples central idea in unsupervised learning is compression finding data representations that require the fewest bits to write down we realize this by treating observed data as the output of an unknown program applied to unknown inputs by doing joint inference over the program and the inputs we recover compressive encodings of the observed data the induced program gives generative model for the data and the induced inputs give an embedding for each data point although completely domain general method for program synthesis would be desirable we believe this will remain intractable for the foreseeable future accordingly our approach factors out the components of problems in the form of grammar for program hypotheses and we show how this allows the same tools to be used for unsupervised program synthesis in two very different domains in domain of visual concepts designed to be natural for humans but difficult for machines to learn we show that our methods can synthesize simple graphics programs representing these visual concepts from only few example images these programs outperform both previous baselines and several new baselines we introduce we also study the domain of learning morphological rules in language treating rules as programs and inflected verb forms as outputs we show how to encode prior linguistic knowledge as grammar over programs and recover linguistic rules useful for both simple stemming tasks and for predicting the phonological form of new words the unsupervised program synthesis algorithm the space of all programs is vast and often unamenable to the optimization methods used in much of machine learning we extend two ideas from the program synthesis community to make search over programs tractable sketching in the sketching approach to program synthesis one manually provides sketch of the program to be induced which specifies rough outline of its structure our sketches take the form of probabilistic grammar and make explicit the domain specific prior knowledge symbolic search much progress has been made in the engineering of symbolic solvers for satisfiability modulo theories smt problems we show how to translate our sketches into smt problems program synthesis is then reduced to solving an smt problem these are intractable in general but often solved efficiently in practice due to the highly constrained nature of program synthesis which these solvers can exploit prior work on symbolic search from sketches has not had to cope with noisy observations or probabilities over the space of programs and inputs demonstrating how to do this efficiently is our main technical contribution formalization as probabilistic inference we formalize unsupervised program synthesis as bayesian inference within the following generative model draw program from description length prior over programs which depends upon the sketch draw inputs ii to the program from description length prior pi these inputs are passed to the program to yield zi with zi ii zi defined as ii last we compute the observed data xi by drawing from noise model our objective is to estimate the unobserved and ii from the observed dataset xi we use this probabilistic model to define the description length below which we seek to minimize log pf log xi ii program length data reconstruction error log pi ii data encoding length defining program space we sketch space of allowed programs by writing down context free grammar and write to mean the set of all programs generated by placing uniform production probabilities over each symbol in gives pcfg that serves as prior over programs the pf of eq for example grammar over arithmetic expressions might contain rules that say expressions are either the sum of two expressions or real number or an input variable which we write as having specified space of programs we define the meaning of program in terms of smt primitives which can include objects like tuples real numbers conditionals booleans etc we write to mean the set of expressions built of smt primitives formally we assume comes equipped with denotation for each rule which we write as the denotation of rule in is always written as function of the denotations of that rule children for example denotation for the grammar in eq is where is program input jr rk jxk defining the denotations for grammar is straightforward and analogous to writing wrapper library around the core primitives of the smt solver our formalization factors out the grammar and the denotation but they are tightly coupled and in other synthesis tools written down together the denotation shows how to construct an smt expression from single program in and we use it to build an smt expression that represents the space of all programs such that its solution tells which program in the space solves the synthesis problem the smt solver then solves jointly for the program and its inputs subject to an upper bound upon the total description length this builds upon prior work in program synthesis such as but departs in the quantitative aspect of the constraints and in not knowing the program inputs due to space constraints we only briefly describe the synthesis algorithm leaving detailed discussion to the supplement we use algorithm to generate an smt formula that defines the space of programs computes the description length of program and computes the output of program on given input in algorithm the returned description length corresponds to the log pf term of eq while the returned evaluator gives us the ii terms the returned constraints ensure that the program computed by is member of the smt formula generated by algorithm algorithm smt encoding of programs generated by must be supplemented with constraints production of grammar that compute the data reconstruction erfunction generate ror and data encoding length of eq input grammar denotation we handle infinitely recursive grammars output description length by bounding the depth of recursive calls evaluator assertions to the generate procedure as in choices smt solvers are not designed to minimize loss functions but to verify the satisfiabilfor to do ity of set of constraints we minimize let prk choices eq by first asking the solver for any for to do solution then adding constraint saying lrj frj ajr generate prj its solution must have smaller description end for length than the one found previously etc lr lrj until it can find no better solution denotation is function of child denotations let gr be that function for choices qk are arguments to constructor experiments let gr jqk jk qk visual concept learning fr gr fr frk end for humans quickly learn new visual indicator variables specifying which rule is used cepts often from only few examples fresh variables unused in any existing formula in this section we present cn freshbooleanvariable idence that an unsupervised program thesis approach can also learn visual cepts from small number of examples cjk ar our approach is as follows given set of log if if example images we automatically parse if if them into symbolic form then we return synthesize program that maximally compresses these parses intuitively this program encodes the common structure needed to draw each of the example images we take our visual concepts from the synthetic visual reasoning test svrt set of visual classification problems which are easily parsed into distinct shapes fig shows three examples of svrt concepts fig diagrams the parsing procedure for another visual concept two arbitrary shapes bordering each other we defined space of simple graphics programs that control turtle and whose primitives include rotations forward movement rescaling of shapes etc see table both the learner observations and the graphics program outputs are image parses which have three sections list of shapes each shape is tuple of unique id scale from to and coordinates hid scale yi list of containment relations contains where range from one to the number of shapes in the parse list of reflexive borders relations borders where range from one to the number of shapes in the parse the algorithm in section describes purely functional programs programs without state but the grammar in table contains imperative commands that modify turtle state we can think of imperative programs as syntactic sugar for purely functional programs that pass around state variable as is common in the programming languages literature the grammar of table leaves unspecified the number of program inputs when synthesizing program from example images we perform grid search over the number of inputs given images with shapes and maximum shape id the grid search considers input shapes to input positions to input lengths and angles and to input scales we set the number of imperative draw commands resp borders contains to resp number of topological relations we now define noise model that specifies how program output produces parse by defining procedure for sampling given first the and coordinates of each shape are perturbed by additive noise drawn uniformly from to in our experiments we put then optional borders and contains relations see table are erased with probability last because the order of the shapes is unidentifiable both the list of shapes and the indices of the relations are randomly permuted the supplement has the smt encoding of the noise model and priors over program inputs which are uniform teleport position initialorientation draw shape scale move distance draw shape scale scale move distance draw shape scale scale figure left pairs of examples of three svrt concepts taken from right the program we synthesize from the leftmost pair this is turtle program capable of drawing this pair of pictures and is parameterized by set of latent variables shape distance scale initial position initial orientation to encourage translational and rotational invariance the first turtle command is constrained to always be teleport to new location and the initial orientation of the turtle which we write as is made an input to the synthesized graphics program we are introducing an unsupervised learning algorithm but the svrt consists of supervised binary classification problems so we chose to evaluate our visual concept learner by having it solve these classification problems given test image and set of examples resp from class resp we use the decision rule rc or equivalently shape id scale shape id scale borders px px rc px px each term in this decision rule is written as marginal probability and we approximate each marginal by lower bounding it by the largest term in its corresponding sum this gives figure the parser segments shapes and identifies their topological relations contains borders emmitting their coordinates topological relations and scales px px px px grammar rule english description teleport move flipx jitter draw contains contains borders borders alternate containment relations borders relations move turtle to new location reset orientation to rotate by angle go forward by distance flip turtle over axis small perturbation to turtle position draw shape at scale scale is either no rescaling or program input zj angle is either or program input θj positions are program inputs rj shapes are program inputs sj lengths are program inputs containment between integer indices into drawn shapes optional containment between integer indices into drawn shapes bordering between integer indices into drawn shapes optional bordering between integer indices into drawn shapes table grammar for the vision domain the is the start symbol for the grammar the token indicates sequencing of imperative commands optional holds with probability see the supplement for denotations of each grammar rule where is min ie log pf log pi ie log ee ie so we induce programs that maximally compress different set of image parses the maximally compressive program is found by minimizing eq putting the observations xi as the image parses putting the inputs ie as the parameters of the graphics program and generating the program by passing the grammar of table to algorithm we evaluated the classification accuracy across each of the svrt problems by sampling three positive and negative examples from each class and then evaluating the accuracy on held out test example such estimates were made for each problem we compare with three baselines as shown in fig to control for the effect of our parser we consider how well discriminative classification on the image parses performs for each image parse we extracted the following features number of distinct shapes number of rescaled shapes and number of relations for integer valued features following we used adaboost with decision stumps on these parse features we trained two convolutional network architectures for each svrt problem and found that variant of did best we report those results here the supplement has the network parameters and results for both architectures in several discriminative baselines are introduced these models are trained on image features we compare with their bestperforming model which fed examples to adaboost with decision stumps unsupervised program synthesis does best in terms of average classification accuracy number of svrt problems solved at and correlation with the human data we do not claim to have solved the svrt for example our representation does not model some geometric transformations needed for some of the concepts such as rotations of shapes additionally our parsing procedure occasionally makes mistakes which accounts for the many tasks we solve at accuracies between and morphological rule learning how might language learner discover the rules that inflect verbs we focus on english inflectional morphology system with long history of computational modeling viewed as an unsupervised learning problem our objective is to find compressive representation of english verbs humans learn the task after seven consecutive correct classifications seven correct classifications are likely to occur when classification accuracy is figure comparing human performance on the svrt with classification accuracy for machine learning approaches human accuracy is the fraction of humans that learned the concept is chance level machine accuracy is the fraction of correctly classified held out examples is chance level area of circles is proportional to the number of observations at that point dashed line is average accuracy program synthesis this work trained on examples convnet variant of trained on examples parse image features discriminative learners on features of parse pixels trained on examples humans given an average of examples and solve an average of problems we make the following simplification our learner is presented with triples of hlexeme tense this ignores many of the difficulties involved in language acquisition but see for unsupervised approach to extracting similar information from corpora we can think of these triples as the entries of matrix whose columns correspond to different tenses and whose rows correspond to different lexemes see table we regard each row of this matrix as an observation the xi of eq and identify stems with the inputs to the program we are to synthesize the ii of eq thus our objective is to synthesize program that maps stem to tuple of inflections we put description length prior over the stem and detail its smt encoding in the the supplement we represent words as sequences of phonemes and define space of programs that operate upon words given in table english inflectional verb morphology has set of regular rules that apply for almost all words as well as small set of words whose inflections do not follow regular rule the irregular forms we roll these irregular forms into the noise model with some small probability an inflected form is produced not by applying rule to the stem but by drawing sequence of phonemes from description length prior in our experiments we put this corresponds to simple rules plus lexicon model of morphology which is oversimplified in many respects but has been proposed in the past as crude approximation to the actual system of english morphology see the supplement for the smt encoding of our noise model in conclusion the learning problem is as follows given triples of hlexeme tense wordi jointly infer the regular rules the stems and which words are irregular exceptions we took five inflected forms of the top lexemes as measured by token frequency in the celex lexical inventory we split this in half to give lexemes for training and testing and trained our model using random sample consensus ransac concretely we sampled many subsets of the data each with or lexemes thus or words and synthesized the program for each subset minimizing eq we then took the program whose likelihood on the training set was highest fig plots the likelihood on the testing set as function of the number of subsets ransac iterations and the size of the subsets of lexemes fig shows the program that assigned the highest likelihood to the training data it also had the highest likelihood on the testing data with lexemes the learner consistently recovers the regular linguistic rule but with less data it recovers rules that are almost as good degrading more as it receives less data most prior work on morphological rule learning falls into two regimes supervised learning of the phonological form of morphological rules and unsupervised learning of morphemes from corpora because we learn from the lexicon our model is intermediate in terms of supervision we compare with representative systems from both regimes as follows the lexeme is the meaning of the stem or root for example run ran runs all share the same lexeme grammar rule english description hc ci else stem vpms programs are tuples of conditionals one for each tense conditionals have return value guard else condition return values append suffix to stem guards condition upon voicing manner place sibilancy voicing specifies of voice or doesn care voicing options place specifies place of articulation or doesn care place of articulation features manner specifies manner of articulation or doesn care manner of articulation features sibilancy specifies sibilancy or doesn care sibilancy is binary feature table grammar for the morphology domain the is the start symbol for the grammar each guard conditions on phonological properties of the end of the stem voicing place manner and sibilancy sequences of phonemes are encoded as tuples of hlength see the supplement for denotations of each grammar rule lexeme style run subscribe rack present stail bskraib ræk past staild ræn bskraibd rækt sing pres stailz bskraibz ræks past part staild bskraibd rækt prog stailin bskraibin rækin table example input to the morphological rule learner the morfessor system induces morphemes from corpora which it then uses for segmentation we used morfessor to segment phonetic forms of the inflections of our lexemes compared to the ground truth inflection transforms provided by celex it has an error rate of our model segments the same verbs with an error rate of this experiment is best seen as sanity check because our system knows priori to expect only suffixes and knows which words must share the same stem we expect better performance due to our restricted hypothesis space to be clear we are not claiming that we have introduced stemmer that exceeds or even meets the in albright and hayes introduce supervised morphological rule learner that induces phonological rules from examples of stem being transformed into its inflected form because our model learns joint distribution over all of the inflected forms of lexeme we can use it to predict inflections conditioned upon their present tense our model recovers the regular inflections but does not recover the islands of reliability modeled in our model predicts that the past tense of the nonce word glee is gleed but does not predict that plausible alternative past tense is gled which the model of albright and hayes does this deficiency is because the space of programs in table lacks the ability to express this class of rules discussion related work inductive programming systems have long and rich history often these systems use stochastic search algorithms such as genetic programming or mcmc others sufficiently constrain the hypothesis space to enable fast exact inference the inductive logic programming community has had some success inducing prolog programs using heuristic search our work is motivated by the recent successes of systems that put program synthesis in probabilistic framework the program synthesis community introduced methods for learning programs and our work builds upon their techniques present past coronal stop id voiced stem else stem prog in sibilant iz voiced stem else stem figure learning curves for our morphology model trained using ransac at each iteration we sample or lexemes from the training data fit model using their inflections and keep the model if it has higher likelihood on the training data than other models found so far each line was run on different permutation of the samples figure program synthesized by morphology learner past participle program was the same as past tense program there is vast literature on computational models of morphology these include systems that learn the phonological form of morphological rules systems that induce morphemes from corpora and systems that learn the productivity of different rules in using general framework our model is similar in spirit to the early connectionist accounts but our use of symbolic representations is more in line with accounts proposed by linguists like our model of visual concept learning is similar to inverse graphics but the emphasis upon synthesizing programs is more closely aligned with acknowledge that convolutional networks are engineered to solve classification problems qualitatively different from the svrt and that one could design better neural network architectures for these problems for example it would be interesting to see how the very recent draw network performs on the svrt limitation of the approach large datasets synthesizing programs from large datasets is difficult and complete symbolic solvers often do not degrade gracefully as the problem size increases our morphology learner uses ransac to sidestep this limitation but we anticipate domains for which this technique will be insufficient prior work in program synthesis introduced counter example guided inductive synthesis cegis for learning from large or possibly infinite family of examples but it can not accomodate noise in the data we suspect that hypothetical hybrid would scale to large noisy training sets future work the two key ideas in this work are the encoding of soft probabilistic constraints as hard constraints for symbolic search and crafting domain specific grammar that serves both to guide the symbolic search and to provide good inductive bias without strong inductive bias one can not possibly generalize from small number of examples yet humans can and ai systems should learn over time what constitutes good prior hypothesis space or sketch learning good inductive bias as done in and then providing that inductive bias to solver may be way of advancing program synthesis as technology for artificial intelligence acknowledgments we are grateful for discussions with timothy donnell on morphological rule learners for advice from brendan lake and tejas kulkarni on the convolutional network baselines and for the suggestions of our anonymous reviewers this material is based upon work supported by funding from nsf award from the center for minds brains and machines cbmm funded by nsf stc award and from aro muri contract references adam albright and bruce hayes rules analogy in english past tenses study cognition brenden lake ruslan salakhutdinov and josh tenenbaum learning by inverting compositional causal process in advances in neural information processing systems pages noah goodman vikash mansinghka daniel roy keith bonawitz and joshua tenenbaum church language for generative models in uai pages sumit gulwani jose emanuel kitzelmann stephen muggleton ute schmid and ben zorn inductive programming meets the real world commun acm fleuret ting li charles dubout emma wampler steven yantis and donald geman comparing machines and humans on visual categorization test pnas ray solomonoff formal theory of inductive inference information and control armando solar lezama program synthesis by sketching phd thesis eecs department university of california berkeley dec leonardo de moura and nikolaj bjørner an efficient smt solver in tools and algorithms for the construction and analysis of systems pages springer emina torlak and rastislav bodik growing languages with rosette in proceedings of the acm international symposium on new ideas new paradigms and reflections on programming software pages acm stanislas dehaene izard pierre pica and elizabeth spelke core knowledge of geometry in an amazonian indigene group science david thornburg friends of the turtle compute march lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee november mark seidenberg and david plaut quasiregularity and its discontents the legacy of the past tense debate cognitive science erwin chan and constantine lignos investigating the relationship between linguistic representation and computation through an unsupervised model of human morphology learning research on language and computation piepenbrock baayen and gulikers philadelphia linguistic data consortium web download martin fischler and robert bolles random sample consensus paradigm for model fitting with applications to image analysis and automated cartography commun acm june sami virpioja peter smit grnroos and mikko kurimo morfessor python implementation and extensions for morfessor baseline technical report aalto university helsinki john koza genetic programming on the programming of computers by means of natural selection complex adaptive systems mit press eric schkufza rahul sharma and alex aiken stochastic superoptimization in acm sigarch computer architecture news volume pages acm sumit gulwani automating string processing in spreadsheets using examples in popl pages new york ny usa acm yarden katz noah goodman kristian kersting charles kemp and joshua tenenbaum modeling semantic cognition as logical dimensionality reduction in cogsci pages percy liang michael jordan and dan klein learning programs hierarchical bayesian approach in johannes and thorsten joachims editors icml pages omnipress sumit gulwani susmit jha ashish tiwari and ramarathnam venkatesan synthesis of programs in pldi pages new york ny usa acm rumelhart and mcclelland on learning the past tenses of english verbs in parallel distributed processing explorations in the microstructure of cognition pages volume bradford press john goldsmith unsupervised learning of the morphology of natural language comput june timothy donnell productivity and reuse in language theory of linguistic computation and storage the mit press karol gregor ivo danihelka alex graves and daan wierstra draw recurrent neural network for image generation corr 
enforcing balance allows local supervised learning in spiking recurrent networks sophie deneve group for neural theory ens paris rue dulm paris france ralph bourdoukan group for neural theory ens paris rue dulm paris france abstract to predict sensory inputs or control motor trajectories the brain must constantly learn temporal dynamics based on error feedback however it remains unclear how such supervised learning is implemented in biological neural networks learning in recurrent spiking networks is notoriously difficult because local changes in connectivity may have an unpredictable effect on the global dynamics the most commonly used learning rules such as temporal are not local and thus not biologically plausible furthermore reproducing the statistics of neural responses requires the use of networks with balanced excitation and inhibition such balance is easily destroyed during learning using approach we show how networks of neurons can learn arbitrary linear dynamical systems by feeding back their error as input the network uses two types of recurrent connections fast and slow the fast connections learn to balance excitation and inhibition using plasticity rule the slow connections are trained to minimize the error feedback using hebbian learning rule importantly the balance maintained by fast connections is crucial to ensure that global error signals are available locally in each neuron in turn resulting in local learning rule for the slow connections this demonstrates that spiking networks can learn complex dynamics using purely local learning rules using balance as the key rather than an additional constraint the resulting network implements given function within the predictive coding scheme with minimal dimensions and activity the brain constantly predicts relevant sensory inputs or motor trajectories for example there is evidence that neural circuits mimic the dynamics of motor effectors using internal models if the dynamics of the predicted sensory and motor variables change in time these models may become false and therefore need to be readjusted through learning based on error feedback from modeling perspective supervised learning in recurrent networks faces many challenges earlier models have succeeded in learning useful functions at the cost of non local learning rules that are biologically implausible more recent models based on reservoir computing transfer the learning from the recurrent network with now random fixed weights to the readout weights using this simple scheme the network can learn to generate complex patterns however the majority of these models use abstract rate units and are yet to be translated into more realistic spiking networks moreover to provide sufficiently large reservoir the recurrent network needs to be large balanced and have rich and high dimensional dynamics this typically generates far more activity than strictly required redundancy that can be seen as inefficient on the other hand supervised learning models involving spiking neurons have essentially concentrated on the learning of precise spike sequences with some exceptions these models use architectures in balanced recurrent network with asynchronous irregular and highly variable spike trains such as those found in cortex the activity has been shown to be chaotic this leads to spike timing being intrinsically unreliable rendering representation of the trajectory by precise spike sequences problematic moreover many configurations of spike times may achieve the same goal here we derive two local learning rules that drive network of leaky lif neurons into implementing desired linear dynamical system the network is trained to minimize the objective kx where is the output of the network decoded from the spikes is the desired output and is cost associated with firing penalizing unnecessary activity and thus enforcing efficiency the dynamical system is linear ax with being constant matrix and time varying command signal we first study the learning of an autoencoder network where the desired output is fed to the network as feedforward input the autoencoder learns to represent its inputs as precisely as possible in an unsupervised fashion after learning each unit represents the encoding error made by the entire network we then show that the network can learn more complex computations if slower recurrent connections are added to the autoencoder thus it receives the command along with an error signal and learns to generate the output with the desired temporal dynamics despite the nature of the representation and of the plasticity rules the learning does not enforce precise spike timing trajectories but on the contrary enforces irregular and highly variable spike trains learning balance global becomes local using predictive coding strategy we build network that learns to accurately represent its inputs while expending the least amount of spikes to introduce the learning rules and explain how they work we start by describing the optimized network after learning let us first consider set of unconnected neurons receiving shared input signals xi through feedforward connections fji we assume that the network performs predictive coding it subtracts from each of thesep input signals an estimate obtained by decoding the output spike trains fig specifically dij rj where dij are the decoding weights and rj are the filtered spike trains which obey oj with oj tkj being the spike train of neuron and tkj are the times of its spikes note that such an autoencoder automatically maintains an accurate representation because it responds to any encoding error larger than the firing threshold by increasing its response and in turn decreasing the error it is also efficient because neurons respond only when input and decoded signals differ the autoencoder can be equivalently implemented by lateral connections rather than feedback targeting the inputs fig these lateral connections combine the feedforward connections and the decoding weights and they subtract from the feedforward inputs received by each neuron the membrane potential dynamics in this recurrent network are described by fs wo where is the vector of the membrane potentials of the population λx is the effective input to the population is the connectivity matrix and is the population vector of the spikes neuron has threshold ti kfi when input channels are independent and the weights are distributed uniformly on sphere then the optimal decoding weights are equal to the encoding weights and hence the optimal recurrent connectivity in the following we assume that this is always the case and we choose the feedforward weights accordingly in this scheme having precise representation of the inputs is equivalent to maintaining precise balance between excitation and inhibition in fact the membrane potential of neuron is the projection of the global error of the network on the neurons feedforward weight vi fi if the output of the network matches the input the recurrent term in the membrane potential fi should precisely cancel the feedforward term fi therefore in order to learn the connectivity matrix we tackle the problem through balance which is its physiological characterization the learning rule that we derive achieves efficient coding by enforcing precise balance at single neuron level it makes the network converge to state where each presynaptic spike cancels the recent charge that was accumulated by the postsynaptic neuron fig this accumulation of charge is naturally represented by the postsynaptic membrane potential vi which jumps upon the arrival of presynaptic spike by magnitude given by the recurrent weight wij due after neuron index before fd balanced unbalanced fi fi vpost wi ms ew fi wi time figure network preforming predictive coding top panel set of unconnected leaky neurons receiving the error between signal and their own decoded spike trains bottom panel the previous architecture is equivalent to the recurrent network with lateral connections equal to the product of the encoding and the decoding weights illustration of the learning of an inhibitory weight the trace of the membrane potential of postsynaptic neuron is shown in blue and red the blue lines correspond to changes due to the integration of the feedforward input and the red to changes caused by the integration of spikes from neurons in the population the black line represents the resting potential of the neuron in the left panel the presynaptic spike perfectly cancels the accumulated feedforward current during cycle and therefore there is no learning in the right panel the inhibitory weight is too strong and thus creates imbalance in the membrane potential therefore it is depressed by learning learning in network top panels the two dimensions of the input blue lines and the output red lines before left and after right learning bottom panels raster plots of the spikes in the population left panel after learning each neuron receives local estimate of the output of the network through lateral connections red arrows right panel scatter plot of the output of the network projected on the feedforward weights of the neurons versus the recurrent input they receive the evolution of the mean error between the recurrent weights of the network and the optimal recurrent weights using the rule defined by equation black line and the rule in gray line note that our rule is different from because it operates on finer and reaches the optimal balanced state with more than one order of magnitude faster this is important because as we will see below some computations require very fast restoration of this balance to the instantaneous nature of recurrent synapses because the two charges should cancel each other the greedy learning rule is proportional to the sum of both quantities δwij vi βwij where vi is the membrane potential of the postsynaptic neuron wij is the recurrent weight from neuron to neuron and the factor controls the overall magnitude of lateral weights more importantly regularizes the cost penalizing the total spike count in the population ri where is the effective linear cost the example of an inhibitory synapse wij is illustrated in figure if neuron is too hyperpolarized upon the arrival of presynaptic spike from neuron if the inhibitory weight wij is smaller than the absolute weight of the synapse the amplitude of the ipsp is decreased the opposite occurs if the membrane is too depolarized the synaptic weights thus converge when the two quantities balance each other on average wij itj where tj are the spike times of the presynaptic neuron fig shows the learning in network receiving random input signals for illustration purposes the weights are initialized with very small values before learning the lack of lateral connectivity causes neurons to fire synchronously and regularly after learning spike trains are sparse irregular and asynchronous despite the quasi absence of noise in the network even though the firing rates decrease globally the quality of the input representation drastically improves over the course of learning moreover the convergence of recurrent weights to their optimal values is typically quick and monotonic fig by enforcing balance the learning rule establishes an efficient and reliable communication between neurons because fx fft every neuron has access through its recurrent input to the network global coding error projected on its feedforward weight fig this local representation of the network global performance is crucial in the supervised learning scheme we describe in the following sections generating temporal dynamics within the network while in the previous section we presented novel rule that drives spiking network into efficiently representing its inputs we are generally interested in networks that perform more complex computations it has been shown already that network having two synaptic time scales can implement an arbitrary linear dynamical system we briefly summarize this approach in this section ax ft wf ft ws figure the construction of recurrent network that implements linear dynamical system in the autoencoder presented above the effective input to the network is λx fig we assume that follows linear dynamics ax where is constant matrix and is time varying command thus the input can be expanded to ax λx λi fig because the output of the network approximates very precisely they can be interchanged according to this argument the external input term λi is replaced by λi which only depends on the activity of the network fig this replacement amounts to including global loop that adds the term λi to the source input fig as in the autoencoder this can be achieved using recurrent connections in the form of λi ft fig note that this recurrent input is the filtered spike train not the raw spikes as result these new connections have slower dynamics than the connections presented in the first section this motivates us to characterize connections as fast and slow depending on their underlying dynamics the dynamics of the membrane potentials are now described by fc ws wf where λv is the leak in the membrane potential which is different from the leak in the decoder it is clear from the previous construction that the slow connectivity ws λi ft is involved in generating the temporal dynamics of owing to the slow connections the network is able to generate autonomously the temporal dynamics of the output and thus only needs the command as an external input for example if the network implements pure integrator ws λfft compensates for the leak in the decoder by generating positive feedback term that prevents the activity form decaying on the other hand the fast connectivity matrix wf trained with the unsupervised rule presented previously plays the same role as in the autoencoder it insures that the global output and the global coding error of the network are available locally to each neuron teaching the network to implement desired dynamical system our aim is to develop supervised learning scheme where network learns to generate desired output with an error feedback as well as local learning rule the learning rule targets the slow recurrent connections responsible for the generation of the temporal dynamics in the output as seen in the previous section instead of deriving directly the learning rule for the recurrent connections we first derive learning rule for the matrix of the linear dynamical system using simple results from control theory and then we translate the learning to the recurrent network learning linear dynamical system online consider the linear dynamical system where is matrix we derive an online learning rule for the coefficients of the matrix such that the output becomes after learning equal to the desired output the latter undergoes the dynamics ax therefore we define as the error vector between the actual and the desired output this error is fed to the mistuned system in order to correct and guide its behavior fig thus the dynamics of the system with this feedback are where is scalar implementing the gain of the loop the previous equation can be rewritten in the following form ki kx where is the identity matrix if we assume that the spectra of the signals are bounded it is straightforward to show via laplace transform that when the larger the gain of the feedback the smaller the error intuitively if is large very small errors are immediately detected and therefore corrected by the system nevertheless our aim is not to correct the dynamical system forever but to teach it to generate the desired output itself without the error feedback thus the matrix needs to be modified over time to derive the learning rule for the matrix we operate gradient descent on the loss function et kx with respect to the components of the matrix the component mij is updated proportionally to the gradient of δmij to evaluate the term we solve the equation for the simple case were inputs are constant if we assume that is much larger than the eigenvalues of the gradient is approximated by eij where eij is matrix of zeros except for component ij which is one this leads to the very simple learning rule δmij ei which we can write in matrix form as δm the learning rule is simply the outer product of the output and the error to derive the learning rule we assume constant or slowly varying input in practice however learning can be achieved also using fast varying inputs fig learning rule for the slow connections in the previous section we derived simple learning rule for the state matrix of linear dynamical system we translate this learning scheme into the recurrent network described in section to this end we need to determine two things first we have to define the form of the error feedback in the recurrent network case second we need to adapt the learning rule of the matrix of the underlying dynamical system to the slow weights of the recurrent neural network in the previous learning scheme the error is fed into the dynamical system as an additional input since the input weight vector of neuron fi defines the direction that is relevant for its action space the neuron should only receive the errors in that direction thus the error vector is projected on the feedforward weights vector of neuron before being fed to it accordingly equation becomes fc ws wf kfe in the autoencoder the membrane potential of neuron represents the error between the input and the output of the entire network along the neuron feedforward weight with the addition of the dynamic error feedback and the slow connections the membrane potentials now represent the error between the actual and the desired network output trajectories to translate the learning rule of the dynamical system into rule for the recurrent network we assume that any modification of the recurrent weights directly reflects modification in the underlying dynamical system this is achieved if the updates δws of the slow connectivity matrix are in the form of δm ft this ensures that the network always implements linear dynamical system and guarantees that the analysis is consistent the learning rule of the slow connections ws is obtained by replacing δm by its expression according to equation in δm ft δws fe according to this learning rule the weight update between two neurons δwijs is proportional to the error feedback fi received as current by the postsynaptic neuron and to fj the output of the network projected on the feedforward weight of the presynaptic neuron the latter quantity is available to the presynaptic neuron through its inward fast recurrent connections as shown for the autoencoder in fig one might object that the previous learning rule is not biologically plausible because it involves currents present separately in the and neurons indeed the presynaptic term may not be available to the synapse however as shown in the supplementary information of the filtered spike train rj of the presynaptic neuron is approximately proportional to bfj rectified version of the presynaptic term in the previous learning rule by replacing fj by rj in the equation we obtain the following biologically plausible learning rule δwijs ei rj where ei fi is the total error current received by the postsynaptic neuron learning the underlying dynamical system while maintaining balance for the previous analysis to hold the fast connectivity wf should be learned simultaneously with the slow connections using the learning rule defined by equation as shown in the first section the learning of the fast connections establishes detailed balance on the level of the neuron and guarantees that the output of the network is available to each neuron through the term fj the latter is the presynaptic term in the learning rule of equation despite not being involved in the dynamics per se these fast connections are crucial in order to learn any temporal dynamics in other words learning detailed balance is to learn dynamics with local plasticity rules in spiking network the plasticity of the fast connections remediate very quickly any perturbation to the balance caused by the learning of the slow connections simulation as toy example we simulated network learning oscillator using feedback gain the network is initialized with weak fast connections and weak slow connections the learning is driven by smoothed gaussian noise as the command note that in the initial state there are no fast recurrent connection and the output of the network does not depend linearly on the input because membrane potentials are too hyperpolarized fig the network output is quickly linearized through the learning of the fast connections equation by enforcing balance on the membrane potential fig initial membrane potentials exhibit large fluctuations ke time ke error ms learned wf learned predicted neuron index ms figure learning temporal dynamics in recurrent network top panel the linear dynamical system characterized by the state matrix receives feedback signaling the difference between its actual output and desired output bottom panel recurrent network displaying slow and fast connections is equivalent to the top architecture if the error feedback is fed into the network through the feedforward matrix neuron network learns using equations and left panel the evolution of the error between the desired and the actual output during learning the black and grey arrows represent instances where the time course of the membrane potential is shown in the next plot right panel the time course of the membrane potential of one neuron at two different instances during learning the gray line corresponds to the initial state while the black line is few iterations after scatter plots of the learned versus the predicted weights at the end of learning for fast top panel and slow bottom panel connections top panels the output of the network red and the desired output blue before left and after right learning the black solid line on the top shows the impulse command that drives the network bottom panels raster plots before and after learning in the left raster plot there is no spiking activity after the first that wane drastically after few iterations fig on slower time scale the slow connections learn to minimize the prediction error using the learning rule of equation the error between the output of the network and the desired output decreases drastically fig to compute this error different instances of the connectivity matrices were sampled during learning the network was then using the same instances while fixing to mesure the performance in the absence of feedback at the end of learning both slow and fast connections converge to their predicted values ws λi ft and wf fig the presence of feedback is no longer required for the network to have the right dynamics if we set we still obtain the desired output see figs and the output of the network is very accurate representing the state with precision of the order of the contribution of single spike parsimonious no unnecessary spikes are emitted to represent the dynamical state at this level of accuracy and the spike trains are asynchronous and irregular note that because the slow connections are weak at the initial state spiking activity decays quickly once the command impulse is turned off due to the absence of slow recurrent excitation fig simulation parameters figure learning rate figure λv learning rate of the fast connections learning rate of the slow connections discussion using approach we derived pair of and plasticity rules that enable precise supervised learning in recurrent network of lif neurons the essence of this approach is that every neuron is precise computational unit that represents the network error in subspace of dimension in the output space the precise and distributed nature of this code allows the derivation of local learning rules from global objectives to compute collectively the neurons need to communicate to each other about their contributions to the output of the network the fast connections are trained in an unsupervised fashion using spikebased rule to optimize this communication it establishes this efficient communication by enforcing detailed balance between excitation and inhibition the slow connections however are trained to minimize the error between the actual output of the network and target dynamical system they produce currents with long temporal correlations implementing the temporal dynamics of the underlying linear dynamical system the plasticity rule for the slow connections is simply proportional to an error feedback injected as current in the postsynaptic neuron and to quantity akin to the firing rate of the presynaptic neuron to guide the behavior of the network during learning the error feedback must be strong and specific such strength and specialization is in agreement with data on climbing fibers in the cerebellum which are believed to bring information about errors during motor learning however in this model the specificity of the error signals are defined by weight matrix through which the errors are fed to the neurons learning these weights is still under investigation we believe that they could be learned using rule our approach is substantially different from usual supervised learning paradigms in spiking networks since it does not target the spike times explicitly however observing spike times may be misleading since there are many combinations that can produce the same output thus in this framework variability in spiking is not lack of precision but is the consequence of the redundancy in the representation neurons having similar decoding weights may have their spike times interchanged while the global representation is conserved what is important is the cooperation between the neurons and the precise spike timing relative to the population for example using independent poisson neurons with instantaneous firing rates identical to the predictive coding network drastically degrades the quality of the representaion our approach is also different from liquid computing in the sense that the network is small structured and fires only when needed in addition in these studies the feedback error used in the learning rule has no clear physiological correlate while here it is concretely injected as current in the neurons this current is used simultaneously to drive the learning rule and to guide the dynamics of the neuron in the short term however it is still unclear what the mechanisms are that could implement such current dependent learning rule in biological neurons an obvious limitation of our framework is that it is currently restricted to linear dynamical systems one possibility to overcome this limitation would be to introduce in the decoder which would translate into specific and structures in the dendrites similar strategy has been employed recently to combine the approach of predictive coding and force learning using two compartment lif neurons we are currently exploring less constraining forms of synaptic with the ultimate goal of being able to learn arbitrary dynamics in spiking networks using purely local plasticity rules acknowledgments this work was supported by iec psl erc grant and the james mcdonnell foundation award human cognition refrences kawato internal models for motor control and trajectory planning current opinion in neurobiology lackner dizio gravitoinertial force background level affects adaptation to coriolis force perturbations of reaching movements journal of neurophysiology rumelhart hinton williams learning representations by backpropagating errors cognitive modeling williams zipser learning algorithm for continually running fully recurrent neural networks neural computation jaeger the echo state approach to analysing and training recurrent neural networkswith an erratum note bonn germany german national research center for information technology gmd technical report maass natschlger markram computing without stable states new framework for neural computation based on perturbations neural computation sussillo abbott generating coherent patterns of activity from chaotic neural networks neuron legenstein naeger maass what can neuron learn with plasticity neural computation pfister toyoizumi barber gerstner optimal plasticity for precise action potential firing in supervised learning neural computation ponulak kasinski supervised learning in spiking neural networks with resume sequence learning classification and spike shifting neural computation memmesheimer rubin lveczky sompolinsky learning precisely timed spikes neuron sompolinsky the tempotron neuron that learns spike timingbased decisions nature neuroscience van vreeswijk sompolinsky chaos in neuronal networks with balanced excitatory and inhibitory activity science brunel dynamics of networks of randomly connected excitatory and inhibitory spiking neurons journal of boerlin machens predictive coding of dynamical variables in balanced spiking networks plos computational biology bourdoukan barrett machens learning optimal spikebased representations in advances in neural information processing systems pp vertechi brendel machens unsupervised learning of an efficient memory network in advances in neural information processing systems pp watanabe kano climbing fiber synapse elimination in cerebellar purkinje cells european journal of neuroscience chen kano abeliovich chen bao kim tonegawa impaired motor coordination correlates with persistent multiple climbing fiber innervation in pkc mutant mice cell eccles llinas sasaki the excitatory synaptic action of climbing fibres on the purkinje cells of the cerebellum the journal of physiology knudsen supervised learning in the brain the journal of neuroscience thalmeier uhlmann kappen memmesheimer learning universal computations with spikes arxiv preprint 
fast and guaranteed tensor decomposition via sketching yining wang tung alex smola machine learning department carnegie mellon university pittsburgh pa yiningwa htung alex anima anandkumar department of eecs university of california irvine irvine ca abstract tensor cp decomposition has wide applications in statistical learning of latent variable models and in data mining in this paper we propose fast and randomized tensor cp decomposition algorithms based on sketching we build on the idea of count sketches but introduce many novel ideas which are unique to tensors we develop novel methods for randomized computation of tensor contractions via ffts without explicitly forming the tensors such tensor contractions are encountered in decomposition methods such as tensor power iterations and alternating least squares we also design novel colliding hashes for symmetric tensors to further save time in computing the sketches we then combine these sketching ideas with existing whitening and tensor power iterative techniques to obtain the fastest algorithm on both sparse and dense tensors the quality of approximation under our method does not depend on properties such as sparsity uniformity of elements etc we apply the method for topic modeling and obtain competitive results keywords tensor cp decomposition count sketch randomized methods spectral methods topic modeling introduction in many domains such as computer vision neuroscience and social networks consisting of and data tensors have emerged as powerful paradigm for handling the data deluge an important operation with tensor data is its decomposition where the input tensor is decomposed into succinct form one of the popular decomposition methods is the cp decomposition also known as canonical polyadic decomposition where the input tensor is decomposed into succinct sum of components the cp decomposition has found numerous applications in data mining computational neuroscience and recently in statistical learning for latent variable models for latent variable modeling these methods yield consistent estimates under mild conditions such as and require only polynomial sample and computational complexity given the importance of tensor methods for machine learning there has been an increasing interest in scaling up tensor decomposition algorithms to handle gigantic data tensors however the previous works fall short in many ways as described subsequently in this paper we design and analyze efficient randomized tensor methods using ideas from sketching the idea is to maintain sketch of an input tensor and then perform implicit tensor decomposition using existing methods such as tensor power updates alternating least squares or online tensor updates we obtain the fastest decomposition methods for both sparse and dense tensors our framework can easily handle modern machine learning applications with billions of training instances and at the same time comes with attractive theoretical guarantees our main contributions are as follows efficient tensor sketch construction we propose efficient construction of tensor sketches when the input tensor is available in factored forms such as in the case of empirical moment tensors where the factor components correspond to tensors over individual data samples we construct the tensor sketch via efficient fft operations on the component vectors sketching each component takes log operations where is the tensor dimension and is the sketch length this is much faster than the np complexity for brute force computations of tensor since empirical moment tensors are available in the factored form with components where is the number of samples it takes log operations to compute the sketch implicit tensor contraction computations almost all tensor manipulations can be expressed in terms of tensor contractions which involves multilinear combinations of different tensor fibres for example tensor decomposition methods such as tensor power iterations alternating least squares als whitening and online tensor methods all involve tensor contractions we propose highly efficient method to directly compute the tensor contractions without forming the input tensor explicitly in particular given the sketch of tensor each tensor contraction can be computed in log operations regardless of order of the source and destination tensors this significantly accelerates the implementation that requires np complexity for tensor contraction in addition in many applications the input tensor is not directly available and needs to be computed from samples such as the case of empirical moment tensors for spectral learning of latent variable models in such cases our method results in huge savings by combining implicit tensor contraction computation with efficient tensor sketch construction novel colliding hashes for symmetric tensors when the input tensor is symmetric which is the case for empirical moment tensors that arise in spectral learning applications we propose novel colliding hash design by replacing the boolean ring with the complex ring to handle multiplicities as result it makes the sketch building process much faster and avoids repetitive fft operations though the computational complexity remains the same the proposed colliding hash design results in significant in practice by reducing the actual number of computations theoretical and empirical guarantees we show that the quality of the tensor sketch does not depend on sparseness uniform entry distribution or any other properties of the input tensor on the other hand previous works assume specific settings such as sparse tensors or tensors having entries with similar magnitude such assumptions are unrealistic and in practice we may have both dense and spiky tensors for example unordered word trigrams in natural language processing we prove that our proposed randomized method for tensor decomposition does not lead to any significant degradation of accuracy experiments on synthetic and datasets show highly competitive results we demonstrate to over exact methods for decomposing dense tensors for topic modeling we show significant reduction in computational time over existing spectral lda implementations with small performance loss in addition our proposed algorithm outperforms collapsed gibbs sampling when running time is constrained we also show that if gibbs sampler is initialized with our output topics it converges within several iterations and outperforms randomly initialized gibbs sampler run for much more iterations since our proposed method is efficient and avoids local optima it can be used to accelerate the slow phase in gibbs sampling related works there have been many works on deploying efficient tensor decomposition methods most of these works except implement the alternating least squares als algorithm however this is extremely expensive since the als method is run in the input space which requires operations to execute one least squares step on an dense tensor thus they are only suited for extremely sparse tensors an alternative method is to first reduce the dimension of the input tensor through procedures such as whitening to dimension where is the tensor rank and then carry out als in the dimensionreduced space on tensor this results in significant reduction of computational complexity when the rank is small nonetheless in practice such complexity is still prohibitively high as could be several thousands in many settings to make matters even worse when the tensor corresponds to empirical moments computed from samples such as in spectral learning of latent variable models it is actually much slower to construct the reduced dimension table summary of notations see also appendix variables cn cn cn operator cn cn meaning product convolution tensor product variables cn operator cn meaning product mode expansion tensor from training data than to decompose it since the number of training samples is typically very large another alternative is to carry out online tensor decomposition as opposed to batch operations in the above works such methods are extremely fast but can suffer from high variance the sketching ideas developed in this paper will improve our ability to handle larger sizes of and therefore result in reduced variance in online tensor methods another alternative method is to consider randomized sampling of the input tensor in each iteration of tensor decomposition however such methods can be expensive due to calls and are sensitive to the sampling distribution in particular employs uniform sampling which is incapable of handling tensors with spiky elements though sampling is adopted in it requires an additional pass over the training data to compute the sampling distribution in contrast our sketch based method takes only one pass of the data preliminaries tensor tensor product and tensor decomposition order tensor of dimension has entries each entry can be represented as tijk for for an tensor and vector rn we define two forms of tensor products contractions as follows tn uj uk ti ui uj uk uj uk note that and rn for two complex tensors of the same order and dimension its inner product is defined as ha bi al bl wherepl ranges over all tuples that index the tensors the frobenius norm of tensor is simply kakf ha ai the cp decomposition of tensor involves scalars λi and vectors ai bi ci such that the residual kt pk λi ai bi ci kf is minimized here is order tensor defined as rijk ai bj ck additional notations are defined in table and appendix robust tensor power method the method was proposed in and was shown to provably succeed if the input tensor is noisy perturbation of the sum of tensors whose base vectors are orthogonal fix an input tensor the basic idea is to randomly generate initial vectors and perform power update steps the vector that results in the largest eigenvalue is then kept and subsequent eigenvectors can be obtained via deflation if implemented naively the algorithm takes lt time to run requiring storage in addition in certain cases when moment matrix is available the tensor power method can be carried out on whitened tensor thus improving the time complexity by avoiding dependence on the ambient dimension apart from the tensor power method other algorithms such as alternating least squares als and stochastic gradient descent sgd have also been applied to tensor cp decomposition tensor sketch tensor sketch was proposed in as generalization of count sketch for tensor of dimension np random hash functions hp with prhj hj for every and binary rademacher variables ξp the sketch st of tensor is defined as st ξp ip ip ip though we mainly focus on order tensors in this work extension to higher order tensors is easy is usually set to be linear function of and is logarithmic in see theorem in where ip hp ip mod the corresponding recovery rule is ξp ip st ip for accurate recovery needs to be int dependent which is achieved by independently selecting hp from independent hash family finally the estimation can be made more robust by the standard approach of taking independent sketches of the same tensor and then report the median of the estimates fast tensor decomposition via sketching in this section we first introduce an efficient procedure for computing sketches of factored or empirical moment tensors which appear in wide variety of applications such as parameter estimation of latent variable models we then show how to run tensor power method directly on the sketch with reduced computational complexity in addition when an input tensor is symmetric tijk the same for all permutations of we propose novel colliding hash design which speeds up the sketch building process due to space limits we only consider the robust tensor power method in the main text methods and experiments for sketching based als are presented in appendix to avoid confusions we emphasize that is used to denote the dimension of the tensor to be decomposed which is not necessarily the same as the dimension of the original data tensor indeed once whitening is applied could be as small as the intrinsic dimension of the original data tensor efficient sketching of empirical moment tensors sketching dense tensor via eq takes operations which in general can not be improved because the input size is however in practice data tensors are usually structured one notable example is empirical moment tensors which arises naturally in parameter estimation problems of latent variable models more specifically an empirical moment pn tensor can be expressed as where is the total number of training data points and xi is the ith data point in this section we show that computing sketches of such tensors can be made significantly more efficient than the implementations via eq the main idea is to sketch components of efficiently via fft trick inspired by previous efforts on sketching based matrix multiplication and kernel learning we consider the more generalized case when an input tensor can be written as weighted sum pn of known components ai ui wi where ai are scalars and ui wi are known vectors the key observation is that the sketch of each component ti ui wi can be efficiently computed by fft in particular sti can be computed as sti ui vi wi ui vi wi where denotes convolution and stands for vector product ui is the count sketch of and are defined similarly and note the fast fourier transform fft and its inverse operator by applying fft we reduce the convolution computation into product evaluation in the fourier space therefore st can be computed using log operations where the arises from fft evaluaplog termp tions finally because the sketching operator is linear ai ti ai ti st can be computed in log which is much cheaper than that takes time fast robust tensor power method we are now ready to present the fast robust tensor power method the main algorithm of this paper the computational bottleneck of the original robust tensor power method is the computation of two tensor products and naive implementation requires operations in this section we show how to speed up computation of these products we show that given the sketch of an input tensor one can approximately compute both and in log steps where is the hash length before going into details we explain the key idea behind our fast tensor product computation for any two tensors its inner product ha bi can be approximated by ha bi hsa sb denotes the real part of complex number med denotes the median all approximations will be theoretically justified in section and appendix algorithm fast robust tensor power method input noisy symmetric tensor target rank number of initializations number of iterations hash length number of independent sketches initialization hj ξj for and compute sketches cb for to do draw uniformly at random from unit sphere for to do for each compute the sketch of using hj ξj via eq compute as follows first evaluate set as for every set med update ut selection compute λτ ut ut ut using for and evaluate λτ med λτ λτ and argmaxτ λτ set λτ and ut deflation for each compute sketch for the tensor output the pair and sketches of the deflated tensor table computational complexity of sketched and plain tensor power method is the tensor dimension is the intrinsic tensor rank is the sketch length time complexity is shown preprocessing general tensors preprocessing factored tensors with components per tensor contraction time lain ketch lain hitening ketch hitening log nk nk log log log eq immediately results in fast approximation procedure of because ht xi where is rank one tensor whose sketch can be built in log time by eq consequently the product can be approximately computed using log operations if the tensor sketch of is available for tensor product of the form the ith coordinate in the result can be expressed as ht yi where yi ei ei is the ith indicator vector we can then apply eq to approximately compute ht yi efficiently however this method is not completely satisfactory because it requires sketching tensors through yn which results in fft evaluations by eq below we present proposition that allows us to use only ffts to approximate proposition hst ei hf st ei proposition is proved in appendix the main idea is to shift all terms not depending on to the left side of the inner product and eliminate the inverse fft operation on the right side so that sei contains only one nonzero entry as result we can compute st once and read off each entry of in constant time in addition the technique can be further extended to symmetric tensor sketches with details deferred to appendix due to space limits when operating on an tensor the algorithm requires klt bb log running time excluding the time for building and bb memory which significantly improves the lt time and space complexity over the brute force tensor power method here are algorithm parameters for robust tensor power method previous analysis shows that log and poly where poly is some low order polynomial function finally table summarizes computational complexity of sketched and plain tensor power method colliding hash and symmetric tensor sketch for symmetric input tensors it is possible to design new style of tensor sketch that can be built more efficiently the idea is to design hash functions that deliberately collide symmetric entries etc consequently we only need to consider entries tijk with when building tensor sketches an intuitive idea is to use the same hash function and rademacher random variable for each order that is and in this way all permutations of will collide with each other however such design has an issue with repeated entries because can only take values consider and as an example with probability even if on the other hand we need for any pair of distinct and to address the issue we extend the rademacher random variables to the complex domain and consider all roots of that is ωj where ωj ei suppose is rademacher random variable with pr ωi by elementary algebra whenever is relative prime to or can be divided by therefore by setting we avoid collisions of repeated entries in order tensor more specifically the symmetric tensor sketch of symmetric tensor can be defined as ti where mod to recover an entry we use where if if or or otherwise for higher order tensors the coefficients can be computed via the young tableaux which characterizes symmetries under the permutation group compared to asymmetric tensor sketches the hash function needs to satisfy stronger independence conditions because we are using the same hash function for each order in our case needs to be independent to make independent the fact is due to the following proposition which is proved in appendix proposition fix and for define symmetric mapping as ip ip if is pq independent then is independent the symmetric tensor sketch described above can significantly speed up sketch building processes for general tensor with nonzero entries to build one only needs to consider roughly entries those tijk with for tensor only one fft is needed to build in contrast to compute eq one needs at least fft evaluations finally in appendix we give details on how to seamlessly combine symmetric hashing and techniques in previous sections to efficiently construct and decompose tensor error analysis in this section we provide theoretical analysis on approximation error of both tensor sketch and the fast sketched robust tensor power method we mainly focus on symmetric tensor sketches while extension to asymmetric settings is trivial due to space limits all proofs are placed in the appendix tensor sketch concentration bounds theorem bounds the approximation error of symmetric tensor sketches when computing and its proof is deferred to appendix theorem fix symmetric real tensor and real vector rn with suppose and rn are estimation errors of and using independent symmetric tensor sketches that is and if log then with probability the following error bounds hold ktkf ktkf in addition for any fixed rn with probability we have hw analysis of the fast tensor power method we present theorem analyzing robust tensor power method with tensor sketch approximations more detailed theorem statement along with its proof can be found in appendix pk theorem suppose where with an λi mal basis λk and kek let be the table squared residual norm on top recovered eigenvectors of tensors and running time excluding and sketch building time for plain exact and sketched robust tensor power methods two vectors are considered mismatch wrong if kv extended version is shown as table in appendix residual norm exact no of wrong vectors running time min table negative and running time min on the large wikipedia dataset for and topics spectral gibbs hybrid like time iters like time iters pairs obtained by algorithm suppose log log maxi λi λi and grows linearly with assume the randomness of the tensor sketch is independent among tensor product evaluations if log and satisfies max where mini λi and maxi λi then with probability there exists permutation over such that and kt pk kv λi for some constant theorem shows that the sketch length can be set as to provably approximately decompose tensor with dimension theorem together with time complexity comparison in table shows that the sketching based fast tensor decomposition algorithm has better computational complexity over implementation one potential drawback of our analysis is the assumption that sketches are independently built for each tensor product contraction evaluation this is an artifact of our analysis and we conjecture that it can be removed by incorporating recent development of differentially private adaptive query framework experiments we demonstrate the effectiveness and efficiency of our proposed sketch based tensor power method on both synthetic tensors and topic modeling problems experimental results involving the fast als method are presented in appendix all methods are implemented in and tested on single machine with intel cpus and memory for synthetic tensor decomposition we use only single thread for fast spectral lda to threads are used synthetic tensors in table we compare our proposed algorithms with exact decomposition methods on synthetic tensors let be the dimension of the input tensor we first generate random orthonormal pn basis and then set the input tensor as normalize λi where the eigenvalues λi satisfy λi the normalization step makes before imposing noise the gaussian noise matrix is symmetric with eijk for and level due to time constraints we only compare the recovery error and running time on the top recovered eigenvectors of the input tensor both and are set to table shows that our proposed algorithms achieve reasonable approximation error within few minutes which is much faster then exact methods complete version table is deferred to appendix topic modeling we implement fast spectral inference algorithm for latent dirichlet allocation lda by combining tensor sketching with existing whitening technique for dimensionality reduction negative exact exact exact gibbs sampling iterations mins log hash length figure left negative for fast and exact tensor power method on wikipedia dataset right negative for collapsed gibbs sampling fast lda and gibbs sampling using fast lda as initialization tation details are provided in appendix we compare our proposed fast spectral lda algorithm with baseline spectral methods and collapsed gibbs sampling using implementation on two datasets wikipedia and enron dataset details are presented in only the most frequent words are kept and the vocabulary size is set to for the robust tensor power method the parameters are set to and for als we iterate until convergence or maximum number of iterations is reached is set to and is set to obtained topic models rv are evaluated on dataset consisting of documents randomly picked out from training datasets for each testing document we fit topic mixing vector rk by solving the following optimization problem kwd where wd is the empirical word distribution of document the is pnd pk then defined as ld ln wdi where wdi φwdi finally the average ld over all testing documents is reported figure left shows the negative for fast spectral lda under different hash lengths we can see that as increases the performance approaches the exact tensor power method because sketching approximation becomes more accurate on the other hand table shows that fast spectral lda runs much faster than exact tensor decomposition methods while achieving comparable performance on both datasets figure right compares the convergence of collapsed gibbs sampling with different number of iterations and fast spectral lda with different hash lengths on wikipedia dataset for collapsed gibbs sampling we set and following as shown in the figure fast spectral lda achieves comparable likelihood while running faster than collapsed gibbs sampling we further take the dictionary output by fast spectral lda and use it as initializations for collapsed gibbs sampling the word topic assignments are obtained by gibbs sampling with the dictionary fixed the resulting gibbs sampler converges much faster with only iterations it already performs much better than randomly initialized gibbs sampler run for iterations which takes more running time we also report performance of fast spectral lda and collapsed gibbs sampling on larger dataset in table the dataset was built by crawling random wikipedia pages and evaluation set was built by randomly picking out documents from the dataset number of topics is set to or and after getting topic dictionary from fast spectral lda we use gibbs sampling to obtain word topic assignments table shows that the hybrid method collapsed gibbs sampling initialized by spectral lda achieves the best likelihood performance in much shorter time compared to randomly initialized gibbs sampler conclusion in this work we proposed sketching based approach to efficiently compute tensor cp decomposition with provable guarantees we apply our proposed algorithm on learning latent topics of unlabeled document collections and achieve significant compared to vanilla spectral and collapsed gibbs sampling methods some interesting future directions include further improving the sample complexity analysis and applying the framework to broader class of graphical models acknowledgement anima anandkumar is supported in part by the microsoft faculty fellowship and the sloan foundation alex smola is supported in part by google faculty research grant references anandkumar ge hsu kakade and telgarsky tensor decompositions for learning latent variable models journal of machine learning research bhojanapalli and sanghavi new sampling technique for tensors blei ng and jordan latent dirichlet allocation journal of machine learning research carlson betteridge kisiel settles hruschka jr and mitchell toward an architecture for language learning in aaai carroll and chang analysis of individual differences in multidimensional scaling via an generalization of decomposition psychometrika chaganty and liang estimating graphical models using moments and likelihoods in icml charikar chen and finding frequent items in data streams theoretical computer science choi and vishwanathan dfacto distributed factorization of tensors in nips dwork feldman hardt pitassi reingold and roth preserving statistical validity in adaptive data analysis in stoc field and graupe topographic component parallel factor analysis of multichannel evoked potentials practical issues in trilinear spatiotemporal decomposition brain topography griffiths and steyvers finding scientific topics proceedings of the national academy of sciences suppl harshman foundations of the parafac procedure models and conditions for an explanatory factor analysis ucla working papers in phonetics huang matusevych anandkumar karampatziakis and mineiro distributed latent dirichlet allocation via tensor factorization in nips optimization workshop huang niranjan hakeem and anandkumar fast detection of overlapping communities via online tensor methods jain fundamentals of digital image processing kang papalexakis harpale and faloutsos gigatensor scaling tensor analysis up by times algorithms and discoveries in kdd klimt and yang introducing the enron corpus in ceas kolda and bader the tophits model for web link analysis in workshop on link analysis counterterrorism and security kolda and bader tensor decompositions and applications siam review kolda and sun scalable tensor decompositions for data mining in icdm mørup hansen herrmann parnas and arnfred parallel factor analysis as an exploratory tool for wavelet transformed eeg neuroimage pagh compressed matrix multiplication in itcs pham and pagh fast and scalable polynomial kernels via explicit feature maps in kdd phan tichavsky and cichocki fast alternating ls algorithms for high order tensor factorizations ieee transactions on signal processing phan and nguyen implementation of latent dirichlet allocation lda and thorup the power of simple tabulation hashing journal of the acm tsourakakis mach fast randomized tensor decompositions in sdm tung and smola spectral methods for indian buffet process inference in nips wang liu song and han scalable inference for latent dirichlet allocation in wang and zhu spectral methods for supervised topic models in nips 
differentially private subspace clustering yining wang wang and aarti singh machine learning department carnegie mellon universty pittsburgh usa yiningwa yuxiangw aarti abstract subspace clustering is an unsupervised learning problem that aims at grouping data points into multiple clusters so that data points in single cluster lie approximately on linear subspace it is originally motivated by motion segmentation in computer vision but has recently been generically applied to wide range of statistical machine learning problems which often involves sensitive datasets about human subjects this raises dire concern for data privacy in this work we build on the framework of differential privacy and present two provably private subspace clustering algorithms we demonstrate via both theory and experiments that one of the presented methods enjoys formal privacy and utility guarantees the other one asymptotically preserves differential privacy while having good performance in practice along the course of the proof we also obtain two new provable guarantees for the agnostic subspace clustering and the graph connectivity problem which might be of independent interests introduction subspace clustering was originally proposed to solve very specific computer vision problems having structure in the data motion segmentation under an affine camera model or face clustering under lambertian illumination models as it gains increasing attention in the statistics and machine learning community people start to use it as an agnostic learning tool in social network movie recommendation and biological datasets the growing applicability of subspace clustering in these new domains inevitably raises the concern of data privacy as many such applications involve dealing with sensitive information for example applies subspace clustering to identify diseases from personalized medical data and in fact uses subspace clustering as effective tool to conduct linkage attacks on individuals in movie rating datasets nevertheless privacy issues in subspace clustering have been less explored in the past literature with the only exception of brief analysis and discussion in however the algorithms and analysis presented in have several notable deficiencies for example data points are assumed to be incoherent and it only protects the differential privacy of any feature of user rather than the entire user profile in the database the latter means it is possible for an attacker to infer with high confidence whether particular user is in the database given sufficient side information it is perhaps reasonable why there is little work focusing on private subspace clustering which is by all means challenging task for example negative result in shows that if utility is measured in terms of exact clustering then no private subspace clustering algorithm exists when neighboring databases are allowed to differ on an entire user profile in addition subspace clustering methods like sparse subspace clustering ssc lack complete analysis of its clustering output thanks to the notorious graph connectivity problem finally clustering could have high global sensitivity even if only cluster centers are released as depicted in figure as result general private data releasing schemes like output perturbation do not apply in this work we present systematic and principled treatment of differentially private subspace clustering to circumvent the negative result in we use the perturbation of recovered dimensional subspace from the ground truth as the utility measure our contributions are first we analyze two efficient algorithms based on the framework and established formal privacy and utility guarantees when data are generated from some stochastic model or satisfy certain deterministic separation conditions new results on subspace clustering are obtained along our analysis including fully agnostic subspace clustering on datasets using stability arguments and exact clustering guarantee for subspace clustering tsc in the noisy setting in addition we employ the exponential mechanism and propose novel gibbs sampler for sampling from this distribution which involves novel tweak in sampling from matrix bingham distribution the method works well in practice and we show it is closely related to the mixtures of probabilistic pca model related work subspace clustering can be thought as generalization of pca and clustering the former aims at finding single subspace and the latter uses zerodimensional subspaces as cluster centers there has been extensive research on private pca and perhaps the most similar work to ours is applies the framework to clustering and employs the exponential mechanism to recover private principal vectors in this paper we give generalization of both work to the private subspace clustering setting preliminaries notations for vector rd its is defined as kxkp xpi if is not explicitly specified then the is used for matrix we use σn to denote its singular values assuming without loss of generality that we use kξ to denote matrix norms with the spectral norm and the frobenious norm that pmatrix pn is and kakf σi for subspace we associate with basis where the columns in are orthonormal and range we use sdq to denote the set of all subspaces in rd given rd and rd the distance is defined as inf kx if is subspace associated with basis then we have kx ps kx uu where ps denotes the projection operator onto subspace for two subspaces of dimension the distance is defined as the frobenious norm of the sin matrix of principal angles sin kf kuu kf where are orthonormal basis associated with and respectively subspace clustering given data points xn rd the task of subspace clustering is to cluster the data points into clusters so that data points within subspace lie approximately on subspace without loss of generality we assume kxi for all we also use xn to denote the dataset and to denote the data matrix by stacking all data points in columnwise order subspace clustering seeks to find subspaces cˆ so as to minimize the wasserstein distance or distance squared defined as min sπ where are taken over all permutations on and are the subspaces in model based approach is fixed and data points xi are generated either deterministically or stochastically from one of the subspaces in with noise corruption for completely agnostic setting is defined as the minimizer of the subspace clustering objective sk cost sk min xi sj to simplify notations we use cost to denote cost of the optimal solution algorithm the framework input xi rd number of subsets privacy parameters dm initialize ln ln subsampling select random subsets of size of independently and uniformly at random without replacement repeat this step until no single data point appears in more than of the sets mark the subsampled subsets xsm separate queries compute si where si xsi aggregation compute where argminm ri with here ri denotes the distance dm between si and the nearest neighbor to si in noise calibration compute maxk where is the mean of the top values in rm output where is standard gaussian random vector differential privacy definition differential privacy randomized algorithm is private if for all satisfying and all sets of possible outputs the following holds pr eε pr in addition if then the algorithm is private in our setting the distance between two datasets and is defined as the number of different columns in and differential privacy ensures the output distribution is obfuscated to the point that every user has plausible deniability about being in the dataset and in addition any inferences about individual user will have nearly the same confidence before and after the private release based private subspace clustering in this section we first summarize the framework introduced in and argue why it should be preferred to conventional output perturbation mechanisms for subspace clustering we then analyze two efficient algorithms based on the framework and prove formal privacy and utility guarantees we also prove new results in our analysis regarding the stability of subspace clustering lem and graph connectivity consistency of noisy subspace clustering tsc under stochastic model lem smooth local sensitivity and the framework most existing privacy frameworks are based on the idea of global sensitivity which is defined as the maximum output perturbation kf kξ where maximum is over all neighboring databases and or unfortunately global sensitivity of clustering problems is usually high even if only cluster centers are released for example figure shows that the global sensitivity of subspace clustering could be as high as figure illustration of instability of subspace clustering solutions which ruins the algorithm utility blue dots represent evenly spaced data points to circumvent the on the unit circle blue crosses indicate an addilenges nissim et al introduces the tional data point red lines are optimal solutions framework based on the concept of smooth version of local sensitivity unlike global sensitivity local sensitivity measures the maximum perturbation kf kξ over all databases neighboring to the input database the proposed framework pseudocode in alg enjoys local sensitivity and comes with the following guarantee theorem theorem let rd be an efficiently computable function where is the collection of all databases and is the output dimension let dm be semimetric on the outer space of set and the algorithm in algorithm is an efficient private algorithm furthermore if and are chosen such that the norm of the output of is bounded by and pr dm xs xs for some rd and then the standard deviation of gaussian noise added is upper bounded by λε in addition when satisfies with high probability each coordinate of is upper bounded by where depending on satisfies dm let be any subspace clustering solver that outputs estimated subspaces and dm be the wasserstein distance as defined in eq theorem provides privacy guarantee for an efficient with any in addition utility guarantee holds with some more assumptions on input dataset in following sections we establish utility guarantees the main idea is to prove stability results as outlined in eq for particular subspace clustering solvers and then apply theorem the agnostic setting we first consider the setting when data points xi are arbitrarily placed under such agnostic setting the optimal solution is defined as the one that minimizes the cost as in eq the solver is taken to be any of optimal subspace clustering that cost efficient is always outputs subspaces cˆ satisfying cost based approximation algorithms exist for example in the key task of this section it to identify assumptions under which the stability condition in eq holds with respect to an approximate solver the example given in figure also suggests that identifiability issue arises when the input data itself can not be well clustered for example no two straight lines could well approximate data uniformly distributed on circle to circumvent the difficulty we impose the following condition on the input data definition condition for subspace clustering dataset is separated if there exist constants and all between and such that min where and are defined as cost si cost si and cost si the first condition in eq constrains that the input dataset can not be well clustered using instead of clusters it was introduced in to analyze stability of solutions for subspace clustering we need another two conditions regarding the intrinsic dimension of each subspace the asserts that replacing subspace with one is not sufficient while means an additional subspace dimension does not help much with clustering the following lemma is our main stability result for subspace clustering on datasets it states that when candidate clustering cˆ is close to the optimal clustering in terms of clustering cost they are also close in terms of the wasserstein distance defined in eq lemma stability of agnostic subspace clustering assume is separated with suppose candidate clustering cˆ sdq satisfies cost for some cost then the following holds dw the following theorem is then simple corollary with complete proof in appendix dm satisfies dm dm and dm dm dm for all here is an approximation constant and is not related to the privacy parameter algorithm subspace clustering tsc simplified version input xi number of clusters and number of neighbors thresholding construct by connecting xi to the other data points in with the largest absolute inner products complete so that it is undirected clustering let be the connected components in construct by sampling points from uniformly at random without replacement output subspaces cˆ is the subspace spanned by arbitrary points in theorem fix separated dataset with data points and suppose xs is subset of with size sampled uniformly at random without replacement let cˆ be an of optimal subspace kqd log clustering computed on xs if with then we have ˆc pr dw xs where is the optimal clustering on that is cost consequently applying theorem together with the framework we obtain weak private algorithm for agnostic subspace clustering with additional amount of gaussian noise upper bounded by our bound is comparable to the one obtained in for private clustering except for the term which characterizes the under the subspace clustering scenario the stochastic setting we further consider the case when data points are stochastically generated from some underlying true subspace set such settings were extensively investigated in previous development of subspace clustering algorithms below we give precise definition of the considered stochastic subspace clustering model the stochastic model for every cluster associated with subspace data point xi rd belonging to cluster can be written as xi εi where is sampled uniformly at random from and εi id for some noise parameter under the stochastic setting we consider the solver to be the subspace clustering tsc algorithm simplified version of tsc is presented in alg an alternative idea is to apply results in the previous section since the stochastic model implies dataset when noise level is small however the running time of tsc is which is much more efficient than based methods tsc is provably correct in that the similarity graph has no false connections and is connected per cluster as shown in the following lemma lemma connectivity of tsc fix and assume max min if for every the number of data points and the noise level satisfy γπ log log log log cos cos log where then with probability at least log the connected components in correspond ex actly to the subspaces conditions in lemma characterize the interaction between sample complexity noise level and signal level min theorem is then simple corollary of lemma complete proofs are deferred to appendix theorem stability of tsc on stochastic data assume conditions in lemma hold with respect to for assume in addition that for all and the failure probability does not exceed then for every we have lim pr dw xs compared to theorem for the agnostic model theorem shows that one can achieve consistent estimation of underlying subspaces under stochastic model it is an interesting question to derive finite sample bounds for the differentially private tsc algorithm discussion it is worth noting that the framework is an private mechanism for any computational subroutine however the utility claim the bound on each coordinate of requires the stability of the particular subroutine as outlined in eq it is unfortunately hard to theoretically argue for stability of subspace clustering methods such as sparse subspace cluster ssc due to the graph connectivity issue nevertheless we observe satisfactory performance of ssc based algorithms in simulations see sec it remains an open question to derive utility guarantee for user differentially private ssc private subspace clustering via the exponential mechanism in section we analyzed two algorithms with provable privacy and utility guarantees for subspace clustering based on the framework however empirical evidence shows that based private clustering suffers from poor utility in practice in this section we propose practical private subspace clustering algorithm based on the exponential mechanism in particular given the dataset with data points we propose to samples parameters zi where sqd zj from the following distribution exp xi szi where is the privacy parameter the following proposition shows that exact sampling from the distribution in eq results in provable differentially private algorithm its proof is trivial and is deferred to appendix note that unlike based methods the exponential mechanism can privately release clustering assignment this does not violate the lower bound in because the released clustering assignment is not guaranteed to be exactly correct proposition the random algorithm that outputs one sample from the distribution defined in eq is private gibbs sampling implementation it is hard in general to sample parameters from distributions as complicated as in eq we present gibbs sampler that iteratively samples subspaces si and cluster assignments zj from their conditional distributions update of zi when and are fixed the conditional distribution of zi is zi exp xi szi since xi szi can be efficiently computed given an orthonormal basis of szi update of zi can be easily done by sampling zj from categorical distribution update of let xe xi zi denote data points that are assigned to cluster and as the matrix with columns corresponding to all data points in denote the distribution over conditioned on can then be written as range exp tr is the unnormalized sample covariance matrix distribution of the form in where eq is special case of the matrix bingham distribution which admits gibbs sampler we give implementation details in appendix with modifications so that the resulting gibbs sampler is empirically more efficient for wide range of parameter settings recently established full clustering guarantee for ssc however under strong assumptions discussion the proposed gibbs sampler resembles the algorithm for subspace clustering it is in fact probabilistic version of since sampling is performed at each iteration rather than deterministic updates furthermore the proposed gibbs sampler could be viewed as posterior sampling for the following generative model first sample uniformly at random from sdq for each subspace afterwards cluster assignments zi are sampled such that pr zi wi where is sampled uniformly at random from the qand xi is set as xi dimensional unit ball and wi id connection between the generative model and gibbs sampler is formally justified in appendix the generative model is strikingly similar to the mixtures of probabilistic pca mppca model by setting variance parameters in mppca to the only difference is that are sampled uniformly at random from unit ball and noise wi is constrained to the complement space of note that this is closely related to earlier observation that posterior sampling is private but different in that we constructed model from private procedure rather than the other way round as the privacy parameter no privacy guarantee we arrive immediately at the exact algorithm and the posterior distribution concentrates around the optimal solution this behavior is similar to what asymptotic analysis on mppca models reveals on the other hand the proposed gibbs sampler is significantly different from previous bayesian probabilisitic pca formulation in that the subspaces are sampled from matrix bingham distribution finally we remark that the proposed gibbs sampler is only asymptotically private because proposition requires exact or nearly exact sampling from eq numerical results we provide numerical results of both the and gibbs sampling algorithms on synthetic and datasets we also compare with baseline method implemented based on the algorithm with perturbed sample covariance matrix via the sulq framework details presented in appendix three solvers are considered for the framework subspace clustering tsc which has provable utility guarantee with sampleaggregation on stochastic models along with sparse subspace clustering ssc and representation lrr the two methods for subspace clustering for gibbs sampling we use ssc and lrr solutions as initialization for the gibbs sampler all methods are implemented using matlab for synthetic datasets we first generate random linear subspaces each subspace is generated by first sampling random gaussian matrix and then recording its column space data points are then assigned to one of the subspaces clusters uniformly at random to generate data point xi assigned with subspace we first sample rq with ky uniformly at random from the unit sphere afterwards xi is set as xi wi where is an orthonormal basis associated with and wi id is noise vector and the wasserfigure compares the utility measured in terms of objective cost of sample aggregation gibbs sampling and sulq subspace clustering stein distance dw as shown in the plots algorithms have poor utility unless the privacy parameter is truly large which means very little privacy protection on the other hand both gibbs sampling and sulq subspace clustering give reasonably good performance figure also shows that sulq scales poorly with the ambient dimension this is because sulq subspace clustering requires calibrating noise to sample covariance matrix which induces much error when is large gibbs sampling seems to be robust to various settings we also experiment on datasets the right two plots in figure report utility on subset of the extended yale face dataset for face clustering random individuals are picked forming subset of the original dataset with data points images the dataset is preprocessed by projecting each individual onto affine subspace via pca such preprocessing step was adopted in and was theoretically justified in afterwards ambient dimension of the entire dataset is reduced to by random gaussian projection the plots show that gibbs sampling significantly outperforms the other algorithms in mppca latent variables are sampled from normal distribution iq ssc tsc lrr ssc exp lrr ssc tsc lrr ssc exp lrr wasserstein distance wasserstein distance wasserstein distance cost cost cost ssc tsc lrr ssc exp lrr ssc tsc lrr ssc exp lrr ssc tsc lrr ssc exp lrr ssc tsc lrr ssc exp lrr figure utility under fixed privacy budget top row shows cost and bottom row shows from left to right synthetic dataset the wasserstein distance dw extended yale face dataset subset is set to ln for algorithms stands for smooth sensitivity and stands for exponential mechanism and stand for the sulq framework performing and iterations gibbs sampling is run for iterations and the mean of the last samples is reported test statistic cost wasserstein distance iterations iterations iterations of trials of the gibbs sampler under different figure test statistics cost and dw privacy settings synthetic dataset setting in figure we investigate the mixing behavior of proposed gibbs sampler we plot for multiple trials of gibbs sampling the objective wasserstein distance and test statistic kq pk pt where is basis sample of at the tth iteration the test statistic has mean zero under distribution in eq and similar statistic was used in as diagnostic of the mixing behavior of another gibbs sampler figure shows that under various privacy parameter settings the proposed gibbs sampler mixes quite well after iterations conclusion in this paper we consider subspace clustering subject to formal differential privacy constraints we analyzed two based algorithms with provable utility guarantees under agnostic and stochastic data models we also propose gibbs sampling subspace clustering algorithm based on the exponential mechanism that works well in practice some interesting future directions include utility bounds for subspace clustering algorithms like ssc or lrr acknowledgement this research is supported in part by grant nsf career nsf award and grant by singapore national research foundation under its international research centre singapore funding initiative administered by the idm programme office references basri and jacobs lambertian reflectance and linear subspaces ieee transactions on pattern analysis and machine intelligence blum dwork mcsherry and nissim practical privacy the sulq framework in pods bradley and mangasarian clustering journal of global optimization chaudhuri sarwate and sinha algorithms for differentially private principal components in nips chen jalali sanghavi and xu clustering partially observed graphs via convex optimization the journal of machine learning research dimitrakakis nelson mitrokotsa and rubinstein robust and private bayesian inference in algorithmic learning theory pages springer dwork kenthapadi mcsherry mironov and naor our data ourselves privacy via distributed noise generation in eurocrypt dwork mcsherry nissim and smith calibrating noise to sensitivity in private data analysis in tcc dwork and roth the algorithmic foundations of differential privacy foundations and trends in theoretical computer science dwork talwar thakurta and zhang analyze gauss optimal bounds for principal component analysis in stoc elhamifar and vidal sparse subspace clustering algorithm theory and applications ieee transactions on pattern analysis and machine intelligence feldman schmidt and sohler turning big data into tiny data coresets for pca and projective clustering in soda georghiades belhumeur and kriegman from few to many illumination cone models for face recognition under variable lighting and pose ieee transactions on pattern analysis and machine intelligence heckel and robust subspace clustering via thresholding ho yang lim lee and kriegman clustering appearances of objects under varying illumination conditions in cvpr hoff simulation of the matrix distribution with applications to multivariate and relational data journal of computational and graphical statistics liu lin yan sun ma and yu robust recovery of subspace structures by representation ieee transactions on pattern analysis and machine intelligence mcsherry and talwar mechanism design via differential privacy in focs mcwilliams and montana subspace clustering of data predictive approach data mining and knowledge discovery mir differential privacy an exploration of the landscape phd thesis rutgers university nasihatkon and hartley graph connectivity in sparse subspace clustering in cvpr nissim raskhodnikova and smith smooth sensitivity and sampling in private data analysis in stoc ostrovksy rabani schulman and swamy the effectiveness of methods for the problem in focs soltanolkotabi candes et al geometric analysis of subspace clustering with outliers the annals of statistics soltanolkotabi elhamifa and candes robust subspace clustering the annals of statistics su cao li bertino and jin differentially private clustering arxiv tipping and bishop mixtures of probabilistic principle component anlyzers neural computation wang wang and singh clustering consistent sparse subspace clustering arxiv wang wang and singh deterministic analysis of noisy sparse subspace clustering for data in icml wang and zhu bayesian nonparametric subspace clustering with asymptotic analysis in icml wang fienberg and smola privacy for free posterior sampling and stochastic gradient monte carlo in icml wang and xu noisy sparse subspace clustering in icml pages zhang fawaz ioannidis and montanari guess who rated this movie identifying users through subspace clustering arxiv zhang chan kwok and yeung bayesian inference on principal component analysis using reversible jump markov chain monte carlo in aaai 
predtron family of online algorithms for general prediction problems prateek jain microsoft research india prajain nagarajan natarajan university of texas at austin usa ambuj tewari university of michigan ann arbor usa tewaria abstract modern prediction problems arising in multilabel learning and learning to rank pose unique challenges to the classical theory of supervised learning these problems have large prediction and label spaces of combinatorial nature and involve sophisticated loss functions we offer general framework to derive mistake driven online algorithms and associated loss bounds the key ingredients in our framework are general loss function general vector space representation of predictions and notion of margin with respect to general norm our general algorithm predtron yields the perceptron algorithm and its variants when instantiated on classic problems such as binary classification multiclass classification ordinal regression and multilabel classification for multilabel ranking and subset ranking we derive novel algorithms notions of margins and loss bounds simulation study confirms the behavior predicted by our bounds and demonstrates the flexibility of the design choices in our framework introduction classical supervised learning problems such as binary and multiclass classification share number of characteristics the prediction space the space in which the learner makes predictions is often the same as the label space the space from which the learner receives supervision because directly learning discrete valued prediction functions is hard one learns or functions these functions generate continuous predictions that are converted into discrete ones via simple mappings via the sign function binary classification or the argmax function multiclass classification also the most commonly used loss function is simple viz the loss in contrast modern prediction problems such as multilabel learning multilabel ranking and subset ranking do not share these characteristics in order to handle these problems we need more general framework that offers more flexibility first it should allow for the possibility of having different label space and prediction space second it should allow practitioners to use creative new ways to map continuous predictions to discrete ones third it should permit the use of general loss functions extensions of the theory of classical supervised learning to modern predictions problems have begun for example the work on calibration dimension can be viewed as extending one aspect of the theory viz that of calibrated surrogates and consistent algorithms based on convex optimization this paper deals with the extension of another interesting part of classical supervised learning mistake driven algorithms such as perceptron resp winnow and their analyses in terms of resp margins section we make number of contributions first we provide general framework section whose ingredients include an arbitrary loss function and an arbitrary representation of discrete predictions in continuous space the framework is abstract enough to be of general applicability but it offers enough mathematical structure so that we can derive general online algorithm predtron algorithm along with an associated loss bound theorem under an abstract margin condition section second we show that our framework unifies several algorithms for classical problems such as binary classification multiclass classification ordinal regression and multilabel classification section even for these classical problems we get some new results for example when the loss function treats labels asymmetrically or when there exists reject option in classification third we apply our framework to two modern prediction problems subset ranking section and multilabel ranking section in both of these problems the prediction space rankings is different from the supervision space set of labels or vector of relevance scores for these two problems we propose interesting novel notions of correct prediction with margin and derive mistake bounds under loss derived from ndcg ranking measure that pays more attention to the performance at the top of ranked list fourth our techniques based on online convex optimization oco can effortlessly incorporate notions of margins norms such as norm group norm and trace norm such flexibility is important in modern prediction problems where the learned parameter can be high dimensional vector or large matrix with low group or trace norm finally we test our theory in simulation study section dealing with the subset ranking problem showing how our framework can be adapted to specific prediction problem we investigate different margin notions as we vary two key design choices in our abstract framework the map used to convert continuous predictions into discrete ones and the choice of the norm used in the definition of margin related work our general algorithm is related to the perceptron and online gradient descent algorithms used in structured prediction but to the best of knowledge our emphasis on keeping label and prediction spaces possibly distinct our use of general representation of predictions and our investigation of generalized notions of margins are all novel the use of simplex coding in multiclass problems inspired the use of maximum distance decoding to obtain discrete predictions from continuous ones our proofs use results about online gradient descent and online mirror descent from the online convex optimization literature framework and main result the key ingredients in classic supervised learning are an input space an output space and loss function in this paper the input space rp will always be some subset of finite dimensional euclidean space our algorithms maintain prediction functions as linear combination of the seen inputs as result they easily kernelize and the theory extends in straightforward way to the case when the input space is possibly infinite dimensional reproducing kernel hilbert space rkhs labels prediction and scores we will distinguish between the label space and the prediction space the former is the space where the training labels come from whereas the latter is the space where the learning algorithm has to make predictions in both spaces will be assumed to be finite therefore without any loss of generality we can identify the label space with and the prediction space with where are positive but perhaps very large integers given loss function maps prediction and label to loss the loss can equivalently be thought of as matrix with loss values as entries define the set of correct predictions for label as we assume that for every label the set is that is every column of the loss matrix has zero entry also let cl minl and cl max be the minimum and maximum entries in the loss matrix in an online setting the learner will see stream of examples learner will predict scores using linear predictor however the predicted scores will be in rd not in the prediction space so we need function pred rd to convert scores into actual predictions we will assume that there is unique representation rep rd of each prediction such that rep for all given this natural transformation of scores into prediction is given by the following maximum similarity decoding pred argmax hrep ti where ties in the argmax can be broken arbitrarily there are some nice consequences of the definition of pred above first because rep maximum similarity decoding is equivalent to nearest neighbor decoding pred argmin rep second we have homogeneity property pred ct pred if third rep serves as an inverse of pred in the following sense we have pred rep for all moreover rep pred is more similar to than the representation of any other prediction rd hrep pred ti hrep ti in view of these facts we will use pred and rep interchangeably using pred the loss function can be extended to function defined on rd as pred with little abuse of notation we will continue to denote this new function also by margins we say that score is compatible with label if the set of that achieve the maximum in the definition of pred is exactly that is argmax pred hence for any we have pred pred the notion of margin makes this requirement stronger we say that score has margin on label iff is compatible with and pred pred note that margin scales with if has margin on then ct has margin on for any positive if we are using linear predictions we say that has margin on iff has margin on we say that has margin on dataset xn yn iff has margin on for all finally dataset xn yn is said to be linearly separable with margin if there is unit such that has margin on xn yn algorithm just like the classic perceptron algorithm our generalized perceptron algorithm algorithm is mistake driven that is it only updates on round when mistake loss is incurred on mistake round it makes update of the form wp where rd rp therefore always has representation of the form xi the prediction on fresh input is given by gi hxi xi which means the algorithm just like the original perceptron can be kernelized we will give loss bound for the algorithm using tools from online convex optimization oco define the function rd as max pred pred where is an arbitrary member of for any is maximum of linear functions and hence convex also is choose to lower bound the maximum the inner product part vanishes and the loss vanishes too because given the definition of algorithm can be described succinctly as follows at round if then otherwise here we mean that the frobenius norm kw kf equals of course the notion of margin can be generalized to any norm including the norm kw and the norm kw ks also called the nuclear or trace norm see appendix algorithm predtron extension of the perceptron algorithm to general prediction problems for do receive rp predict pred receive label if then argmax pred pred pred pred else end if end for theorem suppose the dataset xn yn is linearly separable with margin then the sequence generated by algorithm with cl satisfies the loss bound where for all cl note that the bound above assumes perfect linear separability however just the classic perceptron the bound will degrade gracefully when the best linear predictor does not have enough margin on the data set the predtron algorithm has some interesting variants two of which we consider in the appendix loss driven version enjoys loss bound that gets rid of the cl factor in the bound above version that uses link functions to deal with margins defined with respect to norms is also considered relationship to existing results it is useful to discuss few concrete applications of the abstract framework introduced in the last section several existing loss bounds can be readily derived by applying our bound for the generalized perceptron algorithm in theorem in some cases our framework yields different algorithm than existing counterparts yet admitting identical loss bounds up to constants binary classification we begin with the classical perceptron algorithm for binary classification if or otherwise letting rep be for the positive class and for the negative class predictor vector and thus pred sign algorithm reduces to the original perceptron algorithm theorem yields identical mistake bound on linearly separable dataset with margin if the classical margin is ours works out to be pn we can also easily incorporate asymmetric losses let if and otherwise we then have the following result corollary consider the perceptron with weighted loss assume without loss of generality then the sequence generated by algorithm satisfies the weighted mistake bound we are not aware of such results for weighted loss previous work studies perceptrons with pn uneven margins and the loss bound there only implies bound on the unweighted loss in technical note and kivinen provide mistake bound of the pn form without proof and for any but for the specific choice of weights another interesting extension is obtained by allowing the predictions to have eject option define lr ej eject and lr ej otherwise assume without loss of generality choosing the standardpbasis vectors in to be rep for the positive and the pn negative classes and rep eject rep we obtain lr ej see appendix multiclass classification each instance is assigned exactly one of classes extending binary classification we choose the standard basis vectors in rm to be rep for the classes the learner predicts score rm using the predictor so pred argmaxi ti let wj denote the jth row of corresponding to label the definition of margin becomes hwy xi max hwj xi which is identical to the multiclass margin studied earlier for the multiclass loss we recover their bound up to moreover our surrogate for max max ty matches the multiclass extension of the hinge loss studied by finally note that it is straightforward to obtain loss bounds for multiclass perceptron with eject option by naturally extending the definitions of rep and lr ej for the binary case ordinal regression the goal is to assign ordinal classes such as ratings to set of objects described by their features xi rp in many cases precise rating information may not be available but only their relative ranks the observations consist of pairs where is with relation which in turn induces partial ordering on the objects xj is preferred to xj if yj yj xj and xj are not comparable if yj yp the prank perceptron algorithm enjoys the for the ranking loss bound where is certain rank margin by reduction to classification with classes algorithm achieves the loss bound albeit for different margin multilabel classification this setting generalizes multiclass classification in that instances are assigned subsets of classes rather than unique classes the loss function of interest may dictate the choice of rep and in turn pred for example consider the following subset losses that treat labels as well as predictions as subsets subset loss liserr if or otherwise ii hamming loss lham and ii error set size lerrsetsize natural choice of rep then is the subset indicator vector expressed as in where log which can be rep where are the standard basis vectors in the learner predicts score rm using matrix note that pred sign where sign is applied the number of predictions is but we show in appendix that the surrogate and its gradient can be efficiently computed for all of the above losses subset ranking in subset ranking the task is to learn to rank number of documents in order of their relevance to query we will assume for simplicity that the number of documents per query is constant that we denote by the input space is subset of that we can identify with rp for each row of an input matrix corresponds to feature vector derived jointly using the query perceptron algorithm in is based on slightly different loss defined as lerrset if tr ty or otherwise where this loss upper bounds because of the way ties are handled there can be rounds when is but lerrset is and one of the documents associated with it the predictions are all permutations of degree the most natural but by no means the only one representation of permutations is to set rep where is the position of the document in the predicted ranking and the normalization ensures that rep is unit vector note that the dimension of this representation is equal to the minus sign in this representation ensures that pred outputs permutation that corresponds to sorting the entries of in decreasing order common convention in existing work more general representation is obtained by setting rep where is decreasing papstrictly ensures real valued function that is applied to the normalization that rep to convert an input matrix rp into score vector rm it seems that we need to learn matrix however natural permutation invariance requirement if the documents associated are presented in permuted fashion the output scores should also get permuted in the same way reduces the dimensionality of to see for more details thus given vector we get the score vector as xw the label space consists of relevance score vectors ymax where ymax is typically between and yielding to grades of relevance note that the prediction space of size is different from the label space of size ymax variety of loss functions have been used in subset ranking for multigraded relevance judgments pm very popular choice is ndcg which is defined as dcg where is normalization constant ensuring ndcg stays bounded by to convert it into loss we define lndcg dcg note that any permutation that sorts in decreasing order gets zero lndcg one might worry that the computation of the surrogate defined in and its gradient might require an enumeration of permutations the next lemma allays such concern lemma when lndcg and rep is chosen as above the computation of the surrogate as well as its gradient can be reduced to solving linear assignment problem and hence can be done in time we now give result explaining what it means for score vector to have margin on when we use representation of the form described above without loss of generality we may assume that is sorted in decreasing order of relevance judgements lemma ppm suppose rep for strictly decreasing function and let be relevance judgement vector sorted in decreasing order suppose in are the positions where the relevance drops by grade or more ij ij then has margin on iff is compatible with and for tij ij ij ij where we define in to handle boundary cases note that if we choose then ij ij for large ij in that case the margin condition above requires less separation between documents with different relevance scores down the list when viewed in decreasing order of relevance scores than at the top of the list we end this section with loss bound for lndcg under margin condition corollary suppose lndcg and rep is as in lemma then assuming the dataset is linearly separable with margin the sequence generated by algorithm with line replaced by satisfies pred where kop lndcg pred note that the result above uses the standard based notion of margin imagine subset ranking problem where only small number of features are relevant it is therefore natural to consider notion of margin where the weight vector that ranks everything perfectly has low group norm instead of low norm the margin also appears in the analysis of adaboost definition we can use special case of more general algorithm given in the appendix appendix algorithm specifically we replace line with the step where we set log log the mapping and its inverse can both be easily computed see corollary suppose lndcg and rep is as in lemma then assuming the dataset is linearly separable with margin by unit norm kw the sequence generated by algorithm with chosen as above and line modified as in corollary satisfies lndcg log where po and denotes the jth column of multilabel ranking as discussed in section in multilabel classification both prediction space and label space are with sizes in multilabel ranking however the learner has to output rankings as predictions so as in the previous section we have since the prediction can be any one of permutations of the labels as before we choose rep and hence however unlike the previous section the input is no longer matrix but vector rp prediction rd is obtained as where note the contrast with the last section there inputs are matrices and weight vector is learned here inputs are vectors and weight matrix is learned since we output rankings it is reasonable to use loss that takes positions of labels into account we can use lndcg algorithm now immediately applies lemma already showed that is efficiently implementable we have the following straightforward corollary corollary suppose lndcg and rep is as in lemma then assuming the dataset is linearly separable with margin the sequence generated by algorithm satisfies lndcg where the bound above matches the corresponding bound up to loss specific constants for the multiclass multilabel perceptron mmp algorithm studied by the definition of margin by for mmp is different from ours since their algorithms are designed specifically for multilabel ranking just like them we can also consider other losses precision at top positions another perceptron style algorithm for multilabel ranking adopts pairwise approach of comparing two labels at time however no loss bounds are derived the result above uses the standard frobenius norm based margin imagine multilabel problem where only small number of features are relevant across all labels then it is natural to consider notion of margin where the matrix that ranksp everything perfectly has low group norm instead of low frobenius norm where kw kwj wj denotes column of we again use special case of algorithm appendix specifically we replace line with the step where kw recall that the group is the norm of the norm of the columns of we set log log the mapping and its inverse can both be easily computed see eq corollary suppose lndcg and rep is as in lemma then assuming the dataset is linearly separable with margin by unit group norm kw the sequence generated by algorithm with chosen as above satisfies lndcg log where no of training points subset ranking no of documents in each instance loss on test points loss on test points loss on test points subset ranking vs data dimensionality figure subset ranking ndcg loss for different pred choices with varying plot and plot as predicted by lemmas and pred is more accurate than vs margin lndcg for two different predtron algorithms based on and margin data is generated using margin notion but with varying sparsity of the optimal scoring function experiments we now present simulation results to demonstrate the application of our proposed predtron framework to subset ranking we also demonstrate that empirical results match the trend predicted by our error bounds hence hinting at tightness of our upper bounds due to lack of space we focus only on the subset ranking problem also we would like to stress that we do not claim that the basic version of predtron itself with provides ranker instead we wish to demonstrate the applicability and flexibility of our framework in controlled setting we generated data points using gaussian distribution with independent rows the ith row of represents document and is sampled from spherical gaussian centered at µi we selected and also set of thresholds to generate relevance scores we set and and we set relevance score of the ith document in the th as iff that is we measure performance of given method using the ndcg loss lndcg defined in section note that lndcg is less sensitive to errors in predictions for the less relevant documents in the list on the other hand our selection of thresholds implies that the gap between scores of lowerranked documents is very small compared to the ones and hence chances of making mistakes lower down the list is higher figure shows lndcg on test set for our predtron algorithm see section but with different pred functions for pred is monotonically increasing with on the other hand for pred is monotonically decreasing with lemma shows that the mistake bound in terms of lndcg of predtron is better when pred function is selected to be as well as for instead of clearly figure empirically validates this mistake bound with lndcg going to almost for and with just training points while based predtron has large loss even with training points next we fix the number of training instances to be and vary the number of documents as the gap between decreases for larger increasing implies reducing the margin naturally predtron with the above mentioned inverse functions has monotonically increasing loss see figure however and provide solutions for larger when compared to finally we conduct an experiment to show that by selecting appropriate notion of margin predtron can obtain more accurate solutions to this end we generate data from select sparse now predtron with notion standard gradient descent has dependency in the error bounds while the see corollary has only log dependence this error dependency is also revealed by figure where increasing with fixed leads to minor increase in the loss for predtron but leads to significantly higher loss for predtron acknowledgments tewari acknowledges the support of nsf under grant references harish ramaswamy and shivani agarwal classification calibration dimension for general multiclass losses in advances in neural information processing systems pages mehryar mohri afshin rostamizadeh and ameet talwalkar foundations of machine learning mit press michael collins discriminative training methods for hidden markov models theory and experiments with perceptron algorithms in proceedings of the conference on empirical methods in natural language pages nathan ratliff andrew bagnell and martin zinkevich approximate subgradient methods for structured prediction in international conference on artificial intelligence and statistics pages youssef mroueh tomaso poggio lorenzo rosasco and slotine multiclass learning with simplex coding in advances in neural information processing systems pages shai online learning and online convex optimization foundations and trends in machine learning albert novikoff on convergence proofs on perceptrons in proceedings of the symposium on the mathematical theory of automata volume pages yaoyong li hugo zaragoza ralf herbrich john and jaz kandola the perceptron algorithm with uneven margins in proceedings of the nineteenth international conference on machine learning pages gunnar ratsch and jyrki kivinen extended classification with modified perceptron presented at the nips workshop beyond classification and regression learning rankings preferences equality predicates and other structures abstract available at http koby crammer and yoram singer ultraconservative online algorithms for multiclass problems the journal of machine learning research koby crammer and yoram singer on the algorithmic implementation of multiclass vector machines the journal of machine learning research koby crammer and yoram singer pranking with ranking advances in neural information procession systems david cossock and tong zhang statistical analysis of bayes optimal subset ranking ieee transactions on information theory ambuj tewari and sougata chaudhuri generalization error bounds for learning to rank does the length of document lists matter in proceedings of the international conference on machine learning volume of jmlr workshop and conference proceedings koby crammer and yoram singer family of additive online algorithms for category ranking the journal of machine learning research eneldo loza and johannes furnkranz pairwise learning of multilabel classifications with perceptrons in ieee international joint conference on neural networks pages sham kakade shai and ambuj tewari regularization techniques for learning with matrices journal of machine learning research 
weighted theta functions and embeddings with applications to clustering and summarization fredrik johansson computer science engineering chalmers university of technology sweden frejohk ankani brain cognitive sciences university of rochester rochester ny usa achattor chiranjib bhattacharyya computer science and automation indian institute of science bangalore karnataka india chiru devdatt dubhashi computer science engineering chalmers university of technology sweden dubhashi abstract we introduce unifying generalization of the theta function and the associated geometric embedding for graphs with weights on both nodes and edges we show how it can be computed exactly by semidefinite programming and how to approximate it using svm computations we show how the theta function can be interpreted as measure of diversity in graphs and use this idea and the graph embedding in algorithms for correlation clustering and document summarization all of which are well represented as problems on weighted graphs introduction embedding structured data such as graphs in geometric spaces is central problem in machine learning in many applications graphs are attributed with weights on the nodes and edges information that needs to be well represented by the embedding introduced graph embedding together with the famous theta function in the seminal paper giving his celebrated solution to the problem of computing the shannon capacity of the pentagon indeed embedding is very elegant and powerful representation of unweighted graphs that has come to play central role in information theory graph theory and combinatorial optimization however despite there being at least eight different formulations of for unweighted graphs see for example there does not appear to be version that applies to graphs with weights on the edges this is surprising as it has natural interpretation in the information theoretic problem of the original definition version of the number for graphs and corresponding geometrical representation could open the way to new approaches to learning problems on data represented as similarity matrices here we propose such generalization for graphs with weights on both nodes and edges by combining few key observations recently jethava et al discovered an interesting connection between the original theta function and central problem in machine learning namely the one class support vector machine svm formulation this kernel based method gives yet another equivalent characterization of the number crucially it is easily modified to yield an equivalent characterization of the closely related delsarte version of the number this work was performed when the author was affiliated with cse at chalmers university of technology introduced by schrijver which is more flexible and often more convenient to work with using this kernel characterization of the delsarte version of number we define theta function and embedding of weighted graphs suitable for learning with data represented as similarity matrices the original theta function is limited to applications on small graphs because of its formulation as semidefinite program sdp in jethava et al showed that their kernel characterization can be used to compute number and an embedding of graph that are often good approximations to the theta function and embedding and that can be computed fast scaling to very large graphs here we give the analogous approximate method for weighted graphs we use this approximation to solve the weighted maximum cut problem faster than the classical sdp relaxation finally we show that our theta function has natural interpretation as measure of diversity in graphs we use this intuition to define correlation clustering algorithm that automatically chooses the number of clusters and initializes the centroids we also show how to use the support vectors computed in the kernel characterization with both node and edge weights to perform extractive document summarization to summarize our main contributions we introduce unifying generalization of the famous number applicable to graphs with weights on both nodes and edges we show that via our characterization we can compute good approximation to our weighted theta function and the corresponding embeddings using svm computations we show that the weighted version of the number can be interpreted as measure of diversity in graphs and we use this to define correlation clustering algorithm dubbed that automatically chooses the number of clusters and initializes centroids we apply the embeddings corresponding to the weighted numbers to solve weighted maximum cut problems faster than the classical sdp methods with similar accuracy we apply the weighted kernel characterization of the theta function to document summarization exploiting both node and edge weights extensions of and delsarte numbers for weighted graphs background consider embeddings of undirected graphs introduced an elegant embedding implicit in the definition of his celebrated theta function famously an upper bound on the shannon capacity and sandwiched between the independence number and the chromatic number of the complement graph min max uj kui kck ui ui the vectors ui are orthonormal representations or labellings the dimension of which is determined by the optimization we refer to both ui and the matrix un as an embedding and use the two notations interchangeably jethava et al introduced characterization of the function that established close connection with the support vector machine they showed that for an unweighted graph min where kii kij max αi kij αi αj is the dual formulation of the svm problem see note that the conditions on only refer to the of in the sequel and always refer to the definitions in new weighted versions of key observation in proving is that the set of valid orthonormal representations is equivalent to the set of kernels this equivalence can be preserved in natural way when generalizing the definition to weighted graphs any constraint on the inner product uti uj may be represented as constraints on the elements kij of the kernel matrix to define weighted extensions of the theta function we need to first pass to the closely related delsarte version of the number introduced by schrijver in the delsarte version the orthogonality constraint for is relaxed to uti uj with reference to the formulation it is easy to observe that the delsarte version is given by min where kii kij in other words the number corresponds to orthogonal labellings of with orthogonal vectors on the unit sphere assigned to nodes whereas the delsarte version corresponds to obtuse labellings the vectors corresponding to nodes are vectors on the unit sphere meeting at obtuse angles in both cases the corresponding number is essentially the of the smallest spherical cap containing all the vectors assigned to the nodes comparing and it follows that in the sequel we will use the delsarte version and obtuse labellings to define weighted generalizations of the theta function we observe in passing that for any and for any independent set in the graph taking αi if and otherwise αi αi αj kij αi αi αj kij αi since for each term in the second sum either is an edge in which case either αi or αj is zero or is in which case kij thus like the delsarte version is also an upper bound on the stability or independence number kernel characterization of theta functions on graphs number has classical extension to graphs with node weights σn see for example the generalization in the delsarte version note the inequality constraint is the following σi min max uj kui kck ui ui by passing to the dual of see section and we may as for unweighted graphs characterize by minimization over the set of kernels kii kij and just like in the unweighted case when σi this reduces to the unweighted case we also note that for any and for any independent set in the graph taking αi σi if and otherwise αi αi αj kij σi σi αi αj kij σi since kij thus is an upper bound on the weight of the independent set extension to graphs the kernel characterization of allows one to define natural extension to data given as similarity matrices represented in the form of weighted graph here is similarity function on unordered node pairs and with representing complete similarity and complete dissimilarity the obtuse labellings corresponding to the delsarte version are somewhat more flexible even for unweighted graphs but is particularly well suited for weighted graphs we define min where kii kij sij in the case of an unweighted graph where sij this reduces exactly to table characterizations of weighted theta functions in the first row are characterizations following the original definition in the second are kernel characterizations the bottom row are versions of the in all cases kui kck refers to the adjacency matrix of unweighted min min max ui σi min min max ui ui ui kls min min max ui uj uj kg kii kij kg kii kij kls ui uj sij kg kii kij sij σmax kls unifying weighted generalization we may now combine both node and edge weights to form fully general extension to the delsarte version of the number sij min kii kij σi σi σj it is easy to see that for unweighted graphs sij σi the definition reduces to the delsarte version of the theta function in is hence strict generalization of all the proposed weighted extensions are defined by the same objective the only difference is the set specialized in various ways over which the minimum is computed it also is important to note that with the generalization of the theta function comes an implicit generalization of the geometric representation of specifically for any feasible in there is an embedding un such that with the properties uj σi σj sij kui σi which can be retrieved using matrix decomposition note that uj σi σj is exactly the cosine similarity between ui and uj which is very natural choice when sij the original definition of the delsarte theta function and its extensions as well as their kernel characterizations can be seen in table we can prove the equivalence of the embedding top and kernel characterizations middle using the following result proposition for any embedding with and in the following holds min max max αi ui proof the result is given as part of the proof of theorem in jethava et al see also as we have already established in section that any set of geometric embeddings have characterization as set of kernel matrices it follows that the minimizing the lhs in over constrained set of orthogonal representations ui is equivalent to minimizing the rhs over kernel set computation and approximation the weighted generalization of the theta function defined in the previous section may be computed as semidefinite program in fact for the solution to the following problem for details see with the set of symmetric matrices maximize subject to xi xii xij sij σi σj while polynomial in time complexity solving the sdp is too slow in many cases to address this jethava et al introduced fast approximation to the unweighted dubbed svmtheta they showed that in some cases the minimization over in can be replaced by fixed choice of while causing just error specifically for unweighted graphs with adjacency matrix jethava et al defined the so called kls and showed that for large families of graphs kls γϑ for constant we extend the to weighted graphs for graphs with edge weights represented by similarity matrix the original definition may be used with substituted for for node weighted graphs we also must satisfy the constraint kii see natural choice still ensuring positive semidefiniteness is kls diag σmax where diag is the diagonal matrix with elements σii and σmax σi both weighted versions of the are presented in table the fully generalized labelling for graphs with weights on both nodes and edges kls can be obtained by substituting for in as with the exact characterization we note that kls reduces to kls for the uniform case sij σi for all versions of the of as with the exact characterization geometric embedding may be obtained from kls using matrix decompotion computational complexity solving the full problem in the kernel characterization is not faster than the computing the sdp characterization however for fixed the svm can be solved in time retrieving the embedding may be done using cholesky or singular value decomposition svd in general algorithms for these problems have complexity however in many cases rank approximation to the decomposition is sufficient see for example thin or truncated svd corresponding to the top singular values may be computed in time for the remaining issue is the computation of the complexity of computing the discussed in the previous section is dominated by the computation of the minimum eigenvalue λn this can be done approximately in time where is the number of edges of the graph overall the complexity of computing both the embedding and is the theta function as diversity in graphs clustering in section we defined extensions of the delsarte version of the number and the associated geometric embedding for weighted graphs now we wish to show how both and the geometric embedding are useful for solving common machine learning tasks we build on an intuition of as measure of diversity in graphs illustrated here by few simple examples for complete graphs kn it is well known that kn and for empty graphs we may interpret these graphs as having and clusters respectively graphs with several disjoint clusters make natural for graph that is union of disjoint cliques now consider the analogue of for graphs with edge weights sij for any and for any subset of nodes let αi if and otherwise then since kij sij αi αi αj kij αi αi αj kij sij ij maximizing this expression may be viewed as the of finding subset of nodes that is both large and diverse the objective function is the size of the set subjected to penalty for in general support vector machines support values αi correspond to support vectors defining the decision boundary as result nodes with high values αi may be interpreted as an important and diverse set of nodes clustering common problem related to diversity in graphs is correlation clustering in correlation clustering the task is to cluster set of items based on their similarity or correlation algorithm clustering input graph with weight matrix and node weights compute kernel arg maxαi as in sort alphas according to ji such that αjn let where either initialize labels zi arg jk kij output result of kernel with kernel and as initial labels or compute with columns ui and let uji output result of with and as initial cluster centroids without specifying the number of clusters beforehand this is naturally posed as problem of clustering the nodes of an graph in variant called overlapping correlation clustering items may belong to several overlapping clusters the usual formulation of correlation clustering is an integer linear program making use of geometric embeddings we may convert the graph clustering problem to the more standard problem of clustering set of points ui allowing the use of an arsenal of established techniques such as clustering however we remind ourselves of two common problems with existing clustering algorithms problem number of clusters many clustering algorithms relies on the user making good choice of the number of clusters as this choice can have dramatic effect on both the accuracy and speed of the algorithm heuristics for choosing such as pham et al have been proposed problem initialization popular clustering algorithms such as lloyd or expectationmaximization for gaussian mixture models require an initial guess of the parameters as result these algorithms are often run repeatedly with different random initializations we propose solutions to both problems based on to solve problem we choose this is motivated by being measure of diversity for problem we propose initializing parameters based on the observation that the αi are support vectors specifically we let the initial clusters by represented by the set of nodes with the largest αi in clustering this corresponds to letting the initial centroids be ui we summarize these ideas in algorithm comprising both and kernel clustering in section we showed that computating the approximate weighted theta function and embedding can be done in time for rank approximation to the svd as is lloyd algorithm has very high complexity and will dominate the overall complexity experiments weighted maximum cut the maximum cut problem fundamental problem in graph algorithms with applications in machine learning has famously been solved using geometric embeddings defined by semidefinite programs here given graph we compute an embedding the labelling in using the kls to reduce complexity while preserving accuracy we use rank truncated svd see section we apply the goemanswilliamson random hyperplane rounding to partition the embedding into two sets of points representing the cut the rounding was repeated times and the maximum cut is reported helmberg rendl constructed set of graphs of which are weighted that has since often been used as benchmarks for we use the six of the weighted graphs for which there are multiple published results our approach is closest to that of the which table weighted maximum cut is the weight of the produced cut graph sdp time time best known time table clustering of the mini newsgroup dataset average and std deviation over splits is the average number of clusters predicted the true number is vote ivot est irst means rand means init means rand means time has time complexity mn in comparison our method takes time see section the results are presented in table for all graphs the svm approximation is comparable to or better than the sdp solution and considerably faster than the best known method correlation clustering we evaluate several different versions of algorithm in the task of correlation clustering see section we consider the full version means one with but random initialization of centroids means rand one with initialization but choosing according to pham et al means init and according to and random initialization means rand for the randomly initialized versions we use restarts of in all versions we cluster the points of the embedding defined by the fixed kernel kls elsner schudy constructed five affinity matrices for subset of the classical dataset each matrix corresponding to different split of the data represents the similarity between messages in different newsgroups the task is to cluster the messages by their respective newsgroup we run algorithm on every split and compute the reporting the average and standard deviation over all splits as well as the predicted number of clusters we compare our results to several greedy methods described by elsner schudy see table we only compare to their logarithmic weighting schema as the difference to using additive weights was negligible the results are presented in table we observe that the full method achieves the highest followed by the version with random initialization instead of using embeddings of nodes with highest αi see algorithm we note also that choosing by the method of pham et al consistently results in too few clusters and with the greedy search methods far too many overlapping correlation clustering bonchi et al constructed benchmark for overlapping correlation clustering based on two datasets for classification yeast and emotion the datasets consist of and items belonging to one or more of and overlapping clusters respectively each set can be represented as an binary matrix where is the number of clusters and is the number of items note that the timing results for the sdp method are from the original paper published in table clustering of the yeast and emotion datasets the total time for finding the best solution the times for sect for single was and respectively sect no prec emotion rec time prec yeast rec time such that lic iff item belongs to cluster from weight matrix is defined such that sij is the jaccard coefficient between rows and of is often sparse as many of the pairs do not share single cluster the correlation clustering task is to reconstruct from here we use only the centroids ujk produced by algorithm without running kmeans we let each centroid represent cluster and assign node to that cluster iff uti ujc we compute the precision and recall following bonchi et al for comparison with bonchi et al we run their algorithm called sect with the parameter bounding the number of clusters in the interval and select the one resulting in lowest cost the results are presented in table for emotion and yeast estimated the number of clusters to be the correct number and respectively for the with the lowest cost were and we note that while very similar in performance the algorithms is considerably faster than sect especially when is unknown document summarization finally we briefly examine the idea of using αi to select both relevant and diverse set of items in very natural application of the weighted theta function extractive summarization in extractive summarization the goal is to automatically summarize text by picking out small set of sentences that best represents the whole text we may view the sentences of text as the nodes of graph with edge weights sij the similarity between sentences and node weights σi representing the relevance of the sentence to the text as whole the between brevity and relevance described above can then be viewed as finding set of nodes that has both high total weight and high diversity this is naturally accomplished using our framework by computing arg maxαi for fixed kls and picking the sentences with the highest we apply this method to the summarization task of let sij be the sentence similarity described by lin bilmes and let σi sij systems for summarization achieve around in recall and score our method achieves score of on both measures which is about the same as the basic version of this is likely possible to improve by tuning the between relevance and diversity such as making more sophisticated choice of and however we leave this to future work conclusions we have introduced unifying generalization of theta function and the corresponding geometric embedding to graphs with node and edge weights characterized as minimization over constrained set of kernel matrices this allows an extension of fast approximation of the number to weighted graphs defined by an svm problem for fixed kernel matrix we have shown that the theta function has natural interpretation as measure of diversity in graphs useful function in several machine learning problems exploiting these results we have defined algorithms for weighted maximum cut correlation clustering and document summarization acknowledgments this work is supported in part by the swedish foundation for strategic research ssf http references arora hazan and kale fast algorithms for approximate semidefinite programming using the multiplicative weights update method in foundations of computer science focs annual ieee symposium on pages ieee arora hazan and kale the multiplicative weights update method and applications theory of computing bansal blum and chawla correlation clustering machine learning bonchi gionis and ukkonen overlapping correlation clustering knowledge and information systems brand fast modifications of the thin singular value decomposition linear algebra and its applications burer and monteiro projected gradient algorithm for solving the maxcut sdp relaxation optimization methods and software elsner and schudy bounding and comparing methods for correlation clustering beyond ilp in proceedings of the workshop on integer linear programming for natural langauge processing pages association for computational linguistics goemans semidefinite programming in combinatorial optimization math goemans and williamson improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming journal of the acm jacm and schrijver geometric algorithms and combinatorial optimization volume of algorithms and combinatorics springer helmberg and rendl spectral bundle method for semidefinite programming siam journal on optimization hush kelly scovel and steinwart qp algorithms with guaranteed accuracy and run time for support vector machines journal of machine learning research iyengar phillips and stein approximating semidefinite packing programs siam journal on optimization jethava martinsson bhattacharyya and dubhashi function svms and finding dense subgraphs the journal of machine learning research jethava sznajdman bhattacharyya and dubhashi lovasz svms and applications in information theory workshop itw ieee pages ieee johanson chattoraj bhattacharyya and dubhashi supplementary material knuth the sandwich theorem electr lin and bilmes class of submodular functions for document summarization in proc of the annual meeting of the association for computational linguistics human language technologiesvolume pages association for computational linguistics on the shannon capacity of graph ieee transactions on information theory and vesztergombi geometric representations of graphs paul erdos and his mathematics duarte and laguna advanced scatter search for the problem informs journal on computing pham dimov and nguyen selection of in clustering proceedings of the institution of mechanical engineers part journal of mechanical engineering science platt smola and williamson estimating the support of distribution neural computation schrijver comparison of the delsarte and bounds information theory ieee transactions on wang jebara and chang learning using greedy the journal of machine learning research 
sgd algorithms based on incomplete minimization of empirical risk guillaume papa ltci cnrs paristech paris france bellet magnet team inria lille nord europe villeneuve ascq france abstract in many learning problems ranging from clustering to ranking through metric learning empirical estimates of the risk functional consist of an average over tuples pairs or triplets of observations rather than over individual observations in this paper we focus on how to best implement stochastic approximation approach to solve such risk minimization problems we argue that in the largescale setting gradient estimates should be obtained by sampling tuples of data points with replacement incomplete instead of sampling data points without replacement complete based on subsamples we develop theoretical framework accounting for the substantial impact of this strategy on the generalization ability of the prediction model returned by the stochastic gradient descent sgd algorithm it reveals that the method we promote achieves much better between statistical accuracy and computational cost beyond the rate bound analysis experiments on auc maximization and metric learning provide strong empirical evidence of the superiority of the proposed approach introduction in many machine learning problems the statistical risk functional is an expectation over of observations rather than over individual points this is the case in supervised metric learning where one seeks to optimize distance function such that it assigns smaller values to pairs of points with the same label than to those with different labels other popular examples include bipartite ranking see for instance where the goal is to maximize the number of concordant pairs auc maximization and more generally ranking cf as well as pairwise clustering see given data sample the most natural empirical risk estimate which is known to have minimal variance among all unbiased estimates is obtained by averaging over all tuples of observations and thus takes the form of an average of dependent variables generalizing the means see the empirical risk minimization erm principle one of the main paradigms of statistical learning theory has been extended to the case where the empirical risk of prediction rule is using concentration properties of collections of the computation of the empirical risk is however numerically unfeasible in large and even moderate scale situations due to the exploding number of possible tuples in practice the minimization of such empirical risk functionals is generally performed by means of stochastic optimization techniques such as stochastic gradient descent sgd where at each iteration only small number of randomly selected terms are used to compute an estimate of the gradient see for instance drawback of the original sgd learning method introduced in the case where empirical risk functionals are computed by summing over independent observations sample mean statistics is its slow convergence due to the variance of the gradient estimates see this has recently motivated the development of wide variety of sgd variants implementing variance reduction method in order to improve convergence variance reduction is achieved by occasionally computing the exact gradient see sag svrg miso and saga among others or by means of nonuniform sampling schemes see for instance however such ideas can hardly be applied to the case under study here due to the overwhelming number of possible tuples computing even single exact gradient or maintaining probability distribution over the set of all tuples is computationally unfeasible in general in this paper we leverage the specific structure and statistical properties of the empirical risk functional when it is of the form of to design an efficient implementation of the sgd learning method we study the performance of the following sampling scheme for the gradient estimation step involved in the sgd algorithm drawing with replacement set of tuples directly in order to build an incomplete gradient estimate rather than drawing subset of observations without replacement and forming all possible tuples based on these the corresponding gradient estimate is then complete based on subsample while has investigated maximal deviations between and their incomplete approximations the performance analysis carried out in the present paper is inspired from and involves both the optimization error of the sgd algorithm and the estimation error induced by the statistical setting we first provide rate bounds and asymptotic convergence rates for the sgd procedure applied to the empirical minimization of these results shed light on the impact of the conditional variance of the gradient estimators on the speed of convergence of sgd we then derive novel generalization bound which depends on the variance of the sampling strategies this bound establishes the indisputable superiority of the incomplete estimation approach over the complete variant in terms of the between statistical accuracy and computational cost our experimental results on auc maximization and metric learning tasks on datasets are consistent with our theoretical findings and show that the use of the proposed sampling strategy can provide spectacular performance gains in practice we conclude this paper with promising lines for future research in particular regarding the involved in possible implementation of nonuniform sampling strategies to further improve convergence the rest of this paper is organized as follows in section we briefly review the theory of statistics and their approximations together with elementary notions of stochastic approximation section provides detailed description of the sgd implementation we propose along with performance analysis conditional upon the data sample in section based on these results we derive generalization bound based on decomposition into optimization and estimation errors section presents our numerical experiments and we conclude in section technical proofs are sketched in the appendix and further details can be found in the supplementary material background and problem setup here and throughout the indicator function of any event is denoted by and the variance of any square integrable by definition and examples generalized are extensions of standard sample mean statistics as defined below definition let and dk let nk xnk be independent samples of sizes nk dk and composed of random variables taking their values in some measurable space xk with distribution fk dx respectively let dk xk be measurable function square integrable with respect to the probability distribution fk assume in addition without loss of generality that is symmetric within each block of arguments valued in xkdk the generalized or of degrees dk with kernel is then defined as un qk xik dk ik where nk the symbol ik refers to summation over all elements of qk the set of the ndkk index vectors ik ik being set of dk indexes idk nk and xik xid for in the above definition standard mean statistics correspond to the case where more generally when un is an average over all of observations finally corresponds to the situation where dk is used for each sample the key property of the statistic is that it has minimum variance among all unbiased estimates of xdk un one may refer to for further results on the theory of in machine learning generalized are used as performance criteria in various problems such as those listed below clustering given distance the quality of partition of with respect to the clustering of an sample xn drawn from dx can be assessed through the within cluster point scatter cn xi xj xi xj it is one sample of degree with kernel hp ranking suppose that independent samples xnk with nk and on rp have been observed the accuracy of scoring function with respect to the ranking is empirically estimated by the rate of concordant sometimes referred to as the volume under the roc surface nk xik vus nk the quantity above is with degrees dk and kernel xk xk metric learning based on an sample of labeled data xn yn on rp the empirical pairwise classification performance of distance can be evaluated by bn xi xj xi xk yi yj yk which is one sample of degree three with kernel minimization of let rq with be some parameter space and consider the risk minimization problem with xdk qk where xkdk is convex loss function the xdk are independent random variables with distribution dx on xkdk respectively so that is square integrable for any based on independent samples xnk with un we the empirical version of the risk function is denote by the gradient operator many learning algorithms are based on gradient descent following the iterations θt with an arbitrary initial value and learning rate step size γt such that γt and γt here we place ourselves in setting where the sample sizes nk of the training datasets are such that computing the empirical gradient nk def gbn ln xik dk ik qk at each iteration is intractable due to the number ndkk of terms to be averaged instead stochastic approximation suggests the use of an unbiased estimate of that is cheap to compute sgd implementation based on incomplete possible approach consists in replacing by complete computed from subsamples of reduced sizes nk say drawn uniformly at rank dom without replacement among the original samples leading to the following gradient estimator qk dk xik ik where ik refers to summation over all ndkk subsets ik xid related to set ik of dk indexes idk and although this approach is very natural one can obtain better estimate for the same computational cost as shall be seen below estimation of the empirical gradient from practical perspective the alternative strategy we propose is of disarming simplicity it is based on sampling scheme that consists in drawing independently with replacement among the set of index vectors yielding gradient estimator in the form of incomplete see xik ik where db is built by sampling times with replacement in the set we point out that the conditional expectation of given the observed data samples is equal to gbn the parameter corresponding to the number of terms to be averaged controls the computational complexity of the sgd implementation observe incidentally that an incomplete is not in general hence as an unbiased estimator of the gradient of the statistical risk is of course less accurate than the full empirical gradient it has larger variance but this slight increase in variance leads to large reduction in computational cost in our subsequent analysis we will qk show that for the same computational cost taking ndkk implementing sgd with rather than leads to much more accurate results we will rely on the fact that has smaller variance to except in the case where as shown in the proposition below qk proposition set ndkk there exists universal constant such that we have and nk dk xdk explicit but lengthy expressions of the for all with variances are given in remark the results of this paper can be extended to other sampling schemes to approximate such as bernoulli sampling or sampling without replacement in following the proposal of for clarity we focus on sampling with replacement which is computationally more efficient conditional performance analysis as first go we investigate and compare the performance of the sgd methods described above conditionally upon the observed data samples for simplicity we denote by pn the conditional probability measure given the data and by enp the pn given matrix we denote by the transpose of and km khs its norm we assume that the loss function is in its gradient is with we also restrict is convex for some deterministic constant ourselves to the case where and we denote by its unique minimizer we point out that the present analysis can be extended to the smooth but convex case see classical argument based on convex analysis and stochastic optimization see for instance shows precisely how the conditional variance of the gradient estimator impacts the empirical performance of the solution produced by the corresponding sgd method and thus strongly advocates the use of the sgd variant proposed in section θt and proposition consider the recursion θt γt θt where en θt denote by σn the conditional variance of for step size γt the following holds if then en if and then σn θn exp en proposition illustrates the fact that the convergence rate of sgd is dominated by the variance term and thus one needs to focus on reducing this term to improve its performance we are also interested in the asymptotic behavior of the algorithm when under the following assumptions is twice differentiable on neighborhood of the function is bounded the function we establish the following result refer to the supplementary material let us set for detailed proof theorem let the covariance matrix be the unique solution of the lyapunov equation σn where σn θn en θn θn and if if not then under assumptions we have θt where iq in addition in the case we have theorem reveals that the conditional variance term again plays key role in the asymptotic performance of the algorithm in particular it is the dominating term in the precision of the solution in the next section we build on these results to derive generalization bound in the spirit of which explicitly depend on the true variance of the gradient estimator generalization bounds let be the minimizer of the true risk as proposed in the mean excess risk can be decomposed as follows θt θt sup beyond the optimization error the second term on the right hand side of the analysis of the generalization ability of the learning method previously described requires to control the estimation error the first term this can be achieved by means of the result stated below which extends corollary in to the situation qk proposition let be collection of bounded symmetric kernels on xkdk such that mh sup suppose also that is vc major class of functions with finite dimension let min bnk then for any log sup mh we are now ready to derive our main result theorem let θt be the sequence generated by sgd using the incomplete statistic gradient qk mator with ndkk terms for some assume that is vc major class class of finite vc dimension mθ sup dk xk qk and nθ if the step size satisfies the condition of proposition we have log cn θt btβ for any we also have with probability at least dβ log log log cnθ θt btβ tβ for some constants and dβ depending on the parameters the generalization bound provided by theorem shows the advantage of using an incomplete statistic as the gradient estimator in particular we can obtain results of the same form as qk pk rem for the complete estimator but ndkk is then replaced by following proposition leading to greatly damaged bounds using an incomplete we thus achieve better performance on the test set while reducing the number of iterations and therefore the numbers of gradient computations required to converge to accurate solution to the best of our knowledge this is the first result of this type for empirical minimization of in the next section we provide experiments showing that these gains are very significant in practice numerical experiments in this section we provide numerical experiments to compare the incomplete and complete statistic gradient estimators and in sgd when they rely on the same number of terms the datasets we use are available in all experiments we randomly split the data into training set and test set and sample pairs from the test set to estimate the test performance we used step size of the form γt and the results below are with respect to the number of sgd iterations computational time comparisons can be found in the supplementary material auc optimization we address the problem of learning binary classifier by optimizing the area under the curve which corresponds to the vus criterion eq when given sequence of observations zi xi yi where xi rp and yi we denote by xi yi xi yi and as done in we take linear scoring rule sθ θt where rp is the parameter to learn and use the logistic loss as smooth convex function upper bounding the heaviside function leading to the following erm problem minp log exp sθ sθ xi xj http covtype batch size covtype batch size batch size batch size figure average over runs of the risk estimate with the number of iterations solid lines standard deviation dashed lines we use two datasets examples features and covtype examples features we try different values for the initial step size and the batch size some results averaged over runs of sgd are displayed in figure as predicted by our theoretical findings we found that the incomplete estimator always outperforms its complete variant the performance gap between the two strategies can be small for instance when is very large or is unnecessarily small but for values of the parameters that are relevant in practical scenarios reasonably small and ensuring significant decrease in the objective function the difference can be substantial we also observe smaller variance between sgd runs with the incomplete version metric learning we now turn to metric learning formulation where we are given sample of observations zi xi yi where xi rp and yi following the existing literature we focus on pseudo distances of the form dm where is symmetric positive matrix we again use the logistic loss to obtain convex and smooth surrogate for the erm problem is as follows yi yj yk log exp dm xi xj dm xi xk min we use the binary classification dataset susy examples features figure shows that the performance gap between the two strategies is much larger on this problem this is consistent with the theory one can see from proposition that the variance gap between the incomplete and the complete approximations is much wider for of degree metric learning than for of degree auc optimization conclusion and perspectives in this paper we have studied specific implementation of the sgd algorithm when the natural empirical estimates of the objective function are of the form of generalized this situation susy batch size susy batch size figure average over runs of the error test with the number of iterations solid lines their standard deviation dashed lines covers wide variety of statistical learning problems such as ranking pairwise clustering and metric learning the gradient estimator we propose in this context is based on an incomplete obtained by sampling tuples with replacement our main result is thorough analysis of the generalization ability of the predictive rules produced by this algorithm involving both the optimization and the estimation error in the spirit of this analysis shows that the sgd variant we propose far surpasses more naive implementation of same computational cost based on subsampling the data points without replacement furthermore we have shown that these performance gains are very significant in practice when dealing with datasets in future work we plan to investigate how one may extend the nonuniform sampling strategies proposed in to our setting in order to further improve convergence this is challenging goal since we can not hope to maintain distribution over the set of all possible tuples of data points tractable solution could involve approximating the distribution in order to achieve good between statistical performance and costs appendix sketch of technical proofs note that the detailed proofs can be found in the supplementary material sketch proof of proposition set at en and following observe that the sequence at satisfies the recursion at γt standard stochastic approximation argument yields kθ see an upper bound for at cf which combined with for instance give the desired result sketch of proof of theorem the on stochastic approximation arguments see we first show that proof relies θt then we apply the second order to derive the asymptotic behavior of the objective function eq is obtained by standard algebra sketch of proof of theorem combining and proposition leads to the first part of the result to derive sharp probability bounds we apply the union bound on to deal with we use concentration results for while we adapt the proof of proposition to control the are recentered to make martingale increments appear and finally we apply azuma and hoeffding inequalities acknowledgements this work was supported by the chair machine learning for big data of paristech and was conducted when bellet was affiliated with paristech references bach and moulines analysis of stochastic approximation algorithms for machine learning in nips bellet habrard and sebban survey on metric learning for feature vectors and structured data technical report june bellet habrard and sebban metric learning morgan claypool publishers bottou and bousquet the tradeoffs of large scale learning in nips lugosi and vayatis ranking and empirical risk minimization of ann robbiano and tressou maximal deviations of incomplete with applications to empirical risk sampling in sdm on and clustering performance in nips pages bertail and chautru scaling up via sampling designs the horvitzthompson stochastic gradient descent in ieee big data defazio bach and saga fast incremental gradient method with support for convex composite objectives in nips delyon stochastic approximation with decreasing gain convergence and asymptotic theory fort central limit theorems for stochastic approximation with controlled markov chain esaimps and vanderlooy binary decomposition methods for multipartite ranking in pages herschtal and raskutti optimising area under the roc curve using gradient descent in icml page janson the asymptotic distributions of incomplete wahrsch verw gebiete johnson and zhang accelerating stochastic gradient descent using predictive variance reduction in nips pages kar sriperumbudur jain and karnick on the generalization ability of online learning algorithms for pairwise loss functions in icml kushner and yin stochastic approximation and recursive algorithms and applications volume springer science business media le roux schmidt and bach stochastic gradient method with an exponential convergence rate for finite training sets in nips lee theory and practice mairal incremental optimization with application to machine learning technical report needell ward and srebro stochastic gradient descent weighted sampling and the randomized kaczmarz algorithm in nips pages nemirovski juditsky lan and shapiro robust stochastic approximation approach to stochastic programming siam journal on optimization nesterov introductory lectures on convex optimization volume springer norouzi fleet and salakhutdinov hamming distance metric learning in nips pages pelletier weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing ann qian jin yi zhang and zhu efficient distance metric learning by adaptive sampling and stochastic gradient descent sgd machine learning zhao hoi jin and yang auc maximization in icml pages zhao and zhang stochastic optimization with importance sampling for regularized loss minimization in icml 
on selection in bandits and hidden bipartite graphs wei jian yufei zhize tsinghua university chinese university of hong kong mails mail mails taoyf abstract this paper discusses how to efficiently choose from unknown distributions the ones whose means are the greatest by certain metric up to small relative error we study the topic under two standard bandits and hidden bipartite differ in the nature of the input distributions in the former setting each distribution can be sampled in the manner an arbitrary number of times whereas in the latter each distribution is defined on population of finite size and hence is fully revealed after samples for both settings we prove lower bounds on the total number of samples needed and propose optimal algorithms whose sample complexities match those lower bounds introduction this paper studies class of problems that share common objective from number of probabilistic distributions find the ones whose means are the greatest by certain metric crowdsourcing crowdsourcing algorithm see recent works and the references therein summons certain number say of individuals called workers to collaboratively accomplish complex task typically the algorithm breaks the task into potentially very large number of each of which makes binary decision yes or no by taking the majority vote from the participating workers each worker is given an often monetary reward for every that participates in it is therefore crucial to identify the most reliable workers that have the highest rates of making correct decisions because of this crowdsourcing algorithm should ideally be preceded by an exploration phase which selects the best workers from candidates by series of control questions every must be paid for in the same way as the challenge is to find the best workers with the least amount of money frequent pattern discovery let and be two relations given join predicate the joining power of tuple equals the number of tuples such that and satisfy returns the tuples in with the greatest joining power this type of semijoins is notoriously difficult to process when the evaluation of is complicated and thus unfriendly to optimization example from graph databases is the discovery of frequent patterns where is set of graph patterns set of data graphs and decides if pattern is subgraph of data graph in this case essentially returns the set of graph patterns most frequently found in the data graphs given black box for resolving subgraph isomorphism the challenge is to minimize the number of calls to the black box we refer to the reader to for more examples of difficult of this sort problem formulation the paper studies four problems that capture the essence of the above applications bandit we consider standard setting of stochastic bandit selection specifically there is bandit with set of arms where the arm is associated with bernoulli distribution with an unknown mean θi in each round we choose an arm pull it and then collect reward which is an sample from the arm reward distribution given subset of arms we denote by ai the arm with the largest mean in and pk by θi the mean of ai define θavg θi namely the average of the means of the arms in our first two problems aim to identify arms whose means are the greatest either individually or aggregatively problem arm selection given parameters and we want to select subset of such that with probability at least it holds that θi θi we further study variation of where we change the multiplicative guarantee θi θi to an additive guarantee θi θi we refer to the modified problem as topkadd arm selection kadd due to the space constraint we present all the details of kadd in appendix problem arm selection kavg given the same parameters as in we want to select subset of such that with probability at least it holds that θavg θavg for both problems the cost of an algorithm is the total number of arms pulled or equivalently the total number of samples drawn from the arms distributions for this reason we refer to the cost as the algorithm sample complexity it is easy to see that is more stringent than kavg hence feasible solution to the former is also feasible solution to the latter but not the vice versa hidden bipartite graph the second main focus of the paper is the exploration of hidden bipartite graphs let be bipartite graph where the nodes in are colored black and those in colored white set and the edge set is hidden in the sense that an algorithm does not see any edge at the beginning to find out whether an edge exists between black vertex and white vertex the algorithm must perform probe operation the cost of the algorithm equals the number of such operations performed if an edge exists between and we say that there is solid edge between them otherwise we say that they have an empty edge let deg be the degree of black vertex namely the number of solid edges of given subset of black vertices we denote by bi the black vertex with largest degree in and by degi the degree of bi furthermore define pk degavg degi we now state the other two problems studied in this work which aim to identify black vertices whose degrees are the greatest either individually or aggregatively problem connected vertex given parameters and we want to select subset of such that with probability at least it holds that degi degi problem kavg connected vertex kavg given the same parameters as in we want to select subset of such that with probability at least it holds that degavg degavg feasible solution to is also feasible for kavg but not the vice versa we will refer to the cost of an algorithm also as its sample complexity by regarding probe operation as sampling the edge probed for any deterministic algorithm the adversary can force the algorithm to always probe mn edges hence we only consider randomized algorithms can be reduced to given hidden bipartite graph we can treat every black vertex as an arm associated with bernoulli reward distribution the reward is with probability deg recall and with probability deg any algorithm for can be deployed to solve as follows whenever samples from arm we randomly choose white vertex and probe the edge between and reward of is returned to if and only if the edge exists and differ however in the size of the population that reward distribution is defined on for the reward of each arm is sampled from population of an indefinite size which can even be infinite consequently nicely models situations such as the crowdsourcing application mentioned earlier for the reward distribution of each arm black vertex is defined on population of size the edges of this has three implications first is better modeling of applications like where an edge exists between and if and only if is true second the problem admits an obvious algorithm with cost nm recall simply probe all the hidden edges third an algorithm never needs to probe the same edge between and probed whether the edge is solid or empty is perpetually revealed we refer to the last implication as the property the above discussion on and also applies to kavg and kavg for each of above problems we refer to an algorithm which achieves the precision and failure requirements prescribed by and as an algorithm previous results problem sheng et al presented an that solves with expected cost θk log nδ no lower bound is known on the sample complexity of the closest work is due to kalyanakrishnan et al they considered the problem where the goal is to return set of arms such that with probability at least the mean of each arm in is at least θk they showed an algorithm with sample complexity log kδ in expectation and establish matching lower bound note that ensures an guarantee which is weaker than the individually guarantee of therefore the same lower bound also applies to the readers may be tempted to set θk to derive lower bound of θk log for this however is clearly wrong because when θk typical case in practice this lower bound may be even higher than the upper bound of mentioned earlier the cause of the error lies in that the hard instance constructed in requires θk problem the θk log nδ upper bound of on carries over to kavg which as mentioned before can be solved by any algorithm zhou et al considered an pt mai problem whose goal is to find subset such that θavg θavg holds with probability at least note once again that this is an guarantee as opposed to the guarantee of kavg for pt mai zhou et al presented an algorithm with sample complexity log in expectation observe that if θavg is available magically in advance we can immediately apply the pt mai algorithm of to settle kavg by setting log θavg the expected cost of the algorithm becomes θavg which is suboptimal see the table no lower bound is known on the sample complexity of kavg for pt mai zhou et al which directly applies to kavg due to its stronger proved lower bound of log quality guarantee problems and both problems can be trivially solved with cost nm furthermore as explained in section and kavg can be reduced to and kavg respectively indeed the best existing and kavg algorithms surveyed in the above serve as the state of the art for and kavg respectively prior to this work no lower bound results were known for and kavg note that none of the lower bounds for or kavg is applicable to or kavg resp because there is no reduction from the former problem to the latter our results we obtain tight upper and lower bounds for all of the problems defined in section our main results are summarized in table all bounds are in expectation next we explain several highlights and provide an overview into our techniques the algorithm was designed for but it can be adapted to as well table comparison of our and previous results all bounds are in expectation problem sample complexity source upper bound upper bound lower bound lower bound θk log log log θavg log kavg θk log kδ log θk log kδ θk log nδ log θavg lower bound upper bound min degm log nδ nm min degm log kδ nm deg log if degk log nδ nm degkn new new new new new new min degm log nδ nm upper min deg log nm bound avg kavg min deg log nm new avg log degavg lower new if degavg log nδ bound nm if degavg our algorithm improves the log factor of to log in practice thereby achieving the optimal sample complexity theorem our analysis for is inspired by in particular the median elimination technique in however the details are very different and more involved than the previous ones the application of median elimination of was in much simpler context where the analysis was considerably easier on the lower bound side our argument is similar to that of but we need to get rid of the θk assumption as explained in section which requires several changes in the analysis theorem kavg our algorithm improves both existing solutions in significantly noticing that both θk and θavg are never larger but can be far smaller than θavg this improvement results from an enhanced version of median elimination and once again requires analysis specific to our context theorem our lower bound is established with novel reduction from the problem theorem it is worth nothing that the reduction can be used to simplify the proof of the lower bound in theorem and kavg the stated upper bounds for and kavg in table can be obtained directly from our and kavg algorithms in contrast all the arguments for and kavg crucially rely on the samples being down for the two mcv problems due to the property explained in section for we remedy the issue by when degk is large reduction from and ii when degk is small reduction from sampling lower bound for distinguishing two extremely similar distributions theorem analogous ideas are deployed for kavg theorem note that for small range of degk degk log nδ we do not have the optimal lower bounds yet for and kavg closing the gap is left as an interesting open problem algorithm input for do me ai ai us if ak then return ak algorithm median elimination me input and while do sample every arm for log times for each arm do its empirical value the average of the samples from the arms sorted in order of their empirical values and return algorithm uniform sampling us input µs sample every arm for log times for each arm do its value the average of the samples from the arms sorted in order of their values return us ak us ak arm selection in this section we describe new algorithm for the problem we present the detailed analysis in appendix our algorithm consists of three components median elimination me and uniform sampling us as shown in algorithms and respectively given parameters as in problem takes guess line on the value of θk and then applies me line to prune down to set of at most arms then at line us is invoked to process at line as will be clear shortly the value of ak is what thinks should be the value of θk thus the algorithm performs quality check to see whether ak is larger than but close to if the check fails halves its guess line and repeats the above steps otherwise the output of us from line is returned as the final result me runs in rounds round is controlled by parameters and their values for round are given at line in general is the set of arms from which we still want to sample for each arm me takes line samples from and calculates its empirical value lines and me drops at lines and half of the arms in with the smallest empirical values and then at line sets the parameters of the next round me terminates by returning as soon as is at most lines and us simply takes samples from each arm line and calculates its value lines and finally us returns the arms in with the largest values lines and remark if we ignore line of algorithm and simply set then degenerates into the algorithm in theorem solves the problem with expected cost θk log we extends the proof in and establish the lower bound for as shown in theorem theorem for any and given any algorithm there is an instance of the problem on which the algorithm must entail θk log kδ cost in expectation connected vertex this section is devoted to the problem problem we will focus on lower bounds because our algorithm in the previous section also settles with the cost claimed in table by applying the reduction described in section we establish matching lower bounds below theorem for any and the following statements are true about any algorithm when degk log nδ there is an instance on which the algorithm must probe degm log kδ edges in expectation when degk there is an instance on which the algorithm must probe nm edges in expectation for large degk in theorem we utilize an instance for to construct random hidden bipartite graph and fed it to any algorithm solves by doing this we reduce to kmcv and thus establish our first lower bound for small degk we define the problem where the goal is to distinguish two extremely distributions we prove the lower bound of problem and reduce it to thus we establish our second lower bound the details are presented in appendix arm selection our kavg algorithm is similar to described in section except that the parameters are adjusted appropriately as shown in algorithm respectively we present the details in appendix theorem solves the kavg problem with expected cost log we establish the lower bound for kavg as shown in theorem theorem for any and given any algorithm there an instance of the kavg problem on which the algorithm must entail θavg log cost in expectation we show that the lower bound of kavg is the maximum of log and θavg our proof of the first lower bound is based on novel reduction from we stress that our reduction can be used to simplify the proof of the lower bound in theorem kavg connected vertex our kavg algorithm combined with the reduction described in section already settles kavg mcv with the sample complexity given in table we establish the following lower bound and prove it in appendix theorem for any and the following statements are true about any kavg algorithm when degavg log nδ there is an instance on which the algorithm must probe log degavg edges in expectation when degk there is an instance on which the algorithm must probe nm edges in expectation algorithm input for do qe us ai us us if then return ak algorithm quartile elimination qe input and while do sample every arm for log times for each arm do its empirical value the average of the samples from the arms sorted in order of their empirical values and return algorithm uniform sampling us input µs sample every arm for log times for each arm do its value the average of the samples from the arms sorted in order of their values us return ak pk ai experiment evaluation due to the space constraint we show only the experiments that compare and amcv for problem additional experiments can be found in appendix we use two synthetic data sets and one real world data set to evaluate the algorithms each dataset is represented as bipartite graph with for the synthetic data the degrees of the black vertices follow power law distribution for each black vertex its degree equals with probability where is the parameter to be set and is the normalizing factor furthermore for each black vertex with degree we connected it to randomly selected white vertices thus we build two bipartite graphs by setting the proper parameters in order to control the average degrees of the black vertices to be and respectively for the real world data we crawl active users from twitter with their corresponding relationships we construct bipartite graph where each of and represents all the users and represents the relationships we say two users and have relationship if they share at least one common friend as the theoretical analysis is rather pessimistic due to the extensive usage of the union bound to make fair comparison we adopt the same strategy as in to divide the sample cost in theory by heuristic constant we use the same parameter for amcv as in for we first take for each round of the median elimination step and then we use the previous sample cost dividing as the samples of the uniform sampling step notice that it does not conflict the theoretical sample complexity since the median elimination step dominates the sample complexity of the algorithm we fix the parameters and enumerate from to we then calculate the actual failure probability by counting the successful runs in repeats recall that due to the heuristic nature the algorithm may not achieve the theoretical guarantees prescribed by whenever this happens we label the percentage of actual error it achieves according to the failure probability for example means the algorithm actually achieves an error with failure probability the experiment result is shown in fig amcv sample cost amcv sample cost sample cost power law with deg power law with deg amcv figure performance comparison for as we can see outperforms amcv in both sample complexity and the actual error in all data sets we stress that in the worst case it seems only shows difference when however for the most of the real world data the degrees of the vertices usually follow power law distribution or gaussian distribution for such cases our algorithm only needs to take few samples in each round of the elimination step and drops half of vertices with high confidence therefore the experimental result shows that the sample cost of is much less than amcv related work bandit problems are classical decision problems with tradeoffs and have been extensively studied for several decades dating back to in this line of research and kavg fit into the pure exploration category which has attracted significant attentions in recent years due to its abundant applications such as online advertisement placement channel allocation for mobile communications crowdsourcing etc we mention some closely related work below and refer the interested readers to recent survey et al proposed an optimal algorithm for selecting single arm which approximates the best arm with an additive error at most matching lower bound was established by mannor et al kalyanakrishnan et al considered the problem which we mentioned in section they provided an algorithm with the sample complexity log kδ similarly zhou et al studied the pt mai problem which again as mentioned in section is the version of kavg audibert et al and bubeck et al investigated the fixed budget setting where given fixed number of samples we want to minimize the misidentification probability informally the probability that the solution is not optimal buckeck et al also showed the links between the simple regret the gap between the arm we obtain and the best arm and the cumulative regret the gap between the reward we obtained and the expected reward of the best arm gabillon et al provide unified approach ugape for in both the fixed budget and the fixed confidence settings they derived the algorithms based on lower and upper confidence bound lucb where the time complexity depends on the gap between θk and the other arms note that each time lucb samples the two arms that are most difficult to distinguish since our problem ensures an individually guarantee it is unclear whether only sampling the most arms would be enough we leave it as an intriguing direction for future work chen et al studied how to select the best arms under various combinatorial constraints acknowledgements jian li wei cao zhize li were supported in part by the national basic research program of china grants and the national nsfc grants yufei tao was supported in part by projects grf and grf from hkrgc references amsterdamer davidson milo novgorodov and somech oassis query driven crowd mining in sigmod pages audibert bubeck et al best arm identification in bandits colt the complexity of massive data set computations phd thesis university of california bubeck et al regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning bubeck munos and stoltz pure exploration in and bandits theoretical computer science chen lin king lyu and chen combinatorial pure exploration of multiarmed bandits in advances in neural information processing systems pages dubhashi and panconesi concentration of measure for the analysis of randomized algorithms cambridge university press mannor and mansour action elimination and stopping conditions for the bandit and reinforcement learning problems the journal of machine learning research gabillon ghavamzadeh and lazaric best arm identification unified approach to fixed budget and fixed confidence in advances in neural information processing systems pages kalyanakrishnan and stone efficient selection of multiple bandit arms theory and practice in icml pages kalyanakrishnan tewari auer and stone pac subset selection in stochastic multiarmed bandits in icml pages mannor and tsitsiklis the sample complexity of exploration in the bandit problem the journal of machine learning research parameswaran boyd gupta polyzotis and widom optimal rating and filtering algorithms pvldb sheng tao and li exact and approximate algorithms for the most connected vertex problem tods wang lo and yiu identifying the most connected vertices in hidden bipartite graphs using group testing tkde zhou chen and li optimal pac multiple arm identification with applications to crowdsourcing in icml pages zhu papadias zhang and lee spatial joins tkde 
the brain uses reliability of stimulus information when making perceptual decisions sebastian stefan department of psychology technische dresden dresden germany abstract in simple perceptual decisions the brain has to identify stimulus based on noisy sensory samples from the stimulus basic statistical considerations state that the reliability of the stimulus information the amount of noise in the samples should be taken into account when the decision is made however for perceptual decision making experiments it has been questioned whether the brain indeed uses the reliability for making decisions when confronted with unpredictable changes in stimulus reliability we here show that even the basic drift diffusion model which has frequently been used to explain experimental findings in perceptual decision making implicitly relies on estimates of stimulus reliability we then show that only those variants of the drift diffusion model which allow stimulusspecific reliabilities are consistent with neurophysiological findings our analysis suggests that the brain estimates the reliability of the stimulus on short time scale of at most few hundred milliseconds introduction in perceptual decision making participants have to identify noisy stimulus in typical experiments only two possibilities are considered the amount of noise on the stimulus is usually varied to manipulate task difficulty with higher noise participants decisions are slower and less accurate early psychology research established that biased random walk models explain the response distributions choice and reaction time of perceptual decision making experiments these models describe decision making as an accumulation of noisy evidence until bound is reached and correspond in discrete time to sequential analysis as developed in statistics more recently electrophysiological experiments provided additional support for such bounded accumulation models see for review there appears to be general consensus that the brain implements the mechanisms required for bounded accumulation although different models were proposed for how exactly this accumulation is employed by the brain an important assumption of all these models is that the brain provides the input to the accumulation the evidence but the most established models actually do not define how this evidence is computed by the brain in this contribution we will show that addressing this question offers new perspective on how exactly perceptual decision making may be performed by the brain probabilistic models provide precise definition of evidence evidence is the likelihood of decision alternative under noisy measurement where the likelihood is defined through generative model of the measurements under the hypothesis that the considered decision alternative is true in particular this generative model implements assumptions about the expected distribution of measurements therefore the likelihood of measurement is large when measurements are assumed by the decision maker to be reliable and small otherwise for modelling perceptual decision making experiments the evidence input which is assumed to be by the brain should similarly depend on the reliability of measurements as estimated by the brain however this has been disputed before the argument is that typical experimental setups make the reliability of each trial unpredictable for the participant therefore it was argued the brain can have no correct estimate of the reliability this issue has been addressed in neurally inspired probabilistic model based on probabilistic population codes ppcs the authors have shown that ppcs can implement perceptual decision making without having to explicitly represent reliability in the decision process this remarkable result has been obtained by making the comprehensible assumption that reliability has multiplicative effect on the tuning curves of the neurons in the current stimulus reliability therefore was implicitly represented in the tuning curves of model neurons and still affected decisions in this paper we will investigate on conceptual level whether the brain estimates measurement reliability even within trials while we will not consider the details of its neural representation we will show that even simple widely used bounded accumulation model the drift diffusion model is based on some estimate of measurement reliability using this result we will analyse the results of perceptual decision making experiment and will show that the recorded behaviour together with neurophysiological findings strongly favours the hypothesis that the brain weights evidence using current estimate of measurement reliability even when reliability changes unpredictably across trials this paper is organised as follows we first introduce the notions of measurement evidence and likelihood in the context of the experimentally random dot motion rdm stimulus we define these quantities formally by resorting to simple probabilistic model which has been shown to be equivalent to the drift diffusion model this in turn allows us to formulate three competing variants of the drift diffusion model that either do not use reliability variant const or do use reliability of measurements during decision making variants ddm and depc see below for definitions finally using data of we show that only variants ddm and depc which use reliability are consistent with previous findings about perceptual decision making in the brain measurement evidence and likelihood in the random dot motion stimulus the widely used random dot motion rdm stimulus consists of set of randomly located dots shown within an invisible circle on screen from one video frame to the next some of the dots move into one direction which is fixed within trial of an experiment subset of the dots moves coherently in one direction all other dots are randomly replaced within the circle although there are many variants of how exactly to present the dots the main idea is that the coherently moving dots indicate motion direction which participants have to decide upon by varying the proportion of dots which move coherently also called the coherence of the stimulus the difficulty of the task can be varied effectively we will now consider what kind of evidence the brain can in principle extract from the rdm stimulus in short time window for example from one video frame to the next within trial for simplicity we call this time window time point from here on the idea being that evidence is accumulated over different time points as postulated by bounded accumulation models in perceptual decision making at single time point the brain can measure motion directions from the dots in the rdm display by construction proportion of measurable motion directions will be into one specific direction but through the random relocation of other dots the rdm display will also contain motion in random directions therefore the brain observes distribution of motion directions at each time point this distribution can be considered measurement of the rdm stimulus made by the brain due to the randomness of each time frame this distribution varies across time points and the variation in the distribution reduces for increasing coherences we have illustrated this using rose histograms in fig for three different coherence levels note that the precise effect on tuning curves may depend on the particular distribution of measurements and its encoding by the neural population time point time point figure illustration of possible motion direction distributions that the brain can measure from an rdm stimulus rows are different time points columns are different coherences the true underlying motion direction was left for low coherence the measured distribution is very variable across time points and may indicate the presence of many different motion directions at any given time point as coherence increases from to the true underlying motion direction will increasingly dominate measured motion directions simultaneously leading to decreased variation of the measured distribution across time points to compute the evidence for the decision whether the rdm stimulus contains predominantly motion to one of the two considered directions left and right the brain must check how strongly these directions are represented in the measured distribution by estimating the proportion of motion towards left and right we call these proportions evidence for left eleft and evidence for right eright as the measured distribution over motion directions may vary strongly across time points the computed evidences for each single time point may be unreliable probabilistic approaches weight evidence by its reliability such that unreliable evidence is not the question is does the brain perform this computation as well more formally for given coherence does the brain weight evidence by an estimate of reliability that depends on and which we call likelihood or does it ignore changing reliabilities and use weighting unrelated to coherence bounded accumulation models bounded accumulation models postulate that decisions are made based on decision variable in particular this decision variable is driven towards the correct alternative and is perturbed by noise decision is made when the decision variable reaches specific value in the drift diffusion model these three components are represented by drift diffusion and bound we will now relate the typical drift diffusion formalism to our notions of measurement evidence and likelihood by linking the drift diffusion model to probabilistic formulations in the drift diffusion model the decision variable evolves according to simple wiener process with drift in discrete time the change in the decision variable can be written as δy yt vδt for convenience we use imprecise denominations here as will become clear below is in our case gaussian hence the linear weighting of evidence by reliability where is the drift is gaussian noise and controls the amount of diffusion this equation bears an interesting link to how the brain may compute the evidence for example it has been stated in the context of an experiment with rdm stimuli with two decision alternatives that the change in often called momentary evidence is thought to be difference in firing rates of direction selective neurons with opposite direction supp fig formally δy ρleft ρright where ρleft is the firing rate of the population selective to motion towards left at time point because the firing rates depend on the considered decision alternative they represent form of evidence extracted from the stimulus measurement instead of the stimulus measurement itself see our definitions in the previous section it is unclear however whether the firing rates just represent the evidence or whether they represent the likelihood the evidence weighted by reliability to clarify the relation between firing rates evidence and likelihood we consider probabilistic models of perceptual decision making several variants have been suggested and related to other forms of decision making for its simplicity which is sufficient for our argument we here consider the model presented in for which direct transformation from probabilistic model to the drift diffusion model has already been shown this model defines two gaussian generative models of measurements which are derived from the stimulus xt xt where represents the variability of measurements expected by the brain similarly it is assumed that the measurements xt are sampled from gaussian with variance which captures variance both from the stimulus and due to other noise sources in the brain xt δtσ where the mean is for left stimulus and for right stimulus evidence for decision is computed in this model by calculating the likelihood of measurement xt under the hypothesised generative models to be precise we consider the which is xt xt lleft log log right we note three important points the first term on the right hand side means that for decreasing the likelihood increases when the measurement xt is close to the means and this contribution however cancels when the difference between the likelihoods for left and right is computed the likelihood is large for measurement xt when xt is close to the corresponding mean the contribution of the stimulus is weighted by the assumed reliability this model of the rdm stimulus is simple but captures the most important properties of the stimulus in particular high coherence rdm stimulus has large proportion of motion in the correct direction with very low variability of measurements whereas low coherence rdm stimulus tends to have lower proportions of motion in the correct direction with high variability cf fig the gaussian model captures these properties by adjusting the noise variance such that high coherence corresponds to low noise and low coherence to high noise under high noise the values xt will vary strongly and tend to be rather distant from and whereas for low noise the values xt will be close to or with low variability hence as expected the model produces large for low noise and small for high noise this intuitive relation between stimulus and probabilistic model is the basis for us to proceed to show that the reliability of the stimulus connected to the coherence level appears at prominent position in the drift diffusion model crucially the drift diffusion model can be derived as the sum of ratios across time in particular discrete time drift diffusion process can be derived by subtracting the likelihoods of eq xt xt δt consequently the change in within trial in which the true stimulus is constant is gaussian δy this replicates the model described in supp fig where the parameterisation of the model however more directly followed that of the gaussian distribution δy lright lleft and did not explicitly take time into account δy kc where and are free parameters and is coherence of the rdm stimulus by analogy to the probabilistic model we therefore see that the model in implicitly assumes that reliability depends on coherence more generally the parameters of the drift diffusion model of eq and that of the probabilistic model can be expressed as functions of each other δt δt these equations state that both drift and diffusion depend on the assumed reliability of the measurements does the brain use and necessarily compute this reliability which depends on coherence in the following section we answer this question by comparing how well three variants of the drift diffusion model that implement different assumptions about conform to experimental findings use of reliability in perceptual decision making experimental evidence we first show that different assumptions about the reliability translate to variants of the drift diffusion model we then fit all variants to behavioural data performances and mean reaction times of an experiment for which neurophysiological data has also been reported and demonstrate that only those variants which allow reliability to depend on coherence level lead to accumulation mechanisms which are consistent with the neurophysiological findings drift diffusion model variants for the drift diffusion model of eq the accuracy and mean decision time predicted by the model can be determined analytically exp vb tanh where is the bound these equations highlight an important caveat of the drift diffusion model only two of the three parameters can be determined uniquely from behavioural data for fitting the model one of the parameters needs to be fixed in most cases the diffusion is set to arbitrarily or is fit with constant value across stimulus strengths we call this standard variant of the drift diffusion model the ddm if is constant across stimulus strengths the other two parameters of the model must explain differences in behaviour between stimulus strengths by taking on values that depend on stimulus strength indeed it has been found that primarily drift explains such differences see also below eq states that drift depends on estimated reliability so if drift varies across stimulus strengths this strongly suggests that must vary across stimulus strengths that must depend on coherence however the drift diffusion formalism allows for two other obvious variants of parameterisation one in which the bound is constant across stimulus strengths and conversely one in which drift is constant across stimulus strengths eq we call these variants depc and const respectively for their property to weight evidence by reliability that either depends on coherence or not experimental data in the following we will analyse the data presented in this data set has two major advantages for our purposes reported accuracies and mean reaction times fig are averages based on trials in total therefore noise in this data set is minimal cf small error bars in fig such that any potential effects of overfitting on found parameter values will be small especially in relation to the effect induced by different stimulus strengths the behavioural data is accompanied by recordings of neurons which have been implicated in the decision making process we can therefore compare the accumulation mechanisms resulting from the fit to behaviour with the actual neurophysiological recordings furthermore the structure of the experiments was such that the stimulus in subsequent trials had random strength the brain could not have estimated stimulus strength of trial before the trial started in the experiment of that we consider here two monkeys performed forced choice task based on the rdm stimulus data for eight different coherences were reported to avoid ceiling effects which prevent the unique identification of parameter values in the drift diffusion model we exclude those coherences which lead to an accuracy of random choices or to an accuracy of perfect choices the behavioural data of the remaining six coherence levels are presented in table table behavioural data of used in our analysis rt reaction time coherence accuracy fraction mean rt ms the analysis of revealed nondecision time component of the reaction time that is unrelated to the decision process cf of ca using this estimate we determined the mean decision time by subtracting from the mean reaction times shown in table the main findings for the neural recordings which replicated previous findings were that firing rates at the end of decisions were similar and particularly showed no significant relation to coherence fig whereas ii the buildup rate of neural firing within trial had an approximately linear relation to coherence fig fits of drift diffusion model variants to behaviour we can easily fit the model variants ddm depc and const to accuracy and mean decision time using eqs and in accordance with previous approaches we selected values for the respective redundant parameters since the redundant parameter value or its inverse simply scales the fitted parameter values cf eqs and the exact value is irrelevant and we fix in each model variant the redundant parameter to ddm depc coherence coherence const coherence figure fitting results values of the free parameters that replicate the accuracy and mean rt recorded in the experiment table in relation to coherence the remaining parameter was fixed to for each variant left the ddm variant with free parameters drift green and bound purple middle the depc variant with free parameters and diffusion orange right the const variant with free parameters and fig shows the inferred parameter values in congruence with previous findings the ddm variant explained variation in behaviour due to an increasing coherence mostly with an increasing drift green in fig specifically drift and coherence appear to have straightforward linear relation the same finding holds for the depc variant in contrast to the ddm variant however which also exhibited slight increase in the bound purple in fig with increasing coherence the depc variant explained the corresponding differences in behaviour by decreasing diffusion orange in fig as the drift was fixed in const this variant explained behaviour with large and almost identical changes in both diffusion and bound such that large parameter values occurred for small coherences and the relation between parameters and coherence appeared to be quadratic ddm depc const time from start ms time from start ms time from start ms mean of time from start ms time from start ms dt time from end ms dt time from end ms time from start ms dt time from end ms figure properties of fitted model variants top row example trajectories of for different model variants with fitted parameters for blue and yellow coherence trajectories end when they reach the bound for the first time which corresponds to the decision time in that simulated trial notice that the same random samples of were used across variants and coherences bottom row trajectories of averaged over trials in which the first alternative top bound was chosen for the three model variants format of the plots follows that of supp fig left panels show the buildup of from the start of decision making for the different coherences right panels show the averaged drift diffusion trajectories when aligned to the time that decision was made we further investigated the properties of the model variants with the fitted parameter values the top row of fig shows example drift diffusion trajectories in eq simulated at resolution of for two coherences following we interpret as the decision variables represented by the firing rates of neurons in monkey area lip these plots exemplify that the ddm and depc variants lead to qualitatively very similar predictions of neural responses whereas the trajectories produced by the const variant stand out because the neural responses to large coherences are predicted to be smaller than those to small coherences we have summarised predicted neural responses to all coherences in the bottom row of fig where we show averages of across trials either aligned to the start of decision making left panels or aligned to the decision time right panels these plots illustrate that the ddm and depc variants replicate the main neurophysiological findings of neural responses at the end of the decision were similar and independent of coherence for the depc variant this was built into the model because the bound was fixed for the ddm variant the bound shows small dependence on coherence but the neural responses aligned to decision time were still very similar across coherences the ddm and depc variants further replicate the finding that the buildup of neural firing depends approximately linear on coherence normalised mean square error of corresponding linear model was and respectively in contrast the const variant exhibited an inverse relation between coherence and buildup of predicted neural response buildup was larger for small coherences furthermore neural responses at decision time strongly depended on coherence therefore the const variant as the only variant which does not use reliability is also the only variant which is clearly inconsistent with the neurophysiological findings discussion we have investigated whether the brain uses online estimates of stimulus reliability when making simple perceptual decisions from probabilistic perspective fundamental considerations suggest that using accurate estimates of stimulus reliability lead to better decisions but in the field of perceptual decision making it has been questioned that the brain estimates stimulus reliability on the very short time scale of few hundred milliseconds by using probabilistic formulation of the most widely accepted model we were able to show that only those variants of the model which assume online reliability estimation are consistent with reported experimental findings our argument is based on strict distinction between measurements evidence and likelihood which may be briefly summarised as follows measurements are raw stimulus features that do not relate to the decision evidence is transformation of measurements into decision relevant space reflecting the decision alternatives and likelihood is evidence scaled by current estimate of measurement reliabilities it is easy to overlook this distinction at the level of bounded accumulation models such as the drift diffusion model because these models assume form of evidence as input however this evidence has to be computed by the brain as we have demonstrated based on the example of the rdm stimulus and using behavioural data we chose one particular simple probabilistic model because this model has direct equivalence with the drift diffusion model which was used to explain the data of before other models may have not allowed conclusions about reliability estimates in the brain in particular introduced an alternative model that also leads to equivalence with the drift diffusion model but explains differences in behaviour by different mean measurements and their representations in the generative model instead of varying reliability across coherences this model would vary the difference of means in the second summand of eq directly without leading to any difference on the drift diffusion trajectories represented by of eq when compared to those of the probabilistic model chosen here the interpretation of the alternative model of however is far removed from basic assumptions about the rdm stimulus whereas the alternative model assumes that the reliability of the stimulus is fixed across coherences the noise in the rdm stimulus clearly depends on coherence we therefore discarded the alternative model here as slight caveat the neurophysiological findings on which we based our conclusion could have been the result of search for neurons that exhibit the properties of the conventional drift diffusion model the ddm variant we can not exclude this possibility completely but given the wide range and persistence of consistent evidence for the standard bounded accumulation theory of decision making we find it rather unlikely that the results in and were purely found by chance even if our conclusion about the rapid estimation of reliability by the brain does not endure our formal contribution holds we clarified that the drift diffusion model in its most common variant ddm is consistent with and even implicitly relies on estimates of measurement reliability in the experiment of coherences of the rdm stimulus were chosen randomly for each trial consequently participants could not predict the reliability of the rdm stimulus for the upcoming trial the participants brains could not have had good estimate of stimulus reliability at the start of trial yet our analysis strongly suggests that reliabilities were used during decision making the brain therefore must had adapted reliability within trials even on the short timescale of few hundred milliseconds on the level of analysis dictated by the drift diffusion model we can not observe this adaptation it only manifests itself as change in mean drift that is assumed to be constant within trial first models of simultaneous decision making and reliability estimation have been suggested but clearly more work in this direction is needed to elucidate the underlying mechanism used by the brain references joshua gold and michael shadlen the neural basis of decision making annu rev neurosci john statistical decision theory of simple reaction time australian journal of psychology duncan luce response times their role in inferring elementary mental organization number in oxford psychology series oxford university press abraham wald sequential analysis wiley new york wang probabilistic decision making by slow reverberation in cortical circuits neuron dec rajesh rao bayesian computation in recurrent neural circuits neural comput jan jeffrey beck wei ji ma roozbeh kiani tim hanks anne churchland jamie roitman michael shadlen peter latham and alexandre pouget probabilistic population codes for bayesian decision making neuron december anne churchland kiani chaudhuri wang alexandre pouget and shadlen variance as signature of neural computations during decision making neuron feb rafal bogacz eric brown jeff moehlis philip holmes and jonathan cohen the physics of optimal decision making formal analysis of models of performance in tasks psychol rev october michael shadlen roozbeh kiani timothy hanks and anne churchland neurobiology of decision making an intentional framework in christoph engel and wolf singer editors better than conscious decision making the humand mind and implications for institutions mit press anne churchland roozbeh kiani and michael shadlen with multiple alternatives nat neurosci jun peter dayan and nathaniel daw decision theory reinforcement learning and the brain cogn affect behav neurosci dec sebastian bitzer hame park felix blankenburg and stefan kiebel perceptual decision making model is equivalent to bayesian model frontiers in human neuroscience newsome and selective impairment of motion perception following lesions of the middle temporal visual area mt neurosci june praveen pilly and aaron seitz what difference parameter makes psychophysical comparison of random dot motion algorithms vision res jun angela yu and peter dayan inference attention and decision in bayesian neural architecture in lawrence saul yair weiss and bottou editors advances in neural information processing systems pages mit press cambridge ma alec solway and matthew botvinick decision making as probabilistic inference computational framework and potential neural correlates psychol rev january yanping huang abram friesen timothy hanks mike shadlen and rajesh rao how prior probability influences decision making unifying probabilistic model in bartlett pereira burges bottou and weinberger editors advances in neural information processing systems pages jamie roitman and michael shadlen response of neurons in the lateral intraparietal area during combined visual discrimination reaction time task neurosci nov timothy hanks charles kopec bingni brunton chunyu duan jeffrey erlich and carlos brody distinct relationships of parietal and prefrontal cortices to evidence accumulation nature jan sophie making decisions with unknown sensory reliability front neurosci 
fast classification rates for gaussian generative models tianyang li adarsh prasad department of computer science ut austin lty adarsh pradeepr pradeep ravikumar abstract we consider the problem of binary classification when the covariates conditioned on the each of the response values follow multivariate gaussian distributions we focus on the setting where the covariance matrices for the two conditional distributions are the same the corresponding generative model classifier derived via the bayes rule also called linear discriminant analysis has been shown to behave poorly in settings we present novel analysis of the classification error of any linear discriminant approach given conditional gaussian models this allows us to compare the generative model classifier other recently proposed discriminative approaches that directly learn the discriminant function and then finally logistic regression which is another classical discriminative model classifier as we show under natural sparsity assumption and letting denote the sparsity of the bayes classifier the number of covariates and the number of samples the simple logistic regression classifier achieves the fast misclassification error rates of log which is much better than the other approaches which are either under settings inconsistent log or achieve slower rate of introduction we consider the problem of classification of binary response given covariates popular class of approaches are statistical given classification evaluation metric they then optimize surrogate evaluation metric that is computationally tractable and yet have strong guarantees on sample complexity namely number of observations required for some bound on the expected classification evaluation metric these guarantees and methods have been developed largely for the evaluation metric and extending these to general evaluation metrics is an area of active research another class of classification methods are relatively evaluation metric agnostic which is an important desideratum in modern settings where the evaluation metric for an application is typically less clear these are based on learning statistical models over the response and covariates and can be categorized into two classes the first are generative models where we specify conditional distributions of the covariates conditioned on the response and then use the bayes rule to derive the conditional distribution of the response given the covariates the second are the socalled discriminative models where we directly specify the conditional distribution of the response given the covariates in the classical fixed setting we have now have good understanding of the performance of the classification approaches above for generative and discriminative modeling based approaches consider the specific case of naive bayes generative models and logistic regression discriminative models which form ng and jordan provided in such pair the discriminative model has the same form as that of the conditional distribution of the response given the covariates specified by the bayes rule given the generative model itative consistency analyses and showed that under small sample settings the generative model classifiers converge at faster rate to their population error rate compared to the discriminative model classifiers though the population error rate of the discriminative model classifiers could be potentially lower than that of the generative model classifiers due to weaker model assumptions but if the generative model assumption holds then generative model classifiers seem preferable to discriminative model classifiers in this paper we investigate whether this conventional wisdom holds even under settings we focus on the simple generative model where the response is binary and the covariates conditioned on each of the response values follows conditional multivariate gaussian distribution we also assume that the two covariance matrices of the two conditional gaussian distributions are the same the corresponding generative model classifier derived via the bayes rule is known in the statistics literature as the linear discriminant analysis lda classifier under classical settings where the misclassification error rate of this classifier has been shown to converge to that of the bayes classifier however in setting where the number of covariates could scale with the number of samples this performance of the lda classifier breaks down in particular bickel and levina show that when then the lda classifier could converge to an error rate of that of random chance what should one then do when we are even allowed this generative model assumption and when bickel and levina suggest the use of naive bayes or conditional independence assumption which in the conditional gaussian context assumes the covariance matrices to be diagonal as they showed the corresponding naive bayes lda classifier does have misclassification error rate that is better than chance but it is asymptotically biased it converges to an error rate that is strictly larger than that of the bayes classifier when the naive bayes conditional independence assumption does not hold bickel and levina also considered weakening of the naive bayes rule by assuming that the covariance matrix is weakly sparse and an ellipsoidal constraint on the means showed that an estimator that leverages these structural constraints converges to the bayes risk at rate of log where depends on the mean and covariance structural assumptions caveat is that these covariance sparsity assumptions might not hold in practice similar caveats apply to the related works on feature annealed independence rules nearest shrunken centroids as well as those moreover even when the assumptions hold they do not yield the fast rates of an alternative approach is to directly impose sparsity on the linear discriminant which is weaker than the covariance sparsity assumptions though impose these in addition then proposed new estimators that leveraged these assumptions but while they were able to show log convergence to the bayes risk they were only able to show slower rate of it is instructive at this juncture to look at recent results on classification error rates from the machine learning community key notion of importance here is whether the two classes are separable which can be understood as requiring that the classification error of the bayes classifier is classical learning theory gives rate of for any classifier when two classes are and it shown that this is also minimax with the note that this is relatively distribution agnostic since it assumes very little on the underlying distributions when the two classes are only rates slower than are known another key notion is condition under which certain classifiers can be shown to attain rate faster than albeit not at the rate unless the two classes are separable specifically let denote constant such that tα holds when this is said to be assumption since as the two classes start becoming separable that is the bayes zero under this assumption known rates for excess risk is note that this is always slower than when there has been surge of recent results on statistical statistical analyses of estimators these however are largely focused on parameter error bounds empirical and population and sparsistency in this paper however we are interested in analyzing the classification error under sampling regimes one could stitch these recent results to obtain some error bounds use bounds on the excess and use forms from to convert excess bounds to get bounds on classification error however the resulting bounds are very loose and in particular do not yield the fast rates that we seek in this paper we leverage the closed form expression for the classification error for our generative model and directly analyse it to give faster rates for any linear discriminant method our analyses show that assuming sparse linear discriminant in addition simple log logistic regression classifier achieves near optimal fast rates of even without requiring that the two classes be separable problem setup we consider the problem of high dimensional binary classification under the following generative model let denote binary response variable and let xp rp denote set of covariates for technical simplicity we assume pr pr however our analysis easily extends to the more general case when pr pr for some constant we assume that µy σy conditioned on response the covariate follows multivariate gaussian distribution we assume we are given training samples drawn from the conditional gaussian model above for any classifier rp the risk or simply the classification error is given by ex where is the loss it can also be simply written as pr the classifier attaining the lowest classification error is known as the bayes classifier which we will denote by under the generative model assumption above the bayes classifier can be derived simply as log pr pr so that given sample it would be classified as if the error of the bayes classifier pr pr and as otherwise we denote when log pr pr and we denote this quantity as where so that the bayes classifier can be written as for any trained classifier we are interested in bounding the excess risk defined as the generative approach to training classifier is to estimate estimate and from data and then plug the estimates into equation to construct the classifier this classifier is known as the linear discriminant analysis lda classifier whose theoretical properties have been in classical fixed setting the discriminative approach to training is to estimate pr pr directly from samples assumptions we assume that mean is bounded rp bµ where bµ is constant which doesn scale with we assume that the covariance matrix is all eigenvalues of are in bλmin bλmax additionally we assume bs which gives lower bound on the bayes classifier classification error bs note that this assumption is different from the definition of separable classes in and the low noise condition in and the two classes are still not separable because sparsity assumption motivated by we assume that is sparse and there at most entries cai and liu extensively discuss and show that such sparsity assumption is much weaker than assuming either and to be individually sparse we refer the reader to for an elaborate discussion generative classifiers generative techniques work by estimating and from data and plugging them into equation in simple estimation techniques do not perform well when the sample estimate for the covariance matrix is singular using the generalized inverse of the sample covariance matrix makes the estimator highly biased and unstable numerous alternative approaches have been proposed by imposing structural conditions on or and to ensure that they can be estimated consistently some early work based on nearest shrunken centroids feature annealed independence rules and naive bayes imposed independence assumptions on which are often violated in applications impose more complex structural assumptions on the covariance matrix and suggest more complicated thresholding techniques most commonly and are assumed to be sparse and then some thresholding techniques are used to estimate them consistently discriminative classifiers recently more direct techniques have been proposed to solve the sparse lda problem let and µˆd be consistent estimators of and fan et al proposed the regularized optimal affine discriminant road approach which minimizes wt σw with wt restricted to be constant value and an of wroad argmin wt wt kolar and liu provided theoretical insights into the road estimator by analysing its consistency for variable selection cai and liu proposed another variant called linear programming discriminant lpd which tries to make close to the bayes rules linear term in the norm this can be cast as linear programming problem related to the dantzig wlpd argmin λn mai et al proposed another version of the sparse linear discriminant analysis based on an equivalent least square formulation of the lda where they solve an least squares problem to produce consistent classifier all the techniques above do not have finite sample convergence rates or the risk converged either log at slow rate of in this paper we first provide an analysis of classification error rates for any classifier with linear discriminant function and then follow this analysis by investigating the performance of generative and discriminative classifiers for conditional gaussian model classifiers with sparse linear discriminants we first analyze any classifier with linear discriminant function of the form wt we first note that the classification error of any such classifier is available in as wt wt σw wt σw which can be shown by noting that wt is univariate normal random variable when conditioned on the label next we relate the classifiction error above to that of the bayes classifier recall the earlier notation of the bayes classifier as xt the following theorem is key result of the paper that shows that for any linear discriminant classifier whose linear discriminant parameters are close to that of the bayes classifier the excess risk is bounded only by second order terms of the difference note that this theorem will enable fast classification rates if we obtain fast rates for the parameter error theorem let and then we have proof denote the quantity then we have using and the taylor series expansion of around µt we have µt σw σw σw σw wt σw where are constants because the first and second order derivatives of are bounded first note that wt σw because is bounded denote we have by the binomial taylor series expansion wt σw note that and is lower bouned we have wt σw µt σw next we bound wt σw wt σw wt σw where we use the fact that and are bounded similarly σw combing the above bounds we get the desired result logistic regression classifier in this section we show that the simple regularized logistic regression classifier attains fast classification error rates specifically we are interested in the below arg min log exp which maximizes the penalized of the logistic regression model which also corresponds to the conditional probability of the response given the covariates for the conditional gaussian model note that here we penalize the intercept term as well although the intercept term usually is not penalized some packages penalize the intercept term our analysis show that penalizing the intercept term does not degrade the performance of the classifier in it is shown that minimizing the expected risk of the logistic loss also minimizes the classification error for the corresponding linear classifier regularized logistic regression is popular classification method in many settings several commonly used packages have been developed for regularized logistic regression and recent works have been on scaling regularized logistic regression to dimensions and large number of samples analysis we first show that regularized logistic regression estimator above converges to the bayes classifier parameters using techniques next we use the theorem from the previous section to argue that since estimated parameter is close to the bayes classifier parameter the excess risk of the classifier using estimated parameter is tightly bounded as well for the first step we first show restricted eigenvalue condition for where are our covariates that comes from mixture of two gaussian distributions note that is not zero centered which is different from existing scenarios etc that assume covariates are zero centered and we denote and lemma with probability kv for some constants we have kx log where is the gaussian width of in the special case when kv we have log proof first note that is with bounded parameter and where and note that aς at notice that aat and and we can see that the singular values of and are lower bounded by and upper bounded by let be the minimum eigenvalue of and the corresponding eigenvector from the expression aσat so we know that the minimum eigenvalue of is lower bounded similarly the largest eigenvalue of is upper bounded then the desired result follows the proof of theorem in although the proof of theorem in is for random variables the proof remains valid for non random variables when kv gives log having established restricted eigenvalue result in lemma next we use the result in for parameter recovery in generalized linear models glms to show that regularized logistic regression can recover the bayes classifier parameters lemma when the number of samples log and choose logn for some constant then we have log with probability at least where are constants proof following the proof of lemma we see that the conditions and in are satisfied following the proof of proposition and corollary in we have the desired result although the proof of proposition and corollary in is for random variables the proof remains valid for non random variables combining lemma and theorem we have the following theorem which gives fast rate for the excess risk of classifier trained using regularized logistic regression theorem with probability at least where are constants when we set logn for some constant the lasso estimate in satisfies log proof this follows from lemma and theorem other linear discriminant classifiers in this section we provide convergence results for the risk for other linear discriminant classifiers discussed in section naive bayes we compare the discriminative approach using logistic regression to the generative for using naive bayes illustration purposes we conside the case where ip and where are unknown but bounded and using naive bayes we esconstants in this case timate where and thus with high probability we have np using theorem we get slower rate than the bound given in theorem for discriminative classification using regularized logistic regression lpd lpd uses linear programming similar to the dantzig selector lemma cai and liu theorem let λn log with being sufficiently large constant let log let for some constant and let wlpd be obtained as in equation then with probability greater than we have wlpd log slda slda uses thresholded estimate for and we state simpler version slda lemma theorem assume that and are sparse then we have log log max with high probability where is the number of entries in and are constants road road minimizes wt σw with wt restricted to be constant value and an of lemma fan et al theorem assume that with high probability logn and logn and let wroad be obtained as in equation then with high probability we have wroad log experiments in this section we describe experiments which illustrate the rates for excess risk given in theorem in our experiments we use glmnet where we set the option to penalize the intercept term along with all other parameters glmnet is popular package for regularized logistic regression using coordinate descent methods for illustration purposes in all simulations we use ip to illustrate our bound in theorem we consider three different scenarios in figure classification error classification error excess risk only varying only varying dependence of excess risk on figure simulations for different gaussian classification problems showing the dependence of classification error on different quantities all experiments plotted the average of trials in all experiments we set the regularization parameter log we vary while keeping constant figure shows for different how the classification error changes with increasing in figure we show the relationship between the classification error and the quantity logn this figure agrees with our result on excess risk dependence on in figure we vary while keeping constant figure shows for different how the classification error changes with increasing in figure we show the relationship between the classification error and the quantity ns this figure agrees with our result on excess risk dependence on in figure we show how changes with respect to in one instance gaussian classification we can see that the excess risk achieves the fast rate and agrees with our bound acknowledgements we acknowledge the support of aro via and nsf via and and nih via as part of the joint initiative to support research at the interface of the biological and mathematical sciences references arindam banerjee sheng chen farideh fazayeli and vidyashankar sivakumar estimation with norm regularization in advances in neural information processing systems pages peter bartlett michael jordan and jon mcauliffe convexity classification and risk bounds journal of the american statistical association peter bickel and elizaveta levina some theory for fisher linear discriminant function naive bayes and some alternatives when there are many more variables than observations bernoulli pages peter bickel and elizaveta levina covariance regularization by thresholding the annals of statistics pages bishop pattern recognition and machine learning information science and statistics springer isbn peter and sara van de geer statistics for data methods theory and applications springer science business media tony cai and weidong liu direct estimation approach to sparse linear discriminant analysis journal of the american statistical association emmanuel candes and terence tao the dantzig selector statistical estimation when is much larger than the annals of statistics pages venkat chandrasekaran benjamin recht pablo parrilo and alan willsky the convex geometry of linear inverse problems foundations of computational mathematics weizhu chen zhenghao wang and jingren zhou using mapreduce in advances in neural information processing systems pages devroye and lugosi probabilistic theory of pattern recognition springer new york luc devroye probabilistic theory of pattern recognition volume springer science business media david donoho and jiashun jin higher criticism thresholding optimal feature selection when useful features are rare and weak proceedings of the national academy of sciences jianqing fan and yingying fan high dimensional classification using features annealed independence rules annals of statistics jianqing fan yang feng and xin tong road to classification in high dimensional space the regularized optimal affine discriminant journal of the royal statistical society series statistical methodology fan chang hsieh wang and lin liblinear library for large linear classification the journal of machine learning research yingying fan jiashun jin zhigang yao et al optimal classification in sparse gaussian graphic model the annals of statistics manuel eva cernadas barro and dinani amorim do we need hundreds of classifiers to solve real world classification problems the journal of machine learning research jerome friedman trevor hastie and rob tibshirani regularization paths for generalized linear models via coordinate descent journal of statistical software siddharth gopal and yiming yang distributed training of logistic models in proceedings of the international conference on machine learning pages hastie tibshirani and friedman the elements of statistical learning data mining inference and prediction springer mladen kolar and han liu feature selection in classification in proceedings of the international conference on machine learning pages vladimir koltchinskii oracle inequalities in empirical risk minimization and sparse recovery problems ecole de de volume springer science business media qing mai hui zou and ming yuan direct approach to sparse discriminant analysis in dimensions biometrika page enno mammen alexandre tsybakov et al smooth discrimination analysis the annals of statistics sahand negahban bin yu martin wainwright and pradeep ravikumar unified framework for analysis of with decomposable regularizers in advances in neural information processing systems pages andrew ng and michael jordan on discriminative generative classifiers comparison of logistic regression and naive bayes in advances in neural information processing systems nips jun shao yazhen wang xinwei deng sijian wang et al sparse linear discriminant analysis by thresholding for high dimensional data the annals of statistics robert tibshirani trevor hastie balasubramanian narasimhan and gilbert chu diagnosis of multiple cancer types by shrunken centroids of gene expression proceedings of the national academy of sciences sijian wang and ji zhu improved centroids estimation for the nearest shrunken centroid classifier bioinformatics tong zhang statistical behavior and consistency of classification methods based on convex risk minimization annals of statistics pages 
fast distributed clustering with outliers on massive data gustavo malkomes matt kusner wenlin chen department of computer science and engineering washington university in louis louis mo luizgustavo mkusner wenlinchen kilian weinberger department of computer science cornell university ithaca ny benjamin moseley department of computer science and engineering washington university in louis louis mo bmoseley abstract clustering large data is fundamental problem with vast number of applications due to the increasing size of data practitioners interested in clustering have turned to distributed computation methods in this work we consider the widely used kcenter clustering problem and its variant used to handle noisy data with outliers in the setting we demonstrate how distributed method is actually an algorithm which accurately explains its strong empirical performance additionally in the noisy setting we develop novel distributed algorithm that is also an these algorithms are highly parallel and lend themselves to virtually any distributed computing framework we compare each empirically against the best known sequential clustering methods and show that both distributed algorithms are consistently close to their sequential versions the algorithms are all one can hope for in distributed settings they are fast memory efficient and they match their sequential counterparts introduction clustering is fundamental machine learning problem with widespread applications example applications include grouping documents or webpages by their similarity for search engines or grouping web users by their demographics for targeted advertising in clustering problem one is given as input set of data points characterized by set of features and is asked to cluster partition points so that points in cluster are similar by some measure clustering is well understood task on modestly sized data sets however today practitioners seek to cluster datasets of massive size once data becomes too voluminous sequential algorithms become ineffective due to their running time and insufficient memory to store the data practitioners have turned to distributed methods in particular mapreduce to efficiently process massive data sets one of the most fundamental clustering problems is the problem here it is assumed that for any two input points distance can be computed that reflects their dissimilarity typically these arise from metric space the objective is to choose subset of points called centers that give rise to clustering of the input set into clusters each input point is assigned to the cluster defined by its closest center out of the center points the objective selects these centers to minimize the farthest distance of any point to its cluster center the problem has been studied for over three decades and is fundamental task used for exemplar based clustering it is known to be and further no algorithm can achieve for any unless in the sequential setting there are algorithms which match this bound achieving the problem is popular for clustering datasets which are not subject to noise since the objective is sensitive to error in the data because the worst case maximum distance of point to the centers is used for the objective in the case where data can be noisy previous work has considered the with outliers problem in this problem the objective is the same but additionally one may discard set of points from the input these points are the outliers and are ignored in the objective here the best known algorithm is once datasets become large the known algorithms for these two problems become ineffective due to this previous work on clustering has resorted to alternative algorithmics there have been several works on streaming algorithms others have focused on distributed computing the work in the distributed setting has focused on algorithms which are implementable in mapreduce but are also inherently parallel and work in virtually any distributed computing framework the work of was the first to consider clustering in the distributed setting their work gave an mapreduce algorithm their algorithm is sampling based mapreduce algorithm which can be used for variety of clustering objectives unfortunately as the authors point out in their paper the algorithm does not always perform well empirically for the objective since the objective function is very sensitive to missing data points and the sampling can cause large errors in the solution the work of kumar et al gave algorithm for submodular function maximization subject to cardinality constraint in the mapreduce setting however their algorithm requires number of mapreduce rounds whereas mirzasoleiman et al recently extended in gave two mapreduce rounds algorithm but their approximation ratio is not constant it is known that an exact algorithm for submodular maximization subject to cardinality constraint gives an exact algorithm for the problem unfortunately both problems are nphard and the reduction is not approximation preserving therefore their theoretical results do not imply nontrivial approximation for the problem for these problems the following questions loom what can be achieved for clustering with or without outliers in the distributed setting what underlying algorithmic ideas are needed for the with outliers problem to be solved in the distributed setting the with outliers problem has not been studied in the distributed setting given the complexity of the sequential algorithm it is not clear what such an algorithm would look like contributions in this work we consider the and with outliers problems in the distributed computing setting although the algorithms are highly parallel and work in virtually any distributed computing framework they are particularly well suited for the mapreduce as they require only small amounts of communication and very little memory on each machine we therefore state our results for the mapreduce framework we will assume throughout the paper that our algorithm is given some number of machines to process the data we first begin by considering natural interpretation of the algorithm of mirzasoleiman et al on submodular optimization for the problem the algorithm we introduce runs in two mapreduce rounds and achieves small constant approximation theorem there is two round mapreduce algorithm which achieves for the problem which communicates km amount of data assuming the data is already partitioned across the machines the algorithm uses max mk memory on each machine next we consider the with outliers problem this problem is far more challenging and previous distributed techniques do not lend themselves to this problem here we combine the algorithm developed for the problem without outliers with the sequential algorithm for with outliers we show two round mapreduce algorithm that achieves an theorem there is two round mapreduce algorithm which achieves for the with outliers problem which communicates km log amount of data assuming the data is already partitioned across the machines the algorithm uses max log memory on each machine finally we perform experiments with both algorithms on real world datasets for we observe that the quality of the solutions is effectively the same as that of the sequential algorithm for all values of best one could hope for for the problem with outliers our algorithm matches the sequential algorithm as the values of and vary and it significantly outperforms the algorithm which does not explicitly consider outliers somewhat surprisingly our algorithm achieves an order of magnitude over the sequential algorithm even if it is run sequentially preliminaries we will consider algorithms in the distributed setting where our algorithms are given machines we define our algorithms in general distributed manner but they particularly suited to the mapreduce model this model has become widely used both in theory and in applied machine learning in the mapreduce setting algorithms run in rounds in each round the machines are allowed to run sequential computation without machine communication between rounds data is distributed amongst the machines in preparation for new computation the goal is to design an algorithm which runs in small number of rounds since the main running time bottleneck is distributing the data amongst the machine between each round generally it is assumed that each of the machines uses sublinear memory the motivation here is that since mapreduce is used to process large data sets the memory on the machines should be much smaller than the input size to the problem it is additionally assumed that there is enough memory to store the entire dataset across all machines our algorithms fall into this category and the memory required on each machine scales inversely with with outliers problem in the problems considered there is universe of points between each pair of points there is distance specifying their dissimilarity the points are assumed to lie in metric space which implies that for all we have that and triangle inequality for set of points we let dx denote the minimum distance of point to any point in in the problem the goal is to choose set of centers of points such that dx is minimized dx is the distance between and its cluster center and we would like to minimize the largest distance across all points in the with outliers problem the goal is to choose set of points and set of points such that dx is minimized note that in this problem the algorithm simply needs to choose the set as the optimal set of points is well defined it is the set of points in farthest from the centers sequential algorithms the most widely used algorithm sequential out outliers algorithm is the following simple greedy reedy dure summarized in in algorithm the algorithm sets and then iteratively adds points from to until at each step the algorithm greedily add any point to lects the farthest point in from and then adds this point while do to the updated set this algorithm is natural and efficient dx and is known to give for the lem however it is also inherently sequential and does end while not lend itself to the distributed setting except for very small mapreduce implementation can be obtained by finding the element to maximize dx in distributed fashion line in algorithm this however requires rounds of mapreduce that must distribute the entire dataset each round therefore it is unsuitably inefficient for many real world problems the sequential algorithm for with outliers is more complicated due to the increased difficulty of the problem for reference see this algorithm is even more fundamentally sequential than algorithm in mapreduce in this section we consider the problem where no outliers are allowed as mentioned before similar variant of this problem has been previously studied in mirzasoleiman et al in the distributed setting the work of mirzasoleiman et al considers submodular maximization and showed min where is the number of machines their algorithm was shown to perform extremely well in practice in slightly modified clustering setup the problem can be mapped to submodular maximization but the standard reduction is not approximation preserving and their result does not imply approximation for in this section we give natural interpretation of their algorithm without submodular maximization algorithm summarizes distributed approach for solving the problem first the data points of are partitioned across all machines then each machine runs the reedy algorithm on the partition they are given to compute set ci of points these points are assigned to single machine which runs reedy again to compute the final solution the algorithm runs in two mapreduce rounds and the only information communicated is ci for each if the data is already assigned to machines thus we have the following proposition proposition the algorithm runs in two mapreduce rounds and communicates km amount of data assuming the data is originally partitioned across the machines the algorithm uses max mk memory on each machine we aim to bound the approximation ratio of algorithm distributed let opt denote the optimal solution value for the problem the partition into equal sized sets um previous proposition and following lemma where machine receives ui give theorem machine assigns ci reedy ui lemma the algorithm is all sets ci are assigned to machine algorithm machine sets reedy ci proof we first show for any that dci output for any ui indeed say that this is not the case for sake of contradiction for some then for some ui dci which implies is distance greater than from all points in ci by definition of reedy for any pair of points ci it must be the case that dci otherwise would have been included in ci thus in the set ci there are points all of distance greater than from each other however then two of these points ci must be assigned to the same center in the optimal solution using the triangle inequality and the definition of opt it must be the case that contradiction thus for all points ui it must be that dci let denote the output solution by we can show similar result for points in ci when compared to that is we show that dx for any ci indeed say that this is not the case for sake of contradiction then for some ci dx which implies is distance greater than from all points in by definition of reedy for any pair of points ci it must be that dx thus in the set there are points all of distance greater than from each other however then two of these points must be assigned to the same center in the optimal solution using the triangle inequality and the definition of opt it must be the case that contradiction thus for all points ci it must be that dx now we put these together to get consider any point if is in ci for any it must be the case that dx by the above argument otherwise is not in ci for any let uj be the partition which belongs to we know that is within distance to some point cj and further we know that is within distance from from the above arguments thus using the triangle inequality dx dx with outliers in this section we consider the with outliers problem and give the first mapreduce algorithm for the problem the problem is more challenging than the version without outliers because one has to also determine which points to discard which can drastically change which centers should be chosen intuitively the right algorithmic strategy is to choose centers such that there are many points around them given that they are surrounded by many points this is strong indicator that these points are not outliers this idea was formalized in the algorithm of charikar et al and influential algorithm for this problem in the single machine setting algorithm summarizes the approach of charikar et al it takes as input the set of points the desired number centers and parameter the parameter is guess of the optimal solution value the algorithm performance is best when opt where opt denotes the optimal objective after discarding points the number of outliers to be discarded is not parameter of the algorithm and is communicated implicitly through the value of can be determined by doing binary search on possible values of the minimum and maximum distances of any two points for each point the set bu contains algorithm sequential with outliers points within distance of the utliers rithm adds the point to the solution set which covers the largest number of points while do with the idea here is to add points which let bu du have many points nearby and thus large bv let then the algorithm removes all points from set the universe which are within distance compute from and continues until points are sen to be in the set recall that in the end while liers problem choosing the centers is well defined solution and the outliers are simply the farthest points from the centers further it can be shown that when op after selecting the centers there are at most outliers remaining in it is known that this algorithm gives it is not efficient on large or even medium sized datasets due to the computation of the sets bu within each iteration for instance it can take hours on data set with points we now give distributed approach algorithm for clustering with outliers this algorithm is naturally parallel yet it is significantly faster even if run sequentially on single machine it uses algorithm which is generalization of utliers the algorithm first partitions the points across the machines set ui goes algorithm distributed with outliers to machine each machine runs the utliers reedy algorithm on ui but selects points rather than this results in set partition into equal sized sets um ci for each ci we assign weight where machine receives ui wc that is the number of points in ui that machines sets ci reedy ui have as their closest point in ci if for each point ci machine set wc ci defines an intermediate clustering of ui ui dci then wc is the number of points in the all sets ci are assigned to machine with the cluster the algorithm then runs variweights of the points in ci ation of utliers called luster machine sets luster ci scribed in algorithm on only the points output in ci the main differences are that luster represents each point by the number of points wc closest to it and that it uses and for the radii in bu and the total communication required for utliers is algorithm clustering subroutine that needed to send each of the sets luster ci to machine along with their weights each weight can have size at while do most so it only requires log compute bu du space to encode the weight this let gives the following proposition set proposition utliers compute runs in two mapreduce rounds and communicates log end while amount of data assuming the data output is originally partitioned across the machines the algorithm uses max log memory on each machine our goal is to show that utliers is an algorithm theorem we first present intermediate lemmas and give proof sketches leaving intermediate proofs to the supplementary material we overload notation and let opt denote fixed optimal solution as well as the optimal objective to the problem we will assume throughout the proof that opt as we can perform binary search to find opt for arbitrarily small when running luster on single machine we first claim that any point in ui is not too far from any point in ci lemma for every point ui it is the case that dci for all given the above lemma let ok denote the clusters in the optimal solution cluster in opt is defined as subset of the points in not including outliers identified by opt that are closest to some fixed center chosen by opt the high level idea of our proof is similar to that used in our goal is to show that when our algorithm choses each center the set of points discarded from in luster can be mapped to some cluster in the optimal solution at the end of luster there should be at most points in which are the outliers in the optimal solution knowing that we only discard points from close to centers we choose this will imply the approximation bound for every point which must fall into some ui we let denote the closest point in ci to is the closest intermediate cluster center found by reedy to consider the output of luster xk ordered by how elements were added to we will say that an optimal cluster oi is marked at luster iteration if there is point oi such that just before xj is added to essentially if cluster is marked we can make no guarantee about covering it within some radius of xj which will then be discarded figure shows examples where oi is and is not marked we begin by noting that when xj is added to that the weight of the points removed from is at least as large as the maximum number of points in an unmarked cluster in the optimal solution lemma when xj is added then for any unmarked cluster oi given this result the following lemma considers point that is in some cluster oi if is within the ball bxj for xj added to then intuitively this means that we cover all of the points in oi with another way to say this is that after we remove the ball no points in oi contribute weight to any point in lemma consider that xj is to be added to say that bxj for some point oi for some then for every point oi either or has already been removed from see the supplementary material for the proof the final lemma below states that the weight of the points in is at least as large as the number of points in oi further we know that oi since opt has outliers viewing the points in as being assigned to xi in the algorithm solution then this shows that the number of points covered is at least as large as the number of points that the optimal solution covers hence there can not be more than points uncovered by our algorithm pk lemma wu finally we are ready to complete the proof of theorem oi marked oi deleted from oi unmarked oi figure examples in which oi not marked proof of theorem lemma implies that the sum of the weights of the points which are in are at least we know that every point contributes to the weight of some point which is in ci for and by lemma we map every point to xi such that by definition of and lemma it is the case xi by the triangle inequality thus we have mapped points to some point in within distance hence our algorithm discards at most points and achieves with proposition we have shown theorem experiments we evaluate the performance of the above clustering algorithms on seven clustering datasets described in table we compare all methods using the with outliers objective in which outliers may be discarded we begin with brief description of the clustering methods we table the clustering datasets and their descriptions used for evaluation name parkinsons yahoo description patients with parkinson disease census household information samples from face images ranking dataset features are gbrt outputs forest cover dataset with cartographic features household electric power readings particle detector measurements the seven features dim compare we then show how the distributed algorithms compare with their sequential counterparts on datasets small enough to run the sequential methods for variety of settings finally in the setting we compare all distributed methods for different settings of methods we implemented the sequential reedy and utliers and distributed and utliers we also implemented two baseline methods machines randomly select points then single machine randomly selects points out of the previously selected points utliers machines randomly select points then utliers algorithm is run over the points previously selected all methods were implemented in matlabtm and conducted on an intel xeon ghz machine covertype power greedy outliers random random random outliers number of clusters number of outliers log objective value census parkinson number of machines figure the performance of sequential and distributed methods we plot the objective value of four small datasets for varying and sequential vs distributed our first set of experiments evaluate how close the proposed distributed methods are to their sequential counterparts to this end we vary all parameters number of centers number of outliers and the number of machines we consider datasets for which computing the sequential methods is practical parkinsons census and two random subsamples inputs each of covertype and power we show the results in figure each column contains the results for single dataset and each row for single varying parameter or along with standard errors over runs when parameter is not varied we fix and as expected the objective value for all methods generally decreases as increases as the distance of any point to its cluster center must shrink with more clusters and utliers usually perform worse than for small save covertype and utliers https objective value skin yahoo covertype power higgs random random random outliers number of clusters figure the objective value of five datasets for varying sometimes matches it for large for all values of tested utliers outperforms all other distributed methods furthermore it matches or slightly outperforms which we attribute to randomness the sequential utliers method in all settings as increases the two random methods improve beyond in some cases similar to the first plot utliers outperforms all other distributed methods while matching the sequential clustering method for very small settings of utliers and perform slightly worse than sequential utliers and reedy however for practical settings of utliers matches utliers and matches reedy in terms of speed on the largest of these datasets census utliers run sequentially is more than faster than utliers see table this large speedup is due to the fact that we can not store the full distance matrix for census thus all distances need to be computed on demand our second set of experiments table the speedup of the distributed algofocus on the performance of the distributed rithms run sequentially over their sequential methods on five datasets shown counterparts on the small datasets in figure we vary between and dataset outliers and fix and note that for certain datasets clustering while covertype ing into account outliers produces power able reduction in objective value on yaparkinson hoo the method is even outpercensus formed by utliers that considers outliers similar to the small dataset results utliers outperforms nearly all distributed methods save for small on covertype even on datasets where there appear to be few outliers utliers has excellent performance finally utliers is extremely fast clustering on higgs took less than minutes conclusion in this work we described algorithms for the and with outliers problems in the distributed setting for both problems we studied two round mapreduce algorithms which achieve an and demonstrated that they perform almost identically to their sequential counterparts on real data further number of our experiments validate that using clustering on noisy data degrades the quality of the solution we hope these techniques lead to the discovery of fast and efficient distributed algorithms for other clustering problems in particular what can be shown for the or with outliers problems are exciting open questions acknowledgments gm was supported by mjk and kqw were supported by the nsf grants and bm was supported by the google and yahoo research awards references agarwal and phillips an efficient algorithm for euclidean with outliers in esa pages aggarwal wolf and yu method for targeted advertising on the web based on accumulated data clustering users and semantic node graph techniques march us patent ailon jaiswal and monteleoni streaming approximation in nips pages andoni nikolov onak and yaroslavtsev parallel algorithms for geometric graph problems in stoc pages bahmani kumar and vassilvitskii densest subgraph in streaming and mapreduce pvldb bahman bahmani benjamin moseley andrea vattani ravi kumar and sergei vassilvitskii scalable pvldb balcan ehrlich and liang distributed and clustering on general communication topologies in nips pages rafael barbosa alina ene huy nguyen and justin ward the power of randomization distributed submodular maximization on massive datasets in icml pages broder pueyo josifovski vassilvitskii and venkatesan scalable by ranked retrieval in wsdm pages charikar khuller mount and narasimhan algorithms for facility location problems with outliers in soda pages chen weinberger chapelle kedem and xu classifier cascade for minimizing feature evaluation cost in aistats pages chierichetti kumar and tomkins in in www pages dean and ghemawat mapreduce simplified data processing on large clusters in osdi pages ene im and moseley fast clustering using mapreduce in kdd pages feldman muthukrishnan sidiropoulos stein and svitkina on distributing symmetric streaming computations in soda pages gonzalez clustering to minimize the maximum intercluster distance theoretical computer science issn guha meyerson mishra motwani and callaghan clustering data streams theory and practice ieee trans knowl data sudipto guha rajeev rastogi and kyuseok shim techniques for clustering massive data sets in clustering and information retrieval volume of network theory and applications pages springer us isbn hassani and seidl ediskco energy efficient distributed clustering with outliers in pages hochbaum and shmoys best possible heuristic for the problem mathematics of operations research karloff suri and vassilvitskii model of computation for mapreduce in soda pages kaufman and rousseeuw finding groups in data an introduction to cluster analysis wileyinterscience edition march isbn ravi kumar benjamin moseley sergei vassilvitskii and andrea vattani fast greedy algorithms in mapreduce and streaming in spaa pages mccutchen and khuller streaming algorithms for clustering with outliers and with anonymity in pages mirzasoleiman karbasi sarkar and krause distributed submodular maximization identifying representative elements in massive data in nips pages shindler wong and meyerson fast and accurate for large datasets in nips pages suri and vassilvitskii counting triangles and the curse of the last reducer in www pages tsanas little mcsharry and ramig enhanced classical dysphonia measures and sparse regression for telemonitoring of parkinson disease progression in icassp pages ieee tyree weinberger agrawal and paykin parallel boosted regression trees for web search ranking in www pages acm zamir etzioni madani and karp fast and intuitive clustering of web documents in kdd volume pages zhao wang butt khan kumar and marathe sahad subgraph analysis in massive networks using hadoop in ipdps pages may 
human memory search as emitting random walk xiaojin timothy wisconsin institute for discovery department of computer sciences department of psychology university of kjun jerryzhu ttrogers ming yuan department of statistics university of myuan zhuoran yang department of mathematical sciences tsinghua university abstract imagine random walk that outputs state only when visiting it for the first time the observed output is therefore version of the underlying walk and consists of permutation of the states or prefix of it we call this model emitting random walk invite prior work has shown that the random walks with such mechanism explain well human behavior in memory search tasks which is of great interest in both the study of human cognition and various clinical applications however parameter estimation in invite is challenging because naive likelihood computation by marginalizing over infinitely many hidden random walk trajectories is intractable in this paper we propose the first efficient maximum likelihood estimate mle for invite by decomposing the censored output into series of absorbing random walks we also prove theoretical properties of the mle including identifiability and consistency we show that invite outperforms several existing methods on human response data from memory search tasks human memory search as random walk key goal for cognitive science has been to understand the mental structures and processes that underlie human semantic memory search semantic fluency has provided the central paradigm for this work given category label as cue animals vehicles etc participants must generate as many example words as possible in seconds without repetition the task is useful because while exceedingly easy to administer it yields rich information about human semantic memory participants do not generate responses in random order but produce bursts of related items beginning with the highly frequent and prototypical then moving to subclusters of related items this ordinal structure sheds light on associative structures in memory retrieval of given item promotes retrieval of related item and so on so that the temporal proximity of items in generated lists reflects the degree to which the two items are related in memory the task also places demands on other important cognitive contributors to memory search for instance participants must retain mental trace of items and use it to refrain from repetition so that the task draws upon working memory and cognitive control in addition to semantic processes for these reasons the task is central tool in all metrics for diagnosing cognitive dysfunction see performance is generally sensitive to variety of neurological disorders but different syndromes also give rise to different patterns of impairment making it useful for diagnosis for these reasons the task has been widely employed both in basic science and applied health research nevertheless the representations and processes that support category fluency remain poorly understood beyond the general observation that responses tend to be clustered by semantic relatedness it is not clear what ordinal structure in produced responses reveals about the structure of human semantic memory in either healthy or disordered populations in the past few years researchers in cognitive science have begun to fill this gap by considering how search models from other domains of science might explain patterns of responses observed in fluency tasks we review related works in section in the current work we build on these advances by considering not how search might operate on semantic representation but rather how the representation itself can be learned from data semantic fluency lists given specified model of the process specifically we model search as random walk on set of states words where the transition probability indicates the strength of association in memory and with the further constraint that node labels are only generated when the node is first visited thus repeated visits are censored in the output we refer to this generative process as the emitting invite random walk the mechanism of invite was first employed in abbott et al however their work did not provide tractable method to compute the likelihood nor to estimate the transition matrix from the fluency responses the problem of estimating the underlying markov chain from the lists so produced is nontrivial because once the first two items in list have been produced there may exist infinitely many pathways that lead to production of the next item for instance consider the produced sequence dog cat goat where the underlying graph is fully connected suppose random walk visits dog then cat the walk can then visit dog and cat arbitrarily many times before visiting goat there exist infinitely many walks that outputs the given sequence how can the transition probabilities of the underlying random walk be learned solution to this problem would represent significant advance from prior works that estimate parameters from separate source such as standard text corpus first one reason for verbal fluency enduring appeal has been that the task appears to reveal important semantic structure that may not be discoverable by other means it is not clear that methods for estimating semantic structure based on another corpus do very good job at modelling the structure of human semantic representations generally or that they would reveal the same structures that govern behavior specifically in this fluency task second the representational structures employed can vary depending upon the fluency category for instance the probability of producing chicken after goat will differ depending on whether the task involves listing animals mammals or farm animals simply estimating single structure from the same corpus will not capture these effects third special populations including neurological patients and developing children may generate lists from quite different underlying mental representations which can not be independently estimated from standard corpus in this work we make two important contributions on the invite random walk first we propose tractable way to compute the invite likelihood our key insight in computing the likelihood is to turn invite into series of absorbing random walks this formulation allows us to leverage the fundamental matrix and compute the likelihood in polynomial time second we show that the mle of invite is consistent which is given that the convergence of the log likelihood function is not uniform we formally define invite and present the two main contributions as well as an efficient optimization method to estimate the parameters in section in section we apply invite to both toy data and fluency data on toy data our experiments empirically confirm the consistency result on actual human responses from verbal fluency invite outperforms baselines the results suggest that invite may provide useful tool for investigating human cognitive functions the invite random walk invite is probabilistic model with the following generative story consider random walk on set of states with an initial distribution and an arbitrary transition matrix where pij is the probability of jumping from state to surfer starts from random initial state drawn from she outputs state if it is the first time she visits that state upon arriving at an already visited state however she does not output the state the random walk continues indefinitely therefore the output consists of states in the order of their the underlying entire walk trajectory is hidden we further assume that the time step of each output is unobserved for example consider the random walk over four states in figure if the underlying random walk takes the trajectory the observation is log likelihood figure example markov chains example nonconvexity of the invite log likelihood we say that the observation produced by invite is censored list since visits are censored it is easy to see that censored list is permutation of the states or prefix thereof more on this later we denote censored list by am where censored list is not markovian since the probability of transition in censored list depends on the whole history rather than just the current state it is worth noting that invite is distinct from broder algorithm for generating random spanning trees or the random walk or cascade models of infection we discuss the technical difference to related works in section we characterize the type of output invite is capable of producing given that the underlying uncensored random walk continues indefinitely state is said to be transient if random walk starting from has nonzero probability of not returning to itself in finite time and recurrent if such probability is zero set of states is closed if walk can not exit if and then random walk from can not reach set of states is irreducible if there exists path between every pair of states in if then random walk from can reach define we use as shorthand for am theorem states that finite state markov chain can be uniquely decomposed into disjoint sets and theorem states what censored list should look like all proofs are in the supplementary material theorem if the state space is finite then can be written as disjoint union wk where is set of transient states that is possibly empty and each wk is nonempty closed irreducible set of recurrent states theorem consider markov chain with the decomposition wk as in theorem censored list generated by invite on has zero or more transient states followed by all states in one and only one closed irreducible set that is and wk for some as an example when the graph is fully connected invite is capable of producing all permutations of the states as the censored lists as another example in figure and both chains have two transient states and two recurrent states has no path that visits both and and thus every censored list must be prefix of permutation however has path that visits both and thus can generate full permutation in general each invite run generates permutation of states or prefix of permutation let sym be the symmetric group on then the data space of censored lists is sym computing the invite likelihood learning and inference under the invite model is challenging due to its likelihood function naive method to compute the probability of censored list given and is to sum over all uncensored random walk trajectories which produces produces this naive computation is intractable since the summation can be over an infinite number of trajectories that might have produced the censored list for example consider the censored list generated from figure there are infinite uncensored trajectories to produce by visiting states and arbitrarily many times before visiting state and later state the likelihood of and on censored list is qm if can not be extended otherwise note we assign zero probability to censored list that is not completed yet since the underlying random walk must run forever we say censored list is valid invalid under and if we first review the fundamental matrix in the absorbing random walk state that transits to itself with probability is called an absorbing state given markov chain with absorbing states we can rearrange the states into where is the transition between the nonabsorbing states is the transition from the nonabsorbing states to absorbing states and the rest trivially represent the absorbing states theorem presents the fundamental matrix the essential tool for the tractable computation of the invite likelihood theorem the fundamental matrix of the markov chain is nij is the expected number of times that chain visits state before absorption when starting from furthermore define then bik is the probability of chain starting from being absorbed by in other words is the absorption distribution of chain starting from as tractable way to compute the likelihood we propose novel formulation that turns an invite random walk into series of absorbing random walks although invite itself is not an absorbing random walk each segment that produces the next item in the censored list can be modeled as one that is for each consider the segment of the uncensored random walk starting from the previous output ak until the next output for this segment we construct an absorbing random walk by keeping nonabsorbing and turning the rest into the absorbing states random walk starting from ak is eventually absorbed by state in the probability of being absorbed by is exactly the probability of outputting after outputting in invite formally we construct an absorbing random walk where the states are ordered as corollary summarizes our computation of the invite likelihood corollary the step invite likelihood for is if exists otherwise suppose we observe independent realizations of invite dm amm where mi is the length of the censored list pm then the invite log likelihood is dm log consistency of the mle identifiability is an essential property for model to be consistent theorem shows that allowing in cause invite to be unidentifiable then theorem presents remedy the proof for both theorems are presented in our supplementary material let diag be diagonal matrix whose diagonal entry is qi theorem let be an transition matrix without any pii and define diag diag scaled transition matrix with probabilities then for every censored list for example consider censored listpa where using the fundamental matrix this implies that multiplying constant to for all and renormalizing the first row to sum to does not change the likelihood theorem assume the initial distribution elementwise in the space of transition matrices without invite is identifiable let rn pi pi be the probability simplex for brevity we pack the parameters of invite into one vector as follows pii let be the true model given set of censored lists dm generated from the average log likelihood function and its pointwise limit are bm log and log for brevity we assume that the true model is strongly connected the analysis can be easily extended to remove it under the assumption theorem states the consistency result assumption let be the true model has no zero entries furthermore is strongly connected bm is consistent theorem assume the mle of invite θbm we provide sketch here the proof relies on lemma and lemma that are presented in our supplementary material since is compact the sequence θbm has convergent subsequence bm bm θbm θbmj let θbmj since bm lim bm θbm lim where the last equality is due to lemma by lemma is the unique maximizer of which implies note that the subsequence was chosen arbitrarily since every convergent subsequence converges to θbm converges to parameter estimation via regularized maximum likelihood we present regularized mle regmle of invite we first extend the censored lists that we consider now we allow the underlying walk to terminate after finite steps because in applications the observed censored lists are often truncated that is the underlying random walk can be stopped before exhausting every state the walk could visit for example in verbal fluency participants have limited time to produce list consequently we use the prefix likelihood we find the regmle by maximizing the prefix log likelihood plus regularization term on note that and can be separately optimized for we place dirichlet prior and find the by maximum posteriori map estimator bj cπ directly computing the regmle of requires solving constrained optimization problem because the transition matrix must be row stochastic we which leads to more convenient unconstrained optimization problem let we exponentiate and it pn βij to derive pij we fix the diagonal entries of to to disallow selftransitions we place squared norm regularizer on to prevent overfitting the unconstrained optimization problem is pm pmi min log cβ βij where cβ is regularization parameter we provide the derivative of the prefix log likelihood in our supplementary material we point out that the objective function of is not convex in in general let and suppose we observe two censored lists and we found with random starts two different local optima and of we plot the prefix log likelihood of λβ where in figure nonconvexity of this slice implies nonconvexity of the prefix log likelihood surface in general efficient optimization using averaged stochastic gradient descent given censored list of length computing the derivative of takes time for matrix inversion there are entries in so the time complexity per item is this computation needs to be done for in list and for censored lists which makes the overall time complexity mm in the worst case is as large as which makes it even the batch optimization method such as lbfgs takes very long time to find the solution for moderate problem size such as for faster computation of the regmle we turn to averaged stochastic gradient descent asgd asgd processes the lists sequentially by updating the parameters after every list the objective function for on the list is cβ βij log ring the number of lists invite rw fe grid error error star error invite rw fe invite rw fe the number of lists the number of lists figure toy experiment results where the error is measured with the frobenius norm we randomly initialize at round we update the solution βt with βt ηt and the average estimate with βt let ηt at we use cβ and following and pick by running the algorithm on small subsample of the train set we run asgd for fixed number of epochs and take the final as the solution experiments we compare invite against two popular estimators of naive random walk rw and fe rw is the regularized mle of the naive random pretending the censored lists are the rw mi underlying uncensored walk trajectory prc crw though simple and popular rw is biased estimator due to the model mismatch fe was proposed in for graph structure fe uses only the first two items in each in cascade model because the first transition in censored censored list prc list is always the same as the first transition in its underlying trajectory fe is consistent estimator of assuming has no zero entries in fact fe is equivalent to the regmle of the length two prefix likelihood of the invite model however we expect fe to waste information since it discards the rest of the censored lists furthermore fe can not estimate the transition probabilities from an item that does not appear as the first item in the lists which is common in data toy experiments here we compare the three estimators invite rw and fe on toy datasets where the observations are indeed generated by an emitting random walk we construct three undirected unweighted graphs of nodes each ring ring graph ii star nodes each connected to hub node and iii grid lattice the initial distribution is uniform and the transition matrix at each node has an equal transition probability to its neighbors for each graph we generate datasets with censored lists each censored list has length we note that in the star graph censored list contains many apparent transitions between leaf nodes although such transitions are not allowed in its underlying uncensored random walk this will mislead rw this effect is less severe in the grid graph and the ring graph for each estimator we perform cross validation cv for finding the best smoothing parameters cβ crw cf on the grid respectively with which we compute each and the true estimator then we evaluate the three estimators using the frobenius norm between qp transition matrix error pbij note the error must approach as inij creases for consistent estimators we repeat the same experiment times where each time we draw new set of censored lists changes as the number of censored lists increases the error bars figure shows how error are confidence bounds we make three observations invite tends towards error this is expected given the consistency of invite in theorem rw is biased in all three plots rw tends towards some positive number unlike invite and fe this is because rw has the wrong model on the censored lists invite outperforms fe on the ring and grid graphs invite dominates fe for every training set size on the star graph fe is better than invite with small but invite eventually achieves lower error this reflects the fact that although fe is unbiased it discards most of the censored lists and therefore has higher variance compared to invite length min max mean median animal food animal food test set mean neg loglik table verbal fluency test set log likelihood table statistics of the verbal fluency data model invite rw fe invite rw fe verbal fluency we now turn to the fluency data where we compare invite with the baseline models since we do not have the ground truth parameter and we compare test set log likelihood of various models confirming the empirical performance of invite sheds light on using it for practical applications such as the dignosis and classification of the patient data the data used to assess human memory search consists of two verbal fluency datasets from the wisconsin longitudinal survey wls the wls is longitudinal assessment of many sociodemographic and health factors that has been administered to large cohort of wisconsin residents every five years since the verbal fluency for two semantic categories animals and foods was administered in the last two testing rounds and yielding total of lists for animals and lists for foods collected from total of participants ranging in age from their to the raw lists included in the wls were preprocessed by expanding abbreviations lab labrador removing inflections cats cat correcting spelling errors and removing response errors like unintelligible items though instructed to not repeat some human participants did occasionally produce repeated words we removed the repetitions from the data which consist of of the word token responses finally the data exhibits zipfian behavior with many idiosyncratic low count words we removed words appearing in less than lists in total the process resulted in removing of the total number of word token responses the statistics of the data after preprocessing is summarized in table procedure we randomly subsample of the lists as the test set and use the rest as the training set we perform cv on the training set for each estimator to find the best smoothing parameter cβ crw cf respectively where the validation measure is the prefix log likelihood for invite and the standard random walk likelihood for rw for the validation measure of fe we use the invite prefix log likelihood since fe is equivalent to the length two prefix likelihood of invite then we train the final estimator on the whole training set using the fitted regularization parameter result the experiment result is summarized in table for each estimator we measure the average negative prefix log likelihood on the test set for invite and fe and the standard random walk negative log likelihood for rw the number in the parenthesis is the confidence interval boldfaced numbers mean that the corresponding estimator is the best and the difference from the others is statistically significant under paired at significance level in both animal and food verbal fluency tasks the result indicates that fluency lists are better explained by invite than by either rw or fe furthermore rw outperforms fe we believe that fe performs poorly despite being consistent because the number of lists is too small compared to the number of states for fe to reach good estimate related work though behavior in semantic fluency tasks has been studied for many years few computationally explicit models of the task have been advanced influential models in the psychological literature such as the clustering and switching model of troyer et al have been articulated only verbally efforts to estimate the structure of semantic memory from fluency lists have mainly focused on decomposing the structure apparent in distance matrices that reflect the mean ordinal distances across many fluency lists without an account of the processes that generate list structure it is not clear how the results of such studies are best interpreted more recently researchers in cognitive science have begun to focus on explicit model of the processes by which fluency lists are generated in these works the structure of semantic memory is first modelled either as graph or as continuous multidimensional space estimated from word statistics in large corpora of natural language researchers then assess whether structure in fluency data can be understood as resulting from particular search process operating over the specified semantic structure models explored in this vein include simple random walk over semantic network with repeated nodes omitted from the sequence produced the pagerank algorithm employed for network search by google and foraging algorithms designed to explain the behavior of animals searching for food each example reports aspects of human behavior that are by the respective search process given accompanying assumptions about the nature of the underlying semantic structure however these works do not learn their model directly from the fluency lists which is the key difference from our study broder algorithm generate for generating random spanning tree is similar to invite generative process given an undirected graph the algorithm runs random walk and outputs each transition to an unvisited node upon transiting to an already visited node however it does not output the transition the random walk stops after visiting every node in the graph in the end we observe an ordered list of transitions for example in figure if the random walk trajectory is then the output is note that if we take the starting node of the first transition and the arriving nodes of each transition then the output list reduces to censored list generated from invite with the same underlying random walk despite the similarity to the best of our knowledge the censored list derived from the output of the algorithm generate has not been studied and there has been no parameter estimation task discussed in prior works random walk or random walk performs random walk while avoiding already visited node for example in figure if random walk starts from state then visits then it can only visit states or since is already visited in not visiting the same node twice walk is similar to invite however key difference is that walk can not produce transition if pij in contrast invite can appear to have such transitions in the censored list such behavior is core property that allows invite to switch clusters in modeling human memory search invite resembles cascade models in many aspects in cascade model the information or disease spreads out from seed node to the whole graph by infections that occur from an infected node to its neighbors formulates graph learning problem where an observation is list or trace that contains infected nodes along with their infection time although not discussed in the present paper it is trivial for invite to produce time stamps for each item in its censored list too however there is fundamental difference in how the infection occurs cascade model typically allows multiple infected nodes to infect their neighbors in parallel so that infection can happen simultaneously in many parts of the graph on the other hand invite contains single surfer that is responsible for all the infection via random walk therefore infection in invite is necessarily sequential this results in invite exhibiting clustering behaviors in the censored lists which is in human memory search tasks discussion there are numerous directions to extend invite first more theoretical investigation is needed for example although we know the mle of invite is consistent the convergence rate is unknown second one can improve the invite estimate when data is sparse by assuming certain cluster structures in the transition matrix thereby reducing the degrees of freedom for instance it is known that verbal fluency tends to exhibit runs of semantically related words one can assume stochastic block model with parameter sharing at the block level where the blocks represent semantic clusters of words one then estimates the block structure and the shared parameters at the same time third invite can be extended to allow repetitions in list the basic idea is as follows in the segment we previously used an absorbing random walk to compute where were the nonabsorbing states for each nonabsorbing state ai add dongle twin absorbing state attached only to ai allow small transition probability from ai to if the walk is absorbed by we output ai in the censored list which becomes repeated item in the censored list note that the likelihood computation in this augmented model is still polynomial such model with reluctant repetitions will be an interesting interpolation between no repetitions and repetitions as in standard random acknowledgments the authors are thankful to the anonymous reviewers for their comments this work is supported in part by nsf grants and nih big data to knowledge nsf grant and nih grant references abbott austerweil and griffiths human memory search as random walk in semantic network in nips pp abrahao chierichetti kleinberg and panconesi trace complexity of network corr vol bottou stochastic gradient tricks in neural networks tricks of the trade reloaded ser lecture notes in computer science lncs montavon orr and eds springer pp broder generating random spanning trees in focs ieee computer society pp chan butters paulsen salmon swenson and maloney an assessment of the semantic network in patients with alzheimer journal of cognitive neuroscience vol no pp cockrell and folstein state principles and practice of geriatric psychiatry pp doyle and snell random walks and electric networks matical association of america washington dc durrett essentials of stochastic processes ser springer texts in statistics york springer flory principles of polymer chemistry new cornell university press glenberg and mehta optimal foraging in semantic memory italian journal of linguistics gomez rodriguez leskovec and krause inferring networks of diffusion and influence new york ny usa acm press july pp goi arrondo sepulcre martincorena de mendizbal bejarano peraita wall and villoslada the semantic organization of the animal category evidence from semantic verbal fluency and network cognitive processing vol no pp griffiths steyvers and firl google and the mind predicting fluency with pagerank psychological science vol no pp henley psychological study of the semantics of animal journal of verbal learning and verbal behavior vol no pp apr hills todd and jones optimal foraging in semantic memory psychological review pp kempe kleinberg and tardos maximizing the spread of influence through social network in proceedings of the ninth acm sigkdd international conference on knowledge discovery and data mining ser kdd new york ny usa acm pp pasquier lebert grymonprez and petit verbal fluency in dementia of frontal lobe type and dementia of alzheimer journal of neurology vol no pp polyak and juditsky acceleration of stochastic approximation by averaging siam control vol no pp july rogers ivanoiu patterson and hodges semantic memory in alzheimer disease and the frontotemporal dementias longitudinal study of neuropsychology vol no pp ruppert efficient estimations from slowly convergent process cornell university operations research and industrial engineering tech troyer moscovitch winocur alexander and stuss clustering and switching on verbal fluency the effects of focal and lesions neuropsychologia vol no 
statistical optimization for sparse tensor graphical model wei sun yahoo labs sunnyvale ca sunweisurrey zhaoran wang department of operations research and financial engineering princeton university princeton nj zhaoran han liu department of operations research and financial engineering princeton university princeton nj hanliu guang cheng department of statistics purdue university west lafayette in chengg abstract we consider the estimation of sparse graphical models that characterize the dependency structure of data to facilitate the estimation of the precision matrix corresponding to each way of the tensor we assume the data follow tensor normal distribution whose covariance has kronecker product structure the penalized maximum likelihood estimation of this model involves minimizing objective function in spite of the of this estimation problem we prove that an alternating minimization algorithm which iteratively estimates each sparse precision matrix while fixing the others attains an estimator with the optimal statistical rate of convergence as well as consistent graph recovery notably such an estimator achieves estimation consistency with only one tensor sample which is unobserved in previous work our theoretical results are backed by thorough numerical studies introduction data are prevalent in many fields such as personalized recommendation systems and brain imaging research traditional recommendation systems are mainly based on the matrix whose entry denotes each user preference for particular item to incorporate additional information into the analysis such as the temporal behavior of users we need to consider tensor for another example functional magnetic resonance imaging fmri data can be viewed as three way tensor since it contains the brain measurements taken on different locations over time for various experimental conditions also in the example of microarray study for aging thousands of gene expression measurements are recorded on tissue types on mice with varying ages which forms four way tensor in this paper we study the estimation of conditional independence structure within tensor data for example in the microarray study for aging we are interested in the dependency structure across different genes tissues ages and even mice assuming data are drawn from tensor normal distribution straightforward way to estimate this structure is to vectorize the tensor and estimate the underlying gaussian graphical model associated with the vector such an approach ignores the tensor structure and requires estimating rather high dimensional precision matrix with insufficient sample size for instance in the aforementioned fmri application the sample size is one if we aim to estimate the dependency structure across different locations time and experimental conditions to address such problem popular approach is to assume the covariance matrix of the tensor normal distribution is separable in the sense that it is the kronecker product of small covariance matrices each of which corresponds to one way of the tensor under this assumption our goal is to estimate the precision matrix corresponding to each way of the tensor see for detailed survey of previous work despite the fact that the assumption of the kronecker product structure of covariance makes the statistical model much more parsimonious it poses significant challenges in particular the penalized negative function is with respect to the unknown sparse precision matrices consequently there exists gap between computational and statistical theory more specifically as we will show in existing literature mostly focuses on establishing the existence of local optimum that has desired statistical guarantees rather than offering efficient algorithmic procedures that provably achieve the desired local optima in contrast we analyze an alternating minimization algorithm which iteratively minimizes the objective function with respect to each individual precision matrix while fixing the others the established theoretical guarantees of the proposed algorithm are as follows suppose that we have observations from order tensor normal distribution we denote by mk sk dk the dimension sparsity and max number of entries in each row of the precision matrix corresponding to the way of the qk tensor besides we define mk the precision matrix estimator from our alternating minimization algorithm achieves mk mk sk log mk nm statistical rate of convergence in frobenius norm which is since this is the best rate one can obtain even when the rest true precision pmatrices are known furthermore under an extra irrepresentability condition pwe establish mk log mk nm rate of convergence in max norm which is also optimal and dk mk log mk nm rate of convergence in spectral norm these estimation consistency results and sufficiently large signal strength condition further imply the model selection consistency of recovering all the edges notable implication of these results is that when our alternating minimization algorithm can achieve estimation consistency in frobenius norm even if we only have access to one tensor sample which is often the case in practice this phenomenon is unobserved in previous work finally we conduct extensive experiments to evaluate the numerical performance of the proposed alternating minimization method under the guidance of theory we propose way to significantly accelerate the algorithm without sacrificing the statistical accuracy related work and our contribution special case of our sparse tensor graphical model when is the sparse matrix graphical model which is studied by in particular and only establish the existence of local optima with desired statistical guarantees meanwhile considers an algorithm that is similar to ours however the statistical rates of convergence obtained by are much slower than ours when see remark in for detailed comparison for our statistical rate of convergence in frobenius norm recovers the result of in other words our theory confirms that the desired local optimum studied by not only exists but is also attainable by an efficient algorithm in addition for matrix graphical model establishes the statistical rates of convergence in spectral and frobenius norms for the estimator attained by similar algorithm their results achieve estimation consistency in spectral norm with only one matrix observation however their rate is slower than ours with see remark in for detailed discussion furthermore we allow to increase and establish estimation consistency even in frobenius norm for most importantly all these results focus on matrix graphical model and can not handle the aforementioned motivating applications such as the tensor dataset in the context of sparse tensor graphical model with general shows the existence of local optimum with desired rates but does not prove whether there exists an efficient algorithm that provably attains such local optimum in contrast we prove that our alternating minimization algorithm achieves an estimator with desired statistical rates to achieve it we apply novel theoretical framework to separately consider the population and sample optimizers and then establish the onestep convergence for the population optimizer theorem and the optimal rate of convergence for the sample optimizer theorem new concentration result lemma is developed for this purpose which is also of independent interest moreover we establish additional theoretical guarantees including the optimal rate of convergence in max norm the estimation consistency in spectral norm and the graph recovery consistency of the proposed sparse precision matrix estimator in addition to the literature on graphical models our work is also closely related to recent line of research on alternating minimization for optimization problems these existing results mostly focus on problems such as dictionary learning phase retrieval and matrix decomposition hence our statistical model and analysis are completely different from theirs also our paper is related to recent line of work on tensor decomposition see and the references therein compared with them our work focuses on the graphical model structure within data notation for matrix ai we denotep kakf as its max spectral and frobenius norm respectively we define off as its norm and maxi as the maximum absolute row sum denote vec as the vectorization of which stacks the columns of let tr be the trace of for an index set we define as the matrix whose entry indexed by is equal to ai and zero otherwise we denote as the identity matrix with dimension throughout this paper we use to denote generic absolute constants whose values may vary from line to line sparse tensor graphical model preliminary we employ the tensor notations used by throughout this paper higher order tensors are denoted by boldface euler script letters we consider order tensor when it reduces to vector and when it reduces to matrix the ik element of the tensor is denoted to be ik meanwhile we define the vectorization of as vec mk mk rm with mk in addition we define the frobenius norm of tensor as kt kf ik ik for tensors fiber refers to the higher order analogue of the row and column of matrices fiber is obtained by fixing all but one of the indices of the tensor the fiber of is given by ik matricization also known as unfolding is the process to transform tensor into matrix we denote as the matricization of tensor which arranges the fibers to be the columns of the resulting matrix another useful operation in tensors is the product the product of tensor with matrix is denoted as and is of the sizep mk mk its entry is defined as ik ik ik ik aj ik in addition for list of matrices ak mk with ak we define ak ak model tensor follows the tensor normal distribution with zero mean and covariance matrices denoted as tn if its probability density function is exp kt qk where mk and when this tensor normal distribution reduces to the vector normal distribution with zero mean and covariance according to it can be shown that tn if and only if vec vec where vec rm and is the matrix kronecker product we consider the parameter estimation for the tensor normal model assume that we observe independently and identically distributed tensor samples tn from tn we aim to estimate the true covariance matrices and their corresponding true precision matrices where to address the identifiability issue in the parameterization of the tensor normal distribution we assume that kf for this renormalization assumption does not change the graph structure of the original precision matrix standard approach to estimate is to use the maximum likelihood method via up to constant the negative function of the tensor normal distribution pk pn is tr log where vec ti vec ti to encourage the sparsity of each precision matrix in the scenario we consider penalized estimator which is obtained by minimizing qn tr log mk where is penalty function indexed by the tuning parameter in this paper we focus on the lasso penalty off this estimation procedure applies similarly to broad family of other penalty functions we name the penalized model from as the sparse tensor graphical model it reduces to the sparse vector graphical model when and the sparse matrix graphical model when our framework generalizes them to fulfill the demand of capturing the graphical structure of higher order data estimation this section introduces the estimation procedure for the sparse tensor graphical model computationally efficient algorithm is provided to estimate the precision matrix for each way of the tensor recall that in qn is jointly with respect to nevertheless qn is problem since qn is convex in when the rest precision matrices are fixed the property plays critical role in our algorithm construction and its theoretical analysis in according to its property we propose to solve this problem by alternatively update one precision matrix with other matrices fixed note that for any minimizing with respect to while fixing the rest precision matrices is equivalent to minimizing tr sk log off mk mk mk pn here sk nm vi vi where vi ti with the tensor product operation and the matricization operation defined in the result in can be shown by noting that vik ti according to the properties of matricization shown by hereafter we drop the superscript of vik if there is no confusion note that minimizing corresponds to estimating gaussian graphical model and can be solved efficiently via the glasso algorithm algorithm solve sparse tensor graphical model via tensor lasso tlasso input tensor samples tn tuning parameters max number of iterations initialize randomly as symmetric and positive definite matrices and set repeat for given solve for via glasso normalize such that kf end for until output the details of our tensor lasso tlasso algorithm are shown in algorithm it starts with random initialization and then alternatively updates each precision matrix until it converges in we will illustrate that the statistical properties of the obtained estimator are insensitive to the choice of the initialization see the discussion following theorem theory of statistical optimization we first prove the estimation errors in frobenius norm max norm and spectral norm and then provide the model selection consistency of our tlasso estimator we defer all the proofs to the appendix estimation error in frobenius norm based on the penalized in we define the population function as tr vec vec log mk by minimizing with respect to we obtain the population minimization function with the parameter mk argmin theorem for any if satisfies tr then the population minimization function in satisfies mk mk tr theorem shows surprising phenomenon that the population minimization function recovers the true precision matrix up to constant in only one iteration if then mk otherwise after normalization such that kmk kf the normalized population minimization function still fully recovers this observation suggests that setting in algorithm is sufficient such suggestion will be further supported by our numeric results in practice when is unknown we can approximate it via its sample version qn defined in which gives rise to the statistical error in the estimation procedure analogously to we define the minimization function with parameter as ck argmin qn in order to prove the estimation error it remains to quantify the statistical error induced from finite samples the following two regularity conditions are assumed for this purpose condition bounded eigenvalues for any there is constant such that min max where min and max refer to the minimal and maximal eigenvalue of respectively condition requires the uniform boundedness of the eigenvalues of true covariance matrices it has been commonly assumed in the graphical model literature condition tuning for any kpand some constant the tuning parameter satisfies log mk nmmk log mk nmmk condition specifies the choice of the tuning parameters in practice tuning procedure can be performed to approximate the optimal choice of the tuning parameters before characterizing the statistical error we define sparsity parameter for let sk denote the sparsity parameter sk mk which is the number of nonzero entries in the component of for each we define as the set containing and its neighborhood for some sufficiently large constant radius rmk kf theorem assume conditions and hold for any the statistical error of the minimization function defined in satisfies that for any fixed log ck mk op nm ck are defined in and and qk mk where mk and ck for arbitrary theorem establishes the statistical error associated with with in comparison previous work on the existence of local solution with desired statistical property only establishes theorems similar to theorem for with the extension to an arbitrary involves technical barriers particularly we first establish the rate of convergence of the difference between quadratic form with its expectation lemma via concentration of lipschitz functions of gaussian random variables this result is also of independent interest we then carefully characterize the rate of convergence of sk defined in lemma finally we develop using the results for graphical models developed by according to theorem and theorem we obtain the rate of convergence of the tlasso estimator in terms of frobenius norm which is our main result theorem assume that conditions and hold for any if the initialization from algorithm with satisfies satisfies for any then the estimator where qk bk op mk mk sk log mk nm mk and is defined in theorem suggests that as long as the initialization is within constant distance to the truth our tlasso algorithm attains consistent estimator after only one iteration this initialization condition trivially holds since for any that is positive definite and has unit frobenius norm we have kf by noting that kf for the identifiability of the tensor normal distribution in literature shows that there exists local minimizer of whose convergence rate can achieve however it is unknown if their algorithm can find such minimizer since there could be many other local minimizers notable implication of theorem is that when the estimator from our tlasso algorithm can achieve estimation consistency even if we only have access to one observation which is often the case in practice to see it suppose that and when the dimensions and are of the same order of magnitude and sk mk for all the three error rates corresponding to in converge to zero this result indicates that the estimation of the precision matrix takes advantage of the information from the way of the tensor data consider simple case that and one precision matrix is known in this scenario the rows of the matrix data are independent and hence the effective sample size for estimating is in fact the optimality presult for the graphical model implies that the optimal rate for estimating is log which matches our result in therefore the rate in obtained by our tlasso estimator is since it is the best rate one can obtain even when are known as far as we know this phenomenon has not been discovered by any previous work in tensor graphical model remark for our tensor graphical model reduces to matrix graphical model with krob necker product covariance structure in this case the rate of convergence of in reduces to log nm which is much faster than log log tablished by and log max established by in literature shows that there exists local minimizer of the objective function whose estimation errors match ours however it is unknown if their estimator can achieve such convergence rate on the other hand our theorem confirms that our algorithm is able to find such estimator with optimal rate of convergence estimation error in max norm and spectral norm we next show the estimation error in max norm and spectral norm trivially these estimation errors are bounded by that in frobenius norm shown in theorem to develop improved rates of convergence in max and spectral norms we need to impose stronger conditions on true parameters we first introduce some important notations denote dk as the maximum number of in any row of the true precision matrices that is dk max mk mk with the cardinality of the inside set for each covariance matrix we define denote the hessian matrix rmk whose entry corresponds to the second order partial derivative of the objective function with respect to and we define its indexed by the index set sk as sk sk sk sk which is the matrix with rows and columns of indexed by sk and sk respectively moreover we define sk sk in order to establish the rate of convergence in max norm we need to impose an irrepresentability condition on the hessian matrix condition irrepresentability for each there exists some such that maxc sk sk sk condition controls the influence of the terms in sck on the connected edges in sk this condition has been widely applied in lasso penalized models condition bounded complexity for each the parameters are bounded and the parameter dk in satisfies dk mk log mk theorem suppose conditions and hold assume sk mk for and assume are in the same order mk for each if the initialization from algorithm with satisfies satisfies for any then the estimator log bk op nm is subset of the true edge set of that is supp supp in addition the edge set of theorem shows that our tlasso estimator achieves the optimal rate of convergence in max norm here we consider the estimator obtained after two iterations since we require new concentration inequality lemma for the sample covariance matrix which is built upon the estimator in theorem direct consequence from theorem is the estimation error in spectral norm corollary suppose the conditions of theorem hold for any we have mk log mk dk nm remark now we compare our obtained rate of convergence in spectral norm for with that established in the sparse matrix graphical model literature in particular establishes the rate of op mk sk log nmk for therefore when sk which holds for example in the bounded degree graphs our obtained rate is faster however our faster rate comes at the price of assuming the irrepresentability condition using recent advance in nonconvex regularization we can eliminate the irrepresentability condition we leave this to future work model selection consistency theorem ensures that the estimated precision matrixpcorrectly excludes all edges and includes all the true edges with mk log mk nm for some constant therefore in order to achieve the model selection consistency sufficient condition is to assume that for each the minimal signal min is not too small theorem under the conditions of theorem if mk log mk nm for some sign with high probability constant then for any sign theorem indicates that our tlasso estimator is able to correctly recover the graphical structure of each way of the tensor data to the best of our knowledge these is the first model selection consistency result in high dimensional tensor graphical model simulations we compare the proposed tlasso estimator with two alternatives the first one is the direct graphical lasso glasso approach which applies the glasso to the vectorized tensor data to estimate directly the second alternative method is the iterative penalized maximum likelihood method proposed by whose termination condition is set to be pk for simplicity in our tlasso algorithm we set the initialization of precision matrix as for each and the total iteration the tuning parameter is set as log mk nmmk for fair comparison the same tuning parameter is applied in the method in the direct glasso approach its tuning parameter is chosen by via huge package we consider two simulations with third order tensor in simulation we construct triangle graph while in simulation we construct four nearest neighbor graph for each precision matrix an illustration of the generated graphs are shown in figure in each simulation we consider three scenarios and and and we repeat each example times and compute the averaged computational time the averaged estimation error of the kronecker product of precision matrices the true positive rate tpr and the true negative rate tnr more specifically we denote pai entry of and define tpr ai ai ai and tnr ai as shown in figure our tlasso is dramatically faster than both alternative methods in scenario tlasso takes about five seconds for each replicate the takes about seconds while the direct glasso method takes more than one hour and is omitted in the plot tlasso algorithm is not only computationally efficient but also enjoys superior estimation accuracy in all examples the direct glasso method has significantly larger errors than tlasso due to ignoring the tensor graphical structure tlasso outperforms in scenarios and and is comparable to it in scenario glasso tlasso errors scenarios glasso tlasso glasso tlasso glasso tlasso scenarios errors time seconds time seconds scenarios scenarios figure left two plots illustrations of the generated graphs middle two plots computational time right two plots estimation errors in each group of two plots the left right is for simulation table shows the variable selection performance our tlasso identifies almost all edges in these six examples while the glasso and method miss several true edges on the other hand tlasso tends to include more edges than other methods table comparison of variable selection performance here tpr and tnr denote the true positive rate and true negative rate scenarios sim sim glasso tpr tnr tpr tnr tlasso tpr tnr acknowledgement we would like to thank the anonymous reviewers for their helpful comments han liu is grateful for the support of nsf career award nsf nsf nih nih and nih guang cheng research is sponsored by nsf career award nsf simons fellowship in mathematics onr and grant from indiana clinical and translational sciences institute references rendle and pairwise interaction tensor factorization for personalized tag recommendation in international conference on web search and data mining allen sparse principal components analysis in international conference on artificial intelligence and statistics zahn poosala owen ingram et al agemap gene expression database for aging in mice plos genetics cai liu and zhou estimating sparse precision matrix optimal rates of convergence and adaptive estimation annals of statistics leng and tang sparse matrix graphical models journal of the american statistical association yin and li model selection and estimation in the matrix normal graphical model journal of multivariate analysis tsiligkaridis hero and zhou on convergence of kronecker graphical lasso algorithms ieee transactions on signal processing zhou gemini graph estimation with matrix variate normal instances annals of statistics he yin li and wang graphical model selection and estimation for high dimensional tensor data journal of multivariate analysis jain netrapalli and sanghavi matrix completion using alternating minimization in symposium on theory of computing pages netrapalli jain and sanghavi phase retrieval using alternating minimization in advances in neural information processing systems pages sun qu and wright complete dictionary recovery over the sphere arora ge ma and moitra simple efficient and neural algorithms for sparse coding anandkumar ge hsu kakade and telgarsky tensor decompositions for learning latent variable models journal of machine learning research sun lu liu and cheng provable sparse tensor decomposition zhe xu chu qi and park scalable nonparametric multiway data analysis in international conference on artificial intelligence and statistics zhe xu qi and yu sparse bayesian multiview learning for simultaneous association discovery and diagnosis of alzheimer disease in aaai conference on artificial intelligence kolda and bader tensor decompositions and applications siam review tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series yuan and lin model selection and estimation in the gaussian graphical model biometrika friedman hastie and tibshirani sparse inverse covariance estimation with the graphical lasso biostatistics rothman bickel levina and zhu sparse permutation invariant covariance estimation electronic journal of statistics sun wang and fang consistent selection of tuning parameters via variable selection stability journal of machine learning research ledoux and talagrand probability in banach spaces isoperimetry and processes springer fan feng and wu network exploration via the adaptive lasso and scad penalties annals of statistics zhao and yu on model selection consistency of lasso journal of machine learning research ravikumar wainwright raskutti and yu covariance estimation by minimizing divergence electronic journal of statistics wang liu and zhang optimal computational and statistical rates of convergence for sparse nonconvex learning problems annals of statistics zhao liu roeder lafferty and wasserman the huge package for undirected graph estimation in journal of machine learning research gupta and nagar matrix variate distributions chapman and press hoff separable covariance arrays via the tucker product with applications to multivariate relational data bayesian analysis dawid some distribution theory notational considerations and bayesian application biometrika negahban and wainwright estimation of near matrices with noise and scaling annals of statistics 
convergence rates of active learning for maximum likelihood estimation kamalika chaudhuri sham kakade praneeth netrapalli sujay sanghavi abstract an active learner is given class of models large set of unlabeled examples and the ability to interactively query labels of subset of these examples the goal of the learner is to learn model in the class that fits the data well previous theoretical work has rigorously characterized label complexity of active learning but most of this work has focused on the pac or the agnostic pac model in this paper we shift our attention to more general setting maximum likelihood estimation provided certain conditions hold on the model class we provide active learning algorithm for this problem the conditions we require are fairly general and cover the widely popular class of generalized linear models which in turn include models for binary and classification regression and conditional random fields we provide an upper bound on the label requirement of our algorithm and lower bound that matches it up to lower order terms our analysis shows that unlike binary classification in the realizable case just single extra round of interaction is sufficient to achieve performance in maximum likelihood estimation on the empirical side the recent work in and on active linear and logistic regression shows the promise of this approach introduction in active learning we are given sample space label space class of models that map to and large set of unlabelled samples the goal of the learner is to learn model in the class with small target error while interactively querying the labels of as few of the unlabelled samples as possible most theoretical work on active learning has focussed on the pac or the agnostic pac model where the goal is to learn binary classifiers that belong to particular hypothesis class and there has been only handful of exceptions in this paper we shift our attention to more general setting maximum likelihood estimation mle where pr is described by model belonging to model class we show that when data is generated by model in this class we can do active learning provided the model class has the following simple property the fisher information matrix for any model at any depends only on and this condition is satisfied in number of widely applicable model classes such as linear regression and generalized linear models glms which in turn includes models for multiclass classification and conditional random fields consequently we can provide active learning algorithms for maximum likelihood estimation in all these model classes the standard solution to active mle estimation in the statistics literature is to select samples for label query by optimizing class of summary statistics of the asymptotic covariance matrix of the dept of cs university of california at san diego email kamalika dept of cs and of statistics university of washington email sham microsoft research new england email praneeth dept of ece the university of texas at austin email sanghavi estimator the literature however does not provide any guidance towards which summary statistic should be used or any analysis of the solution quality when finite number of labels or samples are available there has also been some recent work in the machine learning community on this problem but these works focus on simple special cases such as linear regression or logistic regression and only involves consistency and finite sample analysis in this work we consider the problem in its full generality with the goal of minimizing the expected error over the unlabelled data we provide active learning algorithm for this problem in the first stage our algorithm queries the labels of small number of random samples from the data distribution in order to construct crude estimate of the optimal parameter in the second stage we select set of samples for label query by optimizing summary statistic of the covariance matrix of the estimator at however unlike the experimental design work our choice of statistic is directly motivated by our goal of minimizing the expected error which guides us towards the right objective we provide finite sample analysis of our algorithm when some regularity conditions hold and when the negative log likelihood function is convex our analysis is still fairly general and applies to generalized linear models for example we match our upper bound with corresponding lower bound which shows that the convergence rate of our algorithm is optimal except for lower order terms the finite sample convergence rate of any algorithm that uses perhaps multiple rounds of sample selection and maximum likelihood estimation is either the same or higher than that of our algorithm this implies that unlike what is observed in learning binary classifiers single round of interaction is sufficient to achieve log likelihood error for ml estimation related work previous theoretical work on active learning has focussed on learning classifier belonging to hypothesis class in the pac model both the realizable and cases have been considered in the realizable case line of work has looked at generalization of binary search while their algorithms enjoy low label complexity this style of algorithms is inconsistent in the presence of noise the two main styles of algorithms for the case are disagreementbased active learning and margin or active learning while active learning in the realizable case has been shown to achieve an exponential improvement in label complexity over passive learning in the agnostic case the gains are more modest sometimes constant factor moreover lower bounds show that the label requirement of any agnostic active learning algorithm is always at least where is the error of the best hypothesis in the class and is the target error in contrast our setting is much more general than binary classification and includes regression classification and certain kinds of conditional random fields that are not covered by previous work provides an active learning algorithm for linear regression problem under model mismatch their algorithm attempts to learn the location of the mismatch by fitting increasingly refined partitions of the domain and then uses this information to reweight the examples if the partition is highly refined then the computational complexity of the resulting algorithm may be exponential in the dimension of the data domain in contrast our algorithm applies to more general setting and while we do not address model mismatch our algorithm has polynomial time complexity provides an active learning algorithm for generalized linear models in an online selective sampling setting however unlike ours their input is stream of unlabelled examples and at each step they need to decide whether the label of the current example should be queried our work is also related to the classical statistical work on optimal experiment design which mostly considers maximum likelihood estimation for estimation they suggest selecting samples to maximize the fisher information which corresponds to minimizing the variance of the regression coefficient when is the fisher information is matrix in this case there are multiple notions of optimal design which correspond to maximizing different parameters of the fisher information matrix for example maximizes the determinant and maximizes the trace of the fisher information in contrast with this work we directly optimize the expected over the unlabelled data which guides us to the appropriate objective function moreover we provide consistency and finite sample guarantees finally on the empirical side and derive algorithms similar to ours for logistic and linear regression based on projected gradient descent notably these works provide promising empirical evidence for this approach to active learning however no consistency guarantees or convergence rates are provided the rates presented in these works are not stated in terms of the sample size in contrast our algorithm applies more generally and we provide consistency guarantees and convergence rates moreover unlike our logistic regression algorithm uses single extra round of interaction and our results illustrate that single round is sufficient to achieve convergence rate that is optimal except for lower order terms the model we begin with some notation we are given pool xn of unlabelled examples drawn from some instance space and the ability to interactively query labels belonging to label space of of these examples in addition we are given family of models parameterized by rd we assume that there exists an unknown parameter such that querying the label of an xi generates yi drawn from the distribution we also abuse notation and use to denote the uniform distribution over the examples in we consider the or transductive setting where our goal is to minimize the error on the fixed set of points for any and we define the negative function as log where our goal is to find to minimize lu lu by interactively querying labels for subset of of size where we allow label queries with replacement the label of an example may be queried multiple times an additional quantity of interest to us is the fisher information matrix or the hessian of the negative function which determines the convergence rate for our active learning procedure to work correctly we require the following condition condition for any the fisher information and and does not depend on is function of only condition is satisfied by number of models of practical interest examples include linear regression and generalized linear models section provides brief derivation of condition for generalized linear models for any and we use to denote the hessian observe that by assumption this is just function of and let be any distribution over the unlabelled samples in for any we use algorithm the main idea behind our algorithm is to sample xi from distribution over query the labels of these samples and perform ml estimation over them to ensure good performance should be chosen carefully and our choice of is motivated by lemma suppose the labels yi are generated according to yi lemma states that the expected loglikelihood error of the ml estimate with respect to samples from in this case is essentially tr iu this suggests selecting as the distribution that minimizes tr iu unfortunately we can not do this as is unknown we resolve this problem through two stage algorithm in the first stage we use small number of samples to construct coarse estimate of steps in the second stage we calculate distribution which minimizes tr iu and draw samples from slight modification of this distribution for finer estimation of steps algorithm activesetselect input samples xi for draw samples from and query their labels to get use to solve the mle problem yi xi yi solve the following sdp refer lemma argmina tr iu draw examples using probability ai xi ai ai where the distribution and query their labels to get use to solve the mle problem xi yi yi output the distribution is modified slightly to in step to ensure that is well conditioned with respect to iu the algorithm is formally presented in algorithm finally note that steps are necessary because iu and are functions of in certain special cases such as linear regression iu and are independent of in those cases steps are unnecessary and we may skip directly to step performance guarantees the following regularity conditions are essentially quantified version of the standard local asymptotic normality lan conditions for studying maximum likelihood estimation see assumption regularity conditions for lan smoothness the first three derivatives of exist in all interior points of rd compactness is compact and is an interior point of pn strong convexity iu xi is positive definite with smallest singular value min lipschitz continuity there exists neighborhood of and constant such that for all is in this neighborhood iu iu kiu for every concentration at for any and we have with probability one krl kiu and iu iu boundedness max in addition to the above we need one extra condition which is essentially pointwise self concordance this condition is satisfied by vast class of models including the generalized linear models assumption self concordance definition optimal sampling distribution we define the optimal sampling pdistribution over the points in as the distribution for which and tr iu is as small as possible definition is motivated by lemma which indicates that under some mild regularity conditions ml estimate calculated on samples drawn from will provide the best convergence rates including the right constant factor for the expected error we now present the main result of our paper the proof of the following theorem and all the supporting lemmas will be presented in appendix theorem suppose the regularity conditions in assumptions and hold and the number of samples used in step be max trdiameter tr iu then with iu probability the expected log likelihood error of the estimate of algorithm is bounded as lu lu tr iu where is the optimal sampling distribution in definition and log moreover for any sampling distribution satisfying ciu and label constraint of we have the following lower bound on the expected log likelihood error for ml estimate lu lu tr iu def where remark restricting to maximum likelihood estimation our restriction to maximum likelihood estimators is minor as this is close to minimax optimal see minor improvements with certain kinds of estimators such as the estimator are possible discussions several remarks about theorem are in order the high probability bound in theorem is with respect to the samples drawn in provided these samples are representative which happens with probability the output of algorithm will satisfy additionally theorem assumes that the labels are sampled with replacement in other words we can query the label of point xi multiple times removing this assumption is an avenue for future work second the highest order term in both and is tr iu the terms involving and are lower order as both and are moreover if then the term involving in is of lowerp order as well observe that also measures the tradeoff between and and as long as is also of lower order than thus provided is and the convergence rate of our algorithm is optimal except for lower order terms finally the lower bound applies to distributions for which ciu where occurs in the lower order terms of the bound this constraint is not very restrictive and does not affect the asymptotic rate observe that iu is full rank if is not full rank then the expected log likelihood error of the ml estimate with respect to will not be consistent and thus such will never achieve the optimal rate if is full rank then there always exists for which ciu thus essentially states that for distributions where is close to being the asymptotic convergence rate of tr iu is achieved at larger values of proof outline our main result relies on the following three steps bounding the error first we characterize the log likelihood error wrt of the empirical risk minimizer erm estimate obtained using sampling distribution concretely let be distribution on let be the erm estimate using the distribution yi where xi and yi of our analysis is lemma which shows precise estimate of the log likelihood error lu lu lemma suppose satisfies the regularity conditions in assumptions and let be distribution on and be the erm estimate using labeled examples suppose further that for some constant then for any and large enough such that def log we have def where tr lu iu lu approximating lemma sampling from the optimal sampling distribution that minimizes tr iu however this quantity depends on which we do not know to resolve this issue our algorithm first queries the labels of small fraction of points and solves ml estimation problem to obtain coarse estimate of how close should be to our analysis indicates that it is sufficient for to be close enough that for any is constant factor spectral approximation to the number of samples needed to achieve this is analyzed in lemma lemma suppose satisfies the regularity conditions in assumptions and if the number of samples used in the first step diameter aa max tr iu min tr iu then we have with probability greater than computing third we are left with the task of obtaining distribution we now pose this optimization problem as an sdp that minimizes the log likelihood error ai from lemmas and it is clear that we should aim to obtain sampling distribution minimizing tr iu let iu vj vj be the singular value pd sition svd of iu since tr iu vj vj this is equivalent to solving min cj ai xi vj vj cj pai ai among the above constraints the vj vj cj seems problematic however schur cj vj complement formula tells us that and vj vj cj in our case vj we know that since it is sum of positive semi definite matrices the above argument proves the following lemma lemma the following two optimization programs are equivalent pd mina cj mina tr iu ai xi ai xi cj vj vj pai pai where iu vj vj denotes the svd of iu illustrative examples we next present some examples that illustrate theorem we begin by showing that condition is satisfied by the popular class of generalized linear models derivations for generalized linear models generalized linear model is specified by three parameters linear model sufficient statistic and member of the exponential family let be linear model then in generalized linear model glm is drawn from an exponential family distribution with pa rameter specifically where is the sufficient statistic and is the function from properties of the exponential family the is written as log if we take and take the derivative with respect to we have log xt taking derivatives again gives us log xx which is independent of specific examples we next present three illustrative examples of problems that our algorithm may be applied to linear regression our first example is linear regression in this case rd and are generated according to the distribution where is noise variable drawn from in this case the negative loglikelihood function is and the corresponding fisher information matrix is given as xx observe that in this very special case the fisher information matrix does not depend on as result we can eliminate the first two steps of the algorithm and proceed directly to step if xi xi is the covariance matrix of then theorem tells us that we need to query labels from distribution with covariance matrix such that tr is minimized we illustrate the advantages of active learning through simple example suppose is the unlabelled distribution xi ej for where ej is the standard unit vector in the th direction the covariance matrix of is diagonal matrix with and for for passive learning over we query labels tr of examples drawn from which gives us convergence rate of on the other hand active learning chooses to sample examples from the distribution such that xi ej for where indicates that the probabilities hold upto this has diagonal covariance matrix tr and convergence rate of such that and for which does not grow with logistic regression our second example is logistic regression for binary classification in this case rd and the negative function is log and the corresponding fisher information is given as xx for illustration suppose and are bounded by constant and the covariance matrix is sandwiched between two multiples of identity in the psd ordering dc for some constants and then the regularity assumptions and are satisfied for constant values of and in this case theorem states that choosing to be tr iu gives us the optimal convergence rate of tr iu multinomial logistic regression our third example is multinomial logistic regression for multiclass classification in this case rd and the parameter matrix pk the negative function is written as log pk if and log otherwise the corresponding fisher information matrix is matrix which is obtained as follows let be the matrix with fii fij then xx similar to the example in the logistic regression case suppose and are bounded by constant and the covariance matrix satisfies dc for some constants and since diag pi where pi the boundedness of and for some constants depending on this means that implies that ci ci and ce cc di and so the regularity assumptions and are satisfied with and being constants theorem again tells us that using samples in the first step gives us the optimal convergence rate of maximum likelihood error conclusion in this paper we provide an active learning algorithm for maximum likelihood estimation which provably achieves the optimal convergence rate upto lower order terms and uses only two rounds of interaction our algorithm applies in very general setting which includes generalized linear models there are several avenues of future work our algorithm involves solving an sdp which is computationally expensive an open question is whether there is more efficient perhaps greedy algorithm that achieves the same rate second open question is whether it is possible to remove the with replacement sampling assumption final question is what happens if iu has high condition number in this case our algorithm will require large number of samples in the first stage an open question is whether we can use more sophisticated procedure in the first stage to reduce the label requirement acknowledgements kc thanks nsf under iis for research support references agarwal selective sampling algorithms for multiclass prediction in proceedings of the international conference on machine learning icml atlanta ga usa june pages balcan beygelzimer and langford agnostic active learning comput syst balcan and long active and passive learning of linear separators under logconcave distributions in colt beygelzimer hsu langford and zhang agnostic active learning without constraints in nips cam and yang asymptotics in statistics some basic concepts springer series in statistics springer new york cornell experiments with mixtures designs models and the analysis of mixture data third wiley dasgupta coarse sample complexity bounds for active learning in nips dasgupta two faces of active learning theor comput dasgupta and hsu hierarchical sampling for active learning in icml dasgupta hsu and monteleoni general agnostic active learning algorithm in nips frostig ge kakade and sidford competing with the empirical risk minimizer in single pass arxiv preprint gu zhang ding and han selective labeling via error bound minimization in in proc of advances in neural information processing systems nips lake tahoe nevada united states gu zhang and han active learning via error bound minimization in conference on uncertainty in artificial intelligence uai hanneke bound on the label complexity of agnostic active learning in icml active learning in the case in alt le cam asymptotic methods in statistical decision theory springer lehmann and casella theory of point estimation volume springer science business media nowak the geometry of generalized binary search ieee transactions on information theory sabato and munos active regression through stratification in nips urner wulff and plal active learning in colt van der vaart asymptotic statistics cambridge series in statistical and probabilistic mathematics cambridge university press zhang and chaudhuri beyond agnostic active learning in proc of neural information processing systems 
disentangling with recurrent transformations for view synthesis jimei scott honglak university of california merced mhyang university of michigan ann arbor reedscot honglak abstract an important problem for both graphics and vision is to synthesize novel views of object from single image this is particularly challenging due to the partial observability inherent in projecting object onto the image space and the of inferring object shape and pose however we can train neural network to address the problem if we restrict our attention to specific object categories in our case faces and chairs for which we can gather ample training data in this paper we propose novel recurrent convolutional network that is trained on the task of rendering rotated objects starting from single image the recurrent structure allows our model to capture dependencies along sequence of transformations we demonstrate the quality of its predictions for human faces on the dataset and for dataset of chair models and also show its ability to disentangle latent factors of variation identity and pose without using full supervision introduction numerous graphics algorithms have been established to synthesize photorealistic images from models and environmental variables lighting and viewpoints commonly known as rendering at the same time recent advances in vision algorithms enable computers to gain some form of understanding of objects contained in images such as classification detection segmentation and caption generation to name few these approaches typically aim to deduce abstract representations from raw image pixels however it has been problem for both graphics and vision to automatically synthesize novel images by applying intrinsic transformations rotation and deformation to the subject of an input image from an artificial intelligence perspective this can be viewed as answering questions about object appearance when the view angle or illumination is changed or some action is taken these synthesized images may then be perceived by humans in photo editing or evaluated by other machine vision systems such as the game playing agent with reinforcement learning in this paper we consider the problem of predicting transformed appearances of an object when it is rotated in from single image in general this is an problem due to the loss of information inherent in projecting object into the image space classic approaches either recover object model from multiple related images stereo and or register single image of known object category to its prior model faces the resulting mesh can be used to the scene from novel viewpoints however having meshes as intermediate representations these methods are limited to particular object categories vulnerable to image alignment mistakes and easy to generate artifacts during unseen texture synthesis to overcome these limitations we propose approach without explicit model recovery having observed rotations of similar objects faces chairs household objects the trained model can both better infer the true pose shape and texture of the object and make plausible assumptions about potentially ambiguous aspects of appearance in novel viewpoints thus the learning algorithm relies on mappings between euclidean image space and underlying nonlinear manifold in particular view synthesis can be cast as pose manifold traversal where desired rotation can be decomposed into sequence of small steps major challenge arises due to the dependency among multiple rotation steps the key identifying information shape texture from the original input must be remembered along the entire trajectory furthermore the local rotation at each step must generate the correct result on the data manifold or subsequent steps will also fail closely related to the image generation task considered in this paper is the problem of invariant recognition which involves comparing object images from different viewpoints or poses with dramatic changes of appearance shepard and metzler in their mental rotation experiments found that the time taken for humans to match objects from two different views increased proportionally with the angular rotational difference between them it was as if the humans were rotating their mental images at steady rate inspired by this mental rotation phenomenon we propose recurrent convolutional network with action units to model the process of pose manifold traversal the network consists of four components deep convolutional encoder shared identity units recurrent pose units with rotation action inputs and deep convolutional decoder rather than training the network to model specific rotation sequence we provide control signals at each time step instructing the model how to move locally along the pose manifold the rotation sequences can be of varying length to improve the ease of training we employed curriculum learning similar to that used in other sequence prediction problems intuitively the model should learn how to make rotation before learning how to make series of such rotations the main contributions of this work are summarized as follows first novel recurrent convolutional network is developed for learning to apply rotations to human faces and chair models second the learned model can generate realistic rotation trajectories with control signal supplied at each step by the user third despite only being trained to synthesize images our model learns discriminative features without using class labels this disentangling is especially notable with prediction related work the transforming autoencoder introduces the notion of capsules in deep networks which tracks both the presence and position of visual features in the input image these models can apply affine transformations and rotations to images we address similar task of rendering object appearance undergoing rotations but we use convolutional network architecture in lieu of capsules and incorporate action inputs and recurrent structure to handle repeated rotation steps the predictive gating pyramid is developed for prediction and can learn image transformations including shifts and rotation over multiple time steps our task is related to this prediction but our formulation includes control signal uses disentangled latent features and uses convolutional encoder and decoder networks to model detailed images ding and taylor proposed gating network to directly model mental rotation by optimizing transforming distance instead of extracting invariant recognition features in one shot their model learns to perform recognition by exploring space of relevant transformations similarly our model can explore the space of rotation about an object image by setting the control signal at each time step of our recurrent network the problem of training neural networks that generate images is studied in dosovitskiy et al proposed convolutional network mapping shape pose and transformation labels to images for generating chairs it is able to control these factors of variation and generate renderings we also generate chair renderings in this paper but our model adds several additional features deep encoder network so that we can generalize to novel images rather than only decode distributed representations for appearance and pose and recurrent structure for prediction contemporary to our work the inverse graphics network ign also adds an encoding function to learn graphics codes of images along with decoder similar to that in the chair generating network as in our model ign uses deep convolutional encoder to extract image representations apply modifications to these and then our model differs in that we train recurrent network to perform trajectories of multiple transformations we add control signal input at each step and we use deterministic training rather than the variational vae framework although our approach could be extended to vae version related line of work to ours is disentangling the latent factors of variation that generate natural images bilinear models for separating style and content are developed in and are shown to figure deep convolutional network for learning rotation be capable of separating handwriting style and character identity and also separating face identity and pose the disentangling boltzmann machine disbm applies this idea to augment the restricted boltzmann machine by partitioning its hidden state into distinct factors of variation and modeling their interaction the perceptron employs stochastic feedforward network to disentangle the identity and pose factors of face images in order to achieve recognition the encoder network for ign is also trained to learn disentangled representation of images by extracting graphics code for each factor in the potentially unknown latent factors of variation are both discovered and disentangled using novel hidden unit regularizer our work is also loosely related to the deepstereo algorithm that synthesizes novel views of scenes from multiple images using deep convolutional networks recurrent convolutional network in this section we describe our model formulation given an image of object our goal is to synthesize its rotated views inspired by recent success of convolutional networks cnns in mapping images to abstract representations and synthesizing images from graphics codes we base our model on deep convolutional networks one example network structure is shown in figure the encoder network used layers with stride and padding so that the dimension is halved at each convolution layer followed by two fullyconnected layers in the bottleneck layer we define group of units to represent the pose pose units where the desired transformations can be applied the other group of units represent what does not change during transformations named as identity units the decoder network is symmetric to the encoder to increase dimensionality we use fixed upsampling as in we found that fixed convolution and upsampling worked better than and unpooling with switches because when applying transformations the encoder pooling switches would not in general match the switches produced by the target image the desired transformations are reflected by the action units we used encoding in which encoded clockwise rotation encoded noop and encoded rotation the triangle indicates tensor product taking as input the pose units and action units and producing the transformed pose units equivalently the action unit selects the matrix that transforms the input pose units to the output pose units the action units introduce small linear increment to the pose units which essentially model the local transformations in the nonlinear pose manifold however in order to achieve longer rotation trajectories if we simply accumulate the linear increments from the action units for clockwise rotation the pose units will fall off the manifold resulting in bad predictions to overcome this problem we generalize the model to recurrent neural network which have been shown to capture dependencies for wide variety of sequence modeling problems in essence we use recurrent pose units to model the pose manifold traversals the identity units are shared across all time steps since we assume that all training sequences preserve the identity while only changing the pose figure shows the unrolled version of our rnn model we only perform encoding at the first time step and all transformations are carried out in the latent space the model predictions at time step are not fed into the next time step input the training objective is based on prediction over all time steps for training sequences lrnn fpose fid where is the sequence of actions fid produces the identity features invariant to all the time steps fpose produces the transformed pose features at time step is the image decoder producing an image given the output of fid and fpose is the image is the training image target at step figure unrolled recurrent convolutional network for learning to rotate objects the convolutional encoder and decoder have been abstracted out represented here as vertical rectangles curriculum training we trained the network parameters using backpropagation through time and the adam optimization method to effectively train our recurrent network we found it beneficial to use curriculum learning in which we gradually increase the difficulty of training by increasing the trajectory length this appears to be useful for sequence prediction with recurrent networks in other domains as well in section we show that increasing the training sequence length improves both the model image prediction performance as well as the recognition performance of identity features also longer training sequences force the identity units to better disentangle themselves from the pose if the same identity units need to be used to predict both and image during training these units can not pick up information in this way our model can learn disentangled features identity units can do invariant identity recognition but are not informative of pose and vice versa without explicitly regularizing to achieve this effect we did not find it necessary to use gradient clipping experiments we carry out experiments to achieve the following objectives first we examine the ability of our model to synthesize images of both face and complex objects chairs in wide range of rotational angles second we evaluate the discriminative performance of disentangled identity units through object recognition third we demonstrate the ability to generate and rotate novel object classes by interpolating identity units of query objects datasets the dataset consists of face images from people the images are captured from viewpoints under illumination conditions in different sessions to evaluate our model for rotating faces we select subset of that covers viewpoints evenly from to under neutral illumination each face image is aligned through manually annotated landmarks on eyes nose and mouth corners and then cropped to pixels we use the images of first people for training and the remaining people for testing chairs this dataset contains chair cad models made publicly available by aubry et al each chair model is rendered from azimuth angles with steps of or and elevation angles and at fixed distance to the virtual camera we use subset of chair models in our experiments which are selected out of by dosovitskiy et al in order to remove nearduplicate models models differing only in color or models we crop the rendered images to have small border and resize them to common size of pixels we also prepare their binary masks by subtracting the white background we use the images of the first models as the training set and the remaining models as the test set network architectures and training details the encoder network for the dataset used two layers with stride and padding followed by one layer the number of identity and pose units are and respectively the decoder network is symmetric to the encoder the curriculum training procedure starts with the rotation model which we call figure view synthesis on for each panel the first row shows the ground truth from to the second and third rows show the of clockwise rotation from an input image of red box and of rotation from an input image of red box respectively input rnn model figure comparing face pose normalization results with morphable model we prepare the training samples by pairing face images of the same person captured in the same session with adjacent camera viewpoints for example at is mapped to at with action at is mapped to at with action and at is mapped to at with action for face images with ending viewpoints and only rotation is feasible we train the network using the adam optimizer with fixed learning rate for since there are viewpoints per person per session we schedule the curriculum training with and stages which we call and respectively to sample training sequences with fixed length we allow both clockwise and rotations for example when one input image at is mapped to with corresponding angles and action inputs in each stage we initialize the network parameters with the previous stage and the network with fixed learning rate for additional epochs chairs the encoder network for chairs used three layers with stride and padding followed by two layers the decoder network is symmetric except that after the layers it branches into image and mask prediction layers the mask prediction indicates whether pixel belongs to foreground or background we adopted this idea from the generative cnn and found it beneficial to training efficiency and image synthesis quality tradeoff parameter is applied to the mask prediction loss we train the network parameters with fixed learning rate for epochs we schedule the curriculum training with and which we call and note that the curriculum training stops at because we reached the limit of gpu memory since the images of each chair model are rendered from viewpoints evenly sampled between and we can easily prepare training sequences of clockwise or counterclockwise rotations around the circle similarly the network parameters of the current stage are initialized with those of previous stage and with the learning rate for epochs view synthesis of novel objects we first examine the quality of our rnn models for novel object instances that were not seen during training on the dataset given one input image from the test set with possible views between to the encoder produces identity units and pose units and then the decoder renders images progressively with fixed identity units and recurrent pose units up to examples are shown in figure of the longest rotations clockwise from we carry out experiments using caffe on nvidia and titan gpus to and from to with renderings are generated with smooth transformations between adjacent views the characteristics of faces such as gender expression eyes nose and glasses are also preserved during rotation we also compare our rnn model with morphable model for face pose normalization in figure it can be observed that our rnn model produces stable renderings while morphable model is sensitive to facial landmark localization one of the advantages of morphable model is that it preserves facial textures well on the chair dataset we use to synthesize rotated views of novel chairs in the test set given chair image of certain view we define two action sequences one for progressive clockwise rotation and another for rotation it is more challenging task compared to rotating faces due to the complex shapes of chairs and the large rotation angles more than after rotations since no previous methods tackle the exact same chair problem we use knn method for baseline comparisons the knn baseline is implemented as follows we first extract the cnn features from net for all the chair images for each test chair image we find its neighbors in the training set by comparing their features the retrieved images are expected to be similar to the query in terms of both style and pose given desired rotation angle we synthesize rotated views of the test image by averaging the corresponding rotated views of the retrieved images in the training set at the pixel level we tune the value in namely and to achieve the best performance two examples are shown in figure in our rnn model the shapes are well preserved with clear boundaries for all the rotated views from different input and the appearance changes smoothly between adjacent views with consistent style rnn knn rnn knn rnn knn rnn knn input figure view synthesis of rotations on chairs in each panel we compare synthesis results of the model top and of the baseline bottom the first two panels belong to the same chair of different starting views while the last two panels are from another chair of two starting views input images are marked with red boxes note that conceptually the learned network parameters during different stages of curriculum training can be used to process an arbitrary number of rotation steps the model the first row in figure works well in the first rotation step but it produces degenerate results from the second step the the second row trained with rotations generates reasonable results in the third step progressively the and seem to generalize well on chairs with longer predictions for and for we measure the quantitative performance of knn and our rnn by the mean squared error mse in in figure as result the best knn with retrievals obtains mse which is comparable to our model but our model significantly outperforms mse with relative improvement object recognition in this experiment we examine and compare the discriminative performance of disentangled representations through object recognition reconstruction error gt model of rotation steps figure comparing chair synthesis results from rnn at different curriculum stages figure comparing reconstruction mean squared errors mse on chairs with rnns and knns we create splits from the test set in each split the face images of the same view are collected as gallery and the rest of other views as probes we extract features from the identity units of rnns for all the test images so that the probes are matched to the gallery by their cosine distance it is considered as success if the matched gallery image has the same identity with one probe we also categorize the probes in each split by measuring their angle offsets from the gallery in particular the angle offsets range from to the recognition difficulties increase with angle offsets to demonstrate the discriminative performance of our learned representations we also implement convolutional network classifier the cnn architecture is set up by connecting our encoder and identity units with softmax output layer and its parameters are learned on the training set with ground truth class labels the features extracted from the layer before the softmax layer are used to perform object recognition as above figure left compares the average success rates of rnns and cnn with their standard deviations over splits for each angle offset the success rates of drop more than from angle offset to the success rates keep improving in general with curriculum training of rnns and the best results are achieved with as expected the performance gap for between to reduces to this phenomenon demonstrates that our rnn model gradually learns representations for face recognition without using any class labels our rnn model achieves competitive results against the cnn chairs the experimental setup is similar to there are in total azimuth views per chair instance for each view we create its split so that we have splits we extract features from identity units of and the probes for each split are sorted by their angle offsets from the gallery images note that this experiment is particularly challenging because chair matching is recognition task and chair appearances change significantly with rotations we also compare our model against cnn but instead of training cnn from scratch we use the net to extract the features for chair matching the success rates are shown in figure right the performance drops quickly when the angle offset is greater than but the significantly improves the overall success rates especially for large angle offsets we notice that the standard deviations are large around the angle offsets to this is because some views contain more information about the chair shapes than the other views so that we see performance variations interestingly the performance of net surpasses our rnn model when the angle offset is greater than we hypothesize that this phenomenon results from the symmetric structures of most of the chairs the net was trained with mirroring data augmentation to achieve certain symmetric invariance while our rnn model does not explore this structure to further demonstrate the disentangling property of our rnn model we use the pose units extracted from the input images to repeat the above recognition experiments the mean success rates are shown in table it turns out that the better the identity units perform the worse the pose units perform when the identity units achieve recognition on the pose units only obtain mean success rate which is close to the random guess for classes class interpolation and view synthesis in this experiment we demonstrate the ability of our rnn model to generate novel chairs by interpolating between two existing ones given two chair images of the same view from cnn classification success rates classification success rates angle offset angle offset figure comparing recognition success rates for faces left and chairs right models chairs rnn identity rnn pose cnn table comparing mean recognition success rates with identity and pose units ferent instances the encoder network is used to compute their identity units zid zid and pose units zpose zpose respectively the interpolation is computed by zid βzid zid and zpose βzpose zpose where the interpolated zid and zpose are then fed into the recurrent decoder network to render its rotated views example interpolations between four chair instances are shown in figure the interpolated chairs present smooth stylistic transformations between any pair of input classes each row in figure and their unique stylistic characteristics are also well preserved among its rotated views each column in figure input figure chair style interpolation and view synthesis given four chair images of the same view first row from test set each row presents renderings of style manifold traversal with fixed view while each column presents the renderings of pose manifold traversal with fixed interpolated identity conclusion in this paper we develop recurrent convolutional network and demonstrate its effectiveness for synthesizing views of unseen object instances on the dataset and database of chair cad models the model predicts accurate renderings across trajectories of repeated rotations the proposed curriculum training by gradually increasing trajectory length of training sequences yields both better image appearance and more discriminative features for poseinvariant recognition we also show that trained model could interpolate across the identity manifold of chairs at fixed pose and traverse the pose manifold while fixing the identity this generative disentangling of chair identity and pose emerged from our recurrent rotation prediction objective even though we do not explicitly regularize the hidden units to be disentangled our future work includes introducing more actions into the proposed model other than rotation handling objects embedded in complex scenes and handling mappings for which transformation yields distribution over future states in the trajectory acknowledgments this work was supported in part by onr nsf career and nsf we thank nvidia for donating tesla gpu references aubry and russell understanding deep features with imagery in iccv aubry maturana efros russell and sivic seeing chairs exemplar alignment using large dataset of cad models in cvpr ba and kingma adam method for stochastic optimization in iclr bengio louradour collobert and weston curriculum learning in icml blanz and vetter morphable model for the synthesis of faces in siggraph cheung livezey bansal and olshausen discovering hidden factors of variation in deep networks in iclr ding and taylor mental rotation by optimizing transforming distance in nips deep learning and representation learning workshop dosovitskiy springenberg and brox learning to generate chairs with convolutional neural networks in cvpr flynn neulander philbin and snavely deepstereo learning to predict new views from the world imagery arxiv preprint girshick fast in iccv gross matthews cohn kanade and baker image and vision computing may hinton krizhevsky and wang transforming in icann jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding arxiv preprint kholgade simon efros and sheikh object manipulation in single photograph using stock models in siggraph kingma and welling variational bayes in iclr krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips kulkarni whitney kohli and tenenbaum deep convolutional inverse graphics network in nips long shelhamer and darrell fully convolutional networks for semantic segmentation in cvpr michalski memisevic and konda modeling deep temporal dependencies with recurrent grammar cells in nips mnih kavukcuoglu silver graves antonoglou wierstra and riedmiller playing atari with deep reinforcement learning in nips deep learning workshop oh guo lee lewis and singh video prediction using deep networks in atari games in nips reed sohn zhang and lee learning to disentangle factors of variation with manifold interaction in icml shepard and metzler mental rotation of three dimensional objects science simonyan and zisserman very deep convolutional networks for image recognition in iclr tenenbaum and freeman separating style and content with bilinear models neural computation tieleman optimizing neural networks that generate images phd thesis university of toronto vinyals toshev bengio and erhan show and tell neural image caption generator in cvpr zaremba and sutskever learning to execute arxiv preprint zhu lei yan yi and li pose and expression normalization for face recognition in the wild in cvpr zhu luo wang and tang perceptron deep model for learning face identity and view representations in nips 
efficient exact gradient update for training deep networks with very large sparse targets pascal alexandre de brébisson xavier bouthillier département informatique et de recherche opérationnelle université de montréal montréal québec canada and cifar abstract an important class of problems involves training deep neural networks with sparse prediction targets of very high dimension these occur naturally in neural language models or the learning of often posed as predicting the probability of next words among vocabulary of size computing the equally large but typically output vector from last hidden layer of reasonable dimension incurs prohibitive dd computational cost for each example as does updating the output weight matrix and computing the gradient needed for backpropagation to previous layers while efficient handling of large sparse network inputs is trivial the case of large sparse targets is not and has thus so far been sidestepped with approximate alternatives such as hierarchical softmax or approximations during training in this work we develop an original algorithmic approach which for family of loss functions that includes squared error and spherical softmax can compute the exact loss gradient update for the output weights and gradient for backpropagation all in per example instead of dd remarkably without ever computing the output the proposed algorithm yields speedup of two orders of magnitude for typical sizes for that critical part of the computations that often dominates the training time in this kind of network architecture introduction many modern applications of neural networks have to deal with data represented or representable as very large sparse vectors such representations arise in natural language related tasks where the dimension of that vector is typically multiple of the size of the vocabulary and also in the sparse matrices of applications it is trivial to handle very large sparse inputs to neural network in computationally efficient manner the forward propagation and update to the input weight matrix after backpropagation are correspondingly sparse by contrast training with very large sparse prediction targets is problematic even if the target is sparse the computation of the equally large network output and the corresponding gradient update to the huge output weight matrix are not sparse and thus computationally prohibitive this has been practical problem ever since bengio et al first proposed using neural network for learning language model in which case the computed output vector represents the probability of the next word and is the size of the considered vocabulary which is becoming increasingly large in modern applications several approaches have been proposed to attempt to address this difficulty essentially by sidestepping it they fall in two categories sampling or selection based approximations consider and compute only tiny fraction of the output dimensions sampled at random or heuristically chosen the reconstruction sampling of dauphin et al the efficient use of biased importance sampling in jean et al the use of noise contrastive estimation in mnih and kavukcuoglu and mikolov et al all fall under this category as does the more recent use of approximate maximum inner product search based on locality sensitive hashing techniques to select good candidate subset hierarchical softmax imposes heuristically defined hierarchical tree structure for the computation of the normalized probability of the target class compared to the initial problem of considering all output dimensions both kinds of approaches are crude approximations in the present work we will instead investigate way to actually perform the exact gradient update that corresponds to considering all outputs but do so implicitly in computationally efficient manner without actually computing the outputs this approach works for relatively restricted class of loss functions the simplest of which is linear output with squared error natural choice for sparse regression targets the most common choice for multiclass classification the softmax loss is not part of that family but we may use an alternative spherical softmax which will also yield normalized class probabilities for simplicity and clarity our presentation will focus on squared error and on an online setting we will briefly discuss its extension to minibatches and to the class of possible loss functions in sections and the problem problem definition and setup we are concerned with based training of deep neural network with target vectors of very high dimension but that are sparse comparatively small number at most of the elements of the target vector are such ksparse vector will typically be stored and represented compactly as numbers corresponding to pairs index value network to be trained with such targets will naturally have an equally large output layer of dimension we can also optionally allow the input to the network to be similarly high dimensional sparse vector of dimension din between the large sparse target output and optionally large sparse input we suppose the network intermediate hidden layers to be of smaller more typically manageable dimension mathematical notation vectors are denoted using letters and are considered corresponding row vectors are denoted with transpose ht matrices are denoted using letters with the transpose of the ith column of is denoted wi and its ith row both viewed as column vector denotes the transpose of the inverse of square matrix id is the identity matrix network architecture we consider standard feed forward neural network architecture as depicted in figure an input vector rdin is linearly transformed into linear activation through din input weight matrix and an optional bias vector rd this is typically followed by transformation to yield the representation of the first hidden layer this first hidden layer representation is then similarly transformed through number of subsequent layers that can be of any usual kind amenable to backpropagation with until we obtain last hidden layer representation we then obtain the final network output as where is output weight matrix which will be our main focus in this work finally the network output is compared to the target vector associated with input using squared error yielding loss ko training procedure this architecture is typical possibly deep feed forward neural network architecture with linear output layer and squared error loss its parameters weight matrices and bias vectors will be trained by gradient descent using gradient backpropagation to efficiently compute the gradients the procedure is shown in figure given an example from the training set as an input target pair pass of forward propagation proceeds as outlined above computing the hidden representation of each hidden layer in turn based on the previous one and finally the network predicted output and associated loss pass of gradient backpropagation then works in the opposite direction starting from and our approach does not impose any restriction on the architecture nor size of the hidden layers as long as they are amenable to usual gradient backpropagation training deep neural networks with very large sparse targets is an important problem arises in neural language models with large vocabulary size target efficient handling of large sparse inputs is trivial but backprop training with large sparse targets is prohibitively expensive focus on output layer maps last hidden representation of reasonable dimension to very large output of dimension with dxd parameter matrix problem expensive computation output wh dd backpropagation small kd ex input we will now see how only and respe date versions eﬃcient with wof gra hsupposing in equations we hav the for gradient the squ updateof matrix we is qh haswould comp update to solution representation ofseco co learning rate again vector so that unew computational complex so the all overal then to update the vnew kd to will be sparse ove of magnitude cheape proof if we define inter computation of caw vnew un forward propagation dd hidden we suppose prohibitivley expensive hidden small small large not sparse dd ht last hidden loss altogether note that we can prohibitive boo at kd cheap large but ex us com target large but current workarounds are approximations figure computational problem posed very large sparse dealing sparse insampling based approximations compute only by tiny fraction of the output stargets dimensions sampled atwith random put efficiently is trivial sampling with both the and backward propagation achieved in reconstruction and theforward use of noise contrastive estimation in phases fall undereasily this category kd however this is not the case with large sparse targets they incur prohibitive hierarchical softmax imposes heuristically defined hierarchical tree structure for the computation of the tational costnormalized of dd at the output layer as forward propagation gradient backpropagation and probability of the target class weight update each require accessing all elements of the large output weight matrix ne ac short formu line case ne ac qn first update implicitly by updat proof bengio ducharme and vincent neural probabilistic language model nips dauphin glorot learning of embeddings with reconstruction sampling icml back the gradients bengio and and upstream through the network references propagating the corresponding gradient contributions on parameters weights and biases collected along the way are straightforward once we have the associated specifically they are and similarly for the input layer and for the output layer ht parameters are then updated through gradient descent step and where is positive similarly for the output layer which will be our main focus here the easy part input layer forward propagation and weight update it is easy and straightforward to efficiently compute the forward propagation and the backpropagation and weight update part for the input layer when we have very large din but input vector with appropriate sparse representation specifically we suppose that is represented as pair of vectors of length at most where contains integer indexes and the associated real values of the elements of such that xi if and xuk vk forward propagation through the input layer the sparse representation of as the positions of elements together with their value makes it cheap to compute even though may be huge full din matrix only of its rows those corresponding to the entries of need to be visited and summed to compute precisely with our sparse pk representation of this operation can be written as vk uk where each uk is vector making this an kd operation rather than dd gradient and update through input layer let us for now suppose that we were able to get gradients through backpropagation up to the first hidden layer activations rd in the form of gradient vector the corresponding update to input layer weights is simply ηx this is update to here again we see that only the rows of associated to the at most entries of need to be modified precisely this operation can be written as uk uk making this again kd operation rather than dd the hard part output layer propagation and weight update given some network input we suppose we can compute without difficulty through forward propagation the associated last hidden layer representation rd from then on computing the final output incurs prohibitive computational cost of dd since is full matrix note that there is no reason for representation to be sparse with sigmoid but even if it was this would not fundamentally change the problem since it is that is extremely large and we supposed reasonably sized already computing the residual and associated squared error loss ko incurs an additional cost the gradient on that we need to backpropagate to lower layers is which is another dd product finally when performing the corresponding output weight update ht we see that it is update that updates all elements of which again incurs prohibitive dd computational cost for very large all these three dd operations are prohibitive and the fact that is sparse seen from this perspective doesn help since neither nor will be sparse computationally efficient algorithm for performing the exact online gradient update previously proposed workarounds are approximate or use stochastic sampling we propose different approach that results in the exact same yet efficient gradient update remarkably without ever having to compute large output computing the squared error loss and the gradient with respect to efficiently suppose that we have for network input example computed the last hidden representation rd through forward propagation the network dimensional output is then in principle compared to the high dimensional target rd the corresponding squared error loss is kw yk as we saw in section computing it in the direct naive way would have prohibitive computational complexity of dd dd because computing output with full matrix and typically is dd similarly to backpropagate the gradient through the network we need to compute the gradient of loss with respect to last hidden layer representation this is so again if we were to compute it directly in this manner the computational complexity would be prohibitive dd provided we have maintained an matrix which is of reasonable size and can be cheaply maintained as we will see in section we can rewrite these two operations so as to perform them in loss computation gradient on dd ht ht ht qh ht qh kd wtwh wty qh kd the terms in kd and are due to leveraging the representation of target vector with and we get altogether computational cost of which can be several orders of magnitude cheaper than the prohibitive dd of the direct approach efficient gradient update of the gradient of the squared error loss with respect to output layer weight matrix is and the corresponding gradient descent update to would be wnew ht where is positive learning rate again computed in this manner this induces prohibitive dd computational complexity both to compute output and residual and then to update all the dd elements of since generally neither nor will be sparse all elements of must be accessed during this update on the surface this seems hopeless but we will now see how we can achieve the exact same update on in the trick is to represent implicitly as the factorization and update and instead unew vnew ht unew this results in implicitly updating as we did explicitly in the naive approach as we now prove vnew unew unew unew unew unew unew unew unew unew ht unew unew hht ht ht wnew we see that the update of in eq is simple operation following this simple update to we can use the formula to derive the corresponding update to which will also be unew khk ht an operation needed in eq the ensuing it is then easy to compute the unew update of in eq thanks to the of is only kd only the rows associated to elements in are accessed and updated instead of all rows of we had to modify in the naive update note that with the factored representation of as we only have implicitly so the terms that entered in the computation of and in the previous paragraph need to be adapted slightly as which becomes kd rather than kd in computational complexity but this doesn change the overall complexity of these computations bookkeeping keeping an and we have already seen in eq how we can cheaply maintain an following our update of similarly following our updates to and we need to keep an which is needed to efficiently compute the loss eq and gradient eq we have shown that updates to and in equations and are equivalent to implicitly updating as wnew ht and this translates into the following update to qnew qh hht the proof is straightforward but due to space constraints we put it in supplementary material one can see that this last bookkeeping operation also has computational complexity putting it all together detailed algorithm and expected benefits we have seen that we can efficiently compute cost gradient with respect to to be later backpropagated further as well as updating and and performing the bookkeeping for and algorithm describes the detailed algorithmic steps that we put together from the equations derived above having we see that the proposed algorithm requires operations whereas the standard approach required dd operations if we take we may state more precisely that the proposed algorithm for computing the loss and the gradient updates will require roughly operations whereas the standard approach required roughly operations so overd all the proposed algorithm change corresponds to computational speedup by factor of for and the expected speedup is thus note that the advantage is not only in computational complexity but also in memory access for each example the standard approach needs to access and change all elements of matrix whereas the proposed approach only accesses the much smaller number elements of as well as the three matrices and so overall we have substantially faster algorithm which while doing so implicitly will nevertheless perform the exact same gradient update as the standard approach we want to emphasize here that our approach is completely different from simply chaining linear layers and and performing ordinary gradient descent updates on them this would result in the same prohibitive computational complexity as the standard approach and such ordinary separate gradient updates to and would not be equivalent to the ordinary gradient update to algorithm efficient computation of cost gradient on and update to parameters and step operation computational number of complexity qh ht unew ht unew ht vnew unew qnew hht altogether kd kd kd kd provided elementary operations controlling numerical stability and extension to the minibatch case the update of in equation may over time lead to become to prevent this we regularly every updates monitor its conditioning number if either the smallest or largest singular value moves outside an acceptable we bring it back to by doing an appropriate update to which costs dd operations but is only done rarely our algorithm can also be straightforwardly extended to the minibatch case the derivations are given in the supplementary material section and yields the same theoretical speedup factor with respect to the standard naive approach but one needs to be careful in order to keep the computation of reasonably efficient depending on the size of the minibatch it may be more efficient to solve the corresponding linear equation for each minibatch from scratch rather than updating with the woodbury equation which generalizes the formula for more details on our numerical stabilization procedure can be found in the supplementary material generalization to broader class of loss functions the approach that we just detailed for linear output and squared error can be extended to broader though restricted family of loss functions we call it the spherical family of loss functions because it includes the spherical alternative to the softmax thus named in basically it contains any loss function that can be expressed as function of only the oc associated to yc and of kok oj the squared norm of the whole output vector which we can compute cheaply irrespective of as we did this family does not include the standard softmax loss log oj pexp oc exp oj but it does include the spherical log due to space constraints we will not detail this extension here only give sketch of how it can be obtained deriving it may not appear obvious at first but it is relatively straightforward once we realize that the gain in computing the squared error loss comes from being able to very cheaply compute the sum of squared activations scalar quantity and will thus apply equally well to other losses that can be expressed based on that quantity like the spherical softmax generalizing our gradient update trick to such losses follows naturally from gradient backpropagation the gradient is first backpropagated from the final loss to the scalar sum of squared activations and from there on follows the same path and update procedure as for the squared error loss experimental validation we implemented both cpu version using blas and parallel gpu cuda version using cublas of the proposed we evaluated the gpu and cpu implementations by training word embeddings with simple neural language models in which probability map of the next word given its preceding is learned by neural network we used nvidia titan black gpu and cpu and ran experiments on the one billion word dataset which is composed of billions words belonging to vocabulary of millions words we evaluated the resulting word embeddings with the recently introduced score which measures the similarity between words we also compared our approach to unfactorised versions and to hierarchical softmax figure and left illustrate the practical speedup of our approach for the output layer only figure right shows that out lst large sparse target models are much faster to train than the softmax models and converge to only slightly lower scores table summarizes the speedups for the different output layers we tried both on cpu and gpu we also empirically verified that our proposed factored algorithm learns the model weights as the corresponding naive unfactored algorithm as it theoretically should and followed the same learning curves as function of number of iterations not time conclusion and future work we introduced new algorithmic approach to efficiently compute the exact gradient updates for training deep networks with very large sparse targets remarkably the complexity of the algorithm is independent of the target size which allows tackling very large problems our cpu and gpu implementations yield similar speedups to the theoretical one and can thus be used in practical applications which could be explored in further work in particular neural language models seem good candidates but it remains unclear how using loss function other than the usual softmax might affect the quality of the resulting word embeddings so further research needs to be carried out in this direction this includes empirically investigating natural extensions of the approach we described to other possible losses in the spherical family such as the acknowledgements we wish to thank yves grandvalet for stimulating discussions gülçehre for pointing us to the developers of theano and blocks for making these libraries available to build on and nserc and ubisoft for their financial support in addition loss functions in this family are also allowed to depend on sum oj which we can also compute cheaply without computing by tracking whereby sum jt where is the correct class label and is small positive constant that we added to the spherical interpretation in for numerical stability to guarantee we never divide by nor take the log of open source code is available at https table speedups with respect to the baseline naive model on cpu for minibatch of and the whole vocabulary of words this is model with two hidden layers of neurons model output layer only speedup whole model speedup cpu unfactorised naive gpu unfactorised naive gpu hierarchical softmax cpu factorised gpu factorised cpu gpu factorised gpu factorised cpu gpu timing sec of minibatch of size timing sec of minibatch of size cpu gpu factorised gpu factorised cpu gpu size of the vocabulary size of the vocabulary figure timing of different algorithms time taken by forward and backward propagations in the output layer including weight update on minibatch of size for different sizes of vocabulary on both cpu and gpu the input size is fixed to the timing of layer hierarchical softmax efficient gpu implementation is also provided for comparison right plot is in scale as expected the timings of factorized versions are independent of vocabulary size experimental experimental unfact fact theoretical experimental experimental speedup lst cpu lst gpu softmax cpu softmax gpu gpu size of the vocabulary in thousands training time hours figure left practical and theoretical speedups for different sizes of vocabulary and fixed input size the practical unfact fact speedup is similar to the theoretical one right evolution of the score obtained with different models as function of training time cpu softmax times were extrapolated from fewer iterations softmax models are zero models while our large sparse target lst models have two hidden layers these were the best architectures retained in both cases surprisingly the softmax models with hidden layers performed no better on this task the extra layers in lst may help compensate for the lack of softmax lst models converge to slightly lower scores at similar speed as the hierarchical softmax model but significantly faster than softmax models references bengio ducharme and vincent neural probabilistic language model in advances in neural information processing systems nips pages collobert weston bottou karlen kavukcuoglu and kuksa natural language processing almost from scratch journal of machine learning research dauphin glorot and bengio learning of embeddings with reconstruction sampling in proceedings of the international conference on machine learning icml jean cho memisevic and bengio on using very large target vocabulary for neural machine translation in gutmann and hyvarinen estimation new estimation principle for unnormalized statistical models in proceedings of the thirteenth international conference on artificial intelligence and statistics aistats mnih and kavukcuoglu learning word embeddings efficiently with estimation in advances in neural information processing systems pages mikolov sutskever chen corrado and dean distributed representations of words and phrases and their compositionality in nips pages shrivastava and li asymmetric lsh alsh for sublinear time maximum inner product search mips in advances in neural information processing systems pages vijayanarasimhan shlens monga and yagnik deep networks with large output spaces morin and bengio hierarchical probabilistic neural network language model in proceedings of the tenth international workshop on artificial intelligence and statistics pages rumelhart hinton and williams learning representations by errors nature lecun une procédure apprentissage pour réseau seuil assymétrique in cognitiva la frontière de intelligence artificielle des sciences de la connaissance et des neurosciences pages lecun learning processes in an asymmetric threshold network in disordered systems and biological organization pages les houches ollivier riemannian metrics for neural networks corr chelba mikolov schuster ge brants koehn and robinson one billion word benchmark for measuring progress in statistical language modeling in interspeech annual conference of the international speech communication association singapore september pages hill reichart and korhonen evaluating semantic models with genuine similarity estimation corr bergstra breuleux bastien lamblin pascanu desjardins turian and bengio theano cpu and gpu math expression compiler in proceedings of the python for scientific computing conference scipy oral presentation bastien lamblin pascanu bergstra goodfellow bergeron bouchard and bengio theano new features and speed improvements deep learning and unsupervised feature learning nips workshop van merriënboer bahdanau dumoulin serdyuk chorowski and bengio blocks and fuel frameworks for deep learning arxiv june 
backpropagation for neuromorphic computing steve esser ibm harry road san jose ca sesser rathinakumar appuswamy ibm harry road san jose ca rappusw paul merolla ibm harry road san jose ca pameroll john arthur ibm harry road san jose ca arthurjo dharmendra modha ibm harry road san jose ca dmodha abstract solving real world problems with embedded neural networks requires both training algorithms that achieve high performance and compatible hardware that runs in real time while remaining energy efficient for the former deep learning using backpropagation has recently achieved string of successes across many domains and datasets for the latter neuromorphic chips that run spiking neural networks have recently achieved unprecedented energy efficiency to bring these two advances together we must first resolve the incompatibility between backpropagation which uses neurons and synaptic weights and neuromorphic designs which employ spiking neurons and discrete synapses our approach is to treat spikes and discrete synapses as continuous probabilities which allows training the network using standard backpropagation the trained network naturally maps to neuromorphic hardware by sampling the probabilities to create one or more networks which are merged using ensemble averaging to demonstrate we trained sparsely connected network that runs on the truenorth chip using the mnist dataset with high performance network ensemble of we achieve accuracy at µj per image and with high efficiency network ensemble of we achieve accuracy at µj per image introduction neural networks today are achieving performance in competitions across range of fields such success raises hope that we can now begin to move these networks out of the lab and into embedded systems that can tackle real world problems this necessitates shift darpa approved for public release distribution unlimited in thinking to system design where both neural network and hardware substrate must collectively meet performance power space and speed requirements on basis the most efficient substrates for neural network operation today are dedicated neuromorphic designs to achieve high efficiency neuromorphic architectures can use spikes to provide event based computation and communication that consumes energy only when necessary can use low precision synapses to colocate memory with computation keeping data movement local and allowing for parallel distributed operation and can use constrained connectivity to implement neuron efficiently thus dramatically reducing network traffic however such design choices introduce an apparent incompatibility with the backpropagation algorithm used for training today most successful deep networks which uses neurons and synapses and typically operates with no limits on the number of inputs per neuron how then can we build systems that take advantage of algorithmic insights from deep learning and the operational efficiency of neuromorphic hardware as our main contribution here we demonstrate learning rule and network topology that reconciles the apparent incompatibility between backpropagation and neuromorphic hardware the essence of the learning rule is to train network offline with hardware supported connectivity as well as continuous valued input neuron output and synaptic weights but values constrained to the range we further impose that such constrained values represent probabilities either of spike occurring or of particular synapse being on such network can be trained using backpropagation but also has direct representation in the spiking low synaptic precision deployment system thereby bridging these two worlds the network topology uses progressive mixing approach where each neuron has access to limited set of inputs from the previous layer but sources are chosen such that neurons in successive layers have access to progressively more network input previous efforts have shown success with subsets of the elements we bring together here backpropagation has been used to train networks with spiking neurons but with weights and the converse networks with trinary synapses but with continuous output neurons other probabilistic backpropagation approaches have been demonstrated for networks with binary neurons and binary or trinary synapses but full connectivity the work presented here is novel in that we demonstrate for the first time an offline training methodology using backpropagation to create network that employs spiking neurons synapses requiring less bits of precision than even trinary weights and constrained connectivity ii we achieve the best accuracy to date on mnist when compared to networks that use spiking neurons even with high precision synapses as well as networks that use binary synapses and neurons and iii we demonstrate the network running in on the truenorth chip achieving by far the best published power efficiency for digit recognition µj per classification at accuracy running images per second compared to other low power approaches mj per classification at accuracy running images per second deployment hardware we use the truenorth neurosynaptic chip as our example deployment system though the approach here could be generalized to other neuromorphic hardware the truenorth chip consists of cores with each core containing axons inputs synapse crossbar and spiking neurons information flows via spikes from neuron to one axon between any two cores and from the axon to potentially all neurons on the core gated by binary synapses in the crossbar neurons can be considered to take on variety of dynamics including those described below each axon is assigned of axon types which is used as an index into lookup table of unique to each neuron that provides signed integer synaptic strength to the corresponding synapse this approach requires only bit per synapse for the state and an additional bits per synapse for the lookup table scheme network training in our approach we employ two types of multilayer networks the deployment network runs on platform supporting spiking neurons discrete synapses with low precision and limited connectivity the training network is used to learn binary synaptic connectivity states and biases this network shares the same topology as the deployment network but represents input data neuron outputs and synaptic connections using continuous values constrained to the range an overview is provided in figure and table these values correspond to probabilities of spike occurring or of synapse being on providing means of mapping the training network to the deployment network while providing continuous and differentiable space for backpropagation below we describe the deployment network our training methodology and our procedure for mapping the training network to the deployment network deployment network deployment our deployment network follows feedforward methodology where neurons are sequentially updated from the first to the last layer input to the network is represented using stochastically generated spikes where the value of each input unit is or with some probability we write this as xi where xi is the spike state of input unit and is continuous value in the range derived by the input data pixels this scheme allows representation of data using binary spikes while preserving data precision in the expectation training input input neuron connected synapses synaptic connection probabilities neuron spikes synapse strength spike probabilities figure diagram showing input synapses and output for one neuron in the deployment and training network for simplicity only three synapses are depicted summed neuron input is computed as ij xi cij sij bj where is the target neuron index cij is binary indicator variable representing whether synapse is on sij is the synaptic strength and bj is the bias term this is identical to common practice in neural networks except that we have factored the synaptic weight into cij and sij such that we can focus our learning efforts on the former for reasons described below the neuron activation function follows thresholding equation nj if ij otherwise these dynamics are implemented in truenorth by setting each neuron leak equal to the learned bias term dropping any fractional portion its threshold to its membrane potential floor to and setting its synapse parameters using the scheme described below we represent each class label using multiple output neurons in the last layer of the network which we found improves prediction performance the network prediction for class is simply the average of the output of all neurons assigned to that class table network components network input synaptic connection synaptic strength neuron output deployment network variable values correspondance training network variable values training network training follows the backpropagation methodology by iteratively running forward pass from the first layer to the last layer ii comparing the network output to desired output using loss function iii propagating the loss backwards through the network to determine the loss gradient at each synapse and bias term and iv using this gradient to update the network parameters the training network forward pass is probabilistic representation of the deployment network forward pass synaptic connections are represented as probabilities using where cij while synaptic strength is represented using sij as in the deployment network it is assumed that sij can be drawn from limited set of values and we consider the additional constraint that it is set in blocks such that multiple synapses share the same value as done in truenorth for efficiency while it is conceivable to learn optimal values for sij under such conditions this requires stepwise changes between allowed values and optimization that is not local to each synapse we take simpler approach here which is to learn biases and synapse connection probabilities and to intelligently fix the synapse strengths using an approach described in the network initialization section input to the training network is represented using which is the probability of an input spike occurring in the deployment network for neurons we note that equation is summation of weighted bernoulli variables plus bias term if we assume independence of these inputs and have sufficient numbers then we can approximate the probability distribution of this summation as gaussian with mean µj bj sij and variance we can then derive the probability of such neuron firing using the complementary cumulative distribution function of gaussian µj erf where erf is the error function and nj for layers after the first is replaced by the input from the previous layer which represents the probability that neuron produces spike variety of loss functions are suitable for our approach but we found that training converged the fastest when using log loss yk log pk yk log pk where for each class yk is binary class label that is if the class is present and otherwise and pk is the probability that the the average spike count for the class is greater than conveniently we can use the gaussian approximation in equation for this with and the mean and variance terms set by the averaging process the training network backward pass is an adaptation of backpropagation using the neuron and synapse equations above to get the gradient at each synapse we use the chain rule to compute for the bias similar computation is made by replacing in the above equation with bj we can then differentiate equation to produce sij σj µj as described below we will assume that the synapse strengths to each neuron are balanced between positive and negative values and that each neuron receives inputs so we can expect to be close to zero and and to be much less than therefore the right term of equation containing the denominator can be expected to be much smaller than the left term containing the denominator σj under these conditions for computational efficiency we can approximate equation by dropping the right term and factoring out the remainder as where σj and sij similar treatment can be used to show that corresponding gradient with respect to the bias term equals one the network is updated using the loss gradient at each synapse and bias term for each iteration synaptic connection probability changes according to where is the learning rate any synaptic connection probabilities that fall outside of the range as result of the update rule are snapped to the nearest valid value changes to the bias term are handled in similar fashion with values clipped to fall in the range the largest values supported using truenorth neuron parameters the training procedure described here is amenable to methods and heuristics applied in standard backpropagation for the results shown below we used mini batch size momentum dropout learning rate decay on fixed schedule across training iterations starting at and multiplying by every epochs and transformations of the training data for each iteration with rotation up to shift up to pixels and rescale up to mapping training network to deployment network training is performed offline and the resulting network is mapped to the deployment network for hardware operation for deployment depending on system requirements we can utilize an ensemble of one or more samplings of the training network to increase overall output performance unlike other ensemble methods we train only once then sample the training network for each member the system output for each class is determined by averaging across all neurons in all member networks assigned to the class synaptic connection states are set on or off according to cij using independent random number draws for each synapse in each ensemble member data is converted into spiking representation for input using xi using independent random number draws for each input to each member of the ensemble network initialization the approach for network initialization described here allows us to optimize for efficient neuromorphic hardware that employs less than bits per synapse in our approach each synaptic connection probability is initialized from uniform random distribution over the range to initialize synapse strength values we begin from the principle that each core should maximize information transfer by maximizing information per neuron and minimizing redundancy between neurons such methods have been explored in detail in approaches such as infomax while the first of these goals is data dependent we can pursue the second at initialization time by tuning the space of possible weights for core represented by the matrix of synapse strength values in our approach we wish to minimize redundancy between neurons on core by attempting to induce product distribution on the outputs for every pair of neurons to simplify the problem we note that the summed weighted inputs to pair of neurons is wellapproximated by gaussian distribution thus forcing the covariance between the summed weighted inputs to zero guarantees that the inputs are independent furthermore since functions of pairwise independent random variables remain independent the neuron outputs are guaranteed to be independent axons type axons type axons type axons type synapse strength neurons neurons neurons the summed weighted input to neuron is given by equation it is desirable for the purposes of maintaining balance in neuron dynamics to configure its weights using mix of positive and negative values that sum to zero thus for all figure synapse strength values dex sij picted as axons rows neurons columns array the learning procedure fixes these values when the network is initialized and which implies that ij assuming inputs and learns the probability that each synapse is synaptic connection states are both decorrelated and in transmitting state the blocky appearthe bias term is near this simplifies the covariance ance of the strength matrix is the result of between the inputs to any two neurons on core to the shared synaptic strength approach used by truenorth to reduce memory footprint ij ir xi cij sij xq cqr sqr rearranging terms we get ij ir cij sij cqr sqr cij sij cqr sqr xi xq next we note from the equation for covariance that xi xq xi xq xi xq under the assumption that inputs have equal mean and variance then for any where xi xq xi xq is constant further assuming that covariance between xi and xq where is the same for all inputs then xi xq where xi xq xi xq is constant using this and equation equation becomes ij ir hcj sj cr sr cij sij sir hcj sj cr sr hcj sj cr sr hcj sj cr sr so minimizing the absolute value of the inner product between columns of forces ij and ir to be maximally uncorrelated under the constraints inspired by this observation we apriori without any knowledge of the input data choose the strength values such that the absolute value of the inner product between columns of the effective weight matrix is minimized and the sum of effective weights to each neuron is zero practically this is achieved by assigning half of each neuron to and the other half to balancing the possible permutations of such assignments so they occur as equally as possible across neurons on core and evenly distributing the four possible axon types amongst the axons on core the resulting matrix of synaptic strength values can be seen in figure this configuration thus provides an optimal weight subspace given the constraints in which backpropagation can operate in datadriven fashion to find desirable synaptic states cm one core neurons input window stride neurons layer input input core network core network figure two network configurations used for the results described here core network designed to minimize core count and core network designed to maximize accuracy board with socketed truenorth chip used to run the deployment networks the chip is runs in real time ms neuron updates and consumes mw running benchmark network that uses all of its million neuron measured accuracy and measured energy for the two network configurations running on the chip ensemble size is shown to the right of each data point network topology the network topology is designed to support neurons with responses to local regional or global features while respecting the connectivity of the truenorth architecture namely that all neurons on core share access to the same set of inputs and that the number of such inputs is limited the network uses multilayer feedforward scheme where the first layer consists of input elements in rows columns channels array such as an image and the remaining layers consist of truenorth cores connections between layers are made using sliding window approach input to each core in layer is drawn from an input window figure where represents the row and column dimensions and represents the feature dimension for input from the first layer rows and columns are in units of input elements and features are input channels while for input from the remaining layers rows and columns are in units of cores and features are neurons the first core in given target layer locates its input window in the upper left corner of its source layer and the next core in the target layer shifts its input window to the right by stride of successive cores slide the window over by until the edge of the source layer is reached then the window is returned to the left and shifted down by and the process is repeated features are randomly with the constraint that each neuron can only be selected by one target core we allow input elements to be selected multiple times this scheme is similar in some respects to that used by convolution network but we employ independent synapses for each location the specific networks employed here and associated parameters are shown in figure results we applied the training method described above to the mnist dataset examining accuracy energy tradeoffs using two networks running on the truenorth chip figure the first network is the smallest multilayer truenorth network possible for the number of pixels present in the dataset consisting of cores distributed in layers corresponding to neurons the second network was built with primary goal of maximizing accuracy and is composed of cores distributed in layers figure corresponding to neurons networks are configured with first layer using and in both networks and in the core network and in the core network while all subsequent layers in both networks use and these parameters result in pyramid shape where all cores from layer to the final layer draw input from source cores and neurons in each of those sources each core employs neurons per core it targets up to maximum of neurons we tested each network in an ensemble of or members running on truenorth chip in each image was encoded using single time step ms with different spike sampling used for each input line targeted by pixel the instrumentation available measures active power for the network in operation and leakage power for the entire chip which consists of cores we report energy numbers as active power plus the fraction of leakage power for the cores in use the highest overall performance we observed of was achieved with core trained network using member ensemble for total of cores that was measured using µj per classification the lowest energy was achieved by the core network operating in an ensemble of that was measured using µj per classification while achieving accuracy results are plotted showing accuracy energy in figure both networks classified images per second discussion our results show that backpropagation operating in probabilistic domain can be used to train networks that naturally map to neuromorphic hardware with spiking neurons and extremely lowprecision synapses our approach can be succinctly summarized as where we first constrain our network to provide direct representation of our deployment system and then train within those constraints this can be contrasted with approach where network agnostic to the final deployment system is first trained and following training is constrained through normalization and discretization methods to provide spiking representation or low precision weights while requiring customized training rule the approach offers the advantage that decrease in training error has direct correspondence to decrease in error for the deployment network conversely the approach allows use of off the shelf training methods but unconstrained training is not guaranteed to produce reduction in error after hardware constraints are applied looking forward we see several avenues for expanding this approach to more complex datasets first deep convolution networks have seen great deal of success by using backpropagation to learn the weights of convolutional filters the learning method introduced here is independent of the specific network structure beyond the given sparsity constraint and could certainly be adapted for use in convolution networks second biology provides number of examples such as the retina or cochlea for mapping sensory data into binary spiking representation drawing inspiration from such approaches may improve performance beyond the linear mapping scheme used in this work third this approach may also be adaptable to other gradient based learning methods or to methods with existing probabilistic components such as contrastive divergence further while we describe the use of this approach with truenorth to provide concrete use case we see no reason why this training approach can not be used with other spiking neuromorphic hardware we believe this work is particularly timely as in recent years backpropagation has achieved high level of performance on number tasks reflecting real world tasks including object detection in complex scenes pedestrian detection and speech recognition wide range of sensors are found in mobile devices ranging from phones to automobiles and platforms like truenorth provide low power substrate for processing that sensory data by bridging backpropagation and energy efficient neuromorphic computing we hope that the work here provides an important step towards building scalable systems with real world applicability acknowledgments this research was sponsored by the defense advanced research projects agency under contracts no and no the views opinions findings contained in this paper are those of the authors and should not be interpreted as representing the official views or policies of the department of defense or the government references russakovsky deng su krause satheesh ma huang karpathy khosla bernstein berg and imagenet large scale visual recognition challenge international journal of computer vision ouyang and wang joint deep learning for pedestrian detection in international conference on computer vision pp hannun case casper catanzaro diamos elsen prenger satheesh sengupta coates et deepspeech scaling up speech recognition arxiv preprint benjamin gao mcquinn choudhary chandrasekaran bussat arthur merolla and boahen neurogrid multichip system for largescale neural simulations proceedings of the ieee vol no pp painkras plana garside temple galluppi patterson lester brown and furber spinnaker for neural network simulation ieee journal of circuits vol no pp pfeil jeltsch petrovici schmuker schemmel and meier six networks on universal neuromorphic computing substrate frontiers in neuroscience vol merolla arthur cassidy sawada akopyan jackson imam guo nakamura et million integrated circuit with scalable communication network and interface science vol no pp rumelhart hinton and williams learning representations by errors nature vol no pp moerland and fiesler neural network adaptations to hardware implementations in handbook of neural computation fiesler and beale eds new york institute of physics publishing and oxford university publishing fiesler choudry and caulfield weight discretization paradigm for optical neural networks in the hague april pp international society for optics and photonics cao chen and khosla spiking deep convolutional neural networks for object recognition international journal of computer vision vol no pp diehl neil binas cook liu and pfeiffer spiking deep networks through weight and threshold balancing in international joint conference on neural networks in press muller and indiveri rounding methods for neural networks with low resolution synaptic weights arxiv preprint zhao and van daalen learning in stochastic bit stream neural networks neural networks vol no pp cheng soudry mao and lan training binary multilayer neural networks for image classification using expectation backpropgation arxiv preprint soudry hubara and meir expectation backpropagation training of multilayer neural networks with continuous or discrete weights in advances in neural information processing systems pp stromatias neil galluppi pfeiffer liu and furber scalable lowlatency implementations of spiking deep belief networks on spinnaker in international joint conference on neural networks ieee in press cassidy merolla arthur esser jackson datta sawada wong feldman amir rubin akopyan mcquinn risk and modha cognitive computing building block versatile and efficient digital neuron model for neurosynaptic cores in international joint conference on neural networks hinton srivastava krizhevsky sutskever and salakhutdinov improving neural networks by preventing of feature detectors arxiv preprint bell and sejnowski an approach to blind separation and blind deconvolution neural computation vol no pp lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee vol no pp hinton and salakhutdinov reducing the dimensionality of data with neural networks science vol no pp 
ëíß óôû ñôí óôûô ñôíò óôûô åëô êã åéã åîã åíã íëî åíã ñôí åëô óôû óôû íëî òô óôû óôûòœ åèã óôûœ åèô ïã óôû ñôí ïôððð åéã ïé åïô ëãô øí øí óôû óª âï âî îò ñôíæ âï ãï âî ãï µî øè ñôí øðå óôû ñôí ïã ïô ïõ ïõ ïãøî ðò åçãô åïðô ïïô ïîã åïíô ïìãô åçã åïëô ïêô ïéãò åïèãô åïçãò îï èò åè èã µßµî µî ãï µß µî ßô µßµ ßò ßò óî íëî åîðã ñôí ºøèï øè üô øðå åï ãï ïô µîî ðò ºøèï øèî ïæ ºüð üï îæ íæ ðå ìæ ëæ õï êæ éæ ãï ãï óôû øµèµ µèµ µª µî ãï µªµî ãï µèµ ãû µïµ ãï åè èãæ èô µèµ ïò èô µèµ µî ãï µî øî åï µè øðå ðè ãû ãû åè èã ïð ðè ãò ïëô ñôí ðè µî ñôí µî ñôíò îòï îòî ñôí óôû íò îò õï ïò îý ðè õï ðè îòï èæ øðå èï åè èã ãï ãû ðè ãã ïëô ñôí óôû øî øðå ðò øðå ðò ñôí óôû øì ðè ñôí óôûæ ñôí óôû ëò îòî èæ øðå îî îî ãû ðè ãã ïë ßô øî ðè ñôí ðè ñôí ñôí åîïô îîã íëî ãè óôû øðå ãï ãï øé µîî ãï óôû íëî ºøèï øèî ïæ ºüð üð üï üï îæ éð íæ ðå ìæ îü ëæ êæ éæ õï óôû øî íëî µèµ øðå ðô îò ìý µîî òô åèã ðô ðè øè ìý åé ãï µîî ñôí åîíã åè ñôíò îðô ñôí íëî èî øðå øï øï øðå ïæ ñôí åï µé µî òô ñôí µé µî ãé ïð ãòððëô ïð ïð ñôí óôû óï ïð óí ïð ñôí óôû ïð óí ïð óï ïð ïð óî óî ïð ïð ïð íëî ïð óï óï ïð óî ïð ãòððëô ñôí óôû ìð óî ïð êð èð ïðð ñôí óôû ïð óí ïð ïæ µî øóôûô ñôíô íëî µé µî òô óôû øðå óî óôû ñôí åêô éãò ñôí óî óôû íëî óî åîìô îëã øñôíô óôûô óôû îð ëð øè øí îô óôû µîî ñôí µî óôû óôû ðæððë µî óôûô µîî ñôí µîî ñø íëî øè íëî øí íô ñø ñôí åïã õò ûò ßò ýîý ïçèéò åîã ùò þò îððïò åíã îððèò åìã ðò îððçò åëã öò îððèò åêã îðïíò åéã øò îðïïò åèã ïçéìò åçã íïèëšíïçíô îðïîò åïðã þò ßò ýñôìô îððçò åïïã öò îðïïò åïîã òò öò îðïïò åïíã ßò êïëšêíéô îððëò åïìã îððèò åïëã öò îðïðò åïêã îðïîò åïéã öò ïïïæîìïšîëëô îðïîò åïèã íìñýô êêëšêéìô îðïíò åïçã ýñôìô ééçšèðêô îðïìò åîðã îððèò åîïã ïçêîò åîîã îðïðò åîíã óò îðïìò åîìã öò öò ïçéëò åîëã îððìò åîêã êèëšêçíô îðïìò åîéã öò ýñôìô çîïšçìèô îðïìò åîèã îðïðò 
learning both weights and connections for efficient neural networks jeff pool nvidia jpool song han stanford university songhan william dally stanford university nvidia dally john tran nvidia johntran abstract neural networks are both computationally intensive and memory intensive making them difficult to deploy on embedded systems also conventional networks fix the architecture before training starts as result training can not improve the architecture to address these limitations we describe method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections our method prunes redundant connections using method first we train the network to learn which connections are important next we prune the unimportant connections finally we retrain the network to fine tune the weights of the remaining connections on the imagenet dataset our method reduced the number of parameters of alexnet by factor of from million to million without incurring accuracy loss similar experiments with found that the total number of parameters can be reduced by from million to million again with no loss of accuracy introduction neural networks have become ubiquitous in applications ranging from computer vision to speech recognition and natural language processing we consider convolutional neural networks used for computer vision tasks which have grown over time in lecun et al designed cnn model with less than parameters to classify handwritten digits while in krizhevsky et al won the imagenet competition with parameters deepface classified human faces with parameters and coates et al scaled up network to parameters while these large neural networks are very powerful their size consumes considerable storage memory bandwidth and computational resources for embedded mobile applications these resource demands become prohibitive figure shows the energy cost of basic arithmetic and memory operations in cmos process from this data we see the energy per connection is dominated by memory access and ranges from for bit coefficients in sram to for coefficients in dram large networks do not fit in storage and hence require the more costly dram accesses running billion connection neural network for example at would require just for dram access well beyond the power envelope of typical mobile device our goal in pruning networks is to reduce the energy required to run such large networks so they can run in real time on mobile devices the model size reduction from pruning also facilitates storage and transmission of mobile applications incorporating dnns relative energy cost operation energy pj relative cost bit int add bit float add bit register file bit int mult bit float mult bit sram cache bit dram memory figure energy table for cmos process memory access is orders of magnitude more energy expensive than simple arithmetic to achieve this goal we present method to prune network connections in manner that preserves the original accuracy after an initial training phase we remove all connections whose weight is lower than threshold this pruning converts dense layer to sparse layer this first phase learns the topology of the networks learning which connections are important and removing the unimportant connections we then retrain the sparse network so the remaining connections can compensate for the connections that have been removed the phases of pruning and retraining may be repeated iteratively to further reduce network complexity in effect this training process learns the network connectivity in addition to the weights much as in the mammalian brain where synapses are created in the first few months of child development followed by gradual pruning of connections falling to typical adult values related work neural networks are typically and there is significant redundancy for deep learning models this results in waste of both computation and memory there have been various proposals to remove the redundancy vanhoucke et al explored implementation with integer vs floating point activations denton et al exploited the linear structure of the neural network by finding an appropriate approximation of the parameters and keeping the accuracy within of the original model with similar accuracy loss gong et al compressed deep convnets using vector quantization these approximation and quantization techniques are orthogonal to network pruning and they can be used together to obtain further gains there have been other attempts to reduce the number of parameters of neural networks by replacing the fully connected layer with global average pooling the network in network architecture and googlenet achieves results on several benchmarks by adopting this idea however transfer learning reusing features learned on the imagenet dataset and applying them to new tasks by only the fully connected layers is more difficult with this approach this problem is noted by szegedy et al and motivates them to add linear layer on the top of their networks to enable transfer learning network pruning has been used both to reduce network complexity and to reduce an early approach to pruning was biased weight decay optimal brain damage and optimal brain surgeon prune networks to reduce the number of connections based on the hessian of the loss function and suggest that such pruning is more accurate than pruning such as weight decay however second order derivative needs additional computation hashednets is recent technique to reduce model sizes by using hash function to randomly group connection weights into hash buckets so that all connections within the same hash bucket share single parameter value this technique may benefit from pruning as pointed out in shi et al and weinberger et al sparsity will minimize hash collision making feature hashing even more effective hashednets may be used together with pruning to give even better parameter savings before pruning train connectivity after pruning pruning synapses prune connections pruning neurons train weights figure training pipeline figure synapses and neurons before and after pruning learning connections in addition to weights our pruning method employs process as illustrated in figure which begins by learning the connectivity via normal network training unlike conventional training however we are not learning the final values of the weights but rather we are learning which connections are important the second step is to prune the connections all connections with weights below threshold are removed from the network converting dense network into sparse network as shown in figure the final step retrains the network to learn the final weights for the remaining sparse connections this step is critical if the pruned network is used without retraining accuracy is significantly impacted regularization choosing the correct regularization impacts the performance of pruning and retraining regularization penalizes parameters resulting in more parameters near zero this gives better accuracy after pruning but before retraining however the remaining connections are not as good as with regularization resulting in lower accuracy after retraining overall regularization gives the best pruning results this is further discussed in experiment section dropout ratio adjustment dropout is widely used to prevent and this also applies to retraining during retraining however the dropout ratio must be adjusted to account for the change in model capacity in dropout each parameter is probabilistically dropped during training but will come back during inference in pruning parameters are dropped forever after pruning and have no chance to come back during both training and inference as the parameters get sparse the classifier will select the most informative predictors and thus have much less prediction variance which reduces as pruning already reduced model capacity the retraining dropout ratio should be smaller quantitatively let ci be the number of connections in layer cio for the original network cir for the network after retraining ni be the number of neurons in layer since dropout works on neurons and ci varies quadratically with ni according to equation thus the dropout ratio after pruning the parameters should follow equation where do represent the original dropout rate dr represent the dropout rate during retraining cir dr do ci ni cio local pruning and parameter during retraining it is better to retain the weights from the initial training phase for the connections that survived pruning than it is to the pruned layers cnns contain fragile features gradient descent is able to find good solution when the network is initially trained but not after some layers and retraining them so when we retrain the pruned layers we should keep the surviving parameters instead of them table network pruning can save to parameters with no drop in predictive performance network error error parameters ref pruned ref pruned alexnet ref alexnet pruned ref pruned compression rate retraining the pruned layers starting with retained weights requires less computation because we don have to back propagate through the entire network also neural networks are prone to suffer the vanishing gradient problem as the networks get deeper which makes pruning errors harder to recover for deep networks to prevent this we fix the parameters for conv layers and only retrain the fc layers after pruning the fc layers and vice versa iterative pruning learning the right connections is an iterative process pruning followed by retraining is one iteration after many such iterations the minimum number connections could be found without loss of accuracy this method can boost pruning rate from to on alexnet compared with aggressive pruning each iteration is greedy search in that we find the best connections we also experimented with probabilistically pruning parameters based on their absolute value but this gave worse results pruning neurons after pruning connections neurons with zero input connections or zero output connections may be safely pruned this pruning is furthered by removing all connections to or from pruned neuron the retraining phase automatically arrives at the result where dead neurons will have both zero input connections and zero output connections this occurs due to gradient descent and regularization neuron that has zero input connections or zero output connections will have no contribution to the final loss leading the gradient to be zero for its output connection or input connection respectively only the regularization term will push the weights to zero thus the dead neurons will be automatically removed during retraining experiments we implemented network pruning in caffe caffe was modified to add mask which disregards pruned parameters during network operation for each weight tensor the pruning threshold is chosen as quality parameter multiplied by the standard deviation of layer weights we carried out the experiments on nvidia titanx and gpus we pruned four representative networks and on mnist together with alexnet and on imagenet the network parameters and accuracy before and after pruning are shown in table lenet on mnist we first experimented on mnist dataset with the and networks is fully connected network with two hidden layers with and neurons each which achieves error rate on mnist is convolutional network that has two convolutional layers and two fully connected layers which achieves error rate on mnist after pruning the network is retrained with of the original network original learning rate table shows reference model is from caffe model zoo accuracy is measured without data augmentation table for pruning reduces the number of weights by and computation by layer total weights flop act weights flop table for pruning reduces the number of weights by and computation by layer total weights flop act weights flop figure visualization of the first fc layer sparsity pattern of it has banded structure repeated times which correspond to the parameters in the center of the images since the digits are written in the center pruning saves parameters on these networks for each layer of the network the table shows left to right the original number of weights the number of floating point operations to compute that layer activations the average percentage of activations that are the percentage of weights after pruning and the percentage of actually required floating point operations an interesting byproduct is that network pruning detects visual attention regions figure shows the sparsity pattern of the first fully connected layer of the matrix size is it has bands each band width corresponding to the input pixels the colored regions of the figure indicating parameters correspond to the center of the image because digits are written in the center of the image these are the important parameters the graph is sparse on the left and right corresponding to the less important regions on the top and bottom of the image after pruning the neural network finds the center of the image more important and the connections to the peripheral regions are more heavily pruned alexnet on imagenet we further examine the performance of pruning on the imagenet dataset which has training examples and validation examples we use the alexnet caffe model as the reference model which has million parameters across convolutional layers and fully connected layers the alexnet caffe model achieved accuracy of and accuracy of the original alexnet took hours to train on nvidia titan gpu after pruning the whole network is retrained with of the original network initial learning rate it took hours to retrain the pruned alexnet pruning is not used when iteratively prototyping the model but rather used for model reduction when the model is ready for deployment thus the retraining time is less concern table shows that alexnet can be pruned to of its original size without impacting accuracy and the amount of computation can be reduced by table for alexnet pruning reduces the number of weights by and computation by remaining parameters pruned parameters fc to ta fc fc co co flop weights co act flop co weights co layer total table for pruning reduces the number of weights by and computation by layer total weights flop act weights flop on imagenet with promising results on alexnet we also looked at larger more recent network on the same dataset has far more convolutional layers but still only three layers following similar methodology we aggressively pruned both convolutional and layers to realize significant reduction in the number of weights shown in table we used five iterations of pruning an retraining the results are like those for alexnet very promising the network as whole has been reduced to of its original size smaller in particular note that the two largest layers can each be pruned to less than of their original size this reduction is critical for real time image processing where there is little reuse of fully connected layers across images unlike batch processing during training discussion the curve between accuracy and number of parameters is shown in figure the more parameters pruned away the less the accuracy we experimented with and regularization with and without retraining together with iterative pruning to give five trade off lines comparing solid and dashed lines the importance of retraining is clear without retraining accuracy begins dropping much sooner with of the original connections rather than with of the original connections it interesting to see that we have the free lunch of reducing the connections without losing accuracy even without retraining while with retraining we are ably to reduce connections by regularization retrain regularization retrain regularization iterative prune and retrain regularization retrain regularization retrain accuracy loss parametes pruned away figure curve for parameter reduction and loss in accuracy regularization performs better than at learning the connections without retraining while regularization performs better than at retraining iterative pruning gives the best result accuracy loss accuracy loss parameters parameters figure pruning sensitivity for conv layer left and fc layer right of alexnet regularization gives better accuracy than directly after pruning dotted blue and purple lines since it pushes more parameters closer to zero however comparing the yellow and green lines shows that outperforms after retraining since there is no benefit to further pushing values towards zero one extension is to use regularization for pruning and then for retraining but this did not beat simply using for both phases parameters from one mode do not adapt well to the other the biggest gain comes from iterative pruning solid red line with solid circles here we take the pruned and retrained network solid green line with circles and prune and retrain it again the leftmost dot on this curve corresponds to the point on the green line at pruning pruned to there no accuracy loss at not until does the accuracy begin to drop sharply two green points achieve slightly better accuracy than the original model we believe this accuracy improvement is due to pruning finding the right capacity of the network and hence reducing overfitting both conv and fc layers can be pruned but with different sensitivity figure shows the sensitivity of each layer to network pruning the figure shows how accuracy drops as parameters are pruned on basis the conv layers on the left are more sensitive to pruning than the fully connected layers on the right the first convolutional layer which interacts with the input image directly is most sensitive to pruning we suspect this sensitivity is due to the input layer having only channels and thus less redundancy than the other convolutional layers we used the sensitivity results to find each layer threshold for example the smallest threshold was applied to the most sensitive layer which is the first convolutional layer storing the pruned layers as sparse matrices has storage overhead of only storing relative rather than absolute indices reduces the space taken by the fc layer indices to bits similarly conv layer indices can be represented with only bits table comparison with other model reduction methods on alexnet pruning saved only parameters with much loss of accuracy deep fried convnets worked on fully connected layers only and reduced the parameters by less than reduced the parameters by with inferior accuracy naively cutting the layer size saves parameters but suffers from loss of accuracy exploited the linear structure of convnets and compressed each layer individually where model compression on single layer incurred accuracy penalty with biclustering svd network error error parameters baseline caffemodel pruning collins kohli naive cut svd network pruning weight distribution before pruning weight distribution after pruning and retraining count count compression rate weight value weight value figure weight distribution before and after parameter pruning the right figure has smaller scale after pruning the storage requirements of alexnet and vggnet are are small enough that all weights can be stored on chip instead of dram which takes orders of magnitude more energy to access table we are targeting our pruning method for hardware specialized for sparse dnn given the limitation of general purpose hardware on sparse computation figure shows histograms of weight distribution before left and after right pruning the weight is from the first fully connected layer of alexnet the two panels have different scales the original distribution of weights is centered on zero with tails dropping off quickly almost all parameters are between after pruning the large center region is removed the network parameters adjust themselves during the retraining phase the result is that the parameters form bimodal distribution and become more spread across the between conclusion we have presented method to improve the energy efficiency and storage of neural networks without affecting accuracy by finding the right connections our method motivated in part by how learning works in the mammalian brain operates by learning which connections are important pruning the unimportant connections and then retraining the remaining sparse network we highlight our experiments on alexnet and vggnet on imagenet showing that both fully connected layer and convolutional layer can be pruned reducing the number of connections by to without loss of accuracy this leads to smaller memory capacity and bandwidth requirements for image processing making it easier to be deployed on mobile systems references alex krizhevsky ilya sutskever and geoffrey hinton imagenet classification with deep convolutional neural networks in advances in neural information processing systems pages alex graves and schmidhuber framewise phoneme classification with bidirectional lstm and other neural network architectures neural networks ronan collobert jason weston bottou michael karlen koray kavukcuoglu and pavel kuksa natural language processing almost from scratch jmlr yann lecun leon bottou yoshua bengio and patrick haffner learning applied to document recognition proceedings of the ieee yaniv taigman ming yang marc aurelio ranzato and lior wolf deepface closing the gap to performance in face verification in cvpr pages ieee adam coates brody huval tao wang david wu bryan catanzaro and ng andrew deep learning with cots hpc systems in icml pages mark horowitz energy table for process stanford vlsi wiki jp rauschecker neuronal mechanisms of developmental plasticity in the cat visual system human neurobiology christopher walsh peter huttenlocher nature misha denil babak shakibi laurent dinh nando de freitas et al predicting parameters in deep learning in advances in neural information processing systems pages vincent vanhoucke andrew senior and mark mao improving the speed of neural networks on cpus in proc deep learning and unsupervised feature learning nips workshop emily denton wojciech zaremba joan bruna yann lecun and rob fergus exploiting linear structure within convolutional networks for efficient evaluation in nips pages yunchao gong liu liu ming yang and lubomir bourdev compressing deep convolutional networks using vector quantization arxiv preprint song han huizi mao and william dally deep compression compressing deep neural network with pruning trained quantization and huffman coding arxiv preprint min lin qiang chen and shuicheng yan network in network arxiv preprint christian szegedy wei liu yangqing jia pierre sermanet scott reed dragomir anguelov dumitru erhan vincent vanhoucke and andrew rabinovich going deeper with convolutions arxiv preprint stephen hanson and lorien pratt comparing biases for minimal network construction with in advances in neural information processing systems pages yann le cun john denker and sara solla optimal brain damage in advances in neural information processing systems pages morgan kaufmann babak hassibi david stork et al second order derivatives for network pruning optimal brain surgeon advances in neural information processing systems pages wenlin chen james wilson stephen tyree kilian weinberger and yixin chen compressing neural networks with the hashing trick arxiv preprint qinfeng shi james petterson gideon dror john langford alex smola and svn vishwanathan hash kernels for structured data the journal of machine learning research kilian weinberger anirban dasgupta john langford alex smola and josh attenberg feature hashing for large scale multitask learning in icml pages acm nitish srivastava geoffrey hinton alex krizhevsky ilya sutskever and ruslan salakhutdinov dropout simple way to prevent neural networks from overfitting jmlr jason yosinski jeff clune yoshua bengio and hod lipson how transferable are features in deep neural networks in advances in neural information processing systems pages yoshua bengio patrice simard and paolo frasconi learning dependencies with gradient descent is difficult neural networks ieee transactions on yangqing jia et al caffe convolutional architecture for fast feature embedding arxiv preprint karen simonyan and andrew zisserman very deep convolutional networks for image recognition corr suraj srinivas and venkatesh babu parameter pruning for deep neural networks arxiv preprint zichao yang marcin moczulski misha denil nando de freitas alex smola le song and ziyu wang deep fried convnets arxiv preprint maxwell collins and pushmeet kohli memory bounded deep convolutional networks arxiv preprint 
optimal rates for random fourier features bharath department of statistics pennsylvania state university university park pa usa gatsby unit csml ucl sainsbury wellcome centre howland street london uk abstract kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations while these methods show good versatility they are computationally intensive and have poor scalability to large data as they require operations on gram matrices in order to mitigate this serious computational limitation recently randomized constructions have been proposed in the literature which allow the application of fast linear algorithms random fourier features rff are among the most popular and widely applied constructions they provide an easily computable feature representation for kernels despite the popularity of rffs very little is understood theoretically about their approximation quality in this paper we provide detailed theoretical analysis about the approximation quality of rffs by establishing optimal in terms of the rff dimension and growing set size performance guarantees in uniform norm and ii presenting guarantees in lr norms we also propose an rff approximation to derivatives of kernel with theoretical study on its approximation quality introduction kernel methods have enjoyed tremendous success in solving several fundamental problems of machine learning ranging from classification regression feature extraction dependency estimation causal discovery bayesian inference and hypothesis testing such success owes to their capability to represent and model complex relations by mapping points into high possibly infinite dimensional feature spaces at the heart of all these techniques is the kernel trick which allows to implicitly compute inner products between these high dimensional feature maps via kernel function hλ however this flexibility and richness of kernels has price by resorting to implicit computations these methods operate on the gram matrix of the data which raises serious computational challenges while dealing with data in order to resolve this bottleneck numerous solutions have been proposed such as matrix approximations explicit feature maps designed for additive kernels hashing and random fourier features rff constructed for kernels the focus of the current paper rffs implement an extremely simple yet efficient idea instead of relying on the implicit feature map associated with the kernel by appealing to bochner theorem bounded continuous kernel is the fourier transform of probability proposed an explicit random fourier feature map obtained by empirically approximating the fourier integral so that hφ the advantage of this explicit feature representation is that the kernel machine can be efficiently solved in the primal form through fast linear solvers thereby enabling to handle data through numerical experiments it has also been demonstrated that kernel algorithms constructed using the approximate kernel do not contributed equally suffer from significant performance degradation another advantage with the rff approach is that unlike low rank matrix approximation approach which also speeds up kernel machines it approximates the entire kernel function and not just the kernel matrix this property is particularly useful while dealing with data and also in online learning applications the rff technique has found wide applicability in several areas such as fast regression differential privacy preserving and causal discovery despite the success of the rff method surprisingly very little is known about its performance guarantees to the best of our knowledge the only paper in the machine learning literature providing certain theoretical insight into the accuracy of kernel approximationpvia rff is it shows that am sup hφ op log for any compact set rd where is the number of random fourier features however since the approximation proposed by the rff method involves empirically approximating the fourier integral the rff estimator can be thought of as an empirical characteristic function ecf in the probability literature the systematic study of was initiated by and followed up by while shows the almost sure convergence of am to zero theorems and and theorems and show that the optimal rate is in addition shows that almost sure convergence can not be attained over the entire space rd if the characteristic function decays to zero at infinity due to this study the convergence behavior of am when the diameter of grows with and show that almost sure convergence of am is guaranteed as long as the diameter of is eo unfortunately all these results to the best of our knowledge are asymptotic in nature and the only known guarantee by is in this paper see section we present probabilistic bound for am that holds for any and provides the optimal rate of for any compact set along with guaranteeing the almost sure convergence of am as long as the diameter of is eo since convergence in uniform norm might sometimes be too strong requirement and may not be suitable to attain correct rates in the generalization bounds associated with learning algorithms involving we also study the behavior of hφ in lr and obtain an optimal rate of the rff approach to approximate kernel can be seen as special of the problem of approximating function in the barycenter of family say of functions which was considered in however the approximation guarantees in theorem do not directly apply to rff as the assumptions on are not satisfied by the cosine function which is the family of functions that is used to approximate the kernel in the rff approach while careful modification of the proof of theorem could yield rate of approximation for any compact set this result would still be by providing linear dependence on similar to the theorems in in contrast to the optimal logarithmic dependence on that is guaranteed by our results traditionally kernel based algorithms involve computing the value of the kernel recently kernel algorithms involving the derivatives of the kernel the gram matrix consists of derivatives of the kernel computed at training samples have been used to address numerous machine learning tasks or hermite learning with gradient information nonlinear variable selection gradient learning and fitting of distributions in an exponential family given the importance of these derivative based kernel algorithms similar to in section we propose finite dimensional random feature map approximation to kernel derivatives which can be used to speed up the above mentioned derivative based kernel algorithms we present bound that quantifies the quality of approximation in uniform and lr and show the rate of convergence to be in both these cases summary of our contributions are as follows we provide the first detailed performance analysis of rffs for approximating kernels and their derivatives prove uniform and lr convergence on fixed compacts sets with optimal rate in terms of the rff dimension give sufficient conditions for the growth rate of compact sets while preserving convergence uniformly and in lr specializing our result we match the best attainable asymptotic growth rate derived tighter constants compared to and also considered different rff implementations for example in applications like kernel ridge regression based on rff it is more appropriate to consider the approximation guarantee in norm than in the uniform norm various notations and definitions that are used throughout the paper are provided in section along with brief review of rff approximation proposed by the missing proofs of the results in sections and are provided in the supplementary material notations preliminaries in this section we introduce notations that are used throughout the paper and then present preliminaries on kernel approximation through random feature maps as introduced by definitions notation for topological space resp cb denotes the space of all continuous resp bounded continuous functions on for cb kf kx is the supremum norm of mb and is the set of all finite borel and probability measures on respectively for mb lr denotes the banach space of functions for rd we will use lr for lr if is lebesgue measure denotes the lr of for on for lr kf klr dµ and we write it as if and is the lebesgue measure for any where pm we define pf dp and pm xi where xi pm δxi is the empirical measure and δx is dirac measure supported on supp denotes the support of pm denotes the product measure qp for vd rd vi the diameter of where is metric space is defined as sup if rd with we denote the diameter of as if is compact the volume of rd is defined as vol dx for rd we define conv is the convex hull of for function defined on open set rd rd qd qd where nd are pj and define vp vj an for positive sequences an bn an bn if bn xn op rn resp in probability resp almost surely dx rn denotes that rn is the gamma function and tγ random feature maps let rd rd be bounded continuous positive definite kernel there exists positive definite function rd such that rd where cb rd by bochner theorem theorem can be represented as the fourier transform of finite borel measure on rd cos dλ dλ rd rd where follows from the fact that is and symmetric since rd dp where rd therefore we assume throughout the paper that and so rd based on proposed an approximation to by replacing with its empirical measure λm constructed from ωi so that resultant approximation can be written as the euclidean inner product of finite dimensional random feature maps cos ωit hφ where cos cos ωm and holds based on sin sin ωm the basic trigonometric identity cos cos cos sin this elegant approximation to is particularly useful in speeding up algorithms as the random feature map can be used to solve these algorithms in the primal thereby offering better computational complexity than by solving them in the dual while at the same time not lacking in performance apart from these practical advantages claim and similarly prop provides theoretical guarantee that as for any compact set rd formally claim showed that is slightly different but more precise than the one in the statement of claim in any cd λm ωi when the where dλ and cd condition implies that and therefore is twice differentiable from it is clear that the probability has polynomial tails if small and gaussian tails if large and can be equivalently written as log log kk λm ωi where cd for sufficiently large it follows from that op log while shows that is consistent estimator of in the topology of compact pconvergence convergences to uniformly over compact sets the rate of convergence of log is not optimal in addition the order of dependence on is not optimal while faster rate in fact an optimal rate of convergence is rates in can lead to better convergence rates for the excess error of the kernel machine constructed using the order of dependence on is also important as it determines the the number of rff features that are needed to achieve given approximation accuracy in fact the order of dependence on controls the rate at which can be grown as function of when see remark ii for detailed discussion about the significance of growing in the following section we present an analogue of theorem provides optimal rates and has correct dependence on main results approximation of as discussed in sections and while the random feature map approximation of introduced by has many practical advantages it does not seem to be theoretically the existing theoretical results on the quality of approximation do not provide complete picture owing to their in this section we first present our main result see theorem that improves upon and provides rate of with logarithm dependence on we then discuss the consequences of theorem along with its optimality in remark next in corollary and theorem we discuss the lr of to over compact subsets of rd theorem definite and where cb is positive kωk dλ then for any and compact set rd ωi where log log log proof sketch note that supx where gx cos which means the object of interest is the suprema of an empirical process indexed by instead of bounding by using hoeffding inequality on cover of and then applying union bound as carried out in we use the refined technique of applying concentration via mcdiarmid inequality followed by symmetrization and bound the rademacher average by dudley entropy bound the result is obtained by carefully bounding the λm number of the details are provided in section of the supplementary material remark theorem shows that is consistent estimator of in the topology of compact convergence as with the rate of convergence being log almost sure convergence is guaranteed by the first lemma in comparison to it is clear that theorem provides improved rates with better constants and logarithmic dependence on instead of linear dependence the logarithmic dependence on ensures that we need log random features instead of log random features significantly fewer features to achieve the same approximation accuracy of ii growing diameter while theorem provides almost sure convergence uniformly over compact sets one might wonder whether it is possible to achieve uniform convergence over rd section showed that such result is possible if is discrete measure but not possible for that is absolutely continuous the lebesgue measure if has density since uniform convergence of to over rd is not possible for many interesting gaussian kernel it is of interest to study the convergence on whose diameter grows with therefore as mentioned in section the order of dependence of rates on is critical suppose as we write instead of to show the explicit dependence on then theorem shows that is consistent estimator of in the topology of compact convergence if log as eo in contrast to the result in which requires log in other words theorem ensures consistency even when grows exponentially in whereas ensures consistency only if does not grow faster than log iii optimality note that is the characteristic function of rd since is the fourier transform of by bochner theorem therefore the object of interest is the function pmuniform norm of the difference between and the empirical characteristic cos hωi when both are restricted to compact set the question of the conm vergence behavior of is not new and has been studied in great detail in the probability and statistics literature see for and for where the characteristic function is not just symmetric function like but is hermitian theorems and show that the optimal rate of convergence of is when which matches with our result in theorem also theorems and in show that the logarithmic dependence on is optimal asymptotically in particular theorem matches with the growing diameter result in remark ii while theorem shows that if is absolutely continuous the lebesgue measure and if lim log then there exists positive such that lim λm ψksm this means the rate eo is not only the best possible in general for almost sure convergence but if faster sequence is considered then even stochastic convergence can not be retained for any characteristic function vanishing at infinity along at least one path while these previous results match with that of theorem and its consequences we would like to highlight the fact that all these previous results are asymptotic in nature whereas theorem provides probabilistic inequality that holds for any we are not aware of any such result except for the one in using theorem one can obtain probabilistic inequality for the lr of over any compact set rd as given by the following result corollary suppose satisfies the assumptions in theorem then for any and compact set rd λm ωi kklr where kklr kklr proof note that dx dy kklr the result follows byocombining theorem and the fact that vol vol where which follows from corollary and vol rd corollary shows that kklr log and therefore if as then consistency of in lr sm is achieved as long as log as this means in comparison to the uniform normr in theoremr where can grow exponential in mδ can not grow faster than log to achieve consistency in lr instead of using theorem to obtain bound on kklr this bound may be weak as kklr for any better bound for can be obtained by directly bounding kklr as shown in the following result theorem suppose rd where cb rd is positive definite then for any and compact set rd cr λm ωi kklr where is the khintchine constant given by for and for proof sketch as in theorem we show that kk satisfies the bounded difference property hence by the mcdiarmid inequality it concentrates around its expectation ekk by symmetrization we then show that ekk is upper bounded in terms of pm eε εi cos hωi klr where εi are rademacher random variables by exploiting the fact that is banach space of type min the result follows the details are provided in section of the supplementary material remark theorem shows an improved dependence on without the extra log factor given in corollary and therefore provides better rate for when the diameter of grows kklr sm if as however for theorem provides slower rate than corollary and therefore it is appropriate to use the bound in corollary while one might wonder why we only considered the convergence of kklr and not kklr rd it is important to note that the latter is not because lr rd even if lr rd approximation of kernel derivatives in the previous section we focused on the approximation of the kernel function where we presented uniform and lr convergence guarantees on compact sets for the random fourier feature approximation and discussed how fast the diameter of these sets can grow to preserve uniform and lr convergence almost surely in this section we propose an approximation to derivatives of the kernel and analyze the uniform and lr convergence behavior of the proposed approximation as motivated in section the question of approximating the derivatives of the kernel through finite dimensional random feature map is also important as it enables to speed up several interesting machine learning tasks that involve the derivatives of the kernel see for example the recent infinite dimensional exponential family fitting technique which implements this idea to this end we consider as in and define ha cos πa in other words cos sin cos sin and mod for assuming dλ it follows from the dominated convergence theorem that dλ zr dλ rd so that can be approximated by replacing with λm resulting in sp ωj ωjt hφp φq where φp ωm ωm ωm ωm and ωj now the goal is to understand the behavior of ksp and ks kklr for obtain analogues of theorems and as in the proof sketch of theorem while ksp can be analyzed as the suprema of an empirical process indexed by suitable function class say some technical issues arise because is not uniformly bounded this means mcdiarmid or talagrand inequality can not be applied to achieve concentration and bounding rademacher average by dudley entropy bound may not be reasonable while these issues can be tackled by resorting to more technical and refined methods in this paper we generalize see theorem which is proved in section of the supplement theorem to derivatives under the restrictive assumption that supp is bounded note that many popular kernels including the gaussian do not satisfy this assumption we also present another result see theorem by generalizing the proof of to unbounded functions where the boundedness assumption of supp is relaxed but at the expense of worse rate compared to theorem theorem let nd tp cp and assume that suppose supp is bounded if and then for any and compact set rd tp ωi where log log remark note that theorem reduces to theorem if in which case tp if or then the boundedness of supp implies that tp and ii growth of by the same reasoning as in remark ii and corollary it follows that sp ksm if eo and sp klr sm if log for as an exact analogue of theorem can be obtained but with different constants under the assumption that supp is bounded and it can be shown that for sp klr sm if the following result relaxes the boundedness of supp by imposing certain moment conditions on but at the expense of worse rate the proof relies on applying bernstein inequality at the elements of net which exists by the compactness of combined with union bound and extending the approximation error from the anchors by probabilistic lipschitz argument theorem let nd be continuously differentiable be continuous rd be any compact set dp and ep assume that ep suppose such that lm we also correct some technical issues in the proof of claim where argument was pm applied to the invariant kernel estimator cos ωj bj cos ωj bj leading to cos ωj cos ωj ii the convexity of was not imposed possibly undefined lipschitz constant and iii the randomness of arg was not taken into account thus the upper bound on the expectation of the squared lipschitz constant does not hold where define fd then λm ωi sp ǫl dp ep fd remark the compactness of implies that of hence by the continuity one gets dp holds if and if supp is bounded then the boundedness of is guaranteed see section in the supplement ii in the special case when our requirement boils down to the continuously differentiability of and iii note that is similar to and therefore based on the discussion in section one has sp log but the advantage with theorem over claim and prop is that it can handle unbounded functions in comparison to theorem we obtain worse rates and it will be of interest to improve the rates of theorem while handling unbounded functions discussion in this paper we presented the first detailed theoretical analysis about the approximation quality of random fourier features rff that was proposed by in the context of improving the computational complexity of kernel machines while provided probabilistic bound on the uniform approximation over compact subsets of rd of kernel by random features the result is not optimal we improved this result by providing bound with optimal rate of convergence and also analyzed the quality of approximation in lr we also proposed an rff approximation for derivatives of kernel and provided theoretical guarantees on the quality of approximation in uniform and lr over compact subsets of rd while all the results in this paper and also in the literature dealt with the approximation quality of rff over only compact subsets of rd it is of interest to understand its behavior over entire rd however as discussed in remark ii and in the paragraph following theorem rff can not approximate the kernel uniformly or in lr over rd by truncating the taylor series expansion of the exponential function proposed finite dimensional representation to approximate the gaussian kernel which also enjoys the computational advantages of rff however this representation also does not approximate the gaussian kernel uniformly over rd therefore the question remains whether it is possible to approximate kernel uniformly or in lr over rd but still retaining the computational advantages associated with rff acknowledgments wishes to thank the gatsby charitable foundation for its generous support references alaoui and mahoney fast randomized kernel ridge regression with statistical guarantees in nips chaudhuri monteleoni and sarwate differentially private empirical risk minimization journal of machine learning research cotter keshet and srebro explicit approximations of the gaussian kernel technical report http multivariate empirical characteristic functions zeitschrift wahrscheinlichkeitstheorie und verwandte gebiete and totik on how long interval is the empirical characteristic function uniformly consistent acta scientiarum mathematicarum fd is monotonically decreasing in drineas and mahoney on the method for approximating gram matrix for improved learning journal of machine learning research feuerverger and mureika the empirical characteristic function and its applications annals of statistics folland real analysis modern techniques and their applications kulis and grauman kernelized hashing ieee transactions on pattern analysis and machine intelligence muandet and tolstikhin towards learning theory of inference jmlr cp icml pages maji berg and malik efficient classification for additive kernel svms ieee transactions on pattern analysis and machine intelligence oliva neiswanger xing and schneider fast function to function regression jmlr cp aistats pages rahimi and recht random features for kernel machines in nips pages rahimi and recht uniform approximation of functions with random bases in allerton pages rosasco santoro mosci verri and villa regularization approach to nonlinear variable selection jmlr cp aistats rosasco villa mosci santoro and verri nonparametric sparsity and regularization journal of machine learning research and smola learning with kernels support vector machines regularization optimization and beyond mit press shi guo and zhou hermite learning with gradient data journal of computational and applied mathematics shi petterson dror langford smola strehl and vishwanathan hash kernels aistats sriperumbudur fukumizu gretton and kumar sity estimation in infinite dimensional exponential families technical report http strathmann sejdinovic livingstone and gretton hamiltonian monte carlo with efficient kernel exponential families in nips sutherland and schneider on the error of random fourier features in uai pages vedaldi and zisserman efficient additive kernels via explicit feature maps ieee transactions on pattern analysis and machine intelligence wendland scattered data approximation cambridge university press williams and seeger using the method to speed up kernel machines in nips pages ying wu and campbell learning the coordinate gradients advances in computational mathematics yukich some limit theorems for the empirical process indexed by functions probability theory and related fields zhou derivative reproducing properties for kernel methods in learning theory journal of computational and applied mathematics 
the population posterior and bayesian modeling on streams james mcinerney columbia university james rajesh ranganath princeton university rajeshr david blei columbia university abstract many modern data analysis problems involve inferences from streaming data however streaming data is not easily amenable to the standard probabilistic modeling approaches which require conditioning on finite data we develop population variational bayes new approach for using bayesian modeling to analyze streams of data it approximates new type of distribution the population posterior which combines the notion of population distribution of the data with bayesian inference in probabilistic model we develop the population posterior for latent dirichlet allocation and dirichlet process mixtures we study our method with several data sets introduction probabilistic modeling has emerged as powerful tool for data analysis it is an intuitive language for describing assumptions about data and provides efficient algorithms for analyzing real data under those assumptions the main idea comes from bayesian statistics we encode our assumptions about the data in structured probability model of hidden and observed variables we condition on data set to reveal the posterior distribution of the hidden variables and we use the resulting posterior as needed for example to form predictions through the posterior predictive distribution or to explore the data through the posterior expectations of the hidden variables many modern data analysis problems involve inferences from streaming data examples include exploring the content of massive social media streams twitter facebook analyzing live video streams estimating the preferences of users on an online platform for recommending new items and predicting human mobility patterns for anticipatory computing such problems however can not easily take advantage of the standard approach to probabilistic modeling which requires that we condition on finite data set this might be surprising to some readers after all one of the tenets of the bayesian paradigm is that we can update our posterior when given new information yesterday posterior is today but there are two problems with using bayesian updating on data streams the first problem is that bayesian inference computes posterior uncertainty under the assumption that the model is correct in theory this is sensible but only in the impossible scenario where the data truly came from the proposed model in practice all models provide approximations to the distribution and when the model is incorrect the uncertainty that maximizes predictive likelihood may be larger or smaller than the bayesian posterior variance this problem is exacerbated in potentially streams after seeing only few data points uncertainty is high but eventually the model becomes overconfident the second problem is that the data stream might change over time this is an issue because frequently our goal in applying probabilistic models to streams is not to characterize how they change but rather to accommodate it that is we would like for our current estimate of the latent variables to be accurate to the current state of the stream and to adapt to how the stream might slowly change this is in contrast for example to time series modeling traditional bayesian updating can not handle this either we explicitly model the time series and pay heavy inferential cost or we tacitly assume that the data are exchangeable that the underlying distribution does not change in this paper we develop new ideas for analyzing data streams with probabilistic models our approach combines the frequentist notion of the population distribution with probabilistic models and bayesian inference main idea the population posterior consider latent variable model of data points this is unconventional notation we will describe why we use it below following we define the model to have two kinds of hidden variables global hidden variables contain latent structure that potentially governs any data point local hidden variables zi contain latent structure that only governs the ith data point such models are defined by the joint xi zi where and traditional bayesian statistics conditions on fixed data set to obtain the posterior distribution of the hidden variables as we discussed this framework can not accommodate data streams we need different way to use the model we define new distribution the population posterior which enables us to consider bayesian modeling of streams suppose we observe data points independently from the underlying population distribution fα this induces posterior which is function of the random data the population posterior is the expected value of this distribution efα efα notice that this distribution is not function of observed data it is function of the population distribution and the data size the data size is hyperparameter that can be set it effectively controls the variance of the population posterior how to best set it depends on how close the model is to the true data distribution we have defined new problem given an endless stream of data points coming from and value for our goal is to approximate the corresponding population posterior in this paper we will approximate it through an algorithm based on variational inference and stochastic optimization as we will show our algorithm justifies applying variant of stochastic variational inference to data stream we used our method to analyze several data streams with two modern probabilistic models latent dirichlet allocation and dirichlet process mixtures with likelihood as measure of model fitness we found our method to give better models of the data than approaches based on full bayesian inference or bayesian updating related work researchers have proposed several methods for inference on streams of data refs propose extending markov chain monte carlo methods for streaming data however approaches do not scale to massive datasets the variational approximation enables more scalable inference in variational inference ref propose online variational inference by exponentially forgetting the variational parameters associated with old data stochastic variational inference svi also decay parameters derived from old data but interprets this in the context of stochastic optimization neither of these methods applies to streaming data both implicitly rely on the data being of known size even when subsampling data to obtain noisy gradients to apply the variational approximation to streaming data ref and ref both propose bayesian updating of the approximating family ref adapts this framework to nonparametric mixture models here we take different approach changing the variational objective to incorporate population distribution and then following stochastic gradients of this new objective in section we show that this generally performs better than bayesian updating independently ref applied svi to streaming data by accumulating new data points into growing window and then uniformly sampling from this window to update the variational parameters our method justifies that approach further they propose updating parameters along trust region instead of following natural gradients as way of mitigating local optima this innovation can be incorporated into our method variational inference for the population posterior we develop population variational bayes method for approximating the population posterior in eq our method is based on variational inference and stochastic optimization the the idea behind variational inference is to approximate distributions through optimization we introduce an approximating family of distributions over the latent variables and try to find the member of that minimizes the kl divergence to the target distribution population variational bayes vb uses variational inference to approximate the population posterior in eq it aims to minimize the kl divergence from an approximating family arg min kl as for the population posterior this objective is function of the population distribution of data points fα notice the difference to classical vb in classical vb we optimize the kl divergence between and posterior kl its objective is function of fixed data set in contrast the objective in eq is function of the population distribution fα we will use the variational family where each latent variable is independent and governed by free parameter zi φi the free variational parameters are the global parameters and local parameters φi though we focus on the family extensions could consider structured families where there is dependence between variables in classical vb where we approximate the usual posterior we can not compute the kl thus we optimize proxy objective called the elbo evidence lower bound that is equal to the negative kl up to an additive constant maximizing the elbo is equivalent to minimizing the kl divergence to the posterior in population vb we also optimize proxy objective the the is an expectation of the elbo under the population distribution of the data fα efα eq log log log xi zi log zi the is lower bound on the population evidence log efα and lower bound on the negative kl to the population posterior see appendix the inner expectation is over the latent variables and and is function of the variational distribution the outer expectation is over the random data points and is function of the population distribution fα the is thus function of both the variational distribution and the population distribution as we mentioned classical vb maximizes the classical elbo which is equivalent to minimizing the kl the in contrast is only bound on the negative kl to the population posterior thus maximizing the is suggestive but is not guaranteed to minimize the kl that said our studies show that this is good quantity to optimize and in appendix we show that the does minimize efα kl the population kl conditionally conjugate models in the next section we will develop stochastic optimization algorithm to maximize eq first we describe the class of models that we will work with following we focus on conditionally conjugate models conditionally conjugate model is one where each complete conditional distribution of latent variable given all the other latent variables and the in the exponential family this class includes many models in modern machine learning such as mixture models topic models many bayesian nonparametric models and some hierarchical regression models using conditionally conjugate models simplifies many calculations in variational inference under the joint in eq we can write conditionally conjugate model with two exponential families zi xi zi xi exp zi xi exp we overload notation for base measures sufficient statistics and log normalizers note that is the hyperparameter and that in conditionally conjugate models each complete conditional is in an exponential family and we use these families as the factors in the variational distribution in eq thus indexes the same family as and φi indexes the same family as zi xi for example in latent dirichlet allocation the complete conditional of the topics is dirichlet the complete conditional of the topic mixture is dirichlet and the complete conditional of the topic assignment is categorical see for details population variational bayes we have described the ingredients of our problem we are given conditionally conjugate model described in eqs and parameterized variational family in eq and stream of data from an unknown population distribution our goal is to optimize the in eq with respect to the variational parameters the is function of the population distribution which is an unknown quantity to overcome this hurdle we will use the stream of data from to form noisy gradients of the we then update the variational parameters with stochastic optimization technique to find local optimum by following noisy unbiased gradients before describing the algorithm however we acknowledge one technical detail mirroring we optimize an that is only function of the global variational parameters the population vi objective is lfα maxφ lfα this implicitly optimizes the local parameter as function of the global parameter and allows us to convert the potentially optimization problem in eq to finite one the resulting objective is identical to eq but with replaced by details are in appendix the next step is to form noisy gradient of the so that we can use stochastic optimization to maximize it stochastic optimization maximizes an objective by following noisy and unbiased gradients we will write the gradient of the as an expectation with respect to fα and then use monte carlo estimates to form noisy gradients we compute the gradient of the by bringing the gradient operator inside the expectations of eq this results in population expectation of the classical vb gradient with data points we take the natural gradient which has simple form in completely conjugate models specifically the natural gradient of the is fα efα eφi xi zi we approximate this expression using monte carlo to compute noisy unbiased natural gradients at to form the monte carlo estimate we collect data points from for each we compute the optimal local parameters φi which is function of the sampled data point and variational parameters we then compute the quantity inside the brackets in eq averaging these results gives the monte carlo estimate of the natural gradient we follow the noisy natural gradient and repeat the algorithm is summarized in algorithm because eq is monte carlo estimate we are free to draw data points from fα where and rescale the sufficient statistics by this makes the natural gradient estimate noisier but faster to calculate as highlighted in this strategy is more computationally efficient because early iterations of the algorithm have inaccurate values of it is wasteful to pass through lot of data before making updates to discussion thus far we have defined the population posterior and showed how to approximate it with population variational inference our derivation justifies using an algorithm like stochastic variational inference svi on stream of data it is nearly identical to svi but includes an additional parameter the number of data points in the population posterior for most models of interest this is justified by the dominated convergence theorem algorithm population variational bayes randomly initialize global variational parameter set iteration repeat draw data minibatch fα optimize local variational parameters φb fα see eq calculate natural gradient update global variational parameter with learning rate fα αb update iteration count until forever note we can recover the original svi algorithm as an instance of population vi thus reinterpreting it as minimizing the kl divergence to the population posterior we recover svi by setting equal to the number of data points in the data set and replacing the stream of data with the empirical distribution of the observations the stream in this case comes from sampling with replacement from which results in precisely the original svi we focused on the conditionally conjugate family for convenience the simple gradient in eq we emphasize however that by using recent tools for nonconjugate inference we can adapt the new ideas described population posterior and the of conditionally conjugate models finally we analyze the population posterior distribution under the assumption the only way the stream affects the model is through the data formally this means the unobserved variables in the model and the stream fα are independent given the data the population posterior without the local latent variables which can be marginalized out is efα expanding the expectation gives fα dx showing that the population posterior distribution can be written as fα this can be depicted as graphical model this means first that the population posterior is well defined even when the model does not specify the marginal distribution of the data and second rather than the classical bayesian setting where the posterior is conditioned on finite fixed dataset the population posterior is distributional posterior conditioned on the stream fα empirical evaluation we study the performance of population variational bayes population vb against svi and streaming variational bayes svb with large data we study two models latent dirichlet allocation and bayesian nonparametric mixture models comparing the predictive performance of the algorithms all three methods share the same local variational update which is the dominating computational cost we study the data coming in true ordered stream and in permuted stream to better match the assumptions of svi across data and models population vb usually outperforms the existing approaches models we study two models the first is latent dirichlet allocation lda lda is model of text collections and is frequently used to find its latent topics lda assumes that there are topics βk dir each of which is multinomial distribution over fixed vocabulary documents are drawn by first choosing distribution over topics θd dir and then this derivation of svi is an application of efron principle applied to inference of the population posterior the principle says that we can replace the population with the empirical distribution of the data to make population inferences in our empirical study however we found that population vi often outperforms stochastic vi treating the data in true stream and setting the number of data points different to the true number can improve predictive accuracy held out log likelihood stream new york times twitter science number of documents seen held out log likelihood random stream new york times science svi twitter number of documents seen figure held out predictive log likelihood for lda on streamed text corpora populationvb outperforms existing methods for two out of the three settings we use the best settings of drawing each word by choosing topic assignment zdn mult θd and finally choosing word from the corresponding topic wdn βzdn the joint distribution is θd zdi wdi zdi fixing hyperparameters the inference problem is to estimate the conditional distribution of the topics given large collection of documents the second model is dirichlet process dp mixture loosely dp mixtures are mixture models with potentially infinite number of components thus choosing the number of components is part of the posterior inference problem when using variational inference for dp mixtures we take advantage of the stick breaking representation to construct truncated variational approximation the variables are mixture proportions stick mixture components βk for infinite mixture assignments zi mult and observations xi βzi the joint is zi xi xi the likelihood and prior on the components are general to the observations at hand in our study of data we use normal priors and normal likelihoods in our study of text data we use dirichlet priors and multinomial likelihoods for both models we vary usually fixed to the number of data points in traditional analysis datasets with lda we analyze three streamed corpora articles from the new york times spanning years science articles written over years and tweets collected from twitter on feb we processed them all in similar way choosing vocabulary based on the most frequent words in the corpus with stop words removed for the new york times for science and for twitter on twitter each tweet is document and we removed duplicate tweets and tweets that did not contain at least words in the vocabulary for each data stream all algorithms took few hours to process all the examples we collected with dp mixtures we analyze human location behavior data these data allow us to build periodic models of human population mobility with applications to disaster response and urban planning such models account for periodicity by including the hour of the week as one of the dimensions of the stream held out log likelihood ivory coast locations geolife locations new york times number of data points seen held out log likelihood random stream ivory coast locations geolife locations svi new york times number of data points seen figure held out predictive log likelihood for dirichlet process mixture models on streamed location and text data sets note that we apply gaussian likelihoods in the geolife dataset so the reported predictive performance is measured by probability density we chose the best for each curve held out log likelihood sensitivity to for lda new york times science twitter logarithm base of held out log likelihood sensitivity to for ivory coast locations geolife locations new york times logarithm base of figure we show the sensitivity of to hyperparameter based on final log likelihoods in the stream and find that the best setting of often differs from the true number of data points which may not be known in any case in practice data to be modeled the ivory coast location data contains discrete cell tower locations for users recorded over months the microsoft geolife dataset contains gps locations for users over years for both data sets our observations reflect the data to ensure that each individual is seen no more than once every minutes results we compare population vb with svi and svb for lda and dp mixtures svb updates the variational approximation of the global parameter using density filtering with exponential families the complexity of the approximation remains fixed as the expected sufficient statistics from minibatches observed in stream are combined with those of the current approximation here we give the final results we include details of how we set and fit hyperparameters below we measure model fitness by evaluating the average predictive log likelihood on data this involves splitting observations that were not involved in the posterior approximation of into two equal halves inferring the local component distribution based on the first half and testing with the second half for we condition on the observed hour of the week and predict the geographic location of the data point in standard offline studies the set is randomly selected from the data with streams however we test on the next documents for new york times science tweets for twitter or locations on geo data this is valid set because the data ahead of the current position in the stream have not yet been seen by the inference algorithms figure shows the performance for lda we looked at two types of streams one in which the data appear in order and the other in which they have been permuted an exchangeable stream the time permuted stream reveals performance when each data minibatch is safely assumed to be an sample from this results in smoother improvements to predictive likelihood on our data we found that population vb outperformed svi and svb on two of the data sets and outperformed svi on all of the data svb performed better than population vb on twitter figure shows similar study for dp mixtures we analyzed the human mobility data and the new york times ref also analyzed the new york times on these data population vb outperformed svb and svi in all hyperparameters unlike traditional bayesian methods the data set size is hyperparameter to population vb it helps control the posterior variance of the population posterior figure reports sensitivity to for all studies for the stream these plots indicate that the optimal setting of is often different from the true number of data points the best performing population posterior variance is not necessarily the one implied by the data the other hyperparameters to our experiments are reported in appendix conclusions and future work we introduced the population posterior distribution over latent variables that combines traditional bayesian inference with the frequentist idea of the population distribution with this idea we derived population variational bayes an efficient algorithm for probabilistic inference on streams on two complex bayesian models and several large data sets we found that population variational bayes usually performs better than existing approaches to streaming inference in this paper we made no assumptions about the structure of the population distribution making assumptions such as the ability to obtain streams conditional on queries can lead to variants of our algorithm that learn which data points to see next during inference finally understanding the theoretical properties of the population posterior is also an avenue of interest acknowledgments we thank allison chaney john cunningham alp kucukelbir stephan mandt peter orbanz theo weber frank wood and the anonymous reviewers for their comments this work is supported by nsf onr darpa ndseg facebook adobe amazon and the siebel scholar and john templeton foundations though our purpose is to compare algorithms we make one note about specific data set the predictive accuracy for the ivory coast data set plummets after data points this is because of the data collection policy for privacy reasons the data set provides the cell tower locations of randomly selected cohort of users every weeks the new cohort at data points behaves differently to previous cohorts in way that affects predictive performance however both algorithms steadily improve after this shock references ahmed ho teo eisenstein xing and smola online inference for the infinite model storylines from streaming text in international conference on artificial intelligence and statistics pages amari natural gradient works efficiently in learning neural computation bernardo and smith bayesian theory volume john wiley sons blei jordan et al variational inference for dirichlet process mixtures bayesian analysis blei ng and jordan latent dirichlet allocation the journal of machine learning research blondel esch chan clérot deville huens morlot smoreda and ziemlicki data for development the challenge on mobile phone data arxiv preprint bottou online learning and stochastic approximations online learning in neural networks broderick boyd wibisono wilson and jordan streaming variational bayes in advances in neural information processing systems pages doucet godsill and andrieu on sequential monte carlo sampling methods for bayesian filtering statistics and computing efron and tibshirani an introduction to the bootstrap crc press escobar and west bayesian density estimation and inference using mixtures journal of the american statistical association ghahramani and attias online variational bayesian learning in slides from talk presented at nips workshop on online learning pages hoffman and blei structured stochastic variational inference in international conference on artificial intelligence and statistics pages hoffman blei wang and paisley stochastic variational inference the journal of machine learning research honkela and valpola variational bayesian learning in international symposium on independent component analysis and blind signal separation pages jordan ghahramani jaakkola and saul an introduction to variational methods for graphical models machine learning kingma and welling variational bayes arxiv preprint ranganath gerrish and blei black box variational inference in proceedings of the seventeenth international conference on artificial intelligence and statistics pages robbins and monro stochastic approximation method the annals of mathematical statistics pages saul and jordan exploiting tractable substructures in intractable networks advances in neural information processing systems pages sethuraman constructive definition of dirichlet priors statistica sinica tank foti and fox streaming variational inference for bayesian nonparametric mixture models in international conference on artificial intelligence and statistics theis and hoffman method for stochastic variational inference with applications to streaming data in international conference on machine learning titsias and doubly stochastic variational bayes for inference in proceedings of the international conference on machine learning pages wainwright and jordan graphical models exponential families and variational inference foundations and trends in machine learning wallach murray salakhutdinov and mimno evaluation methods for topic models in international conference on machine learning yao mimno and mccallum efficient methods for topic model inference on streaming document collections in conference on knowledge discovery and data mining pages acm 
bayesian quadrature probabilistic integration with theoretical guarantees briol department of statistics university of warwick chris oates school of mathematical and physical sciences university of technology sydney mark girolami department of statistics university of warwick the alan turing institute for data science michael osborne department of engineering science university of oxford mosb abstract there is renewed interest in formulating integration as statistical inference problem motivated by obtaining full distribution over numerical error that can be propagated through subsequent computation current methods such as bayesian quadrature demonstrate impressive empirical performance but lack theoretical analysis an important challenge is therefore to reconcile these probabilistic integrators with rigorous convergence guarantees in this paper we present the first probabilistic integrator that admits such theoretical treatment called bayesian quadrature fwbq under fwbq convergence to the true value of the integral is shown to be up to exponential and posterior contraction rates are proven to be up to in simulations fwbq is competitive with methods and alternatives based on optimisation our approach is applied to successfully quantify numerical error in the solution to challenging bayesian model choice problem in cellular biology introduction computing integrals is core challenge in machine learning and numerical methods play central role in this area this can be problematic when numerical integration routine is repeatedly called maybe millions of times within larger computational pipeline in such situations the cumulative impact of numerical errors can be unclear especially in cases where the error has structural component one solution is to model the numerical error statistically and to propagate this source of uncertainty through subsequent computations conversely an understanding of how errors arise and propagate can enable the efficient focusing of computational resources upon the most challenging numerical integrals in pipeline classical numerical integration schemes do not account for prior information on the integrand and as consequence can require an excessive number of function evaluations to obtain prescribed level of accuracy alternatives such as carlo qmc can exploit knowledge on the smoothness of the integrand to obtain optimal convergence rates however these optimal rates can only hold on of sample sizes consequence of the fact that all function evaluations are weighted equally in the estimator modern approach that avoids this problem is to consider arbitrarily weighted combinations of function values the quadrature rules also called cubature rules whilst quadrature rules with weights have received atively little theoretical attention it is known that the extra flexibility given by arbitrary weights can lead to extremely accurate approximations in many settings see applications to image and mental simulation in psychology probabilistic numerics introduced in the seminal paper of aims at numerical tasks as inference tasks that are amenable to statistical recent developments include probabilistic solvers for linear systems and differential equations for the task of computing integrals bayesian quadrature bq and more recent work by provide probabilistic numerics methods that produce full posterior distribution on the output of numerical schemes one advantage of this approach is that we can propagate uncertainty through all subsequent computations to explicitly model the impact of numerical error contrast this with chaining together classical error bounds the result in such cases will typically be weak bound that provides no insight into the error structure at present significant shortcoming of these methods is the absence of theoretical results relating to rates of posterior contraction this is unsatisfying and has likely hindered the adoption of probabilistic approaches to integration since it is not clear that the induced posteriors represent sensible quantification of the numerical error by classical frequentist standards this paper establishes convergence rates for new probabilistic approach to integration our results thus overcome key perceived weakness associated with probabilistic numerics in the quadrature setting our starting point is recent work by who cast the design of quadrature rules as problem in convex optimisation that can be solved using the fw algorithm we propose hybrid approach of with bq taking the form of quadrature rule that carries full probabilistic interpretation ii is amenable to rigorous theoretical analysis and iii converges faster empirically compared with the original approaches in in particular we prove that rates hold for posterior contraction concentration of the posterior probability mass on the true value of the integral showing that the posterior distribution provides sensible and effective quantification of the uncertainty arising from numerical error the methodology is explored in simulations and also applied to challenging model selection problem from cellular biology where numerical error could lead to of expensive resources background quadrature and cubature methods let rd be measurable space such that and consider probability density defined rwith respect to the lebesgue measure on this paper focuses on computing integrals of the form dx for test function where for simplicity we assume is with respect to quadrature rule approximates such integrals as weighted sum of function values at some design points xi dx wi xi viewing integrals pn as projections we write for the side and for the side where wi xi and xi is dirac measure at xi note that may not be probability distribution in fact weights wi do not have to sum to one or be quadrature rules can be extended to multivariate functions rd by taking each component in turn there are many ways of choosing combinations xi wi in the literature for example taking weights to be wi with points xi drawn independently from the probability distribution recovers basic monte carlo integration the case with weights wi but with points chosen with respect to some specific possibly deterministic schemes includes kernel herding and carlo qmc in bayesian quadrature the points xi are chosen to minimise posterior variance with weights wi arising from posterior probability distribution classical error analysis for quadrature rules is naturally couched in terms of minimising the worstcase estimation error let be hilbert space of functions equipped with the inner detailed discussion on probabilistic numerics and an extensive bibliography can be found at http product and associated norm kh we define the maximum mean discrepancy mmd as mmd xi wi sup kf kh the reader can refer to for conditions on that are needed for the existence of the mmd the rate at which the mmd decreases with the number of samples is referred to as the convergence rate of the quadrature rule for monte carlo the mmd decreases with the slow rate of op where the subscript specifies that the convergence is in probability let be rkhs with reproducing kernel and denote the corresponding canonical feature map by so that the mean element is given by µp then following mmd xi wi kµp kh this shows that to obtain low integration error in the rkhs one only needs to obtain good approximation of its mean element µp as hf µp ih establishing theoretical results for such quadrature rules is an active area of research bayesian quadrature bayesian quadrature bq was originally introduced in and later revisited by and the main idea is to place functional prior on the integrand then update this prior through bayes theorem by conditioning on both samples xi and function evaluations at those sample points fi where fi xi this induces full posterior distribution over functions and hence over the value of the integral the most common implementation assumes gaussian process gp prior gp useful property motivating the use of gps is that linear projection preserves normality so that the posterior distribution for the integral is also gaussian characterised by its mean and covariance natural estimate of the integral is given by the mean of this posterior distribution which can be compactly written as zt where zi µp xi and kij xi xj notice that this estimator takes the form of quadrature rule with weights wbq zt recently showed how specific choices of kernel and design points for bq can recover classical quadrature rules this begs the question of how to select design points xi particularly natural approach aims to minimise the posterior uncertainty over the integral which was shown in prop to equal vbq xi µp zt xi wibq thus in the rkhs setting minimising the posterior variance corresponds to minimising the worst case error of the quadrature rule below we refer to optimal bq obq as bq coupled with design points xobq chosen to globally minimise we also call sequential bq sbq the algorithm that greedily selects design points to give the greatest decrease in posterior variance at each iteration obq will give improved results over sbq but can not be implemented in general whereas sbq is comparatively to implement there are currently no theoretical results establishing the convergence of either bq obq or sbq remark is independent of observed function values as such no active learning is possible in sbq surprising function values never cause revision of planned sampling schedule this is not always the case for example approximately encodes of into bq which leads to dependence on in the posterior variance in this case sequential selection becomes an active strategy that outperforms batch selection in general deriving quadrature rules via the algorithm despite the elegance of bq its convergence rates have not yet been rigorously established in brief this is because is an orthogonal projection of onto the affine hull of xi rather than the convex hull standard results from the optimisation literature apply to bounded domains but the affine hull is not bounded the bq weights can be arbitrarily large and possibly negative below we describe solution to the problem of computing integrals recently proposed by based on the fw algorithm that restricts attention to the bounded convex hull of xi algorithm the fw and with fwls algorithms require function initial state and for fw only sequence ρi for do compute dj for fwls only line search ρi update gi ρi ρi end for the fw algorithm alg also called the conditional gradient algorithm is convex optimization method introduced in it considers problems of the form where the function is convex and continuously differentiable particular case of interest in this paper will be when the domain is compact and convex space of functions as recently investigated in these assumptions imply the existence of solution to the optimization problem at each iteration the fw algorithm computes linearisation of the objective function at the previous state along its gradient dj and selects an atom that minimises the inner product state and dj the new state gi is then convex combination of the previous state and of the atom this convex combination depends on ρi which is and different versions of the algorithm may have different sequences our goal in quadrature is to approximate the mean element µp recently proposed to frame integration as fw optimisation problem here the domain is space of functions and taking the objective function to be µp this gives an approximation of the mean element and takes the form of half the posterior variance or the in this functional approximation setting minimisation of is carried out over the marginal polytope of the rkhs the marginal polytope is defined as the closure of the convex hull of so that in particular µp assuming as in that is uniformly bounded in feature space kφ kh then is closed and bounded set and can be optimised over in order to define the algorithm rigorously in this case we introduce the derivative of denoted dj such that for being the dual space of we have the unique map dj such that for each dj is the function mapping to dj we also introduce the bilinear map which for given by hg ih is the rule giving hh hh ih particular advantage of this method is that it leads to sparse solutions which are linear combinations of the atoms in particular this provides weighted estimate for the mean element gn wifw where by default which leads to all wifw when ρi typical sequence of approximations to the mean element is shown in fig left demonstrating that the approximation quickly converges to the ground truth in black since minimisation of linear function can be fw restricted to extreme points of the domain the atoms will be of the form xfw xi fw for some xi the minimisation in over from step in algorithm therefore becomes minimisation in over and this algorithm therefore provides us design points in practice at each iteration the fw algorithm hence selects design point xfw which induces an atom and gives us an approximation of the mean element µp we denote by this approximation after iterations using the reproducing property we can show that the fw estimate is quadrature rule wifw wifw xfw wifw xfw the total computational cost for fw is an extension known as fw with line search fwls uses method to find the optimal step size ρi at each iteration see alg figure left approximations of the mean element µp using the fwls algorithm based on design points purple blue green red and orange respectively it is not possible to distinguish between approximation and ground truth when right density of mixture of gaussian distributions displaying the first design points chosen by fw red fwls orange and sbq green each method provides design points in regions most fw and fwls design points overlap partly explaining their similar performance in this case once again the approximation obtained by fwls has sparse expression as convex combination of all the previously visited states and we obtain an associated quadrature rule fwls has theoretical convergence rates that can be stronger than standard versions of fw but has computational cost in the authors in provide survey of algorithms and their convergence rates under different regularity conditions on the objective function and domain of optimisation remark the fw design points xfw are generally not available in we follow mainstream literature by selecting at each iteration the point that minimises the mmd over finite collection of points drawn from the authors in proved that this approximation adds term to the mmd so that theoretical results on fw convergence continue to apply provided that sufficiently quickly appendix provides full details in practice one may also make use of numerical optimisation scheme in order to select the points hybrid approach bayesian quadrature to combine the advantages of probabilistic integrator with formal convergence theory we pron pose bayesian quadrature fwbq in fwbq we first select design points xfw using the fw algorithm however when computing the quadrature approximation instead of using the usual fw weights wifw we use instead the weights wibq provided by bq we denote this quadrature rule by and also consider which uses fwls in place of fw as we show below these hybrid estimators carry the bayesian interpretation of sec ii permit rigorous theoretical analysis and iii existing fw quadrature rules by orders of magnitude in simulations fwbq is hence ideally suited to probabilistic numerics applications for these theoretical results we assume that belongs to rkhs in line with recent literature we further assume that is compact subset of rd that and that is continuous on under these hypotheses theorem establishes consistency of the posterior mean while theorem establishes contraction for the posterior distribution theorem consistency the posterior mean converges to the true integral at the following rates for fwbq mmd xi wi exp for fwlsbq where the fwbq uses ρi is the diameter of the marginal polytope and gives the radius of the smallest ball of center µp included in note that all the proofs of this paper can be found in appendix an immediate corollary of theorem is that fwlsbq has an asymptotic error which is exponential in and is therefore superior to that of any qmc estimator this is not contradiction recall that qmc restricts attention to uniform weights while fwlsbq is able to propose arbitrary weightings in addition we highlight robustness property even when the assumptions of this section do not hold one still obtains atleast rate op for the posterior mean using either fwbq or fwlsbq remark the choice of kernel affects the convergence of the fwbq method clearly we expect faster convergence if the function we are integrating is close to the space of functions induced by our kernel indeed the kernel specifies the geometry of the marginal polytope that in turn directly influences the rate constant and associated with fw convex optimisation consistency is only stepping stone towards our main contribution which establishes posterior contraction rates for fwbq posterior contraction is important as these results justify for the first time the probabilistic numerics approach to integration that is we show that the full posterior distribution is sensible quantification at least asymptotically of numerical error in the integration routine theorem contraction let be an open neighbourhood of the true integral and let inf then the posterior probability mass on vanishes at rate exp for fwbq πrγ prob exp exp for fwlsbq πγ where the fwbq uses ρi is the diameter of the marginal polytope and gives the radius of the smallest ball of center µp included in the contraction rates are exponential for fwbq and for fwlbq and thus the two algorithms enjoy both probabilistic interpretation and rigorous theoretical guarantees notable corollary is that obq enjoys the same rates as fwlsbq resolving conjecture by tony hagan that obq converges exponentially personal communication corollary the consistency and contraction rates obtained for fwlsbq apply also to obq experimental results simulation study to facilitate the experiments in this paper we followed and employed an exponentiatedquadratic eq kernel exp kx this corresponds to an infinitedimensional rkhs not covered by our theory nevertheless we note that all simulations are practically due to rounding at machine precision see appendix for finitedimensional approximation using random fourier features eq kernels are popular in the bq literature as when is mixture of gaussians the mean element µp is analytically tractable see appendix some other pairs that produce analytic mean elements are discussed in for this simulation study we took to be mixture of distributions monte carlo mc is often used for such distributions but has slow convergence rate in op fw and fwls are known to converge more quickly and are in this sense preferable to mc in our simulations fig left both our novel methods fwbq and fwlsbq decreased the mmd much faster than the methods of here the same kernel were employed for all methods to have fair comparison this suggests that the best quadrature rules correspond to elements outside the convex hull of xi examples of those including bq often assign negative weights to features fig right appendix the principle advantage of our proposed methods is that they reconcile theoretical tractability with fully probabilistic interpretation for illustration fig right plots the posterior uncertainty due to numerical error for typical integration problem based on this empirical studies of such posteriors exist already in the literature and the reader is referred to for details beyond these theoretically tractable integrators sbq seems to give even better performance as increases an intuitive explanation is that sbq picks xi to minimise the mmd whereas estimator fwls fwlsbq number of design points figure simulation study left plot of the integration error squared both fwbq and fwlsbq are seen to outperform fw and fwls with sbq performing best overall right integral estimates for fwls and fwlsbq for function fwls converges more slowly and provides only point estimate for given number of design points in contrast fwlsbq converges faster and provides full probability distribution over numerical error shown shaded in orange and credible intervals ground truth corresponds to the dotted black line fwbq and fwlsbq only minimise an approximation of the mmd its linearisation along dj in addition the sbq weights are optimal at each iteration which is not true for fwbq and fwlsbq we conjecture that theorem and provide upper bounds on the rates of sbq this conjecture is partly supported by fig right which shows that sbq selects similar design points to but weights them optimally note also that both fwbq and fwlsbq give very similar result this is not surprising as fwls has no guarantees over fw in rkhs quantifying numerical error in proteomic model selection problem topical bioinformatics application that extends recent work by is presented the objective is to select among set of candidate models mi for protein regulation this choice is based on dataset of protein expression levels in order to determine most plausible biological hypothesis for further experimental investigation each mi is specified by vector of kinetic parameters θi full details in appendix bayesian model selection requires that theserparameters are integrated out against prior θi to obtain marginal likelihood terms mi θi dθi our focus here is on obtaining the maximum posteriori map model mj defined as the maximiser of the pm posterior model probability mj mi where we have assumed uniform prior over model space numerical error in the computation of each term mi if unaccounted for could cause us to return model mk that is different from the true map estimate mj and lead to the of valuable experimental resources the problem is quickly exaggerated when the number of models increases as there are more opportunities for one of the mi terms to be too large due to numerical error in the number of models was combinatorial in the number of protein kinases measured in assay currently but in principle up to this led to deploy substantial computing resources to ensure that numerical error in each estimate of mi was individually controlled probabilistic numerics provides more elegant and efficient solution at any given stage we have fully probabilistic quantification of our uncertainty in each of the integrals mi shown to be sensible both theoretically and empirically this induces full posterior distribution over numerical uncertainty in the location of the map estimate bayes all the way down as such we can determine the precise point in the computational pipeline when numerical uncertainty near the map estimate becomes acceptably small and cease further computation the fwbq methodology was applied to one of the model selection tasks in in fig left we display posterior model probabilities for each of candidates models where low number of samples were used for each integral for display clarity only the first models are shown in this regime numerical error introduces second level of uncertainty that we quantify by combining the fwbq error models for all integrals in the computational pipeline this is summarised by box plot rather than single point for each of the models obtained by sampling details in appendix these box plots reveal that our estimated posterior model probabilities are posterior probability posterior probability candidate models candidate models figure quantifying numerical error in model selection problem fwbq was used to model the numerical error of each integral mi explicitly for integration based on design points fwbq tells us that the computational estimate of the model posterior will be dominated by numerical error left when instead design points are used right uncertainty due to numerical error becomes much smaller but not yet small enough to determine the map estimate completely dominated by numerical error in contrast when is increased through and fig right and fig the uncertainty due to numerical error becomes negligible at we can conclude that model is the true map estimate and further computations can be halted correctness of this result was confirmed using the more computationally intensive methods in in appendix we compared the relative performance of fwbq fwlsbq and sbq on this problem fig shows that the bq weights reduced the mmd by orders of magnitude relative to fw and fwls and that sbq converged more quickly than both fwbq and fwlsbq conclusions this paper provides the first theoretical results for probabilistic integration in the form of posterior contraction rates for fwbq and fwlsbq this is an important step in the probabilistic numerics research programme as it establishes theoretical justification for using the posterior distribution as model for the numerical integration error which was previously assumed the practical advantages conferred by fully probabilistic error model were demonstrated on model selection problem from proteomics where sensitivity of an evaluation of the map estimate was modelled in terms of the error arising from repeated numerical integration the strengths and weaknesses of bq notably including scalability in the dimension of are and are inherited by our fwbq methodology we do not review these here but refer the reader to for an extended discussion convergence in the classical sense was proven here to occur exponentially quickly for fwlsbq which partially explains the excellent performance of bq and related methods seen in applications as well as resolving an open conjecture as bonus the hybrid quadrature rules that we developed turned out to converge much faster in simulations than those in which originally motivated our work key open problem for kernel methods in probabilistic numerics is to establish protocols for the practical elicitation of kernel this is important as directly affect the scale of the posterior over numerical error that we ultimately aim to interpret note that this problem applies equally to bq as well as related quadrature methods and more generally in probabilistic numerics previous work such as optimised on perapplication basis our ongoing research seeks automatic and general methods for elicitation that provide good frequentist coverage properties for posterior credible intervals but we reserve the details for future publication acknowledgments the authors are grateful for discussions with simon simo arno solin dino sejdinovic tom gunter and mathias fxb was supported by epsrc cjo was supported by epsrc mg was supported by epsrc an epsrc established career fellowship the eu grant and royal society wolfson research merit award references bach on the equivalence between quadrature rules and random features bach and obozinski on the equivalence between herding and conditional gradient algorithms in proceedings of the international conference on machine learning pages chen bornn de freitas eskelin fang and welling herded gibbs sampling journal of machine learning research to appear chen welling and smola from kernel herding in proceedings of the conference on uncertainty in artificial intelligence pages conrad girolami stuart and zygalakis probability measures for numerical solutions of differential equations diaconis bayesian numerical analysis statistical decision theory and related topics iv pages dick and pillichshammer digital nets and sequences discrepancy theory and carlo integration cambridge university press dunn convergence rates for conditional gradient sequences generated by implicit step length rules siam journal on control and optimization frank and wolfe an algorithm for quadratic programming naval research logistics quarterly garber and hazan faster rates for the method over sets in proceedings of the international conference on machine learning pages ghahramani and rasmussen bayesian monte carlo in advances in neural information processing systems pages gunter garnett osborne hennig and roberts sampling for inference in probabilistic models with fast bayesian quadrature in advances in neural information processing systems hamrick and griffiths mental rotation as bayesian quadrature in nips workshop on bayesian optimization in theory and practice hennig probabilistic interpretation of linear solvers siam journal on optimization hennig osborne and girolami probabilistic numerics and uncertainty in computations proceedings of the royal society huszar and duvenaud herding is bayesian quadrature in uncertainty in artificial intelligence pages jaggi revisiting sparse convex optimization in proceedings of the international conference on machine learning volume pages lindsten and bach sequential kernel herding optimization for particle filtering in proceedings of the international conference on artificial intelligence and statistics pages oates dondelinger bayani korkola gray and mukherjee causal network inference using biochemical kinetics bioinformatics oates girolami and chopin control functionals for monte carlo integration arxiv hagan monte carlo is fundamentally unsound journal of the royal statistical society series hagan quadrature journal of statistical planning and inference osborne garnett roberts hart aigrain and gibson bayesian quadrature for ratios in proceedings of the international conference on artificial intelligence and statistics pages owen constraint on extensible quadrature rules numerische mathematik pages hartikainen svensson and sandblom on the relation between gaussian process quadratures and methods schober duvenaud and hennig probabilistic ode solvers with means in advances in neural information processing systems pages sriperumbudur gretton fukumizu and lanckriet hilbert space embeddings and metrics on probability measures journal of machine learning research 
scheduled sampling for sequence prediction with recurrent neural networks samy bengio oriol vinyals navdeep jaitly noam shazeer google research mountain view ca usa bengio vinyals ndjaitly noam abstract recurrent neural networks can be trained to produce sequences of tokens given some input as exemplified by recent results in machine translation and image captioning the current approach to training them consists of maximizing the likelihood of each token in the sequence given the current recurrent state and the previous token at inference the unknown previous token is then replaced by token generated by the model itself this discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence we propose curriculum learning strategy to gently change the training process from fully guided scheme using the true previous token towards less guided scheme which mostly uses the generated token instead experiments on several sequence prediction tasks show that this approach yields significant improvements moreover it was used succesfully in our winning entry to the mscoco image captioning challenge introduction recurrent neural networks can be used to process sequences either as input output or both while they are known to be hard to train when there are long term dependencies in the data some versions like the long memory lstm are better suited for this in fact they have recently shown impressive performance in several sequence prediction problems including machine translation contextual parsing image captioning and even video description in this paper we consider the set of problems that attempt to generate sequence of tokens of variable size such as the problem of machine translation where the goal is to translate given sentence from source language to target language we also consider problems in which the input is not necessarily sequence like the image captioning problem where the goal is to generate textual description of given image in both cases recurrent neural networks or their variants like lstms are generally trained to maximize the likelihood of generating the target sequence of tokens given the input in practice this is done by maximizing the likelihood of each target token given the current state of the model which summarizes the input and the past output tokens and the previous target token which helps the model learn kind of language model over target tokens however during inference true previous target tokens are unavailable and are thus replaced by tokens generated by the model itself yielding discrepancy between how the model is used at training and inference this discrepancy can be mitigated by the use of beam search heuristic maintaining several generated target sequences but for continuous state space models like recurrent neural networks there is no dynamic programming approach so the effective number of sequences considered remains small even with beam search the main problem is that mistakes made early in the sequence generation process are fed as input to the model and can be quickly amplified because the model might be in part of the state space it has never seen at training time here we propose curriculum learning approach to gently bridge the gap between training and inference for sequence prediction tasks using recurrent neural networks we propose to change the training process in order to gradually force the model to deal with its own mistakes as it would have to during inference doing so the model explores more during training and is thus more robust to correct its own mistakes at inference as it has learned to do so during training we will show experimentally that this approach yields better performance on several sequence prediction tasks the paper is organized as follows in section we present our proposed approach to better train sequence prediction tasks with recurrent neural networks this is followed by section which draws links to some related approaches we then present some experimental results in section and conclude in section proposed approach we are considering supervised tasks where the training set is given in terms of pairs where is the input and can be either static like an image or dynamic like sequence while the target output is sequence yti of variable number of tokens that belong to fixed known dictionary model given single pair the log probability can be computed as log log log yt where is sequence of length represented by tokens yt the latter term in the above equation is estimated by recurrent neural network with parameters by introducing state vector ht that is function of the previous state and the previous output token log yt log yt where ht is computed by recurrent neural network as follows if ht otherwise yt is often implemented as linear of the state vector ht into vector of scores one for each token of the output dictionary followed by softmax transformation to ensure the scores are properly normalized positive and sum to is usually function that combines the previous state and the previous output in order to produce the current state this means that the model focuses on learning to output the next token given the current state of the model and the previous token thus the model represents the probability distribution of sequences in the most general form unlike conditional random fields and other models that assume independence between between outputs at different time steps given latent variable states the capacity of the model is only limited by the representational capacity of the recurrent and feedforward layers lstms with their ability to learn long range structure are especially well suited to this task and make it possible to learn rich distributions over sequences in order to learn variable length sequences special token eos that signifies the end of sequence is added to the dictionary and the model during training eos is concatenated to the end of each sequence during inference the model generates tokens until it generates eos although one could also use projection training training recurrent neural networks to solve such tasks is usually accomplished by using stochastic gradient descent to look for set of parameters that maximizes the log likelihood of producing the correct target sequence given the input data for all training pairs arg max log inference during inference the model can generate the full sequence given by generating one token at time and advancing time by one step when an eos token is generated it signifies the end of the sequence for this process at time the model needs as input the output token from the last time step in order to produce yt since we do not have access to the true previous token we can instead either select the most likely one given our model or sample according to it searching for the sequence with the highest probability given is too expensive because of the combinatorial growth in the number of sequences instead we use beam searching procedure to generate best sequences we do this by maintaining heap of best candidate sequences at each time step new candidates are generated by extending each candidate by one token and adding them to the heap at the end of the step the heap is to only keep candidates the beam searching is truncated when no new sequences are added and best sequences are returned while beam search is often used for discrete state based models like hidden markov models where dynamic programming can be used it is harder to use efficiently for continuous state based models like recurrent neural networks since there is no way to factor the followed state paths in continuous space and hence the actual number of candidates that can be kept during beam search decoding is very small in all these cases if wrong decision is taken at time the model can be in part of the state space that is very different from those visited from the training distribution and for which it doesn know what to do worse it can easily lead to cumulative bad decisions classic problem in sequential gibbs sampling type approaches to sampling where future samples can have no influence on the past bridging the gap with scheduled sampling the main difference between training and inference for sequence prediction tasks when predicting token yt is whether we use the true previous token or an estimate coming from the model itself we propose here sampling mechanism that will randomly decide during training whether we use or assuming we use based stochastic gradient descent approach for every token to predict yt of the ith of the training algorithm we propose to flip coin and use the true previous token with probability or an estimate coming from the model itself with probability the estimate of the model can be obtained by sampling token according to the probability distribution modeled by or can be taken as the arg maxs this process is illustrated in figure when the model is trained exactly as before while when the model is trained in the same setting as inference we propose here curriculum learning strategy to go from one to the other intuitively at the beginning of training sampling from the model would yield random token since the model is not well trained which could lead to very slow convergence so selecting more often the true previous token should help on the other hand at the end of training should favor sampling from the model more often as this corresponds to the true inference situation and one expects the model to already be good enough to handle it and sample reasonable tokens note that in the experiments we flipped the coin for every token we also tried to flip the coin once per sequence but the results were much worse most probably because consecutive errors are amplified during the first rounds of training figure illustration of the scheduled sampling approach where one flips coin at every time step to decide to use the true previous token or one sampled from the model itself exponential decay inverse sigmoid decay linear decay figure schedules examples of decay we thus propose to use schedule to decrease as function of itself in similar manner used to decrease the learning rate in most modern stochastic gradient descent approaches examples of such schedules can be seen in figure as follows linear decay max ci where is the minimum amount of truth to be given to the model and and provide the offset and slope of the decay which depend on the expected speed of convergence exponential decay where is constant that depends on the expected speed of convergence inverse sigmoid decay where depends on the expected speed of convergence we call our approach scheduled sampling note that when we sample the previous token from the model itself while training we could the gradient of the losses at times through that decision this was not done in the experiments described in this paper and is left for future work related work the discrepancy between the training and inference distributions has already been noticed in the literature in particular for control and reinforcement learning tasks searn was proposed to tackle problems where supervised training examples might be different from actual test examples when each example is made of sequence of decisions like acting in complex environment where few mistakes of the model early in the sequential decision process might compound and yield very poor global performance their proposed approach involves where at each one trains new model according to the current policy essentially the expected decisions for each situation applies it on test set and modifies the next iteration policy in order to account for the previous decisions and errors the new policy is thus combination of the previous one and the actual behavior of the model in comparison to searn and related ideas our proposed approach is completely online single model is trained and the policy slowly evolves during training instead of batch approach which makes it much faster to furthermore searn has been proposed in the context of reinforcement learning while we consider the supervised learning setting trained using stochastic gradient descent on the overall objective other approaches have considered the problem from ranking perspective in particular for parsing tasks where the target output is tree in this case the authors proposed to use beam search both during training and inference so that both phases are aligned the training beam is used to find in fact in the experiments we report in this paper our proposed approach was not meaningfully slower nor faster to train than the baseline the best current estimate of the model which is compared to the guided solution the truth using ranking loss unfortunately this is not feasible when using model like recurrent neural network which is now the technique in many sequential tasks as the state sequence can not be factored easily because it is continuous state and thus beam search is hard to use efficiently at training time as well as inference time in fact finally proposed an online algorithm for parsing problems that adapts the targets through the use of dynamic oracle that takes into account the decisions of the model the trained model is perceptron and is thus not like recurrent neural network and the probability of choosing the truth is fixed during training experiments we describe in this section experiments on three different tasks in order to show that scheduled sampling can be helpful in different settings we report results on image captioning constituency parsing and speech recognition image captioning image captioning has attracted lot of attention in the past year the task can be formulated as mapping of an image onto sequence of words describing its content in some natural language and most proposed approaches employ some form of recurrent network structure with simple decoding schemes notable exception is the system proposed in which does not directly optimize the log likelihood of the caption given the image and instead proposes pipelined approach since an image can have many valid captions the evaluation of this task is still an open problem some attempts have been made to design metrics that positively correlate with human evaluation and common set of tools have been published by the mscoco team we used the mscoco dataset from to train our model we trained on images and report results on separate development set of additional images each image in the corpus has different captions so the training procedure picks one at random creates of examples and optimizes the objective function defined in the image is preprocessed by pretrained convolutional neural network without the last classification layer similar to the one described in and the resulting image embedding is treated as if it was the first word from which the model starts generating language the recurrent neural network generating words is an lstm with one layer of hidden units and the input words are represented by embedding vectors of size the number of words in the dictionary is we used an inverse sigmoid decay schedule for for the scheduled sampling approach table shows the results on various metrics on the development set each of these metrics is variant of estimating the overlap between the obtained sequence of words and the target one since there were target captions per image the best result is always chosen to the best of our knowledge the baseline results are consistent slightly better with the current on that task while dropout helped in terms of log likelihood as expected but not shown it had negative impact on the real metrics on the other hand scheduled sampling successfully trained model more resilient to failures due to training and inference mismatch which likely yielded higher quality captions according to all the metrics ensembling models also yielded better performance both for the baseline and the schedule sampling approach it is also interesting to note that model trained while always sampling from itself hence in regime similar to inference dubbed always sampling in the table yielded very poor performance as expected because the model has hard time learning the task in that case we also trained model with scheduled sampling but instead of sampling from the model we sampled from uniform distribution in order to verify that it was important to build on the current model and that the performance boost was not just simple form of regularization we called this uniform scheduled sampling and the results are better than the baseline but not as good as our proposed approach we also experimented with flipping the coin once per sequence instead of once per token but the results were as poor as the always sampling approach table various metrics the higher the better on the mscoco development set for the image captioning task approach vs metric baseline baseline with dropout always sampling scheduled sampling uniform scheduled sampling baseline ensemble of scheduled sampling ensemble of meteor cider it worth noting that we used our scheduled sampling approach to participate in the mscoco image captioning challenge and ranked first in the final leaderboard constituency parsing another less obvious connection with the paradigm is constituency parsing recent work has proposed an interpretation of parse tree as sequence of linear operations that build up the tree this linearization procedure allowed them to train model that can map sentence onto its parse tree without any modification to the formulation the trained model has one layer of lstm cells and words are represented by embedding vectors of size we used an attention mechanism similar to the one described in which helps when considering the next output token to produce yt to focus on part of the input sequence only by applying softmax over the lstm state vectors corresponding to the input sequence the input word dictionary contained around words while the target dictionary contained symbols used to describe the tree we used an inverse sigmoid decay schedule for in the scheduled sampling approach parsing is quite different from image captioning as the function that one has to learn is almost deterministic in contrast to an image having large number of valid captions most sentences have unique parse tree although some very difficult cases exist thus the model operates almost deterministically which can be seen by observing that the train and test perplexities are extremely low compared to image captioning this different operating regime makes for an interesting comparison as one would not expect the baseline algorithm to make many mistakes however and as can be seen in table scheduled sampling has positive effect which is additive to dropout in this table we report the score on the wsj development set we should also emphasize that there are only training instances so overfitting contributes largely to the performance of our system whether the effect of sampling during training helps with regard to overfitting or the mismatch is unclear but the result is positive and additive with dropout once again model trained by always sampling from itself instead of using the groundtruth previous token as input yielded very bad results in fact so bad that the resulting trees were often not valid trees hence the in the corresponding metric table score the higher the better on the validation set of the parsing task approach baseline lstm baseline lstm with dropout always sampling scheduled sampling scheduled sampling with dropout speech recognition for the speech recognition experiments we used slightly different setting from the rest of the paper each training example is an pair where is sequence of input vectors xt and is sequence of tokens yt so each yt is aligned with the corresponding xt here xt are the acoustic features represented by log mel filter bank spectra at frame and yt is the corresponding target the targets used were labels generated from recipe using the kaldi toolkit but could very well have been phoneme labels this setting is different from the other experiments in that the model we used is the following log log log yt log yt where ht is computed by recurrent neural network as follows oh if ht xt otherwise where oh is vector of with same dimensionality as ht and is an extra token added to the dictionary to represent the start of each sequence we generated data for these experiments using the corpus and the kaldi toolkit as described in standard configurations were used for the experiments dimensional log mel filter banks and their first and second order temporal derivatives were used as inputs to each frame dimensional targets were generated for each time frame using forced alignment to transcripts using trained system the training validation and test sets have and sequences respectively and their average length was frames the validation set was used to choose the best epoch in training and the model parameters from that epoch were used to evaluate the test set the trained models had two layers of lstm cells and softmax layer for each of five configurations baseline configuration where the ground truth was always fed to the model configuration always sampling where the model was only fed in its own predictions from the last time step and three scheduled sampling configurations scheduled sampling where was ramped linearly from maximum value to minimum value over ten epochs and then kept constant at the final value for each configuration we trained models and report average performance over them training of each model was done over frame targets from the gmm the baseline configurations typically reached the best validation accuracy after approximately epochs whereas the sampling models reached the best accuracy after approximately epochs after which the validation accuracy decreased this is probably because the way we trained our models is not exact it does not account for the gradient of the sampling probabilities from which we sampled our targets future effort at tackling this problem may further improve results testing was done by finding the best sequence from beam search decoding using beam size of beams and computing the error rate over the sequences we also report the next step error rate where the model was fed in the ground truth to predict the class of the next frame for each of the models on the validation set to summarize the performance of the models on the training objective table shows summary of the results it can be seen that the baseline performs better next step prediction than the models that sample the tokens for input this is to be expected since the former has access to the groundtruth however it can be seen that the models that were trained with sampling perform better than the baseline during decoding it can also be seen that for this problem the always sampling model performs quite https well we hypothesize that this has to do with the nature of the dataset the states have lot of correlation the same state appears as the target for several frames and most of the states are constrained only to go to subset of other states next step prediction with groundtruth labels on this task ends up paying disproportionate attention to the structure of the labels and not enough to the acoustics input thus it achieves very good next step prediction error when the groundtruth sequence is fed in with the acoustic information but is not able to exploit the acoustic information sufficiently when the groundtruth sequence is not fed in for this model the testing conditions are too far from the training condition for it to make good predictions the model that is only fed its own prediction always sampling ends up exploiting all the information it can find in the acoustic signal and effectively ignores its own predictions to influence the next step prediction thus at test time it performs just as well as it does during training model such as the attention model of which predicts phone sequences directly instead of the highly redundant hmm state sequences would not suffer from this problem because it would need to exploit both the acoustic signal and the language model sufficiently to make predictions nevertheless even in this setting adding scheduled sampling still helped to improve the decoding frame error rate note that typically speech recognition experiments use hmms to decode predictions from neural networks in hybrid model here we avoid using an hmm altogether and hence we do not have the advantage of the smoothing that results from the hmm architecture and the language models thus the results are not directly comparable to the typical hybrid model results table frame error rate fer on the speech recognition experiments in next step prediction reported on validation set the ground truth is fed in to predict the next target like it is done during training in decoding experiments reported on test set beam searching is done to find the best sequence we report results on four different linear schedulings of sampling where was ramped down linearly from to for the baseline the model was only fed in the ground truth see section for an analysis of the results approach always sampling scheduled sampling scheduled sampling scheduled sampling baseline lstm next step fer decoding fer conclusion using recurrent neural networks to predict sequences of tokens has many useful applications like machine translation and image description however the current approach to training them predicting one token at time conditioned on the state and the previous correct token is different from how we actually use them and thus is prone to the accumulation of errors along the decision paths in this paper we proposed curriculum learning approach to slowly change the training objective from an easy task where the previous token is known to realistic one where it is provided by the model itself experiments on several sequence prediction tasks yield performance improvements while not incurring longer training times future work includes the errors through the sampling decisions as well as exploring better sampling strategies including conditioning on some confidence measure from the model itself references bengio simard and frasconi learning long term dependencies is hard ieee transactions on neural networks hochreiter and schmidhuber long memory neural computation sutskever vinyals and le sequence to sequence learning with neural networks in advances in neural information processing systems nips vinyals kaiser koo petrov sutskever and hinton grammar as foreign language in vinyals toshev bengio and erhan show and tell neural image caption generator in ieee conference on computer vision and pattern recognition cvpr donahue hendricks guadarrama rohrbach venugopalan saenko and darrell recurrent convolutional networks for visual recognition and description in ieee conference on computer vision and pattern recognition cvpr bengio louradour collobert and weston curriculum learning in proceedings of the international conference on machine learning icml lafferty mccallum and pereira conditional random fields probabilistic models for segmenting and labeling sequence data in proceedings of the eighteenth international conference on machine learning icml pages san francisco ca usa morgan kaufmann publishers iii langford and marcu structured prediction as classification machine learning journal ross gordon and bagnell reduction of imitation learning and structured prediction to online learning in proceedings of the workshop on artificial intelligence and statistics aistats venkatraman herbert and bagnell improving prediction of learned time series models in aaai conference on artificial intelligence aaai collins and roark incremental parsing with the perceptron algorithm in proceedings of the association for computational linguistics acl goldberg and nivre dynamic oracle for dependency parsing in proceedings of coling mao xu yang wang huang and yuille deep captioning with multimodal recurrent neural networks in international conference on learning representations iclr kiros salakhutdinov and zemel unifying embeddings with multimodal neural language models in tacl karpathy and li deep alignments for generating image descriptions in ieee conference on computer vision and pattern recognition cvpr fang gupta iandola srivastava deng dollar gao he mitchell platt zitnick and zweig from captions to visual concepts and back in ieee conference on computer vision and pattern recognition cvpr vedantam zitnick and parikh cider image description evaluation in ieee conference on computer vision and pattern recognition cvpr lin maire belongie hays perona ramanan and zitnick microsoft coco common objects in context ioffe and szegedy batch normalization accelerating deep network training by reducing internal covariate shift in proceedings of the international conference on machine learning icml cui ronchi lin dollr and zitnick http microsoft coco captioning challenge bahdanau cho and bengio neural machine translation by jointly learning to align and translate in international conference on learning representations iclr hovy marcus palmer ramshaw and weischedel ontonotes the solution in proceedings of the human language technology conference of the naacl short papers pages new york city usa june association for computational linguistics povey ghoshal boulianne burget glembek goel hannemann motlicek qian schwarz silovsky stemmer and vesely the kaldi speech recognition toolkit in ieee workshop on automatic speech recognition and understanding ieee signal processing society december ieee catalog no jaitly exploring deep learning methods for discovering features in speech signals phd thesis university of toronto jan chorowski dzmitry bahdanau kyunghyun cho and yoshua bengio continuous speech recognition using recurrent nn first results arxiv preprint 
unified view of matrix completion under general structural constraints suriya gunasekar ut at austin usa suriya arindam banerjee umn twin cities usa banerjee joydeep ghosh ut at austin usa ghosh abstract in this paper we present unified analysis of matrix completion under general structural constraints induced by any norm regularization we consider two estimators for the general problem of structured matrix completion and provide unified upper bounds on the sample complexity and the estimation error our analysis relies on results from generic chaining and we establish two intermediate results of independent interest in characterizing the size or complexity of low dimensional subsets in high dimensional ambient space certain partial complexity measure encountered in the analysis of matrix completion problems is characterized in terms of well understood complexity measure of gaussian widths and it is shown that form of restricted strong convexity holds for matrix completion problems under general norm regularization further we provide several examples of structures included in our framework notably the recently proposed spectral norm introduction the task of completing the missing entries of matrix from an incomplete subset of potentially noisy entries is encountered in many applications including recommendation systems data imputation covariance matrix estimation and sensor localization among others traditionally high dimensional estimation problems where the number of parameters to be estimated is much higher than the number of observations has been extensively studied in the recent literature however matrix completion problems are particularly as the observations are both limited high dimensional and the measurements are extremely localized the observations consist of individual matrix entries the localized measurement model in contrast to random gaussian or measurements poses additional complications in high dimensional estimation for estimation in high dimensional problems including matrix completion it is imperative that low dimensional structural constraints are imposed on the target for matrix completion the special case of constraint has been widely studied several existing work propose tractable estimators with recovery guarantees for approximate matrix completion recent work addresses the extension to structures with decomposable norm regularization however the scope of matrix completion extends for low dimensional structures far beyond simple or decomposable norm structures in this paper we present unified statistical analysis of matrix completion under general set of low dimensional structures that are induced by any suitable norm regularization we provide statistical analysis of two generalized matrix completion estimators the constrained norm minimizer and the generalized matrix dantzig selector section the main results in the paper theorem provide unified upper bounds on the sample complexity and estimation error of these estimators for matrix completion under any norm regularization existing results on matrix completion with low rank or other decomposable structures can be obtained as special cases of our general results our unified analysis of sample complexity is motivated by recent work on high dimensional estimation using global sub gaussian measurements key ingredient in the recovery analysis of high dimensional estimation involves establishing certain variation of restricted isometry property rip of the measurement operator it has been shown that such properties are satisfied by gaussian and measurement operators with high probability unfortunately as has been noted before by candes et al owing to highly localized measurements such conditions are not satisfied in the matrix completion problem and the existing results based on global sub gaussian measurements are not directly applicable in fact key question we consider is given the radically limited measurement model in matrix completion by how much would the sample complexity of estimation increase beyond the known sample complexity bounds for global sub gaussian measurements our results upper bounds the sample complexity for matrix completion to within log factor over that for estimation under global sub gaussian measurements while the result was previously known for low rank matrix completion using nuclear norm minimization with careful use of generic chaining we show that the log factor suffices for structures induced by any norm as key intermediate result we show that useful form of restricted strong convexity rsc holds for the localized measurements encountered in matrix completion under general norm regularized structures the result substantially generalizes existing rsc results for matrix completion under the special cases of nuclear norm and decomposable norm regularization for our analysis we use tools from generic chaining to characterize the main results theorem in terms of the gaussian width definition of certain error sets gaussian widths provide powerful geometric characterization for quantifying the complexity of structured low dimensional subset in high dimensional ambient space numerous tools have been developed in the literature for bounding the gaussian width of structured sets unified characterization of results in terms of gaussian width has the advantage that this literature can be readily leveraged to derive new recovery guarantees for matrix completion under suitable structural constraints appendix in addition to the theoretical elegance of such unified framework identifying useful but potentially low dimensional structures is of significant practical interest the broad class of structures enforced through symmetric convex bodies and symmetric atomic sets can be analyzed under this paradigm section such specialized structures can capture the constraints in certain applications better than simple in particular we discuss in detail trivial example of the spectral norm introduced by mcdonald et al to summarize the key contributions of the paper theorem provide unified upper bounds on sample complexity and estimation error for matrix completion estimators using general norm regularization substantial generalization of the existing results on matrix completion under structural constraints theorem is applied to derive statistical results for the special case of matrix completion under spectral norm regularization an intermediate result theorem shows that under any norm regularization variant of restricted strong convexity rsc holds in the matrix completion setting with extremely localized measurements further certain partial measure of complexity of set is encountered in matrix completion analysis another intermediate result theorem provides bounds on the partial complexity measures in terms of better understood complexity measure of gaussian width these intermediate results are of independent interest beyond the scope of the paper notations and preliminaries indexes are typically used to index rows and columns respectively of matrices and index is used to index the observations ei ej ek etc denote the standard basis in appropriate notation and are used to denote matrix and vector respectively with independent standard gaussian random variables and denote the probability of an event and the expectation of random variable respectively given an integer let euclidean norm in vector space is denoted as hx xi for matrix ppx with singular values the nuclear norm kxk common norms include the frobenius norm kxkf σi the spectral norm kxkop and the maximum norm maxij also let for brevity we omit the explicit dependence of dimension unless necessary kxkf and kxkf finally given norm defined on vectorspace its dual norm is given by supky hx definition gaussian width gaussian width of set is widely studied measure of complexity of subset in high dimensional ambient space and is given by wg eg sup hx gi where recall that is matrix of independent standard gaussian random variables some key results on gaussian width are discussed in appendix definition random variable the norm of random variable is given by is if equivalently is if one of the following conditions are satisfied for some constants and lemma of or if ex then esx definition restricted strong convexity rsc function is said to satisfy restricted strong convexity rsc at with respect to subset if for some rsc parameter κl κl definition spikiness ratio for measure of its spikiness is given by αsp kxkf definition norm compatibility constant the compatibility constant of norm under closed convex cone is defined as follows ψr sup kxkf structured matrix completion denote the ground truth target matrix as let in the noisy matrix completion observations consists of individual entries of observed through an additive noise channel noise given list of independently sampled standard basis ek eik jk ik jk with potential duplicates observations yk are given by yk ek ξηk for where is the noise vector of independent random variables with ηk and var ηk and is scaled variance of noise per observation further let kηk for constant recall from definition also without loss of generality assume normalization kf uniform sampling assume that the entries in are drawn independently and uniformly ek uniform ei for ek let ek be the standard basis of given define pω as pω hx ek iek structural constraints for matrix completion with low dimensional structural constraints on are necessary for we consider generalized constraint setting wherein for some model space is enforced through surrogate norm regularizer we make no further assumptions on other than it being norm in low spikiness in matrix completion under uniform sampling model further restrictions on beyond low dimensional structure are required to ensure that the most informative entries of the matrix are observed with high probability early work assumed stringent matrix incoherence conditions for completion to preclude such matrices while more recent work relax these assumptions to more intuitive restriction of the spikiness ratio defined in however under this relaxation only an approximate recovery is typically guaranteed in regime as opposed to near exact recovery under incoherence assumptions assumption spikiness ratio there exists such that kf αsp kθ special cases and applications we briefly introduce some interesting examples of structural constraints with practical applications example low rank and decomposable norms is the most common structure used in many matrix estimation problems including collaborative filtering pca spectral clustering etc convex estimators using nuclear norm regularization has been widely studied statistically recent work extends the analysis of low rank matrix completion to general decomposable norms example spectral norm and significant example of norm regularization that is not decomposable is the spectral norm recently introduced by mcdonald et al spectral norm is essentially the vector norm applied on the singular values of matrix without loss of generality let be the set of all subsets of cardinality at most and let let gk gk vg vg supp vg the spectral norm is given by nx inf kvg vg gk mcdonald et al showed that spectral norm is special case of cluster norm it was further shown that in learning wherein the tasks columns of are assumed to be clustered into dense groups the cluster norm provides between variance inverse variance and the norm of the task vectors both and demonstrate superior empirical performance of cluster norms and norm over traditional trace norm and spectral elastic net minimization on bench marked matrix completion and learning datasets however statistical analysis of consistent matrix completion using spectral norm regularization has not been previously studied in section we discuss the consequence of our main theorem for this special case example additive decomposition elementwise sparsity is common structure often assumed in estimation problems however in matrix completion elementwise sparsity conflicts with assumption and more traditional incoherence assumptions indeed it is easy to see that with high probability most of the uniformly sampled observations will be zero and an informed prediction is infeasible however elementwise sparse structures can often be modelled within an additive decomposition framework wherein such that each component matrix is in turn structured low used for robust pca in such structures there is no scope for recovering sparse components outside the observed indices and it is assumed that is sparse supp in such cases our results are applicable under additional regularity assumptions that enforces on the superposed matrix candidate norm regularizer for such structures is the weighted infimum convolution of individual structure inducing norms rw inf wk rk example other applications other potential applications including cut matrices structures induced by compact convex sets norms inducing structured sparsity assumptions on the spectrum of etc can also be handled under the paradigm of this paper structured matrix estimator let be the norm surrogate for the structural constraints on and denote its dual norm we propose and analyze two convex estimators for the task of structured matrix completion constrained norm minimizer cn argmin kpω λcn generalized matrix dantzig selector ds argmin pω pω λds where recall that rω is the linear adjoint of pω hpω yi hx note theorem gives consistency results for and respectively under certain conditions on the parameters λcn λds and in particular these conditions assume knowledge of the noise variance and spikiness ratio αsp in practice typically and αsp are unknown and the parameters are tuned by validating on held out data main results we define the following restricted error cone and its subset tr tr cone and er tr cn and ds be the estimates from and respectively if λcn and λds are chosen such that let cn cn belongs to the feasible sets in and respectively then the error matrices ds ds are contained in tr and cn theorem constrained norm minimizer the problem setup in section let be the estimate from with λcn for large enough if wg er log then there exists an rsc parameter with log and constants and er log er exp wg such that with probability greater than wg log wg kf max ds theorem matrix dantzig selector under the setup in section let be the estimate from with λds pω for large enough if er log then there exists an rsc parameter with wg er and constant such that with probability greater than wg log wg tr kf max recall gaussian width wg and subspace compatibility constant ψr from and respectively remarks er ψr tr and if and rank then wg log using these bounds in theorem recovers results for low rank matrix completion under spikiness assumption for both estimators upper bound on sample complexity is dominated by the square of gaussian width which is often considered the effective dimension of subset in high dimensional space and plays key role in high dimensional estimation under gaussian measurement ensembles the results show that independent of the upper bound on sample complexity for consistent matrix completion with highly localized measurements is within log factor of the known sample complexity of wg er for estimation from gaussian measurements first term in estimation error bounds in theorem scales with which is the per observation noise variance upto constant the second term is an upper bound on error that arises due to unidentifiability of within certain radius under the spikiness constraints in contrast show exact recovery when using more stringent matrix incoherence conditions cn from theorem is comparable to the result by et al for low rank bound on matrix completion under regime where the first term dominates and those of for high dimensional estimation under gaussian measurements with bound on wg er it is easy to specialize this result for new structural constraints however this bound is potentially loose and asymptotically converges to constant error proportional to the noise variance the estimation error bound in theorem is typically sharper than that in theorem however for specific structures using application of theorem requires additional bounds on and ψr tr besides wg er kpω partial complexity measures recall that for wg hx gi and is standard normal vector definition partial complexity measures given randomly sampled ek and centered random vector the partial measure of is given by wω eω sup hx special cases of being vector of standard gaussian or standard rademacher variables are of particular interest note in the case of symmetric like and wω hx and the later expression will be used interchangeably ignoring the constant term theorem partial gaussian complexity let with interior and let be sampled according to universal constants and such that wω wg eω sup kpω wω wg sup kx also for centered vector constant wω wω note for the second term in is consequence of the localized measurements spectral norm we introduced spectral norm in section the estimators from and for spectral norm can be efficiently solved via proximal methods using the proximal operators derived in we are interested in the statistical guarantees for matrix completion using spectral norm regularization we extend the analysis for upper bounding the gaussian width of the descent cone for the vector norm by to the case of spectral norm wlog let let be the vector of singular values of sorted in order pp let be the unique integer satisfying σi denote and finally for σi and σi σi lemma if rank of is and er is the error set for then kσ wg er proof of the above lemma is provided in the appendix lemma can be combined with theorem to obtain recovery guarantees for matrix completion under spectral norm discussions and related work sample complexity for consistent recovery in high dimensional convex estimation it is desirable that the descent cone at the target parameter is small relative to the feasible set enforced by the observations of the estimator thus it is not surprising that the sample complexity and estimation error bounds of an estimator depends on some measure of of the error cone at results in this paper are largely characterized in terms of widely used complexity measure of gaussian width wg and can be compared with the literature on estimation from gaussian measurements error bounds theorem provides estimation error bounds that depends only on the gaussian width of the descent cone in regime this result is comparable to analogous results of constrained norm minimization however this bound is potentially loose owing to mismatched term using squared loss and asymptotically converges to constant error proportional to the noise variance tighter analysis on the estimation error can be obtained for the matrix dantzig selector from theorem however application of theorem requires computing high probability upper bound on the literature on norms of random matrices can be exploited in computing such bounds beside in special cases if then can be used to obtain asymptotically consistent results finally under near the second term in the results of theorem dominates and bounds are weaker than that of owing to the relaxation of stronger incoherence assumption related work and future directions the closest related work is the result on consistency of matrix completion under decomposable norm regularization by results in this paper are strict generalization to general norm regularized not necessarily decomposable matrix completion we provide examples of application where structures enforced by such norms are of interest further in contrast to our results that are based on gaussian width the rsc parameter in depends on modified complexity measure κr see definition in an advantage of results based on gaussian width is that application of theorem for special cases can greatly benefit from the numerous tools in the literature for the computation of wg another closely related line of work is the analysis of high dimensional estimation under random gaussian or measurements however the analysis from this literature rely on variants of rip of the measurement ensemble which is not satisfied by the the extremely localized measurements encountered in matrix completion in an intermediate result we establish form of rsc for matrix completion under general norm regularization result that was previously known only for nuclear norm and decomposable norm regularization in future work it is of interest to derive matching lower bounds on estimation error for matrix completion under general low dimensional structures along the lines of and explore special case applications of the results in the paper we also plan to derive explicit characterization of λds in terms of gaussian width of unit balls by exploiting generic chaining results for general banach spaces proof sketch proofs of the lemmas are provided in the appendix proof of theorem define the following set of matrices for constant from theorem αsp kxkf define log wg case spiky error matrix when the error matrix from or has large spikiness ratio kθk in following bound on error is immediate using cn proposition spiky error for the constant in theorem if αsp then matrix log cn ds an analogous result also holds for ds cn recall from that pω ξη case error matrix let where consists of independent random variables with ηk var ηk and kηk for constant restricted strong convexity rsc recall tr and er from the most significant step in the proof of theorem involves showing that over useful subset of tr form of rsc is satisfied by squared loss penalty theorem restricted strong convexity let wg er log for large enough constant there exists rsc parameter with the following holds greater that exp wg er log and constant such that tr kpω proof in appendix combines empirical process tools along with theorem constrained norm minimizer lemma under the conditions of theorem letp be constant such that kηk there exists universal constant such that if λcn then greater than exp ds tr and kpω cn cn βc then using theorem and lemma using λcn in if cn cn kpω matrix dantzig selector proposition λds pω ds tr pω pω ds and triangle inequality also above result follows from optimality of ds pω ds ds ψr tr ds kf kpω where recall norm compatibility constant ψr tr from finally using theorem ds ds ψr tr kpω ds proof of theorem let the entries of ek eik jk be sampled as in recall that is standard normal vector for compact it suffices to prove theorem for dense countable subset of overloading to such countable subset define following random process xω where xω hx hx ek igk we start with key lemma in the proof of theorem proof of this lemma provided in appendix uses tools from the broad topic of generic chaining developed in recent works lemma for compact subset with interior constants such that wg sup kpω wω sup xω lemma there exists constants such that for compact with interior sup kpω sup kx wω theorem follows by combining lemma and lemma and simple algebraic manipulations using ab and triangle inequality see appendix the statement in theorem about partial complexity follows from standard result in empirical process given in lemma in the appendix acknowledgments we thank the anonymous reviewers for helpful comments and suggestions gunasekar and ghosh acknowledge funding from nsf grants and banerjee acknowledges nsf grants and nasa grant references amelunxen lotz mccoy and tropp living on the edge geometric theory of phase transitions in convex optimization inform inference argyriou foygel and srebro sparse prediction with the norm in nips banerjee chen fazayeli and sivakumar estimation with norm regularization in nips banerjee merugu dhillon and ghosh clustering with bregman divergences jmlr cai liang and rakhlin geometrizing local rates of convergence for linear inverse problems arxiv preprint li ma and wright robust principal component analysis acm and plan matrix completion with noise proceedings of the ieee and recht exact matrix completion via convex optimization focm emmanuel candes and terence tao decoding by linear programming information theory ieee transactions on chandrasekaran recht parrilo and willsky the convex geometry of linear inverse problems foundations of computational mathematics davenport plan berg and wootters matrix completion inform inference dudley the sizes of compact subsets of hilbert space and continuity of gaussian processes journal of functional analysis edelman eigenvalues and condition numbers of random matrices journal on matrix analysis and applications fazel hindi and boyd rank minimization heuristic with application to minimum order system approximation in american control conference forster and warmuth relative expected instantaneous loss bounds journal of computer and system sciences gunasekar ravikumar and ghosh exponential family matrix completion under structural constraints in icml jacob vert and bach clustered learning convex formulation in nips keshavan montanari and oh matrix completion from few entries ieee trans it keshavan montanari and oh matrix completion from noisy entries jmlr klopp noisy matrix completion with general sampling distribution bernoulli klopp matrix completion by singular value thresholding sharp bounds arxiv preprint arxiv vladimir koltchinskii karim lounici alexandre tsybakov et al penalization and optimal rates for noisy matrix completion the annals of statistics ledoux and talagrand probability in banach spaces isoperimetry and processes springer litvak pajor rudelson and smallest singular value of random matrices and geometry of random polytopes advances in mathematics mcdonald pontil and stamos new perspectives on and cluster norms arxiv preprint negahban and wainwright restricted strong convexity and weighted matrix completion optimal bounds with noise jmlr negahban yu wainwright and ravikumar unified framework for analysis of with decomposable regularizers in nips recht simpler approach to matrix completion jmlr richard obozinski and vert tight convex relaxations for sparse matrix factorization in arxiv srebro and shraibman rank and in learning theory springer talagrand majorizing measures the generic chaining the annals of probability talagrand majorizing measures without measures annals of probability talagrand upper and lower bounds for stochastic processes springer tropp tail bounds for sums of random matrices foundations of computational mathematics tropp convex recovery of structured signal from independent random linear measurements arxiv preprint vershynin introduction to the analysis of random matrices compressed sensing pages vershynin estimation in high dimensions geometric perspective arxiv watson characterization of the subdifferential of some matrix norms linear algebra and its applications yang and ravikumar dirty statistical models in nips 
efficient output kernel learning for multiple tasks pratik maksim matthias and bernt saarland university germany max planck institute for informatics germany abstract the paradigm of learning is that one can achieve better generalization by learning tasks jointly and thus exploiting the similarity between the tasks rather than learning them independently of each other while previously the relationship between tasks had to be in the form of an output kernel recent approaches jointly learn the tasks and the output kernel as the output kernel is positive semidefinite matrix the resulting optimization problems are not scalable in the number of tasks as an eigendecomposition is required in each step using the theory of positive semidefinite kernels we show in this paper that for certain class of regularizers on the output kernel the constraint of being positive semidefinite can be dropped as it is automatically satisfied for the relaxed problem this leads to an unconstrained dual problem which can be solved efficiently experiments on several and data sets illustrate the efficacy of our approach in terms of computational efficiency as well as generalization performance introduction learning mtl advocates sharing relevant information among several related tasks during the training stage the advantage of mtl over learning tasks independently has been shown theoretically as well as empirically the focus of this paper is the question how the task relationships can be inferred from the data it has been noted that naively grouping all the tasks together may be detrimental in particular outlier tasks may lead to worse performance hence clustered learning algorithms aim to learn groups of closely related tasks the information is then shared only within these clusters of tasks this corresponds to learning the task covariance matrix which we denote as the output kernel in this paper most of these approaches lead to problems in this work we focus on the problem of directly learning the output kernel in the learning framework the kernel on input and output is assumed to be decoupled as the product of scalar kernel and the output kernel which is positive semidefinite matrix in classical learning algorithms the degree of relatedness between distinct tasks is set to constant and is optimized as hyperparameter however constant similarity between tasks is strong assumption and is unlikely to hold in practice thus recent approaches have tackled the problem of directly learning the output kernel solves formulation in the framework of reproducing kernel hilbert spaces involving squared loss where they penalize the frobenius norm of the output kernel as regularizer they formulate an invex optimization problem that they solve optimally in comparison recently proposed an efficient barrier method to optimize generic convex output kernel learning formulation on the other hand proposes convex formulation to learn low rank output kernel matrix by enforcing trace constraint the above approaches solve the resulting optimization problem via alternate minimization between task parameters and the output kernel each step of the alternate minimization requires an value decomposition of matrix having as size the number of tasks and problem corresponding to learning all tasks independently in this paper we study similar formulation as however we allow arbitrary convex loss functions and employ general for including the frobenius norm as regularizer for the output kernel our problem is jointly convex over the task parameters and the output kernel small leads to sparse output kernels which allows for an easier interpretation of the learned task relationships in the output kernel under certain conditions on we show that one can drop the constraint that the output kernel should be positive definite as it is automatically satisfied for the unconstrained problem this significantly simplifies the optimization and our result could also be of interest in other areas where one optimizes over the cone of positive definite matrices the resulting unconstrained dual problem is amenable to efficient optimization methods such as stochastic dual coordinate ascent which scale well to large data sets overall we do not require any eigenvalue decomposition operation at any stage of our algorithm and no alternate minimization is necessary leading to highly efficient methodology furthermore we show that this trick not only applies to but also applies to large class of regularizers for which we provide characterization our contributions are as follows we propose generic regularized output kernel matrix learning formulation which can be extended to large class of regularizers we show that the constraint on the output kernel to be positive definite can be dropped as it is automatically satisfied leading to an unconstrained dual problem we propose an efficient stochastic dual coordinate ascent based method for solving the dual formulation we empirically demonstrate the superiority of our approach in terms of generalization performance as well as significant reduction in training time compared to other methods learning the output kernel the paper is organized as follows we introduce our formulation in section our main technical result is discussed in section the proposed optimization algorithm is described in section in section we report the empirical results all the proofs can be found in the supplementary material the output kernel learning formulation we first introduce the setting considered in this paper we denote the number of tasks by we assume that all tasks have common input space and common positive definite kernel function we denote by the feature map and by hk the reproducing kernel hilbert space rkhs associated with the training data is xi yi ti where xi ti is the task the instance belongs to and yi is the corresponding label moreover we have positive definite matrix on the set of tasks where is the set of symmetric and positive semidefinite matrices if one arranges the predictions of all tasks in vector one can see learning as learning function in rkhs see and references therein however in this paper we use the correspondence between and kernels see in order to limit the technical overhead in this framework we define the joint kernel of input space and the set of tasks as we denote the corresponding rkhs of functions on as hm and by the corresponding norm we formulate the output kernel learning problem for multiple tasks as min yi xi ti kf khm where is the convex loss function convex in the second argument is convex regularizer penalizing the complexity of the output kernel and is the regularization parameter note that kf khm implicitly depends also on in the following we show that can be reformulated into jointly convex problem in the parameters of the prediction function and the output kernel using the standard representer theorem see the supplementary material for fixed output kernel one can show that the optimal solution hm of can be written as γis xi γis xi with the explicit form of the prediction function one can rewrite the main problem as min yi γir γjs kij θrs γjs kji θs ti where θrs and kij xi xj unfortunately problem is not jointly convex in and due to the product in the second term similar problem has been analyzed in they could show that for the squared loss and kθkf the corresponding optimization problem is invex and directly optimize it for an invex function every stationary point is globally optimal we follow different path which leads to formulation similar to the one of used for learning an input mapping see also our formulation for the output kernel learning problem is jointly convex in the task kernel and the task parameters we present derivation for the general rkhs hk analogous to the linear case presented in we use the following variable transformation βit θts γis γis resp st it in the last expression has to be understood as the if is not invertible note that this causes no problems as in case is not invertible we can without loss of generality restrict in to the range of the transformation leads to our final problem formulation where the prediction function and its squared norm kf khm can be written as kf khm βit xi xi xj sr is jr we get our final primal optimization problem min yi sr βis βjr kij βjti kji before we analyze the convexity of this problem we want pn to illustrate the connection to the formulations in with the task weight vectors wt βjt xj hk we get predictions as hwt and one can rewrite kf khm xi xj sr is jr sr hws wt this identity is known for rkhs see and references therein when is times pt the identity matrix then kf khm kwκt and thus is learning the tasks independently as mentioned before the convexity of the expression of kf khm is crucial for the convexity of the full problem the following result has been shown in see also lemma let denote the range of and let be the pseudoinverse the extended function defined as pn if sr βis βjr xi xj else is jointly convex the formulation in is similar to uses the constraint trace instead of regularizer enforcing low rank of the output kernel on the other hand employs squared frobenius norm for with squared loss function proposed an efficient algorithm for convex instead we think that sparsity of is better to avoid the emergence of spurious relations between tasks and also leads to output kernels which are easier to interpret thus we propose to use the following regularization functional for the output kernel kθkp for several approaches employ alternate minimization scheme involving costly eigendecompositions of matrix per iteration as in the next section we show that for certain set of values of one can derive an unconstrained dual optimization problem which thus avoids the explicit minimization over the cone the resulting unconstrained dual problem can then be easily optimized by stochastic coordinate ascent having explicit expressions of the primal variables and in terms of the dual variables allows us to get back to the original problem unconstrained dual problem avoiding optimization over the primal formulation is convex output kernel learning problem the next lemma derives the fenchel dual function of this still involves the optimization over the primal variable main contribution of this paper is to show that this optimization problem over the cone can be solved with an analytical solution for certain class of regularizers in the following we denote by αr αi ti the dual variables corresponding to task and by krs the kernel matrix xi xj ti tj corresponding to the dual variables of tasks and lemma let be the conjugate function of the loss li yi then rn θrs hαr krs αs max is the dual function of where rn are the dual variables the primal variable in and the prediction function can be expressed in terms of and as βis αi θsti and αj θstj xj respectively where tj is the task of the training example we now focus on the remaining maximization problem in the dual function in max θrs hαr krs αs this is semidefinite program which is computationally expensive to solve and thus prohibits to scale the output kernel learning problem to large number of tasks however we show in the following that this problem has an analytical solution for subset of the regularizers pt for for better readability we defer more general result towards the end of the section the basic idea is to relax the constraint on rt in so that it is equivalent to the computation of the conjugate of if the maximizer of the relaxed problem is positive one has found the solution of the original problem theorem let and max θrs ρrs then with ρrs hαr krs αs we have hα krs αs and the maximizer is given by the positive matrix hαr krs αs plugging the result of the previous theorem into the dual function of lemma we get for and with kθkp the following unconstrained dual of our main problem maxn hα krs αs note that by doing the variable transformation κi αci we effectively have only one hyperparameter in this allows us to more efficiently the range of admissible values for in theorem lies in the interval where we get for the value and as table examples of regularizers together with their generating function and the explicit form of in terms of the dual variables ρrs hαr krs αs the optimal value of is given pt in terms of as max hρ θi ρrs ez cosh rs log if θrs rs rs rs else θrs arcsinh θrs eρrs arcsinh ρrs we have the regularizer for together with the squared loss has been considered in the primal in our analytical expression of the dual is novel and allows us to employ stochastic dual coordinate ascent to solve the involved primal optimization problem please also note that by optimizing the dual we have access to the duality gap and thus stopping criterion this is in contrast to the alternating scheme of for the primal problem which involves costly matrix operations our runtime experiments show that our solver for outperforms the solvers of finally note that even for suboptimal dual variables the corresponding matrix in is positive semidefinite thus we always get feasible set of primal variables characterizing the set of convex regularizers which allow an analytic expression for the dual function the previous theorem raises the question for which class of convex separable regularizers we can get an analytical expression of the dual function by explicitly solving the optimization problem over the positive semidefinite cone key element in the proof of the previous theorem is the characterization of functions which when applied elementwise aij ti to positive semidefinite matrix result in matrix that is this set of functions has been characterized by hiai we denote by aij ti the elementtheorem let and if and only if is analytic wise application of to it holds and ak with ak for all note that in the previous theorem the condition on is only necessary when we require the implication to hold for all if is fixed the set of functions is larger and includes even large fractional powers see we use the stronger formulation as we want that the result holds without any restriction on the number of tasks theorem is the key element used in our following characterization of separable regularizers of which allow an analytical expression of the dual function ak theorem let be analytic on and given as where ak pt if is convex then θrs is convex function and max hρ θi ρrs where the global maximizer fulfills if and ak ρkrs table summarizes of functions the corresponding and the maximizer in optimization algorithm the dual problem can be efficiently solved via decomposition based methods like stochastic dual coordinate ascent algorithm sdca sdca enjoys low computational complexity per iteration and has been shown to scale effortlessly to large scale optimization problems algorithm fast input gram matrix label vector regularization parameter and relative duality gap parameter output is computed from using our result in initialize repeat randomly choose dual variable αi solve for in corresponding to αi αi αi until relative duality gap is below our algorithm for learning the output kernel matrix and task parameters is summarized in algorithm refer to the supplementary material for more details at each step of the iteration we optimize the dual objective over randomly chosen αi variable let ti be the task corresponding to αi we apply the update αi αi the optimization problem of solving with respect to is as follows min crr brs crs sz where kii brs tj kij αj csz hαs ksz αz and this convex optimization problem is solved efficiently via newton method the complexity of the proposed algorithm is per iteration the proposed algorithm can also be employed for learning output kernels regularized by generic discussed in the previous section special case for certain loss functions such as the hinge loss the squared loss yields linear or quadratic expression in in such cases problem reduces to finding the roots of cubic equation which has closed form expression hence our algorithm is highly efficient with the above loss functions when is regularized by the squared frobenius norm empirical results in this section we present our results on benchmark data sets comparing our algorithm with existing approaches in terms of generalization accuracy as well as computational efficiency please refer to the supplementary material for additional results and details data sets we begin with the generalization results in setups the data sets are as follows sarcos regression data set aim is to predict degrees of freedom of robotic arm parkinson regression data set aim is to predict the parkinson disease symptom score for patients yale face recognition data with binary classification tasks landmine data set containing binary classifications from different landmines bioinformatics data set having binary classification tasks letter handwritten letters data set with binary classification tasks we compare the following algorithms single task learning stl methods learning the output kernel matrix mtl cmtl mtrl and approaches that learn both input and output kernel matrices mtfl gmtl our proposed formulation is denoted by fmtlp we consider three different values for the and hinge and loss functions were employed for classification and regression problems respectively we follow the experimental described in table reports the performance of the algorithms averaged over ten random splits the proposed fmtlp attains the best generalization accuracy in general it outperforms the baseline mtl as well as mtrl and cmtl which solely learns the output kernel matrix moreover it achieves an overall better performance than gmtl and mtfl the give comparable generalization to case with the additional benefit of learning sparser and more interpretable output kernel matrix see figure the performance of stl mtl cmtl and mtfl are reported from table mean generalization performance and the standard deviation over ten splits data set stl mtl cmtl mtfl gmtl mtrl regression data sets explained variance sarcos parkinson classification data sets auc yale landmine letter fmtlp figure plots of matrices rescaled to and averaged over ten splits computed by our solver fmtlp for the landmine data set for different with values the darker regions indicate higher value tasks landmines numbered correspond to highly foliated regions and those numbered correspond to bare earth or desert regions hence we expect two groups of tasks indicated by the red squares we can observe that the learned matrix at depicts much more spurious task relationships than the ones at and thus our sparsifying regularizer improves interpretability table mean accuracy and the standard deviation over five splits data set stl mnist usps gmtl mtrl fmtlp fmtlp data sets the setup is cast as binary classification tasks corresponding to classes in this section we experimented with two loss functions fmtlp the hinge loss employed in svms and fmtlp the squared loss employed in okl in these experiments we also compare our results with feature learning method usps mnist experiments we followed the experimental protocol detailed in results are tabulated in table our approach fmtlp obtains better accuracy than gmtl mtrl and on both data sets mit experiments we report results on the mit benchmark which covers indoor scene categories we use the split images per class provided by the authors fmtlp achieved the accuracy of with note that this is better than the ones reported in and experiments is challenging scene classification benchmark with classes we use images per class for training images per class for testing and report the average accuracy over the standard splits we employed the cnn features extracted with the table mean accuracy and the standard deviation over ten splits on stl mtl fmtlp fmtlp time by baseline time by fmtl convexokl time scale okl number of tasks mit okl okl mit convexokl convexokl figure plot compares the runtime of various algorithms with varying number of tasks on our approach is times faster that okl and times faster than convexokl when the number of tasks is maximum plot showing the factor by which outperforms okl and convexokl over the range on various data sets on we outperform okl and convexokl by factors of and respectively on mit we are better than okl and convexokl by factors of and respectively convolutional neural network cnn using places database the results are tabulated in table the matrices computed by fmtlp are discussed in the supplementary material scaling experiment we compare the runtime of our solver for with the okl solver of and the convexokl solver of on several data sets all the three methods solve the same optimization problem figure shows the result of the scaling experiment where we vary the number of tasks classes the parameters employed are the ones obtained via note that both okl and convexokl algorithms do not have well defined stopping criterion whereas our approach can easily compute the relative duality gap set as we terminate them when they reach the primal objective value achieved by our optimization approach is times and times faster than the alternate minimization based okl and convexokl respectively when the number of tasks is maximal the generic are also considerably faster than okl and convexokl figure compares the average runtime of our fmtlp with okl and convexokl on the crossvalidated range of values fmtlp outperform them on both mit and data sets on mnist and usps data sets fmtlp is more than times faster than okl and more than times faster than convexokl additional details of the above experiments are discussed in the supplementary material conclusion we proposed novel formulation for learning the positive output kernel matrix for multiple tasks our main technical contribution is our analysis of certain class of regularizers on the output kernel matrix where one may drop the positive constraint from the optimization problem but still solve the problem optimally this leads to dual formulation that can be efficiently solved using stochastic dual coordinate ascent algorithm results on benchmark and data sets demonstrates the effectiveness of the proposed algorithm in terms of runtime as well as generalization accuracy acknowledgments and acknowledge the support by the cluster of excellence mmci references evgeniou micchelli and pontil learning multiple tasks with kernel methods jmlr argyriou evgeniou and pontil convex feature learning ml lounici pontil tsybakov and van de geer taking advantage of sparsity in learning in colt jalali ravikumar sanghavi and ruan dirty model for learning in nips jawanpuria and nath multiple kernel learning in sdm maurer pontil and sparse coding for multitask and transfer learning in icml jawanpuria nath and ramakrishnan generalized hierarchical kernel learning jmlr caruana multitask learning ml zhang and yeung convex formulation for learning task relationships in learning in uai kang grauman and sha learning with whom to share in feature learning in icml jawanpuria and nath convex feature learning formulation for latent task structure discovery in icml jacob bach and vert clustered learning convex formulation in nips micchelli and pontil kernels for multitask learning in nips caponnetto micchelli pontil and ying universal kernels jmlr rosasco and lawrence kernels for functions review foundations and trends in machine learning evgeniou and pontil regularized learning in kdd dinuzzo ong gehler and pillonetto learning output kernels with block coordinate descent in icml ciliberto mroueh poggio and rosasco convex learning of multiple tasks and their structure in icml and zhang stochastic dual coordinate ascent methods for regularized loss jmlr and smola learning with kernels mit press hein and bousquet kernels associated structures and generalizations technical report max planck institute for biological cybernetics and mond what is invexity austral math soc ser hiai monotonicity for entrywise functions of matrices linear algebra and its applications horn the theory of infinitely divisible matrices and kernels trans amer math lapin schiele and hein scalable multitask representation learning for scene classification in cvpr zhou lapedriza xiao torralba and oliva learning deep features for scene recognition using places database in nips koskela and laaksonen convolutional network features for scene recognition in proceedings of the acm international conference on multimedia xiao hays ehinger oliva and torralba sun database scene recognition from abbey to zoo in cvpr 
scalable adaptation of state complexity for nonparametric hidden markov models michael hughes william stephenson and erik sudderth department of computer science brown university providence ri mhughes wtstephe sudderth abstract bayesian nonparametric hidden markov models are typically learned via fixed truncations of the infinite state space or local monte carlo proposals that make small changes to the state space we develop an inference algorithm for the sticky hierarchical dirichlet process hidden markov model that scales to big datasets by processing few sequences at time yet allows rapid adaptation of the state space cardinality unlike previous methods our novel variational bound penalizes redundant or irrelevant states and thus enables optimization of the state space our birth proposals use observed data statistics to create useful new states that escape local optima merge and delete proposals remove ineffective states to yield simpler models with more affordable future computations experiments on speaker diarization motion capture and epigenetic chromatin datasets discover models that are more compact more interpretable and better aligned to ground truth segmentations than competitors we have released an python implementation which can parallelize local inference steps across sequences introduction the hidden markov model hmm is widely used to segment sequential data into interpretable discrete states human activity streams might use walking or dancing states while dna transcription might be understood via promotor or repressor states the hierarchical dirichlet process hmm provides an elegant bayesian nonparametric framework for reasoning about possible data segmentations with different numbers of states existing inference algorithms for hmms and have numerous shortcomings they can not efficiently learn from large datasets do not effectively explore segmentations with varying numbers of states and are often trapped at local optima near their initialization stochastic optimization methods are particularly vulnerable to these last two issues since they can not change the number of states instantiated during execution the importance of removing irrelevant states has been long recognized samplers that add or remove states via split and merge moves have been developed for hdp topic models and beta process hmms however these monte carlo proposals use the entire dataset and require all sequences to fit in memory limiting scalability we propose an learning algorithm that reliably transforms an uninformative initialization into an accurate yet compact set of states generalizing previous work on memoized variational inference for dp mixture models and hdp topic models we derive variational bound for the that accounts for sticky state persistence and can be used for effective bayesian model selection our algorithm uses birth proposal moves to create new states and merge and delete moves to remove states with poor predictive power state space adaptations are validated via global variational bound but by caching sufficient statistics our memoized algorithm efficiently processes subsets of sequences at each step extensive experiments demonstrate the reliability and scalability of our approach which can be reproduced via python code we have released http initialization after first lap births after first lap merges accepted merge pairs ground truth labels after laps after second lap accepted birth figure illustration of our new variational algorithm as it learns to segment motion capture sequences into common exercise types sec each panel shows segmentations of the same sequences with time on the horizontal axis starting from just one state birth moves at the first sequence create useful states local updates to each sequence in turn can use existing states or birth new ones after all sequences are updated once we perform merge moves to clean up and lap is complete after another complete lap of birth updates at each sequence followed by merges and deletes the segmentation is further refined after many laps our final segmentation aligns well to labels from human annotator with some true states aligning to multiple learned states that capture variability in exercises hierarchical dirichlet process hidden markov models we wish to jointly model sequences where sequence has data xn xntn and observation xnt is vector representing interval or timestep for example xnt rd could be the spectrogram for an instant of audio or human limb positions during interval the explains this data by assigning each observation xnt to single hidden state znt the chosen state comes from countably infinite set of options generated via markovian dynamics with initial state distributions and transition distributions πk znt zn πkℓ we draw data xnt given assigned state znt from an exponential family likelihood log xnt φk sf xdn φk cf φk log φk φtk ch the natural parameter φk for each state has conjugate prior cumulant functions cf ch ensure these distributions are normalized the chosen exponential family is defined by its sufficient statistics sf our experiments consider bernoulli gaussian and gaussian likelihoods hierarchies of dirichlet processes under the prior and posterior the number of states is unbounded it is possible that every observation comes from unique state the hierarchical dirichlet process hdp encourages sharing states over time via latent root probability vector over the infinite set of states see fig the representation of the prior on first uℓ draws independent variables uk beta for each state and then sets βk uk we interpret uk as the conditional probability of choosing state among states in expectation the most common states are first in order we represent their probabilities via the vector βk where βk given this dimensional probability vector the generates transition distributions πk for each state from dirichlet with mean equal to and variance governed by concentration parameter πkk πk dir αβ we draw starting probability vector from similar prior with much smaller variance dir with because few starting states are observed sticky bias in many applications we expect each segment to persist for many timesteps the sticky parameterization of favors by placing extra prior mass on the transition probability πkk in particular πk dir αβk αβ where controls the degree of bias choosing leads to long segment lengths while avoiding the computational cost of alternatives uk πk znt cd sticky lower bound φk xnt alpha cd sticky lower bound num states figure left graphical representation of the hdp hidden markov model variational parameters are shown in red center our surrogate bound for the sticky dirichlet cumulant function cd eq as function of computed with and uniform with active states right surrogate bound with fixed this bound remains tight when our state adaptation moves insert or remove states memoized and stochastic variational inference after observing data our inferential goal is posterior knowledge of conditional probabilities hmm parameters and assignments we refer to as global parameters because they generalize to new data sequences in contrast the states zn are local to specific sequence xn factorized variational lower bound we seek distribution over the unobserved variables that is close to the true posterior but lies in the simpler factorized family each factor has exponential family form with free parameters denoted by hats and our inference algorithms update these parameters to minimize the kl divergence kl our chosen factorization for is similar to but includes substantially more accurate approximation to as detailed in sec factor for each sequence we use an independent factor zn with markovian structure ty znt δℓ zn zn free parameter vector defines the joint assignment probabilities zn znt so the entries of sum to one the parameter defines the marginal probability pk znt and equals we can find the expected count of transitions pn ptn from state to across all sequences via the sufficient statistic mkℓ the truncation level limits the total number of states to which data is assigned under our approximate posterior only zn is constrained by this choice no global factors are truncated indeed if data is only assigned to the first states the conditional independence properties of the imply that φk uk are independent of the data their optimal variational posteriors thus match the prior and need not be explicitly computed or stored simple variational algorithms treat as fixed constant but sec develops novel algorithms that fit to data factor for the starting state and each state we define πk as dirichlet distribution πk dir free parameter is vector of positive numbers with one entry for each of the active states and final entry for the aggregate mass of all other states the expected log transition probability between states and pkℓ eq log πkℓ is key sufficient statistic factor emission parameter φk for state has factor φk conjugate to the likelihood the supplement provides details for bernoulli gaussian and we score the approximation via an objective function that assigns scalar value higher is better to each possible input of free parameters data and hyperparameters eq log log ldata lentropy this function provides lower bound on the marginal evidence log improving this bound is equivalent to minimizing kl its four component terms are defined as follows ldata eq log log lentropy log eq log eq log log detailed analytic expansions for each term are available in the supplement tractable posterior inference for global state probabilities previous variational methods for the and for hdp topic models and hdp grammars used point estimate for the state probabilities while this approximation simplifies inference the variational objective no longer bounds the marginal evidence such are unsuitable for model selection and can favor models with redundant states that do not explain any data but nevertheless increase computational and storage costs because we seek to learn compact and interpretable models and automatically adapt the truncation level to each dataset we instead place proper beta distribution on uk uk beta where qk here eq uk eq βk and eq the scalar controls the variance where the point estimate is recovered as the beta factorization in eq complicates evaluation of the marginal likelihood bound in eq pk eq cd eq cd αβ κδk pk pk cd mkℓ αk eq βℓ κδk pkℓ the dirichlet cumulant function cd maps positive parameters to constant for where previous work established the following bound cd αβ log log αβk log log βℓ direct evaluation of eq cd αβ is problematic because the expectations of functions have no closed form but the lower bound has simple expectation given beta distributed uk developing similar bound for sticky models with requires novel contribution to begin in the supplement we establish the following bound for any cd αβ κδk log log log αβk log βℓ to handle the intractable term eq log αβk we leverage the concavity of the logarithm log αβk βk log βk log combining eqs and and taking expectations we can evaluate lower bound on eq in closed form and thereby efficiently optimize its parameters as illustrated in fig this rigorous lower bound on the marginal evidence log is quite accurate for practical hyperparameters batch and stochastic variational inference most variational inference algorithms maximize via coordinate ascent optimization where the best value of each parameter is found given fixed values for other variational factors for the hdphmm this leads to the following updates which when iterated converge to some local maximum local update to zn the assignments for each sequence zn can be updated independently via dynamic programming the algorithm takes as input tn matrix of eq log xn φk given the current and log transition probabilities pjk given the current it outputs the optimal marginal state probabilities under objective this step has cost tn for sequence and we can process multiple sequences in parallel for efficiency global update to conjugate priors lead to simple updates sk where ptn sufficient statistic sk summarizes the data assigned to state sk sf xnt global update to for each state the positive vector defining the optimal dirichlet posterior on transition probabilities from state is mkℓ αβℓ κδk statistic mkℓ counts the expected number of transitions from state to across all sequences global update to due to our surrogate objective has no update to instead we employ numerical optimization to update vectors simultaneously arg max subject to for details are in the supplement the update to requires expectations under and vice versa so it can be useful to iteratively optimize and several times given fixed local statistics to handle large datasets we can adapt these updates to perform stochastic variational inference svi stochastic algorithms perform local updates on random subsets of sequences batches and then perturb global parameters by following noisy estimate of the natural gradient which has simple closed form svi has previously been applied to with pointestimated and can be easily adapted to our more principled objective one drawback of svi is the requirement of learning rate schedule which must typically be tuned to each dataset memoized variational inference we now outline memoized algorithm for our sticky variational objective before execution each sequence is randomly assigned to one of batches the algorithm repeatedly visits batches one at time in random order we call each full pass through the complete set of batches lap at each visit to batch we perform local step for all sequences in batch and then global step with batches memoized inference reduces to the standard algorithm while with larger we have more affordable local steps and faster overall convergence with just one lap memoized inference is equivalent to the synchronous version of streaming variational inference presented in alg of broderick et al we focus on regimes where dozens of laps are feasible which we demonstrate dramatically improves performance affordable but exact batch optimization of is possible by exploiting the additivity of statistics for each statistic we track quantity and summary after local step at batch yields we update and increment each statistic by adding the new batch summary and subtracting the summary stored in memory from the previous visit and store or memoize the new statistics for future iterations this update cycle makes and consistent with the most recent assignments for all sequences memoization does require bk more storage than svi however this cost does not scale with the number of sequences or length sparsity in transition counts may make storage cheaper at any point during memoized execution we can evaluate exactly for all data seen thus far this is possible because nearly all terms in eq are functions of only global parameters and sufficient statistics the one exception that requires local values is the entropy term lentropy to compute it we track matrix at each batch ptn log log hkℓ ntk where the sums aggregate sequences that belong to batch each entry of is and pk pk given the entropy matrix we have lentropy hkℓ state space adaptation via birth merge and delete proposals reliable nonparametric inference algorithms must quickly identify and create missing states splitmerge samplers for hdp topic models are limited because proposals can only split an existing state into two new states require expensive traversal of all data points to evaluate an acceptance ratio and often have low acceptance rates some variational methods for hdp topic models also dynamically create new topics but do not guarantee improvement of the global objective and can be unstable we instead interleave stochastic birth proposals with delete and merge proposals and use memoization to efficiently verify proposals via the exact objective birth proposals birth moves can create many new states at once while maintaining the monotonic increase of the objective each proposal happens within the local step by trying to improve zn for single sequence given current assignments with truncation the move proposes new assignments that include the existing states and some new states with index if improves under the proposal we accept and use the expanded set of states for all remaining updates in the current lap to compute we require candidate global parameters these are found via global step from candidate summaries which combine hamming dist num topics stoch memo delete merge birth delete merge sticky num pass thru data stoch after laps in min sampler sampler after laps in min num pass thru data delete merge after laps in min figure toy data experiments sec top left data sequences contain points from gaussians with sticky transitions top center trace plots from initialization with redundant states our algorithms reach ideal states and zero hamming distance regardless of whether sticky solid or dashed model is used competitors converge slower especially in the case because methods are more sensitive to hyperparameters bottom segmentations of sequences by svi the gibbs sampler and our method under the model top half shows true state assignments bottom shows aligned estimated states competitors are polluted by extra states black the new batch statistics and memoized statistics of other batches expanded by zeros for states see the supplement for details on handling multiple sequences within batch the proposal for expanding with new states can flexibly take any form from very to very for data with sticky state persistence we recommend randomly choosing one interval of the current sequence to reassign when creating leaving other timesteps fixed we split this interval into two contiguous blocks one may be empty each completely assigned to new state in the supplement we detail search that finds the cut point that maximizes the objective ldata other proposals such as splits could be easily incorporated in our variational algorithm but we find this simple proposal to be fast and effective merge proposals merge proposals try to find less redundant but equally expressive model each proposal takes pair of existing states and constructs candidate model where data from state is reassigned to state conceptually this reassignment gives new value but instead statistics can be directly computed and used in global update for candidate parameters si sj mi mi mj mii mjj mji mij while most terms in are linear functions of our cached sufficient statistics the entropy lentropy is not thus for each candidate merge pair we use storage and computation to track column and row hi of the corresponding merged entropy matrix because all terms in the matrix of eq are we can lentropy by summing subset of as detailed in the supplement this allows us to rigorously bound the objective for accepting multiple merges of distinct state pairs because many entries of are this bound is very tight and in practice enables us to scalably merge many redundant state pairs in each lap through the data to identify candidate merge pairs we examine all pairs of states and keep those that satisfy ldata because entropy must decrease after any merge lentropy this test is guaranteed to find all possibly useful merges it is much more efficient than the heuristic correlation score used in prior work on hdp topic models deletes our proposal to delete state begins by dropping row and column from to create and dropping sj from to create using target dataset of sequences with ptn mass on state xn we run global and local parameter updates to reassign observations from former state in way rather than verifying on only the target dataset as in we accept or reject the delete proposal via the bound to control computation we only propose deleting states used in or fewer sequences num pass thru data num pass thru data speedup time sec num states objective stoch memo birth del merge num parallel workers num parallel workers figure segmentation of human epigenome million observations across sequences sec left adaptive runs started at state grow to states within one lap and reach better scores than nonadaptive methods each run takes several days right wallclock times and speedup factors for parallelized local step on of this dataset workers complete local step with states in under one minute experiments we compare our proposed memoized algorithm to memoized with delete and merge moves only and without any moves we further run blocked gibbs sampler that was previously shown to mix faster than slice samplers and our own implementation of svi for objective these baselines maintain fixed number of states though some states may have usage fall to zero we start all methods including the sampler from matched initializations see the supplement for futher discussion and all details needed to reproduce these experiments toy data in fig we study toy data sequences generated from gaussian states with sticky transitions from an abundant initialization with states the sampler and variational methods require hundreds of laps to remove redundant states especially under model in contrast our adaptive methods reach the ideal of zero hamming distance within few dozen laps regardless of stickiness suggesting less sensitivity to hyperparameters speaker diarization we study unrelated audio recordings of meetings with an unknown number of speakers from the nist speaker diarization challenge the sticky previously achieved diarization performance using sampler that required hours of computation we ran methods from matched initializations with states and computing hamming distance on segments as in the standard der metric fig shows that within minutes our algorithms consistently find segmentations better aligned to true speaker labels labelled motion capture fox et al introduced sequence dataset with labels for exercise types illustrated in fig each sequence has joint angles wrist knee etc captured at second intervals fig shows that methods struggle even when initialized abundantly with dashed lines or solid states while our adaptive methods reach better values of the objective and cleaner alignment to true exercises large motion capture next we apply scalable methods to the sequence dataset of we lack ground truth here but fig shows deletes and merges making consistent reductions from abundant initializations and births growing from fig also shows estimated segmentations for representative sequences along with skeleton illustrations for the states in this subset these segmentations align well with text descriptions chromatin segmentation finally we study segmenting the human genome by the appearance patterns of regulatory proteins we observe binary signals from at intervals throughout white blood cell line each binary value indicates the presence or absence of an acetylation or methylation that controls gene expression we divide the whole epigenome into sequences one per batch with total size million fig shows our method can grow from state to states and compete favorably with competitors we also demonstrate that our parallelized local step leads to big speedups in processing such large datasets conclusion our new variational algorithms adapt hmm state spaces to find clean segmentations driven by bayesian model selection relative to prior work our contributions include new bound for the sticky births with guaranteed improvement local step parallelization and better merge selection rules our python code is targeted at applications acknowledgments this research supported in part by nsf career award no hughes supported in part by an nsf graduate research fellowship under grant no hamming hamming dist sampler memo delete merge birth delete merge elapsed time sec elapsed time sec hamming dist train objective meeting worst elapsed time sec elapsed time sec hamming dist meeting avg train objective train objective sampler hamming meeting best elapsed time sec elapsed time sec figure method comparison on speaker diarization from common initializations sec left scatterplot of final hamming distance for our adaptive method and the sampler across meetings each with initializations shown as individual dots our method finds segmentations closer to ground truth right traces of objective and hamming distance for meetings representative of good average and poor performance birth laps stoch sampler memo delete merge birth delete merge num pass thru data num pass thru data hamming dist num states train objective num pass thru data laps sampler laps num states train objective figure comparison on motion capture streams sec top our adaptive methods reach better values and lower distance from true exercise labels bottom segmentations from the best runs of left only deletes and merges from initial states middle and the sampler right each sequence shows true labels top half and estimates bottom half colored by the true state with highest overlap num pass thru data walk playground jump playground climb playground climb swordplay dance dance dance basketball dribble basketball dribble basketball dribble num pass thru data climb sword arms swing dribble jump balance ballet leap ballet pose figure study of motion capture sequences sec top left objective and state count as more data is seen solid lines have initial states dashed top right final segmentation of select sequences by our method with id numbers and descriptions from the most used states are shown in color the rest with gray bottom skeletons assigned to each highlighted state references rabiner tutorial on hidden markov models and selected applications in speech recognition proc of the ieee zoubin ghahramani an introduction to hidden markov models and bayesian networks international journal of pattern recognition and machine intelligence jason ernst and manolis kellis discovery and characterization of chromatin states for systematic annotation of the human genome nature biotechnology matthew beal zoubin ghahramani and carl rasmussen the infinite hidden markov model in neural information processing systems teh jordan beal and blei hierarchical dirichlet processes journal of the american statistical association emily fox erik sudderth michael jordan and alan willsky sticky with application to speaker diarization annals of applied statistics matthew johnson and alan willsky stochastic variational inference for bayesian time series models in international conference on machine learning nicholas foti jason xu dillon laird and emily fox stochastic variational inference for hidden markov models in neural information processing systems andreas stolcke and stephen omohundro hidden markov model induction by bayesian model merging in neural information processing systems chong wang and david blei mcmc algorithm for the hierarchical dirichlet process arxiv preprint jason chang and john fisher iii parallel sampling of hdps using splits in neural information processing systems emily fox michael hughes erik sudderth and michael jordan joint modeling of multiple time series via the beta process with application to motion capture segmentation annals of applied statistics michael hughes and erik sudderth memoized online variational inference for dirichlet process mixture models in neural information processing systems michael hughes dae il kim and erik sudderth reliable and scalable variational inference for the hierarchical dirichlet process in artificial intelligence and statistics yee whye teh kenichi kurihara and max welling collapsed variational inference for hdp in neural information processing systems michael bryant and erik sudderth truly nonparametric online variational inference for hierarchical dirichlet processes in neural information processing systems percy liang slav petrov michael jordan and dan klein the infinite pcfg using hierarchical dirichlet processes in empirical methods in natural language processing matthew james beal variational algorithms for approximate bayesian inference phd thesis university of london matt hoffman david blei chong wang and john paisley stochastic variational inference journal of machine learning research tamara broderick nicholas boyd andre wibisono ashia wilson and michael jordan streaming variational bayes in neural information processing systems chong wang and david blei online variational inference for bayesian nonparametric models in neural information processing systems van gael saatci teh and ghahramani beam sampling for the infinite hidden markov model in international conference on machine learning nist rich transcriptions database http michael hoffman orion buske jie wang zhiping weng jeff bilmes and william noble unsupervised pattern discovery in human chromatin structure through genomic segmentation nature methods 
variational consensus monte carlo maxim rabinovich elaine angelino and michael jordan computer science division university of california berkeley rabinovich elaine jordan abstract practitioners of bayesian statistics have long depended on markov chain monte carlo mcmc to obtain samples from intractable posterior distributions unfortunately mcmc algorithms are typically serial and do not scale to the large datasets typical of modern machine learning the recently proposed consensus monte carlo algorithm removes this limitation by partitioning the data and drawing samples conditional on each partition in parallel fixed aggregation function then combines these samples yielding approximate posterior samples we introduce variational consensus monte carlo vcmc variational bayes algorithm that optimizes over aggregation functions to obtain samples from distribution that better approximates the target the resulting objective contains an intractable entropy term we therefore derive relaxation of the objective and show that the relaxed problem is blockwise concave under mild conditions we illustrate the advantages of our algorithm on three inference tasks from the literature demonstrating both the superior quality of the posterior approximation and the moderate overhead of the optimization step our algorithm achieves relative error reduction measured against serial mcmc of up to compared to consensus monte carlo on the task of estimating probit regression parameter expectations similarly it achieves an error reduction of on the task of estimating cluster comembership probabilities in gaussian mixture model with components in dimensions furthermore these gains come at moderate cost compared to the runtime of serial speedup in some instances introduction modern statistical inference demands scalability to massive datasets and models innovation in distributed and stochastic optimization has enabled parameter estimation in this setting via stochastic and asynchronous variants of gradient descent achieving similar success in bayesian inference where the target is posterior distribution over parameter values rather than point estimate remains computationally challenging two dominant approaches to bayesian computation are variational bayes and markov chain monte carlo mcmc within the former scalable algorithms like stochastic variational inference and streaming variational bayes have successfully imported ideas from optimization within mcmc adaptive subsampling procedures stochastic gradient langevin dynamics and firefly monte carlo have applied similar ideas achieving computational gains by operating only on data subsets these algorithms are serial however and thus can not take advantage of multicore and architectures this motivates mcmc algorithms such as asynchronous variants of gibbs sampling our work belongs to class of mcmc algorithms these algorithms partition the full dataset into disjoint subsets where xik denotes the data associated with core each core samples from subposterior distribution pk xik and then centralized procedure combines the samples into an approximation of the full posterior due to their efficiency such procedures have recently received substantial attention one of these algorithms consensus monte carlo cmc requires communication only at the start and end of sampling cmc proceeds from the intuition that subposterior samples when aggregated correctly can approximate full posterior samples this is formally backed by the factorization xik pk if one can approximate the subposterior densities pk using kernel density estimates for instance it is therefore possible to recombine them into an estimate of the full posterior unfortunately the factorization does not make it immediately clear how to aggregate on the level of samples without first having to obtain an estimate of the densities pk themselves cmc alters to untie the parameters across partitions and plug in deterministic link from the to pk this approximation and an aggregation function motivated by gaussian approximation lie at the core of the cmc algorithm the introduction of cmc raises numerous interesting questions whose answers are essential to its wider application two among these stand out as particularly vital first how should the aggregation function be chosen to achieve the closest possible approximation to the target posterior second when model parameters exhibit structure or must conform to constraints if they are for example positive semidefinite covariance matrices or labeled centers of clusters how can the weighted averaging strategy of scott et al be modified to account for this structure in this paper we propose variational consensus monte carlo vcmc novel class of mcmc algorithms that allow both questions to be addressed by formulating the choice of aggregation function as variational bayes problem vcmc makes it possible to adaptively choose the aggregation function to achieve closer approximation to the true posterior the flexibility of vcmc likewise supports nonlinear aggregation functions including structured aggregation functions applicable to not purely vectorial inference problems an appealing benefit of the vcmc point of view is clarification of the untying step leading to in vcmc the approximate factorization corresponds to variational approximation to the true posterior this approximation can be viewed as the joint distribution of and in an augmented model that assumes conditional independence between the data partitions and posits deterministic mapping from parameters to the single global parameter the added flexibility of this makes it possible to move beyond subposteriors and include alternative forms of within the cmc framework in particular it is possible to define pk xik using partial posteriors in place of subposteriors cf although extensive investigation of this issue is beyond the scope of this paper we provide some evidence in section that partial posteriors are better choice in some circumstances and demonstrate that vcmc can provide substantial gains in both the partial posterior and subposterior settings before proceeding we outline the remainder of this paper below in we review cmc and related mcmc algorithms next we cast cmc as variational bayes problem in we define the variational optimization objective in addressing the challenging entropy term by relaxing it to concave lower bound and give conditions for which this leads to blockwise concave maximization problem in we define several aggregation functions including novel ones that enable aggregation of structured positive semidefinite matrices and mixture model parameters in we evaluate the performance of vcmc and cmc relative to serial mcmc we replicate experiments carried out by scott et al and execute more challenging experiments in higher dimensions and with more data finally in we summarize our approach and discuss several open problems generated by this work related work we focus on mcmc algorithms for bayesian posterior sampling several recent research threads propose schemes in the setting where the posterior factors as in in general these parallel strategies are approximate relative to serial procedures and the specific algorithms differ in terms of the approximations employed and amount of communication required at one end of the communication spectrum are algorithms that fit into the mapreduce model first parallel cores sample from subposteriors defined in via any monte carlo sampling procedure the subposterior samples are then aggregated to obtain approximate samples from the full posterior this leads to the challenge of designing proper and efficient aggregation procedures scott et al propose consensus monte carlo cmc which constructs approximate posterior samples via weighted averages of subposterior samples our algorithms are motivated by this work let denote the subposterior sample from core in cmc the aggregation function averages across each set of samples to produce one approximate posterior sample uniform averaging is natural but heuristic that can in fact be improved upon via weighted average wk where in general is vector and wk can be matrix the authors derive weights motivated by the special case of gaussian posterior where each subposterior is consequently also gaussian let be the covariance of the subposterior this suggests weights wk equal to the subposteriors inverse covariances cmc treats arbitrary subpostertiors as gaussians aggregating with computed from the observed subposterior samples weights given by empirical estimates of neiswanger et al propose aggregation at the level of distributions rather than samples here the idea is to form an approximate posterior via product of density estimates fit to each subposterior and then sample from this approximate posterior the accuracy and computational requirements of this approach depend on the complexity of these density estimates wang and dunson develop alternate mcmc methods based on applying weierstrass transform to each subposterior these weierstrass sampling procedures introduce auxiliary variables and additional communication between computational cores consensus monte carlo as variational inference given the distributional form of the cmc framework we would like to choose so that the induced distribution on is as close as possible to the true posterior this is precisely the problem addressed by variational bayes which approximates an intractable posterior by the solution to the constrained optimization problem min dkl subject to where is the family of variational approximations to the distribution usually chosen to make both optimization and evaluation of target expectations tractable we thus view the aggregation problem in cmc as variational inference problem with the variational family given by all distributions qf qf where each is in some function class and defines density qf pk in practice we optimize over using projected stochastic gradient descent sgd the variational optimization problem standard optimization of the variational bayes objective uses the evidence lower bound elbo log log eq eq log log dkl lvb we can therefore recast the variational optimization problem in an equivalent form as max lvb subject to unfortunately the variational bayes objective lvb remains difficult to optimize indeed by writing lvb eq log we see that optimizing lvb requires computing an entropy and its gradients we can deal with this issue by deriving lower bound on the entropy that relaxes the objective further pk concretely suppose that every can be decomposed as fk with each fk differentiable bijection since the come from subposteriors conditioning on different segments of the data they are independent the entropy power inequality therefore implies max fk max pk epk log det fk min pk max epk log det fk min pk epk log det fk where denotes the jacobian of the function the proof can be found in the supplement this approach gives an explicit easily computed approximation to the this approximation is lower bound allowing us to interpret it simply as further relaxation of the original inference problem furthermore and crucially it decouples pk and fk thereby making it possible to optimize over fk without estimating the entropy of any pk we note additionally that if we are willing to sacrifice concavity we can use the tighter lower bound on the entropy given by putting everything together we can define our relaxed variational objective as eq log maximizing this function is the variational bayes problem we consider in the remainder of the paper conditions for concavity under certain conditions the problem posed above is blockwise concave to see when this holds we use the language of graphical models and exponential families to derive the result in the greatest possible generality we decompose the variational objective as lvb eq log and prove concavity directly for then treat our choice of relaxed entropy we emphasize that while the entropy relaxation is only defined for decomposed aggregation functions concavity of the partial objective holds for arbitrary aggregation functions all proofs are in the supplement suppose the model distribution is specified via graphical model so that such that each conditional distribution is defined by an exponential family log log hu log au if each of these log conditional density functions is in we can guarantee that the log likelihood is concave in each individually theorem blockwise concavity of the variational suppose that the model distribution is specified by graphical model in which each conditional probability density is exponential family suppose further that the variational aggregation function family satisfies such that we can decompose each aggregation function across nodes via and if each is convex subset of some vector space hu then the variational is concave in each individually assuming that the aggregation function can be decomposed into sum over functions of individual subposterior terms we can also prove concavity of our entropy relaxation qk theorem concavity of the relaxed entropy suppose fk with each function pk decomposing as fk for unique bijective fk fk then the relaxed entropy is concave in as result we derive concavity of the variational objective in broad range of settings corollary concavity of the variational objective under the hypotheses of theorems and the variational bayes objective is concave in each individually variational aggregation function families the performance of our algorithm depends critically on the choice of aggregation function family the family must be sufficiently simple to support efficient optimization expressive to capture the complex transformation from the set of subposteriors to the full posterior and structured to preserve structure in the parameters we now illustrate some aggregation functions that meet these criteria vector aggregation in the simplest case rd is an unconstrained vector then linear aggrepk gation function fw wk makes sense and it is natural to impose constraints to make this sum behave like weighted each wk is positive semidefinite psd matrix pk and wk id for computational reasons it is often desirable to restrict to diagonal wk spectral aggregation cases involving structure exhibit more interesting behavior indeed if our applying the vector aggregation function above to the flattened parameter is psd matrix vector form vec of the parameter does not suffice denoting elementwise matrix product as pk we note that this strategy would in general lead to fw wk we therefore introduce more sophisticated aggregation function that preserves psd structure for this given symmetric define and to be orthogonal and diagonal matrices respectively such that impose canonical ordering dd we can then define our spectral aggregation function by spec fw wk assuming wk the output of this function is guaranteed to be psd as required as above we pk restrict the set of wk to the matrix simplex wk wk wk combinatorial aggregation additional complexity arises with unidentifiable latent variables and more generally models with multimodal posteriors since this class encompasses many popular algorithms in machine learning including factor analysis mixtures of gaussians and multinomials and latent dirichlet allocation lda we now show how our framework can accommodate them for concreteness suppose now that our model parameters are given by where denotes the number of global latent variables cluster centers we introduce discrete alignment parameters ak that indicate how latent variables associated with partitions map to global latent variables each ak is thus correspondence with ak denoting the index on worker core of cluster center for fixed we then obtain the variational aggregation function fa wk optimization can then proceed in an alternating manner switching between the alignments ak and the weights wk or in greedy manner fixing the alignments at the start and optimizing the weight matrices in practice we do the latter aligning using simple heuristic objective pk pl where denotes the mean value of cluster center on partition as suggests we set minimizing via the hungarian algorithm leads to good alignments figure probit regression moment approximation error for the uniform and gaussian averaging baselines and vcmc relative to serial mcmc for subposteriors left and partial posteriors right note the different vertical axis scales we assessed three groups of functions first moments with for pure second moments with for and mixed second moments with for for brevity results for pure second moments are relegated to figure in the supplement empirical evaluation we now evaluate vcmc on three inference problems in range of data and dimensionality conditions in the vector parameter case we compare directly to the simple weighting baselines corresponding to previous work on cmc in the other cases we compare to structured analogues of these weighting schemes our experiments demonstrate the advantages of vcmc across the whole range of model dimensionality data quantity and availability of parallel resources baseline weight settings scott et al studied linear aggregation functions with fixed weights wkunif id ˆk wkgauss diag and denotes the corresponding to uniform averaging and gaussian averaging respectively where standard empirical estimate of the covariance these are our baselines for comparison evaluation metrics since the goal of mcmc is usually to estimate event probabilities and function expectations we evaluate algorithm accuracy for such estimates relative to serial mcmc output for each model we consider suite of test functions low degree polynomials cluster comembership indicators and we assess the error of each algorithm using the metric emcmc in the body of the paper we report median values of computed within each test function class the supplement expands on this further showing quartiles for the differences in and bayesian probit regression we consider the nonconjugate probit regression model in this case we use linear aggregation functions as our function class for computational efficiency we also limit ourselves to diagonal wk we use gibbs sampling on the following augmented model if zn id zn xn xn yn otherwise this augmentation allows us to implement an efficient and rapidly mixing gibbs sampler where xt we run two experiments the first using data generating distribution from scott et al with data points and dimensions and the second using data points and dimensions as shown in figure and in the figures and vcmc decreases the error of moment estimation compared to the baselines with substantial gains starting at partitions and increasing with we also run the experiment using partial posteriors in place of subposteriors and observe substantially lower errors in this case figure wishart model far left left right moment approximation error for the uniform and gaussian averaging baselines and vcmc relative to serial mcmc letting denote the th largest eigenvalue of we assessed three groups of functions first moments with for pure second moments with for and mixed second moments with for far right graph of error in estimating as function of where wishart model inverse wishart model to compare directly to prior work we consider the wishart xn here we use spectral aggregation rules as our function class restricting to diagonal wk for computational efficiency we run two sets of experiments one using the covariance matrix from scott et al with data points and dimensions and one using covariance matrix designed to have small spectral gap and range of eigenvalues with data points and dimensions in both cases we use form of projected sgd using samples per iteration to estimate the variational gradients and running iterations of optimization we note that because the mean is treated as parameter one could sample exactly using wishart conjugacy as figure vcmc improves both first and second posterior moment estimation as compared to the baselines here the greatest gains from vcmc appear at large numbers of partitions we also note that uniform and gaussian averaging perform similarly because the variances do not differ much across partitions mixture of gaussians substantial portion of bayesian inference focuses on latent variable models and in particular mixture models we therefore evaluate vcmc on mixture of gaussians id zn cat xn id where the mixture weights and the prior and likelihood variances and are assumed known we use the combinatorial aggregation functions defined in section we set and uniform and generate data points in dimensions using the model from nishihara et al the resulting inference problem is therefore all samples were drawn using the pystan implementation of hamiltonian monte carlo hmc as figure shows vcmc drastically improves moment estimation compared to the baseline gaussian averaging to assess how vcmc influences estimates in cluster membership probabilities we generated new test points from the model and analyzed cluster comembership probabilities for all pairs in the test set concretely for each xi and xj in the test data we estimated xi and xj belong to the same cluster figure shows the resulting boost in accuracy when vcmc delivers estimates close to those of serial mcmc across all numbers of partitions the errors are larger for unlike previous models uniform averaging here outperforms gaussian averaging and indeed is competitive with vcmc assessing computational efficiency the efficiency of vcmc depends on that of the optimization step which depends on factors including the step size schedule number of samples used per iteration to estimate gradients and size of data minibatches used per iteration extensively assessing the influence of all these factors is beyond the scope of this paper and is an active area of research both in general and specifically in the context of variational inference here we provide due to space constraints we relegate results for to the supplement due to space constraints we compare to the experiment of scott et al in the supplement mixture of gaussians error versus timing and speedup measurements figure expectation approximation error for the uniform and gaussian baselines and vcmc we report the median error relative to serial mcmc for cluster comembership probabilities of pairs of test data points for left and right where we run the vcmc optimization procedure for and iterations respectively when some comembership probabilities are estimated poorly by all methods we therefore only use the of comembership probabilities with the smallest errors across all the methods left vcmc error as function of number of seconds of optimization the cost of optimization is nonnegligible but still moderate compared to serial since our optimization scheme only needs small batches of samples and can therefore operate concurrently with the sampler right error versus speedup relative to serial mcmc for both cmc with gaussian averaging small markers and vcmc large markers an initial assessment of the computational efficiency of vcmc taking the probit regression and gaussian mixture models as our examples using step sizes and sample numbers from above and eschewing minibatching on data points figure shows timing results for both models for the probit regression while the optimization cost is not negligible it is significantly smaller than that of serial sampling which takes over seconds to produce effective across most numbers of partitions approximately to less than seconds of wall clock to give errors close to those at convergence for the mixture on the other hand the computational cost of optimization is minimal compared to serial sampling we can see this in the overall speedup of vcmc relative to serial mcmc for sampling and optimization combined low numbers of partitions achieve speedups close to the ideal value of and large numbers still achieve good speedups of about the cost of the vcmc optimization step is thus when the mcmc step is expensive small enough to preserve the linear speedup of embarrassingly parallel sampling moreover since the serial bottleneck is an optimization we are optimistic that performance both in terms of number of iterations and wall clock time can be significantly increased by using techniques like data minibatching adaptive step sizes or asynchronous updates conclusion and future work the flexibility of variational consensus monte carlo vcmc opens several avenues for further research following previous work on mcmc we used the subposterior factorization our variational framework can accomodate more general factorizations that might be more statistically or computationally efficient the factorization used by broderick et al we also introduced structured sample aggregation and analyzed some concrete instantiations complex latent variable models would require more sophisticated aggregation functions ones that account for symmetries in the model or lift the parameter to higher dimensional space before aggregating finally recall that our algorithm again following previous work aggregates in manner cf other aggregation paradigms may be useful in building approximations to multimodal posteriors or in boosting the statistical efficiency of the overall sampler acknowledgments we thank adams altieri broderick giordano johnson and scott for helpful discussions is supported by the miller institute for basic research in science university of california berkeley is supported by hertz foundation fellowship generously endowed by google and an nsf graduate research fellowship support for this project was provided by amazon and by onr under the muri program we ran the sampler for iterations including burnin steps and kept every fifth sample references asuncion smyth and welling asynchronous distributed learning of topic models in advances in neural information processing systems pages bardenet doucet and holmes towards scaling up markov chain monte carlo an adaptive subsampling approach in proceedings of the international conference on machine learning bertsekas nonlinear programming athena scientific belmont ma edition broderick boyd wibisono wilson and jordan streaming variational bayes in advances in neural information processing systems pages campbell and how approximate decentralized bayesian inference in conference on uncertainty in artificial intelligence cover and thomas elements of information theory wiley series in telecommunications and signal processing dean and ghemawat mapreduce simplified data processing on large clusters communications of the acm knowles mohamed and ghahramani large scale nonparametric bayesian inference data parallelisation in the indian buffet process in advances in neural information processing systems pages duchi hazan and singer adaptive subgradient methods for online learning and stochastic optimization journal of machine learning research gelman carlin stern dunson vehtari and rubin bayesian data analysis third edition chapman and hoffman blei wang and paisley stochastic variational inference journal of machine learning research may johnson saunderson and willsky analyzing hogwild parallel gaussian gibbs sampling in advances in neural information processing systems pages johnson and zhang accelerating stochastic gradient descent using predictive variance reduction in advances in neural information processing systems pages korattikara chen and welling austerity in mcmc land cutting the budget in proceedings of the international conference on machine learning kuhn the hungarian method for the assignment problem naval research logistics quarterly maclaurin and adams firefly monte carlo exact mcmc with subsets of data in proceedings of conference on uncertainty in artificial intelligence mandt and blei smoothed gradients for stochastic variational inference in advances in neural information processing systems pages neiswanger wang and xing asymptotically exact embarrassingly parallel mcmc in conference on uncertainty in artificial intelligence nishihara murray and adams parallel mcmc with generalized elliptical slice sampling journal of machine learning research niu recht and wright hogwild approach to parallelizing stochastic gradient descent in advances in neural information processing systems pages ranganath wang blei and xing an adaptive learning rate for stochastic variational inference in proceedings of the international conference on machine learning pages scott blocker and bonassi bayes and big data the consensus monte carlo algorithm in bayes strathmann sejdinovic and girolami unbiased bayes for big data paths of partial posteriors wang and dunson parallel mcmc via weierstrass sampler welling and teh bayesian learning via stochastic gradient langevin dynamics in proceedings of the international conference on machine learning 
method second order method for glms via stein lemma murat erdogdu department of statistics stanford university erdogdu abstract we consider the problem of efficiently computing the maximum likelihood estimator in generalized linear models glms when the number of observations is much larger than the number of coefficients in this regime optimization algorithms can immensely benefit from approximate second order information we propose an alternative way of constructing the curvature information by formulating it as an estimation problem and applying lemma which allows further improvements through and eigenvalue thresholding our algorithm enjoys fast convergence rates resembling that of second order methods with modest cost we provide its convergence analysis for the case where the rows of the design matrix are samples with bounded support we show that the convergence has two phases quadratic phase followed by linear phase finally we empirically demonstrate that our algorithm achieves the highest performance compared to various algorithms on several datasets introduction generalized linear models glms play crucial role in numerous statistical and machine learning problems glms formulate the natural parameter in exponential families as linear model and provide miscellaneous framework for statistical methodology and supervised learning tasks celebrated examples include linear logistic multinomial regressions and applications to graphical models in this paper we focus on how to solve the maximum likelihood problem efficiently in the glm setting when the number of observations is much larger than the dimension of the coefficient vector glm optimization task is typically expressed as minimization problem where the objective function is the negative that is denoted by where rp is the coefficient vector many optimization algorithms are available for such minimization problems however only few uses the special structure of glms in this paper we consider updates that are specifically designed for glms which are of the from qr where is the step size and is scaling matrix which provides curvature information for the updates of the form eq the performance of the algorithm is mainly determined by the scaling matrix classical newton method nm and natural gradient descent ng are recovered by simply taking to be the inverse hessian and the inverse fisher information at the current iterate respectively second order methods may achieve quadratic convergence rate yet they suffer from excessive cost of computing the scaling matrix at every iteration on the other hand if we take to be the identity matrix we recover the simple gradient descent gd method which has linear convergence rate although gd convergence rate is slow compared to that of second order methods modest cost makes it practical for problems the between the convergence rate and cost has been extensively studied in regime the main objective is to construct scaling matrix that is computational feasible and provides sufficient curvature information for this purpose several methods have been proposed updates given by methods satisfy an equation which is often referred as the relation member of this class of algorithms is the bfgs algorithm in this paper we propose an algorithm that utilizes the structure of glms by relying on lemma it attains fast convergence rate with low cost we call our algorithm method which we abbreviate as newst our contributions are summarized as follows we recast the problem of constructing scaling matrix as an estimation problem and apply lemma along with to form computationally feasible newton method cost is replaced by np cost and cost where is the size assuming that the rows of the design matrix are and have bounded support and denoting the iterates of method by ˆt we prove bound of the form ˆt ˆt where is the minimizer and are the convergence coefficients the above bound implies that the convergence starts with quadratic phase and transitions into linear later we demonstrate its performance on four datasets by comparing it to several algorithms the rest of the paper is organized as follows section surveys the related work and section introduces the notations used throughout the paper section briefly discusses the glm framework and its relevant properties in section we introduce method develop its intuition and discuss the computational aspects section covers the theoretical results and in section we discuss how to choose the algorithm parameters finally in section we provide the empirical results where we compare the proposed algorithm with several other methods on four datasets related work there are numerous optimization techniques that can be used to find the maximum likelihood estimator in glms for moderate values of and classical second order methods such as nm ng are commonly used in problems data dimensionality is the main factor while choosing the right optimization method optimization tasks have been extensively studied through online and batch methods online methods use gradient or of single randomly selected observation to update the current iterate their cost is independent of but the convergence rate might be extremely slow there are several extensions of the classical stochastic descent algorithms sgd providing significant improvement stability on the other hand batch algorithms enjoy faster convergence rates though their cost may be prohibitive in particular second order methods attain quadratic rate but constructing the hessian matrix requires excessive computation many algorithms aim at forming an approximate scaling matrix this idea lies at the core of methods another approach to construct an approximate hessian makes use of techniques many contemporary learning methods rely on as it is simple and it provides significant boost over the first order methods further improvements through conjugate gradient methods and krylov are available many hybrid variants of the aforementioned methods are proposed examples include the combinations of and methods sgd and gd ng and nm ng and approximation lastly algorithms that specialize on certain types of glms include coordinate descent methods for the penalized glms and trust region newton methods notation let and denote the size of set by the gradient and the hessian of with respect to are denoted by and respectively the derivative of function is denoted by for vector rp and matrix and denote the and spectral norms respectively pc is the euclidean projection onto set and bp rp is the ball of radius for random variables and denote probability metrics to be explicitly defined later measuring the distance between the distributions of and generalized linear models distribution of random variable belongs to an exponential family with natural parameter if its density can be written of the form exp where is the cumulant generating function and is the carrier density let yn be independent observations such that yi yi for the joint likelihood is yn exp yi yi we consider the problem of learning the maximum likelihood estimator in the above exponential family framework where the vector rn is modeled through the linear relation for some design matrix with rows xi rp and coefficient vector rp this formulation is known as generalized linear models glms in canonical form the cumulant generating function determines the class of glms for the ordinary least squares ols and for the logistic regression lr log ez maximum likelihood estimation in the above formulation is equivalent to minimizing the negative function hxi yi hxi where hx is the inner product between the vectors and the relation to ols and lr can be seen much easier by plugging in the corresponding in eq the gradient and the hessian of can be written as hxi xi yi xi hxi xi xti for sequence of scaling matrices qt we consider iterations of the form ˆt tq ˆt where is the step size the above iteration is our main focus but with new approach on how to compute the sequence of matrices qt we formulate the problem of finding scalable qt as an estimation problem and use lemma that provides computationally efficient update method classical update is generally used for training glms however its cost makes it impractical for optimization the main bottleneck is the computation of the hessian matrix that requires flops which is prohibitive when numerous methods have been proposed to achieve nm fast convergence rate while keeping the cost manageable the task of constructing an approximate hessian can be viewed as an estimation problem assuming that the rows of are random vectors the hessian of glms with cumulant generating function has the following form xi xti hxi xxt hx we observe that qt is just sum of matrices hence the true hessian is nothing but sample mean estimator to its expectation another natural estimator would be the hessian method suggested by similarly our goal is to propose an appropriate estimator that is also computationally efficient we use the following lemma to derive an efficient estimator to the expectation of hessian lemma lemma assume that np and rp is constant vector then for any function that is twice weakly differentiable we have xxt hx hx hx algorithm method input set and set of indices uniformly at random and argminrank bs compute while ˆt do pn pn ˆt hxi ˆt ˆt hxi ˆt qt ˆt pb end while output ˆt ˆt ˆt ˆt ˆt ˆt ˆt ˆt qt ˆt the proof of lemma is given in appendix the right hand side of eq is update to the first term hence its inverse can be computed with cost quantities that change at each iteration are the ones that depend on hx and hx and are scalar quantities and can be estimated by their corresponding sample means and explicitly defined at step of algorithm with only np computation to complete the estimation task suggested by eq we need an estimator for the covariance matrix natural estimator is the sample mean where we only use so that the cost is reduced to from based sample mean estimator bs is denoted by xi xi which is widely used in problems we highlight the fact that lemma replaces nm cost with cost of we further use to reduce this cost to in general important curvature information is contained in the largest few spectral features following we take the largest eigenvalues of the covariance estimator setting rest of them to eigenvalue this operation helps denoising and would require only computation step of algorithm performs this procedure inverting the constructed hessian estimator can make use of the structure several times first notice that the updates in eq are based on matrix additions hence we can simply use matrix inversion formula to derive an explicit equation see qt in step of algorithm this formulation would impose another inverse operation on the covariance estimator since the covariance estimator is also based on approximation one can utilize the inversion formula again we emphasize that this operation is performed once therefore instead of nm cost of due to inversion method newst requires and cost of assuming that newst and nm converge in and iterations respectively the overall complexity of newst is whereas that of nm is even though proposition assumes that the covariates are multivariate gaussian random vectors in section the only assumption we make on the covariates is that they have bounded support which covers wide class of random variables the left plot of figure shows that the estimation is accurate for various distributions this is consequence of the fact that the proposed estimator in eq relies on the distribution of only through inner products of the form hx vi which in turn results in approximate normal distribution due to the central limit theorem when is sufficiently large we will discuss this phenomenon in detail in section the convergence rate of method has two phases convergence starts quadratically and transitions into linear rate when it gets close to the true minimizer the phase transition behavior can be observed through the right plot in figure this is consequence of the bound provided in eq which is the main result of our theorems stated in section difference between estimated and true hessian convergence rate randomness bernoulli gaussian poisson uniform error estimation error size newst newst dimension iterations figure the left plot demonstrates the accuracy of proposed hessian estimation over different distributions number of observations is set to be log the right plot shows the phase transition in the convergence rate of method newst convergence starts with quadratic rate and transitions into linear plots are obtained using covertype dataset theoretical results we start this section by introducing the terms that will appear in the theorems then we provide our technical results on uniformly bounded covariates the proofs are provided in appendix preliminaries hessian estimation described in the previous section relies on gaussian approximation for theoretical purposes we use the following probability metric to quantify the gap between the distribution of xi and that of normal vector definition given family of functions and random vectors rp and any define dh sup dh where dh many probability metrics can be expressed as above by choosing suitable function class examples include total variation tv kolmogorov and wasserstein metrics based on the second and fourth derivatives of cumulant generating function we define the following classes hx bp hx bp hv hx bp where bp rp is the ball of radius exact calculation of such probability metrics are often difficult the general approach is to upper bound the distance by more intuitive metric in our case we observe that dhj for can be easily upper bounded by dtv up to scaling constant when the covariates have bounded support we will further assume that the covariance matrix follows model ui ui which is commonly encountered in practice this simply means that the first eigenvalues of the covariance matrix are large and the rest are small and equal to each other large eigenvalues of correspond to the signal part and small ones denoted by can be considered as the noise component composite convergence rate we have the following bound for the iterates generated by the method when the covariates are supported on ball theorem that the covariates xn are random vectors supported on ball of radius with xi and xi xti where follows the model further assume that the cumulant generating function has bounded derivatives and that is the radius of the projection pbp for ˆt given by the method for define the event ˆt ˆt ˆt ˆt bp for some positive constant and the optimal value if and are sufficiently large then there exist constants and depending on the radii and the bounds on and such that conditioned on the event with probability at least we have ˆt ˆt where the coefficients and are deterministic constants defined as min log log and is defined as for multivariate gaussian random variable with the same mean and covariance as xi the bound in eq holds with high probability and the coefficients and are deterministic constants which will describe the convergence behavior of the method observe that the coefficient is sum of two terms measures how accurate the hessian estimation is and the second term depends on the size and the data dimensions theorem shows that the convergence of method can be upper bounded by compositely converging sequence that is the squared term will dominate at first giving quadratic rate then the convergence will transition into linear phase as the iterate gets close to the optimal value the coefficients and govern the linear and quadratic terms respectively the effect of appears in the coefficient of linear term in theory there is threshold for the subsampling size namely log beyond which further has no effect the transition point between the quadratic and the linear phases is determined by the size and the properties of the data the phase transition can be observed through the right plot in figure using the above theorem we state the following corollary corollary assume that the assumptions of theorem hold for constant ec tolerance satisfying and for an iterate satisfying ˆt the iterates of method will satisfy ˆt ˆt where and are as in theorem the bound stated in the above corollary is an analogue of composite convergence given in eq in expectation note that our results make strong assumptions on the derivatives of the cumulant generating function we emphasize that these assumptions are valid for linear and logistic regressions an example that does not fit in our scheme is poisson regression with ez however we observed empirically that the algorithm still provides significant improvement the following theorem states sufficient condition for the convergence of composite sequence theorem let ˆt be compositely converging sequence with convergence coefficients and as in eq to the minimizer let the starting point satisfy and define then the sequence of converges to further the number of iterations to reach tolerance of can be upper bounded by inf where log log log log above theorem gives an upper bound on the number of iterations until reaching tolerance of the first and second terms on the right hand side of eq stem from the quadratic and linear phases respectively algorithm parameters newst takes three input parameters and for those we suggest choices based on our theoretical results size newst uses subset of indices to approximate the covariance matrix corollary of proves that sample size of is sufficient for covariates and that of log is sufficient for arbitrary distributions supported in some ball to estimate covariance matrix by its sample mean estimator in the regime we consider we suggest to use sample size of log rank many methods have been suggested to improve the estimation of covariance matrix and almost all of them rely on the concept of shrinkage eigenvalue thresholding can be considered as shrinkage operation which will retain only the important second order information choosing the rank threshold can be simply done on the sample mean estimator of after obtaining the estimate of the mean one can either plot the spectrum and choose manually or use technique from step size step size choices of newst are quite similar to newton method see the main difference comes from the eigenvalue thresholding if the data follows the model the optimal step size will be close to if there is no however due to fluctuations resulting from we suggest the following step size choice for newst in general this formula yields step size greater than which is due to rank thresholding providing faster convergence see for detailed discussion experiments in this section we validate the performance of newst through extensive numerical studies we experimented on two commonly used glm optimization problems namely logistic regression lr and linear regression ols lr minimizes eq for the logistic function log ez whereas ols minimizes the same equation for in the following we briefly describe the algorithms that are used in the experiments newton method nm uses the inverse hessian evaluated at the current iterate and may achieve quadratic convergence nm steps require computation which makes it impractical for datasets bfgs forms curvature matrix by cultivating the information from the iterates and the gradients at each iteration under certain assumptions the convergence rate is locally and the cost is comparable to that of first order methods limited memory bfgs is similar to bfgs and uses only the recent few iterates to construct the curvature matrix gaining significant performance in terms of memory gradient descent gd update is proportional to the negative of the full gradient evaluated at the current iterate under smoothness assumptions gd achieves linear convergence rate with np cost accelerated gradient descent agd is proposed by nesterov which improves over the gradient descent by using momentum term performance of agd strongly depends of the smoothness of the function for all the algorithms we use constant step size that provides the fastest convergence size rank and the constant step size for newst is selected by following the guidelines in section we experimented over two real two synthetic datasets which are summarized in table synthetic data are generated through multivariate gaussian distribution and data dimensions are chosen so that newton method still does well the experimental results are summarized in figure we observe that newst provides significant improvement over the classical techniques the methods that come closer to newst is newton method for moderate and and bfgs when is large observe that the convergence rate of newst has clear phase transition point as argued earlier this point depends on various factors including size and data dimensions the method newst bfgs lbfgs newton gd agd covertype logistic regression time sec linear regression log error method newst bfgs lbfgs newton gd agd time sec linear regression method newst bfgs lbfgs newton gd agd time sec linear regression method newst bfgs lbfgs newton gd agd log error log error method newst bfgs lbfgs newton gd agd logistic regression method newst bfgs lbfgs newton gd agd log error ct slices logistic regression log error log error method newst bfgs lbfgs newton gd agd time sec linear regression method newst bfgs lbfgs newton gd agd log error logistic regression log error dataset time sec time sec time sec time sec figure performance of various optimization methods on different datasets red straight line represents the proposed method newst algorithm parameters including the rank threshold is selected by the guidelines described in section rank threshold and structure of the covariance matrix the prediction of the phase transition point is an interesting line of research which would allow further tuning of algorithm parameters the optimal for newst will typically be larger than which is mainly due to the eigenvalue thresholding operation this feature is desirable if one is able to obtain large that provides convergence in such cases the convergence is likely to be faster yet more unstable compared to the smaller step size choices we observed that similar to other second order algorithms newst is susceptible to the step size selection if the data is not and the size is not sufficiently large algorithm might have poor performance this is mainly because the subsampling operation is performed only once at the beginning therefore it might be good in practice to once in every few iterations dataset ct slices covertype reference uci repo model model table datasets used in the experiments discussion in this paper we proposed an efficient algorithm for training glms we call our algorithm method newst as it takes newton update at each iteration relying on lemma the algorithm requires one time cost to estimate the covariance structure and np cost to form the update equations we observe that the convergence of newst has phase transition from quadratic rate to linear this observation is justified theoretically along with several other guarantees for covariates with bounded support such as bounds conditions for convergence etc parameter selection guidelines of newst are based on our theoretical results our experiments show that newst provides high performance in glm optimization relaxing some of the theoretical constraints is an interesting line of research in particular bounded support assumption as well as strong constraints on the cumulant generating functions might be loosened another interesting direction is to determine when the phase transition point occurs which would provide better understanding of the effects of and rank thresholding acknowledgements the author is grateful to mohsen bayati and andrea montanari for stimulating conversations on the topic of this work the author would like to thank bhaswar bhattacharya and qingyuan zhao for carefully reading this article and providing valuable feedback references amari natural gradient works efficiently in learning neural computation richard byrd gillian chin will neveitt and jorge nocedal on the use of stochastic hessian information in optimization methods for machine learning siam journal on optimization jock blackard and denis dean comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables computers and electronics in agriculture richard byrd sl hansen jorge nocedal and yoram singer stochastic method for optimization arxiv preprint christopher bishop neural networks for pattern recognition oxford university press lèon bottou machine learning with stochastic gradient descent compstat jinho baik and jack silverstein eigenvalues of large sample covariance matrices of spiked population models journal of multivariate analysis no stephen boyd and lieven vandenberghe convex optimization cambridge university press cai emmanuel candès and zuowei shen singular value thresholding algorithm for matrix completion siam journal on optimization no louis hy chen larry goldstein and shao normal approximation by method springer science lee dicker and murat erdogdu flexible results for quadratic forms with applications to variance components estimation arxiv preprint david donoho and matan gavish the optimal hard threshold for singular values is david donoho matan gavish and iain johnstone optimal shrinkage of eigenvalues in the spiked covariance model arxiv preprint john duchi elad hazan and yoram singer adaptive subgradient methods for online learning and stochastic optimization mach learn res murat erdogdu and andrea montanari convergence rates of newton methods arxiv preprint jerome friedman trevor hastie and rob tibshirani regularization paths for generalized linear models via coordinate descent journal of statistical software no michael friedlander and mark schmidt hybrid methods for data fitting siam journal on scientific computing no franz graf kriegel matthias schubert sebastian pölsterl and alexander cavallaro image registration in ct images using radial image descriptors miccai springer alison gibbs and francis su on choosing and bounding probability metrics isr daphne koller and nir friedman probabilistic graphical models mit press lichman uci machine learning repository nicolas le roux and andrew fitzgibbon fast natural newton method icml nicolas le roux manzagol and yoshua bengio topmoumoute online natural gradient algorithm nips lin ruby weng and sathiya keerthi trust region newton method for logistic regression jmlr james martens deep learning via optimization icml pp peter mccullagh and john nelder generalized linear models vol chapman and hall yurii nesterov method for unconstrained convex minimization problem with the rate of convergence doklady an sssr vol pp introductory lectures on convex optimization basic course vol springer mark schmidt nicolas le roux and francis bach minimizing finite sums with the stochastic average gradient arxiv preprint charles stein estimation of the mean of multivariate normal distribution annals of statistics roman vershynin introduction to the analysis of random matrices oriol vinyals and daniel povey krylov subspace descent for deep learning aistats 
practical and optimal lsh for angular distance alexandr columbia university piotr indyk mit ilya razenshteyn mit thijs laarhoven tu eindhoven ludwig schmidt mit abstract we show the existence of hashing lsh family for the angular distance that yields an approximate near neighbor search algorithm with the asymptotically optimal running time exponent unlike earlier algorithms with this property spherical lsh our algorithm is also practical improving upon the hyperplane lsh in practice we also introduce multiprobe version of this algorithm and conduct an experimental evaluation on real and synthetic data sets we complement the above positive results with lower bound for the quality of any lsh family for angular distance our lower bound implies that the above lsh family exhibits between evaluation time and quality that is close to optimal for natural class of lsh functions introduction nearest neighbor search is key algorithmic problem with applications in several fields including computer vision information retrieval and machine learning given set of points rd the goal is to build data structure that answers nearest neighbor queries efficiently for given query point rd find the point that is closest to under an appropriately chosen distance metric the main algorithmic design goals are usually fast query time small memory footprint the approximate good quality of the returned solution there is wide range of algorithms for nearest neighbor search based on techniques such as space partitioning with indexing as well as dimension reduction or sketching popular method for point sets in spaces is hashing lsh an approach that offers provably query time and space complexity and has been shown to achieve good empirical performance in variety of applications the method relies on the notion of hash functions intuitively hash function is if its probability of collision is higher for nearby points than for points that are far apart more formally two points are nearby if their distance is at most and they are far apart if their distance is at least where quantifies the gap between near and far the quality of hash function is characterized by two key parameters is the collision probability for nearby points and is the collision probability for points that are far apart the gap between and determines how sensitive the hash function is to changes in distance and this property is captured by the parameter log log which can usually be expressed as function of the distance gap the problem of designing good hash functions and efficient nearest neighbor search algorithms has attracted significant attention over the last few years the authors are listed in alphabetical order in this paper we focus on lsh for the euclidean distance on the unit sphere which is an important special case for several reasons first the spherical case is relevant in practice euclidean distance on sphere corresponds to the angular distance or cosine similarity which are commonly used in applications such as comparing image feature vectors speaker representations and data sets moreover on the theoretical side the paper shows reduction from nearest neighbor search in the entire euclidean space to the spherical case these connections lead to natural question what are good lsh families for this special case on the theoretical side the recent work of gives the best known provable guarantees for lshbased nearest neighbor search the euclidean distance on the unit sphere specifically their algorithm has query time of nρ and space complexity of for for the approximation factor the algorithm achieves query time of at the heart of the algorithm is an lsh scheme called spherical lsh works vectors its key property is that it can distinguish between distances and with probabilities yielding the formula for the full range of distances is more complex and given in section unfortunately the scheme as described in the paper is not applicable in practice as it is based on rather complex hash functions that are very time consuming to evaluate simply evaluating single hash function from can take more time than linear scan over points since an lsh data structure contains many individual hash functions using their scheme would be slower than simple linear scan over all points in unless the number of points is extremely large on the practical side the hyperplane lsh introduced in the influential work of charikar has worse theoretical guarantees but works well in practice since the hyperplane lsh can be implemented very efficiently it is the standard hash function in practical nearest neighbor and the resulting implementations has been shown to improve over linear scan on real data by multiple orders of magnitude the aforementioned discrepancy between the theory and practice of lsh raises an important question is there hash function with optimal guarantees that also improves over the hyperplane lsh in practice in this paper we show that there is family of hash functions that achieves both objectives specifically the hash functions match the theoretical guarantee of spherical lsh from and when combined with additional techniques give better experimental results than the hyperplane lsh more specifically our contributions are theoretical guarantees for the lsh we show that hash function based on randomly rotated unit balls of the achieves the same parameter as the spherical lsh scheme in assuming data points are unit vectors while the lsh family has been proposed by researchers before we give the first theoretical analysis of its performance lower bound for cosine similarity lsh to highlight the difficulty of obtaining optimal and practical lsh schemes we prove the first lower bound on the between the collision probabilities and so far the optimal lsh upper bound from and from here attain this bound only in the limit as very small and are undesirable since the hash evaluation time is often proportional to our lower bound proves this is unavoidable if we require to be large has to be suboptimal this result has two important implications for designing practical hash functions first it shows that the achieved by the lsh and the scheme of are essentially optimal second the lower bound guides design of future lsh functions if one is to significantly improve upon the lsh one has to design hash function that is computed more efficiently than by explicitly enumerating its range see section for more detailed discussion multiprobe scheme for the lsh the space complexity of an lsh data structure is but even this is often too large strongly in the number of points this running time is known to be essentially optimal for large class of algorithms note that if the data points are binary more efficient lsh schemes exist however in this paper we consider algorithms for general vectors and several methods have been proposed to address this issue empirically the most efficient scheme is multiprobe lsh which leads to significantly reduced memory footprint for the hyperplane lsh in order to make the lsh competitive in practice with the multiprobe hyperplane lsh we propose novel multiprobe scheme for the lsh we complement these contributions with an experimental evaluation on both real and synthetic data sift vectors data and random point set in order to make the lsh practical we combine it with fast rotations via the fast hadamard transform and feature hashing to exploit sparsity of data our results show that for data sets with around to points our multiprobe variant of the lsh is up to faster than an efficient implementation of the hyperplane lsh and up to faster than linear scan to the best of our knowledge our combination of techniques provides the first algorithm that empirically improves over the hyperplane lsh in terms of query time for an exact nearest neighbor search related work the lsh functions were originally proposed in however the analysis in that paper was mostly experimental specifically the probabilities and of the proposed lsh functions were estimated empirically using the monte carlo method similar hash functions were later proposed in the latter paper also uses dft to the random matrix multiplication operation both of the aforementioned papers consider only the algorithm there are several works that show lower bounds on the quality of lsh hash functions however those papers provide only lower bound on the parameter for asymptotic values of and as opposed to an actual between these two quantities in this paper we provide such with implications as outlined in the introduction preliminaries we use to denote the euclidean norm on rd we also use to denote the unit sphere in rd centered in the origin the gaussian distribution with mean zero and variance of one is denoted by let be normalized haar measure on that is note that it corresponds to the uniform distribution over we also let be point sampled from uniformly at random for we denote φc pr dt we will be interested in the near neighbor search on the sphere with respect to the euclidean distance note that the angular distance can be expressed via the euclidean distance between normalized vectors so our results apply to the angular distance as well definition given an dataset on the sphere the goal of the near neighbor problem ann is to build data structure that given query with the promise that there exists datapoint with kp qk reports datapoint within distance cr from definition we say that hash family on the sphere is if for every one has pr if kx yk and pr if kx yk it is known that an efficient cr hash family implies data structure for log ann with space dn and query time nρ where log lsh in this section we describe the lsh analyze it and show how to make it practical first we recall the definition of the lsh consider the following hash family for points on unit sphere rd let be random matrix with gaussian entries random rotation to hash point we compute and then find the point closest to from where ei is the standard basis vector of rd we use the closest neighbor as hash of the following theorem bounds the collision probability for two points under the above family theorem suppose that are such that kp qk where then ln ln oτ ln ln pr before we show how to prove this theorem we briefly describe its implications theorem shows that the lsh achieves essentially the same bounds on the collision probabilities as the theoretically optimal lsh for the sphere from see section spherical lsh there in particular substituting the bounds from theorem for the lsh into the standard reduction from near neighbor search to lsh we obtain the following data structure with space and sublinear query time for near neighbor search on sphere corollary the on unit sphere can be solved in space dn and query time nρ where we now outline the proof of theorem for the full proof see appendix due to the spherical symmetry of gaussians we can assume that and where are such that and then we expand the collision probability pr pr pr pr and βvi and where indeed the first step is due to the spherical symmetry of the hash family the second step follows from the above discussion about replacing random orthogonal matrix with gaussian one and that one can assume that and the last step is due to the independence of the entries of and thus proving theorem reduces to estimating the side of note that the probability pr and is equal to the gaussian area of the planar set shown in figure the latter is heuristically equal to where is the distance from the origin to the complement of which is easy to compute see appendix for the precise statement of this argument using this estimate we compute by taking the outer expectation making the lsh practical as described above the lsh is not quite practical the main bottleneck is sampling storing and applying random rotation in particular to multiply random gaussian matrix with vector we need time proportional to which is infeasible for large rotations to rectify this issue we instead use rotations instead of multiplying an input vector by random gaussian matrix we apply the following linear transformation where is the hadamard transform and di for is random diagonal clearly this is an orthogonal transformation which one can store in space and evaluate in time log using the fast hadamard transform this is similar to rotations used in the context of lsh dimensionality reduction or compressed sensing while we are currently not aware how to prove rigorously that such pseudorandom rotations perform as well as the fully random ones empirical evaluations show that three applications of hdi are exactly equivalent to applying true random rotation when tends to infinity we note that only two applications of hdi are not sufficient figure sensitivity αx βy lsh lower bound αx βy number of parts between and the number of parts for distances and approximation both bounds tend to see discussion in section the set appearing in the analysis of the crosspolytope lsh and feature hashing while we can apply rotation in time log even this can be too slow consider an input vector that is sparse the number of entries of is much smaller than in this case we can evaluate the hyperplane lsh from in time while computing the lsh even with rotations still takes time log to the lsh for sparse vectors we apply feature hashing before performing rotation we reduce the dimension from to by applying linear map sx where is random sparse matrix whose columns have one entry sampled uniformly this way the evaluation time becomes log partial lsh in the above discussion we defined the lsh as hash family that returns the closest neighbor among as hash after random rotation in principle we do not have to consider all basis vectors when computing the closest neighbor by restricting the hash to basis vectors instead theorem still holds for the new hash family with replaced by since the analysis is essentially this slight generalization of the lsh turns out to be useful for experiments see section note that the case corresponds to the hyperplane lsh lower bound let be hash family on for we would like to understand the between and where is the smallest probability of collision under for points at distance at most and the largest collision for points at distance at least we focus on the case because setting to as tends to infinity allows us to replace with the following quantity that is somewhat easier to handle pr this quantity is at most since unit sphere the distance between two random points is tightly concentrated around so for hash family on unit sphere we would like to understand the upper bound on in terms of and for and we define pr and pr note that one can apply lemma from the arxiv version of to claim such dimension distance between any two points remains sufficiently concentrated for the bounds from theorem to still hold with replaced by we are now ready to formulate the main result of this section theorem let be hash family on such that every function in partitions the sphere into at most parts of measure at most then we have where is such that φc and is quantity that depends on and and tends to as tends to infinity the idea of the proof is first to reason about one part of the partition using the isoperimetric inequality from and then to apply certain averaging argument by proving concavity of function related to using delicate analytic argument for the full proof see appendix we note that the above requirement of all parts induced by having measure at most is only technicality we conjecture that theorem holds without this restriction in any case as we will see below in the interesting range of parameters this restriction is essentially irrelevant one can observe that if every hash function in partitions the sphere into at most parts then indeed is precisely the average sum of squares of measures of the parts this observation combined with theorem leads to the following interesting consequence specifically we can numerically estimate in order to give lower bound on log log for any hash family in which every function induces at most parts of measure at most see figure where we plot this lower bound for together with an upper bound that is given by the for which we use numerical estimates for we can make several conclusions from this plot first the lsh gives an almost optimal between and given that the evaluation time for the lsh is log if one uses rotations we conclude that in order to improve upon the lsh substantially in practice one should design an lsh family with being close to optimal and evaluation time that is sublinear in we note that none of the known lsh families for sphere has been shown to have this property this direction looks especially interesting since the convergence of to the optimal value as tends to infinity is extremely slow for instance according to figure for and we need more than parts to achieve whereas the optimal is multiprobe lsh for the lsh we now describe our multiprobe scheme for the lsh which is method for reducing the number of independent hash tables in an lsh data structure given query point standard lsh data structure considers only single cell in each of the hash tables the cell is given by the hash value hi for in multiprobe lsh we consider candidates from multiple cells in each table the rationale is the following points that are close to but fail to collide with under hash function hi are still likely to hash to value that is close to hi by probing multiple hash locations close to hi in the same table multiprobe lsh achieves given probability of success with smaller number of hash tables than standard lsh multiprobe lsh has been shown to perform well in practice the main ingredient in multiprobe lsh is probing scheme for generating and ranking possible modifications of the hash value hi the probing scheme should be computationally efficient and ensure that more likely hash locations are probed first for single hash the order of alternative hash values is straightforward let be the randomly rotated version of query point recall that the main hash value is hi arg then it is easy to see that the second highest probability of collision is achieved for the hash value corresponding to the coordinate with the second largest absolute value etc therefore we consider the indices sorted by their absolute value as our probing sequence or ranking for single the remaining question is how to combine multiple rankings when we have more than one hash function as in the analysis of the lsh see section we consider two points and at distance let be the gaussian matrix of hash the situation is qualitatively similar for other values of more specifically for the partial version from section since should be constant while grows in order to simplify notation we consider slightly modified version of the lsh that maps both the standard basis vector and its opposite to the same hash value it is easy to extend the multiprobe scheme defined here to the full lsh from section function hi and let be the randomly rotated version of point given we are interested in the probability of hashing to certain combination of the individual rankings more formally let rvi be the index of the vi largest element of where specifies the alternative probing location then we would like to compute pr hi rv for all pr arg max rv if we knew this probability for all we could sort the probing locations by their probability we now show how to approximate this probability efficiently for single value of and hence drop the superscripts to simplify notation wlog we permute the rows of so that rv and get pr arg max αx pr arg max id the rhs is the gaussian measure of the set rd arg similar to the analysis of the lsh we approximate the measure of by its distance to the origin then the probability of probing location is proportional to exp where yx is the shortest vector such that arg maxj note that the factor becomes proportionality constant and hence the probing scheme does not require to know the distance for computational performance and simplicity we make further approximation and use yx maxi ev we only consider modifying single coordinate to reach the set once we have estimated the probabilities for each vi we incrementally construct the probing sequence using binary heap similar to the approach in for probing sequence of length the resulting algorithm has running time log log in our experiments we found that the log time taken to sort the probing candidates vi dominated the running time of the hash function evaluation in order to circumvent this issue we use an incremental sorting approach that only sorts the relevant parts of each and gives running time of log experiments we now show that the lsh combined with our multiprobe extension leads to an algorithm that is also efficient in practice and improves over the hyperplane lsh on several data sets the focus of our experiments is the query time for an exact nearest neighbor search since hyperplane lsh has been compared to other algorithms before we limit our attention to the relative compared with hyperplane hashing we evaluate the two hashing schemes on three types of data sets we use synthetic data set of randomly generated points because this allows us to vary single problem parameter while keeping the remaining parameters constant we also investigate the performance of our algorithm on real data two data sets and set of sift feature vectors we have chosen these data sets in order to illustrate when the lsh gives large improvements over the hyperplane lsh and when the improvements are more modest see appendix for more detailed description of the data sets and our experimental setup implementation details cpu in all experiments we set the algorithm parameters so that the empirical probability of successfully finding the exact nearest neighbor is at least moreover we set the number of lsh tables so that the amount of additional memory occupied by the lsh data structure is comparable to the amount of memory necessary for storing the data set we believe that this is the most interesting regime because significant memory overheads are often impossible for large data sets in order to determine the parameters that are not fixed by the above constraints we perform grid search over the remaining parameter space and report the best combination of parameters for the hash we consider partial in the last of the hash functions in order to get smooth between the various parameters see section multiprobe experiments in order to demonstrate that the multiprobe scheme is critical for making the lsh competitive with hyperplane hashing we compare the performance of data set method query time ms vs hp best number of candidates hashing time ms distances time ms nyt nyt hp cp ms ms pubmed pubmed hp cp ms ms sift sift hp cp ms ms table average running times for single nearest neighbor query with the hyperplane hp and cp algorithms on three real data sets the lsh is faster than the hyperplane lsh on all data sets with significant for the two data sets nyt and pubmed for the lsh the entries for include both the number of individual hash functions per table and in parenthesis the dimension of the last of the standard lsh data structure with our multiprobe variant on an instance of the random data set as can be seen in table appendix the multiprobe variant is about faster in our setting note that in all of the following experiments the of the multiprobe lsh compared to the multiprobe hyperplane lsh is less than hence without our multiprobe addition the lsh would be slower than the hyperplane lsh for which multiprobe scheme is already known experiments on random data next we show that the better time complexity of the crosspolytope lsh already applies for moderate values of in particular we compare the lsh combined with fast rotations section and our multiprobe scheme to lsh on random data we keep the dimension and the distance to the nearest neighbor fixed and vary the size of the data set from to the number of hash tables is set to for points the lsh is already faster than the hyperplane lsh and for the speedup is see table in appendix compared to linear scan the achieved by the lsh ranges from for to about for experiments on real data on the sift data set and the lsh achieves modest of compared to the hyperplane lsh see table on the other hand the is is on the two data sets which is significant improvement considering the relatively small size of the nyt data set one important difference between the data sets is that the typical distance to the nearest neighbor is smaller in the sift data set which can make the nearest neighbor problem easier see appendix since the data sets are very but sparse we use the feature hashing approach described in section in order to reduce the hashing time of the lsh the standard hyperplane lsh already runs in time proportional to the sparsity of vector we use and as feature hashing dimensions for nyt and pubmed respectively acknowledgments we thank michael kapralov for many valuable discussions during various stages of this work we also thank stefanie jegelka and rasmus pagh for helpful conversations this work was supported in part by the nsf and the simons foundation work done in part while the first author was at the simons institute for the theory of computing references alexandr andoni piotr indyk huy nguyen and ilya razenshteyn beyond hashing in soda full version at http alexandr andoni and ilya razenshteyn optimal hashing for approximate near neighbors in stoc full version at http moses charikar similarity estimation techniques from rounding algorithms in stoc gregory shakhnarovich trevor darrell and piotr indyk methods in learning and vision theory and practice mit press cambridge ma hanan samet foundations of multidimensional and metric data structures morgan kaufmann sariel piotr indyk and rajeev motwani approximate nearest neighbor towards removing the curse of dimensionality theory of computing hervé jégou matthijs douze and cordelia schmid product quantization for nearest neighbor search ieee transactions on pattern analysis and machine intelligence ludwig schmidt matthew sharifi and ignacio lopez moreno speaker identification in icassp narayanan sundaram aizana turmukhametova nadathur satish todd mostak piotr indyk samuel madden and pradeep dubey streaming similarity search over one billion tweets using parallel localitysensitive hashing in vldb moshe dubiner bucketing coding and information theory for the statistical nearestneighbor problem ieee transactions on information theory alexandr andoni and ilya razenshteyn tight lower bounds for hashing available at http anshumali shrivastava and ping li fast near neighbor search in binary data in machine learning and knowledge discovery in databases pages springer anshumali shrivastava and ping li densifying one permutation hashing via rotation for fast near neighbor search in icml qin lv william josephson zhe wang moses charikar and kai li lsh efficient indexing for similarity search in vldb kengo terasawa and yuzuru tanaka spherical lsh for approximate nearest neighbor search on unit hypersphere in algorithms and data structures pages springer kave eshghi and shyamsundar rajaram locality sensitive hash functions based on concomitant rank order statistics in kdd nir ailon and bernard chazelle the fast transform and approximate nearest neighbors siam journal on computing kilian weinberger anirban dasgupta john langford alexander smola and josh attenberg feature hashing for large scale multitask learning in icml rajeev motwani assaf naor and rina panigrahy lower bounds on locality sensitive hashing siam journal on discrete mathematics ryan donnell yi wu and yuan zhou optimal lower bounds for hashing except when is tiny acm transactions on computation theory anirban dasgupta ravi kumar and tamás sarlós fast hashing in kdd nir ailon and holger rauhut fast and transforms discrete computational geometry uriel feige and gideon schechtman on the optimality of the random hyperplane rounding technique for max cut random structures and algorithms malcolm slaney yury lifshits and junfeng he optimal parameters for hashing proceedings of the ieee moshe lichman uci machine learning repository persi diaconis and david freedman dozen de results in search of theory annales de institut henri poincaré probabilités et statistiques 
learning to linearize under uncertainty ross michael yann dept of computer science courant institute of mathematical science new york ny facebook ai research new york ny goroshin mathieu yann abstract training deep feature hierarchies to solve supervised learning tasks has achieved state of the art performance on many problems in computer vision however principled way in which to train such hierarchies in the unsupervised setting has remained elusive in this work we suggest new architecture and loss for training deep feature hierarchies that linearize the transformations observed in unlabeled natural video sequences this is done by training generative model to predict video frames we also address the problem of inherent uncertainty in prediction by introducing latent variables that are functions of the input into the network architecture introduction the recent success of deep feature learning in the supervised setting has inspired renewed interest in feature learning in weakly supervised and unsupervised settings recent findings in computer vision problems have shown that the representations learned for one task can be readily transferred to others which naturally leads to the question does there exist generically useful feature representation and if so what principles can be exploited to learn it recently there has been flurry of work on learning features from video using varying degrees of supervision temporal coherence in video can be considered as form of weak supervision that can be exploited for feature learning more precisely if we assume that data occupies some low dimensional manifold in high dimensional space then videos can be considered as trajectories on this manifold parametrized by time many unsupervised learning algorithms can be viewed as various parameterizations implicit or explicit of the data manifold for instance sparse coding implicitly assumes locally linear model of the data manifold in this work we assume that deep convolutional networks are good parametric models for natural data parameterizations of the data manifold can be learned by training these networks to linearize short temporal trajectories thereby implicitly learning local parametrization in this work we cast the linearization objective as frame prediction problem as in many other unsupervised learning schemes this necessitates generative model several recent works have also trained deep networks for the task of frame prediction however unlike other works that focus on prediction as final objective in this work prediction is regarded as proxy for learning representations we introduce loss and architecture that addresses two main problems in frame prediction minimizing error between the predicted and actual frame leads to unrealistically blurry predictions which potentially compromises the learned representation and copying the most recent frame to the input seems to be trap of the objective function which results in the network learning little more than the identity function we argue that the source of blur partially stems from the inherent unpredictability of natural data in cases where multiple valid predictions are plausible deterministic network will learn to average between all the plausible predictions to address the first problem we introduce set of latent variables that are equal contribution functions of the input which are used to explain the unpredictable aspects of natural videos the second problem is addressed by introducing an architecture that explicitly formulates the prediction in the linearized feature space the paper is organized as follows section reviews relevant prior work section introduces the basic architecture used for learning linearized representations subsection introduces phasepooling operator that facilitates linearization by inducing topology on the feature space subsection introduces latent variable formulation as means of learning to linearize under uncertainty section presents experimental results on relatively simple datasets to illustrate the main ideas of our work finally section offers directions for future research prior work this work was heavily inspired by the philosophy revived by hinton et al which introduced capsule units in that work an equivariant representation is learned by the capsules when the true latent states were provided to the network as implicit targets our work allows us to move to more unsupervised setting in which the true latent states are not only unknown but represent completely arbitrary qualities this was made possible with two assumptions that temporally adjacent samples also correspond to neighbors in the latent space predictions of future samples can be formulated as linear operations in the latent space in theory the representation learned by our method is very similar to the representation learned by the capsules this representation has locally stable what component and locally linear or equivariant where component theoretical properties of linearizing features were studied in several recent works propose schemes for learning representations from video which use varying degrees of supervision for instance assumes that the network from is already available and training consists of learning to mimic this network similarly learns representation by receiving supervision from tracker this work is more closely related to fully unsupervised approaches for learning representations from video such as it is most related to which also trains decoder to explicitly predict video frames our proposed architecture was inspired by those presented in in and learning linearized representations our goal is to obtain representation of each input sequence that varies linearly in time by transforming each frame individually furthermore we assume that this transformation can be learned by deep feed forward network referred to as the encoder denoted by the function fw denote the code for frame xt by fw xt assume that the dataset is parameterized by temporal index so it is described by the sequence xt with corresponding feature sequence produced by the encoder thus our goal is to train fw to produce sequence whose average local curvature is smaller than sequence scale invariant local measure of curvature is the cosine distance between the two vectors formed by three temporally adjacent samples however minimizing the curvature directly can result in the trivial solutions zt ct and zt these solutions are trivial because they are virtually uninformative with respect to the input xt and therefore can not be meaningful representation of the input to avoid this solution we also minimize the prediction error in the input space the predicted frame is generated in two steps linearly extrapolation in code space to obtain predicted code followed by ii decoding with gw which generates the predicted frame gw for example if the predicted code corresponds to constant speed linear extrapolation of and the prediction error is minimized by jointly training the encoder and decoder networks note that minimizing prediction error alone will not necessarily lead to low curvature trajectories in since the decoder is unconstrained the decoder may learn many to one mapping which maps different codes to the same output image without forcing them to be equal to prevent this we add an explicit curvature penalty to the loss corresponding to the cosine distance between and the complete loss to minimize is kgw kz kkz video time enc enc enc pool pool pool prediction figure video generated by translating gaussian intensity bump over three pixel array the corresponding manifold parametrized by time in three dimensional space unpool dec cosine distance figure the basic linear prediction architecture with shared weight encoders this feature learning scheme can be implemented using an network with shared encoder weights phase pooling thus far we have assumed generic architecture for fw and gw we now consider custom architectures and operators that are particularly suitable for the task of linearization to motivate the definition of these operators consider video generated by translating gaussian intensity bump over three pixel region at constant speed the video corresponds to one dimensional manifold in three dimensional space curve parameterized by time see figure next assume that some convolutional feature detector fires only when centered on the bump applying the operator to the activations of the detector in this region signifies the presence of the feature somewhere in this region the what applying the argmax operator over the region returns the position the where with respect to some local coordinate frame defined over the pooling region this position variable varies linearly as the bump translates and thus parameterizes the curve in figure these two channels namely the what and the where can also be regarded as generalized magnitude and phase corresponding to factorized representation the magnitude represents the active set of parameters while the phase represents the set of local coordinates in this active set we refer to the operator that outputs both the max and argmax channels as the operator in this example spatial pooling was used to linearize the translation of fixed feature more generally the operator can locally linearize arbitrary transformations if pooling is performed not only spatially but also across features in some topology in order to be able to through we define soft version of the max and argmax operators within each pool group for simplicity assume that the encoder has fully convolutional architecture which outputs set of feature maps possibly of different resolution than the input although we can define an arbitrary topology in feature space for now assume that we have the familiar spatial feature map representation where each activation is function where and correspond to the spatial location and is the feature map index assuming that the feature activations are positive we define our soft operator for the th neighborhood nk as mk eβz max βz nk nk nk where note that the fraction in the sum is softmax operation parametrized by which is positive and sums to one in each pooling region the larger the the closer it is to unimodal distribution and therefore the better mk approximates the max operation on the other hand if equation reduces to finally note that mk is simply the expected value of restricted to nk under the softmax distribution assuming that the activation pattern within each neighborhood is approximately unimodal we can define soft versions of the argmax operator the vector pk approximates the local coordinates in the feature topology at which the max activation value occurred assuming that pooling is done volumetrically that is spatially and across features pk will have three components in general the number of components in pk is equal to the dimension of the topology of our feature space induced by the pooling neighborhood the dimensionality of pk can also be interpreted as the maximal intrinsic dimension of the data if we define local standard coordinate system in each pooling volume to be bounded between and the soft operator is defined by the sum eβz pk arg max βz nk nk nk where the indices take values from to in equal increments over the pooling region again we observe that pk is simply the expected value of under the softmax distribution the operator acts on the output of the encoder therefore it can simply be considered as the last encoding step correspondingly we define an operation as the first step of the decoder which produces reconstructed activation maps by placing the magnitudes at appropriate locations given by the phases because the operator produces both magnitude and phase signals for each of the two input frames it remains to define the predicted magnitude and phase of the third frame in general this linear extrapolation operator can be learned however this operator allows us to place implicit priors on the magnitude and phase channels the predicted magnitude and phase are defined as follows mt predicting the magnitude as the mean of the past imposes an implicit stability prior on the temporal sequence corresponding to the channel should be stable between adjacent frames the linear extrapolation of the phase variable imposes an implicit linear prior on thus such an architecture produces factorized representation composed of locally stable and locally linearly varying when is used curvature regularization is only applied to the variables the full prediction architecture is shown in figure addressing uncertainty natural video can be inherently unpredictable objects enter and leave the field of view and out of plane rotations can also introduce previously invisible content in this case the prediction should correspond to the most likely outcome that can be learned by training on similar video however if multiple outcomes are present in the training set then minimizing the distance to these multiple outcomes induces the network to predict the average outcome in practice this phenomena results in blurry predictions and may lead the encoder to learn less discriminative representation of the input to address this inherent unpredictability we introduce latent variables to the prediction architecture that are not deterministic functions of the input these variables can be adjusted using the target in order to minimize the prediction error the interpretation of these variables is that they explain all aspects of the prediction that are not captured by the encoder for example can be used to switch between multiple equally likely predictions it is important to control the capacity of to prevent it from explaining the entire prediction on its own therefore is restricted to act only as correction term in the code space output by the encoder to further restrict the capacity of we enforce that dim dim more specifically the code is defined as where is trainable matrix of size dim dim and denotes the product during training is inferred using gradient descent for each training sample by minimizing the loss in equation the corresponding adjusted is then used for through and at test time can be selected via sampling assuming its distribution on the training set has been previously estimated min kgw kz kkz the following algorithm details how the above loss is minimized using stochastic gradient descent algorithm minibatch stochastic gradient descent training for prediction with uncertainty the number of descent steps is treated as for number of training epochs do sample of temporal triplets xt set forward propagate xt through the network and obtain the codes and the prediction for to do compute the prediction error back propagate the error through the decoder to compute the gradient update δi δi compute compute gw end for back propagate the full loss from equation using δk and update the weight matrices and end for when phase pooling is used we allow to only affect the phase variables in equation this further encourages the magnitude to be stable and places all the uncertainty in the phase experiments the following experiments evaluate the proposed feature learning architecture and loss in the first set of experiments we train shallow architecture on natural data and visualize the learned features in order gain basic intuition in the second set of experiments we train deep architecture on simulated movies generated from the norb dataset by generating frames from interpolated and extrapolated points in code space we show that linearized representation of the input is learned finally we explore the role of uncertainty by training on only partially predictable sequences we show that our latent variable formulation can account for this uncertainty enabling the encoder to learn linearized representation even in this setting shallow architecture trained on natural data to gain an intuition for the features learned by architecture let us consider an encoder architecture comprised of the following stages convolutional filter bank rectifying nonlinearity and the decoder architecture is comprised of an stage followed by convolutional filter bank this architecture was trained on simulated movie shallow architecture shallow architecture deep architecture deep architecture deep architecture encoder phase pool phase pool stride prediction average mag linear extrap phase average mag linear extrap phase none linear extrapolation reshape phase pool average mag linear extrap phase decoder conv conv reshape spatialpadding spatialpadding conv reshape spatialpadding spatialpadding conv unpool reshape spatialpadding spatialpadding conv table summary of architectures frames taken from youtube videos each frame triplet is generated by transforming still frames with sequence of three rigid transformations translation scale rotation more specifically let be random rigid transformation parameterized by and let denote still image reshaped into column vector the generated triplet of frames is given by aτ aτ aτ two variants of this architecture were trained their full architecture is summarized in the first two lines of table in shallow architecture phase pooling is performed spatially in groups of and across features in topology consisting of groups of four each of the produce code consisting of scalar and pf px py corresponding to two spatial and one feature dimensions thus the encoder architecture produces code of size for each frame the corresponding filters whose activations were pooled together are laid out horizontally in groups of four in figure note that each group learns to exhibit strong ordering corresponding to the linearized variable pf because global rigid transformations can be locally well approximated by translations the features learn to parameterize local translations in effect the network learns to linearize the input by tracking common features in the video sequence unlike the spatial phase variables pf can linearize translations next the architecture described in column of table was trained on natural movie patches with the natural motion present in the real videos the architecture differs in only in that pooling across features is done with overlap groups of stride of the resulting decoder filters are displayed in figure note that pooling with overlap introduces smoother transitions between the pool groups although some groups still capture translations more complex transformations are learned from natural movies deep architecture trained on norb in the next set of experiments we trained deep feature hierarchies that have the capacity to linearize richer class of transformations to evaluate the properties of the learned features in controlled setting the networks were trained on simulated videos generated using the norb dataset rescaled to to reduce training time the simulated videos are generated by tracing constant speed trajectories with random starting points in the latent space of pitch and azimuth rotations in other words the models are trained on triplets of frames ordered by their rotation angles as before presented with two frames as input the models are trained to predict the third frame recall that prediction is merely proxy for learning linearized feature representations one way to evaluate the linearization properties of the learned features is to linearly interpolate or extrapolate shallow architecture shallow architecture figure decoder filters learned by shallow architectures figure test samples input to the network linear interpolation in code space learned by our network new codes and visualize the corresponding images via forward propagation through the decoder this simultaneously tests the encoder capability to linearize the input and the decoder generative capability to synthesize images from the linearized codes in order to perform these tests we must have an explicit code representation which is not always available for instance consider simple scheme in which generic deep network is trained to predict the third frame from the concatenated input of two previous frames such network does not even provide an explicit feature representation for evaluation simple baseline architecture that affords this type of evaluation is siamese encoder followed by decoder this exactly corresponds to our proposed architecture with the linear prediction layer removed such an architecture is equivalent to learning the weights of the linear prediction layer of the model shown in figure in the following experiment we evaluate the effects of fixing learning the linear prediction operator including the phase pooling operation including explicit curvature regularization second term in equation let us first consider deep architecture summarized in table in this architecture siamese encoder produces code of size for each frame the codes corresponding to the two frames are concatenated together and propagated to the decoder in this architecture the first linear layer of the decoder can be interpreted as learned linear prediction layer figure shows three frames from the test set corresponding to temporal indices and respectively figure shows the generated frames corresponding to interpolated codes at temporal indices the images were generated by propagating the corresponding codes through the decoder codes corresponding to temporal indices were obtained by linearly interpolating in code space deep architecture differs from deep architecture in that it generates the predicted code via fixed linear extrapolation in code space the extrapolated code is then fed to the decoder that generates the predicted image note that the fully connected stage of the decoder has half as many free parameters compared to the previous architecture this architecture is further restricted by propagating only the predicted code to the decoder for instance unlike in deep architecture the decoder can not copy any of the input frames to the output the generated images corresponding to this architecture are shown in figure these images more closely resemble images from the dataset furthermore deep architecture achieves lower prediction error than deep architecture figure linear interpolation in code space learned by our model no no curvature regularization with phase pooling and curvature regularization interpolation results obtained by minimizing equation and equation trained with only partially predictable simulated video finally deep architecture uses in the encoder and in the decoder this architecture makes use of in feature space arranged on an grid the pooling is done in single group over all the features producing feature vector of dimension compared to in previous architectures nevertheless this architecture achieves the best overall prediction error and generates the most visually realistic images figure in this subsection we compare the representation learned by minimizing the loss in equation to equation uncertainty is simulated by generating triplet sequences where the third frame is skipped randomly with equal probability determined by bernoulli variable for example the sequences corresponding to models with rotation angles and are equally likely minimizing equation with deep architecture results in the images displayed in figure the interpolations are blurred due to the averaging effect discussed in subsection on the other hand minimizing equation figure partially recovers the sharpness of figure for this experiment we used real valued moreover training linear predictor to infer binary variable from after training results in test set accuracy this suggests that does indeed capture the uncertainty in the data discussion in this work we have proposed new loss and architecture for learning locally linearized features from video we have also proposed method that introduces latent variables that are nondeterministic functions of the input for coping with inherent uncertainty in video in future work we will suggest methods for stacking these architectures that will linearize more complex features over longer temporal scales acknowledgments we thank jonathan tompson joan bruna and david eigen for many insightful discussions we also gratefully acknowledge nvidia corporation for the donation of tesla gpu used for this research references yoshua bengio aaron courville and pascal vincent representation learning review and new perspectives technical report university of montreal charles cadieu and bruno olshausen learning representations of form and motion from natural movies neural computation taco cohen and max welling transformation properties of learned visual representations arxiv preprint ross goroshin joan bruna jonathan tompson david eigen and yann lecun unsupervised learning of spatiotemporally coherent metrics arxiv preprint geoffrey hinton alex krizhevsky and sida wang transforming in artificial neural networks and machine pages springer christoph kayser wolfgang einhauser olaf dummer peter konig and konrad kding extracting slow subspaces from natural videos leads to complex cells in icann alex krizhevsky ilya sutskever and geoffrey hinton imagenet classification with deep convolutional neural networks in nips volume page hossein mobahi ronana collobert and jason weston deep learning from temporal coherence in video in icml bruno olshausen and david field sparse coding of sensory inputs current opinion in neurobiology maxime oquab leon bottou ivan laptev and josef sivic learning and transferring midlevel image representations using convolutional neural networks in computer vision and pattern recognition cvpr ieee conference on pages ieee ranzato fu jie huang boureau and yann lecun unsupervised learning of invariant feature hierarchies with applications to object recognition in computer vision and pattern recognition cvpr ieee conference on pages ieee marcaurelio ranzato arthur szlam joan bruna michael mathieu ronan collobert and sumit chopra video language modeling baseline for generative models of natural videos arxiv preprint carl vondrick hamed pirsiavash and antonio torralba anticipating the future by watching unlabeled video arxiv preprint xiaolong wang and abhinav gupta unsupervised learning of visual representations using videos arxiv preprint laurenz wiskott and terrence sejnowski slow feature analysis unsupervised learning of invariances neural computation matthew zeiler dilip krishnan graham taylor and robert fergus deconvolutional networks in computer vision and pattern recognition cvpr ieee conference on pages ieee 
analysis of projected langevin monte carlo ronen eldan weizmann institute roneneldan bubeck microsoft research sebubeck joseph lehec lehec abstract we analyze the projected langevin monte carlo lmc algorithm close cousin of projected stochastic gradient descent sgd we show that lmc allows to sample in polynomial time from posterior distribution restricted to convex body and with concave this gives the first markov chain to sample from distribution with oracle as the existing chains with provable guarantees lattice walk ball walk and require zerothorder oracle our proof uses elementary concepts from stochastic calculus which could be useful more generally to understand sgd and its variants introduction fundamental primitive in bayesian learning is the ability to sample from the posterior distribution similarly to the situation in optimization convexity is key property to obtain algorithms with provable guarantees for this task indeed several markov chain monte carlo methods have been analyzed for the case where the posterior distribution is supported on convex set and the negative is convex this is usually referred to as the problem of sampling from distribution in this paper we propose and analyze new markov chain for this problem which could have several advantages over existing chains for machine learning applications we describe formally our contribution in section then in section we explain how this contribution relates to various line of work in different fields such as theoretical computer science statistics stochastic approximation and machine learning main result let rn be convex set such that contains euclidean ball of radius and is contained in euclidean ball of radius denote pk the euclidean projection on pk where denotes the euclidean norm in rn and kk the gauge of defined by kxkk inf tk rn let be and convex function that is is differentiable and satisfies and we are interested in the problem of sampling from the probability measure on rn whose density with respect to the lebesgue measure is given by dµ exp where exp dy dx we denote eµ and kθkk where is uniform on the sphere rn in this paper we study the following markov chain which depends on parameter and where is an sequence of standard gaussian random variables in rn and pk ηξk we call the chain projected langevin monte carlo lmc recall that the total variation distance between two measures is defined as tv supa where the supremum is over all measurable sets with slight abuse of notation we sometimes write tv where is random variable distributed according to respectively means that there exists such that the notation vn vn cun logc un respectively our main result shows that for an appropriately chosen and number of iterations one has convergence in total variation distance of the iterates to the target distribution and theorem let one has tv provided that rl rl nm max βm note that by viewing as numerical constants using and assuming and the bound reads observe also that if is constant that is is the uniform measure on then and one can show that which yields the bound context and related works there is long line of works in theoretical computer science proving results similar to theorem starting with the breakthough result of dyer et al who showed that the lattice walk mixes in steps the current record for the mixing time is obtained by and vempala for the walk these chains as well as other popular chains who show bound of such as the ball walk or the dikin walk see kannan and narayanan and references therein all require oracle for the potential that is given one can calculate the value on the other hand our proposed chain works with oracle that is given one can calculate the value of the difference between oracle and firstorder oracle has been extensively studied in the optimization literature nemirovski and yudin but it has been largely ignored in the literature on sampling algorithms we also note that and lmc are the only chains which are rapidly mixing from any starting point see and vempala though they have this property for seemingly very different reasons when initialized in corner of the convex body might take long time to take step but once it moves it escapes very far while chain such as the ball walk would only do small step on the other hand lmc keeps moving at every step even when initialized in corner thanks for the projection part of our main motivation to study the chain stems from its connection with the ubiquitous stochastic gradient descent sgd algorithm in general this algorithm takes the form pk xk xk εk where is centered sequence standard results in approximation theory such as robbins and monro show that if the variance of the noise var is of smaller order than the then the iterates xk converge to the minimum of on for decreasing sufficiently fast as function of the number of iterations for the specific noise sequence that we study in the variance is exactly equal to the which is why the chain deviates from its standard and behavior we also note that other regimes where sgd does not converge to the minimum of have been studied in the optimization literature such as the constant case investigated in pflug bach and moulines the chain is also closely related to line of works in bayesian statistics on langevin monte carlo algorithms starting essentially with tweedie and roberts the focus there is on the unconstrained case that is rn in this simpler situation variant of theorem was proven in the recent paper dalalyan the latter result is the starting point of our work straightforward way to extend the analysis of dalalyan to the constrained case is to run the unconstrained chain with an additional potential that diverges quickly as the distance from to increases however it seems much more natural to study directly the chain unfortunately the techniques used in dalalyan can not deal with the singularities in the diffusion process which are introduced by the projection as we explain in section our main contribution is to develop the appropriate machinery to study in the machine learning literature it was recently observed that langevin monte carlo algorithms are particularly for applications because of the close connection to sgd for instance welling and teh suggest to use to compute approximate gradients instead of exact gradients in and they call the resulting algorithm sgld stochastic gradient langevin dynamics it is conceivable that the techniques developed in this paper could be used to analyze sgld and its refinements introduced in ahn et al we leave this as an open problem for future work another interesting direction for future work is to improve the polynomial dependency on the dimension and the inverse accuracy in theorem our main goal here was to provide the simplest analysis contribution and paper organization as we pointed out above dalalyan proves the equivalent of theorem in the unconstrained case his elegant approach is based on viewing lmc as discretization of the diffusion process dxt dwt xt where wt is brownian motion the analysis then proceeds in two steps by deriving first the mixing time of the diffusion process and then showing that the discretized process is close to its continuous version in dalalyan the first step is particularly transparent as he assumes convexity for the potential which in turns directly gives mixing time of order the second step is also simple once one realizes that lmc without projection can be viewed as the diffusion process dx dwt xηb ηt using pinsker inequality and girsanov formula it is then short calculation to show that the total variation distance between and xt is small the constrained case presents several challenges arising from the reflection of the diffusion process on the boundary of and from the lack of curvature in the potential indeed the constant potential case is particularly important for us as it corresponds to being the uniform distribution on rather than simple brownian motion with drift lmc with projection can be viewed as the discretization of reflected brownian motion with drift which is process of the form dxt dwt xt dt νt dt where xt is measure supported on xt and νt is an outer normal unit vector of at xt the term νt dt is referred to as the tanaka drift following dalalyan the analysis is again decomposed in two steps we study the mixing time of the continuous process via simple coupling argument which crucially uses the convexity of and of the potential the main difficulty is in showing that the discretized process is close to the continuous version xt as the tanaka drift prevents us from straightforward application of girsanov formula our approach around this issue is to first use geometric argument to prove that the two processes are close in wasserstein distance and then to show that in fact for reflected brownian motion with drift one can deduce total variation bound from wasserstein bound in this extended abstract we focus on the special case where is constant function that is is uniform on the convex body the generalization to an arbitrary smooth potential can be found in the supplementary material the rest of the paper is organized as follows section contains the main tehcnical arguments we first remind the reader of tanaka construction tanaka of reflected brownian motion in section we present our geometric argument to bound the wasserstein distance between xt and in section and we use our coupling argument to bound the mixing time of xt in section the derivation of total variation bound from the wasserstein bound is discussed in section finally we conclude the paper in section with some preliminary experimental comparison between lmc and the constant potential case in this section we derive the main arguments to prove theorem when is constant function that is for point we say that is an outer unit normal vector at if and hx νi for we say that is an outer unit normal at we define the support function hk of by hk sup hx yi rn note that hk is also the gauge function of the polar body of the skorokhod problem let and rn be piecewise continuous path with we say that rn and rn solve the skorokhod problem for if one has and furthermore is of the form νs ds where νs is an outer unit normal at and is measure on supported on the set the path is called the reflection of at the boundary of and the measure is called the local time of at the boundary of skorokhod showed the existence of such pair in dimension in skorokhod and tanaka extended this result to convex sets in higher dimensions in tanaka furthermore tanaka also showed that the solution is unique and if is continuous then so is and in particular the reflected brownian motion in denoted xt is defined as the reflection of the standard brownian motion wt at the boundary of existence follows by continuity of wt observe that by formula for any smooth function on rn xt xs dws xs ds xs νs ds to get sense of what solution typically looks like let us work out the case where is piecewise constant this will also be useful to realize that lmc can be viewed as the solution to skorokhod problem for sequence gn rn and for we consider the path gk kη define xk inductively by and pk xk gk it is easy to verify that the solution to the skorokhod problem for is given by xηb ηt and rt νs ds where the measure is defined by denoting δs for dirac at gk pk xk gk and for kη νs xk gk pk xk gk gk pk xk gk discretization of reflected brownian motion given the discussion above it is clear that when is constant function the chain can be viewed as the reflection of discretized brownian motion wηb ηt at the boundary of more precisely the value of kη coincides with the value of as defined by it is rather clear that the discretized brownian motion is close to the path wt and we would like to carry this to the reflected paths and xt the following lemma extracted from tanaka allows to do exactly that lemma let and be piecewise continuous path and assume that and solve the skorokhod problems for and respectively then for all time we have hw ds ds applying the above lemma to the processes wt and at time yields note that wt hwt νt il dt hwt il dt we claim that the second integral is equal to indeed since the discretized process is constant on the intervals kη the local time is positive combination of dirac point masses at on the other hand wkη kη for all integer hence the claim therefore hwt νt dt using the inequality hx yi kxkk hk we get sup kwt kk hk νt dt taking the square root expectation and using we get sup kwt kk hk νt dt the next two lemmas deal with each term in the right hand side of the above equation and they will show that there exists universal constant such that log we discuss why the above bound implies total variation bound in section lemma we have for all nt hk νs ds proof by formula dwt dt νt dt now observe that by definition of the reflection if is in the support of then hxt νt hx νt in other words hxt νt hk νt therefore hk νs ds hxs dws nt the first term of the side is martingale so using that and taking expectation we get the result lemma there exists universal constant such that sup kwt kk nη log proof note that sup kwt kk max yi where yi kwt wiη kk sup iη observe that the variables yi are identically distributed let and write kp max yi we claim that kp for some constant and for all taking this for granted and choosing log in the previous inequality yields the result recall that so it is enough to prove observe that since wt is martingale the process mt kwt kk is by doob maximal inequality kp sup mt kp kp for every letting γn be the standard gaussian measure on rn and using khintchin inequality we get kmη kp kxkk γn dx pη kxkk γn dx rn rn lastly integrating in polar coordinate it is easily seen that kxkk γn dx rn mixing time estimate for the reflected brownian motion given probability measure supported on we let νpt be the law of xt when has law the following lemma is the key result to estimate the mixing time of the process xt lemma let tv δx pt pt the above result clearly implies that for probability measure on tv pt νpt dx since the uniform measure on is stationary for reflected brownian motion we obtain tv pt in other words starting from the mixing time of xt is of order we now turn to the proof of the above lemma proof the proof is based on coupling argument let wt be brownian motion starting from and let xt be reflected brownian motion starting from dxt dwt νt dt where νt and satisfy the appropriate conditions we construct reflected brownian motion starting from as follows let inf xt and for let st be the orthogonal reflection with respect to the hyperplane xt then up to time the process is defined by dt st dwt where is measure supported on and is an outer unit normal at for all such after time we just set xt since st is an orthogonal map is brownian motion and thus is reflected brownian motion starting from therefore tv δx pt pt xt observe that on dwt st dwt dwt ivt where vt xt so xt dwt ivt νt dt dt dbt vt νt dt dt where hvs dws bt on observe that bt is brownian motion formula then gives dg xt xt vt dbt xt νt dt xt ti dt xt vt vt dt for every smooth function on rn now if then xt vt so xt vt xt νt on the support of and xt on the support of moreover xt xt where denotes the orthogonal projection on in particular xt yt vt we obtain on therefore where is the first time the brownian motion bt hits the value now by the reflection principle bt from wasserstein distance to total variation to conclude it remains to derive total variation bound between xt and using the details of this step are deferred to the supplementary material where we consider the case of general logconcave distribution the intuition goes as follows the processes xt and both evolve according to brownian motion until the first time that one process undergoes reflection but if is large enough and is small enough then one can easily get from and the fact that the uniform measure does not put too much mass close to the boundary that xt and are much closer to each other than they are to the boundary of this implies that one can couple them just as in section so that they meet before one of them hits the boundary experiments comparing different markov chain monte carlo algorithms is challenging problem in and of itself here we choose the following simple comparison procedure based on the volume algorithm developed in cousins and vempala this algorithm whose objective is to compute the volume of given convex set procedes in phases in each phase it estimates the mean of certain function under multivariate gaussian restricted to with unrestricted covariance in cousins and vempala provide matlab implementation of the entire algorithm where in each phase the target mean is estimated by sampling from the truncated gaussian using the chain we implemented the same procedure with lmc instead of and we choose the where is the smoothness parameter of the underlying distribution in particular here the intuition for the choice of the is as follows the scaling in inverse smoothness comes from the optimization literature while the scaling in inverse dimension squared comes from the analysis in the unconstrained case in dalalyan time estimated normalized volume box box lmc box and ball box and ball lmc box box lmc box and ball box and ball lmc we ran the volume algorithm with both and lmc on following set of convex bodies referred to as the box and bn referred to as the box and ball where the computed volume normalized by for the box and by for the box and ball as well as the clock time in seconds to terminate are reported in the figure above from these experiments it seems that lmc and roughly compute similar values for the volume with being slightly more accurate and lmc is almost always bit faster these results are encouraging but much more extensive experiments are needed to decide if lmc is indeed competitor to in practice references ahn korattikara and welling bayesian posterior sampling via stochastic gradient fisher scoring in icml bach and moulines smooth stochastic approximation with convergence rate in advances in neural information processing systems nips pages cousins and vempala bypassing kls gaussian cooling and an volume algorithm arxiv preprint dalalyan theoretical guarantees for approximate sampling from smooth and densities arxiv preprint dyer frieze and kannan random algorithm for approximating the volume of convex bodies journal of the acm jacm kannan and narayanan random walks on polytopes and an affine interior point method for linear programming mathematics of operations research and vempala from corner siam and vempala the geometry of logconcave functions and sampling algorithms random structures algorithms nemirovski and yudin problem complexity and method efficiency in optimization wiley interscience pflug stochastic minimization with constant asymptotic laws siam control and optimization robbins and monro stochastic approximation method annals of mathematical statistics skorokhod stochastic equations for diffusion processes in bounded region theory of probability its applications tanaka stochastic differential equations with reflecting boundary condition in convex regions hiroshima mathematical journal tweedie and roberts exponential convergence of langevin distributions and their discrete approximations bernoulli welling and teh bayesian learning via stochastic gradient langevin dynamics in icml 
deep visual scott reed yi zhang yuting zhang honglak lee university of michigan ann arbor mi usa reedscot yeezhang yutingzh honglak abstract in addition to identifying the content within single image relating images and generating related images are critical tasks for image understanding recently deep convolutional networks have yielded breakthroughs in predicting image labels annotations and captions but have only just begun to be used for generating images in this paper we develop novel deep network trained to perform visual analogy making which is the task of transforming query image according to an example pair of related images solving this problem requires both accurately recognizing visual relationship and generating transformed query image accordingly inspired by recent advances in language modeling we propose to solve visual analogies by learning to map images to neural embedding in which analogical reasoning is simple such as by vector subtraction and addition in experiments our model effectively models visual analogies on several datasets shapes animated video game sprites and car models introduction humans are good at considering questions about objects in their environment what if this chair were rotated degrees clockwise what if dyed my hair blue we can easily imagine roughly how objects would look according to various hypothetical questions however current generative models of images struggle to perform this kind of task without encoding significant prior knowledge about the environment and restricting the allowed transformations infer relationship transform query often these visual hypothetical questions can be effectively answered by analogical having observed many similar objects rotating one could learn to mentally rotate new objects having observed objects with different colors or textures one could learn to mentally refigure visual analogy making concept we learn color or new objects an encoder function mapping images into space solving the analogy problem requires the in which analogies can be performed and decoder ability to identify relationships among mapping back to the image space ages and transform query images accordingly in this paper we propose to solve the problem by directly training on visual analogy completion that is to generate the transformed image output note that we do not make any claim about how humans solve the problem but we show that in many cases thinking by analogy is enough to solve it without exhaustively encoding first principles into complex model we denote valid analogy as often spoken as is to as is to given such an analogy there are several questions one might ask what is the common relationship are and related in the same way that and are related what is the result of applying the transformation to see for deeper philosophical discussion of analogical reasoning the first two questions can be viewed as discriminative tasks and could be formulated as classification problems the third question requires generating an appropriate image to make valid analogy since model with this capability would be of practical interest we focus on this question our proposed approach is to learn deep encoder function rd rk that maps images to an embedding space suitable for reasoning about analogies and deep decoder function rk rd that maps from the embedding back to the image space see figure our encoder function is inspired by glove and other embedding methods that map inputs to space supporting analogies by vector addition in those models analogies could be performed via arg cos where is the vocabulary and form an analogy tuple such that other variations such as multiplicative version on this inference have been proposed the vector represents the transformation which is applied to query by vector addition in the embedding space in the case of images we can modify this naturally by replacing the cosine similarity and argmax over the vocabulary with application of decoder function mapping from the embedding back to the image space clearly this simple vector addition will not accurately model transformations for representations such as raw pixels and so in this work we seek to learn representation in our experiments we parametrize the encoder and decoder as deep convolutional neural networks cnn but in principle other methods could be used to model and in addition to vector addition we also propose more powerful methods of applying the inferred transformations to new images such as multiplicative interactions and additive interactions we first demonstrate visual analogy making on shapes benchmark with variation in shape color rotation scaling and position and evaluate the performance on analogy completion second we generate dataset of animated video game character sprites using graphics assets from the liberated pixel cup we demonstrate the capability of our model to transfer animations onto novel characters from single frame and to perform analogies that traverse the manifold induced by an animation third we apply our model to the task of analogy making on car models and show that our model can perform pose transfer and rotation by analogy related work hertzmann et al developed method for applying new textures to images by analogy this problem is of practical interest for stylizing animations our model can also synthesize new images by analogy to examples but we study global transformations rather than only changing the texture of the image et al developed manifold learning to traverse image manifolds we share similar motivation when analogical reasoning requires walking along manifold pose analogies but our model leverages deep encoder and decoder trainable by backprop memisevic and hinton proposed the factored gated boltzmann machine for learning to represent transformations between pairs of images this and related models use tensors or their factorization to infer translations rotations and other transformations from pair of images and apply the same transformation to new image in this work we share similar goal but we directly train deep predictive model for the analogy task without requiring multiplicative connections with the intent to scale to bigger images and learn more subtle relationships involving articulated pose multiple attributes and rotation our work is related to several previous works on disentangling factors of variation for which common application is as an early example bilinear models were proposed to separate style and content factors of variation in face images and speech signals tang et al developed the tensor analyzer which uses factor loading tensor to model the interaction among latent factor groups and was applied to face modeling several variants of boltzmann machine were developed to tackle the disentangling problem featuring multiple groups of hidden units with each group corresponding to single factor disentangling was also considered in the discriminative case in the contractive discriminative analysis model our work differs from these in that we train deep network for generating images by analogy recently several methods were proposed to generate images using deep networks dosovitskiy et al used cnn to generate chair images with controllable variation in ance shape and pose contemporary to our work kulkarni et al proposed the deep convolutional inverse graphics network which is form of variational autoencoder vae in which the encoder disentangles factors of variation other works have considered extension of the vae incorporating class labels associated to subset of the training images which can control the label units to perform some visual analogies cohen and welling developed generative model of commutative lie groups image rotation translation that produced invariant and disentangled representations in this work is extended to model the rotation group so zhu et al developed the perceptron for modeling face identity and viewpoint and generated high quality faces subject to view changes cheung et al also use convolutional model and develop regularizer to disentangle latent factors of variation from discriminative target analogies have been in the nlp community turney used analogies from sat tests to evaluate the performance of text analogy detection methods in the visual domain hwang et al developed an embedding model that could both detect analogies and as regularizer improve visual recognition performance our work is related to these but we focus mainly on generating images to complete analogies rather than detecting analogies method suppose that is the set of valid analogy tuples in the training set for example implies the statement is to as is to let the input image space for images be rd and the embedding space be rk typically denote the encoder as rd rk and the decoder as rk rd figure illustrates our architectures for visual analogy making making analogies by vector addition neural word representations have been shown to be capable of by addition and subtraction of word embeddings analogy making capability appears to be an emergent property of these embeddings but for images we propose to directly train on the objective of analogy completion concretely we propose the following objective for analogies ladd this objective has the advantage of being very simple to implement and train in addition with modest number of labeled relations large number of training analogies can be mined making analogy transformations dependent on the query context in some cases purely additive model of applying transformations may not be ideal for example in the case of rotation the manifold of rotated object is circular and after enough rotation has been applied one returns to the original point in the model we can add the same rotation vector multiple times to query but we will never return to the original point except when the decoder could in principle solve this problem by learning to perform modulus operation but this would make the training significantly more difficult instead we propose to parametrize the transformation increment to as function of both and itself in this way analogies can be applied in way we present two variants of our training objective to solve this problem the first which we will call lmul uses multiplicative interactions between and to generate the increment the second which we call ldeep uses multiple fully connected layers to form perceptron mlp without using multiplicative interactions lmul ldeep for lmul is in practice to reduce the number of weights we used factorized tensor parametrized as wijl wif wjf wlf multiplicative interactions for tensor and vectors rk we define the tensor multiplication pk as wijl vi wj encoder network increment function add mul deep decoder network add mul deep figure illustration of the network structure for analogy making the top portion shows the encoder transformation module andn decoder the bottom portion illustrates the transformations used for ladd lmul and ldeep the icon in lmul indicates tensor product we share weights with all three encoder networks shown on the top left were similarly used in bilinear models disentangling boltzmann machines and tensor analyzers note that our multiplicative interaction in lmul is different from in that we use the difference between two encoding vectors to infer about the transformation or relation rather than using interaction tensor product for this inference for ldeep rk is an mlp deep algorithm manifold traversal by analogy network without multiplicative interactions with transformation function eq and denotes concatenation of the given images and steps transformation vector with the query embedding optimizing the above objectives teaches the model for to do to predict analogy completions in image space but in order to traverse image manifolds for rexi peated analogies as in algorithm we also want return generated images xi accurate analogy completions in the embedding space to encourage this property we introduce regularizer to make the predicted transformation increment match the difference of encoder embeddings where lp when using ladd when using lmul when using ldeep the overall training objective is weighted combination of analogy prediction and the above regularizer ldeep we set by cross validation on the shapes data and found it worked well for all models on sprites and cars as well all parameters were trained with backpropagation using stochastic gradient descent sgd with disentangled feature representation visual analogies change some aspects of query image and leave others unchanged for example changing the viewpoint but preserving the shape and texture of an object to exploit this fact we incorporate disentangling into our analogy prediction model disentangled representation is simply concatenation of coordinates along each underlying factor of variation if one can reliably infer these disentangled coordinates subset of analogies can be solved simply by swapping sets of coordinates among reference and query embedding and projecting back into the image space however in general disentangling alone can not solve analogies that require traversing the manifold structure of given factor and by itself does not capture image relationships in this section we show how to incorporate disentangled features into our analogy model the disentangling component makes each group of embedding features encode its respective factor of variation and be invariant to the others the analogy component enables the model to traverse the manifold of given factor or subset of factors identity switches pitch elevation identity pitch elevation algorithm disentangling training update the switches determine which units from and are used to reconstruct image given input images and target given switches figure the encoder learns disentangled representation in this case for pitch elevation and identity of car models in the example above switches would be block vector for learning disentangled representation we require tuples pair from which to extract hidden units and third to act as target for prediction as shown in figure we use vector of switch units that decides which elements from and which from will be used to form the hidden representation rk typically will have block structure according to the groups of units associated to each factor of variation once has been extracted it is projected back into the image space via the decoder the key to learning disentangled features is that images should be distinct so that there is no path from any image to itself this way the reconstruction target forces the network to separate the visual concepts shared by and respectively rather than learning the identity mapping concretely the disentangling objective can be written as ldis note that unlike analogy training disentangling only requires dataset of of images along with switch unit vector intuitively describes the sense in which and are related algorithm describes the learning update we used to learn disentangled representation experiments we evaluated our methods using three datasets the first is set of colored shapes which is simple yet nontrivial benchmark for visual analogies the second is set of sprites from the video game project called liberated pixel cup which we chose in order to get controlled variation in large number of character attributes and animations the third is set of car model renderings which allowed us to train model to perform rotation we used caffe to train our encoder and decoder networks with custom matlab wrapper implementing our analogy sampling and training objectives many additional qualitative results of images generated by our model are presented in the supplementary material transforming shapes comparison of analogy models the shapes dataset was used to benchmark performance on rotation scaling and translation analogies specifically we generated images scaled to with four shapes eight colors four scales five row and column positions and rotation angles we compare the performance of our models trained with ladd lmul and ldeep objectives respectively we did not perform disentangling training in this experiment the encoder consisted of fully connected layers with rectified linear nonlinearities relu for intermediate layers the final embedding layer did not use any nonlinearity the decoder architecture mirrors the encoder but did not share weights we trained for steps with size analogy per we used sgd with momentum base learning rate and decayed the learning rate by factor every steps model rotation steps scaling steps translation steps ladd lmul ldeep table comparison of squared pixel prediction error of ladd lmul and ldeep on shape analogies ref gt query ref gt query ref gt query figure analogy predictions made by ldeep for rotation scaling and translation respectively by row ladd and lmul perform as well for scaling and transformation but fail for rotation figure prediction error on repeated application of rotation analogies figure shows repeated predictions from ldeep on rotation scaling and translation test set analogies showing that our model has learned to traverse these manifolds table shows that ladd and lmul perform similarly for scaling and translation but only ldeep can perform accurate rotation analogies further extrapolation results with repeated rotations are shown in figure though both lmul and ldeep are in principle capable of learning the circular pose manifold we suspect that ldeep has much better performance due to the difficulty of training multiplicative models such as lmul generating video game sprites game developers often use what are known as sprites to portray characters and objects in video games more commonly on older systems but still seen on phones and indie games this entails significant human effort to draw each frame of each common animation for each in this section we show how animations can be transferred to new characters by analogy our dataset consists of color images of sprites scaled to with attributes body type sex hair type armor type arm type greaves type and weapon type with total unique characters for each character there are animations each from viewpoints spellcast thrust walk slash and shoot each animation has between and frames we split the data by characters training validation and for testing we conducted experiments using the ladd and ldeep variants of our objective with and without disentangled features we also experimented with disentangled feature version in which the identity units are taken to be the character attribute vector from which the pose is disentangled in this case the encoder for identity units acts as multiple softmax classifiers one for each attribute hence we refer to this objective in experiments as the encoder network consisted of two layers of convolution with stride and relu followed by two and relu layers followed by projection onto the embedding the decoder mirrors the encoder to increase the spatial dimension we use simple upsampling in which we copy each input cell value to the corner of its corresponding output for ldis we used units for identity and for pose for we used categorical units for identity which is the attribute vector and the remaining for pose during training for we did not backpropagate reconstruction error through the identity units we only used the attribute classification objective for those units when ldeep is used the internal layers of the transformation function see figure had dimension and were each followed by relu we trained the models using sgd with momentum and learning rate decayed by factor every steps training was conducted for steps with size figure demonstrates the task of animation transfer with predictions from model trained on ladd table provides quantitative comparison of ladd ldis and we found that the disentangling and additive analogy models perform similarly and that using attributes for disentangled identity features provides further gain we conjecture that wins because changes in certain aspects of appearance such as arm color have very small effect in pixel space yielding weak signal for pixel prediction but still provides strong signal to an attribute classifier in some cases the work may be decreased by projecting models to or by other heuristics but in general the work scales with the number of animations and characters figure transferring animations the top row shows the reference and the bottom row shows the transferred animation where the first frame in red is the starting frame of test set character model spellcast thrust walk slash shoot average ladd ldis table pixel error on test analogies by animation from practical perspective the ability to transfer poses accurately to unseen characters could help decrease manual labor of drawing at least of drawing the assets comprising each character in each animation frame however training this model required that each transferred animation already has hundreds of examples ideally the model could be shown small number of examples for new animation and transfer it to the existing character database we call this setting because only small number of the target animations are provided model ladd ldis num reference of examples table error for analogy transfer of the spellcast animation from each of viewpoints ldis outperforms ladd and performs the best even with only examples output query prediction figure few shot prediction with examples table provides quantitative comparison and figure provides qualitative comparison of our proposed models in this task we find that provides the best performance by wide margin unlike in table ldis outperforms ladd suggesting that disentangling may allow new animations to be learned in more manner however ldis has an advantage in that it can average the identity features of multiple views of query character which ladd can not do the previous analogies only required us to combine disentangled features from two characters the identity from one and the pose from another and so disentangling was sufficient however our analogy method enables us to perform more challenging analogies by learning the manifold of character animations defined by the sequence of frames in each animation adjacent frames are thus neighbors on the manifold and each animation sequence can be viewed as fiber in this manifold we trained model by forming analogy tuples across animations as depicted in fig using disentangled identity and pose features pose transformations were modeled by deep additive interactions and we used to disentangle pose from identity units figure shows the result of several analogies and their extrapolations including character rotation for which we created animations figure cartoon visualization of the shoot animation manifold for two different characters in different viewpoints the model can learn the structure of the animation manifold by forming analogy tuples during training example tuples are circled in red and blue above ref output query predictions walk thrust rotate figure extrapolating by analogy the model sees the reference output pair and repeatedly applies the inferred transformation to the query this inference requires learning the manifold of animation poses and can not be done by simply combining and decoding disentangled features car analogies in this section we apply our model to on car renderings subject to changes in appearance and rotation angle unlike in the case of shapes this requires the ability of the model to perform rotation and the depicted objects are more complex features pose units id units combined pose auc id auc pose table measuring the disentangling performance on cars pose auc refers to area under the roc curve for pose verification and id auc for car verification on pairs of test set images id gt prediction figure car analogies the column gt denotes ground truth we use the car cad models from for each of the car models we generated color renderings from rotation angles each offset by degrees we split the models into training validation and testing the same convolutional network architecture was used as in the sprites experiments and we used units for identity and for pose ref output query figure repeated rotation analogies in forward and reverse directions starting from frontal pose figure shows test set predictions of our model trained on ldis where images in the fourth column combine pose units from the first column and identity units from the second table shows that the learned features are in fact disentangled and discriminative for identity and pose matching despite not being discriminatively trained figure shows repeated rotation analogies on test set cars using model trained on ldeep demonstrating that our model can perform rotation this type of extrapolation is difficult because the query image shows different car from different starting pose we expect that recurrent architecture can further improve the results as shown in conclusions we studied the problem of visual analogy making using deep neural networks and proposed several new models our experiments showed that our proposed models are very general and can learn to make analogies based on appearance rotation pose and various object attributes we provide connection between analogy making and disentangling factors of variation and showed that our proposed analogy representations can overcome certain limitations of disentangled representations acknowledgements this work was supported in part by nsf grfp grant onr grant nsf career grant and nsf grant we thank nvidia for donating tesla gpu references liberated pixel cup http accessed bartha analogy and analogical reasoning in the stanford encyclopedia of philosophy fall edition cole kass mordatch hegarty senn fleischer pesare and breeden stylizing animation by example acm transactions on graphics cheung livezey bansal and olshausen discovering hidden factors of variation in deep networks in iclr workshop cohen and welling learning the irreducible representations of commutative lie groups in icml cohen and welling transformation properties of learned visual representations in iclr desjardins courville and bengio disentangling factors of variation via generative entangling arxiv preprint ding and taylor mental rotation by optimizing transforming distance arxiv preprint rabaud and belongie learning to traverse image manifolds in nips dosovitskiy springenberg and brox learning to generate chairs with convolutional neural networks in cvpr fidler dickinson and urtasun object detection and viewpoint estimation with deformable cuboid model in nips hertzmann jacobs oliver curless and salesin image analogies in siggraph hwang grauman and sha semantic embedding for visual object categorization in nips jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding arxiv preprint kingma and welling variational bayes in iclr kingma mohamed rezende and welling learning with deep generative models in nips kulkarni whitney kohli and tenenbaum deep convolutional inverse graphics network in nips levy goldberg and linguistic regularities in sparse and explicit word representations in memisevic and hinton learning to represent spatial transformations with factored boltzmann machines neural computation michalski memisevic and konda modeling deep temporal dependencies with recurrent grammar cells in nips mikolov sutskever chen corrado and dean distributed representations of words and phrases and their compositionality in nips pennington socher and manning glove global vectors for word representation in emnlp reed sohn zhang and lee learning to disentangle factors of variation with manifold interaction in icml rifai bengio courville vincent and mirza disentangling factors of variation for facial expression recognition in eccv susskind memisevic hinton and pollefeys modeling the joint density of two images under variety of transformations in cvpr tang salakhutdinov and hinton tensor analyzers in icml tenenbaum and freeman separating style and content with bilinear models neural computation turney similarity of semantic relations computational linguistics yang reed yang and lee disentangling with recurrent transformations for view synthesis in nips zhu luo wang and tang perceptron deep model for learning face identity and view representations in nips 
matrix completion from fewer entries spectral detectability and rank estimation alaa and florent laboratoire de physique statistique cnrs école normale supérieure paris france sorbonne universités université pierre et marie curie paris paris france lenka zdeborová institut de physique théorique cea saclay and cnrs umr france abstract the completion of low rank matrices from few entries is task with many practical applications we consider here two aspects of this problem detectability the ability to estimate the rank reliably from the fewest possible random entries and performance in achieving small reconstruction error we propose spectral algorithm for these two tasks called macbeth for matrix completion with the bethe hessian the rank is estimated as the number of negative eigenvalues of the bethe hessian matrix and the corresponding eigenvectors are used as initial condition for the minimization of the discrepancy between the estimated matrix and the revealed entries we analyze the performance in random matrix setting using results from the statistical mechanics of the hopfield neural network and show in particular efficiently detects the rank of large matrix from nm entries where is constant close to we also evaluate the corresponding error empirically and show that macbeth compares favorably to other existing approaches matrix completion is the task of inferring the missing entries of matrix given subset of known entries typically this is possible because the matrix to be completed has at least approximately low rank this problem has witnessed burst of activity see motivated by many applications such as collaborative filtering quantum tomography in physics or the analysis of covariance matrix commonly studied model for matrix completion assumes the matrix to be exactly low rank with the known entries chosen uniformly at random and observed without noise the most widely considered question in this setting is how many entries need to be revealed such that the matrix can be completed exactly in computationally efficient way while our present paper assumes the same model the main questions we investigate are different the first question we address is detectability how many random entries do we need to reveal in order to be able to estimate the rank reliably this is motivated by the more generic problem of detecting structure in our case low rank hidden in partially observed data it is reasonable to expect the existence of region where exact completion is hard or even impossible yet the rank estimation is tractable second question we address is what is the minimum achievable error rmse in estimating the unknown elements of the matrix in practice even if exact reconstruction is not possible having procedure that provides very small rmse might be quite sufficient in this paper we propose an algorithm called macbeth that gives the best known empirical performance for the two tasks above when the rank is small the rank in our algorithm is estimated as the number of negative eigenvalues of an associated bethe hessian matrix and the corresponding eigenvectors are used as an initial condition for the local optimization of cost function commonly considered in matrix completion see in particular in the random matrix setting we show that macbeth detects the rank of large matrix from nm entries where is small constant see and as the rmse is evaluated empirically and in the regime close to nm compares very favorably to existing approache such as optspace this paper is organized as follows we define the problem and present generally our approach in the context of existing work in sec in sec we describe our algorithm and motivate its construction via spectral relaxation of the hopfield model of neural network next in sec we show how the performance of the proposed spectral method can be analyzed using in parts results from spin glass theory and phase transitions and rigorous results on the spectral density of large random matrices finally in sec we present numerical simulations that demonstrate the efficiency of macbeth implementations of our algorithms in the julia and matlab programming languages are available at the sphinx webpage http problem definition and relation to other work let mtrue be matrix such that mtrue xy where and are two unknown tall matrices we observe only small fraction of the elements of mtrue chosen uniformly at random we call the subset of observed entries and the sparse matrix supported on whose nonzero elements are the revealed entries of mtrue the aim is to reconstruct the rank matrix mtrue xy given an important parameter which controls the difficulty of the problem is nm in the case of square matrix this is the average number of revealed entries per line or column in our numerical examples and theoretical justifications we shall generate the low rank matrix mtrue xy using tall matrices and with iid gaussian elements we call this the random matrix setting the macbeth algorithm is however and does not use any prior knowledge about and the analysis we perform applies to the limit while and the matrix completion problem was popularized in who proposed nuclear norm minimization as convex relaxation of the problem the algorithmic complexity of the associated semidefinite programming is however low complexity procedure to solve the problem was later proposed by and is based on singular value decomposition svd considerable step towards theoretical understanding of matrix completion from few entries was made in who proved that with the use of trimming pthe performance of matrix completion can be improved and rmse proportional to can be achieved the algorithm of is referred to as optspace and empirically it achieves rmse in the regime of very few revealed entries optspace proceeds in three steps first one trims the observed matrix by setting to zero all rows resp columns with more revealed entries than twice the average number of revealed entries per row resp per column second singular value decompositions is performed on the matrix and only the first components are kept when the rank is unknown it is estimated as the index for which the ratio between two consecutive singular values has minimum third local minimization of the discrepancy between the observed entries and the estimate is performed the initial condition for this minimization is given by the first left and right singular vectors from the second step in this work we improve upon optspace by replacing the first two steps by different spectral procedure that detects the rank and provides better initial condition for the discrepancy minimization our method leverages on recent progress made in the task of detecting communities in the stochastic block model with spectral methods both in community detection and matrix completion traditional spectral methods fail in the very sparse regime due to the existence of spurious large eigenvalues or singular values corresponding to localized eigenvectors the authors of showed that using the matrix or the closely related bethe hessian as basis for the spectral method in community detection provides reliable rank estimation and better inference performance the present paper provides an analogous improvement for the matrix completion problem in particular we shall analyze the algorithm using tools from spin glass theory in statistical mechanics and show that there exists phase transition between phase where it is able to detect the rank and phase where it is unable to do so algorithm and motivation the macbeth algorithm standard approach to the completion problem see is to minimize the cost function min mij xy ij ij over and this function is and global optimization is hard one therefore resorts to local optimization technique with careful choice of the initial conditions in our method given the matrix we consider weighted bipartite undirected graph with adjacency matrix mt we will refer to the graph thus defined as we now define the bethe hessian matrix to be the matrix with elements hij sinh βaik δij sinh where is parameter that we will fix to value βsg depending on the data and stands for the neighbors of in the graph expression corresponds to the matrix introduced in applied to the case of graphical model the macbeth algorithm that is the main subject of this paper is then given the matrix which we assume to be centered algorithm macbeth numerically solve for the value of such that where βmij nm build the bethe hessian following eq compute all its negative eigenvalues and corresponding eigenvectors is our estimate for the rank set resp to be the first lines resp the last lines of the matrix perform local optimization of the cost function with rank and initial condition in step is an approximation of the optimal value of for which has maximum number of negative eigenvalues see section instead of this approximation can be chosen in such way as to maximize the number of negative eigenvalues we however observed numerically that the algorithm is robust to some imprecision on the value of in step we could also use the matrix weighted by tanh βmij it was shown in that the spectrum of the bethe hessian and the matrix are closely related in the next section we will motivate and analyze this algorithm in the setting where mtrue was generated from random and and show that in this case macbeth is able to infer the rank whenever fig illustrates the spectral properties of the bethe hessian that justify this algorithm the spectrum is composed of few informative negative eigenvalues well separated from the bulk which remains positive in particular as observed in it avoids the spurious eigenvalues with localized eigenvectors that make trimming necessary in the case of this algorithm is computationally efficient as it is based on the eigenvalue decomposition of sparse symmetric matrix motivation from hopfield model we shall now motivate the construction of the macbeth algorithm from graphical model perspective and spectral relaxation given the observed matrix from the previous section we consider the following graphical model exp mij si tj where the si and tj are binary variables and is parameter controlling the strength of the interactions this model is generalized hebbian hopfield model on bipartite sparse graph and is therefore known to have modes up to symmetries correlated with the lines of and to study it we can use the standard bethe approximation which is widely believed to be exact for such problems on large random graphs in this approximation the means si tj and moments si tj of each variable are approximated by the parameters bi cj and ξij that minimize the bethe free energy fbethe bi cj ξij that reads bi si cj tj ξij si tj fbethe bi cj ξij mij ξij bi si cj di dj where ln and di dj are the degrees of nodes and in the graph neural network models such as eq have been extensively studied over the last decades see and references therein and the phenomenology that we shall review briefly here is well known in particular for small enough the global minimum of the bethe free energy corresponds to the paramagnetic state bi cj ξij tanh βmij as we increase above certain value βr the model enters retrieval phase where the free energy has local minima correlated with the factors and there are local minima called retrieval states bli clj ξij indexed by such that in the large limit xi bli yj clj these retrieval states are therefore convenient initial conditions for the local optimization of eq and we expect their number to tell us the correct rank increasing above critical value βsg the system eventually enters spin glass phase marked by the appearance of many spurious minima it would be tempting to continue the bethe approach leading to belief propagation but we shall instead consider simpler spectral relaxation of the problem following the same strategy as used in for graph clustering first we use the fact that the paramagnetic state is always stationary point of the bethe free energy for any value of in order to detect the retrieval states we thus study its stability by looking for negative eigenvalues of the hessian of the bethe free energy evaluated at the paramagnetic state at this point the elements of the hessian involving one derivative with respect to ξij vanish while the block involving two such derivatives is diagonal positive definite matrix the remaining part is the matrix called bethe hessian in which however considers different graphical model than eigenvectors corresponding to its negative eigenvalues are thus expected to give an approximation of the retrieval states the picture exposed in this section is summarized in figure and motivates the macbeth algorithm note that similar approach was used in to detect the retrieval states of hopfield model using the weighted matrix which linearizes the belief propagation equations rather than the bethe free energy resulting in larger matrix the bethe hessian while mathematically closely related is also simpler to handle in practice analysis of performance in detection we now show how the performance of macbeth can be analyzed and the spectral properties of the matrix characterized using both tools from statistical mechanics and rigorous arguments direct diag bp direct diag bp direct diag bp direct diag bp figure spectral density of the bethe hessian for various values of the parameter red dots are the result of the direct diagonalisation of the bethe hessian for rank and matrix with revealed entries per row on average the black curves are the solutions of computed with belief propagation on graph of size we isolated the smallest eigenvalues represented as small bars for convenience and the inset is zoom around these smallest eigenvalues for small enough top plots the bethe hessian is positive definite signaling that the paramagnetic state is local minimum of the bethe free energy as increases the spectrum is shifted towards the negative region and has negative eigenvalues at the approximate value of to be compared to βr for this case evaluated by our algorithm lower left plot these eigenvalues corresponding to the retrieval states become positive and eventually merge in the bulk as is further increased lower right plot while the bulk of uninformative eigenvalues remains at all values of in the positive region analysis of the phase transition we start by investigating the phase transition above which our spectral method will detect the correct rank let xp xlp yp ypl be random vectors with the same empirical distribution as the lines of and respectively using the statistical mechanics correspondence between the negative eigenvalues of the bethe hessian and the appearance of phase transitions in model we can compute the values βr and βsg where instabilities towards respectively the retrieval states and the spurious glassy states arise we have repeated the computations of in the case of model using the cavity method we refer the reader interested in the technical details of the statistical mechanics approach to neural networks to following standard computation for locating phase transitions in the bethe approximation see the stability of the paramagnetic state towards these two phases can be monitored in terms of the two following parameters hy lim xlp ypl ypl hy lim tanh xlp ypl tanh ypl where the expectation is over the distribution of the vectors xp yp the parameter controls the sensitivity of the paramagnetic solution to random noise while measures its sensitivity to perturbation in the direction of retrieval state βsg and βr are defined implicitly as βsg and βr the value beyond which the perturbation diverges the existence of retrieval phase is equivalent to the condition βsg βr so that there exists range of values of where the retrieval states exist but not the spurious ones if this condition is met by setting βsg in our algorithm we ensure the presence of meaningful negative eigenvalues of the bethe hessian we define the critical value of such that βsg βr if and only if in general there is no formula for this critical value which is defined implicitly in terms of the functions and we thus computed numerically using population dynamics algorithm and the results for are presented on figure quite remarkably with the definition nm the critical value does not depend on the ratio only on the rank in the limit of large and it is possible to obtain simple formula in this case the observed entries of the matrix become jointly gaussian distributed and uncorrelated and therefore independent expression then simplifies to xl note that the macbeth algorithm uses an empirical estimator of this quantity to compute an approximation of βsg purely from the revealed entries in the large regime both βsg βr decay to so that we can further approximate figure location of the critical value as function of the rank macbeth is able to estimate the correct rank from nm known entries we used population dynamics algorithm βsg with population of size to compute the funcp tions and from the dotted line is fit βr suggesting that so that we reach the simple asymptotic expression in the large limit that or equivalently interestingly this result was obtained as the detectability threshold in completion of rank matrices from entries in the bayes optimal setting in notice however that exact completion in the setting of is only possible for nm clearly detection and exact completion are different phenomena the previous analysis can be extended beyond the random setting assumption as long as the empirical distribution of the entries is well defined and the lines of resp are approximately orthogonal and centered this condition is related to the standard incoherence property computation of the spectral density in this section we show how the spectral density of the bethe hessian can be computed analytically on graphs such as those generated by picking uniformly at random the observed entries of the matrix xy this further motivates our algorithm and in particular our choice of independently of section the spectral density is defined as λi lim where the λi are the eigenvalues of the bethe hessian using again the cavity method it can be shown that the spectral density in which potential delta peaks have been removed is given by lim where the are complex variables living on the vertices of the graph which are given by βaik where is the set of neighbors of the are the linearly stable solution of the following belief propagation recursion βaik rank rank mean inferred rank transition ϵc figure mean inferred rank as function of for different sizes averaged over samples of xy matrices the entries of are drawn from gaussian distribution of mean and variance the theoretical transition is computed with population dynamics algorithm see section the finite size effects are considerable but consistent with the asymptotic prediction this formula can be derived by turning the computation of the spectral density into marginalization problem for graphical model on the graph and then solving it using loopy belief propagation quite remarkably this approach leads to an asymptotically exact and rigorous description of the spectral density on random graphs solving equation numerically we obtain the results shown on fig the bulk of the spectrum in particular is always positive we now demonstrate that for any value of βsg there exists an open set around where the spectral density vanishes this justifies independently or choice for the parameter the proof follows and begins by noticing that βaij is fixed point of the recursion for since this fixed point is real the corresponding spectral density is now consider small perturbation δij of this solution such that βaij βaij δij the linearized version of writes βail the linear operator thus defined is weighted version of the matrix of its spectral radius is given by where is defined in in particular for βsg so that straightforward application of the implicit function theorem allows to show that there exists neighborhood of such that for any there exists real linearly stable fixed point of yielding spectral density equal to at the informative eigenvalues those outside of the bulk are therefore exactly the negative ones which motivates independently our algorithm numerical tests figure illustrates the ability of the bethe hessian to infer the rank above the critical value in the limit of large size see section in figure we demonstrate the suitability of the eigenvectors of the bethe hessian as starting point for the minimization of the cost function we compare the final rmse achieved on the reconstructed matrix xy with other initializations of the optimization including the largest singular vectors of the trimmed matrix macbeth systematically outperforms all the other choices of initial conditions providing better initial condition for the optimization of remarkably the performance achieved by macbeth with the inferred rank is essentially the same as the one achieved with an oracle rank by contrast estimating the correct rank from the trimmed svd is more challenging we note that for the choice of parameters we consider trimming had negligible effect along the same lines optspace uses different minimization procedure but from our tests we could not see any difference in performance due to that when using alternating least squares as optimization method we also obtained similar improvement in reconstruction by using the eigenvectors of the bethe hessian instead of the singular vectors of as initial condition rmse rank macbeth or or random or macbeth ir ir rmse rank figure rmse as function of the number of revealed entries per row comparison between different initializations for the optimization of the cost function the top row shows the probability that the achieved rmse is smaller than while the bottom row shows the probability that the final rmse is smaller than the probabilities were estimated as the frequency of success over samples of matrices xy of size with the entries of drawn from gaussian distribution of mean and variance all methods optimize the cost function using low storage bfgs algorithm part of nlopt starting from different initial conditions the maximum number of iterations was set to the initial conditions compared are macbeth with oracle rank macbeth or or inferred rank macbeth ir svd of the observed matrix after trimming with oracle rank or or inferred rank ir note that this is equivalent to optspace in this regime and random initial conditions with oracle rank random or for the ir method we inferred the rank from the svd by looking for an index for which the ratio between two consecutive eigenvalues is minimized as suggested in conclusion in this paper we have presented macbeth an algorithm for matrix completion that is efficient for two distinct complementary tasks it has the ability to estimate finite rank reliably from fewer random entries than other existing approaches and ii it gives lower reconstruction errors than its competitors the algorithm is built around the bethe hessian matrix and leverages both on recent progresses in the construction of efficient spectral methods for clustering of sparse networks and on the optspace approach for matrix completion the method presented here offers number of possible future directions including replacing the minimization of the cost function by type algorithm the use of different neural network models or more theoretical direction involving the computation of information theoretically optimal transitions for detectability acknowledgment our research has received funding from the european research council under the european union framework programme grant agreement references candès and recht exact matrix completion via convex optimization foundations of computational mathematics vol no pp candès and tao the power of convex relaxation matrix completion information theory ieee transactions on vol no pp keshavan montanari and oh matrix completion from few entries information theory ieee transactions on vol no pp gross liu flammia becker and eisert quantum state tomography via compressed sensing physical review letters vol no saade krzakala and zdeborová spectral clustering of graphs with the bethe hessian in advances in neural information processing systems pp saade krzakala lelarge and zdeborová spectral detection in the censored block model ieee international symposium on information theory to appear cai candès and shen singular value thresholding algorithm for matrix completion siam journal on optimization vol no pp krzakala moore mossel neeman sly zdeborová and zhang spectral redemption in clustering sparse networks proc natl acad vol no pp bordenave lelarge and massoulié spectrum of random graphs community detection and ramanujan graphs hopfield neural networks and physical systems with emergent collective computational abilities proc nat acad vol no pp yedidia freeman and weiss bethe free energy kikuchi approximations and belief propagation algorithms advances in neural information processing systems vol mezard and montanari information physics and computation oxford university press amit gutfreund and sompolinsky models of neural networks physical review vol no wemmenhove and coolen finite connectivity attractor neural networks journal of physics mathematical and general vol no castillo and skantzos the model on sparse random graph journal of physics mathematical and general vol no zhang nonbacktracking operator for the ising model and its applications in systems with multiple states physical review vol no mooij and kappen validity estimates for loopy belief propagation on binary in advances in neural information processing systems pp the bethe approximation for solving the inverse ising problem comparison with other inference methods stat mech th and zdeborová statistical physics of hard optimization problems acta physica slovaca vol no pp kabashima krzakala mézard sakata and zdeborová phase transitions and sample complexity in matrix factorization rogers castillo kühn and takeda cavity approach to the spectral density of sparse symmetric random matrices phys rev vol no bordenave and lelarge resolvent of large random graphs random structures and algorithms vol no pp jain netrapalli and sanghavi matrix completion using alternating minimization in proceedings of the annual acm symposium on theory of computing acm pp hardt understanding alternating minimization for matrix completion in foundations of computer science focs ieee annual symposium on ieee pp liu and nocedal on the limited memory bfgs method for large scale optimization mathematical programming vol no pp johnson the nlopt package keshavan montanari and oh matrix completion with noisy observations quantitative comparison in annual allerton conference on communication control and computing pp 
online learning with adversarial delays kent and daniel department of computer science university of illinois at urbana il abstract we study the performance of standard online learning algorithms when the feedback is delayed by an adversary we show that and achieve regret in the delayed setting where is the sum of delays of each round feedback this bound collapses to an optimal bound in the usual setting of no delays where our main contribution is to show that standard algorithms for online learning already have simple regret bounds in the most general setting of delayed feedback making adjustments to the analysis and not to the algorithms themselves our results help affirm and clarify the success of recent algorithms in optimization and machine learning that operate in delayed feedback model introduction consider the following simple game let be bounded set such as the unit ball or collection of experts each round we pick point xt an adversary then gives us cost function ft pt and we incur the loss ft xt after rounds our total loss is the sum lt which we want to minimize we can not hope to beat the adversary so to speak when the adversary picks the cost function after we select our point there is margin for optimism however if rather than evaluate our total loss in absolute terms we compare our strategy to the best fixed point in hindsight the regret of strategy pt pt xt is the additive difference ft xt arg ft surprisingly one can obtain positive results in terms of regret kalai and vempala showed that simple and randomized type algorithm achieves in expectation for linear cost functions here the notation assumes that the diameter of and the ft are bounded by constants if is convex then even if the cost vectors are more generally convex cost functions where we incur losses of the form ft xt with ft convex function zinkevich showed that gradient descent achieves regret there is large body of theoretical literature about this setting called online learning see for example the surveys by blum and hazan online learning is general enough to be applied to diverse family of problems for example kalai and vempala algorithm can be applied to online combinatorial problems such as shortest paths decision trees and data structures in addition to basic machine learning problems with convex loss functions zinkevich considers applications to industrial optimization where the http supported in part by nsf grants http supported in part by grant from google value of goods is not known until after the goods are produced other examples of applications of online learning include universal portfolios in finance and online for documents the standard setting assumes that the cost vector ft or more generally the feedback is given to and processed by the player before making the next decision in round philosophically this is not how decisions are made in real life we rush through many different things at the same time with no pause for careful consideration and we may not realize our mistakes for while unsurprisingly the assumption of immediate feedback is too restrictive for many real applications in online advertising online learning algorithms try to predict and serve ads that optimize for clicks the algorithm learns by observing whether or not an ad is clicked but in production systems massive number of ads are served between the moment an ad is displayed to user and the moment the user has decided to either click or ignore that ad in military applications online learning algorithms are used by radio jammers to identify efficient jamming strategies after jammer attempts to disrupt packet between transmitter and receiver it does not know if the jamming attempt succeeded until an acknowledgement packet is sent by the receiver in cloud computing online learning helps devise efficient resource allocation strategies such as finding the right mix of cheaper and inconsistent spot instances and more reliable and expensive instances when renting computers for batch jobs the learning algorithm does not know how well an allocation strategy worked for batch job until the batch job has ended by which time many more batch jobs have already been launched in finance online learning algorithms managing portfolios are subject to information and transaction delays from the market and financial firms invest heavily to minimize these delays one strategy to handle delayed feedback is to pool independent copies of fixed learning algorithm each of which acts as an undelayed learner over subsequence of the rounds each round is delegated to single instance from the pool of learners and the learner is required to wait for and process its feedback before rejoining the pool if there are no learners available new copy is instantiated and added to the pool the size of the pool is proportional to the maximum number of outstanding delays at any point of decision and the overall regret is bounded by the sum of regrets of the individual learners this approach is analyzed for constant delays by weinberger and ordentlich and more sophisticated analysis is given by joulani et al if is the expected maximum number of outstanding feedbacks then joulani et al obtain regret bound on the order of αt in expectation for the setting considered here the blackbox nature of this approach begets simultaneous bounds for other settings such as partial information and stochastic rewards although maintaining copies of learners in proportion to the delay may be prohibitively resource intensive joulani et al provide more efficient variant for the stochastic bandit problem setting not considered here another line of research is dedicated to scaling gradient descent type algorithms to distributed settings where asynchronous processors naturally introduce delays in the learning framework classic reference in this area is the book of bertsekas and tsitskilis if the data is very sparse so that input instances and their gradients are somewhat orthogonal then intuitively we can apply gradients out of order without significant interference across rounds this idea is explored by recht et al who analyze and test parallel algorithm on restricted class of strongly convex loss functions and by duchi et al and mcmahan and streeter who design and analyze distributed variants of adaptive gradient descent perhaps the most closely related work in this area is by langford et who study the algorithm of zinkevich when the delays are bounded by constant number of rounds research in this area has largely moved on from the simplistic models considered here see for more recent developments the impact of delayed feedback in learning algorithms is also explored by riabko under the framework of weak teachers for the sake of concreteness we establish the following notation for the delayed setting for each round let dt be integer delay the feedback from round is delivered at the end of round dt and can be used in round dt in the standard setting with no delays dt for all for each round let ft du be the set of rounds whose pt feedback appears at the end of round we let dt denote the sum of all delays in the standard setting with no delays we have in this paper we investigate the implications of delayed feedback when the delays are adversarial arbitrary with no assumptions or restrictions made on the adversary rather than design new algorithms that may generate more involved analysis we study the performance of the classical algorithms and essentially unmodified when the feedback is delayed in the delayed setting we prove that both rithms have simple regret bound of these bounds collapse to match the regret bounds if there are no delays where paper organization in section we analyze the algorithm in the delayed setting giving upper bounds on the regret as function of the sum of delays in section we analyze the in the delayed setting and derive regret bound in terms of due to space constraints extensions to and are deferred to the appendix we conclude and propose future directions in section delayed gradient descent convex optimization in online convex optimization the input domain is convex and each cost function ft is convex for this setting zinkevich proposed simple online algorithm called designed as follows the first point is picked in arbitrarily after picking the tth point xt computes the gradient of the loss function at xt and chooses πk xt in the subsequent round for some parameter here πk is the projection that maps point to its nearest point in discussed further below zinkevich showed that assuming the euclidean diameter of and the euclidean lengths of all gradients are bounded by constants has an optimal regret bound of delayed gradient descent in the delayed setting the loss function ft is not necessarily given by the adversary before we pick the next point or even at all the natural generalization of to this setting is to process the convex loss functions and apply their gradients the moment they are delivered that is we update xt for some fixed parameter and then project πk back into to choose our th point in the setting of zinkevich we have ft for each and this algorithm is exactly note that gradient does not need to be timestamped by the round from which it originates which is required by the pooling strategies of weinberger and ordentlich and joulani et al in order to return the feedback to the appropriate learner theorem let be convex set with diameter let ft be convex functions over with for all and and let be fixed parameter in the presence of adversarial delays selects points xt such that for all ft xt ft ηl where denotes the sum of delays over all rounds for theorem implies regret bound of this choice of requires prior knowledge of the final sum when this sum is not known one can calculate on the fly if there are outstanding undelivered cost functions at round then increases by exactly obviously and so at most doubles we can therefore employ the doubling trick of auer et al to dynamically adjust as grows in the undelayed setting analyzed by zinkevich we have and the regret bound of theorem matches that obtained by zinkevich if each delay dt is bounded by some fixed value theorem implies regret bound of that matches that of langford et al in both of these special cases the regret bound is known to be tight before proving theorem we review basic definitions and facts on convexity function is convex if αy αf if is differentiable then is convex iff for convex but not necessarily differentiable subgradient of at is any vector that can replace in equation the possible empty set of gradients of at is denoted by the gradient descent may occasionally update along gradient that takes us out of the constrained domain if is convex then we can simply project the point back into lemma let be closed convex set in normed linear space and point and let be the closest point in to then for any point kx we let πk denote the map taking point to its closest point in the convex set proof of theorem let arg ft be the best point in hindsight at the end of all rounds for by convexity of ft we have ft ft xt xt fix and and by lemma we know that consider the distance between where xt we split the sum of gradients applied in single round and consider them one by one for each ft let ft ft and let xt xt suppose ft is nonempty and fix max ft to be the last index in ft by lemma we have xt kxt xt repeatedly unrolling the first term in this fashion gives kxt xt for each ft by convexity of we have xt xt xs xs xt fs fs xs xs xt by assumption we also have for each ft with respect to the distance between and this gives kxt fs fs xs xs xt solving this inequality for the regret terms over all rounds we have ft xt ft fs xs fs and taking the sum of inequalities fs xs fs kxt xs xt kxt xs xt xs xt the first two terms are familiar from the standard analysis of it remains to analyze the last sum which we call the delay term each summand xs xt in the delay term contributes loss proportional to the distance between the point xs when the gradient is generated and the point xt when the gradient is applied this distance is created by the other gradients that are applied in between and the number of such gradients are intimately tied to the total delay as follows by the delay term is bounded above by xs xt kxs xt kxs xt consider single term kxs xt for fixed and ft intuitively the difference xt is roughly the sum of gradients received between round and when we apply the gradient from round in round more precisely by applying the triangle inequality and lemma we have kxt xs kxt xt kxt xs kxt xt xs for the same reason we have xs xs and unrolling in this fashion we have kxt xs kxt xt xr pt after substituting equation into equation it remains to bound the sum consider single term in the sum this quantity counts for gradient from round delivered just before round the number of other gradients that are applied while is withheld fix two rounds and and consider an intermediate round if then fix fr and if then fix ft the feedback from round is applied in round between round and round we divide our analysis into two scenarios in one case and the gradient from round appears only after as in the following diagram in the other case as in the following diagram for each round let du denote the number of rounds the gradient feedback is delayed so there are at most ds instances of the latter case since must lie in the first case can be charged to dq to bound the first case observe that for fixed the number of indices such that dq ds is at most dq that is all instances of the second case for fixed can pt pt be charged to dq between the two cases we have dt and the delay term is bounded by xs xt dt with respect to the overall regret this gives xt dt as desired pt remark the delay term xs xt is natural point of entry for sharper analysis based on strong sparseness assumptions the distance xs xt is measured by its projection against the gradient and the preceding proof assumes the worst case and bounds the dot product with the inequality if for example we assume that gradients are pairwise orthogonal and analyze in the unconstrained setting then the dot product xs xt is and the delay term vanishes altogether delaying the perturbed leader discrete online linear optimization in discrete online linear optimization the input domain rn is possibly discrete set with bounded diameter and each cost function ft is of the form ft ct for cost vector ct the previous algorithm does not apply here because is not convex natural algorithm for this problem is each round let yt arg be the optimum choice over the first cost vectors the algorithm picking yt in round is called and can be shown to have zero regret of course is infeasible since the cost vector ct is revealed after picking yt tries the next best thing picking in round unfortunately this strategy can have linear regret largely because it is deterministic algorithm that can be manipulated by an adversary kalai and vempala gave simple and elegant correction called let be parameter to be fixed later and let be the cube of length each round randomly picks vector by the uniform distribution and then selects xt arg to optimize over the previous costs plus the random perturbation with the diameter of and the lengths kct of each cost vector held constant kalai and vempala showed that has regret in expectation following the delayed and perturbed leader more generally optimizes over all information available to the algorithm plus some additional noise to smoothen the analysis if the cost vectors are delayed we naturally interpret to optimize over all cost vectors ct delivered in time for round when picking its point xt that is the tth leader becomes the best choice with respect to all cost vectors delivered in the first rounds ytd arg min cr we use the superscript to emphasize the delayed setting the tth perturbed leader optimizes over all cost vectors delivered through the first rounds in addition to the random perturbation arg min cr in the delayed setting chooses xt in round we claim that has direct and simple regret bound in terms of the sum of delays that collapses to kalai and vempala regret bound in the undelayed setting theorem let rn be set with ct rn with kct for all and in the presence of adversarial delays picks points xt such that for all ct xt ct for theorem implies regret bound of when is not known priori the doubling trick can be used to adjust dynamically see the discussion following theorem to analyze in the presence of delays we introduce the notion of prophet who is sort of omniscient leader who sees the feedback immediately formally the tth prophet is the best point with respect to all the cost vectors over the first rounds zt arg min ct the tth perturbed prophet is the best point with respect to all the cost vectors over the first rounds in addition to perturbation arg min ct the prophets and perturbed prophets behave exactly as the leaders and perturbed leaders in the setting of kalai and vempala with no delays in particular we can apply the regret bound of kalai and vempala to the infeasible strategy of following the perturbed prophet lemma let rn be set with let ct rn be cost vectors bounded by kct for all and let if are chosen per equation pt pt then ct ct for all the analysis by kalai and vempala observes that when there are no delays two consecutive perturbed leaders and are distributed similarly over the random noise lemma instead we will show that and are distributed in proportion to delays we first require technical lemma that is implicit in lemma let be set with and let rn be vectors let rn be random vectors defined by arg and arg where qn is chosen uniformly at random from for some fixed length then for any vector kv proof let and and write arg and arg where and are chosen uniformly at random then subtracting from both terms on the right we have by symmetry and we have by assumption has so ky and by hölder inequality we have it remains to bound if kv we have kv vi ui vol vol ui vol otherwise if ku then vol vol kv in either case we have kv vol vol vol vol and the claim follows lemma could also have been proven geometrically in similar fashion to kalai and vempala lemma vectors pt where is the sum of delays of all cost ct ct pt proof let ut ct be the sum of all costs through the first rounds and vt ct be the sum of cost vectors actually delivered through the first rounds then the perturbed prophet optimizes over and optimizes over by lemma for each we have kct ds ct ct summed over all rounds we have ct ct ds pt the sum ds charges each cost vector cs once for every round it is delayed pt and therefore equals thus ct ct as desired now we complete the proof of theorem proof of theorem by lemma and lemma we have ct ct ct arg min as desired conclusion we prove regret bounds for and in the delayed setting directly extending the regret bounds known in the undelayed setting more importantly by deriving simple bound as function of the delays without any restriction on the delays we establish simple and intuitive model for measuring delayed learning this work suggests natural relationships between the regret bounds of online learning algorithms and delays in the feedback beyond analyzing existing algorithms we hope that optimizing over the regret as function of may inspire different and hopefully simple algorithms that readily model real world applications and scale nicely to distributed environments acknowledgements we thank avrim blum for introducing us to the area of online learning and helping us with several valuable discussions we thank the reviewers for their careful and insightful reviews finding errors referencing relevant works and suggesting connection to mirror descent references zinkevich online convex programming and generalized infinitesimal gradient ascent in proc int conf mach learning icml pages kalai and vempala efficient algorithms for online decision problems comput sys extended abstract in proc ann conf comp learning theory colt blum algorithms in machine learning in fiat and woeginger editors online algorithms volume of lncs chapter pages springer berlin heidelberg online learning and online convex optimization found trends mach hazan introduction to online convex optimization internet draft available at http takimoto and warmuth path kernels and multiplicative updates mach learn research helmbold and schapire predicting nearly as well as the best pruning of decision tree mach learn blum chawla and kalai static optimality and dynamic search optimality in lists and trees algorithmica cover universal portfolios math finance crammer and singer family of additive online algorithms for category ranking mach learn research he pan jin xu liu xu shi atallah herbrich bowers and quiñonero candela practical lessons from predicting clicks on ads at facebook in proc acm conf knowl disc and data mining kdd pages acm amuru and buehrer optimal jamming using delayed learning in ieee military comm conf milcom pages ieee menache shamir and jain spot or both dynamic resource allocation for executing batch jobs in the cloud in int conf on autonomic comput icac weinberger and ordentlich on delayed prediction of individual sequences ieee trans inf theory joulani györgy and szepesvári online learning under delayed feedback in proc int conf mach learning icml volume bertsekas and tsitsiklis parallel and distributed computation numerical methods prenticehall recht re wright and niu hogwild approach to parallelizing stochastic gradient descent in adv neural info proc sys nips pages duchi jordan and mcmahan estimation optimization and parallelism when data is sparse in adv neural info proc sys nips pages mcmahan and streeter algorithms for asynchronous distributed online learning in adv neural info proc sys nips pages duchi hazan and singer adaptive subgradient methods for online learning and stochastic optimization mach learn research july langford smola and zinkevich slow learners are fast in adv neural info proc sys nips pages liu wright ré bittorf and sridhar an asynchronous parallel stochastic coordiante descent algorithm mach learn research duchi chaturapruek and ré asynchronous stochastic convex optimization to appear in adv neural info proc sys nips corr wright coordinate descent algorithms math riabko on the flexibility of theoretical models for pattern recognition phd thesis university of london april freund haussler helmbold schapire and warmuth how to use expert advice assoc comput 
feature reduction for tree structured group lasso via hierarchical projection jie jieping computational medicine and bioinformatics department of electrical engineering and computer science university of michigan ann arbor mi jwangumi jpye abstract tree structured group lasso tgl is powerful technique in uncovering the tree structured sparsity over the features where each node encodes group of features it has been applied successfully in many applications however with extremely large feature dimensions solving tgl remains significant challenge due to its highly complicated regularizer in this paper we propose novel multilayer feature reduction method mlfre to quickly identify the inactive nodes the groups of features with zero coefficients in the solution hierarchically in fashion which are guaranteed to be irrelevant to the response thus we can remove the detected nodes from the optimization without sacrificing accuracy the major challenge in developing such testing rules is due to the overlaps between the parents and their children nodes by novel hierarchical projection algorithm mlfre is able to test the nodes independently from any of their ancestor nodes moreover we can integrate has low computational any existing solvers experiments on both synthetic and real data sets demonstrate that the speedup gained by mlfre can be orders of magnitude introduction tree structured group lasso tgl is powerful regression technique in uncovering the hierarchical sparse patterns among the features the key of tgl the tree guided regularization is based on tree structure and the group lasso penalty where each node represents group of features in recent years tgl has achieved great success in many applications such as brain image analysis gene data analysis natural language processing and face recognition many algorithms have been proposed to improve the efficiency of tgl however the application of tgl to problems remains challenge due to its highly complicated regularizer as an emerging and promising technique in scaling problems screening has received much attention in the past few years screening aims to identify the zero coefficients in the sparse solutions by simple testing rules such that the corresponding features can be removed from the optimization thus the size of the data matrix can be significantly reduced leading to substantial savings in computational cost and memory usage typical examples include tlfre flams edpp sasvi dome safe and strong rules we note that strong rules are inexact in the sense that features with nonzero coefficients may be mistakenly discarded while the others are exact another important direction of screening is to detect the vectors for support vector machine svm and least absolute deviation lad empirical studies have shown that the speedup gained by screening methods can be several orders of magnitude moreover the exact screening methods improve the efficiency without sacrificing optimality however to the best of our knowledge existing screening methods are only applicable to sparse models with simple structures such as lasso group lasso and sparse group lasso in this paper we propose novel feature reduction method called mlfre for tgl mlfre is exact and it tests the nodes hierarchically from the top level to the bottom level to quickly identify the inactive nodes the groups of features with zero coefficients in the solution vector which are guaranteed to be absent from the sparse representation to the best of our knowledge mlfre is the first screening method that is applicable to tgl with the highly complicated tree guided regularization the major technical challenges in developing mlfre for tgl lie in two folds the first is that most existing exact screening methods are based on evaluating the norm of the subgradients of the regularizers with respect to the variables or groups of variables of interests however for tgl we only have access to mixture of the subgradients due to the overlaps between parents and their children nodes therefore our first major technical contribution is novel hierarchical projection algorithm that is able to exactly and efficiently recover the subgradients with respect to every node from the mixture sections and the second technical challenge is that most existing exact screening methods need to estimate an upper bound involving the dual optimum this turns out to be complicated nonconvex optimization problem for tgl thus our second major technical contribution is to show that this highly nontrivial nonconvex optimization problem admits closed form solutions section experiments on both synthetic and real data sets demonstrate that the speedup gained by mlfre can be orders of magnitude section please see supplements for detailed proofs of the results in the main text notation let be the norm for positive integer and for rp let ui be its ith component for we denote ug vi ui if vi otherwise and hg rp if and we emphasize that for set let int ri bd and rbd be its interior relative interior boundary and relative boundary respectively if is closed and convex the projection operator is pc kz uk and its indicator function is ic which is on and elsewhere let rp be the class of proper closed convex functions on rp for rp let be its subdifferential and dom we denote by max basics we briefly review some basics of tgl first we introduce the index tree definition for an index tree of depth we denote the node of depth by ti gini where gij and ni we assume that and different nodes of the same depth do not overlap ii if gij is parent node of gij then when the tree structure is available see supplement for an example the tgl problem is xd xni min ky wji kβgij tgl where is the response vector is the data matrix βgij and wji are the coefficients vector and positive weight corresponding to node gij respectively and is the regularization parameter we derive the lagrangian dual problem of tgl as follows pd pni theorem for the tgl problem let wji kβgij the following hold let φij kβgij and bji hgij kζk wji we can write as xd xni xd xni wji bji ii let the lagrangian dual of tgl is sup yλ iii let and be the optimal solution of problems tgl and respectively then xβ xd xni xt wji the dual problem of tgl in is equivalent to projection problem pf this geometric property plays fundamentally important role in developing mlfre see section testing dual feasibility via hierarchical projection although the dual problem in has nice geometric properties it is challenging to determine the feasibility of given due to the complex dual feasible set an alternative approach is to test if xt xt although is very complicated we show that admits closed form solution by hierarchically splitting into sum of projection operators with respect to collection of simpler sets we first introduce some notations for an index tree let nx aij bkt gtk gij ni nx cji bkt gtk gij ni for node gij the set aij is the sum of bkt corresponding to all its descendant nodes and itself and the set cji the sum excluding itself therefore by the definitions of aij bji and cji we have aij bji cji node gij aij bji leaf node gij which implies that this motivates the first pillar of this paper lemma which splits into the sum of two projections onto and respectively lemma let hg kuk with hg nonempty closed convex set and an arbitrary point in hg then the following hold pb min if otherwise pb ii ib pc pc ib iii pc pb pc by part iii of lemma we can split xt in the following form xt xt xt xt as admits closed form solution by part of lemma we can compute if we have xt computed by eq and eq for node gij we note that cji gij where ic gj gk gj inspired by we have the following result lemma let be setpof nonoverlapping index sets hg be set of nonempty closed convex sets and then pc pc zg for rp remark for lemma if all are balls centered at then pc admits closed form solution by lemma and eq we can further splits xt in eq in the following form xt xt where ic consider the right hand side of eq if is leaf node eq implies that and thus admits closed form solution by part of lemma otherwise we continue to split by lemmas and this procedure will always terminate as we reach the leaf nodes see the last equality in eq therefore by repeated application of lemmas and the following algorithm computes the closed form solution of algorithm hierarchical projection input rp the index tree as in definition and positive weights wji for all nodes gij in output vi for set ui rp vi rp for to do hierarchical projection for to ni do vg pb zgi uigi vg gi end for end for the time complexity of algorithm is similar to that of solving its proximal operator pd pni where is the number of features contained in the node gij as pni by definition the time complexity of algorithm is pd and thus log for balanced tree where log the next result shows that returned by algorithm is the projection of onto indeed we have more general results as follows theorem the following hold ugi paij zgij ni pcji zgij for any node gij ii gi mlfre inspired by the kkt conditions and hierarchical projection in this section we motivate mlfre via the kkt condition in eq and the hierarchical projection in algorithm note that for any node gij we have hgij kζk wji if gij wj wji gij gij otherwise moreover the kkt condition in eq implies that ξji wji ni such that xt xd xn ξji thus if kξji wji we can see that gij however we do not have direct access to ξji even if is known because xt is mixture sum of all ξji as shown in eq indeed algorithm turns out to be much more useful than testing the feasibility of given it is able to split all ξji wji from xt this will serve as cornerstone in developing mlfre theorem rigorously shows this property of algorithm theorem let vi be the output of algorithm with input xt and ξji ni be the set of vectors that satisfy eq then the following hold if gij and glr for all glr gij then paij xt gij gt ξkt ii if gij is node and gij then pcji xt gij gt ξkt iii vg wj ni combining eq and part iii of theorem we can see that kvg wj gi by plugging eq and part ii of theorem into we have gij if pbji xt gij pcji xt gij wji if gij is node pbji xt gij wji if gij is leaf node moreover the definition of pbji implies that we can simplify and to the following form gij pcji xt gij wji gij if gij is node if gij is leaf node gij wji gij however and are not applicable to detect inactive nodes as they involve inspired by safe we first estimate set containing let xt gij xt gij and sij zgij pcji zgij then we can relax and as supζ sij ζgij ξij xt gij wji gij if gij is node if gij is leaf node supζ ζgij ζgij xt gij wji gij in view of and we sketch the procedure to develop mlfre in the following three steps step we estimate set that contains step we solve for the supreme values in and respectively step we develop mlfre by plugging the supreme values obtained in step to and the effective interval of the regularization parameter the geometric property of the dual problem in pf implies that if moreover for the root node leads to if is an interior point of indeed the following theorem presents stronger results theorem for tgl let λmax max and be defined by eq then λmax xt ii yλ λmax yλ for more discussions on λmax please refer to section in the supplements the proposed feature reduction method for tgl we follow the three steps in section to develop mlfre specifically we first present an accurate estimation of the dual optimum in section then we solve for the supreme values in and in section and finally we present the proposed mlfre in section estimation of the dual optimum we estimate the dual optimum by the geometric properties of projection operators recall that pf we first introduce useful tool to characterize the projection operators definition for closed convex set and point the normal cone to at is nc hζ theorem implies that is known with λmax thus we can estimate in terms of known this leads to theorem that bounds the dual optimum by small ball theorem for tgl suppose that is known with λmax for we define if λmax λmax if λmax hr kn then the following hold nf ii theorem indicates that lies inside the ball of radius centered at solving the nonconvex optimization problems in and we solve for the supreme values in and for notational convenience let kθ ξij hgij kζ gij kr kkxgj ξij theorem implies that and thus gij for all nodes mlfre by and we need to solve the following optimization problems sij supζ ksij ξij if gij is node sij supζ kζk ξij if gij is leaf node gij to develop before we solve problems and we first introduce some notations definition for node gij of an index tree let ic gij gij if for gj gij gk we define virtual child node of gj by gj gj gij where is the number of virtual nodes of depth we set the weights wji for all virtual nodes gij another useful concept is the unique path between the nodes in the tree lemma for any node gij we can find unique path from gij to the root let the nodes on this path be glrl where and ri then the following hold gij glrl gij glr rl ni solving problem we consider the following equivalent problem of sj supζ ksij ξij if gij is node although both the objective function and feasible set of problem are convex it is nonconvex as we need to find the supreme value we derive the closed form solutions of and as follows theorem let xt gij output of algorithm with input xt kr kkxgj and vi be the suppose that cji then sij kvg ii suppose that node gij has virtual child node then for any cji sij iii suppose that node gij has no virtual child node then the following hold if rbd cji then sij if ri cji then for any node gtk gij where and nt let the nodes on the path from gtk to gij be glrl where ri and rt and xt wrl kvg gk then sij min gtk gk solving problem we can solve problem by the inequality theorem for problem we have sij xt gij kkxgij the screening rule in applications the optimal parameter values are usually unknown commonly used approaches to determine an appropriate parameter value such as cross validation and stability selection solve tgl many times along grid of parameter values this process can be very time consuming motivated by this challenge we present mlfre in the following theorem by plugging the supreme values found by theorems and into and respectively theorem for the tgl problem suppose that we are given sequence of parameter values λmax λk for each integer we compute λk from given λk via eq then for mlfre takes the form of sij λk wji gij ni mlfre remark we apply mlfre to identify inactive nodes hierarchically in fashion note that we do not need to apply mlfre to node gij if one of its ancestor nodes passes the rule remark to simplify notations we consider tgl with single tree in the proof however all major results are directly applicable to tgl with multiple trees as they are independent from each other we note that many sparse models such as lasso group lasso and sparse group lasso are special cases of tgl with multiple trees synthetic synthetic synthetic synthetic synthetic synthetic figure rejection ratios of mlfre on two synthetic data sets with different feature dimensions experiments we evaluate mlfre on both synthetic and real data sets by two measurements the first measure is the rejection ratios of mlfre for each level of the tree let be the number of zero coefficients in the solution vector and be the index set of the inactive nodes withpdepth identified by mlfre the rejection ratio of the ith layer of mlfre is defined by ri where is the number of features contained in node gik the second measure is speedup namely the ratio of the running time of the solver without screening to the running time of solver with mlfre for each data set we run the solver combined with mlfre along sequence of parameter values equally spaced on the logarithmic scale of from to the solver for tgl is from the slep package it also provides an efficient routine to compute λmax simulation studies we perform experiments on two synthetic data table running time in seconds for solving sets named synthetic and synthetic which tgl along sequence of tuning parameare commonly used in the literature ter values of equally spaced on the logarithmic the true model is xβ scale of from to by the solver for each of the data set we fix without screening see the third column and select we the solver with mlfre see the fifth column create tree with height the dataset solver mlfre speedup average sizes of the nodes with depth and are and respectively thus if synthetic we have roughly and for ic the entries of the data matrix are synthetic standard gaussian with zero tion corr xi xj for the ith and th columns of with for synthetic the entries of are drawn from standard sian with correlation corr xi xj to construct we first randomly select of the nodes with depth and then randomly select of the children nodes with depth of the remaining nodes with depth the components of corresponding to the remaining nodes are populated from standard gaussian and the remaining ones are set to zero figure rejection ratios of mlfre on adni data set with grey matter volume gmv white mater volume wmv and whole brain volume wbv as response vectors respectively fig shows the rejection ratios of all three layers of mlfre we can see that mlfre identifies almost all of the inactive nodes ri and the first layer contributes the most moreover fig also indicates that as the feature dimension and the number of nodes in each level increases mlfre identifies more inactive nodes ri thus we can expect more significant capability of mlfre in identifying inactive nodes on data sets with higher dimensions table shows the running time of the solver with and without mlfre we can observe significant speedups gained by mlfre which are up to times take synthetic with for example the solver without mlfre takes about minutes to solve tgl at parameter values combined with mlfre the solver only needs less than one minute for the same task table also shows that the computational cost of mlfre is very is negligible compared to that of the solver without mlfre moreover as mlfre identifies more inactive nodes with increasing feature dimensions table shows that the speedup gained by mlfre becomes more significant as well experiments on adni data set we perform experiments on the alzheimers disease neuroimaging initiative adni data set http the data set consists of patients with single nucleotide polymorphisms snps we create the index tree such that and fig presents the rejection ratios of mlfre on the adni data set with grey matter volume gmv white matter volume wmv and whole brain volume wbv as response respectively we can see that mlfre identifies almost all inactive nodes ri as result we observe significant speedups gained by are about table specifically with gmv as response the solver without mlfre takes about six hours to solve tgl at parameter values however combined with mlfre the solver only needs about eight minutes for the same task moreover table also indicates that the computational cost of mlfre is very is negligible compared to that of the solver without mlfre conclusion in this paper we propose novel feature reduction mlfre method for tgl our major technical contributions lie in two folds the first is the novel hierarchical projection algorithm that is able to exactly and efficiently recover the subgradients of the regularizer with respect to each node from their mixture the second is that we show highly nontrivial nonconvex problem admits closed form solution to the best of our knowledge mlfre is the first screening method that is applicable to tgl an appealing feature of mlfre is that it is exact in the sense that the identified inactive nodes are guaranteed to be absent from the sparse representations experiments on both synthetic and real data sets demonstrate that mlfre is very effective in identifying inactive nodes leading to substantial savings in computational cost and memory usage without sacrificing accuracy moreover the capability of mlfre in identifying inactive nodes on higher dimensional data sets is more significant we plan to generalize mlfre to more general and complicated sparse models group lasso with logistic loss in addition we plan to apply mlfre to other applications brain image analysis and natural language processing acknowledgments this work is supported in part by research grants from nih and nsf references bach jenatton mairal and obozinski optimization with penalties foundations and trends in machine learning bauschke and combettes convex analysis and monotone operator theory in hilbert spaces springer bazaraa sherali and shetty nonlinear programming theory and algorithms wileyinterscience borwein and lewis convex analysis and nonlinear optimization second edition canadian mathematical society boyd and vandenberghe convex optimization cambridge university press chen lin kim carbonell and xing smoothing proximal gradient method for general structured sparse regression annals of applied statistics pages deng yin and zhang group sparse optimization by alternating direction method technical report rice caam report el ghaoui viallon and rabbani safe feature elimination in sparse supervised learning pacific journal of optimization from convex optimization to nonconvex optimization necessary and sufficient conditions for global optimality in nonsmooth optimization and related topics springer jenatton gramfort michel obozinski eger bach and thirion multiscale mining of fmri data with hierarchical structured sparsity siam journal on imaging science pages jenatton mairal obozinski and bach proximal methods for hierarchical sparse coding journal of machine learning research jia chan and ma robust and practical face recognition via structured sparsity in european conference on computer vision kim and xing group lasso for regression with structured sparsity in international conference on machine learning kim and xing group lasso for regression with structured sparsity with an application to eqtl mapping the annals of applied statistics liu ji and ye slep sparse learning with efficient projections arizona state university liu and ye regularization for grouped tree structure learning in advances in neural information processing systems liu zhao wang and ye safe screening with variational inequalities and its application to lasso in international conference on machine learning liu zhang yap and shen sparse coding for brain disease classification in medical image computing and intervention ogawa suzuki and takeuchi safe screening of vectors in pathwise svm computation in icml nonlinear optimization princeton university press tibshirani bien friedman hastie simon taylor and tibshirani strong rules for discarding predictors in problems journal of the royal statistical society series wang fan and ye fused lasso screening rules via the monotonicity of subdifferentials ieee transactions on pattern analysis and machine intelligence pp wang wonka and ye scaling svm and least absolute deviations via exact data reduction in international conference on machine learning wang wonka and ye lasso screening rules via dual polytope projection journal of machine learning research wang and ye feature reduction for lasso via decomposition of convex sets advances in neural information processing systems xiang xu and ramadge learning sparse representation of high dimensional data on large scale dictionaries in nips yogatama faruqui dyer and smith learning word representations with hierarchical sparse coding in international conference on machine learning yogatama and smith linguistic structured sparsity in text categorization in proceedings of the annual meeting of the association for computational linguistics yuan and lin model selection and estimation in regression with grouped variables journal of the royal statistical society series zhao rocha and yu the composite absolute penalties family for grouped and hierarchical variable selection annals of statistics zou and hastie regularization and variable selection via the elastic net journal of the royal statistical society series 
minimum weight perfect matching via blossom belief propagation sungsoo sejun michael jinwoo school of electrical engineering korea advanced institute of science and technology daejeon korea theoretical division and center for nonlinear studies los alamos national laboratory los alamos usa jinwoos chertkov abstract belief propagation bp is popular algorithm for computing map assignment over distribution represented by graphical model gm it has been shown that bp can solve number of combinatorial optimization problems including minimum weight matching shortest path network flow and vertex cover under the following common assumption the respective linear programming lp relaxation is tight no integrality gap is present however when lp shows an integrality gap no model has been known which can be solved systematically via sequential applications of bp in this paper we develop the first such algorithm coined for solving the minimum weight matching problem over arbitrary graphs each step of the sequential algorithm requires applying bp over modified graph constructed by contractions and expansions of blossoms odd sets of vertices our scheme guarantees termination in of bp runs where is the number of vertices in the original graph in essence the offers distributed version of the celebrated edmonds blossom algorithm by jumping at once over many with single bp moreover our result provides an interpretation of the edmonds algorithm as sequence of lps introduction graphical models gms provide useful representation for reasoning in number of scientific disciplines such models use graph structure to encode the joint probability distribution where vertices correspond to random variables and edges specify conditional dependencies an important inference task in many applications involving gms is to find the assignment to the variables in gm map belief propagation bp is popular algorithm for approximately solving the map inference problem and it is an iterative message passing one that is exact on tree structured gms bp often shows remarkably strong heuristic performance beyond trees over loopy gms furthermore bp is of particular relevance to problems due to its potential for parallelization and its ease of programming within the modern programming models for parallel computing graphlab graphchi and openmp the convergence and correctness of bp was recently established for certain class of loopy gm formulations of several classical combinatorial optimization problems including matching perfect matching shortest path independent set network flow and vertex cover the important common feature of these models is that bp converges to correct assignment when the linear programming lp relaxation of the combinatorial optimization is tight when it shows no integrality gap the lp tightness is an inevitable condition to guarantee the performance of bp and no combinatorial optimization instance has been known where bp would be used to solve problems without the lp tightness on the other hand in the lp literature it has been extensively studied how to enforce the lp tightness via solving multiple intermediate lps that are systematically designed via the method motivated by these studies we pose similar question for bp how to enforce correctness of bp possibly by solving multiple intermediate bps in this paper we show how to resolve this question for the minimum weight or cost perfect matching problem over arbitrary graphs contribution we develop an algorithm coined for solving the minimum weight matching problem over an arbitrary graph our algorithm solves multiple intermediate bps until the final bp outputs the solution the algorithm is sequential where each step includes running bp over contracted graph derived from the original graph by contractions and infrequent expansions of blossoms odd sets of vertices to build such scheme we first design an algorithm coined solving multiple intermediate lps second we show that each lp is solvable by bp using the recent framework that establishes generic connection between bp and lp for the first part methods solving multiple intermediate lps for the minimum weight matching problem have been discussed by several authors over the past decades and provably scheme was recently suggested however lps in were quite complex to solve by bp to address the issue we design much simpler intermediate lps that allow utilizing the framework of we prove that and guarantee to terminate in of bp and lp runs respectively where is the number of vertices in the graph to establish the polynomial complexity we show that intermediate outputs of and are equivalent to those of variation of the algorithm which is the latest implementation of the blossom algorithm due to kolmogorov the main difference is that updates parameters by maintaining disjoint tree graphs while and implicitly achieve this by maintaining disjoint cycles claws and tree graphs notice however that these combinatorial structures are auxiliary as required for proofs and they do not appear explicitly in the algorithm descriptions therefore they are much easier to implement than that maintains complex data structures priority queues to the best of our knowledge and are the simplest possible algorithms available for solving the problem in polynomial time our proof implies that in essence offers distributed version of the edmonds blossom algorithm jumping at once over many of with single bp the subject of solving convex optimizations other than lp via bp was discussed in the literature however we are not aware of any similar attempts to solve integer programming via sequential application of bp we believe that the approach developed in this paper is of broader interest as it promises to advance the challenge of designing map solvers for broader class of gms furthermore stands alone as providing an interpretation for the edmonds algorithm in terms of sequence of tractable lps the edmonds original lp formulation contains exponentially many constraints thus naturally suggesting to seek for sequence of lps each with subset of constraints gradually reducing the integrality gap to zero in polynomial number of steps however it remained illusive for decades even when the bipartite lp relaxation of the problem has an integral optimal solution the standard edmonds algorithm keeps contracting and expanding sequence of blossoms as we mentioned earlier we resolve the challenge by showing that is implicitly equivalent to variant of the edmonds algorithm with three major modifications via maintaining cycles claws and trees addition of small random corrections to weights and initialization using the bipartite lp relaxation organization in section we provide backgrounds on the minimum weight perfect matching problem and the bp algorithm section describes our main result and algorithms where the proof is given in section preliminaries minimum weight perfect matching given an undirected graph matching of is set of edges where perfect matching additionally requires to cover every vertices of given integer edge weights or costs we the minimum weight or cost perfect matching problem consists in computing perfect matching which minimizes the summation of its associated edge weights the problem is formulated as the following ip integer programming minimize subject to xe xe without loss of generality one can assume that weights are strictly furthermore we assume that ip is feasible there exists at least one perfect matching in one can naturally relax the above integer constraints to xe to obtain an lp linear programming which is called the bipartite relaxation the integrality of the bipartite lp relaxation is not guaranteed however it can be enforced by adding the blossom inequalities minimize subject to xe xe xe where is collection of odd cycles in called blossoms and is set of edges between and it is known that if is the collection of all the odd cycles in then lp always has an integral solution however notice that the number of odd cycles is exponential in thus solving lp is computationally intractable to overcome this complication we are looking for tractable subset of of polynomial size which guarantees the integrality our algorithm searching for such tractable subset of is iterative at each iteration it adds or subtracts blossom belief propagation for linear programming joint distribution of binary random variables zi is called graphical model gm if it factorizes as follows for zi ωn pr ψi zi ψα zα where ψi ψα are given functions the factors is collection of subsets αk each αj is subset of with zα is the projection of onto dimensions included in in particular ψi is called variable factor assignment is called map solution if arg pr computing map solution is typically computationally intractable unless the induced bipartite graph of factors and variables factor graph has bounded treewidth the belief propagation bp algorithm is popular simple heuristic for approximating the map solution in gm where it iterates messages over factor graph bp computes map solution exactly after sufficient number of iterations if the factor graph is tree and the map solution is unique however if the graph contains loops bp is not guaranteed to converge to map solution in general due to the space limitation we provide detailed backgrounds on bp in the supplemental material consider the following gm for xi and wi rn pr xi ψα xα where is the set of factors and the factor function ψα for is defined as if aα xα bα cα xα dα ψα xα otherwise for some matrices aα cα and vectors bα dα now we consider linear programming lp corresponding to this gm minimize subject to ψα xα xi if some edges have negative weights one can add the same positive constant to all edge weights and this does not alter the solution of ip for example if and then zα one observes that the map solution for gm corresponds to the optimal solution of lp if the lp has an integral solution furthermore the following sufficient conditions relating bp to lp are known theorem the bp applied to gm converges to the solution of lp if the following conditions hold lp has unique integral solution it is tight for every the number of factors associated with xi is at most two for every factor ψα every xα with ψα xα and every with xi there exists such that xk if ψα xα where xk otherwise xk if ψα where otherwise main result blossom belief propagation in this section we introduce our main result an iterative algorithm coined for solving the minimum weight perfect matching problem over an arbitrary graph where the algorithm uses the bp as subroutine we first describe the algorithm using lp instead of bp in section where we call it its bp implementation is explained in section algorithm let us modify weights we we ne where ne is an random number chosen in the edge the interval note that the solution of the minimum weight perfect matching problem remains the same after this modification because the overall noise does not exceed the blossomlp algorithm updates the following parameters iteratively laminar collection of odd cycles in yv ys and in the above is called laminar if for every or call an outer blossom if there exists no such that initially and yv for all the algorithm iterates between step and step and terminates at step algorithm solving lp on contracted graph first construct an auxiliary contracted graph by contracting every outer blossom in to single vertex where the weights are defined as we yv ys we let be the blossom vertex in coined as the contracted graph and solve the following lp minimize subject to xe is vertex xe is blossom vertex xe updating parameters after we obtain solution xe of lp the parameters are updated as follows if is integral and xe for all then proceed to the termination step else if there exists blossom such that xe then we choose one of such blossoms and update and yv call this step blossom expansion else if there exists an odd cycle in such that xe for every edge in it we choose one of them and update and yv where are the set of vertices and edges of respectively and is the graph distance from vertex to edge in the odd cycle the algorithm also remembers the odd cycle corresponding to every blossom if or occur go to step termination the algorithm iteratively expands blossoms in to obtain the minimum weighted perfect matching as follows let be the set of edges in the original such that its corresponding edge in the contracted graph has xe where xe is the last solution of lp ii if output iii otherwise choose an outer blossom then update by expanding iv let be the vertex in covered by and ms be matching covering using the edges of odd cycle update ms and go to step ii an example of the evolution of is described in the supplementary material we provide the following running time guarantee for this algorithm which is proven in section theorem outputs the minimum weight perfect matching in iterations algorithm in this section we show that the algorithm can be implemented using bp the result is derived in two steps where the first one consists in the following theorem proven in the supplementary material due to the space limitation theorem lp always has solution such that the collection of its edges forms disjoint odd cycles next let us design bp for obtaining the solution of lp first we duplicate each edge into and define new graph where then we build the following equivalent lp minimize subject to xe is vertex xe is blossom vertex xe where one can easily observe that solving lp is equivalent to solving lp due to our construction of and lp always have an integral solution due to theorem now construct the following gm for lp ψv xδ ewe xe pr where the factor function ψv is defined as if is vertex and xe ψv xδ else if is blossom vertex and xe otherwise for this gm we derive the following corollary of theorem proven in the supplementary material due to the space limitation corollary if lp has unique solution then the bp applied to gm converges to it the uniqueness condition stated in the corollary above is easy to guarantee by adding small random noises to edge weights corollary shows that bp can compute the solution of lp proof of theorem first it is relatively easy to prove the correctness of as stated in the following lemma lemma if terminates it outputs the minimum weight perfect matching denote the parameter values at proof we let the termination of then the strong duality theorem and the complementary slackness condition imply that where be dual solution of here observe that and cover inside and outside of respectively hence one can naturally define to cover all yv ys for all if we define for the output matching of as if and otherwise then and satisfy the following complementary slackness condition we where is the last set of blossoms at the termination of in the above the first equality is from and the definition of and the second equality is because the construction of in is designed to enforce this proves that is the optimal solution of lp and is the minimum weight perfect matching thus completing the proof of lemma to guarantee the termination of in polynomial time we use the following notions definition claw is subset of edges such that every edge in it shares common vertex called center with all other edges the claw forms star graph definition given graph set of odd cycles set of claws and matching is called decomposition of if all sets in are disjoint and each vertex is covered by exactly one set among them to analyze the running time of we construct an iterative auxiliary algorithm that outputs the minimum weight perfect matching in bounded number of iterations the auxiliary algorithm outputs decomposition at each iteration and it terminates when the decomposition corresponds to perfect matching we will prove later that the auxiliary algorithm and are equivalent and therefore conclude that the iteration of is also bounded to design the auxiliary algorithm we consider the following dual of lp yv minimize subject to yv yu yv next we introduce an auxiliary iterative algorithm which updates iteratively the blossom set and also the set of variables yv ys for we call edge tight if we yu yv ys now we are ready to describe the auxiliary algorithm having the following parameters and yv ys for decomposition of tree graph consisting of and vertices initially set and in addition set yv ys by an optimal solution of lp with and by the decomposition of consisting of tight edges with respect to yv ys the parameters are updated iteratively as follows the auxiliary algorithm iterate the following steps until becomes perfect matching choose vertex from the following rule expansion if choose claw of center blossom vertex and choose vertex in remove the blossom corresponding to from and update by expanding it find matching covering all vertices in and except for and update contraction otherwise choose cycle add and remove it from and respectively in addition is also updated by contracting and choose the contracted vertex in and set yr set tree graph having as vertex and no edge continuously increase yv of every vertex in and decrease yv of vertex in by the same amount until one of the following events occur grow if tight edge exists where is vertex of and is covered by find tight edge add edges to and remove from where becomes vertices of respectively matching if tight edge exists where is vertex of and is covered by find matching that covers update and remove from cycle if tight edge exists where are vertices of find cycle and matching that covers update and add to claw if blossom vertex with yv exists find claw of center and matching covering update and add to if grow occurs resume the step otherwise go to the step note that the auxiliary algorithm updates parameters in such way that the number of vertices in every claw in the decomposition is since every vertex has degree hence there exists unique matching in the expansion step furthermore the existence of decomposition at the initialization can be guaranteed using the complementary slackness condition and the of lp we establish the following lemma for the running time of the auxiliary algorithm where its proof is given in the supplemental material due to the space limitation lemma the auxiliary algorithm terminates in iterations now we are ready to prove the equivalence between the auxiliary algorithm and the prove that the numbers of iterations of and the auxiliary algorithm are equal to this end given decomposition observe that one can choose the corresponding xe that satisfies constraints of lp if is an edge in or xe if is an edge in otherwise similarly given xe that satisfies constraints of lp one can find the corresponding decomposition furthermore one can also define weight in for the auxiliary algorithm as does we yv ys in the auxiliary algorithm is tight if and only if under these equivalences in parameters between and the auxiliary algorithm we will use the induction to show that decompositions maintained by both algorithms are equal at every iteration as stated in the following lemma whose proof is given in the supplemental material due to the space limitation lemma define the following notation yv and yv ys and are parts of which involves and does not involve in respectively then the and the auxiliary algorithm update parameters equivalently and output the same of at each iteration the above lemma implies that also terminates in iterations due to lemma this completes the proof of theorem the equivalence between the solution of lp in and the decomposition in the auxiliary algorithm implies that lp is always has solution and hence one of steps or always occurs conclusion the bp algorithm has been popular for approximating inference solutions arising in graphical models where its distributed implementation associated ease of programming and strong parallelization potential are the main reasons for its growing popularity this paper aims for designing scheme solving the maximum weigh perfect matching problem we believe that our approach is of broader interest to advance the challenge of designing map solvers in more general gms as well as distributed and parallel solvers for ips acknowledgement this work was supported by institute for information communications technology promotion iitp grant funded by the korea government msip content visual browsing technology in the online and offline environments the work at lanl was carried out under the auspices of the national nuclear security administration of the department of energy under contract no references yedidia freeman and weiss constructing approximations and generalized belief propagation algorithms ieee transactions on information theory vol no pp richardson and urbanke modern coding theory cambridge university press mezard and montanari information physics and computation ser oxford graduate texts oxford oxford univ press wainwright and jordan graphical models exponential families and variational inference foundations and trends in machine learning vol no pp gonzalez low and guestrin residual splash for optimally parallelizing belief propagation in international conference on artificial intelligence and statistics low gonzalez kyrola bickson guestrin and hellerstein graphlab new parallel framework for machine learning in conference on uncertainty in artificial intelligence uai kyrola blelloch and guestrin graphchi graph computation on just pc in operating systems design and implementation osdi chandra menon dagum kohr maydan and mcdonald parallel programming in openmp morgan kaufmann isbn bayati shah and sharma for maximum weight matching convergence correctness and lp duality ieee transactions on information theory vol no pp sanghavi malioutov and willsky linear programming analysis of loopy belief propagation for weighted matching in neural information processing systems nips huang and jebara loopy belief propagation for bipartite maximum weight bmatching in artificial intelligence and statistics aistats bayati borgs chayes zecchina for weighted on arbitrary graphs and its relation to linear programs with integer solutions siam journal in discrete math vol pp ruozzi nicholas and tatikonda st paths using the algorithm in annual allerton conference on communication control and computing sanghavi shah and willsky for independent set in neural information processing systems nips gamarnik shah and wei belief propagation for network flow convergence correctness in soda pp park and shin belief propagation for linear programming applications to combinatorial optimization in conference on uncertainty in artificial intelligence uai trick networks with additional structured constraints phd thesis georgia institute of technology padberg and rao odd minimum and in mathematics of operations research vol no pp and holland solving matching problems with linear programming in mathematical programming vol no pp fischetti and lodi optimizing over the first closure in mathematical programming vol no pp chandrasekaran vegh and vempala the cutting plane method is polynomial for perfect matchings in foundations of computer science focs kolmogorov blossom new implementation of minimum cost perfect matching algorithm mathematical programming computation vol no pp edmonds paths trees and flowers canadian journal of mathematics vol pp malioutov johnson and willsky and belief propagation in gaussian graphical models mach learn vol pp weiss yanover and meltzer map estimation linear programming and belief propagation with convex free energies in conference on uncertainty in artificial intelligence uai moallemi and roy convergence of message passing for convex optimization in allerton conference on communication control and computing 
efficient thompson sampling for online recommendation jaya kawale hung bui branislav kveton adobe research san jose ca kawale hubui kveton long tran thanh university of southampton southampton uk sanjay chawla qatar computing research institute qatar university of sydney australia abstract matrix factorization mf collaborative filtering is an effective and widely used method in recommendation systems however the problem of finding an optimal between exploration and exploitation otherwise known as the bandit problem crucial problem in collaborative filtering from has not been previously addressed in this paper we present novel algorithm for online mf recommendation that automatically combines finding the most relevant items with exploring new or items our approach called particle thompson sampling for mf pts is based on the general thompson sampling framework but augmented with novel efficient online bayesian probabilistic matrix factorization method based on the particle filter extensive experiments in collaborative filtering using several datasets demonstrate that pts significantly outperforms the current introduction matrix factorization mf techniques have emerged as powerful tool to perform collaborative filtering in large datasets these algorithms decompose matrix rn into product of two smaller matrices rn and rm such that variety of methods have been proposed in the literature and have been successfully applied to various domains despite their promise one of the challenges faced by these methods is recommending when new arrives in the system also known as the problem of coldstart another challenge is recommending items in an online setting and quickly adapting to the user feedback as required by many real world applications including online advertising serving personalized content link prediction and product recommendations in this paper we address these two challenges in the problem of online matrix completion by combining matrix completion with bandit algorithms this setting was introduced in the previous work but our work is the first satisfactory solution to this problem in bandit setting we can model the problem as repeated game where the environment chooses row of and the learning agent chooses column the rij value is revealed and the goal of the learning agent is to minimize the cumulative regret with respect to the optimal solution the highest entry in each row of the key design principle in bandit setting is to balance between exploration and exploitation which solves the problem of cold start naturally for example in online advertising exploration implies presenting new ads about which little is known and observing subsequent feedback while exploitation entails serving ads which are known to attract high click through rate while many solutions have been proposed for bandit problems in the last five years or so there has been renewed interest in the use of thompson sampling ts which was originally proposed in in addition to having competitive empirical performance ts is attractive due to its conceptual simplicity an agent has to choose an action column from set of available actions so as to maximize the reward but it does not know with certainty which action is optimal following ts the agent will select with the probability that is the best action let denotes the unknown parameter governing reward structure and the history of observations currently available to the agent the agent chooses with probability max dθ which can be implemented by simply sampling from the posterior and let arg however for many realistic scenarios including for matrix completion sampling from is not computationally efficient and thus recourse to approximate methods is required to make ts practical we propose algorithm for solving our problem which we call particle thompson sampling for matrix factorization pts pts is combination of particle filtering for online bayesian parameter estimation and ts in the case when the posterior does not have closed form particle filtering uses set of weighted samples particles to estimate the posterior density in order to overcome the problem of the huge parameter space we utilize and design suitable monte carlo kernel to come up with computationally and statistically efficient way to update the set of particles as new data arrives in an online fashion unlike the prior work which approximates the posterior of the latent item features by single point estimate our approach can maintain much better approximation of the posterior of the latent features by diverse set of particles our results on five different real datasets show substantial improvement in the cumulative regret other online methods probabilistic matrix factorization we first review the probabilistic matrix factorization approach to the matrix completion problem in matrix completion portion ro of the matrix rij is observed and the goal is to infer the unobserved entries of in probabilistic matrix factorization pmf is assumed to be noisy perturbation of matrix where un and vm are termed the user and item latent features is typically small the full generative model of pmf is ui vj ik vj ui rij ik ui vj figure graphical model of probabilistic matrix where the variances σu are the parameters of the model tion model we also consider full bayesian treatment where the variances σu and are drawn from an inverse gamma prior while is held fixed λu σu λv this is special case of the bayesian pmf where we only consider isotropic gaussians given this generative model from the observed ratings ro we would like to estimate the parameters and which will allow us to complete the matrix pmf is map which finds to maximize pr σu σv via stochastic gradient ascend alternate least square can also be used bayesian pmf attempts to approximate the full posterior pr the joint posterior of and are intractable however the structure of the graphical model fig can be exploited to derive an efficient gibbs sampler rij we now provide the expressions for the conditional probabilities of interest supposed that and σu are known then the vectors ui are independent for each user let rts ro be the set of items rated by user observe that the ratings rij rts are generated from ui considers the full covariance structure but they also noted that isotropic gaussians are effective enough following simple conditional linear gaussian model thus the posterior of ui has the closed form σu ui λui pr ui ro σu pr ui ri rts rij vj where µui λui ζiu λui vj vj ik ζiu σu the conditional posterior of pr ro σv is similarly factorized into qm where the mean and precision are similarly defined the posterior vj λj of the precision λu σu given and simiarly for λv is obtained from the conjugacy of the gamma prior and the isotropic gaussian pr λu λu nk ku kf although not required for bayesian pmf we give the likelihood expression pr rij ro σu µui vj λv vj the advantage of the bayesian approach is that uncertainty of the estimate of and are available which is crucial for exploration in bandit setting however the bandit setting requires maitaining online estimates of the posterior as the ratings arrive over time which makes it rather awkward for mcmc in this paper we instead employ sequential smc method for online bayesian inference similar to the gibbs sampler we exploit the above closed form updates to design an efficient particle filter for maintaining the posterior over time recommendation bandit in typical deployed recommendation system users and observed ratings also called rewards arrive over time and the task of the system is to recommend item for each user so as to maximize the accumulated expected rewards the bandit setting arises from the fact that the system needs to learn over time what items have the best ratings for given user to recommend and at the same time sufficiently explore all the items we formulate the matrix factorization bandit as follows we assume that ratings are generated following eq with fixed but unknown latent features at time the environment chooses user it and the system learning agent needs to recommend an item jt the user then rates the recommended item with rating rit jt and the agent receives this rating as reward we abbreviate this as rto rit jt the system recommends item jt using policy that takes into account the history of the observed ratings prior to time where ik jk rk the highest expected reward the system can earn at time is maxj ui vj and this is achieved if the optimal item arg maxj is recommended since are unknown the optimal item is also not known priori the quality of the recommendation system is measured by its expected cumulative regret cr rt rit it rt max uit vj where the expectation is taken with respect to the choice of the user at time and also the randomness in the choice of the recommended items by the algorithm particle thompson sampling for matrix factorization bandit while it is difficult to optimize the cumulative regret directly ts has been shown to work well in practice for contextual linear bandit to use ts for matrix factorization bandit the main difficulty is to incrementally update the posterior of the latent features which control the reward structure in this subsection we describe an efficient particle filter rbpf designed to exploit the specific structure of the probabilistic matrix factorization model let be the control parameters and let posterior at time be pt pr σu σv the standard algorithm particle thompson sampling for matrix factorization pts global control params σu σv for bayesian version initializeparticles ro for do current user sample if sample pr ui sample new ui due to arg maxj recommend for user and observe rating rto updateposterior end for procedure pdate osterior has the structure particles where particles σu σv rt λui λui ζiu ζiu see eq wd reweighting see eq wd pr rij σu see eq resampling for all do move λui λui vj vj ζiu ζiu rvj see eq pr ui if update the norm of λvj λvj ζjv ζiu pr vj if pr σu end for return end procedure see eq particle filter would sample all of the parameters σu σv unfortunately in our experiments degeneracy is highly problematic for such vanilla particle filter pf even when σu σv are assumed known see fig our rbpf algorithm maintains the posterior distribution pt as follows each of the particle conceptually represents at σu and σv are integrated pd out analytically whenever possible thus pt σu is approximated by σu where is the number of particles crucially since the particle filter needs to estimate set of parameters having an effective and efficient move kt σu σu stationary pt is essential our design of the move kernel kt are based on two observations first we can make use of and σv as auxiliary variables effectively sampling σv σu pt σv σu and then σu σv pt σu σv however this move would be highly inefficient due to the number of variables that need to be sampled at each update our second observation is the key to an efficient implementation note that latent features for all users except the current user are independent of the current observed rating rto pt σu σu therefore at time we only have to resample uit as there is no need to resample furthermore it suffices to resample the latent feature of the current item vjt this leads to an efficient implementation of the rbpf where each particle in fact σu σv where σv are auxiliary variables and for the kernel move kt we sample uit σu then σv and σu the pts algorithm is given in algo at each time the complexity is where and are the maximum number of users who have rated the same item and the maximum when there are fewer users than items similar strategy can be derived to integrate out and σv instead this is not inconsistent with our previous statement that conceptually particle represents only pointmass distribution δv σu number of items rated by the same user respectively the dependency on arises from having to invert the precision matrix but this is not concern since the rank is typically small line can be replaced by an incremental update with caching after line we can incrementally update λvj and ζjv for all item previously rated by the current user this reduces the complexity to potentially significant improvement in real recommendation systems where each user tends to rate small number of items analysis we believe that the regret of pts can be bounded however the existing work on ts and bandits does not provide sufficient tools for proper analysis of our algorithm in particular while existing techniques can provide log or for regret bounds for our problem these bounds are typically linear in the number of entries of the observation matrix or at least linear in the number of users which is typically very large compared to thus an ideal regret bound in our setting is the one that has dependency or no dependency at all on the number of users key obstacle of achieving this is that while the conditional posteriors of and are gaussians neither their marginal and joint posteriors belong to well behaved classes conjugate posteriors or having closed forms thus novel tools that can handle generic posteriors are needed for efficient analysis moreover in the general setting the correlation between ro and the latent features and are see for more details as existing techniques are typically designed for efficiently learning linear regressions they are not suitable for our problem nevertheless we show how to bound the regret of ts in very specific case of matrices and we leave the generalization of these results for future work in particular we analyze the regret of pts in the setting of gopalan et al we model our problem as follows the parameter space is θu θv where θu and θv are discretizations of the parameter spaces of factors and for some integer for the sake of theoretical analysis we assume that pts can sample from the full posterior we also assume that ri for some θu and θu note that in this setting the item in expectation is the same for all users we denote this item by arg max and assume that it is uniquely optimal for any we leverage these properties in our analysis the random variable xt at time is pair of random rating matrix rt ri and random row it the action at at time is column jt the observation is yt it rit jt we bound the regret of pts as follows theorem for any and there exists such that pts on θu θv mends items in steps at most log times with probability of at least where is constant independent of proof by theorem of gopalan et al the number of recommendations is bounded by log where is constant independent of now we bound log by counting the number of times that pts selects models that can not be distinguished from after observing yt under the optimal action let θj θu θv ui vj vj vk be the set of such models where action is optimal suppose that our algorithm chooses model θj then the kl divergence between the distributions of ratings ri under models and is bounded from below as ui vj for any the last inequality follows from the fact that ui vj ui vj because is uniquely optimal in we know that ui vj because the granularity of our discretization is let in be any row indices then the kl divergence between the distributions of ratings in positions in under models and pn is dkl uit vj by theorem of gopalan et al the models θj pn are unlikely to be chosen by pts in steps when dkl uit vj log this happens dkl ui vj after at most log selections of θj now we apply the same argument to all θj in total and sum up the corresponding regrets remarks note that theorem implies at log regret bound that holds with high probability here plays the role of gap the smallest possible difference between the expected ratings of item in any row in this sense our result is log and is of similar magnitude as the results in gopalan et al while we restrict in the proof this does not affect the algorithm in fact the proof only focuses on high probability events where the samples from the posterior are concentrated around the true parameters and thus are within as well extending our proof to the general setting is not trivial in particular moving from discretized parameters to continuous space introduces the abovementioned ill behaved posteriors while increasing the value of will violate the fact that the best item will be the same for all users which allowed us to eliminate from the regret bound experiments and results the goal of our experimental evaluation is twofold evaluate the pts algorithm for making online recommendations with respect to various baseline algorithms on several datasets and ii understand the qualitative performance and intuition of pts dataset description we use synthetic dataset and five real world datasets to evaluate our approach the synthetic dataset is generated as follows at first we generate the user and item latent features and of rank by drawing from gaussian distribution and respectively the true rating matrix is then we generate the observed rating matrix from by adding gaussian noise to the true ratings we use five real world datasets as follows movielens movielens yahoo book and eachmovie as shown in table users items ratings movielens movielens yahoo music book crossing table characteristics of the datasets used in our study eachmovie baseline measures there are no current approaches available that simultaneously learn both the user and item factors by sampling from the posterior in bandit setting from the currently available algorithms we choose two kinds of baseline methods one that sequentially updates the the posterior of the user features only while fixing the item features to point estimate icf and another that updates the map estimates of user and item features via stochastic gradient descent key challenge in online algorithms is unbiased offline evaluation one problem in the offline setting is the partial information available about user feedback we only have information about the items that the user rated in our experiment we restrict the recommendation space of all the algorithms to recommend among the items that the user rated in the entire dataset which makes it possible to empirically measure regret at every interaction the baseline measures are as follows random at each iteration we recommend random movie to the user most popular at each iteration we recommend the most popular movie restricted to the movies rated by the user on the dataset note that this is an unrealistically optimistic baseline for an online algorithm as it is not possible to know the global popularity of the items beforehand icf the icf algorithm proceeds by first estimating the user and item latent factors and on initial training period and then for every interaction thereafter only updates the user features assuming the item features as fixed we run two scenarios for the icf algorithm one in which we use and of the data as the training period respectively during this period of training we randomly recommend movie to the user to compute the regret we use the pmf implementation by for estimating the and we learn the latent factors using an online variant of the pmf algorithm we use the stochastic gradient descent to update the latent factors with size of in order to make recommendation we use the strategy and recommend the highest ui with probability and make random recommendations otherwise is set as in our experiments http http results on synthetic dataset iterations iterations iterations cummulative regret cummulative regret cummulative regret cummulative regret cummulative regret we generated the synthetic dataset as mentioned earlier and run the pts algorithm with particles for recommendations we simulate the setting as mentioned in section and assume that at time random user it arrives and the system recommends an item jt the user rates the recommended item rit jt and we evaluate the performance of the model by computing the expected cumulative regret defined in eq fig shows the cumulative regret of the algorithm on the synthetic data averaged over runs using different size of the matrix and latent features the cumulative regret increases with the number of interactions and this gives us confidence that our approach works well on the synthetic dataset iterations iterations figure cumulative regret on different sizes of the synthetic data and averaged over runs results on real datasets pts random popular cummulative regret cummulative regret pts random popular pts random popular cummulative regret iterations iterations movielens pts random popular iterations yahoo music cummulative regret cummulative regret movielens pts random popular iterations book crossing iterations eachmovie figure comparison with baseline methods on five datasets next we evaluate our algorithms on five real datasets and compare them to the various baseline algorithms we subtract the mean ratings from the data to centre it at zero to simulate an extreme scenario we start from an empty set of user and rating we then iterate over the datasets and assume that random user it has arrived at time and the system recommends an item jt constrained to the items rated by this user in the dataset we use for all the algorithms and use particles for our approach for pts we set the value of and for bayesian version see algo for more details we set and the initial shape parameters of the gamma distribution as and for both and we set and fig shows the cumulative regret of all the algorithms on the five our approach performs significantly better as compared to the baseline algorithms on this diverse set of datasets with no parameter tuning performs slightly better than pts and achieves the best regret it is important to note that both pts and performs comparable to or even better than the most popular baseline despite not knowing the global popularity in advance note that icf is very sensitive to the length of the initial training period it is not clear how to set this apriori fails to run on the bookcrossing dataset as the data is too sparse for the pmf implementation rb no rb test error pmf mse mse iterations movielens iterations rb particle filter movie feature vector figure shows mse on movielens dataset the red line is the mse using the pmf algorithm shows performance of rbpf blue line as compared to vanilla pf red line on synthetic dataset and shows movie feature vectors for movie with ratings the red dot is the feature vector from the algorithm using ratings is the feature vector at of the data green dots and at blue dots we also evaluate the performance of our model in an offline setting as follows we divide the datasets into training and test set and iterate over the training data triplets it jt rt by pretending that jt is the movie recommended by our approach and update the latent factors according to rbpf we compute the recovered matrix as the average prediction from the particles at each time step and compute the mean squared error mse on the test dataset at each iteration unlike the batch method such as pmf which takes multiple passes over the data our method was designed to have bounded update complexity at each iteration we ran the algorithm using data for training and the rest for testing and computed the mse by averaging the results over runs fig shows the average mse on the movielens dataset our mse is comparable to the pmf mse as shown by the red line this demonstrates that the rbpf is performing reasonably well for matrix factorization in addition fig shows that on the synthetic dataset the vanilla pf suffers from degeneration as seen by the high variance to understand the intuition why fixing the latent item features as done in the icf does not work we perform an experiment as follows we run the icf algorithm on the movielens dataset in which we use of the data for training at this point the icf algorithm fixes the item features and only updates the user features next we run our algorithm and obtain the latent features we examined the features for one selected movie from the particles at two time intervals one when the icf algorithm fixes them at and another one in the end as shown in the fig it shows that movie features have evolved into different location and hence fixing them early is not good idea related work probabilistic matrix completion in bandit setting setting was introduced in the previous work by zhao et al the icf algorithm in approximates the posterior of the latent item features by single point estimate several other bandit algorithms for recommendations have been proposed valko et al proposed bandit algorithm for recommendations in this approach the features of the items are extracted from similarity graph over the items which is known in advance the preferences of each user for the features are learned independently by regressing the ratings of the items from their features the key difference in our approach is that we also learn the features of the items in other words we learn both the user and item factors and while learn only kocak et al combine the spectral bandit algorithm in with ts gentile et al propose bandit algorithm for recommendations that clusters users in an online fashion based on the similarity of their preferences the preferences are learned by regressing the ratings of the items from their features the features of the items are the input of the learning algorithm and they only learn maillard et al study bandit problem where the arms are partitioned into unknown clusters unlike our work which is more general conclusion we have proposed an efficient method for carrying out matrix factorization in bandit setting the key novelty of our approach is the combined use of particle filtering and thompson sampling pts in matrix factorization recommendation this allows us to simultaneously update the posterior probability of and in an online manner while minimizing the cumulative regret the state of the art till now was to either use point estimates of and or use point estimate of one of the factor and update the posterior probability of the other pts results in substantially better performance on wide variety of real world data sets references yehuda koren robert bell and chris volinsky matrix factorization techniques for recommender systems computer xiaoxue zhao weinan zhang and jun wang interactive collaborative filtering in proceedings of the acm international conference on conference on information knowledge management pages acm olivier chapelle and lihong li an empirical evaluation of thompson sampling in nips pages shipra agrawal and navin goyal thompson sampling for contextual bandits with linear payoffs in icml pages ruslan salakhutdinov and andriy mnih probabilistic matrix factorization in nips volume pages ruslan salakhutdinov and andriy mnih bayesian probabilistic matrix factorization using markov chain monte carlo in icml pages nicolas chopin sequential particle filter method for static models biometrika pierre del moral arnaud doucet and ajay jasra sequential monte carlo samplers journal of the royal statistical society series statistical methodology arnaud doucet nando de freitas kevin murphy and stuart russell particle filtering for dynamic bayesian networks in proceedings of the sixteenth conference on uncertainty in artificial intelligence pages morgan kaufmann publishers gelman and meng note on bivariate distributions that are conditionally normal amer arnold castillo sarabia and multiple modes in densities with normal conditionals statist probab arnold castillo and sarabia conditionally specified distributions an introduction statistical science aditya gopalan shie mannor and yishay mansour thompson sampling for complex online problems in proceedings of the international conference on machine learning pages michal valko munos branislav kveton and spectral bandits for smooth graph functions in international conference on machine learning michal valko munos and shipra agrawal spectral thompson sampling in proceedings of the aaai conference on artificial intelligence claudio gentile shuai li and giovanni zappella online clustering of bandits arxiv preprint maillard and shie mannor latent bandits in proceedings of the international conference on machine learning icml beijing china june pages 
improved iteration complexity bounds of cyclic block coordinate descent for convex problems ruoyu mingyi abstract the iteration complexity of the descent bcd type algorithm has been under extensive investigation it was recently shown that for convex problems the classical cyclic bcgd block coordinate gradient descent achieves an complexity is the number of passes of all blocks however such bounds are at least linearly depend on the number of variable blocks and are at least times worse than those of the gradient descent gd and proximal gradient pg methods in this paper we close such theoretical performance gap between cyclic bcd and first we show that for family of quadratic nonsmooth problems the complexity bounds for cyclic block coordinate proximal gradient bcpg popular variant of bcd can match those of the in terms of dependency on up to factor second we establish an improved complexity bound for coordinate gradient descent cgd for general convex problems which can match that of gd in certain scenarios our bounds are sharper than the known bounds as they are always at least times worse than gd our analyses do not depend on the update order of block variables inside each cycle thus our results also apply to bcd methods with random permutation random sampling without replacement another popular variant introduction consider the following convex optimization problem min xk hk xk xk xk where is convex smooth function is convex lower possibly nonsmooth function xk xk rn is block variable very popular method for solving this problem is the block coordinate descent bcd method where each time single block variable is optimized while the rest of the variables remain fixed using the classical cyclic block selection rule the bcd method can be described below algorithm the cyclic block coordinate descent bcd at each iteration update the variable blocks by xk min xk hk xk xk department of management science and engineering stanford university stanford ca ruoyu department of industrial manufacturing systems engineering and department of electrical computer engineering iowa state university ames ia mingyi the authors contribute equally to this work where we have used the following notations xk wk xk xk xk the convergence analysis of the bcd has been extensively studied in the literature see for example it is known that for smooth problems is continuous differentiable but possibly nonconvex if each subproblem has unique solution and is in the interval between the current iterate and the minimizer of the subproblem one special case is strict convexity then every limit point of is stationary point proposition the authors of have derived relaxed conditions on the convergence of bcd in particular when problem is convex and the level sets are compact the convergence of the bcd is guaranteed without requiring the subproblems to have unique solutions recently razaviyayn et al have shown that the bcd converges if each subproblem is solved inexactly by way of optimizing certain surrogate functions luo and tseng in have shown that when problem satisfies certain additional assumptions such as having smooth composite objective and polyhedral feasible set then bcd converges linearly without requiring the objective to be strongly convex there are many recent works on showing iteration complexity for randomized bcgd block coordinate gradient descent see and the references therein however the results on the classical cyclic bcd is rather scant saha and tewari show that the cyclic bcd achieves sublinear convergence for family of special lasso problems nutini et al show that when the problem is strongly convex unconstrained and smooth bcgd with certain block selection rule could be faster than the randomized rule recently beck and tetruashvili show that cyclic bcgd converges sublinearly if the objective is smooth subsequently hong et al in show that such sublinear rate not only can be extended to problems with nonsmooth objective but is true for large family of algorithm with or without exact minimization which includes bcgd as special case when each block is minimized exactly and when there is no strong convexity beck proves the sublinear convergence for certain convex problem with only one block having lipschitzian gradient it is worth mentioning that all the above results on cyclic bcd can be used to prove the complexity for popular randomly permuted bcd in which the blocks are randomly sampled without replacement to illustrate the rates developed for the cyclic bcd algorithm let us define to be the optimal solution set for problem and define the constant max max let us assume that hk xk xk rn for now and assume that has lipschitz continuous gradient also assume that has lipschitz continuous gradient with respect to each xk xk vk lk vk let lmax maxk lk and lmin mink lk it is known that the cyclic bcpg has the following iteration complexity δbcd clmax where is some constant independent of problem dimension similar bounds are provided for cyclic bcd in theorem in contrast it is well known that when applying the classical note that the assumptions made in and are slightly different but the rates derived in both cases have similar dependency on the problem dimension gradient descent gd method to problem with the constant stepsize we have the following rate estimate corollary δgd note that unlike here the constant in front of the term is independent of the problem dimension in fact the ratio of the bound given in and is clmax which is at least in the order of for big data related problems with over millions of variables multiplicative constant in the order of can be serious issue in recent work by saha and tewari the authors show that for lasso problem with special data matrix the rate of cyclic bcd with special initialization is indeed unfortunately such result has not yet been extended to any other convex problems an open question posed by few authors are is such factor gap intrinsic to the cyclic bcd or merely an artifact of the existing analysis improved bounds of cyclic bcpg for nonsmooth quadratic problem in this section we consider the following nonsmooth quadratic problem min ak xk hk xk xk xk where ak rm xk rn is the kth block coordinate hk is the same as in note the blocks are assumed to have equal dimension for simplicity of presentation define ak for simplicity we have assumed that all the blocks have the same size problem includes for example lasso and group lasso as special cases we consider the following cyclic bcpg algorithm algorithm the cyclic block coordinate proximal gradient bcpg at each iteration update the variable blocks by xk arg min wk wk xk xk xk xk hk xk xk here pk is the inverse of the stepsize for xk which satisfies pk λmax atk ak lk define pmax maxk pk and pmin mink pk note that for the least square problem smooth quadratic minimization hk bcpg reduces to the widely used bcgd method the optimality condition for the kth subproblem is given by hk xk hk xk pk xk xk xk xk xk xk wk in what follows we show that the cyclic bcpg for problem achieves complexity bound that only dependents on and apart from such log factor it is at least times better than those known in the literature our analysis consists of the following three main steps estimate the descent of the objective after each bcpg iteration estimate the cost yet to be minimized after each bcpg iteration combine the above two estimates to obtain the final bound first we show that the bcpg achieves the sufficient descent lemma we have the following estimate of the descent when using the bcpg pk xk proof we have the following series of inequalities wk wk hk xk wk xk xk xk wk xk hk xk hk xk xk xk wk pk xk where the second inequality uses the optimality condition given below which have dimension and to proceed let us introduce two matrices and respectively pk ak by utilizing the definition of pk in we have the following inequalities the second inequality comes from lemma where in ka at in is the identity matrix and the notation denotes the kronecker product next let us estimate the lemma we have the following estimate of the optimality gap when using the bcpg in log pmin pmax our third step combines the previous two steps and characterizes the iteration complexity this is the main result of this section theorem the iteration complexity of using bcpg to solve is given below when the stepsizes are chosen conservatively as pk we have max when the stepsizes are chosen as pk λmax atk ak lk then we have max lmax lmin in particular if the problem is smooth and unconstrained when and xk rn then we have max log lmax lmin we comment on the bounds derived in the above theorem the bound for bcpg with uniform conservative stepsize has the same order as the gd method except for the factor cf in corollary it is shown that the bcgd with the same conservative stepsize achieves sublinear rate with constant of which is about times worse than our bound further our bound has the same dependency on as the one derived in for bcpg with conservative stepsize to solve an penalized quadratic problem with special data matrix but our bound holds true for much larger class of problems all quadratic nonsmooth problem in the form of however in practice such conservative stepsize is slow compared with bcpg with pk lk for all hence is rarely used the rest of the bounds derived in theorem is again at least times better than existing bounds of cyclic bcpg for example when the problem is smooth and unconstrained the ratio between our bound and the bound is given by lminlmax lmax clmax kl where in the last inequality we have used the fact that lmax for unconstrained smooth problems let us compare the bound derived in the second part of theorem stepsize pk lk with that of the gd if klk for all problem badly conditioned our bound is about times worse than that of the gd this indicates phenomenon by choosing conservative stepsize pk the iteration complexity of bcgd is times better compared with choosing more aggressive stepzise pk lk it also indicates that the factor may hide an additional factor of iteration complexity for general convex problems in this section we consider improved iteration complexity bounds of bcd for general unconstrained smooth convex problems we prove general iteration complexity result which includes result of beck et al as special case our analysis for the general case also applies to smooth quadratic problems but is very different from the analysis in previous sections for quadratic problems for simplicity we only consider the case scalar blocks the generalization to the case is left as future work let us assume that the smooth objective has second order derivatives hij when block is just coordinate we assume lij then li lii and lij li lj for unconstrained smooth convex problems with scalar block variables the bcpg iteration reduces to the following coordinate gradient descent cgd iteration means that is linear combination of wk and where dk wk and wk dk ek ek is the block unit vector in the following theorem we provide an iteration complexity bound for the general convex problem the proof framework follows the standard approach that combines sufficient descent and estimate nevertheless the analysis of the sufficient descent is very different from the methods used in the previous sections the intuition is that cgd can be viewed as an inexact gradient descent method thus the amount of descent can be bounded in terms of the norm of the full gradient it would be difficult to further tighten this bound if the goal is to obtain sufficient descent based on the norm of the full gradient having established the sufficient descent in terms of the full gradient we can easily prove the iteration complexity result following the standard analysis of gd see theorem theorem for cgd with pk lmax we have min lk pmax pmin proof since and wkr only differ by the block and is lipschitz continuous with lipschitz constant lk we have wkr wkr wkr lk wkr wkr wkr lk wkr wkr where the last inequality is due to pk lk the amount of decrease can be estimated as xr wkr wkr since wkr xr by the theorem there must exist ξk such that xr wkr ξk xr wkr ξk ξk ξk hk ξk dk where hij is the second order derivative of then xr xr wkr wkr ξk hk ξk dk dk ξk hk ξk pk dk vkt where we have defined vk ξk hk ξk pk let vk dk pk ξk ξk ξk hk ξk pk stronger bound is wkr wkr where pk but since pk lk the improvement ratio of using this stronger bound is no more than factor of therefore we have xr wkr xr pk combining with we get xr wkr xr let diag pk and let be defined as ξk ξk ξk then which implies hk ξk pmax pmin plugging into we obtain pmax pmin from the fact that hkj ξk is scalar bounded above by ξk lkj lk lj thus ξk lk lj lk we provide the second bound of below let hk denote the row of then therefore we have combining this bound and we obtain that min lk denote pmax then becomes min this relation also implies thus by the definition of in we have by the convexity of and the inequality we have combining with we obtain let we obtain then we have summarizing the inequalities we get which leads to pmax pmin where min lk this completes the proof let us compare this bound with the bound derived in theorem replacing the denominator by which is pmax xr pmax pmin pmin we in our new bound besides reducing the coefficient from to and removing the factor ppmax min improve kl to min kl lk neither of the two bounds kl and lk implies the other when lk the new bound lk is times larger when klk or lk the new bound is times smaller in fact when klk our new bound is times better than the bound in for either pk lk or pk for example bound is lr which matches gd when pk the bound in becomes kl while our listed in table below another advantage of the new bound lk is that it does not increase if we add an artificial block and perform cgd for function in contrast the existing bound will increase to even though the algorithm does not change at all we have demonstrated that our bound can match gd in some cases but can possibly be times worse than gd an interesting question is for general convex problems can we obtain an lr bound for cyclic bcgd matching the bound of gd removing the in will lead to an lr bound for conservative stepsize pk no matter how large lk and are we conjecture that an lr bound for cyclic bcgd can not be achieved for general convex problems that being said we point out that the iteration complexity of cyclic bcgd may depend on other intrinsic parameters of the problem such as lk and possibly third order derivatives of thus the question of finding the best iteration complexity bound of the form lr where is function of may not be the right question to ask for bcd type algorithms conclusion in this paper we provide new analysis and improved complexity bounds for cyclic methods for convex quadratic problems we show that the bounds are lr which is independent of except for mild factor and is about lmax times worse than those for by simple example we show that it is not possible to obtain an iteration complexity kr for cyclic bcpg for illustration the main results of this paper in several simple settings are summarized in the table below note that different ratios of over lk can lead to quite different comparison table comparison of various iteration complexity results gd random bcgd cyclic bcgd cyclic cgd cor cyclic bcgd qp diagonal hessian li pi full hessian li large stepsize pi kr full hessian li small stepsize pi references angelos cowen and narayan triangular truncation and finding the norm of hadamard multiplier linear algebra and its applications beck on the convergence of alternating minimization with applications to iteratively reweighted least squares and decomposition schemes siam journal on optimization beck pauwels and sabach the cyclic block coordinate gradient method for convex optimization problems preprint available on beck and tetruashvili on the convergence of block coordinate descent type methods siam journal on optimization bertsekas nonlinear programming ed athena scientific belmont ma grippo and sciandrone on the convergence of the block nonlinear method under convex constraints operations research letters hong wang razaviyayn and luo iteration complexity analysis of block coordinate descent methods preprint available online lu and xiao on the complexity analysis of randomized descent methods accepted by mathematical programming lu and xiao randomized block coordinate gradient method for class of nonlinear programming preprint luo and tseng on the convergence of the coordinate descent method for convex differentiable minimization journal of optimization theory and application nesterov introductory lectures on convex optimization basic course springer nesterov efficiency of coordiate descent methods on optimization problems siam journal on optimization nutini schmidt laradji friedlander and koepke coordinate descent converges faster with the rule than random selection in the proceeding of the international conference on machine learning icml powell on search directions for minimization algorithms mathematical programming razaviyayn hong and luo unified convergence analysis of block successive minimization methods for nonsmooth optimization siam journal on optimization razaviyayn hong luo and pang parallel successive convex approximation for nonsmooth nonconvex optimization in the proceedings of the neural information processing nips and iteration complexity of randomized descent methods for minimizing composite function mathematical programming saha and tewari on the nonasymptotic convergence of cyclic coordinate descent method siam journal on optimization tseng convergence of block coordinate descent method for nondifferentiable minimization journal of optimization theory and applications xu and yin block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion siam journal on imaging sciences 
lifted symmetry detection and breaking for map inference tim kopp university of rochester rochester ny parag singla delhi hauz khas new delhi henry kautz university of rochester rochester ny tkopp parags kautz abstract symmetry breaking is technique for speeding up propositional satisfiability testing by adding constraints to the theory that restrict the search space while preserving satisfiability in this work we extend symmetry breaking to the problem of model finding in weighted and unweighted relational theories class of problems that includes map inference in markov logic and similar languages we introduce term symmetries which are induced by an evidence set and extend to symmetries over relational theory we provide the important special case of term equivalent symmetries showing that such symmetries can be found in polynomial time we show how to break an exponential number of these symmetries with added constraints whose number is linear in the size of the domain we demonstrate the effectiveness of these techniques through experiments in two relational domains we also discuss the connections between relational symmetry breaking and work on lifted inference in reasoning introduction is an approach to speeding up satisfiability testing by adding constraints called predicates sbps to theory symmetries in the theory define partitioning over the space of truth assignments where the assignments in partition either all satisfy or all fail to satisfy the theory the added sbps rule out some but not all of the truth assignments in the partitions thus reducing the size of the search space while preserving satisfiability we extend the notion of to in relational theories relational theory is specified by set of axioms over finite domains optional weights on the axioms or predicates of the theory and set of ground literals representing evidence by model finding we mean satisfiability testing unweighted theories weighted maxsat weights on axioms or maximum weighted model finding weights on predicates the weighted versions of model finding encompass map inference in markov logic and similar languages we introduce methods for finding symmetries in relational theory that do not depend upon solving graph isomorphism over its full propositional grounding we show how graph isomorphism can be applied to just the evidence portion of relational theory in order to find the set of what we call term symmetries we go on to define the important subclass of term equivalent symmetries and show that they can be found in nm log time where is the number of constants and is the size of the evidence next we provide the formulation for breaking term and term equivalent symmetries an inherent problem in is that propositional theory may have an exponential number of symmetries so breaking them individually would increase the size of the theory exponentially this is typically handled by breaking only portion of the symmetries we show that term equivalent symmetries provide compact representation of exponentially many symmetries and an tially large subset of these can be broken by small linear number of sbps we demonstrate these ideas on two relational domains and compare our approach to other methods for map inference in markov logic background symmetry breaking for sat for satisfiability testing introduced by crawford et al is based upon concepts from group theory permutation is mapping from set to itself permutation group is set of permutations that is closed under composition and contains the identity and unique inverse for every element literal is an atom or its negation clause is disjunction over literals cnf theory is set conjunction of clauses let be the set of literals of we consider only permutations that respect negation that is the action of permutation on theory written is the cnf formula created by applying to each literal in we say is symmetry of if it results in the same theory model is truth assignment to the atoms of theory the action of on written is the model where the key property of being symmetry of is that iff the orbit of model under symmetry group is the set of models that can be obtained by applying any of the symmetries in symmetry group divides the space of models into disjoint sets where the models in an orbit either all satisfy or all do not satisfy the theory the idea of is to add clauses to rule out many of the models but are guaranteed to not rule out at least one model in each orbit note that preserves satisfiability of theory symmetries can be found in cnf theories using reduction to graph isomorphism problem that is thought to require time in the worst case but which can often be efficiently solved in practice the added clauses are called predicates sbps if we place fixed order on the atoms of theory then model can be associated with binary number where the digit or specifies the value of the atom false or true sbps rule out models that are not the members of their orbits the formulation below is equivalent to the sbp given by crawford et al in sbp vj vj vi vi where vi is the ith variable in the ordering of variables and is symmetry over variables even though graph isomorphism is relatively fast in practice theory may have exponentially many symmetries therefore breaking all the symmetries is often impractical though partial symmetrybreaking is still useful it is possible to devise new sbps that can break exponentially more symmetries than the standard form described above we do so in section relational theories we define relational theory as tuple where is set of formulas mapping of predicates and negated predicates to strictly positive real numbers weights and is set of evidence we restrict the formulas in to be built from predicates variables quantifiers and logical connectives but no constants or function symbols is set of ground literals that is literals built from predicates and constant symbols the predicate arguments and constant symbols are typed universal and existential quantification is over the set of the theory constants the constants that appear in its evidence any constants not appearing explicitly in the evidence can be incorporated by introducing unary predicate for each constant type and adding the groundings for those constants to the evidence any formula containing constant can be made by introducing new unary predicate for each constant and then including that predicate applied to that constant in the evidence ground theory can be seen as special case of relational theory where each predicate is argument free we define the weight of positive ground literal ck of theory as and the weight of negative ground literal ck as in other words all positive groundings of literal have the same weight as do all negative groundings the weight of model with respect to theory is if fails to satisfy any part of or otherwise it is the product of the weights of the ground atoms that are true in maximum weighted is the task for finding model of maximum weight with respect to relational theory can be taken to define probability distribution over the set of models where the probability of model is proportional to its weight maximum weighted thus computes map most probable explanation for given theory ordinary satisfiability corresponds to the case where simply sets the weights of all literals to languages such as markov logic use an alternative representation and specify weights on formulas rather than positive weights on predicates and their negations the map problem can be formulated as the problem finding model maximizing the sum of the weights of satisfied clauses this can be translated to our notation by introducing new predicate for each original formula whose arguments are the free variables in the original formula asserts that the predicate is equivalent to the original formula and asserts that the weight of the new predicate is raised to the weight of the original formula solving weighted maxsat in the alternate representation is thus identical to solving maximum weighted in the translated theory for the rest of the discussion in this paper we will assume that the theory is specified with weights on predicates and their negations related work our work has connections to research in both the machine learning and research communities most research in machine learning has concentrated on modifying or creating novel probabilistic inference algorithms to exploit symmetries as opposed to approach developments include lifted versions of variable elimination message passing and dpll our approach of defining symmetries using group theory and detecting them by graph isomorphism is shared by bui et work on lifted variational inference and apsel et work on cluster signatures bui notes that symmetry groups can be defined on the basis of unobserved constants in the domain while we have developed methods to explicitly find symmetries among constants that do appear in the evidence niepert also gives formalism of symmetries in relational theories applying them to mcmc methods another line of work is to make use of problem transformations knowledge compilation transforms relational problem into form for which map and marginal inference is tractable this is much more extensive and computationally complex transformation than mladenov et al propose an approach to translate linear program over relational domain into an equivalent lifted linear program based on message passing computations which can then be solved using any lp solver recent work on map inference in markov logic has identified special cases where relational formula can be transformed by replacing quantified formula with single grounding of the formula relatively little work in srl has explicitly examined the role of evidence separate from the firstorder part of theory on symmetries one exception is which presents heuristic method for approximating an evidence set in order to increase the number of symmetries it induces bui et al consider the case of theory plus evidence pair where the theory has symmetries that are broken by the evidence they show that if the evidence is soft and consists only of unary predicates then lifting based on the theory followed by incorporation of evidence enables polynomial time inference extending the results of van den broeck and darwiche show that dealing with binary evidence is in general but can be done efficiently if there is corresponding low rank boolean matrix factorization they also propose approximation schemes based on this idea we briefly touch upon the extensive literature that has grown around the use of symmetries in constraint satisfaction symmetry detection has been based either on graph isomorphism on propositional theories as in the original work by by crawford et al by interchangeability of row columns in csps specified in matrix form by checking for other special cases of geometric symmetries or by determining that domain elements for variable are exchangeable the last is special case of our term equivalent symmetries researchers have suggested symmetryaware modifications to backtracking csp solvers for variable selection branch pruning and nogood learning recent survey of symmetry breaking for csp described alternatives to the formulation of sbps including one based on gray codes symmetries in relational theories in this section we will formally introduce the notion of symmetries over relational theories and give efficient algorithms to find them symmetries of relational theory can be defined in terms of symmetries over the corresponding ground theory definition let denote relational theory let the denote the theory obtained by grounding the formulas in let denote the set of ground literals in we say that permutation of the set is symmetry of the relational theory if maps the ground theory back to itself we denote the action of on the original theory as straightforward way to find symmetries over relational theory is to first map it to corresponding ground theory and then find symmetries over it using reduction to graph isomorphism the complexity of finding symmetries in this way is the same as that of graph isomorphism which is believed to be further the number of symmetries found is potentially exponential in the number of ground literals this is particularly significant for relational theories since the number of ground literals itself is exponential in the highest predicate arity computing symmetries at the ground level can therefore be prohibitively expensive for theories with high predicate arity and many constants in our work below we exploit the underlying template structure of the relational theory to directly generate symmetries based on the evidence term symmetries we introduce the notion of symmetries defined over terms constants appearing in theory called term symmetries definition let be relational theory let be the set of constants appearing in the theory then permutation over the term set is said to be term symmetry with respect to evidence if application of on the terms appearing in denoted by maps back to itself we will also refer to as an evidence symmetry for the set the problem of finding term symmetries can be reduced to colored graph isomorphism we construct graph as follows for each predicate and it negation has node and unique color is assigned to every such node also has unique color for each type in the domain there is node for every term which takes the color of its type we call an ordered list of terms an argument list given the literal where the argument list is the type of an argument list is simply the of the types of the terms appearing in it has node for every argument list appearing in the evidence which takes the color of its type for every evidence literal there is an edge between the predicate node or its negation and the corresponding argument list node there is also an edge between the argument list node and each of the terms appearing in the list thus for the previous example an edge will be placed between the node for and the node for as well as an edge each between and the nodes for and any automorphism of will map negated predicate nodes to themselves and terms will be mapped in manner that their association with the corresponding predicate node in the evidence is preserved hence automorphisms of will correspond to term symmetries in evidence next we will establish relationship between permutation of terms in the evidence to the permutations of literals in the ground theory definition let be relational theory let be the evidence set and let be the set of terms appearing in given permutation of the terms in the set we associate corresponding permutation θt over the ground literals of the form ck in such that θt ck ck and similarly for negated literals we can now associate symmetry over the terms with symmetry of the theory the following lemma is proven in the supplementary material lemma let be relational theory let denote the evidence set let be the set of terms appearing in if is term symmetry for the evidence then the associated theory permutation θt is also symmetry of in order to find the term symmetries we can resort to solving graph isomorphism problem of size where is the number of literals in the evidence directly finding symmetries over the ground literals requires solving problem of size being the set of constants and being the highest predicate arity in the worst case where everything is fully observed but in practice it is much smaller next we present an important subclass of term symmetries called term equivalent symmetries which capture wide subset of all the symmetries present in the theory and can be efficiently detected and broken term equivalent symmetries term equivalent symmetry is set of term symmetries that divides the set of terms into equivalence classes such that any permutation which maps terms within the same equivalence class is symmetry of the evidence set let zm denote partition of the term set into disjoint subsets given partition we say that two terms are term equivalent with respect to if they occur in the same component of we define partition preserving permutation as follows definition given set of terms and its disjoint partition we say that permutation of the terms in is partition preserving permutation of with respect to the partition if ck cj ck implies that zi st cj ck zi in other words is partition preserving if it permutes terms within the same component of the set of all partition preserving permutations with respect to partition forms group we will denote this group by θz it is easy to see that θz divides the set of terms in equivalence classes next we define the notion of term equivalent symmetries definition let be relational theory and denote the evidence set let be the set of terms in and be disjoint partition of terms in then given the partition preserving permutation θz we say that θz is term equivalent symmetry group of if θz is symmetry of we will refer to each symmetry θz as term equivalent symmetry of partition of term set is term equivalent partition if the partition preserving group θz is symmetry group of we refer to each partition element zi as term equivalent subset the term equivalent symmetry group can be thought of as set of symmetry subgroups θzi one for each term subset zi such that zi allows for all possible permutations of terms within the set zi and defines an identity mapping for terms in other subsets note that the size of term equivalent symmetry group is given by πm despite its large size it can be very efficiently represented by simply storing the partition over the term set note that this formulation works equally well for typed as well untyped theories in typed theory no two terms of differing types can appear in the same subset next we will look at an efficient algorithm for finding partition which corresponds to term equivalent symmetry group over let the evidence be given by lk where each li is ground literal intuitively two terms are term equivalent if they occur in exactly the same context in the evidence for example if evidence for constant is then the context for term is note that here the positions where occurred in the evidence has been marked by any other term sharing the same context would be term equivalent to to find the set of all the equivalent terms we first compute the context for each term then we sort each context based on some lexicographic order defined over predicate symbols and term symbols once the context has been sorted we can simply hash the context for each term and put those which have the same context in the same equivalence class if the evidence size is given by and number of terms in evidence is then the above procedure will take nm log time the log factor is for sorting the context for single term hashing the sorted term takes constant time this is done for each term hence the factor of see the supplement for more details and an example breaking symmetries over terms in this section we provide the sbp formulation for term as well as term equivalent symmetries breaking term symmetries consider relational theory that has term symmetries θk fix an ordering pr over the predicates an ordering over the predicate positions and an ordering over the terms if is typed fix an ordering over types and an ordering over the terms within each type this induces straightforward ordering over the ground atoms of the theory gn let be term symmetry consider the following sbp based on equation to break sbp gj gj gi gi theorem if is weighted the max model for sbp is max model for and it has the same weight in both theories the proof of this theorem follows from essentially an ordering is placed on the atoms of the theory which induces an ordering on the models the sbp constraints ensure that if set of models are in the same orbit symmetric under then only the first of those models in the ordering is admitted breaking term equivalent symmetries let be the term equivalent partitioning over the terms let θj be the term symmetry that swaps cj and ck the jth and kth constants in an ordering over the term equivalent subset zi and maps everything else to identity we next show how to break exponentially many symmetries that respect the term equivalent partition using sbp consider the following sbp csbp stands for composite sbp csbp sbp θj for each term equivalent symmetry of the form θj we apply the sbp formulation from section to break it the formulation is cleverly choosing which symmetries to explicitly break so that exponentially many symmetries are broken the inner conjunction iterates over all of the terms in term equivalent class an ordering is placed on the terms and the only symmetries that are explicitly broken are those that swap two adjacent terms and map everything else to identity all of the adjacent pairs of terms in the ordering are broken in this way with linear number of calls to sbp as we show below this excludes models from the search space while preserving at least one model in each orbit the outer conjunction iterates over the different term equivalent subsets breaking each one of them in turn the following theorem states this in formal terms see the supplement for proof theorem csbp removes models from the search space while preserving at least one model in each orbit induced by the term equivalent symmetry group θz corollary if is weighted the max model for csbp is max model for and it has the same weight in both theories experiments we pitted our term and term equivalent sbp formulations against other inference systems to demonstrate their efficacy for each domain we tested on we compared the following algorithms vanilla running an exact maxsat solver on the grounded instance shatter running an exact maxsat solver on the grounded instance with sbps added by the shatter tool term running an exact maxsat solver on the grounded instance with sbps added by our term symmetry detection algorithm tequiv running an exact maxsat solver on the grounded instance with sbps added by our term equivalent symmetry detection algorithm rockit running rockit on the mln ptp running the ptp algorithm for marginal inference on the mln and mws running the approximate maxwalksat algorithm used by on the grounded instance these algorithms were chosen so that our methods would be compared against variety of other techniques ground techniques including regular maxsat vanilla propositional symmetrybreaking shatter local search mws and lifted techniques including cutting planes rockit and lifted rules ptp it should be noted that all the algorithms except mws and rockit are exact rockit solves an lp relaxation but it always gave us an exact solution whenever it could run therefore we report the solution quality only for mws all the experiments were run on system with xeon processor and of memory the algorithms for term and tequiv were implemented as step before running maxsat solver the mws runs were allowed varying number of flips and up to five restarts for mws the reported results are with equal probability of making random or greedy move other settings were tried and the results were similar both ptp and rockit were run using the default setting of their parameters we experimented on two relational domains the classic pigeonhole problem php with two different variants and an advisor domain details below the publicly available implementation of ptp only supports marginal inference though marginal inference is somewhat harder problem than map ptp failed to run even on the smallest of instances https for ground instances that required an exact maxsat solver we experimented with three different solvers minimaxsat and this was done because different solvers and heuristics tend to work better on different application domains we found that for php variant worked best for php variant was the best and for advisor minimaxsat worked the best we used the best maxsat solver for each domain php in the php problem we are given pigeons and holes the goal is to fit each pigeon in unique hole the problem is unsatisfiable we tried two variants of this problem in the first variant there are hard constraints that ensure that every hole has at most one pigeon and every pigeon is in at most one hole there is soft constraint that states that every pigeon is in every hole the goal is to find of the problem the comparison with various algorithms is given in table vanilla times out except for the smallest instance ptp times out on all instances our algorithms consistently outperform rockit but are not able to outperform shatter except for smaller instances where they are marginally better mws was run on the largest instance and results are shown in table mws is unable to find the optimal solution after million flips in the second php variant there are hard constraints that ensure that every pigeon is in precisely one hole there is soft constraint that each hole has at most one pigeon the comparison with various algorithms is given in in table vanilla times out except for the two smallest instances and ptp times out for all the instances our systems consistently outperform rockit and outperform shatter for the larger instances tequiv outperforms term because the detection step is much faster mws table is able to find the optimal within flips and is significantly faster than all other algorithms though mws does better on this variant of php it fails badly on the first one further there is no way for mws to know if it has reached the optimal unlike our exact approaches advisor domain the domain has three types student prof and research area there are predicates to indicate interest in an area and an advisor relationship between students and profs the theory has hard constraints that ensure that each student has precisely one advisor students and their advisors share an interest it also has small soft constraint saying that prof advises two different students the interests of all students and profs are fully observed students have one interest profs have one or more chosen randomly there is at least one prof interested in each area the results for this domain are given in table gives the comparisons vanilla is able to run only on the two smallest instances ptp times out on every instance as before our algorithms outpeform shatter but is outperformed by rockit mws was run on the two larger sized instances see table and it is able to outperform our system in order to make sure that poor performance of ptp is not due to evidence we modified the php formulation to work with no evidence formulation this did not help ptp either based on these results there is no clear winner among the algorithms which performs the best on all of these domains among all the algorithms term tequiv and shatter are the only ones which could run to completion on all the instances that we tried among these our algorithms won on php variant large instances and the advisor domain tequiv outperformed term on php variant large instances and they performed similarly in the advisor domain in the php variants all of the pigeons are in one term equivalent symmetry group and all of the holes are in another the php is of special interest because pigeonhole structure was shown to be present in hard random sat formulas with very high probability in the advisor domain there is term equivalent symmetry group of students for each area and term equivalent symmetry group of professors for each set of areas conclusion and future work in this work we have provided the theoretical foundation for using techniques from satisfiability testing for map inference in weighted relational theories we have presented the class of term symmetries and subclass of term equivalent symmetries both of which are found in the evidence given with the theory we have presented ways to detect and break both these classes for the class of term equivalent symmetries the detection method runs in time and the number of sbps required to break the symmetries is linear in the domain size the algorithms presented and compared them against other systems on set of relational domains future work includes carefully characterizing the cases where our algorithms perform better than other systems experimenting on additional domains comparing with more lifted inference approaches and the application of similar techniques to the problem of marginal inference table experiments with lifted symmetries for map inference soft pigeon hole domain variant vanilla to to to to to to pigeon shatter tequiv term rockit ptp to to to to to to to soft pigeon hole domain variant pigeon vanilla to to to to shatter tequiv term rockit to ptp to to to to to to advisor domain area vanilla to to to shatter tequiv term rockit ptp to to to to to maxwalksat results problem size optimal advisor advisor best time best time best time all times are given in seconds to indicates the program timed out indicates the program failed without timeout instance too large the vanilla column is the time to find the max model of the ground theory the first three subtables give results for exact algorithms the shatter term and tequiv columns are the same with propositional term and term equivalent symmetries broken respectively these columns are given as where is the time to detect and break the symmetries and is the time to find the max model the rockit column is the time it takes rockit to find the max model the lifted ptp algorithm was run on every instance and timed out on each instance the last table gives the results for the approximate algorithm maxwalksat giving the time it takes to perform certain number of iterations as well as the comparison between the best model found on that run and the optimal model references fadi aloul igor markov and karem sakallah shatter efficient for boolean satisfiability in proc of dac pages udi apsel kristian kersting and martin mladenov lifting relational using cluster signatures in proc of aaai pages gilles audemard belaid benhamou and laurent henocque predicting and detecting symmetries in fol finite model search automated reasoning hung bui tuyen huynh and rodrigo de salvo braz exact lifted inference with distinct soft evidence on every object in proc of aaai hung hai bui tuyen huynh and sebastian riedel automorphism groups of graphical models and lifted variational inference in proc of uai pages vasek and endre many hard examples for resolution acm james crawford matthew ginsberg eugene luks and amitabha roy predicates for search problems in proc of kr pages rodrigo de salvo braz eyal amir and dan roth lifted probabilistic inference in proc of ijcai pages guy van den broeck and adnan darwiche on the complexity and approximation of binary evidence in lifted inference in proc of nips pages guy van den broeck and jesse davis conditioning in knowledge compilation and lifted probabilistic inference in proc of aaai guy van den broeck nima taghipour wannes meert jesse davis and luc de raedt lifted probabilistic inference by knowledge compilation in proc of ijcai pages pedro domingos and daniel lowd markov logic an interface layer for artificial intelligence synthesis lectures on artificial intelligence and machine learning morgan claypool publishers pierre flener justin pearson meinolf sellmann pascal van hentenryck and magnus agren dynamic structural symmetry breaking for constraint satisfaction problems constraints vibhav gogate and pedro domingos probabilistic theorem proving in proc of uai pages federico heras javier larrosa and albert oliveras minimaxsat an efficient weighted solver artificial intelligence research hadi katebi karem sakallah and igor markov symmetry and satisfiability an update in theory and applications of satisfiability testing volume pages daniel le berre and anne parrain the library release jsat luks isomorphism of graphs of bounded valence can be tested in polynomial time of computer system science ruben martins vasco manquinho and lynce modular maxsat solver in theory and applications of satisfiability testing volume of lecture notes in computer science pedro meseguer and carme torras exploiting symmetries within constraint satisfaction search artificial intelligence happy mittal prasoon goyal vibhav gogate and parag singla new rules for domain independent lifted map inference in proc of nips pages martin mladenov babak ahmadi and kristian kersting lifted linear programming in proc of aistats pages martin mladenov amir globerson and kristian kersting lifted message passing as reparametrization of graphical models in proc of uai pages mathias niepert markov chains on orbits of permutation groups in proc of uai pages mathias niepert marginal density estimation in proc of aaai jan noessner mathias niepert and heiner stuckenschmidt rockit exploiting parallelism and symmetry for map inference in statistical relational models in proc of aaai david poole probabilistic inference in proc of ijcai pages meinolf sellmann and pascal van hentenryck structural symmetry breaking in proc of ijcai pages parag singla and pedro domingos lifted belief propagation in proc of aaai pages parag singla aniruddh nath and pedro domingos approximate lifting techniques for belief propagation in proc of aaai pages deepak venugopal and vibhav gogate clustering for scalable inference in markov logic in proc of pages toby walsh symmetry breaking constraints recent results in proc of aaai 
evaluating the statistical significance of biclusters jason lee yuekai sun and jonathan taylor institute of computational and mathematical engineering stanford university stanford ca yuekai abstract biclustering also known as submatrix localization is problem of high practical relevance in exploratory analysis of data we develop framework for performing statistical inference on biclusters found by algorithms since the bicluster was selected in data dependent manner by biclustering or localization algorithm this is form of selective inference our framework gives exact confidence intervals and for the significance of the selected biclusters introduction given matrix biclustering or submatrix localization is the problem of identifying subset of the rows and columns of such that the bicluster or submatrix consisting of the selected rows and columns are significant compared to the rest of an important application of biclustering is the identification of significant associations in the unsupervised analysis of gene expression data the data is usually represented by an expression matrix whose rows correspond to genes and columns correspond to samples thus associations correspond to salient submatrices of the location and significance of such biclusters in conjunction with relevant clinical information give preliminary results on the genetic underpinnings of the phenotypes being studied more generally given matrix whose rows correspond to variables and columns correspond to samples biclustering seeks associations in the form of salient submatrices without loss of generality we consider square matrices of the form zij the components of ei are given by ei otherwise for our theoretical results we assume the size of the embedded submatrix and the noise variance is known the biclustering problem due to its practical relevance has attracted considerable attention most previous work focuses on finding significant submatrices large class of algorithms for biclustering are they search for submatrices that maximize some score function that measures the significance of submatrix in this paper we focus on evaluating the significance of submatrices found by algorithms for biclustering more precisely let be random pair output by biclustering algorithm we seek to test whether the localized submatrix xi contains any signal test the hypothesis mij since the hypothesis depends on the random output of thep biclustering algorithm this is form of selective inference the distribution of the test statistic xij depends on the specific algorithm and is extremely difficult to derive for many heuristic biclustering algorithms our main contribution is to test whether biclustering algorithm has found statistically significant bicluster the tests and confidence intervals we construct are exact meaning that in finite samples the type error is exactly this paper is organized as follows first we review recent work on biclustering and related problems then in section we describe our framework for performing inference in the context of simple biclustering algorithm based on scan statistic we show the framework gives exact unif under and the can be inverted to form confidence intervals for the amount of signal in xi under the minimax ratio snr regime logk the test has full asymptotic power in section we show the framework handles more computationally tractable biclustering algorithms including greedy algorithm originally proposed by shabalin et al in the supplementary materials we discuss the problem in the more general setting where there are multiple emnbedded submatrices finally we present experimental validation of the various tests and biclustering algorithms related work slightly easier problem is submatrix detection test whether matrix has an embedded submatrix with nonzero mean this problem was recently studied by ma and wu who characerized the minimum signal strength for any test and any computationally tractable test to reliably detect an embedded submatrix we emphasize that the problem we consider is not the submatrix detection problem but complementary problem submatrix detection asks whether there are any hidden associations in matrix we ask whether submatrix selected by biclustering algorithm captures the hidden association in practice given matrix practitioner might perform in order submatrix detection check for hidden submatrix with elevated mean submatrix localization attempt to find the hidden submatrix selective inference check whether the selected submatrix captures any signal we focus on the third step in the pipeline results on evaluating the significance of selected submatrices are scarce the only result we know of is by bhamidi dey and nobel who characterized the asymptotic distribution of the largest average submatrix in gaussian random matrices their result may be used to form an asymptotic test of the submatrix localization problem due to its practical relevance has attracted considerable attention most prior work focuses on finding significant submatrices broadly speaking submatrix localization procedures fall into one of two types search procedures and spectral algorithms the main idea behind the approach to submatrix localization is significant submatrices should maximize some score that measures the significance of submatrix the average of its entries or the of anova model since there are exponentially many submatrices many search procedure use heuristics to reduce the search space such heuristics are not guaranteed to succeed but often perform well in practice one of the purposes of our work is to test whether heuristic algorithm has identified significant submatrix the submatrix localization problem exhibits statistical and computational that was first studied by balakrishnan et al they compare the snr required by several computationally efficient algorithms to the minimax snr recently chen and xu study the when there are several embedded submatrices in this more general setting they show the snr required by convex relaxation is smaller than the snr required by thresholding thus the power of convex relaxation is in separating not in identifying one framework for evaluating the significance of submatrix our main contribution is framework for evaluating significance of submatrix selected by biclustering algorithm the framework allows us to perform exact inference on the selected submatrix in this section we develop the framework on very simple algorithm that simply outputs the largest average submatrix at high level our framework consists of characterizing the selection event and applying the key distributional result in to obtain pivotal quantity the significance of the largest average submatrix to begin we consider performing inference on output of the simple algorithm that simply returns the submatrix with largest sum let be the set of indices of all submatrices of the largest average submatrix las algorithm returns pair ilas jlas ilas jlas arg max eti xej the optimal value tr ejlas etilas is distributed like the maxima of nk correlated normal random variables although results on the asymptotic distribution fixed growing of under are known theorem in we are not aware of any results that characterizes the finite sample distribution of the optimal value to avoid this pickle we condition on the selection event elas ilas jlas and work with the distribution of ilas jlas we begin by making key observation the selection event given by is equivalent to satisfying set of linear inequalities given by tr ej eti tr ej for any thus the selection event is equivalent to falling in the polyhedral set clas tr ej eti tr ej for any thus ilas jlas clas is constrained gaussian random variable recall our goal was to perform inference on the amount of signal in the selected submatrix xilas jlas this task is akin to performing inference on the mean of constrained gaussian random variable namely clas we apply the selective inference framework by lee et al to accomplish the task before we delve into the details of how we perform inference on the mean parameter of constrained gaussian random variable we review the key distribution result in concerning constrained gaussian random variables theorem consider gaussian random variable rn with mean rn and covariance constrained to polyhedral set rp ay for some rm the mean parameter is the mean of the gaussian prior to truncation let rn represent linear function of define ηaση ση and sup bj ay αj αj inf bj ay αj αj αj inf bj ay αj the expression ση is pivotal quantity with unif distribution ση ay unif remark the truncation limits and and depend on and the polyhedral set we omit the dependence to keep our notation manageable recall elas is constrained gaussian random variable constrained to the polyhedral set clas given by by theorem and the characterization of the selection event elas the random variable tr ej eti elas where and and are evaluated on the set clas is uniformly distributed on the unit interval the mean parameter tr ej eti is the amount of signal captured by xi tr ej eti what are and let ei for any for convenience we index the constraints by the pairs the term αi is given by αi since αi is negative for any sn and the upper truncation limit is the lower truncation limit simplifies to tr tr ei αi we summarize the developments thus far in corollary corollary we have tr ej eti elas unif tr ei ei tr ei αi under the hypothesis tr ejlas etilas we expect elas unif thus is for the hypothesis under the alternative we expect the selected submatrix to be stochastically larger than under the null thus rejecting when the is smaller than is an exact level test for reject elas since the test controls type error at for all possible selection events all possible outcomes of the las algorithm the test also controls type error unconditionally reject reject elas elas elas thus the test is an exact test of we summarize the result in theorem theorem the test that rejects when tr eilas jlas ei tr eitlas jlas αi ilas jlas is valid test for mij to obtain confidence intervals for the amount of signal in the selected submatrix we invert the pivotal quantity given by by corollary the interval αo is an exact confidence interval for mij when ilas jlas is confidence interval for like the test given by lemma the confidence intervals given by are also valid unconditionally power under minimax ratio in section we derived an exact valid test for the hypothesis in this section we study the power of the test before we delve into the details we review some relevant results to place our result in the correct context balakrishnan et al show must be at least log for any algorithm to succeed find the embedded submatrix with high probability they also show the las algorithm is minimax rate optimal the las algorithm finds the embedded submatrix with probability when we show that the test given by theorem has asymptotic full power under log the same signal strength the proof is given in the appendix log theorem let log when max log log and the test given by corollary has power at least pr reject further for any sequence such that when and pr reject general scan statistics although we have elected to present our framework in the context of biclustering the framework readily extends to scan statistics let where has the form zi for some and otherwise the set belongs to collection sn we decide which index set in generated the data by arg zi given we are interested in testing the null hypothesis to perform exact inference for the selected effect we must first characterize the selection event we observe that the selection event is equivalent to satisfying set of linear inequalities given by ets ets for any given the form of the constraints as es es es for any ets es since we have as which implies the term also simplifies sup es es as ets ets sup es es as as let be the largest and second largest scan statistics we have sup es es intuitively the pivot will be large the will be small when ets exceeds the lower truncation limit by large margin since the second largest scan statistic is an upper bound for the lower truncation limit the test will reject when exceeds by large margin theorem the test that rejects when where sups as es is valid test for to our knowledge most precedures for obtaining valid inference on scan statistics require careful characterization of the asymptotic distribution of such results are usually only valid when the components of are independent with identical variances see and can only be used to test the global null our framework not only relaxes the independence and homeoskedastic assumption but also allows us to for confidence intervals for the selected effect size extensions to other approaches returning to the submatrix localization problem we note that the framework described in section also readily handles other approaches as long as the scores are affine functions of the entries the main idea is to partition into regions that corresponding to possible outcomes of the algorithm the event that the algorithm outputs particular submatrix is equivalent to falling in the corresponding region of in this section we show how to perform exact inference on biclusters found by more computationally tractable algorithms greedy search searching over all nk submatrices to find the largest average submatrix is computationally intractable for all but the smallest matrices here we consider family of heuristics based on greedy search algorithm proposed by shabalin et al that looks for local largest average submatrices their approach is widely used to discover associations in gene expression data here the score is simply the sum of the entries in submatrix algorithm greedy search algorithm initialize select repeat the indices of the rows with the largest column sum in the indices of the columns with the largest row sum in until convergence to adapt the framework laid out in section to the greedy search algorithm we must characterize the selection event here the selection event is the path of the greedy search egrs egrs is the event the greedy search selected at the first step at the second step etc in practice to ensure stable performance of the greedy algorithm shabalin et al propose to run the greedy search with random initialization times and select the largest local maximum suppose the greedy search outputs the largest local maximum the selection event is arg max eigrs xejgrs where egrs egrs im jm im jm is the event the greedy search selected im jm at the first step im jm at the second step etc an alternative to running the greedy search with random initialization many times and picking the largest local maximum is to initialize the greedy search intelligently let jgreedy be the output of the intelligent initialization the selection event is given by egrs jgreedy where egrs is the event the greedy search selected at the first step at the second step etc the intelligent initialization selects when et xej et xej for any which corresponds to selecting the columns with largest sum thus the selection event is equivalent to falling in the polyhedral set cgrs tr ej et tr ej et for any where cgrs is the constraint set corresponding to the selection event egrs see appendix for an explicit characteriation largest sum test an alternative to running the greedy search is to use test statistic based off choosing the rows and columns with largest sum the largest sum test selects subset of columns when et xej et xej for any which corresponds to selecting the columns with largest sum similarly it selects rows with largest sum thus the selection event for initialization at is equivalent to falling in the polyhedral set tr ej et tr ej et for any tr ei et tr et for any the procedure of selecting the largest was analyzed in they proved that when log the procedure recovers the planted submatrix we show similar result for the test statistic based off the intelligent initialization tr ej under the null of the statistic is uniformly distributed so type error is controlled at level theorem below shows that this computationally tractable test has power tending to for log theorem assume that exp and when let log max log the test given by corollary has power at log least pr reject further for any sequence such that when and pr reject power power power signal strength signal strength signal strength figure random initialization with restarts power power power signal strength signal strength signal strength figure intelligent initialization in practice we have found that initializing the greedy algorithm with the rows and columns identified by the largest sum test stabilizes the performance of the greedy algorithm and preserves power by intersecting the selection events from the largest sum test and the greedy algorithm the test also controls type error let iloc jloc be the pair of indices returned by the greedy algorithm initialized with from the largest sum test the test statistic is given by tr ejloc etiloc where are now computed using the intersection of the greedy and the largest sum selection events this statistic is also uniformly distributed under the null we test the performance of three of the biclustering algorithms algorithm with the intelligent initialization in and algorithm with random restarts we generate data from the model for various values of and we only test the power of each procedure since all of the algorithms discussed provably control type error the results are in figures and the shows power the probability of rejecting and the log is rescaled signal strength the tests were calibrated to control type error at so any power over is nontrivial from the log plot we see that the intelligently initialized greedy procedure outperforms the greedy algorithm with single random initialization and the greedy algorithm with random initializations conclusion in this paper we considered the problem of evaluating the statistical significance of the output of several biclustering algorithms by considering the problem as selective inference problem we are able to devise exact significance tests and confidence intervals for the selected bicluster we also show how the framework generalizes to the more practical problem of evaluating the significance of multiple biclusters in this setting our approach gives sequential tests that control error rate in the strong sense references louigi nicolas broutin luc devroye lugosi et al on combinatorial testing problems the annals of statistics brendan pw ames guaranteed clustering and biclustering via semidefinite programming mathematical programming pages brendan pw ames and stephen vavasis convex optimization for the planted problem mathematical programming ery emmanuel candes arnaud durand et al detection of an anomalous cluster in network the annals of statistics sivaraman balakrishnan mladen kolar alessandro rinaldo aarti singh and larry wasserman statistical and computational tradeoffs in biclustering in nips workshop on computational in statistical learning shankar bhamidi partha dey and andrew nobel energy landscape for large average submatrix detection problems in gaussian random matrices arxiv preprint yudong chen and jiaming xu tradeoffs in planted problems and submatrix localization with growing number of clusters and submatrices arxiv preprint yizong cheng and george church biclustering of expression data in ismb volume pages laura lazzeroni and art owen plaid models for gene expression data statistica sinica jason lee dennis sun yuekai sun and jonathan taylor exact inference with the lasso arxiv preprint zongming ma and yihong wu computational barriers in minimax submatrix detection arxiv preprint andrey shabalin victor weigman charles perou and andrew nobel finding large average submatrices in high dimensional data the annals of applied statistics pages 
discriminative robust transformation learning jiaji huang qiang qiu guillermo sapiro robert calderbank department of electrical engineering duke university durham nc abstract this paper proposes framework for learning features that are robust to data variation which is particularly important when only limited number of training samples are available the framework makes it possible to tradeoff the discriminative value of learned features against the generalization error of the learning algorithm robustness is achieved by encouraging the transform that maps data to features to be local isometry this geometric property is shown to improve thereby providing theoretical justification for reductions in generalization error observed in experiments the proposed optimization framework is used to train standard learning algorithms such as deep neural networks experimental results obtained on benchmark datasets such as labeled faces in the wild demonstrate the value of being able to balance discrimination and robustness introduction learning features that are able to discriminate is classical problem in data analysis the basic idea is to reduce the variance within class while increasing it between classes one way to implement this is by regularizing certain measure of the variance while assuming some prior knowledge about the data for example linear discriminant analysis lda measures sample covariance and implicitly assumes that each class is gaussian distributed the low rank transform lrt instead uses nuclear norm to measure the variance and assumes that each class is near subspace different approach is to regularize the pairwise distances between data points examples include the seminal work on metric learning and its extensions while great attention has been paid to designing objectives to encourage discrimination less effort has been made in understanding and encouraging robustness to data variation which is especially important when limited number of training samples are available one exception is which promotes robustness by regularizing the traditional metric learning objective using prior knowledge from an auxiliary unlabeled dataset in this paper we develop general framework for balancing discrimination and robustness robustness is achieved by encouraging the learned transform to be locally an isometry within each class we theoretically justify this approach using and give an example of the proposed formulation incorporating it in deep neural networks experiments validate the capability to discrimination against robustness our main contributions are the following prove that locally near isometry leads to robustness propose practical framework that allows to robustify wide class of learned transforms both linear and nonlinear provide an explicit realization of the proposed framework achieving competitive results on difficult face verification tasks the paper is organized as follows section motivates the proposed study and proposes general formulation for learning discriminative robust transform drt section provides theoretical justification for the framework by making an explicit connection to robustness section gives specific example of drt denoted as section provides experimental validation of eucdrt and section presents conclusions problem formulation consider an classification problem the training set is denoted by xi yi where xi rn is the data and yi is the class label we want to learn feature transform fα such that datum becomes more discriminative when it is transformed to feature fα the transform fα is parametrized by vector framework that includes linear transforms and neural networks where the entries of are the learned network parameters motivation the transform fα promotes discriminability by reducing variance and enlarging interclass variance this aim is expressed in the design of objective functions or the structure of the transform fα however the robustness of the learned transform is an important issue that is often overlooked when training samples are scarce statistical learning theory predicts overfitting to the training data the result of overfitting is that discrimination achieved on test data will be significantly worse than that on training data our aim in this paper is the design of robust transforms fα for which the degradation is small we formally measure robustness of the learned transform fα in terms of given distance metric learning algorithm is said to be if the input data space can be partitioned into disjoint sets sk such that for all training sets the learned parameter αt determines loss for which the value on pairs of training samples taken from different sets sj and sk is very close to the value of any pair of data samples taken from sj and sk is illustrated in fig where and are both of diameter and fα fα fα fα if the transform fα preserves all distances within and then can not deviate much from figure here fα fα and fα fα the difference can not deviate too much from formulation and discussion motivated above reasoning we now present our proposed framework first we define pair if yi yj label given metric we use the following hinge loss to encourage otherwise high distance and small distance max fα xi fα xj here is the set of all data pairs is function of and similar to metric learning this loss function connects pairwise distance to discrimination however traditional metric learning typically assumes squared euclidean distance and here the metric can be arbitrary for robustness as discussed above we may want fα to be within each small local region in particular we define the set of all local neighborhoods as xi xj note on the notations matrices vectors are denoted in upper lower case bold letters scalars are denoted in plain letters therefore we minimize the following objective function fα xi fα xj xi xj note that we do not need to have the same metric in both the input and the feature space they do not even have in general the same dimension with slight abuse of notation we use the same symbol to denote both metrics to achieve discrimination and robustness simultaneously we formulate the objective function as weighted linear combination of the two extreme cases in and max fα xi fα xj fα xi fα xj xi xj where the formulation balances discrimination and robustness when it seeks discrimination and as decreases it starts to encourage robustness we shall refer to transform that is learned by solving as discriminative robust transform drt the drt framework provides opportunity to select both the distance measure and the transform family theoretical analysis in this section we provide theoretical explanation for robustness in particular we show that if the solution to yields transform fα that is locally near isometry then fα is robust theoretical framework let denote the original data let denote the set of class labels and let the training samples are pairs zi xi yi drawn from some unknown distribution defined on the indicator function is defined as if yi yj and otherwise let fα be transform that maps feature to more discriminative feature fα and let denote the space of transformed features for simplicity we consider an arbitrary metric defined on both and the general case of different metrics is straightforward extension and loss function fα xi fα xj that encourages fα xi fα xj to be small big if we shall require the lipschtiz constant of and to be upper bounded by note that the loss function in eq has lipschtiz constant of we abbreviate fα xi fα xj hα zi zj the empirical loss on the training set is function of given by pn remp hα zi zj and the expected loss on the test data is given by hα the algorithm operates on pairs of training samples and finds parameters αt arg min remp that minimize the empirical loss on the training set the difference remp between expected loss on the test data and empirical loss on the training data is the generalization error of the algorithm and covering number we work with the following definition of definition learning algorithm is if can be partitioned into disjoint sets zk such that for all training sets the learned parameter αt determines loss function where the value on pairs of training samples taken from sets zp and zq is very close to the value of any pair of data samples taken from zp and zq formally assume zi zj with zi zp and zj zq if zp and zq then hα zi zj hα remark means that the loss incurred by testing pair in zp zq is very close to the loss incurred by any training pair zi zj in zp zq it is shown in that the generalization error of algorithms is bounded as αt remp αt therefore the smaller the smaller is the generalization error and the more robust is the learning algorithm given metric space the covering number specifies how many balls of given radius are needed to cover the space the more complex the metric space the more balls are needed to cover it covering number is formally defined as follows definition covering number given metric space we say that subset of is of if for every element there exists such that the number of is nγ min is of remark the covering number is measure of the geometric complexity of set with covering number can be partitioned into disjoint subsets such that any two points within the same subset are separated by no more than lemma the metric space can be partitioned into subsets denoted as such that any two points in the same subset satisfy and proof assuming the metric space is compact we can partition into subsets each with diameter at most since is finite set of size we can partition into subsets with the property that two samples in the same subset satisfy and it follows from lemma that we may partition into subsets such that pairs of points from the same subset have the same label and satisfy xi xj before we connect local geometry to robustness we need one more definition we say that learned transform fα is if the metric is distorted by at most definition let be metric spaces with metrics ρa and ρb map is if for any ρb theorem let fα be transform derived via eq and let be cover of as described above if fα is then it is proof sketch consider training samples zi zj and testing samples such that zi zp and zj zq for some then by lemma xi and xj yi and yj and xi xp and xj xq by definition of fαt xi fαt xi and fαt xj fαt xj rearranging the terms gives fαt xi fαt xi and fαt xj fαt xj figure proof without words in order to bound the generalization error we need to bound the difference between fαt xi fαt xj and fαt fαt the details can be found in here we appeal to the proof schematic in fig we need to bound and it can not exceed twice the diameter of local region in the transformed domain robustness of the learning algorithm depends on the granularity of the cover and the degree to which the learned transform fα distorts distances between pairs of points in the same covering subset the subsets in the cover constitute regions where the local geometry makes it possible to bound generalization error it now from that the generalization error satisfies αt remp αt the drt proposed here is particular example of local isometry and theorem explains why the generalization error is smaller than that of pure metric learning the transform described in partitions the metric space into exactly subsets one for each class the experiments reported in section demonstrate that the performance improvements derived from working with finer partition can be worth the cost of learning finer grained local regions an illustrative realization of drt having justified robustness we now provide realization of the proposed general drt where the metric is euclidean distance we use gaussian random variables to initialize then on the randomly transformed data we set to be the average pairwise distance in all our experiments the solution satisfied the condition required in eq we calculate the diameter of the local regions indirectly using the neighbors of each training sample to define local neighborhood we leave the question of how best to initialize the indicator and the diameter for future research we denote this particular example as and use gradient descent to solve for denoting the objective by we define yi fα xi δi fα xi fα xj and kxi xj then δi δi sgn kδi kδi kδi kδi in general fα defines neural network when it defines linear transform let be the linear weights at the layer and let be the output of the layer so that yi xi then the gradients are computed as and for algorithm provides summary and we note that the extension to stochastic training using minbatches is straightforward experimental results in this section we report on experiments that confirm robustness of recall that empirical loss is given by eq where is learned as αt from the training set and the generalization error is remp where the expected loss is estimated using large test set toy example this illustrative example is motivated by the discussion in section we first generate dataset consisting of two noisy then use random matrix to embed the data in space we learn linear transform fα that maps the dimensional data to dimensional features and we use nearest neighbors to construct the set we consider representing the most discriminative balanced and more robust scenarios when the transformed training samples are rather discriminative fig but when the transform is applied to testing data the two classes are more mixed fig when the algorithm gradient descent solver for input training pairs xi xj network as linear transform stepsize neighborhood size output randomly initialize compute yi fα xi on the yi compute the average intra and pairwise distances assign to for each training datum find its nearest neighbor and define the set while stable objective not achieved do compute yi fα xi by forward pass compute objective compute as eq for down to do compute as eq end for end while transformed training transformed training transformed trainsamples discriminative case samples balanced case ing samples robust case transformed testing transformed testing transformed testing samples discriminative case samples balanced case samples robust case figure original and transformed samples embedded in space with different colors representing different classes transformed training data are more dispersed within each class fig hence less easily separated than when however fig shows that it is easier to separate the two classes on the test data when robustness is preferred to discriminative power as shown in figs and tab quantifies empirical loss remp generalization error and classification performance by for and as decreases remp increases indicating loss of discrimination on the training set however generalization error decreases implying more robustness we conclude that by varying we can balance discrimination and robustness mnist classfication using very small training set the transform fα learned in the previous section was linear and we now apply more sophisticated convolutional neural network to the mnist dataset the network structure is similar to lenet and is table varying on toy dataset remp generalization error accuracy original data table implementation details of the neural network for mnist classification name table classification error on mnist original pixels lenet dml parameters size stride pad size size stride pad size size stride pad made up of alternating convolutional layers and pooling layers with parameters detailed in table we map the original pixel values image to features while results often use the full training set training samples per class here we are interested in small training sets we use only training samples per class and we use nearest neighbors to define local regions in we vary and study empirical error generalization error and classification accuracy we observe in fig that when decreases the empirical error also decreases but that the generalization error actually increases by balancing between these two factors peak classification accuracy is achieved at next we use emp emp accuracy figure mnist test with only training samples per class we vary and assess remp generalization error and classification accuracy peak accuracy is achieved at training samples per class and compare the performance of with lenet and deep metric learning dml dml minimizes hinge loss on the squared euclidean distances it shares the same spirit with our using all methods use the same network structure tab to map to the features for classification lenet uses linear softmax classifier on top of the layer and minimizes the standard loss during training dml and both use classifier on the learned features classification accuracies are reported in tab in tab we see that all the learned features improve upon the original ones dml is very discriminative and achieves higher accuracy than lenet however when the training set is very small robustness becomes more important and significantly outperforms dml face verification on lfw we now present face verification on the more challenging labeled faces in the wild lfw benchmark where our experiments will show that there is an advantage to balancing disciminability and robustness our goal is not to reproduce the success of deep learning in face verification but to stress the importance of robust training and to compare the proposed objective with popular alternatives note also that it is difficult to compare with deep learning methods when training sets are proprietary we adopt the experimental framework used in and train deep network on the wdref dataset where each face is described using high dimensional lbp feature available at that is reduced to feature using pca the wdref dataset is significantly smaller than the proprietary datasets typical of deep learning such as the million labeled faces from individuals in or the labeled faces from individuals in it contains subjects with about samples per subject we compare the objective with deepface df and deep metric learning dml two deep learning objectives for fair comparison we employ the same network structure and train on the same input data deepface feeds the output of the last network layer to an to generate probability distribution over classes then minimizes cross entropy loss the feature fα is implemented as fully connected network with tanh as the squash function weight decay conventional frobenius norm regularization is employed in both df and dml and results are only reported for the best weight decay factor after network is trained on wdref it is tested on the lfw benchmark verification simply consists of comparing the cosine distance between given pair of faces to threshold fig displays roc curves and table reports area under the roc curve auc and verification accuracy lbp refers to verification using the initial lbp features deepface df optimizes for classification objective by minimizing softmax loss and it successfully separates samples from different classes however the constraint that assigns similar representations to the same class is weak and this is reflected in the true positive rate displayed in fig in deep metric learning dml this same constraint is strong but robustness is concern when the training set is small the proposed improves upon both df and dml by balancing disciminability and robustness it is less conservative than df for better discriminability and more responsive to local geometry than dml for smaller generalization error face verification accuracy for was obtained by varying the regularization parameter between and as shown in fig then reporting the peak accuracy observed at verification accuracy deepface dml table verification accuracy and aucs on lfw method figure comparison of figure verification accurocs for all methods racy of as varies deepface dml accuracy auc conclusion we have proposed an optimization framework within which it is possible to tradeoff the discriminative value of learned features with robustness of the learning algorithm improvements to generalization error predicted by theory are observed in experiments on benchmark datasets future work will investigate how to initialize and tune the optimization also how the algorithm compares with other methods that reduce generalization error acknowledgement the work of huang and calderbank was supported by afosr under fa and by nga under the work of qiu and sapiro is partially supported by nsf and dod http references bellet and habrard robustness and generalization for metric learning neurocomputing chen cao wang wen and sun bayesian face revisited joint formulation in european conference on computer vision eccv chen cao wen and sun blessing of dimensionality feature and its efficient compression for face verification in ieee conference on computer vision and pattern recognition cvpr fukunaga introduction to statistical pattern recognition san diego academic press globerson and roweis metric learning by collapsing classes in advances in neural information processing systems nips goldberger roweis hinton and salakhutdinov neighbourhood components analysis in advances in neural information processing systems nips hu lu and tan discriminative deep metric learning for face verification in the wild in computer vision and pattern recognition cvpr pages huang ramesh berg and labeled faces in the wild database for studying face recognition in unconstrained environments technical report university of massachusetts amherst october huang qiu calderbank and sapiro deep transform in international conference on computer vision sapiro qiu learning transformations for clustering and classification journal of machine learning research jmlr pages sumit hadsell and lecun learning similarity metric discriminatively with application to face verification in ieee conference on computer vision and pattern recognition cvpr volume pages sun chen wang and tang deep learning face representation by joint in advances in neural information processing systems nips pages sun wang and tang deep learning face representation from predicting classes in ieee conference on computer vision and pattern recognition cvpr pages taigman yang ranzato and wolf deepface closing the gap to humanlevel performance in face verification in ieee conference on computer vision and pattern recognition cvpr pages vapnik an overview of statistical learning theory ieee transactions on neural networks weinberger and saul distance metric learning for large margin nearest neighbor classification journal of machine learning research xing ng jordan and russell distance metric learning with application to clustering with in advances in neural information processing systems nips xu and mannor robustness and generalization machine learning zha mei wang wang and hua robust distance metric learning with auxiliary knowledge in international joint conference on artificial intelligence ijcai 
bandits with unobserved confounders causal approach andrew department of computer science university of california los angeles forns elias department of computer science purdue university eb judea pearl department of computer science university of california los angeles judea abstract the bandit problem constitutes an archetypal setting for sequential permeating multiple domains including engineering business and medicine one of the hallmarks of bandit setting is the agent capacity to explore its environment through active intervention which contrasts with the ability to collect passive data by estimating associational relationships between actions and payouts the existence of unobserved confounders namely unmeasured variables affecting both the action and the outcome variables implies that these two modes will in general not coincide in this paper we show that formalizing this distinction has conceptual and algorithmic implications to the bandit setting the current generation of bandit algorithms implicitly try to maximize rewards based on estimation of the experimental distribution which we show is not always the best strategy to pursue indeed to achieve low regret in certain realistic classes of bandit problems namely in the face of unobserved confounders both experimental and observational quantities are required by the rational agent after this realization we propose an optimization metric employing both experimental and observational distributions that bandit agents should pursue and illustrate its benefits over traditional algorithms introduction the bandit mab problem is one of the most popular settings encountered in the sequential literature with applications across multiple disciplines the main challenge in prototypical bandit instance is to determine sequence of actions that maximizes payouts given that each arm reward distribution is initially unknown to the agent accordingly the problem revolves around determining the best strategy for learning this distribution exploring while simultaneously using the agent accumulated samples to identify the current best arm so as to maximize profit exploiting different algorithms employ different strategies to balance exploration and exploitation but standard definition for the best arm is the one that has the highest payout rate associated with it we will show that perhaps surprisingly the definition of best arm is more involved when unobserved confounders are present this paper complements the vast literature of mab that encompasses many variants including adversarial bandits in which an omnipotent adversary can dynamically shift the reward distributions to thwart the player best strategies contextual bandits in which the payout the authors contributed equally to this paper and therefore the best choice of action is function of one or more observed environmental variables and many different constraints and assumptions over the underlying generative model and payout structure for recent survey see this work addresses the mab problem when unobserved confounders are present called mabuc for short which is arguably the most sensible assumption in practical applications obviously weaker than assuming the inexistence of confounders to support this claim we should first note that in the experimental design literature fisher very motivation for considering randomizing the treatment assignment was to eliminate the influence of unobserved confounders factors that simultaneously affect the treatment or bandit arm and outcome or bandit payout but are not accounted for in the analysis in reality the reason for not accounting for such factors explicitly in the analysis is that many of them are unknown priori by the modeller the study of unobserved confounders is one of the central themes in the modern literature of causal inference to appreciate the challenges posed by these confounders consider the comparison between randomized clinical trial conducted by the food and drug administration fda versus physicians prescribing drugs in their offices key tenet in any fda trial is the use of randomization for the treatment assignment which precisely protects against biases that might be introduced by physicians specifically physicians may prescribe drug for their wealthier patients who have better nutrition than their less wealthy ones when unknown to the doctors the wealthy patients would recover without treatment on the other hand physicians may avoid prescribing the expensive drug to their less privileged patients who again unknown to the doctors tend to suffer less stable immune systems causing negative reactions to the drug if naive estimate of the drug causal effect is computed based on physicians data obtained through random sampling but not random assignment the drug would appear more effective than it is in practice bias that would otherwise be avoided by random assignment confounding biases of variant magnitude appear in almost any application in which the goal is to learn policies instead of statistical associations and the use of randomization of the treatment assignment is one established tool to combat them to the best of our knowledge no method in the bandit literature has studied the issue of unobserved confounding explicitly in spite of its pervasiveness in applications specifically no mab technique makes distinction between experimental exploration through random assignment as required by the fda and observational data as given by random sampling in the doctors offices in this paper we explicitly acknowledge formalize and then exploit these different types of more specifically our contributions are as follow we show that the current bandit algorithms implicitly attempt to maximize rewards by estimating the experimental distribution which does not guarantee an optimal strategy when unobserved confounders are present section based on this observation we translate the mab problem to causal language and then suggest more appropriate metric that bandit players should optimize for when unobserved confounders are present this leads to new exploitation principle that can take advantage of data collected under both observational and experimental modes section we empower thompson sampling with this new principle and run extensive simulations the experiments suggest that the new strategy is stats efficient and consistent sec challenges due to unobserved confounders in this section we discuss the mechanics of how the maximization of rewards is treated based on bandit instance with unobserved confounders consider scenario in which greedy casino decides to demo two new models of slot machines say and for simplicity and wishes to make them as lucrative as possible as such they perform battery of observational studies using random sampling to compare various traits of the casino gamblers to their typical slot machine choices from these studies the casino learns that two factors well predict the gambling habits of players when combined unknown by the players player inebriation and machine conspicuousness say whether or not machine is blinking coding both of these traits as binary variables we let denote whether or not machine is blinking and denote whether or not the gambler is drunk as it turns out gambler natural choice of machine can be modelled by the structural equation indicating the index of their chosen arm starting at fx figure performance of different bandit strategies in the greedy casino example left panel no algorithm is able to perform better than random guessing right panel regret grows without bounds moreover the casino learns that every gambler has an equal chance of being intoxicated and each machine has an equal chance of blinking its lights at given time namely and the casino executives decide to take advantage of these propensities by introducing new type of reactive slot machine that will tailor payout rates to whether or not it believes via sensor input assumed to be perfect for this problem gambler is intoxicated suppose also that new gambling law requires that casinos maintain minimum attainable payout rate for slots of cognizant of this new law while still wanting to maximize profits by exploiting gamblers natural arm choices the casino executives modify their new slots with the payout rates depicted in table table payout rates decided by reactive slot machines as function of arm choice sobriety and machine conspicuousness players natural arm choices under are indicated by asterisks payout rates according to the observational and experimental distributions where represents winning shown in the table and otherwise the state blind to the casino payout strategy decides to perform randomized study to verify whether the win rates meet the payout requisite wary that the casino might try to inflate payout rates for the inspectors the state recruits random players from the casino floor pays them to play random slot and then observes the outcome their randomized experiment yields favorable outcome for the casino with win rates meeting precisely the cutoff the data looks like table third column assuming binary payout where represents losing and winning as students of causal inference and still suspicious of the casino ethical standards we decide to go to the casino floor and observe the win rates of players based on their natural arm choices through random sampling we encounter distribution close to table second column which shows that the casino is actually paying ordinary gamblers only of the time in summary the casino is at the same time exploiting the natural predilections of the gamblers arm choices as function of their intoxication and the machine blinking behavior based on eq paying on average less than the legally allowed instead of and fooling state inspectors since the randomized trial payout meets the legal requirement as machine learning researchers we decide to run battery of experiments using the standard bandit algorithms thompson sampling to test the new slot machines on the casino floor we obtain data encoded in figure which shows that the probability of choosing the correct action is no better than random coin flip even after considerable number of steps we note somewhat surprised that the cumulative regret fig shows no signs of abating and that we are apparently unable to learn superior arm we also note that the results obtained by the standard algorithms coincide with the randomized study conducted by the state purple line under the presence of unobserved confounders such as in the casino example however does not seem to capture the information required to maximize payout but rather the average payout akin to choosing arms by coin flip specifically the payout given by coin flipping is the same for both machines which means that the arms are statistically indistinguishable in the limit of large sample size further if we consider using the observational data from watching gamblers on the casino floor based on their natural predilections the average payoff will also appear independent of the machine choice albeit with an even lower payout based on these observations we can see why no arm choice is better than the other under either distribution alone which explains the reason any algorithm based on these distributions will certainly fail to learn an optimal policy more fundamentally we should be puzzled by the disagreement between observational and interventional distributions this residual difference may be encoding knowledge about the unobserved confounders which may lead to some indication on how to differentiate the arms this indeed may lead to some indication on how to differentiate the arms as well as sensible strategy to play better than pure chance in the next section we will use causal machinery to realize this idea bandits as causal inference problem we will use the language of structural causal models ch for expressing the bandit datagenerating process and for allowing the explicit manipulation of some key concepts in our analysis confounding observational and experimental distributions and counterfactuals to be defined definition structural causal model ch structural causal model is where is set of background variables also called exogenous that are determined by factors outside of the model is set vn of observable variables also called endogenous that are determined by variables in the model determined by variables in is set of functions fn such that each fi is mapping from the respective domains of ui ai to vi where ui and ai vi and the entire set forms mapping from to in other words each fi in vi fi pai ui assigns value to vi that depends on the values of the select set of variables ui ai and is probability distribution over the exogenous variables each structural model is associated with directed acyclic graph where nodes correspond to endogenous variables and edges represent functional relationships there exists an edge from to whenever appears in the argument of function we define next the mabuc problem within the structural semantics definition bandits with unobserved confounders bandit problem with unobserved confounders is defined as model with reward distribution over where xt xk is an observable variable encoding player arm choice from one of arms decided by nature in the observational case and do xt for strategy in the experimental case when the strategy decides the choice ut represents the unobserved variable encoding the payout rate of arm xt as well as the propensity to choose xt and yt is reward for losing for winning from choosing arm xt under unobserved confounder state ut decided by yt fy xt ut one may surmise that these ties are just contrived examples or perhaps numerical coincidences which do not appear in realistic bandit instances unfortunately that not the case as shown in the other scenarios discussed in the paper this phenomenon is indeed manifestation of the deeper problem arising due to the lack of control for the unobserved confounders figure model for the standard mab sequential decision game model for the mabuc sequential decision game in each model solid nodes denote observed variables and open nodes represent unobserved variables square nodes denote the players strategys arm choice at time dashed lines illustrate influences on future time trials that are not pictured first note that this definition also applies to the mab problem without confounding as shown in fig this standard mab instance is defined by constraining the mabuc definition such that ut affects only the outcome variable yt there is no edge from ut to xt def in the unconfounded case it is clear that ch which means that that payouts associated with flipping coin to randomize the treatment or observing through random sampling the player gambling on the casino floor based on their natural predilections will yield the same answer the variable carries the unobserved payout parameters of each arm which is usually the target of analysis fig provides graphical representation of the mabuc problem note that πt represents the system choice policy which is affected by the unobserved factors encoded through the arrow from ut to πt one way to understand this arrow is through the idea of players natural predilections in the example from the previous section the predilection would correspond to the choices arising when the gambler is allowed to play freely on the casino floor drunk players desiring to play on the blinking machines or doctors prescribing drugs based on their gut feeling physicians prescribing the more expensive drug to their wealthier patients these predilections are encoded in the observational distribution on the other hand the experimental distribution encodes the process in which the natural predilections are overridden or ceased by external policies in our example this distribution arises when the government inspectors flip coin and send gamblers to machines based on the coin outcome regardless of their predilections remarkably it is possible to use the information embedded in these distinct modes and their corresponding distributions to understand players predilections and perform better than random guessing in these bandit instances to witness assume there exists an oracle on the casino floor operating by the following protocol the oracle observes the gamblers until they are about to play given machine the oracle intercepts each gambler who is about to pull the arm of machine for example and suggests the player to contemplate whether following his predilection or going against it playing would lead to better outcome the drunk gambler who is clever machine learning student and familiar with fig says that this evaluation can not be computed priori he affirms that despite spending hours on the casino estimating the payoff distribution based on players natural predilections namely it is not feasible to relate this distribution with the hypothetical construction what would have happened had he decided to play differently he also acknowledges that the experimental distribution devoid of the gamblers predilections does not support any clear comparison against his personal strategy the oracle says that this type of reasoning is possible but first one needs to define the concept of counterfactual definition counterfactual pp let and be two subsets of exogenous variables in the counterfactual sentence would be in situation had been is interpreted as the equality with yx with yx being the potential response of to on more fundamental level it is clear that unconfoundedness is implicitly assumed not to hold in the general case otherwise the equality between observational and experimental distributions would imply that no randomization of the action needs to be carried out since standard random sampling would recover the same distribution in this case this would imply that many works in the literature are acting in suboptimal way since in general experiments are more expensive to perform than collecting data through random sampling the interventional nature of the mab problem is virtually not discussed in the literature one of the few exceptions is the causal interpretation of thompson sampling established in this definition naturally leads to the judgement suggested by the oracle namely would the agent win had played on machine which can be formally written as we drop the for simplicity assuming that the agent natural predilection is to play machine the oracle suggests an introspection comparing the odds of winning following his intuition or going against it the former statement can be written in counterfactual notation probabilistically as which reads as the expected value of winning had play machine given that am about to play machine which contrasts with the alternative hypothesis which reads as the expected value of winning had play machine given that am about to play machine this is also known in the literature as the effect of the treatment on the treated ett so instead of using decision rule comparing the average payouts across arms namely for action argmax which was shown in the previous section to be insufficient to handle the mabuc we should consider the rule using the comparison between the average payouts obtained by players for choosing in favour or against their intuition respectively argmax where is the player natural predilection and is their final decision we will call this procedure rdc regret decision criterion to emphasize the counterfactual nature of this reasoning step and the idea of following or disobeying the agent intuition which is motivated by the notion of regret remarkably rdc accounts for the agents individuality and the fact that their natural inclination encodes valuable information about the confounders that also affect the payout in the binary case for example assuming that is the player natural choice at some time step if is greater than this would imply that the player should refrain of playing machine to play machine assuming one wants to implement an algorithm based on rdc the natural question that arises is how the quantities entailed by eq can be computed from data for the factors in the form the consistency axiom pp implies that where the is estimable from observational data counterfactuals in the form where can be computed in the binary case through algebraic means pp for the general case however ett is not computable without knowledge of the causal graph here ett will be computed in an alternative fashion based on the idea of randomization the main idea is to randomize groups namely interrupt any reasoning agent before they execute their choice treat this choice as intention delibarte and then act we discuss next about the algorithmic implementation of this randomization applications experiments based on the previous discussion we can revisit the greedy casino example from section apply rdc and use the following inequality to guide agent decisions there are different ways of incorporating this heuristic into traditional bandit algorithms and we describe one such approach taking the thompson sampling algorithm as the basis for simulation source code see https our proposed algorithm causal thompson sampling takes the following steps first accepts an observational distribution as input which it then uses to seed estimates of ett quantities for actions and intuition by consistency we may seed knowledge of pobs with large samples from an input set of observations this seeding reduces and possibly eliminates the need to explore the payout rates associated with following intuition leaving only the disobeying intuition payout rates left for the agent to learn as such at each time step our oracle observes the agent predilection and then uses rdc to graphical conditions for identifying ett are orthogonal to the bandit problem studied in this paper since no detailed knowledge about the causal graph as well as infinite samples is assumed here mine their best lastly note that our seeding in immediately improves the accuracy of our comparison between arms viz that superior arm will emerge more quickly than had we not seeded we can exploit this early lead in accuracy by weighting the more favorable arm making it more likely to be chosen earlier in the learning process which empirically improves the convergence rate as shown in the simulations algorithm causal thompson sampling procedure tsc pobs pobs seed distribution for do intuition get intuition for trial estimated payout for estimated payout for intuition initialize weights bias compute weighting strength if then bias else bias choose arm to bias max pull choose arm receive reward update in the next section we provide simulations to support the efficacy of in the mabuc context for simplicity we present two simulation results for the model described in section experiment employs the greedy casino parameterization found in table whereas experiment employs the paradoxical switching parameterization found in table each experiment compares the performance of traditional thompson sampling bandit players versus table paradoxical switching parameterization payout rates decided by reactive slot machines as function of arm choice sobriety and machine conspicuousness players natural arm choices under are indicated by asterisks payout rates associated with the observational and experimental distributions respectively procedure all reported simulations are partitioned into rounds of trials averaged over monte carlo repetitions at each time step in single round values for the unobserved confounders and observed intuitive arm choice are selected by their respective structural equations see section the player observes the value of the player chooses an arm based on their given strategy to maximize reward which may or may not employ and finally the player receives bernoulli reward and records the outcome furthermore at the start of every round players possess knowledge of the problem observational distribution each player begins knowing see table however only causallyempowered strategies will be able to make use of this knowledge since this distribution is not as we ve seen the correct one to maximize candidate algorithms standard thompson sampling attempts to maximize rewards based on ignoring the intuition thompson sampling treats the note that using predilection as criteria for the inequality does not uniquely map to the contextual bandit problem to understand this point note that not all variables are equally legitimate for confounding control in causal settings while the agent predilection is certainly one of such variables in our setup specifically when considering whether variable qualifies as causal context requires much deeper understanding of the data generating model which is usually not available in the general case the notation smk fmk means to sample from beta distribution with parameters equal to the successes encountered choosing action on machine mk smk and the failures encountered choosing action on that machine fmk for additional experimental results and parameterizations see appendix figure simulation results for experiments and comparing standard thompson sampling thompson sampling and causal thompson sampling predilection as new context variable and attempts to maximize based on at each round causal thompson sampling as described above employs the ett inequality and input observational distribution evaluation metrics we assessed each algorithms performances with standard bandit evaluation metrics the probability of choosing the optimal arm and cumulative regret as in traditional bandit problems these measures are recorded as function of the time step averaged over all round repetitions note however that traditional definitions of regret are not phrased in terms of unobserved confounders our metrics by contrast compare each algorithm chosen arm to the optimal arm for given instantiation of bt and dt even though these instantiations are never directly available to the players we believe that this is fair operationalization for our evaluation metrics because it allows us to compare regret experienced by our algorithms to truly optimal albeit hypothetical policy that has access to the unobserved confounders experiment greedy the greedy casino parameterization specified in table illustrates the scenario where each arm payout appears to be equivalent under the observational and experimental distributions alone only when we concert the two distributions and condition on player predilection can we obtain the optimal policy simulations for experiment support the efficacy of the causal approach see figure analyses revealed significant difference in the regret experienced by sd compared to sd standard was predictably not competitor experiencing high regret sd experiment paradoxical the paradoxical switching parameterization specified in table illustrates the scenario where one arm appears superior in the observational distribution but the other arm appears superior in the experimental again we must use causal analyses to resolve this ambiguity and obtain the optimal policy simulations for experiment also support the efficacy of the causal approach see figure analyses revealed significant difference in the regret experienced by sd compared to sd standard was again predictably not competitor experiencing high regret sd conclusions in this paper we considered new class of bandit problems with unobserved confounders mabuc that are arguably more realistic than traditional formulations we showed that mabuc instances are not amenable to standard algorithms that rely solely on the experimental distribution more fundamentally this lead to an understanding that in mabuc instances the optimization task is not attainable through the estimation of the experimental distribution but relies on both experimental and observational quantities rooted in counterfactual theory and based on the agents predilections to take advantage of our findings we empowered the thompson sampling algorithm in two different ways we first added new rule capable of improving the efficacy of which arm to explore we then jumpstarted the algorithm by leveraging observational data that is often available but overlooked simulations demonstrated that in general settings these changes lead to more effective with faster convergence and lower regret references agrawal and goyal analysis of thompson sampling for the bandit problem corr freund auer and schapire gambling in rigged casino the adversarial bandit problem in foundations of computer science annual symposium on pages oct bubeck and nicolo regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning and fast boosting using adversarial bandits in joachims editor international conference on machine learning icml pages haifa israel june bareinboim forney and pearl bandits with unobserved confounders causal approach technical report http cognitive systems laboratory ucla bubeck and slivkins the best of both worlds stochastic and adversarial bandits corr chapelle and li an empirical evaluation of thompson sampling in shawetaylor zemel bartlett pereira and weinberger editors advances in neural information processing systems pages curran associates miroslav daniel hsu satyen kale nikos karampatziakis john langford lev reyzin and tong zhang efficient optimal learning for contextual bandits corr eyal shie mannor and yishay mansour action elimination and stopping conditions for the bandit and reinforcement learning problems mach learn december fisher the design of experiments oliver and boyd edinburgh edition lai and herbert robbins asymptotically efficient adaptive allocation rules advances in applied mathematics john langford and tong zhang the algorithm for bandits with side information in platt koller singer and roweis editors advances in neural information processing systems pages curran associates ortega and braun minimum relative entropy principle for learning and acting artif int may pearl causality models reasoning and inference cambridge university press new york second herbert robbins some aspects of the sequential design of experiments bull amer math yevgeny seldin peter bartlett koby crammer and yasin prediction with limited advice and multiarmed bandits with paid observations in international conference on machine learning beijing china scott modern bayesian look at the bandit applied stochastic models in business and industry slivkins contextual bandits with similarity information mach learn january shpitser and pearl what counterfactuals can be tested in proceedings of the conference on uncertainty in artificial intelligence uai pages auai press vancouver bc canada 
scalable aggregation of classifiers yoav freund uc san diego yfreund akshay balsubramani uc san diego abalsubr abstract we present and empirically evaluate an efficient algorithm that learns to aggregate the predictions of an ensemble of binary classifiers the algorithm uses the structure of the ensemble predictions on unlabeled data to yield significant performance improvements it does this without making assumptions on the structure or origin of the ensemble without parameters and as scalably as linear learning we empirically demonstrate these performance gains with random forests introduction learning is very successful approach to learning classifiers including methods like boosting bagging and random forests the power of these methods has been clearly demonstrated in open learning competitions such as the netflix prize and the imagenet challenge in general these methods train large number of base classifiers and then combine them using possibly weighted majority vote by aggregating over classifiers ensemble methods reduce the variance of the predictions and sometimes also reduce the bias the ensemble methods above rely solely on labeled training set of data in this paper we propose an ensemble method that uses large unlabeled data set in addition to the labeled set our work is therefore at the intersection of learning and ensemble learning this paper is based on recent theoretical results of the authors our main contributions here are to extend and apply those results with new algorithm in the context of random forests and to perform experiments in which we show that when the number of labeled examples is small our algorithm performance is at least that of random forests and often significantly better for the sake of completeness we provide an intuitive introduction to the analysis given in how can unlabeled data help in the context of ensemble learning consider simple example with six equiprobable data points the ensemble consists of six classifiers partitioned into three rules and three rules suppose that the rules each have error and the rules each have error if given only this information we might take the majority vote over the six rules possibly giving lower weights to the rules because they have higher errors suppose however that we are given the unlabeled information in table the columns of this table correspond to the six classifiers and the rows to the six unlabeled examples each entry corresponds to the prediction of the given classifier on the given example as we see the main difference between the rules and the rules is that any two rules disagree with probability whereas the rules always agree for this example it can be seen proved by contradiction that the only possible true labeling of the unlabeled data that is consistent with table and with the errors of the classifiers is that all the examples are labeled consequently we conclude that the majority vote over the rules has zero error performing significantly better than any of the base rules in contrast giving the rules equal weight would we assume that bounds on the errors are with high probability true on the actual distribution such bounds can be derived using large deviation bounds or methods result in an rule with error crucially our reasoning to this point has solely used the structure of the unlabeled examples along with the error rates in table to constrain our search for the true labeling error classifiers classifiers table an example of the utility of unlabeled examples in ensemble learning by such reasoning alone we have correctly predicted according to weighted majority vote this example provides some insight into the ways in which unlabeled data can be useful when combining classifiers diversity is important it can be better to combine less accurate rules that disagree with each other than to combine more accurate rules that tend to agree the bounds on the errors of the rules can be seen as set of constraints on the true labeling complementary set of constraints is provided by the unlabeled examples these sets of constraints can be combined to improve the accuracy of the ensemble classifier the above setup was recently introduced and analyzed in that paper characterizes the problem as game between predictor and an adversary it then describes the minimax solution of the game which corresponds to an efficient algorithm for transductive learning in this paper we build on the framework of to devise an efficient and practical semisupervised aggregation algorithm for random forests to achieve this we extend the framework to handle specialists classifiers which only venture to predict on subset of the data and abstain from predicting on the rest specialists can be very useful in targeting regions of the data on which to precisely suggest prediction the idea of our algorithm is to artificially generate new specialists from the ensemble we incorporate these and the targeted information they carry into the framework of the resulting aggregated predictor inherits the advantages of the original framework efficient learning reduces to solving scalable convex optimization and prediction is as efficient and parallelizable as linear prediction no assumptions about the structure or origin of the predictions or labels no introduced parameters the aggregation method is completely safe accuracy guaranteed to be at least that of the best classifier in the ensemble we develop these ideas in the rest of this paper reviewing the core setting of in section and specifying how to incorporate specialists and the resulting learning algorithm in section then we perform an exploratory evaluation of the framework on data in section though the framework of and our extensions can be applied to any ensemble of arbitrary origin in this paper we focus on random forests which have been repeatedly demonstrated to have robust classification performance in wide variety of situations we use random forest as base ensemble whose predictions we aggregate but unlike conventional random forests we do not simply take majority vote over tree predictions instead using aggregation strategy dictated by the framework we employ preliminaries few definitions are required to discuss these issues concretely following write max and all vector inequalities are componentwise we first consider an ensemble hp and unlabeled data xn on which we wish to predict as in the predictions and labels are allowed to be randomized represented by values in instead of just the two values the ensemble predictions on the unlabeled data are denoted by xn hp hp hp xn we use vector notation for the rows and columns of hi hi hi xn and xj xj hp xj the true labels on the test data are represented by zn the labels are hidden from the predictor but we assume the predictor has knowledge of correlation vector such that hi xj zj bi fz these constraints on exactly represent upper bounds on individual classifier error rates which can be estimated from the training set when all the data are drawn in standard way also used by empirical risk minimization erm methods that simply predict with the classifier the transductive binary classification game the idea of is to formulate the ensemble aggregation problem as game between predictor and an adversary in this game the predictor is the first player who plays gn randomized label gi for each example xi the adversary then sets the labels under the ensemble classifier error constraints defined by the predictor goal is to minimize the expected classification error on the test data the randomized labelings and which is just this is equivalently viewed as maximizing correlation to summarize concretely we study the following game min max the minimax theorem applies to the game and there is an optimal strategy such that min guaranteeing prediction error on the unlabeled data this optimal strategy is simple function of particular weighting over the hypotheses nonnegative definition slack function let be weight vector over not necessarily distribution the vector of ensemble predictions is xn whose elements magnitudes are the margins the prediction slack function is xj and this is convex in the optimal weight vector is any minimizer arg min the main result of uses these to describe the minimax equilibrium of the game theorem the minimax value of the game is the minimax optimal predictions are defined as follows for all xj gj sgn xj otherwise since is calculated from the training set and deviation bounds we assume the problem feasible interpretation theorem suggests statistical learning algorithm for aggregating the ensemble classifiers predictions estimate from the training labeled set optimize the convex slack function to find and finally predict with gj on each example in the test set the resulting predictions are guaranteed to have low error as measured by in particular it is easy to prove that is at least maxi bi the performance of the best classifier the slack function merits further scrutiny its term depends only on the labeled data and pfirst not the unlabeled set while the second term incorporates only unlabeled information these two terms trade off smoothly as the problem setting becomes fully supervised and unlabeled information is absent the first term dominates and tends to put all its weight on the best single classifier like erm indeed this viewpoint suggests loose interpretation of the second term as an unsupervised regularizer for the otherwise fully supervised optimization of the average error it turns out that change in the regularization factor corresponds to different constraints on the true labels theorem let vα max min for any then vα pn xj so the regularized optimization assumes each zi for this is equivalent to assuming the usual binary labels and then adding uniform random label noise flipping the label on each of the examples independently this encourages clipping of the ensemble predictions to the majority vote predictions as specified by advantages and disadvantages this formulation has several significant merits that would seem to recommend its use in practical situations it is very efficient once is estimated scalable task given the labeled set the slack function is effectively an average over convex functions of unlabeled examples and consequently is amenable to standard convex optimization techniques like stochastic gradient descent sgd and variants these only operate in dimensions independent of which is the slack function is lipschitz and resulting in stable approximate learning moreover prediction is extremely efficient because it only requires the weighting and can be computed on the test set using only dot product in rp the form of and its dependence on facilitates interpretation as well as it resembles familiar objects sigmoid link functions for linear classifiers other advantages of this method also bear mention it makes no assumptions on the structure of or is provably robust against the worst case and adds no input parameters that need tuning these benefits are notable because they will be inherited by our extension of the framework in this paper however this algorithm practical performance can still be mediocre on real data which is often easier to predict than an adversarial setup would have us believe as result we seek to add more information in the form of constraints on the adversary to narrow the gap between it and reality learning with specialists to address this issue we examine generalized scenario in which each classifier in the ensemble can abstain on any subset of the examples instead of predicting it is specialist that predicts only over subset of the input and we think of its decision being randomized in the same way as the randomized label on each example in this section we extend the framework of section to arbitrary specialists and discuss the learning algorithm that results in our formulation suppose that for classifier and an example the classifier decides to abstain with probability vi but if the decision is to participate the classifier predicts hi as previously our only assumption on vi vi xn is the reasonable one that vi xj so classifier is not worthless specialist that abstains everywhere the constraint on classifier is now not on its correlation with on the entire test set but on the average correlation with restricted to occasions on which it participates so for some bs vi xj pn hi xj zj bs vi xk define ρi xj pn vji xk distribution over for convenience now redefine our unlabeled data matrix as follows xn xn ρp hp ρp hp ρp xn hp xn then the constraints can be written as sz bs analogous to the initial prediction game to summarize our specialist ensemble aggregation game is stated as vs max min we can immediately solve this game from thm with bs simply taking the place of theorem solution of the specialist aggregation game the awake ensemble prediction weighting on example xi is ρj xi hj xi σj the slack function is now γs sσ the minimax value of this game is vs the minimax optimal predictions are defined as follows for all σs gs sgn σs otherwise in the case the vector ρi is the uniform distribution for any and the problem reduces to the prediction game as in the original prediction game the minimax equilibrium depends on the data only through the ensemble predictions but these are now of different form each example is now weighted proportionally to ρj xi so on any given example xi only hypotheses which participate on it will be counted and those that specialize the most narrowly and participate on the fewest other examples will have more influence on the eventual prediction gi ceteris paribus creating specialists for an algorithm we can now present the main ensemble aggregation method of this paper which creates specialists from the ensemble adding them as additional constraints rows of the algorithm edge lipper is given in fig and instantiates our specialist learning framework with random forest as an initial exploration of the framework here random forests are an appropriate base ensemble because they are known to exhibit performance their wellknown advantages also include scalability robustness to corrupt data and parameter choices and interpretability each of these benefits is shared by our aggregation algorithm which consequently inherits them all furthermore decision trees are natural fit as the ensemble classifiers because they are inherently hierarchical intuitively and indeed formally too they act like nn predictors distance that is adaptive to the data so each tree in random forest represents somewhat different nonparametric partition of the data space into regions in which one of the labels dominates each such region corresponds exactly to leaf of the tree the idea of edge lipper is simply to consider each leaf in the forest as specialist which predicts only on the data falling into it by the nn intuition above these specialists can be viewed as predicting on data that is near them where the supervised training of the tree attempts to define the purest possible partitioning of the space pure partitioning results in many specialists with bs each of which contributes to the awake ensemble prediction over its domain to influence it towards the correct label inasmuch as bs is high though the idea is complex in concept for large forest with many arbitrarily overlapping leaves from different trees it fits the specialist framework of the previous sections so the algorithm is still essentially linear learning with convex optimization as we have described algorithm edge lipper input labeled set unlabeled set using grow trees tp regularized see sec using estimate bs on and its leaves using approximately optimize to estimate output the estimated weighting for use at test time figure at left is algorithm edge lipper at right is schematic of how the forest structure is related to the unlabeled data matrix with given example highlighted the two colors in the matrix represent predictions and white cells abstentions discussion trees in random forests have thousands of leaves or more in practice as we are advocating adding so many extra specialists to the ensemble for the optimization it is natural to ask whether this erodes some of the advantages we have claimed earlier computationally it does not when ρj xi classifier abstains deterministically on xi then the value of hj xi is irrelevant so storing in sparse matrix format is natural in our setup with the accompanying performance gain in computing while learning and predicting with it this turns out to be crucial to efficiency each tree induces partitioning of the data so the set of rows corresponding to any tree contains nonzero entries in total this is seen in fig statistically the situation is more complex on one hand there is no danger of overfitting in the traditional sense regardless of how many specialists are added each additional specialist can only shrink the constraint set that the adversary must follow in the game it only adds information about and therefore the performance vs must improve if the game is solved exactly however for learning we are only concerned with approximately optimizing γs and solving the game this presents several statistical challenges standard optimization methods do not converge as well in high ambient dimension even given the structure of our problem in addition random forests practically perform best when each tree is grown to overfit in our case on any sizable test set small leaves would cause some entries of to have large magnitude this can foil an algorithm like edge lipper by causing it to vary wildly during the optimization particularly since those leaves bs values are only roughly estimated from an optimization perspective some of these issues can be addressed by pseudo methods whose effect would be interesting to explore in future work our implementation opts for another approach to grow trees constrained to have nontrivial minimum weight per leaf of course there are many other ways to handle this including using the tree structure beyond the leaves we just aim to conduct an exploratory evaluation here as several of these areas remain ripe for future research experimental evaluation we now turn to evaluating edge lipper on publicly available datasets our implementation uses minibatch sgd to optimize runs in python on top of the popular learning package and runs memory taking advantage of the scalability of our formulation the datasets are drawn from as well as data mining sites like kaggle and no further preprocessing was done on the data we refer to base rf as the forest of constrained trees from which our implementation draws its specialists we restrict the training data available to the algorithm using mostly supervised datasets because these far outnumber public datasets unused labeled examples are combined with the test examples and the extra unlabeled set if any is provided to form the set of unlabeled data used by the algorithm further information and discussion on the protocol is in the appendix and noisy sets are included to demonstrate the aforementioned practical advantages of edge lipper therefore auc is an appropriate measure of performance and these results are in table results are averaged over runs each drawing different random subsample of labeled data the best results according to paired are in bold we find that the use of unlabeled data is sufficient to achieve improvements over even traditionally overfitted rfs in many cases notably in most cases there is significant benefit given by unlabeled data in our formulation as compared to the base rf used the methods also perform fairly well as we discuss in the next section figure awake ensemble prediction distributions on susy rows top to bottom labels columns left to right and the base rf the awake ensemble prediction values on the unlabeled set are natural way to visualize and explore the operation of the algorithm on the data in an analogous way to the margin distribution in boosting one representative sample is in fig on susy dataset with many examples roughly evenly split between these plots demonstrate that our algorithm produces much more peaked ensemble prediction distributions than random forests suggesting marginbased learning applications changing alters the aggressiveness of the clipping inducing more or less peaked distribution the other datasets without dramatic label imbalance show very similar qualitative behavior in these respects and these plots help choose in practice see appendix toy datasets with extremely low dimension seem to exhibit little to no significant improvement from our method we believe this is because the distinct feature splits found by the random forest are few in number and it is the diversity in ensemble predictions that enables edge lipper to clip weighted majority vote dramatically and achieve its performance gains on the other hand given large quantity of data our algorithm is able to learn significant structure the minimax structure appears appreciably close to reality as evinced by the results on large datasets related and future work this paper framework and algorithms are superficially reminiscent of boosting another paradigm that uses voting behavior to aggregate an ensemble and has intuition there is some work on versions of boosting but it departs from this principled structure and has little in common with our approach classical boosting algorithms like adaboost make no attempt to use unlabeled data it is an interesting open problem to incorporate boosting ideas into our formulation particularly since the two methods acquit themselves well it is possible to make this footprint independent of as well by hashing features not done here dataset training random forest edge lipper adaboost trees mart logistic regression base rf covtype susy epsilon table area under roc curve for edge lipper supervised ensemble algorithms in our results and can pack information parsimoniously into many fewer ensemble classifiers than random forests there is connection between transductive and learning and our method bridges these two settings popular variants on supervised learning such as the transductive svm and or algorithms which dominate the literature have shown promise largely in regimes because they face major scalability challenges our focus on ensemble aggregation instead allows us to keep computationally inexpensive linear formulation and avoid considering the underlying feature space of the data largely unsupervised ensemble methods have been explored especially in applications like crowdsourcing in which the method of gave rise to plethora of bayesian methods under various conditional independence generative assumptions on using tree structure to construct new features has been applied successfully though without guarantees learning with specialists has been studied in an adversarial online setting as in the work of freund et al though that paper setting and focus is different from ours the optimal algorithms it derives also depend on each specialist average error on the examples on which it is awake finally we the generality of our formulation which leaves many interesting questions to be explored the specialists we form are not restricted to being trees there are other ways of dividing the data like clustering methods indeed the ensemble can be heterogeneous and even incorporate other methods our method is complementary to myriad classification algorithms and we hope to stimulate inquiry into the many research avenues this opens acknowledgements the authors acknowledge support from the national science foundation under grant references robert schapire and yoav freund boosting foundations and algorithms the mit press leo breiman bagging predictors machine learning leo breiman random forests machine learning yehuda koren the bellkor solution to the netflix grand prize olga russakovsky jia deng hao su jonathan krause sanjeev satheesh sean ma zhiheng huang andrej karpathy aditya khosla michael bernstein alexander berg and li imagenet large scale visual recognition challenge international journal of computer vision ijcv robert schapire yoav freund peter bartlett and wee sun lee boosting the margin new explanation for the effectiveness of voting methods annals of statistics pages olivier chapelle bernhard and alexander zien learning xiaojin zhu and andrew goldberg introduction to learning synthesis lectures on artificial intelligence and machine learning akshay balsubramani and yoav freund optimally combining classifiers using unlabeled data in conference on learning theory rich caruana nikos karampatziakis and ainur yessenalina an empirical evaluation of supervised learning in high dimensions in proceedings of the international conference on machine learning pages acm yi lin and yongho jeon random forests and adaptive nearest neighbors journal of the american statistical association john duchi elad hazan and yoram singer adaptive subgradient methods for online learning and stochastic optimization the journal of machine learning research kilian weinberger anirban dasgupta john langford alex smola and josh attenberg feature hashing for large scale multitask learning in proceedings of the annual international conference on machine learning pages acm jerome friedman greedy function approximation gradient boosting machine annals of statistics pages yoav freund and robert schapire game theory prediction and boosting in proceedings of the ninth annual conference on computational learning theory pages acm kumar mallapragada rong jin anil jain and yi liu semiboost boosting for learning pattern analysis and machine intelligence ieee transactions on yoav freund and robert schapire generalization of learning and an application to boosting comput syst thorsten joachims transductive inference for text classification using support vector machines in proceedings of the sixteenth international conference on machine learning pages morgan kaufmann publishers alexander philip dawid and allan skene maximum likelihood estimation of observer using the em algorithm applied statistics pages fabio parisi francesco strino boaz nadler and yuval kluger ranking and combining multiple predictors without labeled data proceedings of the national academy of sciences frank moosmann bill triggs and frederic jurie fast discriminative visual codebooks using randomized clustering forests in twentieth annual conference on neural information processing systems nips pages mit press yoav freund robert schapire yoram singer and manfred warmuth using and combining predictors that specialize in proceedings of the annual acm symposium on theory of computing pages acm predicting biological response https give me some credit https 
online learning with gaussian payoffs and side observations yifan dept of computing science university of alberta szepesva csaba dept of electrical and electronic engineering imperial college london abstract we consider sequential learning problem with gaussian payoffs and side observations after selecting an action the learner receives information about the payoff of every action in the form of gaussian observations whose mean is the same as the mean payoff but the variance depends on the pair and may be infinite the setup allows more refined information transfer from one action to another than previous partial monitoring setups including the recently introduced feedback case for the first time in the literature we provide lower bounds on the regret of any algorithm which recover existing asymptotic lower bounds and finitetime minimax lower bounds available in the literature we also provide algorithms that achieve the lower bound up to some universal constant factor or the minimax lower bounds up to logarithmic factors introduction online learning in stochastic environments is sequential decision problem where in each time step learner chooses an action from given finite set observes some random feedback and receives random payoff several feedback models have been considered in the literature the simplest is the full information case where the learner observes the payoff of all possible actions at the end of every round popular setup is the case of bandit feedback where the learner only observes its own payoff and receives no information about the payoff of other actions recently several papers considered more refined setup called feedback that interpolates between the and the bandit case here the feedback structure is described by possibly directed graph and choosing an action reveals the payoff of all actions that are connected to the selected one including the chosen action itself this problem motivated for example by social networks has been studied extensively in both the adversarial and the stochastic cases however most algorithms presented heavily depend on the assumption that is that the payoff of the selected action can be observed removing this assumption leads to the partial monitoring case in the absolutely general partial monitoring setup the learner receives some general feedback that depends on its choice and the environment with some arbitrary but known dependence while the partial monitoring setup covers all other problems its analysis has concentrated on the finite case where both the set of actions and the set of feedback signals are finite which is in contrast to the standard full information and bandit settings where the feedback is typically assumed to be to our knowledge there are only few exceptions to this case in feedback is considered without the assumption while continuous action spaces are considered in and with special feedback structure linear and censored observations in this paper we consider generalization of the feedback model that can also be viewed as general partial monitoring model with feedback we assume that selecting an action the learner can observe random variable xij for each action whose mean is the same as the payoff of but its variance σij depends on the pair for simplicity throughout the paper we assume that all the payoffs and the xij are gaussian while in the feedback case one either has observation on an action or not but the observation always gives the same amount the information can be of information our model is more refined depending on the value of σij of different quality for example if σij trying action gives no information about action in general for any σij the value of the information depends on the time horizon of the problem when σij is large relative to and the payoff differences of the actions essentially no information is received while small variance results in useful observations after defining the problem formally in section we provide lower bounds in section which depend on the distribution of the observations through their mean payoffs and variances to our knowledge these are the first such bounds presented for any stochastic partial monitoring problem beyond the setting previous work either presented asymptotic lower bounds or minimax bounds our bounds can recover all previous bounds up to some universal constant factors not depending on the problem in section we present two algorithms with performance guarantees for the case of feedback without the assumption while due to their complicated forms it is hard to compare our upper and lower bounds we show that our first algorithm achieves the asymptotic lower bound up to multiplicative factors regarding the minimax regret the hardness of partial monitoring problems is characterized by their observabilor ity property or in case of the feedback model by their observability property in the same section we present another algorithm that achieves the minimax regret up to logarithmic factors under both strong and weak observability and achieves an regret earlier results for the stochastic feedback problems provided only asymptotic lower bounds and performance bounds that did not match the asymptotic lower bounds or the minimax rate up to constant factors related combinatorial partial monitoring problem with linear feedback was considered in where the presented minimax bound and logarithmic problem depenalgorithm was shown to satisfy both an dent bound however the dependence on the problem structure in that paper is not optimal and in particular the paper does not achieve the minimax bound for easy problems finally we draw conclusions and consider some interesting future directions in section proofs can be found in the long version of this paper problem formulation formally we consider an online learning problem with gaussian payoffs and side observations suppose learner has to choose from actions in every round when choosing an action the learner receives random payoff and also some side observations corresponding to other actions more precisely each action is associated with some parameter θi and the payoff yt to action in round is normally distributed random variable with mean θi and variance σii while the learner observes gaussian random vector xt whose jth coordinate is normal random variable with mean θj and variance σij we assume σij and the coordinates of xt are independent of each other we assume the following the random variables xt yt are independent for all ii the parameter vector is unknown to the learner but the variance matrix σij is known in advance iii for some iv σij for all that is the expected payoff of each action can be observed the goal of the learner is to maximize its payoff or in other words minimize the expected regret rt max θi yt it where it is the action selected by the learner in round note that the problem encompasses several common feedback models considered in online learning modulo the gaussian assumption and makes it possible to examine more delicate observation structures tilde denotes order up to logarithmic factors full information σij σj for all bandit σii and σij for all partial monitoring with feedback graphs each action is associated with an observation set si such that σij σj if si and σij otherwise we will call the uniform variance version of these problems when all the finite σij are equal to some some interesting features of the problem can be seen when considering the generalized full information case when all entries of are finite in this case the greedy algorithm which estimates the payoff of each action by the average of the corresponding observed samples and selects the one with the highest average achieves at most constant regret for any time horizon on the other hand the constant can be quite large in particular when the variance of some observations are large relative to the gaps dj maxi θi θj the situation is rather similar to partial monitoring setup for smaller finite time horizon in this paper we are going to analyze this problem and present algorithms and lower bounds that are able to interpolate between these cases and capture the characteristics of the different regimes notation define ctn nk ci ci and let ctn denote the number of plays over all actions taken by some algorithm in rounds also let ct ci ci we will consider environments with different expected payoff vectors but the variance matrix will be fixed therefore an environment can be specified by oftentimes we will explicitly denote the dependence of different quantities on the probability and expectation functionals under environment will be denoted by pr and respectively furthermore let ij be the jth best action ties are broken arbitrarily θik and define θi for any then the expected regret under environment is rt ni di for any action let si σij denote the set of actions whose parameter θj is observable by choosing action throughout the paper log denotes the natural logarithm and denotes the simplex for any positive integer lower bounds the aim of this section is to derive generic lower bounds to the regret which are also able to provide minimax lower bounds the hardness in deriving such bounds is that for any fixed and the dumb algorithm that always selects achieves zero regret obviously the regret of this algorithm is linear for any with so in general it is not possible to give lower bound for single instance when deriving asymptotic lower bounds this is circumvented by only considering consistent algorithms whose regret is for any problem however this asymptotic notion of consistency is not applicable to problems therefore following ideas of for any problem we create family of related problems by perturbing the mean payoffs such that if the regret of an algorithm is too small in one of the problems than it will be large in another one while it still depends on the original problem parameters note that deriving minimax bounds usually only involves perturbing certain special problems as and to show the reader what form of lower bound can be expected first we present an asymptotic lower bound for the version of the problem of partial monitoring with feedback graphs the result presented below is an easy consequence of hence its proof is omitted an algorithm is said to be consistent if rt for every now assume for simplicity that there is unique optimal action in environment that is θi for all and let for all ci cθ ci dj to see this notice that the error of identifying the optimal action decays exponentially with the number of rounds then for any consistent algorithm and for any with lim inf rt inf hc log note that the right hand side of is for any generalized full information problem recall that the expected regret is bounded by constant for such problems but it is finite positive number for other problems similar bounds have been provided in for feedback with under assumptions on the payoffs in the following we derive finite time lower bounds that are also able to replicate this result general finite time lower bound first we derive general lower bound for any and define as ha inf such that log ai where ii is the between xt and xt given by ii pk kl xt xt θj clearly is lower bound on rt for any algorithm for which the distribution of is the intuition behind the allowed values of is that we want to be as similar to as the environments and look like for the algorithm through the feedback xt it now define inf sup such that ctr is lower bound of the regret of any algorithm with finally for any define inf hc where cθ ctr here cθ contains all the possible values of that can be achieved by some algorithm whose lower bound on the regret is smaller than these definitions give rise to the following theorem theorem given any for any algorithm such that rt we have for any environment rt remark if is picked as the minimax value of the problem given the observation structure the theorem states that for any minimax optimal algorithm the expected regret for certain is lower bounded by relaxed lower bound now we introduce relaxed but more interpretable version of the lower bound of theorem which can be shown to match the asymptotic lower bound the idea of deriving the lower bound is the following instead of ensuring that the algorithm performs well in the most adversarial environment we consider set of bad environments and make sure that the algorithm performs well on them where each bad environment is the most adversarial one by only perturbing one coordinate θi of however in order to get meaningful lower bounds we need to perturb more carefully than in the case of asymptotic lower bounds the reason for this is that for any action if θi is very close to then ni is not necessarily small for good algorithm for if it is small one can increase θi to obtain an environment where is the best action and the algorithm performs bad otherwise when ni is large we need to decrease θi to make the algorithm perform badly in moreover when perturbing θi to be better than we can not make as in asymptotic arguments because when is small large and not necessarily large ni may also lead to low regret in in the following we make this argument precise to obtain an interpretable lower bound formulation we start with defining subset of ctr that contains the set of reasonable values for for any and let cj for all cθ ctr σji where mi the minimum sample size required to distinguish between θi and its perturbation is defined as follows for if then mi otherwise let mi max di θi mi max log log and let and denote the value of achieving the maximum in mi and mi respectively then define mi if di mi min mi mi if di for then if else the definitions for change by replacing di with and switching the and indices max max log log where and are the maximizers for in the above expressions then define if min if note that and can be expressed in closed form using the lambert function satisfying ew for any di eb min θi ebe di min θi ebe di eb di and similar results hold for as well now we can give the main result of this section simplified version of theorem corollary given for any algorithm such that rt we have for any environment rt hc next we compare this bound to existing lower bounds comparison to the asymptotic lower bound now we will show that our finite time lower bound in corollary matches the asymptotic lower bound in up to some constants pick αt for some and for simplicity we only consider which is away from the boundary of so that the minima in are recall that θi achieved by the second terms and has unique optimal action then for it is easy to show that di di by and mi log for large enough then using the fact that log log log log for it follows that limt mi log and similarly we can show that limt log thus cθ log cθ under the assumptions of as this implies that corollary matches the asymptotic lower bound of up to factor of comparison to minimax bounds now we will show that our lower bound reproduces the minimax regret bounds of and except for the generalized full information case the minimax bounds depend on the following notion of observability an action is strongly observable if either si or sj is weakly observable if it is not strongly observable but there exists such that sj note that we already assumed the latter condition for all let be the set of all weakly observable actions is said to be strongly observable if is weakly observable if next we will define two key qualities introduced by and that characterize the hardness of problem instance with feedback structure set is called an independent set if for any si the independence number is defined as the cardinality of the largest independent set for any pair of subsets is said to be dominating if for any there exists such that si the weak domination number is defined as the cardinality of the smallest set that dominates corollary assume that σij for some that is we are not in the generalized full information case then if is strongly observable with ασ for some we have for ii if is weakly observable with σt for some we σt picking for strongly observable have remark in corollary for weakly and observable gives formal minimax lower bounds if is strongly observable for any algorithm we have rt algorithm we have rt for eσ ii if is weakly observable for any σt algorithms in this section we present two algorithms and their analysis for the uniform variance version of our problem where σij is either or the upper bound for the first algorithm matches the asymptotic lower bound in up to constants the second algorithm achieves the minimax lower bounds of corollary up to logarithmic factors as well as regret in the upper bounds of both algorithms we assume that the optimal action is unique that is an asymptotically optimal algorithm let hc note that increasing does not change the value of hc since so we take the minimum value of in this definition let ni sis be the number of observations for action before round and be the empirical estimate of θi based on the first ni observations let ni is be the number of plays for action before round note that this definition of ni is different from that in the previous sections since it excludes round algorithm inputs for observe each action at least once by playing it such that sit set exploration count ne for do if then play it set ne ne else if ni ne then play it such that ni sit else play it such that ni ci log end if set ne ne end if end for our first algorithm is presented in algorithm the main idea coming from is that by forcing exploration over all actions the solution of the linear program can be well approximated while paying constant price this solves the main difficulty that without getting enough observations on each action we may not have good enough estimates for and one advantage of our algorithm compared to that of is that we use nondecreasing sublinear exploration schedule instead of constant rate βn this resolves the problem that to achieve asymptotically optimal performance some parameter of the algorithm needs to be chosen according to dmin as in the expected regret of algorithm is upper bounded as follows theorem for any and any that satisfies and for rt dmax exp log ci log ci di where ci sup ci θj for all further specifying and using the continuity of around it immediately follows that algorithm achieves asymptotically optimal performance corollary suppose the conditions of theorem hold assume furthermore that satisfies and exp for any then for any such that is unique lim sup rt log inf hc note that any anb with satisfies the requirements in theorem and corollary also note that the algorithms presented in do not achieve this asymptotic bound minimax optimal algorithm next we present an algorithm achieving the minimax bounds for any let cj ties are broken arbitrarily and cj for any and let as sj and aw as furthermore let gr log where ni is and ni be the empirical estimate of θi based on first ni observations the average of the samples the algorithm is presented in algorithm it follows successive elimination process it explores all possibly optimal actions called good actions later based on some confidence intervals until only one action remains while doing exploration the algorithm first tries to explore the good actions by only using good ones however due to weak observability some good actions might have to be explored by actions that have already been eliminated to control this trade off we use sublinear function to control the exploration of weakly observable actions in the following we present bounds on the performance of the algorithm so with slight abuse of notation rt will denote the regret without expectation in the rest of this section algorithm inputs set for do let αr aw aw define αr if and σαr tr as for all if aw ni ni and ni then and set cr aw else set cr ar asr end if play ir dcr kcr and set tr kir ar if then play the only action in the remaining rounds end if end for theorem for any and any rt σt log with probability at least if is weakly observable while rt log log with probability at least if is strongly observable theorem upper bound for any and any such that the optimal action is unique with probability at least dσ log dσ log rt remark picking gives an upper bound on the expected regret remark note that algortihm is similar to the algorithm of which admits better upper bound although does not achieve it with optimal constants but it does not achieve the minimax bound even under strong observability conclusions and open problems we considered novel setup with gaussian side observations which generalizes the recently introduced setting of feedback allowing finer quantification of the observed information from one action to another we provided lower bounds that imply existing asymptotic and minimax lower bounds up to some constant factors beyond the full information case we also provided an algorithm that achieves the asymptotic lower bound up to some universal constants and another algorithm that achieves the minimax bounds under both weak and strong observability however we think this is just the beginning for example we currently have no algorithm that achieves both the problem dependent and the minimax lower bounds at the same time also our upper bounds only correspond to the feedback case it is of great interest to go beyond the observability in characterizing the hardness of the problem and provide algorithms that can adapt to any correspondence between the mean payoffs and the variances the hardness is that one needs to identify suboptimal actions with good acknowledgments this work was supported by the alberta innovates technology futures through the alberta ingenuity centre for machine learning aicml and nserc during this work was with the department of computing science university of alberta references bubeck and regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning shie mannor and ohad shamir from bandits to experts on the value of in advances in neural information processing systems nips pages noga alon nicolo claudio gentile and yishay mansour from bandits to experts tale of domination and independence in advances in neural information processing systems nips pages gergely neu michal valko and munos efficient learning by implicit exploration in bandit problems with side observations in advances in neural information processing systems nips pages noga alon ofer dekel and tomer koren online learning with feedback graphs beyond bandits in proceedings of the conference on learning theory colt pages caron branislav kveton marc lelarge and smriti bhagat leveraging side observations in stochastic bandits in proceedings of the conference on uncertainty in artificial intelligence uai pages swapna buccapatnam atilla eryilmaz and ness shroff stochastic bandits with side observations on networks sigmetrics perform eval june and lugosi prediction learning and games cambridge university press cambridge dean foster alexander rakhlin and csaba partial monitoring classification regret bounds and algorithms mathematics of operations research tian lin bruno abrahao robert kleinberg john lui and wei chen combinatorial partial monitoring game with linear feedback and its applications in proceedings of the international conference on machine learning icml pages tor lattimore and csaba on learning the optimal waiting time in peter auer alexander clark thomas zeugmann and sandra zilles editors algorithmic learning theory volume of lecture notes in computer science pages springer international publishing todd graves and tze leung lai asymptotically efficient adaptive choice of control laws incontrolled markov chains siam journal on control and optimization yifan wu and csaba online learning with gaussian payoffs and side observations arxiv preprint lihong li munos and csaba toward minimax value estimation in proceedings of the eighteenth international conference on artificial intelligence and statistics aistats pages stefan magureanu richard combes and alexandre proutiere lipschitz bandits regret lower bounds and optimal algorithms in proceedings of the conference on learning theory colt pages emilie kaufmann olivier and garivier on the complexity of best arm identification in bandit models the journal of machine learning research to appear richard combes and alexandre proutiere unimodal bandits regret lower bounds and optimal algorithms in proceedings of the international conference on machine learning icml pages 
private graphon estimation for sparse christian borgs jennifer chayes microsoft research new england cambridge ma usa cborgs jchayes adam smith pennsylvania state university university park pa usa asmith abstract we design algorithms for fitting statistical model to large sparse network without revealing sensitive information of individual members given sparse input graph our algorithms output private nonparametric block model approximation by private we mean that our output hides the insertion or removal of vertex and all its adjacent edges if is an instance of the network obtained from generative nonparametric model defined in terms of graphon our model guarantees consistency as the number of vertices tends to infinity the output of our algorithm converges to in an appropriate version of the norm in particular this means we can estimate the sizes of all cuts in our results hold as long as is bounded the average degree of grows at least like the log of the number of vertices and the number of blocks goes to infinity at an appropriate rate we give explicit error bounds in terms of the parameters of the model in several settings our bounds improve on or match known nonprivate results introduction differential privacy social and communication networks have been the subject of intense study over the last few years however while these networks comprise rich source of information for science they also contain highly sensitive private information what kinds of information can we release about these networks while preserving the privacy of their users simple measures such as removing obvious identifiers do not work for example several studies reidentified individuals in the graph of social network even after all vertex and edge attributes were removed such attacks highlight the need for statistical and learning algorithms that provide rigorous privacy guarantees differential privacy provides meaningful guarantees in the presence of arbitrary side information in the context of traditional statistical data sets differential privacy is now by contrast differential privacy in the context of graph data is much less developed there are two main variants of graph differential privacy edge and node differential privacy intuitively edge differential privacy ensures that an algorithm output does not reveal the inclusion or removal of particular edge in the graph while node differential privacy hides the inclusion or removal of node together with all its adjacent edges edge privacy is weaker notion hence easier to achieve and has been studied more extensively several authors designed private algorithms for fitting generative graph models see the full version for further references but these do not appear to generalize to node privacy with meaningful accuracy guarantees the stronger notion node privacy corresponds more closely to what was achieved in the case of traditional data sets and to what one would want to protect an individual data it ensures that no matter what an analyst observing the released information knows ahead of time she learns the same full version of this extended abstract is available at http things about an individual alice regardless of whether alice data are used or not in particular no assumptions are needed on the way the individuals data are generated they need not even be independent node privacy was studied more recently with focus on on the release of descriptive statistics such as the number of triangles in graph unfortunately differential privacy stringency makes the design of accurate private algorithms challenging in this work we provide the first algorithms for inference of statistical model that does not admit simple sufficient statistics modeling large graphs via graphons traditionally large graphs have been modeled using various parametric models one of the most popular being the stochastic block model here one postulates that an observed graph was generated by first assigning vertices at random to one of groups and then connecting two vertices with probability that depends on their assigned groups as the number of vertices of the graph in question grows we do not expect the graph to be well described by stochastic block model with fixed number of blocks in this paper we consider nonparametric models where the number of parameters need not be fixed or even finite given in terms of graphon graphon is measurable bounded function such that which for convenience we take to be normalized given graphon we generate graph on vertices by first assigning uniform labels in to the vertices and then connecting vertices with labels with probability ρn where ρn is parameter determining the density of the generated graph gn with ρn kw we call gn graph with target density ρn or simply ρn graph to our knowledge random graph models of the above form were first introduced under the name latent position graphs and are special cases of more general model of inhomogeneous random graphs defined in which is the first place were target densities ρn were considered for both dense graphs whose target density does not depend on the number of vertices and sparse graphs those for which ρn as this model is related to the theory of convergent graph sequences and respectively estimation and identifiability assuming that gn is generated in this way we are then faced with the task of estimating from single observation of graph gn to our knowledge this task was first explicitly considered in which considered graphons describing stochastic block models with fixed number of blocks this was generalized to models with growing number of blocks while the first estimation of the nonparametric model was proposed in most of the literature on estimating the nonparametric model makes additional assumptions on the function the most common one being that after transformation the integral of over one variable is strictly monotone function of the other corresponding to an asymptotically strictly monotone degree distribution of gn this assumption is quite restrictive in particular such results do not apply to graphons that represent block models for our purposes the most relevant works are wolfe and olhede gao et al chatterjee and abbe and sandon as well as recent work done concurrently with this research which provide consistent estimators without monotonicity assumptions see comparison to nonprivate bounds below one issue that makes estimation of graphons challenging is identifiability multiple graphons can lead to the same distribution on gn specifically two graphons and lead to the same distribution on graphs if and only if there are measure preserving maps φe where is defined by hence there is such that no canonical graphon that an estimation procedure can output but rather an equivalence class of graphons some of the literature circumvents identifiability by making strong additional assumptions such as strict monotonicity that imply the existence of canonical equivalent class representatives we make no such assumptions but instead define consistency in terms of metric on these equivalence classes rather than on graphons as functions we use variant of the metric inf kw where ranges over bijections our contributions in this paper we construct an algorithm that produces an estimate from single instance gn of graph with target density ρn or simply when is clear from the context we aim for several properties is differentially private is consistent in the sense that in probability as has compact representation in our case as matrix with entries the procedure works for sparse graphs that is when the density is small on input gn can be calculated efficiently here we give an estimation procedure that obeys the first four properties leaving the question of algorithms for future work given an input graph gn and target number of blocks our algorithm produces graphon gn such that is node private the privacy guarantee holds for all inputs independent of modeling assumptions if is an arbitrary graphon normalized so the expected average degree grows at least as fast as log and goes to infinity sufficiently slowly with then when gn is ρw the estimate for is consistent that is both in probability and almost surely we give nonprivate variant of that converges assuming only average degree combined with the general theory of convergent graphs sequences these results in particular give procedure for estimating the edge density of all cuts in ρw graph see section below the main idea of our algorithm is to use the exponential mechanism of to select block model which approximately minimizes the distance to the observed adjacency matrix of under the best possible assignment of nodes to blocks this explicit search over assignments makes the algorithm take exponential time in order to get an algorithm that is accurate on sparse graphs we need several nontrivial extensions of current techniques to achieve privacy we use new variation of the lipschitz extension technique of to reduce the sensitivity of the distance while those works used lipschitz extensions for noise addition we use of lipshitz extensions inside the exponential mechanism to control the sensitivity of the score functions to bound our algorithm error we provide new analysis of the algorithm we show that approximate minimizers are not too far from the actual minimizer stability property both aspects of our work are enabled by restricting the to set of block models whose density in fact norm is not much larger than that of the underlying graph the algorithm is presented in section our most general result proves consistency for arbitrary graphons but does not provides concrete rate of convergence however we provide explicit rates under various assumptions on specifically we relate the error of our estimator to two natural error terms involving the graphon the error of the best approximation to in the norm see below and an error term measuring the between the graphon and the matrix of probabilities hn generating the graph gn see below in terms of these error terms theorem shows log log op ρn provided the average degree ρn grows at least like log along the way we provide novel analysis of straightforward nonprivate estimator that does not require an assumption on the average degree and leads to an error bound with better dependence on log op ρn ρn it follows from the theory of graph convergence that for all graphons we have as and almost surely as by selecting appropriately the nonprivate algorithm converges for any bounded graphon as long as ρn with the private algorithm converges whenever ρn log for constant as proven in the full version we also have op though this upper bound is loose in many cases as specific instantiation of these bounds let us consider the case that is exactly described by model in which case and op see full version for proof for log and constant our private estimator has an asymptotic error that is dominated by the unavoidable error of showing that we do not lose anything due to privacy in this special case another special case is when is continuous in which case and op see remark below comparison to previous nonprivate bounds we provide the first consistency bounds for estimation of nonparametric graph model subject to node differential privacy along the way for sparse graphs we provide more general consistency results than were previously known regardless of privacy in particular to the best of our knowledge no prior results give consistent estimator for that works for sparse graphs without any additional assumptions besides boundedness when compared to results for nonprivate algorithms applied to graphons obeying additional assumptions our bounds are often incomparable and in other cases match the existing bounds we start by considering graphons which are themselves step functions with known number of steps in the dense case the nonprivate algorithms of and as well pas our nonprivate algorithm give an asymptotic error that is dominated by the term which is of the same order as our private estimator as long as provided the first convergence results for estimating graphons in the sparse regime assuming that is bounded above and below so it takes values in range where they analyze an inefficient algorithm the mle the bounds of are incomparable to ours though for thepcase of graphons both their bounds and our nonprivate bound are dominated by the term when log and ρn different sequence of works shows how to consistently estimate the underlying block model with fixed number of blocks in polynomial time for very sparse graphs as for our algorithm the only thing which is needed is that nρ we are not aware of concrete bounds on the convergence rate for the case of dense graphons the results of give an error which is dominated by the term op for our nonprivate bound matches this bound while for it is worse considers the sparse case the rate of their estimator is incomparable to that of ours further their analysis requires lower bound on the edge probabilities while ours does not very recently after our paper was submitted both the bounds of as well as our bound were substantially improved leading to an error bound where the root in is replaced by square root at the cost of an extra constant multiplying the oracle error see the full version for more detailed discussion of the previous literature preliminaries notation for graph on we use and to denote the edge set and the adjacency matrix of respectively the edge density is defined as the number of edges divided by finally the degree di of vertex in is the number of edges containing we use thep same notation for weighted graph with nonnegative edge weights βij where now βij and di βij we use gn to denote the set of weighted graphs on vertices with weights in and gn to denote the set of all graphs in gn that have maximal degree at most from matrices to graphons we define graphon to be bounded measurable function such that for all it will be convenient to embed the set of symmetric matrix with nonnegative entries into graphons as follows let pn in be the partition of into adjacent intervals of lengths define to be the step function which equals aij on ii ij if is the adjacency matrix of an unweighted graph we use for distances for we define the lp norm of an matrix and borel function by kakp and kf kp dxdy respectively associated with the is scalar product defined as ha bi aij bij for two matrices and and hu dxdy for two square integrable functions note that with this notation the edge density and the norm are related by recalling we define the distance between two matrices or between matrix and graphon by and in addition we will also use the in general larger distances and defined by taking minimum over matrices which are obtained from by relabelling of the indices and kw graphs graph convergence and cuts graphs and stochastic block models given graphon we define random matrix hn hn by choosing positions xn uniformly at random from and then setting hn ij xi xj if kw then hn has entries in and we can form random graph gn gn on by choosing an edge between two vertices with probability hn ij independently for all following we call gn graph and hn random graph we incorporate target density ρn or simply when is clear from the context by normalizing so that and taking to be sample from gn ρw in other words we set hn ρw ρhn and then connect to with probability qij independently for all stochastic block models are specific examples of graph in which is constant on sets of the form ii ij where ik is partition of into intervals of possibly different lengths on the other hand an arbitrary graphon can be well approximated by block model indeed let min kw where the minimum runs over all matrices by straightforward argument see kw wpk as we will take this approximation as benchmark for our approach and consider it the error an oracle could obtain hence the superscript another key term in our algorithm error guarantee is the distance between hn and hn it goes to zero as by the following lemma which follows easily from the results of lemma let be graphon with kw with probability one khn kw and convergence given sequence of graphs with target densities ρn one might wonder whether the graphs gn gn ρn converge to in suitable metric the answer is yes and involves the first introduced in its definition is identical to the definition of the norm except that instead of the it involves the friezekannan kw defined as the sup of over all measurable sets in the metric graphs gn gn ρw then converge to in the sense that gn gn see for the proof estimation of cuts using the results of the convergence of gn in the implies many interesting results for estimating various quantities defined on the graph gn indeed consistent approximation to in the metric is clearly consistent in the weaker metric this distance in turn controls various quantities of interest to computer scientists the size of all cuts implying that consistent estimator for also gives consistent estimators for all cuts see the full version for details differential privacy for graphs the goal of this paper is the development of differentially private algorithm for graphon estimation the privacy guarantees are formulated for inputs we do not assume that is generated from graphon when analyzing privacy this ensures that the guarantee remains meaningful no matter what an analyst knows ahead of time about in this paper we consider node privacy we call two graphs and node neighbors if one can be obtained from the other by removing one node and its adjacent edges definition randomized algorithm is if for all events in the output space of and node neighbors pr exp pr we also need the notion of the of function gn defined as maximum maxg where the maximum goes over the node sensitivity is the lipshitz constant of viewed as map between appropriate metrics differentially private graphon estimation estimation given graph as input generated by an unknown graphon our goal is to recover approximation to the basic nonprivate algorithm we emulate is least squares estimation which outputs the matrix which is closest to the input adjacency matrix in the distance min kbπ where the minimum runs over all equipartitions of into classes over all maps such that all classes have size as close to as possible such that for all and bπ is the with entries bπ xy bπ if is the adjacency matrix of graph we write instead of in the above notation the basic algorithm we would want to emulate is then the algorithm which outputs the least square fit argminb where the argmin runs over all symmetric matrices towards private algorithm our algorithm uses carefully chosen instantiation of the exponential mechanism of mcsherry and talwar the most direct application of their framework would be to output random matrix according to the probability distribution pr exp for some the resulting algorithm is private if we set to be over twice the of the score function here but this value of turns out to be too small to produce an output that is good approximation to the least square estimator indeed for given matrix and equipartition the of kg bπ can be as large as leading to value of which is too small to produce useful results for sparse graphs to address this we first note that we can work with an equivalent score that is much less sensitive given and we subtract off the squared norm of to obtain the following score kg bπ bπ kbπ and score max score where the max ranges over equipartitions for fixed input graph maximizing the score is the same as minimizing the distance argminb argmaxb score the sensitivity of the new score is then bounded by times the maximum degree in since only affects the score via the inner product hg bπ but this is still problematic since priori we have no control over either the size of or the maximal degree of to keep the sensitivity low we make two modifications first we only optimize over matrices whose entries bounded by roughly ρn since good estimator will have entries which are not much larger than kρn which is of order ρn second we restrict the score to be accurate only on graphs whose maximum degree is at most constant times the average degree since this is what one expects for graphs generated from bounded graphon while the first restriction can be directly enforced by the algorithm the second is more delicate since we need to provide privacy for all inputs including graphs with very large maximum degree we employ an idea from we first consider the restriction of score to gn dn where dn will be chosen to be of the order of the average degree of and then extend it back to all graphs while keeping the sensitivity low private estimation algorithm our final algorithm takes as input the privacy parameter the graph number of blocks and constant that will have to be chosen large enough to guarantee consistency of the algorithm algorithm private estimation algorithm input an integer and graph on vertices output block graphon represented as matrix estimating ρw compute an density approximation lap the target maximum degree the target norm for for each and let score denote nondecreasing lipschitz extension from of score from gn to gn such that for all matrices score score and define score max score return sampled from the distribution pr exp where score and ranges over matrices in bµ all entries bi are multiples of our main results about the private algorithm are the following lemma and theorem lemma algorithm is private theorem performance of the private algorithm let be normalized graphon let ρλ let gn ρw and be an integer assume that ρn log ρn and min then the algorithm outputs an approximation such that log log op ρn remark while theorem is stated in term of bounds which hold in probability our proofs yield statements which hold almost surely as remark under additional assumptions on the graphon we obtain tighter bounds for example if we assume that is continuous there exist constants and such that cδ whenever then we have that and op remark when considering the best block model approximation to one might want to consider block models with unequal block sizes in similar way one might want to construct private algorithm that outputs block model with unequal size blocks and produces bound in terms of this best block model approximation instead of this can be proved with our methods with the minimal block size taking the role of in all our statements estimation algorithm we also analyze simple algorithm which outputs the argmin of over all matrices whose entries are bounded by λρ independently of our work this algorithm was also proposed and analysed in our bound refers to this restricted least square algorithm and does not require any assumptions on the average degree as in suppress the dependence of the error on to include it one has to multiply the op term in by analysis of the private and algorithm at high level our proof of theorem as well as our new bounds on estimation follow from the fact that for all and the expected score score is equal to the score score combined with concentration argument as consequence the maximizer of score will approximately minimize the which in turn will approximately minimize thus relating the of our estimator to the oracle error defined in our main concentration statement is captured in the following proposition to state it we define for every symmetric matrix with vanishing diagonal to be the distribution over symmetric matrices with zero diagonal such that the entries aij are independent bernouilli random variables with eaij qij proposition let be symmetric matrix with vanishing diagonal and if min eρ and bµ is such that score max score for some then with probability at least min log morally the proposition contains almost all that is needed to establish the bound proving consistency of the algorithm which in fact only involves the case even though there are several additional steps needed to complete the proof the proposition also contains an extra ingredient which is crucial input for the analysis of the private algorithm it states that if instead of an optimal least square estimator we output an estimator whose score is only approximately maximal then the excess error introduced by the approximation is small to apply the proposition we then establish lemma which gives us lower bound on the score of the output in terms of the maximal score and an excess error there are several steps needed to execute this strategy the most important ones involving rigorous control of the error introduced by the lipschitz extension inside the exponential algorithm we defer the details to the full version acknowledgments was supported by nsf award and google faculty award part of this work was done while visiting boston university hariri institute for computation and harvard university center for research on computation and society references abbe and sandon recovering communities in the general stochastic block model without knowing the parameters abbe and sandon recovering communities in the general stochastic block model without knowing the parameters manuscript abbe bandeira and hall exact recovery in the stochastic block model bickel and chen nonparametric view of network models and and other modularities proceedings of the national academy of sciences of the united states of america bickel chen and levina the method of moments and degree distributions for network models annals of statistics blocki blum datta and sheffet differentially private data analysis of social networks via restricted sensitivity in innovations in theoretical computer science itcs pages bollobas janson and riordan the phase transition in inhomogeneous random graphs random struct algorithms borgs chayes and vesztergombi counting graph homomorphisms in topics in discrete mathematics eds klazar kratochvil loebl matousek thomas pages springer borgs chayes and vesztergombi convergent graph sequences subgraph frequencies metric properties and testing advances in borgs chayes and vesztergombi convergent graph sequences ii multiway cuts and statistical physics ann of borgs chayes cohn and zhao an lp theory of sparse graph convergence limits sparse random graph models and power law distributions borgs chayes cohn and zhao an lp theory of sparse graph convergence ii ld convergence quotients and right convergence chatterjee matrix estimation by universal singular value thresholding annals of statistics chen and zhou recursive mechanism towards node differential privacy and unrestricted joins in acm sigmod international conference on management of data pages choi wolfe and airoldi stochastic blockmodels with growing number of classes biometrika diaconis and janson graph limits and exchangeable random graphs rendiconti di matematica dwork mcsherry nissim and smith calibrating noise to sensitivity in private data analysis in halevi and rabin editors tcc volume pages gao lu and zhou graphon estimation hoff raftery and handcock latent space approaches to social network analysis journal of the american statistical association holland laskey and leinhardt stochastic blockmodels first steps soc netw kasiviswanathan nissim raskhodnikova and smith analyzing graphs with nodedifferential privacy in theory of cryptography conference tcc pages klopp tsybakov and verzelen oracle inequalities for network models and sparse graphon estimation and szegedy limits of dense graph sequences journal of combinatorial theory series lu and miklau exponential random graph estimation under differential privacy in acm sigkdd international conference on knowledge discovery and data mining pages mcsherry and talwar mechanism design via differential privacy in focs pages ieee raskhodnikova and smith lipschitz extensions and analysis of network data rohe chatterjee and yu spectral clustering and the stochastic blockmodel ann wolfe and olhede nonparametric graphon estimation 
submodboxes search for set of diverse object proposals qing sun virginia tech dhruv batra virginia tech sunqing https abstract this paper formulates the search for set of bounding boxes as needed in object proposal generation as monotone submodular maximization problem over the space of all possible bounding boxes in an image since the number of possible bounding boxes in an image is very large even single linear scan to perform the greedy augmentation for submodular maximization is intractable thus we formulate the greedy augmentation step as scheme in order to speed up repeated application of we propose novel generalization of minoux lazy greedy algorithm to the tree theoretically our proposed formulation provides new understanding to the problem and contains classic heuristic approaches such as sliding suppression nms and and efficient subwindow search ess as special cases empirically we show that our approach leads to performance on object proposal generation via novel diversity measure introduction number of problems in computer vision and machine learning involve searching for set of bounding boxes or rectangular windows for instance in object detection the goal is to output set of bounding boxes localizing all instances of particular object category in object proposal generation the goal is to output set of candidate bounding boxes that may potentially contain an object of any category other scenarios include face detection tracking and weakly supervised learning classical approach enumeration diverse subset selection in the context of object detection the classical paradigm for searching for set of bounding boxes used to be sliding window enumeration over all windows in an image with some level of followed by suppression nms picking set of windows by suppressing windows that are too close or overlapping as several previous works have recognized the problem with this approach is inefficiency the number of possible bounding boxes or rectangular subwindows in an image is even image contains more than one billion rectangular windows as result modern object detection pipelines often rely on object proposals as preprocessing step to reduce the number of candidate object locations to few hundreds or thousands rather than billions interestingly this migration to object proposals has simply pushed the problem of searching for set of bounding boxes upstream specifically number of object proposal techniques involve the same enumeration nms approach except they typically use cheaper features to be fast proposal generation step goal the goal of this paper is to formally study the search for set of bounding boxes as an optimization problem clearly enumeration for diversity via nms is one widelyused heuristic approach our goal is to formulate formal optimization objective and propose an efficient algorithm ideally with guarantees on optimization performance challenge the key challenge is the search space the number of possible sets of bounding boxes is pixels assuming figure overview of our formulation submodboxes we formulate the selection of set of boxes as constrained submodular maximization problem the objective and marginal gains consists of two parts relevance and diversity figure shows two candidate windows ya and yb relevance is the sum of edge strength over all edge groups black curves wholly enclosed in the window figure shows the diversity term the marginal gain in diversity due to new window ya or yb is the ability of the new window to cover the reference boxes that are currently not with the already chosen set in this case we can see that ya covers new reference box thus the marginal gain in diversity of ya will be larger than yb overview of our formulation submodboxes let denote the set of all possible bounding boxes or rectangular subwindows in an image this is structured output space with the size of this set growing quadratically with the size of the input image we formulate the selection of set of boxes as search problem on the power set specifically given budget of windows we search for set of windows that are both relevant have high likelihood of containing an object and diverse to cover as many objects instances as possible argmax parameter diversity objective relevance budget constraint search over crucially when the objective function is monotone and submodular then simple greedy algorithm that iteratively adds the window with the largest marginal gain achieves approximation factor of unfortunately although conceptually simple this greedy augmentation step requires an enumeration over the space of all windows and thus naïve implementation is intractable in this work we show that for broad class of relevance and diversity functions this greedy augmentation step may be efficiently formulated as step with easily computable this enables an efficient implementation of greedy with significantly fewer evaluations than linear scan over finally in order to speed up repeated application of across iterations of the greedy algorithm we present novel generalization of minoux lazy greedy algorithm to the tree where different branches are explored in lazy manner in each iteration we apply our proposed technique submodboxes to the task of generating object proposals on the pascal voc pascal voc and ms coco datasets our results show that our approach outperforms all baselines contributions this paper makes the following contributions we formulate the search for set of bounding boxes or subwindows as the constrained maximization of monotone submodular function to the best of our knowledge despite the popularity of object recognition and object proposal generation this is the first such formal optimization treatment of the problem our proposed formulation contains existing heuristics as special cases specifically sliding window nms can be viewed as an instantiation of our approach under specific definition of the diversity function our work can be viewed as generalization of the efficient subwindow search ess of lampert et al who proposed scheme for finding the single best bounding box in an image their extension to detecting multiple objects consisted of heuristic for suppressing features extracted from the selected bounding box and the procedure we show that this heuristic is special case of our formulation under specific diversity function thus providing theoretical justification to their intuitive heuristic to the best of our knowledge our work presents the first generalization of minoux lazy greedy algorithm to spaces the space of bounding boxes finally our experimental contribution is novel diversity measure which leads to performance on the task of generating object proposals related work our work is related to few different themes of research in computer vision and machine learning submodular maximization and diversity the task of searching for diverse subset of items from ground set has been in number of application domains and across these domains submodularity has emerged as an fundamental property of set functions for measuring diversity of subset of items most previous work has focussed on submodular maximization on unstructured spaces where the ground set is efficiently enumerable our work is closest in spirit to prasad et al who studied submodular maximization on structured output spaces where each item in the ground set is itself structured object such as segmentation of an image unlike our ground set is not exponentially large only quadratically large however enumeration over the ground set for the step is still infeasible and thus we use such structured output spaces and oracles were not explored in bounding box search in object detection and object proposals as we mention in the introduction the search for set of bounding boxes via heuristics such as sliding window nms used to be the dominant paradigm in object recognition modern pipelines have shifted that search step to object proposal algorithms comparison and overview of object proposals may be found in zitnick et al generate candidate bounding boxes via sliding window nms based on an objectness score which is function of the number of contours wholly enclosed by bounding box we use this objectness score as our relevance term thus making submodboxes directly comparable to nms another closely related work is which presents an active search strategy for reranking selective search object proposals based on contextual cues unlike this work our formulation is not restricted to any set of windows we search over the entire power set and may generate any possible set of windows up to convergence tolerance in one key building block of our work is the efficient subwindow search ess scheme et al ess was originally proposed for object detection their extension to detecting multiple objects consisted of heuristic for suppressing features extracted from the selected bounding box and the procedure in this work we extend and generalize ess in multiple ways first we show that relevance objectness scores and diversity functions used in object proposal literature are amenable to and thus optimization we also show that the suppression heuristic used by is special case of our formulation under specific diversity function thus providing theoretical justification to their intuitive heuristic finally also proposed the use of for nms in object detection unfortunately as we explain later in the paper the nms objective is submodular but not monotone and the classical greedy algorithm does not have approximation guarantees in this setting in contrast our work presents general framework for based on monotone submodular maximization submodboxes formulation and approach we begin by establishing the notation used in the paper preliminaries and notation for an input image let yx denote the set of all possible bounding boxes or rectangular subwindows in this image for simplicity we drop the explicit dependance on and just use uppercase letters refer to set functions and lowercase letter refer to functions over individual items set function is submodular if its marginal gains are decreasing for all sets and items the function is called monotone if adding an item to set does not hurt constrained submodular maximization from the classical result of nemhauser it is known that cardinality constrained maximization of monotone submodular can be performed nearoptimally via greedy algorithm we start out with an empty set and iteratively add the next best item with the largest marginal gain over the chosen set where argmax the score of the final solution is within factor of of the optimal solution the computational bottleneck is that in each iteration we must find the item with the largest marginal gain in our case is the space of all rectangular windows in an image and exhaustive enumeration figure priority queue in scheme each vertex in the tree represents set of windows blue rectangles denote the largest and the smallest window in the set gray region denotes the rectangle set yv in each case the priority queue consists of all leaves in the tree ranked by the upper bound uv left shows vertex is split along the right coordinate interval into equal halves and middle the highest priority vertex in is further split along bottom coordinate into and right the highest priority vertex in is split along right coordinate into and this procedure is repeated until the highest priority vertex in the queue is single rectangle is intractable instead of exploring subsampling as is done in sliding window methods we will formulate this greedy augmentation step as an optimization problem solved with sets vs lists for pedagogical reasons our problem setup is motivated with the language of sets and subsets in practice our work falls under submodular list prediction the generalization from sets to lists allows reasoning about an ordering of the items chosen and potentially repeated entries in the list our final solution is an ordered list not an unordered set all guarantees of greedy remain the same in this generalization parameterization of and search in this subsection we briefly recap the efficient subwindow search ess of lampert et al which is used key building block in this work the goal of is to maximize potentially objective function over the space of all rectangular windows rectangular window is parameterized by its top bottom left and right coordinates set of windows is represented by using intervals for each coordinate instead of single integer for example where tlow thigh is range in this parameterization the set of all possible boxes in an image can be written as over ess creates tree where each vertex in the tree is rectangle set yv and an associated on the objective function achievable in this set uv initially this tree consists of single vertex which is the entire search space and typically loose ess proceeds in manner in each iteration the with the highest is chosen for branching and then new are computed on each of the two created in practice this is implemented with priority queue over the that are currently leaves in the tree fig shows an illustration of this procedure the parent rectangle set is split along its largest coordinate interval into two equal halves thus forming disjoint children sets explores the tree in manner till single rectangle is identified with score equal to the at which point we have found global optimum in our experiments we show results with different convergence tolerances objective in our setup the objective at each step is the marginal gain of the window the currently chosen list of windows λd in the following subsections we describe the relevance and diversity terms in detail and show how upper bounds can be efficiently computed over the sets of windows relevance function and upper bound the goal of the relevance function is to quantify the quality or relevance of the windows chosen in in our work we define to be modular function aggregating the quality of all chosen windows thus the marginal gain of window is simply its individual quality regardless of what else has already been chosen in our application of object proposal generation we use the objectness score produced by edgeboxes as our relevance function the main intuition of edgeboxes is that the number of contours or edge groups wholly contained in box is indicative of its objectness score thus it first creates grouping of edge pixels called edge groups each associated with edge strength si abstracting away some of the details edgeboxes essentially defines the score of box as weighted sum of the strengths of edge groups contained in it normalized by the size of the edge group wi si box edgeboxesscore where with slight abuse of notation we use edge group to mean the edge groups contained the rectangle these weights and size normalizations were found to improve performance of edgeboxes in our work we use simplification of the edgeboxesscore which allow for easy computation of upper bounds edge group si we ignore the weights one simple for set of windows yv can be computed by accumulating all possible positive scores and the least necessary negative scores edge group si si edge group si si max ymin where ymax is the largest and ymin is the smallest box in the set yv and is the iverson bracket consistent with the experiments in we found that this simplification indeed hurts performance in the edgeboxes sliding window nms pipeline however interestingly we found that even with this weaker relevance term submodboxes was able to outperform edgeboxes thus the drop in performance due to weaker relevance term was more than compensated for by the ability to perform jointly on the relevance and diversity terms diversity function and upper bound the goal of the diversity function is to encourage in the chosen set of windows and potentially capture different objects in the image before we introduce our own diversity function we show how existing heuristics in object detection and proposal generation can be written as special cases of this formulation under specific diversity functions sliding window nms suppression nms is the most popular heuristic for selecting diverse boxes in computer vision nms is typically explained procedurally select the highest scoring window suppress all windows that overlap with by more than some threshold select the next highest scoring window rinse and repeat this procedure can be explained as special case of our formulation sliding window corresponds to enumeration over with some level of or stride typically with fixed aspect ratio each step in nms is precisely greedy augmentation step under the following marginal gain argmax λdn where if iou else intuitively the nms diversity function imposes an infinite penalty if new window overlaps with previously chosen by more than threshold and offers no reward for diversity beyond that this explains the nms procedure of suppressing overlapping windows and picking the highest scoring one among the unsuppressed ones notice that this diversity function is submodular but not monotone the marginals gains may be negative similar observation was made in for such functions greedy does not have approximation guarantees and different techniques are needed this is an interesting perspective on the classical nms heuristic ess heuristic ess was originally proposed for object detection their extension to detecting multiple instances consisted of heuristic for suppressing the features extracted from the selected bounding box and the procedure since their scoring function was linear in the features this heuristic of suppressing features and rerunning can be expressed as greedy augmentation step under the following marginal gain argmax λdess where dess dn the ess diversity function subtracts the score contribution coming from the intersection region if the is it is easy to see that this diversity function is monotone and submodular adding new window never hurts and since the marginal gain is the score contribution of the new regions not covered by previous window it is naturally diminishing thus even though this heuristic not was presented as such the authors of did in fact formulate greedy algorithm for maximizing monotone submodular function unfortunately while is always positive in our experiments this was not the case in the experimental setup of our diversity function instead of an explicit diversity function we use function that implicitly measures diversity in terms of coverage of set of reference set of bounding boxes this reference set of boxes may be uniform of the space of windows as done in sliding window methods or may itself be the output of another object proposal method such as selective search specifically each greedy augmentation step under our formulation given by argmax λdcoverage where dcoverage max δiou δiou max iou iou intuitively speaking the marginal gain of new window under our diversity function is the largest gain in coverage of exactly one of the references boxes we can also formulate this diversity function as maximum bipartite matching problem between the reference proposal boxes and the reference boxes in our experiments we also study performance under matches we show in the supplement that this marginal gain is always and decreasing with larger thus the diversity function is monotone submodular all that remains is to compute an on this marginal gain ignoring constants the key term to bound is iou we can this term by computing the intersection the largest window in the window set ymax and computing max the union to the smallest window ymin iou area area ymin speeding up greedy with minoux lazy greedy in order to speed up repeated application of across iterations of the greedy algorithm we now present an application of minoux lazy greedy algorithm to the tree the key insight of classical lazy greedy is that the marginal gain function is nonincreasing function of due to submodularity of thus at time we can cache the priority queue of marginals gains for all items at time lazy greedy does not recompute all marginal gains rather the item at the front of the priority queue is picked its marginal gain is updated and the item is reinserted into the queue crucially if the item remains at the front of the priority queue lazy greedy can stop and we have found the item with the largest marginal gain interleaving lazy greedy with in our work the priority queue does not contain single items rather sets of windows yv corresponding to the vertices in the tree thus we must interleave the lazy updates with the steps specifically we pick set from the front of the queue and compute the on its marginal gain we reinsert this set into the priority queue once set remains at the front of the priority queue after reinsertion we have found the set with the highest this is when perform step split this set into two children compute the on the children and insert them into the queue figure interleaving lazy greedy with the first few steps update following by finally branching on set some sets such as are never updated or split resulting in fig illustrates how the priority queue and tree are updated in this process suppose at the end of iteration and the beginning of iteration we have the priority queue shown on the left the first few updates involve recomputing the on the window sets following by branching on because it continues to stay on top of the queue creating new vertices notice that is never explored updated or split resulting in experiments setup we evaluate submodboxes for object proposal generation on three datasets pascal voc pascal voc and ms coco the goal of experiments is to validate our approach by testing the accuracy of generated object proposals and the ability of handling different kinds of reference boxes and observe trends as we vary multiple parameters submodboxes submodboxes ss no proposals pascal voc submodboxes submodboxes ss no proposals pascal voc figure abo no proposals abo abo abo submodboxes submodboxes ss no proposals ms coco evaluation to evaluate the quality of our object proposals we use mean average best overlap mabo score given set of boxes gtc for class abo is calculated by averaging the best iou between each ground truth bounding box and all object proposals aboc max iou mabo is mean abo over all classes weighing the reference boxes recall that the marginal gain of our proposed diversity function rewards covering the reference boxes with the chosen set of boxes instead of weighing all reference boxes equally we found it important to weigh different reference boxes differently the exact form the weighting rule is provided in the supplement in our experiments we present results with and without such weighting to show impact of our proposed scheme accuracy of object proposals in this section we explore the performance of our proposed method in comparison to relevant object proposal generators for the two pascal datasets we perform cross validation on validation images of pascal voc for the best parameter then report accuracies on test images of pascal voc and validation images of pascal voc the ms coco dataset is much larger so we randomly select subset of training images for tuning and test on complete validation dataset with images we use top ranked selective search windows as reference boxes in manner similar to we chose different λm for proposals we compare our approach with several baselines which essentially involves selective search windows by considering their ability to coverage other boxes three variants of edgeboxes at iou and and corresponding three variants without affinities in selective search compute multiple hierarchical segments via grouping superpixels and placing bounding boxes around them use edgeboxesscore to selective search windows fig shows that our approach at and both outperform all baselines at and our approach is and better than selective search and and better than respectively ablation studies we now study the performance of our system under different components and parameter settings effect of and reference boxes we test performance of our approach as function of using reference boxes from different object proposal generators all reported at on pascal voc our reference box generators are selective search mcg cpmc edgeboxes at iou objectness and uniformly sample the bounding box center position square root area and log aspect ratio table shows the performance of submodboxes when used with these different reference box generators our approach shows improvement over corresponding method for all reference boxes our approach outperforms the current state of art mcg by and selective search by this is significantly larger than previous improvements reported in the literature fig shows more behavior as is varied at all methods produce the same highest weighted box times at they all perform reranking of the reference set of boxes in nearly all curves there is peak at some intermediate setting of the only exception is edgeboxes which is expected since it is being used in both the relevance and diversity terms effect of no steps we analyze the convergence trends of fig shows that both the optimization objective function value and the mabo increase with the number of iterations mcg eb cpmc objectness weighting without weighting weighting without weighting weighting original method table comparison weighting scheme rows with different reference boxes columns original method row shows performance of directly using object proposals from these proposal generators means we report the best performance from and considering the peak occurs at different for different object proposal generators mcg uniform cpmc mabo ss objectness eb mabo objective values mabo boxes performance with objective and performance performance no of ent reference box generators no of iterations matching boxes figure experiments on different parameter settings effect of no of matching boxes instead of allowing the chosen boxes to cover exactly one reference box we analyze the effect of matching reference boxes fig shows that the performance decreases monotonically bit as more matches are allowed speed up via lazy greedy fig compares the number of iterations required with and without our proposed lazy greedy without lazy eralization averaged over randomly chosen images we can lazy see that lazy greedy significantly reduces the number of iterations required the cost of each evaluation is nearly the same so the iteration is directly proportional to time conclusions to summarize we formally studied the search for set of diverse figure comparison of the bounding boxes as an optimization problem and provided theoretnumber of iterations of our ical justification for greedy and heuristic approaches used in prior lazy greedy generalization and work the key challenge of this problem is the large search space thus we proposed generalization of minoux lazy greedy on independent runs tree to speed up classical greedy we tested our formulation on three datasets of object detection pascal voc pascal and microsoft coco results show that our formulation outperforms all baselines with novel diversity measure acknowledgements this work was partially supported by national science foundation career award an army research office yip award an office of naval research grant an aws in education research grant and gpu support by nvidia the views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements either expressed or implied of the government or any sponsor references alexe deselaers and ferrari measuring the objectness of image windows pami nov arbelaez tuset marques and malik multiscale combinatorial grouping in cvpr blaschko branch and bound strategies for suppression in object detection in emmcvpr pages blaschko and lampert learning to localize objects with structured output regression in eccv buchbinder feldman naor and schwartz tight approximation to unconstrained submodular maximization in focs carbonell and goldstein the use of mmr reranking for reordering documents and producing summaries in proceedings of the annual international acm sigir conference on research and development in information retrieval sigir pages carreira and sminchisescu constrained parametric for automatic object segmentation in cvpr cheng zhang lin and torr bing binarized normed gradients for objectness estimation at in cvpr dalal and triggs histograms of oriented gradients for human detection in cvpr deselaers alexe and ferrari localizing objects while learning their appearance in eccv dey liu hebert and bagnell contextual sequence prediction with application to control library optimization in robotics science and systems rss and methods survey operations research everingham van gool williams winn and zisserman the pascal visual object classes challenge results http everingham van gool williams winn and zisserman the pascal visual object classes challenge results http feige mirrokni and vondrák maximizing submodular functions in focs felzenszwalb girshick mcallester and ramanan object detection with discriminatively trained part based models pami girshick donahue darrell and malik rich feature hierarchies for accurate object detection and semantic segmentation in cvpr vezhnevets and ferrari an active search strategy for efficient object detection in cvpr he zhang ren and sun spatial pyramid pooling in deep convolutional networks for visual recognition in eccv hosang benenson and schiele how good are detection proposals really in bmvc joachims finley and yu training of structural svms machine learning kempe kleinberg and tardos maximizing the spread of influence through social network in acm sigkdd conference on knowledge discovery and data mining kdd krahenbuhl and koltun learning to propose objects in cvpr krause and golovin submodular function maximization in tractability practical approaches to hard problems to appear cambridge university press krause singh and guestrin sensor placements in gaussian processes theory efficient algorithms and empirical studies mach learn lampert blaschko and hofmann efficient subwindow search branch and bound framework for object localization tpmai lin and bilmes class of submodular functions for document summarization in acl lin maire belongie hays perona ramanan dollár and zitnick microsoft coco common objects in context in eccv minoux accelerated greedy algorithms for maximizing submodular set functions optimization techniques pages nemhauser wolsey and fisher an analysis of approximations for maximizing submodular set functions mathematical programming prasad jegelka and batra submodular meets structured finding diverse subsets in structured item sets in nips ren he girshick and sun faster towards object detection with region proposal networks in nips ross zhou yue dey and bagnell learning policies for contextual submodular prediction in icml sermanet eigen zhang mathieu fergus and lecun overfeat integrated recognition localization and detection using convolutional networks in iclr streeter and golovin an online algorithm for maximizing submodular functions in nips szegedy reed and erhan scalable object detection in cvpr szegedy toshev and erhan deep neural networks for object detection in nips taskar guestrin and koller markov networks in nips uijlings van de sande gevers and smeulders selective search for object recognition ijcv viola and jones robust face detection int comput vision may zitnick and dollar edge boxes locating object proposals from edges in eccv 
fast stochastic backpropagation for variational inference kai fan duke university ziteng jeffrey beck duke university katherine heller duke university kheller james kwok hkust jamesk abstract we propose hessian or based optimization method for variational inference inspired by gaussian backpropagation and argue that optimization can be developed as well this is accomplished by generalizing the gradient computation in stochastic backpropagation via reparametrization trick with lower complexity as an illustrative example we apply this approach to the problems of bayesian logistic regression and variational vae additionally we compute bounds on the estimator variance of intractable expectations for the family of lipschitz continuous function our method is practical scalable and model free we demonstrate our method on several datasets and provide comparisons with other stochastic gradient methods to show substantial enhancement in convergence rates introduction generative models have become ubiquitous in machine learning and statistics and are now widely used in fields such as bioinformatics computer vision or natural language processing these models benefit from being highly interpretable and easily extended unfortunately inference and learning with generative models is often intractable especially for models that employ continuous latent variables and so fast approximate methods are needed variational bayesian vb methods deal with this problem by approximating the true posterior that has tractable parametric form and then identifying the set of parameters that maximize variational lower bound on the marginal likelihood that is vb methods turn an inference problem into an optimization problem that can be solved for example by gradient ascent indeed efficient stochastic gradient variational bayesian sgvb estimators have been developed for models and number of papers have followed up on this approach recently provided complementary perspective by using stochastic backpropagation that is equivalent to sgvb and applied it to deep latent gaussian models stochastic backpropagation overcomes many limitations of traditional inference methods such as the meanfield or algorithms due to the existence of efficient computations of an unbiased estimate of the gradient of the variational lower bound the resulting gradients can be used for parameter estimation via stochastic optimization methods such as stochastic gradient decent sgd or adaptive version adagrad equal contribution to this work refer to hong kong university of science and technology unfortunately methods such as sgd or adagrad converge slowly for some models such as or recurrent neural networks the common experience is that gradient decent always gets stuck near saddle points or local extrema meanwhile the learning rate is difficult to tune gave clear explanation on why newton method is preferred over gradient decent which often encounters problem if the optimizing function manifests pathological curvature newton method is invariant to affine transformations so it can take advantage of curvature information but has higher computational cost due to its reliance on the inverse of the hessian matrix this issue was partially addressed in where the authors introduced hessian free hf optimization and demonstrated its suitability for problems in machine learning in this paper we continue this line of research into order variational inference algorithms inspired by the property of location scale families we show how to reduce the computational cost of the hessian or product thus allowing for order stochastic optimization scheme for variational inference under gaussian approximation in conjunction with the hf optimization we propose an efficient and scalable order stochastic gaussian backpropagation for variational inference called hfsgvi alternately version method merely using the gradient information is natural generalization of order variational inference the most immediate application would be to look at obtaining better optimization algorithms for variational inference as to our knowledge the model currently applying order information is lda where the hessian is easy to compute in general for factor models like factor analysis or the deep latent gaussian models this is not the case indeed to our knowledge there has not been any systematic investigation into the properties of various optimization algorithms and how they might impact the solutions to optimization problem arising from variational approximations the main contributions of this paper are to fill such gap for variational inference by introducing novel order optimization scheme first we describe clever approach to obtain curvature information with low computational cost thus making the newton method both scalable and efficient second we show that the variance of the lower bound estimator can be bounded by constant extending the work of that discussed specific bound for univariate function third we demonstrate the performance of our method for bayesian logistic regression and the vae model in comparison to commonly used algorithms convergence rate is shown to be competitive or faster stochastic backpropagation in this section we extend the bonnet and price theorem to develop order gaussian backpropagation specifically we consider how to optimize an expectation of the form eqθ where and refer to latent variables and observed variables respectively and expectation is taken distribution qθ and is some smooth loss function it can be derived from standard variational lower bound sometimes we abuse notation and refer to by omitting when no ambiguity exists to optimize such expectation gradient decent methods require the derivatives while newton methods require both the gradients and hessian involving order derivatives second order gaussian backpropagation if the distribution is dz gaussian the required partial derivative is easily computed with lower algorithmic cost by using the property of gaussian distribution we can compute the order partial derivative of en as follows µj en en zj en ck en en zj zk zl ck en en zk zl eq proof in supplementary have the nice property that limited number of samples from are sufficient to obtain unbiased gradient estimates however note that eq needs to calculate the third and fourth derivatives of which is highly computationally inefficient to avoid the calculation of high order derivatives we use transformation covariance parameterization for optimization by constructing the linear transformation reparameterization where idz we can generate samples from any gaussian distribution by simulating data from standard normal distribution provided the decomposition rr holds this fact allows us to derive the following theorem indicating that the computation of order derivatives can be scalable and programmed to run in parallel theorem fast derivative if is twice differentiable function and follows gaussian distribution rr where both the mean and depend on parameter θl we have en idz and en idz this then implies en θl en where is kronecker product and gradient hessian are evaluated at in terms of if we consider the mean and covariance matrix as the variational parameters in variational inference the first two results make parallelization possible and reduce computational cost of the multiplication due to the fact that vec vec av if the model has few parameters or large resource budget gpu is allowed theorem launches the foundation for exact order derivative computation in parallel in addition note that the order gradient computation on model parameter only involves or multiplication thus leading to an algorithmic complexity that is for order derivative of which is the same as order gradient the derivative computation at function is up to order avoiding to calculate or order derivatives one practical parametrization assumes diagonal covariance matrix diag this reduces the actual computational cost compared with theorem albeit the same order of the complexity see supplementary material theorem holds for large class of distributions in addition to gaussian distributions such as student if the dimensionality of embedded parameter is large computation of the gradient gθ and hessian hθ differ from above will be linear and quadratic which may be unacceptable therefore in the next section we attempt to reduce the computational complexity apply reparameterization on second order algorithm in standard newton method we need to compute the hessian matrix and its inverse which is intractable for limited computing resources applied hf optimization method in deep learning effectively and efficiently this work largely relied on the technique of fast hessian multiplication we combine reparameterization trick with or quasinewton to circumvent matrix inverse problem unlike methods hf doesn make any approximation on the hessian hf needs to compute hθ where is any vector that has the matched dimension to hθ and then uses conjugate gradient algorithm to solve the linear system hθ for any objective function gives reasonable explanation for hessian free optimization in short unlike method that places the parameters in search region to regularize hf solves issues of pathological curvature in the objective by taking the advantage of rescaling property of newton method by definition hθ indicating that hθ can be numerically computed by using finite differences at however this numerical method is unstable for small in this section we focus on the calculation of hθ by leveraging reparameterization trick specifically we apply an technique for computing the product hθ exactly let en and reparametrize again as sec we do variable substitution γv after gradient eq is obtained and then take derivative on thus we have the following algorithm algorithm on stochastic gaussian variational inference hfsgvi parameters minibatch size number of samples to estimate the expectation as default input observation and if required lower bound function en fl output parameter after having converged for do xb randomly draw datapoints from full data set mb sample times from for each xb define gradient gb fl mb gb define function γv where is vector using conjugate gradient algorithm to solve linear system θt pt θt θt pt end for ical expression for multiplication hθ γv en en eq is appealing since it does not need to store the dense matrix and provides an unbiased hθ estimator with small sample size in order to conduct the order optimization for variational inference if the computation of the gradient for variational lower bound is completed we only need to add one extra step for gradient evaluation via eq which has the same computational complexity as eq this leads to variational inference method described in algorithm for the worst case of hf the conjugate gradient cg algorithm requires at most iterations to terminate meaning that it requires evaluations of hθ product however the good news is that cg leads to good convergence after reasonable number of iterations in practice we found that it may not necessary to wait cg to converge in other words even if we set the maximum iteration in cg to small fixed number in our experiments though with thousands of parameters the performance does not deteriorate the early stoping strategy may have the similar effect of wolfe condition to avoid excessive step size in newton method therefore we successfully reduce the complexity of each iteration to whereas is for one sgd iteration limited memory bfgs utilizes the information gleaned from the gradient vector to approximate the hessian matrix without explicit computation and we can readily utilize it within our framework the basic idea of bfgs approximates hessian by an iterative algorithm bt bt bt bt where gt and θt by eq the gradient gt at each iteration can be obtained without any difficulty however even if this low rank approximation to the hessian is easy to invert analytically due to the formula we still need to store the matrix will further implicitly approximate this dense bt or by tracking only few gradient vectors and short history of parameters and therefore has linear memory requirement in general can perform sequence of inner products with the most recent and where is predefined constant or in our experiments due to the space limitations we omit the details here but will present this algorithm in experiments section estimator variance the framework of stochastic backpropagation extensively uses the mean of very few samples often just one to approximate the expectation similarly we approximate the left side of eq by sampling few points from the standard normal distribution however the magnitude of the variance of such an estimator is not seriously discussed simply explored the variance quantitatively for separable functions merely borrowed the variance reduction technique from reinforcement learning by centering the learning signal in expectation and performing variance normalization here we will generalize the treatment of variance to broader family lipschitz continuous function theorem variance bound if is an differentiable function and idz then the proof of theorem see supplementary employs the properties of distributions and the duplication trick that are commonly used in learning theory significantly the result implies variance bound independent the dimensionality of gaussian variable note that from the proof fof we can only obtain the for though this result is enough to illustrate the variance independence of dz we can in fact tighten it to sharper upper bound by constant scalar eλ thus leading to the result of theorem with var if all the results above hold for smooth twice continuous and differentiable functions with lipschitz constant then it holds for all lipschitz functions by standard approximation argument this means the condition can be relaxed to lipschitz continuous function corollary bias bound it is also worth mentioning that the significant corollary of theorem is probabilistic inequality to measure the convergence rate of monte carlo approximation in our setting this tail bound together with variance bound provides the theoretical guarantee for stochastic backpropagation on gaussian variables and provides an explanation for why unique realization is enough in practice by reparametrization eq can be formulated as the expectation the isotropic gaussian distribution with identity covariance matrix leading to algorithm thus we can rein in the number of samples for monte carlo integration regardless dimensionality of latent variables this seems however we notice that larger may require more samples and lipschitz constants of different models vary greatly application on variational note that our method is model free if the loss function has the form of the expectation of function latent gaussian variables we can directly use algorithm in this section we put the emphasis on standard framework vae model that has been intensively researched in particular the function endows the logarithm form thus bridging the gap between hessian and fisher information matrix by expectation see survey and reference therein model description suppose we have observations rd is data vector that can where take either continuous or discrete values in contrast to standard model constructed by neural network with bottleneck structure vae describes the embedding process from the prospective of gaussian latent variable model specifically each data point follows generative model pψ where this process is actually decoder that is usually constructed by transformation with unknown parameters and prior distribution pψ the encoder or recognition model qφ is used to approximate the true posterior pψ where is similar to the parameter of variational distribution as suggested in perceptron mlp is commonly considered as both the probabilistic encoder and decoder we will later see that this construction is equivalent to variant deep neural networks under the constrain of unique realization for for this model and each datapoint the variational lower bound on the marginal likelihood is log pψ eqφ log pψ dkl qφ kpψ we can actually write the kl divergence into the expectation term and denote as by the previous discussion this means that our objective is to solve the optimization problem arg maxθ of full dataset variational lower bound thus or hf sgvi algorithm can be implemented straightforwardly to estimate the parameters of both generative and recognition models since the first term of reconstruction error appears in eq with an expectation form on latent variable used small finite number samples as monte carlo integration with reparameterization trick to reduce the variance this is in fact drawing samples from the standard normal distribution in addition the second term is the kl divergence between the variational distribution and the prior distribution which acts as regularizer deep neural networks with hybrid hidden layers in the experiments setting can not only achieve excellent performance but also speed up the program in this special case we discuss the relationship between vae and traditional deep pd for binary inputs denote the output as we have log pψ xj log yj xj log yj which is exactly the negative it is also apparent that log pψ is equivalent to negative squared error loss for continuous data this means that maximizing the lower bound is roughly equal to minimizing the loss function of deep neural network see figure in supplementary except for different regularizers in other words the prior in vae only imposes regularizer in encoder or generative model while penalty for all parameters is always considered in deep neural nets from the perspective of deep neural networks with hybrid hidden nodes the model consists of two bernoulli layers and one gaussian layer the gradient computation can simply follow variant of backpropagation layer by layer derivation given in supplementary to further see the rationale of setting we will investigate the upper bound of the lipschitz constant under various activation functions in the next lemma as theorem implies the variance of approximate expectation by finite samples mainly relies on the lipschitz constant rather than dimensionality according to lemma imposing prior or regularization to the parameter can control both the model complexity and function smoothness lemma also implies that we can get the upper bound of the lipschitz constant for the designed estimators in our algorithm lemma for sigmoid activation function in deep neural networks with one gaussian layer let then the lipschitz constant of wi bi is bounded by kwi where wi is ith row of weight matrix and bi is the ith element bias similarly for hyperbolic tangent or softplus function the lipschitz constant is bounded by kwi experiments we apply our order stochastic variational inference to two different models first we consider simple but widely used bayesian logistic regression model and compare with the most recent order algorithm doubly stochastic variational inference dsvi designed for sparse variable selection with logistic regression then we compare the performance of vae model with our algorithms bayesian logistic regression given dataset xi yi includes the default feature and where each instance xi yi is the binary label the bayesian logistic regression models the probability of outputs conditional on features and the coefficients with an imposed prior the likelihood and the qn prior can usually take the form as yi and respectively where is sigmoid function and is diagonal covariance matrix for simplicity we can propose variational gaussian distribution to approximate the posterior of regression parameter if we further assume qd diagonal factorized form βj σj is both efficient and practical for inference unlike iteratively optimizing and as in variational em noticed that the calculation of the gradient the lower bound indicates the updates of can be analytically worked out by variational parameters thus resulting new objective function for the representation of lower bound that only relies on details refer to we apply our algorithm to this variational logistic regression on three appropriate datasets dukebreast and leukemia are small size but for sparse logistic regression and which is large see table for additional dataset descriptions fig shows the convergence of gaussian variational lower bound for bayesian logistic regression in terms of running time it is worth mentioning that the lower bound of hfsgvi converges within iterations on the small datasets dukebreast and leukemia this is because all data points are fed to all algorithms and the hfsgvi uses better approximation of the hessian matrix to proceed order optimization also take less time to converge and yield slightly larger lower bound than dsvi in addition as an algorithm it is clearly seen that dsvi is less stable for small datasets and fluctuates strongly even at the later optimized stage for the large we observe that hfsgvi also needs iterations to reach good lower bound and becomes less stable than the other two algorithms however performs the best table comparison on number of misclassification dataset size dukebreast leukemia duke breast dsvi hfsgvi train test train test train test leukemia dsvi hfsgvi time lower bound lower bound lower bound time dsvi hfsgvi dsvi hfsgvi time figure convergence rate on logistic regression zoom out or see larger figures in supplementary both in terms of convergence rate and the final lower bound the misclassification report in table reflects the similar advantages of our approach indicating competitive predication ability on various datasets finally it is worth mentioning that all three algorithms learn set of very sparse regression coefficients on the three datasets see supplement for additional visualizations variational we also apply the order stochastic variational inference to train vae model setting for monte carlo integration to estimate expectation or the equivalent deep neural networks with hybrid hidden layers the datasets we used are images from the frey face olivetti face and mnist we mainly learned three tasks by maximizing the variational lower bound parameter estimation images reconstruction and images generation meanwhile we compared the convergence rate running time of three algorithms where in this section the compared sgd is the ada version that is recommended for vae model in the experimental setting is as follows the initial weights are randomly drawn from or while all bias terms are initialized as the variational lower bound only introduces the regularization on the encoder parameters so we add an regularizer on decoder parameters with shrinkage parameter or the number of hidden nodes for encoder and decoder is the same for all model which is reasonable and convenient to construct symmetric structure the number is always tuned from to with increment the size is for and ada while larger is recommended for hf meaning it should vary according to the training size the detailed results are shown in fig and both and converge faster than ada in terms of running time hfsgvi also performs competitively with respet to generalization on testing data ada takes at least four times as long to achieve similar lower bound theoretically newton method has quadratic convergence rate in terms of iteration but with cubic algorithmic complexity at each iteration however we manage to lower the computation in each iteration to linear complexity thus considering the number of evaluated training data points the order algorithm needs much fewer step than order gradient descent see visualization in supplementary on mnist the hessian matrix also replaces manually tuned learning rates and the affine invariant property allows for automatic learning rate adjustment technically if the program can run in parallel with gpu the speed advantages of order algorithm should be more obvious fig and fig are reconstruction results of input images from the perspective of deep neural network the only difference is the gaussian distributed latent variables by corollary of theorem we can roughly tell the mean is able to represent the quantity of meaning this layer is actually linear transformation with noise which looks like dropout training specifically olivetti includes pixels faces of various persons which means more complicated models or preprocessing nearest neighbor interpolation patch sampling is needed however even when simply learning very bottlenecked our approach can achieve acceptable results note that although we have tuned the hyperparameters of ada by the best result is still bunch of mean faces for manifold learning fig represents how the frey face lower bound ada train ada test train test hfsgvi train hfsgvi test time convergence reconstruction manifold by generative model figure shows how lower bound increases program running time for different algorithms illustrates the reconstruction ability of this model when dz left columns are randomly sampled from dataset is the learned manifold of generative model when dz olivetti face lower bound ada train ada test train test hfsgvi train hfsgvi test time convergence hfsgvi figure shows running time comparison illustrates reconstruction comparison without patch sampling where dz top rows are original faces learned generative model can simulate the images by hfsgvi to visualize the results we choose the latent variable in pψ where the parameter is estimated by the algorithm the two coordinates of take values that were transformed through the inverse cdf of the gaussian distribution from equal distance grid or on the unit square then we merely use the generative model to simulate the images besides these learning tasks denoising imputation and even generalizing to learning are possible application of our approach conclusions and discussion in this paper we proposed scalable order stochastic variational method for generative models with continuous latent variables by developing gaussian backpropagation through reparametrization we introduced an efficient unbiased estimator for higher order gradients information combining with the efficient technique for computing multiplication we derived an efficient inference algorithm hfsgvi that allows for joint optimization of all parameters the algorithmic complexity of each parameter update is quadratic the dimension of latent variables for both and derivatives furthermore the overall computational complexity of our order sgvi is linear the number of parameters in real applications just like sgd or ada however hfsgvi may not behave as fast as ada in some situations when the pixel values of images are sparse due to fast matrix multiplication implementation in most softwares future research will focus on some difficult deep models such as rnns or dynamic sbn because of conditional independent structure by giving sampled latent variables we may construct blocked hessian matrix to optimize such dynamic models another possible area of future work would be reinforcement learning rl many rl problems can be reduced to compute gradients of expectations in policy gradient methods and there has been series of exploration in this area for natural gradients however we would suggest that it might be interesting to consider where stochastic backpropagation fits in our framework and how order computations can help acknolwedgement this research was supported in part by the research grants council of the hong kong special administrative region grant no references matthew james beal variational algorithms for approximate bayesian inference phd thesis david blei andrew ng and michael jordan latent dirichlet allocation journal of machine learning research bonnans jean charles gilbert claude and claudia numerical optimization theoretical and practical aspects springer science business media georges bonnet transformations des signaux travers les non sans annals of telecommunications george dahl tara sainath and geoffrey hinton improving deep neural networks for lvcsr using rectified linear units and dropout in icassp john duchi elad hazan and yoram singer adaptive subgradient methods for online learning and stochastic optimization journal of machine learning research dumitru erhan yoshua bengio aaron courville manzagol pascal vincent and samy bengio why does unsupervised help deep learning journal of machine learning research thomas ferguson location and scale parameters in exponential families of distributions annals of mathematical statistics pages zhe gan chunyuan li ricardo henao david carlson and lawrence carin deep temporal sigmoid belief networks for sequence modeling in nips karol gregor ivo danihelka alex graves and daan wierstra draw recurrent neural network for image generation in icml james hensman magnus rattray and neil lawrence fast variational inference in the conjugate exponential family in nips geoffrey hinton peter dayan brendan frey and radford neal the algorithm for unsupervised neural networks science geoffrey hinton and ruslan salakhutdinov reducing the dimensionality of data with neural networks science matthew hoffman david blei chong wang and john paisley stochastic variational inference journal of machine learning research mohammad khan decoupled variational gaussian inference in nips diederik kingma shakir mohamed danilo jimenez rezende and max welling learning with deep generative models in nips diederik kingma and max welling variational bayes in iclr james martens deep learning via optimization in icml andriy mnih and karol gregor neural variational inference and learning in belief networks in icml volodymyr mnih koray kavukcuoglu david silver andrei rusu joel veness marc bellemare alex graves martin riedmiller andreas fidjeland georg ostrovski et al control through deep reinforcement learning nature jiquan ngiam adam coates ahbik lahiri bobby prochnow quoc le and andrew ng on optimization methods for deep learning in icml razvan pascanu and yoshua bengio revisiting natural gradient for deep networks arxiv preprint barak pearlmutter fast exact multiplication by the hessian neural computation robert price useful theorem for nonlinear devices having gaussian inputs information theory ire transactions on danilo jimenez rezende shakir mohamed and daan wierstra stochastic backpropagation and approximate inference in deep generative models in icml tim salimans markov chain monte carlo and variational inference bridging the gap in icml ilya sutskever oriol vinyals and quoc vv le sequence to sequence learning with neural networks in nips michalis titsias and miguel doubly stochastic variational bayes for inference in icml 
randomized block krylov methods for stronger and faster approximate singular value decomposition christopher musco massachusetts institute of technology eecs cambridge ma usa cpmusco cameron musco massachusetts institute of technology eecs cambridge ma usa cnmusco abstract since being analyzed by rokhlin szlam and tygert and popularized by halko martinsson and tropp randomized simultaneous power iteration has become the method of choice for approximate singular value decomposition it is more accurate than simpler sketching algorithms yet still converges quickly for any matrix independently of singular value gaps after iterations it gives approximation within of optimal for spectral norm error we give the first provable runtime improvement on simultaneous iteration randomized block krylov method closely related to the classic block lanczos rithm gives the same guarantees in just iterations and performs substantially better experimentally our analysis is the first of krylov subspace method that does not depend on singular value gaps which are unreliable in practice furthermore while it is simple accuracy benchmark even error for spectral norm approximation does not imply that an algorithm returns high quality principal components major issue for data applications we address this problem for the first time by showing that both block krylov iteration and simultaneous iteration give nearly optimal pca for any matrix this result further justifies their strength over sketching methods introduction any matrix with rank can be written using singular value decomposition svd as uσvt and have orthonormal columns left and right singular vectors and is positive diagonal matrix containing singular values σr rank partial svd algorithm returns just the top left or right singular vectors of these are the first columns of or denoted uk and vk respectively among countless applications the svd is used for optimal approximation and principal component analysis pca specifically for partial svd can be used to construct rank approximation ak such that both ka ak kf and ka ak are as small as possible we simply set ak uk utk that is ak is projected onto the space spanned by its top singular vectors for principal component analysis top singular vector provides top principal component which describes the direction of greatest variance within the ith singular vector ui provides the ith principal component which is the direction of greatest variance orthogonal to all higher principal components formally denoting ith singular value as σi uti aat ui max xt aat traditional svd algorithms are expensive typically running in time so there has been substantial research on randomized techniques that seek nearly optimal approximation and pca these methods are quickly becoming standard tools in practice and implementations are widely available including in popular learning libraries recent work focuses on algorithms whose runtimes do not depend on properties of in contrast classical literature typically gives runtime bounds that depend on the gaps between singular values and become useless when these gaps are small which is often the case in practice see section this limitation is due to focus on how quickly approximate singular vectors converge to the actual singular vectors of when two singular vectors have nearly identical values they are difficult to distinguish so convergence inherently depends on singular value gaps only recently has shift in approximation goal along with an improved understanding of randomization allowed for algorithms that avoid gap dependence and thus run provably fast for any matrix for approximation and pca we only need to find subspace that captures nearly as much variance as top singular vectors distinguishing between two close singular values is overkill prior work the fastest randomized svd algorithms run in nnz are based on sketching methods and return rank matrix with orthonormal columns zk satisfying frobenius norm error ka zzt akf ka ak kf unfortunately as emphasized in prior work frobenius norm error is often hopelessly insufficient especially for data analysis and learning applications when has of singular values which is common for noisy data ka ak can be huge potentially much larger than top singular value this renders meaningless since does not need to align with any large singular vectors to obtain good multiplicative error to address this shortcoming number of papers target spectral norm approximation error spectral norm error ka zzt ka ak which is intuitively stronger when looking for rank approximation top singular vectors are often considered data and the remaining tail is considered noise spectral norm guarantee roughly ensures that zzt recovers up to this noise threshold series of work shows that the decades old simultaneous power iteration also called subspace iteration or orthogonal iteration implemented with random start vectors achieves after iterations hence this method which was popularized by halko martinsson and tropp in has become the randomized svd algorithm of choice for practitioners our results algorithm imultaneous teration algorithm lock rylov teration input error rank input error rank output output log log aat aπ aπ aat aπ aat aπ orthonormalize the columns of to obtain orthonormalize the columns of to obtain compute aa compute qt aat set to the top singular vectors of set to the top singular vectors of return return faster algorithm we show that algorithm randomized relative of the block lanczos algorithm which we call block krylov iteration gives the same guarantees as simultaneous iteration algorithm in just iterations this not only gives the fastest known theoretical runtime for achieving but also yields substantially better performance in practice see section here nnz is the number of entries in and this runtime hides lower order terms even though the algorithm has been discussed and tested for potential improvement over simultaneous iteration theoretical bounds for krylov subspace and lanczos methods are much more limited as highlighted in despite decades of research on lanczos methods the theory for randomized power iteration is more complete and provides strong guarantees of excellent accuracy whether or not there exist any gaps between the singular our work addresses this issue giving the first gap independent bound for krylov subspace method stronger guarantees in addition to runtime improvements we target much stronger notion of approximate svd that is needed for many applications but for which no analysis was known specifically as noted in while intuitively stronger than frobenius norm error spectral norm approximation error does not guarantee any accuracy in for many consider with its top squared singular values all equal to followed by tail of smaller singular values at ka ak but in fact ka zzt for any rank leaving the spectral norm bound useless at the same time ka ak is large so frobenius error is meaningless as well for example any obtains ka zzt ka ak with this scenario in mind it is unsurprising that approximation guarantees fail as an accuracy measure in practice we ran standard approximate svd algorithm see section on amazon an amazon product dataset and achieved very good approximation error in both norms for ka zzt akf ak kf and ka zzt ak however the approximate principal components given by are of significantly lower quality than true singular vectors see figure we saw similar results for number of other datasets ut aat singular value zti aat zi index figure poor per vector error for amazon returned by approximate svd that gives very good approximation in both spectral and frobenius norm we address this issue by introducing per vector guarantee that requires each approximate singular vector zk to capture nearly as much variance as the corresponding true singular vector per vector error uti aat ui zti aat zi the error bound is very strong in that it depends on which is better then relative error for large singular values while it is reminiscent of the bounds sought in classical numerical analysis we stress that does not require each zi to converge to ui in the presence of small singular value gaps in fact we show that both randomized block krylov iteration and our slightly modified simultaneous iteration algorithm achieve in runtimes main result our contributions are summarized in theorem its detailed proof is relegated to the full version of this paper the runtimes are given in theorems and and the three error bounds shown in theorems and in section we provide sketch of the main ideas behind the result in fact it does not even imply frobenius norm error theorem main theorem with high probability algorithms and find approximate singular vectors zk satisfying guarantees and for approximation and for pca algorithm requires log iterations while algorithm requires log iterations excluding lower order terms both algorithms run in time nnz kq in the full version of this paper we also use our results to give an alternative analysis that does depend on singular value gaps and can offer significantly faster convergence when has decaying singular values it is possible to take further advantage of this result by running algorithms and with that has columns simple modification for accelerating either method in section we test both algorithms on number of large datasets we justify the importance of gap independent bounds for predicting algorithm convergence and we show that block krylov iteration in fact significantly outperforms the more popular simultaneous iteration comparison to classical bounds decades of work has produced variety of gap dependent bounds for krylov methods most relevant to our work are bounds for block krylov methods with block size equal to roughly speaking with randomized initialization these results offerp guarantees equivalent to our strong equation for the top singular directions after log σk iterations this bound is recovered in section of this paper full version when the target accuracy is smaller than the relative singular value gap σk it is tighter than our gap independent results however as discussed in section for high dimensional data problems where is set far above machine precision gap independent bounds more accurately predict required iteration count prior work also attempts to analyze algorithms with block size smaller than while small block algorithms offer runtime advantages it is well understood that with duplicate singular values it is impossible to recover the top singular directions with block of size more generally large singular value clusters slow convergence so any small block algorithm must have runtime dependence on the gaps between each adjacent pair of top singular values analyzing simultaneous iteration before discussing our proof of theorem we review prior work on simultaneous iteration to demonstrate how it can achieve the spectral norm guarantee algorithms for frobenius norm error typically work by sketching into very few dimensions using random projection matrix with poly columns aπ is usually random gaussian or possibly sparse random sign matrix and is computed using the svd of aπ or of projected onto aπ this approach is very efficient the computation of aπ is easily parallelized and regardless in single processor setting furthermore once small compression of is obtained it can be manipulated in fast memory for the final computation of however frobenius norm error seems an inherent limitation of methods the noise from lower singular values corrupts aπ making it impossible to extract good partial svd if the sum of these singular values equal to ka ak is too large in order to achieve spectral norm error simultaneous iteration must reduce this noise down to the scale of ka ak it does this by working with the powered matrix aq by the spectral theorem aq has exactly the same singular vectors as but its singular values are equal to those of raised to the th power powering spreads the values apart and accordingly aq lower singular values are relatively much smaller than its top singular values see example in figure specifically is sufficient to increase any singular value to be significantly poly times larger than any value this effectively denoises our problem if we use sketching method to find good for approximating aq up to frobenius norm error will have to align very well with every singular vector with value it thus provides an accurate basis for approximating up to small spectral norm error for nonsymmetric matrices we work with aat but present the symmetric case here for simplicity spectrum of spectrum of aq xo to singular value σi index an chebyshev polynomial to pushes low values nearly as close to zero as xo singular values compared to those of aq rescaled to match on notice the significantly reduced tail after figure replacing with matrix polynomial facilitates higher accuracy approximation computing aq directly is costly so aq is computed iteratively start with random and repeatedly multiply by on the left since even rough frobenius norm approximation for aq suffices can be chosen to have just columns each iteration thus takes nnz time when analyzing simultaneous iteration uses the following randomized result to find that gives coarse frobenius norm approximation to aq and therefore good spectral norm approximation to the lemma is numbered for consistency with our full paper lemma frobenius norm approximation for any and where the entries of are independent gaussians drawn from if we let be an orthonormal basis for span bπ then with probability at least for some fixed constant kb zzt dkkb bk for analyzing block methods results like lemma can effectively serve as replacement for earlier random initialization analysis that applies to single vector power and krylov methods aq poly σm for any with σm plugging into lemma kaq zzt aq cdk aq cdk aq σm aq poly aq rearranging using pythagorean theorem we have kzzt aq kaq poly that is aq projection onto captures nearly all of its frobenius norm this is only possible if aligns very well with the top singular vectors of aq and hence gives good spectral norm approximation for proof sketch for theorem the intuition for beating simultaneous iteration with block krylov iteration matches that of many accelerated iterative methods simply put there are better polynomials than aq for denoising tail singular values in particular we can use lower degree polynomial allowing us to compute fewer powers of and thus leading to an algorithm with fewer iterations for example an appropriately shifted log degree chebyshev polynomial can push the tail of nearly as close to zero as ao log even if the long run growth of the polynomial is much lower see figure specifically we prove the following scalar polynomial lemma in the full version of our paper which can then be applied to effectively denoising singular value tail lemma chebyshev minimizing polynomial for and log there exists degree polynomial such that and for poly for furthermore we can choose the polynomial to only contain monomials with odd powers block krylov iteration takes advantage of such polynomials by working with the krylov subspace aπ aq from which we can construct pq for any polynomial pq of degree since the polynomial from lemma must be scaled and shifted based on the value of we can not easily compute it directly instead we argue that the very best rank approximation to lying in the span of at least matches the approximation achieved by projecting onto the span of pq finding this best approximation will therefore give nearly optimal approximation to unfortunately there catch surprisingly it is not clear how to efficiently compute the best spectral norm error approximation to lying in given subspace span this challenge precludes an analysis of krylov methods parallel to recent work on simultaneous iteration nevertheless since our analysis shows that projecting to captures nearly all the frobenius norm of pq we can show that the best frobenius norm approximation to in the span of gives good enough spectral norm approximation by the following lemma this optimal frobenius norm approximation is given by zzt where is exactly the output of algorithm lemma lemma of given and with orthonormal columns min ka qckf ka qqt kf ka qt kf qt can be obtained using an svd of the matrix qt aat specifically letting be the svd of and then qt zzt stronger per vector error guarantees achieving the per vector guarantee of requires more nuanced understanding of how simultaneous iteration and block krylov iteration denoise the spectrum of the analysis for spectral norm approximation relies on the fact that aq or pq for block krylov iteration blows up any singular value to much larger than any singular value this ensures that our output aligns very well with the singular vectors corresponding to these large singular values if σk then aligns well with all top singular vectors of and we get good frobenius norm error and the per vector guarantee unfortunately when there is small gap between σk and could miss intermediate singular vectors whose values lie between and this is the case where gap dependent guarantees of classical analysis break down however aq or for block krylov iteration some polynomial in our krylov subspace also significantly separates singular values from those thus each column of at least aligns with nearly as well as so even if we miss singular values between and they will be replaced with approximate singular values enough for for frobenius norm approximation we prove that the degree to which falls outside of the span of top singular vectors depends on the number of singular values between and these are the values that could be swapped in for the true top singular values since their weight counts towards tail our total loss compared to optimal is at worst ak implementation and runtimes for both algorithm and can be replaced by random sign matrix or any matrix achieving the guarantee of lemma may also be chosen with columns in our full paper we discuss in detail how this approach can give improved accuracy simultaneous iteration in our implementation we set which is necessary for achieving per vector guarantees for approximate pca however for near optimal approximation we can simply set projecting to is equivalent to projecting to as these matrices have the same column spans since powering spreads its singular values aat aπ could be poorly conditioned to improve stability we orthonormalize after every iteration or every few iterations this does not change column span so it gives an equivalent algorithm in exact arithmetic algorithm in fact only constructs odd powered terms in which is sufficient for our choice of pq theorem simultaneous iteration runtime algorithm runs in time nnz log nk log proof computing requires first multiplying by which takes nnz time computing aπ then takes nnz time to first multiply our aat aπ given aat matrix by at and then by reorthogonalizing after each iteration takes nk time via gramschmidt this gives total runtime of nnz kq nk for computing finding takes nk time computing by multiplying from left to right requires nnz nk time svd then requires time using classical techniques finally multiplying by takes time nk setting log gives the claimed runtime block krylov iteration in the traditional block lanczos algorithm one starts by computing an orthonormal basis for aπ the first block in bases for subsequent blocks are computed from previous blocks using three term recurrence that ensures qt aat is block tridiagonal with sized blocks this technique can be useful if qk is large since it is faster to compute the top singular vectors of block tridiagonal matrix however computing using recurrence can introduce number of stability issues and additional steps may be required to ensure that the matrix remains orthogonal an alternative uesd in and our algorithm is to compute explicitly and then find using qr decomposition this method does not guarantee that qt aat is block tridiagonal but avoids stability issues furthermore if qk is small taking the svd of qt aat will still be fast and typically dominated by the cost of computing as with simultaneous iteration we orthonormalize each block of after it is computed avoiding poorly conditioned blocks and giving an equivalent algorithm in exact arithmetic theorem block krylov iteration runtime algorithm runs in time nnz log nk proof computing including reorthogonalization requires nnz kq nk time the remaining steps are analogous to those in simultaneous iteration except somewhat more costly as we work with rather than dimensional subspace finding takes kq time computing take nnz kq kq time and its svd then requires kq time finally plying by takes time nk kq setting log gives the claimed runtime experiments we close with several experimental results variety of empirical papers not to mention widespread adoption already justify the use of randomized svd algorithms prior work focuses in particular on benchmarking simultaneous iteration and due to its improved accuracy over approaches this algorithm is popular in practice as such we focus on demonstrating that for many data problems block krylov iteration can offer significantly better convergence we implement both algorithms in matlab using gaussian random starting matrices with exactly columns we explicitly compute for both algorithms as described in section and use reorthonormalization at each iteration to improve stability we test the algorithms with varying iteration count on three common datasets amazon email nron and ewsgroups computing column principal components in all cases we plot error iteration count for metrics and in figure for per vector error we plot the maximum deviation amongst all top approximate principal components relative to unsurprisingly both algorithms obtain very accurate frobenius norm error ka zzt akf ak kf with very few iterations this is our intuitively weakest guarantee and in the presence of heavy singular value tail both iterative algorithms will outperform the worst case analysis on the other hand for spectral norm approximation and per vector error we confirm that block krylov iteration converges much more rapidly than simultaneous iteration as predicted by block krylov frobenius error block krylov spectral error block krylov per vector error simult iter frobenius error simult iter spectral error simult iter per vector error error block krylov frobenius error block krylov spectral error block krylov per vector error simult iter frobenius error simult iter spectral error simult iter per vector error error iterations amazon error block krylov frobenius error block krylov spectral error block krylov per vector error simult iter frobenius error simult iter spectral error simult iter per vector error error email nron block krylov frobenius error block krylov spectral error block krlyov per vector error simult iter frobenius error simult iter spectral error simult iter per vector error iterations iterations runtime seconds ewsgroups ewsgroups runtime cost figure approximation and per vector error convergence rates for algorithms and our theoretical analysis it it often possible to achieve nearly optimal error with iterations where as getting to within say error with simultaneous iteration can take much longer the final plot in figure shows error verses runtime for the dimensional ews groups dataset we averaged over trials and ran the experiments on commodity laptop with of memory as predicted because its additional memory overhead and costs are small compared to the cost of the large matrix multiplication required for each iteration block krylov iteration outperforms simultaneous iteration for small more generally these results justify the importance of convergence bounds that are independent of singular value gaps our analysis in section of the full paper predicts that once is small in comparison to the gap we should see much more rapid convergence since will depend on log instead of however for simultaneous iteration we do not see this behavior with amazon and it only just begins to emerge for ewsgroups while all three datasets have rapid singular value decay careful look confirms that their singular value gaps are actually quite small for example σk is for amazon and for ewsgroups in comparison to for email nron accordingly the frequent claim that singular value gaps can be taken as constant is insufficient even for small references vladimir rokhlin arthur szlam and mark tygert randomized algorithm for principal component analysis siam journal on matrix analysis and applications nathan halko martinsson and joel tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam review improved approximation algorithms for large matrices via random projections in proceedings of the annual ieee symposium on foundations of computer science focs martinsson vladimir rokhlin and mark tygert randomized algorithm for the approximation of matrices technical report yale university kenneth clarkson and david woodruff low rank approximation and regression in input sparsity time in proceedings of the annual acm symposium on theory of computing stoc pages antoine liutkus randomized svd matlab central file exchange daisuke okanohara redsvd randomized svd https david hall et al scalanlp breeze http ibm reseach division skylark team libskylark distributed matrix computations for machine learning ibm corporation armonk ny pedregosa et al machine learning in python jmlr arthur szlam yuval kluger and mark tygert an implementation of randomized algorithm for principal component analysis zohar karnin and edo liberty online pca with spectral bounds in proceedings of the annual conference on computational learning theory colt pages rafi witten and emmanuel randomized algorithms for matrix factorizations sharp performance bounds algorithmica christos boutsidis petros drineas and malik matrix reconstruction siam journal on computing david woodruff sketching as tool for numerical linear algebra found trends in theoretical computer science andrew tulloch fast randomized singular value decomposition http jane cullum and donath block lanczos algorithm for computing the algebraically largest eigenvalues and corresponding eigenspace of large sparse real symmetric matrices in ieee conference on decision and control including the symposium on adaptive processes pages gene golub and richard underwood the block lanczos method for computing eigenvalues mathematical software nathan halko martinsson yoel shkolnisky and mark tygert an algorithm for the principal component analysis of large data sets siam journal on scientific computing nathan halko randomized methods for computing approximations of matrices phd thesis of colorado ming gu subspace iteration randomization and singular value problems timothy davis and yifan hu the university of florida sparse matrix collection acm transactions on mathematical software december jure leskovec lada adamic and bernardo huberman the dynamics of viral marketing acm transactions on the web may saad on the rates of convergence of the lanczos and the methods siam journal on numerical analysis cameron musco and christopher musco randomized block krylov methods for stronger and faster approximate singular value decomposition yousef saad numerical methods for large eigenvalue problems revised edition volume gene golub franklin luk and michael overton block lanczos method for computing the singular values and corresponding singular vectors of matrix acm trans math golub and van loan matrix computations johns hopkins university press edition li and zhang convergence of the block lanczos method for eigenvalue clusters numerische mathematik michael cohen sam elder cameron musco christopher musco and madalina persu dimensionality reduction for clustering and low rank approximation in proceedings of the annual acm symposium on theory of computing stoc friedrich bauer das verfahren der treppeniteration und verwandte verfahren zur algebraischer eigenwertprobleme zeitschrift angewandte mathematik und physik zamp and estimating the largest eigenvalue by the power and lanczos algorithms with random start siam journal on matrix analysis and applications kin cheong sou and anders rantzer on the minimum rank of generalized matrix approximation problem in the maximum singular value norm in proceedings of the international symposium on mathematical theory of networks and systems mtns martinsson arthur szlam and mark tygert normalized power iterations for the computation of svd nips workshop on methods for machine learning jure leskovec jon kleinberg and christos faloutsos graphs over time densification laws shrinking diameters and possible explanations in proceedings of the acm sigkdd international conference on knowledge discovery and data mining kdd pages jason rennie newsgroups http may 
matching for data via kernel embeddings of latent distributions yuya nara institute of science and technology nara japan tomoharu iwata ntt communication science laboratories kyoto japan hiroshi sawada ntt service evolution laboratories kanagawa japan takeshi yamada ntt communication science laboratories kyoto japan abstract we propose method for finding matching between instances across different domains such as multilingual documents and images with annotations each instance is assumed to be represented as multiset of features representation for documents the major difficulty in finding relationships is that the similarity between instances in different domains can not be directly measured to overcome this difficulty the proposed method embeds all the features of different domains in shared latent space and regards each instance as distribution of its own features in the shared latent space to represent the distributions efficiently and nonparametrically we employ the framework of the kernel embeddings of distributions the embedding is estimated so as to minimize the difference between distributions of paired instances while keeping unpaired instances apart in our experiments we show that the proposed method can achieve high performance on finding correspondence between wikipedia articles between documents and tags and between images and tags introduction the discovery of matched instances in different domains is an important task which appears in natural language processing information retrieval and data mining tasks such as finding the alignment of sentences attaching tags to images or text documents and matching user identifications in different databases when given an instance in source domain our goal is to find the instance in target domain that is the most closely related to the given instance in this paper we focus on supervised setting where correspondence information between some instances in different domains is given to find matching in single domain find documents relevant to an input document similarity or distance measure between instances can be used on the other hand when trying to find matching between instances in different domains we can not directly measure the distances since they consist of different types of features for example when matching documents in different languages since the documents have different vocabularies we can not directly measure the similarities between documents across different languages without dictionaries the author moved to software technology and artificial intelligence research laboratory stair lab at chiba institute of technology japan figure an example of the proposed method used on multilingual document matching task correspondences between instances in source english and target japanese domains are observed the proposed method assumes that each feature vocabulary term has latent vector in shared latent space and each instance is represented as distribution of the latent vectors of the features associated with the instance then the distribution is mapped as an element in reproducing kernel hilbert space rkhs based on the kernel embeddings of distributions the latent vectors are estimated so that the paired instances are embedded closer together in the rkhs one solution is to map instances in both the source and target domains into shared latent space one such method is canonical correspondence analysis cca which maps instances into latent space by linear projection to maximize the correlation between paired instances in the latent space however in practice cca can not solve relationship problems due to its linearity to find correspondence kernel cca can be used it has been reported that kernel cca performs well as regards alignment between different languages when searching for images from text queries and when matching face images note that the performance of kernel cca depends on how appropriately we define the kernel function for measuring the similarity between instances within domain many kernels such as linear polynomial and gaussian kernels can not consider the occurrence of different but semantically similar words in two instances because these kernels use the between the feature vectors representing the instances for example words pc and computer are different but indicate the same meaning nevertheless the kernel value between instances consisting only of pc and consisting only of computer is equal to zero with linear and polynomial kernels even if gaussian kernel is used the kernel value is determined only by the vector length of the instances in this paper we propose matching method that can overcome the problem of kernel cca figure shows an example of the proposed method the proposed method assumes that each feature in source and target domains is associated with latent vector in shared latent space since all the features are mapped into the latent space the proposed method can measure the similarity between features in different domains then each instance is represented as distribution of the latent vectors of features that are contained in the instance to represent the distributions efficiently and nonparametrically we employ the framework of the kernel embeddings of distributions which measures the difference between distributions in reproducing kernel hilbert space rkhs without the need to define parametric distributions the latent vectors are estimated by minimizing the differences between the distributions of paired instances while keeping unpaired instances apart the proposed method can discover unseen matching in test data by using the distributions of the estimated latent vectors we will explain matching between two domains below however the proposed method can be straightforwardly extended to matching between three and more domains by regarding one of the domains as pivot domain in our experiments we demonstrate the effectiveness of our proposed method in tasks that involve finding the correspondence between wikipedia articles between documents and tags and between images and tags by comparison with existing linear and matching methods related work as described above canonical correlation analysis cca and kernel cca have been successfully used for finding various types of matching when we want to match instances represented by such as documents bilingual topic models can also be used the difference between the proposed method and these methods is that since the proposed method represents each instance as set of latent vectors of its own features the proposed method can learn more complex representation of the instance than these existing methods that represent each instance as single latent vector another difference is that the proposed method employs discriminative approach while kernel cca and bilingual topic models employ generative ones to model data deep learning and neural network approaches have been recently proposed unlike such approaches the proposed method performs matching without deciding the number of layers of the networks which largely affects their performances key technique of the proposed method is the kernel embeddings of distributions which can represent distribution as an element in an rkhs while preserving the moment information of the distribution such as the mean covariance and moments without density estimation the kernel embeddings of distributions have been successfully used for statistical test of the independence of two sample sets discriminative learning on distribution data anomaly detection for group data density estimation and three variable interaction test most previous studies about the kernel embeddings of distributions consider cases where the distributions are unobserved but the samples generated from the distributions are observed additionally each of the samples is represented as dense vector with the proposed method the kernel embedding technique can not be used to represent the observed multisets of features such as for documents since each of the features is represented as vector whose dimensions are zero except for the dimension indicating that the feature has one in this study we benefit from the kernel embeddings of distributions by representing each feature as dense vector in shared latent space the proposed method is inspired by the use of the kernel embeddings of distributions in data classification and regression their methods can be applied to single domain data and the latent vectors of features are used to measure the similarity between the features in domain unlike these methods the proposed method is used for the matching of two different types of domain data and the latent vectors are used for measuring the similarity between the features in different domains kernel embeddings of distributions in this section we introduce the framework of the kernel embeddings of distributions the kernel embeddings of distributions are used to embed any probability distribution on space into reproducing kernel hilbert space rkhs hk specified by kernel and the distribution is represented as element in the rkhs more precisely when given distribution the kernel embedding of the distribution is defined as follows dp hk where kernel is referred to as embedding kernel it is known that kernel embedding preserves the properties of probability distribution such as the mean covariance and moments by using characteristic kernels gaussian rbf kernel when set of samples xl is drawn from the distribution by interpreting sample set as empirical distribution δxl where δx is the dirac delta function at point empirical kernel embedding is given by xl which can be approximated with an error rate of op unlike kernel density estimation the error rate of the kernel embeddings is independent of the dimensionality of the given distribution measuring difference between distributions by using the kernel embedding representation eq we can measure the difference between two distributions given two sets of samples xl and where xl and belong to the same space we can obtain their kernel embedding representations and then the difference between and is given by intuitively it reflects the difference in the moment information of the distributions the difference is equivalent to the square of maximum mean discrepancy mmd which is used for statistical test of independence of two distributions the difference can be calculated by expanding eq as follows where is an in the rkhs in particular is given by xl xl hk and can also be calculated by eq proposed method suppose that we are given training set consisting of instance pairs dsi dti where di is the ith instance in source domain and dti is the ith instance in target domain these instances dsi and dti are represented as multisets of features included in source feature set and target feature set respectively this means that these instances are represented as bow the goal of our task is to determine the unseen relationship between instances across source and target domains in test data the number of instances in the source domain may be different to that in the target domain kernel embeddings of distributions in shared latent space as described in section the difficulty as regards finding instance matching is that the similarity between instances across source and target domains can not be directly measured we have also stated that although we can find latent space that can measure the similarity by using kernel cca standard kernel functions gaussian kernel can not reflect the of different but related features in kernel calculation between instances to overcome them we propose new data representation for finding instance matching the proposed method assumes that each feature in source feature set has latent vector xf rq in shared space likewise each feature in target feature set has latent vector yg rq in the shared space since all the features in the source and target domains are mapped into common shared space the proposed method can capture the relationship between features both in each domain and across different domains we define the sets of latent vectors in the source and target domains as xf and yg respectively the proposed method assumes that each instance is represented by distribution or multiset of the latent vectors of the features that are contained in the instance the ith instance in the source domain dsi is represented by set of latent vectors xi xf and the jth instance in the target domain dtj is represented by set of latent vectors yj yg note that xi and yj lie in the same latent space in section we introduced the kernel embedding representation of distribution and described how to measure the difference between two distributions when samples generated from the distribution are observed in the proposed method we employ the kernel embeddings of distributions to represent the distributions of the latent vectors for the instances the kernel embedding representations for the ith source and the jth target domain instances are given by xi xf yj yg then the difference between the distributions of the latent vectors are measured by using eq that is the difference between the ith source and the jth target domain instances is given by xi yj xi yj model the proposed method assumes that paired instances have similar distributions of latent vectors and unpaired instances have different distributions in accordance with the assumption we define the likelihood of the relationship between the ith source domain instance and the jth target domain instance as follows exp xi yj dtj exp xi yj where is set of for the embedding kernel used in eq eq is in fact the conditional probability with which the jth target domain instance is chosen given the ith source domain instance this formulation is more efficient than we consider bidirectional matching intuitively when distribution xi is more similar to yj than other distributions yj the probability has higher value we define the posterior distribution of latent vectors and placing gaussian priors with precision parameter for and that is exp exp the posterior distribution is given by dti where dsi dti is training set of instance pairs is set of hyperparameters and dxdy is marginal probability which is constant with respect to and learning we estimate latent vectors and by maximizing the posterior probability of the latent vectors given by eq instead of eq we consider the following negative logarithm of the posterior probability xi yi log exp xi yj and minimize it with respect to the latent vectors here maximizing eq is equivalent to minimizing eq to minimize eq with respect to and we perform optimization the gradient of eq with respect to each xf is given by xi yi xi yj ρxf eij ci where eij exp xi yj ci exp xi yj and the gradient of the difference between distributions xi and yj with respect to xf is given by xl yg xi yj xl when the distribution xi does not include the latent vector xf the gradient consistently becomes zero vector is the gradient of an embedding kernel this depends on the choice of kernel when the embedding kernel is gaussian kernel the gradient is calculated as with eq in similarly the gradient of eq with respect to each yg is given by xi yi xi yj ρyg eij ci where the gradient of the difference between distributions xi and yj with respect to yg can be calculated as with eq learning is performed by alternately updating using eq and updating using eq until the improvement in the negative log likelihood eq converges matching after the estimation of the latent vectors and the proposed method can reveal the matching between test instances the matching is found by first measuring the difference between given source domain instance and target domain instances using eq and then searching for the instance pair with the smallest difference experiments in this section we report our experimental results for three different types of datasets wikipedia and datasets setup of proposed method throughout these experiments we used gaussian kernel with param eter xf yg exp yg as an embedding kernel the of the proposed method are the dimensionality of shared latent space regularizer parameter for latent vectors and gaussian embedding kernel parameter after we train the proposed method with various and we chose the optimal by using validation data when training the proposed method we initialized latent vectors and by applying principle component analysis pca to matrix concatenating two matrices in the source and target domains then we employed the method with gradients given by eqs to learn the latent vectors comparison methods we compared the proposed method with the neighbor method knn canonical correspondence analysis cca kernel cca kcca bilingual latent dirichlet allocation blda and kernel cca with the kernel embeddings of distributions for test instance in the source domain our knn searches for the nearest neighbor source instances in the training data and outputs target instance in the test data which is located close to the target instances that are paired with the searched for source instances cca and kcca first learn the projection of instances into shared latent space using training data and then they find matching between instances by projecting the test instances into the shared latent space kcca used gaussian kernel for measuring the similarity between instances and chose the optimal gaussian kernel parameter and regularizer parameter by using validation data with blda we first learned the same model as and found matching between instances in the test data by obtaining the topic distributions of these instances from the learned model uses the kernel embeddings of distributions described in section for obtaining the kernel values between the instances the vector representations of features were obtained by applying singular value decomposition svd for frequency matrices here we set the dimensionality of the vector representations to then learns kernel cca with the kernel values as with the above kcca with cca kcca blda and we chose the optimal latent dimensionality or number of topics within by using validation data evaluation method throughout the experiments we quantitatively evaluated the matching performance by using the precision with which the true target instance is included in set of candidate instances found by each method more formally the precision is given by precision nte ti si nte where nte is the number of test instances in the target domain ti is the ith true target instance si is candidate instances of the ith source instance and is the binary function that returns if the argument is true and otherwise matching between bilingual documents with wikipedia document dataset we examine whether the proposed method can find the correct matching between documents written in different languages the dataset includes wikipedia documents for each of six languages german de english en finnish fi french fr italian it and japanese ja and documents with the same content are aligned across the languages from the dataset we create bilingual document pairs we regard the first component of the pair as source domain and the other as target domain for each of the bilingual document pairs we randomly create evaluation sets that consist of document pairs as training data document pairs as validation data and document pairs as test data here each document is represented as without stopwords and low frequency words figure shows the matching precision for each of the bilingual pairs of the wikipedia dataset with all the bilingual pairs the proposed method achieves significantly higher precision than the other methods with wide range of table shows examples of predicted matching with the wikipedia dataset compared with kcca which is the second best method the figure precision of matching prediction and its standard deviation on wikipedia datasets table top five english documents matched by the proposed method and kcca given five japanese documents in the wikipedia dataset titles in bold typeface indicate correct matching japanese input title sd カード sd card proposed kcca intel sd card libavcodec mplayer freeware bbc world news sd card morocco phoenix hours of le mans japanese input title 炭疽症 anthrax proposed kcca psittacosis anthrax dehydration isopoda cataract dehydration psittacosis cataract hypergeometric distribution long island iced tea japanese input title ドップラー効果 doppler effect proposed kcca lu deconmposition redshift doppler effect phenylalanine dehydration long island iced tea opportunity cost cataract hypergeometric distribution intel japanese input title メキシコ料理 mexican cuisine proposed kcca mexican cuisine long island iced tea phoenix baldr china radio international taoism chariot anthrax digital millennium copyright act alexis de tocqueville japanese input title フリーウェア freeware proposed kcca bbc world news opportunity cost freeware nfs intel digital millennium copyright act china radio international hypergeometric distribution taoism chariot proposed method can find both the correct document and many related documents for example in table the correct document title is sd card the proposed method outputs the sd card document and documents related to computer technology such as intel and mplayer this is because the proposed method can capture the relationship between words and reflect the difference between documents across different domains by learning the latent vectors of the words matching between documents and tags and between images and tags we performed experiments matching documents and tailgates and matching images and tailgates with the datasets used in when matching documents and tailgates we use datasets obtained from two social bookmarking sites and and patent dataset the delicious and the hatena datasets include pairs consisting of web page and tag list labeled by users and the patent dataset includes pairs consisting of patent description and tag list representing the category of the patent each web page and each patent description are represented https http figure precision of matching prediction and its standard deviation on delicious hatena patent and flickr datasets figure two examples of input tag lists and the top five images matched by the proposed method on the flickr dataset as as with the experiments using the wikipedia dataset and the tag list is represented as set of tags with the matching of images and tag lists we use the flickr dataset which consists of pairs of images and tag lists each image is represented as which is obtained by first extracting features using sift and then applying clustering with components to the sift features for all the datasets the numbers of training test and validation pairs are and respectively figure shows the precision of the matching prediction of the proposed and comparison methods for the delicious hatena patent and flickr datasets the precision of the comparison methods with these datasets was much the same as the precision of random prediction nevertheless the proposed method achieved very high precision particularly for the delicious hatena and patent datasets figure shows examples of input tag lists and the top five images matched by the proposed method with the flickr dataset in the examples the proposed method found the correct images and similar related images from given tag lists conclusion we have proposed novel method for addressing instance matching tasks with data the proposed method represents each feature in all the domains as latent vector in shared latent space to capture the relationship between features each instance is represented by distribution of the latent vectors of features associated with the instance which can be regarded as samples from the unknown distribution corresponding to the instance to calculate difference between the distributions efficiently and nonparametrically we employ the framework of kernel embeddings of distributions and we learn the latent vectors so as to minimize the difference between the distributions of paired instances in reproducing kernel hilbert space experiments on various types of datasets confirmed that the proposed method significantly outperforms the existing methods for matching acknowledgments this work was supported by jsps for jsps fellows references zhang liu and zhao cross lingual entity linking with bilingual topic model in proceedings of the international joint conference on artificial intelligence yunchao gong qifa ke michael isard and svetlana lazebnik embedding space for modeling internet images tags and their semantics international journal of computer vision oct tomoharu iwata yamada and ueda modeling social annotation data with content relevance using topic model in advances in neural information processing systems citeseer bin li qiang yang and xiangyang xue transfer learning for collaborative filtering via ratingmatrix generative model in proceedings of the annual international conference on machine learning hotelling relations between two sets of variants biometrika akaho kernel method for canonical correlation analysis in proceedings of international meeting on psychometric society number alexei vinokourov john and nello cristianini inferring semantic representation of text via correlation analysis in advances in neural information processing systems yaoyong li and john using kcca for information retrieval and document classification journal of intelligent information systems sep nikhil rasiwasia jose costa pereira emanuele coviello gabriel doyle gert lanckriet roger levy and nuno vasconcelos new approach to multimedia retrieval in proceedings of the international conference on multimedia patrik kamencay robert hudec miroslav benco and martina zachariasov face recognition method based on modified algorithm international journal of advanced robotic systems tomoharu iwata shinji watanabe and hiroshi sawada fashion coordinates recommender system using photographs from fashion magazines in proceedings of the international joint conference on artificial intelligence aaai press jul jiquan ngiam aditya khosla mingyu kim juhan nam honglak lee and andrew ng multimodal deep learning in proceedings of the international conference on machine learning pages galen andrew raman arora jeff bilmes and karen livescu deep canonical correlation analysis in proceedings of the international conference on machine learning pages alex smola arthur gretton le song and bernhard hilbert space embedding for distributions in algorithmic learning theory gretton fukumizu teo song and smola kernel statistical test of independence in advances in neural information processing systems krikamol muandet kenji fukumizu francesco dinuzzo and bernhard learning from distributions via support measure machines in advances in neural information processing systems krikamol muandet and bernhard support measure machines for group anomaly detection in proceedings of the conference on uncertainty in artificial intelligence dudik phillips and schapire maximum entropy density estimation with generalized regularization and an application to species distribution modeling journal of machine learning research dino sejdinovic arthur gretton and wicher bergsma kernel test for interactions in advances in neural information processing systems yuya yoshikawa tomoharu iwata and hiroshi sawada latent support measure machines for data classification in advances in neural information processing systems yuya yoshikawa tomoharu iwata and hiroshi sawada regression for data via gaussian process latent variable set model in proceedings of the aaai conference on artificial intelligence bharath sriperumbudur arthur gretton kenji fukumizu bernhard and gert lanckriet hilbert space embeddings and metrics on probability measures the journal of machine learning research dong liu and jorge nocedal on the limited memory bfgs method for large scale optimization mathematical programming aug 
scalable inference for gaussian process models with likelihoods edwin bonilla the university of new south wales amir dezfouli the university of new south wales akdezfuli abstract we propose sparse method for scalable automated variational inference avi in large class of models with gaussian process gp priors multiple latent functions multiple outputs and likelihoods our approach maintains the statistical efficiency property of the original avi method requiring only expectations over univariate gaussian distributions to approximate the posterior with mixture of gaussians experiments on small datasets for various problems including regression classification log gaussian cox processes and warped gps show that our method can perform as well as the full method under high sparsity levels on larger experiments using the mnist and the sarcos datasets we show that our method can provide superior performance to previously published scalable approaches that have been handcrafted to specific likelihood models introduction developing automated yet practical approaches to bayesian inference is problem that has attracted considerable attention within the probabilisitic machine learning community see in the case of models with gaussian process gp priors the main challenge is that of dealing with large number of latent variables although promising directions within the sampling community such as elliptical slice sampling ess have been proposed they have been shown to be particularly slow compared to variational methods in particular showed that their automated variational inference avi method can provide posterior distributions that are practically indistinguishable from those obtained by ess while running orders of magnitude faster one of the fundamental properties of the method proposed in is its statistical efficiency which means that in order to approximate posterior distribution via the maximization of the evidence lower bound elbo it only requires expectations over univariate gaussian distributions regardless of the likelihood model remarkably this property holds for large class of models involving multiple latent functions and multiple outputs however this method is still impractical for large datasets as it inherits the cubic computational cost of gp models on the number of observations while there have been several approaches to large scale inference in gp models these have been focused on regression and classification problems the main obstacle to apply these approaches to inference with general likelihood models is that it is unclear how they can be extended to frameworks such as those in while maintaining that desirable property of statistical efficiency in this paper we build upon the approach underpinning most sparse approximations to gps in order to scale up the automated inference method of in particular for models with multiple latent functions multiple outputs and likelihoods such as in classification and gaussian process regression networks we propose sparse approximation whose computational complexity is in time where is the number of inducing points this approximation maintains the statistical efficiency property of the original avi method as the resulting elbo decomposes over the training data points our method can scale up to very large number of observations and is amenable to stochastic optimization and parallel computation moreover it can in principle approximate arbitrary posterior distributions as it uses mog as the family of approximate posteriors we refer to our method as savigp which stands for scalable automated variational inference for gaussian process models our experiments on small datasets for problems including regression classification log gaussian cox processes and warped gps show that savigp can perform as well as the full method under high levels of sparsity on larger experiment on the mnist dataset our approach outperforms the distributed variational inference method in who used density modeling approach our method unlike uses single discriminative framework finally we use savigp to do inference for the gaussian process regression network model on the sar cos dataset concerning an inverse robot dynamics problem we show that we can outperform previously published scalable approaches that used inference algorithms related work there has been interest in the gp community to overcome the cubic scaling of inference in standard gp models however none of these approaches actually dealt with the harder tasks of developing scalable inference methods for problems and general likelihood models the former multiple output problem has been addressed notably by and using the convolution process formalism nevertheless such approaches were specific to regression problems the latter problem general likelihood models has been tackled from sampling perspective and within an optimization framework using variational inference in particular the work of proposes an efficient full gaussian posterior approximation for gp models with iid observations our work pushes this breakthrough further by allowing multiple latent functions multiple outputs and more importantly scalability to large datasets related area of research is that of modeling complex data with deep belief networks based on gaussian process mappings unlike our approach these models target the unsupervised problem of discovering structure in data do not deal with likelihoods and focus on applications finally very recent developments in probabilistic kernel machines show that the types of problems we are addressing here are highly relevant to the machine learning community in particular has proposed efficient inference methods for large scale gp classification and has developed distributed variational approach for gp models with focus on regression and classification problems our work unlike these approaches allows practitioners and researchers to investigate new models with gp priors and complex likelihoods for which currently there is no machinery that can scale to very large datasets gaussian process priors and nonlinear likelihoods we are given dataset xn yn where xn is input vector and yn is output our goal is to learn the mapping from inputs to outputs which can be established via underlying latent functions fj sensible modeling approach to the above problem is to assume that the latent functions fj are uncorrelated priori and that they are drawn from gaussian processes kj where is the set of all latent function values fj xn denotes the values of latent function and kj is the covariance matrix induced by the covariance function κj evaluated at every pair of inputs along with the prior in equation we can also assume that our multidimensional observations yn are iid given the corresponding set of latent functions fn yn where is the set of all output observations yn is the nth output observation and fj xn is the set of latent function values which yn depends upon in short we are ested in models for which the following criteria are satisfied factorization of the prior over the latent functions and ii factorization of the conditional likelihood over the observations given the latent functions interestingly large class of problems can be well modeled with the above assumptions binary classification warped gps log gaussian cox processes classification and regression all belong to this family of models automated variational inference one of the key inference challenges in the above models is that of computing the posterior distribution over the latent functions ideally we would like an efficient method that does not need to know the details of the likelihood in order to carry out posterior inference this is exactly the main result in which approximates the posterior with within variational inference framework this entails the optimization of an evidence lower bound which decomposes as term and an expected log likelihood ell term as the term is relatively straightforward to deal with we focus on their main result regarding the ell term th the expected log likelihood and its gradients can be approximated using samples from univariate gaussian distributions more generally we say that the ell term and its gradients can be estimated using expectations over univariate gaussian distributions we refer to this result as that of statistical efficiency one of the main limitations of this method is its poor scalability to large datasets as it has cubic time complexity on the number of data points in the next section we describe our inference method that scales up to large datasets while maintaining the statistical efficiency property of the original model scalable inference in order to make inference scalable we redefine our prior to be sparse by conditioning the latent processes on set of inducing variables which lie in the same space as and are drawn from the same gp priors as before we assume factorization of the prior across the latent functions hence the resulting sparse prior is given by zj zj zj zj zj κj aj zj with aj zj zj zj where are the inducing variables for latent process is the set of all the inducing variables zj are all the inducing inputs locations for latent process is the matrix of all input locations xi and is the covariance matrix induced by evaluating the covariance function κj at all pairwise vectors of matrices and we note that while each of the inducing variables in lies in the same space as the elements in each of the inducing inputs in zj lies in the same space as each input data point xn given the latent function values the conditional likelihood factorizes across data points and is given by equation approximate posterior we will approximate the posterior using variational inference motivated by the fact that the true joint posterior is given by our approximate posterior has the form where is the conditional prior given in equation and is our approximate variational posterior this decomposition has proved effective in problems with single latent process and single output see our variational distribution is mixture of gaussians mog πk qk sk πk mkj skj where πk mkj skj are the variational parameters the mixture proportions πk the posterior means mkj and posterior covariances skj of the inducing variables corresponding to mixture component and latent function we also note that each of the mixture components qk sk is gaussian with mean mk and covariance sk posterior approximation via optimization of the evidence lower bound following variational inference principles the log marginal likelihood log or evidence is lower bounded by the variational objective log lelbo log df du kp lkl lell where the evidence lower bound lelbo decomposes as the sum of an expected log likelihood term lell and term lkl our goal is to estimate our posterior distribution via maximization of lelbo we consider first the lell term as it is the most difficult to deal with since we do not know the details of the implementation of the conditional likelihood expected log likelihood term here we need to compute the expectation of the log conditional likelihood log over the joint approximate posterior given in equation our goal is to obtain expressions for the lell term and its gradients wrt the variational parameters while maintaining the statistical efficiency property of needing only expectations from univariate gaussians for this we first introduce an intermediate distribution that is obtained by integrating out from the joint approximate posterior lell log df du log du df given our approximate posterior in equation can be obtained analytically πk qk bkj aj mkj πk bkj σkj with aj skj at σkj and aj are given in equation now we can rewrite equation as where lell πk eqk log πk eqk log where eq denotes the expectation of function over the distribution here we have used the mixture decomposition of in equation and the factorization of the likelihood over the data points in equation now we are ready to state formally our main result theorem for the sparse gp model with prior defined in equations to and likelihood defined in equation the expected log likelihood over the variational distribution in equation and its gradients can be estimated using expectations over univariate gaussian distributions given the result in equation the proof is trivial for the computation of lell as we only need to realize that qk bk σk given in equation has covariance structure consequently qk is gaussian with diagonal covariance for the gradients of lell wrt the variational parameters we use the following identity eqk log yn eqk log qk log yn for λk mk sk and the result for πk is straightforward explicit computation of lell we now provide explicit expressions for the computation of lell we know that qk is gaussian with qk bk σk where σk is diagonal matrix the jth element of the mean and the th entry of the covariance are given by bk aj mkj aj skj at σk where and denote the nth row and nth column of matrix respectively hence we can compute lell as follows os bk σk xx lbell πk log the gradients of lell wrt variational parameters are given in the supplementary material term we turn now our attention to the term which can be decomposed as follows kp eq log eq log lent lcross where the entropy term lent can be lower bounded using jensen inequality lent πk log def mk sk the negative term lcross can be computed exactly lcross πk log log zj zj mtkj zj zj mkj tr zj zj skj the gradients of the above terms wrt the variational parameters are given in the supplementary material hyperparameter learning and scalability to large datasets for simplicity in the notation we have omitted the parameters of the covariance functions and the likelihood parameters from the elbo however in our experiments we optimize these along with the variational parameters in alternating optimization framework the gradients of the elbo wrt these parameters are given in the supplementary material the original framework of is completely unfeasible for large datasets as its complexity is dominated by the inversion of the gram matrix on all the training data which is an operation where is the number of training points our sparse framework makes automated variational inference practical for large datasets as its complexity is dominated by inversions of the kernel matrix on the inducing points which is an operation where is the number of inducing points per latent process furthermore as the lell and its gradients decompose over the training points and the lkl term decomposes over the number of latent process our method is amenable to stochastic optimization and or parallel computation which makes it scalable to very large number of input observations output dimensions and latent processes in our experiments in section we show that our sparse framework can achieve similar performance to the full method on small datasets under high levels of sparsity moreover we carried out experiments on larger datasets for which is practically impossible to apply the full method fg fg nlpd sse sf figure the sse and nlpd for warped gps on the abalone dataset where lower values on both measures are better three approximate posteriors are used fg full gaussian mog diagonal gaussian and mog mixture of two diagonal gaussians along with various sparsity factors sf the smaller the sf the sparser the model with corresponding to no sparsity experiments our experiments first consider the same six benchmarks with various likelihood models analyzed by the number of training points on these benchmarks ranges from to and their input dimensionality ranges from to the goal of this first set of experiments is to show that savigp can attain as good performance as the full method under high sparsity levels we also carried out experiments at larger scale using the mnist dataset and the sarcos dataset the application of the original automated variational inference framework on these datasets is unfeasible we refer the reader to the supplementary material for the details of our experimental we used two performance measures in each experiment the standardized squared error sse and the negative log predictive density nlpd for problems and the error rate and the negative log probability nlp for problems we use three versions of savigp fg and corresponding to full gaussian diagonal gaussian and mixture of diagonal gaussians with components respectively we refer to the ratio of the number of inducing points over the number of training points as sparsity factor experiments in this section we describe the results on three out of six benchmarks used by and analyze the performance of savigp the other three benchmarks are described in the supplementary material warped gaussian processes wgp abalone dataset yn yn yn for this task we used the same transformation as in and the results for the abalone dataset are shown in figure we see that the performance of savigp is practically indistinguishable across all sparsity factors for sse and nlpd here we note that showed that automated variational inference performed competitively when compared to methods for warped gps λyn exp log gaussian cox process lgcp disasters dataset yn yn here we used the lgcp for modeling the number of disasters between years to we note that reported that automated variational inference the focus of this paper produced practically indistinguishable distributions but run order of magnitude faster when compared to sampling methods such as elliptical slice sampling the results for our sparse models are shown in figure where we see that both models fg and mog remain mostly unaffected when using high levels of sparsity we also confirm the findings in that the mog model underestimates the variance of the predictions binary classification wisconsin breast cancer dataset yn exp classification error rates and the negative log probability nlp on the wisconsin breast cancer dataset are shown in figure we see that the error rates are comparable across all models and sparsity factors interestingly sparser models achieved lower nlp values suggesting overconfident predictions by the less sparse models especially for the mixtures of diagonal gaussians sf sf sf intensity sf fg event counts time figure left the disasters data right the posteriors for log gaussian cox process on these data when using full gaussian fg and diagonal gaussian mog for various sparsity factors sf the smaller the sf the sparser the model with corresponding to no sparsity the solid line is the posterior mean and the shading area includes confidence interval fg fg nlp error rate sf figure error rates and nlp for binary classification on the wisconsin breast cancer dataset three approximate posteriors are used fg full gaussian mog diagonal gaussian and mog mixture of two diagonal gaussians along with various sparsity factors sf the smaller the sf the sparser the model with corresponding to the original model without sparsity error bars on the left plot indicate confidence interval around the mean experiments in this section we show the results of the experiments carried out on larger datasets with likelihoods classification on the mnist dataset we first considered classification task on the mnist dataset using the softmax likelihood this dataset has been extensively used by the machine learning community and contains examples for training for validation and for testing with input vectors unlike most previous approaches we did not tune additional parameters using the validation set instead we used our variational framework for learning all the model parameters using all the training and validation data this setting most likely provides lower bound on test accuracy but our goal here is simply to show that we can achieve competitive performance with models as our inference algorithm does not know the details of the conditional likelihood figure left and middle shows error rates and nlps where we see that although the performance decreases with sparsity the method is able to attain an accuracy of while using only around inducing points sf to the best of our knowledge we are the first to train gaussian process classifier using single discriminative probabilistic framework on all classes on mnist for example used approach and focused on the binary classification task of distinguishing the odd digits from the even digits finally trained one model for each digit and used it as density model achieving an error rate of our experiments show that by having single discriminative probabilistic framework even without exploiting the details of the conditional likelihood we can bring this error rate down to as reference previous literature reports about error rate by linear classifiers and less than error rate by convolutional nets fg sf smse nlp error rate fg sf sf output figure left and middle classification error rates and negative log probabilities nlp for the problem on mnist here we used the fg full gaussian approximation with various sparsity factors sf the smaller the sf the sparser the model right the smse for gaussian process regression network model on the sarcos dataset when learning the and torques output and output with fg full gaussian approximation and sparsity factor our results show that our method while solving the harder problem of full posterior estimation can reduce the gap between gps and deep nets gaussian process regression networks on the sarcos dataset here we apply our savigp inference method to the gaussian process regression networks gprns model of using the sarcos dataset as test bed gprns are very flexible regression approach where outputs are linear combination of latent gaussian processes with the weights of the linear combination also drawn from gaussian processes this yields multiple output likelihood model where the correlations between the outputs can be spatially adaptive input dependent the sarcos dataset concerns an inverse dynamics problem of anthropomorphic robot arm the data consists of training examples mapping from input space joint positions joint velocities joint accelerations to the corresponding joint torques similarly to the work in we consider joint learning for the and torques which we refer to as output and output respectively and make predictions on test points per output figure right shows the standardized mean square error smse with the full gaussian approximation fg using less than inducing points the results are considerably better than those reported by and for each output respectively although their setting was much sparser than ours on the first output this also corroborates previous findings that on this problem having more data does help to the best of our knowledge we are the first to perform inference in gprns on problems at this scale conclusion we have presented scalable approximate inference method for models with gaussian process gp priors multiple outputs and nonlinear likelihoods one of the key properties of this method is its statistical efficiency in that it requires only expectations over univariate gaussian distributions to approximate the posterior with mixture of gaussians extensive experimental evaluation shows that our approach can attain excellent performance under high sparsity levels and that it can outperform previous inference methods that have been handcrafted to specific likelihood models overall this work makes substantial contribution towards the goal of developing generic yet scalable bayesian inference methods for models based on gaussian processes acknowledgments this work has been partially supported by unsw faculty of engineering research grant program project and an aws in education research grant award ad was also supported by grant from the australian research council references pedro domingos stanley kok hoifung poon matthew richardson and parag singla unifying logical and statistical ai in aaai noah goodman vikash mansinghka daniel roy keith bonawitz and joshua tenenbaum church language for generative models in uai matthew hoffman and andrew gelman the sampler adaptively setting path lengths in hamiltonian monte carlo jmlr rajesh ranganath sean gerrish and david blei black box variational inference in aistats iain murray ryan prescott adams and david mackay elliptical slice sampling in aistats trung nguyen and edwin bonilla automated variational inference for gaussian process models in nips hannes nickisch and carl edward rasmussen approximations for binary gaussian process classification jmlr james hensman nicolo fusi and neil lawrence gaussian processes for big data in uai yarin gal mark van der wilk and carl rasmussen distributed variational inference in sparse gaussian process regression and latent variable models in nips trung nguyen and edwin bonilla collaborative gaussian processes in uai trung nguyen and edwin bonilla fast allocation of gaussian process experts in icml joaquin and carl edward rasmussen unifying view of sparse approximate gaussian process regression jmlr michalis titsias variational learning of inducing variables in sparse gaussian processes in aistats andrew wilson david knowles and zoubin ghahramani gaussian process regression networks in icml edward snelson carl edward rasmussen and zoubin ghahramani warped gaussian processes in nips sethu vijayakumar and stefan schaal locally weighted projection regression an algorithm for incremental real time learning in high dimensional space in icml neil lawrence matthias seeger and ralf herbrich fast sparse gaussian process methods the informative vector machine in nips ed snelson and zoubin ghahramani sparse gaussian processes using in nips mauricio and neil lawrence computationally efficient convolved multiple output gaussian processes jmlr mauricio david luengo michalis titsias and neil lawrence efficient multioutput gaussian processes through variational inducing kernels in aistats manfred opper and archambeau the variational gaussian approximation revisited neural computation andreas damianou and neil lawrence deep gaussian processes in aistats james hensman alexander matthews and zoubin ghahramani scalable variational gaussian process classification in aistats zichao yang andrew gordon wilson alexander smola and le song la carte learning fast kernels in aistats carl edward rasmussen and christopher williams gaussian processes for machine learning the mit press christopher williams and david barber bayesian classification with gaussian processes pattern analysis and machine intelligence ieee transactions on jesper møller anne randi syversveen and rasmus plenge waagepetersen log gaussian cox processes scandinavian journal of statistics bache and lichman uci machine learning repository jarrett note on the intervals between disasters biometrika 
fast bidirectional probability estimation in markov models siddhartha banerjee sbanerjee peter plofgren abstract we develop new bidirectional algorithm for estimating markov chain transition probabilities given markov chain we want to estimate the probability of hitting given target state in steps after starting from given source distribution given the target state we use reverse local power iteration to construct an expanded target distribution which has the same mean as the quantity we want to estimate but smaller variance this can then be sampled efficiently by monte carlo algorithm our method extends to any markov chain on discrete finite or countable and can be extended to compute functions of transition probabilities such as pagerank graph diffusions times etc our main result is that in sparse markov chains wherein the number of transitions between states is comparable to the number of states the running time of our algorithm for target node is smaller than monte carlo and power iteration based algorithms in particular our method can estimate probability using only running time introduction markov chains are one of the workhorses of stochastic modeling finding use across variety of applications mcmc algorithms for simulation and statistical inference to compute network centrality metrics for data mining applications statistical physics operations management models for reliability inventory and supply chains etc in this paper we consider fundamental problem associated with markov chains which we refer to as the transition probability estimation or problem given markov chain on state space with transition matrix an initial source distribution over target state and fixed length we are interested in computing the transition probability from to formally we want to estimate hσp et σp ett where et is the indicator vector of state natural parametrization for the complexity of mstpestimation is in terms of the minimum transition probabilities we want to detect given desired minimum detection threshold we want algorithms that give estimates which guarantee small relative error for any such that parametrizing in terms of the minimum detection threshold can be thought of as benchmarking against standard monte carlo algorithm which estimates by sampling independent paths starting from states sampled from an alternate technique for is based on linear algebraic iterations in particular the local power iteration we discuss these in more detail in section crucially however both these techniques have running time of for testing if cf section siddhartha banerjee is an assistant professor at the school of operations research and information engineering at cornell http peter lofgren is graduate student in the computer science department at stanford http our results to the best of our knowledge our work gives the first bidirectional algorithm for which works for general discrete markov the algorithm we develop is very simple both in terms of implementation and analysis moreover we prove that in many settings it is faster than existing techniques our algorithm consists of two distinct forward and reverse components which are executed sequentially in brief the two components proceed as follows starting from the target node we perform sequence of reverse local power iterations in particular we use the operation defined in algorithm we next sample number of random walks of length starting from and transitioning according to and return the sum of residues on the walk as an estimate of this full algorithm which we refer to as the estimator is formalized in algorithm it works for all markov chains giving the following accuracy result theorem for details refer section given any markov chain source distribution terminal state length threshold and relative error algorithm for which with high probability satisfies returns an unbiased estimate max since we dynamically adjust the number of operations to ensure that all residues are small the proof of the above theorem follows from straightforward concentration bounds since combines local power iteration and monte carlo techniques natural question is when the algorithm is faster than both it is easy to to construct scenarios where the runtime of is comparable to its two constituent algorithms for example if has more than surprisingly however we show that in sparse markov chains and for typical target states is faster theorem for details refer section given any markov chain source distribution length threshold and desired accuracy then for uniform random choice of the where is the average algorithm has running time of number of neighbors of nodes in thus for typical targets we can estimate transition probabilities of order in time only note that we do not need for every state that the number of neighboring states is small but rather that they are small on average for example this is true in networks where some nodes have very high degree but the average degree is small the proof of this result is based on modification of an argument in refer section for details estimating transition probabilities to target state is one of the fundamental primitives in markov chain models hence we believe that our algorithm can prove useful in variety of application domains in section we briefly describe how to adapt our method for some of these applications estimating times and stationary probabilities extensions to markov chains in particular for estimating graph diffusions and heat kernels connections to local algorithms and expansion testing in addition our could be useful in several other applications estimating ruin probabilities in reliability models buffer overflows in queueing systems in statistical physics simulations etc existing approaches for there are two main techniques used for the first is natural monte carlo algorithm we estimate by sampling independent paths each starting from random state sampled from simple concentration argument shows that for given value of we need samples to get an accurate estimate of irrespective of the choice of and the structure bidirectional estimators have been developed before for reversible markov chains our method however is not only more general but conceptually and operationally simpler than these techniques cf section of note that this algorithm is agnostic of the terminal state it gives an accurate estimate for any such that on the other hand the problem also admits natural linear algebraic solution using the standard power iteration starting with or the reverse power iteration starting with et which is obtained by equation as et when the state space is large performing direct power iteration is infeasible however there are localized versions of the power iteration that are still efficient such algorithms have been developed among other applications for pagerank estimation and for heat kernel estimation although slow in the worst case such local update algorithms are often fast in practice as unlike monte carlo methods they exploit the local structure of the chain however even in sparse markov chains and for large fraction of target states their running time can be for example consider random walk on random dregular graph and let then for logd verifying es is equivalent to uncovering the entire logd neighborhood of since large random graph is whp an expander this neighborhood has distinct nodes finally note that as with monte carlo power iterations can be adapted to either the source or terminal state but not both for reversible markov chains one can get bidirectional algorithms for estimating es based on colliding random walks for example consider the problem of estimating random walk transition probabilities in regular undirected graph on vertices the main idea is that to test if random walk goes from to in steps with probability we can generate two independent random walks of length starting from and respectively and detect if they terminate at the same intermediate node suppose pw qw are the probabilities that walk from andpt respectively terminate at node then from the reversibility of the chain we have that pw qw this is also the collision probability the critical observation is that if we generate walks from and then we get potential collisions which is sufficient to detect if this argument forms the basis of the and similar techniques used in variety of estimation problems see showing concentration for this estimator is tricky as the samples are not independent moreover to control the variance of the samples the algorithms often need to separately deal with heavy intermediate nodes where pw or qw are much larger than our proposed approach is much simpler both in terms of algorithm and analysis and more significantly it extends beyond reversible chains to any general discrete markov chain the most similar approach to ours is the recent algorithm of lofgren et al for pagerank estimation our algorithm borrows several ideas and techniques from that work however the algorithm relies heavily on the structure of pagerank in particular the fact that the pagerank walk has geometric length and hence can be stopped and restarted due to the memoryless property our work provides an elegant and powerful generalization of the algorithm extending the approach to general markov chains the bidirectional algorithm algorithm as described in section given target state our bidirectional mstp algorithm keeps track of pair of vectors the estimate vector qkt rn and the residual vector rkt rn for each length the vectors are initially all set to the vector except which is initialized as et moreover they are updated using reverse push operation defined as algorithm inputs transition matrix estimate vector qit residual vectors rit return new estimate vectors qit and reit computed as eit qit hrit ev iev rit rit hrit ev iev hrit ev ev in particular local power iterations are slow if state has very large for the forward iteration or for the reverse update the main observation behind our algorithm is that we can in terms of qkt rkt as an expectation over random of the markov chain as follows cf equation hσ evk vk in other words given vectors qkt rkt we can get an unbiased estimator for by sampling random trajectory of the markov chain starting at random state sampled from the source distribution and then adding the residuals along the trajectory as in equation we formalize this bidirectional mstp algorithm in algorithm algorithm max inputs transition matrix source distribution target state maximum steps max minimum probability threshold relative error bound failure probability pf set accuracy parameter based on and pf and set reverse threshold δr cf theorems and in our experiments we use and δr initialize estimate vectors qkt residual vectors et and rkt for max do while rit δr do execute end while end for set number of sample paths nf max δr see theorem for details for index nf do sample starting node generate sample path ti vi max of length max starting from for max sample nif orm and compute st rt vik we reinterpret the sum over in equation as an expectation and sample rather sum over for computational speed end for pnf hσ return max where st some intuition behind our approach before formally analyzing the performance of our algorithm we first build some intuition as to why it works in particular it is useful to interpret the estimates and residues in terms in figure we have considered simple markov chain on three states solid hollow and checkered henceforth on the right side we have illustrated an intermediate stage of reverse work using as the target after performing the operations and in that order each push at level uncovers collection figure visualizing sequence of operations given the markov chain on the left with as the target we perform operations of paths terminating at for example in the figure we have uncovered all length and paths and several length paths the crucial observation is that each uncovered path of length starting from node is accounted for in either qiv or riv in particular in figure all paths starting at solid nodes are stored in the estimates of the corresponding states while those starting at blurred nodes are stored in the residue now we can use this set of paths to boost the estimate returned by monte carlo trajectories generated starting from the source distribution the dotted line in the figure represents the current frontier it separates the fully uncovered neighborhood of from the remaining states in sense what the operation does is construct sequence of importancesampling weights which can then be used for monte carlo an important novelty here is that the weights are adapted to the target state and ii dynamically adjusted to ensure the monte carlo estimates have low variance viewed in this light it is easy to see how the algorithm can be modified to applications beyond basic for example to nonhomogenous markov chains or for estimating the probability of hitting target state for the first time in steps cf section essentially we only need an appropriate programming update for the quantity of interest with associated invariant as in equation performance analysis we first formalize the critical invariant introduced in equation lemma given terminal state suppose we initialize et and qkt rkt then for any source distribution and length after any arbitrary sequence of reversepush operations the vectors qkt rkt satisfy the invariant hσ hσp rt the proof follows the outline of similar result in andersen et al for pagerank estimation due to lack of space we defer it to our full version using this result we can now characterize the accuracy of the algorithm theorem we are given any markov chain source distribution terminal state maximum length max and also parameters pf and the desired threshold failure probability and relative error suppose we choose reverse threshold δr and set the number of sample paths nf cδr where max ln ln max then for any length max with probability at least pf the estimate returned by satisfies max proof given any markov chain and terminal state note first that for given length max is an unbiased estimator now for any equation shows that the estimate tk we have that the score st obeys st and ii st δr the first inequality again follows from equation while the second follows from the fact that we executed operations until all residual values were less than δr now consider the rescaled random variable xk st δr and nf xk then we have that xk nf δr pσ and also nf δr moreover using standard chernoff bounds cf theorem in we have that exp and for any now we consider two cases st nf δr here we can use the first concentration bound to get pσ pσ δr exp exp where we use that nf max δr cf algorithm moreover by the union bound we have max exp pσ pσ max now as long as ln max we get the desired failure probability st in this case note first that since we have that nf δr on the other hand we also have pσ nf pσ δr where the last inequality follows from our second concentration bound which holds since we have now as before we can use the union bound to show that the failure probability is bounded by pf as long as max combining the two cases we see that as long as max ln ln max then we hs pσ max pf have max one aspect that is not obvious from the intuition in section or the accuracy analysis is if using bidirectional method actually improves the running time of this is addressed by the following result which shows that for typical targets our algorithm achieves significant speedup theorem let any markov chain source maximum length max and parameters distribution pf and be given suppose we set δr max log max then for uniform random choice of the algorithm has running time of max proof the runtime of algorithm consists of two parts for generating trajectories we generate max δr sample trajectories each of length max hence the running time is cδ for any markov chain source distribution and target substituting for from theorem we get that the max δr log max running time tf for operations let tr denote the runtime for uniform random choice of then we have tr max xx din is executed now for given and max note that the only states on which we execute are those with residual rkt δr consequently for these states we have that qkt δr and hence by equation we have that pkev δr by setting ev starting from state moreover operation involves updating the residuals for din states note that pkev and hence via straightforward counting argument we have that for any pke thus we have max max xx xx din pke din pke din max max max din δr δr δr tr finally we choose δr max log max to balance tf and tr and get the result applications of mstp estimation estimating the stationary distribution and hitting probabilities can be used in two ways to estimate stationary probabilities first if we know the mixing time τmix of the chain we can directly use algorithm to approximate by setting max τmix and using any source distribution theorem then guarantees that we can estimate stationary probability of order in time τmix in comparison monte carlo has τmix runtime we note that in practice we usually do not know the mixing time in such setting our algorithm can be used to compute an estimate of for all values of max an alternative is to modify algorithm to estimate the truncated hitting time pb hit the probability of hitting starting from for the first time in steps bypsetting et we get an estimate for the expected truncated return time tt tt max max hit et now using that fact that tt we can get lower bound for which converges to as max we note also that the truncated hitting time has been shown to be useful in other applications such as identifying similar documents on graph to estimate the truncated hitting time we modify algorithm as follows at each stage eit qit max note not instead of we update rit and do not push back rit to the of in the th stage the rit set remaining algorithm remains the same it is easy to see from the discussion in section that the resulting quantity pb hit is an unbiased estimate of hitting time of we omit formal proof due to lack of space exact stationary probabilities in strong doeblin chains strong doeblin chain is obtained by mixing markov chain and distribution as follows at each transition the process proceeds according to with probability else samples state from doeblin chains are widely used in ml applications special cases include the celebrated pagerank metric variants such as hits and salsa and other algorithms for applications such as ranking and structured prediction an important property of these chains is that if we sample starting node from and sample trajectory of length geometric starting from then the terminal node is an unbiased sample from the stationary distribution there are two ways in which our algorithm can be used for this purpose one is to replace the algorithm with corresponding local update algorithm for the strong doeblin chain similar to the one in andersen et al for pagerank and then sample random trajectories of length geometric more direct technique is to choose some max estimate max and then max directly compute the stationary distribution as graph diffusions if we assign weight αi to random walks of on weighted graph the resulting scoring functions αi are known as graph diffusions and are used in variety of applications the case where αi corresponds to pagerank if instead the length is drawn according to poisson distribution αi αi then the resulting function is called the this too has several applications including finding communities clusters in large networks note max max that for any function as defined above the truncated sum obeys αi pσ max max αi thus guarantee on an estimate for the truncated sum directly translates to guarantee on the estimate for the diffusion we can use to efficiently estimate these truncated sums we perform numerical experiments on heat kernel estimation in the next section conductance testing in graphs is an essential primitive for conductance testing in large markov chains in particular in regular undirected graphs kale et al develop sublinear bidirectional estimator based on counting collisions between walks in order to identify weak nodes those which belong to sets with small conductance our algorithm can be used to extend this process to any graph including weighted and directed graphs local algorithms there is lot of interest recently on local algorithms those which perform computations given only small neighborhood of source node in this regard we note that gives natural local algorithm for mstp estimation and thus for the applications mentioned above given neighborhood around the source and target we can perform with max set to the proof of this follows from the fact that the invariant in equation holds after any sequence of operations figure estimating heat kernels bidirectional monte carlo forward push to compare runtimes we choose parameters such that the mean relative error of all algorithms is around notice that is times faster than the other algorithms experiments to demonstrate the efficiency of our algorithm on large markov chains we use heat kernel estimation cf section as an example application the heat kernel is markov chain defined as the probability of stopping at the target on random walk from the source where the walk length is sampled from oisson distribution in graphs value between pair of nodes has been shown to be good indicator of an underlying community relationship this suggests that it can serve as metric for personalized search on social networks for example if social network user wants to view list of users attending some event then sorting these users by heat kernel values will result in the most similar users to appearing on top is ideal for such personalized search applications as the set of users filtered by search query is typically much smaller than the set of nodes on the network in figure we compare the runtime of different algorithms for heat kernel computation on four graphs ranging from millions to billions of edges for each graph for random source target pairs we compute the heat kernel using as well as two benchmark algorithms monte carlo and the forward push algorithm as presented in all three algorithms have parameters which allow them to trade off speed and accuracy for fair comparison we choose parameters such that the empirical mean relative error each algorithm is all three algorithms were implemented in scala for the forward push algorithm our implementation follows the code linked from we set average since longer walks will mix into the stationary distribution and set the maximum length to the probability of walk being longer than this is which is negligible for reproducibility our source code is available on our website cf figure shows that across all graphs is faster than the two benchmark algorithms for example on the twitter graph it can estimate heat kernel score is seconds while the the other algorithms take more than minutes we note though that monte carlo and forward push can return scores from the source to all targets rather than just one target thus is most useful when we want the score for small set of targets acknowledgments research supported by the darpa graphs program via grant and by nsf grant peter lofgren was supported by an npsc fellowship thanks to ashish goel and other members of the social algorithms lab at stanford for many helpful discussions pokec live journal and orkut datasets are from the snap was downloaded from the laboratory for web algorithmics refer to our full version for details references oded goldreich and dana ron on testing expansion in graphs in studies in complexity and cryptography miscellanea on the interplay between randomness and computation springer peter lofgren siddhartha banerjee ashish goel and seshadhri scaling personalized pagerank estimation for large graphs in acm sigkdd reid andersen fan chung and kevin lang local graph partitioning using pagerank vectors in ieee focs reid andersen christian borgs jennifer chayes john hopcraft vahab mirrokni and teng local computation of pagerank contributions in algorithms and models for the springer kyle kloster and david gleich heat kernel based community detection in acm sigkdd satyen kale yuval peres and seshadhri noise tolerance of expanders and sublinear expander reconstruction in ieee focs rajeev motwani rina panigrahy and ying xu estimating sum by weighted sampling in automata languages and programming pages springer siddhartha banerjee and peter lofgren fast bidirectional probability estimation in markov models technical report http devdatt dubhashi and alessandro panconesi concentration of measure for the analysis of randomized algorithms cambridge university press purnamrita sarkar andrew moore and amit prakash fast incremental proximity search in large graphs in proceedings of the international conference on machine learning pages acm wolfgang doeblin elements une theorie generale des chaines simples constantes de markoff in annales scientifiques de ecole normale volume pages de france lawrence page sergey brin rajeev motwani and terry winograd the pagerank citation ranking bringing order to the web ronny lempel and shlomo moran the stochastic approach for analysis salsa and the tkc effect computer networks sahand negahban sewoong oh and devavrat shah iterative ranking from comparisons in advances in neural information processing systems pages jacob steinhardt and percy liang learning models for structured prediction in icml krishna athreya and stenflo perfect sampling for doeblin chains the indian journal of statistics pages fan chung the heat kernel as the pagerank of graph proceedings of the national academy of sciences christina lee asuman ozdaglar and devavrat shah computing the stationary distribution locally in advances in neural information processing systems pages lubos takac and michal zabovsky data analysis in public social networks in international scientific conf workshop present day trends of innovations alan mislove massimiliano marcon krishna gummadi peter druschel and bobby bhattacharjee measurement and analysis of online social networks in proceedings of the internet measurement conference imc san diego ca october stanford network analysis platform snap http accessed paolo boldi marco rosa massimo santini and sebastiano vigna layered label propagation multi resolution ordering for compressing social networks in acm www laboratory for web algorithmics http accessed 
probabilistic variational bounds for graphical models qiang liu computer science dartmouth college qliu john fisher iii csail mit fisher alexander ihler computer science univ of california irvine ihler abstract variational algorithms such as belief propagation can provide deterministic bounds on the partition function but are often loose and difficult to use in an fashion expending more computation for tighter bounds on the other hand monte carlo estimators such as importance sampling have excellent behavior but depend critically on the proposal distribution we propose simple monte carlo based inference method that augments convex variational bounds by adding importance sampling is we argue that convex variational methods naturally provide good is proposals that cover the target probability and reinterpret the variational optimization as designing proposal to minimize an upper bound on the variance of our is estimator this both provides an accurate estimator and enables construction of probabilistic bounds that improve quickly and directly on variational bounds and provide certificates of accuracy given enough samples relative to the error in the initial bound introduction graphical models such as bayesian networks markov random fields and deep generative models provide powerful framework for reasoning about complex dependency structures over many variables see fundamental task is to calculate the partition function or normalization constant this task is in the worst case but in many practical cases it is possible to find good deterministic or monte carlo approximations the most useful approximations should give not only accurate estimates but some form of confidence interval so that for easy problems one has certificate of accuracy while harder problems are identified as such broadly speaking approximations fall into two classes variational optimization and monte carlo sampling variational inference provides spectrum of deterministic estimates and upper and lower bounds on the partition function these include loopy belief propagation bp which is often quite accurate its convex variants such as tree reweighted bp which give upper bounds on the partition function and mean field type methods that give lower bounds unfortunately these methods often lack useful accuracy assessments although in principle pair of upper and lower bounds such as and mean field taken together give an interval containing the true solution the gap is often too large to be practically useful also improving these bounds typically means using larger regions which quickly runs into memory constraints monte carlo methods often based on some form of importance sampling is can also be used to estimate the partition function in principle is provides unbiased estimates with the potential for probabilistic bound bound which holds with some probability sampling estimates can also easily trade time for increased accuracy without using more memory unfortunately choosing the proposal distribution in is is often both crucial and difficult if poorly chosen not only is the estimator but the samples empirical variance estimate is also misleading resulting in both poor accuracy and poor confidence estimates see we propose simple algorithm that combines the advantages of variational and monte carlo methods our result is based on an observation that convex variational methods including and its generalizations naturally provide good importance sampling proposals that cover the probability of the target distribution the simplest example is mixture of spanning trees constructed by we show that the importance weights of this proposal are uniformly bounded by the convex upper bound itself which admits bound on the variance of the estimator and more importantly allows the use of exponential concentration inequalities such as the empirical bernstein inequality to provide explicit confidence intervals our method provides several important advantages first the upper bounds resulting from our sampling approach improve directly on the initial variational upper bound this allows our bound to start at value and be quickly and easily improved in an memory efficient way additionally using concentration bound provides certificate of accuracy which improves over time at an easily analyzed rate our upper bound is significantly better than existing probabilistic upper bounds while our corresponding lower bound is typically worse with few samples but eventually outperforms probabilistic bounds our approach also results in improved estimates of the partition function as in previous work applying importance sampling serves as bias correction to variational approximations here we interpret the variational bound optimization as equivalent to minimizing an upper bound on the is estimator variance empirically this translates into estimates that can be significantly more accurate than is using other variational proposals such as mean field or belief propagation related work importance sampling and related approaches have been widely explored in the bayesian network literature in which the partition function corresponds to the probability of observed evidence see and references therein dagum and luby derive sample size to ensure probabilistic bound with given relative accuracy however they use the normalized bayes net distribution as proposal leading to prohibitively large numbers of samples when the partition function is small and making it inapplicable to markov random fields cheng refines this result including bound on the importance weights but leaves the choice of proposal unspecified some connections between is and variational methods are also explored in yuan and druzdzel wexler and geiger gogate and dechter in which proposals are constructed based on loopy bp or mean field methods while straightforward in principle we are not aware of any prior work which uses variational upper bounds to construct proposal or more importantly analyzes their properties an alternative probabilistic upper bound can be constructed using perturb and map methods combined with recent concentration results however in our experiments the resulting bounds were quite loose although not directly related to our work there are also methods that connect variational inference with mcmc our work is orthogonal to the line of research on adaptive importance sampling which refines the proposal as more samples are drawn we focus on developing good fixed proposal based on variational ideas and leave adaptive improvement as possible future direction outline we introduce background on graphical models in section our main result is presented in section where we construct tree reweighted is proposal discuss its properties and propose our probabilistic bounds based on it we give simple extension of our method to higher order cliques based on the weighted framework in section we then show experimental comparisons in section and conclude with section background undirected probabilistic graphical models def let xp be discrete random vector taking values in xp probabilistic graphical model on in an exponential family form is with exp θα xα where is set of subsets of variable indices and θα xα are functions of xα we denote by θα xα xα xα the vector formed by the elements of θα called the natural parameters our goal is to calculate the partition function that normalizes the distribution we often drop the dependence on and write for convenience the factorization of can be represented by an undirected graph eg called its markov graph where each vertex is associated with variable xk and nodes are connected kl eg iff there exists some that contains both and then is set of cliques of simple special case of is the pairwise model in which exp θk xk θkl xk xl kl monte carlo estimation via importance sampling importance sampling is is at the core of many monte carlo methods for estimating the partition function the idea is to take tractable normalized distribution called the proposal and estimate using samples xi xi with xi xi xi where is called the importance weight it is easy to show that is an unbiased estimator of in that if whenever and has mse of var unfortunately the is estimator often has very high variance if the choice of proposal distribution is very different from the target especially when the proposal is more peaked or has thinner tails than the target in these cases there exist configurations such that giving importance weights with extremely large values but very small probabilities due to the low probability of seeing these large weights typical run of is often underestimates in practice that is with high probability despite being unbiased similarly the empirical variance of xi can also severely underestimate the true variance var and so fail to capture the true uncertainty of the estimator for this reason concentration inequalities that make use of the empirical variance see section also require that or its variance be bounded it is thus desirable to construct proposals that are similar to and less peaked than the target distribution the key observation of this work is to show that tree reweighted bp and its generalizations provide easy way to construct such good proposals tree reweighted belief propagation next we describe the tree reweighted trw upper bound on the partition function restricting to pairwise models for notational ease in section we give an extension that includes both more general factor graphs and more general convex upper bounds let be set of spanning trees et of that covers et eg we assign set of nonnegative weights ρt on such that ρt let be set of natural parameters that satisfies ρt and each respects the structure of so that θkl xk xl for kl et define def pt with exp θkt xk θkl xk xl kl then pt is tree structured graphical model with markov graph wainwright et al use the fact that log is convex function of to propose to upper bound log by log ztrw ρt log log ρt log via jensen inequality wainwright et al find the tightest bound via convex optimization log ztrw min log ztrw ρt θt wainwright et al solve this optimization by tree reweighted belief propagation algorithm and note that the optimality condition of is equivalent to enforcing marginal consistency condition on the trees optimizes if and only if there exists set of common singleton and pairwise bk xk bkl xk xl corresponding to the fixed point of in wainwright et al such that xk xl pt xk xl kl xk pt xk and where pt xk and pt xk xl are the marginals of pt thus after running we can calculate pt via bkl xk xl pt bk xk bk xk bl xl because trw provides convex upper bound it is often to the inner loop of learning algorithms however it is often far less accurate than its counterpart loopy bp in some sense this can be viewed as the cost of being bound in the next section we show that our importance sampling procedure can the trw bound to produce an estimator that significantly outperforms loopy bp in addition due to the nice properties of our proposal we can use an empirical bernstein inequality to construct confidence interval for our estimator turning the deterministic trw bound into much tighter probabilistic bound tree reweighted importance sampling we propose to use the collection of trees pt and weights ρt in trw to form an importance sampling proposal ρt pt pn which defines an estimator with drawn from our observation is that this proposal is good due top the special convex construction of trw to see this we note that the reparameterization constraint ρt can be rewritten as ztrw pt that is is the geometric mean of pt up to constant ztrw on the other hand by its definition is the arithmetic mean of pt and hence will always be larger than the geometric mean by the inequality guaranteeing good coverage of the target probability to be specific we have is always no smaller than and hence the importance weight is always upper bounded by ztrw note that immediately implies that whenever we summarize our result as follows proposition if ρt ρt ρt then the importance weight with defined in satisfies ztrw that is the importance weights of are always bounded by the trw upper bound this reinterprets the trw optimization as finding the mixture proposal in that has the smallest upper bound on the importance weights ii as result we have max var var ztrw for where var is the empirical variance of the weights this implies that ztrw proof directly apply inequality on and ii note that and hence var ztrw ztrw note that the trw reparameterization is key to establishing our results its advantage is first it provides simple upper bound on for an arbitrary establishing such an upper bound may require difficult combinatorial optimization over second it enables that bound to be optimized over resulting in good proposal empirical bernstein confidence bound the upper bound of in proposition allows us to use exponential concentration inequalities and construct tight confidence bounds based on the empirical bernstein inequality in maurer and pontil we have corollary maurer and pontil let be the is estimator resulting from in define log log where var is the empirical variance of the weights then and are upper and lower bounds of with at least probability respectively that is pr and pr the quantity is quite intuitive with the first term proportional to the empirical standard deviation and decaying at the classic rate the second term captures the possibility that the empirical variance is inaccurate it depends on the boundedness of and decays at rate since the second term typically dominates for small and the first term for large var ztrw when is large the lower bound may be negative this is most common when is small and ztrw is much larger than in this case we may replace with any deterministic lower bound or with which is probabilistic bound by the markov inequality see gogate and dechter for more markov inequality based lower bounds however once is large enough we expect should be much tighter than using markov inequality since also leverages boundedness and variance on the other hand the bernstein upper bound readily gives good upper bound and is usually much tighter than ztrw even with relatively small for example if ztrw the trw bound is not tight our upper bound improves rapidly on ztrw at rate and passes ztrw when log for example for used in our experiments we have ztrw by meanwhile one can show that the lower bound must be if ztrw log during sampling we can roughly estimate the point at which it will become by finding such that more rigorously one can apply stopping criterion on to guarantee relative error with probability at least using the bound on roughly the expected number of samples will depend on ztrw the relative accuracy of the variational bound weighted importance sampling we have so far presented our results for tree reweighted bp on pairwise models which approximates the model using combinations of trees in this section we give an extension of our results to general higher order models and approximations based on combinations of graphs our extension is based on the weighted framework but extensions based on other higher order generalizations of trw such as globerson and jaakkola are also possible we only sketch the main idea in this section we start by rewriting the distribution using the chain rule along some order xp xk the markov lower bounds by gogate and dechter have the undesirable property that they may not become tighter with increasing and may even decrease where pa called the induced parent set of is the set of variables adjacent to xk when it is eliminated along order the largest parent size is called the induced width of along order and the computational complexity of exact variable elimination along order is exp which is intractable when is large weighted is an approximation method that avoids the exp complexity by splitting each pa into several smaller pa such that pa pa where the size of the pa is controlled by predefined number ibound so that the ibound trades off the computational complexity with approximation quality we associate each pa with nonnegative weight ρk such that ρk the weighted algorithm in liu then frames convex optimization to output an upper bound zwmb together with set of conditional distributions bk xk such that yy zwmb bk xk ρk which intuitively speaking can be treated as approximating each conditional distribution xk with geometric mean of the bk xk while we omit the details of weighted for space what is most important for our purpose is the representation similarly to with trw we define proposal distribution by replacing the geometric mean with an arithmetic mean yx ρk bk xk we can again use the inequality to obtain bound on that zwmb proposition let where and satisfy and with ρk ρk then zwmb proof use the inequality bk xk ρk ρk bk xk for each note that the formp of makes it convenient to sample by sequentially drawing each variable xk from the mixture ρk bk xk along the reverse order xp the proposal also can be viewed as mixture of large number of models with induced width controlled by ibound this can be seen by expanding the form in where ρk bk xk experiments we demonstrate our algorithm using synthetic ising models and models from recent uai inference challenges we show that our trw proposal can provide better estimates than other proposals constructed from mean field or loopy bp particularly when it underestimates the partition function in this case the proposal may be too peaked and fail to approach the true value even for extremely large sample sizes using the empirical bernstein inequality our trw proposal also provides strong probabilistic upper and lower bounds when the model is relatively easy or is large our upper and lower bounds are close demonstrating the estimate has high confidence mrfs on grids we illustrate our method using pairwise markov random fields on grid we start with simple ising model with θk xk σs xk and θkl xk xl σp xk xl xk where σs represents the external field and σp the correlation we fix σs and vary σp from strong negative correlation to strong positive correlation different σp lead to different inference hardness inference is easy when the correlation is either very strong large or very weak small but difficult for an intermediate range of values corresponding to phase transition log log interval is trw bernstein is trw is mf is lbp loopy bp pairwise strength σp fixed trw lbp mf is trw markov trw pairwise strength σp fixed pairwise strength σp fixed sample size fixed σp figure experiments on ising models with interaction strength σp ranging from strong negative to strong positive we first run the standard variational algorithms including loopy bp lbp tree reweighted bp trw and mean field mf we then calculate importance sampling estimators based on each of the three algorithms the trw trees are chosen by adding random spanning trees until their union covers the grid we assign uniform probability ρt to each tree the lbp proposal follows gogate constructing randomly selected tree structured proposal based on the lbp pseudoq marginals the mf proposal is qk xk where the qk xk are the mean field beliefs figure shows the result of the is estimates based on relatively small number of importance samples in this case the trw proposal outperforms both the mf and lbp proposals all the methods degrade when σp corresponding to inherently more difficult inference however the trw proposal converges to the correct values when the correlation is strong while the mf and lbp proposals underestimate the true value indicating that the mf and lbp proposals are too peaked and miss significant amount of probability mass of the target examining the deterministic estimates we note that the lbp approximation which can be shown to be lower bound on these models is also significantly worse than is with the trw proposal and slightly worse than is based on the lbp proposal the trw and mf bounds of course are far less accurate compared to either lbp or the is methods and are shown separately in figure this suggests it is often beneficial to follow the variational procedure with an importance sampling process and use the corresponding is estimators instead of the variational approximations to estimate the partition function figure compares the confidence interval of the is based on the trw proposal filled with red with the interval formed by the trw upper bound and the mf lower bound filled with green we can see that the bernstein upper bound is much tighter than the trw upper bound although at the cost of turning deterministic bound into probabilistic bound on the other hand the bernstein interval fails to report meaningful lower bound when the model is difficult σp because is small relative to the difficulty of the model as shown in figure our method eventually produces both tight upper and lower bounds as sample size increases in addition to the simple ising model we also tested grid models with normally distributed parameters θk xk and θkl xk xl figure shows the results when σs and we vary σp in this case lbp tends to overestimate the partition function and is with the lbp proposal performs quite well similarly to our trw is but with the previous example this illustrates that it is hard to know whether bp will result in or proposal on this model mean field is is significantly worse and is not shown in the figure log log figure shows the bernstein bound as we increase on fixed model with σp which is relatively difficult according to figure of the methods our is estimator becomes the most accurate by around samples we also show the markov lower bound as suggested by gogate it provides lower bounds for all sample sizes but does not converge to the true value even with in fact it converges to zδ is trw bernstein loopy bp is bp pairwise strength σp figure mrf with mixed interactions log log wmb is wmb markov wmb wmb is wmb markov wmb sample size sample size bn ibound bn ibound log log figure the bernstein interval on bn and bn using ibound and different sample sizes these problems are relatively easy for variational approximations we illustrate that our method gives tight bounds despite using no more memory than the original model gbp gbp is wmb is wmb sample size sample size ibound ibound figure results on harder instance at ibound and different uai instances we test the weighted wmb version of our algorithm on instances from past uai approximate inference challenges for space reasons we only report few instances for illustration bn instances figure shows two bayes net instances bn true log and bn true log these examples are very easy for loopy bp which estimates log nearly exactly but of course gives no accuracy guarantees for comparison we run our wmb is estimator using ibound cliques equal to the original factors we find that we get tight confidence intervals by around samples for comparison the method of dagum and luby using the normalized distribution as proposal would require samples proportional to approximately and respectively pedigree instances we next show results for our method on log induced width and various ibounds figure shows the results for ibound and for comparision we also evaluate gbp defined on junction graph with cliques found in the same way as wmb and complexity controlled by the same ibound again lbp and gbp generally give accurate estimates the absolute error of lbp not shown is about reducing to and at ibound and respectively the initial wmb bounds overestimate by and at ibound and and are much less accurate however our method surpasses gbp accuracy with modest number of samples for example with ibound figure our is estimator is more accurate than gbp with fewer than samples and our bernstein confidence interval passes gbp at roughly samples conclusion we propose simple approximate inference method that augments convex variational bounds by adding importance sampling our formulation allows us to frame the variational optimization as designing proposal that minimizes an upper bound on our estimator variance providing guarantees on the goodness of the resulting proposal more importantly this enables the construction of anytime probabilistic bounds that improve quickly and directly on variational bounds and provide certificates of accuracy given enough samples relative to the error in the initial bound one potential future direction is whether one can adaptively improve the proposal during sampling acknowledgement this work is supported in part by vitalite under the aro muri program award number nsf grants and and by the united states air force under contract no under the darpa ppaml program references bengtsson bickel and li revisited collapse of the particle filter in very large scale systems in probability and statistics essays in honor of david freedman pages institute of mathematical statistics cheng sampling algorithms for estimating the mean of bounded random variables computational statistics cheng and druzdzel an adaptive importance sampling algorithm for evidential reasoning in large bayesian networks journal of artificial intelligence research dagum and luby an optimal approximation algorithm for bayesian inference artificial intelligence dagum karp luby and ross an optimal algorithm for monte carlo estimation siam journal on computing de freitas jordan and russell variational mcmc in uai dechter and rish general scheme for bounded inference journal of the acm fung and chang weighing and integrating evidence for stochastic simulation in bayesian networks in uai globerson and jaakkola approximate inference using conditional entropy decompositions in uai pages gogate sampling algorithms for probabilistic graphical models with determinism phd thesis uc irvine gogate and dechter lower bounds for counting queries intelligenza artificiale hazan and jaakkola on the partition function and random maximum perturbations in icml koller and friedman probabilistic graphical models principles and techniques mit press lauritzen graphical models oxford university press liu monte carlo strategies in scientific computing springer science business media liu reasoning and decisions in probabilistic graphical unified framework phd thesis uc irvine liu and ihler bounding the partition function using inequality in icml mateescu kask gogate and dechter propagation algorithms jair maurer and pontil empirical bernstein bounds and penalization in colt pages mnih and audibert empirical bernstein stopping in icml oh and berger adaptive importance sampling in monte carlo integration stat comput orabona hazan sarwate and jaakkola on measure concentration of random maximum perturbations in icml papandreou and yuille random fields using discrete optimization to learn and sample from energy models in iccv ruozzi the bethe partition function of graphical models in nips salimans kingma and welling markov chain monte carlo and variational inference bridging the gap in icml shachter and peot simulation approaches to general probabilistic inference on belief networks in uai sudderth wainwright and willsky loop series and bethe variational bounds in attractive graphical models in nips pages wainwright estimating the wrong graphical model benefits in the setting jmlr wainwright and jordan graphical models exponential families and variational inference foundations and trends in machine learning wainwright jaakkola and willsky new class of upper bounds on the log partition function ieee trans information theory wexler and geiger importance sampling via variational optimization in uai yuan and druzdzel an importance sampling algorithm based on evidence in uai pages yuan and druzdzel importance sampling algorithms for bayesian networks principles and performance mathematical and computer modeling yuan and druzdzel generalized evidence importance sampling for hybrid bayesian networks in aaai volume pages yuan and druzdzel theoretical analysis and practical insights on importance sampling in bayesian networks international journal of approximate reasoning 
linear response methods for accurate covariance estimates from mean field variational bayes ryan giordano uc berkeley rgiordano tamara broderick mit tbroderick michael jordan uc berkeley jordan abstract mean ﬁeld variational bayes mfvb is popular posterior approximation method due to its fast runtime on data sets however well known major failing of mfvb is that it underestimates the uncertainty of model variables sometimes severely and provides no information about model variable covariance we generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for model for individual variables and coherently across variables we call our method linear response variational bayes lrvb when the mfvb posterior approximation is in the exponential family lrvb has simple analytic form even for models indeed we make no assumptions about the form of the true posterior we demonstrate the accuracy and scalability of our method on range of models for both simulated and real data introduction with increasingly efﬁcient data collection methods scientists are interested in quickly analyzing ever larger data sets in particular the promise of these large data sets is not simply to ﬁt old models but instead to learn more nuanced patterns from data than has been possible in the past in theory the bayesian paradigm yields exactly these desiderata hierarchical modeling allows practitioners to capture complex relationships between variables of interest moreover bayesian analysis allows practitioners to quantify the uncertainty in any model to do so coherently across all of the model variables mean ﬁeld variational bayes mfvb method for approximating bayesian posterior distribution has grown in popularity due to its fast runtime on data sets but well known major failing of mfvb is that it gives underestimates of the uncertainty of model variables that can be arbitrarily bad even when approximating simple multivariate gaussian distribution also mfvb provides no information about how the uncertainties in different model variables interact by generalizing linear response methods from statistical physics to exponential family variational posteriors we develop methodology that augments mfvb to deliver accurate uncertainty estimates for model for individual variables and coherently across variables in particular as we elaborate in section when the approximating posterior in mfvb is in the exponential family mfvb deﬁnes equation in the means of the approximating posterior and our approach yields covariance estimate by perturbing this ﬁxed point we call our method linear response variational bayes lrvb we provide simple intuitive formula for calculating the linear response correction by solving linear system based on the mfvb solution section we show how the sparsity of this system for many common statistical models may be exploited for scalable computation section we demonstrate the wide applicability of lrvb by working through diverse set of models to show that the lrvb covariance estimates are nearly identical to those produced by markov chain monte carlo mcmc sampler even when mfvb variance is dramatically underestimated section finally we focus in more depth on models for ﬁnite mixtures of multivariate gaussians section which have historically been sticking point for mfvb covariance estimates we show that lrvb can give accurate covariance estimates orders of magnitude faster than mcmc section we demonstrate both theoretically and empirically that for this gaussian mixture model lrvb scales linearly in the number of data points and approximately cubically in the dimension of the parameter space section previous work linear response methods originated in the statistical physics literature these methods have been applied to ﬁnd new learning algorithms for boltzmann machines covariance estimates for discrete factor graphs and independent component analysis states that linear response methods could be applied to general exponential family models but works out details only for boltzmann machines which is closest in spirit to the present work derives general linear response corrections to variational approximations indeed the authors go further to formulate linear response as the ﬁrst term in functional taylor expansion to calculate full pairwise joint marginals however it may not be obvious to the practitioner how to apply the general formulas of our contributions in the present work are the provision of concrete straightforward formulas for covariance correction that are fast and easy to compute demonstrations of the success of our method on wide range of new models and an accompanying suite of code linear response covariance estimation variational inference suppose we observe data points denoted by the column vector and denote our unobserved model parameters by here is column vector residing in some space it has subgroups and total dimension our model is speciﬁed by distribution of the observed data given the model likelihood prior distributional belief on the model parameters bayes theorem yields the posterior variational bayes mfvb approximates by factorized distribution of the form θj is chosen so that the divergence kl between and is minimized equivalently is chosen so that for eq log the expected log posterior and log the entropy of the variational distribution is maximized arg min kl arg min eq log log arg max up to constant in the objective is sometimes called the evidence lower bound or the elbo in what follows we further assume that our variational distribution is in the exponential family with natural parameter and log partition function log expressed with respect to some base measure in we assume that is expressed with respect to the same base measure in as for below we will make only mild regularity assumptions about the true posterior and no assumptions about its form if we assume additionally that the parameters at the optimum are in the interior of the feasible space then may instead be described by the mean parameterization eq with thus the objective can be expressed as function of and the condition for the optimality of becomes the ﬁxed point equation for linear response let denote the covariance matrix of under the variational distribution and let denote the covariance matrix of under the true posterior covp in mfvb may be poor estimator of even when ep when the marginal estimated means match well our goal is to use the mfvb solution and linear response methods to construct an improved estimator for we will focus on the covariance of the natural sufﬁcient statistic though the covariance of functions of can be estimated similarly see appendix the essential idea of linear response is to perturb the condition around its optimum in particular deﬁne the distribution pt as perturbation of the posterior log pt log tt where is constant in we assume that pt is distribution for any in an open ball around since normalizes pt it is in fact the function of so the derivatives of evaluated at give the cumulants of to see why this perturbation may be useful recall that the second cumulant of distribution is the covariance matrix our desired estimand ept covp dt dt dt the practical success of mfvb relies on the fact that its estimates of the mean are often good in practice so we assume that ept where is the mean parameter characterizing and is the mfvb approximation to pt we examine this assumption further in section taking derivatives with respect to on both sides of this mean approximation and setting yields covp dtt where we call the linear response variational bayes lrvb estimate of the posterior covariance of we next show that there exists simple formula for recalling the form of the kl divergence see eq we have that et then by eq we have mt for mt it follows from the chain rule that dt dt dt where is the identity matrix if we assume that we are at strict local optimum and so can invert the hessian of then evaluating at yields dtt where we have used the form for in eq so the lrvb estimator is the negative inverse hessian of the optimization objective as function of the mean parameters it follows from eq that is both symmetric and positive deﬁnite when the variational distribution is at least local maximum of we can further simplify eq by using the exponential family form of the variational approximating distribution for in exponential family form as above the negative entropy is dual to the log partition function so hence ds dη dη dm dm dm recall that for exponential families so eq for when the true posterior is in the exponential family and contains no products of the variational moment parameters then and in this case the mean ﬁeld assumption is correct and the lrvb and mfvb covariances coincide at the true posterior covariance furthermore even when the variational assumptions fail as long as certain mean parameters are estimated exactly then this formula is also exact for covariances notably mfvb is to provide arbitrarily bad estimates of the covariance of multivariate normal posterior but since mfvb estimates the means exactly lrvb estimates the covariance exactly see appendix scaling the matrix inverse eq requires the inverse of matrix as large as the parameter dimension of the posterior which may be computationally prohibitive suppose we are interested in the covariance of parameter and let denote the remaining parameters we can partition σα σαz σzα σz similar partitions exist for and if we assume factorization then vαz the variational distributions may factor further as well we calculate the schur complement of in eq with respect to its zth component to ﬁnd that vα iα vα hα vα hαz iz vz hz vz hzα here iα and iz refer to and identity matrices respectively in cases where iz vz hz can be efﬁciently calculated all the experiments in section see fig in appendix eq requires only an inverse experiments we compare the covariance estimates from lrvb and mfvb in range of models including models both with and without conjugacy we demonstrate the superiority of the lrvb estimate to mfvb in all models before focusing in on gaussian mixture models for more detailed scalability analysis for each model we simulate datasets with range of parameters in the graphs each point represents the outcome from single simulation the horizontal axis is always the result from an mcmc for comparison of this formula with the frequentist supplemented procedure see appendix all the code is available on our github repository procedure which we take as the ground truth as discussed in section the accuracy of the lrvb covariance for sufﬁcient statistic depends on the approximation ept in the models to follow we focus on regimes of moderate dependence where this is reasonable assumption for most of the parameters see section for an exception except where explicitly mentioned the mfvb means of the parameters of interest coincided well with the mcmc means so our key assumption in the lrvb derivations of section appears to hold model model first consider poisson generalized linear mixed model exhibiting we observe poisson draws yn and design vector xn for implicitly below we will everywhere condition on the xn which we consider to be ﬁxed design matrix the generative model is indep indep zn zn yn poisson yn exp zn βτ for mfvb we factorize zn inspection reveals that the optimal will be gaussian and the optimal will be gamma see appendix since the optimal zn does not take standard exponential family form we restrict further to gaussian zn there are product terms in for example the term eq eq eq zn so and the mean ﬁeld approximation does not hold we expect lrvb to improve on the mfvb covariance estimate detailed description of how to calculate the lrvb estimate can be found in appendix results we simulated datasets each with data points and randomly chosen value for and we drew the design matrix from normal distribution and held it ﬁxed throughout we set prior hyperparameters ατ and βτ to get the ground truth covariance matrix we took draws from the posterior with the mcmcglmm package which used combination of gibbs and metropolis hastings sampling our lrvb estimates used the autodifferentiation software jump results are shown in fig since is high in many of the simulations and are correlated and mfvb underestimates the standard deviation of and lrvb matches the mcmc standard deviation for all and matches for in all but the most correlated simulations when gets very high the mfvb assumption starts to bias the point estimates of and the lrvb standard deviations start to differ from mcmc even in that case however the lrvb standard deviations are much more accurate than the mfvb estimates which underestimate the uncertainty dramatically the ﬁnal plot shows that lrvb estimates the covariances of with and log reasonably well while mfvb considers them independent figure posterior mean and covariance estimates on simulation data linear random effects model next we consider simple random slope linear model with full details in appendix we observe scalars yn and rn and vector xn for implicitly below we will everywhere condition on all the xn and rn which we consider to be ﬁxed design matrices in general each random effect may appear in multiple observations and the index indicates which random effect zk affects which observation yn the full generative model is indep iid yn yn xn rn zk zk zk βτ we assume the factorization zn since this is conjugate model the optimal will be in the exponential family with no additional assumptions σβ βν results we simulated datasets of datapoints each and distinct random effects we set prior hyperparameters to αν βν ατ βτ and σβ our xn was as in section we implemented the variational solution using the autodifferentiation software jump the mcmc ﬁt was performed with using mcmcglmm intuitively when the random effect explanatory variables rn are highly correlated with the ﬁxed effects xn then the posteriors for and will also be correlated leading to violation of the mean ﬁeld assumption and an underestimated mfvb covariance in our simulation we used rn so that rn is correlated with but not the result as seen in fig is that is underestimated by mfvb but is not the parameter in contrast is not wellestimated by the mfvb approximation in many of the simulations since the lrvb depends on the approximation ept its lrvb covariance is not accurate either fig however lrvb still improves on the mfvb standard deviation figure posterior mean and covariance estimates on linear random effects simulation data mixture of normals model mixture models constitute some of the most popular models for mfvb application and are often used as an example of where mfvb covariance estimates may go awry thus we will consider in detail gaussian mixture model gmm consisting of mixture of multivariate normals with unknown component means covariances and weights in what follows the weight πk is the probability of the kth component µk is the mean of the kth component and λk is the precision matrix of the kth component so is the covariance parameter is the number of data points and xn is the nth observed data point we employ the standard trick of augmenting the data generating process with the latent indicator variables znk for and such that znk implies xn µk so the generative model is znk xn znk πk we used diffuse conditionally conjugate priors see appendix for details we make the variational assumption µk λk πk zn we compare the accuracy and speed of our estimates to gibbs sampling on the augmented model eq using the function rnmixgibbs from the package bayesm we implemented lrvb in making extensive use of rcppeigen we evaluate our results both on simulated data and on the mnist data set results for simulations we generated data points from multivariate normal components in dimensions mfvb is expected to underestimate the marginal variance of and log when the components overlap since that induces correlation in the posteriors due to the uncertain classiﬁcation of points between the clusters we check the covariances estimated with eq against gibbs sampler which we treat as the ground we performed simulations each of which had at least effective gibbs samples in each with the tool eﬀectivesize from the coda package the ﬁrst three plots show the diagonal standard deviations and the third plot shows the covariances note that the covariance plot excludes the mfvb estimates since most of the values are zero fig shows that the raw mfvb covariance estimates are often quite different from the gibbs sampler results while the lrvb estimates match the gibbs sampler closely for example we ﬁt gmm to the instances of handwritten and in the mnist data set we used pca to reduce the pixel intensities to dimensions full details are provided in appendix in this mnist analysis the standard deviations were by mfvb but correctly estimated by lrvb fig the other parameter standard deviations were estimated correctly by both and are not shown figure posterior mean and covariance estimates on gmm simulation and mnist data scaling experiments we here explore the computational scaling of lrvb in more depth for the ﬁnite gaussian mixture model section in the terms of section includes the sufﬁcient statistics from and and grows as kp the sufﬁcient statistics for the variational posterior of contain the vectors µk for each and the products in the covariance matrix µk µtk similarly for each the variational posterior of involves the sufﬁcient statistics in the symmetric matrix λk as well as the term log the sufﬁcient statistics for the posterior of πk are the terms log πk so minimally eq will require the inverse of matrix of size the likelihood described in section is symmetric under relabeling when the component locations and shapes have interpretation the researcher is generally interested in the uncertainty of and for particular labeling not the marginal uncertainty over all possible this poses problem for standard mcmc methods and we restrict our simulations to regimes where label switching did not occur in our gibbs sampler the mfvb solution conveniently avoids this problem since the mean ﬁeld assumption prevents it from representing more than one mode of the joint posterior since πk using sufﬁcient statistics involves one redundant parameter however this does not violate any of the necessary assumptions for eq and it considerably simpliﬁes the calculations note that though the perturbation argument of section requires the parameters of to be in the interior of the feasible space it does not require that the parameters of be interior kp the sufﬁcient statistics for have dimension though the number of parameters thus grows with the number of data points hz for the multivariate normal see appendix so we can apply eq to replace the inverse of an kn matrix with multiplication by the same matrix since matrix inverse is cubic in the size of the matrix the scaling for lrvb is then in in and in in our simulations fig we can see that in practice lrvb scales in and approximately cubically in across the dimensions the scaling is presumably better than the theoretical worst case of due to extra efﬁciency in the numerical linear algebra note that the vertical axis of the leftmost plot is on the log scale at all the values of and considered here lrvb was at least as fast as gibbs sampling and often orders of magnitude faster figure scaling of lrvb and gibbs on simulation data in both log and linear scales before taking logs the line in the two lefthand graphs is and in the righthand graph it is conclusion the lack of accurate covariance estimates from the widely used variational bayes mfvb methodology has been longstanding shortcoming of mfvb we have demonstrated that in sparse models our method linear response variational bayes lrvb can correct mfvb to deliver these covariance estimates in time that scales linearly with the number of data points furthermore we provide an formula for applying lrvb to wide range of inference problems our experiments on diverse set of models have demonstrated the efﬁcacy of lrvb and our detailed study of scaling of mixtures of multivariate gaussians shows that lrvb can be considerably faster than traditional mcmc methods we hope that in future work our results can be extended to more complex models including bayesian nonparametric models where mfvb has proven its practical success acknowledgments the authors thank alex blocker for helpful comments giordano and broderick were funded by berkeley fellowships the gibbs sampling time was linearly rescaled to the amount of time necessary to achieve effective samples in the component of any parameter interestingly this rescaling leads to increasing efﬁciency in the gibbs sampling at low due to improved mixing though the beneﬁts cease to accrue at moderate dimensions for numeric stability we started the optimization procedures for mfvb at the true values so the time to compute the optimum in our simulations was very fast and not representative of practice on real data the optimization time will depend on the quality of the starting point consequently the times shown for lrvb are only the times to compute the lrvb estimate the optimization times were on the same order references blei ng and jordan latent dirichlet allocation journal of machine learning research blei and jordan variational inference for dirichlet process mixtures bayesian analysis hoffman blei wang and paisley stochastic variational inference journal of machine learning research mackay information theory inference and learning algorithms cambridge university press chapter bishop pattern recognition and machine learning springer new york chapter turner and sahani two problems with variational expectation maximisation for models in barber cemgil and chiappa editors bayesian time series models wang and titterington inadequacy of interval estimates corresponding to variational bayesian approximations in workshop on artiﬁcial intelligence and statistics pages rue martino and chopin approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations journal of the royal statistical society series statistical methodology parisi statistical field theory volume new york opper and winther variational linear response in advances in neural information processing systems opper and saad advanced mean ﬁeld methods theory and practice mit press tanaka information geometry of approximation neural computation kappen and rodriguez efﬁcient learning in boltzmann machines using linear response theory neural computation welling and teh linear response algorithms for approximate inference in graphical models neural computation winther and hansen approaches to independent component analysis neural computation tanaka theory of boltzmann machine learning physical review wainwright and jordan graphical models exponential families and variational inference foundations and in machine learning hadﬁeld mcmc methods for generalized linear mixed models the mcmcglmm package journal of statistical software lubin and dunning computing in operations research using julia informs journal on computing bates and eddelbuettel fast and elegant numerical linear algebra using the rcppeigen package journal of statistical software lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee plummer best cowles and vines coda convergence diagnosis and output analysis for mcmc news meng and rubin using em to obtain asymptotic matrices the sem algorithm journal of the american statistical association and biegler on the implementation of an ﬁlter algorithm for nonlinear programming mathematical programming 
combinatorial cascading bandits branislav kveton adobe research san jose ca kveton zheng wen yahoo labs sunnyvale ca zhengwen azin ashkan technicolor research los altos ca csaba department of computing science university of alberta szepesva abstract we propose combinatorial cascading bandits class of partial monitoring problems where at each step learning agent chooses tuple of ground items subject to constraints and receives reward if and only if the weights of all chosen items are one the weights of the items are binary stochastic and drawn independently of each other the agent observes the index of the first chosen item whose weight is zero this observation model arises in network routing for instance where the learning agent may only observe the first link in the routing path which is down and blocks the path we propose algorithm for solving our problems combcascade and prove and upper bounds on its regret our proofs build on recent work in stochastic combinatorial but also address two novel challenges of our setting reward function and partial observability we evaluate combcascade on two problems and show that it performs well even when our modeling assumptions are violated we also demonstrate that our setting requires new learning algorithm introduction combinatorial optimization has many applications in this work we study class of combinatorial optimization problems with binary objective function that returns one if and only if the weights of all chosen items are one the weights of the items are binary stochastic and drawn independently of each other many popular optimization problems can be formulated in our setting network routing is problem of choosing routing path in computer network that maximizes the probability that all links in the chosen path are up recommendation is problem of choosing list of items that minimizes the probability that none of the recommended items are attractive both of these problems are closely related and can be solved using similar techniques section combinatorial cascading bandits are novel framework for online learning of the aforementioned problems where the distribution over the weights of items is unknown our goal is to maximize the expected cumulative reward of learning agent in steps our learning problem is challenging for two main reasons first the reward function is in the weights of chosen items second we only observe the index of the first chosen item with zero weight this kind of feedback arises frequently in network routing for instance where the learning agent may only observe the first link in the routing path which is down and blocks the path this feedback model was recently proposed in the cascading bandits the main difference in our work is that the feasible set can be arbitrary the feasible set in cascading bandits is uniform matroid stochastic online learning with combinatorial actions has been previously studied with feedback and linear reward function and its monotone transformation established algorithms for bandits such as and thompson sampling can be usually easily adapted to stochastic combinatorial however it is to show that the algorithms are statistically efficient in the sense that their regret matches some lower bound kveton et al recently showed this for form of our analysis builds on this recent advance but also addresses two novel challenges of our problem reward function and partial observability these challenges can not be addressed straightforwardly based on kveton et al we make multiple contributions in section we define the online learning problem of combinatorial cascading bandits and propose combcascade variant of for solving it combcascade is computationally efficient on any feasible set where linear function can be optimized efficiently improvement to the upper confidence bound which exploits the fact that the expected weights of items are bounded by one is necessary in our analysis in section we derive and upper bounds on the regret of combcascade and discuss the tightness of these bounds in section we evaluate combcascade on two practical problems and show that the algorithm performs well even when our modeling assumptions are violated we also show that can not solve some instances of our problem which highlights the need for new learning algorithm combinatorial cascading bandits this section introduces our learning problem its applications and also our proposed algorithm we discuss the computational complexity of the algorithm and then introduce the disjunctive variant of our problem we denote random variables by boldface letters the cardinality of set is and we assume that min the binary and operation is denoted by and the binary or is setting we model our online learning problem as combinatorial cascading bandit combinatorial cascading bandit is tuple where is finite set of ground items is probability distribution over binary hypercube and ak ak ai aj for any is the set of all tuples of distinct items from we refer to as the feasible set and to as feasible solution we abuse our notation and also treat as the set of items in solution without loss of generality we assume that the feasible set covers the ground set let wt be an sequence of weights drawn from distribution where wt at time the learning agent chooses solution at based on its past observations and then receives binary reward rt min wt wt as response to this choice the reward is one if and only if the weights of all items in at are one the key step in our solution and its analysis is that the reward can be expressed as rt at wt where is reward function which is defined as at the end of time the agent observes the index of the first item in at whose weight is zero and if such an item does not exist we denote this feedback by ot and define it as ot min wt atk note that ot fully determines the weights of the first min ot items in at in particular wt atk ot min ot accordingly we say that item is observed at time if atk for some min ot note that the order of items in at affects the feedback ot but not the reward rt this differentiates our problem from combinatorial the goal of our learning agent is to maximize its expected cumulative reward this is equivalent to minimizing the expected cumulative regret in steps pn at wt where at wt wt at wt is the instantaneous stochastic regret of the agent at time and arg max is the optimal solution in hindsight of knowing for simplicity of exposition we assume that as set is unique major simplifying assumption which simplifies our optimization problem and its learning is that the distribution is factored pe where pe is bernoulli distribution with mean we borrow this assumption from the work of kveton et al and it is critical to our results we would face computational difficulties without it under this assumption the expected reward of solution the probability that the weight of each item in is one can be written as and depends only on the expected weights of individual items in it follows that arg max in section we experiment with two problems that violate our independence assumption we also discuss implications of this violation several interesting online learning problems can be formulated as combinatorial cascading bandits consider the problem of learning routing paths in simple mail transfer protocol smtp that maximize the probability of delivery the ground set in this problem are all links in the network and the feasible set are all routing paths at time the learning agent chooses routing path at and observes if the is delivered if the is not delivered the agent observes the first link in the routing path which is down this kind of information is available in smtp the weight of item at time is an indicator of link being up at time the independence assumption in requires that all links fail independently this assumption is common in the existing network routing models we return to the problem of network routing in section combcascade algorithm our proposed algorithm combcascade is described in algorithm this algorithm belongs to the family of ucb algorithms at time combcascade operates in three stages first it computes the upper confidence bounds ucbs ut on the expected weights of all items in the ucb of item at time is defined as ut min ct tt where is the average of observed weights of item tt is the number of times that item is observed in steps and ct log is the radius of confidence interval around after steps such that ct ct holds with high probability after the ucbs are computed combcascade chooses the optimal solution with respect to these ucbs at arg max ut finally combcascade observes ot and updates its estimates of the expected weights based on the weights of the observed items in for all items atk such that ot for simplicity of exposition we assume that combcascade is initialized by one sample if is unavailable we can formulate the problem of obtaining as an optimization problem on with linear objective the initialization procedure of kveton et al tracks observed items and adaptively chooses solutions with the maximum number of unobserved items this approach is computationally efficient on any feasible set where linear function can be optimized efficiently combcascade has two attractive properties first the algorithm is computationally efficient in the sense that at arg max log ut is the problem of maximizing linear function on algorithm combcascade for combinatorial cascading bandits initialization observe for all do compute ucbs ut min ct tt solve the optimization problem and get feedback at arg max ut observe ot update statistics tt tt for all min ot do atk tt tt tt ot tt this problem can be solved efficiently for various feasible sets such as matroids matchings and paths second combcascade is sample efficient because the ucb of solution ut is product of the ucbs of all items in which are estimated separately the regret of combcascade does not depend on and is polynomial in all other quantities of interest disjunctive objective our reward model is conjuctive the reward is one if and only if the weights of all chosen items are one natural alternative is disjunctive model rt wt wt the reward is one if the weight of any item in at is one this model arises in recommender systems where the recommender is rewarded when the user is satisfied with any recommended item the feedback ot is the index of the first item in at whose weight is one as in cascading bandits let be reward function which is defined as then under the independence assumption in and arg max arg min arg min therefore can be learned by variant of combcascade where the observations are each ucb ut is substituted with lower confidence bound lcb on lt max ct tt wt and let at wt at wt wt be the instantaneous stochastic regret at time then we can bound the regret of combcascade as in theorems and the only difference is that min and are redefined as min analysis we prove and upper bounds on the regret of combcascade in section we discuss these bounds in section upper bounds we define the suboptimality gap of solution as and the probability that all items in are observed as pa ak for convenience we define shorthands and let be the set of suboptimal items the items that are not in then the minimum gap associated with suboptimal item is min let max be the maximum number of items in any solution and then the regret of combcascade is bounded as follows theorem the regret of combcascade is bounded as log min proof the proof is in appendix the main idea is to reduce our analysis to that of in stochastic combinatorial this reduction is challenging for two reasons first our reward function is in the weights of chosen items second we only observe some of the chosen items our analysis can be trivially reduced to by conditioning on the event of observing all items in particular let ht at ot at be the history of combcascade up to choosing solution at the first observations and actions then we can express the expected regret at time conditioned on ht as at wt ht at at ot ht and analyze our problem under the assumption that all items in at are observed this reduction is problematic because the probability pat can be low and as result we get loose regret bound we address this issue by formalizing the following insight into our problem when combcascade can distinguish from without learning the expected weights of all items in in particular combcascade acts implicitly on the prefixes of suboptimal solutions and we choose them in our analysis such that the probability of observing all items in the prefixes is close to and the gaps are close to those of the original solutions lemma let be feasible solution and bk ak be prefix of items of then can be set such that bk and pbk then we count the number of times that the prefixes can be chosen instead of when all items in the prefixes are observed the last remaining issue is that ut is in the confidence radii of the items in therefore we bound it from above based on the following lemma lemma let pk and uk then qk qk pk min pk uk pk uk this bound is tight when pk and uk the rest of our analysis is along the lines of theorem in kveton et al we can achieve linear dependency on in exchange for multiplicative factor of in our upper bound we also prove the following bound theorem the regret of combcascade is bounded as kln log proof the proof is in appendix the key idea is to decompose the regret of combcascade into two parts where the gaps at are at most and larger than we analyze each part separately and then set to get the desired result discussion in section we prove two upper bounds on the regret of combcascade theorem kl log theorem kl log where min these bounds do not depend on the total number of feasible solutions and are polynomial in any other quantity of interest the bounds match up to factors combcascade step regret regret regret step step figure the regret of combcascade and in the synthetic experiment section the results are averaged over runs the upper bounds of in stochastic combinatorial since combcascade receives less feedback than this is rather surprising and unexpected the upper bounds of kveton et al are known to be tight up to polylogarithmic factors we believe that our upper bounds are also tight in the setting similar to kveton et al where the expected weight of each item is close to and likely to be observed the assumption that is large is often reasonable in network routing the optimal routing path is likely to be reliable in recommender systems the optimal recommended list often does not satisfy reasonably large fraction of users experiments we evaluate combcascade in three experiments in section we compare it to algorithm for stochastic combinatorial with linear reward function this experiment shows that can not solve all instances of our problem which highlights the need for new learning algorithm it also shows the limitations of combcascade we evaluate combcascade on two problems in sections and synthetic in the first experiment we compare combcascade to on synthetic problem this problem is combinatorial cascading bandit with items and is popular algorithm for stochastic combinatorial with linear reward function we approximate by this approximation is motivated by the fact that as we update the estimates of in as in combcascade based on the weights of the observed items in we experiment with three different settings of and report our results in figure the settings of are reported in our plots we assume that wt are distributed independently except for the last plot where wt wt our plots represent three common scenarios that we encountered in our experiments in the first plot arg max arg min in this case both combcascade and can learn the regret of combcascade is slightly lower than that of in the second plot arg max arg min in this case can not learn and therefore suffers linear regret in the third plot we violate our modeling assumptions perhaps surprisingly combcascade can still learn the optimal solution although it suffers higher regret than network routing in the second experiment we evaluate combcascade on problem of network routing we experiment with six networks from the rocketfuel dataset which are described in figure our learning problem is formulated as follows the ground set are the links in the network the feasible set are all paths in the network at time we generate random pair of starting and end nodes and the learning agent chooses routing path between these nodes the goal of the agent is to maximizes the probability that all links in the path are up the feedback is the index of the first link in the path which is down the weight of link at time wt is an indicator of link being regret regret network nodes links step step figure the description of six networks from our network routing experiment section the regret of combcascade in these networks the results are averaged over runs up at time we model wt as an independent bernoulli random variable wt with mean local where local is an indicator of link being local we say that the link is local when its expected latency is at most millisecond about half of the links in our networks are local to summarize the local links are up with probability and are more reliable than the global links which are up only with probability our results are reported in figure we observe that the regret of combcascade flattens as time increases this means that combcascade learns policies in all networks diverse recommendations in our last experiment we evaluate combcascade on problem of diverse recommendations this problem is motivated by media streaming services like netflix which often recommend groups of movies such as popular on netflix and dramas we experiment with the movielens dataset from march the dataset contains people who assigned ratings to movies between january and march our learning problem is formulated as follows the ground set are movies from our dataset most rated animated movies random animated movies most rated movies and random movies the feasible set are all of where movies are animated the weight of item at time wt indicates that item attracts the user at time we assume that wt if and only if the user rated item in our dataset this indicates that the user watched movie at some point in time perhaps because the movie was attractive the user at time is drawn randomly from our pool of users the goal of the learning agent is to learn list of items arg max that maximizes the probability that at least one item is attractive the feedback is the index of the first attractive item in the list section we would like to point out that our modeling assumptions are violated in this experiment in particular wt are correlated across items because the users do not rate movies independently the result is that arg max it is to compute however is submodular and monotone in and therefore approximation to can be computed greedily we denote this approximation by and show it for in figure our results are reported in figure similarly to figure the regret of combcascade is concave function of time for all studied this indicates that combcascade solutions improve over time we note that the regret does not flatten as in figure the reason is that combcascade does not learn nevertheless it performs well and we expect comparably good performance in other domains where our modeling assumptions are not satisfied our current theory can not explain this behavior and we leave it for future work related work our work generalizes cascading bandits of kveton et al to arbitrary combinatorial constraints the feasible set in cascading bandits is uniform matroid any list of items out of is feasible our generalization significantly expands the applicability of the original model and we demonstrate this on two novel problems section our work also extends stochastic combinatorial with linear reward function to the cascade model of feedback similar model to cascading bandits was recently studied by combes et al regret movie title animation pulp fiction no forrest gump no independence day no shawshank redemption no toy story yes shrek yes who framed roger rabbit yes aladdin yes step figure the optimal list of movies in the diverse recommendations experiment section the regret of combcascade in this experiment the results are averaged over runs our generalization is significant for two reasons first combcascade is novel learning algorithm chooses solutions with the largest sum of the ucbs chooses items out of with the largest ucbs combcascade chooses solutions with the largest product of the ucbs all three algorithms can find the optimal solution in cascading bandits however when the feasible set is not matroid it is critical to maximize the product of the ucbs may learn suboptimal solution in this setting and we illustrate this in section second our analysis is novel the proof of theorem is different from those of theorems and in kveton et al these proofs are based on counting the number of times that each suboptimal item is chosen instead of any optimal item they can be only applied to special feasible sets such matroid because they require that the items in the feasible solutions are exchangeable we build on the recent work of kveton et al to achieve linear dependency on in theorem the rest of our analysis is novel our problem is partial monitoring problem where some of the chosen items may be unobserved agrawal et al and bartok et al studied partial monitoring problems and proposed learning algorithms for solving them these algorithms are impractical in our setting as an example if we formulate our problem as in bartok et al we get actions and unobserved outcomes and the learning algorithm reasons over pairs of actions and requires space lin et al also studied combinatorial partial monitoring their feedback is linear function of the weights of chosen items our feedback is function of the weights our reward function is in unknown parameters chen et al studied stochastic combinatorial with reward function which is known monotone function of an unknown linear function the feedback in chen et al is which is more informative than in our work le et al studied network optimization problem where the reward function is function of observations conclusions we propose combinatorial cascading bandits class of stochastic partial monitoring problems that can model many practical problems such as learning of routing path in an unreliable communication network that maximizes the probability of packet delivery and learning to recommend list of attractive items we propose practical algorithm for our problems combcascade and prove upper bounds on its regret we evaluate combcascade on two problems and show that it performs well even when our modeling assumptions are violated our results and analysis apply to any combinatorial action set and therefore are quite general the strongest assumption in our work is that the weights of items are distributed independently of each other this assumption is critical and hard to eliminate section nevertheless it can be easily relaxed to conditional independence given the features of items along the lines of wen et al we leave this for future work from the theoretical point of view we want to derive lower bound on the regret in combinatorial cascading bandits and show that the factor of in theorems and is intrinsic references rajeev agrawal demosthenis teneketzis and venkatachalam anantharam asymptotically efficient adaptive allocation schemes for controlled processes finite parameter space ieee transactions on automatic control shipra agrawal and navin goyal analysis of thompson sampling for the bandit problem in proceeding of the annual conference on learning theory pages peter auer nicolo and paul fischer analysis of the multiarmed bandit problem machine learning gabor bartok navid zolghadr and csaba szepesvari an adaptive algorithm for finite stochastic partial monitoring in proceedings of the international conference on machine learning wei chen yajun wang and yang yuan combinatorial bandit general framework results and applications in proceedings of the international conference on machine learning pages choi sue moon zhang konstantina papagiannaki and christophe diot analysis of packet delay in an operational network in proceedings of the annual joint conference of the ieee computer and communications societies richard combes stefan magureanu alexandre proutiere and cyrille laroche learning to rank regret lower bounds and efficient algorithms in proceedings of the acm sigmetrics international conference on measurement and modeling of computer systems yi gai bhaskar krishnamachari and rahul jain combinatorial network optimization with unknown variables bandits with linear rewards and individual observations transactions on networking aurelien garivier and olivier cappe the algorithm for bounded stochastic bandits and beyond in proceeding of the annual conference on learning theory pages branislav kveton csaba szepesvari zheng wen and azin ashkan cascading bandits learning to rank in the cascade model in proceedings of the international conference on machine learning branislav kveton zheng wen azin ashkan hoda eydgahi and brian eriksson matroid bandits fast combinatorial optimization with learning in proceedings of the conference on uncertainty in artificial intelligence pages branislav kveton zheng wen azin ashkan and csaba szepesvari tight regret bounds for stochastic combinatorial in proceedings of the international conference on artificial intelligence and statistics shyong lam and jon herlocker movielens dataset http thanh le csaba szepesvari and rong zheng sequential learning for wireless network monitoring with channel switching costs ieee transactions on signal processing tian lin bruno abrahao robert kleinberg john lui and wei chen combinatorial partial monitoring game with linear feedback and its applications in proceedings of the international conference on machine learning pages christos papadimitriou and kenneth steiglitz combinatorial optimization dover publications mineola ny neil spring ratul mahajan and david wetherall measuring isp topologies with rocketfuel ieee acm transactions on networking william thompson on the likelihood that one unknown probability exceeds another in view of the evidence of two samples biometrika zheng wen branislav kveton and azin ashkan efficient learning in combinatorial in proceedings of the international conference on machine learning 
mixing time estimation in reversible markov chains from single sample path daniel hsu columbia university aryeh kontorovich university csaba university of alberta djhsu karyeh szepesva abstract this article provides the first procedure for computing fully interval that traps the mixing time tmix of finite reversible ergodic markov chain at prescribed confidence level the interval is computed from single sample path from the markov chain and does not require the knowledge of any parameters of the chain this stands in contrast to previous approaches which either only provide point estimates or require reset mechanism or additional prior knowledge the interval is constructed around the relaxation time trelax which is strongly related to the mixing time and the width of the interval converges to zero roughly at rate where is the length of the sample path upper and lower bounds are given on the number of samples required to achieve multiplicative accuracy the lower bounds indicate that unless further restrictions are placed on the chain no procedure can achieve this accuracy level before seeing each state at least trelax times on the average finally future directions of research are identified introduction this work tackles the challenge of constructing fully empirical bounds on the mixing time of markov chains based on single sample path let xt be an irreducible aperiodic timehomogeneous markov chain on finite state space with transition matrix under this assumption the chain converges to its unique stationary distribution πi regardless of the initial state distribution lim prq xt lim qp πi for each the mixing time tmix of the markov chain is the number of time steps required for the chain to be within fixed threshold of its stationary distribution tmix min sup max xt here πi is the probability assigned to set by and the supremum is over all possible initial distributions the problem studied in this work is the construction of confidence interval cn cn xn based only on the observed sample path xn and that succeeds with probability in trapping the value of the mixing time tmix this problem is motivated by the numerous scientificpapplications and machine learning tasks in which the quantity of interest is the mean πi for some function of the states of markov chain this is the setting of the celebrated markov chain monte carlo mcmc paradigm but the problem also arises in performance prediction involving data as is common in reinforcement learning observable bounds on mixing times are useful in the design and diagnostics of these methods they yield effective approaches to assessing the estimation quality even when priori knowledge of the mixing time or correlation structure is unavailable main results we develop the first procedure for constructing and fully empirical confidence intervals for markov mixing time consider reversible ergodic markov chain on states with absolute spectral gap and stationary distribution minorized by as is theorems and trelax ln tmix trelax ln where trelax is the relaxation time hence it suffices to estimate and our main results are summarized as follows in section we show that in some problems log observations are necessary for any procedure to guarantee constant multiplicative accuracy in estimating theorems and essentially in some problems every state may need to be visited about log times on average before an accurate estimate of the mixing time can be provided regardless of the actual estimation procedure used in section we give for and prove in theorem that it achieves multiplicative accuracy from single sample path of length we also provide for that requires sample path of length this establishes the feasibility of estimating the mixing time in this setting however the valid confidence intervals suggested by theorem depend on the unknown quantities and we also discuss the importance of reversibility and some possible extensions to nonreversible chains in section the construction of valid fully empirical confidence intervals for and are considered first the difficulty of the task is explained why the standard approach of turning the finite time confidence intervals of theorem into fully empirical one fails combining several results from perturbation theory in novel fashion we propose new procedure and prove that it avoids slow convergence theorem we also explain how to combine the empirical confidence intervals from algorithm with the bounds from theorem to produce valid empirical confidence intervals we prove in theorem that the width of these new intervals converge to zero asymptotically at least as fast as those from either theorem and theorem related work there is vast statistical literature on estimation in markov chains for instance it is known that under the assumptions pn on xt from above the law of large numbers guarantees that the sample mean xt converges almost surely to while the central limit theorem tells us that as the distribution of the deviation will be normal with mean zero and asymptotic variance var although these asymptotic results help us understand the limiting behavior of the sample mean over markov chain they say little about the behavior which is often needed for the prudent evaluation of method or even its algorithmic design to address this need numerous works have developed bounds on pr thus providing valuable tools for probabilistic analysis these probability bounds are larger than corresponding bounds for independent and identically distributed iid data due to the temporal dependence intuitively for the markov chain to yield fresh draw that behaves as if it was independent of xt one must wait tmix time steps note that the bounds generally depend on properties of the markov chain tmix which are often unknown priori in practice consequently much effort has been put towards estimating these unknown quantities especially in the context of mcmc diagnostics in order to provide datadependent assessments of estimation accuracy however these approaches generally only provide asymptotic guarantees and hence fall short of our goal of empirical bounds that are valid with any sample path learning with dependent data is another main motivation to our work many results from statistical learning and empirical process theory have been extended to sufficiently fast mixing dependent the notation suppresses logarithmic factors data providing learnability assurances generalization error bounds these results are often given in terms of mixing coefficients which can be consistently estimated in some cases however the convergence rates of the estimates from which are needed to derive confidence bounds are given in terms of unknown mixing coefficients when the data comes from markov chain these mixing coefficients can often be bounded in terms of mixing times and hence our main results provide way to make them fully empirical at least in the limited setting we study it is possible to eliminate many of the difficulties presented above when allowed more flexible access to the markov chain for example given sampling oracle that generates independent transitions from any given state akin to reset device the mixing time becomes an efficiently testable property in the sense studied in on the other hand when one only has description of the transition probabilities of markov chain over an state space there are barriers for many mcmc diagnostic problems preliminaries notations we denote the set of positive integers by and the set of the first positive integers by the part of real number is max and max we use ln for natural logarithm and log for logarithm with an arbitrary constant base boldface symbols are used for vectors and matrices and their entries are referenced by subindexing vi mi for vector kvk denotes its euclidean norm for matrix km denotes its spectral norm we use diag to denote the diagonal matrix whose entry is vi pd the probability simplex is denoted by pi and we regard vectors in as row vectors setting let be matrix for an ergodic irreducible and aperiodic markov chain this implies there is unique stationary distribution with πi for all corollary we also assume that is reversible with respect to πi pi πj pj the minimum stationary probability is denoted by πi define the matrices diag and diag diag the th entry of the matrix mi contains the doublet probabilities associated with mi πi pi is the probability of seeing state followed by state when the chain is started from its stationary distribution the matrix is symmetric on account of the reversibility of and hence it follows that is also symmetric we will strongly exploit the symmetry in our results further diag diag hence and are similar and thus their eigenvalue systems are identical ergodicity and reversibility imply that the eigenvalues of are contained in the interval and that is an eigenvalue of with multiplicity lemmas and denote and order the eigenvalues of as λd let max and define the absolute spectral gap to be which is strictly positive on account of ergodicity let xt be markov chain whose transition probabilities are governed by for each let denote the marginal distribution of xt so note that the initial distribution is arbitrary and need not be the stationary distribution the goal is to estimate and from the length sample path xt and also to construct fully empirical confidence intervals that and with high probability in particular the construction of the intervals should not depend on any unobservable quantities including and themselves as mentioned in the introduction it is that the mixing time of the markov chain tmix defined in eq is bounded in terms of and as shown in eq moreover convergence rates for empirical processes on markov chain sequences are also often given in terms of mixing coefficients that can ultimately be bounded in terms of and as we will show in the proof of our first result therefore valid confidence intervals for and can be used to make these rates fully observable point estimation in this section we present lower and upper bounds on achievable rates for estimating the spectral gap as function of the length of the sample path lower bounds the purpose of this section is to show lower bounds on the number of observations necessary to achieve fixed multiplicative or even just additive accuracy in estimating the spectral gap by eq the multiplicative accuracy lower bound for gives the same lower bound for estimating the mixing time our first result holds even for two state markov chains and shows that sequence length of is necessary to achieve even constant additive accuracy in estimating theorem pick any consider any estimator that takes as input random sample path of length from markov chain starting from any desired initial state distribution there exists ergodic and reversible markov chain distribution with spectral gap and minimum stationary probability such that pr next considering state chains we show that sequence of length log is required to estimate up to constant multiplicative accuracy essentially the sequence may have to visit all states at least log times each on average this holds even if is within factor of two of the largest possible value of that it can take when is nearly uniform theorem there is an absolute constant such that the following holds pick any positive integer and any consider any estimator that takes as input random sample path of length cd log from reversible markov chain starting from any desired initial state distribution there is an ergodic and reversible markov chain distribution with spectral gap and minimum stationary probability such that pr the proofs of theorems and are given in appendix based point estimator and its accuracy let us now consider the problem of estimating for this we construct natural estimator along the way we also provide an estimator for the minimum stationary probability allowing one to use the bounds from eq to trap the mixing time and random vector by define the random matrix ci xt xt furthermore define sym full version of this paper with appendices is available on arxiv to be the symmetrized version of the possibly matrix diag diag our estimator of the minimum stalet be the eigenvalues of sym tionary probability is and our estimator of the spectral gap is max these estimators have the following accuracy guarantees theorem there exists an absolute constant such that the following holds assume the estimators and described above are formed from sample path of length from an ergodic and reversible markov chain let denote the spectral gap and the minimum stationary probability for any with probability at least log log and log log log theorem implies that the sequence lengths required and to within constant estimate multiplicative factors are respectively and by eq the second of these is also bound on the required sequence length to estimate tmix and the proof of theorem is based on analyzing the convergence of the sample averages to their expectation and then using perturbation bounds for eigenvalues to derive bound on the error of however since these averages are formed using single sample path from possibly markov chain we can not use standard large deviation bounds moreover applying would result in significantly worse bounds for markov chains to each entry of sequence length requirement roughly factor of larger instead we adapt probability tail bounds for sums of independent random matrices to our setting by directly applying blocking technique of as described in the article of due to ergodicity the convergence rate can be bounded without any dependence on the initial state distribution the proof of theorem is given in appendix note that because the eigenvalues of are the same as that of the transition probability matrix we could have instead opted to estimate say using simple frequency estimates obtained from the sample path and then computing the second largest eigenvalue of this empirical estimate in fact this approach is way to extend to chains as we would no longer rely on the symmetry of or the difficulty with this approach is that lacks the structure required by certain strong eigenvalue perturbation results one could instead invoke the theorem cf theorem on page of which bounds the matching distance between the is expected eigenvalues of matrix and its perturbation by since kp to be of size this approach will give confidence interval for whose width shrinks at rate of exponential compared to the rate from theorem as demonstrated through an example from the dependence on the root of the norm of the perturbation can not be avoided in general our approach based on estimating symmetric matrix affords us the use of perturbation results that exploit more structure returning to the question of obtaining fully empirical confidence interval for and we notice that unfortunately theorem falls short of being directly suitable for this at least without further assumptions this is because the deviation terms themselves depend inversely both on and and hence can never rule out or an arbitrarily small positive value as possibility for or in effect the fact that the markov chain could be slow mixing and the frequency of some using theorem it is possible to trap in the union of two empirical confidence around and the other around zero both of which shrink in width as the sequence length increases algorithm empirical confidence intervals input sample path xn confidence parameter compute state visit counts and smoothed transition probability estimates ni xt ni xt ni pbi ni be the group inverse of let let be the unique stationary distribution for diag diag compute eigenvalues of sym where spectral gap estimate max empirical bounds for for τn inf and bi τn cτn cτn pi τn ni ni relative sensitivity of min max empirical bounds for πi and max πi bi max max empirical bounds for states could be small makes it difficult to be confident in the estimates provided by and this suggests that in order to obtain fully empirical confidence intervals we need an estimator that is not subject to such pursue this in section theorem thus primarily serves as point of comparison for what is achievable in terms of estimation accuracy when one does not need to provide empirical confidence bounds fully empirical confidence intervals in this section we address the shortcoming of theorem and give fully empirical confidence intervals for the stationary probabilities and the spectral gap the main idea is to use the markov property to eliminate the dependence of the confidence intervals on the unknown quantities including and specifically we estimate the transition probabilities from the sample path using simple frequency estimates as consequence of the markov property for each state the frequency estimates converge at rate that depends only on the number of visits to the state and in particular the rate given the visit count of the state is independent of the mixing time of the chain as discussed in section it is possible to form confidence interval for based on the eigenvalues of an estimated transition probability matrix by appealing to the theorem however as explained earlier this would lead to slow rate we avoid this slow rate by using an estimate of the symmetric matrix so that we can use stronger perturbation result namely weyl inequality as in the proof of theorem available for symmetric matrices to form an estimate of based on an estimate of the transition probabilities one possibility is to estimate using estimate for as was done in section and appeal to the relation diag diag to form estimate however as noted in section confidence intervals for the entries of formed this way may depend on the mixing time indeed such an estimate of does not exploit the markov property we adopt different strategy for estimating which leads to our construction of empirical confib using smoothed frequency estimates dence intervals detailed in algorithm we form the matrix step followed by of step then compute the group inverse of finding the unique stationary distribution of step this way decoupling the bound on the of is uniquely defined and if accuracy of from the mixing time the group inverse can defines an ergodic chain which is the case here due to the use of the smoothed estimates be computed at the cost of inverting an matrix theorem further once the unique stationary distribution of can be read out from the last row of given theorem the group inverse is also be used to compute the sensitivity of based on and we construct the estimate of and use the eigenvalues of its symmetrization to form the estimate of the spectral gap steps and in the remaining steps we use perturbation analyses and also to relate and viewing as to relate and viewing as the perturbation of both analyses give error bounds entirely in terms of observable quantities perturbation of sym tracing back to empirical error bounds for the smoothed frequency estimates of the most computationally expensive step in algorithm is the computation of the group inverse which as noted reduces to matrix inversion thus with standard implementation of matrix inversion the algorithm time complexity is while its space complexity is to state our main theorem concerning algorithm we first define to be analogous to from replaced by the group inverse of the result is as follows step with theorem suppose algorithm is given as input sample path of length from an ergodic and reversible markov chain and confidence parameter let denote the spectral gap the unique stationary distribution and the minimum stationary probability then on an event of probability at least πi for all moreover and almost surely satisfy as pi log log max πi and log log log log the proof of theorem is given in appendix as mentioned above the obstacle encountered in we establish fully observable upper and theorem is avoided by exploiting the markov property lower bounds on the entries of that converge at log log rate using standard martingale tail inequalities this justifies the validity of the bounds from step properties of the group inverse and eigenvalue perturbation theory are used to validate the empirical bounds on πi and developed in the remaining steps of the algorithm the first part of theorem provides valid empirical confidence intervals for each πi and for which are simultaneously valid at confidence level the second part of theorem shows that the the group inverse of square matrix special case of the drazin inverse is the unique matrix satisfying aa aa and aa in theorems and our use of notation is as follows for random sequence yn and nonrandom positive sequence εθ parameterized by we say yn εθ holds almost surely as if there is some universal constant such that for all lim yn holds almost surely width of the intervals decrease length increases we show in appendix that as the sequence pi log log log log and hence max πi it is easy to combine theorems and to yield intervals whose widths shrink at least as fast as both the intervals from theorem and the empirical intervals from theorem specifically determine lower bounds on and using algorithm then these lower bounds for and in the deviation bounds in eq from theorem this yields new interval centered around the estimate of from theorem and it no longer depends on unknown quantities the interval is valid probability confidence interval for and for sufficiently large the width shrinks at the rate given in eq we can similarly construct an empirical confidence interval for using eq which is valid on the same probability finally we can take the intersection of these new intervals with the corresponding intervals from algorithm this is summarized in the following theorem which we prove in appendix theorem the following holds under the same conditions as theorem for any and vb described above for and respectively satisfy and the confidence intervals with probability at leastr furthermore the widths of these intervals almost surely log log min where is satisfy as the width from algorithm discussion the construction used in theorem applies more generally given confidence interval of the form in in for some confidence level and fully empirical confidence set en for for the same level en in is valid fully empirical confidence interval whose asymptotic width matches that of in up to lower order terms under reasonable assumptions on en and in in particular this suggests that future work should focus on closing the gap between the lower and upper bounds on the accuracy of another interesting direction is to reduce the computation cost the current cubic cost in the number of states can be too high even when the number of states is only moderately large perhaps more important however is to extend our results to large state space markov chains in most practical applications the state space is continuous or is exponentially large in some natural parameters as follows from our lower bounds without further assumptions the problem of fully data dependent estimation of the mixing time is intractable for information theoretical reasons interesting directions for future work thus must consider markov chains with specific structure parametric classes of markov chains including but not limited to markov chains with factored transition kernels with few factors are promising candidate for such future investigations the results presented here are first step in the ambitious research agenda outlined above and we hope that they will serve as point of departure for further insights in the area of fully empirical estimation of markov chain parameters based on single sample path references liu monte carlo strategies in scientific computing springer series in statistics sutton and barto reinforcement learning an introduction adaptive computation and machine learning bradford book levin peres and wilmer markov chains and mixing times ams meyn and tweedie markov chains and stochastic stability springer kipnis and varadhan central limit theorem for additive functionals of reversible markov processes and applications to simple exclusions comm math for the interval we only lower bounds on and only where these quantities appear as and in eq it is then possible to solve for observable bounds on see appendix for details kontoyiannis and meyn exponential bounds and stopping rules for mcmc and general markov chains in valuetools page balcan beygelzimer and langford agnostic active learning in icml pages mnih cs and audibert empirical bernstein stopping in icml pages maurer and pontil empirical bernstein bounds and penalization in colt li littman walsh and strehl knows what it knows framework for learning machine learning flegal and jones implementing mcmc estimating with confidence in handbook of markov chain monte carlo pages chapman gyori and paulin confidence intervals for mcmc in practice swaminathan and joachims counterfactual risk minimization learning from logged bandit feedback in icml gillman chernoff bound for random walks on expander graphs siam journal on computing and perron optimal hoeffding bounds for discrete reversible markov chains annals of applied probability pages paulin concentration inequalities for markov chains by marton couplings and spectral methods electronic journal of probability garren and smith estimating the second largest eigenvalue of markov transition matrix bernoulli jones and hobert honest exploration of intractable probability distributions via markov chain monte carlo statist markov chain monte carlo confidence intervals bernoulli to appear yu rates of convergence for empirical processes of stationary mixing sequences the annals of probability january karandikar and vidyasagar rates of uniform convergence of empirical means with mixing processes statistics and probability letters gamarnik extension of the pac framework to finite and countable markov chains ieee transactions on information theory mohri and rostamizadeh stability bounds for processes in nips mohri and rostamizadeh rademacher complexity bounds for processes in nips steinwart and christmann fast learning from observations in nips steinwart hush and scovel learning from dependent observations journal of multivariate analysis mcdonald shalizi and schervish estimating coefficients in aistats pages batu fortnow rubinfeld smith and white testing that distributions are close in focs pages ieee batu fortnow rubinfeld smith and white testing closeness of discrete distributions journal of the acm jacm bhatnagar bogdanov and mossel the computational complexity of estimating mcmc convergence time in random pages springer hsu kontorovich and mixing time estimation in reversible markov chains from single sample path corr tropp an introduction to matrix concentration inequalities foundations and trends in machine learning bernstein sur extension du theoreme limite du calcul des probabilites aux sommes de quantites dependantes mathematische annalen stewart and sun matrix perturbation theory academic press boston meyer jr the role of the group generalized inverse in the theory of finite markov chains siam review cho and meyer comparison of perturbation bounds for the stationary distribution of markov chain linear algebra and its applications 
policy gradient for coherent risk measures yinlam chow stanford university ychow aviv tamar uc berkeley avivt mohammad ghavamzadeh adobe research inria shie mannor technion shie abstract several authors have recently developed policy gradient methods that augment the standard expected cost minimization problem with measure of variability in cost these studies have focused on specific such as the variance or conditional value at risk cvar in this work we extend the policy gradient method to the whole class of coherent risk measures which is widely accepted in finance and operations research among other fields we consider both static and dynamic risk measures for static risk measures our approach is in the spirit of policy gradient algorithms and combines standard sampling approach with convex programming for dynamic risk measures our approach is style and involves explicit approximation of value function most importantly our contribution presents unified approach to reinforcement learning that generalizes and extends previous results introduction optimization considers problems in which the objective involves risk measure of the random cost in contrast to the typical expected cost objective such problems are important when the wishes to manage the variability of the cost in addition to its expected outcome and are standard in various applications of finance and operations research in reinforcement learning rl objectives have gained popularity as means to regularize the variability of the total discounted in markov decision process mdp many risk objectives have been investigated in the literature and applied to rl such as the celebrated markowitz model var and conditional value at risk cvar the view taken in this paper is that the preference of one risk measure over another is and depends on factors such as the cost distribution sensitivity to rare events ease of estimation from data and computational tractability of the optimization problem however the highly influential paper of artzner et al identified set of natural properties that are desirable for risk measure to satisfy risk measures that satisfy these properties are termed coherent and have obtained widespread acceptance in financial applications among others we focus on such coherent measures of risk in this work for sequential decision problems such as mdps another desirable property of risk measure is time consistency risk measure satisfies dynamic programming style property if strategy is for an problem then the component of the policy from the time until the end where is also see principle of optimality in the recently proposed class of dynamic markov coherent risk measures satisfies both the coherence and time consistency properties in this work we present policy gradient algorithms for rl with coherent risk objective our approach applies to the whole class of coherent risk measures thereby generalizing and unifying previous approaches that have focused on individual risk measures we consider both static coherent risk of the total discounted return from an mdp and dynamic markov coherent risk our main contribution is formulating the under the framework more specifically we provide new formula for the gradient of static coherent risk that is convenient for approximation using sampling an algorithm for the gradient of general static coherent risk that involves sampling with convex programming and corresponding consistency result new policy gradient theorem for markov coherent risk relating the gradient to suitable value function and corresponding algorithm several previous results are special cases of the results presented here our approach allows to rederive them in greater generality and simplicity related work optimization in rl for specific risk functions has been studied recently by several authors studied exponential utility functions studied meanvariance models studied cvar in the static setting and studied dynamic coherent risk for systems with linear dynamics our paper presents general method for the whole class of coherent risk measures both static and dynamic and is not limited to specific choice within that class nor to particular system dynamics reference showed that an mdp with dynamic coherent risk objective is essentially robust mdp the planning for large scale mdps was considered in using an approximation of the value function for many problems approximation in the policy space is more suitable see our approach is suitable for approximations both in the policy and value function and to large or continuous mdps we do however make use of technique of in part of our method optimization of coherent risk measures was thoroughly investigated by ruszczynski and shapiro see also for the stochastic programming case in which the policy parameters do not affect the distribution of the stochastic system the mdp trajectory but only the reward function and thus this approach is not suitable for most rl problems for the case of mdps and dynamic risk proposed dynamic programming approach this approach does not to large mdps due to the curse of dimensionality for further motivation of policy gradient methods we refer the reader to preliminaries consider probability space pθ where is the set of outcomes sample space is over representing the set of events we are interested in and pθ where is the set of probability distributions is probability measure over parameterized by some tunable parameter rk in the following we suppress the notation of in quantities to ease the technical exposition in this paper we restrict our attention to finite probability spaces has finite number of elements our results can be extended to the lp spaces without loss of generality but the details are omitted for brevity denote by the space of random variables defined over the probability space pθ in this paper random variable is interpreted as cost the smaller the realization of the better for we denote by the partial order for all we denote by eξ pθ expectation of an mdp is tuple where and are the state and action spaces cmax is bounded deterministic and cost is the transition probability distribution is discount factor and is the initial actions are chosen according to stationary policy µθ we denote by xt at trajectory of length drawn by following the policy µθ in the mdp our results may easily be extended to random costs dependent costs and random initial states for markov coherent risk the class of optimal policies is stationary markov while this is not necessarily true for static risk our results can be extended to policies or stationary markov coherent risk measures risk measure is function that maps an uncertain outcome to the extended line the expectation or the conditional cvar risk measure is called coherent if it satisfies the following conditions for all convexity λz λρ monotonicity if then translation invariance positive homogeneity if then λz λρ intuitively these condition ensure the rationality of risk assessments ensures that diversifying an investment will reduce its risk guarantees that an asset with higher cost for every possible scenario is indeed riskier also known as cash invariance means that the deterministic part of an investment portfolio does not contribute to its risk the intuition behind is that doubling position in an asset doubles its risk we further refer the reader to for more detailed motivation of coherent risk the following representation theorem shows an important property of coherent risk measures that is fundamental to our approach theorem risk measure is coherent if and only if there exists convex bounded and closed set such max ξpθ pθ eξ the result essentially states that any coherent risk measure is an expectation density function ξpθ of pθ by chosen adversarially from suitable set of test density functions pθ referred to as risk envelope moreover coherent risk measure is uniquely represented by its risk envelope in the sequel we shall interchangeably refer to coherent risk measures either by their explicit functional representation or by their corresponding in this paper we assume that the risk envelope pθ is given in canonical convex programming formulation and satisfies the following conditions assumption the general form of risk envelope for each given policy parameter rk the risk envelope of coherent risk measure can be written as pθ ξpθ ge pθ fi pθ pθ where each constraint ge pθ is an affine function in each constraint fi pθ is convex function in and there exists strictly feasible point and here denote the sets of equality and inequality constraints respectively furthermore for any given fi and ge are twice differentiable in and there exists such that dfi max dge max max dp dp assumption implies that the risk envelope pθ is known in an explicit form from theorem of in the case of finite probability space is coherent risk if and only if pθ is convex and compact set this justifies the affine assumption of ge and the convex assumption of fi moreover the additional assumption on the smoothness of the constraints holds for many popular coherent risk measures such as the cvar the and spectral risk measures dynamic risk measures the risk measures defined above do not take into account any temporal structure that the random variable might have such as when it is associated with the return of trajectory in the case of mdps in this sense such risk measures are called static dynamic risk measures on the other hand policies on state space augmented with accumulated cost the latter has shown to be sufficient for optimizing the cvar risk when we study risk in mdps the risk envelope pθ in eq also depends on the state explicitly take into account the temporal nature of the stochastic outcome primary motivation for considering such measures is the issue of time consistency usually defined as follows if certain outcome is considered less risky in all states of the world at stage then it should also be considered less risky at stage example in shows the importance of time consistency in the evaluation of risk in dynamic setting it illustrates that for optimizing static measure can lead to behavior similar paradoxical results could be obtained with other risk metrics we refer the readers to and for further insights markov coherent risk measures markov risk measures were introduced in and constitute useful class of dynamic risk measures that are important to our study of risk in mdps for horizon and mdp the markov coherent risk measure ρt is ρt γρ γρ xt γρ xt where is static coherent risk measure that satisfies assumption and xt is trajectory drawn from the mdp under policy µθ it is important to note that in each static coherent risk at state is induced by the transition probability pθ µθ we also define limt ρt which is since and the cost is bounded we further assume that in is markov risk measure the evaluation of each static coherent risk measure is not allowed to depend on the whole past problem formulation in this paper we are interested in solving two optimization problems given random variable and static coherent risk measure as defined in section the static risk problem srp is given by min for example in an rl setting may correspond to the cumulative discounted cost γc xt of trajectory induced by an mdp with policy parameterized by for an mdp and dynamic markov coherent risk measure ρt as defined by eq the dynamic risk problem drp is given by min except for very limited cases there is no reason to hope that neither the srp in nor the drp in should be tractable problems since the dependence of the risk measure on may be complex and in this work we aim towards more modest goal and search for locally optimal thus the main problem that we are trying to solve in this paper is how to calculate the gradients of the srp and drp objective functions and we are interested in cases in which the gradients can not be calculated analytically in the static case this would correspond to dependence of on for dynamic risk we also consider cases where the state space is too large for tractable computation our approach for dealing with such difficult cases is through sampling we assume that in the static case we may obtain samples of the random variable for the dynamic case we assume that for each state and action of the mdp we may obtain samples of the next state we show that sampling may indeed be used in both cases to devise suitable estimators for the gradients to finally solve the srp and drp problems gradient estimate may be plugged into standard stochastic gradient descent sgd algorithm for learning locally optimal solution to and from the structure of the dynamic risk in eq one may think that gradient estimator for may help us to estimate the gradient indeed we follow this idea and begin with estimating the gradient in the static risk case gradient formula for static risk in this section we consider static coherent risk measure and propose estimators for we make the following assumption on the policy parametrization which is standard in the policy gradient literature assumption the likelihood ratio log is and bounded for all moreover our approach implicitly assumes that given some log may be easily calculated this is also standard requirement for policy gradient algorithms and is satisfied in various applications such as queueing systems inventory management and financial engineering see the survey by fu using theorem and assumption for each we have that is the solution to the convex optimization problem for that value of the lagrangian function of denoted by lθ λp λe λi may be written as lθ pθ pθ λe ge pθ fi pθ the convexity of and its strict feasibility due to assumption implies that lθ λp λe λi has set of saddle points the next theorem presents formula for the gradient as we shall subsequently show this formula is particularly convenient for devising sampling based estimators for theorem let assumptions and hold for any saddle point λθ λθ of we have log λθ fi pθ ge ξθ pθ the proof of this theorem given in the supplementary material involves an application of the envelope theorem and standard trick we now demonstrate the utility of theorem with several examples in which we show that it generalizes previously known results and also enables deriving new useful gradient formulas example cvar the cvar at level of random variable denoted by ρcvar is very popular coherent risk measure defined as ρcvar inf when is continuous ρcvar is to be the mean of the distribution of qα where qα is of thus selecting small makes cvar particularly sensitive to rare but very high costs the risk envelope for cvar is known to be ξpθ furthermore show that the saddle points of satisfy when and when where is any quantile of plugging this result into theorem we can easily show that ρcvar log qα qα this formula was recently proved in for the case of continuous distributions by an explicit calculation of the conditional expectation and under several additional smoothness assumptions here we show that it holds regardless of these assumptions and in the discrete case as well our proof is also considerably simpler example the of random variable is defined as sd the captures the variation of the cost only above its mean and is an appealing alternative to the standard deviation which does not distinguish between the variability of upside and downside deviations for some the risk measure is defined as ρmsd αsd and is coherent risk measure we have the following result proposition under assumption with log we have αe log ρmsd sd this proposition can be used to devise sampling based estimator for ρmsd by replacing all the expectations with sample averages the algorithm along with the proof of the proposition are in the supplementary material in section we provide numerical illustration of optimization with objective general gradient estimation algorithm in the two previous examples we obtained gradient formula by analytically calculating the lagrangian saddle point and plugging it into the formula of theorem we now consider general coherent risk for which in contrast to the cvar and cases the lagrangian is not known analytically we only assume that we know the structure of the as given by we show that in this case may be estimated using sample average approximation saa of the formula in theorem assume that we are given samples ωi pθ and let pθ ωi denote the corresponding empirical distribution also let the sample risk enn velope pθ be defined according to eq with pθ replaced by pθ consider the following saa version of the optimization in eq ρn max pθ ωi ωi ωi ξpθ pθ note that defines convex optimization problem with variables and constraints in the following we assume that solution to may be computed efficiently using standard denote solution to and vex programming tools such as interior point methods let ξθ λθ λθ λθ denote the corresponding kkt multipliers which can be obtained from the convex programming algorithm we propose the following estimator for the on theorem pθ ωi ξθ ωi log ωi ωi ge ξθ pθ fi ξθ pθ thus our gradient estimation algorithm is procedure involving both sampling and convex programming in the following we show that under some conditions on the set pθ is consistent estimator of the proof has been reported in the supplementary material proposition let assumptions and hold suppose there exists compact set cξ such that the set of lagrangian saddle points is and bounded ii the functions fe pθ for all and fi pθ for all are and continuous in on cξ iii for large enough the set sn is and sn further assume that iv if ξn pθ pθ and ξn converges to point then ξpθ pθ we then have that limn ρn and limn the set of assumptions for proposition is large but rather mild note that is implied by the slater condition of assumption for satisfying iii we need that the risk be for every empirical distribution which is natural requirement since pθ always converges to pθ uniformly on iv essentially requires smoothness of the constraints we remark that in particular constraints to iv are satisfied for the popular cvar and spectral risk it is interesting to compare the performance of the saa estimator with the based estimator as in sections and in the supplementary material we report an empirical comparison between the two approaches for the case of cvar risk which showed that the two approaches performed very similarly this is since in general both saa based estimators obey variance bound of order to summarize this section we have seen that by exploiting the special structure of coherent risk measures in theorem and by the style result of theorem we are able to derive style algorithms for estimating the policy gradient of coherent static risk measures the gradient estimation algorithms developed here for static risk measures will be used as in our subsequent treatment of dynamic risk measures gradient formula for dynamic risk in this section we derive new formula for the gradient of the markov coherent dynamic risk measure our approach is based on combining the static gradient formula of theorem with decomposition of the for an mdp under the policy is defined as vθ where with slight abuse of notation denotes the markovcoherent dynamic risk in when the initial state is it is shown in that due to the structure of the markov dynamic risk the value function is the unique solution to the bellman equation vθ max eξ vθ ξpθ pθ where the expectation is taken over the next state transition note that by definition we have vθ and thus vθ we now develop formula for vθ this formula extends the policy gradient theorem developed for the expected return to dynamic risk measures we make standard assumption analogous to assumption of the static case assumption the likelihood ratio log µθ is and bounded for all and for each state let ξθ λθ λθ denote saddle point of corresponding to the state with pθ replacing pθ in and vθ replacing the next theorem presents formula for vθ the proof is in the supplementary material theorem under assumptions and we have log µθ at hθ xt at where denotes the expectation trajectories generated by the markov chain with transition probabilities pθ ξθ and the cost function hθ is defined as hθ ξθ γvθ dfi ξθ dge ξθ λθ λθ dp dp theorem may be used to develop an style algorithm for solving the drp problem composed of two interleaved procedures critic for given policy calculate the value function vθ and actor using the critic vθ and theorem estimate and update space limitation restricts us from specifying the full details of our algorithm and its analysis in the following we highlight only the key ideas and results for the full details we refer the reader to the full paper version provided in the supplementary material for the critic the main challenge is calculating the value function when the state space is large and dynamic programming can not be applied due to the curse of dimensionality to overcome this we exploit the fact that vθ is equivalent to the value function in robust mdp and modify recent algorithm in to estimate it using function approximation for the actor the main challenge is that in order to estimate the gradient using thm we need to sample from an mdp with transitions also hθ involves an expectation for each and therefore we propose sampling procedure to estimate in which we first use the critic estimate of vθ to derive and sample trajectory from an mdp with transitions for each state in the trajectory we then sample several next states to estimate hθ the convergence analysis of the algorithm and the gradient error incurred from function approximation of vθ are reported in the supplementary material we remark that our algorithm requires simulator for sampling multiple from each state extending our approach to work with single trajectory is an interesting direction for future research numerical illustration in this section we illustrate our approach with numerical example the purpose of this illustration is to emphasize the importance of flexibility in designing risk criteria for selecting an appropriate such that suits both the user risk preference and the properties we consider trading agent that can invest in one of three assets see figure for their distributions the returns of the first two assets and are normally distributed and figure numerical illustration selection between assets probability density of asset return bar plots of the probability of selecting each asset training iterations for policies and respectively at each iteration samples were used for gradient estimation with the return of the third asset has pareto distribution the mean of the return from is and its variance is infinite such distributions are widely used in financial modeling the agent selects an action randomly with probability ai exp θi where is the policy parameter we trained three different policies and policy is maxθ and it was trained using standard policy gradient policy is and had objective maxθ sd and was trained using the algorithm in section policy is also with objective as proposed in maxθ var and was trained using the algorithm of for each of these policies figure shows the probability of selecting each asset training iterations although has the highest mean return the policy chooses since it has lower downside as expected however because of the heavy of policy opted to choose instead this is as rational investor should not avert high returns in fact in this case stochastically dominates conclusion we presented algorithms for estimating the gradient of both static and dynamic coherent risk measures using two new policy gradient style formulas that combine sampling with convex programming thereby our approach extends rl to the whole class of coherent risk measures and generalizes several recent studies that focused on specific risk measures on the technical side an important future direction is to improve the convergence rate of gradient estimates using importance sampling methods this is especially important for risk criteria that are sensitive to rare events such as the cvar from more conceptual point of view the framework explored in this work provides the decision maker with flexibility in designing risk preference as our numerical example shows such flexibility is important for selecting appropriate risk measures for managing the cost variability however we believe that our approach has much more potential than that in almost every application uncertainty emanates from stochastic dynamics but also and perhaps more importantly from modeling errors model uncertainty prudent policy should protect against both types of uncertainties the representation duality of theorem naturally relates the risk to model uncertainty in similar connection was made between in mdps and dynamic markov coherent risk we believe that by carefully shaping the the decision maker may be able to take uncertainty into account in broad sense designing principled procedure for such is not trivial and is beyond the scope of this paper however we believe that there is much potential to risk shaping as it may be the key for handling model misspecification in dynamic decision making acknowledgments the research leading to these results has received funding from the european research council under the european unions seventh framework program erc grant agreement yinlam chow is partially supported by croucher foundation doctoral scholarship references acerbi spectral measures of risk coherent representation of subjective risk aversion journal of banking finance artzner delbaen eber and heath coherent measures of risk mathematical finance bardou frikha and computing var and cvar using stochastic approximation and adaptive unconstrained importance sampling monte carlo methods and applications and ott markov decision processes with criteria mathematical methods of operations research bertsekas dynamic programming and optimal control athena scientific edition borkar sensitivity formula for cost and the algorithm systems control letters boyd and vandenberghe convex optimization cambridge university press chow and ghavamzadeh algorithms for cvar optimization in mdps in nips chow and pavone unifying framework for model predictive control theory and algorithms in american control conference delage and mannor percentile optimization for markov decision processes with parameter uncertainty operations research fu gradient estimation in simulation volume of handbooks in operations research and management science pages elsevier hadar and russell rules for ordering uncertain prospects the american economic review pages iancu petrik and subramanian tight approximations of dynamic risk measures konda and tsitsiklis algorithms in nips marbach and tsitsiklis optimization of markov reward processes ieee transactions on automatic control markowitz portfolio selection efficient diversification of investment john wiley and sons milgrom and segal envelope theorems for arbitrary choice sets econometrica moody and saffell learning to trade via direct reinforcement neural networks ieee transactions on osogami robustness and in markov decision processes in nips petrik and subramanian an approximate solution method for large markov decision processes in uai prashanth and ghavamzadeh algorithms for mdps in nips rachev and mittnik stable paretian models in finance john willey sons new york rockafellar and uryasev optimization of conditional journal of risk dynamic programming for markov decision processes mathematical programming and shapiro optimization of convex risk functions math or shapiro dentcheva and lectures on stochastic programming chapter pages siam sutton and barto reinforcement learning an introduction cambridge univ press sutton mcallester singh and mansour policy gradient methods for reinforcement learning with function approximation in nips tamar di castro and mannor policy gradients with variance related risk criteria in international conference on machine learning tamar glassner and mannor optimizing the cvar via sampling in aaai tamar mannor and xu scaling up robust mdps using function approximation in international conference on machine learning 
fast rates for empirical risk minimization kfir levy technion haifa israel kfiryl tomer koren technion haifa israel tomerk abstract we consider empirical risk minimization erm in the context of stochastic optimization with and smooth general optimization framework that captures several important learning problems including linear and logistic regression learning svms with the squared portfolio selection and more in this setting we establish the first evidence that erm is able to attain fast generalization rates and show that the expected loss of the erm solution in dimensions converges to the optimal expected loss in rate of this rate matches existing lower bounds up to constants and improves by log factor upon the which is only known to be attained by an conversion of computationally expensive online algorithms introduction statistical learning and stochastic optimization with loss functions captures several fundamental problems in statistical machine learning which include linear regression logistic regression learning machines svms with the squared hinge loss and portfolio selection amongst others functions constitute rich class of convex functions which is substantially richer than its more familiar subclass of strongly convex functions similarly to their counterparts it is that loss functions are amenable to fast generalization rates specifically standard conversion of either the online newton step algorithm or exponential weighting schemes in dimensions gives rise to convergence rate of as opposed to the standard rate of generic lipschitz stochastic convex optimization unfortunately the latter online methods are highly inefficient the runtime complexity of the online newton step algorithm scales as with the dimension of the problem even in very simple optimization scenarios an alternative and learning paradigm is that of empirical risk minimization erm which is often regarded as the strategy of choice due to its generality and its statistical efficiency in this scheme sample of training instances is drawn from the underlying data distribution and the minimizer of the sample average or the regularized sample average is computed as opposed to methods based on conversions the erm approach enables the use of any optimization procedure of choice and does not restrict one to use specific online algorithm furthermore the erm solution often enjoys several generalization bounds in conjunction and thus is able to obliviously adapt to the properties of the underlying data distribution in the context of functions however nothing is known about the generalization abilities of erm besides the standard convergence rate that applies to any convex losses surprisingly it appears that even in the specific and case of linear regression with the squared loss the state of affairs remains unsettled this important case was recently addressed by shamir who proved lower bound on the convergence rate of any algorithm and conjectured that the rate of an erm approach should match this lower bound in this paper we explore the convergence rate of erm for stochastic optimization we show that when the loss functions are also smooth erm approach yields convergence rate of which matches the lower bound of shamir up to constants in fact our result shows for erm generalization rate tighter than the obtained by the online newton step algorithm improving upon the latter by log factor even in the specific case of linear regression with the squared loss our result improves by log factor upon the best known fast rates provided by the algorithm our results open an avenue for potential improvements to the runtime complexity of stochastic optimization by permitting the use of accelerated methods for regularized loss minimization the latter has been the topic of an extensive research effort in recent years and numerous methods have been developed see johnson and zhang shalevshwartz and zhang and the references therein on the technical side our convergence analysis relies on stability arguments introduced by bousquet and elisseeff we prove that the expected loss of the regularized erm solution does not change significantly when single instance picked uniformly at random from the training sample is discarded then the technique of bousquet and elisseeff allows us to translate this average stability property into generalization guarantee we remark that in all previous stability analyses that we are aware of stability was shown to hold uniformly over all discarded training intances either with probability one or in expectation in contrast in the case of functions it is crucial to look at the average stability in order to bound the average stability of erm we make use of localized notion of strong convexity defined with respect to local norm at certain point in the optimization domain roughly speaking we show that when looking at the right norm which is determined by the local properties of the empirical risk at the right point the minimizer of the empirical risk becomes stable this part of our analysis is inspired by recent analysis techniques of online learning algorithms that use local norms to study the regret performance of online linear optimization algorithms related work the study of loss functions was initiated in the online learning community by kivinen and warmuth who considered the problem of prediction with expert advice with losses later hazan et al considered more general framework that allows for continuous decision set and proposed the online newton step ons algorithm that attains regret bound that grows logarithmically with the number of optimization rounds mahdavi et al considered the ons algorithm in the statistical setting and showed how it can be used to establish generalization bounds that hold with high probability while still keeping the fast rate fast convergence rates in stochastic optimization are known to be achievable under various conditions bousquet and elisseeff and et al have shown via uniform stability argument that erm guarantees convergence rate of for strongly convex functions sridharan et al proved similar result albeit using the notion of localized rademacher complexity for the case of smooth and losses srebro et al established rate in conditions when the expected loss of the best hypothesis is of order for further discussion of fast rates in stochastic optimization and learning see and the references therein setup and main results we consider the problem of minimizing stochastic objective over closed and convex domain in euclidean space here the expectation is taken with respect to random variable distributed according to an unknown distribution over parameter space given budget of samples zn of the random variable we are required to produce an estimate whose expected excess loss defined by is small here the expectation is with respect the randomization of the training set zn used to produce we make the following assumptions over the loss function first we assume that for any fixed parameter the function is over the domain for some namely that the function exp is concave over we will also assume that is over with respect to euclidean norm which means that its gradient is with respect to the same norm βkw in particular this property implies that is differentiable for simplicity and without loss of generality we assume finally we assume that is bounded over in the sense that for all for some in this paper we analyze regularized empirical risk minimization erm procedure for optimizing the stochastic objective in eq that based on the sample zn computes arg min fb where fb zi the function serves as regularizer which is assumed to be with respect to the euclidean norm for instance one can simply choose the strong convexity of implies in particular that fb is also strongly convex which ensures that the optimizer is unique for our bounds we will assume that for all for some constant our main result which we now present establishes fast convergence rate for the expected excess loss of the erm estimate given in eq theorem let be loss function defined over closed and convex domain rd which is and with respect to its first argument let be and regularization function then for the regularized erm estimate defined in eqs and based on an sample zn the expected excess loss is bounded as min αn in other words the theorem states that for ensuring an expected excess loss of at most sample of size suffices this result improves upon the best known fast convergence rates for functions by log factor and matches the lower bound of shamir for the special case where the loss function is the squared loss for this particular case our result affirms the conjecture of shamir regarding the sample complexity of erm for the squared loss see section below for details it is important to note that theorem establishes fast convergence rate with respect to the actual expected loss itself and not for regularized version thereof and in particular not with respect to the expectation of fb notably the magnitude of the regularization we use is only as opposed to the regularization used in standard regularized loss minimization methods that can only give rise to traditional rate results for the squared loss in this section we focus on the important special case where the loss function is the squared loss namely where rd is an instance vector and is target value this case that was extensively studied in the past was recently addressed by shamir who gave lower bounds on the sample complexity of any learning algorithm under mild assumptions shamir analyzed learning with the squared loss in setting where the domain is rd for some constant and the parameters distribution is supported over rd it is not hard to verify that in this setup for the squared loss we can take and furthermore if we choose the standard regularizer we have for all as consequence theorem implies that the expected excess loss of the regularized erm estimator we defined in eq is bounded by on the other hand standard uniform convergence results for generalized linear functions show that under the same conditions erm also enjoys an upper bound of over its expected excess risk overall we conclude corollary for the squared loss over the domain rd with rd the regularized erm estimator defined in eqs and based on an sample of instances has min min this result slightly improves by log factor upon the bound conjectured by shamir for the erm estimator and matches the lower bound proved therein up to previous results for erm that we are aware of either included excess log factors or were proven under additional distributional assumptions see also the discussion in we remark that shamir conjectures this bound for erm without any regularization for the specific case of the squared loss it is indeed possible to obtain the same rates without regularizing we defer details to the full version of the paper however in practice regularization has several additional benefits it renders the erm optimization problem ensures that the underlying matrix that needs to be inverted is and guarantees it has unique minimizer proof of theorem our proof of theorem proceeds as follows first we relate the expected excess risk of the erm estimator to its average stability then we bound this stability in terms of certain local properties of the empirical risk at the point to introduce the average stability notion we study we first define for each the following empirical risk zj fbi namely fbi is the regularized empirical risk corresponding to the sample obtained by discarding the instance zi then for each we let bi arg fbi be the erm estimator bi the average stability of corresponding to is then defined as the quantity pn intuitively the average stability serves as an unbiased estimator of the amount of change in the expected loss of the erm estimator when one of the instances zn chosen uniformly at random is removed from the training sample we note that looking at the average is crucial for us and the stronger condition of expected uniform stability does not hold for expconcave functions for further discussion of the various stability notions refer to bousquet and elisseeff our main step in proving theorem involves bounding the average stability of defined in eq which is the purpose of the next theorem theorem average stability for any zn and for bn and as defined above we have bi zi zi αn we remark that shamir result assumes two different bounds over the magnitude of the predictors and the target values while here we assume both are bounded by the same constant we did not attempt to capture this refined dependence on the two different parameters before proving this theorem we first show how it can be used to obtain our main theorem the proof follows arguments similar to those of bousquet and elisseeff and et al proof of theorem to obtain the stated result it is enough to upper bound the expected excess loss of bn which is the minimizer of the regularized empirical risk over the sample to this end fix an arbitrary we first write fb fb which holds true since is the minimizer of fb over hence bn bn fb next notice that the random variables bn have exactly the same distribution each is the output of regularized erm on an sample of examples also notice that bi which is the minimizer of the sample obtained by discarding the th example is independent of zi thus we have bn bi bi zi furthermore we can write zi fb plugging these expressions into eq gives bound over the expected excess loss of bn in terms of the average stability bn bi zi zi using theorem for bounding average stability term on the side and our assumption that supw to bound the second term we obtain the stated bound over the expected excess loss of bn the remainder of the section is devoted to the proof of theorem before we begin with the proof of the theorem itself we first present useful tool for analyzing the stability of minimizers of convex functions which we later apply to the empirical risks local strong convexity and stability our stability analysis for functions is inspired by recent analysis techniques of online learning algorithms that make use of strong convexity with respect to local norms the crucial property is summarized in the following definition definition local strong convexity we say that function is locally over domain rd at with respect to norm if ky in words function is locally at if it can be lower bounded globally over its entire domain by quadratic tangent to the function at the nature of the quadratic term in this lower bound is determined by choice of local norm which is typically adapted to the local properties of the function at the point with the above definition we can now prove the following stability result for optima of convex functions that underlies our stability analysis for functions lemma let be two convex functions defined over closed and convex domain rd and let arg and arg assume that is locally at with respect to norm then for we have furthermore if is convex then proof the local strong convexity of at implies notice that since is minimizer of also since is minimizer of optimality conditions imply that whence combining the observations yields where we have used inequality in the last inequality this gives the first claim of the lemma to obtain the second claim we first observe that where we used the fact that is the minimizer of for the first inequality and the fact that is the minimizer of for the second this establishes the lower bound for the upper bound we use the assumed convexity of to write where the second inequality follows from inequality and the final one from the first claim of the lemma average stability analysis with lemma at hand we now turn to prove theorem first few definitions are needed for brevity we henceforth denote fi zi for all we let hi be thep gradient of fi at the point defined in eq and let id hi hti and hi id hj htj for all where min finally we will use to denote the norm induced by positive definite matrix kxkm xt in this case the dual norm induced by simply equals kxkm xt in order to obtain an upper bound over the average stability we first bound each of the individual stability expressions fi bi in terms of certain norm of the gradient hi of the corresponding function fi as the proof below reveals this norm is the local norm at with respect to which the risk fbi is locally strongly convex lemma for all it holds that fi bi fi khi notice that the expression on the side might be quite large for particular function fi indeed uniform stability does not hold in our case however as we show below the average of these expressions is small the proof of lemma relies on lemma above and the following property of functions established by hazan et al lemma hazan et al lemma let be an function over convex domain rd such that for any then for any min it holds that proof of lemma we apply lemma with fb and fbi so that fi we should first verify that fbi is indeed at with respect to the norm since each fi is lemma shows that for all fi fi hi with our choice of min also the strong convexity of the regularizer implies that kw wk summing eq over all with eq and dividing through by gives fbi fbi hi kw wk kw wk fbi which establishes the strong convexity now applying lemma gives khi on the other hand since fi is convex we have kw bi wk hi fi bi fi bi bi bi bi bi the first term can be bounded using inequality and eq as khi also since fi is with respect to the euclidean norm we can bound the second term in eq as follows bi bi bi bi wk βkw bi wk bi bi hi bi khi bi wk hi and since hi id we can further bound using eq khi combining the bounds and simplifying using our assumption gives the lemma kw bi wk σkw bi wk next we bound sum involving the terms introduced in lemma lemma let khi then and we have khi proof denote ai hti hi for all first we claim that ai for all and being for the sum of the ai the fact that ai is evident from ai we write ai hti hi tr hi hti tr tr id where we have used the linearity of the trace and the fact that pn hi hti now our claim that is evident if khi for more than terms then the sum hi must be larger than which is contradiction to eq to prove ai hi our second claim we first write hi hi hti and use the identity to obtain hi hti hi hti hti hi for all note that for we have hti hi so that the identity applies and the inverse on the side is well defined we therefore have ht hi hti hi hti hi khi ai ai hi hi where the inequality follows from the fact that ai ai for summing this inequality over and recalling that the ai are nonnegative we obtain khi ai ai which concludes the proof theorem is now obtained as an immediate consequence of our lemmas above proof of theorem as consequence of lemmas and we have fi bi fi and fi bi fi khi σn σn summing the inequalities and using max gives the result conclusions and open problems we have proved the first fast convergence rate for regularized erm procedure for loss functions our bounds match the existing lower bounds in the specific case of the squared loss up to constants and improve by logarithmic factor upon the best known upper bounds achieved by online methods our stability analysis required us to assume smoothness of the loss functions in addition to their we note however that the online newton step algorithm of hazan et al for online optimization does not require such an assumption even though most of the popular loss functions are also smooth it would be interesting to understand whether smoothness is indeed required for the convergence of the erm estimator we study in the present paper or whether it is simply limitation of our analysis another interesting issue left open in our work is how to obtain bounds on the excess risk of erm that hold with high probability and not only in expectation since the excess risk is one can always apply markov inequality to obtain bound that holds with probability but scales linearly with also using standard concentration inequalities or success amplification techniques we may also obtain high probability bounds that scale with log losing the fast rate we leave the problem of obtaining bounds that depends both linearly on and logarithmically on for future work references abernethy hazan and rakhlin methods for and bandit online learning information theory ieee transactions on anthony and bartlett neural network learning theoretical foundations cambridge university press azoury and warmuth relative loss bounds for density estimation with the exponential family of distributions machine learning bousquet and elisseeff stability and generalization the journal of machine learning research and lugosi prediction learning and games cambridge university press conconi and gentile on the generalization ability of learning algorithms ieee transactions on information theory golub and van loan matrix computations volume jhu press hazan agarwal and kale logarithmic regret algorithms for online convex optimization machine learning hsu kakade and zhang random design analysis of ridge regression foundations of computational mathematics johnson and zhang accelerating stochastic gradient descent using predictive variance reduction in advances in neural information processing systems pages kakade sridharan and tewari on the complexity of linear prediction risk bounds margin bounds and regularization in advances in neural information processing systems pages kivinen and warmuth averaging expert predictions in computational learning theory pages springer koren open problem fast stochastic optimization in conference on learning theory pages and mendelson performance of empirical risk minimization in linear aggregation arxiv preprint mahdavi zhang and jin lower and upper bounds on the generalization of stochastic exponentially concave optimization in proceedings of the conference on learning theory and zhang stochastic dual coordinate ascent methods for regularized loss the journal of machine learning research and zhang accelerated proximal stochastic dual coordinate ascent for regularized loss minimization mathematical programming pages shamir srebro and sridharan learnability stability and uniform convergence the journal of machine learning research shamir the sample complexity of learning linear predictors with the squared loss arxiv preprint srebro sridharan and tewari smoothness low noise and fast rates in advances in neural information processing systems pages sridharan and srebro fast rates for regularized objectives in advances in neural information processing systems pages vovk competitive statistics international statistical review 
deep generative image models using laplacian pyramid of adversarial networks emily dept of computer science courant institute new york university soumith arthur szlam facebook ai research new york rob fergus abstract in this paper we introduce generative parametric model capable of producing high quality samples of natural images our approach uses cascade of convolutional networks within laplacian pyramid framework to generate images in fashion at each level of the pyramid separate generative convnet model is trained using the generative adversarial nets gan approach samples drawn from our model are of significantly higher quality than alternate approaches in quantitative assessment by human evaluators our samples were mistaken for real images around of the time compared to for samples drawn from gan baseline model we also show samples from models trained on the higher resolution images of the lsun scene dataset introduction building good generative model of natural images has been fundamental problem within computer vision however images are complex and high dimensional making them hard to model well despite extensive efforts given the difficulties of modeling entire scene at most existing approaches instead generate image patches in contrast we propose an approach that is able to generate plausible looking scenes at and to do this we exploit the multiscale structure of natural images building series of generative models each of which captures image structure at particular scale of laplacian pyramid this strategy breaks the original problem into sequence of more manageable stages at each scale we train convolutional networkbased generative model using the generative adversarial networks gan approach of goodfellow et al samples are drawn in fashion commencing with residual image the second stage samples the structure at the next level conditioned on the sampled residual subsequent levels continue this process always conditioning on the output from the previous scale until the final level is reached thus drawing samples is an efficient and straightforward procedure taking random vectors as input and running forward through cascade of deep convolutional networks convnets to produce an image deep learning approaches have proven highly effective at discriminative tasks in vision such as object classification however the same level of success has not been obtained for generative tasks despite numerous efforts against this background our proposed approach makes significant advance in that it is straightforward to train and sample from with the resulting samples showing surprising level of visual fidelity related work generative image models are well studied falling into two main approaches and parametric the former copy patches from training images to perform for example texture synthesis or more ambitiously entire portions of an image can be given sufficiently large training dataset early parametric models addressed the easier problem of denotes equal contribution ture synthesis with portilla simoncelli making use of steerable pyramid wavelet representation similar to our use of laplacian pyramid for image processing tasks models based on marginal distributions of image gradients are effective but are only designed for image restoration rather than being true density models so can not sample an actual image very large gaussian mixture models and sparse coding models of image patches can also be used but suffer the same problem wide variety of deep learning approaches involve generative parametric models restricted boltzmann machines deep boltzmann machines denoising all have generative decoder that reconstructs the image from the latent representation variational provide probabilistic interpretation which facilitates sampling however for all these methods convincing samples have only been shown on simple datasets such as mnist and norb possibly due to training complexities which limit their applicability to larger and more realistic images several recent papers have proposed novel generative models dosovitskiy et al showed how convnet can draw chairs with different shapes and viewpoints while our model also makes use of convnets it is able to sample general scenes and objects the draw model of gregor et al used an attentional mechanism with an rnn to generate images via trajectory of patches showing samples of mnist and images et al use process for deep unsupervised learning and the resulting model is able to produce reasonable samples theis and bethge employ lstms to capture spatial dependencies and show convincing inpainting results of natural textures our work builds on the gan approach of goodfellow et al which works well for smaller images mnist but can not directly handle large ones unlike our method most relevant to our approach is the preliminary work of mirza and osindero and gauthier who both propose conditional versions of the gan model the former shows mnist samples while the latter focuses solely on frontal face images our approach also uses several forms of conditional gan model but is much more ambitious in its scope approach the basic building block of our approach is the generative adversarial network gan of goodfellow et al after reviewing this we introduce our lapgan model which integrates conditional form of gan model into the framework of laplacian pyramid generative adversarial networks the gan approach is framework for training generative models which we briefly explain in the context of image data the method pits two networks against one another generative model that captures the data distribution and discriminative model that distinguishes between samples drawn from and images drawn from the training data in our approach both and are convolutional networks the former takes as input noise vector drawn from distribution pnoise and outputs an image the discriminative network takes an image as input stochastically chosen with equal probability to be either as generated from or real image drawn from the training data pdata outputs scalar probability which is trained to be high if the input was real and low if generated from minimax objective is used to train both models together min max log log this encourages to fit pdata so as to fool with its generated samples both and are trained by backpropagating the loss in eqn through both models to update the parameters the conditional generative adversarial net cgan is an extension of the gan where both networks and receive an additional vector of information as input this might contain say information about the class of the training example the loss function thus becomes min max eh log log where pl is for example the prior distribution over classes this model allows the output of the generative model to be controlled by the conditioning variable mirza and osindero and gauthier both explore this model with experiments on mnist and faces using as class indicator in our approach will be another image generated from another cgan model laplacian pyramid the laplacian pyramid is linear invertible image representation consisting of set of images spaced an octave apart plus residual formally let be downsampling operation which blurs and decimates image so that is new image of size also let be an upsampling operator which smooths and expands to be twice the size so is new image of size we first build gaussian pyramid ik where and ik is repeated applications of to is the number of levels in the pyramid selected so that the final level has very small spatial extent pixels the coefficients hk at each level of the laplacian pyramid are constructed by taking the difference between adjacent levels in the gaussian pyramid upsampling the smaller one with so that the sizes are compatible hk lk gk ik intuitively each level captures image structure present at particular scale the final level of the laplacian pyramid hk is not difference image but residual equal to the final gaussian pyramid level hk ik reconstruction from laplacian pyramid coefficients hk is performed using the backward recurrence ik hk which is started with ik hk and the reconstructed image being io in other words starting at the coarsest level we repeatedly upsample and add the difference image at the next finer level until we get back to the full resolution image laplacian generative adversarial networks lapgan our proposed approach combines the conditional gan model with laplacian pyramid representation the model is best explained by first considering the sampling procedure following training explained below we have set of generative convnet models gk each of which captures the distribution of coefficients hk for natural images at different level of the laplacian pyramid sampling an image is akin to the reconstruction procedure in eqn except that the generative models are used to produce the hk gk zk the recurrence starts by setting and using the model at the final level gk to generate residual image using noise vector zk gk zk note that models at all levels except the final are conditional generative models that take an upsampled version of the current image as conditioning variable in addition to the noise vector zk fig shows this procedure in action for pyramid with using generative models to sample image the generative models gk are trained using the cgan approach at each level of the pyramid specifically we construct laplacian pyramid from each training image at each level we make stochastic choice with equal probability to either construct the coefficients hk either using the standard procedure from eqn or ii generate them using gk gk zk figure the sampling procedure for our lapgan model we start with noise sample right side and use generative model to generate this is upsampled green arrow and then used as the conditioning variable orange arrow for the generative model at the next level together with another noise sample generates difference image which is added to to create this process repeats across two subsequent levels to yield final full resolution sample generated generated figure the training procedure for our lapgan model starting with input image from our training set top left we take and blur and downsample it by factor of two red arrow to produce ii we upsample by factor of two green arrow giving version of iii with equal probability we use to create either real or generated example for the discriminative model in the real case blue arrows we compute which is input to that computes the probability of it being real vs generated in the generated case magenta arrows the generative network receives as input random noise vector and it outputs generated image which is input to in both the cases also receives orange arrow optimizing eqn thus learns to generate realistic structure consistent with the image the same procedure is repeated at scales and using and note that the models at each level are trained independently at level is an image simple enough to be modeled directly with standard gans note that gk is convnet which uses coarse scale version of the image lk as an input as well as noise vector zk dk takes as input hk or along with the image lk which is explicitly added to hk or before the first convolution layer and predicts if the image was real or generated at the final scale of the pyramid the low frequency residual is sufficiently small that it can be directly modeled with standard gan gk zk and dk only has hk or as input the framework is illustrated in fig breaking the generation into successive refinements is the key idea in this work note that we give up any global notion of fidelity we never make any attempt to train network to discriminate between the output of cascade and real image and instead focus on making each step plausible furthermore the independent training of each pyramid level has the advantage that it is far more difficult for the model to memorize training examples hazard when high capacity deep networks are used as described our model is trained in an unsupervised manner however we also explore variants that utilize class labels this is done by add vector indicating class identity as another conditioning variable for gk and dk model architecture training we apply our approach to three datasets pixel color images of different classes training samples with tight crops of objects ii pixel color images of different classes training samples we use the unlabeled portion of data and iii lsun images of different natural scene types downsampled to pixels for each dataset we explored variety of architectures for gk dk model selection was performed using combination of visual inspection and heuristic based on error in pixel space the heuristic computes the error for given validation image at level in the pyramid as lk ik min zj zj hk where zj is large set of noise vectors drawn from pnoise in other words the heuristic is asking are any of the generated residual images close to the ground truth torch training and evaluation code along with model specification files can be found at http for all models the noise vector zk is drawn from uniform distribution and initial scale this operates at resolution using densely connected nets for both gk dk with hidden layers and relu dk uses dropout and has vs for gk zk is vector subsequent scales for we boost the training set size by taking four crops from the original images thus the two subsequent levels of the pyramid are and for stl we have levels going from for both datasets gk dk are convnets with and layers respectively see the noise input zk to gk is presented as color plane to lk hence its dimensionality varies with the pyramid level for we also explore class conditional version of the model where vector encodes the label this is integrated into gk dk by passing it through linear layer whose output is reshaped into single plane feature map which is then concatenated with the layer maps the loss in eqn is trained using sgd with an initial learning rate of decreased by factor of at each epoch momentum starts at increasing by at epoch up to maximum of training time depends on the models size and pyramid level with smaller models taking hours to train and larger models taking up to day lsun the larger size of this dataset allows us to train separate lapgan model for each of the scene classes the four subsequent scales use common architecture for gk dk at each level gk is convnet with feature maps and linear output layer filters relus batch normalization and dropout are used at each hidden layer dk has hidden layers with maps plus sigmoid output see for full details note that gk and dk are substantially larger than those used for and stl as afforded by the larger training set experiments we evaluate our approach using different methods computation of on held out image set ii drawing sample images from the model and iii human subject experiment that compares our samples those of baseline methods and real images evaluation of like goodfellow et al we are compelled to use gaussian parzen window estimator to compute since there no direct way of computing it using our model table compares the on validation set for our lapgan model and standard gan using samples for each model the gaussian width was also tuned on the validation set our approach shows marginal gain over gan however we can improve the underlying estimation technique by leveraging the structure of the lapgan model this new approach computes probability at each scale of the laplacian pyramid and combines them to give an overall image probability see appendix in supplementary material for details our parzen estimate shown in table produces big gain over the traditional estimator the shortcomings of both estimators are readily apparent when compared to simple gaussian fit to the training set even with added noise the resulting model can obtain far higher loglikelihood than either the gan or lapgan models or other published models more generally is problematic as performance measure due to its sensitivity to the exact representation used small variations in the scaling noise and resolution of the image much less changing from rgb to yuv or more substantive changes in input representation results in wildly different scores making fair comparisons to other methods difficult model gan parzen window estimate lapgan parzen window estimate lapgan parzen window estimate table estimates for standard gan and our proposed lapgan model on and datasets the mean and std dev are given in units of rows and use approach at while row uses our estimator model samples we show samples from models trained on and lsun datasets additional samples can be found in the supplementary material fig shows samples from our models trained on samples from the class conditional lapgan are organized by class our reimplementation of the standard gan model produces slightly sharper images than those shown in the original paper we attribute this improvement to the introduction of data augmentation the lapgan samples improve upon the standard gan samples they appear more and have more clearly defined edges conditioning on class label improves the generations as evidenced by the clear object structure in the conditional lapgan samples the quality of these samples compares favorably with those from the draw model of gregor et al and also et al the rightmost column of each image shows the nearest training example to the neighboring sample in this demonstrates that our model is not simply copying the input examples fig shows samples from our lapgan model trained on here we lose clear object shape but the samples remain sharp fig shows the generation chain for random samples fig shows samples from lapgan models trained on three lsun categories tower bedroom church front to the best of our knowledge no other generative model is been able to produce samples of this complexity the substantial gain in quality over the and samples is likely due to the much larger training lsun training set which allows us to train bigger and deeper models in supplemental material we show additional experiments probing the models drawing multiple samples using the same fixed image which illustrates the variation captured by the lapgan models human evaluation of samples to obtain quantitative measure of quality of our samples we asked volunteers to participate in an experiment to see if they could distinguish our samples from real images the subjects were presented with the user interface shown in fig right and shown at random four different types of image samples drawn from three different gan models trained on lapgan ii class conditional lapgan and iii standard gan and also real images after being presented with the image the subject clicked the appropriate button to indicate if they believed the image was real or generated since accuracy is function of viewing time we also randomly pick the presentation time from one of durations ranging from to after which gray mask image is displayed before the experiment commenced they were shown examples of real images from after collecting samples from the volunteers we plot in fig the fraction of images believed to be real for the four different data sources as function of presentation time the curves show our models produce samples that are more realistic than those from standard gan discussion by modifying the approach in to better respect the structure of images we have proposed conceptually simple generative model that is able to produce sample images that are qualitatively better than other deep generative modeling approaches while they exhibit reasonable diversity we can not be sure that they cover the full data distribution hence our models could potentially be assigning low probability to parts of the manifold on natural images quantifying this is difficult but could potentially be done via another human subject experiment key point in our work is giving up any global notion of fidelity and instead breaking the generation into plausible successive refinements we note that many other signal modalities have multiscale structure that may benefit from similar approach acknowledgements we would like to thank the anonymous reviewers for their insightful and constructive comments we also thank andrew tulloch wojciech zaremba and the fair infrastructure team for useful discussions and support emily denton was supported by an nserc fellowship airplane automobile bird cat deer dog frog horse ship truck lapgan gan figure samples our class conditional model our lapgan model and the standard gan model of goodfellow the yellow column shows the training set nearest neighbors of the samples in the adjacent column figure samples random samples from our lapgan model generation chain figure samples from three different lsun lapgan models top tower middle bedroom bottom church front real lapgan gan classified real presentation time ms figure left human evaluation of real images red and samples from goodfellow et al magenta our lapgan blue and class conditional lapgan green the error bars show of the variability around of the samples generated by our class conditional lapgan model are realistic enough to fool human into thinking they are real images this compares with of images from the standard gan model but is still lot lower than the rate for real images right the presented to the subjects references burt edward and adelson the laplacian pyramid as compact image code ieee transactions on communications coates lee and ng an analysis of single layer networks in unsupervised feature learning in aistats de bonet multiresolution sampling procedure for analysis and synthesis of texture images in proceedings of the annual conference on computer graphics and interactive techniques pages acm publishing deng dong socher li li and imagenet hierarchical image database in cvpr pages ieee denton chintala szlam and fergus deep generative image models using laplacian pyramid of adversarial networks supplementary material http dosovitskiy springenberg and brox learning to generate chairs with convolutional neural networks arxiv preprint efros and leung texture synthesis by sampling in iccv volume pages ieee eslami heess williams and winn the shape boltzmann machine strong model of object shape international journal of computer vision freeman jones and pasztor computer graphics and applications ieee gauthier conditional generative adversarial nets for convolutional face generation class project for stanford convolutional neural networks for visual recognition winter semester goodfellow mirza xu ozair courville and bengio generative adversarial nets in nips pages gregor danihelka graves and wierstra draw recurrent neural network for image generation corr hays and efros scene completion using millions of photographs acm transactions on graphics tog hinton and salakhutdinov reducing the dimensionality of data with neural networks science ioffe and szegedy batch normalization accelerating deep network training by reducing internal covariate shift arxiv preprint kingma and welling variational bayes iclr krizhevsky learning multiple layers of features from tiny images masters thesis deptartment of computer science university of toronto krizhevsky hinton et al factored restricted boltzmann machines for modeling natural images in aistats pages mirza and osindero conditional generative adversarial nets corr olshausen and field sparse coding with an overcomplete basis set strategy employed by vision research osindero and hinton modeling image patches with directed hierarchy of markov random fields in platt koller singer and roweis editors nips pages portilla and simoncelli parametric texture model based on joint statistics of complex wavelet coefficients international journal of computer vision ranzato mnih susskind and hinton modeling natural images using gated mrfs ieee transactions on pattern analysis machine intelligence rezende mohamed and wierstra stochastic backpropagation and variational inference in deep latent gaussian models arxiv preprint roth and black fields of experts framework for learning image priors in in cvpr pages salakhutdinov and hinton deep boltzmann machines in aistats pages simoncelli freeman adelson and heeger shiftable multiscale transforms information theory ieee transactions on weiss maheswaranathan and ganguli deep unsupervised learning using nonequilibrium thermodynamics corr theis and bethge generative image modeling using spatial lstms dec vincent larochelle bengio and manzagol extracting and composing robust features with denoising autoencoders in icml pages wright ma mairal sapiro huang and yan sparse representation for computer vision and pattern recognition proceedings of the ieee zhang yu song xu seff and xiao scene understanding challenge in cvpr workshop zhu wu and mumford filters random fields and maximum entropy frame towards unified theory for texture modeling international journal of computer vision zoran and weiss from learning models of natural image patches to whole image restoration in iccv 
decoupled deep neural network for semantic segmentation seunghoon hyeonwoo bohyung han dept of computer science and engineering postech pohang korea hyeonwoonoh bhhan abstract we propose novel deep neural network architecture for semantic segmentation using heterogeneous annotations contrary to existing approaches posing semantic segmentation as single task of classification our algorithm decouples classification and segmentation and learns separate network for each task in this architecture labels associated with an image are identified by classification network and binary segmentation is subsequently performed for each identified label in segmentation network the decoupled architecture enables us to learn classification and segmentation networks separately based on the training data with and class labels respectively it facilitates to reduce search space for segmentation effectively by exploiting activation maps obtained from bridging layers our algorithm shows outstanding performance compared to other approaches with much less training images with strong annotations in pascal voc dataset introduction semantic segmentation is technique to assign structured semantic object class individual pixels in images this problem has been studied extensively over decades yet remains challenging since object appearances involve significant variations that are potentially originated from pose variations scale changes occlusion background clutter etc however in spite of such challenges the techniques based on deep neural network dnn demonstrate impressive performance in the standard benchmark datasets such as pascal voc most approaches pose semantic segmentation as classification problem although these approaches have achieved good performance compared to previous methods training dnn requires large number of segmentation which result from tremendous annotation efforts and costs for this reason reliable segmentation annotations are typically available only for small number of classes and images which makes supervised dnns difficult to be applied to semantic segmentation tasks involving various kinds of objects or learning approaches alleviate the problem in lack of training data by exploiting weak label annotations per bounding box or image they often assume that large set of weak annotations is available during training while training examples with strong annotations are missing or limited this is reasonable assumption because weak annotations such as class labels for bounding boxes and images require only fraction of efforts compared to strong annotations segmentations the standard approach in this setting is to update the model of supervised dnn by iteratively inferring and refining hypothetical segmentation labels using weakly annotated images such iterative techniques often work well in practice but training methods rely on procedures and there is no guarantee of convergence implementation may be tricky and the algorithm may not be straightforward to reproduce both authors have equal contribution on this paper we propose novel decoupled architecture of dnn appropriate for semantic segmentation which exploits heterogeneous annotations with small number of strong full segmentation well as large number of weak class labels per image our algorithm stands out from the traditional techniques because the architecture is composed of two separate networks one is for classification and the other is for segmentation in the proposed network object labels associated with an input image are identified by classification network while segmentation of each identified label is subsequently obtained by segmentation network additionally there are bridging layers which deliver information from classification to segmentation network and enable segmentation network to focus on the single label identified by classification network at time training is performed on each network separately where networks for classification and segmentation are trained with and annotations respectively training does not require iterative procedure and algorithm is easy to reproduce more importantly decoupling classification and segmentation reduces search space for segmentation significantly which makes it feasible to train the segmentation network with handful number of segmentation annotations inference in our network is also simple and does not involve any extensive experiments show that our network substantially outperforms existing techniques based on dnns even with much smaller segmentation annotations or strong annotations per class the rest of the paper is organized as follows we briefly review related work and introduce overall algorithm in section and respectively the detailed configuration of the proposed network is described in section and training algorithm is presented in section section presents experimental results on challenging benchmark dataset related work recent breakthrough in semantic segmentation are mainly driven by supervised approaches relying on convolutional neural network cnn based on cnns developed for image classification they train networks to assign semantic labels to local regions within images such as pixels or superpixels notably long et al propose an system for semantic segmentation by transforming standard cnn for classification into fully convolutional network later approaches improve segmentation accuracy through based on crf another branch of semantic segmentation is to learn deconvolution network which also provides complete pipeline however training these networks requires large number of segmentation but the collection of such dataset is difficult task due to excessive annotation efforts to mitigate heavy requirement of training data learning approaches start to draw attention recently in setting the models for semantic segmentation have been trained with only labels or bounding box class labels given weakly annotated training images they infer latent segmentation masks based on multiple instance learning mil or em framework based on the cnns for supervised semantic segmentation however performance of weakly supervised learning approaches except is substantially lower than supervised methods mainly because there is no direct supervision for segmentation during training note that requires bounding box annotations as weak supervision which are already pretty strong and significantly more expensive to acquire than labels learning is an alternative to bridge the gap between and learning approaches in the standard learning framework given only small number of training images with strong annotations one needs to infer the full segmentation labels for the rest of the data however it is not plausible to learn huge number of parameters in deep networks reliably in this scenario instead train the models based on heterogeneous large number of weak annotations as well as small number strong annotations this approach is motivated from the facts that the weak annotations object labels per bounding box or image is much more easily accessible than the strong ones and that the availability of the weak annotations is useful to learn deep network by mining additional training examples with full segmentation masks based on supervised cnn architectures they iteratively infer and refine segmentation labels of weakly annotated images with guidance of strongly annotated images where labels and bounding box annotations are employed as weak annotations they figure the architecture of the proposed network while classification and segmentation networks are decoupled bridging layers deliver critical information from classification network to segmentation network claim that exploiting few strong annotations substantially improves the accuracy of semantic segmentation while it reduces annotations efforts for supervision significantly however they rely on iterative training procedures which are often and heuristic and increase complexity to reproduce results in general also these approaches still need fairly large number of strong annotations to achieve reliable performance algorithm overview figure presents the overall architecture of the proposed network our network is composed of three parts classification network segmentation network and bridging layers connecting the two networks in this model semantic segmentation is performed by separate but successive operations of classification and segmentation given an input image classification network identifies labels associated with the image and segmentation network produces segmentation corresponding to each identified label this formulation may suffer from loose connection between classification and segmentation but we figure out this challenge by adding bridging layers between the two networks and delivering information from classification network to segmentation network then it is possible to optimize the two networks using separate objective functions while the two decoupled tasks collaborate effectively to accomplish the final goal training our network is very straightforward we assume that large number of annotations are available while there are only few training images with segmentation annotations given these heterogeneous and unbalanced training data we first learn the classification network using rich annotations then with the classification network fixed we jointly optimize the bridging layers and the segmentation network using small number of training examples with strong annotations there are only small number of strongly annotated training data but we alleviate this challenging situation by generating many artificial training examples through data augmentation the contributions and characteristics of the proposed algorithm are summarized below we propose novel dnn architecture for semantic segmentation using heterogeneous annotations the new architecture decouples classification and segmentation tasks which enables us to employ models for classification network and train only segmentation network and bridging layers using few strongly annotated data the bridging layers construct activation maps which are delivered from classification network to segmentation network these maps provide strong priors for segmentation and reduce search space dramatically for training and inference overall training procedure is very simple since two networks are to be trained separately our algorithm does not infer segmentation labels of weakly annotated images through iterative which are common in learning techniques the proposed algorithm provides concept to make up for the lack of strongly annotated training data using large number of weakly annotated data this concept is interesting because the due to this property our framework is different from standard learning but close to fewshot learning with heterogeneous annotations nonetheless we refer to it as learning in this paper since we have fraction of strongly annotated data in our training dataset but complete annotations of weak labels note that our level of supervision is similar to the learning case in tion about the availability of training data is desirable for real situations we estimate segmentation maps only for the relevant classes identified by classification network which improves scalability of algorithm in terms of the number of classes finally our algorithm outperforms the comparable learning method with substantial margins in various settings architecture this section describes the detailed configurations of the proposed network including classification network segmentation network and bridging layers between the two networks classification network the classification network takes an image as its input and outputs normalized score vector θc rl representing set of relevance scores of the input based on the trained classification model θc for predefined categories the objective of classification network is to minimize error between and estimated class labels and is formally written as min ec yi xi θc θc where yi denotes the label vector of the example and ec yi xi θc is classification loss of xi θc with respect to yi we employ vgg net as the base architecture for our classification network it consists of convolutional layers followed by rectification and optional pooling layers and fully connected layers for projection sigmoid loss function is employed in eq which is typical choice in classification tasks given output scores xi θc our classification network identifies set of labels li associated with input image xi the region in xi corresponding to each label li is predicted by the segmentation network discussed next segmentation network the segmentation network takes activation map gil of input image xi which is obtained from bridging layers and produces segmentation map gil θs after applying softmax function where θs is the model parameter of segmentation network note that gil θs has foreground and background channels which are denoted by mf gil θs and mb gil θs respectively the segmentation task is formulated as regression to groundtruth segmentation which minimizes es zli gil θs min θs zli where denotes the binary segmentation mask for category of the image xi and es zi gil θs is the segmentation loss of mf gil θs equivalently the segmentation loss of mb gil θs respect to zli the recently proposed deconvolution network is adopted for our segmentation network given an input activation map gil corresponding to input image xi the segmentation network generates segmentation mask in the same size to xi by multiple series of operations of unpooling deconvolution and rectification unpooling is implemented by importing the switch variable from every pooling layer in the classification network and the number of deconvolutional and unpooling layers are identical to the number of convolutional and pooling layers in the classification network we employ the softmax loss function to measure loss in eq note that the objective function in eq corresponds to binary classification it infers whether each pixel belongs to the given class or not this is the major difference from the existing networks for semantic segmentation including which aim to classify each pixel to one of the predefined classes by decoupling classification from segmentation and posing the objective of segmentation network as binary classification our algorithm reduces the number of parameters figure examples of activation maps output of bridging layers we show the most representative channel for visualization despite significant variations in input images the classspecific activation maps share similar properties in the segmentation network significantly specifically this is because we identify the relevant labels using classification network and perform binary segmentation for each of the labels where the number of output channels in segmentation network is set to foreground and regardless of the total number of candidate classes this property is especially advantageous in our challenging scenario where only few annotations typically to annotations per class are available for training segmentation network bridging layers to enable the segmentation network described in section to produce the segmentation mask of specific class the input to the segmentation network should involve information as well as spatial information required for shape generation to this end we have additional layers underneath segmentation network which is referred to as bridging layers to construct the classspecific activation map gil for each identified label li to encode spatial configuration of objects presented in image we exploit outputs from an intermediate layer in the classification network we take the outputs from the last pooling layer since the activation patterns of convolution and pooling layers often preserve spatial information effectively while the activations in the higher layers tend to capture more abstract and global information we denote the activation map of layer by fspat afterwards although activations in fspat maintain useful information for shape generation they contain mixed information of all relevant labels in xi and we should identify activations in fspat additionally for the purpose we compute saliency maps using the technique proposed in let be the output of the layer in the classification network the relevance of activations in with respect to specific class is computed by chain rule of partial derivative which is similar to error in optimization as fcls where fcls denotes saliency map and sl is the classification score of class intuitively eq means that the values in fcls depend on how much the activations in are relevant to class this is measured by computing the partial derivative of class score sl with respect to the activations in we the information until layer the activation map gil is obtained by combining both fspat and fcls we first concatenate fspat and fcls in their channel direction and it through the bridging layers which discover the optimal combination of fspat and fcls using the trained weights the resultant activation map gil that contains both spatial and information is given to segmentation network to produce segmentation map note that the changes in gil depend only on fcls since fspat is fixed for all classes in an input image 𝑝𝑒𝑟𝑠𝑜𝑛 person 𝑡𝑎𝑏𝑙𝑒 table 𝑝𝑙𝑎𝑛𝑡 plant giperson 𝑀f𝑝𝑒𝑟𝑠𝑜𝑛 m𝑀f𝑡𝑎𝑏𝑙𝑒 gitable m𝑀f 𝑝𝑙𝑎𝑛𝑡 giplant figure input image left and its segmentation maps right of individual classes figure visualizes the examples of activation maps gil obtained from several validation images the activations from the images in the same class share similar patterns despite substantial appearance variations which shows that the outputs of bridging layers capture information effectively this property makes it possible to obtain segmentation maps for individual relevant classes in segmentation network more importantly it reduces the variations of input distributions for segmentation network which allows to achieve good generalization performance in segmentation even with small number of training examples for inference we compute activation map gil for each identified label li and obtain segmentation maps gil θs in addition we obtain θs where is the activation map from the bridging layers for all identified labels the final label estimation is given by identifying the label with the maximum score in each pixel out of mf gil θs and mb θs figure illustrates the output segmentation map of each gil for xi where each map identifies high response area given gil successfully training in our learning scenario we have mixed training examples with weak and strong annotations let nw and ns denote the index sets of images with imagelevel and class labels respectively where nw ns we first train the classification network using the images in by optimizing the loss function in eq then fixing the weights in the classification network we jointly train the bridging layers and the segmentation network using images in by optimizing eq for training segmentation network we need to obtain classspecific activation map gil from bridging layers using class labels associated with xi note that we can reduce complexity in training by optimizing the two networks separately although the proposed algorithm has several advantages in training segmentation network with few training images it would still be better to have more training examples with strong annotations hence we propose an effective data augmentation strategy combinatorial cropping let denotes set of labels associated with image xi we enumerate all possible combinations of labels in where denotes the powerset of for each except empty set we construct binary segmentation mask zp by setting the pixels corresponding to every label as foreground and the rests as background then we generate np enclosing the foreground areas based on region proposal method and random pling through this simple data augmentation technique we have nt ns training examples with strong annotations effectively where nt ns experiments implementation details dataset we employ pascal voc dataset for training and testing of the proposed deep network the dataset with extended annotations from which contains images with class labels is used in our experiment to simulate learning scenario we divide the training images into two with weak annotations only and with strong annotations as well there are images with class labels which are used to train our classification network we also construct training datasets with strongly annotated images table evaluation results on pascal voc validation set of strongs full classes classes classes decouplednet fov deconvnet table evaluation results on pascal voc test set models bkg areo bike bird boat bottle bus car cat chair cow table dog horse mbk prsn plnt sheep sofa train tv mean the number of images with segmentation labels per class is controlled to evaluate the impact of supervision level in our experiment three different or training images with strong annotations per tested to show the effectiveness of our framework we evaluate the performance of the proposed algorithm on validation images data augmentation we employ different strategies to augment training examples in the two datasets with weak and strong annotations for the images with weak annotations simple data augmentation techniques such as random cropping and horizontal flipping are employed as suggested in we perform combinatorial cropping proposed in section for the images with strong annotations where edgebox is adopted to generate region proposals and the np are generated for each label combination optimization the proposed network is implemented based on caffe library the standard stochastic gradient descent sgd with momentum is employed for optimization where all parameters are identical to we initialize the weights of the classification network using vgg net on ilsvrc dataset when we train the deep network with full annotations the network converges after approximately and sgd iterations with of examples in training classification and segmentation networks respectively training takes days day for classification network and days for segmentation network in single nvidia gtx titan gpu with memory note that training segmentation network is much faster in our setting while there is no change in training time of classification network results on pascal voc dataset our algorithm denoted by decouplednet is compared with two variations in wssl which is another algorithm based on learning with heterogeneous annotations we also test the performance of and deconvnet which only utilize examples with strong annotations to analyze the benefit of weak annotations all learned models in our experiment are based only on the training set not including the validation set in pascal voc dataset all algorithms except wssl report the results without crf segmentation accuracy is measured by intersection over union iou between and predicted segmentation and the mean iou over semantic categories is employed for the final performance evaluation table summarizes quantitative results on pascal voc validation set given the same amount of supervision decouplednet presents substantially better performance even without any than wssl which is directly comparable method in particular our algorithm has great advantage over wssl when the number of strong annotations is extremely small we believe that this is because decouplednet reduces search space for segmentation effectively by employing the bridging layers and the deep network can be trained with smaller number of images with strong annotations consequently our results are even more meaningful since training procedure of decouplednet is very straightforward compared to wssl and does not involve heuristic iterative procedures which are common in learning methods when there are only small number of strongly annotated training data our algorithm obviously outperforms and deconvnet by exploiting the rich information of weakly this is identical to decouplednet except that its classification and segmentation networks are trained with the same images where weak annotations are generated from segmentation annotations figure semantic segmentation results of several pascal voc validation images based on the models trained on different number of segmentation annotations notated images it is interesting that is clearly better than deconvnet especially when the number of training examples is small for reference the best accuracy of the algorithm based only on the examples with labels is which is much lower than our result with five strongly annotated images per class even though requires significant efforts for heuristic these results show that even little strong supervision can improve semantic segmentation performance dramatically table presents more comprehensive results of our algorithm in pascal voc test set our algorithm works well in general and approaches to the empirical fast with small number of strongly annotated images drawback of our algorithm is that it does not achieve the performance when the full supervision is provided in pascal voc dataset this is probably because our method optimizes classification and segmentation networks separately although joint optimization of two objectives is more desirable however note that our strategy is more appropriate for learning scenario as shown in our experiment figure presents several qualitative results from our algorithm note that our model trained only with five strong annotations per class already shows good generalization performance and that more training examples with strong annotations improve segmentation accuracy and reduce label confusions substantially refer to our project for more comprehensive qualitative evaluation conclusion we proposed novel deep neural network architecture for semantic segmentation with heterogeneous annotations where classification and segmentation networks are decoupled for both training and inference the decoupled network is conceptually appropriate for exploiting heterogeneous and unbalanced training data with class labels segmentation annotations and simplifies training procedure dramatically by discarding complex iterative procedures for intermediate label inferences bridging layers play critical role to reduce output space of segmentation and facilitate to learn segmentation network using handful number of segmentation annotations experimental results validate the effectiveness of our decoupled network which outperforms existing and approaches with substantial margins acknowledgement this work was partly supported by the ict program of ml center deepview and samsung electronics we did not include the validation set for training and have less training examples than the competitors http references mark everingham luc van gool christopher ki williams john winn and andrew zisserman the pascal visual object classes voc challenge ijcv jonathan long evan shelhamer and trevor darrell fully convolutional networks for semantic segmentation in cvpr chen george papandreou iasonas kokkinos kevin murphy and alan yuille semantic image segmentation with deep convolutional nets and fully connected crfs in iclr bharath hariharan pablo ross girshick and jitendra malik hypercolumns for object segmentation and localization in cvpr bharath hariharan pablo ross girshick and jitendra malik simultaneous detection and segmentation in eccv mohammadreza mostajabi payman yadollahpour and gregory shakhnarovich feedforward semantic segmentation with features cvpr pedro pinheiro and ronan collobert weakly supervised semantic segmentation with convolutional networks in cvpr george papandreou chen kevin murphy and alan yuille learning of dcnn for semantic image segmentation in iccv deepak pathak evan shelhamer jonathan long and trevor darrell fully convolutional multiple instance learning in iclr jifeng dai kaiming he and jian sun boxsup exploiting bounding boxes to supervise convolutional networks for semantic segmentation in iccv shuai zheng sadeep jayasumana bernardino vibhav vineet zhizhong su dalong du chang huang and philip torr conditional random fields as recurrent neural networks in iccv hyeonwoo noh seunghoon hong and bohyung han learning deconvolution network for semantic segmentation in iccv karen simonyan and andrew zisserman very deep convolutional networks for image recognition in iclr karen simonyan andrea vedaldi and andrew zisserman deep inside convolutional networks visualising image classification models and saliency maps in iclr workshop lawrence zitnick and piotr edge boxes locating object proposals from edges in eccv bharath hariharan pablo lubomir bourdev subhransu maji and jitendra malik semantic contours from inverse detectors in iccv yangqing jia evan shelhamer jeff donahue sergey karayev jonathan long ross girshick sergio guadarrama and trevor darrell caffe convolutional architecture for fast feature embedding arxiv preprint jia deng wei dong richard socher li kai li and li imagenet hierarchical image database in cvpr 
equilibrated adaptive learning rates for optimization harm de de devries yann de dauphiya yoshua bengio de abstract adaptive learning rate methods are computationally efficient ways to reduce the problems encountered when training large deep networks following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points we find how considering the presence of negative eigenvalues of the hessian could help us design better suited adaptive learning rate schemes we show that the popular jacobi preconditioner has undesirable behavior in the presence of both positive and negative curvature and present theoretical and empirical evidence that the socalled equilibration preconditioner is comparatively better suited to problems we introduce novel adaptive learning rate scheme called esgd based on the equilibration preconditioner our experiments show that esgd performs as well or better than rmsprop in terms of convergence speed always clearly improving over plain stochastic gradient descent introduction one of the challenging aspects of deep learning is the optimization of the training criterion over millions of parameters the difficulty comes from both the size of these neural networks and because the training objective is in the parameters stochastic gradient descent sgd has remained the method of choice for most practitioners of neural networks since the in spite of rich literature in numerical optimization although it is that methods considerably slow down when the objective function is it remains unclear how to best exploit structure when training deep networks because of the large number of parameters storing the full hessian or even approximation is not practical making parameter specific learning rates diagonal preconditioners one of the viable alternatives one of the open questions is how to set the learning rate for sgd adaptively both over time and for different parameters and several methods have been proposed see schaul et al and references therein on the other hand recent work dauphin et choromanska et has brought theoretical and empirical evidence suggesting that local minima are with high probability not the main obstacle to optimizing large and deep neural networks contrary to what was previously believed instead saddle points are the most prevalent critical points on the optimization path except when we approach the value of the global minimum these saddle points can considerably slow down training mostly because the objective function tends to be in the neighborhood of denotes first authors original preconditioned figure contour lines of saddle point black point problem for original function and transformed function by equilibration preconditioner gradient descent slowly escapes the saddle point in because it oscillates along the high positive curvature direction for the better conditioned function these oscillations are reduced and gradient descent makes faster progress these saddle points this raises the question can we take advantage of the saddle structure to design good and computationally efficient preconditioners in this paper we bring these threads together we first study diagonal preconditioners for saddle point problems and find that the popular jacobi preconditioner has unsuitable behavior in the presence of both positive and negative curvature instead we propose to use the equilibration preconditioner and provide new theoretical justifications for its use in section we provide specific arguments why equilibration is better suited to optimization problems than the jacobi preconditioner and empirically demonstrate this for small neural networks in section using this new insight we propose new adaptive learning rate schedule for sgd called esgd that is based on the equilibration preconditioner in section we evaluate the proposed method on two deep autoencoder benchmarks the results presented in section confirm that esgd performs as well or better than rmsprop in addition we empirically find that the update direction of rmsprop is very similar to equilibrated update directions which might explain its success in training deep neural networks preconditioning it is that gradient descent makes slow progress when the curvature of the loss function is very different in separate directions the negative gradient will be mostly pointing in directions of high curvature and small enough learning rate have to be chosen in order to avoid divergence in the largest positive curvature direction as consequence the gradient step makes very little progress in small curvature directions leading to the slow convergence often observed with methods preconditioning can be thought of as geometric solution to the problem of pathological curvature it aims to locally transform the optimization landscape so that its curvature is equal in all directions this is illustrated in figure for saddle point problem using the equilibration preconditioner section gradient descent method slowly escapes the saddle point due to the typical oscillations along the high positive curvature direction by transforming the function to be more equally curved it is possible for gradient descent to move much faster more formally we are interested in minimizing function with parameters rn we introduce preconditioning by linear change of variables with matrix we use this change of variables to define new function fˆ parameterized by that is equivalent to the original function fˆ the gradient and the hessian of this new function are by the chain rule fˆ with gradient descent iteration for the transformed function corresponds to θt for the original parameter in other words by the original gradient with positive definite matrix we effectively apply gradient descent to the problem after change of variables the curvature of this transformed function is given by the hessian and we aim to seek preconditioning matrix such that the new hessian has equal curvature in all directions one way to assess the success of in doing so is to compute the relative difference between the biggest and smallest curvature direction which is measured by the condition number of the hessian σmax σmin where σmax σmin denote respectively the biggest and smallest singular values of which are the absolute value of the eigenvalues it is important to stress that the condition number is defined for both definite and indefinite matrices the famous newton step corresponds to change of variables which makes the new hessian perfectly conditioned however change of variables only when the hessian is positive this is problem for loss surfaces where the hessian might be indefinite in fact recent studies dauphin et choromanska et has shown that saddle points are dominating the optimization landscape of deep neural networks implying that the hessian is most likely indefinite in such setting not valid preconditioner and applying newton step without modification would make you move towards the saddle point nevertheless it is important to realize that the concept of preconditioning extends to problems and reducing around saddle point will often speed up gradient descent at this point it is natural to ask whether there exists valid preconditioning matrix that always perfectly conditions the new hessian the answer is yes and the corresponding preconditioning matrix is the inverse of the absolute hessian which is obtained by an eigendecomposition of and taking the absolute values of the eigenvalues see proposition in appendix for proof that is the only up to symmetric positive definite preconditioning matrix that perfectly reduces the condition number practically there are several computational drawbacks for using as preconditioner neural networks typically have millions of parameters rendering it infeasible to store the hessian perform an eigendecomposition and invert the matrix except for the eigendecomposition other full rank preconditioners are facing the same computational issues we therefore look for more computationally affordable preconditioners while maintaining its efficiency in reducing the condition number of indefinite matrices in this paper we focus on diagonal preconditioners which can be stored inverted and multiplied by vector in linear time when diagonal preconditioners are applied in an online optimization setting in conjunction with sgd they are often referred to as adaptive learning rates in the neural network literature related work the jacobi preconditioner is one of the most preconditioners it is given by the diagonal of the hessian dj where is absolute value lecun et al proposes an efficient approximation of the jacobi preconditioner using the matrix the has been shown to approximate the hessian under certain conditions pascanu bengio the merit of this approach is that it is efficient but it is not clear what is lost by the approximation what more the jacobi preconditioner has not be found to be competitive for indefinite matrices bradley murray this will be further explored for neural networks in section real square root only exists when is positive can be incorporated into the learning rate recent revival of interest in adaptive learning rates has been started by adagrad duchi et adagrad collects information from the gradients across several parameter updates to tune the learning rate this gives us the diagonal preconditioning matrix da which relies on the sum of gradients at each timestep duchi et al relies strongly on convexity to justify this method this makes the application to neural networks difficult from theoretical perspective rmsprop tieleman hinton and adadelta zeiler were methods introduced to be practical adaptive learning methods to train large neural networks although rmsprop has been shown to work very well schaul et there is not much understanding for its success in practice preconditioning might be good framework to get better understanding of such adaptive learning rate methods equilibration equilibration is preconditioning technique developed in the numerical mathematics community sluis when solving linear system ax with gaussian elimination significant errors can be introduced when small numbers are added to big numbers datta to circumvent this issue it is advised to properly scale the rows of the matrix before starting the elimination process this step is often referred to as row equilibration which formally scales the rows of to unit magnitude in some throughout the following we consider row equilibration is equivalent to multiplying from the left by the matrix ii kai instead of solving the original system we now solve the equivalent left preconditioned system with and in this paper we apply the equilibration preconditioner in the context of large scale optimization however it is not straightforward how to apply the preconditioner by choosing the preconditioning matrix deii khi the hessian of the transformed function de de see section does not have equilibrated rows nevertheless its spectrum eigenvalues is equal to the spectrum of the row equilibrated hessian de and column equilibrated hessian de consequently if row equilibration succesfully reduces the condition number then the condition number of the formed hessian de de will be reduced by the same amount the proof is given by proposition from the above observation it seems more natural to seek for diagonal preconditioning matrix such that is row and column equilibrated in bradley murray an iterative stochastic procedure is proposed for finding such matrix however we did not find it to work very well in an online optimization setting and therefore stick to the original equilibration matrix de although the original motivation for row equilibration is to prevent errors our interest is in how well it is able to reduce the condition number intuitively can be result of matrix elements that are of completely different order scaling the rows to have equal norm could therefore significantly reduce the condition number although we are not aware of any proofs that row equilibration improves the condition number there are theoretical results that motivates its use in sluis it is shown that the condition number of row equilibrated matrix is at most factor worse than the diagonal preconditioning matrix that optimally reduces the condition number note that the bound grows sublinear in the dimension of the matrix and can be quite loose for the extremely large matrices we consider in this paper we provide an alternative justification using the following upper bound on the condition number from guggenheimer et al khkf the proof in guggenheimer et al provides useful insight when we expect tight upper bound to be tight if all singular values except for the smallest are roughly equal we prove by proposition that row equilibration improves this upper bound by factor det de khk it is easy see that the bound is more reduced when the norms of the rows convex figure histogram of the condition number reduction lower is better for random hessians in convex and setting equilibration clearly outperforms the other methods in the case are more varied note that the proof can be easily extended to column equilibration and row and column equilibration in contrast we can not prove that the jacobi preconditioner improves the upper bound which provides another justification for using the equilibration preconditioner deterministic implementation to calculate the of all matrix rows needs to access all matrix elements this is prohibitive for very large hessian that can not even be stored we therefore resort to estimator of the equilibration matrix that only uses matrix vector multiplications of the form hv where the square is and vi as shown by bradley murray this estimator is unbiased khi hv since multiplying the hessian by vector can be efficiently done without ever computing the hessian this method can be efficiently used in the context of neural networks using the schraudolph the computation only uses computations and costs about the same as two backpropagations equilibrated learning rates are well suited to problems in this section we demonstrate that equilibrated learning rates are well suited to optimization particularly compared to the jacobi preconditioner first the diagonal equilibration matrix can be seen as an approximation to diagonal of the absolute hessian reformulating the equilibration matrix as deii khi diag reveals an interesting connection changing the order of the square root and diagonal would give us the diagonal of in other words the equilibration preconditioner can be thought of as the jacobi preconditioner of the absolute hessian recall that the inverse of the absolute hessian is the only symmetric positive definite matrix that reduces the condition number to the proof of which can be be found in proposition in the appendix it can be considered as the gold standard if we do not take computational costs into account for indefinite matrices the diagonal of the hessian and the diagonal of the absolute hessian will be very different and therefore the behavior of the jacobi and equilibration preconditioner will also be very different in fact we argue that the jacobi preconditioner can cause divergence because it underestimates curvature we can measure the amount of curvature in given direction with the raleigh quotient vt hv vt any random variable vi with zero mean and unit variance can be used algorithm equilibrated gradient descent require function to minimize learning rate and damping factor for do hv end for this quotient is large when there is lot of curvature in the direction the raleigh quopn tient can be decomposed into λj qj qj where λj and qj are the eigenvalues and eigenvectors of it is easy to show that each element of the jacobi matrix is given by pn djii λj an element djii is the inverse of the sum of the eigenvalues λj negative eigenvalues will reduce the total sum and make the step much larger than it should specifically imagine diagonal element where there are large positive and negative curvature eigendirections the contributions of these directions will cancel each other and large step will be taken in that direction however the function will probably also change fast in that direction because of the high curvature and the step is too large for the local quadratic approximation we have considered equilibration methods never diverge this way because they will not underestimate curvature in equilibration the curvature information is given by the raleigh quotient of the squared hessian deii note that all the elements are positive and so will not cancel jensen inequality then gives us an upper bound deii ii which ensures that equilibrated adaptive learning rate will in fact be more conservative than the jacobi preconditioner of the absolute hessian see proposition for proof this strengthens the links between equilibration and the absolute hessian and may explain why equilibration has been found to work well for indefinite matrices bradley murray we have verified this claim experimentally for random neural networks the neural networks have hidden layer of sigmoid units with zero mean gaussian distributed inputs weights and biases the output layer is softmax with the target generated randomly we also give results for similarly sampled logistic regressions we compare reductions of the condition number between the methods figure gives the histograms of the condition number reductions we obtained these graphs by sampling hundred networks and computing the ratio of the condition number before and after preconditioning on the left we have the convex case and on the right the case we clearly observe that the jacobi and equilibration method are closely matched for the convex case however in the case equilibration significantly outperforms the other methods note that the poor performance of the diagonal only means that its success in optimization is not due to preconditioning as we will see in section these results extend to practical highdimensional problems implementation we propose to build scalable algorithm for preconditioning neural networks using equilibration this method will estimate the same curvature information diag with the unbiased estimator described in equation it is prohibitive to compute the full expectation at each learning step instead we will simply update our running average at each learning step much like rmsprop the is given in algorithm the additional costs are one product with the hessian which is roughly the cost of two additional gradient calculations and the sampling random gaussian vector in practice we greatly amortize the cost by only performing the update every iterations this brings the cost of equilibration very close to that of regular sgd the only added is the damping we find that good setting for that is and it is robust over the tasks we considered mnist curves figure learning curves for deep on mnist and curves comparing the different preconditioned sgd methods in the interest of comparison we will evaluate sgd preconditioned with the jacobi preconditioner this will allow us to verify the claims that the equilibration preconditioner is better suited for nonconvex problems bekas et al show that the diagonal of matrix can be recovered by the expression diag hv where are random vectors with entries and is the product we use this estimator to precondition sgd in thepsame fashion as that described in algorithm the variance of this estimator for an element is hji while the method in martens et al has therefore the optimal method depends on the situation the computational complexity is the same as esgd experimental setup we consider the challenging optimization benchmark of training very deep neural networks following martens sutskever et al vinyals povey we train deep which have to reconstruct their input under the constraint that one layer is very the networks have up to layers of sigmoidal hidden units and have on the order of million parameters we use the standard network architectures described in martens for the mnist and curves dataset both of these datasets have input dimensions and and examples respectively we tune the of the optimization methods with random search we have sampled the learning rate from logarithmic scale between for stochastic gradient descent sgd and equilibrated sgd esgd the learning rate for rmsprop and the jacobi preconditioner are sampled from the damping factor used before dividing the gradient is taken from either while the exponential decay rate of rmsprop is taken from either the networks are initialized using the sparse initialization described in martens the minibatch size for all methods in we do not make use of momentum in these experiments in order to evaluate the strength of each preconditioning method on its own similarly we do not use any regularization because we are only concerned with optimization performance for these reasons we report training error in our graphs the networks and algorithms were implemented using theano bastien et al simplifying the use of the in jacobi and equilibrated sgd all experiments were run on gpu results comparison of preconditioned sgd methods we compare the different adaptive learning rates for training deep in figure we don use momentum to better isolate the performance of each method we believe this is important because rmsprop has been found not to mix well with momentum tieleman hinton thus the results presented are not but they do reach state of the art when momentum is used mnist curves figure cosine distance between the diagonals estimated by each method during the training of deep trained on mnist and curves we can see that rmsprop estimates quantity close to the equilibration matrix our results on mnist show that the proposed esgd method significantly outperforms both rmsprop and jacobi sgd the difference in performance becomes especially notable after epochs sutskever et al reported performance of of training mse for sgd without momentum and we can see all adaptive learning rates improve on this result with equilibration reaching we observe convergence speed that is approximately three times faster then our baseline sgd esgd also performs best for curves although the difference with rmsprop and jacobi sgd is not as significant as for mnist we show in the next section that the smaller gap in performance is due to the different preconditioners behaving the same way on this dataset measuring the similarity of the methods we train pdeep autoencoders with rmsprop and measure every epochs the equilibration matrix de diag and jacobi matrix dj diag using samples of the unbiased estimators described in equations respectively we then measure the pairwise differences between these quantities in terms of the cosine distance cosine kukkvk which measures the angle between two vectors and ignores their norms figure shows the resulting cosine distances over training on mnist and curves for the latter dataset we observe that rmsprop remains remarkably close around to equilibration while it is significantly different from jacobi in the order of the same order of difference is observed when we compare equilibration and jacobi confirming the observations of section that both quantities are rather different in practice for the mnist dataset we see that rmsprop fairly well estimates diag in the beginning of training but then quickly diverges after epochs this difference has exceeded the difference between jacobi and equilibration and rmsprop no longer matches equilibration interestingly at the same time that rmsprop starts diverging we observe in figure that also the performance of the optimizer drops in comparison to esgd this may suggests that the success of rmsprop as optimizer is tied to its similarity to the equilibration matrix conclusion we have studied diagonal preconditioners for saddle point problems indefinite matrices we have shown by theoretical and empirical arguments that the equilibration preconditioner is comparatively better suited to this kind of problems than the jacobi preconditioner using this insight we have proposed novel adaptive learning rate schedule for optimization problems called esgd which empirically outperformed rmsprop on two competitive deep autoencoder benchmark interestingly we have found that the update direction of rmsprop was in practice very similar to the equilibrated update direction which might provide more insight into why rmsprop has been so successfull in training deep neural networks more research is required to confirm these results however we hope that our findings will contribute to better understanding of sgd adaptive learning rate schedule for large scale optimization problems references bastien lamblin pascal pascanu razvan bergstra james goodfellow ian bergeron arnaud bouchard nicolas and bengio yoshua theano new features and speed improvements deep learning and unsupervised feature learning nips workshop bekas costas kokiopoulou effrosyni and saad yousef an estimator for the diagonal of matrix applied numerical mathematics bradley andrew and murray walter approximate equilibration arxiv preprint choromanska anna henaff mikael mathieu michael arous grard ben and lecun yann the loss surface of multilayer networks datta biswa nath numerical linear algebra and applications second edition siam edition isbn dauphin yann pascanu razvan gulcehre caglar cho kyunghyun ganguli surya and bengio yoshua identifying and attacking the saddle point problem in optimization in nips duchi john hazan elad and singer yoram adaptive subgradient methods for online learning and stochastic optimization journal of machine learning research guggenheimer heinrich edelman alan and johnson charles simple estimate of the condition number of linear system the college mathematics journal pp issn url http lecun yann bottou orr genevieve and efficient backprop in neural networks tricks of the trade lecture notes in computer science lncs springer verlag martens deep learning via optimization in icml pp martens james sutskever ilya and swersky kevin estimating the hessian by curvature arxiv preprint pascanu razvan and bengio yoshua revisiting natural gradient for deep networks in international conference on learning representations conference track april schaul tom antonoglou ioannis and silver david unit tests for stochastic optimization arxiv preprint schraudolph nicol fast curvature products for gradient descent neural computation sluis avd condition numbers and equilibration of matrices numerische mathematik sutskever ilya martens james dahl george and hinton geoffrey on the importance of initialization and momentum in deep learning in icml tieleman tijmen and hinton geoffrey lecture divide the gradient by running average of its recent magnitude coursera neural networks for machine learning vinyals oriol and povey daniel krylov subspace descent for deep learning arxiv preprint zeiler matthew adadelta an adaptive learning rate method technical report arxiv url http 
back hift learning causal cyclic graphs from unknown shift interventions dominik seminar statistik eth switzerland rothenhaeusler jonas peters max planck institute for intelligent systems germany christina seminar statistik eth switzerland heinze nicolai meinshausen seminar statistik eth switzerland meinshausen abstract we propose simple method to learn linear causal cyclic models in the presence of latent variables the method relies on equilibrium data of the model recorded under specific kind of interventions shift interventions the location and strength of these interventions do not have to be known and can be estimated from the data our method called back hift only uses second moments of the data and performs simple joint matrix diagonalization applied to differences between covariance matrices we give sufficient and necessary condition for identifiability of the system which is fulfilled almost surely under some quite general assumptions if and only if there are at least three distinct experimental settings one of which can be pure observational data we demonstrate the performance on some simulated data and applications in flow cytometry and financial time series introduction discovering causal effects is fundamentally important yet very challenging task in various disciplines from public health research and sociological studies economics to many applications in the life sciences there has been much progress on learning acyclic graphs in the context of structural equation models including methods that learn from observational data alone under faithfulness assumption exploiting of the data or feedbacks are prevalent in most applications and we are interested in the setting of where we observe the equilibrium data of model that is characterized by set of linear relations bx where is random vector and is the connectivity matrix with zeros on the diagonal no allowing for would lead to an identifiability problem independent of the method see section in the appendix for more details on this setting the graph corresponding to has nodes and an edge from node to node if and only if bi the error terms are random variables with mean and positive covariance matrix eet we do not assume that is diagonal matrix which allows the existence of latent variables the solutions to can be thought of as the deterministic equilibrium solutions conditional on the noise term of dynamic model governed by difference equations with matrix in the authors contributed equally sense of for equilibrium solutions of we need that is invertible usually we also want to converge to an equilibrium when iterating as new bx old or in other words limm bm this condition is equivalent to the spectral radius of being strictly smaller than one we will make an assumption on cyclic graphs that restricts the strength of the feedback specifically let cycle of length be given by and mk for we define the cp of matrix to be the maximum over cycles of all lengths of the cp max mk cycle the cp is clearly zero for acyclic graphs we will assume the to be strictly smaller than one for identifiability results see assumption below the most interesting graphs are those for which cp and for which the spectral radius of is strictly smaller than one note that these two conditions are identical as long as the cycles in the graph do not intersect there is no node that is part of two cycles for example if there is at most one cycle in the graph if cycles do intersect we can have models for which either cp but the spectral radius is larger than one or ii cp but the spectral radius is strictly smaller than one models in situation ii are not stable in the sense that the iterations will not converge under interventions we can for example block all but one cycle if this one single unblocked cycle has larger than and there is such cycle in the graph if cp then the solutions of the iteration are not models in situation are not stable either even in the absence of interventions we can still in theory obtain the now instable equilibrium solutions to as and the theory below applies to these instable equilibrium solutions however such instable equilibrium solutions are arguably of little practical interest in summary all interesting feedback models that are stable under interventions satisfy both cp and have spectral radius strictly smaller than one we will just assume cp for the following results it is impossible to learn the structure of this model from observational data alone without making further assumptions the ingam approach has been extended in to cyclic models exploiting possible of the data using both experimental and interventional data could show identifiability of the connectivity matrix under learning mechanism that relies on data under surgical or perfect interventions in their framework variable becomes independent of all its parents if it is being intervened on and all incoming contributions are thus effectively removed under the intervention also called in the classical sense of the learning mechanism makes then use of the knowledge where these surgical interventions occurred also allow for changing the incoming arrows for variables that are intervened on but again requires the location of the interventions while we do not assume such knowledge consider target variable and allow for arbitrary interventions on all other nodes they neither permit hidden variables nor cycles here we are interested in setting where we have either no or just very limited knowledge about the exact location and strength of the interventions as is often the case for data observed under different environments see the example on financial time series further below or for biological data these interventions have been called or uncertain interventions while assume acyclicity and model the structure explicitly in bayesian setting we assume that the data in environment are equilibrium observations of the model xj bxj cj ej where the random intervention shift cj has mean and covariance the location of these interventions or simply the intervened variables are those components of cj that are not zero with probability one given these locations the interventions simply shift the variables by value determined by cj they are therefore not surgical but can be seen as special case of what is called an imperfect parametric or dependent intervention or mechanism change the matrix and the error distribution of ej are assumed to be identical in all environments in contrast to the covariance matrix for the noise term ej we do assume that is diagonal the blocking of all but one cycle can be achieved by on appropriate variables under the following condition for every pair of cycles in the graph the variables in one cycle can not be subset of the variables in the other cycle otherwise the blocking could be achieved by deletion of appropriate edges matrix which is equivalent to demanding that interventions at different variables are uncorrelated this is key assumption necessary to identify the model using experimental data furthermore we will discuss in section how violation of the model assumption can be detected and used to estimate the location of the interventions in section we show how to leverage observations under different environments with different interventional distributions to learn the structure of the connectivity matrix in model the method rests on simple joint matrix diagonalization we will prove necessary and sufficient conditions for identifiability in section numerical results for simulated data and applications in flow cytometry and financial data are shown in section method grouping of data let be the set of experimental conditions under which we observe equilibrium data from model these different experimental conditions can arise in two ways controlled experiment was conducted where the external input or the external imperfect interventions have been deliberately changed from one member of to the next an example are the flow cytometry data discussed in section the data are recorded over time it is assumed that the external input is changing over time but not in an explicitly controlled way the data are grouped into consecutive blocks of observations see section for an example notation assume we have nj observations in each setting let xj be the nj of observations from model for general random variables aj rp the population covariance matrix in setting is called cov aj where the covariance is under the setting furthermore the covariance on all settings except setting is defined as an average over all environments except for the environment the population be the empirical gram matrix is defined as ga aj aj let the covariance matrix of the observations aj rnj of variable aj in setting more precisely nj the let be the version of aj then empirical gram matrix is denoted by nj aj aj assumptions the main assumptions have been stated already but we give summary below the data are observations of the equilibrium observations of model the matrix is invertible and the solutions to are thus well defined the cp is strictly smaller than one the diagonal entries of are zero the distribution of the noise ej which includes the influence of latent variables and the connectivity matrix are identical across all settings in each setting the intervention shift cj and the noise ej are uncorrelated interventions at different variables in the same setting are uncorrelated that is is an unknown diagonal matrix for all we will discuss stricter version of in section in the appendix that allows the use of gram matrices instead of covariance matrices the conditions above imply that the environments are characterized by different interventions strength as measured by the variance of the shift in each setting we aim to reconstruct both the connectivity matrix from observations in different environments and also aim to reconstruct the unknown intervention strength and location in each environment additionally we will show examples where we can detect violations of the model assumptions and use these to reconstruct the location of interventions population method the main idea is very simple looking at the model we can rewrite xj cj ej the population covariance of the transformed observations are then for all settings given by the last term is constant across all settings but not necessarily diagonal as we allow hidden variables any change of the matrix on the side thus stems from shift in the covariance matrix of the interventions let us define the difference between the covariance of and in setting as and assumption together with implies that using assumption the random intervention shifts at different variables are uncorrelated and the side in is thus diagonal matrix for all let be the set of all invertible matrices we also define more restricted space dcp which only includes those members of that have entries all equal to one on the diagonal and have less than one invertible dcp and diag and cp under assumption dcp motivated by we now consider the minimizer where is the loss for any matrix and defined as the sum of the squared elements in section we present necessary and sufficient conditions on the interventions under which is the unique minimizer of in this case exact joint diagonalization is possible so that dt for all environments we discuss an alternative that replaces covariance with gram matrices throughout in section in the appendix we now give version estimate of the connectivity matrix in practice we estimate by minimizing algorithm back hift the empirical counterpart of in two input xj steps first the solution of the tion is only constrained to matrices in compute subsequently we enforce the constraint on ffd iag the solution to be member of dcp the permuteandscale back hift algorithm is presented in algorithm and we describe the important output steps in more detail below steps first we minimize the following empirical less constrained variant of argmin where the population differences between covariance matrices are replaced with their empirical counterparts and the only constraint on the solution is that it is invertible for the optimization we use the joint approximate matrix diagonalization algorithm ffd iag step the constraint on the cycle product and the diagonal elements of is enforced by permuting and scaling the rows of part simply scales the rows so that the diagonal elements of the resulting matrix are all equal to one the more challenging first step consists of finding permutation such that under this permutation the scaled matrix from part will have cycle product as small as possible as follows from theorem at most one permutation can lead to cycle product less than one this optimization problem seems computationally challenging at first but we show that it can be solved by variant of the linear assignment problem lap see as proven in theorem in the appendix as last step we check whether the cycle product of is less than one in which case we have found the solution otherwise no solution satisfying the model assumptions exists and we return warning that the model assumptions are not met see appendix for more details computational cost the computational complexity of back hift is as computing the covariance matrices costs ffd iag has computational cost of and both the linear assignment problem and computing the cycle product can be solved in time for instance this complexity is achieved when using the hungarian algorithm for the linear assignment problem see and the cycle product can be computed with simple dynamic programming approach estimating the intervention variances one additional benefit of back hift is that the location and strength of the interventions can be estimated from the data the empirical version of eq is given by so the element kk is an estimate for the difference between the variance of the intervention at variable in environment namely kk and the average in all other environments kk from these differences we can compute the intervention variance for all environments up to an offset by convention we set the minimal intervention variance across all environments equal to zero alternatively one can let observational data if available serve as baseline against which the intervention variances are measured identifiability let for simplicity of notation kk be the variance of the random intervention shifts cj at node in environment as per the definition of in we then have the following identifiability result the proof is provided in appendix theorem under assumptions and the solution to is unique if and only if for all there exist such that if none of the intervention variances vanishes the uniqueness condition is equivalent to demanding that the ratio between the intervention variances for two variables must not stay identical across all environments that is there exist such that which requires that the ratio of the variance of the intervention shifts at two nodes is not identical across all settings this leads to the following corollary corollary the identifiability condition can not be satisfied if since then for all and we need at least three different environments for identifiability ii the identifiability condition is satisfied for all almost surely if the variances of the intervention cj are chosen independently over all variables and environments from distribution that is absolutely continuous with respect to lebesgue measure condition ii can be relaxed but shows that we can already achieve full identifiability with very generic setting for three or more different environments numerical results in this section we present empirical results for both synthetic and real data sets in addition to estimating the connectivity matrix we demonstrate various ways to estimate properties of the interventions besides computing the point estimate for back hift we use stability selection to assess the stability of retrieved edges we attach with which all simulations and analyses can be an called backshift is available from cran interv strength sample size hidden vars false true method backshift ling precision recall figure simulated data true network scheme for data generation performance metrics for the settings considered in section for back hift precision and recall values for settings and coincide back hift setting no hidden vars mi setting no hidden vars mi ing shd shd shd shd shd setting no hidden vars mi shd shd shd setting no hidden vars mi setting hidden vars mi shd shd figure point estimates of back hift and ing for synthetic data we threshold the point estimate of at to exclude those entries which are close to zero we then threshold the estimate of ing so that the two estimates have the same number of edges in setting we threshold ing at as back hift returns the empty graph in setting it is not possible to achieve the same number of edges as all remaining coefficients in the point estimate of ing are equal to one in absolute value the transparency of the edges illustrates the relative magnitude of the estimated coefficients we report the structural hamming distance shd for each graph precision and recall values are shown in figure back hift synthetic data we compare the point estimate of back hift against ing generalization of ingam to the cyclic case for purely observational data we consider the cyclic graph shown in figure and generate data under different scenarios the data generating mechanism is sketched in figure specifically we generate ten distinct environments with noise in each environment the random intervention variable is generated as cj kj ikj where pj are drawn from exp mi and ipj are independent standard normal random variables the intervention shift thus acts on all observed random variables the parameter mi regulates the strength of the intervention if hidden variables exist the noise term ej of variable in environment is equal to where the weights are sampled once from and the random variable has laplace distribution if no hidden variables are present then ej is sampled laplace in this set of experiments we consider five different settings described below in which the sample size the intervention strength mi as well as the existence of hidden variables varies we allow for hidden variables in only one out of five settings as ing assumes causal sufficiency and can thus in theory not cope with hidden variables if no hidden variables are present the pooled data can be interpreted as coming from model whose error variables follow mixture distribution but if one of the error variables comes from the second mixture component for example the other plcg plcg erk mek erk raf raf akt jnk jnk pka pkc raf akt jnk pka pkc mek erk raf akt pka plcg mek erk akt plcg mek jnk pka pkc pkc figure flow cytometry data union of the consensus network according to the reconstruction by and the best acyclic reconstruction by the edge thickness and intensity reflect in how many of these three sources that particular edge is present one of the cyclic reconstructions by the edge thickness and intensity reflect the probability of selecting that particular edge in the stability selection procedure for more details see back hift point estimate thresholded at the edge intensity reflects the relative magnitude of the coefficients and the coloring is comparison to the union of the graphs shown in panels and blue edges were also found in and purple edges are reversed and green edges were not previously found in or back hift stability selection result with parameters and the edge thickness illustrates how often an edge was selected in the stability selection procedure error variables come from the second mixture component too in this sense the data points are not independent anymore this poses challenge for ing which assumes an sample we also cover case for mi in which all assumptions of ing are satisfied scenario figure shows the estimated connectivity matrices for five different settings and figure shows the obtained precision and recall values in setting mi and there are no hidden variables in setting is increased to while the other parameters do not change we observe that back hift retrieves the correct adjacency matrix in both cases while ing estimate is not very accurate it improves slightly when increasing the sample size in setting we do include hidden variables which violates the causal sufficiency assumption required for ing indeed the estimate is worse than in setting but somewhat better than in setting back hift retrieves two false positives in this case setting is not feasible for back hift as the distribution of the variables is identical in all environments since mi in step of the algorithm ffd iag does not converge and therefore the empty graph is returned so the recall value is zero while precision is not defined for ing all assumptions are satisfied and the estimate is more accurate than in the settings lastly setting shows that when increasing the intervention strength to back hift returns few false positives its performance is then similar to ing which returns its most accurate estimate in this scenario the stability selection results for back hift are provided in figure in appendix in short these results suggest that the back hift point estimates are close to the true graph if the interventions are sufficiently strong hidden variables make the estimation problem more difficult but the true graph is recovered if the strength of the intervention is increased when increasing mi to in setting back hift obtains shd of zero in contrast ing is unable to cope with hidden variables but also has worse accuracy in the absence of hidden variables under these shift interventions flow cytometry data the data published in is an instance of data set where the external interventions differ between the environments in and might act on several compounds simultaneously there are nine different experimental conditions with each containing roughly observations which correspond to measurements of the concentration of biochemical agents in single cells the first setting corresponds to purely observational data in addition to the original work by the data set has been described and analyzed in and we compare against the results of and the consensus according to shown in figures and figure shows the thresholded back hift point estimate most of the retrieved edges were also found in at least one of the previous studies five edges are reversed in our estimate and three edges were not discovered previously figure shows the corresponding stability selection result with the expected number of falsely selected variables nasdaq time prices logarithmic est intervention variance nasdaq est intervention variance dax log price dax dax nasdaq time time daily back hift dax nasdaq time ing figure financial time series with three stock indices nasdaq blue technology index green american equities and dax red german equities prices of the three indices between may and end of on logarithmic scale the scaled daily change in of the three instruments are shown three periods of increased volatility are visible starting with the bust on the left to the financial crisis in and the august downturn the scaled estimated intervention variance with the estimated back hift network the three are clearly separated as originating in technology american and european equities in contrast the analogous ing estimated intervention variances have peak in american equities intervention variance during the european debt crisis in this estimate is sparser in comparison to the other ones as it bounds the number of false discoveries notably the feedback loops between plcg and pkc jnk were also found in it is also noteworthy that we can check the model assumptions of shift interventions which is important for these data as they can be thought of as changing the mechanism or activity of biochemical agent rather than regulate the biomarker directly if the shift interventions are not appropriate we are in general not able to diagonalize the differences in the covariance matrices large elements in the estimate of the in indicate mechanism change that is not just explained by shift intervention as in in four of the seven interventions environments with known intervention targets the largest mechanism violation happens directly at the presumed intervention target see appendix for details it is worth noting again that the presumed intervention target had not been used in reconstructing the network and mechanism violations financial time series finally we present an application in financial time series where the environment is clearly changing over time we consider daily data from three stock indices nasdaq and dax for period between and group the data into overlapping blocks of consecutive days each we take as shown in panel of figure and estimate the connectivity matrix which is fully connected in this case and perhaps of not so much interest in itself it allows us however to estimate the intervention strength at each of the indices according to shown in panel the intervention variances separate very well the origins of the three major of the markets on the period technology is correctly estimated by back hift to be at the epicenter of the crash in nasdaq as proxy american equities during the financial crisis in proxy is and european instruments dax as best proxy during the august downturn conclusion we have shown that cyclic causal networks can be estimated if we obtain covariance matrices of the variables under unknown shift interventions in different environments back hift leverages solutions to the linear assignment problem and joint matrix diagonalization and the part of the computational cost that depends on the number of variables is at worst cubic we have shown sufficient and necessary conditions under which the network is fully identifiable which require observations from at least three different environments the strength and location of interventions can also be reconstructed references bollen structural equations with latent variables john wiley sons new york usa spirtes glymour and scheines causation prediction and search mit press cambridge usa edition chickering optimal structure identification with greedy search journal of machine learning research maathuis kalisch and estimating intervention effects from observational data annals of statistics hauser and characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs journal of machine learning research hoyer janzing mooij peters and nonlinear causal discovery with additive noise models in advances in neural information processing systems nips pages shimizu inazumi sogawa kawahara washio hoyer and bollen directlingam direct method for learning linear structural equation model journal of machine learning research mooij janzing heskes and on causal discovery with cyclic additive noise models in advances in neural information processing systems nips pages hyttinen eberhardt and hoyer learning linear cyclic causal models with latent variables journal of machine learning research lauritzen and richardson chain graph models and their causal interpretations journal of the royal statistical society series lacerda spirtes ramsey and hoyer discovering cyclic causal models by independent components analysis in proceedings of the conference on uncertainty in artificial intelligence uai pages scheines eberhardt and hoyer combining experiments to discover linear cyclic models with latent variables in international conference on artificial intelligence and statistics aistats pages pearl causality models reasoning and inference cambridge university press new york usa edition eberhardt hoyer and scheines combining experiments to discover linear cyclic models with latent variables in international conference on artificial intelligence and statistics aistats pages peters and meinshausen causal inference using invariant prediction identification and confidence intervals journal of the royal statistical society series to jackson bartz schelter kobayashi burchard mao li cavet and linsley expression profiling reveals gene regulation by rnai nature biotechnology kulkarni booker silver friedman hong perrimon and evidence of effects associated with long dsrnas in drosophila melanogaster assays nature methods eaton and murphy exact bayesian structure learning from uncertain interventions in international conference on artificial intelligence and statistics aistats pages eberhardt and scheines interventions and causal inference philosophy of science korb hope nicholson and axnick varieties of causal intervention in proceedings of the pacific rim conference on ai pages tian and pearl causal discovery from changes in proceedings of the conference annual conference on uncertainty in artificial intelligence uai pages sachs perez pe er lauffenburger and nolan causal networks derived from multiparameter data science ziehe laskov nolte and fast algorithm for joint diagonalization with nonorthogonal transformations and its application to blind source separation journal of machine learning research burkard quadratic assignment problems in pardalos du and graham editors handbook of combinatorial optimization pages springer new york edition meinshausen and stability selection journal of the royal statistical society series mooij and heskes cyclic causal discovery from continuous equilibrium data in proceedings of the annual conference on uncertainty in artificial intelligence uai pages 
and robust cvar optimization approach yinlam chow stanford university ychow aviv tamar uc berkeley avivt shie mannor technion shie marco pavone stanford university pavone abstract in this paper we address the problem of decision making within markov decision process mdp framework where risk and modeling errors are taken into account our approach is to minimize cvar objective as opposed to standard expectation we refer to such problem as cvar mdp our first contribution is to show that cvar objective besides capturing risk sensitivity has an alternative interpretation as expected cost under modeling errors for given error budget this result which is of independent interest motivates cvar mdps as unifying framework for and robust decision making our second contribution is to present an approximate algorithm for cvar mdps and analyze its convergence rate to our knowledge this is the first solution algorithm for cvar mdps that enjoys error guarantees finally we present results from numerical experiments that corroborate our theoretical findings and show the practicality of our approach introduction decision making within the markov decision process mdp framework typically involves the minimization of performance objective namely the expected total discounted cost this approach while very popular natural and attractive from computational standpoint neither takes into account the variability of the cost fluctuations around the mean nor its sensitivity to modeling errors which may significantly affect overall performance mdps address the first aspect by replacing the expectation with of the total discounted cost such as variance var or cvar robust mdps on the other hand address the second aspect by defining set of plausible mdp parameters and optimize decision with respect to the expected cost under parameters in this work we consider mdps with cvar objective referred to as cvar mdps cvar is that is rapidly gaining popularity in various engineering applications finance due to its favorable computational properties and superior ability to safeguard decision maker from the outcomes that hurt the most in this paper by relating risk to robustness we derive novel result that further motivates the usage of cvar objective in context specifically we show that the cvar of discounted cost in an mdp is equivalent to the expected value of the same discounted cost in presence of perturbations of the mdp parameters specifically transition probabilities provided that such perturbations are within certain error budget this result suggests cvar mdp as method for decision making under both cost variability and model uncertainty motivating it as unified framework for planning under uncertainty literature review mdps have been studied for over four decades with earlier efforts focusing on exponential utility and percentile risk criteria recently for the reasons explained above several authors have investigated cvar mdps specifically in the authors propose dynamic programming algorithm for mdps where risk is measured according to cvar the algorithm is proven to asymptotically converge to an optimal policy however the algorithm involves computing integrals over continuous variables algorithm in and in general its implementation appears particularly difficult in the authors investigate the structure of cvar optimal policies and show that markov policy is optimal on an augmented state space where the additional continuous state variable is represented by the running cost in the authors leverage such result to design an algorithm for cvar mdps that relies on discretizing occupation measures in the mdp this approach however involves solving program via sequence of approximations which can only shown to converge asymptotically different approach is taken by and which consider finite dimensional parameterization of control policies and show that cvar mdp can be optimized to local optimum using stochastic gradient descent policy gradient recent result by pflug and pichler showed that cvar mdps admit dynamic programming formulation by using procedure different from the one in the augmented state is also continuous making the design of solution algorithm challenging contributions the contribution of this paper is twofold first as discussed above we provide novel interpretation for cvar mdps in terms of robustness to modeling errors this result is of independent interest and further motivates the usage of cvar mdps for decision making under uncertainty second we provide new optimization algorithm for cvar mdps which leverages the state augmentation procedure introduced by pflug and pichler we overcome the aforementioned computational challenges due to the continuous augmented state by designing an algorithm that merges approximate value iteration with linear interpolation remarkably we are able to provide explicit error bounds and convergence rates based on arguments in contrast to the algorithms in given the explicit mdp model our approach leads to error guarantees with respect to the globally optimal policy in addition our algorithm is significantly simpler than previous methods and calculates the optimal policy for all cvar confidence intervals and initial states simultaneously the practicality of our approach is demonstrated in numerical experiments involving planning path on grid with thousand of states to the best of our knowledge this is the first algorithm to approximate policies for cvar mdps whose error depends on the resolution of interpolation organization this paper is structured as follows in section we provide background on cvar and mdps we state the problem we wish to solve cvar mdps and motivate the cvar mdp formulation by establishing novel relation between cvar and model perturbations section provides the basis for our solution algorithm based on equation for the cvar then in section we present our algorithm and correctness analysis in section we evaluate our approach via numerical experiments finally in section we draw some conclusions and discuss directions for future work preliminaries problem formulation and motivation conditional let be random variable on probability space with cumulative distribution function in this paper we interpret as cost the var at confidence level is the quantile of varα min the conditional cvar at confidence level is defined as cvarα min where max represents the positive part of if there is no probability atom at varα it is well known from theorem in that cvarα varα therefore cvarα may be interpreted as the expected value of conditioned on the of the tail distribution it is well known that cvarα is decreasing in equals to and cvarα tends to max as during the last decade the cvar has gained popularity in financial applications among others it is especially useful for controlling rare but potentially disastrous events which occur above the quantile and are neglected by the var furthermore cvar enjoys desirable axiomatic properties such as coherence we refer to for further motivation about cvar and comparison with other risk measures such as var useful property of cvar which we exploit in this paper is its alternative dual representation cvarα max eξ where eξ denotes the expectation of risk envelope ucvar is given by ucvar dω thus the cvar of random variable may be interpreted as the expectation of under perturbed distribution ξp in this paper we are interested in the cvar of the total discounted cost in sequential decisionmaking setting as discussed next markov decision processes an mdp is tuple where and are finite state and action spaces cmax is bounded deterministic cost is the transition probability distribution is the discounting factor and is the initial state our results easily generalize to random initial states and random costs let the space of admissible histories up to time be ht for and generic element ht ht is of the form ht xt let πh be the set of all policies with the property that at each time the randomized control action is function of ht in other words πh µt ht hj for all hj hj we also let πh πh be the set of all history dependent policies problem formulation let xt at denote the costs observed along trajectory in the mdp pt model and let xt at denote the total discounted cost up to time the risksensitive problem we wish to address is as follows min cvarα lim where is the policy sequence with actions at µt ht for we refer to problem as cvar mdp one may also consider related formulation combining mean and cvar the details of which are presented in the supplementary material the problem formulation in directly addresses the aspect of risk sensitivity as demonstrated by the numerous applications of cvar optimization in finance see and the recent approaches for cvar optimization in mdps in the following we show new result providing additional motivation for cvar mdps from the point of view of robustness to modeling errors motivation robustness to modeling errors we show new result relating the cvar objective in to the expected in presence of perturbations of the mdp parameters where the perturbations are budgeted according to the number of things that can go wrong thus by minimizing cvar the decision maker also guarantees robustness of the policy consider trajectory xt in mdp problem with transitions pt xt we explicitly denote the time index of the transition matrices for reasons that will become clear shortly the total probability of the trajectory is xt pt xt at and we let xt denote its discounted cost as defined above we consider an adversarial setting where an adversary is allowed to change the transition probabilities at each stage under some budget constraints we will show that for specific budget and perturbation structure the expected cost under the perturbation is equivalent to the cvar of the cost thus we shall establish that in this perspective being risk sensitive is equivalent to being robust against model perturbations for each stage consider perturbed transition matrix pt where δt rx is multiplicative probability perturbation and is the hadamard product under the condition that is stochastic matrix let denote the set of perturbation matrices that satisfy this condition and let the set of all possible perturbations to the trajectory distribution we now impose budget constraint on the perturbations as follows for some budget we consider the constraint δt xt at xt at essentially the product in eq states that with small budget the worst can not happen at each time instead the perturbation budget has to be split multiplicatively along the trajectory we note that eq is in fact constraint on the perturbation matrices and we denote by the set of perturbations that satisfy this constraint with budget the following result shows an equivalence between the cvar and the expected loss proposition interpretation of cvar as robustness measure it holds cvar xt sup δt xt where denotes expectation with respect to markov chain with transitions the proof of proposition is in the supplementary material it is instructive to compare proposition with the dual representation of cvar in where both results convert the cvar risk into robustness measure note in particular that the perturbation budget in proposition has temporal structure which constrains the adversary from choosing the worst perturbation at each time step remark an equivalence between robustness and was previously suggested by osogami in that study the iterated dynamic coherent risk was shown to be equivalent to robust mdp with rectangular uncertainty set the iterated risk and correspondingly the rectangular uncertainty set is very conservative in the sense that the worst can happen at each time step in contrast the perturbations considered here are much less conservative in general solving robust mdps without the rectangularity assumption is nevertheless mannor et al showed that for cases where the number of perturbations to the parameters along trajectory is upper bounded perturbation the corresponding robust mdp problem is tractable analogous to the constraint set in the perturbation set in proposition limits the total number of along trajectory accordingly we shall later see that optimizing problem with perturbation structure is indeed also tractable next section provides the fundamental theoretical ideas behind our approach to the solution of bellman equation for cvar in this section by leveraging recent result from we present dynamic programming dp formulation for the cvar mdp problem in as we shall see the value function in this formulation depends on both the state and the cvar confidence level we then establish important properties of such dp formulation which will later enable us to derive an efficient approximate solution algorithm and provide correctness guarantees on the approximation error all proofs are presented in the supplementary material our starting point is recursive decomposition of cvar whose proof is detailed in theorem of theorem cvar decomposition theorem in for any denote by the cost sequence from time onwards the conditional cvar under policy cvarα ht obeys the following decomposition cvarα ht max at cvarαξ ht where at is the action induced by policy µt ht and the expectation is with respect to theorem concerns fixed policy we now extend it to general dp formulation note that in the recursive decomposition in theorem the side involves cvar terms with different confidence levels than that in the side accordingly we augment the state space with an additional continuous state which corresponds to the confidence level for any and the for the augmented state is defined as min cvary lim similar to standard dp it is convenient to work with operators defined on the space of value functions in our case theorem leads to the following definition of cvar bellman operator min max yξ we now establish several useful properties for the bellman operator lemma properties of cvar bellman operator the bellman operator has the following properties contraction kt where kf concavity preserving in for any suppose yv is concave in then the maximization problem in is concave furthermore yt is concave in the first property in lemma is similar to standard dp and is instrumental to the design of converging approach the second property is nonstandard and specific to our approach it will be used to show that the computation of updates involves concave and therefore tractable optimization problems furthermore it will be used to show that linearinterpolation of in the augmented state has bounded error equipped with the results in theorem and lemma we can now show that the fixed point solution of is unique and equals to the solution of the cvar mdp problem with and theorem optimality condition for any and the solution to is unique and equals to cvary limt next we show that the optimal value of the cvar mdp problem can be attained by stationary markov policy defined as greedy policy with respect to the value function thus while the original problem is defined over the intractable space of policies stationary markov policy over the augmented state space is optimal and can be readily derived from furthermore an optimal policy can be readily obtained from an augmented optimal markov policy according to the following theorem theorem optimal policies let πh πh be policy recursively defined as µk hk xk yk with initial conditions and and state transitions xk yk xk where the stationary markovian policy and risk factor ξx are solution to the max optimization problem in the cvar bellman operator then πh is an optimal policy for problem with initial state and cvar confidence level theorems and suggest that dp method can be used to solve the cvar mdp problem let an initial guess be chosen arbitrarily value iteration proceeds recursively as follows vk specifically by combining the contraction property in lemma and uniqueness result of fixed point solutions from theorem one concludes that vk by selecting and one immediately obtains cvarα limt furthermore an optimal policy may be derived from according to the policy construction procedure in theorem unfortunately while value iteration is conceptually appealing its direct implementation in our setting is generally impractical since the state is continuous in the following we pursue an approximation to the value iteration algorithm based on linear interpolation scheme for algorithm cvar value iteration with linear interpolation given interpolation points yn for every with yi and yn initial value function that satisfies assumption for for each and each yi update the value function estimate as follows vt yi ti yi set the converged value iteration estimate as vb yi for any and yi value iteration with linear interpolation in this section we present an approximate dp algorithm for solving cvar mdps based on the theoretical results of section the value iteration algorithm in eq presents two main implementation challenges the first is due to the fact that the augmented state is continuous we handle this challenge by using interpolation and exploit the concavity of yv to bound the error introduced by this procedure the second challenge stems from the the fact that applying involves maximizing over our strategy is to exploit the concavity of the maximization problem to guarantee that such optimization can indeed be performed effectively as discussed our approach relies on the fact that the bellman operator preserves concavity as established in lemma accordingly we require the following assumption for the initial guess assumption the guess for the initial value function satisfies the following properties is concave in and is continuous in for any assumption may easily be satisfied for example by choosing cvary where is any arbitrary bounded random variable as stated earlier key difficulty in applying value iteration is that for each state the bellman operator has to be calculated for each and is continuous as an approximation we propose to calculate the bellman operator only for finite set of values and interpolate the value function in between such interpolation points formally let denote the number of interpolation points for every denote by yn the set of interpolation points we denote by ix the linear interpolation of the function yv on these points yi yi ix yi yi yi yi where yi max and is the closest interpolation point such that yi min the interpolation of yv instead of is key to our approach the motivation is twofold first it can be shown that for discrete random variable ycvary is piecewise linear in second one can show that the lipschitzness of is preserved during value iteration and exploit this fact to bound the linear interpolation error we now define the interpolated bellman operator ti as follows yξ ti min max remark notice that by hospital rule one has ix yξ this implies that at the interpolated bellman operator is equivalent to the original bellman ator ti algorithm presents cvar value iteration with linear interpolation the only difference between this algorithm and standard value iteration is the linear interpolation procedure described above in the following we show that algorithm converges and bound the error due to interpolation we begin by showing that the useful properties established in lemma for the bellman operator extend to the interpolated bellman operator ti lemma properties of interpolated bellman operator ti has the same properties of as in lemma namely contraction and concavity preservation lemma implies several important consequences for algorithm the first one is that the maximization problem in is concave and thus may be solved efficiently at each step this guarantees that the algorithm is tractable second the contraction property in lemma guarantees that algorithm converges there exists value function vb such that tni yi vb yi in addition the convergence rate is geometric and equals to the following theorem provides an error bound between approximate value iteration and exact value iteration in terms of the interpolation resolution theorem convergence and error bound suppose the initial value function satisfies assumption and let be an error tolerance parameter for any state and step choose such that vt vt and update the interpolation points according to the logarithmic rule θyi with uniform constant then algorithm has the following error bound vb min cvarα lim and the following finite time convergence error bound ti min cvarα lim theorem shows that the value function is conservative estimate for the optimal solution to problem the interpolation procedure is consistent when the number of interpolation points is arbitrarily large specifically and the approximation error tends to zero and the approximation error bound is where log is the of the interpolation points log log log yi for the condition vt vt may be satisfied by simple adaptive procedure for selecting the interpolation points at each iteration after calculating vt yi in algorithm at each state in which the condition does not hold add new interpolation point and additional points between and such that the condition log log log yi is maintained since all the additional points belong to the segment the linearly interpolated vt yi remains unchanged and algorithm proceeds as is for bounded costs and the number of additional points required is bounded the full proof of theorem is detailed in the supplementary material we highlight the main ideas and challenges involved in the first part of the proof we bound for all the lipschitz constant of yvt in the key to this result is to show that the bellman operator preserves the lipschitz property for yvt using the lipschitz bound and the concavity of yvt we then bound the error ix vyt vt for all the condition on is required for this bound to hold when finally we use this result to bound kti vt vt the results of theorem follow from contraction arguments similar to approximate dynamic programming experiments we validate algorithm on rectangular grid world where states represent grid points on terrain map an agent robotic vehicle starts in safe region and its objective is to travel to given destination at each time step the agent can move to any of its four neighboring states due to sensing and control noise however with probability move to random neighboring state occurs the cost of each move until reaching the destination is to account for fuel usage in between the starting point and the destination there are number of obstacles that the agent should avoid hitting an obstacle costs and terminates the mission the objective is to compute safe path that is fuel efficient for our experiments we choose see figure for total of states the destination is at position and there are obstacles plotted in yellow by leveraging theorem we use interpolation points for algorithm in order to achieve small value function error we choose and discount factor for an effective horizon of steps furthermore we set the penalty cost equal to choice trades off high penalty for collisions and computational complexity that increases as increases for the figure simulation left three plots show the value functions and corresponding paths for different cvar confidence levels the rightmost plot shows cost histogram for monte carlo trials for policy and cvar policy with confidence level interpolation parameters discussed in theorem we set and in order to have logarithmically distributed grid points for the cvar confidence parameter in in figure we plot the value function for three different values of the cvar confidence parameter and the corresponding paths starting from the initial position the first three figures in figure show how by decreasing the confidence parameter the average travel distance and hence fuel consumption slightly increases but the collision probability decreases as expected we next discuss robustness to modeling errors we conducted simulations in which with probability each obstacle position is perturbed in random direction to one of the neighboring grid cells this emulates for example measurement errors in the terrain map we then trained both the riskaverse and policies on the nominal unperturbed terrain map and evaluated them on perturbed scenarios perturbed maps with monte carlo evaluations each while the policy finds shorter route with average cost equal to on successful runs it is vulnerable to perturbations and fails more often with over failed runs in contrast the policy chooses slightly longer routes with average cost equal to on successful runs but is much more robust to model perturbations with only failed runs for the computation of algorithm we represented the concave piecewise linear maximization problem in as linear program and concatenated several problems to reduce repeated overhead stemming from the initialization of the cplex linear programming solver this resulted in computation time on the order of two hours we believe there is ample room for improvement for example by leveraging parallelization and methods overall we believe our proposed approach is currently the most practical method available for solving cvar mdps as comparison the recently proposed method in involves infinite dimensional optimization the matlab code used for the experiments is provided in the supplementary material conclusion in this paper we presented an algorithm for cvar mdps based on approximate on an augmented state space we established convergence of our algorithm and derived error bounds these bounds are useful to stop the algorithm at desired error threshold in addition we uncovered an interesting relationship between the cvar of the total cost and the expected cost under adversarial model perturbations in this formulation the perturbations are correlated in time and lead to robustness framework significantly less conservative than the popular framework where the uncertainty is temporally independent collectively our work suggests cvar mdps as unifying and practical framework for computing control policies that are robust with respect to both stochasticity and model perturbations future work should address extensions to large we conjecture that approximate dp approach should be feasible since as proven in this paper the cvar bellman equation is contracting as required by approximate dp methods acknowledgement the authors would like to thank mohammad ghavamzadeh for helpful comments on the technical details and daniel vainsencher for practical optimization advice chow and pavone are partially supported by the croucher foundation doctoral scholarship and the office of naval research science of autonomy program under contract funding for shie mannor and aviv tamar were partially provided by the european community seventh framework programme under grant agreement suprel references artzner delbaen eber and heath coherent measures of risk mathematical finance and ott markov decision processes with criteria mathematical methods of operations research bertsekas dynamic programming and optimal control vol ii athena scientific edition borkar and jain markov decision processes ieee transaction of automatic control chow and ghavamzadeh algorithms for cvar optimization in mdps in advances in neural information processing systems pages dowd measuring market risk john wiley sons filar krass and ross percentile performance criteria for limiting average markov decision processes automatic control ieee transactions on haskell and jain convex analytic approach to markov decision processes siam journal of control and optimization howard and matheson markov decision processes management science iyengar robust dynamic programming mathematics of operations research iyengar and ma fast gradient descent method for optimization annals of operations research mannor simester sun and tsitsiklis bias and variance approximation in value function estimates management science mannor mebel and xu lightning does not strike twice robust mdps with coupled uncertainty in international conference on machine learning pages milgrom and segal envelope theorems for arbitrary choice sets econometrica nilim and el ghaoui robust control of markov decision processes with uncertain transition matrices operations research osogami robustness and in markov decision processes in advances in neural information processing systems pages pflug and pichler time consistent decisions and temporal decomposition of coherent risk functionals optimization online phillips interpolation and approximation by polynomials volume springer science business media prashanth policy gradients for mdps in algorithmic learning theory pages springer rockafellar and uryasev optimization of conditional journal of risk rockafellar uryasev and zabarankin master funds in portfolio analysis with general deviation measures journal of banking finance serraino and uryasev conditional cvar in encyclopedia of operations research and management science pages springer shapiro dentcheva and lectures on stochastic programming siam sobel the variance of discounted markov decision processes journal of applied probability pages tamar glassner and mannor optimizing the cvar via sampling in aaai uryasev sarykalin serraino and kalinchenko var vs cvar in risk management and optimization in carisma conference xu and mannor the tradeoff in markov decision processes in advances in neural information processing systems pages 
asynchronous stochastic convex optimization the noise is in the noise and sgd don care sorathan john christopher departments of computer science electrical engineering and statistics stanford university stanford ca sorathan jduchi chrismre abstract we show that asymptotically completely asynchronous stochastic gradient procedures achieve optimal even to constant factors convergence rates for the solution of convex optimization problems under nearly the same conditions required for asymptotic optimality of standard stochastic gradient procedures roughly the noise inherent to the stochastic approximation scheme dominates any noise from asynchrony we also give empirical evidence demonstrating the strong performance of asynchronous parallel stochastic optimization schemes demonstrating that the robustness inherent to stochastic approximation problems allows substantially faster parallel and asynchronous solution methods in short we show that for many stochastic approximation problems as freddie mercury sings in queen bohemian rhapsody nothing really introduction we study natural asynchronous stochastic gradient method for the solution of minimization problems of the form minimize ep dp where is convex for each is probability distribution on and the vector rd stochastic gradient techniques for the solution of problem have long history in optimization starting from the early work of robbins and monro and continuing on through ermoliev polyak and juditsky and nemirovski et al the latter two show how certain long stepsizes and averaging techniques yield more robust and asymptotically optimal optimization schemes and we show how their results extend to practical parallel and asynchronous settings we consider an extension of previous stochastic gradient methods to natural family of asynchronous gradient methods where multiple processors can draw samples from the distribution and asynchronously perform updates to centralized shared decision vector our iterative scheme is based on the ogwild algorithm of niu et al which is designed to asynchronously solve certain stochastic optimization problems in environments though our analysis and iterations are different in particular we study the following procedure where each processor runs asynchronously and independently of the others though they maintain shared integer iteration counter each processor asynchronously performs the following processor reads current problem data ii processor draws random sample computes and increments the centralized counter iii processor updates αk sequentially for each coordinate by incrementing αk where the scalars αk are stepsize sequence we assume that scalar addition is atomic the addition of propagates eventually but maybe out of order our main results show that because of the noise inherent to the sampling process for the errors introduced by asynchrony in iterations iii are asymptotically negligible they do not matter even more we can efficiently construct an from the asynchronous process possessing optimal convergence rate and asymptotic variance this has consequences for solving stochastic optimization problems on and systems we can leverage parallel computing without performing any synchronization so that given machine with processors we can read data and perform updates times as quickly as with single processor and the error from reading stale information on becomes asymptotically negligible in section we state our main convergence theorems about the asynchronous iteration iii for solving problem our main result theorem gives explicit conditions under which our results hold and we give applications to specific stochastic optimization problems as well as general result for asynchronous solution of operator equations roughly all we require for our optimal convergence results is that the hessian of be positive definite near argminx and that the gradients be smooth several researchers have provided and analyzed asynchronous algorithms for optimization bertsekas and tsitsiklis provide comprehensive study both of models of asynchronous computation and analyses of asynchronous numerical algorithms more recent work has studied asynchronous gradient procedures though it often imposes strong conditions on gradient sparsity conditioning of the hessian of or allowable types of asynchrony as we show none are essential niu et al propose ogwild and show that under sparsity and smoothness assumptions essentially that the gradients have vanishing fraction of entries that is strongly convex and is lipschitz for all convergence guarantees similar to the synchronous case are possible agarwal and duchi showed under restrictive ordering assumptions that some delayed gradient calculations have negligible asymptotic effect and duchi et al extended niu et results to dual averaging algorithm that works for non problems so long as certain gradient sparsity assumptions hold researchers have also investigated parallel coordinate descent solvers and and liu et al show how certain nearseparability properties of an objective function govern convergence rate of parallel coordinate descent methods the latter focusing on asynchronous schemes as we show stochastic optimization renders many of these problem assumptions unnecessary in addition to theoretical results in section we give empirical results on the power of parallelism and asynchrony in the implementation of stochastic approximation procedures our experiments demonstrate two results first even in settings asynchrony introduces little degradation in solution quality regardless of data sparsity common assumption in previous analyses that is estimates are statistically efficient second we show that there is some subtlety in implementation of these procedures in real hardware while increases in parallelism lead to concomitant linear improvements in the speed with which we compute solutions to problem in some cases we require strategies to reduce hardware resource competition between processors to achieve the full benefits of asynchrony notation sequence of random variables or vectors xn converges in distribution to denoted xn if xn for all bounded continuous functions we let xn denote convergence in probability meaning that limn kxn zk for any the notation denotes the multivariate gaussian with mean and covariance main results our main results repose on few standard assumptions often used for the analysis of stochastic optimization procedures which we now detail along with few necessary definitions we let denote the iteration counter used throughout the asynchronous gradient procedure given that we compute with counter value in the iterations iii we let xk denote the possibly inconsistent particular used to compute and likewise say that gk noting that the update to is then performed using αk in addition throughout paper we assume there is some finite bound such that no processor reads information more than steps out of date asynchronous convex optimization we now present our main theoretical results for solving the stochastic convex problem giving the necessary assumptions on and for our results our first assumption roughly states that has quadratic expansion near the unique optimal point and is smooth assumption the function has unique minimizer and is twice continuously differentiable in the neighborhood of with positive definite hessian and there is covariance matrix such that additionally there exists constant such that the gradients satisfy kx for all rd lastly has continuous gradient kx yk for all rd assumption guarantees the uniqueness of the vector minimizing over rd and ensures that is enough for our asynchronous iteration procedure to introduce negligible noise over procedure in addition to assumption we make one of two additional assumptions in the first case we assume that is strongly convex assumption the function is convex over all of rd for some that is xi kx yk for rd our alternate assumption is lipschitz assumption on itself made by virtue of second moment bound on assumption there exists constant such that for all rd with our assumptions in place we state our main theorem theorem let the iterates xk be generated by the asynchronous process ii iii with stepsize choice αk αk where and let assumption and either of assumptions or hold then xk σh before moving to example applications of theorem we note that its convergence guarantee is generally unimprovable even by numerical constants indeed for classical statistical problems the covariance σh is the inverse fisher information and by the le local minimax theorems and results on bahadur efficiency chapter this is the optimal covariance matrix and the best possible rate is as for function values using the delta method theorem we can show the optimal convergence rate of on function values pn corollary let the conditions of theorem hold then xk where denotes random variable with degree of freedom and tr and examples we now give two classical statistical optimization problems to illustrate theorem we verify that the conditions of assumptions and or are not overly restrictive linear regression standard linear regression problems satisfies the conditions of assumption in this case the data rd and the objective ha xi if we have moment bounds and we have and the assumptions of theorem are certainly satisfied standard modeling assumptions yield more concrete guarantees for example if ha where is independent noise with the minimizer of is we have ha and ha ha in particular the asynchronous iterates satisfy xk which is the minimax optimal asymptotic covariance of the ordinary least squares estimate of logistic regression as long as the data has finite second moment logistic regression problems satisfy all the conditions of assumption in theorem we have rd and instantaneous objective log exp ha xi for fixed this function is lipschitz continuous and has gradient and hessian ebha xi ba and exp ha xi ebha xi where is lipschitz continuous as so long as and is positive definite theorem applies to logistic regression extension to nonlinear problems we prove theorem by way of more general result on finding the zeros of residual operator rd rd where we only observe noisy views of and there is unique such that such situations arise for example in the solution of stochastic monotone operator problems cf juditsky nemirovski and tauvel in this more general setting we consider the following asynchronous iterative process which extends that for the convex case outlined previously each processor performs the following asynchronously and independently processor reads current problem data ii processor receives vector where is random conditionally noise vector and increments centralized counter iii processor updates αk sequentially for each coordinate by incrementing αk as in the convex case we associate vectors xk and gk with the update performed using αk and we let ξk denote the noise vector used to construct gk these iterates and assignment of indices imply that xk has the form αi ki gi xk ki where is diagonal matrix whose jth diagonal entry captures that coordinate of the ith gradient has been incorporated into iterate xk we define the an increasing sequence of fk by fk ξk ij that is the noise variables ξk are adapted to the filtration fk and these are the smallest containing both the noise and all index updates that have occurred and that will occur to compute thus we have fk and our assumption on the noise is ξk we base our analysis on polyak and juditsky study of stochastic approximation procedures so we enumerate few more on our results on convergence of the asynchronous iterations for solving the nonlinear equality we assume there is lyapunov function satisfying kxk for all rd kx yk for all that and this implies kxk kxk kxk and kxk we make the following assumptions on the residual assumption there exists matrix with parameter constant and such that if satisfies kx kr kx assumption essentially requires that is differentiable at with derivative matrix we also make few assumptions on the noise process specifically we assume implicitly depends on rd so that we may write ξk xk and that the following assumption holds assumption the noise vector decomposes as where is process satisfying ξk ξk for matrix supk kξk with probability and kζk kx for constant and all rd as in the convex case we make one of two additional assumptions which should be compared with assumptions and the first is that gives globally strong information about assumption strongly convex residuals there exists constant such that for all rd alternatively we may make an assumption on the boundedness of which we shall see suffices for proving our main results assumption bounded residuals there exist and such that and inf inf in addition there exists such that kr and kξk for all and with these assumptions in place we obtain the following more general version of theorem indeed we show that theorem is consequence of this result theorem let be function satisfying inequality and let assumptions and hold let let one of assumptions or hold then the stepsizes αk αk where xk σh we may compare this result to polyak and juditsky theorem which gives identical asymptotic convergence guarantees but with somewhat weaker conditions on the function and stepsize sequence αk our stronger assumptions however allow our result to apply even in fully asynchronous settings proof sketch we provide rigorous proofs in the long version of this paper providing an amputated sketch here first to show that theorem follows from theorem we set and kxk we can then show that assumption which guarantees taylor expansion implies assumption with and moreover assumption or implies assumption respectively while to see that assumption holds we set taking and and applying inequality of assumption to satisfy assumption with the vector the proof of theorem is somewhat more involved roughly we show the asymptotic equivalence of the sequence xk from expression to the easier to analyze sequence ek αi gi asymptotically we obtain kxk ek while the iterates ek spite of their incorrect gradient close enough to correct stochastic gradient iterate that they possess optimal asymptotic normality properties this close enough follows by virtue of the squared error bounds for in assumption which guarantee that ξk essentially behaves like an sequence asymptotically after application of the martingale convergence theorem which we then average to obtain experimental results we provide empirical results studying the performance of asynchronous stochastic approximation schemes on several simulated and datasets our theoretical results suggest that asynchrony should introduce little degradation in solution quality which we would like to verify we also investigate the engineering techniques necessary to truly leverage the power of asynchronous stochastic procedures in our experiments we focus on linear and logistic regression the examples given in section that is we have data ai bi rd for linear regression or ai bi rd for logistic regression for and objectives hai xi bi and log exp hai xi we perform each of our experiments using intel xeon machine with terabyte of ram and have put code and binaries to replicate our experiments on codalab the xeon architecture puts each core onto one of four sockets where each socket has its own memory to limit the impact of communication overhead in our experiments we limit all experiments to at most cores all on the same socket within an on the empirical expectations iterate in epochs meaning that our stochastic gradient procedure repeatedly loops through all examples each exactly within an epoch we use fixed stepsize decreasing the stepsize by factor of between each epoch this matches the experimental protocol of niu et al within each epoch we choose examples in randomly permuted order where the order changes from epoch to epoch cf to address issues of hardware resource contention see section for more on this in some cases we use strategy abstractly in the formulation of the basic problem this means that in each calculation of stochastic gradient we draw samples wb according to then set wb the strategy does not change the asymptotic convergence guarantees of asynchronous stochastic gradient descent as the covariance matrix satisfies while the total iteration count is reduced by the factor lastly we measure the performance of optimization schemes via speedup defined as speedup average epoch runtime on single core using ogwild average epoch runtime on cores in our experiments as increasing the number of cores does not change the gap in optimality xk after each epoch speedup is equivalent to the ratio of the time required to obtain an solution using single to that required to obtain solution using efficiency and sparsity for our first set of experiments we study the effect that data sparsity has on the convergence behavior of asynchronous has been an essential part of the analysis of many asynchronous and parallel optimization schemes while our theoretical results suggest it should be the linear regression objective we generate synthetic linear regression problems with examples in dimensions via the following procedure let be the desired fraction of gradient entries and let πρ be random projection operator that zeros out all but fraction of the elements of its argument meaning that for rd πρ uniformly at random chooses ρd elements of leaves them identical and zeroes the remaining elements we generate data for our linear regression drawing random vector then constructing bi hai εi where εi ai πρ ai ai and πρ ai denotes an independent random sparse projection of ai to measure optimality gap we directly compute at at where an rn in figure we plot the results of simulations using densities and size showing the gap xk as function of the number of epochs for each of the given sparsity levels we give results using and processor cores increasing degrees of asynchrony and from the plots we see that regardless of the number of cores the convergence strictly speaking this violates the stochastic gradient assumption but it allows direct comparison with the original ogwild code and implementation behavior is nearly identical with very minor degradations in performance for the sparsest data we plot the gaps xk on logarithmic axis moreover as the data becomes denser the more asynchronous number of performance essentially identical to the fully synchronous method in terms of convergence versus number of epochs in figure we plot the speedup achieved using different numbers of cores we also include speedup achieved using multiple cores with explicit synchronization locking of the updates meaning that instead of allowing asynchronous updates each of the cores globally locks the decision vector when it reads unlocks and performs gradient computations and locks the vector again when it updates the vector we can see that the performance curve is much worse than than the withoutlocking performance curve across all densities that the locking strategy also gains some speedup when the density is higher is likely due to longer computation of the gradients however the lockingstrategy performance is still not competitive with that of the strategy cores cores cores core xk cores cores cores core cores cores cores core epochs epochs cores cores cores core epochs epochs figure exponential backoff stepsizes optimality gaps for synthetic linear regression experiments showing effects of data sparsity and asynchrony on xk fraction of each vector ai rd is linear speedup without locking with locking linear speedup without locking with locking linear speedup without locking with locking cores cores cores linear speedup without locking with locking cores figure exponential backoff stepsizes speedups for synthetic linear regression experiments showing effects of data sparsity on speedup fraction of each vector ai rd is hardware issues and cache locality we detail small set of experiments investigating hardware issues that arise even in implementation of asynchronous gradient methods the intel architecture as with essentially every processor architecture organizes memory in hierarchy going from to level to level caches of increasing sizes an important aspect of the speed of different optimization schemes is the relative fraction of memory hits meaning accesses to memory that is cached locally in order of decreasing speed or cache in table we show the proportion of cache misses at each level of the memory hierarchy for our synthetic regression experiment with fully dense data over the execution of epochs averaged over different experiments we compare memory contention when the batch size used to compute the local asynchronous gradients is and we see that the proportion of misses for the fastest two and the cache for increase significantly with the number of cores while increasing the batch size to substantially mitigates cache incoherency in particular we maintain near linear increases in iteration speed with little degradation in solution quality the gap output by each of the procedures with and without batching is identical to within cf figure number of cores fraction of misses fraction of misses fraction of misses epoch average time speedup number of cores fraction of misses fraction of misses fraction of misses epoch average time speedup no batching batch size table memory traffic for batched updates versus updates for dense linear regression problem in dimensions with sample of size cache misses are substantially higher with real datasets we perform experiments using three different datasets the reuters corpus the higgs detection dataset and the forest cover dataset each represents binary classification problem which we formulate using logistic regression we briefly detail statistics for each reuters dataset consists of data vectors documents ai with dimensions each vector has sparsity approximately our task is to classify each document as being about corporate industrial topics ccat or not ai with we the higgs detection dataset consists of data vectors quantize each coordinate into bins containing equal fraction of the coordinate values and encode each vector ai as vector ai whose entries correspond to quantiles into which coordinates fall the task is to detect simulated emissions from linear accelerator the forest cover dataset consists of data vectors ai with and the task is to predict forest growth types xk cores cores cores core cores cores cores core cores cores cores core figure exponential backoff stepsizes optimality gaps xk on the higgs and forest cover datasets epochs epochs linear speedup without locking higgs linear speedup without locking cores figure exponential backoff stepsizes logistic regression experiments showing speedup on the higgs and forest cover datasets linear speedup without locking epochs forest cores higgs cores forest in figure we plot the gap xk as function of epochs giving standard error intervals over runs for each experiment there is essentially no degradation in objective value for the different numbers of processors and in figure we plot speedup achieved using and cores with batch sizes asynchronous gradient methods achieve speedup of between and on each of the datasets using cores references agarwal and duchi distributed delayed stochastic optimization in advances in neural information processing systems baldi sadowski and whiteson searching for exotic particles in physics with deep learning nature communications july bertsekas and tsitsiklis parallel and distributed computation numerical methods duchi jordan and mcmahan estimation optimization and parallelism when data is sparse in advances in neural information processing systems duchi chaturapruek and asynchronous stochastic convex optimization duchi chaturapruek and asynchronous stochastic convex optimization url https code for reproducing experiments ermoliev on the stochastic method and stochastic sequences kibernetika juditsky nemirovski and tauvel solving variational inequalities with the stochastic algorithm stochastic systems le cam and yang asymptotics in statistics some basic concepts springer lehmann and casella theory of point estimation second edition springer lewis yang rose and li new benchmark collection for text categorization research journal of machine learning research lichman uci machine learning repository url http liu wright bittorf and sridhar an asynchronous parallel stochastic coordinate descent algorithm in proceedings of the international conference on machine learning nemirovski juditsky lan and shapiro robust stochastic approximation approach to stochastic programming siam journal on optimization niu recht re and wright hogwild approach to parallelizing stochastic gradient descent in advances in neural information processing systems polyak and juditsky acceleration of stochastic approximation by averaging siam journal on control and optimization recht and beneath the valley of the noncommutative mean inequality conjectures and consequences in proceedings of the twenty fifth annual conference on computational learning theory and parallel coordinate descent methods for big data optimization mathematical programming page online first url http robbins and monro stochastic approximation method annals of mathematical statistics robbins and siegmund convergence theorem for almost supermartingales and some applications in optimizing methods in statistics pages academic press new york van der vaart asymptotic statistics cambridge series in statistical and probabilistic mathematics cambridge university press isbn 
lifelong learning with tasks christoph lampert ist austria klosterneuburg austria chl anastasia pentina ist austria klosterneuburg austria apentina abstract in this work we aim at extending the theoretical foundations of lifelong learning previous work analyzing this scenario is based on the assumption that learning tasks are sampled from task environment or limited to strongly constrained data distributions instead we study two scenarios when lifelong learning is possible even though the observed tasks do not form an sample first when they are sampled from the same environment but possibly with dependencies and second when the task environment is allowed to change over time in consistent way in the first case we prove theorem that can be seen as direct generalization of the analogous previous result for the case for the second scenario we propose to learn an inductive bias in form of transfer procedure we present generalization bound and show on toy example how it can be used to identify beneficial transfer algorithm introduction despite the tremendous growth of available data over the past decade the lack of fully annotated data which is an essential part of success of any traditional supervised learning algorithm demands methods that allow good generalization from limited amounts of training data one way to approach this is provided by the lifelong learning or learning to learn paradigm which is based on the idea of accumulating knowledge over the course of learning multiple tasks in order to improve the performance on future tasks in order for this scenario to make sense one has to define what kind of relations connect the observed tasks with the future ones the first formal model of lifelong learning was proposed by baxter in he introduced the notion of task environment set of all tasks that may need to be solved together with probability distribution over them in baxter model the lifelong learning system observes tasks that are sampled from the task environment this allows proving bounds in the pac framework that guarantee that hypothesis set or inductive bias that works well on the observed tasks will also work well on future tasks from the same environment baxter results were later extended using algorithmic stability task similarity measures and analysis specific cases that were studied include feature learning and sparse coding all these works however assume that the observed tasks are independently and identically distributed as the original work by baxter did this assumption allows making predictions about the future of the learning process but it limits the applicability of the results in practice to our knowledge only the recent has studied lifelong learning without an assumption however the considered framework is limited to binary classification with linearly separable classes and isotropic data distributions in this work we use the framework to study two possible relaxations of the assumption without restricting the class of possible data distributions first we study the case in which tasks can have dependencies between them but are still sampled from fixed task environment an illustrative example would be when task are to predict the outcome of chess games whenever player plays multiple games the corresponding tasks are not be independent in this setting we retain many concepts of and learn an inductive bias in the form of probability distribution we prove bound relating the expected error when relying on the learned bias for future tasks to its empirical error over the observed tasks it has the same form as for the situation except for slowdown of convergence proportional to parameter capturing the amount of dependence between tasks second we introduce new and more flexible lifelong learning setting in which the learner observes sequence of tasks from different task environments this could be classification tasks of increasing difficulty in this setting one can not expect that transferring an inductive bias from observed tasks to future tasks will be beneficial because the task environment is not stationary instead we aim at learning an effective transfer algorithm procedure that solves task taking information from previous task into account we bound the expected performance of such algorithms when applied to future tasks based on their performance on the observed tasks preliminaries following baxter model we assume that all tasks that may need to be solved share the same input space and output space the lifelong learning system observes tasks tn in form is set of points sampled of training sets sn where each si xim ym from the corresponding unknown data distribution di over in contrast to previous works on lifelong learning we omit the assumption that the observed tasks are independently and identically distributed in order to theoretically analyze lifelong learning in the case of tasks we use techniques from theory we assume that the learner uses the same hypothesis set and the same loss function for solving all tasks theory studies the performance of randomized gibbs predictors formally for any probability distribution over the hypothesis set the corresponding gibbs predictor for every point randomly samples and returns the expected loss of such gibbs predictor on task corresponding to data distribution is given by er and its empirical counterpart based on training set sampled from er xi yi is given by theory allows us to obtain upper bounds on the difference between these two quantities of the following form theorem let be any distribution over fixed before observing the sample then for any the following holds uniformly for all distributions over with probability at least log er er kl where kl denotes the divergence the distribution should be chosen before observing any data and therefore is usually referred as prior distribution in contrast the bound holds uniformly with respect to the distributions whenever it consists only of computable quantities it can be used to choose that minimizes the right hand side of the inequality and thus provides gibbs predictor with expected error bounded by hopefully low value suchwise is usually referred as posterior distribution note that besides explicit bounds such as in the case of one can also derive implicit bound that can be tighter in some regimes instead of the error difference er these bound their kl erk er where kl qkp denotes the between two bernoulli random variables with success probabilities and in this work we prefer explicit bounds as they are more intuitive and allow for more freedom in the choice of different loss functions they also allow us to combine several inequalities in an additive way which we make use of in sections and dependent tasks the first extension of baxter model that we study is the case when the observed tasks are sampled from the same task environment but with some interdependencies in other words in this case the tasks are identically but not independently distributed since the task environment is assumed to be constant we can build on ideas from the situation of tasks in we assume that for all tasks the learner uses the same deterministic learning algorithm that produces posterior distribution based on prior distribution and sample set we also assume that there is set of possible prior distributions and some distribution over it the goal of the learner is to find distribution over this set such that when the prior is sampled according to the expected loss on the next yet unobserved task is minimized er ep st st the empirical counterpart of the above quantity is given by er ep si xij yji in order to bound the difference between these two quantities we adopt the procedure used in first we bound the difference between the empirical error er and the corresponding expected risk given by er ep si then we continue with bounding the difference between er and er since conditioned on the observed tasks the corresponding training samples are independent we can reuse the following results from in order to perform the first step of the proof theorem with probability at least uniformly for all log kl ep kl qi si er er to bound the difference between er and er however the results from can not be used because they rely on the assumption that the observed tasks are independent instead we adopt ideas from chromatic bounds that rely on the properties of dependency graph built with respect to the dependencies within the observed tasks definition dependency graph the dependency graph of set of random variables tn is such that the set of vertices equals there is no edge between and if and only if ti and tj are independent definition exact fractional cover let be an undirected graph with set cj wj where cj and wj for all is proper exact fractional cover if for every all vertices in cj are independent cj for every pk the sum of the weights wj pk wj is the chromatic weight of and is the size of then the following holds theorem for any fixed distribution any proper exact fractional cover of the dependency graph tn of size and any the following holds with probability at least uniformly for all distributions log log er er kl proof by variational formula er er wj st ert qt eri qi ep wj kl log ep exp st ert qt eri qi λj since the tasks within every cj are independent for every fixed prior st ert qt eri qi are and take values in where st ert qt therefore by hoeffding lemma ti si exp st ert qt eri qi exp therefore by markov inequality with probability at least δj it holds that log δj log ep exp st ert qt eri qi consequently we obtain with probability at least pk δj wj wj λj wj kl log δj λj λj by setting λk and δj we obtain the statement of the theorem er er by combining theorems and we obtain the main result of this section theorem for any fixed distribution any proper exact fractional cover of the dependency graph tn of size and any the following holds with probability at least uniformly for all distributions mn er er kl ep kl qi si log log log theorem shows that even in the case of tasks bound very similar to that in can be obtained in particular it contains two types of complexity terms kl corresponds to the level of the task environment and kl qi corresponds specifically to the task similarly to the case when the learner has access to unlimited amount of data but for finitely many observed tasks the complexity terms of the second type converge to as while the first one does not as there is still uncertainty on the task environment level in the opposite situation when the learner has access to infinitely many tasks but with only finitely many samples per task the first complexity term converges to as up to logarithmic terms thus there is worsening comparing to the case proportional to which represents the amount of dependence among the tasks if the tasks are actually the dependency graph contains no edges so we can form cover of size with chromatic weight thus we recover the result from as special case of theorem for general dependence graph fastest convergence is obtained by using cover with minimal chromatic weight it is known that the minimal chromatic weight satisfies the following inequality where is the order of the largest clique in and is the maximum degree of vertex in in some situations even the bound obtainable from theorem by plugging in cover with the minimal chromatic weight can be improved theorem also holds for any subset ts of the observed tasks with the induced dependency subgraph γs therefore it might provide tighter bound if γs is smaller than however this is not guaranteed since the empirical error er computed on ts might become larger as well as the second part of the bound which decreases with and does not depend on the chromatic weight of the cover note also that such subset needs to be chosen before observing the data since the bound of theorem holds with probability only for fixed set of tasks and fixed cover another important aspect of theorem as bound is that the right hand side of inequality consists only of computable quantities therefore it can be seen as quality measure of and by minimizing it one could obtain distribution that is adjusted to particular task environment the resulting minimizer can be expected to work well even on new yet unobserved tasks because the guarantees of theorem still hold due to the uniformity of the bound to do so one can use the same techniques as in because theorem differs from the bound provided there only by constant factors changing task environments in this section we study situation when the task environment is gradually changing every next task is sampled from distribution over the tasks that can depend on the history of the process due to the change of task environment the previous idea of learning one prior for all tasks does not seem reasonable anymore in contrast we propose to learn transfer algorithm that produces solution for the current task based on the corresponding sample set and the sample set from the previous task formally we assume that there is set of learning algorithms that produce posterior distribution for task based on the training samples si and the goal of the learner is to identify an algorithm in this set that leads to good performance when applied to new yet unobserved task while using the last observed training sample sn for each task ti and each algorithm we define the expected and empirical error of applying this algorithm as follows eri er xij yji where qi si goal of the learner is to find that minimizes given the history of the observed tasks however if the task environment would change arbitrarily from step to step the observed tasks would not contain any relevant information for new task to overcome this difficulty we make the assumption that the expected performance of the algorithms in does not change over time formally we assume for each there exists value er such that for every with ei ti ti si ei eri er in words the quality of transfer algorithm does not depend on when during the task sequence it is applied provided that it is always applied to the subsequent sample sets note that this is natural assumption for lifelong learning without it the quality of transfer algorithms could change over time so an algorithm that works well for all observed tasks might not work anymore for future tasks the goal of the learner can be reformulated as identifying with minimal er which can be seen as the expected value of the expected risk of applying algorithm on the next yet unobserved task since er is unknown we derive an upper bound based on the observed data that holds uniformly for all algorithms and therefore can be used to guide the learner to do so we again use note that this setup includes the possibility of model selection such as predictors using different feature representations or hyper parameter values the construction with and from the previous section formally let be prior distribution over the set of possible algorithms that is fixed before any data arrives and let be possibly the quality of the and its empirical counterpart are given by the following quantities er er er er similarly to the previous section we first bound the difference between er and expected error given by er eri even though theorem is not directly applicable here more careful modification of it allows to obtain the following result see supplementary material for detailed proof theorem for any fixed distribution with probability at least the following holds uniformly for all distributions log kl er er where pn are some reference prior distributions that do not depend on the training sets of subsequent tasks possible choices include using just one prior distribution fixed before observing any data or using the posterior distributions obtained from the previous task pi to complete the proof we need to bound the difference between er and er we use techniques from in combination of those from resulting in the following lemma lemma for any fixed algorithm and any the following holds en exp er eri exp proof first define xi ei for and xi eri and er then exp er eri exp xi xi even odd exp xi exp xi even odd note that both the set of xi corresponding to even and the set of xi corresponding to odd form martingale difference sequence therefore by using lemma from the supplementary material or similarly lemma in and hoeffding lemma we obtain xi exp en exp even and the same for the odd together with inequality it gives the statement of the lemma now we can prove the following statement theorem for any distribution and any with probability at least the following inequality holds uniformly for all log er er kl proof by applying variational formula one obtains that er er kl log exp er eri figure illustration of three learning tasks sampled from environment shaded areas illustrate the data distribution and indicate positive and negative training examples between subsequent tasks the data distribution changes by rotation transfer algorithm with access to two subsequent tasks can compensate for this by rotating the previous data into the new position thereby obtaining more data samples to train on for fixed algorithm we obtain from lemma en exp er eri exp since does not depend on the process by markov inequality with probability at least we obtain eri exp exp er the statement of the theorem follows by setting by combining theorems and we obtain the main result of this section theorem for any distribution and any with probability at least the following holds uniformly for all er er kl qi kpi kl qkp log log where pn are some reference prior distributions that should not depend on the data of subsequent tasks similarly to theorem the above bound contains two types of complexity terms one corresponding to the level of the changes in the task environment and terms the first complexity term converges to like when the number of the observed tasks increases indicating that more observed tasks allow for better estimation of the behavior of the transfer algorithms the taskspecific complexity terms vanish only when the amount of observed data per tasks grows in addition since the right hand side of the inequality consists only of computable quantities and at the same time holds uniformly for all one can obtain posterior distribution by minimizing it over the transfer algorithms that is adjusted to particularly changing task environments we illustrate this process by discussing toy example figure suppose that and that the learner uses linear classifiers signhw xi and for solving every task for simplicity we assume that every task environment contains only one task or equivalently every ti is delta peak and that the change in the environment between two steps is due to constant rotation by of the feature space for the set we use family of transfer algorithms aα for given sample sets sprev and scur any algorithm aα first rotates sprev by the angle and then trains linear support vector machine on the union of both sets clearly the quality of each transfer algorithm depends on the chosen angle and an elementary calculation shows that condition is fulfilled we can therefore use the bound as criterion to determine beneficial for that we set qi wi unit variance gaussian distributions with means wi similarly we choose all reference prior distributions as unit variance gaussian with zero mean pi analogously we set the to be zero mean normal distribution with enlarged variance in order to make all reasonable rotations lie within one standard deviation from the mean as we choose and the goal of the learning is to identify the best in order to obtain the objective function from equation we first compute the complexity terms and approximate all expectations with respect to by the values at its mean kwi kl qi kpi the empirical error of the gibbs classifiers in the case of and gaussian distributions is given by the following expression we again approximate the expectation by the value at yji hwi xij er kxij where erf and erf is the gauss error function the resulting objective function that we obtain for identifying beneficial angle is the following hw kw kxij kl numeric experiments confirm that by optimizing with respect to one can obtain an advantageous angle using tasks each with samples we obtain an average test error of for the th task as can be expected this lies in between the error for the same setting without transfer which was and the error when always rotating by which was conclusion in this work we present analysis of lifelong learning under two types of relaxations of the assumption on the tasks our results show that accumulating knowledge over the course of learning multiple tasks can be beneficial for the future even if these tasks are not in particular for the situation when the observed tasks are sampled from the same task environment but with possible dependencies we prove theorem that generalizes the existing bound for the case as second setting we further relax the assumption and allow the task environment to change over time our bound shows that it is possible to estimate the performance of applying transfer algorithm on future tasks based on its performance on the observed ones furthermore our result can be used to identify beneficial algorithm based on the given data and we illustrate this process on toy example for future work we plan to expand on this aspect essentially any existing domain adaptation algorithm can be used as transfer method in our setting however the success of domain adaptation techniques is often caused by asymmetry between the source and the target such algorithms usually rely on availability of extensive amounts of data from the source and only limited amounts from the target in contrast in lifelong learning setting all the tasks are assumed to be equipped with limited training data therefore we are particularly interested in identifying how far the constant quality assumption can be carried over to existing domain adaptation techniques and lifelong learning situations acknowledgments this work was in parts funded by the european research council under the european union seventh framework programme grant agreement no note that theorem provides an upper bound for the expected error of stochastic gibbs classifiers and not deterministic ones that are preferable in practice however for the error of gibbs classifier is bounded from below by half the error of the corresponding majority vote predictor and therefore twice the right hand side of provides bound for deterministic classifiers references sebastian thrun and tom mitchell lifelong robot learning technical report robotics and autonomous systems jonathan baxter model of inductive bias learning journal of artificial intelligence research leslie valiant theory of the learnable communications of the acm vladimir vapnik estimation of dependencies based on empirical data springer andreas maurer algorithmic stability and journal of machine learning research jmlr gilles blanchard gyemin lee and clayton scott generalizing from several related classification tasks to new unlabeled sample in conference on neural information processing systems nips anastasia pentina and christoph lampert bound for lifelong learning in international conference on machine learing icml andreas maurer transfer bounds for linear feature learning machine learning andreas maurer massimiliano pontil and bernardino sparse coding for multitask and transfer learning in international conference on machine learing icml balcan avrim blum and santosh vempala efficient representations for lifelong learning and autoencoding in workshop on computational learning theory colt david mcallester some theorems machine learning matthias seeger generalisation error bounds for gaussian process classification journal of machine learning research jmlr liva ralaivola marie szafranski and guillaume stempfel chromatic bounds for data applications to ranking and stationary processes journal of machine learning research jmlr daniel ullman and edward scheinerman fractional graph theory rational approach to the theory of graphs wiley interscience series in discrete mathematics monroe donsker and srinivasa varadhan asymptotic evaluation of certain markov process expectations for large time communications on pure and applied mathematics wassily hoeffding probability inequalities for sums of bounded random variables journal of the american statistical association yevgeny seldin franois laviolette nicol john and peter auer inequalities for martingales ieee transactions on information theory david mcallester simplified margin bounds in workshop on computational learning theory colt laviolette and mario marchand risk bounds for stochastic averages and majority votes of classifiers journal of machine learning research jmlr john langford and john and margins in conference on neural information processing systems nips pascal germain alexandre lacasse laviolette and mario marchand learning of linear classifiers in international conference on machine learing icml 
optimal linear estimation under unknown nonlinear transform xinyang yi the university of texas at austin yixy zhaoran wang princeton university zhaoran constantine caramanis the university of texas at austin constantine han liu princeton university hanliu abstract linear regression studies the problem of estimating model parameter rp from observations yi xi from linear model yi hxi we consider significant generalization in which the relationship between hxi and yi is noisy quantized to single bit potentially nonlinear noninvertible as well as unknown this model is known as the model in statistics and among other things it represents significant generalization of compressed sensing we propose novel estimation procedure and show that we can recover in settings classes of link function where previous algorithms fail in general our algorithm requires only very mild restrictions on the unknown functional relationship between yi and hxi we also consider the high dimensional setting where is sparse and introduce nonconvex framework that addresses estimation challenges in high dimensional regimes where for broad class of link functions between hxi and yi we establish minimax lower bounds that demonstrate the optimality of our estimators in both the classical and high dimensional regimes introduction we consider generalization of the quantized regression problem where we seek to recover the regression coefficient rp from measurements specifically suppose that is random vector in rp and is binary random variable taking values in we assume the conditional distribution of given takes the form hx where is called the link function we aim to estimate from observations yi xi of the pair in particular we assume the link function is unknown without any loss of generality we take to be on the unit sphere since its magnitude can always be incorporated into the link function the model in is simple but general under specific choices of the link function immediately leads to many practical models in machine learning and signal processing including logistic regression and compressed sensing in the settings where the link function is assumed to be known popular estimation procedure is to calculate an estimator that minimizes certain loss function however for particular link functions this approach involves minimizing nonconvex objective function for which the global minimizer is in general intractable to obtain furthermore it is difficult or even impossible to know the link function in practice and poor choice of link function may result in inaccurate parameter estimation and high prediction error we take more general approach and in particular target the setting where is unknown we propose an algorithm that can estimate the parameter in the absence of prior knowledge on the link function as our results make precise our algorithm succeeds as long as the function satisfies single moment condition as we demonstrate this moment condition is only mild restriction on in particular our methods and theory are widely applicable even to the settings where is sign or noninvertible sin in particular as we show in our restrictions on are sufficiently flexible so that our results provide unified framework that encompasses broad range of problems including logistic regression compressed sensing phase retrieval as well as their robust extensions we use these important examples to illustrate our results and discuss them at several points throughout the paper main contributions the key conceptual contribution of this work is novel use of the method of moments rather than considering moments of the covariate and the response variable we look at moments of differences of covariates and differences of response variables such simple yet critical observation enables everything that follows and leads to our procedure we also make two theoretical contributions first we simultaneously establish the statistical and computational rates of convergence of the proposed spectral algorithm we consider both the low dimensional setting where the number of samples exceeds the dimension and the high dimensional setting where the dimensionality may greatly exceed the number of samples in both these settings our proposed algorithm achieves the same statistical rate of convergence as that of linear regression applied on data generated by the linear model without quantization second we provide minimax lower bounds for the statistical rate of convergence and thereby establish the optimality of our procedure within broad model class in the low dimensional setting our results obtain the optimal rate with the optimal sample complexity in the high dimensional setting our algorithm requires estimating sparse eigenvector and thus our sample complexity coincides with what is believed to be the best achievable via polynomial time methods the error rate itself however is informationtheoretically optimal we discuss this further in related works our model in is close to the model sim in statistics in the sim we assume that the pair is determined by hx with unknown link function and noise our setting is special case of this as we restrict to be binary random variable the single index model is classical topic and therefore there is extensive literature too much to exhaustively review it we therefore outline the pieces of work most relevant to our setting and our results for estimating in feasible approach is in which the unknown link function is jointly estimated using nonparametric estimators although these have been shown to be consistent they are not computationally efficient since they involve solving nonconvex optimization problem another approach to estimate is named the average derivative estimator ade further improvements of ade are considered in ade and its related methods require that the link function is at least differentiable and thus excludes important models such as compressed sensing with sign beyond estimating the works in focus on iteratively estimating function and vector that are good for prediction and they attempt to control the generalization error their algorithms are based on isotonic regression and are therefore only applicable when the link function is monotonic and satisfies lipschitz constraints the work discussed above focuses on the low dimensional setting where another related line of works is sufficient dimension reduction where the goal is to find subspace of the input space such that the response only depends on the projection model and our problem can be regarded as special cases of this problem as we are primarily interested in recovering subspace due to space limit we refer readers to the long version of this paper for detailed survey in the high dimensional regime with and has some structure for us this means sparsity we note there exists some recent progress on estimating via pac bayesian methods in the special case when is linear function sparse linear regression has attracted extensive study over the years the recent work by plan et al is closest to our setting they consider the setting of normal covariates ip and they propose marginal regression estimator for estimating that like our approach requires no prior knowledge about their proposed algorithm relies on the assumption that zf and hence can not work for link functions that are even as we will describe below our algorithm is based on novel estimator and avoids requiring such condition thus allowing us to handle even link functions under very mild moment restriction which we describe in detail below generally the work in requires different conditions and thus beyond the discussion above is not directly comparable to the work here in cases where both approaches apply the results are minimax optimal example models in this section we discuss several popular and important models in machine learning and signal processing that fall into our general model under specific link functions variants of these models have been studied extensively in the recent literature these examples trace through the paper and we use them to illustrate the details of our algorithms and results logistic regression in logistic regression lr we assume that exp where is the intercept the link function corresponds to exp one robust variant of lr is called flipped logistic regression where we assume that the labels generated from standard lr model are flipped with probability pe pe hx this reduces to the standard lr model when pe for flipped lr the link function can be written as exp exp exp exp flipped lr has been studied by in both papers estimating is based on minimizing some surrogate loss function involving certain tuning parameter connected to pe however pe is unknown in practice in contrast to their approaches our method does not hinge on the unknown parameter pe our approach has the same formulation for both standard and flipped lr thus unifies the two models compressed sensing compressed sensing cs aims at recovering sparse signals from quantized linear measurements see in detail we define rp supp as the set of sparse vectors in rp with at most nonzero elements we assume rp satisfies sign hx where in this paper we also consider its robust version with noise sign hx assuming the link function of robust cs thus corresponds to du note that also corresponds to the probit regression model without the sparse constraint on throughout the paper we do not distinguish between the two model names model is referred to as compressed sensing even in the case where is not sparse phase retrieval the goal of phase retrieval is to recover signals based on linear measurements with phase information erased pair rp is determined by equation analogous to compressed sensing we consider new model named phase retrieval where the linear measurement with phase information erased is quantized to one bit in detail pair rp is linked through sign where is the quantization threshold compared with compressed sensing this problem is more difficult because only depends on through the magnitude of hx instead of the value of hx also it is more difficult than the original phase retrieval problem due to the additional quantization using our general model the link function thus corresponds to sign it is worth noting that unlike previous models here is neither odd nor monotonic main results we now turn to our algorithms for estimating in both low and high dimensional settings we first introduce second moment estimator based on pairwise differences we prove that the eigenstructure of the constructed second moment estimator encodes the information of we then propose algorithms to estimate based upon this second moment estimator in the high dimensional setting where is sparse computing the top eigenvector of our matrix reduces to computing sparse eigenvector beyond algorithms we discuss minimax lower bound in we present simulation results in conditions for success we now introduce several key quantities which allow us to state precisely the conditions required for the success of our algorithm definition for any unknown link function define the quantity as follows where and are given by µk where as we discuss in detail below the key condition for success of our algorithm is as we show below this is relatively mild condition and in particular it is satisfied by the three examples introduced in for odd and monotonic unless for all in which case no algorithm is able to recover for even we have thus if and only if second moment estimator we describe novel moment estimator that enables our algorithm let yi xi be the observations of assuming without loss of generality that is even we consider the following key transformation for our procedure is based on the following second moment the intuition behind this second moment is as follows by the variation of along the direction has the largest impact on the variation of hx thus the variation of directly depends on the variation of along consequently encodes the information of such dependency relationship in the following we make this intuition more rigorous by analyzing the eigenstructure of and its relationship with lemma for we assume that rp satisfies for ip we have ip where and are defined in and lemma proves that is the leading eigenvector of as long as the eigengap is positive if instead we have we can use related moment estimator which has analogous properties to this end define in parallel to lemma we have similar result for as stated below corollary under the setting of lemma ip corollary therefore shows that when we can construct another second moment estimator such that is the leading eigenvector of as discussed above this is precisely the setting for phase retrieval when the quantization threshold in satisfies θm for simplicity of the discussion hereafter we assume that and focus on the second moment estimator defined in natural question to ask is whether holds for specific models the following lemma demonstrates exactly this for the example models introduced in lemma consider the flipped logistic regression where is given in by setting the intercept to be we have for robust compressed sensing where is given in we have min for phase retrieval where is given in for we let θm be the median of θm we have θm exp and sign sign θm we thus obtain for θm low dimensional recovery we consider estimating in the classical low dimensional setting where based on the second moment estimator defined in estimating amounts to solving noisy eigenvalue problem we solve this by simple iterative algorithm provided an initial vector which may be chosen at random we perform power iterations as shown in algorithm theorem we assume ip and follows let yi xi be samples of response input pair for any link function in with defined in and and we let γφ and there exist constant ci such that when for algorithm we have that with probability at least exp for tmax optimization error statistical error here βb where βb is the first leading eigenvector of note that by we have thus the optimization error term in decreases at geometric rate to zero as increases for tmax sufficiently large such perror that the statistical and optimization error terms in are of the same order we have tmax this statistical rate of convergence matches the rate of estimating vector in linear regression without any quantization and will later be shown to be optimal this result shows that the lack of prior knowledge on the link function and the information loss from quantization do not keep our procedure from obtaining the optimal statistical rate high dimensional recovery next we consider the high dimensional setting where and is sparse with being support size although this high dimensional estimation problem is closely recall that we have an analogous treatment and thus results for related to the sparse pca problem the existing works on sparse pca do not provide direct solution to our problem in particular they either lack statistical guarantees on the convergence rate of the obtained estimator or rely on the properties of the sample covariance matrix of gaussian data which are violated by the second moment estimator defined in for the sample covariance matrixp of data prove that the convex relaxation proposed by achieves suboptimal log rate of convergence yuan and zhang propose the truncated power method and show that it attains the optimal log rate algorithm low dimensional recovery algorithm sparse recovery input yi xi number of iterations tmax input yi xi number of iterations tmax second moment estimation construct regularization parameter sparsity level sb from samples according to second moment estimation construct from samples according to initialization choose random vector initialization argmin πi for tmax do tr end for first leading eigenvector of output tmax trunc sb locally that is it exhibits this rate of convergence for max do only in neighborhood of the true solution where trunc sb hβ where is some constant it is well understood that for random initialization end for on such condition fails with probability output tmax going to one as instead we propose procedure for estimating in our setting in the first stage we adapt the convex relaxation proposed by and use it as an initialization step in order to obtain good enough initial point satisfying the condition hβ the convex optimization problem can be easily solved by the alternating direction method of multipliers admm algorithm see for details then we adapt the truncated power method this procedure is illustrated in algorithm in particular we define truncation operator trunc as trunc βj where is the index set corresponding to the top largest the initialization phase of our algorithm requires log samples see below for more precise details to succeed as work in suggests it is unlikely that polynomial time algorithm can avoid such dependence however once we are near the solution as we show this procedure achieves the optimal error rate of log theorem let and the minimum sample size be nmin log min suppose log with sufficiently large constant where and are specified in and meanwhile assume the sparsity parameter sb in algorithm is set to be max for nmin with nmin defined in we have log min kβ optimization error statistical error with high probability here is defined in the first term on the side of is the statistical error while the second term gives the optimization error note that the optimization error decays at geometric rate since for tmax sufficiently large we have max log in the sequel we show that the side gives the optimal statistical rate of convergence for broad model class under the high dimensional setting with minimax lower bound we establish the minimax lower bound for estimating in the model defined in in the sequel we define the family of link functions that are lipschitz continuous and are bounded away from formally for any and we define for all let xfn yi xi be the realizations of where follows ip and satisfies with link function correspondingly we denote the estimator of to be where is the domain of we define the minimax risk for estimating as inf inf sup but also all in the above definition we not only take the infimum over all possible estimators possible link functions in for fixed our formulation recovers the standard definition of minimax risk by taking the infimum over all link functions our formulation characterizes the minimax lower bound under the least challenging in in the sequel we prove that our procedure attains such minimax lower bound for the least challenging given any unknown link function in that is to say even when is unknown our estimation procedure is as accurate as in the setting where we are provided the least challenging and the achieved accuracy is not improvable due to the limit the following theorem establishes the minimax lower bound in the high dimensional setting theorem let we assume that cs log for any the minimax risk defined in satisfies log here and are absolute constants while and are defined in theorem establishes the minimax optimality of the statistical rate attained by our procedure for and in particular for arbitrary the estimator attained by algorithm is in the sense that its log rate of convergence is not improvable even when the information on the link function is available for general rp one can show the best possible convergence rate is by setting in theorem it is worth to note that our lower bound becomes trivial for there exists some such that one example is the noiseless compressedpsensing for which we have sign in fact for noiseless compressed sensing the log rate is not optimal for example the jacques et al provide an algorithm with exponential running time that achieves rate log understanding such rate transition phenomenon for link functions with zero margin in is an interesting future direction numerical results we now turn to the numerical results that support our theory for the three models introduced in we apply algorithm and algorithm to do parameter estimation in the classic and high dimensional regimes our simulations are based on synthetic data for classic recovery is randomly chosen from for sparse recovery we set for all where is random index subset of with size in figure as predicted by theorem we observe that the same leads to nearly identical estimation error figure demonstrates similar results for the predicted rate log of sparse recovery and thus validates theorem flipped logistic regression estimation error estimation error estimation error compressed sensing phase retrieval figure estimation error of low dimensional recovery pe log flipped logistic regression estimation error estimation error estimation error log compressed sensing log phase retrieval figure estimation error of sparse recovery pe discussion sample complexity in high dimensional regime while our algorithm achieves optimal convergence rate the sample complexity we need is log the natural question is whether it can be reduced to log we note that breaking the barrier log is challenging consider simpler problem sparse phase retrieval where yi with fairly extensive body of literature the efficient algorithms with polynomial running time for recovering sparse requires sample complexity log it remains open to show whether it possible to do consistent sparse recovery with log samples by any polynomial time algorithms acknowledgment xy and cc would like to acknowledge nsf grants and this research was also partially supported by the department of transportation through the transportation operations and planning tier university transportation center hl is grateful for the support of nsf career award nsf nsf nih nih and nih zw was partially supported by msr phd fellowship while this work was done references and sparse model journal of machine learning research and complexity theoretic lower bounds for sparse principal component detection in conference on learning theory at and distributed optimization and statistical learning via the alternating direction method of multipliers foundations and in machine learning trends and sparse pca optimal rates and adaptive estimation annals of statistics and phase retrieval via matrix completion siam journal on imaging sciences and optimal solutions for sparse principal component analysis journal of machine learning research and direct formulation for sparse pca using semidefinite programming siam review ta and at optimal smoothing in semiparametric index approximation of regression functions tech interdisciplinary research project quantification and simulation of economic processes ta and at on semiparametric in regression journal of statistical planning and inference and phase retrieval stability and recovery guarantees applied and computational harmonic analysis pa and compressed sensing provable support and vector recovery in international conference on machine learning and optimal smoothing in models annals of statistics ta and direct estimation of the index coefficient in model annals of statistics pp and robust compressive sensing via binary stable embeddings of sparse vectors arxiv preprint and efficient learning of generalized linear and single index models with isotonic regression in advances in neural information processing systems and the isotron algorithm isotonic regression in conference on learning theory sparse principal component analysis and iterative thresholding the annals of statistics and concentration inequalities and model selection vol springer ata av and wa learning with noisy labels in advances in neural information processing systems and compressed sensing by linear programming communications on pure and applied mathematics and estimation with geometric constraints arxiv preprint and semiparametric estimation of index coefficients econometrica pp and sparse principal component analysis via regularized low rank matrix approximation journal of multivariate analysis consistent estimation of scaled coefficients econometrica pp and robust logistic regression using shift parameters arxiv preprint introduction to the analysis of random matrices arxiv preprint and fantope projection and selection convex relaxation of sparse pca in advances in neural information processing systems and penalized matrix decomposition with applications to sparse principal components and canonical correlation analysis biostatistics and optimal linear estimation under unknown nonlinear transform arxiv preprint assouad fano and le cam in festschrift for lucien le cam springer and truncated power method for sparse eigenvalue problems journal of machine learning research and sparse principal component analysis journal of computational and graphical statistics 
learning with group invariant features kernel perspective youssef mroueh ibm watson group mroueh stephen cbmm mit voinea author tomaso poggio cbmm mit tp abstract we analyze in this paper random feature map based on theory of invariance introduced in more specifically group invariant signal signature is obtained through cumulative distributions of random projections our analysis bridges invariant feature learning with kernel methods as we show that this feature map defines an expected kernel that is invariant to the specified group action we show how this random feature map approximates this group invariant kernel uniformly on set of points moreover we show that it defines function space that is dense in the equivalent invariant reproducing kernel hilbert space finally we quantify error rates of the convergence of the empirical risk minimization as well as the reduction in the sample complexity of learning algorithm using such an invariant representation for signal classification in classical supervised learning setting introduction encoding signals or building similarity kernels that are invariant to the action of group is key problem in unsupervised learning as it reduces the complexity of the learning task and mimics how our brain represents information invariantly to symmetries and various nuisance factors change in lighting in image classification and pitch variation in speech recognition convolutional neural networks achieve state of the art performance in many computer vision and speech recognition tasks but require large amount of labeled examples as well as augmented data where we reflect symmetries of the world through virtual examples obtained by applying identitypreserving transformations such as shearing rotation translation to the training data in this work we adopt the approach of where the representation of the signal is designed to reflect the invariant properties and model the world symmetries with group actions the ultimate aim is to bridge unsupervised learning of invariant representations with invariant kernel methods where we can use tools from classical supervised learning to easily address the statistical consistency and sample complexity questions indeed many invariant kernel methods and related invariant kernel networks have been proposed we refer the reader to the related work section for review section and we start by showing how to accomplish this invariance through haarintegration kernels and then show how random features derived from theory of invariances introduced in approximate such kernel group invariant kernels we start by reviewing kernels introduced in and their use in binary classification problem this section highlights the conceptual advantages of such kernels as well as their practical inconvenience putting into perspective the advantage of approximating them with explicit and invariant random feature maps invariant kernels we consider subset of the hypersphere in dimensions let ρx be measure on consider kernel on such as radial basis function kernel let be group acting on with normalized haar measure is assumed to be compact and unitary group define an invariant kernel between through as follows gx dµ dµ as we are integrating over the entire group it is easy to see that gz hence the kernel is invariant to the group action the symmetry of is obvious moreover if is positive definite kernel it follows that is positive definite as well one can see the kernel framework as another form of data augmentation since we have to produce points in order to compute the kernel invariant decision boundary turning now to binary classification problem we assume that we are given labeled training set xi yi xi yi in order to learn decision function we minimize the following empirical risk induced by an pn convex loss function with minf yi xi where we restrict to belong to hypothesis class induced by the invariant kernel the so called reproducing kernel hilbert space hk the representer theorem shows that the solution of such problem pn xi since the has the following form fn or the optimal decision boundary fn pn pn gx αi gx xi αi xi kernel is it follows that fn fn hence the the decision boundary is as well and we have fn gx fn reduced sample complexity we have shown that kernel induces groupinvariant decision boundary but how does this translate to the sample complexity of the learning algorithm to answer this question we will assume that the input set has the following structure gx where is the identity group element this structure implies that for function in the invariant rkhs hk we have such that gx and let ρy be the label posteriors we assume that ρy gx ρy this is natural assumption since the label is unchanged given the group action assume that the set is endowed with measure ρx that is also let be the decision function and consider the expected risk induced by the loss ev defined as follows ev yf ρy ρx dx ev is proxy to the misclassification risk using the invariant properties of the function class and the data distribution we have by invariance of ρy and ev yf ρy ρx dx yf ρy ρx dz dµ dµ yf ρy ρx dx by invariance of ρy and yf gx ρy gx ρx dx yf ρy ρx dx hence given an invariant kernel to group action that is identity preserving it is sufficient to minimize the empirical risk on the core set and it generalizes to samples in let us imagine that is finite with cardinality the cardinality of the core set is small fraction of the cardinality of where hence when we sample training points from the maximum size of the training set is yielding reduction in the sample complexity contributions we have just reviewed the kernel in summary kernel implies the existence of decision function that is invariant to the group action as well as reduction in the sample complexity due to sampling training points from reduced set the core set kernel methods with kernels come at very expensive computational price at both training and test time computing the kernel is computationally cumbersome as we have to integrate over the group and produce virtual examples by transforming points explicitly through the group action moreover the training complexity of kernel methods scales cubicly in the sample size those practical considerations make the usefulness of such kernels very limited the contributions of this paper are on three folds we first show that random feature map rd derived from memorybased theory of invariances introduced in induces an expected haarintegration kernel for fixed points we have hφ where satisfies gx we show type result that holds uniformly on set of points that assess the concentration of this random feature map around its expected induced kernel for sufficiently large we have hφ uniformly on an points set we show that with linear model an invariant decision function can be learned in this random feature space by sampling points from the core set fn and generalizes to unseen points in reducing the sample complexity moreover we show that those features define function space that approximates dense subset of the invariant rkhs and assess the error rates of the empirical risk minimization using such random features we demonstrate the validity of these claims on three datasets text artificial vision mnist and speech tidigits from group invariant kernels to feature maps in this paper we show that random feature map based on rd approximates kernel having the form given in equation hφ we start with some notation that will be useful for defining the feature map denote the cumulative distribution function of random variable by fx fix let be random variable drawn according to the normalized haar measure and let be random template whose distribution will be defined later for define the following truncated cumulative distribution function cdf of the dot product hx gti pg hx gti fhx gti let we consider the following gaussian vectors sampling with rejection for the templates id if else the reason behind this sampling is to keep the range of hx gti under control the squared norm will be bounded by with high probability by classical concentration result see proof of theorem for more details the group being unitary and we know that hx gti for remark we can also consider templates drawn uniformly on the unit sphere uniform templates on the sphere can be drawn as follows id since the norm of gaussian vector is highly concentrated around its mean we can use the gaussian sampling with rejection results proved for gaussian templates with rejection will hold true for templates drawn at uniform on the sphere with different constants define the following kernel function ks et dτ where will be fixed throughout the paper to be since the gaussian sampling with rejection controls the dot product to be in that range let dµ as the group is closed we have dµ and hence gx for all it is clear hgx now that is kernel in order to approximate we sample elements uniformly and independently from the group gi and define the normalized empirical cdf we discretize the continuous threshold as follows sk ns we sample templates independently according to the gaussian sampling with rejection tj we are now ready to define the random feature map sk tj it is easy to see that lim et hφ ir lim et sk sk tj tj ks in section we study the geometric information captured by this kernel by stating explicitly the similarity it computes remark efficiency of the representation the main advantage of such feature map as outlined in is that we store transformed templates in order to compute while if we wanted to compute an invariant kernel of type equation we would need to explicitly transform the points the latter is computationally expensive storing transformed templates and computing the signature is much more efficient it falls in the category of learning and is biologically plausible as get large enough the feature map approximates kernel as we will see in next section an equivalent expected kernel and uniform concentration result in this section we present our main results with proofs given in the supplementary material theorem shows that the random feature map defined in the previous section corresponds in expectation to kernel ks moreover ks computes the average pairwise distance between all points in the orbits of and where the orbit is defined as the collection of all of given point ox gx theorem expectation let and define the distance dg between the orbits ox and oz dg kgx dµ dµ and the expected kernel ks lim et hφ ir et dτ the following inequality holds with probability ks dg where and for any as the dimension we have and and we have asymptotically ks dg dg ks is symmetric and ks is positive remark and are not errors due to results holding with high probability but are due to the truncation and are technical artifact of the proof local invariance can be defined by restricting the sampling of the group elements to subset assuming that for each the equivalent kernel has asymptotically the following form ks kgx dµ dµ the constraint can be relaxed let hence we can set and ks dg where re and theorem is in sense an invariant type result where we show that the dot product defined by the random feature map hφ is concentrated around the invariant expected kernel uniformly on data set of points given sufficiently large number of templates large number of sampled group elements and large bin number the error naturally decomposes to numerical error and statistical errors due to the sampling of the templates and the group elements respectively theorem type point set let xi xi be finite dataset fix for number of bins templates log and group elements log where are universal numeric constants we have xi xj ks xi xj with probability putting together theorems and the following corollary shows how the random feature map captures the invariant distance between points uniformly on dataset of points corollary invariant features maps and distances between orbits let xi xi be finite dataset fix for number of bins templates log and group elements log nδm where are universal numeric constants we have hφ xi xj dg xi xj with probability remark assuming that the templates are unitary and drawn form general distribution the equivalent kernel has the following form ks dµ dµ max hx gti hz ti dt indeed when we use the gaussian sampling with rejection for the templates the integral max hx gti hz ti dt is asymptotically proportional to it is interesting to consider different distributions that are for the templates and assess the number of the templates needed to approximate such kernels it is also interesting to find the optimal templates that achieve the minimum distortion in equation in data dependent way but we will address these points in future work learning with group invariant random features in this section we show that learning linear model in the invariant random feature space on training set sampled from the reduced core set has low expected risk and generalizes to unseen test points generated from the distribution on the architecture of the proof follows ideas from and recall that given an convex loss function our aim is to minimize the expected risk given in equation denote the cdf by hgt xi and the empirical cdf by let be the distribution of templates rs the rkhs defined by the invariant kernel ks ks dtdτ denoted hks is the completion of the set of all finite linear combinations of the form αi ks xi xi αi similarly to we define the following function space fp dtdτ sup lemma fp is dense in hks for fp we have ev yf ρy dρx where is the reduced core set since fp is dense in hks we canh learn an invariant decision function in the space fp instead sk of learning in hks let tj and are equivalent up to constants we will approximate the set fp as follows sk hw wj tj tj hence we learn the invariant decision function via empirical risk minimization where we restrict the function to belong to and the sampling in the training set is restricted to the core set note that with this function space we are regularizing for convenience the norm infinity of the weights but this can be relaxed in practice to classical tikhonov regularization theorem learning with group invariant features let xi yi xi yi training set sampled from the core set let fn arg minf pn yi xi then log lc ev fn min ev log log with probability at least on the training set and the choice of templates and group elements the proof of theorem is given in appendix theorem shows that learning linear model in the invariant random feature space defined by or equivalently has low expected risk more importantly this risk is arbitrarily close to the optimal risk achieved in an infinitedimensional class of functions namely fp the training set is sampled from the reduced core set and invariant learning generalizes to unseen test points generated from the distribution on hence the reduction in the sample complexity recall that fp is dense in the rkhs of the invariant kernel and so the expected risk achieved by linear model in the invariant random feature space is not far from the one attainable in the invariant rkhs note that the error decomposes into two terms the first is statistical and it depends on the training sample complexity the other is governed by the approximation error of functions fp with functions in and depends on the number of templates number of elements sampled the number of bins and has the following form log relation to previous work we now put our contributions in perspective by outlining some of the previous work on invariant kernels and approximating kernels with random features approximating kernels several schemes have been proposed for approximating kernel with an explicit feature map in conjunction with linear methods such as the method or random sampling techniques in the fourier domain for kernels our features fall under the random sampling techniques where unlike previous work we sample both projections and group elements to induce invariance with an integral representation we note that the relation between random features and quadrature rules has been thoroughly studied in where sharper bounds and error rates are derived and can apply to our setting invariant kernels we focused in this paper on kernels since they have an integral representation and hence can be represented with random features other invariant kernels have been proposed in authors introduce transformation invariant kernels but unlike our general setting the analysis is concerned with dilation invariance in multilayer arccosine kernels are built by composing kernels that have an integral representation but does not explicitly induce invariance more closely related to our work is where kernel descriptors are built for visual recognition by introducing kernel view of histogram of gradients that corresponds in our case to the cumulative distribution on the group variable explicit feature maps are obtained via kernel pca while our features are obtained via random sampling finally the convolutional kernel network of builds sequence of multilayer kernels that have an integral representation by convolution considering spatial neighborhoods in an image our future work will consider the composition of kernels where the convolution is applied not only to the spatial variable but to the group variable akin to numerical evaluation in this paper and specifically in theorems and we showed that the random feature map captures the invariant distance between points and that learning linear model trained in the invariant random feature space will generalize well to unseen test points in this section we validate these claims through three experiments for the claims of theorem we will use nearest neighbor classifier while for theorem we will rely on the regularized least squares rls classifier one of the simplest algorithms for supervised learning while our proofs focus on regularization rls corresponds to tikhonov regularization with square loss specifically for performing classification on batch of training points in rd summarized in the data matrix and label matrix rn rls will perform the optimization minw where is the frobenius norm is the regularization parameter and is the feature map which for the representation described in this paper will be cdf pooling of the data projected onto random templates all rls experiments in this paper were completed with the gurls toolbox the three datasets we explore are xperm figure an artificial dataset consisting of all sequences of length whose elements come from an alphabet of characters we want to learn function which assigns positive value to any sequence that contains target set of characters in our case two of them regardless of their position thus the function label is globally invariant to permutation and so we project our data onto all permuted versions of our random template sequences mnist figure we seek local invariance to translation and rotation and so all random templates are translated by up to pixels in all directions and rotated between and degrees tidigits figure we use subset of tidigits consisting of speakers men women children reading the digits in isolation and so each datapoint is waveform of single word we seek local invariance to pitch and speaking rate and so all random templates are pitch shifted up and down by cents and warped to play at half and double speed the task is classification with one see for more detail acknowledgements stephen voinea acknowledges the support of nuance foundation grant this work was also supported in part by the center for brains minds and machines cbmm funded by nsf stc award ccf xperm sample complexity rls raw haar cdf xperm sample complexity nn cdf cdf raw cdf accuracy number of training points per class number of training points per class figure classification accuracy as function of training set size averaged over random training samples at each size cdf refers to random feature map with bins and templates with templates the random feature map outperforms the raw features and representation also invariant to permutation and even approaches an rls classifier with kernel error bars were removed from the rls plot for clarity see supplement mnist accuracy rls points per class bins mnist sample complexity rls raw cdf accuracy number of templates number of training points per class figure left plot mean classification accuracy as function of number of bins and templates averaged over random sets of templates right plot classification accuracy as function of training set size averaged over random samples of the training set at each size at examples per class we achieve an accuracy of tidigits gender rls tidigits speaker rls bins bins accuracy number of templates number of templates figure mean classification accuracy as function of number of bins and templates averaged over random sets of templates in the speaker dataset we test on unseen speakers and in the gender dataset we test on new gender giving us an extreme mismatch references anselmi leibo rosasco mutch tacchetti and poggio unsupervised learning of invariant representations in hierarchical corr vol bruna and mallat invariant scattering convolution networks corr vol hinton krizhevsky and wang transforming auto encoders bengio courville and vincent representation learning review and new perspectives ieee trans pattern anal mach vol no pp lecun bottou bengio and haffner learning applied to document recognition in proceedings of the ieee vol pp krizhevsky sutskever and hinton imagenet classification with deep convolutional neural in nips pp niyogi girosi and poggio incorporating prior information in machine learning by creating virtual examples in proceedings of the ieee pp mostafa learning from hints in neural networks journal of complexity vol pp june vapnik statistical learning theory publication steinwart and christmann support vector machines information science and statistics new york springer haasdonk vossen and burkhardt invariance in kernel methods by in scia springer bartlett jordan and mcauliffe convexity classification and risk bounds journal of the american statistical association vol no pp wahba spline models for observational data vol of regional conference series in applied mathematics philadelphia pa siam johnson and lindenstrauss extensions of lipschitz mappings into hilbert conference in modern analysis and probability rahimi and recht weighted sums of random kitchen sinks replacing minimization with randomization in in nips rahimi and recht uniform approximation of functions with random bases in proceedings of the annual allerton conference williams and seeger using the nystrm method to speed up kernel machines in nips bach on the equivalence between quadrature rules and random features corr vol walder and chapelle learning with transformation invariant kernels in nips cho and saul kernel methods for deep learning in nips pp bo ren and fox kernel descriptors for visual recognition in mairal koniusz harchaoui and schmid convolutional kernel networks in nips tacchetti mallapragada santoro and rosasco gurls least squares library for supervised learning corr vol voinea zhang evangelopoulos rosasco and poggio invariant representations from acoustic waveforms vol pp september benzeghiba de mori deroo dupont erbes jouvet fissore laface mertins ris rose tyagi and wellekens automatic speech recognition and speech variability review speech communication vol pp 
regularized em algorithms unified framework and statistical guarantees constantine caramanis dept of electrical and computer engineering the university of texas at austin constantine xinyang yi dept of electrical and computer engineering the university of texas at austin yixy abstract latent models are fundamental modeling tool in machine learning applications but they present significant computational and analytical challenges the popular em algorithm and its variants is much used algorithmic tool yet our rigorous understanding of its performance is highly incomplete recently work in has demonstrated that for an important class of problems em exhibits linear local convergence in the setting however the may not be well defined we address precisely this setting through unified treatment using regularization while regularization for problems is by now well understood the iterative em algorithm requires careful balancing of making progress towards the solution while identifying the right structure sparsity or in particular regularizing the using the highdimensional prescriptions la is not guaranteed to provide this balance our algorithm and analysis are linked in way that reveals the balance between optimization and statistical errors we specialize our general framework to sparse gaussian mixture models mixed regression and regression with missing variables obtaining statistical guarantees for each of these examples introduction we give general conditions for the convergence of the em method for estimation we specialize these conditions to several problems of interest including sparse and mixed regression sparse gaussian mixture models and regression with missing covariates as we explain below the key problem in the setting is the natural idea is to modify this step via appropriate regularization yet choosing the appropriate sequence of regularizers is critical problem as we know from the theory of regularized the regularizer should be chosen proportional to the target estimation error for em however the target estimation error changes at each step the main contribution of our work is technical we show how to perform this iterative regularization we show that the regularization sequence must be chosen so that it converges to quantity controlled by the ultimate estimation error in existing work the estimation error is given by the relationship between the population and empirical operators but this too is not well defined in the highdimensional setting thus key step related both to our algorithm and its convergence analysis is obtaining different characterization of statistical error for the setting background and related work em is general algorithmic approach for handling latent variable models including mixtures popular largely because it is typically computationally highly scalable and easy to implement on the flip side despite fairly long history of studying em in theory very little has been understood about general statistical guarantees until recently very recent work in establishes general local convergence theorem assuming initialization lies in local region around true parameter and statistical guarantees for em which is then specialized to obtain rates for several specific problems in the sense of the classical statistical setting where the samples outnumber the dimension central challenge in extending em and as corollary the analysis in to the regime is the on the algorithm side the will not be stable or even in some cases in the setting to make matters worse any analysis that relies on showing that the is somehow close to the performed with infinite data the simply can not apply in the regime recent work in treats em using truncated this works in some settings but also requires specialized treatment for every different setting precisely because of the difficulty with the in contrast to work in we pursue extension via regularization the central challenge as mentioned above is in picking the sequence of regularization coefficients as this must control the optimization error related to the special structure of as well as the statistical error finally we note that for finite mixture regression et al consider an regularized em algorithm for which they develop some asymptotic analysis and oracle inequality however this work doesn establish the theoretical properties of local optima arising from regularized em our work addresses this issue from local convergence perspective by using novel choice of regularization classical em and challenges in high dimensions the em algorithm is an iterative algorithm designed to combat the of max likelihood due to latent variables for space concerns we omit the standard derivation and only give the definitions we need in the sequel let be random variables taking values in with joint distribution fβ depending on model parameter rp we observe samples of but not of the latent variable em seeks to maximize lower bound on the maximum likelihood function for letting κβ denote the conditional distribution of given letting denote the marginal distribution of and defining the function qn κβ log yi dz one iteration of the em algorithm mapping to consists of the following two steps compute function qn given mn arg qn we can define the population infinite sample versions of qn and mn in natural manner yβ κβ log dzdy arg max this paper is about the setting where the number of samples may be far less than the dimensionality of the parameter but where exhibits some special structure it may be sparse vector or matrix in such setting the of the em algorithm may be highly problematic in many settings for example sparse mixed regression the may not even be well defined more generally when mn may be far from the population version and in particular the minimum estimation error kmn can be much larger than the signal strength kβ this quantity is used in as well as in work in as measure of statistical error in the high dimensional setting something else is needed algorithm the basis of our algorithm is the well understood concept of regularized high dimensional estimators where the regularization is tuned to the underlying structure of thus defining larized via mrn arg max qn λn where denotes an appropriate regularizer chosen to match the structure of the key chal lenge is how to choose the sequence of regularizers λn in the iterative process so as to control optimization and statistical error as detailed in algorithm our sequence of regularizers attempts to match the target estimation error at each step of the em iteration for an intuition of what this might look like consider the estimation error at step kmrn by the triangle inequality we can bound this by sum of two terms the optimization error and the final estimation error kmrn kmrn mrn kmrn since we expect and show linear convergence of the optimization it is natural to update λn via recursion of the form λn κλn as in where the first term represents the optimization error and represents the final statistical error the last term above in key part of our analysis shows that this error and hence is controlled by which in turn can be bounded uniformly for variety of important applications of em including the three discussed in this paper see section while technical point it is this key insight that enables the right choice of algorithm and its analysis in the cases we consider we obtain optimal rates of convergence demonstrating that no algorithm let alone another variant of em can perform better algorithm regularized em algorithm input samples yi regularizer number of iterations initial parameter initial regu larization parameter λn estimated statistical error contractive factor for do regularization parameter update κλn compute function qn regularized according to mrn arg max qn end for output statistical guarantees we now turn to the theoretical analysis of regularized em algorithm we first set up general analytical framework for regularized em where the key ingredients are decomposable regularizer and several technical conditions on the population based and the sample based qn in section we provide our main result theorem that characterizes both computational and statistical performance of the proposed variant of regularized em algorithm decomposable regularizers decomposable regularizers have been shown to be useful both empirically and theoretically for high dimensional structural estimation and they also play an important role in our analytical framework recall that for rp norm and pair of subspaces in rp such that we have the following definition definition decomposability regularizer rp is decomposable with respect to if for any typically the structure of model parameter can be characterized by specifying subspace such that the common use of regularizer is thus to penalize the compositions of solution that live outside we are interested in bounding the estimation error in some norm the following quantity is critical in connecting to definition subspace compatibility constant for any subspace rp given regularizer and some norm the subspace compatibility constant of with respect to is given by sup kuk as is standard the dual norm of is defined as supr to simplify notation we let kukr and conditions on and qn next we review three technical conditions originally proposed by on the population level function and then we give two important conditions that the empirial function qn must satisfy including one that characterizes the statistical error it is well known that performance of em algorithm is sensitive to initialization following the lowdimensional development in our results are local and apply to an region around ku we first require that is self consistent as stated below this is satisfied in particular when maximizes the population log likelihood function as happens in most settings of interest condition self consistency function is self consistent namely arg max we also require that the function satisfies certain strong concavity condition and is smooth over condition strong concavity and smoothness is concave over for any is over the next condition is key in guaranteeing the curvature of is similar to that of when is close to it has also been called first order stability in condition gradient stability for any we have kβ the above condition only requires that the gradient be stable at one point this is sufficient for our analysis in fact for many concrete examples one can verify stronger version of condition that is kβ next we require two conditions on the empirical function qn which is computed from finite number of samples according to our first condition parallel to condition imposes curvature constraint on qn in order to guarantee that the estimation error kβ in step of the em algorithm is well controlled we would like qn to be strongly concave at however in the setting where there might exist directions along which qn is flat as in mixed linear regression and missing covariate regression in contrast with condition we only require qn to be strongly concave over particular set that is defined in terms of the subspace pair and regularizer this set is defined as follows πs πs where the projection operator πs rp rp is defined as πs arg kv uk the restricted strong concavity rsc condition is as follows condition rsc γn tδ for any fixed with probability at least we have that for all γn qn qn kβ the above condition states that qn is strongly concave in directions that belong to it is instructive to compare condition with related condition proposed by for analyzing high dimensional they require the loss function to be strongly convex over the cone rp kπs kr kπs kr therefore our restrictive set is similar to the cone but has the additional term kuk the main purpose of the term kuk is to allow the regularization parameter λn to jointly control optimization and statistical error we note that while condition is stronger than the usual rsc condition in in typical settings the difference is immaterial this is because πs is within constant factor of and hence checking rsc over amounts to checking it over kπs kr kuk which is indeed what is typically also done in the setting finally we establish the condition that characterizes the achievable statistical error condition statistical error for any fixed with probability at least we have this quantity replaces the term kmn which appears in and and which presents problems in the high dimensional regime main results in this section we provide the theoretical guarantees for resampled version of our regularized em algorithm we split the whole dataset into pieces and use fresh piece of data in each iteration of regularized em as in resampling makes it possible to check that conditions are satisfied without requiring them to hold uniformly for all with high probability our empirical results indicate that it is not in fact required and is an artifact of the analysis we refer to this resampled version as algorithm in the sequel we let to denote the sample complexity in each iteration we let where is the dual norm of for algorithm our main result is as follows the proof is deferred to the supplemental material theorem assume the model parameter and regularizer is decomposable with respect to where rp assume is such that further assume function defined in is self consistent and satisfies conditions with parameters and given samples and iterations let assume qm computed from any samples according to satisfies conditions with parameters γm αµτ and let γγ and assume and define rγm and assume is such that consider algorithm with initialization and with regularization parameters given by κt γm kβ for any then with probability at least we have that for any κt kβ κt kβ γm the estimation error is bounded by term decaying linearly with number of iterations which we can think of as the optimization error and second term that characterizes the ultimate estimation error of our algorithm with log and suitable choice of such that we bound the ultimate estimation error as kβ we note that overestimating the initial error kβ is not important as it may slightly increase the overall number of iterations but will not impact the ultimate estimation error the constraint rγm ensures that is contained in for all this constraint is quite mild in the sense that if rγm is decent estimator with estimation error that already matches our expectation examples applying the theory now we introduce three well known latent variable models for each model we first review the standard em algorithm formulations and discuss the extensions to the high dimensional setting then we apply theorem to obtain the statistical guarantee of the regularized em with data splitting algorithm the key ingredient underlying these results is to check the technical conditions in section hold for each model we postpone these tedious details to the supplemental material gaussian mixture model we consider the balanced isotropic gaussian mixture model gmm with two components where the distribution of random variables rp is characterized as pr ip pr pr here we use to denote the probability density function of in this example is the latent variable that indicates the cluster id of each sample given samples yi function qn defined in corresponds to qgm yi kyi yi kyi where exp exp exp we assume rp supp naturally we choose the regularizer to be the norm we define the ratio snr kβ corollary sparse recovery in gmm there exist constants such that if snr kβ log kβ then with probability at least algorithm with parameters kβ log any and regularization generates that has estimation error kβ log kβ kβ for all note that by setting log log the order of final estimation error turns out to be kβ log log log the minimax rate for estimating vector in single gaussian cluster is log thereby the rate is optimal on up to log factor mixed linear regression mixed linear regression mlr as considered in some recent work is the problem of recovering two or more linear vectors from mixed linear measurements in the case of mixed linear regression with two symmetric and balanced components the pair rp is linked through hx where is the noise term and is the latent variable that has rademacher distribution over we assume ip in this setting with samples yi xi of pair function qn then corresponds to lr qm yi xi yi hxi yi xi yi hxi where exp βi exp βi exp βi we consider two kinds of structure on sparse recovery assume then let be the norm as in the previous section we define snr kβ corollary sparse recovery in mlr there exist constant such that if snr kβ log kβ then with probability at least algorithm with parameters kβ log kβ any and regularization generates that has estimation error kβ log kβ kβ for all performing log log iterations gives us estimation rate kβ log log log which is on the dependence on kβ which also appears in the analysis of em in the classical low dimensional setting arises from fundamental limits of em removing such dependence for mlr is possible by convex relaxation it is interesting to study how to remove it in the high dimensional setting low rank recovery second we consider the setting where the model parameter is matrix with rank min we further assume is an gaussian matrix entries of are independent random variables with distribution we apply nuclear norm regularization to serve the low rank structure where si is the ith singular value of similarly we let snr kf corollary low rank recovery in mlr there exist constant such that if snr kf kf kf thenpwith probability at least exp algorithm with parameters kf kf any and nuclear norm regularization generates that has estimation error kγ kf kγ kf kf for all the standard low rank matrix recovery with single component including other sensing matrix designs beyond the gaussianity has been studied extensively to the best of our knowledge the theoretical study of the mixed low rank matrix recovery has not been considered missing covariate regression as our last example we consider the missing covariate regression mcr problem to parallel standard linear regression yi xi are samples of linked through hx however we assume each entry of xi is missing independently with probability thereei takes the form fore the observed covariate vector xi with probability ei otherwise we assume the model is under gaussian design ip we refer the reader to our supplementary material for the specific qn function in high dimensional case we assume we define kβ to be the snr and to be the relative contractivity radius in particular let corollary sparse recovery in mcr there exist constants such that if max ωρ log ωkβ then with probp ability at least algorithm with parameters cσ log kβ any and regularization generates that has estimation error log kβ kβ for all unlike the previous two models we require an upper bound on the signal to noise ratio this unusual constraint is in fact unavoidable by optimizing the order of final estimation error turns out to be log log log simulations we now provide some simulation results to back up our theory note that while theorem requires resampling we believe in practice this is unnecessary this is validated by our results where we apply algorithm to the four latent variable models discussed in section convergence rate we first evaluate the convergence of algorithm assuming only that the initialization is bounded distance from for given error ωkβ the initial parameter is picked randomly from the sphere centered around with radius ωkβ we use algorithm with λn in theorem the choice of the critical parameter is given in the supplementary material for every single trial we report estimation error kβ and optimization error kβ in every iteration we plot the log of errors over iteration in figure est error opt error log error log error est error opt error est error opt error log error est error opt error log error number of iterations number of iterations gmm number of iterations mlr sparse number of iterations mlr low rank mcr figure convergence of regularized em algorithm in each panel one curve is plotted from single independent trial settings snr snr statistical rate we now evaluate the statistical rate we set and compute estimation error on βb in figure we plot kβb over normalized sample complexity log for parameter and θp for rank parameter we refer the reader to figure for other settings we observe that the same normalized sample complexity leads to almost identical estimation error in practice which thus supports the corresponding statistical rate established in section log gmm θp mlr sparse log kf mlr low rank log mcr figure statistical rates each point is an average of independent trials settings acknowledgments the authors would like to acknowledge nsf grants and this research was also partially supported by the department of transportation through the transportation operations and planning tier university transportation center references sivaraman balakrishnan martin wainwright and bin yu statistical guarantees for the em algorithm from population to analysis arxiv preprint tony cai and anru zhang rop matrix recovery via projections the annals of statistics emmanuel candes and terence tao the dantzig selector statistical estimation when is much larger than the annals of statistics pages emmanuel and yaniv plan tight oracle inequalities for matrix recovery from minimal number of noisy random measurements information theory ieee transactions on arun tejasvi chaganty and percy liang spectral experts for estimating mixtures of linear regressions arxiv preprint yudong chen sujay sanghavi and huan xu improved graph clustering information theory ieee transactions on oct yudong chen xinyang yi and constantine caramanis convex formulation for mixed regression with two components minimax optimal rates in conf on learning theory arthur dempster nan laird and donald rubin maximum likelihood from incomplete data via the em algorithm journal of the royal statistical society series methodological pages loh and martin wainwright regression with noisy and missing data provable guarantees with in advances in neural information processing systems pages loh and martin wainwright corrupted and missing predictors minimax bounds for highdimensional linear regression in information theory proceedings isit ieee international symposium on pages ieee jinwen ma and lei xu asymptotic convergence properties of the em algorithm with respect to the overlap in the mixture neurocomputing geoffrey mclachlan and thriyambakam krishnan the em algorithm and extensions volume john wiley sons sahand negahban martin wainwright et al estimation of near matrices with noise and scaling the annals of statistics sahand negahban bin yu martin wainwright and pradeep ravikumar unified framework for analysis of with decomposable regularizers in advances in neural information processing systems pages benjamin recht maryam fazel and pablo parrilo guaranteed solutions of linear matrix equations via nuclear norm minimization siam review nicolas peter and sara van de geer for mixture regression models test paul tseng an analysis of the em algorithm and proximal point methods mathematics of operations research roman vershynin introduction to the analysis of random matrices arxiv preprint martin wainwright structured regularizers for problems statistical and computational issues annual review of statistics and its application zhaoran wang quanquan gu yang ning and han liu high dimensional algorithm statistical optimization and asymptotic normality arxiv preprint wu on the convergence properties of the em algorithm the annals of statistics pages xinyang yi constantine caramanis and sujay sanghavi alternating minimization for mixed linear regression arxiv preprint 
adaptive stochastic optimization from sets to paths zhan wei lim david hsu wee sun lee department of computer science national university of singapore limzhanw dyhsu leews abstract adaptive stochastic optimization aso optimizes an objective function adaptively under uncertainty it plays crucial role in planning and learning under uncertainty but is unfortunately computationally intractable in general this paper introduces two conditions on the objective function the marginal likelihood rate bound and the marginal likelihood bound which together with pointwise submodularity enable efficient approximate solution of aso several interesting classes of functions satisfy these conditions naturally the version space reduction function for hypothesis learning we describe recursive adaptive coverage new aso algorithm that exploits these conditions and apply the algorithm to two robot planning tasks under uncertainty in contrast to the earlier submodular optimization approach our algorithm applies to aso over both sets and paths introduction hallmark of an intelligent agent is to learn new information as the world unfolds and to improvise by fusing the new information with prior knowledge consider an autonomous unmanned aerial vehicle uav searching for victim lost in jungle the uav acquires new information on the victim location by scanning the environment with noisy onboard sensors how can the uav plan and adapt its search strategy in order to find the victim as fast as possible this is an example of stochastic optimization in which an agent chooses sequence of actions under uncertainty in order to optimize an objective function in adaptive stochastic optimization aso the agent action choices are conditioned on the outcomes of earlier choices aso plays crucial role in planning and learning under uncertainty but it is unfortunately computationally intractable in general adaptive submodular optimization provides powerful tool for approximate solution of aso and has several important applications such as sensor placement active learning etc however it has been so far restricted to optimization over set domain the agent chooses subset out of finite set of items this is inadequate for the uav search as the agent consecutive choices are constrained to form path our work applies to aso over both sets and paths our work aims to identify subclasses of aso and provide conditions that enable efficient nearoptimal solution we introduce two conditions on the objective function the marginal likelihood rate bound mlrb and the marginal likelihood bound mlb they enable efficient approximation of aso with pointwise submodular objective functions functions that satisfy diminishing return property mlrb is different from adaptive submodularity we prove that adaptive submodularity does not imply mlrb and vice versa while there exist functions that do not satisfy either the adaptive submodular or the mlrb condition all pointwise submodular functions satisfy the mlb condition albeit with different constants we propose recursive adaptive coverage rac approximation algorithm that guarantees solution of aso over either set or path domain if the objective function satisfies the mlrb or the mlb condition and is pointwise monotone submodular since mlrb differs from adaptive submodularity the new algorithm expands the set of problems that admit efficient approximate solutions even for aso over set domain we have evaluated rac in simulation on two robot planning tasks under uncertainty and show that rac performs well against several commonly used heuristic algorithms including greedy algorithms that optimize information gain related work submodular set function optimization encompasses many hard combinatorial optimization problems in operation research and decision making submodularity implies diminishing return effect where adding an item to smaller set is more beneficial than adding the same item to bigger set for example adding new temperature sensor when there are few sensors helps more in mapping temperature in building than when there are already many sensors submodular functions can be efficiently approximated using greedy heuristic recent works have incorporated stochasticity to submodular optimization and generalized the problem from sets optimization to path optimization our work builds on progress in submodular optimization on paths to solve the adaptive stochastic optimization problem on paths our rac algorithm share similar structure and analysis as the raid algorithm in that is used to solve adaptive informative path planning ipp problems without noise in fact noiseless adaptive ipp is special case of adaptive stochastic optimization problems on paths that satisfies the marginal likelihood rate bound condition we can derive the same approximation bound using the results in section both works are inspired by the algorithm in used to solve the adaptive traveling salesperson atsp problem in the atsp problem salesperson has to service subset of locations with demand that is not known in advance however the salesperson knows the prior probabilities of the demand at each location possibly correlated and the goal is to find an adaptive policy to service all locations with demand adaptive submodularity generalizes submodularity to stochastic settings and gives logarithmic approximation bounds using greedy heuristic it was also shown that no polynomial time algorithm can compute approximatep solution of adaptive stochastic optimization problems within factor of unless that is the hierarchy collapses to its second level many bayesian active learning problems can be modeled by suitable adaptive submodular objective functions however recently proposed new stochastic set function for active learning with general loss function that is not adaptive monotone submodular this new objective function satisfies the marginal likelihood bound with nontrivial constant adaptive stochastic optimization is special case of the partially observable markov decision process pomdp mathematical principled framework for reasoning under uncertainty despite recent tremendous progress in offline and online solvers most partially observable planning problems remain hard to solve preliminaries we now describe the adaptive stochastic optimization problem and use the uav search and rescue task to illustrate our definitions let be the set of actions and let be the set of observations the agent operates in world whose events are determined by static state called the scenario denoted as when the agent takes an action it receives an observation that is determined by an initially unknown scenario we denote random scenario as and use prior distribution over the scenarios to represent our prior knowledge of the world for in the uav task the actions are flying to various locations observations are the possible sensors readings and scenario is victim position when the uav flies to particular location it observes its sensors readings that depends on actual victim position prior knowledge about the victim position can be encoded as probability distribution over the possible victim positions after taking actions and receiving observations after each action the agent has history we say that scenario is consistent with history when the actions and corresponding observations of the history never contradict with the for all we denote this by we can also say that history is consistent with another history if dom dom and for all dom where dom is the set of actions taken in for example victim position has not been ruled out given the sensors readings at various locations when an agent goal can be characterized by stochastic set function ox which measures progress toward the goal given the actions taken and the true scenario in this paper we assume that is pointwise monotone on finite domain for any and for all an agent achieves its goal and covers when has maximum value after taking actions and given it is in scenario for example the objective function can be the sum of prior probabilities of impossible victim positions given history the uav finds the victim when all except the true victim position are impossible an agent strategy for adaptively taking actions is policy that maps history to its next action policy terminates when there is no next action to take for given history we say that policy covers the function when the agent executing always achieves its goal upon termination that is dom for all scenarios where is the history when the agent executes for example policy tells the uav where to fly to next given the locations visited and whether it has positive sensor at those locations or not and it covers the objective function when the uav executing it always find the victim formally an adaptive stochastic optimization problem on paths consists of the tuple the set of actions is the set of locations the agent can visit is the starting location of the agent and is metric that gives the distance between any pair of locations the cost of the policy is the length of the path starting from location traversed by the agent until the policy terminates when presented with scenario the distance traveled by uav executing policy for particular true victim position we want to find policy that minimizes the cost of traveling to cover the function we formally state the problem problem given an adaptive stochastic optimization problem on paths compute an adaptive policy that minimizes the expected cost subject to dom for all where is the history encountered when executing on adaptive stochastic optimization problems on sets can be formally defined by tuple the set of actions is set of items that an agent may select instead of distance metric the cost of selecting an item is defined by cost function and the cost of policy where is the subset of items selected by when presented with scenario classes of functions this section introduces the classes of objective functions for adaptive stochastic optimization problems and gives the relationship between them given finite set and function on subsets of the function is submodular if for all let be stochastic set function if is submodular for each fixed scenario ox then is pointwise submodular adaptive submodularity and monotonicity generalize submodularity and monotonicity to stochastic settings where we receive random observations after selecting each item we define the expected marginal value of an item given history as dom dom function ox is adaptive monotone with respect to prior distribution if for all such that and all it holds that the expected marginal value of any fixed item is nonnegative function is adaptive submodular with respect to prior distribution if for all and such that and for all it holds that the expected marginal value of any fixed item does not increase as more items are selected function can be adaptive submodular with respect to certain distribution but not be pointwise submodular however it must be pointwise submodular if it is adaptive submodular with respect to all distributions we denote fˆ min as the worst case value of given history and as the marginal likelihood of history the marginal likelihood rate bound mlrb condition requires function such that for all if then fˆ dom fˆ dom except for scenarios already covered where and max is constant upper bound for the maximum value of for all scenarios intuitively this condition means that the worst case remaining objective value decreases by constant fraction whenever the marginal likelihood of history decreases by at least half example the version space reduction function with arbitrary prior is adaptive submodular and monotone furthermore it satisfies the mlrb the version space reduction function is defined as for all scenario and gives the history of visiting locations in when the scenario is the version space reduction function is often used for active learning where the true hypothesis is identified once all the scenarios are covered we present the proof that the version space reduction function satisfies the mlrb condition and all other proofs in the supplementary material proposition the version space function satisfies the mlrb with constants and the following proposition teases apart the relationship between the mlrb condition and adaptive submodularity proposition adaptive submodularity does not imply the mlrb condition and vice versa the marginal likelihood bound mlb condition requires that there exists some constant such that for all fˆ dom in other words the worst remaining objective value must be less than the marginal likelihood of its history multiplied by some constant our quality of solution depends on the constant the smaller the constant the better the approximation bound we can make any adaptive stochastic optimization problem satisfy the mlb with large enough constant to trivially ensure the bound of mlb we set where min hence unless we have visited all locations and covered the function by definition example the version space reduction function can be interpreted as the expected loss of random scenario differing from true scenario the loss is counted as one whenever for example pair of scenarios that differ in observation at one location has the same loss of as another pair that differs in all observations thus it can be useful to assign different loss to different pair of scenarios with general loss function the generalized version space reduction function is defined as fl where is an indicator function and ox ox is general loss function that satisfies and if the generalized version space reduction function is not adaptive submodular and does not satisfy the mlrb condition however it satisfies condition mlb with constant proposition the generalized version space reduction function fl satisfies mlb with max algorithm adaptive planning is computationally hard due to the need to consider every possible observation after each action rac assumes that it always receive the most likely observation to simplify adaptive planning rac is recursive algorithm that partially covers the function in each step and repeats on the residual function until the entire function is covered in each recursive step rac uses the mostly like observation assumption to transform adaptive stochastic optimization problem into submodular orienteering problem to generate tour and traverse it if the assumption is true throughout the tour then rac achieves the required partial coverage otherwise rac receives some observation that has probability less than half since only the most likely observation has probability at least half the marginal likelihood of history decreases by at least half and the mlrb and mlb conditions ensures that substantial progress is made towards covering the function submodular orienteering takes submodular function and metric on and gives the minimum cost path that covers function such that we now describe the submodular orienteering problem used in each recursive step given the current history we construct restricted set of pairs is the most likely observation at given using ideas from we construct submodular function to be used in the submodular orienteering problem upon completion of the recursive step we would like the function to be either covered or have value at least for all scenarios consistent with where is the selected subset of we first restrict to subset of scenarios that are consistent with to simplify we transform the function so that its maximum value for all is at least by defining whenever and otherwise for we now define dom if is consistent with and otherwise finally we construct the submodular function min the constructions have the following properties that guarantees the effectiveness of the recursive steps of rac proposition let be pointwise monotone submodular function then is pointwise monotone submodular and is monotone submodular in addition if and only if is either covered or have value at least for all scenarios consistent with we can replace by simpler function if satisfies minimal dependency property where the value of function depends only on the history dom dom for all we define new submodular set function proposition when satisfies minimal dependency implies rac needs to guard against committing to costly plan made under the most likely observation assumption which is bound to be wrong eventually rac uses two different mechanisms for hedging for mlrb instead of requiring complete coverage we solve partial coverage using submodular path optimization problem for all consistent scenarios under so that the most likely observation assumption in each recursive step for mlb we solve submodular orienteering for complete coverage of gq but also solve for the version space reduction function with as the target as hedge against by the first tour when the function is not well aligned with the probability of observations the cheaper tour is then traversed by rac in each recursive step we define the informative observation set for every location rac traverses the tour and adaptively terminates when it encounters an informative observation subsequent recursive calls work on the residual function and normalized prior let be the history encountered so far just before the recursive call for any set dom dom we assume that function is the recursive step is repeated until the residual value we give the pseudocode of rac in algorithm we give details of ubmodular path procedure and prove its approximation bound in supplementary material algorithm rac procedure recurse rac if max then return en our xecute lan min for all recurse rac procedure xecute lan repeat visit next location in and observe until or end of tour move to location xt return history encountered procedure en our if satisfies mlb then ubmodular path gq if max then ubmodular path arg else else ubmodular path return where xt and xt analysis we give the performance guarantees for applying rac to adaptive stochastic optimization problem on paths that satisfy mlrb and mlb theorem assume that is an pointwise submodular monotone function if satisfies mlrb condition then for any constant and an instance of adaptive stochastic optimization problem on path optimizing rac computes policy in polynomial time such that log logk where and are constants that satisfies equation theorem assume that prior probability distribution is represented as integers with and is an pointwise submodular monotone function if satisfies mlb then for any constant and an instance of adaptive stochastic optimization problem on path optimizing rac computes policy for in polynomial time such that log log log where max for adaptive stochastic optimization problems on subsets we achieve tighter approximation bounds by replacing the bound of submodular orienteering with greedy submodular set cover theorem assume is an pointwise submodular and monotone function if satisfies mlrb condition then for an instance of adaptive stochastic optimization problem on subsets optimizing rac computes policy in polynomial time such that ln logk where and are constants that satisfies equation theorem assume is an pointwise submodular and monotone function and min if satisfies mlb condition then for an instance of adaptive stochastic optimization problem on subsets optimizing rac computes policy in polynomial time such that ln ln log where max application noisy informative path planning in this section we apply rac to solve adaptive informative path planning ipp problems with noisy observations we reduce an adaptive noisy ipp problem to an equivalence class determination ecd problem and apply rac to solve it using an objective function that satisfies mlrb condition we evaluate this approach on two ipp tasks with noisy observations in an informative path planning ipp problem an agent seeks path to sense and gather information from its environment an ipp problem is specified as tuple ph zh the definitions for are the same as adaptive stochastic optimization problem on path in addition there is finite set of hypotheses and prior probability over them we also have set of probabilistic observation functions zh zx with one observation function zx for each location the goal of ipp problem is to identify the true hypothesis equivalence class determination problem an equivalence class determination ecd problem consists of set of hypotheses and set of equivalence classes hm that partitions its goal is to identify which equivalence class the true hypothesis lies in by moving to locations and making observations with the minimum expected movement cost ecd problem has been applied to noisy bayesian active learning to achieve performance noisy adaptive ipp problem can also be reduced to an ecd instance when it is always possible to identify the true hypothesis in ipp problem to differentiate between the equivalence classes we use the gibbs error objective function called the function in the idea is consider the ambiguities between pairs of hypotheses in different equivalence class and to visit locations and make observations to disambiguate between them the set of pairs of hypotheses in different classes is hi hj we disambiguate pair when we make an observation at location and either or is inconsistent with the observation or the set of pairs disambiguated by visiting location when hypothesis is true is given by long range sensor detects the survivor in the area safe zone true target location short range sensor detects the survivor in current grid cell starting location figure grasp the cup with handle top the side view left and the top view right figure uav search and rescue ex or we define weight function as we can pnow define the gibbs error objective function fge ex where is the set of location visited and proposition the gibbs error function fge is pointwise submodular in addition pm and monotone it satisfies condition mlrb with constants hi the total weight of ambiguous pairs of hypotheses and the first step to reduce adaptive noisy ipp instance to ecd instance is to create noiseless ipp problem from noisy ipp instance is by creating hypothesis for every possible observation vector each hypothesis is an observation vector and the new hypothesis space is next for each hyo pothesis hi we create an equivalence class hi zxj hi oj that consists of all observation vectors that are possible with hypothesis hi when we can always identify the true underlying hypothesis the equivalence classes is partition on the set experiments we evaluate rac in simulation on two noisy ipp tasks modified from we highlight the modifications and give the full description in the supplementary material in variant of the uav search and rescue task see figure there is safe zone marked grey in figure where the victim is deemed to be safe if we know that he is in it otherwise we need to know the exact location of the victim the equivalence classes task are the safe zone and every location outside of it furthermore the long range sensor may report the wrong reading with probability of in noisy variant of the grasping task the laser range finder has chance of detecting the correct discretized value chance of errors each and chance of errors each the robot gripper is fairly robust to estimation error of the cup handle orientation for each cup we partition the cup handle orientation into regions of degrees each we only need to know the region that contains cup handle the equivalence classes here are the regions however it is not always possible to identify the true region due to observation noise we can still reduce to ecd problem by associating each observation vector to its most likely equivalence class we now describe our baselines algorithms define information gain to be reduction in shannon entropy of the equivalence classes the information gain ig algorithm greedily picks the location that maximizes the expected information gain where the expectation is taken over all possible observations at the location to account for movement cost the information gain algorithm greedily picks the location that maximizes expected information gain per unit movement cost both ig and do not reason over the long term but achieve limited adaptivity by replanning in each step the algorithm is as described in we evaluate ig and rac with version space reduction and gibbs error objectives has theoretical performance guarantees for the noisy adaptive ipp problem under the mlrb condition can also be shown to have similar performance bound however optimizes the target function directly and we expect that optimizing the target function directly would usually have better performance in practice even though the version space reduction function and gibbs error function are adaptive submodular the greedy policy in is not applicable as the movement cost per step depends on the paths and is not fixed if we ignore movement cost greedy policy on the version space reduction function is equivalent to generalized binary search which is equivalent to ig for the uav task where the prior is uniform and there are two observations we set all algorithms to terminate when the gibbs error of the equivalence classes is less than the gibbs error corresponds to the exponentiated rényi entropy order and also the prediction error of gibbs classifier that predicts by sampling hypothesis from the prior we run trials with the true hypothesis sampled randomly from the prior for the uav search task and trials for the grasping task as its variance is higher for we set the number of samples to be three times the number of hypothesis cost cost for performance comparison we pick different thresholds starting from and doubling each step for gibbs error of the equivalence classes and compute the average cost incurred by each algorithm to reduce gibbs error to below each threshold level we plot the average cost with confidence interval for the two ipp tasks in figures and for the grasping task we omit trials where the minimum gibbs error possible is greater than when we compute the average cost for that specific value for readability we omit results due to ig from the plots when it is worse than other algorithms by large margin which is all of ig in the grasping task from our experiments has the lowest average cost for both tasks at almost every the has very close results while the other algorithms and ig do not perform as well for both the uav search and grasping task ig gibbs error gibbs error figure grasping figure uav search and rescue conclusion we study approximation algorithms for adaptive stochastic optimization over both sets and paths we give two conditions on pointwise monotone submodular functions that are useful for understanding the performance of approximation algorithms on these problems the mlb and the mlrb our algorithm rac runs in polynomial time with an approximation ratio that depends on the constants characterizing these two conditions the results extend known results for adaptive stochastic optimization problems on sets to paths and enlarges the class of functions known to be efficiently approximable for both problems we apply the algorithm to two adaptive informative path planning applications with promising results acknowledgement this work is supported in part by nus acrf grant national research foundation singapore through the smart phase pilot program subaward agreement no and us air force research laboratory under agreement number references arash asadpour hamid nazerzadeh and amin saberi stochastic submodular maximization in internet and network economics pages gruia calinescu and alexander zelikovsky the polymatroid steiner problems journal of combinatorial optimization nguyen viet cuong wee sun lee and nan ye adaptive active learning with general loss in proc uncertainty in artificial intelligence nguyen viet cuong wee sun lee nan ye kian ming chai and hai leong chieu active learning for probabilistic hypotheses using the maximum gibbs error criterion in advances in neural information processing systems nips daniel golovin and andreas krause adaptive submodularity theory and applications in active learning and stochastic optimization artificial intelligence research daniel golovin andreas krause and debajyoti ray bayesian active learning with noisy observations in advances in neural information processing systems nips pages andrew guillory and jeff bilmes interactive submodular set cover in international conference on machine learning icml haifa israel anupam gupta viswanath nagarajan and ravi approximation algorithms for optimal decision trees and adaptive tsp problems in samson abramsky cyril gavoille claude kirchner friedhelm meyer auf der heide and paul spirakis editors automata languages and programming number in lecture notes in computer science pages springer berlin heidelberg january leslie pack kaelbling michael littman and anthony cassandra planning and acting in partially observable stochastic domains artificial intelligence january zhan wei lim david hsu and wee sun lee adaptive informative path planning in metric spaces in workshop on the algorithmic foundations of robotics george nemhauser laurence wolsey and marshall fisher an analysis of approximations for maximizing submodular set mathematical programming sylvie ong shao wei png david hsu and wee sun lee planning under uncertainty for robotic tasks with mixed observability int robotics research david silver and joel veness planning in large pomdps advances in neural information processing systems nips adhiraj somani nan ye david hsu and wee sun lee despot online pomdp planning with regularization in advances in neural information processing systems nips pages alice zheng irina rish and alina beygelzimer efficient test selection in active diagnosis via entropy approximation proc uncertainty in artificial intelligence 
beyond convexity stochastic optimization elad hazan princeton university kfir levy technion shai the hebrew university ehazan kfiryl shais abstract stochastic convex optimization is basic and well studied primitive in machine learning it is well known that convex and lipschitz functions can be minimized efficiently using stochastic gradient descent sgd the normalized gradient descent ngd algorithm is an adaptation of gradient descent which updates according to the direction of the gradients rather than the gradients themselves in this paper we analyze stochastic version of ngd and prove its convergence to global minimum for wider class of functions we require the functions to be and broadens the concept of unimodality to multidimensions and allows for certain types of saddle points which are known hurdle for optimization methods such as gradient descent functions are only required to be lipschitz in small region around the optimum this assumption circumvents gradient explosion which is another known hurdle for gradient descent variants interestingly unlike the vanilla sgd algorithm the stochastic normalized gradient descent algorithm provably requires minimal minibatch size introduction the benefits of using the stochastic gradient descent sgd scheme for learning could not be stressed enough for convex and lipschitz objectives sgd is guaranteed to find an solution within iterations and requires only an unbiased estimator for the gradient which is obtained with only one or few data samples however when applied to problems several drawbacks are revealed in particular sgd is widely used for deep learning one of the most interesting fields where stochastic optimization problems arise often the objective in these kind of problems demonstrates two extreme phenomena on the one hand with vanishing gradients and on the other hand high gradients as expected applying sgd to such problems is often reported to yield unsatisfactory results in this paper we analyze stochastic version of the normalized gradient descent ngd algorithm which we denote by sngd each iteration of sngd is as simple and efficient as sgd but is much more appropriate for optimization problems overcoming some of the pitfalls that sgd may encounter particularly we define family of and functions and prove that sngd is suitable for optimizing such objectives is generalization of unimodal functions to multidimensions which includes and convex functions as subclass functions allow for certain types of plateaus and saddle points which are difficult for sgd and other gradient descent variants is generalization of lipschitz functions that only assumes lipschitzness in small region around the minima whereas farther away the gradients may be unbounded gradient explosion is thus another difficulty that is successfully tackled by sngd and poses difficulties for other stochastic gradient descent variants our contributions we introduce property that extends and captures unimodal functions which are not we prove that ngd finds an minimum for such functions within iterations as special case we show that the above rate can be attained for functions that are lipschitz in an around the optimum gradients may be unbounded outside this region for objectives that are also smooth in an around the optimum we prove faster rate of we introduce new setup stochastic optimization of functions and show that this setup captures generalized linear models glm regression for this setup we devise stochastic version of ngd sngd and show that it converges within iterations to an minimum the above positive result requires that at each iteration of sngd the gradient should be estimated using minibatch of minimal size we provide negative result showing that if the minibatch size is too small then the algorithm might indeed diverge we report experimental results supporting our theoretical guarantees and demonstrate an accelerated convergence attained by sngd related work optimization problems arise in numerous fields spanning economics industrial organization and computer vision it is well known that optimization tasks can be solved by series of convex feasibility problems however generally solving such feasibility problems may be very costly there exists rich literature concerning optimization in the offline case pioneering paper by was the first to suggest an efficient algorithm namely normalized gradient descent and prove that this algorithm attains solution within iterations given differentiable objective this work was later extended by establishing the same rate for upper objectives in faster rates for optimization are attained but they assume to know the optimal value of the objective an assumption that generally does not hold in practice among the deep learning community there have been several attempts to tackle ideas spanning smart initialization and more have shown to improve training in practice yet non of these works provides theoretical analysis showing better convergence guarantees to the best of our knowledge there are no previous results on stochastic versions of ngd neither results regarding functions plateaus and cliffs difficulties for gd gradient descent with fixed step sizes including krf its stochastic variants is known to perform poorly when the gradients are krf too small in plateau area of the function or alternatively when the other extreme happens gradient explosions these two phenomena have been figure function with plateaus reported in certain types of and cliffs optimization such as training of deep networks figure depicts family of functions for which gd behaves provably poorly with large gd will hit the cliffs and then oscillate between the two boundaries alternatively with small step size the low gradients will cause gd to miss the middle valley which has constant size of on the other hand this exact function is and and hence the ngd algorithm provably converges to the optimum quickly definitions and notations we use to denote the euclidean norm bd denotes the dimensional euclidean ball of radius centered around and bd bd denotes the set for simplicity throughout the paper we always assume that functions are differentiable but if not stated explicitly we do not assume any bound on the norm of the gradients definition and let rd function is called if for every bd we have gkx yk similarly the function is if for every bd we have kx next we define functions definition we say that function rd is if rd such that it follows that xi we further say that is if it is and its gradients vanish only at the global minima informally the above characterization states that the opposite gradient of function directs us in global descent direction following is an equivalent more common definition definition we say that function rd is if any of is convex the set lα is convex the equivalence between the above definitions can be found in during this paper we denote the of at by sf does not fully capture the notion of unimodality in several dimension as an example let and consider the function it is natural to consider as unimodal since it acquires no local minima but for the unique global minima at however is not consider the points log log log log which belong to the set their average does not belong to the same since functions always enable us to explore meaning that the gradient always directs us in global descent direction intuitively from an optimization point of view we only need such direction whenever we do not exploit whenever we are not approximately optimal in what follows we define property that enables us to either this property captures wider class of unimodal function such as above rather than mere quasiconvexity later we justify this definition by showing that it captures generalized linear models glm regression see definition let rd we say that rd is slqc in if at least one of the following applies and for every it holds that xi note that if is and function then rd it holds that is in recalling the function that appears in equation then it can be shown that then this function is in where generalized linear models glm the idealized glm in this setup we have collection of samples xi yi bd and an activation function we are guaranteed to have rd such that yi xi we denote φhw xi hw xi the performance of predictor rd is measured by the average square error over all samples ec rrm yi φhw xi in it is shown that the perceptron problem with is private case of glm regression the sigmoid function is popular activation function in the field of deep learning the next lemma states that in the idealized glm problem with sigmoid activation then the error function is slqc but not as we will see in section this implies that algorithm finds an minima of ec rrm within poly iterations lemma consider the idealized glm problem with the sigmoid activation and assume that then the error function appearing in equation is ew in bd but it is not generally the noisy glm in the noisy glm setup see we may draw samples xi yi bd from an unknown distribution we assume that there exists predictor rd such that xi where is an activation function given rd we define its expected error as follows φhw xi and it can be shown that is global minima of we are interested in schemes that obtain an minima to within poly samples and optimization steps given samples from their empirical error ec rrm is defined as in equation the following lemma states that in this setup letting then ec rrm is slqc with high probability this property will enable us to apply algorithm to obtain an minima to within poly samples from and poly optimization steps lemma let consider the noisy glm problem with the sigmoid activation and assume that given fixed point then after log samples the empirical error function appearing in equation is ew in note that if we had required the slqc to hold then we would need the number of samples to depend on the dimension which we would like to avoid instead we require slqc to hold for fixed this satisfies the conditions of algorithm enabling us to find an solution with sample complexity that is independent of the dimension ngd for optimization here we present the ngd algorithm and prove the convergence rate of this algorithm for slqc objectives our analysis is simple enabling us to extend the convergence rate presented in beyond functions we then show that and objective are slqc implying that ngd converges even if the gradients are unbounded outside small region algorithm normalized gradient descent ngd input iterations rd learning rate for do update gt xt where gt xt kgt end for return arg min xt xt around the minima for and objectives we show that ngd attains faster convergence rate ngd is presented in algorithm ngd is similar to gd except we normalize the gradients it is intuitively clear that to obtain robustness to plateaus where the gradient can be arbitrarily small and to exploding gradients where the gradient can be arbitrarily large one must ignore the size of the gradient it is more surprising that the information in the direction of the gradient suffices to guarantee convergence following is the main theorem of this section theorem fix let rd and arg given that is slqc in every rd then running the ngd algorithm with and we have that theorem states that functions admit poly convergence rate using ngd the intuition behind this lies in definition which asserts that at point either the opposite gradient points out global optimization direction or we are already note that the requirement of in any is not restrictive as we have seen in section there are interesting examples of functions that admit this property and for any for simplicity we have presented ngd for unconstrained problems using projections we can easily extend the algorithm and and its analysis for constrained optimization over convex sets this will enable to achieve convergence of for the objective presented in equation and the idealized glm problem presented in section we are now ready to prove theorem proof of theorem first note that if the gradient of vanishes at xt then by the slqc assumption we must have that xt assume next that we perform iterations and the gradient of at xt never vanishes in these iterations consider the update rule of ngd algorithm then by standard algebra we get kxt xt assume that we have xt take and observe that ky the assumption implies that xt and therefore xt xt setting the above implies kxt kxt thus after iterations for which xt we get kxt therefore we must have optimization it can be shown that and of implies that is rd and arg therefore the following is direct corollary of theorem algorithm stochastic normalized gradient descent sngd input iterations rd learning rate minibatch size for do sample ψi db and define ft ψi update xt where gt xt end for return arg min xt ft xt gt kgt corollary fix let rd and arg given that is strictly and then running the ngd algorithm with and we have that in case is also we state an even faster rate theorem fix let rd and arg given that is strictly and the ngd algorithm with then running and we have that remark the above corollary resp theorem impliesp that could have arbitrarily large gradients and second derivatives outside resp yet ngd is still ensured to output an point within resp iterations we are not familiar with similar guarantee for gd even in the convex case sngd for stochastic slqc optimization here we describe the setting of stochastic slqc optimization then we describe our sngd algorithm which is ensured to yield an solution within poly queries we also show that the noisy glm problem described in section is an instance of stochastic slqc optimization allowing us to provably solve this problem within poly samples and optimization steps using sngd the stochastic slqc optimization setup consider the problem of minimizing function rd and assume there exists distribution over functions such that we assume that we may access by randomly sampling minibatches of size and querying the gradients of these minibatches thus upon querying point xt rd random minibatch pb ψi db is sampled and we receive xt where ft ψi we make the following assumption regarding the minibatch averages assumption let arg there exists and function that for then and the minibatch average ft pb ψi is in xt moreover we assume note that we assume that poly log justification of assumption noisy glm regression see section is an interesting instance of stochastic optimization problem where assumption holds indeed according to lemma given then for log samples the average minibatch function is in xt of minibatch averages is plausible assumption when we optimize an expected sum of functions that share common global minima or when the different global minima are close by as seen from the examples presented in equation and in sections this sum is generally not but is more often note that in the general case when the objective is sum of functions the number of local minima of such objective may grow exponentially with the dimension see this might imply that general setup where each is may be generally hard main results sngd is presented in algorithm sngd is similar to sgd except we normalize the gradients the normalization is crucial in order to take advantage of the slqc assumption and in order to overcome the hurdles of plateaus and cliffs following is our main theorem theorem fix suppose we run sngd with iterations and max log assume that for then and the function ft defined in the algorithm is and is also in xt then with probability of at least we have that we prove of theorem at the end of this section remark since and are equivalent to slqc the theorem implies that could have arbitrarily large gradients outside yet sngd is still ensured to output an point within iterations we are not familiar with similar guarantee for sgd even in the convex case remark theorem requires the minibatch size to be in the context of learning the number of functions corresponds to the number of training examples by standard sample complexity bounds should also be order of therefore one may wonder if the size of the minibatch should be order of this is not true since the required training set size is times the vc dimension of the hypothesis class in many practical cases the vc dimension is more significant than and therefore will be much larger than the required minibatch size the reason our analysis requires minibatch of size without the vc dimension factor is because we are just validating and not learning in sgd and for the case of convex functions even minibatch of size suffices for guaranteed convergence in contrast for sngd we require minibatch of size the theorem below shows that the requirement for large minibatch is not an artifact of our analysis but is truly required theorem let there exists distribution over convex functions such that running sngd with minibatch size of with high probability it never reaches an solution the gap between the upper bound of and the lower bound of remains as an open question we now provide sketch for the proof of theorem proof of theorem theorem is consequence of the following two lemmas in the first we show that whenever all ft are slqc there exists some such that ft xt ft in the second lemma we show that for large enough minibatch size then for any we have xt ft xt and ft combining these two lemmas we conclude that lemma let suppose we run sngd for iterations and assume that all ft are in xt whenever then we must have some for which ft xt ft lemma is proved similarly to theorem we omit the proof due to space constraints the second lemma relates ft xt ft to bound on xt lemma suppose log xt ft xt then and for every and also ft msgd nesterov sngd msgd nesterov sngd objective error objective iteration iteration iteration figure comparison between optimizations schemes left test error middle objective value on training set on the right we compare the objective of sngd for different minibatch sizes lemma is direct consequence of hoeffding bound using the definition of alg together with lemma gives ft xt ft combining the latter with lemma establishes theorem experiments better understanding of how to train deep neural networks is one of the greatest challenges in current machine learning and optimization since learning nn neural network architectures essentially requires to solve hard program we have decided to focus our empirical study on this type of tasks as test case we train neural network with single hidden layer of units over the mnist data set we use relu activation function and minimize the square loss we employ regularization over weights with parameter of at first we were interested in comparing the performance of sngd to msgd minibatch stochastic gradient descent and to stochastic variant of nesterov accelerated gradient method which is considered to be for msgd and nesterov method we used step size rule of the form ηt γt with and for sngd we used the constant step size of in nesterov method we used momentum of the comparison appears in figures as expected msgd converges relatively slowly conversely the performance of sngd is comparable with nesterov method all methods employed minibatch size of later we were interested in examining the effect of minibatch size on the performance of sngd we employed sngd with different minibatch sizes as seen in figure the performance improves significantly with the increase of minibatch size discussion we have presented the first provable algorithm for stochastic optimization this is first attempt at generalizing the machinery of stochastic convex optimization to the challenging problems facing machine learning and better characterizing the border between optimization and tractable cases such as the ones studied herein amongst the numerous challenging questions that remain we note that there is gap between the upper and lower bound of the minibatch size sufficient for sngd to provably converge acknowledgments the research leading to these results has received funding from the european union seventh framework programme under grant agreement shai is supported by isf and by intel references peter auer mark herbster and manfred warmuth exponentially many local minima for single neurons advances in neural information processing systems pages yoshua bengio learning deep architectures for ai foundations and trends in machine learning yoshua bengio patrice simard and paolo frasconi learning dependencies with gradient descent is difficult neural networks ieee transactions on stephen boyd and lieven vandenberghe convex optimization cambridge university press kenji doya bifurcations of recurrent neural networks in gradient descent learning ieee transactions on neural networks goffin luo and yinyu ye complexity analysis of an interior cutting plane method for convex feasibility problems siam journal on optimization adam tauman kalai and ravi sastry the isotron algorithm isotonic regression in colt qifa ke and takeo kanade quasiconvex optimization for robust geometric reconstruction pattern analysis and machine intelligence ieee transactions on rustem khabibullin method to find point of convex set issled prik krzysztof kiwiel convergence and efficiency of subgradient methods for quasiconvex minimization mathematical programming igor konnov on convergence properties of subgradient method optimization methods and software laffont and david martimort the theory of incentives the model princeton university press james martens and ilya sutskever learning recurrent neural networks with optimization in proceedings of the international conference on machine learning pages mccullagh and ja nelder generalised linear models london chapman and yu nesterov minimization methods for nonsmooth convex and quasiconvex functions matekon razvan pascanu tomas mikolov and yoshua bengio on the difficulty of training recurrent neural networks in proceedings of the international conference on machine learning pages boris polyak general method of solving extremum problems dokl akademii nauk sssr jarosław sikorski quasi subgradient algorithms for calculating surrogate constraints in analysis and algorithms of optimization problems pages springer ilya sutskever james martens george dahl and geoffrey hinton on the importance of initialization and momentum in deep learning in proceedings of the international conference on machine learning pages hal varian price discrimination and social welfare the american economic review pages elmar wolfstetter topics in microeconomics industrial organization auctions and incentives cambridge university press yaroslav ivanovich zabotin ai korablev and rustem khabibullin the minimization of quasicomplex functionals izv vyssh uch zaved 
tractable approximation to optimal point process filtering application to neural encoding yuval harel ron meir department of electrical engineering technion israel institute of technology technion city haifa israel yharel tx rmeir ee manfred opper department of artificial intelligence technical university berlin berlin germany opperm abstract the process of dynamic state estimation filtering based on point process observations is in general intractable numerical sampling techniques are often practically useful but lead to limited conceptual insight about optimal strategies which are of significant relevance to computational neuroscience we develop an analytically tractable bayesian approximation to optimal filtering based on point process observations which allows us to introduce distributional assumptions about sensory cell properties that greatly facilitate the analysis of optimal encoding in situations deviating from common assumptions of uniform coding the analytic framework leads to insights which are difficult to obtain from numerical algorithms and is consistent with experiments about the distribution of tuning curve centers interestingly we find that the information gained from the absence of spikes may be crucial to performance introduction the task of inferring hidden dynamic state based on partial noisy observations plays an important role within both applied and natural domains widely studied problem is that of online inference of the hidden state at given time based on observations up to to that time referred to as filtering for the linear setting with gaussian noise and quadratic cost the solution is well known since the early both for discrete and continuous times leading to the celebrated kalman and the filters respectively in these cases the exact posterior distribution is gaussian resulting in closed form recursive update equations for the mean and variance of this distribution implying filters however beyond some very specific settings the optimal filter is and impossible to compute in closed form requiring either approximate analytic techniques the extended kalman filter the unscented filter or numerical procedures particle filters the latter usually require time discretization and finite number of particles resulting in loss of precision for many practical tasks queuing and optical communication and biologically motivated problems natural observation process is given by point process observer leading to nonlinear optimal filter except in specific settings finite state spaces we consider and multivariate hidden markov process observed through set of sensory elements characterized by unimodal tuning functions representing the elements average firing rate the tuning function parameters are characterized by distribution allowing much flexibility the actual firing of each cell is random and is given by poisson process with rate determined by the input and by the cell tuning function inferring the hidden state under such circumstances has been widely studied within the computational neuroscience literature mostly for static stimuli in the more challenging and practically important dynamic setting much work has been devoted to the development of numerical sampling techniques for fast and effective approximation of the posterior distribution in this work we are less concerned with algorithmic issues and more with establishing analytic expressions for an approximately optimal filter see for previous work in related but more restrictive settings and using these to characterize the nature of encoders namely determining the structure of the tuning functions for optimal state inference significant advantage of the closed form expressions over purely numerical techniques is the insight and intuition that is gained from them about qualitative aspects of the system moreover the leverage gained by the analytic computation contributes to reducing the variance inherent to monte carlo approaches technically given the intractable nature of the posterior distribution we use projection method replacing the full posterior at each point in time by projection onto simple family of distributions gaussian in our case this approach originally developed in the filtering literature and termed assumed density filtering adf has been successfully used more recently in machine learning as far as we are aware this is the first application of this methodology to point process filtering the main contributions of the paper are the following derivation of closed form recursive expressions for the continuous time posterior mean and variance within the adf approximation allowing for the incorporation of distributional assumptions over sensory variables ii characterization of the optimal tuning curves encoders for sensory cells in more general setting than hitherto considered specifically we study the optimal shift of tuning curve centers providing an explanation for observed experimental phenomena iii demonstration that absence of spikes is informative and that depending on the relationship between the tuning curve distribution and the dynamic process the prior may significantly improve the inference this issue has not been emphasized in previous studies focusing on homogeneous populations we note that most previous work in the field of neural has dealt with static observations and was based on the fisher information which often leads to misleading qualitative results our results address the full dynamic setting in continuous time and provide results for the posterior variance which is shown to yield an excellent approximation of the posterior mean square error mse previous work addressing distributions over tuning curve parameters used static univariate observations and was based on fisher information rather than the mse itself problem formulation dense gaussian neural code we consider dynamical system with state xt rn observed through an observation process describing the firing patterns of sensory neurons in response to the process the observed process is diffusion process obeying the stochastic differential equation sde dxt xt dt xt dwt where are arbitrary functions and wt is standard brownian motion the initial condition is assumed to have continuous distribution with known density the observation process is marked point process defined on rm meaning that each point representing the firing of neuron is identified by its time and mark rm in this work the mark is interpreted as parameter of the firing neuron which we refer to as the neuron preferred stimulus specifically neuron with parameter is taken to have firing rate exp khx tc in response to state where and σtc are fixed matrices and the notation kykm denotes the choice of gaussian form for facilitates analytic tractability the inclusion of the matrix allows using models where only some dimensions are observed for example when the full state includes velocities but only locations are directly observable we also define nt rm nt is the total number of points up to time regardless of their location and denote by nt the sequence of points up to time formally the process restricted to rm following we use the notation dt dθ ti θi ti θi for rm and any function where ti θi are respectively the time and mark of the point of the process consider network with sensory neurons having random preferred stimuli θi that are drawn independently from common distribution with probability density which we refer to as the population density positing distribution for the preferred stimuli allows us to obtain simple closed form solutions and to optimize over distribution parameters rather than over the higherdimensional space of all the θi the total rate of spikes in set rm with preferred stimuli given xt is then λa θi exp khx θi averaging over tc we have the expected rate λa eλa hm exp khx dθ we tc now obtain an infinite neural network by considering the limit while holding hm fixed in the limit we have λa λa so that the process has density λt xt exp khxt tc meaning that the expected number of points in small rectangle dt θi θi dθi conditioned on the history nt is λt xt dt dθi dt finite network can be obtained as special case by taking to be sum of delta functions for analytic tractability we assume that is gaussian with center and covariance σpop namely σpop we refer to as the population center previous work considered the case where neurons preferred stimuli uniformly cover the space obtained by removing the factor from then the total firing rate λt dθ is independent of which simplifies the analysis and leads to gaussian posterior see we refer to the assumption that λt dθ is independent of as uniform uniform coding case may be obtained from our model by taking the limit pop with det σpop held constant optimal encoding and decoding we consider the question of optimal encoding and decoding under the above model the process of neural decoding is assumed to compute exactly or approximately the full posterior distribution of xt given nt the problem of neural encoding is then to choose the parameters σpop σtc which govern the statistics of the observation process given specific decoding scheme to quantify the performance of the system we summarize the result of decoding using single estimator nt and define the mean square error mse as trace xt xt we seek and that solve minφ minφ the inner minimization problem in this equation is solved by the decoder which is the posterior mean µt xt the posterior mean may be computed from the full posterior obtained by decoding the outer minimization problem is solved by the optimal encoder in principle the problem can be solved for any value of in order to assess performance it is convenient to consider the limit for the encoding problem below we find closed form approximate solution to the decoding problem for any using adf we then explore the problem of choosing the optimal encoding parameters using monte carlo simulations note that if decoding is exact the problem of optimal encoding becomes that of minimizing the expected posterior variance neural decoding exact filtering equations let denote the posterior density of xt given nt and etp the posterior expectation given nt the prior density is assumed to be known the problem of filtering diffusion process from doubly stochastic poisson process driven by is formally solved in the result is extended to marked point processes in where the authors derive stochastic pde for the posterior λt dp dt dt dθ dθ dt where the integral with respect to is interpreted as in is the state infinitesimal generator kolmogorov backward operator defined as lf is adjoint operator kolmogorov forward operator and etp λt xt λt dx the stochastic pde is usually intractable in the authors consider linear dynamics with uniform coding and gaussian prior in this case the posterior is gaussian and leads to closed form odes for its moments when the uniform coding assumption is violated the posterior is no longer gaussian still we can obtain exact equations for the posterior moments as follows let µt etp xt xt µt σt etp using and the known results for for diffusion processes see supplementary material the first two posterior moments can be shown to obey the following equations between spikes see for the finite population case dµt etp xt etp xt λt xt dθ dt dσt etp xt etp xt etp xt xt dt λt xt dθ adf approximation while equations are exact they are not practical since they require computation of etp we now proceed to find an approximate closed form for here we present the main ideas of the derivation the formulation presented here assumes for simplicity an setting where the system is passively observed it can be readily extended to setting and is presented in this more general framework in the supplementary material including full details to bring to closed form we use adf with an assumed gaussian density see for details conceptually this may be envisioned as integrating while replacing the distribution by its approximating gaussian at each time step assuming the moments are known exactly the gaussian is obtained by matching the first two moments of note that the solution of the resulting equations does not in general match the first two moments of the exact solution though it may approximate it abusing notation in the sequel we use µt σt to refer to the adf approximation rather than to the exact values substituting the normal distribution µt σt for to compute the expectations involving λt in and using and the gaussian form of results in computable gaussian integrals other terms may also be computed in closed form if the function can be expanded as power series this computation yields approximate equations for µt σt between spikes the updates at spike times can similarly be computed in closed form either from or directly from bayesian update of the posterior see supplementary material or the model considered in assumes linear dynamics and uniform coding meaning that the total rate of nt namely λt xt dθ is independent of xt however these assumption are only relevant to establish other proposition in that paper the proof of equation still holds as is in our more general setting population density firing rate for xt figure left changes to the posterior moments between spikes as function of the current terior mean estimate for static state the parameters are σpop σtc σt the bottom plot shows the density of preferred stimuli and tuning curve for neuron with preferred stimulus right an example of filtering linear process each dot correspond to spike with the vertical location indicating the preferred stimulus the curves to the right of the graph show the preferred stimulus density black and tuning curve centered at gray the tuning curve and preferred stimulus density are normalized to the same height for visualization the bottom graph shows the posterior variance with the vertical lines showing spike times parameters are note the decrease σtc σpop of the posterior variance following even though no spikes are observed for simplicity we assume that the dynamics are linear dxt axt dt dwt resulting in the filtering equations dµt aµt dt gt σt st hµt dt dt dθ dσt aσt σt at ddt dt gt σt st st hµt hµt st hσt dt dnt st σtc σpop hσt and where sttc σtc hσt gt dθ etp xt dθ det σtc st exp khµt ckst is the posterior expected total firing rate expressions including are to be interpreted as left limits which are necessary since the solution is discontinuous at spike times the last term in is to be interpreted as in it contributes an instantaneous jump in µt at the time of spike with preferred stimulus moving hµt closer to similarly the last term in contributes an instantaneous jump in σt at each spike time which is the same regardless of spike location all other terms describe the evolution of the posterior between spikes the first few terms in are the same as in the dynamics of the prior as in whereas the terms involving gt correspond to information from the absence of spikes note that the latter scale with gt the expected total firing rate lack of spikes becomes more informative the higher the expected rate of spikes it is illustrative to consider these equations in the scalar case with letting σt σtc σtc σpop σpop yields µt dt dt dθ dµt aµt dt gt σt σtc σtc pop µt dσt gt dt dnt σt σtc σtc σtc pop pop uniform coding filter adf accumulated mse for adf and filter true state adf uniform coding filter adf uniform mse figure left illustration of information gained between spikes static state xt shown in dotted line is observed and filtered twice with the correct value σpop adf solid blue line and with σpop uniform coding filter dashed line the curves to the right of the graph show the preferred stimulus density black and tuning curve centered at gray both filters are initialized with right comparison of mse for the adf filter and the uniform coding filter the vertical axis shows the integral of the square error integrated over the time interval averaged over trials shaded areas indicate estimated errors computed as the sample standard deviation divided by the square root of the number of trials parameters in both plots are σpop σtc where gt pop figure left shows how µt σt change betc tween spikes for static state in this case all terms in the filtering equations drop out except those involving gt the term involving gt in dµt pushes µt away from in the absence of spikes this effect weakens as grows due to the factor gt consistent with the idea that far from the lack of spikes is less surprising hence less informative the term involving gt in increases the variance when µt is near otherwise decreases it information from lack of spikes an interesting aspect of the filtering equations is that the dynamics of the posterior density between spikes differ from the prior dynamics this is in contrast to previous models which assumed uniform coding the exact filtering equations appearing in and have the same form as except that they do not include the correction terms involving gt so that between spikes the dynamics are identical to the prior dynamics this reflects the fact that lack of spikes in time interval is an indication that the total firing rate is low in the uniform coding case this is not informative since the total firing rate is independent of the state figure left illustrates the information gained from lack of spikes static scalar state is observed by process with rate and filtered twice once with the correct value of σpop and once with σpop as in the uniform coding filter of between spikes the adf estimate moves away from the population center whereas the uniform coding estimate remains fixed the size of this effect decreases with time as the posterior variance estimate not shown decreases the reduction in filtering errors gained from the additional terms in is illustrated in figure right despite the approximation involved the full filter significantly outperforms the uniform coding filter the difference disappears as σpop increases and the population becomes uniform special cases to gain additional insight into the filtering equations we consider their behavior in several limits as σpop spikes become rare as the density approaches for any the total expected rate of spikes gt also approaches and the terms corresponding to information from lack of spikes vanish other terms in the equations are unaffected ii in the limit σtc each neuron fires as poisson process with constant rate independent of the observed state the total expected firing rate gt saturates at its maximum therefore the preferred stimuli of spiking neurons provide no information nor does the presence or absence of spikes accordingly all terms other than those related to the prior dynamics vanish iii the uniform coding case is obtained as special case in the limit σpop with constant in this limit the terms involving gt drop out recovering the exact filtering equations in optimal neural encoding we model the problem of optimal neural encoding as choosing the parameters σpop σtc of the population and tuning curves so as to minimize the mse as noted above when the estimate is exactly the posterior mean this is equivalent to minimizing the expected posterior variance the posterior variance has the advantage of being less noisy than the square error itself since by definition it is the mean of the square error of the posterior mean under conditioning by nt we explore the question of optimal neural encoding by measuring the variance through monte carlo simulations of the system dynamics and the filtering equations since the posterior mean and variance computed by adf are approximate we verified numerically that the variance closely matches the mse in the steady state when averaged across many trials see supplementary material suggesting that asymptotically the error in estimating µt and σt is small optimal population center we now consider the question of the optimal value for the population center intuitively if the prior distribution of the process is unimodal with mode the optimal population center is at to produce the most spikes on the other hand the terms involving gt in the filtering equation suggest that the lack of spikes is also informative moreover as seen in figure left the posterior variance is reduced between spikes only when the current estimate is far enough from these considerations suggest that there is between maximizing the frequency of spikes and maximizing the information obtained from lack of spikes yielding an optimal value for that differs from we simulated simple process to determine the optimal value of which minimizes the approximate posterior variance σt figure left shows the posterior variance for varying values of the population center and base firing rate for each firing rate we note the value of minimizing the posterior variance the optimal population center as well as the value of cm argminc dσt which maximizes the reduction in the posterior variance when the current state estimate µt is at the process equilibrium consistent with the discussion above the optimal value lies between where spikes are most abundant and cm where lack of spikes is most informative as could be expected the optimal center is closer to the higher the base firing rate similarly wide tuning curves which render the spikes less informative lead to an optimal center farther from figure right shift of the population center relative to the prior mode has been observed physiologically in encoding of time differences for localization of sound sources in this phenomenon was explained in finite population model based on maximization of fisher information this is in contrast to the results of which consider heterogeneous population where the tuning curve width scales roughly inversely with neuron density in this case the population density maximizing the fisher information is shown to be monotonic with the prior more neurons should be assigned to more probable states this apparent discrepancy may be due to the scaling of tuning curve widths in which produces roughly constant total firing rate uniform coding this demonstrates that total firing rate which renders lack of spikes informative may be necessary to explain the physiologically observed shift phenomenon optimization of population distribution next we consider the optimization of the population distribution namely the simultaneous optimization of the population center and the population variance σpop in the case of static scalar state previous work using finite neuron population and fisher criterion has shown that the optimal distribution of preferred stimuli depends on the prior variance when it is small relative to the tuning curve width optimal encoding is achieved by placing all preferred stimuli at fixed distance from the prior mean on the other hand when the prior variance is large relative to the tuning curve width optimal encoding is uniform see figure in similar results are obtained with our model as shown in figure here static scalar state drawn from is filtered by population with tuning curve width σtc and preferred stimulus density σpop in figure left the prior distribution is narrow relative to the tuning curve width leading to an optimal population with narrow population distribution far from the origin in argminc argminc posterior stdev prior stdev posterior stdev prior stdev variance wide prior post stdev prior stdev variance narrow prior post stdev prior stdev figure optimal population center location for filtering linear process both graphs show the ratio of posterior standard deviation to the prior standard deviation of the process along with the value of minimizing the posterior variance blue line and minimizing the reduction of posterior variance when µt yellow line the process is initialized from its distribution the posterior variance is estimated by averaging over the time interval and across trials for each data point parameters for both graphs on the right σpop in the graph on the left σtc figure optimal population distribution depends on prior variance relative to tuning curve width static scalar state drawn from is filtered with tuning curve σtc and preferred stimulus both graphs show the posterior standard deviation relative to the prior standard density σpop deviation σp in the left graph the prior distribution is narrow whereas on the right it is wide in both cases the filter is initialized with the correct prior and the square error is averaged over the time interval and across trials for each data point figure right the prior is wide relative to the tuning curve width leading to an optimal population with variance that roughly matches the prior variance when both the tuning curves and the population density are narrow relative to the prior so that spikes are rare low values of σpop in figure right the adf approximation becomes poor resulting in mses larger than the prior variance conclusions we have introduced an analytically tractable bayesian approximation to point process filtering allowing us to gain insight into the generally intractable filtering problem the approach enables the derivation of encoding schemes going beyond previously studied uniform coding assumptions the framework is presented in continuous time circumventing temporal discretization errors and numerical imprecisions in methods applies to fully dynamic setups and directly estimates the mse rather than lower bounds to it it successfully explains observed experimental results and opens the door to many future predictions future work will include development of previously successful mean field approaches within our more general framework leading to further analytic insight moreover the proposed strategy may lead to practically useful decoding of spike trains references anderson and moore optimal filtering dover kalman and bucy new results in linear filtering and prediction theory of basic trans asme series kalman new approach to linear filtering and prediction problems basic trans asme series daum nonlinear filters beyond the kalman filter aerospace and electronic systems magazine ieee julier uhlmann and new method for the nonlinear transformation of means and covariances in filters and estimators ieee trans autom control doucet and johansen tutorial on particle filtering and smoothing fifteen years later in crisan and rozovskii editors handbook of nonlinear filtering pages oxford uk oxford university press point processes and queues martingale dynamics springer new york snyder and miller random point processes in time and space springer second edition edition dayan and abbott theoretical neuroscience computational and mathematical modeling of neural systems mit press bobrowski meir and eldar bayesian filtering in spiking neural networks noise adaptation and multisensory integration neural comput may ahmadian pillow and paninski efficient markov chain monte carlo methods for decoding neural spike trains neural comput jan susemihl meir and opper analytical results for the error in filtering of gaussian processes in zemel bartlett pereira and weinberger editors advances in neural information processing systems pages susemihl meir and opper dynamic state estimation based on poisson spike towards theory of optimal encoding journal of statistical mechanics theory and experiment maybeck stochastic models estimation and control academic press brigo hanzon and legland differential geometric approach to nonlinear filtering the projection filter automatic control ieee transactions on opper bayesian approach to online learning in saad editor online learning in neural networks pages cambridge university press minka expectation propagation for approximate bayesian inference in proceedings of the seventeenth conference on uncertainty in artificial intelligence pages morgan kaufmann publishers harper and mcalpine optimal neural population coding of an auditory spatial cue nature aug bethge rotermund and pawelzik optimal population coding when fisher information fails neural comput oct yaeli and meir analysis of optimal tuning functions explains phenomena observed in sensory neurons front comput neurosci ganguli and simoncelli efficient sensory encoding and bayesian inference with heterogeneous neural populations neural comput rhodes and snyder estimation and control performance for observations ieee transactions on automatic control susemihl meir and opper optimal neural codes for control and estimation advances in neural information processing systems pages snyder filtering and detection for doubly stochastic poisson processes ieee transactions on information theory january brand behrend marquardt mcalpine and grothe precise inhibition is essential for microsecond interaural time difference coding nature 
lower bounds for sparse pca tengyu and avi department of computer science princeton university school of mathematics institute for advanced study abstract this paper establishes statistical versus computational for solving basic machine learning problem via basic convex relaxation method specifically we consider the sparse principal component analysis sparse pca problem and the family of sos aka convex relaxations it was well known that in large dimension planted unit vector can be in principle detected using only log gaussian or bernoulli samples but all efficient polynomial time algorithms known require samples it was also known that this quadratic gap can not be improved by the the most basic sdp aka spectral relaxation equivalent to sos algorithms here we prove that also sos algorithms can not improve this quadratic gap this lower bound adds to the small collection of hardness results in machine learning for this powerful family of convex relaxation algorithms moreover our design of moments or for this lower bound is quite different than previous lower bounds establishing lower bounds for higher degree sos algorithms for remains challenging problem introduction we start with general discussion of the tension between sample size and computational efficiency in statistical and learning problems we then describe the concrete model and problem at hand algorithms and the problem all are broad topics studied from different viewpoints and the given references provide more information statistical computational modern machine learning and statistical inference problems are often high dimensional and it is highly desirable to solve them using far less samples than the ambient dimension luckily we often know or assume some underlying structure of the objects sought which allows such savings in principle typical such assumption is that the number of real degrees of freedom is far smaller than the dimension examples include sparsity constraints for vectors and low rank for matrices and tensors the main difficulty that occurs in nearly all these problems is that while information theoretically the sought answer is present with high probability in small number of samples actually computing or even approximating it from these many samples is computationally hard problem it is often expressed as optimization program which is in the worst case and seemingly hard even on random instances given this state of affairs relaxed formulations of such programs were proposed which can be solved efficiently but sometimes to achieve accurate results seem to require far more samples supported in part by simons award for graduate students in theoretical computer science supported in part by nsf grant than existential bounds provide this phenomenon has been coined the statistical versus computational by chandrasekaran and jordan who motivate and formalize one framework to study it in which efficient algorithms come from the family of convex relaxations which we shall presently discuss they further give detailed study of this for the basic problem in various settings some exhibiting the and others that do not this was observed in other practical machine learning problems in particular for the sparse pca problem that will be our focus by berthet and rigollet as it turns out the study of the same phenomenon was proposed even earlier in computational complexity primarily from theoretical motivations decatur goldreich and ron initiate the study of computational sample complexity to study statistical versus computation in samplesize in their framework efficient algorithms are arbitrary polynomial time ones not restricted to any particular structure like convex relaxations they point out for example that in the framework of and valiant there is often no such the reason is that the number of samples is essentially determined up to logarithmic factors which we will mostly ignore here by the of the given concept class learned and moreover an occam algorithm computing any consistent hypothesis suffices for classification from these many samples so in the many cases where efficiently finding hypothesis consistent with the data is possible enough samples to learn are enough to do so efficiently this paper also provide examples where this is not the case in pac learning and then turns to an extensive study of possible for learning various concept classes under the uniform distribution this direction was further developed by servedio the fast growth of big data research the variety of problems successfully attacked by various heuristics and the attempts to find efficient algorithms with provable guarantees is growing area of interaction between statisticians and machine learning researchers on the one hand and optimization and computer scientists on the other the between sample size and computational complexity which seems to be present for many such problems reflects curious conflict between these fields as in the first more data is good news as it allows more accurate inference and prediction whereas in the second it is bad news as larger input size is source of increased complexity and inefficiency more importantly understanding this phenomenon can serve as guide to the design of better algorithms from both statistical and computational viewpoints especially for problems in which data acquisition itself is costly and not just computation basic question is thus for which problems is such inherent and to establish the limits of what is achievable by efficient methods establishing has two parts one has to prove an existential information theoretic upper bound on the number of samples needed when efficiency is not an issue and then prove computational lower bound on the number of samples for the class of efficient algorithms at hand needless to say it is desirable that the lower bounds hold for as wide class of algorithms as possible and that it will match the best known upper bound achieved by algorithms from this class the most general one the computational complexity framework of allows all algorithms here one can not hope for unconditional lower bounds and so existing lower bounds rely on computational assumptions cryptographic assumptions that factoring integers has no polynomial time algorithm or other average case assumptions for example hardness of refuting random was used for establishing the tradeoff for learning halfspaces and hardness of finding planted clique in random graphs was used for tradeoff in sparse pca on the other hand in frameworks such as where the class of efficient algorithms is more restricted family of convex relaxations one can hope to prove unconditional lower bounds which are called integrality gaps in the optimization and algorithms literature our main result is of this nature adding to the small number of such lower bounds for machine learning problems we now describe and motivate sos convex relaxations algorithms and the sparse pca problem convex relaxations algorithms sometimes called the lasserre hierarchy encompasses perhaps the strongest known algorithmic technique for diverse set of optimization problems it is family of convex relaxations introduced independently around the year by lasserre parillo and in the equivalent context of proof systems by grigoriev these papers followed better and better understanding in real algebraic geometry of david hilbert famous problem on certifying the of polynomial by writing it as sum of squares which explains the name of this method we only briefly describe this important class of algorithms far more can be found in the book and the excellent extensive survey the sos method provides principled way of adding constraints to linear or convex program in way that obtains tighter and tighter convex sets containing all solutions of the original problem this family of algorithms is parametrized by their degree sometimes called the number of rounds as gets larger the approximation becomes better but the running time becomes slower specifically no thus in practice one hopes that small degree ideally constant would provide sufficiently good approximation so that the algorithm would run in polynomial time this method extends the standard relaxation sdp sometimes called spectral that is captured already by sos algorithms moreover it is more powerful than two earlier families of relaxations the and hierarchies the introduction of these algorithms has made huge splash in the optimization community and numerous applications of it to problems in diverse fields were found that greatly improve solution quality and time performance over all past methods for large classes of problems they are considered the strongest algorithmic technique known relevant to us is the very recent growing set of applications of sos algorithms to machine learning problems such as the survey contains some of these exciting developments section contains some selfcontained material about the general framework sos algorithms as well given their power it was natural to consider proving lower bounds on what sos algorithms can do there has been an impressive progress on sos degree lower bounds via beautiful techniques for variety of combinatorial optimization problems however for machine learning problems relatively few such lower bounds above sdp level are known and follow via reductions to the above bounds so it is interesting to enrich the set of techniques for proving such limits on the power of sos for ml the lower bound we prove indeed seem to follow different route than previous such proofs sparse pca sparse principal component analysis the version of the classical pca problem which assumes that the direction of variance of the data has sparse structure is by now central problem of highdiminsional statistical analysis in this paper we focus on the covariance model introduced by johnstone one observes samples from gaussian distribution with covariance λvv where the planted vector is assumed to be sparse vector with at most entries and represents the strength of the signal the task is to find or estimate the sparse vector more general versions of the problem allow several sparse and general covariance matrix sparse pca and its variants have wide variety of applications ranging from signal processing to biology see the hardness of sparse pca at least in the worst case can be seen through its connection to the clique problem in graphs note that if is adjacency matrix of graph with on the diagonal then it has eigenvector with eigenvalue if and only if the graph has this connection between these two problems is actually deeper and will appear again below for our real average case version above from theoretical point of view sparse pca is one of the simplest examples where we observe gap between the number of samples needed information theoretically and the number of samples needed for polynomial time estimator it has been well understood that information theoretically given log one can estimate up to constant error in euclidean norm using therefore not polynomial time optimization algorithm on the other hand all the existing provable polynomial time algorithms which use either diagonal thresholding for the single spiked model or semidefinite programming for general covariance first introduced for this problem in need at least quadratically many samples to solve the problem namely moreover krauthgamer nadler and vilenchik and berthet and rigollet have shown that for programs sdp this bound is tight specifically the natural sdp can not even solve the detection problem to distinguish the data from covariance we treat as constant so that we omit the dependence on it for simplicity throughout the introduction section λvv from the null hypothesis in which no sparse vector is planted namely the samples are drawn from the gaussian distribution with covariance matrix recall that the natural sdp for this problem and many others is just the first level of the sos hierarchy namely given the importance of the sparse pca it is an intriguing question whether one can solve it efficiently with far fewer samples by allowing sos algorithms with larger very interesting conditional negative answer was suggested by berthet and rigollet they gave an efficient reduction from planted problem to sparse pca which shows in particular that sos algorithms for sparse pca will imply similar ones for planted clique gao ma and zhou strengthen the result by establishing the hardness of the gaussian singlespiked covariance model which is an interesting subset of models considered by these are useful as nontrivial sos lower bounds for planted clique were recently proved by see there for the precise description history and motivation for planted clique as argue strong yet believed bounds if true would imply that the quadratic gap is tight for any constant before the submission of this paper the known lower bounds above for planted clique were not strong enough yet to yield any lower bound for sparse pca beyond the minimax sample complexity we also note that the recent progress that show the tight lower bounds for planted clique together with the reductions of also imply the tight lower bounds for sparse pca as shown in this paper our contribution we give direct unconditional lower bound proof for computing sparse pca using sos samples to solve the detection problem theoalgorithms showing that they too require rem which is tight up to polylogarithmic factors when the strength of the signal is constant indeed the theorem gives lower bound for every strength which becomes weaker as gets larger our proof proceeds by constructing the necessary for the sos program that achieve too high an objective value in the jargon of optimization we prove an integrality gap for these programs as usual in such proofs there is tension between having the satisfy the constraints of the program and keeping them positive semidefinite psd differing from past lower bound proofs we construct two different psd moments each approximately satisfying one sets of constraints in the program and is negligible on the rest thus their sum give psd moments which approximately satisfy all constraints we then perturb these moments to satisfy constraints exactly and show that with high probability over the random data this perturbation leaves the moments psd we note several features of our lower bound proof which makes the result particularly strong and general first it applies not only for the gaussian distribution but also for bernoulli and other distributions indeed we give set of natural pseudorandomness conditions on the sampled data vectors under which the sos algorithm is fooled and show that these conditions are satisfied with high probability under many similar distributions possessing strong concentration of measure next our lower bound holds even if the hidden sparse vector is discrete namely its entries come from the set we also extend the lower bound for the detection problem to apply also to the estimation problem in the regime when the ambient dimension is linear in the number of samples namely bn for constant organization section provides more backgrounds of sparse pca and sos algorithms we state our main results in section complete paper is available as supplementary material or on arxiv formal description of the model and problem notation we will assume that are all sufficiently and that throughout this paper by with high probability some event happens we mean the failure probability is bounded by for every constant as tends to infinity sparse pca estimation and detection problems we will consider the simplest setting of sparse pca which is called covariance model in literature note that restricting to an average case version of the clique problem in which the input is random graph in which much larger than expected clique is planted or we assume that they go to infinity as typically done in statistics special case makes our lower bound hold in all generalizations of this simple model in this model the task is to recover single sparse vector from noisy samples as follows the hidden data is an unknown vector rp with and kvk to make the task easier and so the lower bound stronger we even assume that has discrete entries namely that vi for all we observe noisy rp that are generated as follows each is independently drawn as λg from distribution which generalizes both gaussian and bernoulli noise to namely the are real random variable with mean and variance and are random vectors which have independent entries with mean zero and variance therefore under this model the covariance of is equal to λvv moreover we assume that and entries of are with variance proxy given these samples the estimation problem is to approximate the unknown sparse vector up to sign flip it is also interesting to also consider the sparse component detection problem which is the decision problem of distinguishing from random samples the following two distributions data is purely random hv data λg contains hidden sparse signal with strength rigollet observed that polynomial time algorithm for estimation version of sparse pca with constant error implies that an algorithm for the detection problem with twice number of the samples thus for polynomial time lower bounds it suffices consider the detection problem we will use as shorthand for the matrix we denote the rows of as xpt therefore xi are column vectors the empirical covariance matrix is defined as xx statistically optimal it is well known that the following program achieves optimal statistical minimax rate for the estimation problem and the optimal sample complexity for the detection problem note that we scale the variables up by factor of for simplicity the hidden vector now has entries from max subject to λkmax xxt proposition informally stated the program statistically optimally solves the sparse pca problem when log for some sufficiently large namely the following hold with high probability if is generated from hv then optimal solution xopt of program satisfies xopt xtopt vv and the objective value λkmax is at least on the other hand if is generated from null hypothesis then λmax is at most therefore for the detection problem once can simply use the test λkmax to distinguish the case of and hv with samples however this test is highly inefficient as the best known ways for computing λkmax take exponential time we now turn to consider efficient ways of solving this problem sum of squares lasserre relaxations here we will only briefly introduce the basic ideas of lasserre relaxation that will be used for this paper we refer readers to the extensive for detailed discussions of sum of squares algorithms and proofs and their applications to algorithm design let denote the set of all real polynomials of degree at most with variables xn we start by defining the notion of sometimes called the intuition is that these behave like the actual first moments of real probability distribution real random variable is subgaussian with variance proxy if it has similar tail behavior as gaussian distribution with variance more formally if for any exp tx exp definition is linear operator that maps to and satisfies and for all real polynomials of degree at most for we use xs to denote the monomial xi since is linear operator it can be clearly described by all the values of on the monomial of degree that is all the values of xs for of size at most uniquely determines moreover the nonnegativity constraint is equivalent to the positive semidefiniteness of the as defined below and therefore the set of all is convex definition for an even integer and any we define the of as the trivial way of viewing all the values of on monomials as matrix we use mat to denote the matrix that is indexed by of with size at most and mat xs xt given polynomials and qm of degree at most and polynomial program maximize subject to qi we can write sum of squares based relaxation in the following way instead of searching over rn we search over all the possible of hypothetical distribution over solutions that satisfy the constraints above the key of the relaxation is to consider only moments up to degree concretely we have the following semidefinite program in roughly nd variables variables xs maximize deg qi subject to qi mat note that is valid relaxation because for any solution of if we define xs to be xs then satisfies all the constraints and the objective value is therefore it is guaranteed that the optimal value of is always larger than that of finally the key point is that this program can be solved efficiently in polynomial time in its size namely in time no as grows the constraints added make the defined by the moments closer and closer to an actual distribution thus providing tighter relaxation at the cost of larger running time to solve it in the next section we apply this relaxation to the sparse pca problem and state our results main results to exploit the sum of squares relaxation framework as described in section we first convert the statistically optimal into the polynomial program version below xxt subject and xi the sparsity constraint is replaced by the polynomial constraint which ensures that any solution vector has entries in and so together with the constraint guarantees that it has precisely entries the constraint implies other natural constraints that one may add to the program in order to make it stronger for example the upper bound on each entry xi the lower bound on the entries of xi and the constraint which is used as surrogate for vectors in note that we also added an sparsity constraint which is convex as is often used in practice and makes our lower bound even stronger of course it is formally implied by the other constraints but not in sos now we are ready to apply the relaxation scheme described in section to the polynomial program above as for relaxation we obtain the following semidefinite program which we view as an algorithm for both detection and estimation problems note that the same objective function with only the three constraints gives the relaxation which is precisely the standard sdp relaxation of sparse pca studied in so clearly subsumes the sdp relaxation algorithm sum of squares relaxation solve the following sdp and obtain optimal objective value and maximizer variables for all of size at most max xi xj obj subject to and xj xi xj and xi xj xi xj km xi xj xi xj xs xt and output for detection problem output hv if otherwise for estimation problem output xi xj before stating the lower bounds for both detection and estimation in the next two subsections we comment on the choices made for the outputs of the algorithm in both as clearly other choices can be made that would be interesting to investigate for detection we pick the natural threshold from the statistically optimal detection algorithm of section our lower bound of the objective under is actually large constant multiple of λk so we could have taken higher threshold to analyze even higher ones would require analyzing the behavior of under the planted alternative distribution hv for estimation we output the maximizer of the objective function and prove that it is not too correlated with the matrix vv in the planted distribution hv this suggest but does not prove that the leading eigenvector of which is natural estimator for is not too correlated with we finally note that rigollet efficient reduction from detection to estimation is not in the sos framework and so our detection lower bound does not automatically imply the one for estimation for the detection problem we prove that gives large objective on null hypothesis theorem there exists absolute constant and such that for min and any cλn logr the following holds when the data is drawn from the null hypothesis then with high probability the objective value of sum of squares relaxation is at least consequently algorithm can solve the detection problem to parse the theorem and to understand its consequence consider first the case when is constant which is also arguably the most interesting regime then the theorem says that when we have only samples sos relaxation still overfits heavily to the randomness of the data under the null hypothesis therefore using or even as threshold will fail with high probability to distinguish and hv we note that for constant our result is essentially tight in terms of the dependencies between the condition is necessary since otherwise when even without the sum of squares relaxation the objective value is controlled by since has maximum is eigenvalue in this regime furthermore as mentioned in the introduction also necessary up to factors since when simple diagonal thresholding algorithm works for this simple model when is not considered as constant the the lower bound on is not optimal but close ideally one could expect that as long as and λn the objective value on the null hypothesis is at least λk tightening the slack and possibly extending the range of are left to future study finally we note that he result can be extended to lower bound for the estimation problem which is presented in the supplementary material references venkat chandrasekaran and michael jordan computational and statistical tradeoffs via convex relaxation proceedings of the national academy of sciences im johnstone function estimation and gaussian sequence models unpublished manuscript donoho by ieee trans inf may david donoho and iain johnstone minimax estimation via wavelet shrinkage ann quentin berthet and philippe rigollet complexity theoretic lower bounds for sparse principal component detection in colt the annual conference on learning theory june princeton university nj usa pages scott decatur oded goldreich and dana ron computational sample complexity in proceedings of the tenth annual conference on computational learning theory colt pages new york ny usa acm rocco servedio computational sample complexity and learning journal of computer and system sciences amit daniely nati linial and shai more data speeds up training time in learning halfspaces over sparse vectors in christopher burges bottou zoubin ghahramani and kilian weinberger editors advances in neural information processing systems annual conference on neural information processing systems proceedings of meeting held december lake tahoe nevada united pages gao ma and zhou sparse cca adaptive estimation and computational barriers arxiv september jean lasserre global optimization with polynomials and the problem of moments siam journal on optimization pablo parrilo structured semidefinite programs and semialgebraic geometry methods in robustness and optimization phd thesis california institute of technology dima grigoriev linear lower bound on degrees of positivstellensatz calculus proofs for the parity theoretical computer science emil artin die zerlegung definiter funktionen in quadrate in abhandlungen aus dem mathematischen seminar der hamburg volume pages springer krivine anneaux journal analyse gilbert stengle nullstellensatz and positivstellensatz in semialgebraic geometry mathematische annalen shor an approach to obtaining global extremums in polynomial mathematical programming problems cybernetics konrad problem for compact sets mathematische annalen mihai putinar positive polynomials on compact sets indiana university mathematics journal yurii nesterov squared functional systems and optimization problems in hans frenk kees roos tams terlaky and shuzhong zhang editors high performance optimization volume of applied optimization pages springer us jean bernard lasserre an introduction to polynomial and optimization cambridge texts in applied mathematics cambridge cambridge university press monique laurent sums of squares moment matrices and optimization over polynomials in mihai putinar and seth sullivant editors emerging applications of algebraic geometry volume of the ima volumes in mathematics and its applications pages springer new york hanif sherali and warren adams hierarchy of relaxations between the continuous and convex hull representations for programming problems siam journal on discrete mathematics and schrijver cones of matrices and and optimization siam journal on optimization boaz barak jonathan kelner and david steurer dictionary learning and tensor decomposition via the method in proceedings of the annual acm symposium on theory of computing stoc boaz barak jonathan kelner and david steurer rounding relaxations in stoc pages boaz barak and ankur moitra tensor prediction rademacher complexity and random corr boaz barak and david steurer proofs and the quest toward optimal algorithms in proceedings of international congress of mathematicians icm to appear grigoriev complexity of positivstellensatz proofs for the knapsack computational complexity grant schoenebeck linear level lasserre lower bounds for certain in proceedings of the annual ieee symposium on foundations of computer science focs pages washington dc usa ieee computer society raghu meka aaron potechin and avi wigderson lower bounds for planted clique corr wang gu and liu statistical limits of convex relaxations arxiv march iain johnstone on the distribution of the largest eigenvalue in principal components analysis ann zongming ma sparse principal component analysis and iterative thresholding ann vincent vu and jing lei minimax sparse principal subspace estimation in high dimensions ann alon barkai notterman gish ybarra mack and levine broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays proceedings of the national academy of sciences iain johnstone and arthur yu lu on consistency and sparsity for principal components analysis in high dimensions journal of the american statistical association pp xi chen adaptive sparse principal component analysis for pathway association testing statistical applications in genetics and molecular biology rodolphe jenatton guillaume obozinski and francis bach structured sparse principal component analysis in proceedings of the thirteenth international conference on artificial intelligence and statistics aistats chia laguna resort sardinia italy may pages vincent vu and jing lei minimax rates of estimation for sparse pca in high dimensions in proceedings of the fifteenth international conference on artificial intelligence and statistics aistats la palma canary islands april pages debashis paul and iain johnstone augmented sparse principal component analysis for high dimensional data arxiv preprint quentin berthet and philippe rigollet optimal detection of sparse principal components in high dimension the annals of statistics arash amini and martin wainwright analysis of semidefinite relaxations for sparse principal components ann yash deshpande and andrea montanari sparse pca via covariance thresholding in advances in neural information processing systems annual conference on neural information processing systems december montreal quebec canada pages alexandre aspremont laurent el ghaoui michael jordan and gert lanckriet direct formulation for sparse pca using semidefinite programming siam review robert krauthgamer boaz nadler and dan vilenchik do semidefinite relaxations solve sparse pca up to the information limit the annals of statistics deshpande and montanari improved lower bounds for hidden clique and hidden submatrix problems arxiv february prasad raghavendra and tselil schramm tight lower bounds for planted clique in the sos program corr samuel hopkins pravesh kothari and aaron potechin sos and planted clique tight analysis of mpw moments at all degrees and an optimal lower bound at degree four corr tengyu ma and philippe rigollet personal communication 
majority voting for learning from crowds tian tian jun zhu department of computer science technology center for computing research tsinghua national lab for information science technology state key lab of intelligent technology systems tsinghua university beijing china dcszj abstract aims to design proper aggregation strategies to infer the unknown true labels from the noisy labels provided by ordinary web workers this paper presents majority voting to improve the discriminative ability of majority voting and further presents bayesian generalization to incorporate the flexibility of generative methods on modeling noisy observations with worker confusion matrices we formulate the joint learning as regularized bayesian inference problem where the posterior regularization is derived by maximizing the margin between the aggregated score of potential true label and that of any alternative label our bayesian model naturally covers the estimator and empirical results demonstrate that our methods are competitive often achieving better results than estimators introduction many learning tasks require labeling large datasets though reliable it is often too expensive and to collect labels from domain experts or workers recently online crowdsourcing platforms have dramatically decreased the labeling cost by dividing the workload into small parts then distributing to crowd of ordinary web workers however the labeling accuracy of web workers could be lower than expected due to their various backgrounds or lack of knowledge to improve the accuracy it is usually suggested to label every task multiple times by different workers then the redundant labels can provide hints on resolving the true labels much progress has been made in designing effective aggregation mechanisms to infer the true labels from noisy observations from modeling perspective existing work includes both generative approaches and discriminative approaches generative method builds flexible probabilistic model for generating the noisy observations conditioned on the unknown true labels and some behavior assumptions with examples of the ds estimator the minimax entropy entropy and their variants in contrast discriminative approach does not model the observations it directly identifies the true labels via some aggregation rules examples include majority voting and the weighted majority voting that takes worker reliability into consideration in this paper we present formulation of the most popular majority voting estimator to improve its discriminative ability and further present bayesian generalization that conjoins the advantages of both generative and discriminative approaches the majority voting directly maximizes the margin between the aggregated score of potential true label and that of any alternative label and the bayesian model consists of flexible probabilistic model to generate the noisy observations by conditioning on the unknown true labels we adopt the same approach as the maximum entropy estimator can be understood as dual of the mle of probabilistic model classical estimator to build the probabilistic model by considering worker confusion matrices though many other generative models are also possible then we strongly couple the generative model and by formulating joint learning problem under the regularized bayesian inference regbayes framework where the posterior regularization enforces large margin between the potential true label and any alternative label naturally our bayesian model covers both the estimator and as special cases by setting the regularization parameter to its extreme values or we investigate two choices on defining the posterior regularization an averaging model with variational inference algorithm and gibbs model with gibbs sampler under data augmentation formulation the averaging version can be seen as an extension to the mle learner of model experiments on real datasets suggest that learning can significantly improve the accuracy of majority voting and that our bayesian estimators are competitive often achieving better results than estimators on true label estimation tasks preliminary we consider the label aggregation problem with dataset consisting of items pictures or paragraphs each item has an unknown true label yi where the task ti is to label item in crowdsourcing we have workers assigning labels to these items each worker may only label part of the dataset let ii denote the workers who have done task ti we use xij to denote the label of ti provided by worker xi to denote the labels provided to task ti and is the collection of these worker labels which is an incomplete matrix the goal of is to estimate the true labels of items from the noisy observations majority voting estimator majority voting mv is arguably the simplest method it posits that for every task the true label is always most commonly given thus it selects the most frequent label for each task as its true label by solving the problem xij argmax where is an indicator function it equals to whenever the predicate is true otherwise it equals to previous work has extended this method to weighted majority voting wmv by putting different weights on workers to measure worker reliability estimator the method of dawid and skene is generative approach by considering worker confusability it posits that the performance of worker is consistent across different tasks as measured by confusion matrix whose diagonal entries denote the probability of assigning correct labels while offdiagonal entries denote the probability of making specific mistakes to label items in one category as another formally let φj be the confusion matrix of worker then φjkd denotes the probability that worker assigns label to an item whose true label is under the basic assumption that workers finish each task independently the likelihood of observed labels can be expressed as φjkd njkd φjkd njkd nijkd where xij yi and njkd but being labeled to by worker pm njkd is the number of tasks with true label the unknown labels and parameters can be estimated by estimation mle argmaxy log via an em algorithm that iteratively updates the true labels and the parameters the learning procedure is often initialized by majority voting to avoid bad local optima if we assume some structure of the confusion matrix various variants of the ds estimator have been studied including the homogenous ds model and the ds model we can also put prior over worker confusion matrices and transform the inference into standard inference problem in graphical models recently spectral methods have also been applied to better initialize the ds model majority voting majority voting is discriminative model that directly finds the most likely label for each item in this section we present majority voting novel extension of weighted majority voting with new notion of margin named crowdsourcing margin geometric interpretation of crowdsourcing margin let xi be vector with the element equaling to ii xij then the estimation of the vanilla majority voting in eq can be formulated as finding solutions yi that satisfy the following constraints xi yi xi where is the vector and is the aggregated score of the potential true label for task ti by using the vector the aggregated score has an intuitive interpretation it denotes the number of figure geometric interpretation of the crowdworkers who have labeled ti as class sourcing margin apparently the vector treats all workers equally which may be unrealistic in practice due to the various backgrounds of the workers by simply choosing what the majority of workers agree on the vanilla mv is prone to errors when many workers give low quality labels one way to tackle this problem is to take worker reliability into consideration let denote the worker weights when these values are known we can get the aggregated score xi of weighted majority voting wmv and estimate the true labels by the rule xi thus reliable workers contribute more to the decisions geometrically xi is point in the space for each task ti the aggregated score xi measures the distance up to constant scaling from this point to the hyperplane so the mv estimator actually finds point that has the largest distance to that hyperplane for each task and the decision boundary of majority voting is another hyperplane which separates the point xi from the other points xi by introducing the worker weights we relax the constraint of the vector to allow for more flexible decision boundaries all the possible decision boundaries with the same orientation are equivalent inspired by the generalized notion of margin in svm we define the crowdsourcing margin as the minimal difference between the aggregated score of the potential true label and the aggregated scores of other alternative labels then one reasonable choice of the best hyperplane is the one that represents the largest margin between the potential true label and other alternatives fig provides an illustration of the crowdsourcing margin for wmv with and where each axis represents the label of worker assume that both workers provide labels and to item then the vectors xi are three points in the plane given the worker weights the estimated label should be since xi has the largest distance to line line and line are two boundaries that separate xi and other points the margin is the distance between them in this case xi and xi are support vectors that decide the margin majority voting estimator let be the minimum margin between the potential true label and all other alternatives we define the majority voting as solving the constrained optimization problem to estimate the true labels and weights inf where gi xi yi xi and yi and in practice the worker labels are often linearly inseparable by single hyperplane therefore we relax the hard constraints the offset is canceled out in the margin constraints by introducing slack variables ξi one for each task and define the majority voting as inf ξi ξi ξi where is positive regularization parameter and ξi is the for task ti the value of ξi reflects the difficulty of task ti small ξi suggests large discriminant margin indicating that the task is easy with rare chance to make mistakes while large ξi suggests that the task is hard with higher chance to make mistakes note that our majority voting is significantly different from the unsupervised svms or clustering which aims to assign cluster labels to the data points by maximizing some different notion of margin with balance constraints to avoid trivial solutions our does not need such balance constraints albeit not jointly convex problem can be solved by iteratively updating and to find local pm pd optimum for the solution can be derived as ωi gi by the fact that the subproblem is convex the parameters are obtained by solving the dual problem xx sup ωid which is exactly the qp dual problem in standard svm so it can be efficiently solved by welldeveloped svm solvers like libsvm for updating we define max and then it is weighted majority voting with margin gap constraint argmax max yi overall the algorithm is iterative weighted majority voting comparing with the iterative weighted majority voting iwmv which tends to maximize the expected gap of the aggregated scores under the homogenous ds model our directly maximizes the data specified margin without further assumption on data model empirically as we shall see our could have more powerful discriminative ability with better accuracy than iwmv bayesian estimator with the intuitive and simple principle we now present more sophisticated bayesian estimator which conjoins the discriminative ability of and the flexibility of the generative ds estimator though slightly more complicated in learning and inference the bayesian models retain the intuitive simplicity of and the flexibility of ds as explained below model definition we adopt the same ds model to generate observations conditioned on confusion matrices with the full likelihood in eq we further impose prior for bayesian inference assuming that the true labels are given we aim to get the target posterior which can be obtained by solving an optimization problem inf where kl eq log measures the kl divergence between desired posterior and the original bayesian posterior and is the prior often factorized as as we shall see this bayesian ds estimator often leads to better performance than the vanilla ds then we explore the ideas of regularized bayesian inference regbayes to incorporate majority voting constraints as posterior regularization on problem and define the bayesian estimator denoted by crowdsvm as solving inf ξi ξi eq ξi where is the probabilistic simplex and we take expectation over to define the margin constraints such posterior constraints will influence the estimates of and to get better aggregation as we shall see we use dirichlet prior on worker confusion matrices φmk dir and spherical gaussian prior on vi by absorbing the slack variables crowdsvm solves the equivalent unconstrained problem inf rm where rm pm gi is the posterior regularization remark from the above definition we can see that both the bayesian ds estimator and the maxmargin majority voting are special cases of crowdsvm specifically when it is equivalent to the ds model if we set for some positive parameter then when crowdsvm reduces to the majority voting variational inference since it is intractable to directly solve problem or we introduce the structured assumption on the posterior and solve the problem by alternating minimization as outlined in alg the algorithm iteratively performs the following steps until local optimum is reached algorithm the crowdsvm algorithm initialize by majority voting while not converge do for each worker and category φjk dir njk solve the dual problem for each item argmaxyi yi xi end infer fixing the distribution and the true labels the problem in eq turns to standard bayesian inference problem with the solution since the prior is dirichlet distribution the inferred distribution is also dirichlet φjk dir njk where njk is vector with element being njkd infer and solve for fixing the distribution and the true labels we optimize eq over which is can derive the optimal solution also convex we exp ωi gi where ωi are lagrange multipliers with the normal prior vi the posterior is normal distribution vi whose mean pm pd is ωid then the parameters are obtained by solving the dual problem xx ωid sup which is same as the problem in majority voting infer fixing the distributions of and at their optimum we find by solving problem to make the prediction more efficient we approximate the distribution by dirac delta mass where is the mean of then since all tasks are independent we can derive the discriminant function of yi as yi xi log xi yi max gi where is the mean of then we can make predictions by maximize this function apparently the discriminant function represents strong coupling between the generative model and the discriminative margin constraints therefore crowdsvm jointly considers these two factors when estimating true labels we also note that the estimation rule used here reduces to the rule of by simply setting gibbs crowdsvm estimator crowdsvm adopts an averaging model to define the posterior constraints in problem here we further provide an alternative strategy which leads to full bayesian model with gibbs sampler the resulting does not need to make the assumption model definition suppose the target posterior is given we perform the majority voting by drawing random sample this leads to the crowdsourcing max gi which is function of since are random we define the overall as the expectation over that is eq due to the convexity of max function the expected loss is in fact an upper bound of the average loss rm differing from crowdsvm we also treat the hidden true labels as random variables with uniform prior then we define as solving the problem inf eq ζisi where ζid gi si ζid and the factor is introduced for simplicity data augmentation in order to build an efficient gibbs sampler for this problem we derive the posterior distribution with the data augmentation for the regularization term we let yi exp ζisi to represent the regularizer according to the equality yi yi λi dλi where yi λi exp λi cζisi is unnormalized joint distribution of yi and the augmented variable λi the posterior of gibbscrowdsvm can be expressed as the marginal of higher dimensional distribution dλ where xi yi yi λi putting the last two terms together we can view as standard bayesian posterior but with the unnormalized likelihood pe xi λi yi xi yi yi λi which jointly considers the noisy observations and the large margin discrimination between the potential true labels and alternatives posterior inference with the augmented representation we can do gibbs sampling to infer the posterior distribution and thus by discarding the conditional distributions for are derived in appendix note that when sample from the inverse gaussian distribution fast sampling algorithm can be applied with time complexity and for the hidden variables we initially set them as the results of majority voting after removing samples we use their most frequent values of as the final outputs experiments we now present experimental results to demonstrate the strong discriminative ability of majority voting and the promise of our bayesian models by comparing with various strong competitors on multiple real datasets datasets and setups we use four real world crowd labeling datasets as summarized in table web search workers are asked to rate set of pairs on relevance rating scale from to each task is labeled by workers on average in total labels are collected age it consists of labels of age estimations for face images each image was labeled by workers and there are workers involved in these tasks the final estimations are discretized into bins bluebirds it consists of bluebird pictures there are breeds among all the images and each image is labeled by all workers labels in total flowers it contains binary labels for dataset with flower pictures each worker is asked to answer whether the flower in picture is peach flower workers participate in these tasks table datasets overview we compare as well as its bayesian exdataset abels tems orkers tensions crowdsvm and eb earch with various baselines including majority age ing mv iterative weighted majority voting luebirds iwmv the ds estimaf lowers tor and the minimax entropy entropy estimator for entropy estimator we use the implementation provided by the authors and show both the performances of its multiclass version entropy and the ordinal version entropy all the estimators that require an iterative updating are initialized by majority voting to avoid bad local minima all experiments were conducted on pc with intel core cpu and ram model selection due to the special property of crowdsourcing we can not simply split the training data into multiple folds to the hyperparameters by using accuracy as the selection criterion which may bias to models instead we adopt the likelihood as the criterion to select parameters which is indirectly related to our evaluation criterion accuracy specifically we test multiple values of and and select the value that produces model with the maximal likelihood on the given dataset this method ensures us to select model without any prior knowledge on the true labels for the special case of we fix the learned true labels after training the model with certain parameters and learn confusion matrices that optimize the full likelihood in eq note that the strategy is not suitable for crowdsvm because this strategy uses marginal likelihood to select model and ignores the label information of through which the effect of constraints is passed for crowdsvm if we use this strategy on crowdsvm it will tend to optimize the generative component without considering the discriminant constraints thus resulting in which is trivial solution for model selection experimental results we first test our estimators on the task of estimating true labels for crowdsvm we set and for all experiments since we find that the results are insensitive to them for crowdsvm and the regularization parameters are selected from and by the method in sec as for we generate samples in each run and discard the first samples as steps which are sufficiently large to reach convergence of the likelihood the reported error rate is the average over runs table presents the error rates of various estimators we group the comparisons into three parts all the mv iwmv and are purely discriminative estimators we can see that our produces consistently lower error rates on all the four datasets compared with the vanilla mv and iwmv which show the effectiveness of principle for crowdsourcing ii this part analyzes the effects of prior and regularization on improving the ds model we can see that is better than the vanilla ds model on the two larger datasets by using dirichlet prior furthermore crowdsvm consistently improves the performance of by considering the constraints again demonstrating the effectiveness of learning iii this part compares our estimator to the minimax entropy estimators we can see that performs better than crowdsvm on age and flowers datasets while worse on the small bluebuirds dataset and it is comparable to the minimax entropy estimators sometimes better with faster running speed as shown in fig and explained below note that we only test entropy on two ordinal datasets since this method is specifically designed for ordinal labels while not always effective fig summarizes the training time and error rates after each iteration for all estimators on the largest dataset it shows that the discriminative methods iwmv and run fast but converge to high error rates compared to the minimax entropy estimator crowdsvm is table of different estimators on four datasets ii iii ethods mv iwmv ds rior rowd svm ntropy ntropy rowd svm eb earch age computationally more efficient and also converges to lower error rate runs slower than crowdsvm since it needs to compute the inversion of matrices the performance of the ds estimator seems mediocre its estimation error rate is large and slowly increases when it runs longer perhaps this is partly because the ds estimator can not make good use of the initial knowledge provided by majority voting luebirds lowers error rate iwmv entropy entropy crowdsvm time seconds error rate nll we further investigate the effectiveness of the figure error rates per iteration of various estigenerative component and the discriminative mators on the web search dataset component of crowdsvm again on the largest dataset for the generative part we compared crowdsvm with ds and fig compares the negative log likelihoods nll of these models computed with eq for we fix its estimated true labels and find the ds csvm mv iwmv csvm fusion matrices to optimize the likelihood the results show that crowdsvm achieves lower nll than ds this suggests that by figure nlls and ers when separately test the ing constraints crowdsvm finds better generative and discriminative components solution of the true labels as well as the confusion matrices than that found by the original em algorithm for the discriminative part we use the mean of worker weights to estimate the true labels as yi xi and show the error rates in fig apparently the weights learned by crowdsvm are also better than those learned by the other mv estimators overall these results suggest that crowdsvm can achieve good balance between the generative modeling and the discriminative prediction conclusions and future work we present simple and intuitive majority voting estimator for as well as its bayesian extension that conjoins the generative modeling and discriminative prediction by formulating as regularized bayesian inference problem our methods naturally cover the classical estimator empirical results demonstrate the effectiveness of our methods our model is flexible to fit specific complicated application scenarios one seminal feature of bayesian methods is their sequential updating we can extend our bayesian estimators to the online setting where the crowdsourcing labels are collected in stream and more tasks are distributed we have some preliminary results as shown in appendix it would also be interesting to investigate more on active learning such as selecting reliable workers to reduce costs acknowledgments the work was supported by the national basic research program program of china nos national nsf of china nos tsinghua national laboratory for information science and technology big data initiative and tsinghua initiative scientific research program nos references carlson betteridge kisiel settles hruschka jr and mitchell toward an architecture for language learning in aaai chang and lin libsvm library for support vector machines acm transactions on intelligent systems and technology chen zhu and zhang robust bayesian clustering in nips crammer and singer on the algorithmic implementation of multiclass vector machines jmlr dawid and skene maximum likelihood estimation of observer using the em algorithm applied statistics pages phillips and schapire maximum entropy density estimation with generalized regularization and an application to species distribution modeling jmlr ganchev gillenwater and taskar posterior regularization for structured latent variable models jmlr otto liu han and jain demographic estimation from face images human machine performance ieee trans on pami jagabathula subramanian and venkataraman worker filtering in crowdsourcing in nips karger oh and shah iterative learning for reliable crowdsourcing systems in nips li and yu error rate bounds and iterative weighted majority voting for crowdsourcing arxiv preprint liu peng and ihler variational inference for crowdsourcing in nips michael schucany and haas generating random variates using transformations with multiple roots the american statistician polson and scott data augmentation for support vector machines bayesian analysis raykar yu zhao valadez florin bogoni and moy learning from crowds jmlr shi and zhu online bayesian learning in icml snow connor jurafsky and ng cheap and is it good evaluating annotations for natural language tasks in emnlp tian and zhu uncovering the latent structures of crowd labeling in pakdd welinder branson perona and belongie the multidimensional wisdom of crowds in nips whitehill wu bergsma movellan and ruvolo whose vote should count more optimal integration of labels from labelers of unknown expertise in nips xu and schuurmans unsupervised and support vector machines in aaai zaidan and crowdsourcing translation professional quality from in acl zhang chen zhou and jordan spectral methods meet em provably optimal algorithm for crowdsourcing in nips zhou basu mao and platt learning from the wisdom of crowds by minimax entropy in nips zhou liu platt and meek aggregating ordinal labels from crowds by minimax conditional entropy in icml zhu chen perkins and zhang gibbs topic models with data augmentation jmlr zhu chen and xing bayesian inference with posterior regularization and applications to infinite latent svms jmlr 
learning with incremental iterative regularization lorenzo rosasco dibris univ genova italy lcsl iit mit usa lrosasco silvia villa lcsl iit mit usa abstract within statistical learning setting we propose and study an iterative regularization algorithm for least squares defined by an incremental gradient method in particular we show that if all other parameters are fixed priori the number of passes over the data epochs acts as regularization parameter and prove strong universal consistency almost sure convergence of the risk as well as sharp finite sample bounds for the iterates our results are step towards understanding the effect of multiple epochs in stochastic gradient techniques in machine learning and rely on integrating statistical and optimization results introduction machine learning applications often require efficient statistical procedures to process potentially massive amount of high dimensional data motivated by such applications the broad objective of our study is designing learning procedures with optimal statistical properties and at the same time computational complexities proportional to the generalization properties allowed by the data rather than their raw amount we focus on iterative regularization as viable approach towards this goal the key observation behind these techniques is that iterative optimization schemes applied to scattered noisy data exhibit property in the sense that early termination earlystop of the iterative process has regularizing effect indeed iterative regularization algorithms are classical in inverse problems and have been recently considered in machine learning where they have been proved to achieve optimal learning bounds matching those of variational regularization schemes such as tikhonov in this paper we consider an iterative regularization algorithm for the square loss based on recursive procedure processing one training set point at each iteration methods of the latter form often broadly referred to as online learning algorithms have become standard in the processing of large because of their low iteration cost and good practical performance theoretical studies for this class of algorithms have been developed within different frameworks in composite and stochastic optimization in online learning sequential prediction and finally in statistical learning the latter is the setting of interest in this paper where we aim at developing an analysis keeping into account simultaneously both statistical and computational aspects to place our contribution in context it is useful to emphasize the role of regularization and different ways in which it can be incorporated in online learning algorithms the key idea of regularization is that controlling the complexity of solution can help avoiding overfitting and ensure stability and generalization classically regularization is achieved penalizing the objective function with some suitable functional or minimizing the risk on restricted space of possible solutions model selection is then performed to determine the amount of regularization suitable for the data at hand more recently there has been an interest in alternative possibly more efficient ways to incorporate regularization we mention in particular where there is no explicit regularization by penalization and the of an iterative procedure is shown to act as regularization parameter here for each fixed each data point is processed once but multiple passes are typically needed to perform model selection that is to pick the best we also mention where an interesting adaptive approach is proposed which seemingly avoid model selection under certain assumptions in this paper we consider different regularization strategy widely used in practice namely we consider no explicit penalization fix the step size priori and analyze the effect of the number of passes over the data which becomes the only free parameter to avoid overfitting regularize the associated regularization strategy that we dub incremental iterative regularization is hence based on early stopping the latter is well known trick for example in training large neural networks and is known to perform very well in practice interestingly early stopping with the square loss has been shown to be related to boosting see also our goal here is to provide theoretical understanding of the generalization property of the above heuristic for techniques towards this end we analyze the behavior of both the excess risk and the iterates themselves for the latter we obtain sharp finite sample bounds matching those for tikhonov regularization in the same setting universal consistency and finite sample bounds for the excess risk can then be easily derived albeit possibly suboptimal our results are developed in capacity independent setting that is under no conditions on the covering or entropy numbers in this sense our analysis is worst case and dimension free to the best of our knowledge the analysis in the paper is the first theoretical study of regularization by early stopping in algorithms and thus first step towards understanding the effect of multiple passes of stochastic gradient for risk minimization the rest of the paper is organized as follows in section we describe the setting and the main assumptions and in section we state the main results discuss them and provide the main elements of the proof which is deferred to the supplementary material in section we present some experimental results on real and synthetic datasets notation we denote by and given normed space and linear operators ai ai for every their composition am qi will be denoted as ai by convention if we set ai where is the identity of the operator pmnorm will be denoted by and the norm by khs also if we set ai setting and assumptions we first describe the setting we consider and then introduce and discuss the main assumptions that will hold throughout the paper we build on ideas proposed in and further developed in series of follow up works unlike these papers where reproducing kernel hilbert space rkhs setting is considered here we consider formulation within an abstract hilbert space as discussed in the appendix results in rkhs can be recovered as special case the formulation we consider is close to the setting of functional regression and reduces to standard linear regression if is finite dimensional see appendix let be separable hilbert space with inner product and norm denoted by and let be pair of random variables on probability space with values in and respectively denote by the distribution of by the marginal measure on and by the conditional measure on given considering the square loss function the problem under study is the minimizazion of the risk inf hw xih provided the distribution is fixed but known only through training set xn yn that is realization of independent identical copies of in the following we measure the quality of an approximate solution an estimator considering the excess risk inf if the set of solutions of problem is non empty that is argminh we also consider where argmin kwkh more precisely we are interested in deriving almost sure convergence results and finite sample bounds on the above error measures this requires making some assumptions that we discuss next we make throughout the following basic assumption assumption there exist and such that surely and surely the above assumption is fairly standard the boundness assumption on the output is satisfied in classification see appendix and can be easily relaxed see the boundness assumption on the input can also be relaxed but the resulting analysis is more involved we omit these developments for the sake of clarity it is well known that see under assumption the risk is convex and continuous functional on the space of functions with norm kf the minimizer of the risk on is the regression function for every by considering problem we are restricting the search for solution to linear functions note that since is in general infinite dimensional the minimum in might not be achieved indeed bounds on the error measures in and depend on if and how well the regression function can be linearly approximated the following assumption quantifies in precise way such requirement assumption consider the space with hw xi and let be its closure in moreover consider the operator lf hx define let and assume that such that lr the above assumption is standard in the context of rkhs since its statement is somewhat technical and we provide formulation in hilbert space with respect to the usual rkhs setting we further comment on its interpretation we begin noting that is the space of linear functions indexed by and is proper subspace of if assumption holds moreover under the same assumption it is easy to see that the operator is linear positive definite and trace class hence compact so that its fractional power in is well defined most importantly the following equality which is analogous to mercer theorem can be shown fairly easily this last observation allows to provide an interpretation of condition indeed given for condition states that belongs to rather than its closure in this case problem has at least one solution and the set in is not empty vice versa if then and is if the condition is stronger than for for the subspaces of lr are nested subspaces of for increasing iterative incremental regularized learning the learning algorithm we consider is defined by the following iteration let and consider the sequence generated through the following procedure given define where is obtained at the end of one cycle namely as the last step of the recursion xi yi xi if then the regression function does not have best linear approximation since and in particular for we are making no assumption intuitively for the condition quantifies how far is from that is to be well approximated by linear function each cycle called an epoch corresponds to one pass over data the above iteration can be seen as the incremental gradient method for the minimization of the empirical risk corresponding to that is the functional hw xi ih yi see also section indeed there is vast literature on how the iterations can be used to minimize the empirical risk unlike these studies in this paper we are interested in how the iterations can be used to approximately minimize the risk the key idea is that while is close to minimizer of the empirical risk when is sufficiently large good approximate solution of problem can be found by terminating the iterations earlier early stopping the analysis in the next few sections grounds theoretically this latter intuition remark representer theorem let be rkhs of functions from to defined by kernel letp then the iteration after epochs of the algorithm in can be written as kxk for suitable coefficients rn updated as follows cnt cit cit cit xi xj cit yi early stopping for incremental iterative regularization in this section we present and discuss the main results of the paper together with sketch of the proof the complete proofs can be found in appendix we first present convergence results and then finite sample bounds for the quantities in and theorem in the setting of section let assumption hold let then the following hold if we choose stopping rule such that log inf surely lim and then lim lim ii suppose additionally that the set of minimizers of is nonempty and let be defined as in if we choose stopping rule satisfying the conditions in then kh surely the above result shows that for an priori fixed consistency is achieved computing suitable number of iterations of algorithm given points the number of required iterations tends to infinity as the number of available training points increases condition can be interpreted as an early stopping rule since it requires the number of epochs not to grow too fast in particular this excludes the choice for all namely considering only one pass over the data in the following remark we show that if we let to depend on the length of one epoch convergence is recovered also for one pass remark recovering stochastic gradient descent algorithm in for is stochastic gradient descent one pass over sequence of data with stepsize choosing with in algorithm we can derive almost sure convergence of inf as relying on similar proof to that of theorem to derive finite sample bounds further assumptions are needed indeed we will see that the behavior of the bias of the estimator depends on the smoothness assumption we are in position to state our main result giving finite sample bound theorem finite sample bounds in in the setting of section let for every suppose that assumption is satisfied for some then the set of minimizers of is nonempty and in is well defined moreover the following hold there exists such that for every with probability greater than kh log ii for the stopping rule with probability greater than kh log the dependence on suggests that big which corresponds to small helps in decreasing the sample error but increases the approximation error next we present the result for the excess risk we consider only the attainable case that is the case in assumption the case is deferred to appendix since both the proof and the statement are conceptually similar to the attainable case theorem finite sample for the risk attainable case in the setting of section let assumptions holds and let let assumption be satisfied for some then the following hold for every with probability greater than log inf ii for the stopping rule with probability greater than inf log equations and arise from form of decomposition of the error choosing the number of epochs that optimize the bounds in and we derive priori stopping rules and corresponding bounds and again these results confirm that the number of epochs acts as regularization parameter and the best choices following from equations and suggest multiple passes over the data to be beneficial in both cases the stopping rule depends on the smoothness parameter which is typically unknown and cross validation is often used in practice following it is possible to show that this procedure allows to adaptively achieve the same convergence rate as in discussion in theorem the obtained bound can be compared to known lower bounds as well as to previous results for least squares algorithms obtained under assumption minimax lower bounds and individual lower bounds suggest that for is the optimal bound for the in this sense theorem provides sharp bounds on the iterates bounds can be improved only under stronger assumptions on the covering numbers or on the eigenvalues of this question is left for future work the lower bounds for the excess risk are of the form and in this case the results in theorems and are not sharp our results can be contrasted with online learning algorithms that use in recent manuscript it has been proved that this is indeed the minimax lower bound blanchard personal communication as regularization parameter optimal capacity independent bounds are obtained in see also and indeed such results can be further improved considering capacity assumptions see and references therein interestingly our results can also be contrasted with non incremental iterative regularization approaches our results show that incremental iterative regularization with distribution independent behaves as batch gradient descent at least in terms of iterates convergence proving advantages of incremental regularization over the batch one is an interesting future research direction finally we note that optimal capacity independent and dependent bounds are known for several least squares algorithms including tikhonov regularization see and spectral filtering methods these algorithms are essentially equivalent from statistical perspective but different from computational perspective elements of the proof the proofs of the main results are based on suitable decomposition of the error to be estimated as the sum of two quantities that can be interpreted as sample and an approximation error respectively bounds on these two terms are then provided the main technical contribution of the paper is the sample error bound the difficulty in proving this result is due to multiple passes over the data which induce statistical dependencies in the iterates error decomposition we consider an auxiliary iteration wt which is the expectation of the iterations and starting from with more explicitly the considered iteration generates according to unt where unt is given by uit uit wt huit xih if we let be the linear map hw which is bounded by assumption then it is that inf ks ks swt kswt wt wt inf under in this paper we refer to the gap between the empirical and the expected iterates wt kh as the sample error and to wt inf as the approximation error similarly if as defined in exists using the triangle inequality we obtain kh wt kh kwt kh proof main steps in the setting of section we summarize the key steps to derive general bound for the sample error the proof of the behavior of the approximation error is more standard the bound on the sample error is derived through many technical lemmas and uses concentration inequalities applied to martingales the crucial point is the inequality in step below its complete derivation is reported in appendix we introduce the additional linear operators and for every sx sx hw xi and tx tx sx pn moreover set txi we are now ready to state the main steps of the proof sample error bound step to step see proposition find equivalent formulations for the sequences and wt yj wt awt where tx tx tx tx tx yj step see lemma use the formulation obtained in step to derive the following recursive inequality wt with wk wk pn yi step see lemmas and initialize prove that ki and derive from step that wt kh kt ak kwk kh yi kb step see lemma let assumption hold for some and prove that max if kwt kh if step see lemma and proposition prove that with probability greater than following inequalities hold akhs log hs log bkh log yi xi the log step approximation error bound see theorem prove that if assumption holds for some then wt inf moreover if assumption holds with then kwt kh and if assumption holds for some then kwt kh step plug the sample and approximation error bounds obtained in step and step in and respectively experiments synthetic data we consider scalar linear regression problem with random design the input points xi are uniformly distributed in and the output points are obtained as yi xi ni where ni is gaussian noise with zero mean and standard deviation and is dictionary of functions whose element is cos in figure we plot the test error for with in and in the plots show that the number of the epochs acts as regularization parameter and that early stopping is beneficial to achieve better test error moreover according to the theory the experiments suggest that the number of performed epochs increases if the number of available training points increases real data we tested the kernelized version of our algorithm see remark and appendix on the adult and breast cancer wisconsin diagnostic available at http adult and breast cancer wisconsin diagnostic uci repository test error test error iterations iterations figure test error as function of the number of iterations in and total number of iterations of iir is corresponding to epochs in and the total number of epochs is the best test error is obtained for epochs in and for epochs in datasets we considered subset of adult with the results are shown in figure comparison of the test errors obtained with the kernelized version of the method proposed in this paper kernel incremental iterative regularization kiir kernel iterative regularization kir that is the kernelized version of gradient descent and kernel ridge regression krr is reported in table the results show that the test error of kiir is comparable to that of kir and krr validation error training error error iterations figure training orange and validation blue classification errors obtained by kiir on the breast cancer dataset as function of the number of iterations the test error increases after certain number of iterations while the training error is decreasing with the number of iterations table test error comparison on real datasets median values over trials dataset cpusmall adult breast cancer ntr error measure rmse class err class err kiir krr kir acknowledgments this material is based upon work supported by cbmm funded by nsf stc award and by the miur firb project villa is member of gnampa of the istituto nazionale di alta matematica indam references bach and dieuleveut stochastic approximation with large step sizes bartlett and traskin adaboost is consistent mach learn bauer pereverzev and rosasco on regularization algorithms in learning theory complexity bertsekas new class of incremental gradient methods for least squares problems siam blanchard and optimal learning rates for kernel conjugate gradient regression in advances in neural inf proc systems nips pages bottou and bousquet the tradeoffs of large scale learning in suvrit sra sebastian nowozin and stephen wright editors optimization for machine learning pages mit press buhlmann and yu boosting with the loss regression and classification amer stat caponnetto and de vito optimal rates for regularized algorithm found comput caponnetto and yao based adaptation for regularization operators in learning theory anal conconi and gentile on the generalization ability of learning algorithms ieee trans information theory and lugosi prediction learning and games cambridge university press cucker and zhou learning theory an approximation theory viewpoint cambridge university press de vito rosasco caponnetto de giovannini and odone learning from examples as an inverse problem learn de vito rosasco caponnetto piana and verri some properties of regularized kernel methods journal of machine learning research engl hanke and neubauer regularization of inverse problems kluwer huang avron sainath sindhwani and ramabhadran kernel methods match deep neural networks on timit in ieee icassp jiang process consistency for adaboost ann lecun bottou orr and muller efficient backprop in orr and muller editors neural networks tricks of the trade springer nedic and bertsekas incremental subgradient methods for nondifferentiable optimization siam journal on optimization nemirovski juditsky lan and shapiro robust stochastic approximation approach to stochastic programming siam nemirovskii the regularization properties of adjoint gradient method in problems ussr computational mathematics and mathematical physics orabona simultaneous model selection and optimization through stochastic learning nips proceedings pinelis optimum bounds for the distributions of martingales in banach spaces ann polyak introduction to optimization optimization software new york ramsay and silverman functional data analysis new york raskutti wainwright and yu early stopping for regression an optimal datadependent stopping rule in in annual allerton conference pages ieee smale and zhou shannon sampling ii connections to learning theory appl comput harmon november smale and zhou learning theory estimates via integral operators and their approximations constr srebro sridharan and tewari optimistic rates for learning with smooth loss steinwart and christmann support vector machines springer steinwart hush and scovel optimal rates for regularized least squares regression in colt and yao online learning as stochastic approximation of regularization paths optimality and convergence ieee trans inform theory vapnik statistical learning theory wiley new york yao rosasco and caponnetto on early stopping in gradient descent learning constr ying and pontil online gradient descent learning algorithms found comput zhang and yu boosting with early stopping convergence and consistency annals of statistics pages 
halting in random walk kernels karsten borgwardt eth basel switzerland mahito sugiyama isir osaka university japan jst presto mahito abstract random walk kernels measure graph similarity by counting matching walks in two graphs in their most popular form of geometric random walk kernels longer walks of length are downweighted by factor of λk to ensure convergence of the corresponding geometric series we know from the field of link prediction that this downweighting often leads to phenomenon referred to as halting longer walks are downweighted so much that the similarity score is completely dominated by the comparison of walks of length this is kernel between edges and vertices we theoretically show that halting may occur in geometric random walk kernels we also empirically quantify its impact in simulated datasets and popular graph classification benchmark datasets our findings promise to be instrumental in future graph kernel development and applications of random walk kernels introduction over the last decade graph kernels have become popular approach to graph comparison which is at the heart of many machine learning applications in bioinformatics imaging and analysis the first and instance of this family of kernels are random walk kernels which count matching walks in two graphs to quantify their similarity in particular the geometric random walk kernel is often used in applications as baseline comparison method on graph benchmark datasets when developing new graph kernels these geometric random walk kernels assign weight λk to walks of length where is set to be small enough to ensure convergence of the corresponding geometric series related similarity measures have also been employed in link prediction as similarity score between vertices however there is one caveat regarding these approaches similarity scores with exponentially decaying weights tend to suffer from problem referred to as halting they may downweight walks of lengths and more so much so that the similarity score is ultimately completely dominated by walks of length in other words they are almost identical to simple comparison of edges and vertices which ignores any topological information in the graph beyond single edges such simple similarity measure could be computed more efficiently outside the random walk framework therefore halting may affect both the expressivity and efficiency of these similarity scores halting has been conjectured to occur in random walk kernels but its existence in graph kernels has never been theoretically proven or empirically demonstrated our goal in this study is to answer the open question if and when halting occurs in random walk graph kernels we theoretically show that halting may occur in graph kernels and that its extent depends on properties of the graphs being compared section we empirically demonstrate in which simulated datasets and popular graph classification benchmark datasets halting is concern section we conclude by summarizing when halting occurs in practice and how it can be avoided section we believe that our findings will be instrumental in future applications of random walk kernels and the development of novel graph kernels theoretical analysis of halting we theoretically analyze the phenomenon of halting in random walk graph kernels first we review the definition of graph kernels in section we then present our key theoretical result regarding halting in section and clarify the connection to linear kernels on vertex and edge label histograms in section random walk kernels let be labeled graph where is the vertex set is the edge set and is mapping with the range of vertex and edge labels for an edge we identify and if is undirected the degree of vertex is denoted by the direct tensor product of two graphs and is defined as follows and and all labels are inherited or and we denote by the adjacency matrix of and denote by and the minimum and maximum degrees of respectively to measure the similarity between graphs and random walk kernels count all pairs of matching walks on and if we assume uniform distribution for the starting and stopping probabilities over the vertices of and the number of matching walks is obtained through the adjacency matrix of the product graph for each the random walk kernel between two graphs and is defined as λl ij with sequence of positive weights λk assuming that the identity matrix its limit is simply called the random walk kernel can be directly computed if weights are the geometric series or λl λl resulting interestingly in the geometric random walk kernel ij kgr ij in the above equation let for some value of then and for any if converges to as is invertible since becomes therefore λl from the equation it is that the geometric series of matrices often called the neumann series converges only if the maximum eigenvalue of denoted by max is strictly smaller than therefore the geometric random walk kernel kgr is only if max there is relationship for the minimum and maximum degrees and of max where is the average of the vertex degrees of or in practice it is sufficient to set the parameter in the inductive learning setting since we do not know priori target graphs that learner will receive in the future should be small enough so max for any pair of unseen graphs otherwise we need to the full kernel matrix and the learner in the transductive setting we are given collection of graphs beforehand we can explicitly compute the upper bound of which is maxg max with the maximum of the maximum eigenvalues over all pairs of graphs halting the geometric random walk kernel kgr is one of the most popular graph kernels as it can take walks of any length into account however the fact that it weights walks of length by the kth power of together with the condition that max immediately tells us that the contribution of longer walks is significantly lowered in kgr if the contribution of walks of length and more to the kernel value is even completely dominated by the contribution of walks of length we would speak of halting it is as if the random walks halt after one step here we analyze under which conditions this halting phenomenon may occur in geometric random walk kernels we obtain the following key theoretical statement by comparing kgr to the random walk kernel theorem let and in the random walk kernel for pair of graphs and kgr where and monotonically converges to as proof let be the degree of vertex in and be the set of neighboring vertices of that is since is the adjacency matrix of the following relationships hold ij ij ij ij from the assumption that we have kgr ij ij it is clear that monotonically goes to when moreover we can normalize by dividing kgr by corollary let and in the random walk kernel for pair of graphs and kgr where and is the average of vertex degrees of proof since we have it follows that theorem can be easily generalized to any random walk kernel corollary let for pair of graphs and we have kgr our results imply that in the geometric random walk kernel kgr the contribution of walks of length longer than diminishes for very small choices of this can easily happen in graph data as is by the inverse of the maximum degree of the product graph relationships to linear kernels on label histograms next we clarify the relationship between kgr and basic linear kernels on vertex and edge label histograms we show that halting kgr leads to the convergence of it to such linear kernels given pair of graphs and let us introduce two linear kernels on vertex and edge histograms assume that the range of labels without loss of generality the vertex label histogram of graph is vector fs such that fi for each let and be the vertex label histograms of graphs and respectively the vertex label histogram kernel kvh is then defined as the linear kernel between and kvh fi similarly the edge label histogram is vector gs such that gi for each the edge label histogram kernel keh is defined as the linear kernel between and for respective histograms keh gi finally we introduce the label histogram let hsss be histogram vector such that hijk for each the label histogram kernel kveh is defined as the linear kernel between and for the respective histograms of and kveh hijk notice that kveh keh if vertices are not labeled from the definition of the direct product of graphs we can confirm the following relationships between histogram kernels and the random walk kernel lemma for pair of graphs and their direct product we have kvh kveh ij proof the first equation kvh can be proven from the following kvh we can prove the second equation in similar fashion kveh ij finally let us define new kernel kh kvh λkveh with parameter from lemma since kh holds if and in the random walk kernel we have the following relationship from theorem corollary for pair of graphs and we have kh kgr kh where is given in theorem to summarize our results show that if the parameter of the geometric random walk kernel kgr is small enough random walks halt and kgr reduces to kh which finally converges to kvh this is based on vertex histograms only and completely ignores the topological structure of the graphs experiments we empirically examine the halting phenomenon of the geometric random walk kernel on popular graph benchmark datasets and graph data experimental setup environment we used amazon linux ami release and ran all experiments on single core of ghz intel xeon cpu and gb of memory all kernels were implemented in with eigen library and compiled with gcc datasets we collected five graph classification benchmark enzymes mutag and which are popular in the literature enzymes and are proteins and and mutag are chemical compounds statistics of these datasets are summarized in table in which we also show the maximum of maximum degrees of product graphs maxg for each dataset we consistently used λmax maxg as the upper bound of in geometric random walk kernels in which the gap was less than one order as the lower bound of the average degree of the product graph the lower bound of were and for enzymes mutag and dd respectively kernels we employed the following graph kernels in our experiments we used linear kernels on vertex label histograms kvh edge label histograms keh label histograms kveh and the combination kh introduced in equation we also included gaussian rbf kernel between label histograms denoted as kveh from the family of random walk kernels we used the geometric random walk kernel kgr and the random walk kernel only the and λk was fixed to for all we used number of steps were treated as parameter in iterations section for efficient computation of kgr moreover we employed the subtree kernel denoted as kwl as the graph kernel which has parameter of the number of iterations results on datasets we first compared the geometric random walk kernel kgr to other kernels in graph classification the classification accuracy of each graph kernel was examined by cross validation with multiclass vector classification was used in which the parameter for csvc and parameter if one exists of each kernel were chosen by internal cross validation cv on only the training dataset we repeated the whole experiment times and reported average the code and all datasets are available at http http table statistics of graph datasets and denote the number of vertex and edge labels dataset size classes enzymes mutag comparison of various graph kernels ii kvh keh kveh kh kveh kgr kxk kwl kxk accuracy iii kgr kh accuracy accuracy comparison of kgr with kh label histogram random walk parameter number of steps enzymes comparison of kgr with kh iii kgr kh kvh keh kveh kh kveh kgr kxk kwl kxk accuracy accuracy accuracy comparison of various graph kernels ii label histogram random walk parameter number of steps comparison of kgr with kh kgr kh kvh keh kveh kh kveh kgr kxk kwl label histogram random walk iii kxk accuracy accuracy accuracy comparison of various graph kernels ii parameter number of steps figure classification accuracy on datasets means sd classification accuracies with their standard errors the list of parameters optimized by the internal cv is as follows for the width in the rbf kernel kveh the number of steps in the number of iterations in kwl and λmax in kh and kgr where λmax maxg results are summarized in the left column of figure for enzymes and we present results on mutag and in the supplementary notes as different graph kernels do not give significantly different results overall we could observe two trends first the subtree kernel kwl was the most accurate which confirms results in enzymes percentage percentage percentage figure distribution of where is defined in corollary in datasets accuracy kgr kh kvh kgr kh kvh kgr kh kvh number of added edges accuracy accuracy number of added edges number of added edges figure classification accuracy on datasets means sd second the two random walk kernels kgr and show greater accuracy than linear kernels on edge and vertex histograms which indicates that halting is not occurring in these datasets it is also noteworthy that employing gaussian rbf kernel on histograms leads to clear improvement over linear kernels on all three datasets on enzymes the gaussian kernel is even on par with the random walks in terms of accuracy to investigate the effect of halting in more detail we show the accuracy of kgr and kh in the center column of figure for various choices of from to its upper bound we can clearly see that halting occurs for small which greatly affects the performance of kgr more specifically if it is chosen to be very small smaller than in our datasets the accuracies are close to the baseline kh that ignores the topological structure of graphs however accuracies are much closer to that reached by the kernel if is close to its theoretical maximum of course the theoretical maximum of depends on unseen test data in reality therefore we often have to set conservatively so that we can apply the trained model to any unseen graph data moreover we also investigated the accuracy of the random walk kernel as function of the number of steps of the random walk kernel results are shown in the right column of figure in all datasets accuracy improves with each step up to four to five steps the optimal number of steps in and the maximum give similar accuracy levels we also confirmed theorem that conservative choices of or less give the same accuracy as random walk in addition figure shows histograms of where is given in corollary for max for all pairs of graphs in the respective datasets the value can be viewed as the deviation of kgr from kh in percentages although is small on average about percent in enzymes and nci datasets we confirmed the existence of relatively large in the plot more than percent which might cause the difference between kgr and kh results on datasets to empirically study halting we generated graphs from our three benchmark datasets enzymes and and compared the three kernels kgr kh and kvh in each dataset we artificially generated denser graphs by randomly adding edges in which the number of new edges per graph was determined from normal distribution with the mean and the distribution of edge labels was unchanged note that the accuracy of the vertex histogram kernel kvh stays always the same as we only added edges results are plotted in figure there are two key observations first by adding new false edges to the graphs the accuracy levels drop for both the random walk kernel and the histogram kernel however even after adding new false edges per graph they are both still better than classifier that assigns all graphs to the same class accuracy of percent on enzymes and approximately percent on and second the geometric random walk kernel quickly approaches the accuracy level of kh when new edges are added this is strong indicator that halting occurs as graphs become denser the upper bound for gets smaller and the accuracy of the geometric random walk kernel kgr rapidly drops and converges to kh this result confirms corollary which says that both kgr and kh converge to kvh as goes to discussion in this work we show when and where the phenomenon of halting occurs in random walk kernels halting refers to the fact that similarity measures based on counting walks of potentially infinite length often downweight longer walks so much that the similarity score is completely dominated by walks of length degenerating the random walk kernel to simple kernel between edges and vertices while it had been conjectured that this problem may arise in graph kernels we provide the first theoretical proof and empirical demonstration of the occurrence and extent of halting in geometric random walk kernels we show that the difference between geometric random walk kernels and simple edge kernels depends on the maximum degree of the graphs being compared with increasing maximum degree the difference converges to zero we empirically demonstrate on simulated graphs that the comparison of graphs with high maximum degrees suffers from halting on real graph data from popular graph classification benchmark datasets the maximum degree is so low that halting can be avoided if the decaying weight is set close to its theoretical maximum still if is set conservatively to low value to ensure convergence halting can clearly be observed even on unseen test graphs with unknown maximum degrees there is an interesting connection between halting and tottering section weakness of random walk kernels described more than decade ago tottering is the phenomenon that walk of infinite length may go back and forth along the same edge thereby creating an artificially inflated similarity score if two graphs share common edge halting and tottering seem to be opposing effects if halting occurs the effect of tottering is reduced and vice versa halting downweights these tottering walks and counteracts the inflation of the similarity scores an interesting point is that the strategies proposed to remove tottering from walk kernels did not lead to clear improvement in classification accuracy while we observed strong negative effect of halting on the classification accuracy in our experiments section this finding stresses the importance of studying halting our theoretical and empirical results have important implications for future applications of random walk kernels first if the geometric random walk kernel is used on graph dataset with known maximum degree should be close to the theoretical maximum second simple baseline kernels based on vertex and edge label histograms should be employed to check empirically if the random walk kernel gives better accuracy results than these baselines third particularly in datasets with high maximum degree we advise using random walk kernel rather than geometric random walk kernel optimizing the length by cross validation on the training dataset led to competitive or superior results compared to the geometric random walk kernel in all of our experiments based on these results and the fact that by definition the kernel does not suffer from halting we recommend using the random walk kernel as comparison method in future studies on novel graph kernels acknowledgments this work was supported by jsps kakenhi grant number ms the alfried krupp von bohlen und kb the snsf starting grant significant pattern mining kb and the marie curie initial training network grant no kb references borgwardt graph kernels phd thesis munich borgwardt ong vishwanathan smola and kriegel protein function prediction via graph kernels bioinformatics suppl brualdi the mutually beneficial relationship of graphs and matrices ams costa and grave fast neighborhood subgraph pairwise distance kernel in proceedings of the international conference on machine learning icml flach and wrobel on graph kernels hardness results and efficient alternatives in learning theory and kernel machines lncs girvan and newman community structure in social and biological networks proceedings of the national academy of sciences pnas kashima tsuda and inokuchi marginalized kernels between labeled graphs in proceedings of the international conference on machine learning icml katz new status index derived from sociometric analysis psychometrika kriege neumann kersting and mutzel explicit versus implicit graph feature maps computational phase transition for walk kernels in proceedings of ieee international conference on data mining icdm and kleinberg the problem for social networks journal of the american society for information science and technology ueda akutsu perret and vert extensions of marginalized graph kernels in proceedings of the international conference on machine learning icml shervashidze and borgwardt fast subtree kernels on graphs in advances in neural information processing systems nips shervashidze schweitzer van leeuwen mehlhorn and borgwardt graph kernels journal of machine learning research vishwanathan schraudolph kondor and borgwardt graph kernels journal of machine learning research 
mcmc for variationally sparse gaussian processes james hensman chicas lancaster university maurizio filippone eurecom alexander de matthews university of cambridge zoubin ghahramani university of cambridge zoubin abstract gaussian process gp models form core part of probabilistic machine learning considerable research effort has been made into attacking three issues with gp models how to compute efficiently when the number of data is large how to approximate the posterior when the likelihood is not gaussian and how to estimate covariance function parameter posteriors this paper simultaneously addresses these using variational approximation to the posterior which is sparse in support of the function but otherwise the result is hybrid sampling scheme which allows for approximation over the function values and covariance parameters simultaneously with efficient computations based on sparse gps code to replicate each experiment in this paper is available at introduction gaussian process models are attractive for machine learning because of their flexible nonparametric nature by combining gp prior with different likelihoods multitude of machine learning tasks can be tackled in probabilistic fashion there are three things to consider when using gp model approximation of the posterior function especially if the likelihood is computation storage and inversion of the covariance matrix which scales poorly in the number of data and estimation or marginalization of the covariance function parameters multitude of approximation schemes have been proposed for efficient computation when the number of data is large early strategies were based on retaining of the data snelson and ghahramani introduced an inducing point approach where the model is augmented with additional variables and titsias used these ideas in variational approach other authors have introduced approximations based on the spectrum of the gp or which exploit specific structures within the covariance matrix or by making unbiased stochastic estimates of key computations in this work we extend the variational inducing point framework which we prefer for general applicability no specific requirements are made of the data or covariance function and because the variational inducing point approach can be shown to minimize the kl divergence to the posterior process to approximate the posterior function and covariance parameters markov chain mcmc approaches provide asymptotically exact approximations murray and adams and filippone et al examine schemes which iteratively sample the function values and covariance parameters such sampling schemes require computation and inversion of the full covariance matrix at each iteration making them unsuitable for large problems computation may be reduced somewhat by considering variational methods approximating the posterior using some fixed family of distributions though many covariance matrix inversions are generally required recent works have proposed inducing point schemes which can reduce the table existing variational approaches reference williams barber also titsias chai nguyen and bonilla hensman et al this work gaussian softmax any factorized probit any factorized sparse posterior hyperparam gaussian assumed gaussian optimal gaussian assumed mixture of gaussians gaussian assumed point estimate point estimate point estimate point estimate point estimate computation required substantially though the posterior is assumed gaussian and the covariance parameters are estimated by approximate maximum likelihood table places our work in the context of existing variational methods for gps this paper presents general inference scheme with the only concession to approximation being the variational inducing point assumption posteriors are permitted through mcmc with the computational benefits of the inducing point framework the scheme jointly samples the representation of the function with the covariance function parameters with sufficient inducing points our method approaches full bayesian inference over gp values and the covariance parameters we show empirically that the number of required inducing points is substantially smaller than the dataset size for several real problems stochastic process posteriors the model is set up as follows we are presented with some data inputs xn and responses yn latent function is assumed drawn from gp with zero mean and covariance function with parameters consistency of the gp means that only those points with data are considered the latent vector represents the values of the function at the observed points xn and has conditional distribution kf where kf is matrix composed of evaluating the covariance function at all pairs of points in the data likelihood depends on the latent function values to make prediction for latent function value test points the posterior function values and parameters are integrated zz dθ df in order to make use of the computational savings offered by the variational inducing point framework we introduce additional input points to the function and collect the responses of the function at that point into the vector um zm with some variational posterior new points are predicted similarly to the exact solution zz dθ du this makes clear that the approximation is stochastic process in the same fashion as the true posterior the length of the predictions vector is potentially unbounded covering the whole domain to obtain variational objective first consider the support of under the true posterior and for under the approximation in the above these points are subsumed into the prediction vector from here we shall be more explicit letting be the points of the process at be the points of the process at and be large vector containing all other points of all of the free parameters of the model are then and using variational framework we aim to minimize the divergence between the approximate and true posteriors kl log the vector here is considered finite but large enough to contain any point of interest for prediction the infinite case follows matthews et al is omitted here for brevity and results in the same solution where the conditional distributions for have been expanded to make clear that they are the same under the true and approximate posteriors and and have been omitted for clarity straightforward identities simplify the expression log log log resulting in the variational objective investigated by titsias aside from the inclusion of this can be rearranged to give the following informative expression exp ep log log log kl here is an intractable constant which normalizes the distribution and is independent of minimizing the kl divergence on the right hand side reveals that the optimal variational distribution is log ep log log log log for general likelihoods since the optimal distribution does not take any particular form we intend to sample from it using mcmc thus combining the benefits of gaussian processes with posterior sampling is feasible using standard methods since log is computable up to constant using computations after completing this work it was brought to our attention that similar suggestion had been made in though the idea was dismissed because prediction in sparse gp models typically involves some additional approximations our presentation of the approximation consisting of the entire stochastic process makes clear that no additional approximations are required to sample effectively the following are proposed whitening the prior noting that the problem appears similar to standard gp for albeit with an interesting likelihood we make use of an ancillary augmentation rv with rr kuu this results in the optimal variational distribution log ep log log log log previously this parameterization has been used with schemes which alternate between sampling the latent function values represented by or and the parameters our scheme uses hmc across and jointly whose effectiveness is examined throughout the experiment section quadrature the first term in is the expected in the case of factorization across the pairs this results in integrals for gaussian or poisson likelihood these integrals are tractable otherwise they can be approximated by quadrature given the current sample the expectations are computed fn µn γn with diag kf kuf rr kuu where the kernel matrices kuf kuu are computed similarly to kf but over the pairs in respectively from here one can compute the expected likelihood and it is subsequently straightforward to compute derivatives in terms of kuf diag kf and reverse mode differentiation of cholesky to compute derivatives with respect to and we use differentiation backpropagation of the derivative through the cholesky matrix decomposition transforming log into log and then log this is discussed by smith and results in operation an efficient cython implementation is provided in the supplement treatment of inducing point positions inference strategy natural question is what strategy should be used to select the inducing points in the original inducing point formulation the positions were treated as parameters to be optimized one could interpret them as parameters of the approximate prior covariance the variational formulation treats them as parameters of the variational approximation thus protecting from as they form part of the variational posterior in this work since we propose bayesian treatment of the model we question whether it is feasible to treat in bayesian fashion since and are auxiliary parameters the form of their distribution does not affect the marginals of the model the term has been defined by the consistency with the gp in order to preserve the interpretation above should be points on the gp but we are free to choose omitting dependence on for clarity and choosing the bound on the marginal likelihood similarly to is given by ep log the bound can be maximized by noting that the term only appears inside negative kl divergence log substituting the optimal reduces to eq ep log which can now be optimized since no entropy term appears for the bound is maximized when the distribution becomes dirac delta in summary since we are free to choose prior for which maximizes the amount of information captured by the optimal distribution becomes this formally motivates optimizing the inducing points derivatives for for completeness we also include the derivative of the free form objective with respect to the inducing point positions substituting the optimal distribution into to give and then differentiating we obtain log ep log since we aim to draw samples from evaluating this free form inducing point gradient using samples seems plausible but challenging instead we use the following strategy fit gaussian approximation to the posterior we follow in fitting gaussian approximation to the posterior the positions of the inducing points are initialized using clustering of the data the values of the latent function are represented by mean vector initialized randomly and matrix forms the approximate posterior covariance as ll for large problems such as the mnist experiment stochastic optimization using adadelta is used otherwise lbfgs is used after few hundred iterations with the inducing points positions fixed they are optimized in alongside the variational parameters and covariance function parameters initialize the model using the approximation having found satisfactory approximation the hmc strategy takes the optimized inducing point positions from the gaussian approximation the initial value of is drawn from the gaussian approximation and the covariance parameters are initialized at the approximate map value tuning hmc the hmc algorithm has two free parameters to tune the number of leapfrog steps and the we follow strategy inspired by wang et al where the number of leapfrog steps is drawn randomly from to lmax and bayesian optimization is used to maximize the expected square jump distance esjd penalized by lmax rather than allow an adaptive but convergent scheme as we run the optimization for iterations of samples each and use the best parameters for long run of hmc run tuned hmc to obtain predictions having tuned the hmc it is run for several thousand iterations to obtain good approximation to the samples are used to estimate the integral in equation the following section investigates the effectiveness of the proposed sampling scheme experiments efficient sampling using hamiltonian monte carlo this section illustrates the effectiveness of hamiltonian monte carlo in sampling from as already pointed out the form assumed by the optimal variational distribution in equation resembles the joint distribution in gp model with likelihood for fixed sampling is relatively straightforward and this can be done efficiently using hmc or elliptical slice sampling well tuned hmc has been reported to be extremely efficient in sampling the latent variables and this motivates our effort into trying to extend this efficiency to the sampling of as well this is also particularly appealing due to the convenience offered by the proposed representation of the model the problem of drawing samples from the posterior distribution over has been investigated in detail in in these works it has been advocated to alternate between the sampling of and in gibbs sampling fashion and condition the sampling of on suitably chosen transformation of the latent variables for each likelihood model we compare efficiency and convergence speed of the proposed hmc sampler with gibbs sampler where is sampled using hmc and is sampled using the algorithm to make the comparison fair we imposed the mass matrix in hmc and the covariance in mh to be isotropic and any parameters of the proposal were tuned using bayesian optimization unlike in the proposed hmc sampler for the gibbs sampler we did not penalize the objective function of the bayesian optimization for large numbers of leapfrog steps as in this case hmc proposals on the latent variables are computationally cheaper than those on the we report efficiency in sampling from using effective sample size ess and time normalized tn in the supplement we include convergence plots based on the potential scale reduction factor psrf computed based on ten parallel chains in these each chain is initialized from the vb solution and individually tuned using bayesian optimization binary classification we first use the image dataset to investigate the benefits of the approach over gaussian approximation and to investigate the effect of changing the number of inducing points as well as optimizing the inducing points under the gaussian approximation the data are dimensional we investigated the effect of our approximation using both ard one lengthscale per dimension and an isotropic rbf kernel the data were split randomly into sets the log predictive density over ten random splits is shown in figure following the strategy outlined above we fitted gaussian approximation to the posterior with initialized with figure investigates the difference in performance when is optimized using the gaussian approximation compared to just using for whilst our strategy is not guaranteed to find the global optimum it is clear that it improves the performance the second part of figure shows the performance improvement of our sampling approach over the gaussian approximation we drew samples discarding the first we see consistent improvement in performance once is large enough for small the gaussian approximation appears to work very well the supplement contains similar figure for the case where single lengthscale is shared there the improvement of the mcmc method over the gaussian approximation is smaller but consistent we speculate that the larger gains for ard are due to posterior uncertainty in the lengthscales which is poorly represented by point in the approximation the ess and are comparable between hmc and the gibbs sampler in particular for inducing points and the rbf covariance ess and for hmc are and and for the gibbs sampler are and for the ard covariance ess and for hmc are and and for the gibbs sampler are and convergence however seems to be faster for hmc especially for the ard covariance see the supplement log gaussian cox processes we apply our methods to log gaussian cox processes doubly stochastic models where the rate of an inhomogeneous poisson process is given by gaussian process the main difficulty for inference lies in that the likelihood of the gp requires an integral over the domain which is typically intractable for low dimensional problems this integral can be approximated on grid assuming that the gp is constant over the width of the grid leads to factorizing poisson likelihood for each of the grid points whilst some recent approaches allow for approach these usually require concessions in the model such as an alternative link function and do not approach full bayesian inference over the covariance function parameters log mcmc log gauss log mcmc zoptimized number of inducing points zoptimized number of inducing points mcmc rate mcmc figure performance of the method on the image dataset with one lengthscale per dimension left show performance for varying numbers of inducing points and strategies optimizing using the gaussian approximation offers significant improvement over the strategy right improvement of the performance of the gaussian approximation method with the same inducing points the method offers consistent performance gains when the number of inducing points is larger the supplement contains similar figure with only single lengthscale time years lengthscale variance figure the posterior of the rates for the coal mining disaster data left posterior rates using our variational mcmc method and gaussian approximation data are shown as vertical bars right posterior samples for the covariance function parameters using mcmc the gaussian approximation estimated the parameters as coal mining disasters on the disaster data we held out of the data at random and using grid of points with evenly spaced inducing points fitted both gaussian approximation to the posterior process with an approximate map estimate for the covariance function parameters variance and lengthscale of an rbf kernel with gamma priors on the covariance parameters we ran our sampling scheme using hmc drawing samples the resulting posterior approximations are shown in figure alongside the true posterior using sampling scheme similar to ours but without the inducing point approximation the variational approximation matches the true posterior closely whilst the gaussian approximation misses important detail the approximate and true posteriors over covariance function parameters are shown in the right hand part of figure there is minimal discrepancy in the distributions over random splits of the data the average was for the gaussian approximation and for the mcmc variant the average difference was and the mcmc variant was always better than the gaussian approximation we attribute this improved performance to marginalization of the covariance function parameters efficiency of hmc is greater than for the gibbs sampler ess and for hmc are and and for the gibbs sampler are and also chains converge within few thousand iterations for both methods although convergence for hmc is faster see the supplement figure pine sapling data from left to right reported locations of pine saplings posterior mean intensity on grid using full mcmc posterior mean intensity on grid with sparsity using inducing points posterior mean intensity on grid using inducing points pine saplings the advantages of the proposed approximation are prominent as the number of grid points become higher an effect emphasized with increasing dimension of the domain we fitted similar model to the above to the pine sapling data we compared the sampling solution obtained using inducing points on grid to the gold standard full mcmc run with the same prior and grid size figure shows that the agreement between the variational sampling and full sampling is very close however the variational method was considerably faster using single core on desktop computer required seconds to obtain effective sample for well tuned variational method whereas it took seconds for well tuned full mcmc this effect becomes even larger as we increase the resolution of the grid to which gives better approximation to the underlying smooth function as can be seen in figure it took seconds to obtain one effective sample for the variational method but now gold standard mcmc comparison was computationally extremely challenging to run for even single hmc step this is because it requires linear algebra operations using flops with classification to do classification with gaussian processes one latent function is defined for each of the classes the functions are defined independent but covary posteriori because of the likelihood chai studies sparse variational approximation to the softmax likelihood restricted to gaussian approximation here following we use likelihood given vector fn containing latent functions evaluated at the point xn the probability that the label takes the integer value yn is if yn argmax fn and otherwise as girolami and rogers discuss the soft behaviour is recovered by adding diagonal nugget to the covariance function in this work was fixed to though it would also be possible to treat this as parameter for inference the expected is ep fn log yn fn log log where is the probability that the labelled function is largest which is computable using quadrature an efficient cython implementation is contained in the supplement toy example to investigate the proposed posterior approximation for the multivariate classification case we turn to the toy data shown in figure we drew data points from three gaussian distributions the synthetic data was chosen to include decision boundaries and ambiguous decision areas figure shows that there are differences between the variational and sampling solutions with the sampling solution being more conservative in general the contours of confidence are smaller as one would expect at the decision boundary there are strong correlations between the functions which could not be captured by the gaussian approximation we are using note the movement of inducing points away from and towards the decision boundaries efficiency of hmc and the gibbs sampler is comparable in the rbf case ess and for hmc are and and for the gibbs sampler are and in the ard case ess and for hmc are and and for the gibbs sampler are and for both cases the gibbs sampler struggles to reach convergence even though the average acceptance rates are similar to those recommended for the two samplers individually mnist the mnist dataset is well studied benchmark with defined split we used inducing points initialized from the training data using gaussian approximation figure toy multiclass problem left the gaussian approximation colored points show the simulated data lines show posterior probability contours at inducing points positions shows as black points middle the free form solution with posterior samples the solution is more conservative the contours are smaller right posterior samples for at the same position but across different latent functions the posterior exhibits strong correlations and edges figure left three centers used to initialize the inducing point positions center the positions of the same inducing points after optimization right difference was optimized using optimization over the means and variances of as well as the inducing points and covariance function parameters the accuracy on the data was significantly improving on previous approaches to classify these digits using gp models for binary classification hensman et al reported that their gaussian approximation resulted in movement of the inducing point positions toward the decision boundary the same effect appears in the multivariate case as shown in figure which shows three of the inducing points used in the mnist problem the three examples were initialized close to the many six digits and after optimization have moved close to other digits five and four the last example still appears to be six but has moved to more unusual six shape supporting the function at another extremity similar effects are observed for all digits having optimized the inducing point positions with the approximate and estimate for we used these optimal inducing points to draw samples from and this did not result in an increase in accuracy but did improve the on the test set from to evaluating the gradients for the sampler took approximately seconds on desktop machine and we were easily able to draw samples this dataset size has generally be viewed as challenging in the gp community and consequently there are not many published results to compare with one recent work reports accuracy using variational inference and gp latent variable model discussion we have presented an inference scheme for general gp models the scheme significantly reduces the computational cost whilst approaching exact bayesian inference making minimal assumptions about the form of the posterior the improvements in accuracy in comparison with the gaussian approximation of previous works has been demonstrated as has the quality of the approximation to the distribution our mcmc scheme was shown to be effective for several likelihoods and we note that the automatic tuning of the sampling parameters worked well over hundreds of experiments this paper shows that mcmc methods are feasible for inference in large gp problems addressing the unfair sterotype of slow mcmc acknowledgments jh was funded by an mrc fellowship am and zg by epsrc grant and google focussed research award references nguyen and bonilla automated variational inference for gaussian process models in nips pages and opper sparse gaussian processes neural snelson and ghahramani sparse gaussian processes using in nips pages titsias variational learning of inducing variables in sparse gaussian processes in aistats pages rasmussen and sparse spectrum gaussian process regression jmlr solin and hilbert space methods for gaussian process regression arxiv preprint wilson gilboa nehorai and cunningham fast kernel learning for multidimensional pattern extrapolation in nips pages bayesian filtering and smoothing volume cambridge university press filippone and engler enabling scalable stochastic inference for gaussian processes by employing the unbiased linear system solver ulisse icml matthews hensman turner and ghahramani on sparse variational methods and the kl divergence between stochastic processes arxiv preprint murray and adams slice sampling covariance hyperparameters of latent gaussian models in nips pages filippone zhong and girolami comparative evaluation of inference methods for gaussian process models mach gibbs and mackay variational gaussian process classifiers ieee trans neural opper and archambeau the variational gaussian approximation revisited neural kuss and rasmussen assessing approximate inference for binary gaussian process classification jmlr nickisch and rasmussen approximations for binary gaussian process classification jmlr khan mohamed and murphy fast bayesian inference for gaussian process regression in nips pages chai variational multinomial logit gaussian process jmlr june lloyd gunter osborne and roberts variational inference for gaussian process modulated poisson processes icml hensman matthews and ghahramani scalable variational gaussian process classification in aistats pages williams and barber bayesian classification with gaussian processes ieee trans pattern anal mach michalis titsias neil lawrence and magnus rattray markov chain monte carlo algorithms for gaussian processes in barber chiappa and cemgil editors bayesian time series models smith differentiation of the cholesky algorithm comp graph and rasmussen unifying view of sparse approximate gaussian process regression jmlr wang mohamed and de freitas adaptive hamiltonian and riemann manifold monte carlo in icml volume pages vanhatalo and vehtari sparse log gaussian processes via mcmc for spatial epidemiology in gaussian processes in practice volume pages christensen roberts and rosenthal scaling limits for the transient phase of local metropolishastings algorithms jrss murray adams and mackay elliptical slice sampling in aistats volume onoda and soft margins for adaboost mach møller syversveen and waagepetersen log gaussian cox processes scand girolami and rogers variational bayesian multinomial probit regression with gaussian process priors neural kim and ghahramani bayesian gaussian process classification with the algorithm ieee tpami and dupont robust gaussian process classification in nips pages gal van der wilk and rasmussen distributed variational inference in sparse gaussian process regression and latent variable models in nips 
less is more computational regularization alessandro raffaello lorenzo degli studi di genova dibris via dodecaneso genova italy istituto italiano di tecnologia icub facility via morego genova italy massachusetts institute of technology and istituto italiano di tecnologia laboratory for computational and statistical learning cambridge ma usa ale rudi lrosasco abstract we study type subsampling approaches to large scale kernel methods and prove learning bounds in the statistical learning setting where random sampling and high probability estimates are considered in particular we prove that these approaches can achieve optimal learning bounds provided the subsampling level is suitably chosen these results suggest simple incremental variant of kernel regularized least squares where the subsampling level implements form of computational regularization in the sense that it controls at the same time regularization and computations extensive experimental analysis shows that the considered approach achieves state of the art performances on benchmark large scale datasets introduction kernel methods provide an elegant and effective framework to develop nonparametric statistical approaches to learning however memory requirements make these methods unfeasible when dealing with large datasets indeed this observation has motivated variety of computational strategies to develop large scale kernel methods in this paper we study subsampling methods that we broadly refer to as approaches these methods replace the empirical kernel matrix needed by standard kernel methods with smaller matrix obtained by column subsampling such procedures are shown to often dramatically reduce requirements while preserving good practical performances the goal of our study is first and foremost we aim at providing theoretical characterization of the generalization properties of such learning schemes in statistical learning setting second we wish to understand the role played by the subsampling level both from statistical and computational point of view as discussed in the following this latter question leads to natural variant of kernel regularized least squares krls where the subsampling level controls both regularization and computations from theoretical perspective the effect of approaches has been primarily characterized considering the discrepancy between given empirical kernel matrix and its subsampled version while interesting in their own right these latter results do not directly yield information on the generalization properties of the obtained algorithm results in this direction albeit suboptimal were first derived in see also and more recently in in these latter papers sharp error analyses in expectation are derived in fixed design regression setting for form of kernel regularized least squares in particular in basic uniform sampling approach is studied while in subsampling scheme based on the notion of leverage score is considered the main technical contribution of our study is an extension of these latter results to the statistical learning setting where the design is random and high probability estimates are considered the more general setting makes the analysis considerably more complex our main result gives optimal finite sample bounds for both uniform and leverage score based subsampling strategies these methods are shown to achieve the same optimal learning error as kernel regularized least squares recovered as special case while allowing substantial computational gains our analysis highlights the interplay between the regularization and subsampling parameters suggesting that the latter can be used to control simultaneously regularization and computations this strategy implements form of computational regularization in the sense that the computational resources are tailored to the generalization properties in the data this idea is developed considering an incremental strategy to efficiently compute learning solutions for different subsampling levels the procedure thus obtained which is simple variant of classical kernel regularized least squares with uniform sampling allows for efficient model selection and achieves state of the art results on variety of benchmark large scale datasets the rest of the paper is organized as follows in section we introduce the setting and algorithms we consider in section we present our main theoretical contributions in section we discuss computational aspects and experimental results supervised learning with krls and approaches let be probability space with distribution where we view and as the input and output spaces respectively let ρx denote the marginal distribution of on and the conditional distribution on given given hypothesis space of measurable functions from to the goal is to minimize the expected risk min dρ provided is known only through training set of xi yi sampled identically and independently according to basic example of the above setting is random design regression with the squared loss in which case yi xi with fixed regression function sequence of random variables seen as noise and xn random inputs in the following we consider kernel methods based on choosing hypothesis space which is separable reproducing kernel hilbert space the latter is hilbert space of functions with inner product such that there exists function with the following two properties for all kx belongs to and the so called reproducing property holds hf kx ih for all the function called reproducing kernel is easily shown to be symmetric and positive definite that is the kernel matrix kn xi xj is positive semidefinite for all xn classical way to derive an empirical solution to problem is to consider tikhonov regularization approach based on the minimization of the penalized empirical functional xi yi λkf min the above approach is referred to as kernel regularized least squares krls or kernel ridge regression krr it is easy to see that solution fˆλ to problem exists it is unique and the representer theorem shows that it can be written as fˆλ xi with kn λni where xn are the training set points yn and kn is the empirical kernel matrix note that this result implies that we can restrict the minimization in to the space hn αi xi αn storing the kernel matrix kn and solving the linear system in can become computationally unfeasible as increases in the following we consider strategies to find more efficient solutions based on the idea of replacing hn with hm αi rm where and is subset of the input points in the training set the solution fˆλ of the corresponding minimization problem can now be written as fˆλ with knm knm λnkmm knm where denotes the pseudoinverse of matrix and knm ij xi kmm kj with and the above approach is related to methods and different approximation strategies correspond to different ways to select the inputs subset while our framework applies to broader class of strategies see section in the following we primarily consider two techniques plain the points are sampled uniformly at random without replacement from the training set approximate leverage scores als recall that the leverage scores associated to the training set points xn are li li kn kn tni ii for any where kn ij xi xj in practice leverage scores are onerous to compute and approximations ˆli can be considered in particular in the following we are interested in suitable approximations defined as follows definition leverage scores let li be the leverage scores associated to the training set for given let and we say that li are leverage scores with confidence when with probability at least li li li given leverage scores for are sampled from the training set inp dependently with replacement and with probability to be selected given by pt ˆli ˆlj in the next section we state and discuss our main result showing that the krls formulation based on plain or approximate leverage scores provides optimal empirical solutions to problem theoretical analysis in this section we state and discuss our main results we need several assumptions the first basic assumption is that problem admits at least solution assumption there exists an fh such that fh min note that while the minimizer might not be unique our results apply to the case in which fh is the unique minimizer with minimal norm also note that the above condition is weaker than assuming the regression function in to belong to finally we note that the study of the paper can be adapted to the case in which minimizers do not exist but the analysis is considerably more involved and left to longer version of the paper the second assumption is basic condition on the probability distribution assumption let zx be the random variable zx fh with and distributed according to then there exists such that for any almost everywhere on the above assumption is needed to control random quantities and is related to noise assumption in the regression model it is clearly weaker than the often considered bounded output assumption and trivially verified in classification the last two assumptions describe the capacity roughly speaking the size of the hypothesis space induced by with respect to and the regularity of fh with respect to and to discuss them we first need the following definition definition covariance operator and effective dimensions we define the covariance operator as hf cgih dρx moreover for we define the random variable nx kx λi kx with distributed according to ρx and let nx sup nx we add several comments note that corresponds to the second moment operator but we refer to it as the covariance operator with an abuse of terminology moreover note that tr λi see this latter quantity called effective dimension or degrees of freedom can be seen as measure of the capacity of the hypothesis space the quantity can be seen to provide uniform bound on the leverage scores in eq clearly for all assumption the kernel is measurable is bounded moreover for all and qλ measurability of and boundedness of are minimal conditions to ensure that the covariance operator is well defined linear continuous positive operator condition is satisfied if the kernel is bounded indeed in this case for all conversely it can be seen that condition together with boundedness of imply that the kernel is bounded indeed kck boundedness of the kernel implies in particular that the operator is trace class and allows to use tools from spectral theory condition quantifies the capacity assumption and is related to number conditions see for further details in particular it is known that condition is ensured if the eigenvalues σi of satisfy polynomial decaying condition σi note that since the operator is trace class condition always holds for here for space constraints and in the interest of clarity we restrict to such polynomial condition but the analysis directly applies to other conditions including exponential decay or finite rank conditions finally we have the following regularity assumption assumption there exists such that kc fh kh the above condition is fairly standard and can be equivalently formulated in terms of classical concepts in approximation theory such as interpolation spaces intuitively it quantifies the degree to which fh can be well approximated by functions in the rkhs and allows to control the error of learning solution for it is always satisfied for larger we are assuming fh to belong to subspaces of that are the images of the fractional compact operators such spaces contain functions which expanded on basis of eigenfunctions of have larger coefficients in correspondence to large eigenvalues such an assumption is natural in view of using techniques such as which can be seen as form of spectral filtering that estimate stable solutions by discarding the contribution of small eigenvalues in the next section we are going to quantify the quality of empirical solutions of problem obtained by schemes of the form in terms of the quantities in assumptions if is finite then kck kcki kx kkx therefore kck main results in this section we state and discuss our main results starting with optimal finite sample error bounds for regularized least squares based on plain and approximate leverage score based subsampling theorem under assumptions and let min and assume log log kck kckδ then the following inequality holds with probability at least qσ fˆλ fh with log kck kck with fˆλ as in and for plain log λδ for als and leverage scores with subsampling probabilities pλ log and log we add several comments first the above results can be shown to be optimal in minimax sense indeed minimax lower bounds proved in show that the learning rate in is optimal under the considered assumptions see thm of for discussion on minimax lower bounds see sec of second the obtained bounds can be compared to those obtained for other regularized learning techniques techniques known to achieve optimal error rates include tikhonov regularization iterative regularization by early stopping spectral regularization principal component regression or truncated svd as well as regularized stochastic gradient methods all these techniques are essentially equivalent from statistical point of view and differ only in the required computations for example iterative methods allow for computation of solutions corresponding to different regularization levels which is more efficient than tikhonov or svd based approaches the key observation is that all these methods have the same memory requirement in this view our results show that randomized subsampling methods can break such memory barrier and consequently achieve much better time complexity while preserving optimal learning guarantees finally we can compare our results with previous analysis of randomized kernel methods as already mentioned results close to those in theorem are given in in fixed design setting our results extend and generalize the conclusions of these papers to general statistical learning setting relevant results are given in for different approach based on averaging krls solutions obtained splitting the data in groups divide and conquer rls the analysis in is only in expectation but considers random design and shows that the proposed method is indeed optimal provided the number of splits is chosen depending on the effective dimension this is the only other work we are aware of establishing optimal learning rates for randomized kernel approaches in statistical learning setting in comparison with computational regularization the main disadvantage of the divide and conquer approach is computational and in the model selection phase where solutions corresponding to different regularization parameters and number of splits usually need to be computed the proof of theorem is fairly technical and lengthy it incorporates ideas from and techniques developed to study spectral filtering regularization in the next section we briefly sketch some main ideas and discuss how they suggest an interesting perspective on regularization techniques including subsampling proof sketch and computational regularization perspective key step in the proof of theorem is an error decomposition and corresponding bound for any fixed and indeed it is proved in theorem and proposition that for with probability rmse classification error rmse figure validation errors associated to grids of values for axis and axis on left breast cancer center and cpusmall right at least fλ fh log rc the first and last term in the right hand side of the above inequality can be seen as forms of sample and approximation errors and are studied in lemma and theorem the mid term can be seen as computational error and depends on the considered subsampling scheme indeed it is shown in proposition that can be taken as cpl min log tδ for the plain approach and cals min log kck log for the approximate leverage scores approach the bounds in theorem follow by minimizing in the sum of the first and third term choosing so that the computational error is of the same order of the other terms computational resources and regularization are then tailored to the generalization properties of the data at hand we add few comments first note that the error bound in holds for large class of subsampling schemes as discussed in section in the appendix then specific error bounds can be derived developing computational error estimates second the error bounds in theorem and proposition and hence in theorem easily generalize to larger class of regularization schemes beyond tikhonov approaches namely spectral filtering for space constraints these extensions are deferred to longer version of the paper third we note that in practice optimal data driven parameter choices based on estimates can be used to adaptively achieve optimal learning bounds finally we observe that different perspective is derived starting from inequality and noting that the role played by and can also be exchanged letting play the role of regularization parameter can be set as function of and tuned adaptively for example in the case of plain approach if we set log and log then the obtained learning solution achieves the error bound in eq as above the subsampling level can also be chosen by interestingly in this case by tuning we naturally control computational resources and regularization an advantage of this latter parameterization is that as described in the following the solution corresponding to different subsampling levels is easy to update using cholesky update formulas as discussed in the next section in practice joint tuning over and can be done starting from small and appears to be advantageous both for error and computational performances incremental updates and experimental analysis in this section we first describe an incremental strategy to efficiently explore different subsampling levels and then perform extensive empirical tests aimed in particular at investigating the statistical and computational benefits of considering varying subsampling levels and compare the incremental nyström batch nyström time input dataset xi yi subsampling regularization parameter output krls estimators compute for do compute at ut rt cholup rt ut rt rt cholup rt vt at end for algorithm incremental krls figure model selection time on the cpusmall dataset and repetitions performance of the algorithm with respect to state of the art solutions on several large scale benchmark datasets throughout this section we only consider plain approach deferring to future work the analysis of leverage scores based sampling techniques interestingly we will see that such basic approach can often provide state of the art performances efficient incremental updates algorithm efficiently compute solutions corresponding to different subsampling levels by exploiting cholesky updates the proposed procedure allows to efficiently compute whole regularization path of solutions and hence perform fast model see sect in algorithm the function cholup is the cholesky update formula available in many linear algebra libraries the total cost of the algorithm is time to compute while naive algorithm would require with is the number of analyzed subsampling levels the following are some quantities needed by the algorithm and at at for any moreover for any gt γt and ut ct gt gt at xn at λnbt vt ct gt bt at λnk experimental analysis we empirically study the properties of algorithm considering gaussian kernel of width the selected datasets are already divided in training and test we randomly split the training part in training set and validation set and of the training points respectively for parameter tuning via the subsampled points for approximation are selected uniformly at random from the training set we report the performance of the selected model on the fixed test set repeating the process for several trials interplay between and we begin with set of results showing that incrementally exploring different subsampling levels can yield very good performance while substantially reducing the computational requirements we consider the the breast cancer and the cpusmall in figure we report the validation errors associated to grid of values for and the values are logarithmically spaced while the values are linearly spaced the ranges and kernel bandwidths chosen according to on the data are for for breast cancer and for cpusmall the main observation that can be derived from this first series of tests is that small is sufficient to obtain the same results achieved with the largest for example for it is sufficient to choose and to obtain an average test rmse of over trials which is the same as the one obtained using and with speedup of the joint training and validation phase also it is interesting to observe that for given values of large values of can decrease the performance this observation is consistent with the results in section showing that can play the the code for algorithm is available at in the following we denote by the total number of points and by the number of dimensions and table test rmse comparison for exact and approximated kernel methods the results for krls batch rf and fastfood are the ones reported in ntr is the size of the training set dataset ntr incremental rbf krls rbf batch rbf rf rbf fastfood rbf fastfood fft krls matern fastfood matern insurance company cpu ct slices axial year prediction msd forest na na na na na na role of regularization parameter similar results are obtained for breast cancer where for and we obtain average classification error on the test set over trials while for and we obtain for cpusmall with and the average test rmse over trials is while for and it is only slightly higher but computing its associated solution requires less than half of the time and approximately half of the memory regularization path computation if the subsampling level is used as regularization parameter the computation of regularization path corresponding to different subsampling levels becomes crucial during the model selection phase naive approach that consists in recomputing the solutions of eq for each subsampling level would require nt lt computational time where is the number of solutions with different subsampling levels to be evaluated and is the number of tikhonov regularization parameters on the other hand by using the incremental algorithm the model selection time complexity is for the whole regularization path we experimentally verify this speedup on cpusmall with repetitions setting and the model selection times measured on server with intelr xeonr cpus and gb of ram are reported in figure the result clearly confirms the beneficial effects of incremental model selection on the computational time predictive performance comparison finally we consider the performance of the algorithm on several large scale benchmark datasets considered in see table has been chosen on the basis of preliminary data analysis and have been chosen by starting from small subsampling values up to mmax and considering after model selection we retrain the best model on the entire training set and compute the rmse on the test set we consider trials reporting the performance mean and standard deviation the results in table compare computational regularization with the following methods as in kernel regularized least squares krls not compatible with large datasets random fourier features rf as in with number of random features fastfood rbf fft and matern kernel as in with random features batch method with uniform sampling and the above results show that the proposed incremental approach behaves really well matching state of the art predictive performances acknowledgments the work described in this paper is supported by the center for brains minds and machines cbmm funded by nsf stc award and by firb project funded by the italian ministry of education university and research references bernhard and alexander smola learning with kernels support vector machines regularization optimization and beyond adaptive computation and machine learning mit press alex smola and bernhard sparse greedy matrix approximation for machine learning in icml pages morgan kaufmann williams and seeger using the method to speed up kernel machines in nips pages mit press ali rahimi and benjamin recht random features for kernel machines in nips pages curran associates yang sindhwani avron and mahoney carlo feature maps for shiftinvariant kernels in icml volume of jmlr proceedings pages quoc le and alexander smola fastfood computing hilbert space expansions in loglinear time in icml volume of jmlr proceedings pages si si hsieh and inderjit dhillon memory efficient kernel approximation in icml volume of jmlr proceedings pages yuchen zhang john duchi and martin wainwright divide and conquer kernel ridge regression in colt volume of jmlr proceedings pages kumar mohri and talwalkar ensemble nystrom method in nips pages mu li james kwok and lu making approximation possible in icml pages omnipress kai zhang ivor tsang and james kwok improved approximation and error analysis icml pages acm bo dai bo xie niao he yingyu liang anant raj balcan and le song scalable kernel methods via doubly stochastic gradients in nips pages petros drineas and michael mahoney on the method for approximating gram matrix for improved learning jmlr december gittens and mahoney revisiting the nystrom method for improved machine learning shusen wang and zhihua zhang improving cur matrix decomposition and the approximation via adaptive sampling jmlr petros drineas malik michael mahoney and david woodruff fast approximation of matrix coherence and statistical leverage jmlr michael cohen yin tat lee cameron musco christopher musco richard peng and aaron sidford uniform sampling for matrix approximation in itcs pages acm shusen wang and zhihua zhang efficient algorithms and error analysis for the modified nystrom method in aistats volume of jmlr proceedings pages kumar mohri and talwalkar sampling methods for the method jmlr corinna cortes mehryar mohri and ameet talwalkar on the impact of kernel approximation on learning accuracy in aistats volume of jmlr proceedings pages jin yang mahdavi li and zhou improved bounds for the method with application to kernel classification information theory ieee transactions on oct tianbao yang li mehrdad mahdavi rong jin and zhou method vs random fourier features theoretical and empirical comparison in nips pages francis bach sharp analysis of kernel matrix approximations in colt volume alaoui and mahoney fast randomized kernel methods with statistical guarantees arxiv steinwart and christmann support vector machines springer new york andrea caponnetto and ernesto de vito optimal rates for the regularized algorithm foundations of computational mathematics lo gerfo lorenzo rosasco francesca odone ernesto de vito and alessandro verri spectral algorithms for supervised learning neural computation steinwart hush and scovel optimal rates for regularized least squares regression in colt mendelson and neeman regularization in kernel learning the annals of statistics bauer pereverzev and rosasco on regularization algorithms in learning theory journal of complexity caponnetto and yuan yao adaptive rates for regularization operators in learning theory analysis and applications ying and pontil online gradient descent learning algorithms foundations of computational mathematics alessandro rudi guillermo canas and lorenzo rosasco on the sample complexity of subspace learning in nips pages gene golub and charles van loan matrix computations volume jhu press 
infinite factorial dynamical model isabel max planck institute for software systems ivalera francisco department of computer science columbia university fernando universidad carlos iii de madrid and bell labs fernandop lennart svensson department of signals and systems chalmers university of technology abstract we propose the infinite factorial dynamic model ifdm general bayesian nonparametric model for source separation our model builds on the markov indian buffet process to consider potentially unbounded number of hidden markov chains sources that evolve independently according to some dynamics in which the state space can be either discrete or continuous for posterior inference we develop an algorithm based on particle gibbs with ancestor sampling that can be efficiently applied to wide range of source separation problems we evaluate the performance of our ifdm on four applications multitarget tracking cocktail party power disaggregation and multiuser detection our experimental results show that our approach for source separation does not only outperform previous approaches but it can also handle problems that were computationally intractable for existing approaches introduction the central idea behind bayesian nonparametrics bnps is the replacement of classical finitedimensional prior distributions with general stochastic processes allowing for an number of degrees of freedom in model they constitute an approach to model selection and adaptation in which the model complexity is allowed to grow with data size in the literature bnp priors have been applied for time series modeling for example the infinite hidden markov model considers potentially infinite cardinality of the state space and the bnp construction of switching linear dynamical systems lds considers an unbounded number of dynamical systems with transitions among them occurring at any time during the observation period in the context of signal processing the source separation problem has captured the attention of the research community for decades due to its wide range of applications the bnp literature for source separation includes in which the authors introduce the nonparametric counterpart of independent component analysis ica referred as infinite ica iica and where the authors present the markov indian buffet process mibp which places prior over an infinite number of parallel markov chains and is used to build the infinite factorial hidden markov model ifhmm and the ica ifhmm these approaches can effectively adapt the number of hidden sources to fit the available data however they suffer from several limitations the ifhmm is restricted to binary hidden states which may lead to hidden chains that do not match the actual hidden causes and it is not able to deal with states and ii both the iica and the ica ifhmm make independence assumptions between consecutive values of active hidden states which significantly restricts their ability to capture the underlying dynamical models as result we find that existing approaches are not applicable to many source separation both authors contributed equally problems such as multitarget tracking in which each target can be modeled as markov chain with states describing the target trajectory or multiuser detection in which the high cardinality of the hidden states makes this problem computationally intractable for the nonbinary extension of the ifhmm hence there is lack of both general bnp model for source separation and an efficient inference algorithm to address these limitations in this paper we provide general bnp framework for source separation that can handle wide range of dynamics and likelihood models we assume potentially infinite number of sources that are modeled as markov chains that evolve according to some dynamical system model we assume that only the active sources contribute to the observations and the states of the markov chains are not restricted to be discrete but they can also be moreover we let the observations depend on both the current state of the hidden sequences and on some previous states this system memory is needed when dealing with applications in which the individual source signals propagate through the air and may thus suffer from some phenomena such as reverberation echo or multipath propagation our approach results in general and flexible dynamic model that we refer to as infinite factorial dynamical model ifdm and that can be particularized to recover other models previously proposed in the literature the binary ifhmm as for most bnp models one of the main challenges of our ifdm is posterior inference in discrete time series models including the ifhmm an approximate inference algorithm based on forwardfiltering ffbs sweeps is typically used however the exact ffbs algorithm has exponential computational complexity with respect to the memory length the ffbs algorithm also becomes computationally intractable when dealing with hidden states that are when active in order to overcome these limitations we develop suitable inference algorithm for our ifdm by building markov chain monte carlo mcmc kernel using particle gibbs with ancestor sampling pgas this algorithm presents quadratic complexity with respect to the memory length and can easily handle broad range of dynamical models the versatility and efficiency of our approach is shown through comprehensive experimental validation in which we tackle four source separation problems multitarget tracking cocktail party power disaggregation and multiuser detection our results show that our ifdm provides meaningful estimations of the number of sources and their corresponding individual signal traces even in applications that previous approaches can not handle it also outperforms in terms of accuracy the ifhmm extended to account for the actual state space cardinality combined with inference in the cocktail party and power disaggregation problems infinite factorial dynamical model in this section we detail our proposed ifdm we assume that there is potentially infinite number of sources contributing to the observed sequence yt and each source is modeled by an underlying dynamic system model in which the state of the source at time denoted by xtm evolves over time as markov chain here the state space can be either discrete or continuous in addition we introduce the auxiliary binary variables stm to indicate whether the source is active at time such that the observations only depend on the active sources we assume that the variables stm follow markov chain and let the states xtm evolve according to xtm the dynamic system model may depend on whether the source is active or inactive we assume dummy states stm for as an example in the cocktail party problem yt denotes sample of the recorded audio signal which depends on the individual voice signals of the active speakers the latent states xtm in this example are and the transition model xtm describes the dynamics of the voice signal in many real applications the individual signals propagate though the air until they are mixed and gathered by the receiver in such propagation different phenomena refraction or reflexion of the signal in the walls may occur leading to multipath propagation of the signals and therefore to different delayed copies of the individual signals at the receiver in order to account for this memory effect we consider that the state of the source at time xtm influences not only the observation yt but also the future observations therefore the likelihood of yt depends on the last states of all the markov chains yielding yt yt xtm stm code for these applications can be found at https am bm st xt yt am bm st yt figure graphical representation of the ifdm with memory length the dashed lines represent the memory equivalent representation using extended states where and are matrices containing all the states xtm and stm respectively we remark that the likelihood of yt can not depend on any hidden state xτ if sτ in order to be able to deal with an infinite number of sources we place bnp prior over the binary matrix that contains all variables stm in particular we assume that mibp is distributed as mibp with parameters and the mibp places prior distribution over binary matrices with finite number of rows and an infinite number of columns in which each row represents time instant and each column represents markov chain the mibp ensures that for any finite value of only finite number of columns in are active almost surely whereas the rest of them remain in the state and do not influence the observations we make use of the construction of the mibp which is particularly useful to develop many practical inference algorithms under the construction two hidden variables for each markov chain are introduced representing the transition probabilities between the active and inactive states in particular we define am stm as the transition probability from inactive to active and bm stm as the selftransition probability of the active state of the chain in the representation the columns of are ordered according to their values of am such that and the probability distribution over variables am is given by beta and am am am being the indicator function finally we place beta distribution over the transition probabilities bm of the form bm beta the resulting ifdm model particularized for is shown in figure note that this model can be equivalently represented as shown in figure using the extended states stm with stm xtm stm this extended representation allows for an inference algorithm however the exponential complexity of the ffbs with the memory parameter and with hidden states xtm makes the algorithm intractable in many real scenarios hence we maintain the representation in figure because it allows us to derive an efficient inference algorithm the proposed ifdm in figure can be particularized to resemble some other models that have been proposed in the literature in particular we recover the ifhmm in by choosing the state space xtm stm and ii the ica ifhmm in if we set and assume that xtm xtm is gaussian distribution and iii bnp counterpart of the lds with states by assuming and and letting the variables xtm be gaussian distributed with linear relationships among them inference algorithm we develop an inference algorithm for the proposed ifdm that can handle different dynamic and likelihood models our approach relies on blocked gibbs sampling algorithm that alternates between sampling the number of considered chains and the global variables conditioned on the current value of matrices and and sampling matrices and conditioned on the current value of the remaining variables in particular the algorithm proceeds iteratively as follows step add mnew new inactive chains using an auxiliary slice variable and slice sampling method in this step the number of considered chains is increased from its initial value to mnew is not updated because stm for all for the new chains at at example of the connection of particles in pgas we represent particles xiτ for the index aiτ denotes the ancestor particle of xτ it can be seen that the trajectories and only differ at time instant algorithm particle gibbs with ancestor sampling input reference particle for and global variables output sample xout from the pgas markov kernel draw for eq set xp compute the weights for eq for do resampling and ancestor sampling draw ait categorical wtp for compute eti for eq draw ap etp categorical particle propagation ai draw xit rt xt set xp for eq ai set xt for eq weighting compute the weights wti wt for eq draw categorical wt wt return xout pgas algorithm figure particle gibbs with ancestor sampling furthermore they can switch on and off start or stop transmitting at any are allowed to switch on at any position we generate synthetic data in which th step jointly sample xtm and stm of all the considered chains compact move within region of metres where sensors are located on re the representation by removing inactive in the entire observation those chains that remain the state of each target consists of its position an tm period consequently updating tm tm tm tm dimensional plane and we assume linear gaussian dynamic model such that step sample the global variables in the model which include the transition probabilities evolves according to and the emission parameters from their posterior distribution the indian scheme for inference in bnp models based on in step we follow the slice sampling ts buffet process ibp effectively transforms the model into finite factorial model xtm with mnew parallel chains step consists elements of mthe ut ces and given the current value of the global variables here we propose to use pgas algorithm recently developed for inference in models and latent able models each iteration of this algorithm presents quadratic complexity with respect to where ts is the sampling period and ut is vector that mod the memory length avoiding the exponential complexity of the standard ffbs algorithm when noise for each considered target we sample the initial position uniformly in applied over the equivalent model with states in figure details on the pgas approach space and assume thethat initial velocity is gaussian pgas are given in section after running we remove thosethat chains remain inactive in the distributed with zero similarly to we assume the observation of sensor at time is gi whole observation period in step we sample the transition probabilities pand as as signal strength rss needed ytjto dmjt ntj where ntj other variables such variables evaluate the observation stm yt further details on the inference algorithm can be found in the supplementary material term dmjt is the distance between target and sensor at time is the and metres and are respectively the reference distance and the particle gibbs with sampling account for the radio we apply our inference algorithm pgas is method within the frameworkwhich of particle mcmc that propagation combines themodel main ideas period of length in our inference algorithm as well as the strengths of sequential monte carlo and mcmc techniques in contrast to other we sample the noise var invgamma as algorithm its prior distribution methods particle gibbs with backward simulation this can also be conveniently applied to latent variable models that are not expressed on models in figure we show the true and inferred trajectories of the targets and the tem form the pgas algorithm is an mcmc and thus generates new sample of the hiddentargets state in way that the position kernel the position error we have sorted the inferred matrices given an initial sample which is the output of the previous iteration of the algorithm is able to detect in this figure we observe that the proposed model and pgas extended to account for mnew new inactive chains the machinery inside the pgas their trajectories with an average position error of around metres we do not co algorithm resembles an ordinary particle filter with two main differences the particles is are not multitarget trackin algorithm because to the best ofone ourofknowledge there deterministically set to the reference input sample and the ancestor of each particle is randomly literature that can deal with targets that may start and stop transmitting at any tim chosen and stored during the algorithm execution we briefly describe the pgas approach below but we refer to for rigorous analysis of the algorithm properties cocktail party we now address blind speech separation task also known specifically we each record multiple people who are simultaneousl in the proposed pgas we assume set of pproblem particles more for each time instant representing the microphones given the recorded signal the goal is to separate out th set of states xtm stm we denote by the vector the state of the particle at time we also speakers may the startparticle speaking becomethe silent at any given time si introduce the ancestor indexes ait signals in order to denote thatorprecedes collect from speakers fromofthe particle at time that is ait corresponds to thedata index of several the ancestor particle xit let also chime speech separati website thethat voice signal for each speaker consists of sentences be the ancestral path of particle xt challenge the particle trajectory is recursively defined as with random pauses in between each sentence we artificially mix the data ti at xt figure an example to clarify the notation http the algorithm is summarized in figure for each time instant we first generate the ancestor indexes for the first particles according to the importance weights given these ant for cestors the particles are then propagated across time according to distribution rt xt simplicity and dropping the global variables from the notation for conciseness we assume that rt xt xt xtm xa sa stm particles are propagated as in figure using simple bootstrap proposal kernel xtm stm the particle is instead deterministically set to the reference particle xp xt whereas the ancestor indexes at are sampled according to some weights indeed this is crucial step that vastly improves the mixing properties of the mcmc kernel we now focus on the computation on the importance weights wti and the ancestor weights for the former the particles are weighted according to wt wt where wt yt rt xt eq implies that in order to obtain the importance being the set of observations yt weights it suffices to evaluate the likelihood at time the weights are given by yτ note that for memoryless models eq can be simplified since the product in the last term is not present and therefore for the computation of the weights in for has computational time complexity scaling as since this computation needs to be performed for each time instant and this is the most expensive calculation the resulting algorithm complexity scales as experiments we now evaluate the proposed model and inference algorithm on four different applications which are detailed below and summarized in table for the pgas kernel we use particles in all our experiments additional details on the experiments are given in the supplementary material multitarget tracking in the multitarget tracking problem we aim at locating the position of several moving targets based on noisy observations under general setup varying number of indistinguishable targets are moving around in region appearing at random in space and time multitarget tracking plays an important role in many areas of engineering such as surveillance computer vision and signal processing here we focus on simple synthetic example to show that our proposed ifdm can handle hidden states we place three moving targets within region of metres where sensors are located on square grid the state xtm xtm xtm vtm vtm of each target consists of its position and velocity in two dimensional plane and we assume linear gaussian dynamic model such that while active xtm evolves according to xtm gx gu ut ts where gx ts ts gu ts ts is the sampling period and ut is vector that models the acceleration noise for each considered target we sample the initial position uniformly in the sensor network space and assume that the initial velocity is gaussian distributed with zero mean and covariance following we generate observations based on the received signal strength rss where the measurep ntj here ntj ment of sensor at time is given by ytj stm dmjt is the noise term dmjt is the distance between target and sensor at time is the transmitted power and metres and are respectively the reference distance and the path loss exponent which account for the radio propagation model in our inference algorithm we sample the noise variance by placing an invgamma distribution as its prior here we compare application model multitarget tracking cocktail party power dissagregation multiuser detection infinite factorial lds ica ifhmm ifhmm xtm xtm gu xtm σx am jk xtm table applications of the ifdm target target target inferred target inferred target inferred target sensors target target target error target target target average target trajectories ifdm model average position error time position error figure results for the multitarget tracking problem the performance of the ifdm with finite factorial model with perfect knowledge of the number of targets and noise variance in figures and we show the true and inferred trajectories of the targets and the temporal evolution of the position error of the ifdm additionally figure shows the average position error in absolute value for our ifdm and the method in these figures we observe that the proposed model and algorithm is able to detect the three targets and their trajectories providing similar performance to the method in particular both approaches provide average position errors of around metres which is thrice the noise variance cocktail party we now address blind speech separation task also known as the cocktail party problem given the recorded audio signals from set of microphones the goal is to separate out the individual speech signals of multiple people who are speaking simultaneously speakers may start speaking or become silent at any time similarly to we collect data from several speakers from the pascal chime speech separation and recognition challenge the voice signal for each speaker consists of sentences which we append with random pauses in between each sentence we artificially mix the data times corresponding to microphones with mixing weights sampled from uniform such that each microphone receives linear combination of all the considered signals corrupted by gaussian noise with standard deviation we consider two scenarios with and speakers and subsample the data so that we learn from and datapoints respectively following our model assumes xtm xtm and xtm whenever stm we also model yt as linear combination of all the voice signals under gaussian noise yt wm xtm nt where nt is the noise term wm is the weighting vector associated to the speaker and invgamma we compare our ifdm with the ica ifhmm in using ffbs sweeps for inference with xtm xtm denoted as and ii xtm laplace xtm denoted as for the scenario with speakers we show the true and the inferred after iteration number of speakers in figures and along with their activities during the observation period in order to quantitatively evaluate the performance of the different algorithms we show in figure top the activity detection error rate ader which is computed as the probability of detecting activity inactivity of speaker while that speaker is actually inactive active as the algorithms are unsupervised we sort the estimated chains so that the ader is minimized if the inferred number of speakers is smaller larger than the true number of speakers we consider some extra inferred inactive chains additional speakers the approach outperforms the two methods because it can jointly sample the states of all chains speakers for each time instant whereas the ffbs requires sampling each chain conditioned on the current states of the other chains leading to poor mixing as discussed in as consequence the ffbs tends to overestimate the number of speakers as shown in figure bottom http method ader pgas pgas ground truth pgas of speakers ader inferred figure results for the cocktail party problem algorithm pgas ffbs redd stands for house algorithm pgas ffbs day day amp table accuracy for the power disaggregation problem power disaggregation given the aggregate power consumption signal the power disaggregation problem consists in estimating both the number of active devices in the house and the power draw of each individual device we validate the performance of the ifdm on two different real databases the reference energy disaggregation data set redd and the almanac of minutely power dataset amp for the amp database we consider two segments and devices for the redd database we consider segment across houses and devices our model assumes that each device can take different states one inactive state and three active states with different power consumption xtm with xtm if stm we place symmetric dirichlet prior over the transition probability vectors of the form am dirichlet where each element ajk xtm when xtm the power consumption of device at time is zero and when xtm its average power consumption is given by pxmtm thus the total power consumption is given by yt pxtm nt where nt represents the additive gaussian noise for we assume prior power consumption pqm in this case the proposed model for the ifdm resembles ifhmm and therefore we can also apply the ffbs algorithm to infer the power consumption draws of each device in order to evaluate the performance of the different algorithms we compute the mean accuracy of the estimated consumption of each device higher is better acc pt pm pt where xt and pxmtm are respectively the true and the estimated power consumption by device at time in order to compute the accuracy we assign each estimated chain to device so that the accuracy is maximized if the inferred number of devices is smaller than the true number of devices we use for the undetected devices if is larger than the true number unk of devices we group all the extra chains as an unknown device and use xt in table we show the results provided by both algorithms the pgas approach outperforms the ffbs algorithm in the five houses of the redd database and the two selected days of the amp database this occurs because the pgas can simultaneously sample the hidden states of all devices for each time instant whereas the ffbs requires conditioning on the current states of all but one device multiuser detection we now consider digital communication system in which users are allowed to enter or leave the system at any time and several receivers cooperate to estimate the number of users the digital symbols they transmit and the propagation channels they face multipath propagation affects the radio signal thus causing interference to capture this phenomenon in our model we use in this application we consider multiuser communication system and we use ray tracing algorithm wise software to design realistic indoor wireless system in an office located at bell labs crawford hill we place receivers and transmitters across the office in the positions respectively marked with circles and crosses in figure all transmitters and receivers are placed at height of metres transmitted symbols belong to quadrature such that while active the transmitted symbols shift keying qpsk constellation are independent and uniformly distributed in xtm figure plane of the office building at bell labs crawford hill model ifdm ifhmm recovered transmitters inferred model ifdm ifhmm mse of the channel coefficients table results for the multiuser detection problem the observations of all the receivers are weighted replicas of the transmitted symbols under noise pl yt nt where xtm for the inactive states and the chanm nel coefficients and noise variance are provided by wise software for inference we assume channels and therefore we place circularly symmetric complex gaussian prior distribution over the channel coefficients hm cn and over the noise term nt cn we place an inverse gamma prior over with mean and standard deviation the choice of this particular prior is based on the assumption that the channel coefficients hm are priori expected to decay with the memory index since the radio signal suffers more attenuation as it propagates through the walls or bounces off them we use an observation period and vary from to five channel taps correspond to the radio signal travelling distance of which should be enough given the dimensions of this office space we compare our ifdm with ifhmm model with state space cardinality using ffbs sweeps for inference we do not run the ffbs algorithm for due to its computational complexity we show in table the number of recovered transmitters the number of transmitters for which we recover all the transmitted symbols with no error found after running the inference algorithms together with the inferred value of we see that the ifhmm tends to overestimate the number of transmitters which deteriorates the overall symbol estimates and as consequence not all the transmitted symbols are recovered we additionally report in table the mse of the first channel bm bm tap being the inferred channel coefficients we sort the transmitters so that the mse is minimized and ignore the extra inferred transmitters in general the ifdm outperforms the ifhmm approach as discussed above under our ifdm the mse decreases as we consider larger value of since the model better fits the actual radio propagation model conclusions we have proposed general bnp approach to solve source separation problems in which the number of sources is unknown our model builds on the mibp to consider potentially unbounded number of hidden markov chains that evolve independently according to some dynamics in which the state space can be either discrete or continuous for posterior inference we have developed an algorithm based on pgas that solves the intractable complexity that the ffbs presents in many scenarios enabling the application of our ifdm in problems such as multitarget tracking or multiuser detection in addition we have shown empirically that our pgas approach outperforms the algorithm in terms of accuracy in the cocktail party and power disaggregation problems since the ffbs gets more easily trapped in local modes of the posterior in which several markov chains correspond to single hidden source acknowledgments valera is currently supported by the humboldt research fellowship for postdoctoral researchers program and acknowledges the support of plan of comunidad de madrid ruiz is supported by an fpu fellowship from the spanish ministry of education this work is also partially supported by ministerio de of spain projects comprehension id and alcit id by comunidad de madrid project id by the office of naval research onr and by the european union framework programme through the marie curie initial training network machine learning for personalized medicine grant no references andrieu doucet and holenstein particle markov chain monte carlo methods journal of the royal statistical society series beal ghahramani and rasmussen the infinite hidden markov model in advances in neural information processing systems volume fortune gay kernighan landron valenzuela and wright wise design of indoor wireless systems practical computation and optimization ieee computing in science engineering march fox sudderth jordan and willsky bayesian nonparametric methods for learning markov switching processes ieee signal processing magazine fox sudderth jordan and willsky sticky with application to speaker diarization annals of applied statistics jiang singh and yıldırım bayesian tracking and parameter learning for multiple target tracking models arxiv preprint johnson and willsky bayesian nonparametric hidden models journal of machine learning research february jordan hierarchical models nested models and completely random measures springer new york ny kalman new approach to linear filtering and prediction problems asme journal of basic engineering series knowles and ghahramani nonparametric bayesian sparse factor models with application to gene expression modeling the annals of applied statistics june kolter and jaakkola approximate inference in additive factorial hmms with application to energy disaggregation in international conference on artificial intelligence and statistics pages lim and chong multitarget tracking by particle filtering based on rss measurement in wireless sensor networks international journal of distributed sensor networks march lindsten jordan and particle gibbs with ancestor sampling journal of machine learning research lindsten and backward simulation methods for monte carlo statistical inference foundations and trends in machine learning makonin popowich bartram gill and bajic ampds public dataset for load disaggregation and research in proceedings of the ieee electrical power and energy conference epec oh russell and sastry markov chain monte carlo data association for general tracking problems in ieee conference on decision and control volume pages dec orbanz and teh bayesian nonparametric models in encyclopedia of machine learning springer vehtari and lampinen particle filter for multiple target tracking information fusion teh and ghahramani construction for the indian buffet process in proceedings of the international conference on artificial intelligence and statistics volume teh jordan beal and blei hierarchical dirichlet processes journal of the american statistical association thouin nannuru and coates tracking for measurement models with additive contributions in proceedings of the international conference on information fusion fusion pages july titsias and yau hamming ball auxiliary sampling for factorial hidden markov models in advances in neural information processing systems van gael teh and ghahramani the infinite factorial hidden markov model in advances in neural information processing systems volume and user activity tracking in systems ieee transactions on vehicular technology whiteley andrieu and doucet efficient bayesian inference for switching models using particle markov chain monte carlo methods technical report bristol statistics research report 
regularization path of error lower bounds atsushi shibagaki yoshiki suzuki masayuki karasuyama and ichiro takeuchi nagoya institute of technology nagoya japan karasuyama abstract careful tuning of regularization parameter is indispensable in many machine learning tasks because it has significant impact on generalization performances nevertheless current practice of regularization parameter tuning is more of an art than science it is hard to tell how many would be needed in cv for obtaining solution with sufficiently small cv error in this paper we propose novel framework for computing lower bound of the cv errors as function of the regularization parameter which we call regularization path of cv error lower bounds the proposed framework can be used for providing theoretical approximation guarantee on set of solutions in the sense that how far the cv error of the current best solution could be away from best possible cv error in the entire range of the regularization parameters our numerical experiments demonstrate that theoretically guaranteed choice of regularization parameter in the above sense is possible with reasonable computational costs introduction many machine learning tasks involve careful tuning of regularization parameter that controls the balance between an empirical loss term and regularization term regularization parameter is usually selected by comparing the cv errors at several different regularization parameters although its choice has significant impact on the generalization performances the current practice is still more of an art than science for example in commonly used it is hard to tell how many grid points we should search over for obtaining sufficiently small cv error in this paper we introduce novel framework for class of regularized binary classification problems that can compute regularization path of cv error lower bounds for an we define εapproximate regularization parameters to be set of regularization parameters such that the cv error of the solution at the regularization parameter is guaranteed to be no greater by than the best possible cv error in the entire range of regularization parameters given set of solutions obtained for example by the proposed framework allows us to provide theoretical guarantee of the current best solution by explicitly quantifying its approximation level in the above sense furthermore when desired approximation level is specified the proposed framework can be used for efficiently finding one of the regularization parameters the proposed framework is built on novel cv error lower bound represented as function of the regularization parameter and this is why we call it as regularization path of cv error lower bounds our cv error lower bound can be computed by only using finite number of solutions obtained by arbitrary algorithms it is thus easy to apply our framework to common regularization parameter tuning strategies such as or bayesian optimization furthermore the proposed framework can be used not only with exact optimal solutions but also with sufficiently good approximate figure an illustration of the proposed framework one of our algorithms presented in automatically selected regularization parameter values in and an upper bound of the validation error for each of them is obtained by solving an optimization problem approximately among those values the one with the smallest validation error upper bound indicated as at is guaranteed to be approximate regularization parameter in the sense that the validation error for the regularization parameter is no greater by than the smallest possible validation error in the whole interval see for the setup see also figure for the results with other options tions which is computationally advantageous because completely solving an optimization problem is often much more costly than obtaining reasonably good approximate solution our main contribution in this paper is to show that theoretically guaranteed choice of regularization parameter in the above sense is possible with reasonable computational costs to the best of our knowledge there is no other existing methods for providing such theoretical guarantee on cv error that can be used as generally as ours figure illustrates the behavior of the algorithm for obtaining approximate regularization parameter see for the setup related works optimal regularization parameter can be found if its exact regularization path can be computed exact regularization path has been intensively studied but they are known to be numerically unstable and do not scale well furthermore exact regularization path can be computed only for limited class of problems whose solutions are written as functions of the regularization parameter our framework is much more efficient and can be applied to wider classes of problems whose exact regularization path can not be computed this work was motivated by recent studies on approximate regularization path these approximate regularization paths have property that the objective function value at each regularization parameter value is no greater by than the optimal objective function value in the entire range of regularization parameters although these algorithms are much more stable and efficient than exact ones for the task of tuning regularization parameter our interest is not in objective function values but in cv errors our approach is more suitable for regularization parameter tuning tasks in the sense that the approximation quality is guaranteed in terms of cv error as illustrated in figure we only compute finite number of solutions but still provide approximation guarantee in the whole interval of the regularization parameter to ensure such property we need novel cv error lower bound that is sufficiently tight and represented as monotonic function of the regularization parameter although several cv error bounds mostly for cv of svm and other similar learning frameworks exist none of them satisfy the above required properties the idea of our cv error bound is inspired from recent studies on safe screening see appendix for the detail furthermore we emphasize that our contribution is not in presenting new generalization error bound but in introducing practical framework for providing theoretical guarantee on the choice of regularization parameter although generalization error bounds such as structural risk minimization might be used for rough tuning of regularization parameter they are known to be too loose to use as an alternative to cv see in we also note that our contribution is not in presenting new method for regularization parameter tuning such as bayesian optimization random search and search as we demonstrate in experiments our approach can provide theoretical approximation guarantee of the regularization parameter selected by these existing methods problem setup we consider linear binary classification problems let xi yi rd be the training set where is the size of the training set is the input dimension and an independent validation set with size is denoted similarly as rd linear decision function is written as where rd is vector of coefficients and represents the transpose we assume the availability of validation set only for simplifying the exposition all the proposed methods presented in this paper can be straightforwardly adapted to setup furthermore the proposed methods can be kernelized if the loss function satisfies certain condition in this paper we focus on the following class of regularized convex loss minimization problems wc arg min yi xi where is the regularization parameter and is the euclidean norm the loss function is denoted as we assume that is convex and subdifferentiable in the argument examples of such loss functions include logistic loss hinge loss loss etc for notational convenience we denote the individual loss as ℓi yi xi for all the optimal solution for the regularization parameter is explicitly denoted as wc we assume that the regularization parameter is defined in finite interval cℓ cu cℓ and cu as we did in the experiments for solution rd the validation is defined as ev where is the indicator function in this paper we consider two problem setups the first lem setup is given set of either optimal or approximate solutions wc wc at different regularization parameters ct cℓ cu to compute the approximation level such that min ev wc ct ct where min ev wc cl cu by which we can find how accurate our search typically is in sense of the deviation of the achieved validation error from the true minimum in the range the second problem setup is given the approximation level to find an regularization parameter within an interval cl cu which is defined as an element of the following set cl cu ev wc our goal in this second setup is to derive an efficient exploration procedure which achieves the specified validation approximation level these two problem setups are both common scenarios in practical data analysis and can be solved by using our proposed framework for computing path of validation error lower bounds validation error lower bounds as function of regularization parameter in this section we derive validation error lower bound which is represented as function of the regularization parameter our basic idea is to compute lower and an upper bound of the inner product score wc xi for each validation input as function of the regularization eter for computing the bounds of wc xi we use solution either optimal or approximate for different regularization parameter score bounds we first describe how to obtain lower and an upper bound of inner product score wc xi based on an approximate solution at different regularization parameter lemma let be an approximate solution of the problem for regularization parameter value and ξi be subgradient of ℓi at such that subgradient of the objective function is ξi for simplicity we regard validation instance whose score is exactly zero is correctly classified in hereafter we assume that there are no validation instances whose input vector is completely because those instances are always correctly classified according to the definition in then for any the score wc xi satisfies if wc xi lb wc xi if if wc xi wc xi if where xi xi the proof is presented in appendix lemma tells that we have lower and an upper bound of the score wc xi for each validation instance that linearly change with the regularization parameter when is optimal it can be shown that see proposition in there exists subgradient such that meaning that the bounds are tight because corollary when the score xi for the regularization parameter value itself satisfies xi xi xi xi xi xi the results in corollary are obtained by simply substituting into and validation error bounds given lower and an upper bound of the score of each validation instance lower bound of the validation error can be computed by simply using the following facts and wc xi lb wc xi and furthermore since the bounds in lemma linearly change with the regularization parameter we can identify the interval of within which the validation instance is guaranteed to be lemma for validation instance with if or then the validation instance is similarly for validation instance with if or xi xi then the validation instance is this lemma can be easily shown by applying to as direct consequence of lemma the lower bound of the validation error is represented as function of the regularization parameter in the following form theorem using an approximate solution for regularization parameter the validation error ev wc for any satisfies ev wc lb ev wc xi xi xi xi yi yi algorithm computing the approximation level from the given set of solutions input xi yi cl cu evbest ev lb cl cu lb ev output evbest lb the lower bound is staircase function of the regularization parameter remark we note that our validation error lower bound is inspired from recent studies on safe screening which identifies sparsity of the optimal solutions before solving the optimization problem key technique used in those studies is to bound lagrange multipliers at the optimal and we utilize this technique to prove lemma which is core of our framework by setting we can obtain lower and an upper bound of the validation error for the regularization parameter itself which are used in the algorithm as stopping criteria for obtaining an approximate solution corollary given an approximate solution the validation error ev satisfies ev lb ev xi xi xi xi yi yi ev ev xi xi xi xi yi yi algorithm in this section we present two algorithms for each of the two problems discussed in due to the space limitation we roughly describe the most fundamental forms of these algorithms details and several extensions of the algorithms are presented in supplementary appendices and problem setup computing the approximation level from given set of solutions given set of either optimal or approximate solutions obtained by ordinary our first problem is to provide theoretical approximation level in the sense of this problem can be solved easily by using the validation error lower bounds developed in the algorithm is presented in algorithm where we compute the current best validation error evbest in line and lower bound of the best possible validation error cℓ cu ev wc in line then the approximation level can be simply obtained by subtracting the latter from the former we note that lb the lower bound of can be easily computed by using evaluation error lower bounds lb ev wc because they are staircase functions of problem setup finding an regularization parameter given desired approximation level such as our second problem is to find an εapproximate regularization parameter to this end we develop an algorithm that produces set of optimal or approximate solutions such that if we apply algorithm to this sequence then approximation level would be smaller than or equal to algorithm is the of this algorithm it computes approximate solutions for an increasing sequence of regularization parameters in the main loop lines when we only have approximate solutions eq is slightly incorrect the first term of the of should be ev let us now consider tth iteration in the main loop where we have already computed apalgorithm finding an approximate regularproximate solutions for ization parameter with approximate solutions input xi yi cl cu at this point cl best cl evbest best arg min ev while cu do approximately for is the best in regularization compute ev by eter obtained so far and it is guaranteed to be an regularization parameter in the if ev evbest then interval cl in the sense that the validation best ev ev error best evbest min ev end if set by is shown to be at most greater by than the end while smallest possible validation error in the interoutput best val cl however we are not sure whether best can still keep property for thus in line we approximately solve the optimization problem at and obtain an approximate solution note that the approximate solution must be sufficiently good enough in the sense that ev lb ev is sufficiently smaller than typically if the upτ per bound of the validation error ev is smaller than evbest we update evbest and best lines our next task is to find in such way that best is an regularization parameter in the interval cl using the validation error lower bound in theorem the task is to find the smallest that violates evbest lb ev wc cu in order to formulate such let us define xi lb xi furthermore let and denote the th element of as th for any natural number then the smallest that violates is given as lb ev th experiments in this section we present experiments for illustrating the proposed methods table summarizes the datasets used in the experiments they are taken from libsvm dataset repository all the input features except and were standardized to for illustrative results the instances were randomly divided into training and validation sets in roughly equal sizes for quantitative results we used cv we used huber hinge loss which is convex and subdifferentiable with respect to the second argument the proposed methods are free from the choice of optimization solvers in the experiments we used an optimization solver described in which is also implemented in liblinear software our slightly modified code we use and as they are for exploiting sparsity ionosphere australian figure illustrations of algorithm on three benchmark datasets the plots indicate how the approximation level improves as the number of solutions increases in red bayesian optimization blue and our own method green see the main text without tricks without tricks with tricks and figure illustrations of algorithm on ionosphere dataset for with with and with respectively figure also shows the result for with for adaptation to huber hinge loss is provided as supplementary material and is also available on https whenever possible we used warmstart approach when we trained new solution we used the closest solutions trained so far either approximate or optimal ones as the initial starting point of the optimizer all the computations were conducted by using single core of an hp workstation xeon cpu mem in all the experiments we set cℓ and cu results on problem we applied algorithm in to set of solutions obtained by gridsearch bayesian optimization bo with expected improvement acquisition function and adaptive search with our framework which sequentially computes solution whose validation lower bound is smallest based on the information obtained so far figure illustrates the results on three datasets where we see how the approximation level in the vertical axis changes as the number of solutions in our notation increases in as we increase the grid points the approximation level tends to be improved since bo tends to focus on small region of the regularization parameter it was difficult to tightly bound the approximation level we see that the adaptive search using our framework straightforwardly seems to offer slight improvement from results on problem we applied algorithm to benchmark datasets for demonstrating theoretically guaranteed choice of regularization parameter is possible with reasonable computational costs besides the algorithm presented in we also tested variant described in supplementary appendix specifically we have three algorithm options in the first option we used timal solutions for computing cv error lower bounds in the second option we instead used approximate solutions in the last option we additionally used tricks described in supplementary appendix we considered four different choices of note that indicates the task of finding the exactly optimal table computational costs for each of the three options and the number of optimization problems solved denoted as and the total computational costs denoted as time are listed note that for there are no results for using time sec using time sec using tricks time sec using time sec using time sec using tricks time sec ev ev ev ev ev ev ization parameter in some datasets the smallest validation errors are less than or in which cases we do not report the results indicated as ev in we initially computed solutions at four different regularization parameter values evenly allocated in in the logarithmic scale in the next regularization parameter was set by replacing in with see supplementary appendix for the purpose of illustration we plot examples of validation error curves in several setups figure shows the validation error curves of ionosphere dataset for several options and table shows the number of optimization problems solved in the algorithm denoted as and the total computation time in cv setups the computational costs mostly depend on which gets smaller as increases two tricks in supplementary appendix was effective in most cases for reducing in addition we see the advantage of using approximate solutions by comparing the computation times of and though this strategy is only for overall the results suggest that the proposed algorithm allows us to find theoretically guaranteed approximate regularization parameters with reasonable costs except for cases for example the algorithm found an approximate regularization parameter within minute in cv for dataset with more than instances see the results on for with and in table table benchmark datasets used in the experiments dataset name heart ionosphere australian diabetes sample size input dimension dataset name sample size input dimension conclusions and future works we presented novel algorithmic framework for computing cv error lower bounds as function of the regularization parameter the proposed framework can be used for theoretically guaranteed choice of regularization parameter additional advantage of this framework is that we only need to compute set of sufficiently good approximate solutions for obtaining such theoretical guarantee which is computationally advantageous as demonstrated in the experiments our algorithm is practical in the sense that the computational cost is reasonable as long as the approximation quality is not too close to an important future work is to extend the approach to multiple tuning setups references efron hastie johnstone and tibshirani least angle regression annals of statistics hastie rosset tibshirani and zhu the entire regularization path for the support vector machine journal of machine learning research rosset and zhu piecewise linear regularized solution paths annals of statistics giesen mueller laue and swiercy approximating concavely parameterized optimization problems in advances in neural information processing systems giesen jaggi and laue approximating parameterized convex optimization problems acm transactions on algorithms giesen laue and wieschollek robust and efficient kernel hyperparameter paths with guarantees in international conference on machine learning mairal and yu complexity analysis of the lasso reguralization path in international conference on machine learning vapnik and chapelle bounds on error expectation for support vector machines neural computation joachims estimating the generalization performance of svm efficiently in international conference on machine learning chung kao sun wang and lin radius margin bounds for support vector machines with the rbf kernel neural computation lee keerthi ong and decoste an efficient method for computing error in support vector machines with gaussian kernels ieee transactions on neural networks el ghaoui viallon and rabbani safe feature elimination in sparse supervised learning pacific journal of optimization xiang xu and ramadge learning sparse representations of high dimensional data on large scale dictionaries in advances in neural information processing sysrtems ogawa suzuki and takeuchi safe screening of vectors in pathwise svm computation in international conference on machine learning liu zhao wang and ye safe screening with variational inequalities and its application to lasso in international conference on machine learning volume wang zhou liu wonka and ye safe screening rule for sparse logistic regression in advances in neural information processing sysrtems vapnik the nature of statistical learning theory springer and understanding machine learning cambridge university press snoek larochelle and adams practical bayesian optimization of machine learning algorithms in advances in neural information processing sysrtems bergstra and bengio random search for optimization journal of machine learning research chapelle vapnik bousquet and mukherjee choosing multiple parameters for support vector machines machine learning bertsekas nonlinear programming athena scientific chang and lin libsvm library for support vector machines acm transactions on intelligent systems and technology chapelle training support vector machine in the primal neural computation lin weng and keerthi trust region newton method for logistic regression the journal of machine learning research fan chang and hsieh liblinear library for large linear classification the journal of machine learning 
attractor network dynamics enable preplay and rapid path planning in environments wulfram gerstner laboratory of computational neuroscience polytechnique de lausanne lausanne switzerland dane corneil laboratory of computational neuroscience polytechnique de lausanne lausanne switzerland abstract rodents navigating in environment can rapidly learn and revisit observed reward locations often after single trial while the mechanism for rapid path planning is unknown the region in the hippocampus plays an important role and emerging evidence suggests that place cell activity during hippocampal preplay periods may trace out future trajectories here we show how particular mapping of space allows for the immediate generation of trajectories between arbitrary start and goal locations in an environment based only on the mapped representation of the goal we show that this representation can be implemented in neural attractor network model resulting in activity profiles resembling those of the region of hippocampus neurons tend to locally excite neurons with similar place field centers while inhibiting other neurons with distant place field centers such that stable bumps of activity can form at arbitrary locations in the environment the network is initialized to represent point in the environment then weakly stimulated with an input corresponding to an arbitrary goal location we show that the resulting activity can be interpreted as gradient ascent on the value function induced by reward at the goal location indeed in networks with large place fields we show that the network properties cause the bump to move smoothly from its initial location to the goal around obstacles or walls our results illustrate that an attractor network with attributes may be important for rapid path planning introduction while early human case studies revealed the importance of the hippocampus in episodic memory the discovery of place cells in rats established its role for spatial representation recent results have further suggested that along with these functions the hippocampus is involved in active spatial planning experiments in learning have revealed the critical role of the region and the intermediate hippocampus in returning to goal locations that the animal has seen only once this poses the question of whether and how hippocampal dynamics could support representation of the current location representation of goal and the relation between the two in this article we propose that model of as bump attractor can be be used for path planning the attractor map represents not only locations within the environment but also the spatial relationship between locations in particular broad activity profiles like those found in intermediate and ventral hippocampus can be viewed as condensed map of particular environment the planned path presents as rapid sequential activity from the current position to the goal location similar to the preplay observed experimentally in hippocampal activity during navigation tasks including paths that require navigating around obstacles in the model the activity is produced by supplying input to the network consistent with the sensory input that would be provided at the goal site unlike other recent models of rapid goal learning and path planning there is no backwards diffusion of value signal from the goal to the current state during the learning or planning process instead the sequential activity results from the representation of space in the attractor network even in the presence of obstacles the recurrent structure in our model is derived from the successor representation which represents space according to the number and length of paths connecting different locations the resulting network can be interpreted as an attractor manifold in space where the dimensions correspond to weighted version of the most relevant eigenvectors of the environment transition matrix such functions have recently found support as viable basis for place cell activity we show that when the attractor network operates in this basis and is stimulated with goal location the network activity traces out path to that goal thus the bump attractor network can act as spatial path planning system as well as spatial memory system the successor representation and key problem in reinforcement learning is assessing the value of particular state given the expected returns from that state in both the immediate and distant future several algorithms exist for solving this task but they are slow to adjust when the reward landscape is rapidly changing the successor representation proposed by dayan addresses this issue given markov chain described by the transition matrix where each element gives the probability of transitioning from state to state in single time step reward vector where each element gives the expected immediate returns from state and discount factor the expected returns from each state can be described by γpr γp lr the successor representation provides an efficient means of representing the state space according to the expected discounted future occupancy of each state given that the chain is initialized from state an agent employing policy described by the matrix can immediately update the value function when the reward landscape changes without any further exploration the successor representation is particularly useful for representing many reward landscapes in the same state space here we consider the set of reward functions where returns are confined to single state where denotes the kronecker delta function and the index denotes particular goal state from eq we see that the value function is then given by the column of the matrix indeed when we consider only single goal we can see the elements of as we will use this property to generate spatial mapping that allows for rapid approximation of the shortest path between any two points in an environment representing space using the successor representation in the spatial navigation problems considered here we assume that the animal has explored the environment sufficiently to learn its natural topology we represent the relationship between locations with gaussian affinity metric given states and in the plane their affinity is exp where is the length of the shortest traversable path between and respecting walls and obstacles we define to be small enough that the metric is localized fig such that resembles small bump in space truncated by walls normalizing the affinity metric gives the normalized metric can be interpreted as transition probability for an agent exploring the environment randomly in this case spectral analysis of the successor representation gives γλl ψl ψl where ψl are the right eigenvectors of the transition matrix are the eigenvalues and denotes the occupancy of state resulting from although the affinity metric is defined locally features of the environment are represented in the eigenvectors associated with the largest eigenvalues fig we now express the position in the space using set of successor coordinates such that γλq ψq ξq where ξl γλl ψl this is similar to the diffusion map framework by coifman and lafon with the useful property that if the value of given state when considering given goal is proportional to the scalar product of their respective mappings we will use this property to show how network operating in the successor coordinate space can rapidly generate prospective trajectories between arbitrary locations note that the mapping can also be defined using the eigenvectors φl of related measure of the space the normalized graph laplacian the eigenvectors φl serve as the objective functions for slow feature analysis and approximations have been extracted through hierarchical slow feature analysis on visual data where they have been used to generate place behaviour using the successor coordinate mapping successor coordinates provide means of mapping set of locations in environment to new space based on the topology of the environment in the new representation the value landscape is particularly simple to move from location towards goal position we can consider constrained gradient ascent procedure on the value landscape arg min arg min where has been absorbed into the parameter at each time step the state closest to an incremental ascent of the value gradient is selected amongst all states in the environment in the following we will consider how the step can be approximated by neural attractor network acting in successor coordinate space due to the properties of the transition matrix is constant across the state space and does not contribute to the value gradient in eq as such we substituted free parameter for the coefficient which controlled the overall level of activity in the network simulations encoding successor coordinates in an attractor network the bump attractor network is common model of place cell activity in the hippocampus neurons in the attractor network strongly excite other neurons with similar place field centers and weakly inhibit the neurons within the network with distant place field centers as result the network allows stable bump of activity to form at an arbitrary location within the environment figure left rat explores environment and passively learns its topology we assume process such as hierarchical slow feature analysis that preliminarily extracts slowly changing functions in the environment here the vectors ξq the vector for the maze is shown in the top left in practice we extracted the vectors directly from localized gaussian transition function bottom center for an arbitrary location right this basis can be used to generate value map approximation over the environment for given reward goal position and discount factor inset due to the walls the function is highly discontinuous in the xy spatial dimensions the goal position is circled in white in the scatter plot the same array of states and value function are shown in the first two successor coordinate dimensions in this space the value function is proportional to the scalar product between the states and the goal location the grey and black dots show corresponding states between the inset and the scatter plot such networks typically represent periodic toroidal environment using local excitatory weight profile that falls off exponentially here we show how the spatial mapping of eq can be used to represent bounded environments with arbitrary obstacles the resulting recurrent weights induce stable firing fields that decrease with distance from the place field center around walls and obstacles in manner consistent with experimental observations in addition the network dynamics can be used to perform rapid path planning in the environment we will use the techniques introduced in the attractor network models by eliasmith and anderson to generalize the bump attractor we first consider purely network composed of population of neurons with place field centers scattered randomly throughout the environment we assume that the input is highly preprocessed potentially by several layers of neuronal processing in fig and given directly by units whose activities ξk represent the input in the successor coordinate dimensions introduced above the activity ai of neuron in response to the inputs can be described by ff dai in wik dt ff where is gain factor represents rectified linear function and wik are the weights each neuron is particularly responsive to bump in the environment given by its encoding vector ei the normalized successor coordinates of particular point in space which corresponds to its place field center the input to neuron in the network is then given by ff wik ei in wik ei neuron is therefore maximally active when the input coordinates are nearly parallel to its encoding vector although we assume the input is given directly in the basis vectors ξl for convenience neural encoding using an over complete basis based on linear combination of the eigenvectors ψl or φl is also possible given corresponding transformation in the weights figure left the attractor network structure for the environment in fig the inputs give approximation of the successor coordinates of point in space the network is composed of neurons with encoding vectors representing states scattered randomly throughout the environment each neuron activation is proportional to the scalar product of its encoding vector and the input resulting in large bump of activity recurrent weights are generated using error decoding of the successor coordinates from the neural activities projected back on to the neural encoding vectors right the generated recurrent weights for the network the plot shows the incoming weights from each neuron to the unit at the circled position where neurons are plotted according to their place field centers if the input represents location in the environment bump of activity forms in the network fig these activities give encoding of the input given the response properties of the neurons we can find set of linear decoding weights dj that recovers an approximation of the input given to the network from the neural activities dj aj these decoding weights dj were derived by minimizing the estimation error of set of example inputs from their resulting activities where the example inputs correspond to the successor coordinates of points evenly spaced throughout the environment the minimization can be performed by taking the pseudoinverse of the matrix of neural activities in response to the example inputs with singular values below certain tolerance removed to avoid overfitting the vector dj therefore gives the contribution of aj to linear population code for the input location rec we now introduce the recurrent weights wij to allow the network to maintain memory of past input in persistent activity the recurrent weights are determined by projecting the decoded location back on to the neuron encoding vectors such that rec wij ei dj rec wij aj ei here the factor determines the timescale on which the network activity fades since the encoding and decoding vectors for the same neuron tend to be similar recurrent weights are highest between neurons representing similar successor coordinates and the weight profile decreases with the distance between place field centers fig the full description is given by dai in rec wij aj wik dt rec in ei where the parameter corresponds to the input strength if we consider the estimate of recovered from decoding the activities of the network we arrive at the update equation dt given location as an initial input the recovered representation approximates the input and reinforces it allowing persistent bump of activity to form when then changes to new goal location the input and recovered coordinates conflict by eq the recovered location moves in the direction of the new input giving us an approximation of the initial gradient ascent step in eq with the addition of decay controlled by as we will show the attractor dynamics typically cause the network activity to manifest as movement of the bump towards the goal location through locations intermediate to the starting position and the goal as observed in experiments after short stimulation period the network activity can be decoded to give state nearby the starting position that is closer to the goal note that with no decay the network activity will tend to grow over time to induce stable activity when the network representation matches the goal position we balanced the decay and input strength in the following we consider networks where the successor coordinate representation was truncated to the first dimensions where this was done because the network is composed of limited number of neurons representing only the portion of the successor coordinate space corresponding to actual locations in the environment in very space the network can rapidly move into regime far from any actual locations and the integration accuracy suffers in effect the weight profiles and activation profile become very narrow and as result the bump of activity simply disappears from the original position and reappears at the goal conversely dimensional representations tend to result in broad excitatory weight profiles and activity profiles fig the high degree of excitatory overlap across the network causes the activity profile to move smoothly between distant points as we will show results we generated attractor networks according to the layout of multiple environments containing walls and obstacles and stimulated them successively with arbitrary startpoints and goals we used neurons to represent each environment with place field centers selected randomly throughout the environment the successor coordinates were generated using we adjusted to control the dimensionality of the representation the network activity resembles bump across portion of the environment fig representations low produced large activity bumps across significant portions of the environment when weak stimulus was provided at the goal the overall activity decreased while the center of the bump moved towards the goal through the intervening areas of the environment with representation activity bumps became more localized and shifted discontinuously to the goal fig bottom row for several networks representing different environments we initialized the activity at points evenly spaced throughout the environment and provided weak stimulation corresponding to fixed goal location fig after short delay we decoded the successor coordinates from the network activity to determine the closest state eq the shifts in the network representation are shown by the arrows in fig for two networks we show the effect of different stimuli representing different goal locations the movement of the activity profile was similar to the shortest path towards the goal fig bottom left including reversals at equidistant points center bottom of the maze irregularities were still present however particularly near the edges of the environment and in the immediate vicinity of the goal where components play larger role in determining the value gradient discussion we have presented spatial bump attractor model generalized to represent environments with arbitrary obstacles and shown how with large activity profiles relative to the size of the environment the network dynamics can be used for this provides possible correlate for figure attractor network activities illustrated over time for different inputs and networks in multiples of the membrane time constant purple boxes indicate the most active unit at each point in time first row activities are shown for network representing environment in space the network was initially stimulated with bump of activation representing the successor coordinates of the state at the black circle recurrent connections maintain similar yet fading profile over time second row for the same network and initial conditions weak constant stimulus was provided representing the successor coordinates at the grey circle the activities transiently decrease and the center of the profile shifts over time through the environment third row two positions black and grey circles were sequentially activated in network representing second environment in space bottom row for representation the activity profile fades rapidly and reappears at the stimulated position activity observed in the hippocampus and an hypothesis for the role that the hippocampus and the region play in rapid navigation as complement to an additional system enabling incremental goal learning in unfamiliar environments recent theoretical work has linked the firing behaviour of place cells to an encoding of the environment based on its natural topology including obstacles and specifically to the successor representation as well recent work has proposed that place cell behaviour can be learned by processing visual data using hierarchical slow feature analysis process which can extract the lowest frequency eigenvectors of the graph laplacian generated by the environment and therefore provide potential input for successor activity we provide the first link between these theoretical analyses and models of slow feature analysis has been proposed as natural outcome of plasticity rule based on plasticity stdp albeit on the timescale of standard postsynaptic figure attractor network activities can be decoded to determine local trajectories to goals arrows show the initial change in the location of the activity profile by determining the state closest to the decoded network activity at after weakly stimulating with the successor coordinates at the black dot pixels show the place field centers of the neurons representing each environment coloured according to their activity at the stimulated goal site top left change in position of the activity profile in like environment with activity compared to bottom left the true shortest path towards the goal at each point in the environment additional plots various environments and stimulated goal sites using successor coordinate representations tential rather than the behavioural timescale we consider here however stdp can be extended to behavioural timescales when combined with sustained firing and slowly decaying potentials of the type observed on the level in the input pathway to or as result of network effects within the attractor network learning could potentially be addressed by rule that trains recurrent synapses to reproduce inputs during exploration our model assigns key role to neurons with large place fields in generating directed trajectories this suggests that such trajectories in dorsal hippocampus where place fields are much smaller must be inherited from dynamics in ventral or intermediate hippocampus the model predicts that ablating the hippocampus will result in significant reduction in preplay activity in the remaining dorsal region in an intact hippocampus the model predicts that preplay in the dorsal hippocampus is preceded by preplay tracing similar path in intermediate hippocampus however these networks lack the specificity to consistently generate useful trajectories in the immediate vicinity of the goal therefore dorsal representations may prove useful in generating trajectories close to the goal location or alternative methods of navigation may become more important if an assembly of neurons projecting to the attractor network is active while the animal searches the environment hebbian plasticity provides potential mechanism for reactivating goal location in particular the presence of neuromodulator could allow for potentiation between the assembly and the attractor network neurons active when the animal receives reward at particular location activating the assembly would then provide stimulation to the goal location in the network the same mechanism could allow an arbitrary number of assemblies to become selective for different goal locations in the same environment unlike traditional free methods of learning which generate static value map this would give highly configurable means of navigating the environment visiting different goal locations based on thirst hunger needs providing link between spatial navigation and higher cognitive functioning acknowledgements this research was supported by the swiss national science foundation grant agreement no we thank laureline logiaco and johanni brea for valuable discussions references william beecher scoville and brenda milner loss of recent memory after bilateral hippocampal lesions journal of neurology neurosurgery and psychiatry howard eichenbaum memory amnesia and the hippocampal system mit press john keefe and jonathan dostrovsky the hippocampus as spatial map preliminary evidence from unit activity in the rat brain research kazu nakazawa linus sun michael quirk laure matthew wilson and susumu tonegawa hippocampal nmda receptors are crucial for memory acquisition of experience neuron toshiaki nakashiba jennie young thomas mchugh derek buhl and susumu tonegawa transgenic inhibition of synaptic transmission reveals role of output in hippocampal learning science tobias bast iain wilson menno witter and richard gm morris from rapid place learning to behavioral performance key role for the intermediate hippocampus plos biology alexei samsonovich and bruce mcnaughton path integration and cognitive mapping in continuous attractor neural network model the journal of neuroscience kirsten brun kjelstrup trygve solstad vegard heimly brun torkel hafting stefan leutgeb menno witter edvard moser and moser finite scale of spatial representation in the hippocampus science brad pfeiffer and david foster hippocampal sequences depict future paths to remembered goals nature andrew wikenheiser and david redish hippocampal theta sequences reflect current goals nature neuroscience martinet denis sheynikhovich karim benchenane and angelo arleo spatial learning and action planning in prefrontal cortical network model plos computational biology filip ponulak and john hopfield rapid parallel path planning by propagating wavefronts of spiking neural activity frontiers in computational neuroscience peter dayan improving generalization for temporal difference learning the successor representation neural computation kimberly stachenfeld matthew botvinick and samuel gershman design principles of the hippocampal cognitive map in ghahramani welling cortes lawrence and weinberger editors advances in neural information processing systems pages curran associates mathias franzius henning sprekeler and laurenz wiskott slowness and sparseness lead to place and cells plos computational biology fabian schoenfeld and laurenz wiskott modeling place field activity with hierarchical slow feature analysis frontiers in computational neuroscience richard sutton and andrew barto introduction to reinforcement learning mit press ronald coifman and lafon diffusion maps applied and computational harmonic analysis sridhar mahadevan learning representation and control in markov decision processes volume now publishers inc henning sprekeler on the relation of slow feature analysis and laplacian eigenmaps neural computation john conklin and chris eliasmith controlled attractor network model of path integration in the rat journal of computational neuroscience nicholas gustafson and nathaniel daw grid cells place cells and geodesic generalization for spatial reinforcement learning plos computational biology chris eliasmith and charles anderson neural engineering computation representation and dynamics in neurobiological systems mit press henning sprekeler christian michaelis and laurenz wiskott slowness an objective for plasticity plos comput biol patrick drew and lf abbott extending the effects of plasticity to behavioral timescales proceedings of the national academy of sciences phillip larimer and ben strowbridge representing information in cell assemblies persistent activity mediated by semilunar granule cells nature neuroscience robert urbanczik and walter senn learning by the dendritic prediction of somatic spiking neuron 
teaching machines to read and comprehend karl moritz edward lasse will mustafa phil google deepmind university of oxford kmh tkocisky etg lespeholt wkay mustafasul pblunsom abstract teaching machines to read natural language documents remains an elusive challenge machine reading systems can be tested on their ability to answer questions posed on the contents of documents that they have seen but until now large scale training and test datasets have been missing for this type of evaluation in this work we define new methodology that resolves this bottleneck and provides large scale supervised reading comprehension data this allows us to develop class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure introduction progress on the path from shallow information retrieval algorithms to machines capable of reading and understanding documents has been slow traditional approaches to machine reading and comprehension have been based on either hand engineered grammars or information extraction methods of detecting predicate argument triples that can later be queried as relational database supervised machine learning approaches have largely been absent from this space due to both the lack of large scale training datasets and the difficulty in structuring statistical models flexible enough to learn to exploit document structure while obtaining supervised natural language reading comprehension data has proved difficult some researchers have explored generating synthetic narratives and queries such approaches allow the generation of almost unlimited amounts of supervised data and enable researchers to isolate the performance of their algorithms on individual simulated phenomena work on such data has shown that neural network based models hold promise for modelling reading comprehension something that we will build upon here historically however many similar approaches in computational linguistics have failed to manage the transition from synthetic data to real environments as such closed worlds inevitably fail to capture the complexity richness and noise of natural language in this work we seek to directly address the lack of real natural language training data by introducing novel approach to building supervised reading comprehension data set we observe that summary and paraphrase sentences with their associated documents can be readily converted to triples using simple entity detection and anonymisation algorithms using this approach we have collected two new corpora of roughly million news stories with associated queries from the cnn and daily mail websites we demonstrate the efficacy of our new corpora by building novel deep learning models for reading comprehension these models draw on recent developments for incorporating attention mechanisms into recurrent neural network architectures this allows model to focus on the aspects of document that it believes will help it answer question and also allows us to visualises its inference process we compare these neural models to range of baselines and heuristic benchmarks based upon traditional frame semantic analysis provided by natural language processing cnn train valid daily mail test train valid top cnn daily mail test months documents queries max entities avg entities avg tokens vocab size cumulative table percentage of time that the correct answer is contained in table corpus statistics articles were collected starting in the top most frequent entities april for cnn and june for the daily mail both until in given document the end of april validation data is from march test data from april articles of over tokens and queries whose answer entity did not appear in the context were filtered out nlp pipeline our results indicate that the neural models achieve higher accuracy and do so without any specific encoding of the document or query structure supervised training data for reading comprehension the reading comprehension task naturally lends itself to formulation as supervised learning problem specifically we seek to estimate the conditional probability where is context document query relating to that document and the answer to that query for focused evaluation we wish to be able to exclude additional information such as world knowledge gained from statistics in order to test model core capability to detect and understand the linguistic relationships between entities in the context document such an approach requires large training corpus of triples and until now such corpora have been limited to hundreds of examples and thus mostly of use only for testing this limitation has meant that most work in this area has taken the form of unsupervised approaches which use templates or analysers to extract relation tuples from the document to form knowledge graph that can be queried here we propose methodology for creating large scale supervised training data for learning reading comprehension models inspired by work in summarisation we create two machine reading corpora by exploiting online newspaper articles and their matching summaries we have collected articles from the and articles from the daily websites both news providers supplement their articles with number of bullet points summarising aspects of the information contained in the article of key importance is that these summary points are abstractive and do not simply copy sentences from the documents we construct corpus of answer triples by turning these bullet points into cloze style questions by replacing one entity at time with placeholder this results in combined corpus of roughly data points table code to replicate our to apply this method to other available entity replacement and permutation note that the focus of this paper is to provide corpus for evaluating model ability to read and comprehend single document not world knowledge or to understand that distinction consider for instance the following cloze form queries created from headlines in the daily mail validation set the bra that helps you beat breast could saccharin help beat can fish oils help fight prostate an ngram language model trained on the daily mail would easily correctly predict that cancer regardless of the contents of the context document simply because this is very frequently cured entity in the daily mail corpus http original version anonymised version context the bbc producer allegedly struck by jeremy clarkson will not press charges against the top gear host his lawyer said friday clarkson who hosted one of the television shows in the world was dropped by the bbc wednesday after an internal investigation by the british broadcaster found he had subjected producer oisin tymon to an unprovoked physical and verbal the producer allegedly struck by will not press charges against the host his lawyer said friday who hosted one of the most watched television shows in the world was dropped by the wednesday after an internal investigation by the broadcaster found he had subjected producer to an unprovoked physical and verbal attack query producer will not press charges against jeremy clarkson his lawyer says producer will not press charges against his lawyer says answer oisin tymon table original and anonymised version of data point from the daily mail validation set the anonymised entity markers are constantly permuted during training and testing to prevent such degenerate solutions and create focused task we anonymise and randomise our corpora with the following procedure use coreference system to establish coreferents in each data point replace all entities with abstract entity markers according to coreference randomly permute these entity markers whenever data point is loaded compare the original and anonymised version of the example in table clearly human reader can answer both queries correctly however in the anonymised setup the context document is required for answering the query whereas the original version could also be answered by someone with the requisite background knowledge therefore following this procedure the only remaining strategy for answering questions is to do so by exploiting the context presented with each question thus performance on our two corpora truly measures reading comprehension capability naturally production system would benefit from using all available information sources such as clues through language and statistics table gives an indication of the difficulty of the task showing how frequent the correct answer is contained in the top entity markers in given document note that our models don distinguish between entity markers and regular words this makes the task harder and the models more general models so far we have motivated the need for better datasets and tasks to evaluate the capabilities of machine reading models we proceed by describing number of baselines benchmarks and new models to evaluate against this paradigm we define two simple baselines the majority baseline maximum frequency picks the entity most frequently observed in the context document whereas the exclusive majority exclusive frequency chooses the entity most frequently observed in the context but not observed in the query the idea behind this exclusion is that the placeholder is unlikely to be mentioned twice in single cloze form query symbolic matching models traditionally pipeline of nlp models has been used for attempting question answering that is models that make heavy use of linguistic annotation structured world knowledge and semantic parsing and similar nlp pipeline outputs building on these approaches we define number of models for our machine reading task parsing parsing attempts to identify predicates and their arguments allowing models access to information about who did what to whom naturally this kind of annotation lends itself to being exploited for question answering we develop benchmark that makes use of annotations which we obtained by parsing our model with parser as the parser makes extensive use of linguistic information we run these benchmarks on the unanonymised version of our corpora there is no significant advantage in this as the approach used here does not possess the capability to generalise through language model beyond exploiting one during the parsing phase thus the key objective of evaluating machine comprehension abilities is maintained extracting denoted as both the query and context document we attempt to resolve queries using number of rules with an increasing as follows table strategy exact match match correct frame permuted frame matching entity strategy pattern pattern example cloze context loves suse kim loves suse is president mike is president won oscar tom won academy award met suse suse met tom likes candy tom loves candy pick the most frequent entity from the context that doesn appear in the query table resolution strategies using propbank triples denotes the entity proposed as answer is fully qualified propbank frame strategies are ordered by precedence and answers determined accordingly this heuristic algorithm was iteratively tuned on the validation data set for reasons of clarity we pretend that all propbank triples are of the form in practice we take the argument numberings of the parser into account and only compare like with like except in cases such as the permuted frame rule where ordering is relaxed in the case of multiple possible answers from single rule we randomly choose one word distance benchmark we consider another baseline that relies on word distance measurements here we align the placeholder of the cloze form question with each possible entity in the context document and calculate distance measure between the question and the context around the aligned entity this score is calculated by summing the distances of every word in to their nearest aligned word in where alignment is defined by matching words either directly or as aligned by the coreference system we tune the maximum penalty per word on the validation data neural network models neural networks have successfully been applied to range of tasks in nlp this includes classification tasks such as sentiment analysis or pos tagging as well as generative problems such as language modelling or machine translation we propose three neural models for estimating the probability of word type from document answering query exp where is the vocabulary and indexes row of weight matrix and through slight abuse of notation word types double as indexes note that we do not privilege entities or variables the model must learn to differentiate these in the input sequence the function returns vector embedding of document and query pair the deep lstm reader long memory lstm networks have recently seen considerable success in tasks such as machine translation and language modelling when used for translation deep lstms have shown remarkable ability to embed long sequences into vector representation which contains enough information to generate full translation in another language our first neural model for reading comprehension tests the ability of deep lstm encoders to handle significantly longer sequences we feed our documents one word at time into deep lstm encoder after delimiter we then also feed the query into the encoder alternatively we also experiment with processing the query then the document the result is that this model processes each document query pair as single long sequence given the embedded document and query the network predicts which token in the document answers the query the vocabulary includes all the word types in the documents questions the entity maskers and the question unknown entity marker mary went to england visited england mary attentive reader went to england visited england impatient reader visited england mary went to england two layer deep lstm reader with the question encoded before the document figure document and query embedding models we employ deep lstm cell with skip connections from each input to every hidden layer and from every hidden layer to the output wkxi wkhi wkci bki wkxf wkhf wkcf bkf tanh wkxc wkhc wkxo wkho tanh bkc wkco bko wky bky where indicates vector concatenation is the hidden state for layer at time and are the input forget and output gates respectively thus our deep lstm reader is defined by lstm with input the concatenation of and separated by the delimiter the attentive reader the deep lstm reader must propagate dependencies over long distances in order to connect queries to their answers the fixed width hidden vector forms bottleneck for this information flow that we propose to circumvent using an attention mechanism inspired by recent results in translation and image recognition this attention model first encodes the document and the query using separate bidirectional single layer lstms we denote the outputs of the forward and backward lstms as and respectively the encoding of query of length is formed by the concatenation of the final forward and backward outputs yq yq for the document the composite output for each token at position is the representation of the document is formed by weighted sum of these output vectors these weights are interpreted as the degree to which the network attends to particular token in the document when answering the query tanh wym yd wum exp wms yd where we are interpreting yd as matrix with each column being the composite representation yd of document token the variable is the normalised attention at token given this attention score the embedding of the document is computed as the weighted sum of the token embeddings the model is completed with the definition of the joint document and query embedding via nonlinear combination ar tanh wrg wug the attentive reader can be viewed as generalisation of the application of memory networks to question answering that model employs an attention mechanism at the sentence level where each sentence is represented by bag of embeddings the attentive reader employs finer grained token level attention mechanism where the tokens are embedded given their entire future and past context in the input document the impatient reader the attentive reader is able to focus on the passages of context document that are most likely to inform the answer to the query we can go further by equipping the model with the ability to reread from the document as each query token is read at each token of the query the model computes document representation vector using the bidirectional embedding yq yq yq tanh wdm yd wrm wqm yq exp wms tanh wrr the result is an attention mechanism that allows the model to recurrently accumulate information from the document as it sees each query token ultimately outputting final joint document query representation for the answer prediction ir tanh wrg wqg empirical evaluation having described number of models in the previous section we next evaluate these models on our reading comprehension corpora our hypothesis is that neural models should in principle be well suited for this task however we argued that simple recurrent models such as the lstm probably have insufficient expressive power for solving tasks that require complex inference we expect that the models would therefore outperform the pure approaches considering the second dimension of our investigation the comparison of traditional versus neural approaches to nlp we do not have strong prior favouring one approach over the other while numerous publications in the past few years have demonstrated neural models outperforming classical methods it remains unclear how much of that is of the language modelling capabilities intrinsic to any neural model for nlp the entity anonymisation and permutation aspect of the task presented here may end up levelling the playing field in that regard favouring models capable of dealing with syntax rather than just semantics with these considerations in mind the experimental part of this paper is designed with threefold aim first we want to establish the difficulty of our machine reading task by applying wide range of models to it second we compare the performance of methods versus that of neural models third within the group of neural models examined we want to determine what each component contributes to the end performance that is we want to analyse the extent to which an lstm can solve this task and to what extent various attention mechanisms impact performance all model hyperparameters were tuned on the respective validation sets of the two our experimental results are in table with the attentive and impatient readers performing best across both datasets for the deep lstm reader we consider hidden layer sizes depths initial learning rates batch sizes and dropout we evaluate two types of feeds in the cqa setup we feed first the context document and subsequently the question into the encoder while the qca model starts by feeding in the question followed by the context document we report results on the best model underlined hyperparameters qca setup for the attention models we consider hidden layer sizes single layer initial learning rates batch sizes and dropout for all models we used asynchronous rmsprop with momentum of and decay of see appendix for more details of the experimental setup cnn valid daily mail test valid test maximum frequency exclusive frequency model word distance model deep lstm reader uniform reader attentive reader impatient reader table accuracy of all the models and benchmarks on the cnn and daily mail datasets the uniform reader baseline sets all of the parameters to be equal figure precision recall for the attention models on the cnn validation data benchmark while the one model proposed in this paper is clearly simplification of what could be achieved with annotations from an nlp pipeline it does highlight the difficulty of the task when approached from symbolic nlp perspective two issues stand out when analysing the results in detail first the pipeline has poor degree of coverage with many relations not being picked up by our propbank parser as they do not adhere to the default structure this effect is exacerbated by the type of language used in the highlights that form the basis of our datasets the second issue is that the approach does not trivially scale to situations where several sentences and thus frames are required to answer query this was true for the majority of queries in the dataset word distance benchmark more surprising perhaps is the relatively strong performance of the word distance benchmark particularly relative to the benchmark which we had expected to perform better here again the nature of the datasets used can explain aspects of this result where the model suffered due to the language used in the highlights the word distance model benefited particularly in the case of the daily mail dataset highlights frequently have significant lexical overlap with passages in the accompanying article which makes it easy for the word distance benchmark for instance the query tom hanks is friends with manager scooter brown has the phrase turns out he is good friends with scooter brown manager for carly rae jepson in the context the word distance benchmark correctly aligns these two while the approach fails to pickup the friendship or management relations when parsing the query we expect that on other types of machine reading data where questions rather than cloze queries are used this particular model would perform significantly worse neural models within the group of neural models explored here the results paint clear picture with the impatient and the attentive readers outperforming all other models this is consistent with our hypothesis that attention is key ingredient for machine reading and question answering due to the need to propagate information over long distances the deep lstm reader performs surprisingly well once again demonstrating that this simple sequential architecture can do reasonable job of learning to abstract long sequences even when they are up to two thousand tokens in length however this model does fail to match the performance of the attention based models even though these only use single layer the poor results of the uniform reader support our hypothesis of the significance of the attention mechanism in the attentive model performance as the only difference between these models is that the attention variables are ignored in the uniform reader the precision recall statistics in figure again highlight the strength of the attentive approach we can visualise the attention mechanism as heatmap over context document to gain further insight into the models performance the highlighted words show which tokens in the document were attended to by the model in addition we must also take into account that the vectors at each memory constraints prevented us from experimenting with deeper attentive readers figure attention heat maps from the attentive reader for two correctly answered validation set queries the correct answers are and respectively both examples require significant lexical generalisation and resolution in order to be answered correctly by given model token integrate long range contextual information via the bidirectional lstm encoders figure depicts heat maps for two queries that were correctly answered by the attentive in both cases confidently arriving at the correct answer requires the model to perform both significant lexical generalsiation killed deceased and or anaphora resolution was killed he was however it is also clear that the model is able to integrate these signals with rough heuristic indicators such as the proximity of query words to the candidate answer conclusion the supervised paradigm for training machine reading and comprehension models provides promising avenue for making progress on the path to building full natural language understanding systems we have demonstrated methodology for obtaining large number of triples and shown that recurrent and attention based neural networks provide an effective modelling framework for this task our analysis indicates that the attentive and impatient readers are able to propagate and integrate semantic information over long distances in particular we believe that the incorporation of an attention mechanism is the key contributor to these results the attention mechanism that we have employed is just one instantiation of very general idea which can be further exploited however the incorporation of world knowledge and queries will also require the development of attention and embedding mechanisms whose complexity to query does not scale linearly with the data set size there are still many queries requiring complex inference and long range reference resolution that our models are not yet able to answer as such our data provides scalable challenge that should support nlp research into the future further significantly bigger training data sets can be acquired using the techniques we have described undoubtedly allowing us to train more expressive and accurate models note that these examples were chosen as they were short the average cnn validation document contained tokens and entities thus most instances were significantly harder to answer than these examples references ellen riloff and michael thelen question answering system for reading comprehension tests in proceedings of the workshop on reading comprehension tests as evaluation for language understanding sytems hoifung poon janara christensen pedro domingos oren etzioni raphael hoffmann chloe kiddon thomas lin xiao ling mausam alan ritter stefan schoenmackers stephen soderland dan weld fei wu and congle zhang machine reading at the university of washington in proceedings of the naacl hlt first international workshop on formalisms and methodology for learning by reading jason weston sumit chopra and antoine bordes memory networks corr sainbayar sukhbaatar arthur szlam jason weston and rob fergus memory networks corr terry winograd understanding natural language academic press orlando fl usa dzmitry bahdanau kyunghyun cho and yoshua bengio neural machine translation by jointly learning to align and translate corr volodymyr mnih nicolas heess alex graves and koray kavukcuoglu recurrent models of visual attention in advances in neural information processing systems karol gregor ivo danihelka alex graves and daan wierstra draw recurrent neural network for image generation corr matthew richardson christopher burges and erin renshaw mctest challenge dataset for the machine comprehension of text in proceedings of emnlp krysta svore lucy vanderwende and christopher burges enhancing summarization by combining ranknet and sources in proceedings of kristian woodsend and mirella lapata automatic generation of story highlights in proceedings of acl wilson taylor cloze procedure new tool for measuring readability journalism quarterly dipanjan das desai chen martins nathan schneider and noah smith framesemantic parsing computational linguistics karl moritz hermann dipanjan das jason weston and kuzman ganchev semantic frame identification with distributed word representations in proceedings of acl june nal kalchbrenner edward grefenstette and phil blunsom convolutional neural network for modelling sentences in proceedings of acl ronan collobert jason weston bottou michael karlen koray kavukcuoglu and pavel kuksa natural language processing almost from scratch journal of machine learning research november ilya sutskever oriol vinyals and quoc le sequence to sequence learning with neural networks in advances in neural information processing systems sepp hochreiter and schmidhuber long memory neural computation november alex graves supervised sequence labelling with recurrent neural networks volume of studies in computational intelligence springer tieleman and hinton lecture divide the gradient by running average of its recent magnitude coursera neural networks for machine learning 
principal differences analysis interpretable characterization of differences between distributions jonas mueller csail mit jonasmueller tommi jaakkola csail mit tommi abstract we introduce principal differences analysis pda for analyzing differences between distributions the method operates by finding the projection that maximizes the wasserstein divergence between the resulting univariate populations relying on the device it requires no assumptions about the form of the underlying distributions nor the nature of their differences sparse variant of the method is introduced to identify features responsible for the differences we provide algorithms for both the original minimax formulation as well as its semidefinite relaxation in addition to deriving some convergence results we illustrate how the approach may be applied to identify differences between cell populations in the somatosensory cortex and hippocampus as manifested by single cell our broader framework extends beyond the specific choice of wasserstein divergence introduction understanding differences between populations is common task across disciplines from biomedical data analysis to demographic or textual analysis for example in biomedical analysis set of variables features such as genes may be profiled under different conditions cell types disease variants resulting in two or more populations to compare the hope of this analysis is to answer whether or not the populations differ and if so which variables or relationships contribute most to this difference in many cases of interest the comparison may be challenging primarily for three reasons the number of variables profiled may be large populations are represented by finite unpaired sets of samples and information may be lacking about the nature of possible differences exploratory analysis we will focus on the comparison of two high dimensional populations therefore given two unpaired sets of samples xpnq xpnq px and ypmq pmq py the goal is to answer the following two questions about the underlying multivariate random variables rd is px py if not what is the minimal subset of features du such that the marginal distributions differ pxs pys while pxsc pysc for the complement finer version of may additionally be posed which asks how much each feature contributes to the overall difference between the two probability distributions with respect to the given scale on which the variables are measured many analyses have focused on characterizing limited differences such as mean shifts more general differences beyond the mean of each feature remain of interest however including of demographic statistics such as income it is also undesirable to restrict the analysis to specific parametric differences especially in exploratory analysis where the nature of the underlying distributions may be unknown in the univariate case number of nonparametric tests of equality of distributions are available with accompanying concentration results popular examples of such divergences also referred to as probability metrics include hellinger etc the kolmogorov distance or the wasserstein metric unfortunately this simplicity vanishes as the dimensionality grows and complex have been designed to address some of the difficulties that appear in settings in this work we propose the principal differences analysis pda framework which circumvents the curse of dimensionality through explicit reduction back to the univariate case given statistical divergence which measures the difference between univariate probability distributions pda seeks to find projection which maximizes dp subject to the constraints to avoid underspecification this reduction is justified by the device which ensures that px py if and only if there exists direction along which the univariate linearly projected distributions differ assuming is positive definite divergence meaning it is nonzero between any two distinct univariate distributions the projection vector produced by pda can thus capture arbitrary types of differences between px and py furthermore the approach can be straightforwardly modified to address by introducing sparsity penalty on and examining the features with nonzero weight in the resulting optimal projection the resulting comparison pertains to marginal distributions up to the sparsity level we refer to this approach as sparse differences analysis or sparda related work the problem of characterizing differences between populations including feature selection has received great deal of study we limit our discussion to methods which as family of methods are closest to our approach for multivariate data the most widely adopted methods include sparse linear discriminant analysis lda and the logistic lasso while interpretable these methods seek specific differences average differences or operate under stringent assumptions model in contrast sparda with divergence aims to find features that characterize priori unspecified differences between general multivariate distributions perhaps most similar to our general approach is diproperm procedure of wei et al in which the data is first projected along the normal to the separating hyperplane found using linear svm distance weighted discrimination or the centroid method followed by univariate test on the projected data the projections could also be chosen at random in contrast to our approach the choice of the projection in such methods is not optimized for the test statistics we note that by restricting the divergence measure in our technique methods such as the sparse linear support vector machine could be viewed as special cases the divergence in this case would measure the margin between projected univariate distributions while suitable for finding projected populations it may fail to uncover more general differences between possibly projected populations general framework for principal differences analysis for given divergence measure between two univariate random variables we find the projection that solves pnq yp pmq max dp pb where is the feasible set is the sparsity constraint pnq denotes the observed random variable that follows the empirical distribution of samand ples of instead of imposing hard cardinality constraint we may instead penalize by adding penalty or its natural relaxation the shrinkage used in lasso sparse lda and sparse pca sparsity in our setting explicitly restricts the comparison to the marginal distributions over features with coefficients we can evaluate the null hypothesis px py or its sparse variant over marginals using permutation testing cf with pnq pt yp pmq statistic dp pt in practice shrinkage parameter or explicit cardinality constraint may be chosen via by maximizing the divergence between samples the divergence plays key role in our analysis if is defined in terms of density functions as in one can use univariate kernel density estimation to approximate projected pdfs with additional tuning of the bandwidth hyperparameter for suitably chosen kernel gaussian the unregularized pda objective without shrinkage is smooth function of and thus amenable to the projected gradient method or its accelerated variants in contrast when is defined over the cdfs along the projected direction the kolmogorov or wasserstein distance that we focus on in this paper the objective is nondifferentiable due to the discrete jumps in the empirical cdf we specifically address the combinatorial problem implied by the wasserstein distance moreover since the divergence assesses general differences between distributions equation is typically optimization to this end we develop relaxation for use with the wasserstein distance pda using the wasserstein distance in the remainder of the paper we focus on the squared wasserstein distance kantorovich mallows dudley or distance defined as dpx min epxy px pxy px py pxy where the minimization is over all joint distributions over px with given marginals px and py intuitively interpreted as the amount of work required to transform one distribution into the other provides natural dissimilarity measure between populations that integrates both the fraction of individuals which are different and the magnitude of these differences while component analysis based on the wasserstein distance has been limited to this divergence has been successfully used in many other applications in the univariate case may be analytically expressed as the distance between quantile functions we can thus efficiently compute empirical projected wasserstein distances by sorting and samples along the projection direction to obtain quantile estimates using the wasserstein distance the empirical objective in equation between unpaired sampled populations xpnq and ty pmq can be shown to be max min xpiq pjq mij max min wm pb pm pb pm where is the set of all nonnegative matching matrices with fixed row sums and column sums see for details wm rzij zij smij and zij xpiq pjq if we omitted fixed the inner minimization over the matching matrices and set the solution of would be simply the largest eigenvector of wm similarly for the sparse variant without minizing over the problem would be solvable as sparse pca the actual maxmin problem in is more complex and with respect to we propose procedure similar to tighten after relax framework used to attain rates in sparse pca first we first solve convex relaxation of the problem and subsequently run steepest ascent method initialized at the global optimum of the relaxation to greedily improve the current solution with respect to the original nonconvex problem whenever the relaxation is not tight finally we emphasize that pda and sparda not only computationally resembles sparse pca but the latter is actually special case of the former in the gaussian setting this connection is made explicit by considering the problem with paired samples pxpiq piq where follow two multivariate gaussian distributions here the largest principal component of the uncentered differences xpiq piq is in fact equivalent to the direction which maximizes the projected wasserstein difference between the distribution of and delta distribution at semidefinite relaxation the sparda problem may be expressed in terms of symmetric matrices as max min tr pwm bq pm subject to trpbq rankpbq where the correspondence between and comes from writing note that any solution of will have unit norm when we impose no sparsity constraint as in pda we can relax by simply dropping the the objective is then supremum of linear functions of and the resulting semidefinite problem is concave over convex set and may be written as max min tr pwm bq bpbr pm where br is the convex set of positive semidefinite matrices with trace if rdˆd denotes the global optimum of this relaxation and rankpb then the best projection for pda is simply the dominant eigenvector of and the relaxation is tight otherwise we can truncate as in treating the dominant eigenvector as an approximate solution to the original problem to obtain relaxation for the sparse version where sparda we follow closely because implies we obtain an equivalent cardinality constrained problem by incorporating this nonconvex constraint into since trpbq and convex relaxation of the squared constraint is given by by selecting as the optimal lagrange multiplier for this constraint we can obtain an equivalent penalized reformulation parameterized by rather than the sparse semidefinite relaxation is thus the following concave problem max min tr pwm bq bpbr pm while the relaxation bears strong resemblance to dspca relaxation for sparse pca the inner maximization over matchings prevents direct application of general semidefinite programming solvers let pbq denote the matching that minimizes tr pwm bq for given standard projected subgradient ascent could be applied to solve where at the tth iterate the subgradient is wm pb ptq however this approach requires solving optimal transport problems with large matrices at each iteration instead we turn to dual form of assuming cf ÿÿ maxn trprz bz vj ij ij bpbr upr vprm is simply maximization over br rn and rm which no longer requires matching matrices nor their cumbersome constraints while dual variables and can be solved in closed form for each fixed via sorting we describe simple approach that works better in practice relax algorithm solves the dualized semidefinite relaxation of sparda returns the largest eigenvector of the solution to as the desired projection direction for sparda input data xpnq and pmq with parameters controls the amount of regularization is the used for updates is the used for updates of dual variables and is the maximum number of iterations without improvement in cost after which algorithm terminates initialize br while the number of iterations since last improvement in objective function is less than bu ns rn bv ms rm bb for nu mu zij xpiq pjq ptq ptq if trprzij zij sb ptq ui vj bui bui bvj bvj bb bb zij zij end for upt uptq bu and pt ptq bv pt projection ptq bb output prelax rd defined as the largest eigenvector based on corresponding eigenvalue tude of the matrix pt which attained the best objective value over all iterations projection algorithm projects matrix onto positive semidefinite cone of matrices br the feasible set in our relaxation step applies proximal operator for sparsity input rdˆd parameters controls the amount of regularization is the actual used in the eigendecomposition of arg min quadratic program qt if for br output rr signpb rr rr the relax algorithm boxed is projected subgradient method with supergradients computed in steps for scaling to large samples one may alternatively employ incremental supergradient directions where step would be replaced by drawing random pi jq pairs after each subgradient step projection back into the feasible set br is done via quadratic program involving the current solution eigenvalues in sparda sparsity is encouraged via the proximal map corresponding to the penalty the overall form of our iterations matches updates in by the convergence analysis in of the relax algorithm as well as its incremental variant is guaranteed to approach the optimal solution of the dual which also solves provided we employ sufficiently large and small in practice fast and accurate convergence is attained by renormalizing the step to ensure balanced updates of the constrained using diminishing learning rates which are initially set larger for the unconstrained dual variables or even taking multiple subgradient steps in the dual variables per each update of tightening after relaxation it is unreasonable to expect that our semidefinite relaxation is always tight therefore we can sometimes further refine the projection prelax obtained by the relax algorithm by using it as starting point in the original optimization we introduce sparsity constrained tightening procedure for applying projected gradient ascent for the original nonconvex objective jp minm pm wm where is now forced to lie in bxsk and sk rd ku the sparsity level is fixed based on the relaxed solution prelax after initializing prelax rd the tightening procedure iterates steps in the gradient direction of followed by straightforward projections into the unit and the set sk accomplished by greedily truncating all entries of to zero besides the largest in magnitude let again denote the matching matrix chosen in response to fails to be differentiable at the where rq is not unique this occurs if two samples have identical projections under while this situation becomes increasingly likely as interestingly becomes smoother overall assuming the distributions admit density functions for all other where lies in small neighborhood around and admits gradient in practice we find that the tightening always approaches local optimum of with diminishing stepsize we note that for given projection we can efficiently calculate gradients without recourse to matrices or wm by sorting ptq ptq xpnq and ptq ptq pmq the gradient is directly derivable from expression where the nonzero mij are determined by appropriately matching empirical quantiles represented by sorted indices since the univariate wasserstein distance is simply the distance between quantile functions additional computation can be saved by employing insertion sort which runs in nearly linear time for almost sorted points in iteration the points have been sorted along the direction and their sorting in direction ptq is likely similar under small thus the tightening procedure is much more efficient than the relax algorithm respective runtimes are opdn log nq per iteration we require the combined steps for good performance the projection found by the tightening algorithm heavily depends on the starting point finding only the closest local optimum as in figure it is thus important that is already good solution as can be produced by our relax algorithm additionally we note that as methods both the relax and tightening algorithms are amendable to number of sub schemes momentum techniques adaptive learning rates or fista and other variants of nesterov method properties of semidefinite relaxation we conclude the algorithmic discussion by highlighting basic conditions under which our pda relaxation is tight assuming each of iii implies that the which maximizes is nearly rank one or equivalently see supplementary information for intuition thus the tightening procedure initialized at will produce global maximum of the pda objective there exists direction in which the projected wasserstein distance between and is nearly as large as the overall wasserstein distance in rd this occurs for example if ery is large while both and are small the distributions need not be gaussian ii pµx and pµy with µx µy and iii pµx and pµy with µx µy where the underlying covariance structure is such that arg maxbpbr pb is nearly rank for example if the primary difference between covariances is shift in the marginal variance of some features where is diagonal matrix theoretical results in this section we characterize statistical properties of an empirical projecp pnq yp pnq although we note that the algorithms may not succeed tion arg max dp pb in finding such global maximum for severely nonconvex problems throughout denotes the squared wasserstein distance between univariate distributions represents universal constants that change from line to line all proofs are relegated to the supplementary information we make the following simplifying assumptions admit continuous density functions are compactly supported with nonzero density in the euclidean ball of radius our theory can be generalized beyond to obtain similar but complex statements through careful treatment of the distributions tails and regions where cdfs are flat theorem suppose there exists direction pnq pt yp pnq dp pt such that dp then with probability greater than exp theorem gives basic concentration results for the projections used in empirical applications our method to relate distributional differences between in the ambient space with their estimated divergence along the univariate linear representation chosen by pda we turn to theorems and finally theorem provides sparsistency guarantees for sparda in the case where exhibit large differences over certain feature subset of known cardinality pnq pt yp pnq theorem if and are identically distributed in rd then dp pt with probability greater than exp to measure the difference between the untransformed random variables rd we define the following metric between distributions on rd which is parameterized by cf ta px aq in addition to we assume the following for the next two theorems has subgaussian tails meaning cdf fy satisfies fy pyq cy erxs ery note that mean differences can trivially be captured by linear projections so these are not the differences of interest in the following theorems var for theorem suppose ta px pgp qq where pgp qq mint with exp pa dq pgp dq gp gp and supy with pyq defined as the density of the projection of in the direction then pnq pt yp pnq dp pt with probability greater than exp theorem define as in suppose there exists feature subset du pxs ys pg cqq and remaining marginal distributions xs ys are identical then ppkq arg max tdp pnq yp pnq ku satisfies ppkq and experiments ppkq pb with probability greater than exp figure illustrates the cost function of pda pertaining to two distributions see details in supplementary information in this example the point of convergence of the tightening method after random initialization in green is significantly inferior to the solution produced by the relax algorithm in red it is therefore important to use relax before tightening as we advise the synthetic madelon dataset used in the nips feature selection challenge consists of points which have features scattered on the vertices of fivedimensional hypercube so that interactions between features must be considered in order to distinguish the two classes features that are noisy linear combinations of the original five and useless features while the focus of the challenge was on extracting features useful to classifiers we direct our attention toward more interpretable models figure demonstrates how well sparda red the top sparse principal component black sparse lda green and the logistic lasso blue are able to identify the relevant features over different settings of their respective regularization parameters which determine the cardinality of the vector returned by each method the red asterisk indicates the sparda result with automatically selected via our crossvalidation procedure without information of the underlying features importance and the black asterisk indicates the best reported result in the challenge two sample testing cardinality value relevant features madelon data dimension figure example where pda is nonconvex sparda other feature selection methods power of various tests for problems with differences the restrictive assumptions in logistic regression and linear discriminant analysis are not satisfied in this complex dataset resulting in poor performance despite being pca was successfully utilized by numerous challenge participants and we find that the sparse pca performs on par with logistic regression and lda although the lasso fairly efficiently picks out relevant features it struggles to identify the rest due to severe similarly the challengewinning bayesian svm with automatic relevance determination only selects of the relevant features in many applications the goal is to thoroughly characterize the set of differences rather than select subset of features that maintains predictive accuracy sparda is better suited for this alternative objective many settings of return of the relevant features with zero false positives if is chosen automatically through the projection returned by sparda contains nonzero elements of which correspond to relevant features figure depicts average produced by sparda red pda purple the overall wasserstein distance in rd black maximum mean discrepancy green and diproperm blue in synthetically controlled problems where px py and the underlying differences have varying degrees of sparsity here indicates the overall number of features included of which only the first are relevant see supplementary information for details as we evaluate the significance of each method statistic via permutation testing all the tests are guaranteed to exactly control type error and we thus only compare their respective power in determining px py setting the figure demonstrates clear superiority of sparda which leverages the underlying sparsity to maintain high power even with the increasing overall dimensionality even when all the features differ when sparda matches the power of methods that consider the full space despite only selecting single direction which can not be based on as there are none in this controlled data this experiment also demonstrate that the unregularized pda retains greater power than diproperm similar method recent technological advances allow complete transcriptome profiling in thousands of individual cells with the goal of fine molecular characterization of cell populations beyond the crude expression measure that is currently standard we apply sparda to expression measurements of genes profiled in single cells from the somatosensory cortex and hippocampus cells sampled from the brains of juvenile mice the resulting identifies many previously characterized genes and is in many respects more informative than the results of standard differential expression methods see supplementary information for details finally we also apply sparda to normalized data with marginals in order to explicitly restrict our search to genes whose relationship with other genes expression is different between hippocampus and cortex cells this analysis reveals many genes known to be heavily involved in signaling regulating important processes and other forms of functional interaction between genes see supplementary information for details these types of important changes can not be detected by standard differential expression analyses which consider each gene in isolation or require to be explicitly identified as features conclusion this paper introduces the overall principal differences methodology and demonstrates its numerous practical benefits of this approach while we focused on algorithms for pda sparda tailored to the wasserstein distance different divergences may be better suited for certain applications further theoretical investigation of the sparda framework is of interest particularly in the highdimensional opnq setting here rich theory has been derived for compressed sensing and sparse pca by leveraging ideas such as restricted isometry or spiked covariance natural question is then which analogous properties of px py theoretically guarantee the strong empirical performance of sparda observed in our applications finally we also envision extensions of the methods presented here which employ multiple projections in succession or adapt the approach to comparison of multiple populations acknowledgements this research was supported by nih grant references lopes jacob wainwright more powerful test in high dimensions using random projection nips clemmensen hastie witten ersbø ll sparse discriminant analysis technometrics van der vaart aw wellner ja weak convergence and empirical processes springer gibbs al su fe on choosing and bounding probability metrics international statistical review wei lee wichers marron js for high dimensional hypothesis tests journal of computational and graphical statistics rosenbaum pr an exact test comparing two multivariate distributions based on adjacency journal of the royal statistical society series szekely rizzo testing for equal distributions in high dimension interstat gretton borgwardt km rasch mj scholkopf smola kernel test the journal of machine learning research cramer wold some theorems on distribution functions journal of the london mathematical society ja fraiman ransford sharp form of the theorem journal of theoretical probability jirak on the maximum of covariance estimators journal of multivariate analysis tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series bradley ps mangasarian ol feature selection via concave minimization and support vector machines icml aspremont el ghaoui jordan mi lanckriet gr direct formulation for sparse pca using semidefinite programming siam review amini aa wainwright mj analysis of semidefinite relaxations for sparse principal components the annals of statistics good permutation tests practical guide to resampling methods for testing hypotheses duchi hazan singer adaptive subgradient methods for online learning and stochastic optimization journal of machine learning research wright sj optimization algorithms in machine learning nips tutorial sandler lindenbaum nonnegative matrix factorization with earth mover distance metric for image analysis ieee transactions on pattern analysis and machine intelligence levina bickel the earth mover distance is the mallows distance some insights from statistics iccv wang lu liu tighten after relax sparse pca in polynomial time nips bertsekas dp network optimization continuous and discrete models athena scientific bertsekas dp eckstein dual coordinate step methods for linear network flow problems mathematical programming bertsekas dp incremental gradient subgradient and proximal methods for convex optimization survey in optimization for machine learning mit press pp beck teboulle fast iterative algorithm for linear inverse problems siam journal on imaging sciences guyon gunn nikravesh zadeh la feature extraction foundations and applications secaucus nj usa zou hastie tibshirani sparse principal component analysis journal of computational and graphical statistics ka bauer cr li ziv gresham et al the details in the distributions why and how to study phenotypic variability current opinion in biotechnology zeisel ab codeluppi lonnerberg la manno et al cell types in the mouse cortex and hippocampus revealed by science 
when are restless bandits indexable christopher dance and tomi silander xerox research centre europe chemin de maupertuis meylan france dance silander abstract we study the restless bandit associated with an extremely simple scalar kalman filter model in discrete time under certain assumptions we prove that the problem is indexable in the sense that the whittle index is function of the relevant belief state in spite of the long history of this problem this appears to be the first such proof we use results about and mechanical words which are particular binary strings intimately related to palindromes introduction we study the problem of monitoring several time series so as to maintain precise belief while minimising the cost of sensing such problems can be viewed as pomdps with rewards and their applications include active sensing attention mechanisms for tracking as well as online summarisation of massive data from specifically we discuss the restless bandit associated with the kalman filter restless bandits generalise bandit problems to situations where the state of each arm project site or target continues to change even if the arm is not played as with bandit problems the states of the arms evolve independently given the actions taken suggesting that there might be efficient algorithms for settings based on calculating an index for each arm which is real number associated with the state of that arm alone however while bandits always have an optimal index policy select the arm with the largest index it is known that no index policy can be optimal for some restless bandits and such problems are in general even to approximate to any factor further in this paper we address restless bandits with rather than discrete states on the other hand whittle proposed natural index policy for restless bandits but this policy only makes sense when the restless bandit is indexable section briefly restless bandit is said to be indexable when an optimal solution to relaxed version of the problem consists in playing all arms whose indices exceed given threshold the relaxed version of the problem relaxes the constraint on the number of arms pulled per turn to constraint on the average number of arms pulled per turn under certain conditions indexability implies form of asymptotic optimality of whittle policy for the original problem restless bandits associated with scalar kalman filters in continuous time were recently shown to be indexable and the corresponding problem has attracted considerable attention over long period however that attention has produced no satisfactory proof of indexability even for scalar and even if we assume that there is monotone optimal policy for the problem which is policy that plays the arm if and only if the relevant exceeds some threshold here the relevant is posterior variance theorem of this paper addresses that gap after formalising the problem section we describe the concepts and intuition section behind the main result section the main tools are mechanical words which are not sufficiently and schur convexity as these tools are associated with rather general theorems we believe that future work section should enable substantial generalisation of our results problem and index we consider the problem of tracking which we call arms in discrete time the state zi of arm at time evolves as random walk independent of everything but its immediate past and all include zero the action space is action ut makes an expensive observation yi of arm which is about zi with precision bi and we receive cheap observations yj of each other arm with precision aj where aj bj and aj means no observation at all let zt yt ht ft be the state observation history and observed history so that zt zn yt yn ht zt ut yt and ft ut yt then we formalise the above as is the indicator function zi ht zi yi zt ut zi ai bi note that this setting is readily generalised to zi zi by change of variables thus the posterior belief is given by the kalman filter as zi ft xi where the posterior mean is and the error variance xi satisfies and xi xi where ai ai bi bi problem let be policy so that ut let xπi be the error variance under the problem is to choose so as to minimise the following objective for discount factor the objective consists of weighted sum of error variances xπi with weights wi plus observation costs hi for xx hi wi xi hi wi xπi where the equality follows as is deterministic mapping and assuming is deterministic problem and whittle index now fix an arm and write xπt instead of xπt say there are now two actions ut corresponding to cheap and expensive observations respectively and the expensive observation now costs where the singlearm problem is to choose policy which here is an action sequence so as to minimise ut wxπt where let be the optimal in this problem if the first action must be and let be an optimal policy so that wx βv φα for any fixed the value of for which actions and are both optimal is known as the whittle index λw assuming it exists and is unique in other words the whittle index λw is the solution to let us consider policy which takes action then acts optimally producing actions ut and error variances then gives λw wx λw wxt solving this linear equation for the index λw gives xt xt ut ut whittle recognised that for his index policy play the arm with the largest λw to make sense any arm which receives an expensive observation for added cost must also receive an expensive observation for added cost such problems are said to be indexable the question resolved by this paper is whether problem is indexable equivalently is λw in xt xt figure orbit traces the path abcde for the word orbit xt traces the path ghij for the word word is palindrome main result key concepts and intuition we make the following intuitive assumption about threshold monotone policies for some depending on the policy ut is optimal for problem note that under definition means the policy ut is also optimal so we can choose if if and otherwise otherwise if if ut and xt otherwise otherwise where we refer to xt xt as the orbits figure we are now ready to state our main result theorem suppose threshold policy is optimal for the problem then problem is indexable specifically for any let ax bx and for any and let xt xt ut ut in which action sequences ut and error variance sequences xt xt are given in terms of by then is continuous and function of we are now ready to describe the key concepts underlying this result words in this paper word is string on with th letter wk and wi wi wj the empty word is the concatenation of words is uv the word that is the repetition of is wn the infinite repetition of is wω and is the reverse of so means is palindrome the length of is and is the number of times that word appears in overlaps included christoffel sturmian and mechanical words it turns out that the action sequences in are given by such words so the following definitions are central to this paper figure part of the christoffel tree the christoffel tree figure is an infinite complete binary tree in which each node is labelled with pair of words the root is and the children of are uv and uv the christoffel words are the words and the concatenations uv for all in that tree the fractions form the tree which contains each positive rational number exactly once also infinite paths in the tree converge to the positive irrational numbers analogously sturmian words could be thought of as christoffel words alternatively among many known characterisations the christoffel words can be defined as the words and the words where and ac bnac for any relatively prime natural numbers and and for the sturmian words are then the infinite words where for and ac bnac we use the notation for sturmian words although they are infinite the set of mechanical words is the union of the christoffel and sturmian words note that the mechanical words are sometimes defined in terms of infinite repetitions of the christoffel words majorisation as in let rm and let and be their elements sorted in ascending order we say is weakly supermajorised by and write if for all if this is an equality for we say is majorised by and write it turns out that for with equality for where are the sequences sorted in descending order for rm we have xi yi for all convex functions more generally function defined on subset of rm is said to be on if implies that transformations let µa denote the transformation µa where transformations such as are closed under composition so for any word we define φw and intuition here is the intuition behind our main result for any the orbits in correspond to particular mechanical word or depending on the value of figure specifically for any word let yu be the fixed point of the mapping φu on so that φu yu yu and yu then the word corresponding to is for for and for in passing we note that these fixed points are sorted in ascending order by the ratio of counts of to counts of as and figure lower fixed points of christoffel words black dots majorisation points for those words black circles and the tree of φw blue illustrated by figure interestingly it turns out that ratio is yet continuous function of reminiscent of the cantor function also composition of transformations is homeomorphic to matrix multiplication so that for any µa µb µab thus the index can be written in terms of the orbits of linear system given by or further if and det then the gradient of the corresponding transformation is the convex function dµa dx so the gradient of the index is the difference of the sums of convex function of the orbits however such sums are functions and it follows that the index is increasing because one orbit weakly supermajorises the other as we now show for the case noting that the proof is easier for words as is mechanical word is palindrome further if is palindrome it turns out that the difference between the orbits increases with so we might define the majorisation point for as the for which one orbit majorises the other quite remarkably if is palindrome then the majorisation point is φw proposition indeed the black circles and blue dots of figure coincide finally φw is less than or equal to which is the least for which the orbits correspond to the word indeed the blue dots of figure are below the corresponding black dots thus one orbit does indeed supermajorise the other proof of main result mechanical words the transformations of satisfy the following assumption for we prove that the fixed point yw of word the solution to φw on is unique in the supplementary material assumption functions where is an interval of are increasing and so for all and for we have φk φk φk φk and increasing furthermore the fixed points of on satisfy hence the following two propositions supplementary material apply to of on proposition suppose holds and is word then φw φw yw yw and φw φw yw yw for given in the notation of we call the shortest word such that the word proposition generalises recent result about words in setting where are linear proposition suppose holds and is mechanical word then is the word also if with and then the and words are and we also use the following very interesting fact proposition on of proposition suppose is mechanical word then is palindrome properties of the orbits and prefix sums definition assume that and consider the matrices and so that the transformations µf µg are the functions of and gf given any word we define the matrix product where and where is the identity and the prefix sum as the matrix polynomial where the matrix for any let tr be the trace of let aij ij be the entries of and let indicate that all entries of are remark clearly det det so that det for any word also corresponds to the partial sums of the orbits as hinted in the previous section the following proposition captures the role of palindromes proof in the supplementary material proposition suppose is word is palindrome and then for some tr tr if then λk for some if is prefix of then we now demonstrate surprisingly simple relation between and proposition suppose is palindrome then and furthermore if then for all proof let us write we prove by induction on in the base case for for for some for the inductive step in accordance with claim of proposition assume for some word satisfying for some for gm and gm calculating the corresponding matrix products and sums gives bh bf bh bf bh bf as claimed for the claim also holds as this completes the proof of furthermore part let and gf gf then gf by definition of by claim of proposition and we know that for some substituting these expressions and the definitions of into the definitions of and then into for directly gives although this calculation is long now consider the case claim of proposition says tr tr and clearly det det thus we can diagonalise as du diag for some so that au bv λk so if then and we already showed that otherwise so implies which gives thus for any we have λk gf dv majorisation the following is straightforward consequence of results in proved in the supplementary material we emphasize that the notation has nothing to do with the notion of as word symmetric function that is convex and proposition suppose rm and is pm pm decreasing on then and for any and any fixed word define the sequences for and σx σy where and proposition suppose is palindrome and φw then σx sequences on and σx σy for any and σy are ascending proof clearly φw so and hence so for any word and letter we have uc as thus xk and yk in conclusion σx and σy are ascending sequences on now φw thus av φw am for any so φw φw for by claim of proposition so all but the first term of the sum tm φw is where tj thus φw φw tm φw but tm φw where the last step follows from so tj φw for yet claims and of pj tj so for proposition give dx φw we have tj for which means that σx σy indexability theorem the index λw of is continuous and for proof as weight is and cost is constant we only need to prove the result for λw and we can use to denote word by proposition for some mechanical word cases are clarified in the supplementary material let us show that the hypotheses of proposition are satisfied by and firstly is palindrome by proposition secondly and as φw is monotonically increasing it follows that φw φw equivalently φw so that φw by proposition hence φw thus proposition applies showing that the sequences σx and σy with elements and as defined in are sequences on with σx σy also is symmetric function that is convex and decreasing on therefore proposition applies giving for any where also proposition shows that the orbits are and where and so the denominator of is mk note that dx for any eh then gives dλ βm dx but is continuous for as shown in the supplementary material therefore we conclude that is for further work one might attempt to prove that assumption holds using general results about monotone optimal policies for mdps based on submodularity or multimodularity however we find to the required submodularity condition rather we are optimistic that the ideas of this paper themselves offer an alternative approach to proving it would then be natural to extend our results to settings where the underlying state evolves as ht mzt for some multiplier and to cost functions other than the variance finally the question of the indexability of the kalman filter in multiple dimensions remains open references altman gaujal and hordijk multimodularity convexity and optimization properties mathematics of operations research altman and stidham optimality of monotonic policies for markovian decision processes with applications to control of queues with delayed information queueing systems araya buffet thomas and charpillet pomdp extension with rewards in neural information processing systems pages badanidiyuru mirzasoleiman karbasi and krause streaming submodular maximization massive data summarization on the fly in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages berstel lauve reutenauer and saliola combinatorics on words christoffel words and repetitions in words crm monograph series bubeck and regret analysis of stochastic and nonstochastic bandit problems foundation and trends in machine learning vol now chen shioi montesinos koh wich and krause active detection via adaptive submodularity in proceedings of the international conference on machine learning pages gittins glazebrook and weber bandit allocation indices john wiley sons graham knuth and patashnik concrete mathematics foundation for computer science guha munagala and shi approximation algorithms for restless bandit problems journal of the acm la scala and moran optimal target tracking with restless bandits digital signal processing le ny feron and dahleh scheduling kalman filters ieee trans automatic control lothaire algebraic combinatorics on words cambridge university press marshall olkin and arnold inequalities theory of majorization and its applications springer science business media meier peschon and dressler optimal control of measurement subsystems ieee trans automatic control and villar multitarget tracking via restless bandit marginal productivity indices and kalman filter in discrete time in proceedings of the ieee conference on decision and control pages ortner ryabko auer and munos regret bounds for restless markov bandits in algorithmic learning theory pages springer rajpathak pillai and bandyopadhyay analysis of stable periodic orbits in the one dimensional linear discontinuous map chaos thiele sur la compensation de quelques erreurs par la des moindres ca reitzel verloop asymptotic optimal control of restless bandits cnrs technical report villar restless bandit index policies for dynamic sensor scheduling optimization phd thesis statistics department universidad carlos iii de madrid vul alvarez tenenbaum and black explaining human multiple object tracking as approximate inference in dynamic probabilistic model in neural information processing systems pages weber and weiss on an index policy for restless bandits journal of applied probability pages whittle restless bandits activity allocation in changing world journal of applied probability pages 
segregated graphs and marginals of chain graph models ilya shpitser department of computer science johns hopkins university ilyas abstract bayesian networks are popular representation of asymmetric for example causal relationships between random variables markov random fields mrfs are complementary model of symmetric relationships used in computer vision spatial modeling and social and gene expression networks chain graph model under the interpretation hereafter chain graph model generalizes both bayesian networks and mrfs and can represent asymmetric and symmetric relationships together as in other graphical models the set of marginals from distributions in chain graph model induced by the presence of hidden variables forms complex model one recent approach to the study of marginal graphical models is to consider supermodel such supermodel of marginals of bayesian networks defined only by conditional independences and termed the ordinary markov model was studied at length in in this paper we show that special mixed graphs which we call segregated graphs can be associated via markov property with supermodels of marginals of chain graphs defined only by conditional independences special features of segregated graphs imply the existence of very natural factorization for these supermodels and imply many existing results on the chain graph model and the ordinary markov model carry over our results suggest that segregated graphs define an analogue of the ordinary markov model for marginals of chain graph models we illustrate the utility of segregated graphs for analyzing outcome interference in causal inference via simulated datasets introduction graphical models are flexible and widely used tool for modeling and inference in high dimensional settings directed acyclic graph dag models also known as bayesian networks are often used to model relationships with an inherent asymmetry perhaps induced by temporal order on variables or relationships models represented by undirected graphs ugs such as markov random fields mrfs are used to model symmetric relationships for instance proximity in social graphs expression in gene networks coinciding magnetization of neighboring atoms or similar colors of neighboring pixels in an image some graphical models can represent both symmetric and asymmetric relationships together one such model is the chain graph model under the interpretation which we will shorten to the chain graph we will not consider the chain graph model under the amp interpretation or other chain graph models discussed in in this paper just as the dag models and mrfs the chain graph model has set of equivalent under some assumptions definitions via set of markov properties and factorization modeling and inference in multivariate settings is complicated by the presence of hidden yet relevant variables their presence motivates the study of marginal graphical models marginal dag models are complicated objects inducing not only conditional independence constraints but also more general equality constraints such as the verma constraint and inequality constraints such as the instrumental variable inequality and the bell inequality in quantum mechanics one approach to studying marginal dag models has therefore been to consider tractable supermodels defined by some easily characterized set of constraints and represented by mixed graph one such supermodel defined only by conditional independence constraints induced by the underlying hidden variable dag on the observed margin is the ordinary markov model studied in depth in another supermodel defined by generalized independence constraints including the verma constraint as special case is the nested markov model there is rich literature on markov properties of mixed graphs and corresponding independence models see for instance in this paper we adapt similar approach to the study of marginal chain graph models specifically we consider supermodel defined only by conditional independences on observed variables of hidden variable chain graph and ignore generalized equality constraints and inequalities we show that we can associate this supermodel with special mixed graphs which we call segregated graphs via global markov property special features of segregated graphs imply the existence of convenient factorization which we show is equivalent to the markov property for positive distributions this equivalence along with properties of the factorization implies many existing results on the chain graph model and the ordinary markov model carry over the paper is organized as follows section describes motivating example from causal inference for the use of hidden variable chain graphs with details deferred until section in section we introduce the necessary background on graphs and probability theory define segregated graphs sgs and an associated global markov property and show that the global markov properties for dag models chain graph models and the ordinary markov model induced by hidden variable dags are special cases in section we define the model of conditional independence induced by hidden variable chain graphs and show it can always be represented by sg via an appropriate global markov property in section we define segregated factorization and show that under positivity the global markov property in section and segregated factorization are equivalent in section we introduce causal inference and interference analysis as an application domain for hidden variable chain graph models and thus for sgs and discuss simulation study that illustrates our results and shows how parameters of the model represented by sg can directly encode parameters representing outcome interference in the underlying hidden variable chain graph section contains our conclusions we will provide outlines of arguments for our claims below but will generally defer detailed proofs to the supplementary material motivating example interference in causal inference consider dataset obtained from vaccination trial described in consisting of pairs where the children were vaccinated against pertussis we suspect that though mothers were not vaccinated directly the fact that children were vaccinated and each mother will generally only contract pertussis from her child the child vaccine may have protective effect on the mother at the same time if only the mothers but not children were vaccinated we would expect the same protective effect to operate in reverse this is an example of interference an effect of treatment on experimental units other than those to which the treatment was administered the relationship between the outcomes of mother and child due to interference in this case has some features of causal relationship but is symmetric we model this study by chain graph shown in fig see section for justification of this model here is the vaccine or placebo given to children and is the children outcomes is the treatment given to mothers in our case no treatment and is the mothers outcomes directed edges represent the direct causal effect of treatment on unit and the undirected edge represents the interference relationship among the outcome pair in this model mother and child treatment are assigned independently and mother outcome is independent of child treatment if we know child outcome and mother treatment and vice versa since treatments in this study were randomly assigned there are no unobserved confounders figure chain graph representing the vaccination example in more complex vaccination example with followup booster shot naive generalization of the latent projection idea applied to where and edges meet segregated graph preserving conditional independences in not involving consider however more complex example where both mother and child are given the initial vaccine but possibly based on results of followup visit children are given booster and we consider the child and the mother outcomes where the same kind of interference relationship is operating we model the child unobserved health status which influences both and by possibly very high dimensional hidden variable the result is hidden variable chain graph in fig since is unobserved and possibly very complex modeling it directly may lead to model misspecification an alternative explored for instance in is to consider model defined by conditional independences induced by the hidden variable model in fig on observed variables simple approach that directly generalizes what had been done in dag models is to encode conditional independences via path separation criterion on mixed graph constructed from hidden variable chain graph via latent projection operation the difficulty with this approach is that simple generalizations of latent projections to the chain graph case may yield graphs where and edges met as happens in fig this is an undesirable feature of graphical representation since existing factorization and parameterization results for chain graphs or ordinary markov models which decompose the joint distribution into pieces corresponding to sets connected by or edges do not generalize in the remainder of the paper we show that for any hidden variable chain graph it is always possible to construct not necessarily unique mixed graph called segregated graph sg where and edges do not meet and which preserves all conditional independences on the observed variables one sg for our example is shown in fig conditional independences implied by this graph are and properties of sgs imply existing results on chain graphs and the ordinary markov model carry over with little change for example we may directly apply the parameterization in and the fitting algorithm in to the model corresponding to fig if the state spaces are discrete as we illustrate in section the construction we give for sgs may replace undirected edges by directed edges in way that may break the symmetry of the underlying interference relationship thus directed edges in sg do not have straightforward causal interpretation background and preliminaries we will consider mixed graphs with three types of edges undirected directed and directed where pair of vertices is connected either by single edge or pair of edges one of which is directed and one bidirected we will denote an edge as an ordered pair of vertices with subscript indicating the type of edge for example ab we will suppress the subscript if edge orientation is not important an alternating sequence of nodes and edges of the form ai ai where we allow ai aj if is called walk in some references also route we will denote walks by lowercase greek letters walk with edges is called trail trial with vertices is called path directed cycle is trail of the form ai ai partially directed cycle is trail with edges and at least one edge where there exists way to orient edges to create directed cycle we will sometimes write path from to where intermediate vertices are not important but edge orientation is as for example mixed graph with no and edges and no directed cycles is called directed acyclic graph dag mixed graph with no edges and no directed cycles is called an acyclic directed mixed graph admg mixed graph with no edges and no partially directed cycles is called chain graph cg segregated graph sg is mixed graph with no partially directed cycles where no path of the form ai ai aj aj aj ak ak exists dags are special cases of admgs and cgs which are special cases of sgs we consider sets of distributions over set defined by independence constraints linked to above types of graphs via global markov properties we will refer to as either vertices in graph or random variables in distribution it will be clear from context what we mean markov model of graph defined via global markov property has the general form where the consequent means is independent of conditional on in and the antecedent means is separated from given according to certain walk separation property in since dags admgs and cgs are special cases of sgs we will define the appropriate path separation property for sgs which will recover known separation properties in dags admgs and cgs as special cases walk contained in another walk is called subwalk of maximal subwalk in where all edges are undirected is called section of section may consist of single node and no edges we say section of walk is collider section if edges in immediately preceding and following contain arrowheads into otherwise is section walk from to is said to be by set in sg if there exists collider section that does not contain an element of or section that does such section is called blocked is said to be from given in sg if every walk from vertex in to vertex in is by and is given otherwise lemma the markov properties defined by superactive routes walks in cgs mseparation in admgs and in dags are special cases of the markov property defined by in sgs segregated graph representation of cg independence models for sg and define the model to be the set of distributions where all conditional independences in implied by hold that is may equal even if and are distinct if is empty simply reduces to the markov model defined by on the entire graph we are going to show that there is always sg that represents the conditional independences that define using special type of vertex we call sensitive vertex in an sg is sensitive if for any other vertex if exists in then exists in we first show that if is sensitive we can orient all undirected edges away from and this results in new sg that gives the same set of conditional independence via this is lemma next we show that for any with child with adjacent undirected edges if is not sensitive we can make it sensitive by adding appropriate edges and this results in new sg that preserves all conditional independences that do not involve this is lemma given above for any vertex in sg we can construct new sg that preserves all conditional independences in that do not involve and where no children of have adjacent undirected edges this is lemma we then project out to get another sg that preserves all conditional independences not involving in this is theorem we are then done corollary states that there is always not necessarily unique sg for the conditional independence structure of marginal of cg lemma for sensitive in sg let hv be the graph be obtained from by replacing all edges adjacent to by edges pointing away from then hv is an sg and hv the intuition here is that directed edges differ from undirected edges due to collider bias induced by the former that is dependence between parents of block is created by conditioning on variables in the block but sensitive vertex in block is already dependent on all the parents in the block so orienting undirected edges away from such vertex and making it block parent does not change the set of advertised independences lemma let be an sg and graph obtained from adding an edge for two nonadjacent vertices where exists in then is an sg lemma for any in an sg let be obtained from by adding whenever exists in then is an sg and this lemma establishes that two graphs one an edge supergraph of the other agree on the conditional independences not involving certainly the subgraph advertises at least as many constraints as the supergraph to see the converse note that definition of coupled with our inability to condition on can always be used to create dependence between and the vertices joined by an edge in the supergraph explicitly this dependence can be created regardless of the conditioning set either via the path or via the walk path it can thus be shown that adding these edges does not remove any independences lemma let be vertex in sg with at least two vertices then there exists an sg where does not exist and proof this follows by an inductive application of lemmas and note that lemma does not guarantee that the graph is unique in fact depending on the order in which we apply the induction we may obtain different sgs with the required property theorem if is an sg with at least vertices and there exists an sg with vertices such that this theorem exploits previous results to construct graph which agrees with on all independences not involving and which does not contain children of that are part of the block with size greater than two given graph with this structure we can adapt the latent projection construction to yield sg that preserves all independences corollary let be an sg with vertices then for any there exists an sg with vertices such that segregated factorization we now show that for positive distributions the markov property we defined and certain factorization for sgs give the same model set of vertices that form connected component in graph obtained from by dropping all edges except and where no vertex is adjacent to edge in is called district in block is set of vertices forming connected component of size two or more in graph obtained from by dropping all edges except we denote the set of districts and blocks in by and respectively it is trivial to show that in sg with vertices and partition for vertex set in define pasg is in and pasg for in let ga be the subgraph of containing only vertices in and edges between them the anterior of set denoted by antg is the set of vertices with partially directed path into node in set is called anterial in if whenever ant we denote the set of anterial subsets of in by let ga clique in an ug is maximal connected component the set of cliques in an ug will be denoted by vertex ordering is topological for sg if whenever antg for vertex in and topological define preg figure simple causal dag model causal dag models for interference causal dag representing markov chain with an equilibrium distribution in the chain graph model in fig given sg define the augmented graph to be an undirected graph with the same vertex set as where share an undirected edge in if are connected by walk consisting exclusively of collider sections in note that this trivially includes all that share an edge we say satisfies the augmented global markov property with respect to sg if for any satisfies the ug global markov property with respect to ga we denote model that is set of satisfying this property with respect to as by analogy with the ordinary markov model and the chain graph model we say that obeys segregated factorization with respect to sg if there exists set of kers da such that for every nels pa pasg and for every fs pasg ga ga φc where φc is mapping from values of to reals lemma if factorizes withq respect to then fs pasg pasg for every and fs pag preg antg for every da and any topological ordering on theorem if factorizes with respect to sg then lemma if there exists walk in between with all sections not intersecting and all collider sections in antg then there exist such that and are given in theorem theorem for sg if and is positive then factorizes with respect to corollary for any sg if is positive then if and only if factorizes with respect to causal inference and interference analysis in this section we briefly describe interference analysis in causal inference as motivation for the use of sgs causal inference is concerned with using observational data to infer cause effect relationships as encoded by interventions setting variable valus from outside the causal dags are often used as tool where directed arrows represent causal relationships not just statistical relevance see for an extensive discussion of causal inference much of recent work on interference in causal inference see for instance has generalized causal dag models to settings where an intervention given to subjects affects other subjects classic example is herd immunity in epidemiology vaccinating subset of subjects can render all subjects even those who were not vaccinated immune interference is typically encoded by having vertices in causal diagram represent not response variability in population but responses of individual units or appropriately defined groups of units where interference only occurs between groups not within figure density with degrees of freedom red and histogram of observed deviances of ordinary markov models of fig fitted with data sampled from randomly sampled model of fig axis values of parameters red and green in the fitted nested markov model of fig axis value of the interaction parameter and in the underlying chain graph model for fig same plot with yellow and blue group for example the dag in fig represents generalization of the model in fig to setting with unit pairs where assigning vaccine to one unit may also influence another unit as was the case in the example in section furthermore we may consider more involved examples of interference if we record responses over time as is shown in fig extensive discussions on this type of modeling approach can be found in we consider an alternative approach to encoding interference between responses using chain graph models we give two justifications for the use of chain graphs first we may assume that interference arises as dependence between responses and in equilibrium of markov chain where transition probabilities represent the causal influence of on and vice versa at multiple points in time before equilibrium is reached under certain assumptions it can be shown that such an equilibrium distribution obeys the markov property of chain graph for example the dag shown in fig encodes transition probabilities for particular values for suitably chosen conditional distributions these transition probabilities lead to an equilibrium distribution that lies in the model corresponding to the chain graph in fig second we may consider certain independence assumptions in our problem as reasonable and sometimes such assumptions lead naturally to chain graph model for example we may study the effect of marketing intervention in social network and consider it reasonable that we can predict the response of any person only knowing the treatment for that person and responses of all friends of this person in social network in other words the treatments on everyone else are irrelevant given this information these assumptions result in response model that is chain graph with directed arrows from treatment to every person response and undirected edges between friends only an example of interference analysis using segregated graph models given ubiquity of unobserved confounding variables in causal inference and our our choice of chain graphs for modeling interference we use models represented by sgs to avoid having to deal with hidden variable chain graph model directly due to the possibility of misspecifying the likely high dimensional hidden variables involved we briefly describe simulation we performed to illustrate how sgs may be used for interference analysis as running example we used model shown in fig with binary and we first considered the following family of parameterizations in all members of this family was assigned via fair coin was logistic model with no interactions was randomly assigned via fair coin given no complications otherwise was heavily weighted probability towards treatment assignment the distribution was obtained from jointp distribution in model of an undirected graph kxc of the form exp λc where ranges over all cliques in is the λc are interactions parameters and is normalizing constant in our case was an undirected graph over where edges from to and were missing and all other edges were present parameters λc were generated from it is not difficult to show that all elements in our family lie in the chain graph model in fig since all observed variables in our example are binary the saturated model has parameters and the model corresponding to fig is missing of them are missing because does not depend on and are missing because does not depend on if our results on sgs are correct we would expect the ordinary markov model of graph in fig to be good fit for the data generated from our hidden variable chain graph family where we omit the values of in particular we would expect the observed deviances of our models fitted to data generated from our family to closely follow distribution with degrees of freedom we generated members of our family described above used each member to generate samples and fitted the ordinary markov model using an approach described in the resulting deviances plotting against the appropriate distribution are shown in fig which looks as we expect we did not vary the parameters for this is because models for fig and fig will induce the same marginal model for by construction in addition we wanted to illustrate that we can encode interaction parameters directly via parameters in sg to this end we generated set of distributions via the binary model as described above where all λc parameters were fixed except we constrained to equal and varied from to these parameters represent interaction of and and interaction of and and thus directly encode the strength of the interference relationship between responses since the sg in fig breaks the symmetry by replacing the undirected edge between and by directed edge the strength of interaction is represented by the degree of dependence of and conditional on as can be seen in fig we obtain independence precisely when and in the underlying hidden variable chain graph model is as expected our simulations did not require the modification of the fitting procedure in since fig is an admg in general sg will have undirected blocks however the special property of sgs allows for trivial modification of the fitting procedure since the likelihood decomposes into pieces corresponding to districts and blocks of the sg we can simply fit each district piece using the approach in and each block piece using any of the existing fitting procedures for discrete chain graph models discussion and conclusions in this paper we considered graphical representation of the ordinary markov chain graph model the set of distributions defined by conditional independences implied by marginal of chain graph model we show that this model can be represented by segregated graphs via global markov property which generalizes markov properties in chain graphs dags and mixed graphs representing marginals of dag models segregated graphs have the property that bidirected and undirected edges are never adjacent under positivity this global markov property is equivalent to segregated factorization which decomposes the joint distribution into pieces that correspond either to sections of the graph containing bidirected edges or sections of the graph containing undirected edges but never both together the convenient form of this factorization implies many existing results on chain graph and ordinary markov models in particular parameterizations and fitting algorithms carry over we illustrated the utility of segregated graphs for interference analysis in causal inference via simulated datasets acknowledgements the author would like to thank thomas richardson for suggesting mixed graphs where and edges do not meet as interesting objects to think about and elizabeth ogburn and eric tchetgen tchetgen for clarifying discussions of interference this work was supported in part by an nih grant references andersson madigan and perlman characterization of markov equivalence classes for acyclic digraphs annals of statistics bell on the einstein podolsky rosen paradox physics cai kuroki pearl and tian bounds on direct effects in the presence of confounded intermediate variables biometrics drton discrete chain graph models bernoulli evans and richardson maximum likelihood fitting of acyclic directed mixed graphs to binary data in proceedings of the twenty sixth conference on uncertainty in artificial intelligence volume evans and richardson markovian acyclic directed mixed graphs for discrete data annals of statistics pages koster marginalizing and conditioning in graphical models bernoulli lauritzen graphical models oxford clarendon lauritzen and richardson chain graph models and their causal interpretations with discussion journal of the royal statistical society series ogburn and vanderweele causal diagrams for interference statistical science pearl probabilistic reasoning in intelligent systems morgan and kaufmann san mateo pearl causality models reasoning and inference cambridge university press edition richardson and spirtes ancestral graph markov models annals of statistics richardson markov properties for acyclic directed mixed graphs scandinavial journal of statistics sadeghi and lauritzen markov properties for mixed graphs bernoulli shpitser evans richardson and robins introduction to nested markov models behaviormetrika studeny bayesian networks from the point of view of chain graphs in proceedings of the fourteenth conference on uncertainty in artificial intelligence pages morgan kaufmann san francisco ca van der laan causal inference for networks working paper van der laan causal inference for population of causally connected units journal of causal inference vanderweele tchetgen and halloran components of the indirect effect in vaccine trials identification of contagion and infectiousness effects epidemiology verma and pearl equivalence and synthesis of causal models technical report department of computer science university of california los angeles wermuth probability distributions with summary graph structure bernoulli 
efficient optimization of decision trees mohammad maxwell matthew david fleet pushmeet kohli department of computer science university of toronto department of computer science university of microsoft research abstract decision trees and randomized forests are widely used in computer vision and machine learning standard algorithms for decision tree induction optimize the split functions one node at time according to some splitting criteria this greedy procedure often leads to suboptimal trees in this paper we present an algorithm for optimizing the split functions at all levels of the tree jointly with the leaf parameters based on global objective we show that the problem of finding optimal oblique splits for decision trees is related to structured prediction with latent variables and we formulate upper bound on the tree empirical loss computing the gradient of the proposed surrogate objective with respect to each training exemplar is where is the tree depth and thus training deep trees is feasible the use of stochastic gradient descent for optimization enables effective training with large datasets experiments on several classification benchmarks demonstrate that the resulting decision trees outperform greedy decision tree baselines introduction decision trees and forests have long and rich history in machine learning recent years have seen an increase in their popularity owing to their computational efficiency and applicability to classification and regression tasks case in point is microsoft kinect where decision trees are trained on millions of exemplars to enable human pose estimation from depth images conventional algorithms for decision tree induction are greedy they grow tree one node at time following procedures laid out decades ago by frameworks such as and cart while recent work has proposed new objective functions to guide greedy algorithms it continues to be the case that decision tree applications utilize the same dated methods of tree induction greedy decision tree induction builds binary tree via recursive procedure as follows beginning with single node indexed by split function si is optimized based on corresponding subset of the training data di such that di is split into two subsets which in turn define the training data for the two children of the node the intrinsic limitation of this procedure is that the optimization of si is solely conditioned on di there is no ability to the split function si based on the results of training at lower levels of the tree this paper proposes general framework for learning of the split parameters for methods that addresses this limitation we focus on binary trees while extension to trees is possible we show that our joint optimization of the split functions at different levels of the tree under global objective not only promotes cooperation between the split nodes to create more compact trees but also leads to better generalization performance part of this work was done while norouzi and collins were at microsoft research cambridge one of the key contributions of this work is establishing link between the decision tree optimization problem and the problem of structured prediction with latent variables we present novel formulation of the decision tree learning that associates binary latent decision variable with each split node in the tree and uses such latent variables to formulate the tree empirical loss inspired by advances in structured prediction we propose upper bound on the empirical loss this bound acts as surrogate objective that is optimized using stochastic gradient descent sgd to find locally optimal configuration of the split functions one complication introduced by this particular formulation is that the number of latent decision variables grows exponentially with the tree depth as consequence each gradient update will have complexity of for inputs one of our technical contributions is showing how this complexity can be reduced to by modifying the surrogate objective thereby enabling efficient learning of deep trees related work finding optimal split functions at different levels of decision tree according to some global objective such as regularized empirical risk is due to the discrete and sequential nature of the decisions in tree thus finding an efficient alternative to the greedy approach has remained difficult objective despite many prior attempts bennett proposes programming based approach for global tree optimization and shows that the method produces trees that have higher classification accuracy than standard greedy trees however their method is limited to binary classification with loss and has high computation complexity making it only applicable to trees with few nodes the work in proposes means for training decision forests in an online setting by incrementally extending the trees as new data points are added as opposed to naive incremental growing of the trees this work models the decision trees with mondrian processes the hierarchical mixture of experts model uses soft splits rather than hard binary decisions to capture situations where the transition from low to high response is gradual the use of soft splits at internal nodes of the tree yields probabilistic model in which the is smooth function of the unknown parameters hence training based on is amenable to numerical optimization via methods such as expectation maximization em that said the soft splits necessitate the evaluation of all or most of the experts for each data point so much of the computational advantage of the decision tree are lost murthy and salzburg argue that tree learning methods that work by looking ahead are unnecessary and sometimes harmful this is understandable since their methods work by minimizing the empirical loss without any regularization which is prone to overfitting to avoid this problem it is common practice see breiman or criminisi and shotton for an overview to limit the tree depth and introduce limits on the number of training instances below which tree branch is not extended or to force diverse ensemble of trees decision forest through the use of bagging bennett and blue describe different way to overcome overfitting by using framework and the support vector machines svm at the split nodes of the tree subsequently bennett et al show how enlarging the margin of decision tree classifiers results in better generalization performance our formulation for decision tree induction improves on prior art in number of ways not only does our latent variable formulation of decision trees enable efficient learning it can handle any general loss function while not sacrificing the dp complexity of inference imparted by the tree structure further our surrogate objective provides natural way to regularize the joint optimization of tree parameters to discourage overfitting problem formulation for ease of exposition this paper focuses on binary classification trees with internal split nodes and leaf terminal nodes note that in binary tree the number of leaves is always one more than the number of internal nodes an input rp is directed from the root of the tree down through internal nodes to leaf node each leaf node specifies distribution over class labels each internal node indexed by performs binary test by evaluating θt θt figure the binary split decisions in decision tree with internal nodes can be thought as binary vector tree navigation to reach leaf can be expressed in terms of function the selected leaf parameters can be expressed by θt specific split function si rp if si evaluates to then is directed to the left child of node otherwise is directed to the right child and so on down the tree each split function si by parameterized weight vector wi is assumed to be linear threshold function si sgn wi we incorporate an offset parameter to obtain split functions of the form sgn wi bi by appending constant to the input feature vector each leaf node indexed by specifies conditional probability distribution over class labels denoted leaf distributions are parametrized with vector of unnormalized predictive denoted rk and softmax function exp pk exp where denotes the αth element of vector the parameters of the tree comprise the internal weight vectors wi and the vectors of unnormalized one for each leaf node we pack these parameters into two matrices and whose rows comprise weight vectors and leaf parameters wm and given dataset of pairs xz yz where yz is the ground truth class label associated with input xz rp we wish to find joint configuration of oblique splits and leaf parameters that minimize some measure of misclassification loss on the training dataset joint optimization of the split functions and leaf parameters according to global objective is known to be extremely challenging due to the discrete and sequential nature of the splitting decisions within the tree one can evaluate all of the split functions for every internal node of the tree on input by computing sgn where sgn is the sign function one key idea that helps linking decision tree learning to latent structured prediction is to think of an vector of potential split decisions sgn as latent variable such latent variable determines the leaf to which data point is directed and then classified using the leaf parameters to formulate the loss for we introduce tree navigation function hm that maps an sequence of split decisions hm to an indicator vector that specifies encoding such an indicator vector is only at the index of the selected leaf fig illustrates the tree navigation function for tree with internal nodes using the notation developed above θt sgn represents the parameters corresponding to the leaf to which is directed by the split functions in generic loss function of the form measures the discrepancy between the model prediction based on and an output for the softmax model given by natural loss is the negative log probability of the correct label referred to as log loss log log exp for regression tasks when rq and the value of rq is directly emitted as the model prediction natural choice of is squared loss sqr kθ one can adopt other forms of loss within our decision tree learning framework as well the goal of learning is to find and that minimize empirical loss for given training set that is θt sgn direct global optimization of empirical loss with respect to is challenging it is discontinuous and function of furthermore given an input the navigation function yields leaf parameter vector based on sequence of binary tests where the results of the initial tests determine which subsequent tests are performed it is not clear how this dependence of binary tests should be formulated decision trees and structured prediction to overcome the intractability in the optimization of we develop piecewise smooth upper bound on empirical loss our upper bound is inspired by the formulation of structured prediction with latent variables key observation that links decision tree learning to structured prediction is that one can sgn in terms of latent variable that is sgn argmax ht in this form decision tree split functions implicitly map an input to binary vector by maximizing score function ht the inner product of and one can the score function in terms of more familiar form of joint feature space on and as wt where vec hxt and vec previously norouzi and fleet used the same reformulation of linear threshold functions to learn binary similarity preserving hash functions given we empirical loss as θt where argmax ht this objective resembles the objective functions used in structured prediction and since we do not have priori access to the ground truth split decisions this problem is form of structured prediction with latent variables upper bound on empirical loss we develop an upper bound on loss for an pair which takes the form θt sgn maxm gt θt maxm ht to validate the bound first note that the second term on the rhs is maximized by sgn second when it is clear that the lhs equals the rhs finally for all other values of the rhs can only get larger than when because of the max operator hence the inequality holds an algebraic proof of is presented in the supplementary material in the context of structured prediction the first term of the upper bound the maximization over is called inference as it augments the inference problem the maximization over with loss term fortunately the inference for our decision tree learning formulation can be solved exactly as discussed below it is also notable that the loss term on the lhs of is invariant to the scale of but the upper bound on the right side of is not as consequence as with binary svm and formulations of structural svm we introduce regularizer on the norm of when optimizing the bound to justify the regularizer we discuss the effect of the scale of on the bound proposition the upper bound on the loss becomes tighter as constant multiple of increases for maxm agt θt maxm aht maxm bg θt maxm bht proof please refer to the supplementary material for the proof in the limit as the scale of approach the loss term θt becomes negligible compared to the score term gt thus the solutions to inference and inference problems become almost identical except when an element of is very close to thus even though larger kw yields tighter bound it makes the bound approach the loss itself and therefore becomes nearly which is hard to optimize based on proposition one easy way to decrease the upper bound is to increase the norm of which does not affect the loss our experiments indicate that lower value of the loss can be achieved when the norm of is regularized we therefore constrain the norm of to obtain an objective with better generalization since each row of acts independently in decision tree in the split functions it is reasonable to constrain the norm of each row independently summing over the bounds for different training pairs and constraining the norm of rows of we obtain the following optimization problem called the surrogate objective minimize max maxm kwi for all where is regularization parameter and wi is the ith row of for all values of we have instead of using the typical lagrange form for regularization we employ hard constraints to enable sparse gradient updates of the rows of since the gradients for most rows of are zero at each step in training optimizing the surrogate objective even though minimizing the surrogate objective of entails optimization is much better behaved than empirical loss in is piecewise linear and in and the constraints on define convex set inference to evaluate and use the surrogate objective in for optimization we must solve inference problem to find the binary code that maximizes the sum of the score and loss terms argmax gt θt an observation that makes this optimization tractable is that can only take on distinct values which correspond to terminating at one of the leaves of the tree and selecting leaf parameter from fortunately for any leaf index we can solve argmax gt efficiently note that if then θt equals the th row of to solve we need to set all of the binary bits in corresponding to the path from the root to the leaf to be consistent with the path direction toward the leaf however bits of that do not appear on this path have no effect on the output of and all such bits should be set based on sgn wi to obtain maximum gt accordingly we can essentially ignore the bits by subtracting sgn from to obtain argmax gt argmax sgn algorithm stochastic gradient descent sgd algorithm for decision tree learning initialize and using greedy procedure for to do sample pair uniformly at random from sgn gt θt bxt hx tmp for to do tmp tmp min end for θt end for note that sgn is constant in and this subtraction zeros out all bits in that are not on the path to the leaf so to solve we only need to consider the bits on the path to the leaf for which sgn wi is not consistent with the path direction using single search on the decision tree we can solve for every and among those we pick the one that maximizes the algorithm described above is mp where is the tree depth and we require multiple of for computing the inner product wi at each internal node this algorithm is not efficient for deep trees especially as we need to perform inference once for every stochastic gradient computation in what follows we develop an alternative more efficient formulation and algorithm with time complexity of fast inference to motivate the fast inference algorithm we formulate slightly different upper bound on the loss θt sgn max sgn gt θt maxm ht where sgn denotes the hamming ball of radius around sgn sgn hm kg sgn kh hence sgn implies that and sgn differ in at most one bit the proof of is identical to the proof of the key benefit of this new formulation is that inference with the new bound is computationally efficient since and sgn differ in at most one bit then can only take distinct values thus we need to evaluate for at most values of requiring running time of stochastic gradient descent sgd one reasonable approach to minimizing uses stochastic gradient descent sgd the steps of which are outlined in alg here denotes the learning rate and is the number of optimization steps line corresponds to gradient update in which is supported by the fact that ht hxt line performs projection back to the feasible region of and line updates based on the gradient of loss our implementation modifies alg by adopting common sgd tricks including the use of momentum and stable sgd ssgd even though alg achieves good training and test accuracy relatively quickly we observe that after several gradient updates some of the leaves may end up not being assigned to any data points and hence the full tree capacity is not exploited we call such leaves inactive as opposed to active leaves that are assigned to at least one training data point an inactive leaf may become active again but this rarely happens given the form of gradient updates to discourage abrupt changes in the number of inactive leaves we introduce variant of sgd in which the assignments of data points to leaves are fixed for number of gradient updates thus the bound is optimized with respect to set of data point leaf assignment constraints when the improvement in the bound becomes negligible the leaf assignment variables are updated followed by another round of optimization of the bound we call this algorithm stable sgd ssgd because it changes the assignment of data points to leaves more conservatively than sgd let denote the encoding of the leaf to which data point should be assigned to then each iteration of ssgd test accuracy sensit protein mnist training accuracy depth depth depth depth depth depth random depth depth figure test and training accuracy of single tree as function of tree depth for different methods trees achieve better test accuracy throughout different depths exhibit less vulnerability to overfitting with fast inference relies on the following upper bound on loss θt sgn max gt θt max sgn ht one can easily verify that the rhs of is larger than the rhs of hence the inequality computational complexity to analyze the computational complexity of each sgd step we note sgn is bounded above by the defined in and that hamming distance between corresponding to the path to selected depth of the tree this is because only those elements of xt needed for line of alg leaf can differ from sgn thus for sgd the expression differ accordingly lines and can be computed in dp if we know which bits of and can be performed in dp the computational bottleneck is the loss augmented inference in line when fast inference is performed in time the total time complexity of gradient update for both sgd and ssgd becomes where is the number of labels experiments experiments are conducted on several benchmark datasets from libsvm for classification namely sensit protein and mnist we use the provided train validation test sets when available if such splits are not provided we use random split of the training data for train validation and random split for train validation test sets we compare our method for learning of oblique trees with several greedy baselines including conventional trees based on information gain oblique trees that use coordinate descent for optimization of the splits and random oblique trees that select the best split function from set of randomly generated hyperplanes based on information gain we also compare with the results of which is special case of our upper bound approach applied greedily to trees of depth one node at time any base algorithm for learning decision trees can be augmented by pruning or building ensembles with bagging or boosting however the key differences between trees and baseline greedy trees become most apparent when analyzing individual trees for single tree the major determinant of accuracy is the size of the tree which we control by changing the maximum tree depth fig depicts test and training accuracy for trees and four other baselines as function of tree depth we evaluate trees of depth up to at depth intervals of the for each method are tuned for each depth independently while the absolute accuracy of our trees varies between datasets few key observations hold for all cases first we observe that num active leaves regularization parameter log tree depth regularization parameter log regularization parameter log tree depth tree depth figure the effect of on the structure of the trees trained by mnist small value of prunes the tree to use far fewer leaves than an baseline used for initialization dotted line greedy trees achieve the best test performance across tree depths across multiple datasets secondly trees trained using our approach seem to be less susceptible to overfitting and achieve better generalization performance at various tree depths as described below we think that the norm regularization provides principled way to tune the tightness of the tree fit to the training data finally the comparison between and trees concentrates on the of the algorithm as it compares our method with its simpler variant which is applied greedily one node at time we find that in most cases the optimization helps by improving upon the results of training time sec key of our method is the regularization constant in which controls the tightness of the upper bound with small the norm constraints force the method to choose with large margin at each internal node the choice of is therefore closely related to the generalization of the learned trees as shown in fig also implicitly controls the degree of pruning of the leaves of the tree during training we train multiple trees for different values of and we pick the value of that produces the tree with minimum validation error we also tune the choice of the sgd learning rate in this step this and are used to build tree using the union of both the training and validation sets which is evaluated on the test set inf fast inf depth figure total time to execute epochs of sgd on the dataset using inference and its fast varient to build trees we initially build an tree with split functions that threshold single feature optimized using conventional procedures that maximize information gain the axisaligned split is used to initialize greedy variant of the tree training procedure called this provides initial values for and for the procedure fig shows an empirical comparison of training time for sgd with inference and fast inference as expected of inference exhibits exponential growth with deep trees whereas its fast variant is much more scalable we expect to see much larger speedup factors for larger datasets only has training points conclusion we present method for learning decision trees using stochastic gradient descent to optimize an upper bound on the empirical loss of the tree predictions on the training set our model poses the global training of decision trees in optimization framework this makes it simpler to pose extensions that could be considered in future work efficiency gains could be achieved by learning sparse split functions via regularization on further the core optimization problem permits applying the kernel trick to the linear split parameters making our overall model applicable to learning split functions or training decision trees on examples in arbitrary reproducing kernel hilbert spaces acknowledgment mn was financially supported in part by google fellowship df was financially supported in part by nserc canada and the ncap program of the cifar references bennett global tree optimization decision tree algorithm computing science and statistics pages bennett and blue support vector machine approach to decision trees in department of mathematical sciences math report no rensselaer polytechnic institute pages bennett cristianini and wu enlarging the margins in perceptron decision trees machine learning breiman random forests machine learning breiman friedman olshen and stone classification and regression trees chapman chang and lin libsvm library for support vector machines criminisi and shotton decision forests for computer vision and medical image analysis springer jerome friedman greedy function approximation gradient boosting machine annals of statistics pages gall yao razavi van gool and lempitsky hough forests for object detection tracking and action recognition ieee trans pami hastie tibshirani and friedman the elements of statistical learning ed springer hyafil and rivest constructing optimal binary decision trees is information processing letters jancsary nowozin and rother training of image restoration models new state of the art eccv jordan and jacobs hierarchical mixtures of experts and the em algorithm neural konukoglu glocker zikic and criminisi neighbourhood approximation forests in medical image computing and pages springer lakshminarayanan roy and teh mondrian forests efficient online random forests in advances in neural information processing systems pages mingers an empirical comparison of pruning methods for decision tree induction machine learning murthy and salzberg on growing better decision trees from data phd thesis john hopkins university norouzi collins fleet and kohli forest improved random forest by continuous optimization of oblique splits norouzi and fleet minimal loss hashing for compact binary codes icml nowozin improved information gain estimates for decision tree induction icml quinlan induction of decision trees machine learning shotton girshick fitzgibbon sharp cook finocchio moore kohli criminisi kipman et al efficient human pose estimation from single depth images ieee trans pami taskar guestrin and koller markov networks nips tsochantaridis hofmann joachims and altun support vector machine learning for interdependent and structured output spaces icml yu and joachims learning structural svms with latent variables icml 
probabilistic curve learning coulomb repulsion and the electrostatic gaussian process david dunson department of statistics duke university durham nc usa dunson ye wang department of statistics duke university durham nc usa abstract learning of low dimensional structure in multidimensional data is canonical problem in machine learning one common approach is to suppose that the observed data are close to smooth manifold there are rich variety of manifold learning methods available which allow mapping of data points to the manifold however there is clear lack of probabilistic methods that allow learning of the manifold along with the generative distribution of the observed data the best attempt is the gaussian process latent variable model but identifiability issues lead to poor performance we solve these issues by proposing novel coulomb repulsive process corp for locations of points on the manifold inspired by physical models of electrostatic interactions among particles combining this process with gp prior for the mapping function yields novel electrostatic gp electrogp process focusing on the simple case of manifold we develop efficient inference algorithms and illustrate substantially improved performance in variety of experiments including filling in missing frames in video introduction there is broad interest in learning and exploiting structure in data canonical case is when the low dimensional structure corresponds to smooth riemannian manifold embedded in the ambient space of the observed data assuming that the observed data are close to it becomes of substantial interest to learn along with the mapping from this allows better data visualization and for one to exploit the structure to combat the curse of dimensionality in developing efficient machine learning algorithms for variety of tasks the current literature on manifold learning focuses on estimating the coordinates corresponding to by optimization finding on the manifold that preserve distances between the corresponding in there are many such methods including isomap embedding and laplacian eigenmaps such methods have seen broad use but have some clear limitations relative to probabilistic manifold learning approaches which allow explicit learning of the mapping and the distribution of there has been some considerable focus on probabilistic models which would seem to allow learning of and two notable examples are mixtures of factor analyzers mfa and gaussian process latent variable models bayesian is bayesian formulation of which automatically learns the intrinsic dimension and handles missing data such approaches are useful in exploiting structure in estimating the distribution of but unfortunately have critical problems in terms of reliable estimation of the manifold and mapping function mfa is not smooth in approximating the manifold with collage of lower dimensional and hence we focus further discussion on bayesian similar problems occur for mfa and other probabilistic manifold learning methods xi with assigned in general form for the ith data vector bayesian lets µpx gaussian process prior generated from gaussian or uniform distribution over space and the residual drawn from gaussian centered on zero with diagonal or spherical covariance while this model seems appropriate to manifold learning identifiability problems lead to extremely poor performance in estimating and to give an intuition for the root cause of the problem consider the case in which are drawn independently from uniform distribution over the model is so flexible that we could fit the training data for just as well if we did not use the entire hypercube but just placed all the values in small subset of the uniform prior will not discourage this tendency to not spread out the latent coordinates which unfortunately has disasterous consequences illustrated in our experiments the structure of the model is just too flexible and further constraints are needed replacing the uniform with standard gaussian does not solve the problem constrained likelihood methods mitigate the issue to some extent but do not correspond to proper bayesian generative model to make the problem more tractable we focus on the case in which is smooth compact manifold assume pxi with gaussian noise and þñ smooth mapping such that µj for where pxq pxq µd pxqq we focus on finding good estimate of and hence the manifold via probabilistic learning framework we refer to this problem as probabilistic curve learning pcl motivated by the principal curve literature pcl differs substantially from the principal curve learning problem which seeks to estimate curve through the data which may be very different from the true manifold our proposed approach builds on in particular our primary innovation is to generate the latent coordinates from novel repulsive process there is an interesting literature on repulsive point process modeling ranging from various matern processes to the determinantal point process dpp in our very different context these processes lead to unnecessary complexity computationally and otherwise and we propose new coulomb repulsive process corp motivated by coulomb law of electrostatic interaction between electrically charged particles using corp for the latent positions has the effect of strongly favoring spread out locations on the manifold effectively solving the identifiability problem mentioned above for the we refer to the gp with corp on the latent positions as an electrostatic gp electrogp the remainder of the paper is organized as follows the coulomb repulsive process is proposed in and the electrogp is presented in with comparison between electrogp and demonstrated via simulations the performance is further evaluated via real world datasets in discussion is reported in coulomb repulsive process formulation definition univariate process is coulomb repulsive process corp if and only if for every finite set of indices tk in the index set ppxti πxti πxtj sin where is the repulsive parameter the process is denoted as xt corpprq the process is named by its analogy in electrostatic physics where by coulomb law two electrostatic positive charges will repel each other by force proportional to the reciprocal of their square distance letting dpx yq sin the above conditional probability of xti given xtj is proportional to pxti xtj shrinking the probability exponentially fast as two states get closer to each other note that the periodicity of the sine function eliminates the edges of making the electrostatic energy field homogeneous everywhere on several observations related to kolmogorov extension theorem can be made immediately ensuring corp to be well defined firstly the conditional density defined in is positive and integrable figure each facet consists of rows with each row representing an scatterplot of random realization of corp under certain and since xt are constrained in compact interval and is positive and bounded hence the finite distributions are well defined secondly the joint finite for xtk can be derived as xtk πxti πxtj as can be easily seen any permutation of tk will result in the same joint finite distribution hence this finite distribution is exchangeable thirdly it can be easily checked that for any finite set of indices tk xtk xtk xtk xtk qdxtk dxtk by observing that xtk xtk xtk xtk qπm ppxtk xtk properties assuming xt is realization from corp then the following lemmas hold lemma for any any and any we have ppxn bpxi where bpxi tx dpx xi lemma for any the of xn due to the exchangeability we can assume xn without loss of generality is maximized when and only when dpxi sin for all according to lemma and lemma corp will nudge the to be spread out within and penalizes the case when two get too close figure presents some simulations from corp this nudge becomes stronger as the sample size grows or as the repulsive parameter grows the properties of corp makes it ideal for strongly favoring spread out latent positions across the manifold avoiding the gaps and clustering in small regions that plague methods the proofs for the lemmas and simulation algorithm based on rejection sampling can be found in the supplement multivariate corp definition multivariate process is coulomb repulsive process if and only if for every finite set of indices tk in the index set xm for ti ppx pym ti ym tj where the spherical coordinates have been converted into the pp cartesian coordinates yp yp the multivariate corp maps the through spherical coordinate system to unit in the repulsion is then defined as the reciprocal of the square euclidean distances between these mapped points in based on this construction of multivariate corp straightfoward generalization of the electrogp model to manifold could be made where electrostatic gaussian process formulation and model fitting in this section we propose the electrostatic gaussian process electrogp model assuming ddimensional data vectors are observed the model is given by yi µj pxi xi corpprq µj where yi for and denotes gaussian process prior with covariance function px yq φj exp αj px letting αd φd denote the model hyperparameters model could be fitted by maximizing the joint posterior distribution of and arg max ppx rq where the repulsive parameter is fixed and can be tuned using cross validation based on our experience setting always yields good results and hence is used as default across this paper for the simplicity of notations is excluded in the remainder the above optimization problem can be rewritten as arg max pyy θq log πpx xq where denotes the log likelihood function and denotes the finite dimensional pdf of corp hence the corp prior can also be viewed as repulsive constraint in the optimization problem it can be easily checked that log πpxi xj for any and starting at initial values the optimizer will converge to local solution that maintains the same order as the initial we refer to this as the property we find that conditionally on the starting order the optimization algorithm converges rapidly and yields stable results although the are not identifiable since the target function is invariant under rotation unique solution does exist conditionally on the specified order raises the necessity of finding good initial values or at least good initial ordering for fortunately in our experience simply applying any standard manifold learning algorithm to estimate in manner that preserves distances in yields good performance we find very similar results using lle isomap and eigenmap but focus on lle in all our implementations our algorithm can be summarized as follows learn the one dimensional coordinate by your favorite manifold learning algorithm and rescale into figure visualization of three simulation experiments where the data triangles are simulated from bivariate gaussian left rotated parabola with gaussian noises middle and spiral with gaussian noises right the dotted shading denotes the posterior predictive uncertainty band of under electrogp the black curve denotes the posterior mean curve under electrogp and the red curve denotes the the three dashed curves denote three realizations from the middle panel shows region and the full figure is shown in the embedded box rq using scaled conjugate gradient descent scg solve arg maxθ ppyy and using scg setting and to be the initial values solve posterior mean curve and uncertainty bands in this subsection we describe how to obtain point estimate of the curve and how to characterize its uncertainty under electrogp such point and interval estimation is as of yet unsolved in the literature and is of critical importance in particular it is difficult to interpret single point estimate without some quantification of how uncertain that estimate is we use the posterior mean epµ as the bayes optimal estimator under squared error loss as curve curve has infinite dimensions hence in order to store and visualize it we discretize to obtain nµ grid points xµi for nµ using basic multivariate gaussian theory the following expectation is easy to compute µpxµnµ pxµnµ nµ is approximated by linear interpolation using xµi µpxµi then for ease of notation we use to denote this interpolated piecewise linear curve later on examples can be found in figure where all the mean curves black solid were obtained using the above method estimating an uncertainty region including data points with probability is much more challenging we addressed this problem by the following heuristic algorithm step draw from unif independently for step sample the corresponding from the posterior predictive distribution conditional on these latent coordinates ppyy step repeat steps times collecting all samples and find the step find the shortest distances from these to the posterior mean curve of these distances denoted by the envelope of the moving trace step moving ball through the entire curve defines the uncertainty band is piecewise linear curve examples can be found in note that step can be easily solved since figure where the uncertainty bands dotted shading were found using the above algorithm figure the of the spiral case left and the corresponding coordinate function pxq of electrogp middle and right the gray shading denotes the heatmap of the posterior distribution of px and the black curve denotes the posterior mean simulation in this subsection we compare the performance of electrogp with and principal curves pcurve in simple simulation experiments data points were sampled from each of the following three distributions gaussian distribution rotated parabola with gaussian noises and spiral with gaussian noises electrogp and were fitted using the same initial values obtained from lle and the was fitted using the princurve package in the performance of the three methods is compared in figure the dotted shading represents posterior predictive uncertainty band for new data point under the electrogp model this illustrates that electrogp obtains an excellent fit to the data provides good characterization of uncertainty and accurately captures the concentration near manifold embedded in two dimensions the is plotted in red the extremely poor representation of is as expected based on our experience in fitting principal curve in wide variety of cases the behavior is highly unstable in the first two cases the corresponds to smooth curve through the center of the data but for the more complex manifold in the third case the is an extremely poor representation this tendency to cut across large regions of near zero data density for highly curved manifolds is common for for we show three random realizations dashed from the posterior in each case it is clear the results are completely unreliable with the tendency being to place part of the curve through where the data have high density while also erratically adding extra outside the range of the data the model does not appropriately penalize such extra parts and the very poor performance shown in the top right of figure is not unusual we find that electrogp in general performs dramatically better than competitors more simulation results can be found in the supplement to better illustrate the results for the spiral case we zoom in and present some further comparisons of and electrogp in figure as can be seen the right panel optimizing without any constraint results in holes on the trajectories of the gaussian process over these holes will become arbitrary as illustrated by the three realizations this arbitrariness will be further projected into the input space resulting in the erratic curve observed in the left panel failing to have well spread out over not only causes trouble in learning the curve but also makes the posterior predictive distribution of overly diffuse near these holes the large gray shading area in the right panel the middle panel shows that electrogp fills in these holes by softly constraining the latent coordinates to spread out while still allowing the flexibility of moving them around to find smooth curve snaking through them prediction broad prediction problems can be formulated as the following missing data problem assume new data for are partially observed and the missing entries are to be filled in letting zo denote the observed data vector and denote the missing part the conditional distribution of original observed electrogp figure left panel three randomly selected reconstructions using electrogp compared with those using bayesian right panel another three reconstructions from electrogp with the first row presenting the original images the second row presenting the observed images and the third row presenting the reconstructions the missing data is given by ppzz dxzm ppzz xzm where xzi is the corresponding latent coordinate of for however dealing with xzm jointly is intractable due to the high of the gaussian process which motivates the following approximation πm ppxi the approximation assumes xzm to be conditionally independent this assumption is more is well spread out on as is favored by accurate if xo though still intractable is much easier to deal with the univariate distribution ppxzi depending on the purpose of the application either metropolis hasting algorithm could be adopted to sample from the predictive distribution or optimization method could be used to find the map of xz the details of both algorithms can be found in the supplement experiments consecutive frames of size with rgb color were collected from video of teapot rotating clearly these images roughly lie on curve of the frames were assumed to be fully observed in the natural time order of the video while the other frames were given without any ordering information moreover half of the pixels of these frames were missing the electrogp was fitted based on the other frames and was used to reconstruct the broken frames and impute the reconstructed frames into the whole frame series with the correct order the reconstruction results are presented in figure as can be seen the reconstructed images are almost indistinguishable from the original ones note that these frames were also correctly imputed into the video with respect to their latent position electrogp was compared with bayesian with the latent dimension set to the reconstruction mean square error mse using electrogp is compared to using the comparison is also presented in figure it can be seen that electrogp outperforms bayesian in highresolution precision how well they reconstructed the handle of the teapot since it obtains much tighter and more precise estimate of the manifold denoising consecutive frames of size with gray color were collected from video of shrinking shockwave frame to were assumed completely missing and the other frames were observed with the original time order with strong white noises the shockwave is homogeneous in all directions from the center hence the frames roughly lie on curve the electrogp was applied for two tasks frame denoising improving resolution by interpolating frames in between the existing frames note that the second task is hard since there are original noisy electrogp nlm isd electrogp li figure row from left to right are the original frame its noisy observation its denoised result by electrogp nlm and isd row from left to right are the original frame its regeneration by electrogp the residual image times of the absolute error between the imputation and the original of electrogp and li the blank area denotes its missing observation consecutive frames missing and they can be interpolated only if the electrogp correctly learns the underlying manifold the denoising performance was compared with mean filter nlm and isotropic diffusion isd the interpolation performance was compared with linear interpolation li the comparison is presented in figure as can be clearly seen electrogp greatly outperforms other methods since it correctly learned this manifold to be specific the denoising mse using electrogp is only comparing to using nlm and using isd the mse of reconstructing the entirely missing frame using electrogp is compared to using li an online video of the result using electrogp can be found in this the frame per second fps of the generated video under electrogp was tripled compared to the original one though over two thirds of the frames are pure generations from electrogp this new video flows quite smoothly another noticeable thing is that the missing frames were perfectly regenerated by electrogp discussion manifold learning has dramatic importance in many applications where data are collected with unknown low dimensional manifold structure while most of the methods focus on finding lower dimensional summaries or characterizing the joint distribution of the data there is to our knowledge no reliable method for probabilistic learning of the manifold this turns out to be daunting problem due to major issues with identifiability leading to unstable and generally poor performance for current probabilistic dimensionality reduction methods it is not obvious how to incorporate appropriate geometric constraints to ensure identifiability of the manifold without also enforcing assumptions about its form we tackled this problem in the manifold curve case and built novel electrostatic gaussian process model based on the general framework of by introducing novel coulomb repulsive process both simulations and real world data experiments showed excellent performance of the proposed model in accurately estimating the manifold while characterizing uncertainty indeed performance gains relative to competitors were dramatic the proposed electrogp is shown to be applicable to many learning problems including and there are many interesting areas for future study including the development of efficient algorithms for applying the model for multidimensional manifolds while learning the dimension https this online video contains no information regarding the authors references tenenbaum de silva and langford global geometric framework for nonlinear dimensionality reduction science roweis and saul nonlinear dimensionality reduction by locally linear embedding science belkin and niyogi laplacian eigenmaps and spectral techniques for embedding and clustering in nips volume pages chen silva paisley wang dunson and carin compressive sensing on manifolds using nonparametric mixture of factor analyzers algorithm and performance bounds signal processing ieee transactions on wang canale and dunson scalable multiscale density estimation arxiv preprint lawrence probabilistic principal component analysis with gaussian process latent variable models the journal of machine learning research titsias and lawrence bayesian gaussian process latent variable model the journal of machine learning research neil lawrence and joaquin local distance preservation in the through back constraints in proceedings of the international conference on machine learning pages acm raquel urtasun david fleet andreas geiger jovan trevor darrell and neil lawrence latent variable models in proceedings of the international conference on machine learning pages acm hastie and stuetzle principal curves journal of the american statistical association rao adams and dunson bayesian inference for repulsive processes arxiv preprint hough krishnapur peres et al zeros of gaussian analytic functions and determinantal point processes volume american mathematical weinberger and saul an introduction to nonlinear dimensionality reduction by maximum variance unfolding in aaai volume pages buades coll and morel algorithm for image denoising in computer vision and pattern recognition cvpr ieee computer society conference on volume pages ieee perona and malik and edge detection using anisotropic diffusion pattern analysis and machine intelligence ieee transactions on 
inverse reinforcement learning with locally consistent reward functions quoc phong kian hsiang and patrick dept of computer science national university of singapore republic of dept of electrical engineering and computer science massachusetts institute of technology qphong lowkh jaillet abstract existing inverse reinforcement learning irl algorithms have assumed each expert demonstrated trajectory to be produced by only single reward function this paper presents novel generalization of the irl problem that allows each trajectory to be generated by multiple locally consistent reward functions hence catering to more realistic and complex experts behaviors solving our generalized irl problem thus involves not only learning these reward functions but also the stochastic transitions between them at any state including unvisited states by representing our irl problem with probabilistic graphical model an em algorithm can be devised to iteratively learn the different reward functions and the stochastic transitions between them in order to jointly improve the likelihood of the expert demonstrated trajectories as result the most likely partition of trajectory into segments that are generated from different locally consistent reward functions selected by em can be derived empirical evaluation on synthetic and datasets shows that our irl algorithm outperforms the em clustering with maximum likelihood irl which is interestingly reduced variant of our approach introduction the reinforcement learning problem in markov decision processes mdps involves an agent using its observed rewards to learn an optimal policy that maximizes its expected total reward for given task however such observed rewards or the reward function defining them are often not available nor known in many tasks the agent can therefore learn its reward function from an expert associated with the given task by observing the expert behavior or demonstration and this approach constitutes the inverse reinforcement learning irl problem unfortunately the irl problem is because infinitely many reward functions are consistent with the expert observed behavior to resolve this issue existing irl algorithms have proposed alternative choices of the agent reward function that minimize different dissimilarity measures defined using various forms of abstractions of the agent generated optimal behavior the expert observed behavior as briefly discussed below see for detailed review the projection algorithm selects reward function that minimizes the squared euclidean distance between the feature expectations obtained by following the agent generated optimal policy and the empirical feature expectations observed from the expert demonstrated trajectories the multiplicative weights algorithm for apprentice learning adopts robust minimax approach to deriving the agent behavior which is guaranteed to perform no worse than the expert and is equivalent to choosing reward function that minimizes the difference between the expected average reward under the agent generated optimal policy and the expert empirical average reward approximated using the agent reward weights the linear programming apprentice learning algorithm picks its reward function by minimizing the same dissimilarity measure but incurs much less time empirically the policy matching algorithm aims to match the agent generated optimal behavior to the expert observed behavior by choosing reward function that minimizes the sum of squared euclidean distances between the agent generated optimal policy and the expert estimated policy from its demonstrated trajectories over every possible state weighted by its empirical state visitation frequency the maximum entropy irl and maximum likelihood irl mlirl algorithms select reward functions that minimize an empirical approximation of the kullbackleibler divergence between the distributions of the agent and expert generated trajectories which is equivalent to maximizing the average of the expert demonstrated trajectories the formulations of the maximum entropy irl and mlirl algorithms differ in the use of smoothing at the trajectory and action levels respectively as result the former or dissimilarity measure does not utilize the agent generated optimal policy which is consequently questioned by as to whether it is considered an irl algorithm bayesian irl extends irl to the bayesian setting by maintaining distribution over all possible reward functions and updating it using bayes rule given the expert demonstrated trajectories the work of extends the projection algorithm to handle partially observable environments given the expert policy represented as finite state controller or trajectories all the irl algorithms described above have assumed that the expert demonstrated trajectories are only generated by single reward function to relax this restrictive assumption the recent works of have respectively generalized mlirl combining it with em clustering and bayesian irl integrating it with dirichlet process mixture model to handle trajectories generated by multiple reward functions due to many intentions in observable environments but each trajectory is assumed to be produced by single reward function in this paper we propose new generalization of the irl problem in observable environments which is inspired by an open question posed in the seminal works of irl if behavior is strongly inconsistent with optimality can we identify locally consistent reward functions for specific regions in state space such question implies that no single reward function is globally consistent with the expert behavior hence invalidating the use of all the irl algorithms more importantly multiple reward functions may be locally consistent with the expert behavior in different segments along its trajectory and the expert has to between these locally consistent reward functions during its demonstration this can be observed in the following example where every possible intention of the expert is uniquely represented by different reward function driver intends to take the highway to food center for lunch an electronic toll coming into effect on the highway may change his intention to switch to another route learning of the driver intentions to use different routes and his transitions between them allows the transport authority to analyze understand and predict the traffic route patterns and behavior for regulating the toll collection this example among others commuters intentions to use different transport modes tourists intentions to visit different attractions section motivate the practical need to formalize and solve our proposed generalized irl problem this paper presents novel generalization of the irl problem that in particular allows each expert trajectory to be generated by multiple locally consistent reward functions hence catering to more realistic and complex experts behaviors than that afforded by existing variants of the irl problem which all assume that each trajectory is produced by single reward function discussed earlier at first glance one may straightaway perceive our generalization as an irl problem in partially observable environment by representing the choice of locally consistent reward function in segment as latent state component however the observation model can not be easily specified nor learned from the expert trajectories which invalidates the use of irl for pomdp instead we develop probabilistic graphical model for representing our generalized irl problem section from which an em algorithm can be devised to iteratively select the locally consistent reward functions as well as learn the stochastic transitions between them in order to jointly improve the likelihood of the expert demonstrated trajectories section as result the most likely partition of an expert demonstrated trajectory into segments that are generated from different locally consistent reward functions selected by em can be derived section thus enabling practitioners to identify states in which the expert transitions between locally consistent reward functions and investigate the resulting causes to extend such partitioning to work for trajectories traversing through any possibly unvisited region of the state space we propose using generalized linear model to represent and predict the stochastic transitions between reward functions at any state including states not visited in the expert demonstrated trajectories by exploiting features that influence these transitions section finally our proposed irl algorithm is empirically evaluated using both synthetic and datasets section problem formulation markov decision process mdp for an agent is defined as tuple consisting of finite set of its possible states such that each state is associated with column vector of realized feature measurements finite set of its possible actions state transition function denoting the probability of moving to state by performing action in state reward function mapping each state to its reward where is column vector of reward weights and constant factor discounting its future rewards when is known the agent can compute its policy specifying the probability of performing action in state however is not known in irl and to be learned from an expert section let denote finite set of locally consistent reward functions of the agent and be reward function chosen arbitrarily from prior to learning define transition function for switching between these reward functions as the probability of switching from reward function to reward function in state where the set contains column vectors of transition weights for all and if the features influencing the stochastic transitions between reward functions can be additionally observed by the agent during the expert demonstration and otherwise in our generalized irl problem is not known and to be learned from the expert section specifically in the former case we propose using generalized linear model to represent exp exp if exp otherwise where is column vector of random feature measurements influencing the stochastic transitions between reward functions in state remark different from whose feature measurements are typically assumed in irl algorithms to be to the agent for all and remain static over time the feature measurements of are in practice often not known to the agent priori and can only be observed when the expert agent visits the corresponding state during its demonstration execution and may vary over time according to some unknown distribution as motivated by the examples given in section without prior observation of the feature measurements of for all or knowledge of their distributions necessary for computing the agent can not consider exploiting for switching between reward functions within mdp or pomdp planning even after learning its weights this eliminates the possibility of reducing our generalized irl problem to an equivalent conventional irl problem section with only single reward function comprising mixture of locally consistent reward functions furthermore the observation model can not be easily specified nor learned from the expert trajectories of states actions and which invalidates the use of irl for pomdp instead of exploiting within planning during the agent execution when it visits some state and observes the feature measurements of it can then use and compute for state to switch between reward functions each of which has generated separate mdp policy prior to execution as illustrated in simple example in fig below remark using generalized linear model to represent allows learning of the stochastic transitions between reward functions specifically by learning section to be generalized across different states after learning can then be exploited for predicting the stochastic transitions between reward functions at any state including states not visited in the expert demonstrated figure transition functrajectories consequently the agent can choose to traverse tion of an agent in state tory through any region possibly not visited by the expert of the for switching between two state space during its execution and the most likely partition of its reward functions and jectory into segments that are generated from different locally with their respective politent reward functions selected by em can still be derived section cies and generated in contrast if the feature measurements of can not be observed by prior to execution the agent during the expert demonstration as defined above then such generalization is not possible only the transition probabilities of switching between reward functions at states visited in the expert demonstrated trajectories can be estimated section in practice since the number of visited states is expected to be much larger than the length of any feature vector the number of transition probabilities to be estimated is bigger than in so observing offers further advantage of reducing the number of parameters to be learned fig shows the probabilistic graphical model for representing our generalized irl problem to describe our model some notations are necessary let be the number of the expert demonstrated trajectories and tn be the length number of time steps of its trajectory for let ant and snt denote its reward function action and state at time step in its trajectory respectively let ant and stn be random variables corresponding to their respective realizations ant and snt where is latent variable and ant and stn are observable variables define an ant and sn snt as sequences of all its reward functions actions and states in its trajectory respectively finally define an and as tuples of all its reward function sequences action sequences and state sequences in its trajectories respectively antn stnn figure probabilistic graphical model of the expert demonstrated trajectory encoding its stochastic transitions between reward functions with solid edges snt for tn state transitions with dashed edges snt ant ant for tn and policy with dotted edges snt ant it can be observed from fig that our probabilistic an for ical model of the expert demonstrated trajectory encodes its stochastic transitions between reward functions state transitions and policy through our model the viterbi algorithm can be applied to derive the most likely partition of the expert trajectory into segments that are generated from different locally consistent reward functions selected by em as shown in section given the state transition function and the number of reward functions our model allows tractable learning of the unknown parameters using em section which include the reward weights vector for all reward functions transition function for switching between reward functions initial state probabilities for all and initial reward function probabilities for all em algorithm for parameter learning straightforward approach to learning the unknown parameters is to select the value of that directly maximizes the of the expert demonstrated trajectories computationally such an approach is prohibitively expensive due to large joint parameter space to be searched for the optimal value of to ease this computational burden our key idea is to devise an em algorithm that iteratively refines the estimate for to improve the expected loglikelihood instead which is guaranteed to improve the original by at least as much expectation step log maximization step where denotes an estimate for at iteration the function of em can be reduced to the following sum of five terms as shown in appendix pn pn log an log pn ptn an log snt ant pn ptn st log st pn ptn log snt ant interestingly each of the first four terms in and contains unique unknown parameter type respectively and and can therefore be maximized separately in the step to be discussed below as result the parameter space to be searched can be greatly reduced note that the third term generalizes the in mlirl assuming all trajectories to be produced by single reward function to that allowing each expert trajectory to be generated by multiple locally consistent reward functions the last term which contains the known state transition function is independent of unknown parameters if the state transition function is unknown then it can be learned by optimizing the last term learning initial state probabilities to maximize the first term in the function of em we use the method of lagrange multipliers with the constraint to obtain the estimate pn for all where is an indicator variable of value if and otherwise since can be computed directly from the expert demonstrated trajectories in time it does not have to be refined learning initial reward function probabilities to maximize the second term in function of em we utilize the method of lagrange multipliers with the constraint to derive pn an for all where denotes an estimate for at iteration denotes an estimate for at pn iteration and an in this case can be computed in tn time using procedure inspired by algorithm as shown in appendix learning reward functions the third term in the function of em is maximized using gradient ascent and its gradient with respect to is derived to be tn an snt ant snt ant for all for snt ant to be differentiable in we define the function of mdp using an operator that blends the values via boltzmann exploration where qp such that exp exp is defined as boltzmann exploration policy and is temperature parameter then we update where is the learning step size we use backtracking line search method to improve the performance of gradient ascent similar to mlirl the time incurred in each iteration of gradient ascent depends mostly on that of value iteration which increases with the size of the mdp state and action space learning transition function for switching between reward functions to maximize the fourth term in the function of em if the feature measurements of can not be observed by the agent during the expert demonstration then we utilize the method of lagrange multipliers with the constraints for all and to obtain pn ptn pn tn for and where is the set of states visited by the expert is an estimate for at iteration and stn an can be computed efficiently by exploiting the intermediate results from evaluating an described previously as detailed in appendix on the other hand if the feature measurements of can be observed by the agent during the expert demonstration then recall that we use generalized linear model to represent section and is the unknown parameter to be estimated similar to learning the reward weights vector for reward function we maximize the fourth term in the function of em by using gradient ascent and its gradient with respect to is derived to be tn sn st for all let ri denote an estimate for at iteration then it is updated using ri ri where is the learning step size backtracking line search method is also used to improve the performance of gradient ascent here in both cases the time incurred in each iteration is proportional to the number of to be computed which is pn time viterbi algorithm for partitioning trajectory into segments with different locally consistent for the unknown pab reward functions given the final estimate rameters produced by em the most likely partition of the expert demonstrated trajectory into segments generated by different locally consistent reward functions is argmaxr which can be derived using the viterbi algorithm specifically define for tn as the probability of the most likely reward function sequence from time steps to ending with reward function at time step that produce state and action sequences snt and ant max snt ant snt ant snt snt ant snt sn sn an maxr sn vr maxr sn an then snt for tn and tn the above viterbi algorithm can be applied in the tn same way to partition an agent trajectory traversing through any region possibly not visited by the expert of the state space during its execution in time experiments and discussion this section evaluates the empirical performance of our irl algorithm using datasets featuring experts demonstrated trajectories in two simulated grid worlds and taxi trajectories the average of the expert demonstrated trajectories is used as the performance metric because it inherently accounts for the fidelity of our irl algorithm in learning the locally consistent reward functions and the stochastic transitions between them pntot log sn an where ntot is the total number of the expert demonstrated trajectories available in the dataset as proven in maximizing with respect to is equivalent to minimizing an empirical approximation of the divergence between the distributions of the agent and expert produced by em section generated trajectories note that when the final estimate is plugged into the resulting in can be computed efficiently using procedure similar to that in section as detailed in appendix to avoid local maxima in gradient ascent we initialize our em algorithm with random values and report the best result based on the value of em section to demonstrate the importance of modeling and learning stochastic transitions between locally consistent reward functions the performance of our irl algorithm is compared with that of its reduced variant assuming no of reward function within each trajectory which is implemented by initializing for all and and deactivating the learning of in fact it can be shown appendix that such reduction interestingly is equivalent to em clustering with mlirl so our irl algorithm generalizes em clustering with mlirl the latter of which has been empirically demonstrated in to outperform many existing irl algorithms as discussed in section figure grid worlds states and are respectively examples of water land and obstacle and state is an example of barrier and denote origin and destination simulated grid world the environment fig is modeled as grid of states each of which is either land water water and destination or obstacle associated with the respective feature vectors and the expert starts at origin and any of its actions can achieve the desired state with probability it has two possible reward functions one of which prefers land to water and going to destination and the other of which prefers water to land and going to destination the expert will only consider switching its reward function at states and from to with probability and from to with probability its reward function remains unchanged at all other states the feature measurements of can not be observed by the agent during the expert demonstration so and is estimated using we set to and the number of reward functions of the agent to fig shows results of the average achieved by our irl algorithm em clustering with mlirl and the expert averaged over random instances with varying number of expert demonstrated trajectories it can be observed that our irl algorithm significantly outperforms em clustering with mlirl and achieves performance close to that of the expert especially when increases this can be explained by its modeling of and its high fidelity in learning and predicting while our irl algorithm allows switching of reward function within each trajectory em clustering with mlirl does not we also observe that the accuracy of estimating the transition probabilities using depends on the frequency and distribution of trajectories demonstrated by the expert with its reward our irl algorithm our irl algorithm em clustering with mlirl em clustering with mlirl function expert expert at time step and its state st no of demonstrated trajectories no of demonstrated trajectories at time step which is expected those transition probabilities that figure graphs of average achieved by our are poorly estimated due to few irl algorithm em clustering with mlirl and the expert evant expert demonstrated number of expert demonstrated trajectories in simulated tories however do not hurt the grid worlds ntot and ntot performance of our irl algorithm by much because such trajectories tend to have very low probability of being demonstrated by the expert in any case this issue can be mitigated by using the generalized linear model to represent and observing the feature measurements of necessary for learning and computing as shown next average average simulated grid world the environment fig is also modeled as grid of states each of which is either the origin destination or land associated with the respective feature vectors and the expert starts at origin and any of its actions can achieve the desired state with probability it has two possible reward functions one of which prefers going to destination and the other of which prefers returning to origin while moving to the destination the expert will encounter barriers at some states with corresponding feature vectors and no barriers at all other states with the second component of is used as an offset value in the generalized linear model the expert behavior of switching between reward functions is governed by generalized linear model with and transition weights and as result it will for example consider switching its reward function at states with barriers from to with probability we estimate using and set to and the number of reward functions of the agent to to assess the fidelity of learning and predicting the stochastic transitions between reward functions at unvisited states we intentionally remove all demonstrated trajectories that visit state with barrier fig shows results of performance achieved by our irl algorithm em clustering with mlirl and the expert averaged over random instances with varying it can again be observed that our irl algorithm outperforms em clustering with mlirl and achieves an performance comparable to that of the expert due to its modeling of and its high fidelity in learning and predicting while our irl algorithm allows switching of reward function within each trajectory em clustering with mlirl does not besides the estimated transition function using is very close to that of the expert even at unvisited state so unlike using the learning of with can be generalized well across different states thus allowing to be predicted accurately at any state hence we will model with and learn it using in the next experiment taxi trajectories the comfort taxi company in singapore has provided gps traces of taxis with the same origin and destination that are onto network comprising highway arterials slip roads etc of road segments states each road is specified by feature vector each of the first six components of is an indicator describing whether it belongs to alexandra road ar ayer rajah expressway aye depot road dr henderson road hr jalan bukit merah jbm or lower delta road ldr while the last component of is the normalized shortest path distance from the road segment to destination we assume that the trajectories are demonstrated by taxi drivers with common set of reward functions and the same transition function for switching between reward functions the latter of which is influenced by the normalized taxi speed constituting the first component of feature vector the second component of is used as an offset of value in the generalized linear model the number of reward functions is set to because when we experiment with two of the learned reward functions are similar every driver can deterministically move its taxi from its current road segment to the desired adjacent road segment rb rb probability average fig shows results of performance achieved by our irl algorithm and em clustering with mlirl averaged over random instances with varying our irl algorithm outperforms em clustering with mlirl due to its modeling of and its high fidelity in learning and predicting our irl algorithm em clustering with mlirl no of demonstrated trajectories normalized taxi speed figure graphs of average achieved by our irl algorithm and em clustering with mlirl no of taxi trajectories ntot and transition probabilities of switching between reward functions taxi speed to see this our irl algorithm is able to learn that taxi driver is likely to switch between reward functions representing different intentions within its demonstrated trajectory reward function denotes his intention of driving directly to the destination fig due to huge penalty reward weight on being far from destination and large reward reward weight for taking the shortest path from origin to destination which is via jbm while denotes his intention of detouring to dr or jbm fig due to large rewards for traveling on them respectively reward weights and as an example fig shows the most likely partition of demonstrated trajectory into segments generated from locally consistent reward functions and which is derived using our viterbi algorithm section it can be observed that the driver is initially in on the slip road exiting aye switches from to upon turning into ar to detour to dr and remains in while driving along dr hr and jbm to destination on the other hand the reward functions learned by em clustering with mlirl are both associated with his intention of driving directly to destination similar to it is not able to learn his intention of detouring to dr or jbm fig shows the influence of normalized taxi speed first component of on the estimated transition function using it can be observed that when the driver is in driving directly to destination he is very unlikely to change his intention regardless of taxi speed but when he is in detouring to dr or jbm he is likely unlikely to remain in this intention if taxi speed is low high the demonstrated trajectory in fig in fact supports this observation the driver initially remains in on the upslope slip road exiting aye which causes the low taxi speed upon turning into ar to detour to dr he switches from to because he can drive at relatively high speed on flat terrain ar jbm aye hr dr ldr ar jbm aye hr dr ldr jbm ar aye dr hr conclusion ldr figure reward and for each this paper describes an irl road segment with gorithm that can learn the multiple reward functions being locally consistent in and such that more red road segments ent segments along trajectory as well as the give higher rewards most likely partition of stochastic transitions between them it gendemonstrated trajectory from origin to destinaeralizes with mlirl and has tion into red and green segments generated by been empirically demonstrated to outperform and respectively it on both synthetic and datasets for our future work we plan to extend our irl algorithm to cater to an unknown number of reward functions nonlinear reward functions modeled by gaussian processes other dissimilarity measures described in section mdps active learning with gaussian processes and interactions with agents acknowledgments this work was partially supported by alliance for research and technology subaward agreement no references abbeel and ng apprenticeship learning via inverse reinforcement learning in proc icml marivate subramanian and littman apprenticeship learning about multiple intentions in proc icml pages bilmes gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models technical report university of california berkeley chen cao low ouyang tan and jaillet parallel gaussian process regression with covariance matrix approximations in proc uai pages choi and kim inverse reinforcement learning in partially observable environments jmlr choi and kim nonparametric bayesian inverse reinforcement learning for multiple reward functions in proc nips pages dvijotham and todorov inverse optimal control with mdps in proc icml pages hoang hoang and low unifying framework of anytime sparse gaussian process regression models with stochastic variational inference for big data in proc icml pages hoang and low general framework for interacting with agents using arbitrary parametric model and model prior in proc ijcai pages hoang and low interactive pomdp lite towards practical planning to predict and exploit intentions for interacting with agents in proc ijcai pages hoang low jaillet and kankanhalli nonmyopic active learning of gaussian processes in proc icml pages levine and koltun nonlinear inverse reinforcement learning with gaussian processes in proc nips pages low chen hoang xu and jaillet recent advances in scaling up gaussian process predictive models for large spatiotemporal data in ravela and sandu editors proc dynamic environmental systems science conference dydess lncs springer low xu chen lim and generalized online sparse gaussian processes with application to persistent mobile robot localization in proc nectar track low yu chen and jaillet parallel gaussian process regression for big data representation meets markov approximation in proc aaai pages neu and apprenticeship learning using inverse reinforcement learning and gradient methods in proc uai pages neu and training parsers by inverse reinforcement learning machine learning newson and krumm hidden markov map matching through noise and sparseness in proc acm sigspatial international conference on advances in geographic information systems pages ng and russell algorithms for inverse reinforcement learning in proc icml rabiner tutorial on hidden markov models and selected applications in speech recognition proc ieee ramachandran and amir bayesian inverse reinforcement learning in proc ijcai pages russell learning agents for uncertain environments in proc colt pages syed bowling and schapire apprenticeship learning using linear programming in proc icml pages syed and schapire approach to apprenticeship learning in proc nips pages xu low chen lim and persistent mobile robot localization using online sparse gaussian process observation model in proc aaai pages yu low oran and jaillet hierarchical bayesian nonparametric approach to modeling and learning the wisdom of crowds of urban traffic route planning agents in proc iat pages ziebart maas bagnell and dey maximum entropy inverse reinforcement learning in proc aaai pages 
communication complexity of distributed convex learning and optimization ohad shamir weizmann institute of science rehovot israel yossi arjevani weizmann institute of science rehovot israel abstract we study the fundamental limits to distributed methods for convex learning and optimization under different assumptions on the information available to individual machines and the types of functions considered we identify cases where existing algorithms are already optimal as well as cases where room for further improvement is still possible among other things our results indicate that without similarity between the local objective functions due to statistical data similarity or otherwise many communication rounds may be required even if the machines have unbounded computational power introduction we consider the problem of distributed convex learning and optimization where set of machines each with access to different local convex function fi rd and convex domain rd attempt to solve the optimization problem min where fi prominent application is empirical risk minimization where the goal is to minimize the average loss over some dataset where each machine has access to different subset of the data letting zn be the dataset composed of examples and assuming the loss function pn is convex in then the empirical risk minimization problem zi can be written as in eq where fi is the average loss over machine examples the main challenge in solving such problems is that communication between the different machines is usually slow and constrained at least compared to the speed of local processing on the other hand the datasets involved in distributed learning are usually large and therefore machines can not simply communicate their entire data to each other and the question is how well can we solve problems such as eq using as little communication as possible as datasets continue to increase in size and parallel computing platforms becoming more and more common from multiple cores on single cpu to and geographically distributed computing grids distributed learning and optimization methods have been the focus of much research in recent years with just few examples including most of this work studied algorithms for this problem which provide upper bounds on the required time and communication complexity in this paper we take the opposite direction and study what are the fundamental performance limitations in solving eq under several different sets of assumptions we identify cases where existing algorithms are already optimal at least in the as well as cases where room for further improvement is still possible since major constraint in distributed learning is communication we focus on studying the amount of communication required to optimize eq up to some desired accuracy more precisely we consider the number of communication rounds that are required where in each communication round the machines can generally broadcast to each other information linear in the problem dimension point in or gradient this applies to virtually all algorithms for learning we are aware of where sending vectors and gradients is feasible but computing and sending larger objects such as hessians matrices is not our results pertain to several possible settings see sec for precise definitions first we distinguish between the local functions being merely convex or and whether they are smooth or not these distinctions are standard in studying optimization algorithms for learning and capture important properties such as the regularization and the type of loss function used second we distinguish between setting where the local functions are related because they reflect statistical similarities in the data residing at different machines and setting where no relationship is assumed for example in the extreme case where data was split uniformly at random between machines one can show that quantities such as the values gradients and hessians of the local tions differ only by where is the sample size per machine due to concentration of measure effects such similarities can be used to speed up the process as was done in both the and the unrelated setting can be considered in unified way by letting be parameter and studying the attainable lower bounds as function of our results can be summarized as follows first we define mild structural assumption on the algorithm which is satisfied by reasonable approaches we are aware of which allows us to provide the lower bounds described below on the number of communication rounds required to reach given suboptimality when the local functions can be unrelated we prove lower bound of log for smooth and convex functions and for smooth convex functions these lower bounds are matched by straightforward distributed implementation of accelerated gradient descent in particular the results imply that many communication rounds may be required to get solution and moreover that no algorithm satisfying our structural assumption would be better even if we endow the local machines with unbounded computational power for functions we show lower bound of for convex functions and for general convex functions although we leave full derivation to future work it seems these lower bounds can be matched in our framework by an algorithm combining acceleration and moreau proximal smoothing of the local functions when the local functions are related as quantified by the parameter we prove communicap tion round lower bound of log for smooth and convex functions for quadratics this bound is matched by up to constants and logarithmic factors by the recentlyproposed disco algorithm however getting an optimal algorithm for general strongly convex and smooth functions in the setting let alone for or convex functions remains open we also study the attainable performance without posing any structural assumptions on the algorithm but in the more restricted case where only single round of communication is allowed we prove that in broad regime the performance of any distributed algorithm may be no better than trivial algorithm which returns the minimizer of one of the local functions as long as the number of bits communicated is less than therefore in our setting no distributed algorithm can provide performance in the worst case related work there have been several previous works which considered lower bounds in the context of distributed learning and optimization but to the best of our knowledge none of them provide similar type of results perhaps the most paper is which studied the communication complexity of distributed optimization and showed that log bits of communication are necessary between the machines for convex problems however in our setting this does not lead to any lower bound on the number of communication rounds indeed just specifying vector up to accuracy required log bits more recently considered lower bounds for certain types of distributed learning problems but not convex ones in an agnostic framework in the context of lower bounds for algorithms the results of imply that bits of communication are required to solve linear regression in one round of communication however that paper assumes different model than ours where the function to be optimized is not split among the machines as in eq where each fi is convex moreover issues such as strong convexity and smoothness are not considered proves an impossibility result for distributed learning scheme even when the local functions are not merely related but actually result from splitting data uniformly at random between machines on the flip side that result is for particular algorithm and doesn apply to any possible method finally we emphasize that distributed learning and optimization can be studied under many settings including ones different than those studied here for example one can consider distributed learning on stream of data or settings where the computing architecture is different where the machines have shared memory or the function to be optimized is not split as in eq studying lower bounds in such settings is an interesting topic for future work notation and framework the only vector and matrix norms used in this paper are the euclidean norm and the spectral norm respectively ej denotes the standard unit vector we let and denote the gradient and hessians of function at if they exist is smooth with parameter if it is differentiable and the gradient is in particular if arg then kw is strongly convex with parameter if for any hg wi wk where is subgradient of at in particular if arg then kw any convex function is also with special case of smooth convex functions are quadratics where aw for some positive semidefinite matrix vector and scalar in this case and correspond to the smallest and largest eigenvalues of we model the distributed learning algorithm as an iterative process where in each round the machines may perform some local computations followed by communication round where each machine broadcasts message to all other machines we make no assumptions on the computational complexity of the local computations after all communication rounds are completed designated machine provides the algorithm output possibly after additional local computation clearly without any assumptions on the number of bits communicated the problem can be trivially solved in one round of communication each machine communicates the function fi to the designated machine which then solves eq however in practical scenarios this is and the size of each message measured by the number of bits is typically on the order of enough to send such as points in the optimization domain or gradients but not larger objects such as hessians in this model our main question is the following how many rounds of communication are necessary in order to solve problems such as eq to some given accuracy as discussed in the introduction we first need to distinguish between different assumptions on the possible relation between the local functions one natural situation is when no significant relationship can be assumed for instance when the data is arbitrarily split or is gathered by each machine from statistically dissimilar sources we denote this as the unrelated setting however this assumption is often unnecessarily pessimistic often the data allocation process is more random or we can assume that the different data sources for each machine have statistical similarities to give simple example consider learning from users activity across geographically distributed computing grid each servicing its own local population we will capture such similarities in the context of quadratic functions using the following definition definition we say that set of quadratic functions fi ai bi ci ai bi rd ci the hides constants and factors logarithmic in the required accuracy of the solution the idea is that we can represent real numbers up to some arbitrarily high machine precision enough so that issues are not problem are if for any it holds that kai aj kbi bj cj for example in the context of linear regression with the squared loss over bounded subset of rd and assuming mn data points with bounded norm are randomly and equally split among machines it can be shown that the conditions above hold with the choice of provides us with spectrum of learning problems ranked by difficulty this generally corresponds to the unrelated setting discussed earlier when we get the situation typical of randomly partitioned data when then all the local functions have essentially the same minimizers in which case eq can be trivially solved with zero communication just by letting one machine optimize its own local function we note that although definition can be generalized to functions we do not need it for the results presented here we end this section with an important remark in this paper we prove lower bounds for the setting which includes as special case the setting of randomly partitioned data in which case however our bounds do not apply for random partitioning since they use constructions which do not correspond to randomly partitioned data in fact very recent work has cleverly shown that for randomly partitioned data and for certain reasonable regimes of strong convexity and smoothness it is actually possible to get better performance than what is indicated by our lower bounds however this encouraging result crucially relies on the random partition property and in parameter regimes which limit how much each data point needs to be touched hence preserving key statistical independence properties we suspect that it may be difficult to improve on our lower bounds under substantially weaker assumptions lower bounds using structural assumption in this section we present lower bounds on the number of communication rounds where we impose certain mild structural assumption on the operations performed by the algorithm roughly speaking our lower bounds pertain to very large class of algorithms which are based on linear operations involving points gradients and vector products with local hessians and their inverses as well as solving local optimization problems involving such quantities at each communication round the machines can share any of the vectors they have computed so far formally we consider algorithms which satisfy the assumption stated below for convenience we state it for smooth functions which are differentiable and discuss the case of functions in sec assumption for each machine define set wj rd initially wj between communication rounds each machine iteratively computes and adds to wj some finite number of points each satisfying γw span fj fj wj diagonal fj exists fj exists for some such that after every communication round let wj wi for all the algorithm final output provided by the designated machine is point in the span of wj this assumption requires several remarks note that wj is not an explicit part of the algorithm it simply includes all points computed by machine so far or communicated to it by other machines and is used to define the set of new points which the machine is allowed to compute the assumption bears some resemblance but is far weaker than standard assumptions used to provide lower bounds for iterative optimization algorithms for example common assumption see is that each computed point must lie in the span of the previous gradients this corresponds to special case of assumption where and the span is only over gradients of previously computed points moreover it also allows for instance exact optimization of each local function which is subroutine in some distributed algorithms by setting and computing point satisfying γw by allowing the span to include previous gradients we also incorporate algorithms which perform optimization of the local function plus terms involving previous gradients and points such as as well as algorithms which rely on local hessian information and preconditioning such as in summary the assumption is satisfied by most techniques for convex optimization that we are aware of finally we emphasize that we do not restrict the number or computational complexity of the operations performed between communication rounds the requirement that is to exclude algorithms which solve local optimization problems of the form minw fj kwk with which are unreasonable in practice and can sometimes break our lower bounds the assumption that wj is initially namely that the algorithm starts from the origin is purely for convenience and our results can be easily adapted to any other starting point by shifting all functions accordingly the techniques we employ in this section are inspired by lower bounds on the iteration complexity of methods for standard optimization see for example these are based on the construction of hard functions where each gradient or subgradient computation can only provide small improvement in the objective value in our setting the dynamics are roughly similar but the necessity of many gradient computations is replaced by many communication rounds this is achieved by constructing suitable local functions where at any time point no individual machine can progress on its own without information from other machines smooth local functions we begin by presenting lower bound when the local functions fi are and smooth theorem for any even number of machines any distributed algorithm which satisfies assumption and for any there exist local quadratic functions over rd where is sufficiently large which are convex and such that if arg then the number of communication rounds required to obtain satisfying for any is at least log log if and at least if the assumption of being even is purely for technical convenience and can be discarded at the cost of making the proof slightly more complex also note that does not appear explicitly in the bound but may appear implicitly via for example in statistical setting may depend on the number of data points per machine and may be larger if the same dataset is divided to more machines let us contrast our lower bound with some existing algorithms and guarantees in the literature first regardless of whether the local functions are similar or not we can always simulate any gradientbased method designed for single machine by iteratively computing gradients of the local functions and performing communication round pmto compute their average clearly this will be gradient of the objective function fi which can be fed into any method such as gradient descent or accelerated gradient descent the resulting number of required communication rounds is then equal to the number of iterations in particular using accelerated for smooth and convex functions yields round complexity gradient descent of log and for smooth convex functions this matches our lower bound up to constants and log factors when the local functions are unrelated when the functions are related however the upper bounds above are highly even if the local functions are completely identical and the number of communication rounds will remain the same as when to utilize function similarity while guaranteeing arbitrary small the two most relevant algorithms are dane and the more recent disco for smooth and convex functions which are either quadratic or satisfy certain condition disco achieves round complexity which matches our lower bound in terms of dependence on however for losses the round complexity bounds are somewhat worse and there are no guarantees for strongly convex and smooth functions which are not thus the question of the optimal round complexity for such functions remains open the full proof of thm appears in the supplementary material and is based on the following idea for simplicity suppose we have two machines with local functions defined as follows kwk kwk where it is easy to verify that for both and are and convex as well as moreover the optimum of their average is point with entries at all coordinates however since each local functions has quadratic term it can be shown that for any algorithm satisfying assumption after communication rounds the points computed by the two machines can only have the first coordinates no machine will be able to further progress on its own and cause additional coordinates to become without another communication round this leads to lower bound on the optimization error which depends on resulting in the theorem statement after few computations local functions remaining in the framework of algorithms satisfying assumption we now turn to discuss the situation where the local functions are not necessarily smooth or differentiable for simplicity our formal results here will be in the unrelated setting and we only informally discuss their extension to setting in sense relevant to functions formally defining functions is possible but not altogether trivial and is therefore left to future work we adapt assumption to the case by allowing gradients to be replaced by arbitrary subgradients at the same points namely we replace eq by the requirement that for some and γw νg span fj fj wj diagonal fj exists fj exists the lower bound for this setting is stated in the following theorem theorem for any even number of machines any distributed optimization algorithm which satisfies assumption and for any there exist convex continuous convex local functions and over the unit euclidean ball in rd where is sufficiently large such that if arg minw the number of communication rounds required to obtain satisfying for any sufficiently small is for and for as in thm we note that the assumption of even is for technical convenience this theorem together with thm implies that both strong convexity and smoothness are necessary for the number of communication rounds to scale logarithmically with the required accuracy we emphasize that this is true even if we allow the machines unbounded computational power to perform arbitrarily many operations satisfying assumption moreover preliminary analysis indicates that performing accelerated gradient descent on smoothed versions of the local functions using moreau proximal smoothing can match these lower bounds up to log we leave full formal derivation which has some subtleties to future work the full proof of thm appears in the supplementary material the proof idea relies on the following construction assume that we fix the number of communication rounds to be and for simplicity that is even and the number of machines is then we use local functions of the form wt kwk wt kwk where is suitably chosen parameter it is easy to verify that both local functions are convex and continuous over the unit euclidean ball similar to the smooth case we argue that after communication rounds the resulting points computed by machine will be only on the first coordinates and the points computed by machine will be only on the first coordinates as in the smooth case these functions allow us to control the progress of any algorithm which satisfies assumption finally although the result is in the unrelated setting it is straightforward to have similar construction in setting by multiplying and by the resulting two functions have their gradients and subgradients at most from each other and the construction above leads to lower bound of for convex lipschitz functions and for convex lipschitz functions in terms of upper bounds we are actually unaware of any relevant algorithm in the literature adapted to such setting and the question of attainable performance here remains wide open one round of communication in this section we study what lower bounds are attainable without any kind of structural assumption such as assumption this is more challenging setting and the result we present will be limited to algorithms using single round of communication round we note that this still captures realistic distributed computing scenario where we want each machine to broadcast single message and designated machine is then required to produce an output in the context of distributed optimization natural example is averaging algorithm where each machine optimizes its own local data and the resulting points are averaged intuitively with only single round of communication getting an arbitrarily small error may be infeasible the following theorem establishes lower bound on the attainable error depending on the strong convexity parameter and the similarity measure between the local functions and compares this with trivial algorithm which just returns the optimum of single local function theorem for any even number of machines any dimension larger than some numerical constant any and any possibly randomized algorithm which communicates at most bits in single round of communication there exist quadratic functions over rd which are convex and for which the following hold for some positive numerical constants the point returned by the algorithm satisfies min in expectation over the algorithm randomness roughly speaking for any this smoothing creates function which is to the original function plugging these into the guarantees of accelerated gradient descent and tuning yields our lower bounds note that in order to execute this algorithm each machine must be sufficiently powerful to obtain the gradient of the moreau envelope of its local function which is indeed the case in our framework for any machine if arg fj then the theorem shows that unless the communication budget is extremely large quadratic in the dimension there are functions which can not be optimized to accuracy in one round of communication in the sense that the same accuracy up to universal constant can be obtained with trivial solution where we just return the optimum of single local function this complements an earlier result in which showed that particular algorithm is no better than returning the optimum of local function under the stronger assumption that the local functions are not merely but are actually the average loss over some randomly partitioned data the full proof appears in the supplementary material but we sketch the main ideas below as before focusing on the case of two machines and assuming machine is responsible for providing the output we use kwk δej where is essentially randomly chosen symmetric matrix with spectral norm at most and is suitable constant these functions can be shown to be as well as convex moreover the optimum of equals ej thus we see that the optimal point depends on the column of intuitively the machines need to approximate this column and this is the source of hardness in this setting machine knows but not yet needs to communicate to machine enough information to construct its column however given communication budget much smaller than the size of which is it is difficult to convey enough information on the column without knowing what is carefully formalizing this intuition and using some tools allows us to prove the first part of thm proving the second part of thm is straightforward using few computations summary and open questions in this paper we studied lower bounds on the number of communication rounds needed to solve distributed convex learning and optimization problems under several different settings our results indicate that when the local functions are unrelated then regardless of the local machines computational power many communication rounds may be necessary scaling polynomially with or and that the optimal algorithm at least for smooth functions is just straightforward distributed implementation of accelerated gradient descent when the functions are related we show that the optimal performance is achieved by the algorithm of for quadratic and strongly convex functions but designing optimal algorithms for more general functions remains open beside these results which required certain mild structural assumption on the algorithm employed we also provided an lower bound for algorithms which implies that even for strongly convex quadratic functions such algorithms can sometimes only provide trivial performance besides the question of designing optimal algorithms for the remaining settings several additional questions remain open first it would be interesting to get lower bounds for algorithms with multiple rounds of communication second our work focused on communication complexity but in practice the computational complexity of the local computations is no less important thus it would be interesting to understand what is the attainable performance with simple algorithms finally it would be interesting to study lower bounds for other distributed learning and optimization scenarios acknowledgments this research is supported in part by an marie curie cig grant the intel institute and israel science foundation grant we thank nati srebro for several helpful discussions and insights references agarwal chapelle and langford reliable effective terascale linear learning system corr balcan blum fine and mansour distributed learning communication complexity and privacy in colt balcan kanchanapally liang and woodruff improved distributed principal component analysis in nips bekkerman bilenko and langford scaling up machine learning parallel and distributed approaches cambridge university press boyd parikh chu peleato and eckstein distributed optimization and statistical learning via admm foundations and trends in machine learning clarkson and woodruff numerical linear algebra in the streaming model in stoc cotter shamir srebro and sridharan better algorithms via accelerated gradient methods in nips dekel shamir and xiao optimal distributed online prediction using journal of machine learning research duchi agarwal and wainwright dual averaging for distributed optimization convergence analysis and network scaling ieee trans automat frostig ge kakade and sidford competing with the empirical risk minimizer in single pass arxiv preprint jaggi smith terhorst krishnan hofmann and jordan distributed dual coordinate ascent in nips lee ma and lin distributed stochastic variance reduced gradient methods corr mahajan keerthy sundararajan and bottou parallel sgd method with strong convergence corr nesterov introductory lectures on convex optimization basic course springer nesterov smooth minimization of functions mathematical programming recht wright and niu hogwild approach to parallelizing stochastic gradient descent in nips and distributed coordinate descent method for learning with big data corr shamir fundamental limits of online and distributed algorithms for statistical learning and estimation in nips shamir and srebro on distributed stochastic optimization and learning in allerton conference on communication control and computing shamir srebro and zhang distributed optimization using an approximate method in icml tao topics in random matrix theory volume american mathematical tsitsiklis and luo communication complexity of convex optimization complexity yang trading computation for communication distributed sdca in nips yu better approximation and faster algorithm using proximal average in nips zhang duchi and wainwright algorithms for statistical optimization journal of machine learning research zhang and xiao distributed optimization of empirical loss in icml zinkevich weimer smola and li parallelized stochastic gradient descent in nips 
learning of lda by back propagation over deep architecture jianshu ji yelong lin xiaodong jianfeng xinying and li microsoft research redmond wa usa jianshuc yeshen xiaohe jfgao xinson deng department of electrical engineering university of washington seattle wa usa jvking abstract we develop fully discriminative learning approach for supervised latent dirichlet allocation lda model using back propagation which maximizes the posterior probability of the prediction variable given the input document different from traditional variational learning or gibbs sampling approaches the proposed learning method applies the mirror descent algorithm for maximum posterior inference and ii back propagation over deep architecture together with stochastic descent for model parameter estimation leading to scalable and discriminative learning of the model as byproduct we also apply this technique to develop new learning method for the traditional unsupervised lda model experimental results on three regression and classification tasks show that the proposed methods significantly outperform the previous supervised topic models neural networks and is on par with deep neural networks introduction latent dirichlet allocation lda among various forms of topic models is an important probabilistic generative model for analyzing large collections of text corpora in lda each document is modeled as collection of words where each word is assumed to be generated from certain topic drawn from topic distribution the topic distribution can be viewed as latent representation of the document which can be used as feature for prediction purpose sentiment analysis in particular the inferred topic distribution is fed into separate classifier or regression model logistic regression or linear regression to perform prediction such separate learning structure usually significantly restricts the performance of the algorithm for this purpose various supervised topic models have been proposed to model the documents jointly with the label information in variational methods was applied to learn supervised lda slda model by maximizing the lower bound of the joint probability of the input data and the labels the disclda method developed in learns the transformation matrix from the latent topic representation to the output in discriminative manner while learning the topic to word distribution in generative manner similar to the standard lda in max margin supervised topic models are developed for classification and regression which are trained by optimizing the sum of the variational bound for the log marginal likelihood and an additional term that characterizes the prediction margin these methods successfully incorporate the information from both the input data and the labels and showed better performance in prediction compared to the vanilla lda model one challenge in lda is that the exact inference is intractable the posterior distribution of the topics given the input document can not be evaluated explicitly for this reason various approximate zd yd wd figure graphical representation of the supervised lda model shaded nodes are observables inference methods are proposed such as variational learning and gibbs sampling for computing the approximate posterior distribution of the topics in this paper we will show that although the full posterior probability of the topic distribution is difficult its maximum posteriori map inference as simplified problem is convex optimization problem when the dirichlet parameter satisfies certain conditions which can be solved efficiently by the mirror descent algorithm mda indeed sontag and roy pointed out that the map inference problem of lda in this situation is and can be solved by an exponentiated gradient method which shares same form as our algorithm with constant nevertheless different from which studied the inference problem alone our focus in this paper is to integrate back propagation with algorithm to perform fully discriminative training of supervised topic models as we proceed to explain below among the aforementioned methods one training objective of the supervised lda model is to maximize the joint likelihood of the input and the output variables another variant is to maximize the sum of the log likelihood or its variable bound and prediction margin moreover the disclda optimizes part of the model parameters by maximizing the marginal likelihood of the input variables and optimizes the other part of the model parameters by maximizing the conditional likelihood for this reason disclda is not fully discriminative training of all the model parameters in this paper we propose fully discriminative training of all the model parameters by maximizing the posterior probability of the output given the input document we will show that the discriminative training can be performed in principled manner by naturally integrating the backpropagation with the exact map inference to our best knowledge this paper is the first work to perform fully discriminative training of supervised topic models discriminative training of generative model is widely used and usually outperforms standard generative training in prediction tasks as pointed out in discriminative training increases the robustness against the mismatch between the generative model and the real data experimental results on three tasks also show the superior performance of discriminative training in addition to the aforementioned related studies on topic models there have been another stream of work that applied empirical risk minimization to graphical models such as markov random field and nonnegative matrix factorization specifically in an approximate inference algorithm belief propagation is used to compute the belief of the output variables which is further fed into decoder to produce the prediction the approximate inference and the decoder are treated as an entire decision rule which is tuned jointly via back propagation our work is different from the above studies in that we use an map inference based on optimization theory to motivate the discriminative training from principled probabilistic framework smoothed supervised lda model we consider the smoothed supervised lda model in figure let be the number of topics be the number of words in each document be the vocabulary size and be the number of documents in the corpus the generative process of the model in figure can be described as for each document choose the topic proportions according to dirichlet distribution dir where is vector consisting of nonnegative components draw each column of matrix independently from an exchangeable dirichlet distribution dir where is the smoothing parameter to generate each word wd choose topic zd zd multinomial choose word wd wd multinomial zd choose the response vector yd yd in regression yd where is matrix consisting of regression coefficients in classification yd multinomial softmax where xc the softmax function is defined as softmax pc therefore the entire model can be described by the following joint probability yd yd where and denotes all the words and the associated topics respectively in the document note that the model in figure is slightly different from the one proposed in where the response variable yd in figure is coupled with instead of as in blei and mcauliffe also pointed out this choice as an alternative in this modification will lead to differentiable cost trainable by back propagation with superior prediction performance to develop fully discriminative training method for the model parameters and we follow the argument in which states that the discriminative training is also equivalent to maximizing the joint likelihood of new model family with an additional set of parameters arg max yd where is obtained by marginalizing yd in and replace with the above problem decouples into arg max ln ln yd arg max ln ln which are the discriminative learning problem of supervised lda eq and the unsupervised learning problem of lda eq respectively we will show that both problems can be solved in unified manner using new map inference and back propagation maximum posterior map inference we first consider the inference problem in the smoothed lda model for the supervised case the main objective is to infer yd given the words in each document computing yd yd where the probability yd is known multinomial or gaussian for classification and regression problems see section the main challenge is to evaluate infer the topic proportion given each document which is also the important inference problem in the unsupervised lda model however it is well known that the exact evaluation of the posterior probability is intractable for this reason various approximate inference methods such as variational inference and gibbs sampling we will represent all the multinomial variables by vector that has single component equal to one at the position determined by the multinomial variable and all other components being zero have been proposed to compute the approximate posterior probability in this paper we take an alternative approach for inference given each document we only seek point map estimate of instead of its full approximate posterior probability the major motivation is that although the full posterior probability of is difficult its map estimate as simplified problem is more tractable and it is convex problem under certain conditions furthermore with the map estimate of we can infer the prediction variable yd according to the following approximation from yd yd yd arg max where denotes the conditional expectation with respect to given and the expectation is sampled by the map estimate of given defined as the approximation gets more precise when becomes more concentrated around experimental results on several real datasets section show that the approximation provides excellent prediction performance using the bayesian rule and the fact that is independent of we obtain the equivalent form of as arg max ln ln pk where pk denotes the probability simplex is the dirichlet distribution and can be computed by integrating qn wd zd over which leads to derived in section of the supplementary material vj xd where xd denotes the term frequency of the word in vocabulary inside the document and xd denotes the bow vector of the document note that depends on only via the bow vector xd which is the sufficient statistics therefore we use xd and interchangeably from now on substituting the expression of dirichlet distribution and into we get arg max xtd ln ln arg min xtd ln ln where we dropped the terms independent of and denotes an vector note that when the optimization problem is strictly convex and is otherwise mirror descent algorithm for map inference an efficient approach to solving the constrained optimization problem is the mirror descent algorithm mda with bregman divergence chosen to be generalized divergence specifically let denote the cost function in then the mda updates the map estimate of iteratively according to arg min td denotes the estimate of at the iteration td denotes the of mda and is the bregman divergence chosen to be xt ln the argmin in can be solved in see section of the supplementary material as xd exp td normalization mirror descent cell mirror descent cell figure layered deep architecture for computing yd where denotes division denotes hadamard product and exp denotes exponential where is normalization factor such that adds up to one denotes hadamard product is the number of mda iterations and the divisions in are operations note that the recursion naturally enforces each to be on the probability simplex the mda td can be either constant td or adaptive over iterations and samples determined by line search see section of the supplementary material the computation complexity in is low since most computations are sparse matrix operations for example although by itself in is dense matrix multiplication we only need to evaluate the elements of at the positions where the corresponding elements of xd are nonzero because all other elements of xd is known to be zero overall the computation complexity in each iteration of is ntok where ntok denotes the number of unique tokens in the document in practice we only use small number of iterations in and use to approximate so that becomes yd yd in summary the inference of and yd can be implemented by the layered architecture in figure where the top layer infers yd using and the mda layers infer iteratively using figure also implies that the the mda layers act as feature extractor by generating the map estimate for the output layer our learning strategy developed in the next section jointly learns the model parameter at the output layer and the model parameter at the feature extractor layers to maximize the posterior of the prediction variable given the input document learning by back propagation we now consider the supervised learning problem and the unsupervised learning problem respectively using the developed map inference we first consider the supervised learning problem with the discriminative learning problem can be approximated by arg min ln ln yd which can be solved by stochastic mirror descent smd note that the cost function in depends on explicitly through yd which can be computed directly from its definition in section on the other hand the cost function in depends on implicitly through from figure we observe that not only depends on explicitly as indicated in the mda block on the side of figure but also depends on implicitly via which in turn depends on both explicitly and implicitly through and so on that is the dependency of the cost function on is in layered manner therefore we devise back propagation procedure to efficiently compute its gradient with respect to according to the graph in figure which back propagate the error signal through the mda blocks at different layers the gradient formula and the implementation details of the learning algorithm can be found in sections in the supplementary material for the unsupervised learning problem the gradient of ln with respect to assumes the same form as that of ln moreover it can be shown that the gradient of ln with respect can be expressed as see section of the supplementary material ln ln xd ln xd where xd assumes the same form as except is replaced by the expectation is evaluated with respect to the posterior probability and is sampled by the map estimate of in step is an approximation of computed via and figure experiments description of datasets and baselines we evaluated our proposed supervised learning denoted as and unsupervised learning denoted as methods on three datasets the first dataset we use is dataset built on amazon movie reviews amr the data set consists of million movie reviews billion words from amazon written by users on total of movies for text preprocessing we removed punctuations and lowercasing capital letters vocabulary of size is built by selecting the most frequent words in another setup we keep the full vocabulary of same as we shifted the review scores so that they have zero mean the task is formulated as regression problem where we seek to predict the rating score using the text of the review second we consider sentiment multisent classification task which contains total reviews on types of products such as apparel electronics kitchen and housewares the task is formulated as binary classification problem to predict the polarity positive or negative of each review likewise we preprocessed the text by removing punctuations and lowercasing capital letters and built vocabulary of size from the most frequent words in addition we also conducted second binary text classification experiment on proprietary dataset for applications documents and vocabulary size of the baseline algorithms we considered include gibbs sampling regression on slda and medlda which are implemented either in or java and our proposed algorithms are implemented in for and we first train the models in an unsupervised manner and then generate topic proportion as their features in the inference steps on top of which we train linear logistic regression model on the regression classification tasks prediction performance we first evaluate the prediction performance of our models and compare them with the traditional supervised topic models since the training of the baseline topic models takes much longer time than and see figure we compare their performance on two smaller datasets namely subset documents of amr randomly sampled from the million reviews and the multisent dataset documents which are all evaluated with cross validation for amr regression use the predictive we ro measure othe prediction performance defined as pr yd yd yd where yd denotes the label of the document in the heldout set during the cross validation is the mean of all ydo in the heldout set and yd is the predicted value the scores of different models with varying number of topics are shown in figure note that the model outperforms the other baselines with large margin moreover the unsupervised model outperforms the unsupervised lda model trained by gibbs sampling second on the multisent binary classification task we use the auc of the operating curve of probability of correct positive versus probability of false positive as our performance metric which are shown in figure it also shows that outperforms other methods and that outperforms the model next we compare our model with other strong discriminative models such as neural networks by conducting two experiments regression task on amr full dataset documents and ii binary classification task on the proprietary dataset documents for the amr regression we can see that improves significantly compared code is available online at https logistic regression logistic regression medlda slda auc linear medlda slda auc amr regression task number of topics number of topics multisent classification task number of topics multisent task zoom in figure prediction performance on amr regression task measured in and multisent classification task measured in auc higher score is better for both with perfect value being one table in percentage on full amr data documents the standard deviations in the parentheses are obtained from cross validation number of topics linear regression neural network linear regression to the best results on the dataset shown in figure and also significantly outperform the neural network models with same number of model parameters moreover the best deep neural network in hidden layers gives of which is worse than of in addition also significantly outperforms and the hybrid method initialized with whose scores reported in are between and for topics and deteriorate when further increasing the topic number the results therein are obtained under same setting as this paper to further demonstrate the superior performance of on the large vocabulary scenario we trained on full vocabulary amr and show the results in table which are even better than the vocabulary case finally for the binary text classification task on the proprietary dataset the aucs are given in table where topics achieves and relative improvements over logistic regression and neural network respectively moreover on this task is also on par with the best dnn larger model consisting of hidden units with dropout which achieves an auc of analysis and discussion we now analyze the influence of different hyper parameters on the prediction performance note from figure that when we increase the number of topics the score of first improves and then slightly deteriorates after it goes beyond topics this is most likely to be caused by overfitting on the small dataset documents because the models trained on the full dataset produce much higher scores table than that on the dataset and keep improving as the model size number of topics increases to understand the influence of the mirror descent steps on the prediction performance we plot in figure the scores of on the amr dataset for different values of steps when increases for small models and the score remains the same and for larger model the score first improves and then remain the same one explanation for this phenomena is that larger implies that the inference problem becomes an optimization problem of higher dimension which requires more mirror descent iterations moreover the mirrordescent back propagation as an training of the prediction output would compensate the imperfection caused by the limited number of inference steps which makes the performance insensitive to once it is large enough in figure we plot the percentage of the dominant table auc in percentage on the proprietary data documents vocabulary the standard deviations in the parentheses are obtained from five random initializations topics topics topics number of topics number of mirror descent iterations layers influence of mda iterations negative percentage of dominant topics number of topics logistic regression neural network number of topics sparsity of the topic distribution figure analysis of the behaviors of and models topics which add up to probability on amr which shows that learns sparse topic distribution even when and obtains sparser topic distribution with smaller and in figure we evaluate the of the unsupervised models on amr dataset using the method in the of with is worse than the case of and for although its prediction performance is better this suggests the importance of the dirichlet prior in text modeling and potential tradeoff between the text modeling performance and the prediction performance efficiency in computation time to compare the efficiency of the algorithms we show the training time of different models on the amr dataset and in figure which shows that our algorithm scales well with respect to increasing model size number of topics and increasing number of data samples conclusion training time in hours slda medlda we have developed novel learning approaches for supervised lda models using map inference and back propagation which leads to an discriminative training we evaluate the number of topics prediction performance of the model on three realworld regression and classification tasks the results show that the discriminative training figure training time on the amr dataset cantly improves the performance of the supervised tested on intel xeon lda model relative to previous learning methods future works include exploring faster algorithms for the map inference accelerated mirror descent ii developing learning of lda using the framework from and iii learning from data finally also note that the layered architecture in figure could be viewed as deep feedforward neural network with structures designed from the topic model in figure this opens up new direction of combining the strength of both generative models and neural networks to develop new deep learning models that are scalable interpretable and having high prediction performance for text understanding and information retrieval references asuncion welling smyth and teh on smoothing and inference for topic models in proc uai pages beck and teboulle mirror descent and nonlinear projected subgradient methods for convex optimization operations research letters bishop and lasserre generative or discriminative getting the best of both worlds bayesian statistics blei and mcauliffe supervised topic models in proc nips pages blei ng and jordan latent dirichlet allocation jmlr blitzer dredze and pereira biographies bollywood and blenders domain adaptation for sentiment classification in proc acl volume pages bouchard and triggs the tradeoff between generative and discriminative classifiers in proc compstat pages duchi hazan and singer adaptive subgradient methods for online learning and stochastic optimization journal of machine learning research jul griffiths and steyvers finding scientific topics proc of the national academy of sciences pages hershey roux and weninger deep unfolding inspiration of novel deep architectures hinton deng yu dahl mohamed jaitly senior vanhoucke nguyen sainath and kingsbury deep neural networks for acoustic modeling in speech recognition the shared views of four research groups ieee signal process holub and perona discriminative framework for modelling object classes in proc ieee cvpr volume pages huang he gao deng acero and heck learning deep structured semantic models for web search using clickthrough data in proc cikm pages kapadia discriminative training of hidden markov models phd thesis university of cambridge sha and jordan disclda discriminative learning for dimensionality reduction and classification in proc nips pages mcauley and leskovec from amateurs to connoisseurs modeling the evolution of user expertise through online reviews in proc www pages andrew kachites mccallum http mallet machine learning for language toolkit nemirovsky yudin problem complexity and method efficiency in optimization wiley new york sontag and roy complexity of inference in latent dirichlet allocation in proc nips pages stoyanov ropson and eisner empirical risk minimization of graphical model parameters given approximate inference decoding and model structure in proc aistats pages tseng on accelerated proximal gradient methods for optimization siam journal on optimization wallach mimno and mccallum rethinking lda why priors matter in proc nips pages wallach murray salakhutdinov and mimno evaluation methods for topic models in proc icml pages wang and zhu spectral methods for supervised topic models in proc nips pages oksana yakhnenko adrian silvescu and vasant honavar discriminatively trained markov model for sequence classification in proc ieee icdm zhu ahmed and xing medlda maximum margin supervised topic models jmlr zhu chen perkins and zhang gibbs topic models with data augmentation jmlr 
subset selection by pareto optimization chao qian yang yu zhou national key laboratory for novel software technology nanjing university collaborative innovation center of novel software technology and industrialization nanjing china qianc yuy zhouzh abstract selecting the optimal subset from large set of variables is fundamental problem in various learning tasks such as feature selection sparse regression dictionary learning etc in this paper we propose the poss approach which employs evolutionary pareto optimization to find subset with good performance we prove that for sparse regression poss is able to achieve the theoretically guaranteed approximation performance efficiently particularly for the exponential decay subclass poss is proven to achieve an optimal solution empirical study verifies the theoretical results and exhibits the superior performance of poss to greedy and convex relaxation methods introduction subset selection is to select subset of size from total set of variables for optimizing some criterion this problem arises in many applications feature selection sparse learning and compressed sensing the subset selection problem is however generally previous employed techniques can be mainly categorized into two branches greedy algorithms and convex relaxation methods greedy algorithms iteratively select or abandon one variable that makes the criterion currently optimized which are however limited due to its greedy behavior convex relaxation methods usually replace the set size constraint the with convex constraints the constraint and the elastic net penalty then find the optimal solutions to the relaxed problem which however could be distant to the true optimum pareto optimization solves problem by reformulating it as optimization problem and employing evolutionary algorithm which has significantly developed recently in theoretical foundation and applications this paper proposes the poss pareto optimization for subset selection method which treats subset selection as optimization problem that optimizes some given criterion and the subset size simultaneously to investigate the performance of poss we study representative example of subset selection the sparse regression the subset selection problem in sparse regression is to best estimate predictor variable by linear regression where the quality of estimation is usually measured by the mean squared error or equivalently the squared multiple correlation gilbert et al studied the approach with orthogonal matching pursuit omp and proved the multiplicative approximation guarantee µk for the mean squared error when the coherence the maximum correlation between any pair of observation variables is this approximation bound was later improved by under the same small coherence condition das and kempe analyzed the forward regression fr algorithm and obtained an approximation guarantee µk for these results however will break down when by introducing the submodularity ratio das and kempe proved the approximation guarantee on by the fr algorithm this guarantee is considered to be the strongest since it can be applied with any coherence note that sparse regression is similar to the problem of sparse recovery but they are for different purposes assuming that the predictor variable has sparse representation sparse recovery is to recover the exact coefficients of the truly sparse solution we theoretically prove that for sparse regression poss using polynomial time achieves multiplicative approximation guarantee for squared multiple correlation the guarantee obtained by the fr algorithm for the exponential decay subclass which has clear applications in sensor networks poss can provably find an optimal solution while fr can not the experimental results verify the theoretical results and exhibit the superior performance of poss we start the rest of the paper by introducing the subset selection problem we then present in three subsequent sections the poss method its theoretical analysis for sparse regression and the empirical studies the final section concludes this paper subset selection the subset selection problem originally aims at selecting few columns from matrix so that the matrix is most represented by the selected columns in this paper we present the generalized subset selection problem that can be applied to arbitrary criterion evaluating the selection the general problem given set of observation variables xn criterion and positive integer the subset selection problem is to select subset such that is optimized with the constraint where denotes the size of set for notational convenience we will not distinguish between and its index set is xi subset selection is formally stated as follows definition subset selection given all variables xn criterion and positive integer the subset selection problem is to find the solution of the optimization problem arg the subset selection problem is in general except for some extremely simple criteria in this paper we take sparse regression as the representative case sparse regression sparse regression finds sparse approximation solution to the regression problem where the solution vector can only have few elements definition sparse regression given all observation variables xn predictor variable and positive integer define the mean squared error of subset as sez αi xi sparse regression is to find set of at most variables minimizing the mean squared error arg sez for the ease of theoretical treatment the squared multiple correlation rz ar sez ar is used to replace sez so that the sparse regression is equivalently arg rz sparse regression is representative example of subset selection note that we will study eq in this paper without loss of generality we assume that all random variables are normalized to have expectation and variance thus rz is simplified to be sez for sparse regression das and kempe proved that the forward regression fr algorithm sented in algorithm can produce solution with and rz op where op denotes the optimal function value of eq which is the best currently known approximation guarantee the fr algorithm is greedy approach which iteratively selects variable with the largest improvement algorithm forward regression input all variables xn predictor variable and an integer parameter output subset of with variables process let and st repeat let be variable maximizing rz arg rz let st and until return sk the poss method the subset selection in eq can be separated into two objectives one optimizes the criterion meanwhile the other keeps the size small max usually the two objectives are conflicting that is subset with better criterion value could have larger size the poss method solves the two objectives simultaneously which is described as follows let us use the binary vector representation for subsets membership indication represents subset of by assigning si if the element of is in and si otherwise we assign two properties for solution is the criterion value and is the sparsity or otherwise where the set of to is to exclude trivial or overly bad solutions we further introduce the isolation function as in which determines if two solutions are allowed to be compared they are comparable only if they have the same isolation function value the implementation of is left as parameter of the method while its effect will be clear in the analysis as will be introduced later we need to compare solutions for solutions and we first judge if they have the same isolation function value if not we say that they are incomparable if they have the same isolation function value is worse than if has smaller or equal value on both the properties is strictly worse if has strictly smaller value in one property and meanwhile has smaller or equal value in the other property but if both is not worse than and is not worse than we still say that they are incomparable poss is described in algorithm starting from the solution representing an empty set and the archive containing only the empty set line poss generates new solutions by randomly flipping bits of an archived solution in the binary vector representation as lines and newly generated solutions are compared with the previously archived solutions line if the newly generated solution is not strictly worse than any previously archived solution it will be archived before archiving the newly generated solution in line the archive set is cleaned by removing solutions in which are previously archived solutions but are worse than the newly generated solution the iteration of poss repeats for times note that is parameter which could depend on the available resource of the user we will analyze the relationship between the solution quality and in later sections and will use the theoretically derived value in the experiments after the iterations we select the final solution from the archived solutions according to eq select the solution with the smallest value while the constraint on the set size is kept line poss for sparse regression in this section we examine the theoretical performance of the poss method for sparse regression for sparse regression the criterion is implemented as note that minimizing is equivalent to the original objective that maximizes rz in eq we need some notations for the analysis let cov be the covariance between two random variables be the covariance matrix between all observation variables ci cov xi xj algorithm poss input all variables xn given criterion and an integer parameter parameter the number of iterations and an isolation function output subset of with at most variables process let and let while do select from uniformly at random generate from by flipping each bit of with probability if such that and or then end if end while return arg and be the covariance vector between and observation variables bi cov xi let cs denote the submatrix of with row and columnpset and bs denote the subvector of containing elements bi with let res αi xi denote the residual of with respect to where is the least square solution to sez the submodularity ratio presented in definition is measure characterizing how close set function is to submodularity it is easy to see that is submodular iff γu for any and for being the objective function we will use γu shortly in the paper definition submodularity ratio let be set function the submodularity ratio of with respect to set and parameter is γu min on general sparse regression our first result is the theoretical approximation bound of poss for sparse regression in theorem let op denote the optimal function value of eq the expected running time of poss is the average number of objective function evaluations the most step which is also the average number of iterations denoted by since it only needs to perform one objective evaluation for the newly generated solution in each iteration theorem for sparse regression poss with and constant tion finds set of variables with and rz op the proof relies on the property of in lemma that for any subset of variables there always exists another variable the inclusion of which can bring an improvement on proportional to the current distance to the optimum lemma is extracted from the proof of theorem in lemma for any there exists one variable such that rz rz op rz proof let be the optimal set of variables of eq rz op let sk and res using lemmas and in we can easily derive that rz rz rz because rz increases with and sk we have rz op thus rz op rz by definition and rz we get rz arg maxx rz rz op rz let then rz op rz op rz let correspond to res thus rz rz op the lemma holds proof of theorem since the isolation function is constant function all solutions are allowed to be compared and we can ignore it let jmax denote the maximum value of such that in the archive set there exists solution with and rz op that is jmax max rz op we then analyze the expected iterations until jmax which implies that there exists one solution in satisfying that and rz op op the initial value of jmax is since poss starts from assume that currently jmax let be corresponding solution with the value and rz op it is easy to see that jmax can not decrease because cleaning from lines and of algorithm implies that is worse than newly generated solution which must have smaller size and larger value by lemma we know that flipping one specific bit of adding specific variable into can generate new solution which satisfies that rz rz op rz then we have rz rz op op since will be included into otherwise from line of algorithm must be strictly worse than one solution in and this implies that jmax has already been larger than which contradicts with the assumption jmax after including jmax let pmax denote the largest size of thus jmax can increase by at least in one iteration with probability at where pmax is lower bound on the probability of selecting in least pmax line of algorithm and is the probability of flipping specific bit of and keeping other bits unchanged in line then it needs at most enpmax expected iterations to increase jmax thus after enpmax expected iterations jmax must have reached by the procedure of poss we know that the solutions maintained in must be incomparable thus each value of one property can correspond to at most one solution in because the solutions with have value on the first property they must be excluded from thus which implies that pmax hence the expected number of iterations for finding the desired solution is at most comparing with the approximation guarantee of fr op it is easy to see that γs from definition thus poss with the simplest configuration of the isolation function can do at least as well as fr on any sparse regression problem and achieves the best previous approximation guarantee we next investigate if poss can be strictly better than fr on the exponential decay subclass our second result is on subclass of sparse regression called exponential decay as in definition in this subclass the observation variables can be ordered in line such that their covariances are decreasing exponentially with the distance definition exponential decay the variables xi are associated with points yn and ci for some constant since we have shown that poss with constant isolation function is generally good we prove below that poss with proper isolation function can be even better it is strictly better than fr on the exponential decay subclass as poss finds an optimal solution theorem while fr can not proposition the isolation function min si implies that two solutions are comparable only if they have the same minimum index for bit theorem for the exponential decay subclass of sparse regression poss with log and min si finds an optimal solution the proof of theorem utilizes the dynamic programming property of the problem as in lemma lemma let denote the maximum rz value by choosing variables including necessarily xj from xj xn that is max rz xj xn xj then the following recursive relation holds bj bi where the term in is the value by adding xj into the variable subset corresponding to proof of theorem we divide the optimization process into phases where the phase starts after the phase has finished we define that the phase finishes when for each solution corresponding to there exists one solution in the archive which is better than it here solution is better than is equivalent to that is worse than let ξi denote the iterations since phase has finished until phase is completed starting from the solution the phase has finished then we consider ξi in this phase from lemma we know that solution better than corresponding solution of can be generated by selecting specific one from the solutions better than and flipping its bit which happens with probability at least pmax thus if we have found desired solutions in the phase the probability of finding new desired solution in the next iteration is at least where is the total number of enpmax sired solutions to find in the phase then ξi log npmax therefore the expected number of iterations is kn log npmax until the phase finishes which implies that an optimal solution corresponding to has been found note that pmax because the incomparable property of the maintained solutions by poss ensures that there exists at most one solution in for each possible combination of and thus for finding an optimal solution is log then we analyze fr algorithm for this special class we show below that fr can be blocked from finding an optimal solution by giving simple example example xi ri yi where ri and yi are independent random variables with expectation such that each xi has variance qj for cov xi xj rk then it is easy to verify that example belongs to the pi exponential decay class by letting and yi loga rk for proposition for example with cov cov and cov fr can not find the optimal solution for proof the covariances between xi and are and since xi and have expectation and variance rz can be simply represented as bs cs bs we then calculate the value as follows rz rz rz rz rz rz the optimal solution for is fr first selects since rz is the largest then selects since rz rz thus produces local optimal solution it is also easy to verify that other two previous methods omp and foba can not find the optimal solution for this example due to their greedy nature empirical study we conducted experiments on data in table to compare poss with the following methods fr iteratively adds one variable with the largest improvement on omp iteratively adds one variable that mostly correlates with the predictor variable residual foba is based on omp but deletes one variable adaptively when beneficial set parameter the solution path length is five times as long as the maximum sparsity level and the last active set containing variables is used as the final selection rfe iteratively deletes one variable with the smallest weight by linear regression lasso scad and mcp replaces the norm constraint with the norm penalty the smoothly clipped absolute deviation penalty and the mimimax concave penalty respectively for implementing these methods we use the sparsereg toolbox developed in for poss we use since it is generally good and the number of iterations is set to be nc as suggested by theorem to evaluate how far these methods are from the optimum we also compute the optimal subset by exhaustive enumeration denoted as opt the data sets are from http and http some binary classification data are used for regression all variables are normalized to have mean and variance table the data sets data set housing ionosphere inst feat data set sonar triazines mushrooms inst feat data set gisette inst feat table the training value of the compared methods on data sets for in each data set denote respectively that poss is significantly than the corresponding method by the with confidence level means that no results were obtained after running several days data set opt housing ionosphere sonar triazines mushrooms gisette poss poss fr foba omp rfe mcp to assess each method on each data set we repeat the following process times the data set is randomly and evenly split into training set and test set sparse regression is built on the training set and evaluated on the test set we report the average training and test values on optimization performance table lists the training for which reveals the optimization quality of the methods note that the results of lasso scad and mcp are very close and we only report that of mcp due to the page limit by the with significance level poss is shown significantly better than all the compared methods on all data sets we plot the performance curves on two data sets for in figure for sonar opt is calculated only for we can observe that poss tightly follows opt and has clear advantage over the rest methods fr foba and omp have close performances while are much better than mcp scad and lasso the bad performance of lasso is consistent with the previous results in we notice that although the norm constraint is tight convex relaxation of the norm constraint and can have good results in sparse recovery tasks the performance of lasso is not as good as poss and greedy methods on most data sets this is due to that unlike assumed in sparse recovery tasks there may not exist sparse structure in the data sets in this case norm constraint can be bad approximation of norm constraint meanwhile norm constraint also shifts the optimization problem making it hard to well optimize the original criterion considering the running time in the number of objective function evaluations opt does exhaustive search thus needs nk nkk time which could be unacceptable for slightly large data set fr foba and omp are approaches thus are efficient and their running time are all in the order of kn poss finds the solutions closest to those of opt taking time although poss is slower by factor of the difference would be small when is small constant since the time is theoretical upper bound for poss being as good as fr we empirically examine how tight this bound is by selecting fr as the baseline we plot the curve of the value over the running time for poss on the two largest data sets gisette and as shown in figure we do not split the training and test set and the curve for poss is the average of independent runs the is in kn the running time of fr we can observe that poss takes about only and of the theoretical time to achieve better performance respectively on the two data sets this implies that poss can be more efficient in practice than in theoretical analysis poss fr foba omp on fr on figure test the larger the better on training set rss on sonar fr rss on running time in kn poss figure performance running time of poss on gisette figure training the larger the better poss running time in kn on sonar lasso scad mcp rfe opt on test set figure sparse regression with regularization on sonar rss the smaller the better on generalization performance when testing sparse regression on the test data it has been known that the sparsity alone may be not good complexity measure since it only restricts the number of variables but the range of the variables is unrestricted thus better optimization does not always lead to better generalization performance we also observe this in figure on test is consistent with training in figure however on sonar better training as in figure leads to worse test as in figure which may be due to the small number of instances making it prone to overfitting as suggested in other regularization terms may be necessary we add the norm regularization into the objective function rssz αi xi the optimization is now arg rssz we then test all the compared methods to solve this optimization problem with on sonar as plotted in figure we can observe that poss still does the best optimization on the training rss and by introducing the norm it leads to the best generalization performance in conclusion in this paper we study the problem of subset selection which has many applications ranging from machine learning to signal processing the general goal is to select subset of size from large set of variables such that given criterion is optimized we propose the poss approach that solves the two objectives of the subset selection problem simultaneously optimizing the criterion and reducing the subset size on sparse regression representative of subset selection we theoretically prove that simple poss using constant isolation function can generally achieve the best previous approximation guarantee using time moreover we prove that with proper isolation function it finds an optimal solution for an important subclass exponential decay using time log while other methods may not find an optimal solution we verify the superior performance of poss by experiments which also show that poss can be more efficient than its theoretical time we will further study pareto optimization from the aspects of using potential heuristic operators and utilizing infeasible solutions and try to apply it to more machine learning tasks acknowledgements we want to thank lijun zhang and jianxin wu for their helpful comments this research was supported by program and nsfc references boutsidis mahoney and drineas an improved approximation algorithm for the column subset selection problem in soda pages new york ny das and kempe algorithms for subset selection in linear regression in stoc pages victoria canada das and kempe submodular meets spectral greedy algorithms for subset selection sparse approximation and dictionary selection in icml pages bellevue wa davis mallat and avellaneda adaptive greedy approximations constructive approximation statistical comparisons of classifiers over multiple data sets journal of machine learning research diekhoff statistics for the social and behavioral sciences univariate bivariate multivariate william brown pub donoho elad and temlyakov stable recovery of sparse overcomplete representations in the presence of noise ieee transactions on information theory fan and li variable selection via nonconcave penalized likelihood and its oracle properties journal of the american statistical association gilbert muthukrishnan and strauss approximation of functions over redundant dictionaries using coherence in soda pages baltimore md guyon weston barnhill and vapnik gene selection for cancer classification using support vector machines machine learning johnson and wichern applied multivariate statistical analysis pearson edition miller subset selection in regression chapman and edition natarajan sparse approximate solutions to linear systems siam journal on computing qian yu and zhou an analysis on recombination in evolutionary optimization artificial intelligence qian yu and zhou on constrained boolean pareto optimization in ijcai pages buenos aires argentina qian yu and zhou pareto ensemble pruning in aaai pages austin tx tan tsang and wang matching pursuit lasso part sparse recovery over big dictionary ieee transactions on signal processing tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series methodological tropp greed is good algorithmic results for sparse approximation ieee transactions on information theory tropp gilbert muthukrishnan and strauss improved sparse approximation over quasiincoherent dictionaries in icip pages barcelona spain xiao and zhang homotopy method for the sparse problem siam journal on optimization yu yao and zhou on the approximation ability of evolutionary optimization with application to minimum set cover artificial intelligence yu and zhou on the usefulness of infeasible solutions in evolutionary search theoretical study in ieee cec pages hong kong china zhang nearly unbiased variable selection under minimax concave penalty the annals of statistics zhang on the consistency of feature selection using greedy least squares regression journal of machine learning research zhang adaptive greedy algorithm for learning sparse representations ieee transactions on information theory zhou matlab sparsereg toolbox version available online zhou armagan and dunson path following and empirical bayes model selection for sparse regression zou and hastie regularization and variable selection via the elastic net journal of the royal statistical society series statistical methodology 
on the accuracy of models jacob maxim michael jordan dan klein computer science division university of california berkeley jda rabinovich jordan klein abstract calculation of the is major computational obstacle in applications of models with large output spaces the problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature in this paper we analyze recently proposed technique known as which introduces regularization term in training to penalize log normalizers for deviating from zero this makes it possible to use unnormalized model scores as approximate probabilities empirical evidence suggests that is extremely effective but theoretical understanding of why it should work and how generally it can be applied is largely lacking we prove upper bounds on the loss in accuracy due to describe classes of input distributions that easily and construct explicit examples of input distributions our theoretical results make predictions about the difficulty of fitting models to several classes of distributions and we conclude with empirical validation of these predictions introduction models general class that includes conditional random fields crfs and generalized linear models glms offer flexible yet tractable approach modeling conditional probability distributions when the set of possible values is large however the computational cost of computing normalizing constant for each can be summation with many terms integral or an expensive dynamic program the machine translation community has recently described several procedures for training selfnormalized models the goal of is to choose model parameters that simultaneously yield accurate predictions and produce normalizers clustered around unity model scores can then be used as approximate surrogates for probabilities obviating the computation normalizer computation in particular given model of the form with pη eη log eη we seek setting of such that is close enough to zero with high probability under to be ignored authors contributed equally this paper aims to understand the theoretical properties of empirical results have already demonstrated the efficacy of this discrete models with many output classes it appears that normalizer values can be made nearly constant without sacrificing too much predictive accuracy providing dramatic efficiency increases at minimal performance cost the broad applicability of makes it likely to spread to other applications of models including structured prediction with combinatorially many output classes and regression with continuous output spaces but it is not obvious that we should expect such approaches to be successful the number of inputs if finite can be on the order of millions the geometry of the resulting input vectors highly complex and the class of functions associated with different inputs quite rich to find to find nontrivial parameter setting with roughly constant seems challenging enough to require that the corresponding also lead to good classification results seems too much and yet for many input distributions that arise in practice it appears possible to choose to make nearly constant without having to sacrifice classification accuracy our goal is to bridge the gap between theoretical intuition and practical experience previous work bounds the sample complexity of training procedures for restricted class of models but leaves open the question of how interacts with the predictive power of the learned model this paper seeks to answer that question we begin by generalizing the model to much more general class of distributions including distributions with continuous support section next we provide what we believe to be the first characterization of the interaction between and model accuracy section this characterization is given from two perspectives bound on the likelihood gap between and unconstrained models conditional distribution provably hard to represent with model in figure we present empirical evidence that these bounds correctly characterize the difficulty of and in the conclusion we survey set of open problems that we believe merit further investigation problem background the immediate motivation for this work is procedure proposed to speed up decoding in machine translation system with language model the language model used is standard neural network with softmax output layer that turns the network predictions into distribution over the vocabulary where each probability is to its output activation it is observed that with sufficiently large vocabulary it becomes prohibitive to obtain probabilities from this model which must be queried millions of times during decoding to fix this the language model is trained with the following objective max yi log en log en where is the response of output in the neural net with weights given an input from lagrangian perspective the extra penalty term simply confines the to the set of empirically normalizing parameters for which all are close in squared error to the origin for suitable choice of it is observed that the trained network is simultaneously accurate enough to produce good translations and close enough to that the raw scores yi can be used in place of without substantial further degradation in quality we seek to understand the observed success of these models in finding accurate normalizing parameter settings while it is possible to derive bounds of the kind we are interested in for general neural networks in this paper we work with simpler linear parameterization that we believe captures the interesting aspects of this problem it is possible to view model as network with softmax output more usefully all of the results presented here apply directly to trained neural nets in which the last layer only is retrained to related work the approach described at the beginning of this section is closely related to an alternative selfnormalization trick described based on estimation nce nce is an alternative to direct optimization of likelihood instead training classifier to distinguish between true samples from the model and noise samples from some other distribution the structure of the training objective makes it possible to replace explicit computation of each with an estimate in traditional nce these values are treated as part of the parameter space and estimated simultaneously with the model parameters there exist guarantees that the normalizer estimates will eventually converge to their true values it is instead possible to fix all of these estimates to one in this case empirical evidence suggests that the resulting model will also exhibit behavior host of other techniques exist for solving the computational problem posed by the many of these involve approximating the associated sum or integral using quadrature herding or monte carlo methods for the special case of discrete finite output spaces an alternative hierarchical to replace the large sum in the normalizer with series of binary decisions the output classes are arranged in binary tree and the probability of generating particular output is the product of probabilities along the edges leading to it this reduces the cost of computing the normalizer from to log while this limits the set of distributions that can be learned and still requires time to compute normalizers it appears to work well in practice it can not however be applied to problems with continuous output spaces distributions we begin by providing slightly more formal characterization of general model definition models given space of inputs space of outputs measure on nonnegative function and function rd that is with respect to its second argument we can define model indexed by parameters rd with the form pη eη where log if then eη dµ pη dµ and pη is probability density over we next formalize our notion of model definition models the model pη is with respect to set if for all in this case we say that is and is an example of normalizable set is shown in figure and we provide additional examples below some readers may be more familiar with generalized linear models which also describe exponential family distributions with linear dependence on input the presentation here is strictly more general and has few notational advantages it makes explicit the dependence of on and but not and lets us avoid tedious bookkeeping involving natural and mean parameterizations set for fixed the solutions to with and the set forms smooth manifold bounded on either side by hyperplanes normal to and sets of approximately normalizing parameters for fixed solutions to with and uniform on for given upper bound on normalizer variance the feasible set of parameters is nonconvex and grows as increases figure data distributions and parameter sets example suppose log log xy log then for either log elog log log and is with respect to it is also easy to choose parameters that do not result in distribution and in fact to construct target distribution which can not be example suppose then there is no such that for all and is constant if and only if as previously motivated downstream uses of these models may be robust to small errors resulting from improper normalization so it would be useful to generalize this definition of normalizable distributions to distributions that are only approximately normalizable exact normalizability of the conditional distribution is deterministic either does or does not exist some that violates the constraint in figure for example it suffices to have single off of the indicated surface to make set approximate normalizability by contrast is inherently probabilistic statement involving distribution over inputs note carefully that we are attempting to represent but have no representation of or control over and that approximate normalizability depends on but not informally if some input violates the constraint by large margin but occurs only very infrequently there is no problem instead we are concerned with expected deviation it is also at this stage that the distinction between penalization of the normalizer becomes important the normalizer is necessarily bounded below by zero so overestimates might appear much worse than underestimates while the is unbounded in both directions for most applications we are concerned with log probabilities and ratios for which an expected normalizer close to zero is just as bad as one close to infinity thus the is the natural choice of quantity to penalize definition approximately models the distribution pη is δapproximately normalized with respect to distribution over if in this case we say that is and is selfnormalizing the sets of parameters for fixed input distribution and feature function are depicted in figure unlike sets of inputs and approximately sets of parameters may have complex geometry throughout this paper we will assume that vectors of sufficient statistics have bounded norm at most natural parameter vectors have norm at most that is they are ivanovregularized and that vectors of both kinds lie in rd finally we assume that all input vectors have constant particular that for every with corresponding weight the first question we must answer is whether the problem of training models is feasible at is whether there exist any exactly data distributions or at least distributions for small section already gave an example of an exactly normalizable distribution in fact there are large classes of both exactly and approximately normalizable distributions observation given some fixed consider the set sη any distribution supported on sη is normalizable additionally every distribution is characterized by at least one such this definition provides simple geometric characterization of distributions an example solution set is shown in figure more generally if is discrete and consists of repetitions of fixed feature function as in figure then we can write log eηy provided is convex in for each ηy the level sets of as function of form the boundaries of convex sets in particular exactly normalizable sets are always the boundaries of convex regions as in the simple example figure we do not in general expect datasets to be supported on the precise class of selfnormalizable surfaces nevertheless it is very often observed that data of practical interest lie on other manifolds within their embedding feature spaces thus we can ask whether it is sufficient for target distribution to be by one we begin by constructing an appropriate measurement of the quality of this approximation definition closeness an input distribution is to set if inf sup in other words is to if random sample from is no more than distance from in expectation now we can relate the quality of this approximation to the level of achieved generalizing result from we have proposition suppose is to then is recalling that it will occasionally be instructive to consider the special case where is the boolean hypercube and we will explicitly note where this assumption is made otherwise all results apply to general distributions both continuous and discrete proofs for this section may be found in appendix the intuition here is that data distributions that place most of their mass in feature space close to normalizable sets are approximately normalizable on the same scale normalization and model accuracy so far our discussion has concerned the problem of finding conditional distributions that selfnormalize without any concern for how well they actually perform at modeling the data here the relationship between the approximately distribution and the true distribution which we have so far ignored is essential indeed if we are not concerned with making good model it is always trivial to make normalized take and then scale appropriately we ultimately desire both good and good data likelihood and in this section we characterize the tradeoff between maximizing data likelihood and satisfying constraint we achieve this characterization by measuring the likelihood gap between the classical maximum likelihood estimator and the mle subject to constraint specifically given pairs xn yn let log pη yi then define arg max arg max where xi we would like to obtain bound on the likelihood gap which we define as the quantity we claim theorem suppose has finite measure then asymptotically as kl pη unif proofs for this section may be found in appendix this result the likelihood at by explicitly constructing scaled version of that satisfies the constraint specifically if is chosen so that normalizers are penalized for distance from log the logarithm of the number of classes in the finite case then any increase in along the span of the data is guaranteed to increase the penalty from here it is possible to choose an such that satisfies the constraint the likelihood at is necessarily less than and can be used to obtain the desired lower bound thus at one extreme distributions close to uniform can be with little loss of likelihood what about the other as far from uniform as possible with suitable assumptions about the form of we can use the same construction of parameter to achieve an alternative characterization for distributions that are close to deterministic proposition suppose that is subset of the boolean hypercube is finite and is the conjunction of each element of with an indicator on the output class suppose additionally that in every input makes unique best is for each there exists unique such that whenever then for constants and this result is obtained by representing the constrained likelihood with taylor expansion about the true mle all terms in the likelihood gap vanish except for the remainder this can be by the times the largest eigenvalue the feature covariance matrix at which in turn is bounded by the favorable rate we obtain for this case indicates that distributions are also an easy class for together with theorem this suggests that hard distributions must have some mixture of uniform and nonuniform predictions for different inputs this is supported by the results in section the next question is whether there is corresponding lower bound that is whether there exist any conditional distributions for which all nearby distributions are provably hard to the existence of direct analog of theorem remains an open problem but we make progress by developing general framework for analyzing normalizer variance one key issue is that while likelihoods are invariant to certain changes in the natural parameters the log normalizers and therefore their variance is far from invariant we therefore focus on equivalence classes of natural parameters as defined below throughout we will assume fixed distribution on the inputs definition equivalence of parameterizations two natural parameter values and are said to be equivalent with respect to an input distribution denoted if pη we can then define the optimal log normalizer variance for the distribution associated with natural parameter value definition optimal variance we define the optimal log normalizer variance of the model associated with natural parameter value by inf varp we now specialize to the case where is finite with and where rkd satisfies xj this is an important special case that arises for example in logistic regression in this setting we can show that despite the fundamental of the model the variance can still be shown to be high under any parameterization of the distribution theorem let and let the input distribution be uniform on there exists an rkd such that for αη experiments the intuition behind the results in the preceding section can be summarized as follows for predictive distributions that are in expectation or results in relatively small likelihood gap for mixtures of and distributions may result in large likelihood gap more generally we expect that an increased tolerance for normalizer variance will be associated with decreased likelihood gap in this section we provide experimental confirmation of these predictions we begin by generating set of random sparse feature vectors and an initial weight vector in order to produce sequence of label distributions that smoothly interpolate between and we introduce temperature parameter and for various settings of draw labels from pτ we then fit selfnormalized model to these training pairs in addition to the synthetic data we compare our results to empirical data from language model figure plots the tradeoff between the likelihood gap and the error in the normalizer under various distributions characterized by their kl from uniform here the tradeoff between and model accuracy can be the normalization constraint is relaxed the likelihood gap decreases lm ekl pη normalization likelihood tradeoff as the normalization constraint is relaxed the likelihood gap decreases lines marked are from synthetic data the line marked lm is from likelihood gap as function of expected divergence from the uniform distribution as predicted by theory the likelihood gap increases then decreases as predictive distributions become more peaked figure experimental results figure shows how the likelihood gap varies as function of the quantity ekl pη as predicted it can be seen that both extremes of this quantity result in small likelihood gaps while intermediate values result in large likelihood gaps conclusions motivated by the empirical success of parameter estimation procedures we have attempted to establish theoretical basis for the understanding of such procedures we have characterized both distributions by constructing provably easy examples and training procedures by bounding the loss of likelihood associated with while we have addressed many of the important theoretical questions around selfnormalization this study of the problem is by no means complete we hope this family of problems will attract further study in the larger machine learning community toward that end we provide the following list of open questions how else can the approximately distributions be characterized the class of approximately normalizable distributions we have described is unlikely to correspond perfectly to data we expect that proposition can be generalized to other parametric classes and relaxed to accommodate spectral or sparsity conditions are the upper bounds in theorem or proposition tight our constructions involve relating the normalization constraint to the norm of but in general some parameters can have very large norm and still give rise to distributions do corresponding lower bounds exist while it is easy to construct of exactly selfnormalizable distributions which suffer no loss of likelihood we have empirical evidence that hard distributions also exist it would be useful to the loss of likelihood in terms of some simple property of the target distribution is the hard distribution in theorem stable this is related to the previous question the existence of distributions is less worrisome if such distributions are fairly rare if the lower bound falls off quickly as the given construction is perturbed then the associated distribution may still be approximately with good rate we have already seen that new theoretical insights in this domain can translate directly into practical applications thus in addition to their inherent theoretical interest answers to each of these questions might be applied directly to the training of approximately models in practice we expect that will find increasingly many applications and we hope the results in this paper provide first step toward complete theoretical and empirical understanding of in models acknowledgments the authors would like to thank robert nishihara for useful discussions ja and mr are supported by nsf graduate fellowships and mr is additionally supported by the fannie and john hertz foundation fellowship references lafferty mccallum pereira conditional random fields probabilistic models for segmenting and labeling sequence data pp mccullagh nelder generalized linear models chapman and hall devlin zbib huang lamar schwartz makhoul fast and robust neural network joint models for statistical machine translation proceedings of the annual meeting of the association for computational linguistics vaswani zhao fossum chiang decoding with neural language models improves translation proceedings of the conference on empirical methods in natural language processing andreas klein when and why are models proceedings of the annual meeting of the north american chapter of the association for computational linguistics bartlett ieee transactions on information theory anthony bartlett neural network learning theoretical foundations cambridge university press gutmann estimation new estimation principle for unnormalized statistical models proceedings of the international conference on artificial intelligence and statistics pp hagan journal of statistical planning and inference chen welling smola proceedings of the conference on uncertainty in artificial intelligence doucet de freitas gordon an introduction to sequential monte carlo methods springer morin bengio proceedings of the international conference on artificial intelligence and statistics yang allen liu ravikumar graphical models via generalized linear models advances in neural information processing systems pp 
regret lower bound and optimal algorithm in finite stochastic partial monitoring junpei komiyama the university of tokyo junpei junya honda the university of tokyo honda hiroshi nakagawa the university of tokyo nakagawa abstract partial monitoring is general model for sequential learning with limited feedback formalized as game between two players in this game the learner chooses an action and at the same time the opponent chooses an outcome then the learner suffers loss and receives feedback signal the goal of the learner is to minimize the total loss in this paper we study partial monitoring with finite actions and stochastic outcomes we derive logarithmic regret lower bound that defines the hardness of the problem inspired by the dmed algorithm honda and takemura for the bandit problem we propose an algorithm that minimizes the regret significantly outperforms algorithms in numerical experiments to show the optimality of with respect to the regret bound we slightly modify the algorithm by introducing hinge function then we derive an asymptotically optimal regret upper bound of that matches the lower bound introduction partial monitoring is general framework for sequential decision making problems with imperfect feedback many classes of problems including prediction with expert advice the bandit problem dynamic pricing the dark pool problem label efficient prediction and linear and convex optimization with full or bandit feedback can be modeled as an instance of partial monitoring partial monitoring is formalized as repeated game played by two players called learner and an opponent at each round the learner chooses an action and at the same time the opponent chooses an outcome then the learner observes feedback signal from given set of symbols and suffers some loss both of which are deterministic functions of the selected action and outcome the goal of the learner is to find the optimal action that minimizes cumulative loss alternatively we can define the regret as the difference between the cumulative losses of the learner and the single optimal action and minimization of the loss is equivalent to minimization of the regret learner with small regret balances exploration acquisition of information about the strategy of the opponent and exploitation utilization of information the rate of regret indicates how fast the learner adapts to the problem linear regret indicates the inability of the learner to find the optimal action whereas sublinear regret indicates that the learner can approach the optimal action given sufficiently large time steps the study of partial monitoring is classified into two settings with respect to the assumption on the outcomes on one hand in the stochastic setting the opponent chooses an outcome distribution before the game starts and an outcome at each round is an sample from the distribution on the other hand in the adversarial setting the opponent chooses the outcomes to maximize the regret of the learner in this paper we study the former setting related work the paper by piccolboni and schindelhauer is one of the first to study the regret of the finite partial monitoring problem they proposed the algorithm which attains minimax regret on some problems this bound was later improved by et al to who also showed an instance in which the bound is optimal since then most literature on partial monitoring has dealt with the minimax regret which is the regret over all possible opponent strategies et al classified the partial monitoring problems into four categories in terms of the minimax regret trivial problem with zero regret an easy problem with hard problem with regret and hopeless problem with regret this shows that the class of the partial monitoring problems is not limited to the bandit sort but also includes larger classes of problems such as dynamic pricing since then several algorithms with regret bound for easy problems have been proposed among them the partial monitoring bpm algorithm is in the sense of empirical performance and minimax regret we focus on the regret that depends on the strategy of the opponent while the minimax regret in partial monitoring has been extensively studied little has been known on regret in partial monitoring to the authors knowledge the only paper focusing on the regret in finite discrete partial monitoring is the one by et al which derived log regret for easy problems in contrast to this situation much more interest in the regret has been shown in the field of bandit problems upper confidence bound ucb the most algorithm for the bandits has regret bound and algorithms that minimize the regret has been shown to perform better than ones that minimize the minimax regret moss even in instances in which the distributions are hard to distinguish scenario in garivier et al therefore in the field of partial monitoring we can expect that an algorithm that minimizes the regret would perform better than the existing ones contribution the contributions of this paper lie in the following three aspects first we derive the regret lower bound in some special classes of partial monitoring bandits an log regret lower bound is known to be achievable in this paper we further extend this lower bound to obtain regret lower bound for general partial monitoring problems second we propose an algorithm called partial monitoring dmed we also introduce slightly modified version of this algorithm and derive its regret bound is the first algorithm with logarithmic regret bound for hard problems moreover for both easy and hard problems it is the first algorithm with the optimal constant factor on the leading logarithmic term third performances of and existing algorithms are compared in numerical experiments here the partial monitoring problems consisted of three specific instances of varying difficulty in all instances significantly outperformed the existing methods when number of rounds is large the regret of on these problems quickly approached the theoretical lower bound problem setup this paper studies the finite stochastic partial monitoring problem with actions outcomes and symbols an instance of the partial monitoring game is defined by loss matrix li rn and feedback matrix hi where at the beginning the learner is informed of and at each round learner selects an action and at the same time an opponent selects an outcome the learner note that ignores polylog factor suffers loss li which can not observe the only information the learner receives is the signal hi we consider stochastic opponent whose strategy for selecting outcomes is governed by the opponent strategy pm where pm is set of probability distributions over an outcome the outcome of each round is an sample from the goal of the learner is to minimize the cumulative loss over rounds let the optimal action be the one that minimizes the loss in expectation that is arg where li is the row of assume that is unique without loss of generality we can assume that let li and ni be the number of rounds before the in which action is selected the performance of the algorithm is measured by the pseudo regret regret ni figure cell decomposition of partial monitoring which is the difference between the expected loss of the learner and instance with the optimal action it is easy to see that minimizing the loss is equivalent to minimizing the regret the expectation of the regret measures the performance of an algorithm that the learner uses for each action let ci be the set of opponent strategies for which action is optimal ci pm li lj we call ci the optimality cell of action each optimality cell is convex closed polytope furthermore we call the set of optimality cells cn the cell decomposition as shown in figure let cic pm ci be the set of strategies with which action is not optimal the signal matrix si of action is defined as si hi where if is true and otherwise the signal matrix defined here is slightly different from the one in the previous papers et al in which the number of rows of si is the number of the different symbols in the row of the advantage in using the definition here is that si ra is probability distribution over symbols that the algorithm observes when it selects an action examples of signal matrices are shown in section an instance of partial monitoring is globally observable if for all pairs of actions li lj in this paper we exclusively deal with globally observable instances in view of the minimax regret this includes trivial easy and hard problems regret lower bound good algorithm should work well against any opponent strategy we extend this idea by introducing the notion of strong consistency partial monitoring algorithm is strongly consistent if it satisfies regret for any and pm given and in the context of the bandit problem lai and robbins derived the regret lower bound of strongly consistent algorithm an algorithm must select each arm until its number of draws ni satisfies log ni θi where θi is the kl divergence between the two distributions from which the rewards of action and the optimal action are generated analogously in the partial monitoring problem we can define the minimum number of observations lemma for sufficiently large strongly consistent algorithm satisfies ni log log where si and log is the kl divergence between two discrete distributions in which we define log lemma can be interpreted as follows for each round consistency requires the algorithm to make sure that the possible risk that action is optimal is smaller than large deviation principle states that the probability that an opponent with strategy behaves like is roughly exp pi therefore wec need to continue exploration of the actions until ni pi log holds for any to reduce the risk to exp log the proof of lemma is in appendix in the supplementary material based on the technique used in lai and robbins the proof considers modified game in which another action is optimal the difficulty in proving the lower bound in partial monitoring lies in that the feedback structure can be quite complex for example to confirm the superiority of action over one might need to use the feedback from action still we can derive the lower bound by utilizing the consistency of the algorithm in the original and modified games we next derive lower bound on the based on lemma note that the expectation of the regret can be expressed as regret ni li let rj pi ri ri pi inf cj pj where cl denotes closure moreover let pi inf ri pi ri li lj the optimal solution of which is rj pi ri rj pi ri li lj cj pi the value log is the possible minimum regret for observations such that the minimum divergence of from any is larger than log using lemma yields the following regret lower bound theorem the regret of strongly consistent algorithm is lower bounded as regret log log from this theorem we can naturally measure the harshness of the instance by whereas the past studies vanchinathan et al ambiguously define the harshness as the closeness to the boundary of the cells furthermore we show in lemma in the appendix that the regret bound has at most quadratic dependence on which is defined in appendix as the closeness of to the boundary of the optimal cell algorithm in this section we describe the partial monitoring deterministic minimum empirical divergence pmdmed algorithm which is inspired by dmed for solving the bandit problem let be the empirical distribution of the symbols under the selection of action namely the element of is hi we sometimes omit from when it is clear from the context let the empirical divergence of pm be ni the exponential of which can be considered as likelihood that is the opponent strategy the main routine of is in algorithm at each loop the actions in the current list zc are selected once the list for the actions in the next loop zn is determined by the subroutine in algorithm the subroutine checks whether the empirical divergence of each point is larger than log or not eq if it is large enough it exploits the current information by selecting the optimal action based on the estimation that minimizes the empirical divergence otherwise it selects the actions with the number of observations below the minimum requirement for making the empirical divergence of each suboptimal point larger than log unlike the bandit problem in which reward is associated with an action in the partial monitoring problem actions outcomes and feedback signals can be intricately related therefore we need to solve optimization to run later in section we discuss practical implementation of the optimization algorithm main routine of and algorithm subroutine for adding actions to zn without duplication parameter initialization select each action once compute an arbitrary such that zc zr zn while do arg min ni for zc in an arbitrarily fixed order do select and receive feedback and let arg mini zr zr if zr then put into zn add actions to zn in accordance with if there are actions zr such that algorithm ni log algorithm then put them into zn end for if zc zr zn zn ni log end while then compute some and put all actions such that zr and ni log into zn necessity of log exploration tries to observe each action to some extent eq which is necessary for the following reason consider game characterized by and the optimal action here is action which does not yield any useful information by using action one receives three kinds of symbols from which one can estimate and where is the component of from this an algorithm can find that is not very small and thus the expected loss of actions and is larger than that of action since the feedback of actions and are the same one may also use action in the same manner however the loss per observation is and for actions and respectively and thus it is better to use action this difference comes from the fact that since an algorithm does not know beforehand it needs to observe action the only source for distinguishing from yet an optimal algorithm can not select it more than log times because it affects the log factor in the regret in fact log observations of action with some are sufficient to poly be convinced that for this reason with probability selects each action log times experiment following et al we compared the performances of algorithms in three different games the game section game and dynamic pricing experiments on the armed bandit game was also done and the result is shown in appendix the game which is classified as easy in terms of the minimax regret is characterized by and the signal matrices of this game are and random cbp lb regret round random cbp lb round round dynamic pricing benign regret harsh random cbp lb random cbp lb round round dynamic pricing intermediate round intermediate regret regret benign random cbp lb random cbp lb regret regret regret dynamic pricing harsh random cbp lb round figure semilog plots of algorithms the regrets are averaged over runs lb is the asymptotic regret lower bound of theorem dynamic pricing which is classified as hard in terms of the minimax regret is game that models repeated auction between seller learner and buyer opponent at each round the seller sets price for product and at the same time the buyer secretly sets maximum price he is willing to pay the signal is buy or and the seller loss is either given constant or the difference between the buyer and the seller prices buy the loss and feedback matrices are and where signals and correspond to and buy the signal matrix of action is si following et al we set and in our experiments with the game and dynamic pricing we tested three settings regarding the harshness of the opponent at the beginning of simulation we sampled points uniformly at random from pm then sorted them by we chose the top and harshest ones as the opponent strategy in the harsh intermediate and benign settings respectively we compared random cbp with and with random is naive algorithm that selects an action uniformly random requires matrix such that and thus one can not apply it to the game cbp is an algorithm of logarithmic regret for easy games the parameters and of cbp were set in accordance with theorem in their paper is bayesian algorithm with regret for easy games and is heuristic of performance the priors of two bpms were set to be uninformative to avoid misspecification as recommended in their paper algorithm subroutine for adding actions to zn without duplication parameters for log log for compute arbitrary which satisfies arg min and let arg mini if zr then put into zn if ni ni or there exists an action such that ni then put all actions zr into zn if there are actions such that ni log then put the actions not in zr into zn if ni log ni then compute some ni and put all actions such that zr and ni log into zn if such is infeasible then put all action zr into zn the computation of in and the evaluation of the condition in involve convex optimizations which were done with ipopt moreover obtaining in is classified as linear programming lsip problem linear programming lp with finitely many variables and infinitely many constraints following the optimization of we resorted to finite sample approximation and used the gurobi lp solver in computing at each round we sampled points from pm and relaxed the constraints on the samples to speed up the computation we skipped these optimizations in most rounds with large and used the result of the last computation the computation of the coefficient of the regret lower bound theorem is also an lsip which was approximated by sample points from the experimental results are shown in figure in the game and the other two games with an easy or intermediate opponent outperforms the other algorithms when the number of rounds is large in particular in the dynamic pricing game with an intermediate opponent the regret of at is ten times smaller than those of the other algorithms even in the harsh setting in which the minimax regret matters has some advantage over all algorithms except for with sufficiently large the slope of an optimal algorithm should converge to lb in all games and settings the slope of converges to lb which is empirical evidence of the optimality of theoretical analysis section shows that the empirical performance of is very close to the regret lower bound in theorem although the authors conjecture that is optimal it is hard to analyze the technically hardest part arises from the case in which the divergence of each action is small but not yet fully converged to circumvent this difficulty we can introduce discount factor let rj pi δi ri inf ri pi δi cj pj where max note that rj pi δi in is natural generalization of rj pi in section in the sense that rj pi rj pi event ni log δi means that the number of observations ni is enough to ensure that the δi empirical divergence of each is larger than log analogous to rj pi δi we define pi δi inf ri pi δi ri lj li and its optimal solution by rj pi δi ri rj pi δi ri lj li cj pi δi we also define ci pm the optimal region of action with margin shares the main routine of algorithm with and lists the next actions by algorithm unlike it discounts ni from the empirical divergence moreover ii when is close to the cell boundary it encourages more exploration to identify the cell it belongs to by eq theorem assume that the following regularity conditions hold for pi δi is unique at pi si δi moreover for sδ it holds that cl int sδ cl cl sδ for all in some neighborhood of where cl and int denote the closure and the interior respectively then regret log log we prove this theorem in appendix recall that δi is the set of optimal solutions of an lsip in this problem kkt conditions and the duality theorem apply as in the case of finite constraints thus we can check whether condition holds or not for each see ito et al and references therein condition holds in most cases and an example of an exceptional case is shown in appendix theorem states that has regret upper bound that matches the lower bound of theorem corollary optimality in the bandit problem in the bernoulli bandit problem the regularity conditions in theorem always hold moreover the coefficient of the leading logarithmic term in the regret bound of the partial monitoring problem is equal to the bound given in lai and robbins namely µi where log log is the between bernoulli distributions corollary which is proven in appendix states that attains the optimal regret of the bandit if we run it on an bandit game represented as partial monitoring asymptotic analysis it is theorem where we lose the property this theorem shows the continuity of the optimal solution set pi δi of pj which does not mention how close pi δi is to if max maxi maxi δi for given to obtain an explicit bound we need sensitivity analysis the theory of the robustness of the optimal value and the solution for small deviations of its parameters see fiacco in particular the optimal solution of partial monitoring involves an infinite number of constraints which makes the analysis quite hard for this reason we will not perform analysis note that the bandit problem is special instance in which we can avoid solving the above optimization and optimal bound is known necessity of the discount factor we are not sure whether discount factor in is necessary or not we also empirically tested although it is better than the other algorithms in many settings such as dynamic pricing with an intermediate opponent it is far worse than we found that our implementation which uses the ipopt nonlinear optimization solver was sometimes inaccurate at optimizing there were some cases in which the true satisfies ni while the solution we obtained had hinge values in this case the algorithm lists all actions from which degrades performance determining whether the discount factor is essential or not is our future work acknowledgements the authors gratefully acknowledge the advice of kentaro minami and sincerely thank the anonymous reviewers for their useful comments this work was supported in part by jsps kakenhi grant number and references nick littlestone and manfred warmuth the weighted majority algorithm inf february lai and herbert robbins asymptotically efficient adaptive allocation rules advances in applied mathematics robert kleinberg and frank thomson leighton the value of knowing demand curve bounds on regret for online auctions in focs pages alekh agarwal peter bartlett and max dama optimal allocation strategies for the dark pool problem in aistats pages lugosi and gilles stoltz minimizing regret with label efficient prediction ieee transactions on information theory martin zinkevich online convex programming and generalized infinitesimal gradient ascent in icml pages varsha dani thomas hayes and sham kakade stochastic linear optimization under bandit feedback in colt pages antonio piccolboni and christian schindelhauer discrete prediction games with arbitrary feedback and loss in colt pages lugosi and gilles stoltz regret minimization under partial monitoring math oper and csaba minimax regret of finite games in stochastic environments in colt pages navid zolghadr and csaba an adaptive algorithm for finite stochastic partial monitoring in icml algorithm for finite games against adversarial opponents in colt pages hastagiri vanchinathan and andreas krause efficient partial monitoring with prior information in nips pages peter auer and paul fischer analysis of the multiarmed bandit problem machine learning garivier and olivier the algorithm for bounded stochastic bandits and beyond in colt pages amir dembo and ofer zeitouni large deviations techniques and applications applications of mathematics springer new york berlin heidelberg junya honda and akimichi takemura an asymptotically optimal bandit algorithm for bounded support models in colt pages andreas and carl laird interior point optimizer ipopt gurobi optimization gurobi optimizer ito liu and teo dual parametrization method for convex programming annals of operations research anthony fiacco introduction to sensitivity and stability analysis in nonlinear programming academic press new york 
is approval voting optimal given approval votes nisarg shah computer science department carnegie mellon university nkshah ariel procaccia computer science department carnegie mellon university arielpro abstract some crowdsourcing platforms ask workers to express their opinions by approving set of good alternatives it seems that the only reasonable way to aggregate these votes is the approval voting rule which simply counts the number of times each alternative was approved we challenge this assertion by proposing probabilistic framework of noisy voting and asking whether approval voting yields an alternative that is most likely to be the best alternative given votes while the answer is generally positive our theoretical and empirical results call attention to situations where approval voting is suboptimal introduction it is surely no surprise to the reader that modern machine learning algorithms thrive on large amounts of data preferably labeled online labor markets such as amazon mechanical turk have become popular way to obtain labeled data as they harness the power of large number of human workers and offer significantly lower costs compared to expert opinions but this data may require compromising quality the workers are often unqualified or unwilling to make an effort leading to high level of noise in their submitted labels to overcome this issue it is common to hire multiple workers for the same task and aggregate their noisy opinions to find more accurate labels for example turkit is toolkit for creating and managing crowdsourcing tasks on mechanical turk for our purposes its most important aspect is that it implements plurality voting among available alternatives possible labels workers report the best alternative in their opinion and the alternative that receives the most votes is selected more generally workers may be asked to report the best alternatives in their opinion such vote is known as vote this has an advantage over plurality in noisy situations where worker may not be able to pinpoint the best alternative accurately but can recognize that it is among the top alternatives at the same time votes even for are much easier to elicit than say rankings of the alternatives not to mention full utility functions for example eterna citizen science game whose goal is to design rna molecules that fold into stable structures uses voting on submitted designs that is each player approves up to favorite designs the designs that received the largest number of approval votes are selected for synthesis in the lab so the elicitation of votes is common practice and has significant advantages and it may seem that the only reasonable way to aggregate these votes once collected is via the approval voting rule that is tally the number of approvals for each alternative and select the most approved but is it in other words do the votes contain useful information that can lead to is also used for picking winners various cities in the us such as san francisco chicago and new york use it in their participatory budgeting process there is subtle distinction which we will not belabor between voting which is the focus of this paper and approval voting which allows voters to approve as many alternatives as they wish the latter significantly better outcomes and is ignored by approval voting or is approval voting an almost optimal method for aggregating votes our approach we study the foregoing questions within the maximum likelihood estimation mle framework of social choice theory which posits the existence of an underlying ground truth that provides an objective comparison of the alternatives from this viewpoint the votes are noisy estimates of the ground truth the optimal rule then selects the alternative that is most likely to be the best alternative given the votes this framework has recently received attention from the machine learning community in part due to its applications to crowdsourcing domains where indeed there is ground truth and individual votes are objective in more detail in our model there exists ground truth ranking over the alternatives and each voter holds an opinion which is another ranking that is noisy estimate of the ground truth ranking the opinions are drawn from the popular mallows model which is parametrized by the ground truth ranking noise parameter and distance metric over the space of rankings we use five distance metrics the kendall tau kt distance the spearman footrule distance the maximum displacement distance the cayley distance and the hamming distance when required to submit vote voter simply approves the top alternatives in his opinion given the votes an alternative is the maximum likelihood estimate mle for the best alternative if the votes are most likely generated by ranking that puts first we can now reformulate our question in slightly more technical terms is approval voting almost maximum likelihood estimator for the best alternative given votes drawn from the mallows model how does the answer depend on the noise parameter and the distance metric our results our first result theorem shows that under the mallows model the set of winners according to approval voting coincides with the set of mle best alternatives under the kendall tau distance but under the other four distances there may exist approval winners that are not mle best alternatives our next result theorem confirms the intuition that the suboptimality of approval voting stems from the information that is being discarded when only single alternative is approved or disapproved in each vote approval voting which now utilizes all the information that can be gleaned from the anonymous votes is optimal under mild conditions going back to the general case of votes we show theorem that even under the four distances for which approval voting is suboptimal weaker statement holds in cases with very high or very low noise every mle best alternative is an approval winner but some approval winners may not be mle best alternatives and our experiments using real data show that the accuracy of approval voting is usually quite close to that of the mle in pinpointing the best alternative we conclude that approval voting is good way of aggregating votes in most situations but our work demonstrates that perhaps surprisingly approval voting may be suboptimal and in situations where high degree of accuracy is required exact computation of the mle best alternative is an option worth considering we discuss our conclusions in more detail in section model let denote the set of alternatives by and let we use to denote the set of rankings total orders of the alternatives in for ranking let denote the alternative occupying position in and let denote the rank position of alternative in with slight abuse of notation let we use to denote the ranking obtained by swapping the positions of alternatives and in we assume that there exists an unknown true ranking of the alternatives the ground truth denoted we also make the standard assumption of uniform prior over the true ranking framework of approval voting has been studied extensively both from the axiomatic point of view and the point of view however even under this framework it is standard assumption that votes are tallied by counting the number of times each alternative is approved which is why we simply refer to the aggregation rule under consideration as approval voting let denote the set of voters each voter has an opinion denoted πi which is noisy estimate of the true ranking the collection of opinions the opinion profile is denoted fix vote is collection of alternatives approved by voter when asked to submit vote voter simply submits the vote vi πi which is the set of alternatives at the top positions in his opinion the collection of all votes is called the vote profile and denoted vi for ranking and vote we say that is generated from denoted or when the value of is clear from the context if more generally for an opinion profile and vote profile we say or if πi vi for every let ak ak denote the set of all subsets of of size voting rule operating on votes is function ak that returns winning alternative given the in particular let us define the approval score of an alternative denoted sc pp as the number of voters that approve then approval voting simply chooses an alternative with the greatest approval score note that we do not break ties instead we talk about the set of approval winners following the standard social choice literature we model the opinion of each voter as being drawn from an underlying noise model noise model describes the probability of drawing an opinion given the true ranking denoted pr we say that noise model is neutral if the labels of the alternatives do not matter renaming alternatives in the true ranking and in the opinion in the same fashion keeps pr intact popular noise model is the mallows model under which pr ϕd here is distance metric over the space of rankings parameter governs the noise level implies that the true ranking is generated with probability and implies the uniform distribution zϕm is the normalization constant which is independent of the true ranking given that distance is neutral renaming alternatives in the same fashion in two rankings does not change the distance between them below we review five popular distances used in the social choice literature they are all neutral the kendall tau kt distance denoted dkt measures the number of pairs of alternatives over which two rankings disagree equivalently it is the number of swaps required by bubble sort to convert one ranking into another the spearman footrule fr distance denoted dfr measures the total displacement absolute difference between positions of all alternatives in two rankings the maximum displacement md distance denoted dmd measures the maximum of the displacements of all alternatives between two rankings the cayley cy distance denoted dcy measures the minimum number of swaps not necessarily of adjacent alternatives required to convert one ranking into another the hamming hm distance denoted dhm measures the number of positions in which two rankings place different alternatives since opinions the probability of approfile given the true ranking is qnare drawn independently pr pr πi where πi once we fix the noise model for fixed we can derive the probability of observing given vote pr pr then the probability of drawing given vote profile is pr pr vi alternatively this can also be expressed as pr pr hereinafter we omit the domains for and for when they are clear from the context finally given the vote profile the likelihood of an alternative beingpthe best alternative in the true ranking is proportional to via bayes rule pr pr using the two expressions derived earlier for pr and ignoring the normalization constant zϕm from the probabilities we define the likelihood function of given votes as πi πi πi the maximum likelihood estimate mle for the best alternative is given by arg again we do not break ties we study the set of mle best alternatives technically this is social choice function social welfare function returns ranking of the alternatives optimal voting rules at first glance it seems natural to use approval voting that is returning the alternative that is approved by the largest number of voters given votes however consider the following example with alternatives and voters providing votes notice that alternatives and receive approvals each while alternative receives only single approval approval voting may return any alternative other than alternative but is that always optimal in particular while alternatives and are symmetric alternative is qualitatively different due to different alternatives being approved along with this indicates that under certain conditions it is possible that not all three alternatives are mle for the best alternative our first result shows that this is indeed the case under three of the distance functions listed above and similar example works for fourth however surprisingly under the kendall tau distance the mle best alternatives are exactly the approval winners and hence are computable which stands in sharp contrast to the of computing them given rankings theorem the following statements hold for aggregating votes using approval voting under the mallows model with fixed distance dmd dcy dhm dfr there exist vote profile with at most six votes over at most five alternatives and choice for the mallows parameter such that not all approval winners are mle best alternatives under the mallows model with the distance dkt the set of mle best alternatives coincides with the set of approval winners for all vote profiles and all values of the mallows parameter proof for the mallows model with dmd dcy dhm and any the profile from equation is counterexample alternatives and are mle best alternatives but is not for the mallows model with dfr we could not find counter example with alternatives simulations generated the following counterexample with alternatives that works for any and here alternatives and have the highest approval score of however alternative has strictly lower likelihood of being the best alternative than alternative and hence is not an mle best alternative the calculation verifying these counterexamples is presented in the online appendix specifically appendix in contrast for the kendall tau distance we show that all approval winners are mle best alternatives and we begin by simplifying the likelihood function from equation for the special case of the mallows model with the kendall tau distance in this case it is well known that qm the normalization constant satisfies zϕm tϕj where tϕj ϕi consider ranking πi such that πi vi we can decompose dkt πi into three types of pairwise mismatches πi the mismatches over pairs where vi and vi or ii πi the mismatches over pairs where vi and iii πi the mismatches over pairs where vi note that every ranking πi that satisfies πi vi has identical mismatches of type let us denote the number of such mismatches by dkt vi also notice that πi dkt πi where denotes the ranking of alternatives in dictated by similarly πi dkt πi now in the expression for the likelihood function ϕdkt vi πi πi πi πi dkt vi dkt vi ϕdkt vi dkt zϕk ϕdkt the second equality follows because every ranking πi that satisfies πi can be generated by picking rankings vi and vi and concatenating them the third equality follows from thep definition of the normalization constant in the mallows model finally we denote dkt dkt vi it follows that maximizing amounts to maximizing note that dkt counts the number of times alternative is approved while alternative is not for all with that is let nv vi vi then dkt nv also note that for alternatives we have sc pp sc pp nv nv next we show that is monotonically increasing function of sc pp equivalently if and only if sc pp sc pp fix consider the bijection between the sets of rankings placing and first which simply swaps and then ϕdkt ϕdkt note that let equivalently in now and fix such that denote the set of alternatives positioned between and in have identical disagreements with on pair of alternatives unless one of and belongs to and ii the other belongs to thus the difference of disagreements of and with on such pairs is dkt dkt nv nv nv nv sc pp sc pp thus sc pp sc pp implies dkt dkt and thus pp pp and sc sc implies dkt dkt and thus suboptimality of approval voting for distances other than the kt distance stems from the fact that in counting the number of approvals for given alternative one discards information regarding other alternatives approved along with the given alternative in various votes however no such information is discarded when only one alternative is approved or not approved in each vote that is given plurality or veto votes approval voting should be optimal not only for the mallows model but for any reasonable noise model the next result formalizes this intuition theorem under neutral noise model the set of mle best alternatives coincides with the set of approval winners given plurality votes if pi where pi is the probability of the alternative in position in the true ranking appearing in the first position in sample or given veto votes if qi where qi is the probability of the alternative in position in the true ranking appearing in the last position in sample proof we show the proof for plurality votes the case of veto votes is symmetric in every vote instead of single approved alternative we have single alternative that is not approved note that the probability pi is independent of the true ranking due to the neutrality of the noise model consider plurality vote profile and an alternative let the likelihood function for is given by pr under every the qn contribution of the sc pp plurality votes for to the product pr pr vi is pp sc note that the alternatives in are distributed among positions in in all possible ways by the rankings in let ib denote the position of alternative then sc pp sc pp pi ib ib pi app the second transition holds because sc pp sc pp our assumption in the theorem statement implies pib for ib now it can be checked that pp pp for we have pi sc thus sc pp sc pp if and only if as required note that the conditions of theorem are very mild in particular the condition for plurality votes is satisfied under the mallows model with all five distances we consider and the condition for veto votes is satisfied under the mallows model with the kendall tau the footrule and the maximum displacement distances this is presented as theorem in the online appendix appendix high noise and low noise while theorem shows that there are situations where at least some of the approval winners may not be mle best alternatives it does not paint the complete picture in particular in both profiles used as counterexamples in the proof of theorem it holds that every mle best alternative is an approval winner that is the optimal rule choosing an mle best alternative works as if scheme is imposed on top of approval voting does this hold true for all profiles part of theorem gives positive answer for the kendall tau distance in this section we answer the foregoing question largely in the positive under the other four distance functions with respect to the two ends of the mallows spectrum the case of low noise and the case of high noise the case of high noise is especially compelling because that is when it becomes hard to pinpoint the ground truth but both extreme cases have received special attention in the literature in contrast to previous results which have almost always yielded different answers in the two cases we show that every mle best alternative is an approval winner in both cases in almost every situation we begin with the likelihood function for alternative ϕd when maximizing requires minimizing the minimum exponent ties if any are broken using the number of terms achieving the minimum exponent then the second smallest exponent and so on at the other extreme let with using the approximation maximizing requires minimizing the sum of over all with and ties are broken using approximations let min min we are interested in minimizing and this leads to novel combinatorial problems that require detailed analysis we are now ready for the main result of this section theorem the following statements hold for using approval voting to aggregate votes drawn from the mallows model under the mallows model with dfr dcy dhm and and under the mallows model with dfr dcy dhm dmd and it holds that for every and every profile with votes every mle best alternative is an approval winner under the mallows model with dmd and there exists profile with seven votes over alternatives such that no mle best alternative is an approval winner before we proceed to the proof we remark that in part of the theorem by and we mean that there exist such that the result holds for all and respectively in part of the theorem we mean that for every there exists for which the negative result holds due to space constraints we only present the proof for the mallows model with dfr and the full proof appears in the online appendix appendix proof of theorem only for dfr let in the mallows model with the footrule distance to analyze we first analyze minπ dfr for fixed then we minimize it over and show that the set of alternatives that appear first in the minimizers the set of alternatives minimizing is exactly the set of approval winners since every mle best alternative in the case must minimize the result follows fix imagine boundary between positions and in all rankings between the approved and the alternatives now given profile such that we first apply the following operation repeatedly for let an alternative be in positions and in and πi respectively if and are on the same side of the boundary either both are at most or both are greater than and then swap alternatives πi and πi in πi note that this decreases the displacement of in πi with respect to by and increases the displacement of πi by at most hence the operation can not increase dfr let denote the profile that we converge to note that satisfies because we only swap alternatives on the same side of the boundary dfr dfr and the following condition condition for every alternative that is on the same side of the boundary in and is in the same position in both rankings because we started from an arbitrary profile subject to it follows that it is sufficient to minimize dfr over all with satisfying condition however we show that subject to and condition dfr is actually constant note that for every alternative that is in different positions in and must be on different sides of the boundary in the two rankings it is easy to see that in every there is an equal number of alternatives on both sides of the boundary that are not in the same position as they are in now we can divide the total footrule distance dfr into four parts let and such that let and then the displacement of is broken into two parts and ii let and such that let and then the displacement of is broken into two parts and ii because the number of alternatives of type and is equal for every we can see that the total displacements of types and ii are equal and so are the total displacements of types ii and by observing that there are exactly sc pp instances of type for given value of and sc pp instances of type for given value of we conclude that pp pp sc dfr sc pm minimizing this over reduces to minimizing sc pp by the rearrangement inequality this is minimized when alternatives are ordered in order of their approval scores note that exactly the set of approval winners appear first in such rankings theorem shows that under the mallows model with dfr dcy dhm every mle best alternative is an approval winner for both and we believe that the same statement holds for all values of as we were unable to find counterexample despite extensive simulations conjecture under the mallows model with distance dfr dcy dhm every mle best alternative is an approval winner for every experiments we perform experiments with two datasets dots and puzzle to compare the performance of approval voting against that of the rule that is mle for the empirically observed distribution of votes and not for the mallows model mao et al collected these datasets by asking workers on amazon mechanical turk to rank either four images by the number of dots they contain dots or four states of an by their distance to the goal state puzzle hence these datasets contain ranked votes over alternatives in setting where true ranking of the alternatives indeed exists each dataset has four different noise levels higher noise was created by increasing the task difficulty for dots ranking images with smaller difference in the number of dots leads to high noise and for puzzle ranking states farther away from the goal state leads to high noise each noise level of each dataset contains profiles with approximately votes each in our experiments we extract votes from the ranked votes by taking the top alternatives in each vote given these votes approval voting returns an alternative with the largest number of approvals to apply the mle rule however we need to learn the underlying distribution of votes to that end we partition the set of profiles in each noise level of each dataset into training and test sets we use high fraction of the profiles for training in order to examine the maximum advantage that the mle rule may have over approval voting given the training profiles which approval voting simply ignores the mle rule learns the probabilities of observing each of the possible of the alternatives given fixed true on the test data the mle rule first computes the likelihood of each ranking given the votes then it computes the likelihood of each alternative being the best by adding the likelihoods of all rankings that put the alternative first it finally returns an alternative with the highest likelihood accuracy accuracy we measure the accuracy of both methods by their frequency of being able to pinpoint the correct best alternative for each noise level in each dataset the accuracy is averaged over simulations with random partitioning of the profiles into training and test sets mle approval noise level noise level dots puzzle fig the mle rule trained on of the profiles and approval voting for votes figures and show that in general the mle rule does achieve greater accuracy than approval voting however the increase is at most which may not be significant in some contexts discussion our main conclusion from the theoretical and empirical results is that approval voting is typically close to optimal for aggregating votes however the situation is much subtler than it appears at first glance moreover our theoretical analysis is restricted by the assumption that the votes are drawn from the mallows model recent line of work in social choice theory has focused on designing voting rules that perform well simultaneously under wide variety of noise models it seems intuitive that approval voting would work well for aggregating votes under any reasonable noise model an analysis extending to wide family of realistic noise models would provide stronger theoretical justification for using approval voting on the practical front it should be emphasized that approval voting is not always optimal when maximum accuracy matters one may wish to switch to the mle rule however learning and applying the mle rule is much more demanding in our experiments we learn the entire distribution over votes given the true ranking while for or votes over alternatives we only need to learn probability values in general for votes over alternatives one would need to learn probability values and the training data may not be sufficient for this purpose this calls for the design of estimators for the best alternative that achieve greater statistical efficiency by avoiding the need to learn the entire underlying distribution over votes technically we learn neutral noise model where the probability of subset of alternatives being observed only depends on the positions of the alternatives in the true ranking references simple characterization of approval voting social choice and welfare azari soufiani chen parkes and xia generalized for rank aggregation in proc of nips pages azari soufiani parkes and xia random utility theory for social choice in proc of nips pages azari soufiani parkes and xia computing parametric ranking models via rankbreaking in proc of icml pages bartholdi tovey and trick voting schemes for which it can be difficult to tell who won the election social choice and welfare baumeister hemaspaandra hemaspaandra and rothe computational aspects of approval voting in handbook on approval voting pages springer brams mathematics and democracy designing better voting and procedures princeton university press brams and fishburn approval voting springer edition caragiannis procaccia and shah when do noisy votes reveal the truth in proc of ec pages caragiannis procaccia and shah modal ranking uniquely robust voting rule in proc of aaai pages elkind and shah electing the most probable without eliminating the irrational voting over intransitive domains in proc of uai pages nowak and rothe approval voting fully resists constructive control and broadly resists destructive control math log fishburn axioms for approval voting direct proof journal of economic theory fishburn and brams approval voting condorcet principle and runoff elections public choice goel krishnaswamy sakshuwong and aitamurto knapsack voting in proc of collective intelligence lee kladwang lee cantu azizyan kim limpaecher yoon treuille and das rna design rules from massive open laboratory proceedings of the national academy of sciences little chilton goldman and miller turkit human computation algorithms on mechanical turk in proc of uist pages lu and boutilier learning mallows models with pairwise preferences in proc of icml pages mallows ranking models biometrika mao procaccia and chen better human computation through principled voting in proc of aaai pages procaccia reddi and shah maximum likelihood approach for selecting sets of alternatives in proc of uai pages sertel characterizing approval voting journal of economic theory shah zhou and peres approval voting and incentives in crowdsourcing in proc of icml pages young condorcet theory of voting the american political science review 
regressive virtual metric learning perrot and amaury habrard de lyon jean monnet de laboratoire hubert curien cnrs france abstract we are interested in supervised metric learning of mahalanobis like distances existing approaches mainly focus on learning new distance using similarity and dissimilarity constraints between examples in this paper instead of bringing closer examples of the same class and pushing far away examples of different classes we propose to move the examples with respect to virtual points hence each example is brought closer to priori defined virtual point reducing the number of constraints to satisfy we show that our approach admits closed form solution which can be kernelized we provide theoretical analysis showing the consistency of the approach and establishing some links with other classical metric learning methods furthermore we propose an efficient solution to the difficult problem of selecting virtual points based in part on recent works in optimal transport lastly we evaluate our approach on several state of the art datasets introduction the goal of metric learning algorithm is to capture the idiosyncrasies in the data mainly by defining new space of representation where some semantic constraints between examples are fulfilled in the previous years the main focus of metric learning algorithms has been to learn mahalanobis like distances of the form dm where is positive matrix psd defining set of using cholesky decomposition llt one can see that this is equivalent to learn linear transformation from the input space most of the existing approaches in metric learning use constraints of type and can not between learning examples for example in supervised classification task the goal is to bring closer examples of the same class and to push far away examples of different classes the idea is that the learned metric should affect high value to dissimilar examples and low value to similar examples then this new distance can be used in classification algorithm like nearest neighbor classifier note that in this case the set of constraints is quadratic in the number of examples which can be prohibitive when the number of examples increases one heuristic is then to select only subset of the constraints but selecting such subset is not trivial in this paper we propose to consider new kind of constraints where each example is associated with an priori defined virtual point it allows us to consider the metric learning problem as simple regression where we try to minimize the differences between learning examples and virtual points fig illustrates the differences between our approach and classical metric learning approach it can be noticed that our algorithm only uses linear number of constraints however defining these constraints by hand can be tedious and difficult to overcome this problem we present two approaches to automatically define them the first one is based on some recent advances in the field of optimal transport while the second one uses representation space when the identity matrix it corresponds to the euclidean distance classical can not approach our virtual regression formulation figure arrows denote the constraints used by each approach for one particular example in binary classification task the classical metric learning approach in fig uses constraints bringing closer examples of the same class and pushing far away examples of different classes on the contrary our approach presented in fig moves the examples to the neighborhood of their corresponding virtual point in black using only constraints best viewed in color moreover thanks to its formulation our approach can be easily kernelized allowing us to deal efficiently with non linear transformations which is nice advantage in comparison to some metric learning methods we also provide theoretical analysis showing the consistency of our approach and establishing some relationships with classical metric learning formulation this paper is organized as follows in section we identify several related works then in section we present our approach provide some theoretical results and give two solutions to generate the virtual points section is dedicated to an empirical evaluation of our method on several widely used datasets finally we conclude in section related work for surveys on metric learning see and in this section we focus on algorithms which are more closely related to our approach first of all one of the most famous approach in metric learning is lmnn where the authors propose to learn psd matrix to improve the algorithm in their work instead of considering pairs of examples they use triplets xi xj xk where xj and xk are in the neighborhood of xi and such that xi and xj are of the same class and xk is of different class the idea is then to bring closer xi and xj while pushing xk far away hence if the number of constraints seems to be cubic the authors propose to only consider triplets of examples which are already close to each other in contrast the idea presented in is to collapse all the examples of the same class in single point and to push infinitely far away examples of different classes the authors define measure to estimate the probability of having an example xj given an example xi with respect to learned psd matrix then they minimize the kl divergence between this measure and the best case where the probability is if the two examples are of the same class and otherwise it can be seen as collapsing all the examples of the same class on an implicit virtual point in this paper we use several explicit virtual points and we collapse the examples on these points with respect to their classes and their distances to them recurring issue in mahalanobis like metric learning is to fulfill the psd constraint on the learned metric indeed projecting matrix on the psd cone is not trivial and generally requires costly eigenvalues decomposition to address this problem in itml the authors propose to use logdet divergence as the regularization term the idea is to learn matrix which is close to an priori defined psd matrix the authors then show that if the divergence is finite then the learned matrix is guaranteed to be psd another approach as proposed in is to learn matrix such that llt instead of learning the metric the authors propose to learn the projection the main drawback is the fact that most of the time the resulting optimization problem is not convex and is thus harder to optimize in this paper we are also interested in learning directly however because we are using constraints between examples and virtual points we obtain convex problem with closed form solution allowing us to learn the metric in an efficient way the problem of learning metric such that the induced space is not linearly dependent of the input space has been addressed in several works before first it is possible to directly learn an intrinsically non linear metric as in where the authors propose to learn distance rather than mahalanobis distance this distance is particularly relevant for histograms comparisons note that this kind of approaches is close to the kernel learning problem which is beyond the scope of this work second another solution used by local metric learning methods is to split the input space in several regions and to learn metric in each region to introduce some non linearity as in mmlmnn similarly in the authors propose to locally refine the metric learned by lmnn by successively splitting the input space third kind of approach tries to project the learning examples in new space which is non linearly dependent of the input space it can be done in two ways either by projecting priori the learning examples in new space with kpca or by rewriting the optimization problem in kernelized form the first approach allows one to include non linearity in most of the metric learning algorithms but imposes to select the interesting features beforehand the second method can be difficult to use as rewriting the optimization problem is most of the times non trivial indeed if one wants to use the kernel trick it implies that the access to the learning examples should only be done through dot products which is difficult when working with pairs of examples as it is the case in metric learning in this paper we show that using virtual points chosen in given target space allows us to kernelize our approach easily and thus to work in very high dimensional space without using an explicit projection thanks to the kernel trick our method is based on regression and can thus be linked in its kernelized form to several approaches in kernelized regression for structured output the idea behind these approaches is to minimize the difference between input examples and output examples using kernels working in high dimensional space in our case the learning examples can be seen as input examples and the virtual points as output examples however we only project the learning examples in high dimensional space the virtual points already belong to the output space hence we do not have the problem furthermore our goal is not to predict virtual point but to learn metric between examples and thus after the learning step the virtual points are discarded contributions the main idea behind our algorithm is to bring closer the learning examples to set of virtual points we present this idea in three subsections first we assume that we have access to set of learning pairs where is learning example and is virtual point associated to and we present both the linear and kernelized formulations of our approach called rvml it boils down to solve regression in closed form the main originality being the introduction of virtual points in the second subsection we show that it is possible to theoretically link our approach to classical metric learning one based on in the last subsection we propose two automatic methods to generate the virtual points and to associate them with the learning examples regressive virtual metric learning rvml given probability distribution defined over where rd and is finite label set let xi yi be set of examples drawn from let fv where rd be the function which associates each example to virtual point we consider the learning set sv xi vi where vi fv xi yi for the sake of simplicity denote by xn and vn the matrices containing respectively one example and the associated virtual point on each line in this section we consider that the function fv is known we come back to its definition in section let kf be the frobenius norm and be the vector norm our goal is to learn matrix such that llt and for this purpose we consider the following optimisation problem min min kxl the idea is to learn new space of representation where each example is close to its associated virtual point note that is matrix and if we also perform dimensionality reduction theorem the optimal solution of problem can be found in closed form furthermore we can derive two equivalent solutions xt λni xt xxt λni proof the proof of this theorem can be found in the supplementary material from eq we deduce the matrix llt xt λni xt vvt xt λni note that is psd by construction xt mx xt llt klt so far we have focused on the linear setting we now present kernelized version showing that it is possible to learn metric in very high dimensional space without an explicit projection let be projection function and be its associated kernel for the sake of readability let kx where xn given the solution matrix presented in eq we have xt xxt λni vvt xxt λni then mk the kernelized version of the matrix is defined such that mk kx λni vvt kx λni the squared mahalanobis distance can be written as xt mx thus we can obtain mk mk mk the kernelized version by considering that mk kx λni vvt kx λni kx kx λni vvt kx λni kx where kx xn is the similarity vector to the examples note that it is also possible to obtain kernelized version of lk kx λni this result is close to previous one already derived in in structured output setting the main difference is the fact that we do not use kernel on the output the virtual points here hence it is possible to compute the projection of an example of dimension in new space of dimension lk kx λni kx kx λni recall that in this work we are interested in learning distance between examples and not in the prediction of the virtual points which only serve as way to bring closer similar examples and push far away dissimilar examples from complexity standpoint we can see that assuming the kernel function as easy to calculate the main bottleneck when computing the solution in closed form is the inversion of matrix theoretical analysis in this section we propose to theoretically show the interest of our approach by proving that it is consistent and ii that it is possible to link it to more classical metric learning formulation consistency let kxt vt be our loss and let dv be the probability distribution over such that pdv pd fv showing the consistency boils down to bound with high probability the true risk denoted by by the empirical risk denoted by such that kxl and the empirical risk corresponds to the error of the learned matrix on the learning set sv the true risk is the error of on the unknown distribution dv the consistency property ensures that with sufficient number of examples low empirical risk implies low true risk with high probability to show that our approach is consistent we use the uniform stability framework theorem let cv for any and cx for any with probability for any matrix optimal solution of problem we have ln cx λn proof the proof of this theorem can be found in the supplementary material we obtain rate of convergence in which is standard with this kind of bounds link with classical metric learning formulation in this section we show that it is possible to bound the true risk of classical metric learning approach with the empirical risk of our formulation most of the classical metric learning approaches make use of notion of margin between similar and dissimilar examples hence similar examples have to be close to each other at distance smaller than margin and dissimilar examples have to be far from each other at distance greater than margin let xi yi and xj yj be two examples from using this notion of margin we consider the following loss xi yi xj yj yij lt xi lt xj γyij where yij if yi yj and otherwise max is the hinge loss and γyij is the desired margin between examples as introduced before we consider that γyij takes big value when the examples are dissimilar when yij and small value when the examples are similar when yij in the following we show that relating the notion of margin to the distances between virtual points it is possible to bound the true risk associated with this loss by the empirical risk of our approach with respect to constant theorem let be distribution over let rd be finite set of virtual points and fv is defined as fv xi yi vi vi let cv for any and cx for any let maxxk xl ykl vk vl and minxk xl ykl vk vl we have xi yi xj yj yij lt xi lt xj γyij ln cx cx cx λn proof the proof of this theorem can be found in the supplementary material in theorem we can notice that the margins are related to the distances between virtual points and correspond to the ideal margins the margins that we would like to achieve after the learning step aside this remark we can define and the observed margins obtained after the learning step all the similar examples are in sphere centered in their corresponding virtual point and of diameter max xt vt similarly the distance between hyperspheres of dissimilar examples is minv kv as consequence even if we do not use can not constraints our algorithm is able to push reasonably far away dissimilar examples in the next subsection we present two different methods to select the virtual points virtual points selection previously we assumed to have access to the function fv in this subsection we present two methods for generating automatically the set of virtual points and the mapping fv using optimal transport on the learning set in this first approach we propose to generate the virtual points by using recent variation of the optimal transport ot problem allowing one to transport some examples to new points corresponding to linear combination of set of known instances these new points will actually correspond to our virtual points our approach works as follows we begin by extracting set of landmarks from the training set for this purpose we use an adaptation of the landmark selection method proposed in allowing us to take into account some diversity among the landmarks to avoid to fix the number of landmarks in advance we have just replaced it by simple heuristic saying that the number of landmarks must be greater than the number of classes and that the maximum distance between an example and landmark must be lower than the mean of all pairwise algorithm selecting from set of examples input xi yi set of examples the label set output subset of begin mean of distances between all the examples of xmax arg maxkx xmax kx while or do xmax arg max kx xmax kx distances from the training set us to have fully automatic procedure it is summarized in algorithm then we compute an optimal transport from the training set to the landmark set for this purpose we create real matrix of size giving the cost to transport one training instance to landmark such that kxi with xi and the optimal transport is found by learning matrix able to minimize the cost of moving training examples to the landmark points let be the matrix of landmark points one per line the transport of any training instance xi yi gives new virtual point such that fv xi yi designing the ith line of note that this new virtual point is linear combination of the landmark instances to which the example is transported the set of virtual points is then defined by the virtual points are thus not defined priori but are automatically learned by solving problem of optimal transport note that this transportation mode is potentially non linear since there is no guarantee that there exists matrix such that xt our metric learning approach can in this case be seen as an approximation of the result given by the optimal transport to learn we use the following optimization problem proposed in arg min hγ cif xx kγ yi kpq where log is the entropy of that allows to solve the transportation problem efficiently with the algorithm the second regularization term where yi corresponds to the lines of the th column of where the class of the input is has been introduced in the goal of this term is to prevent input examples of different classes to move toward the same output examples by promoting group sparsity in the matrix thanks to kpq corresponding to lq to the power of used here with and using representation space for this second approach we propose to define virtual points as the unit vectors of space of dimension let ej be such unit vector vector where all the attributes are except for one attribute which is set to to which we associate class label from then for any learning example xi yi we define fv xi yi yi where yi if ej is mapped with the class yi thus we have exactly virtual points each one corresponding to unit vector and class label we call this approach the representation space method if the number of classes is smaller than the number of dimensions used to represent the learning examples then our method will perform dimensionality reduction for free furthermore our approach will try to project all the examples of one class on the same axis while examples of other classes will tend to be projected on different axes the underlying intuition behind the new space defined by is to make each attribute discriminant for one class table comparison of our approach with several baselines in the linear setting base amazon breast caltech dslr ionosphere isolet letters pima scale splice wine webcam baselines lmnn scml mean our approach experimental results in this section we evaluate our approach on different datasets coming from either the uci repository or used in recent works in metric learning for isolet splice and we have access to standard partition for the other datasets we use test partition we perform the experiments on different splits and we average the result we normalize the examples with respect to the training set by subtracting for each attribute its mean and dividing by times its standard deviation we set our regularization parameter with cross validation after the metric learning step we use neighbor classifier to assess the performance of the metric and report the accuracy obtained we perform two series of experiments first we consider our linear formulation used with the two virtual points selection methods presented in this paper based on optimal transport section and using the representation space method section we compare them to neighbor classifier without metric learning and with two state of the art linear metric learning methods lmnn and scml in second series we consider the kernelized versions of rvml namely and based respectively on optimal transport and representation space methods with rbf kernel with the parameter fixed as the mean of all pairwise training set euclidean distances we compare them to non linear methods using kpca with rbf as neighbor classifier without metric learning and lmnnkpca corresponding to lmnn in the the number of dimensions is fixed as the one of the original space for high dimensional datasets more than attributes to times the original dimension when the dimension is smaller between and attributes and to times the original dimension for the lowest dimensional datasets less than attributes we also consider some local metric learning methods gblmnn non linear version of lmnn and scmllocal the local version of scml for all these methods we use the implementations available online letting them handle tuning the results for linear methods are presented in table while table gives the results obtained with the non linear approaches in each table the best result on each line is highlighted with bold font while the second best result is underlined star indicates either that the best baseline is significantly better than our best result or that our best result is significantly better than the best baseline according to classical significance tests the being fixed at we can make the following remarks in the linear setting our approaches are very competitive to the state of the art and tends to be the best on average even though it must be noticed that scml is very competitive on some datasets the average difference is not significant performs slightly less on average considering now the non linear methods our approaches improve their performance and are significantly better than the others on average has the best average behavior in this setting these experiments show that our regressive formulation with the parameter fixed as previously to the mean of all pairwise training set euclidean distances table comparison of our approach with several baselines in the case base amazon breast caltech dslr ionosphere isolet letter pima scale splice wine webcam mean baselines gblmnn scmllocal our approach is very competitive and is even able to improve state of the art performances in non linear setting and consequently that our virtual point selection methods automatically select correct instances considering the virtual point selection we can observe that the ot formulation performs better than the representation space one in the linear case while it is the opposite in the case we think that this can be explained by the fact that the ot approach generates more virtual points in potentially non linear way which brings more expressiveness for the linear case on the other hand in the non linear one the relative small number of virtual points used by the method seems to induce better regularization in section of the supplementary material we provide additional experiments showing the interest of using explicit virtual points and the need of careful association between examples and virtual points we also provide some graphics showing projections of the space learned by and on the isolet dataset illustrating the capability of these approaches to learn discriminative attributes in terms of computational cost our approach in closed form is competitive with classical methods but does not yield to significant improvements indeed in practice classical approaches only consider small number of constraints times the number of examples where is small constant in the case of scml thus the practical computational complexity of both our approach and classical methods is linearly dependant on the number of examples conclusion we present new metric learning approach based on regression and aiming at bringing closer the learning examples to some priori defined virtual points the number of constraints has the advantage to grow linearly with the size of the learning set in opposition to the quadratic grow of standard can not approaches moreover our method can be solved in closed form and can be easily kernelized allowing us to deal with non linear problems additionally we propose two methods to define the virtual points one making use of recent advances in the field of optimal transport and one based on unit vectors of representation space allowing one to perform directly some dimensionality reduction theoretically we show that our approach is consistent and we are able to link our empirical risk to the true risk of classical metric learning formulation finally we empirically show that our approach is competitive with the state of the art in the linear case and outperforms some classical approaches in the one we think that this work opens the door to design new metric learning formulations in particular the definition of the virtual points can bring way to control some particular properties of the metric rank locality discriminative power as consequence this aspect opens new issues which are in part related to landmark selection problems but also to the ability to embed expressive semantic constraints to satisfy by means of the virtual points other perspectives include the development of specific solver of online versions the use of low norms or the conception of new local metric learning methods another direction would be to study similarity learning extensions to perform linear classification such as in references jason davis brian kulis prateek jain suvrit sra and inderjit dhillon informationtheoretic metric learning in proc of icml pages jacob goldberger sam roweis geoffrey hinton and ruslan salakhutdinov neighbourhood components analysis in proc of nips pages bellet amaury habrard and marc sebban metric learning synthesis lectures on artificial intelligence and machine learning morgan claypool publishers brian kulis metric learning survey foundations and trends in machine learning kilian weinberger john blitzer and lawrence saul distance metric learning for large margin nearest neighbor classification in proc of nips pages amir globerson and sam roweis metric learning by collapsing classes in proc of nips pages kilian weinberger and lawrence saul distance metric learning for large margin nearest neighbor classification jmlr dor kedem stephen tyree kilian weinberger fei sha and gert lanckriet nonlinear metric learning in proc of nips pages bernhard alex smola and kernel principal component analysis in proc of icann pages jason weston olivier chapelle elisseeff bernhard and vladimir vapnik kernel dependency estimation in proc of nips pages corinna cortes mehryar mohri and jason weston general regression technique for learning transductions in proc of icml pages hachem kadri mohammad ghavamzadeh and philippe preux generalized kernel approach to structured output learning in proc of icml pages rong jin shijun wang and yang zhou regularized distance metric learning theory and algorithm in proc of nips pages olivier bousquet and elisseeff stability and generalization jmlr villani optimal transport old and new volume springer science business media purushottam kar and prateek jain learning via data driven embeddings in proc of nips pages nicolas courty flamary and devis tuia domain adaptation with regularized optimal transport in proc of pages marco cuturi sinkhorn distances lightspeed computation of optimal transport in proc of nips pages lichman uci machine learning repository yuan shi bellet and fei sha sparse compositional metric learning in proc of aaai conference on artificial intelligence pages bellet amaury habrard and marc sebban similarity learning for provably accurate sparse linear classification in proc of icml the implementation of rvml is freely available on the authors website balcan avrim blum and nathan srebro improved guarantees for learning via similarity functions in proc of colt pages 
analysis of robust pca via local incoherence huishuai zhang department of eecs syracuse university syracuse ny yi zhou department of eecs syracuse university syracuse ny yingbin liang department of eecs syracuse university syracuse ny abstract we investigate the robust pca problem of decomposing an observed matrix into the sum of and sparse error matrices via convex programming principal component pursuit pcp in contrast to previous studies that assume the support of the error matrix is generated by uniform bernoulli sampling we allow sampling entries of the matrix are corrupted by errors with unequal probabilities we characterize conditions on error corruption of each individual entry based on the local incoherence of the matrix under which correct matrix decomposition by pcp is guaranteed such refined analysis of robust pca captures how robust each entry of the low rank matrix combats error corruption in order to deal with error corruption our technical proof introduces new weighted norm and the concentration properties that such norm satisfies introduction we consider the problem of robust principal component analysis pca suppose data matrix can be decomposed into matrix and sparse matrix as robust pca aims to find and with given this problem has been extensively studied recently in principal component pursuit pcp has been proposed to solve the robust pca problem via the following convex programming pcp minimize subject to where denotes the nuclear norm the sum of singular values and denotes the norm the sum of absolute values of all entries it was shown in that pcp successfully recovers and if the two matrices are distinguishable from each other in properties is not sparse and is not one important quantity that determines similarity of to sparse matrix is the incoherence of which measures how column and row spaces of are aligned with canonical basis and between themselves namely suppose that is matrix with svd where is diagonal matrix with singular values as its diagonal entries is matrix with columns as the left singular vectors of is matrix with columns as the right singular vectors of and denotes the transpose of the incoherence of is measured in this paper we focus on square matrices for simplicity our results can be extended to rectangular matrices in standard way by max where and are defined as ku ei kv ej ku for all previous studies suggest that the incoherence crucially determines conditions on sparsity of in order for pcp to succeed for example theorem in explicitly shows that the matrix with larger can tolerate only smaller error density to guarantee correct matrix decomposition by pcp in all previous work on robust pca the incoherence is defined to be the maximum over all column and row spaces of as in and which can be viewed as the global parameter for the entire matrix and consequently characterization of error density is based on such global and in fact the worst case incoherence in fact each entry of the low rank matrix can be associated with local incoherence parameter µij which is less than or equal to the global parameter and then the allowable error density can be potentially higher than that characterized based on the global incoherence thus the total number of errors that the matrix can tolerate in robust pca can be much higher than that characterized based on the global incoherence when errors are distributed accordingly motivated by such an observation this paper aims to characterize conditions on error corruption of each entry of the low rank matrix based on the corresponding local incoherence parameter which guarantee success of pcp such conditions imply how robust each individual entry of to resist error corruption naturally the error corruption probability is allowed to be over the matrix locations of entries in are sampled we note that the notion of local incoherence was first introduced in for studying the matrix completion problem in which local incoherence determines the local sampling density in order to guarantee correct matrix completion here local incoherence plays similar role and determines the maximum allowable error density at each entry to guarantee correct matrix decomposition the difference lies in that local incoherence here depends on both localized and rather than only on localized in matrix completion due to further difficulty of robust pca in which locations of error corrupted entries are unknown as pointed out in our contribution in this paper we investigate more general robust pca problem in which entries of the low rank matrix are corrupted by distributed bernoulli errors we characterize the conditions that guarantee correct matrix decomposition by pcp our result identifies the local incoherence defined by localized and for each entry of the low rank matrix to determine the condition that each local bernoulli error corruption parameter should satisfy our results provide the following useful understanding of the robust pca problem our characterization provides localized and hence more refined view of robust pca and determines how robust each entry of the low rank matrix combats error corruption our results suggest that the total number of errors that the matrix can tolerate depends on how errors are distributed over the matrix via cluster problems our results provide an evidence that is necessary in characterizing conditions for robust pca in order to deal with error corruption our technical proof introduces new weighted norm denoted by lw which involves the information of both localized and and is hence different from the weighted norms introduced in for matrix completion thus our proof necessarily involves new technical developments associated with such new norm related work closely related but different problem from robust pca is matrix completion in which matrix is partially observed and is to be completed such problem has been previously studied in and it was shown that matrix can be provably recoverable by convex optimization with as few as max nr observed entries later on it was shown in that does not affect sample complexity for matrix completion and hence nr observed entries are sufficient for guaranteeing correct matrix completion it was further shown in that coherent matrix with large can be recovered with means for some positive nr observations as long as the sampling probability is proportional to the leverage score localized our problem can be viewed as its counterpart in robust pca where the difference lies in the local incoherence in our problem depends on both localized and robust pca aims to decompose an observed matrix into the sum of matrix and sparse matrix in robust pca with fixed error matrix was studied and it was shown that the maximum number of errors in any row or column should be bounded from above in order to guarantee correct decomposition by pcp robust pca with random error matrix was investigated in number of studies it has been shown in that such decomposition can be exact with high probability if the percentage of corrupted entries is small enough under the assumptions that the matrix is incoherent and the support set of the sparse matrix is uniformly distributed it was further shown in that if signs of nonzero entries in the sparse matrix are randomly chosen then an adjusted convex optimization can produce exact decomposition even when the percentage of corrupted entries goes to one error is dense the problem was further studied in for the case with the matrix only partially observed our work provides more refined view of robust pca with random error matrix aiming at understanding how local incoherence affects susceptibility of each matrix entry to error corruption model and main result problem statement we consider the robust pca problem introduced in section namely suppose an matrix can be decomposed into two parts where is low rank matrix and is sparse error matrix we assume that the rank of is and the support of is selected randomly but more specifically let denote the support of and then where denotes the set the event is independent across different pairs and where represents the probability that the of is corrupted by error hence is determined by bernoulli sampling with probabilities we study both the random sign and fixed sign models for for the fixed sign model we assume signs of nonzero entries in are arbitrary and fixed whereas for the random sign model we assume that signs of nonzero entries in are independently distributed bernoulli variables randomly taking values or with probability as follows with prob sgn ij with prob with prob in this paper our goal is to characterize conditions on that guarantees correct recovery of and with observation of we provide some notations that are used throughout this paper matrix is associated with five norms kxkf denotes the frobenius norm denotes the nuclear norm the sum of singular values kxk denotes the spectral norm the largest singular value and and represent respectively the and norms of the long vector stacked by the inner product between two matrices is defined as hx trace for linear operator that acts on the space of matrices kak denotes the operator norm given by kak sup kxkf kaxkf main theorems we adopt the pcp to solve the robust pca problem we define the following local incoherence parameters which play an important role in our characterization of conditions on ku ei kv ej µij max ij it is clear that and for all we note that although maxi µij some µij might take values as small as zero we first consider the robust pca problem under the random sign model as introduced in section the following theorem characterizes the condition that guarantees correct recovery by pcp theorem consider the robust pca problem under the random sign model if µij max log for some sufficiently large constant and for all then cp yields correct matrix recovery with with probability at least cn for some constant we note that the term is introduced to justify dual certificate conditions in thepproof see appendix we further note that satisfying the condition in theorem implies log which is an essential bound required in our proof and coincides with the conditions in previous studies although we set for the sake of proof in practice is often determined via cross validation the above theorem suggests that the local incoherence parameter µij is closely related to how robust each entry of to error corruption in matrix recovery an entry corresponding to smaller µij tolerates larger error density this is consistent with the result in for matrix completion in which smaller local incoherence parameter requires lower local sampling rate the difference lies in that here both and play roles in µij whereas only matters in matrix completion the necessity of for robust pca is further demonstrated in section via an example theorem also provides more refined view for robust pca in the dense error regime in which the error corruption probability approaches one such an interesting regime was previously studied in in it is argued that pcp with adaptive yields exact recovery even when the error corruption probability approaches one if errors take random signs and the dimension is sufficiently large in it is further shown that pcp with fixed also yields exact recovery and the scaling behavior of the error corruption probability is characterized the above theorem further provides the scaling behavior of the local error corruption probability as it approaches one and captures how such scaling behavior depends on local incoherence parameters µij such result implies that robustness of pcp depends not only on the error density but also on how errors are distributed over the matrix with regard to µij we next consider the robust pca problem under the fixed sign model as introduced in section in this case entries of the error matrix can take arbitrary and fixed values and only locations of entries are random theorem consider the robust pca problem under the fixed sign model if µij max log for some sufficient large constant and for all then pcp yields correct recovery with with probability at least cn for some constant theorem follows from theorem by adapting the elimination and derandomization arguments section as follows let be the matrix with each being if pcp yields exact recovery with certain probability for the random sign model with the parameter then it also yields exact recovery with at least the same probability for the fixed sign model with locations of entries sampled using bernoulli model with the parameter we now compare theorem for robust pca with error corruption to theorem in for robust pca with uniform error corruption it is clear that if we set for all then the two models are the same it can then be easily checked that conditions log and in theorem of implies the conditions in theorem thus theorem provides more relaxed condition than theorem in such benefit of condition relaxation should be attributed to the new golfing scheme introduced in and this paper provides more refined view of robust pca by further taking advantage of such new golfing scheme to analyze local conditions more importantly theorem characterizes relationship between local incoherence parameters and local error corruption probabilities which implies that different areas of the matrix have different levels of ability to resist errors more incoherent area with smaller µij can tolerate more errors thus theorem illustrates the following interesting fact whether pcp yields correct recovery depends not only on the total number of errors but also on how errors are distributed if more errors are distributed to more incoherent areas with smaller µij then more errors in total can be tolerated however if errors are distributed in an opposite manner then only smaller number of errors can be tolerated implication on cluster matrix in this subsection we further illustrate our result when the low rank matrix is cluster matrix although robust pca and even more sophisticated approaches have been applied to solve clustering problems our perspective here is to demonstrate how local incoherence affects entrywise robustness to error corruption which has not been illustrated in previous studies suppose there are elements to be clustered we use cluster matrix to represent the clustering relationship of these elements with lij if elements and are in the same cluster and lij otherwise thus with appropriate ordering of the elements is block diagonal matrix with all diagonal blocks containing all and blocks containing all hence the rank of equals the number of clusters which is typically small compared to suppose these entries are corrupted by errors that flip entries from one to zero or from zero to one this can be thought of as adding possibly sparse error matrix to so that the observed matrix is then pcp can be applied to recover the cluster matrix error error ρd we first consider an example with clusters having equal size we set and four clusters we apply errors to entries and entries respectively with the probabilities and in fig we plot recovery accuracy of pcp for each pairs of it is clear from the figure that failure occurs for larger than which thus implies that blocks are more robust to errors than diagonal blocks this can be explained by theorem as follows for cluster matrix with equal cluster size the local incoherence parameters are given by is in diagonal blocks for all and is in blocks and thus is in diagonal blocks µij max is in blocks error ρod error error error vulnerability with respect to cluster sizes error with equal cluster sizes figure error vulnerability on different parts for cluster matrix in both cases for each probability pair we generate trials of independent random error matrices and count the number of successes of pcp we declare trial to be successful if the recovered satisfies lkf color from white to black represents the number of successful trials changes from to based on theorem it is clear that entries are more locally coherent and hence are more vulnerable to errors whereas entries are more locally incoherent and hence are more robust to errors moreover this example also demonstrates the necessity of in the robust pca problem showed that is not necessary for matrix completion and argued informally that is necessary for robust pca by connecting the robust pca problem to hardness of finding small clique in large random graph here the above example provides an evidence for such fact in the example are the same over the entire matrix and hence it is that differentiates incoherence between diagonal blocks and blocks and thus differentiates their robustness to errors we then consider the case with two clusters that have different sizes size versus size hence we apply errors to block diagonal entries corresponding to clusters and respectively with the probabilities and in fig we plot the recovery accuracy of pcp for each pair of it is clear from the figure that failure occurs for larger than which thus implies that entries corresponding to the larger cluster are more robust to errors than entries corresponding to smaller clusters this can be explained by theorem because the local incoherence of block diagonal entry is given by µij rk where is the corresponding cluster size and hence the error corruption probability should satisfy kn log for correct recovery thus larger cluster can resist denser errors this also coincides with the results on graph clustering in outline of the proof of theorem the proof of theorem follows the idea established in and further developed in our main technical development lies in analysis of error corruption based on local incoherence parameters for which we introduce new weighted norm lw and establish concentration properties and bounds associated with this norm as generalization of matrix infinity norm lw incorporates both and and is hence different from the weighted norms lµ and lµ in by its role in the analysis for the robust pca problem we next outline the proof here and the detailed proofs are provided in appendix we first introduce some notations we define the subspace where are left and right singular matrix of then induces projection operator pt given by pt moreover the complement subspace to induces an orthogonal projection operator pt with pt we further define two operators associated with bernoulli sampling let denote pa generic subset of we define corresponding projection operator as ij hm ei iei where is the indicator function if is random set generated by bernoulli sampling with tij with tij for all we further define linear operator as ij hm ei iei we further note that throughout this paper with high probability means with probability at least cn where the constant may be different in various contexts our proof includes two main steps establishing that existence of certain dual certificate is sufficient to guarantee correct recovery and constructing such dual certificate for the first step we establish the following proposition µij proposition if max log pcp yields unique solution which agrees with the correct with high probability if there exists dual certificate obeying ky kpt sgn kpt sgn where kf log the proof of the above proposition adapts the idea in for uniform errors to errors in particular the proof exploits the properties of associated with errors which are presented as lemma established in and lemma in appendix proposition suggests that it suffices to prove theorem if we find dual certificate that satisfies the dual certificate conditions thus the second step is to construct via the golfing scheme although we adapt the steps in to construct the dual certificate our analysis requires new technical development based on local incoherence parameters recall the following definitions in section and pij where and pij consider the golfing scheme with nonuniform sizes as suggested in to establish bounds with fewer log factors let where are independent random sets given by pij qij for pij thus if qij the two sampling strategies are equivalent due to the overlap pij between we have qij we set log and construct dual certificate in the following iterative way pt sgn zk pt pt pt zk zk for it is then sufficient to show that such constructed satisfies the dual certificate conditions condition is due to the construction of condition can be shown by concentration property of each iteration step with kf characterized in lemma in appendix in order to showqthat satisfies conditions and we introduce the following weighted norm let µij and wij max where is the smallest nonzero here is introduced to avoid singularity then for any matrix define kzkw max wij it is easy to verify kw is well defined norm we can then show that each iteration step with and kw norms satisfies two concentration properties characterized respectively in lemmas and which are essential to prove conditions and numerical experiments in this section we provide numerical experiments to demonstrate our theoretical results in these experiments pwe adopt an augmented lagrange multiplier algorithm in to solve the pcp we set log trial of pcp for given realization of error locations is declared to be successful if recovered by pcp satisfies lkf we apply the following three models to construct the low rank matrix bernoulli xx where is matrix with entries independently taking values model and equally likely gaussian model xx where is matrix with entries independently sampled from gaussian distribution cluster model is block diagonal matrix with blocks containing all in order to demonstrate that the local incoherence parameter affects local robustness to error corruptions we study the following two types of error corruption models uniform error corruption sgn sij is generated as with for all and sgn adaptive error corruption sgn sij is generated as with for all ij and sgn it is clear in both cases the error matrix has the same average error corruption percentage but in adaptive error corruption the local error corruption probability is adaptive to the local incoherence our first experiment demonstrates that robustness of pcp to error corruption not only depends on the number of errors but also depends on how errors are distributed over the matrix for all three failure frequency failure frequency uniform error adaptive error uniform error adaptive error failure frequency error percentage uniform noise adaptive noise bernoulli model error percentage error percentage gaussian model cluster model figure recovery failure of pcp versus error corruption percentage low rank matrix models we set and rank for each low rank matrix model we apply the uniform and adaptive error matrices and plot the failure frequency of pcp versus the error corruption percentage in fig for each value of we perform trials of independent error corruption and count the number of failures of pcp each plot of fig compares robustness of pcp to uniform error corruption the red square line and adaptive error corruption the blue circle line we observe that pcp can tolerate more errors in the adaptive case this is because the adaptive error matrix is distributed based on the local incoherence parameter where error density is higher in areas where matrices can tolerate more errors furthermore comparison among the three plots in fig illustrates that the gap between uniform and adaptive error matrices is the smallest for bernoulli model and the largest for cluster model our theoretic results suggest that the gap is due to the variation of the local incoherence parameter across the matrix which can be measured by the variance of µij larger variance of µij should yield larger gap our numerical calculation of the variances for three models yield var µbernoulli var µgaussian and var µcluster which confirms our explanation uniform error adaptive error rank bernoulli model uniform error adaptive error error percentage uniform error adative error error percentage error percentage rank gaussian model rank cluster model figure largest allowable error corruption percentage versus rank of so that pcp yields correct recovery we next study the phase transition in rank and error corruption probability for the three matrix models we set in fig we plot the error corruption percentage versus the rank of for both uniform and adaptive error corruption models each point on the curve records the maximum allowable error corruption percentage under the corresponding rank such that pcp yields correction recovery we count pair to be successful if nine trials out of ten are successful we first observe that in each plot of fig pcp is more robust in adaptive error corruption due to the same reason explained above we further observe that the gap between the uniform and adaptive error corruption changes as the rank changes in the regime the gap is largely determined by the variance of incoherence parameter µij as we argued before as the rank increases the gap is more dominated by the rank and less affected by the local incoherence eventually for large enough rank no error can be tolerated no matter how errors are distributed conclusion we characterize refined conditions under which pcp succeeds to solve the robust pca problem our result shows that the ability of pcp to correctly recover matrix from errors is related not only to the total number of corrupted entries but also to locations of corrupted entries more essentially to the local incoherence of the low rank matrix such result is well supported by our numerical experiments moreover our result has rich implication when the low rank matrix is cluster matrix and our result coincides with studies on clustering problems via low rank cluster matrix our result may motivate the development of weighted pcp to improve recovery performance similar to the weighted algorithms developed for matrix completion in references li ma and wright robust principal component analysis journal of the acm jacm chandrasekaran sanghavi parrilo and willsky incoherence for matrix decomposition siam journal on optimization chen jalali sanghavi and caramanis matrix recovery from errors and erasures ieee transactions on information theory chen matrix completion may ieee transactions on information theory and recht exact matrix completion via convex optimization foundations of computational mathematics and tao the power of convex relaxation matrix completion ieee transactions on information theory gross recovering matrices from few coefficients in any basis ieee transactions on information theory recht fazel and parrilo guaranteed solutions of linear matrix equations via nuclear norm minimization siam review chen bhojanapalli sanghavi and ward completing any matrix provably arxiv preprint hsu kakade and zhang robust matrix decomposition with sparse corruptions ieee transactions on information theory ganesh wright li candes and ma dense error correction for matrices via principal component pursuit in ieee international symposium on information theory isit pages austin tx us june li compressed sensing and matrix completion with constant proportion of corruptions constructive approximation oymak and hassibi finding dense clusters via low sparse decomposition arxiv preprint chen sanghavi and xu clustering sparse graphs in advances in neural information processing systems nips pages lake tahoe nevada us december chen sanghavi and xu improved graph clustering ieee transactions on information theory oct chen jalali sanghavi and xu clustering partially observed graphs via convex optimization journal of machine learning research lin chen and ma the augmented lagrange multiplier method for exact recovery of corrupted matrices arxiv preprint srebro and salakhutdinov collaborative filtering in world learning with the weighted trace norm in advances in neural information processing systems nips pages hyatt regency vancouver canada december vershynin introduction to the analysis of random matrices arxiv preprint tropp tail bounds for sums of random matrices foundations of computational mathematics 
learning to transduce with unbounded memory edward grefenstette google deepmind etg karl moritz hermann google deepmind kmh mustafa suleyman google deepmind mustafasul phil blunsom google deepmind and oxford university pblunsom abstract recently strong results have been demonstrated by deep recurrent neural networks on natural language transduction problems in this paper we explore the representational power of these models using synthetic grammars designed to exhibit phenomena similar to those found in real transduction problems such as machine translation these experiments lead us to propose new recurrent networks that implement continuously differentiable analogues of traditional data structures such as stacks queues and deques we show that these architectures exhibit superior generalisation performance to deep rnns and are often able to learn the underlying generating algorithms in our transduction experiments introduction recurrent neural networks rnns offer compelling tool for processing natural language input in straightforward sequential manner many natural language processing nlp tasks can be viewed as transduction problems that is learning to convert one string into another machine translation is prototypical example of transduction and recent results indicate that deep rnns have the ability to encode long source strings and produce coherent translations while elegant the application of rnns to transduction tasks requires hidden layers large enough to store representations of the longest strings likely to be encountered implying wastage on shorter strings and strong dependency between the number of parameters in the model and its memory in this paper we use number of synthetic transduction tasks to explore the ability of rnns to learn reorderings and substitutions further inspired by prior work on neural network implementations of stack data structures we propose and evaluate transduction models based on neural stacks queues and deques double ended queues stack algorithms are to processing the hierarchical structures observed in natural language and we hypothesise that their neural analogues will provide an effective and learnable transduction tool our models provide middle ground between simple rnns and the recently proposed neural turing machine ntm which implements powerful random access memory with read and write operations neural stacks queues and deques also provide logically unbounded memory while permitting efficient constant time push and pop operations our results indicate that the models proposed in this work and in particular the neural deque are able to consistently learn range of challenging transductions while deep rnns based on long memory lstm cells can learn some transductions when tested on inputs of the same length as seen in training they fail to consistently generalise to longer strings in contrast our sequential algorithms are able to learn to reproduce the generating transduction algorithms often generalising perfectly to inputs well beyond those encountered in training related work string transduction is central to many applications in nlp from name transliteration and spelling correction to inflectional morphology and machine translation the most common approach leverages symbolic finite state transducers with approaches based on context free representations also being popular rnns offer an attractive alternative to symbolic transducers due to their simple algorithms and expressive representations however as we show in this work such models are limited in their ability to generalise beyond their training data and have memory capacity that scales with the number of their trainable parameters previous work has touched on the topic of rendering discrete data structures such as stacks continuous especially within the context of modelling pushdown automata with neural networks we were inspired by the continuous pop and push operations of these architectures and the idea of an rnn controlling the data structure when developing our own models the key difference is that our work adapts these operations to work within recurrent continuous structure the dynamics of which are fully decoupled from those of the rnn controlling it in our models the backwards dynamics are easily analysable in order to obtain the exact partial derivatives for use in error propagation rather than having to approximate them as done in previous work in parallel effort to ours researchers are exploring the addition of memory to recurrent networks the ntm and memory networks provide powerful random access memory operations whereas we focus on more efficient and restricted class of models which we believe are sufficient for natural language transduction tasks more closely related to our work have sought to develop continuous stack controlled by an rnn note that this the work proposed discrete push and pop operations continuous by mixing information across levels of the stack at each time step according to scalar action values this means the model ends up compressing information in the stack thereby limiting its use as it effectively loses the unbounded memory nature of traditional symbolic models models in this section we present an extensible memory enhancement to recurrent layers which can be set up to act as continuous version of classical stack queue or deque queue we begin by describing the operations and dynamics of neural stack before showing how to modify it to act as queue and extend it to act as deque neural stack let neural stack be differentiable structure onto and from which continuous vectors are pushed and popped inspired by the neural pushdown automaton of we render these traditionally discrete operations continuous by letting push and pop operations be real values in the interval intuitively we can interpret these values as the degree of certainty with which some controller wishes to push vector onto the stack or pop the top of the stack vt st rt vt vt if if max st note that vt vi for all max ut st dt min st max st vt if if formally neural stack fully parametrised by an embedding size is described at some timestep by value matrix vt and strength vector st rt these form the core of recurrent layer which is acted upon by controller by receiving from the controller value vt rm pop signal ut and push signal dt it outputs read vector rt rm the recurrence of this layer comes from the fact that it will receive as previous state of the stack the pair vt st and produce as next state the pair vt st following the dynamics described below here vt represents the ith row an vector of vt and st represents the ith value of st equation shows the update of the value component of the recurrent layer state represented as matrix the number of rows of which grows with time maintaining record of the values pushed to the stack at each timestep whether or not they are still logically on the stack values are appended to the bottom of the matrix top of the stack and never changed equation shows the effect of the push and pop signal in updating the strength vector st to produce st first the pop operation removes objects from the stack we can think of the pop value ut as the initial deletion quantity for the operation we traverse the strength vector st from the highest index to the lowest if the next strength scalar is less than the remaining deletion quantity it is subtracted from the remaining quantity and its value is set to if the remaining deletion quantity is less than the next strength scalar the remaining deletion quantity is subtracted from that scalar and deletion stops next the push value is set as the strength for the value added in the current timestep equation shows the dynamics of the read operation which are similar to the pop operation fixed initial read quantity of is set at the top of temporary copy of the strength vector st which is traversed from the highest index to the lowest if the next strength scalar is smaller than the remaining read quantity its value is preserved for this operation and subtracted from the remaining read quantity if not it is temporarily set to the remaining read quantity and the strength scalars of all lower indices are temporarily set to the output rt of the read operation is the weighted sum of the rows of vt scaled by the temporary scalar values created during the traversal an example of the stack read calculations across three timesteps after pushes and pops as described above is illustrated in figure the third step shows how setting the strength to for logically removes from the stack and how it is ignored during the read this completes the description of the forward dynamics of neural stack cast as recurrent layer as illustrated in figure all operations described in this section are the equations describing the backwards dynamics are provided in appendix of the supplementary materials stack grows upwards row row row removed from stack example operation of continuous neural stack prev values next values vt next state previous state previous state prev strengths push dt input pop ut neural stack value vt output rt next strengths st input it split it vt ht next ot ot dt ut neural stack vt st st ht rt vt join neural stack as recurrent layer state output ot rnn controlling stack figure illustrating neural stack operations recurrent structure and control neural queue neural queue operates the same way as neural stack with the exception that the pop operation reads the lowest index of the strength vector st rather than the highest this represents popping and the max and min functions are technically not differentiable for following the work on rectified linear units we arbitrarily take the partial differentiation of the left argument in these cases reading from the front of the queue rather than the top of the stack these operations are described in equations st max st max ut dt rt if if min st max st st vt neural deque neural deque operates likes neural stack except it takes push pop and value as input for both ends of the structure which we call top and bot and outputs read for both ends we write utop and ubot instead of ut vttop and vtbot instead of vt and so on the state vt and st are now matrix and vector respectively at each timestep pop from the top is followed by pop from the bottom of the deque followed by the pushes and reads the dynamics of deque which unlike neural stack or queue grows in two directions are described in equations below equations decompose the strength vector update into three steps purely for notational clarity vtbot vt vtop vt stop max st if if if max utop st sboth max stop both st dbot st ttop dt rtop stop if if if min st max rbot max ubot if min st max if st vt st vt to summarise neural deque acts like two neural stacks operated on in tandem except that the pushes and pops from one end may eventually affect pops and reads on the other and vice versa interaction with controller while the three memory modules described can be seen as recurrent layers with the operations being used to produce the next state and output from the input and previous state being fully differentiable they contain no tunable parameters to optimise during training as such they need to be attached to controller in order to be used for any practical purposes in exchange they offer an extensible memory the logical size of which is unbounded and decoupled from both the nature and parameters of the controller and from the size of the problem they are applied to here we describe how any rnn controller may be enhanced by neural stack queue or deque we begin by giving the case where the memory is neural stack as illustrated in figure here we wish to replicate the overall interface of recurrent seen from outside the dotted takes the previous recurrent state ht and an input vector it and transforms them to return the next recurrent state ht and an output vector ot in our setup the previous state ht of the recurrent layer will be the tuple ht rt vt st where ht is the previous state of the rnn rt is the previous stack read and vt st is the previous state of the stack as described above with the exception of which is initialised randomly and optimised during training all other initial states and are set to and not updated during training the overall input it is concatenated with previous read rt and passed to the rnn controller as input along with the previous controller state ht the controller outputs its next state ht and controller output from which we obtain the push and pop scalars dt and ut and the value vector vt which are passed to the stack as well as the network output ot dt sigmoid wd bd ut sigmoid wu bu vt tanh wv bv ot tanh wo bo where wd and wu are projection matrices and bd and bu are their scalar biases wv and wo are projections and bd and bu are their vector biases all randomly intialised and then tuned during training along with the previous stack state vt st the stack operations dt and ut and the value vt are passed to the neural stack to obtain the next read rt and next stack state vt st which are packed into tuple with the controller state ht to form the next state ht of the overall recurrent layer the output vector ot serves as the overall output of the recurrent layer the structure described here can be adapted to control neural queue instead of stack by substituting one memory module for the other the only additional trainable parameters in either configuration relative to rnn are the projections for the input concatenated with the previous read into the rnn controller and the projections from the controller output into the various inputs described above in the case of deque both the top read rtop and bottom read rbot must be preserved in the overall state they are both concatenated with the input to form the input to the rnn controller the output of the controller must have additional projections to output operations and values for the bottom of the deque this roughly doubles the number of additional tunable parameters wrapping the rnn controller compared to the case experiments in every experiment source and target sequence pairs are presented to the candidate model as batch of single joint sequences the joint sequence starts with sos symbol and ends with an eos symbol with separator symbol separating the source and target sequences symbols are converted to embeddings via an embedding matrix which is randomly initialised and tuned during training separate mappings are used for source and target vocabularies separate embedding matrices are used to encode input and output predicted embeddings synthetic transduction tasks the aim of each of the following tasks is to read an input sequence and generate as target sequence transformed version of the source sequence followed by an eos symbol source sequences are randomly generated from vocabulary of meaningless symbols the length of each training source sequence is uniformly sampled from unif and each symbol in the sequence is drawn with replacement from uniform distribution over the source vocabulary ignoring sos and separator deterministic transformation described for each task below is applied to the source sequence to yield the target sequence as the training sequences are entirely determined by the source sequence there are close to training sequences for each task and training examples are sampled from this space due to the random generation of source sequences the following steps are followed before each training and test sequence are presented to the models the sos symbol hsi is prepended to the source sequence which is concatenated with separator symbol and the target sequences to which the eos symbol is appended sequence copying the source sequence is copied to form the target sequence sequences have the form ak ak sequence reversal the source sequence is deterministically reversed to produce the target sequence sequences have the form ak bigram flipping the source side is restricted to sequences the target is produced by swapping for all odd source sequence indices odd the ith symbol with the th symbol sequences have the form ak ak ak ak itg transduction tasks the following tasks examine how well models can approach sequence transduction problems where the source and target sequence are jointly generated by inversion transduction grammars itg subclass of synchronous grammars often used in machine translation we present two simple datasets with interesting linguistic properties and their underlying grammars we show these grammars in table in appendix of the supplementary materials for each synchronised an expansion is chosen according to the probability distribution specified by the rule probability at the beginning of each rule for each grammar is always the root of the itg tree we tuned the generative probabilities for recursive rules by hand so that the grammars generate left and right sequences of lengths to with relatively uniform distribution we generate training data by rejecting samples that are outside of the range and testing data by rejecting samples outside of the range for terminal rules we balance the classes so that for symbols in the grammar each generates vocabulary of approximately and each each vocabulary word under that class is equiprobable these design choices were made to maximise the similarity between the experimental settings of the itg tasks described here and the synthetic tasks described above to persistent challenge in machine translation is to learn to faithfully reproduce syntactic divergences between languages for instance when translating an english sentence with verb into german transducer must locate and move the verb over the object to the final position we simulate this phenomena with synchronous grammar which generates strings exhibiting verb movements to add an extra challenge we also simulate simple relative clause embeddings to test the models ability to transduce in the presence of unbounded recursive structures sample output of the grammar is presented here with spaces between words being included for stylistic purposes and where and indicate subject object and verb terminals respectively and mark input and output and rp indicates relative pronoun rpi rpo genderless to gendered grammar we design small grammar to simulate translations from language with articles to one with definite and indefinite articles real world example of such translation would be from english the to german the grammar simulates sentences in or form where every noun phrase can become an infinite sequence of nouns joined by conjunction each noun in the source language has neutral definite or indefinite article the matching word in the target language then needs to be preceeded by its appropriate article sample output of the grammar is presented here with spaces between words being included for stylistic purposes the and the das und der evaluation for each task test data is generated through the same procedure as training data with the key difference that the length of the source sequence is sampled from unif as result of this change we not only are assured that the models can not observe any test sequences during training but are also measuring how well the sequence transduction capabilities of the evaluated models generalise beyond the sequence lengths observed during training to control for generalisation ability we also report accuracy scores on sequences separately sampled from the training set which given the size of the sample space are unlikely to have ever been observed during actual model training for each round of testing we sample sequences from the appropriate test set for each sequence the model reads in the source sequence and separator symbol and begins generating the next symbol by taking the maximally likely symbol from the softmax distribution over target symbols produced by the model at each step based on this process we give each model coarse accuracy score corresponding to the proportion of test sequences correctly predicted from beginning until end eos symbol without error as well as fine accuracy score corresponding to the average proportion of each sequence correctly generated before the first error formally we have coarse correct seqs ine seqs correcti seqs where correct and seqs are the number of correctly predicted sequences and the total number of sequences in the test batch in this experiment respectively correcti is the number of correctly predicted symbols before the first error in the ith sequence of the test batch and is the length of the target segment that sequence including eos symbol models compared and experimental setup for each task we use as benchmarks the deep lstms described in with and layers against these benchmarks we evaluate neural and lstms when running experiments we trained and tested version of each model where all lstms in each model have hidden layer size of and one for hidden layer size of the embedding size was arbitrarily set to half the maximum hidden size the number of parameters for each model are reported for each architecture in table of the appendix concretely the neural and lstms have the same number of trainable parameters as deep lstm these all come from the extra connections to and from the memory module which itself has no trainable parameters regardless of its logical size models are trained with minibatch rmsprop with batch size of we learning rates across the set we used gradient clipping clipping all gradients above average training perplexity was calculated every batches training and test set accuracies were recorded every batches results and discussion because of the impossibility of overfitting the datasets we let the models train an unbounded number of steps and report results at convergence we present in figure the and accuracies for each task of the best model of each architecture described in this paper alongside the best performing deep lstm benchmark the best models were automatically selected based on average training perplexity the lstm benchmarks performed similarly across the range of random initialisations so the effect of this procedure is primarily to try and select the better performing lstm in most cases this procedure does not yield the actual bestperforming model and in practice more sophisticated procedure such as ensembling should produce better results for all experiments the neural stack or queue outperforms the deep lstm benchmarks often by significant margin for most experiments if neural or lstm learns to partially or consistently solve the problem then so does the neural deque for experiments where the enhanced lstms solve the problem completely consistent accuracy of in training the accuracy persists in longer sequences in the test set whereas benchmark accuracies drop for training experiment testing model lstm coarse fine coarse fine sequence reversal lstm bigram flipping lstm svo to sov lstm gender conjugation lstm sequence copying comparing enhanced lstms to best benchmarks comparison of model convergence during training figure results on the transduction tasks and convergence properties all experiments except the svo to sov and gender conjugation itg transduction tasks across all tasks which the enhanced lstms solve the convergence on the top accuracy happens orders of magnitude earlier for enhanced lstms than for benchmark lstms as exemplified in figure the results for the sequence inversion and copying tasks serve as unit tests for our models as the controller mainly needs to learn to push the appropriate number of times and then pop continuously nonetheless the failure of deep lstms to learn such regular pattern and generalise is itself indicative of the limitations of the benchmarks presented here and of the relative expressive power of our models their ability to generalise perfectly to sequences up to twice as long as those attested during training is also notable and also attested in the other experiments finally this pair of experiments illustrates how while the neural queue solves copying and the stack solves reversal simple lstm controller can learn to operate deque as either structure and solve both tasks the results of the bigram flipping task for all models are consistent with the failure to consistently correctly generate the last two symbols of the sequence we hypothesise that both deep lstms and our models economically learn to pairwise flip the sequence tokens and attempt to do so half the time when reaching the eos token for the two itg tasks the success of deep lstm benchmarks relative to their performance in other tasks can be explained by their ability to exploit short local dependencies dominating the longer dependencies in these particular grammars overall the rapid convergence where possible on general solution to transduction problem in manner which propagates to longer sequences without loss of accuracy is indicative that an unbounded controller can learn to solve these problems procedurally rather than memorising the underlying distribution of the data conclusions the experiments performed in this paper demonstrate that lstms enhanced by an unbounded differentiable memory capable of acting in the limit like classical stack queue or deque are capable of solving transduction tasks for which deep lstms falter even in tasks for which benchmarks obtain high accuracies the lstms converge earlier and to higher accuracies while requiring considerably fewer parameters than all but the simplest of deep lstms we therefore believe these constitute crucial addition to our neural network toolbox and that more complex linguistic transduction tasks such as machine translation or parsing will be rendered more tractable by their inclusion references ilya sutskever oriol vinyals and quoc le sequence to sequence learning with neural networks in ghahramani welling cortes lawrence and weinberger editors advances in neural information processing systems pages curran associates kyunghyun cho bart van merrienboer caglar gulcehre fethi bougares holger schwenk and yoshua bengio learning phrase representations using rnn for statistical machine translation arxiv preprint gz sun lee giles hh chen and yc lee the neural network pushdown automaton model stack and learning simulations alex graves greg wayne and ivo danihelka neural turing machines corr alex graves supervised sequence labelling with recurrent neural networks volume of studies in computational intelligence springer markus dreyer jason smith and jason eisner modeling of string transductions with methods in proceedings of the conference on empirical methods in natural language processing emnlp pages stroudsburg pa usa association for computational linguistics cyril allauzen michael riley johan schalkwyk wojciech skut and mehryar mohri openfst general and efficient weighted transducer library in implementation and application of automata volume of lecture notes in computer science pages springer berlin heidelberg dekai wu stochastic inversion transduction grammars and bilingual parsing of parallel corpora computational linguistics alex graves sequence transduction with recurrent neural networks in representation learning worksop icml sreerupa das lee giles and sun learning grammars capabilities and limitations of recurrent neural network with an external stack memory in proceedings of the fourteenth annual conference of cognitive science society indiana university sreerupa das lee giles and sun using prior knowledge in nnpda to learn languages advances in neural information processing systems sainbayar sukhbaatar arthur szlam jason weston and rob fergus weakly supervised memory networks corr wojciech zaremba and ilya sutskever reinforcement learning neural turing machines arxiv preprint armand joulin and tomas mikolov inferring algorithmic patterns with recurrent nets arxiv preprint vinod nair and geoffrey hinton rectified linear units improve restricted boltzmann machines in proceedings of the international conference on machine learning pages alfred aho and jeffrey ullman the theory of parsing translation and compiling dekai wu and hongsing wong machine translation with stochastic grammatical channel in proceedings of the international conference on computational pages association for computational linguistics tijmen tieleman and geoffrey hinton lecture divide the gradient by running average of its recent magnitude coursera neural networks for machine learning razvan pascanu tomas mikolov and yoshua bengio understanding the exploding gradient problem computing research repository corr zhou jianxin wu and wei tang ensembling neural networks many could be better than all artificial intelligence 
deep generative models chongxuan jun tianlin bo dept of comp sci state key lab of intell tech tnlist lab center for computing research tsinghua university beijing china dept of comp stanford university stanford ca usa dcszj dcszb abstract deep generative models dgms are effective on learning multilayered representations of complex data and performing inference of input data by exploring the generative ability however little work has been done on examining or empowering the discriminative ability of dgms on making accurate predictions this paper presents deep generative models mmdgms which explore the strongly discriminative principle of learning to improve the discriminative power of dgms while retaining the generative capability we develop an efficient doubly stochastic subgradient algorithm for the piecewise linear objective empirical results on mnist and svhn datasets demonstrate that maxmargin learning can significantly improve the prediction performance of dgms and meanwhile retain the generative ability and mmdgms are competitive to the fully discriminative networks by employing deep convolutional neural networks cnns as both recognition and generative models introduction learning has been effective on learning discriminative models with many examples such as support vector machines svms and markov networks or structured svms however the size of complex data makes it hard to construct such fully discriminative model which has only single layer of adjustable weights due to the facts that the manually constructed features may not well capture the underlying statistics and fully discriminative approach can not reconstruct the input data when noise or missing values are present to address the first challenge previous work has considered incorporating latent variables into model including partially observed maximum entropy discrimination markov networks structured latent svms and models all this work has primarily focused on shallow structure of latent variables to improve the flexibility learning svms with deep latent structure has been presented in however these methods do not address the second challenge which requires generative model to describe the inputs the recent work on learning generative models includes harmoniums maxmargin topic models and nonparametric bayesian latent svms which can infer the dimension of latent features from data however these methods only consider the shallow structure of latent variables which may not be flexible enough to describe complex data much work has been done on learning generative models with deep structure of nonlinear hidden variables including deep belief networks autoregressive models and stochastic variations of neural networks for such models inference is challenging problem but fortunately there exists much recent progress on stochastic variational inference algorithms however the primary focus of deep generative models dgms has been on unsupervised learning with the goals of learning latent representations and generating input samples though the latent representations can be used with downstream classifier to make predictions it is often beneficial to learn joint model that considers both input and response variables one recent attempt is the conditional generative models which treat labels as conditions of dgm to describe input data this conditional dgm is learned in setting which is not exclusive to ours in this paper we revisit the principle and present deep generative model mmdgm which learns representations that are good for both classification and input inference our mmdgm conjoins the flexibility of dgms on describing input data and the strong discriminative ability of learning on making accurate predictions we formulate mmdgm as solving variational inference problem of dgm regularized by set of posterior constraints which bias the model to learn representations that are good for prediction we define the posterior constraints as linear functional of the target variational distribution of the latent presentations then we develop doubly stochastic subgradient descent algorithm which generalizes the pagesos algorithm to consider nontrivial latent variables for the variational distribution we build recognition model to capture the nonlinearity similar as in we consider two types of networks used as our recognition and generative models multiple layer perceptrons mlps as in and convolutional neural networks cnns though cnns have shown promising results in various domains especially for image classification little work has been done to take advantage of cnn to generate images the recent work presents type of cnn to map manual features including class labels to rbg chair images by applying unpooling convolution and rectification sequentially but it is deterministic mapping and there is no random generation generative adversarial nets employs single such layer together with mlps in minimax game framework with primary goal of generating images we propose to stack this structure to form highly deep generative network to generate images from latent variables learned automatically by recognition model using standard cnn we present the detailed network structures in experiments part empirical results on mnist and svhn datasets demonstrate that mmdgm can significantly improve the prediction performance which is competitive to the methods while retaining the capability of generating input samples and completing their missing values basics of deep generative models we start from general setting where we have data xn deep generative model dgm assumes that each xn rd is generated from vector of latent variables zn rk which itself follows some distribution the joint probability of dgm is as follows zn xn where zn is the prior of the latent variables and xn is the likelihood model for generating observations for notation simplicity we define depending on the structure of various dgms have been developed such as the deep belief networks deep sigmoid networks deep latent gaussian models and deep autoregressive models in this paper we focus on the directed dgms which can be easily sampled from via an ancestral sampler however in most cases learning dgms is challenging due to the intractability of posterior inference the methods resort to stochastic variational methods under the maximum likelihood estimation mle framework argmaxθ log specifically let be the variational distribution that approximates the true posterior variational upper bound of the per sample negative nll log xn is zn xn kl zn zn eq zn log xn where kl is the kl divergence between distributions and then zn xn upper bounds the full negative log it is important to notice that if we do not make restricting assumption on the variational distribution the lower bound is tight by simply setting that is the mle is equivalent to solving the variational problem minθ however since the true posterior is intractable except handful of special cases we must resort to approximation methods one common assumption is that the variational distribution is of some parametric form qφ and then we optimize the variational bound the variational parameters for dgms another challenge arises that the variational bound is often intractable to compute analytically to address this challenge the early work further bounds the intractable parts with tractable ones by introducing more variational parameters however this technique increases the gap between the bound being optimized and the potentially resulting in poorer estimates much recent progress has been made on hybrid monte carlo and variational methods which approximates the intractable expectations and their gradients over the parameters via some unbiased monte carlo estimates furthermore to handle datasets stochastic optimization of the variational objective can be used with suitable learning rate annealing scheme it is important to notice that variance reduction is key part of these methods in order to have fast and stable convergence most work on directed dgms has been focusing on the generative capability on inferring the observations such as filling in missing values while little work has been done on investigating the predictive power except the dgms which builds dgm conditioned on the class labels and learns the parameters via mle below we present deep generative models which explore the discriminative principle to improve the predictive ability of the latent representations while retaining the generative capability deep generative models we consider supervised learning where the training data is pair with input features rd and the ground truth label without loss of generality we consider the classification where deep generative model mmdgm consists of two components deep generative model to describe input features and classifier to consider supervision for the generative model we can in theory adopt any dgm that defines joint distribution over as in eq for the classifier instead of fitting the input features into conventional svm we define the linear classifier on the latent representations whose learning will be regularized by the supervision signal as we shall see specifically if the latent representation is given we define the latent discriminant function where is an vector that concatenates subvectors with the yth being and all others being zero and is the corresponding weight vector we consider the case that is random vector following some prior distribution then our goal is to infer the posterior distribution which is typically approximated by variational distribution for computational tractability notice that this posterior is different from the one in the vanilla dgm we expect that the supervision information will bias the learned representations to be more powerful on predicting the labels at testing to account for the uncertainty of we take the expectation and define the discriminant function eq and the final prediction rule that maps inputs to outputs is argmax note that different from the conditional dgm which puts the class labels upstream the above classifier is downstream model in the sense that the supervision signal is determined by conditioning on the latent representations the learning problem we want to jointly learn the parameters and infer the posterior distribution based on the equivalent variational formulation of mle we define the joint learning problem as solving min ξn eq ξn ξn where yn zn zn is the difference of the feature vectors is the loss function that measures the cost to predict if the true label is yn and is nonnegative regularization parameter balancing the two components in the objective the variational bound is defined as kl eq log and the margin constraints are from the classifier if we ignore the constraints setting at the solution of will be exactly the bayesian posterior and the problem is equivalent to do mle for by absorbing the slack variables we can rewrite the problem in an unconstrained form min cr pn where the hinge loss is eq due to the convexity of max function it is easy to verify that the hinge loss is an upper bound of the training error of classifier that is furthermore the hinge loss is convex functional over the variational distribution because of the linearity of the expectation operator these properties render the hinge loss as good surrogate to optimize over previous work has explored this idea to learn discriminative topic models but with restriction on the shallow structure of hidden variables our work presents significant extension to learn deep generative models which pose new challenges on the learning and inference the doubly stochastic subgradient algorithm the variational formulation of problem naturally suggests that we can develop variational algorithm to address the intractability of the true posterior we now present new algorithm to solve problem our method is doubly stochastic generalization of the pegasos primal estimated solver for svm algorithm for the classic svms with fully observed input features with the new extension of dealing with highly nontrivial structure of latent variables first we make the structured smf assumption that qφ under the assumption we have the discriminant function as eq eq eqφ moreover we can solve for the optimal solution of in some analytical form in fact by the calculus of variations we can show that given the other parts the solution is exp ωn eqφ where are the lagrange multipliers see for tails if the prior is normal we have the normal posterior where ωny eqφ therefore even though we did not make parametric form assumption of the above results show that the optimal posterior distribution of is gaussian since we only use the expectation in the optimization problem and in prediction we can directly solve for the mean parameter instead of further in this case we can verify that kl and then the equivalent objective function in terms of can be written as min cr pn where xn is the total hinge loss and the is xn eqφ below we present doubly stochastic subgradient descent algorithm to solve this problem the first stochasticity arises from stochastic estimate of the objective by random specifically the batch learning needs to scan the full dataset to compute subgradients which is often too expensive to deal with datasets one effective technique is to do stochastic subgradient descent where at each iteration we randomly draw of the training data and then do the variational updates over the small formally given mini batch of size we get an unbiased estimate of the objective nc xn xn the second stochasticity arises from stochastic estimate of the variational bound and its subgradient whose intractability calls for another monte carlo estimator formally let zln qφ yn be set of samples from the variational distribution where we explicitly put the conditions then an estimate of the variational bound and the is xn zl xn log xn zln qφ zln where zln yn zln zln note that is an unbiased estimate of while is biased estimate of nevertheless we can still show that is an upper bound estimate of under expectation furthermore this biasedness does not affect our estimate of the gradient in fact by using the equality qφ qφ log qφ we can construct an unbiased monte carlo estimate of xn xn as gφ log zln xn log qφ zln cλ zln log qφ zln where the last term roots from the hinge loss with the prediction argmaxy zln for and the estimates of the gradient xn and the subgradient xn are easier which are gθ log xn zln gλ zln yn zln notice that the sampling and the gradient not the underlying model log qφ zln only depend on the variational distribution the above estimates consider the algorithm doubly stochastic subgradient algorithm initialize and eral case where the variational bound is repeat intractable in some cases we can comdraw random of data points pute the term analytidraw random samples from noise distribution cally when the prior and the variational distribution are both gaussian compute subgradient xm in such cases we only need to estimate update parameters using subgradient the rest intractable part by sampling until converge which often reduces the variance return and similarly we could use the expectation of the features directly if it can be computed analytically in the computation of subgradients gθ and gλ instead of sampling which again can lead to variance reduction with the above estimates of subgradients we can use stochastic optimization methods such as sgd and adam to update the parameters as outlined in alg overall our algorithm is doubly stochastic generalization of pegasos to deal with the highly nontrivial latent variables now the remaining question is how to define an appropriate variational distribution qφ to obtain robust estimate of the subgradients as well as the objective two types of methods have been developed for unsupervised dgms namely variance reduction and variational bayes avb though both methods can be used for our models we focus on the avb approach for continuous variables under certain mild conditions we can reparameterize the variational distribution qφ using some simple variables specifically we can draw samples from some simple distribution and do the transformation gφ to get the sample of the distribution we refer the readers to for more details in our experiments we consider the special gaussian case where we assume that the variational distribution is multivariate gaussian with diagonal covariance matrix qφ whose mean and variance are functions of the input data this defines our recognition model then the reparameterization trick is as follows we first draw standard normal variables and then do the transformation zln xn yn xn yn to get sample for simplicity we assume that both the mean and variance are function of only however it is worth to emphasize that although the recognition model is unsupervised the parameters are learned in supervised manner because the subgradient depends on the hinge loss further details of the experimental settings are presented in sec experiments we now present experimental results on the widely adopted mnist and svhn datasets though mmdgms are applicable to any dgms that define joint distribution of and we concentrate on the variational va which is unsupervised we denote our mmdgm with va by mmva in our experiments we consider two types of recognition models multiple layer perceptrons mlps and convolutional neural networks cnns we implement all experiments based on theano architectures and settings in the mlp case we follow the settings in to compare both generative and discriminative capacity of va and mmva in the cnn case we use standard convolutional nets with convolution and operation as the recognition model to obtain more competitive classification results for the generative model we use unconvnets with symmetric structure as the recognition model to reconstruct the input images approximately more specifically the generative model has the same structure as the recognition model but replacing with unpooling operation and applies unpooling convolution and rectification in order the total number of parameters in the convolutional network is comparable with previous work for simplicity we do not involve mlpconv layers and contrast normalization layers in our recognition model but they are not exclusive to our model we illustrate details of the network architectures in appendix in both settings the mean and variance of the latent are transformed from the last layer of the recognition model through linear operation it should be noticed that we could use not only the expectation of but also the activation of any layer in the recognition model as features the only theoretical difference is from where we add hinge loss regularization to the gradient and backpropagate it to previous layers in all of the experiments the mean of has the same nonlinearity but typically much lower dimension than the activation of the last layer in the recognition model and hence often leads to worse performance in the mlp case we concatenate the activations of layers as the features used in the supervised tasks in the cnn case we use the activations of the last layer as the features we use adam to optimize parameters in all of the models although it is an adaptive optimization method we decay the global learning rate by factor three periodically after sufficient number of epochs to ensure stable convergence we denote our mmdgm with mlps by mmva to perform classification using va we first learn the feature representations by va and then build linear svm classifier on these features using the pegasos stochastic subgradient algorithm this baseline will be denoted by the corresponding models with cnns are denoted by cmmva and respectively results on the mnist dataset we present both the prediction performance and the results on generating samples of mmva and with both kinds of recognition models on the mnist dataset which consists of images of different classes to of size with training samples validating samples and testing samples table error rates on mnist dataset odel rror ate predictive performance in the mlp case we only use ing data and the parameters for classification are mmva optimized according to the validation set we cmmva choose for mmva and initialize it with stochastic pooling an unsupervised procedure in classinetwork in network fication first three rows in table compare maxout network and mmva dsn where refers to the best fully supervised model in our model outperforms the baseline significantly we further use the algorithm to embed the features learned by va and mmva on plane which again demonstrates the stronger discriminative ability of mmva see appendix for details in the cnn case we use training data table shows the effect of on classification error rate and variational lower bound typically as gets lager cmmva learns more discriminative features and leads to worse estimation of data likelihood however if is too small the supervision is not enough to lead to predictive features nevertheless is quite good the source code is available at https va mmva cva cmmva figure randomly generated images by va and mmva epochs randomly generated images by cva and cmmva epochs between the classification performance and generative performance and this is the default setting of cmmva on mnist throughout this paper in this setting the classification performance of our cmmva model is comparable to the recent fully discriminative networks without data augmentation shown in the last four rows of table table effects of on mnist dataset generative performance with cnn recognition model we further investigate the generative capability of mmva rror ate ower ound on generating samples fig illustrates the images domly sampled from va and mmva models where we output the expectation of the gray value at each pixel to get smooth visualization we do not our model in all settings when generating data to prove that mmva cmmva remains the generative capability of dgms results on the svhn street view house numbers dataset svhn is large dataset consisting of color images of size the task is to recognize center digits in natural scene images which is significantly harder than classification of digits we follow the work to split the dataset into training data validating data and testing data and preprocess the data by local contrast normalization lcn we only consider the cnn recognition model here the network structure is similar to that in mnist we set for our cmmva model on svhn by default table shows the predictive performance in this more challenging problem we observe larger improvement by cmmva as compared to suggesting that dgms benefit lot from learning on image classification we also compare cmmva with results to the best of our knowledge there is no competitive generative models to classify digits on svhn dataset with full labels table error rates on svhn dataset odel rror ate cmmva cnn stochastic pooling maxout network network in network dsn we further compare the generative capability of cmmva and cva to examine the benefits from jointly training of dgms and classifiers though cva gives tighter lower bound of data likelihood and reconstructs data more elaborately it fails to learn the pattern of digits in complex scenario and could not generate meaningful images visualization of random samples from cva and cmmva is shown in fig in this scenario the hinge loss regularization on recognition model is useful for generating main objects to be classified in images missing data imputation and classification finally we test all models on the task of missing data imputation for mnist we consider two types of missing values each pixel is missing randomly with probability and rect rectangle located at the center of the image is missing given the perturbed images we uniformly initialize the missing values between and and then iteratively do the following steps using the recognition model to sample the hidden variables predicting the missing values to generate images and using the refined images as the input of the next round for svhn we do the same procedure as in mnist but initialize the missing values with guassian training data cva cmmva cmmva figure training data after lcn preprocessing random samples from cva random samples from cmmva when and respectively random variables as the input distribution changes visualization results on mnist and svhn are presented in appendix and appendix respectively intuitively generative models with cnns table mse on mnist data with missing values in could be more powerful on learning patthe testing procedure terns and structures while va mmva cva cmmva generative models with mlps lean more oise ype and rop to reconstruct the pixels in detail this and rop conforms to the mse results shown in and rop table cva and cmmva outperform and rop va and mmva with missing ect gle while va and mmva outperform ect cva and cmmva with random ect ing values compared with the baseline ect mmdgms also make more accurate completion when large patches are missing all of the models infer missing values for iterations we also compare the classification performance of cva cnn and cmmva with rect missing values in testing procedure in appendix cmmva outperforms both cva and cnn overall mmdgms have comparable capability of inferring missing values and prefer to learn highlevel patterns instead of local details conclusions we propose deep generative models mmdgms which conjoin the predictive power of principle and the generative ability of deep generative models we develop doubly stochastic subgradient algorithm to learn all parameters jointly and consider two types of recognition models with mlps and cnns respectively in both cases we present extensive results to demonstrate that mmdgms can significantly improve the prediction performance of deep generative models while retaining the strong generative ability on generating input samples as well as completing missing values in fact by employing cnns in both recognition and generative models we achieve low error rates on mnist and svhn datasets which are competitive to the fully discriminative networks acknowledgments the work was supported by the national basic research program program of china nos national nsf of china nos tsinghua tnlist lab big data initiative and tsinghua initiative scientific research program nos references altun tsochantaridis and hofmann hidden markov support vector machines in icml bastien lamblin pascanu bergstra goodfellow bergeron bouchard wardefarley and bengio theano new features and speed improvements in deep learning and unsupervised feature learning nips workshop bengio laufer alain and yosinski deep generative stochastic networks trainable by backprop in icml chen zhu sun and xing predictive latent subspace learning for data analysis ieee trans on pami cortes and vapnik networks journal of machine learning dosovitskiy springenberg and brox learning to generate chairs with convolutional neural networks goodfellow abadie mirza xu farley courville and bengio generative adversarial nets in nips goodfellow mirza courville and bengio maxout networks in icml gregor danihelka mnih blundell and wierstra deep autoregressive networks in icml kingma and ba adam method for stochastic optimization in iclr kingma rezende mohamed and welling learning with deep generative models in nips kingma and welling variational bayes in iclr larochelle and murray the neural autoregressive distribution estimator in aistats lecun bottou bengio and haffner learning applied to document recognition in proceedings of the ieee lee xie gallagher zhang and tu nets in aistats lee grosse ranganath and ng convolutional deep belief networks for scalable unsupervised learning of hierarchical representations in icml lin chen and yan network in network in iclr little and rubin statistical analysis with missing data jmlr matten and hinton visualizing data using jmlr miller kumar packer goodman and koller models in aistats mnih and gregor neural variational inference and learning in belief networks in icml netzer wang coates bissacco wu and ng reading digits in natural images with unsupervised feature learning nips workshop on deep learning and unsupervised feature learning ranzato susskind mnih and hinton on deep generative models with applications to recognition in cvpr rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in icml salakhutdinov and hinton deep boltzmann machines in aistats saul jaakkola and jordan mean field theory for sigmoid belief networks journal of ai research sermanet chintala and lecun convolutional neural networks applied to house numbers digit classification in icpr singer srebro and cotter pegasos primal estimated solver for svm mathematical programming series tang deep learning using linear support vector machines in challenges on representation learning workshop icml taskar guestrin and koller markov networks in nips tsochantaridis hofmann joachims and altun support vector machine learning for interdependent and structured output spaces in icml yu and joachims learning structural svms with latent variables in icml zeiler and fergus stochastic pooling for regularization of deep convolutional neural networks in iclr zhu ahmed and xing medlda maximum margin supervised topic models jmlr zhu chen perkins and zhang gibbs topic models with data augmentation jmlr zhu chen and xing bayesian inference with posterior regularization and applications to infinite latent svms jmlr zhu xing and zhang partially observed maximum entropy discrimination markov networks in nips 
spherical random features for polynomial kernels jeffrey pennington felix yu sanjiv kumar google research jpennin felixyu sanjivk abstract compact explicit feature maps provide practical framework to scale kernel methods to learning but deriving such maps for many types of kernels remains challenging open problem among the commonly used kernels for nonlinear classification are polynomial kernels for which low approximation error has thus far necessitated explicit feature maps of large dimensionality especially for polynomials meanwhile because polynomial kernels are unbounded they are frequently applied to data that has been normalized to unit norm the question we address in this work is if we know priori that data is normalized can we devise more compact map we show that putative affirmative answer to this question based on random fourier features is impossible in this setting and introduce new approximation paradigm spherical random fourier srf features which circumvents these issues and delivers compact approximation to polynomial kernels for data on the unit sphere compared to prior work srf features are less more compact and achieve better kernel approximation especially for polynomials the resulting predictions have lower variance and typically yield better classification accuracy introduction kernel methods such as nonlinear support vector machines svms provide powerful framework for nonlinear learning but they often come with significant computational cost their training complexity varies from to which becomes prohibitive when the number of training examples grows to the millions testing also tends to be slow with an nd complexity for vectors explicit kernel maps provide practical alternative for applications since they rely on properties of linear methods which can be trained in time and applied in time independent of the idea is to determine an explicit nonlinear map rd rd such that hz and to perform linear learning in the resulting feature space this procedure can utilize the fast training and testing of linear methods while still preserving much of the expressive power of the nonlinear methods following this reasoning rahimi and recht proposed procedure for generating such nonlinear map derived from the monte carlo integration of an inverse fourier transform arising from bochner theorem explicit nonlinear random feature maps have also been proposed for other types of kernels such as intersection kernels generalized rbf kernels skewed multiplicative histogram kernels additive kernels and semigroup kernels another type of kernel that is used widely in many application domains is the polynomial kernel defined by hx yi where is the bias and is the degree of the polynomial approximating polynomial kernels with explicit nonlinear maps is challenging problem but substantial progress has been made in this area recently kar and karnick catalyzed this line of researchq by introducing random maclaurin rm technique which approximates hx yip by the qthe product hwi xi hwi yi where wi is vector consisting of bernoulli random variables another technique tensor sketch offers further improvement by instead writing hx yip as hx where is the tensor product of and then estimating this tensor product with convolution of count sketches although these methods are applicable to any input data in practice polynomial kernels are commonly used on input data because they are otherwise unbounded moreover much of the theoretical analysis developed in former work is based on normalized vectors and it has been shown that utilizing norm information improves the estimates of random projections therefore natural question to ask is if we know priori that data is can we come up with better nonlinear map answering this question is the main focus of this work and will lead us to the development of new form of kernel approximation restricting the input domain to the unit sphere implies that hx yi so that polynomial kernel can be viewed as kernel in this restricted domain as such one might expect the random feature maps developed in to be applicable in this case unfortunately this expectation turns out to be false because bochner theorem can not be applied in this setting the obstruction is an inherent limitation of polynomial kernels and is examined extensively in section in section we propose an alternative formulation that overcomes these limitations by approximating the fourier transform of the kernel function as the positive projection of an indefinite combination of gaussians we provide bound on the approximation error of these spherical random fourier srf features in section and study their performance on variety of standard datasets including experiment on imagenet in section and in the supplementary material compared to prior work the srf method is able to achieve lower kernel approximation error with compact nonlinear maps especially for polynomials the variance in kernel approximation error is much lower than that of existing techniques leading to more stable predictions in addition it does not suffer from the rank deficiency problem seen in other methods before describing the srf method in detail we begin by reviewing the method of random fourier features background random fourier features in method for the explicit construction of compact nonlinear randomized feature maps was presented the technique relies on two important properties of the kernel the kernel is shiftinvariant where and ii the function is positive on rd definite ihw zi property ii guarantees that the fourier transform of admits an interpretation as probability distribution this fact follows from bochner celebrated characterization of positive definite functions theorem bochner function rd is positive definite on rd if and only if it is the fourier transform of finite borel measure on rd consequence of bochner theorem is that the inverse fourier transform of can be interpreted as the computation of an expectation dd zi cos hw xi cos hw yi where and is the uniform distribution on if the above expectation is approximated using monte carlo with random samples wi then hz with cos cos wd bd this identification is we are not claiming total generality of this setting nevertheless in cases where the vector length carries useful information and should be preserved it could be added as an additional feature before normalization made possible by property which guarantees that the functional dependence on and factorizes multiplicatively in frequency space such random fourier features have been used to approximate different types of kernels including the gaussian kernel the laplacian kernel and the cauchy kernel however they have not yet been applied to polynomial kernels because this class of kernels does not satisfy the prerequisite for the application of bochner theorem this statement may seem given the known result that polynomial kernels are positive definite kernels the subtlety is that this statement does not necessarily imply that the associated single variable functions are positive definite on rd for all we will prove this fact in the next section along with the construction of an efficient and effective modification of the random fourier method that can be applied to polynomial kernels defined on the unit sphere polynomial kernels on the unit sphere in this section we consider approximating the polynomial kernel defined on hx yi with we will restrict our attention to the kernel is radial function of the single variable which with slight abuse of notation we write as with in section we show that the fourier transform of is not function so straightforward application of bochner theorem to produce random fourier features as in is impossible in this case nevertheless in section we propose fast and accurate approximation of by surrogate positive definite function which enables us to construct compact fourier features obstructions to random fourier features because cos the behavior of for is undefined and arbitrary since it does not affect the original kernel function in eqn on the other hand it should be specified in order to perform the fourier transform which requires an integration over all values of we first consider the natural choice of for before showing that all other choices lead to the same conclusion lemma the fourier transform of is not function of for any values of and proof see the supplementary material for details direct calculation gives where jν is the bessel function of the first kind expanding for large yields cos πw which takes negative values for some for all and so monte carlo approximation of as in eqn is impossible in this case however there is still the possibility of defining the behavior of for differently and in such way that the fourier transform is positive and integrable on rd the latter condition should hold for all since the vector dimensionality can vary arbitrarily depending on input data we now show that such function can not exist to this end we first recall theorem due to schoenberg regarding completely monotone functions we also follow this practice in frequency space if is radial we also write definition function is said to be completely monotone on an interval if it is continuous on the closed interval infinitely differentiable in its interior and theorem schoenberg function is completely monotone on if and only if is positive definite and radial on rd for all together with theorem theorem shows that must be completely monotone if is to be interpreted as probability distribution we now establish that can not be completely monotone and simultaneously satisfy for proposition the function is completely monotone on proof from the definition of is continuous on infinitely differenφ tiable on and its derivatives vanish for they obey where the inequality follows since therefore is completely monotone on theorem suppose is completely monotone polynomial of degree on the interval with then there is no completely monotone function on that agrees with on for any nonzero proof let be function that agrees with on and let we show that for all integers there exists point χm satisfying χm such that χm for the point obeys by the definition of now suppose there is point χm such that χm and χm the mean value theorem then guarantees the existence of point such that χm χm and χχmm where we have utilized the fact that and the induction hypothesis noting that for all this result implies that χm for all therefore can not be completely monotone corollary there does not exist finite borel measure on rd whose fourier transform agrees with on spherical random fourier features from the section above we see that the bochner theorem can not be directly applied to the polynomial kernel in addition it is impossible to construct positive integrable whose inverse fourier transform equals exactly on despite this result it is nevertheless possible to find that is good approximation of on which is all that is necessary given that we will be approximating by monte carlo integration anyway we present our method of spherical random fourier srf features in this section we recall characterization of radial functions that are positive definite on rd for all due to schoenberg theorem schoenberg continuous function is positive definite and radial on rd for all if and only if it is of the form dµ where is finite borel measure on this characterization motivates an approximation for as sum of gaussians pn to increase the accuracy of the approximation we allow the ci to take negative ci ues doing so enables its fourier transform which is also sum of gaussians to become negative we circumvent this problem by mapping those negative values to zero max ci and simply defining as its inverse fourier transform owing to the max in eqn it is not possible to calculate an analytical expression for thankfully this isn necessary since we original approx original approx figure its approximation and the corresponding pdf for for polynomial orders and polynomials are approximated better see eqn algorithm spherical random fourier srf features input polynomial kernel with bias order input dimensionality and feature dimensionality output randomized feature map rd rd such that hz solve dz for where is the inverse fourier transform of whose form is given in eqn let draw iid samples wd from draw iid samples bd from the uniform distribution on bd cos cos wd can evaluate it numerically by performing one dimensional numerical integral dw wz which is using grid in and and can be computed via single matrix multiplication we then optimize the following cost function which is just the mse between and our approximation of it dz which defines an optimal probability distribution through eqn and the relation we can then follow the random fourier feature method to generate the nonlinear maps the entire srf process is summarized in algorithm note that for any given of kernel parameters can be independently of the data approximation error the total mse comes from two sources error approximating the function from eqn and error from monte carlo sampling the expected mse of converges at rate of and bound on the supremum of the absolute error was given in therefore we focus on analyzing the first type of error we describe simple method to obtain an upper bound on consider the which is special case of eqn obtained by setting ci and the mse between and this function thus provides an upper bound to our approximation error dz dz dz exp exp erf rm ts srf rm ts srf log dimensionality rm ts srf rm ts srf rm ts srf log dimensionality rm ts srf rm ts srf mse rm ts srf log dimensionality rm ts srf log dimensionality log dimensionality log dimensionality rm ts srf mse log dimensionality mse log dimensionality mse log dimensionality log dimensionality mse mse log dimensionality mse rm ts srf mse rm ts srf mse mse mse mse log dimensionality figure comparison of mse of kernel approximation on different datasets with various polynomial orders and feature map dimensionalities the first to third rows show results of usps gisette adult respectively srf gives better kernel approximation especially for large in the first line we have used the fact that integrand is positive and the three terms on the second line are integrated using the standard integral definitions of the error function beta function and kummer confluent hypergeometric function respectively to expose the functional dependence of this result more clearly we perform an expansion for large we use the asymptotic expansions of the error function and the gamma function erf log log bk log where bk are bernoulli numbers for the third term we write the series representation of expand each term for large and sum the result all together we obtain the following bound which decays at rate of and becomes negligible for polynomials this is remarkable as the approximation error of previous methods increases as function of figure shows two kernel functions their approximations and the corresponding pdfs experiments we compare the srf method with random maclaurin rm and tensor sketch ts the other polynomial kernel approximation approaches throughout the experiments we choose the number of gaussians to equal though the specific number had negligible effect on the results the bias term is set as other choices such as yield similar performance results with variety of parameter settings can be found in the supplementary material the error bars and standard deviations are obtained by conducting experiments times across the entire dataset dataset method rm ts srf rm ts srf rm ts srf rm ts srf rm ts srf rm ts srf rm ts srf rm ts srf usps usps usps usps gisette gisette gisette gisette exact table comparison of classification accuracy in on different datasets for different polynomial orders and varying feature map dimensionality the exact column refers to the accuracy of exact polynomial kernel trained with libsvm more results are given in the supplementary material rm ts srf craft rm craft ts craft srf rm ts srf craft rm craft ts craft srf rm ts srf craft rm craft ts craft srf eigenvalue rank accuracy mse log eigenvalue ratio log dimensionality log dimensionality figure comparison of craft features on usps dataset with polynomial order and feature maps of dimension logarithm of ratio of eigenvalue of the approximate kernel to that of the exact kernel constructed using points craft features are projected from dimensional maps mean squared error classification accuracy kernel approximation the main focus of this work is to improve the quality of kernel approximation which we measure by computing the mean squared error mse between the exact kernel and its approximation across the entire dataset figure shows mse as function of the dimensionality of the nonlinear maps srf provides lower mse than other methods especially for higher order polynomials this observation is consistent with our theoretical analysis in section as corollary srf provides more compact maps with the same kernel approximation error furthermore srf is stable in terms of the mse whereas ts and rm have relatively large variance classification with linear svm we train linear classifiers with liblinear and evaluate classification accuracy on various datasets two of which are summarized in table additional results are available in the supplementary material as expected accuracy improves with nonlinear maps and polynomials it is important to note that better kernel approximation does not necessarily lead to better classification performance because the original kernel might not be optimal for the task nevertheless we observe that srf features tend to yield better classification performance in most cases hamid et al observe that rm and ts produce nonlinear features that are rank deficient their approximation quality can be improved by first mapping the input to higher dimensional feature space and then randomly projecting it to lower dimensional space this method is known as craft figure shows the logarithm of the ratio of the ith eigenvalue rm ts srf time sec time sec test error rm ts srf dimensionality dimensionality figure computational time to generate randomized feature map for random samples on fixed hardware with srf rff training examples figure doubly stochastic gradient learning curves with rff and srf features on imagenet of the various approximate kernel matrices to that of the exact kernel for accurate approximation this value should be constant and equal to zero which is close to the case for srf rm and ts deviate from zero significantly demonstrating their figures and show the effect of the craft method on mse and classification accuracy craft improves rm and ts but it has no or even negative effect on srf these observations all indicate that the srf is less than rm and ts computational efficiency both rm and srf have computational complexity ndd whereas ts scales as np log where is the number of nonlinear maps is the number of samples is the original feature dimension and is the polynomial order therefore the scalability of ts is better than srf when is of the same order as log however the computational cost of srf does not depend on making srf more efficient for polynomials moreover there is little computational overhead involved in the srf method which enables it to outperform for practical values of even though it is asymptotically inferior as shown in figure even for the case srf is more efficient than ts for fixed in figure where srf is still more efficient than ts up to learning we investigate the scalability of the srf method on the imagenet dataset which consists of million color images from classes we employ the doubly stochastic gradient method of dai et al which utilizes two stochastic approximations one from random training points and the other from random features associated with the kernel we use the same architecture and parameter settings as including the fixed convolutional neural network parameters except we replace the rff kernel layer with an normalization step and an srf kernel layer with parameters and the learning curves in figure suggest that srf features may perform better than rff features on this dataset we also evaluate the model with testing in which is performed on transformations of the test set we obtain test error of which is comparable to the error reported in these results demonstrate that the unit norm restriction does not have negative impact on performance in this case and that polynomial kernels can be successfully scaled to large datasets using the srf method conclusion we have described novel technique to generate compact nonlinear features for polynomial kernels applied to data on the unit sphere it approximates the fourier transform of kernel functions as the positive projection of an indefinite combination of gaussians it achieves more compact maps compared to the previous approaches especially for polynomials srf also shows less feature redundancy leading to lower kernel approximation error performance of srf is also more stable than the previous approaches due to reduced variance moreover the proposed approach could easily extend beyond polynomial kernels the same techniques would apply equally well to any radial kernel function positive definite or not in the future we would also like to explore adaptive sampling procedures tuned to the training data distribution in order to further improve the kernel approximation accuracy especially when is large when the error is low and the kernel approximation error dominates acknowledgments we thank the anonymous reviewers for their valuable feedback and bo xie for facilitating experiments with the doubly stochastic gradient method references corinna cortes and vladimir vapnik networks machine learning joachims training linear svms in linear time in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm fan chang hsieh wang and lin liblinear library for large linear classification the journal of machine learning research shai yoram singer nathan srebro and andrew cotter pegasos primal estimated subgradient solver for svm mathematical programming ali rahimi and benjamin recht random features for kernel machines in advances in neural information processing systems pages salomon bochner harmonic analysis and the theory of probability dover publications subhransu maji and alexander berg additive classifiers for detection in international conference on computer vision pages ieee sreekanth andrea vedaldi andrew zisserman and jawahar generalized rbf feature maps for efficient detection in british machine vision conference fuxin li catalin ionescu and cristian sminchisescu random fourier approximations for skewed multiplicative histogram kernels in pattern recognition pages springer vedaldi and zisserman efficient additive kernels via explicit feature maps ieee transactions on pattern analysis and machine intelligence jiyan yang vikas sindhwani quanfu fan haim avron and michael mahoney random laplace feature maps for semigroup kernels on histograms in computer vision and pattern recognition cvpr pages ieee hideki isozaki and hideto kazawa efficient support vector classifiers for named entity recognition in proceedings of the international conference on computational pages association for computational linguistics kwang in kim keechul jung and hang joon kim face recognition using kernel principal component analysis signal processing letters ieee purushottam kar and harish karnick random feature maps for dot product kernels in international conference on artificial intelligence and statistics pages ninh pham and rasmus pagh fast and scalable polynomial kernels via explicit feature maps in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm raffay hamid ying xiao alex gittens and dennis decoste compact random feature maps in proceedings of the international conference on machine learning pages ping li trevor hastie and kenneth church improving random projections using marginal information in learning theory pages springer isaac schoenberg metric spaces and completely monotone functions annals of mathematics pages ee kummer de integralibus quibusdam definitis et seriebus infinitis journal die reine und angewandte mathematik felix yu sanjiv kumar henry rowley and chang compact nonlinear maps and circulant extensions arxiv preprint dmitry storcheus mehryar mohri and afshin rostamizadeh foundations of coupled nonlinear dimensionality reduction arxiv preprint bo dai bo xie niao he yingyu liang anant raj balcan and le song scalable kernel methods via doubly stochastic gradients in advances in neural information processing systems pages 
rectified factor networks clevert andreas mayr thomas unterthiner and sepp hochreiter institute of bioinformatics johannes kepler university linz austria okko mayr unterthiner hochreit abstract we propose rectified factor networks rfns to efficiently construct very sparse representations of the input rfn models identify rare and small events in the input have low interference between code units have small reconstruction error and explain the data covariance structure rfn learning is generalized alternating minimization algorithm derived from the posterior regularization method which enforces and normalized posterior means we proof convergence and correctness of the rfn learning algorithm on benchmarks rfns are compared to other unsupervised methods like autoencoders rbms factor analysis ica and pca in contrast to previous sparse coding methods rfns yield sparser codes capture the data covariance structure more precisely and have significantly smaller reconstruction error we test rfns as pretraining technique for deep networks on different vision datasets where rfns were superior to rbms and autoencoders on gene expression data from two pharmaceutical drug discovery studies rfns detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods rfn package for is available at http introduction the success of deep learning is to large part based on advanced and efficient input representations these representations are sparse and hierarchical sparse representations of the input are in general obtained by rectified linear units relu and dropout the key advantage of sparse representations is that dependencies between coding units are easy to model and to interpret most importantly distinct concepts are much less likely to interfere in sparse representations using sparse representations similarities of samples often break down to of features in these samples in bioinformatics sparse codes excelled in biclustering of gene expression data and in finding dna sharing patterns between humans and neanderthals representations learned by relus are not only sparse but also representations do not code the degree of absence of events or objects in the input as the vast majority of events is supposed to be absent to code for their degree of absence would introduce high level of random fluctuations we also aim for input representations to stack models for constructing hierarchical representations finally the representations are supposed to have large number of coding units to allow coding of rare and small events in the input rare events are only observed in few samples like seldom side effects in drug design rare genotypes in genetics or small customer groups in small events affect only few input components like pathways with few genes in biology few relevant mutations in oncology or pattern of few products in in summary our goal is to construct input representations that are sparse are are use many code units and model structures in the input data see next paragraph current unsupervised deep learning approaches like autoencoders or restricted boltzmann machines rbms do encode all peculiarities in the data including noise generative models can be design to model specific structures in the data but their codes can not be enforced to be sparse and nonnegative the input representation of generative model is its posterior mean median or mode which depends on the data therefore sparseness and can not be guaranteed independent of the data for example generative models with rectified priors like rectified factor analysis have zero posterior probability for negative values therefore their means are positive and not sparse sparse priors like laplacian and jeffrey do not guarantee sparse posteriors see experiments in tab to address the data dependence of the code we employ the posterior regularization method this method separates model characteristics from data dependent characteristics that are enforced by constraints on the model posterior we aim at representations that are feasible for many code units and massive datasets therefore the computational complexity of generating code is essential in our approach for priors the computation of the posterior mean of new input requires either to numerically solve an integral or to iteratively update variational parameters in contrast for gaussian priors the posterior mean is the product between the input and matrix that is independent of the input still the posterior regularization method leads to quadratic in the number of coding units constrained optimization problem in each see eq below to speed up computation we do not solve the quadratic problem but perform gradient step to allow for stochastic gradients and fast gpu implementations also the is gradient step these and modifications of the posterior regularization method result in generalized alternating minimization gam algorithm we will show that the gam algorithm used for rfn learning converges and ii is correct correctness means that the rfn codes are sparse have low reconstruction error and explain the covariance structure of the data rectified factor network our goal is to construct representations of the input that are sparse are are use many code units and model structures in the input structures in the input are identified by generative model where the model assumptions determine which input structures to explain by the model we want to model the covariance structure of the input therefore we choose maximum likelihood factor analysis as model the constraints on the input representation are enforced by the posterior regularization method constraints lead to sparse and codes while normalization constraints scale the signal part of each hidden code unit normalizing constraints avoid that generative models explain away rare and small signals by noise explaining away becomes serious problem for models with many coding units since their capacities are not utilized normalizing ensures that all hidden units are used but at the cost of coding also random and spurious signals spurious and true signals must be separated in subsequent step either by supervised techniques by evaluating coding units via additional data or by domain experts generative model with hidden units and data is defined by its prior and its likelihood the full model distribution can be expressed by the model posterior and its evidence marginal likelihood the representation of input is the posterior mean median or mode the posterior regularization method introduces variational distribution from family which approximates the posterior we choose to constrain the posterior means to be and normalized the full model distribution contains all model assumptions and thereby defines which structures of the data are modeled contains data dependent constraints on the posterior therefore on the code for data vn the posterior regularization method maximizes the objective log vi dkl hi vi hi vi hi vi log vi hi dhi dkl hi vi hi where dkl is the distance maximizing achieves two goals simultaneously extracting desired structures and information from the data as imposed by the generative model and ensuring desired code properties via the factor analysis model extracts the ance structure of the data the prior of the hidden units factors and the noise of visible units observations rm are independent the model parameters are the weight loading matrix and the noise covariance matrix we assume agonal to explain correlations between input components by the hidden units and not by correlated noise the factor analysis model is depicted in fig given the data vn the posterior hi vi is gaussian with mean vector µp and covariance matrix σp figure factor analysis model µp vi hidden units factors visible σp units weight matrix noise rectified factor network rfn consists of single or stacked factor analysis model with constraints on the posterior to incorporate the posterior constraints into the factor analysis model we use the posterior regularization method that maximizes the objective given in eq like the em algorithm the posterior regularization method alternates between an and an minimizing the first dkl of eq with respect to leads to constrained optimization problem for gaussian distributions the solution with µp and σp from eq is hi vi µi with σp and the quadratic problem min µi µi µp µi µp µi ij where is this is constraint quadratic optimization problem in the number of hidden units which is too complex to be solved in each em iteration therefore we perform step of the gradient projection algorithm which performs first gradient step and then projects the result to the feasible set we start by step of the projected newton method then we try the gradient projection algorithm thereafter the scaled gradient projection algorithm with reduced matrix see also if these methods fail to decrease the objective in eq we use the generalized reduced method it solves each equality constraint for one variable and inserts it into the objective while ensuring convex constraints alternatively we use rosen gradient projection method or its improvement these methods guarantee decrease of the objective since the projection by eq is very fast the projected newton and projected gradient update is very fast too projected newton step requires nl steps see eq and defined in theorem projected gradient step requires min nlm steps and scaled gradient projection step requires steps the rfn complexity per iteration is see alg in contrast quadratic program solver typically requires for the nl variables the means of the hidden units for all samples steps to find the minimum we exemplify these values on our benchmark datasets mnist and cifar the speedup with projected newton or projected gradient in contrast to quadratic solver is which gives speedup ratios of for mnist and for cifar these speedup ratios show that efficient updates are essential for rfn learning furthermore on our computers ram restrictions limited quadratic program solvers to problems with nl running times of rfns with the newton step and quadratic program solver are given in the supplementary section the decreases the expected reconstruction error hi vi log vi hi dhi rl log log tr tr tr from eq with respect to the model parameters and definitions of and are given in alg the performs gradient step in the newton direction since we want to algorithm rectified factor network vi vi while do for all do µp vi end for σp projected newton projected gradient scaled gradient projection generalized reduced method rosen gradient project pn vi µi µi µi for all do ψkk ψkk ekk ψkk end for if stopping criterion is met end while complexity objective min nlm min nlm projected newton nl projected gradient min nlm scaled gradient projection nl ml overall complexity with projected newton gradient for allow stochastic gradients fast gpu implementation and dropout regularization the newton step is derived in the supplementary which gives further details too also in the rfn learning performs gradient step using projected newton or gradient projection methods these projection methods require the euclidean projection of the posterior means µp onto the feasible set min µi µi µp µi µp µi ij the following theorem gives the euclidean projection as solution to eq theorem euclidean projection if at least one µp ij is positive for then the solution to optimization problem eq is for µp ij µij µp µp ij for µp ij ij if all ij are for then the optimization problem eq has the solution µij for arg µp and µij otherwise proof see supplementary material using the projection defined in eq the updates for the posterior means µi are old µold µold µold µnew µp µi where we set for the projected newton method σp thus and for the projected gradient method for the scaled gradient projection algorithm with reduced matrix the set for consists of all with µij the reduced matrix is the hessian with columns and rows fixed to unit vectors ej the resulting algorithm is posterior regularization method with gradient based and leading to generalized alternating minimization gam algorithm the rfn learning algorithm is given in alg dropout regularization can be included before by randomly setting code units µij to zero with predefined dropout rate note that convergence results will no longer hold convergence and correctness of rfn learning convergence of rfn learning theorem states that alg converges to maximum of theorem rfn convergence the rectified factor network rfn learning algorithm given in alg is generalized alternating minimization gam algorithm and converges to solution that maximizes the objective proof we present sketch of the proof which is given in detail in the supplement for convergence we show that alg is gam algorithm which convergences according to proposition in alg ensures to decrease the objective which is convex in and the update with leads to the minimum of the objective convexity of the objective guarantees decrease in the for if not in minimum alg ensures to decrease the objective by using gradient projection methods all other requirements for gam convergence are also fulfilled proposition in is based on zangwill generalized convergence theorem thus updates of the rfn algorithm are viewed as mappings therefore the numerical precision the choice of the methods in the and gpu implementations are covered by the proof correctness of rfn learning the goal of the rfn algorithm is to explain the data and its covariance structure the expected approximation error is defined in line of alg theorem states that the rfn algorithm is correct that is it explains the data low reconstruction error and captures the covariance structure as good as possible theorem rfn correctness the fixed point of alg minimizes tr given µi and by ridge regression with tr where vi µi the model explains the data covariance matrix by wt up to an error which is quadratic in for the reconstruction error is quadratic in for pn proof the fixed point equation for the update is pn pn using the definition of and we have vi µti µi µi is the ridge regression solution of kvi µi tr pn where tr is the trace after multiplying out all in we obtain pn for the fixed point of the update rule gives diag diag σw thus minimizes tr given µi and multiplying the woodbury identity for wwt from left and right by gives σw inserting this into the expression for diag and taking the trace gives tr tr tr tr therefore for the error is quadratic in sw follows from fixed point equation using this and eq eq is wt using the trace norm nuclear norm or on matrices eq states that the left hand side of eq is quadratic in for the trace norm of positive matrix is its trace and bounds the frobenius norm thus for the covariance is approximated up to quadratic error in according to eq the diagonal is exactly modeled since the minimization of the expected reconstruction error tr is based on µi the quality of reconstruction depends on the correlation between µi and vi we ensure maximal information in µi on vi by the the minimal distance of the posterior onto the family of rectified and normalized gaussian distributions experiments rfns other unsupervised methods we assess the performance of rectified factor networks rfns as unsupervised methods for data representation we compare rfn rectified factor networks rfnn rfns without normalization dae denoising autoencoders with relus rbm restricted boltzmann machines with gaussian visible units fasp factor analysis with jeffrey prior on the hidden units which is sparser than laplace prior falap factor analysis with laplace prior on the hidden units ica independent component analysis by fastica sfa sparse factor analysis with laplace prior on the parameters fa standard factor analysis pca principal component analysis the number of components are fixed to and for each method we generated nine different benchmark datasets to where each dataset consists of instances each instance has samples and features resulting in matrix into these matrices biclusters are implanted bicluster is pattern of particular features which is found in particular samples like pathway activated in some samples an optimal representation will only code the biclusters that are present in sample the datasets have different noise levels and different bicluster sizes large biclusters have samples and features while small biclusters samples and features the pattern signal strength in particular sample was randomly chosen according to the gaussian finally to each matrix gaussian background noise was added with standard deviation or the datasets are characterized by with background noise number of large biclusters and the number of small biclusters we evaluated the methods according to the sparseness of the components the input reconstruction error from the code and the covariance reconstruction error for generative models for rfns sparseness is the percentage of the components that are exactly while for others methods it is the percentage of components with an absolute value smaller than the reconstruction error is the sum of the squared errors across samples the covariance reconstruction error is the frobenius norm of the difference between model and data covariance see supplement for more details on the data and for information on hyperparameter selection for the different methods tab gives averaged results for models with undercomplete complete and overcomplete coding units results are the mean of instances consisting of instances for each dataset to in the supplement we separately tabulate the results for to and confirm them with different noise levels falap did not yield sparse codes since the variational parameter did not table comparison of rfn with other unsupervised methods where the upper part contains methods that yielded sparse codes criteria sparseness of the code sp reconstruction error er difference between data and model covariance co the panels give the results for models with and coding units results are the mean of instances instances for each dataset to maximal value rfns had the sparsest code the lowest reconstruction error and the lowest covariance approximation error of all methods that yielded sparse representations sp undercomplete code units complete code units overcomplete code units rfn rfnn dae rbm fasp sp er co sp er co sp er co falap ica sfa fa pca mnist digits mnist digits with random image background mnist digits with random noise background convex and concave shapes tall and wide rectangular rectangular images on background images images best viewed in color norb images figure randomly selected filters trained on image datasets using an rfn with hidden units rfns learned stroke local and global blob detectors rfns are robust to background noise push the absolute representations below the threshold of the variational approximation to the laplacian is gaussian rfns had the sparsest code the lowest reconstruction error and the lowest covariance approximation error of all methods yielding sparse representations sp rfn pretraining for deep nets we assess the performance of rectified factor networks rfns if used for pretraining of deep networks stacked rfns are obtained by first training single layer rfn and then passing on the resulting representation as input for training the next rfn the deep network architectures use rfn pretrained first layer or stacks of rfns giving layer network the classification performance of deep networks with rfn pretrained layers was compared to support vector machines ii deep networks pretrained by stacking denoising autoencoders sdae iii stacking regular autoencoders sae iv restricted boltzmann machines rbm and stacking restricted boltzmann machines dbn the benchmark datasets and results are taken from previous publications and contain mnist original mnist ii basic smaller subset of mnist for training iii mnist with random noise background iv mnist with random image background rect tall or wide rectangles vi tall or wide rectangular images with random background images vii convex convex or concave shapes viii color images in classes and ix norb stereo image pairs of categories for each dataset its size of training validation and test set is given in the second column of tab as preprocessing we only performed median centering model selection is based on the validation set the rfns hyperparameters are the number of units per layer from and ii the dropout rate from the learning rate was fixed to default value for supervised with stochastic gradient descent we selected the learning rate from the masking noise from and the number of layers from was stopped based on the validation set see fig shows learned filters test error rates and the table results of deep networks pretrained by rfns and other models taken from the test error rate is reported together with the confidence interval the best performing method is given in bold as well as those for which confidence intervals overlap the first column gives the dataset the second the size of training validation and test set the last column indicates the number of hidden layers of the selected deep network in only one case rfn pretraining was significantly worse than the best method but still the second best in six out of the nine experiments rfn pretraining performed best where in four cases it was significantly the best dataset mnist basic rect convex norb cifar svm rbm dbn sae sdae rfn micronuclei figure examples of small and rare events identified by rfn in two drug design studies which were missed by previous methods panel and first row gives the coding unit while the other rows display expression values of genes for controls red active drugs green and inactive drugs black drugs green in panel strongly downregulate the expression of tubulin genes which hints at genotoxic effect by the formation of micronuclei the micronuclei were confirmed by microscopic analysis drugs green in panel show transcriptional effect on genes with negative feedback to the mapk signaling pathway and therefore are potential cancer drugs confidence interval computed according to for deep network pretraining by rfns and other methods are given in tab best results and those with overlapping confidence intervals are given in bold rfns were only once significantly worse than the best method but still the second best in six out of the nine experiments rfns performed best where in four cases it was significantly the best supplementary section shows results of rfn pretraining for convolutional networks where rfn pretraining decreased the test error rates to for and to for rfns in drug discovery using rfns we analyzed gene expression datasets of two projects in the lead optimization phase of big pharmaceutical company the first project aimed at finding novel antipsychotics that target the second project was an oncology study that focused on compounds inhibiting the fgf receptor in both projects the expression data was summarized by farms and standardized rfns were trained with hidden units no masking noise and learning rate of the identified transcriptional modules are shown in fig panels and illustrate that rfns found rare and small events in the input in panel only few drugs are genotoxic rare event by downregulating the expression of small number of tubulin genes small event the genotoxic effect stems from the formation of micronuclei panel and since the mitotic spindle apparatus is impaired also in panel rfn identified rare and small event which is transcriptional module that has negative feedback to the mapk signaling pathway rare events are unexpectedly inactive drugs black dots which do not inhibit the fgf receptor both findings were not detected by other unsupervised methods while they were highly relevant and supported in both projects conclusion we have introduced rectified factor networks rfns for constructing very sparse and input representations with many coding units in generative framework like factor analysis rfn learning explains the data variance by its model parameters the rfn learning algorithm is posterior regularization method which enforces and normalized posterior means we have shown that rfn learning is generalized alternating minimization method which can be proved to converge and to be correct rfns had the sparsest code the lowest reconstruction error and the lowest covariance approximation error of all methods that yielded sparse representations sp rfns have shown that they improve performance if used for pretraining of deep networks in two pharmaceutical drug discovery studies rfns detected small and rare gene modules that were so far missed by other unsupervised methods these gene modules were highly relevant and supported the in both studies rfns are geared to large datasets sparse coding and many representational units therefore they have high potential as unsupervised deep learning techniques acknowledgment the tesla used for this research was donated by the nvidia corporation references hinton and salakhutdinov reducing the dimensionality of data with neural networks science bengio lamblin popovici and larochelle greedy training of deep networks in platt and hoffman editors nips pages mit press schmidhuber deep learning in neural networks an overview neural networks lecun bengio and hinton deep learning nature nair and hinton rectified linear units improve restricted boltzmann machines in icml pages omnipress isbn glorot bordes and bengio deep sparse rectifier neural networks in aistats volume pages srivastava hinton krizhevsky sutskever and salakhutdinov dropout simple way to prevent neural networks from overfitting journal of machine learning research hochreiter bodenhofer et al fabia factor analysis for bicluster acquisition bioinformatics hochreiter hapfabia identification of very short segments of identity by descent characterized by rare variants in large sequencing data nucleic acids frey and hinton variational learning in nonlinear gaussian belief networks neural computation harva and kaban variational learning for rectified factor analysis signal processing ganchev graca gillenwater and taskar posterior regularization for structured latent variable models journal of machine learning research palmer wipf and rao variational em algorithms for latent variable models in nips volume pages bertsekas on the gradient projection method ieee trans automat control kelley iterative methods for optimization society for industrial and applied mathematics siam philadelphia bertsekas projected newton methods for optimization problems with simple constraints siam control abadie and carpentier optimization chapter generalization of the wolfe reduced gradient method to the case of nonlinear constraints academic press rosen the gradient projection method for nonlinear programming part ii nonlinear constraints journal of the society for industrial and applied mathematics haug and arora applied optimal design wiley sons new york and nemirovski interior point polynomial time methods for linear programming conic quadratic programming and semidefinite programming chapter pages society for industrial and applied mathematics gunawardana and byrne convergence theorems for generalized alternating minimization procedures journal of machine learning research zangwill nonlinear programming unified approach prentice hall englewood cliffs srebro learning with matrix factorizations phd thesis department of electrical engineering and computer science massachusetts institute of technology and oja fast algorithm for independent component analysis neural lecun huang and bottou learning methods for generic object recognition with invariance to pose and lighting in proceedings of the ieee conference on computer vision and pattern recognition cvpr ieee press vincent larochelle et al stacked denoising autoencoders learning useful representations in deep network with local denoising criterion jmlr larochelle erhan et al an empirical evaluation of deep architectures on problems with many factors of variation in icml pages krizhevsky learning multiple layers of features from tiny images master thesis deptartment of computer science university of toronto verbist klambauer et al using transcriptomics to guide lead optimization in drug discovery projects lessons learned from the qstar project drug discovery today hochreiter clevert and obermayer new summarization method for affymetrix probe level data bioinformatics 
learning bayesian networks with thousands of variables mauro scanagatta lugano switzerland mauro cassio de campos queen university belfast northern ireland uk giorgio corani lugano switzerland giorgio marco zaffalon lugano switzerland zaffalon abstract we present method for learning bayesian networks from data sets containing thousands of variables without the need for structure constraints our approach is made of two parts the first is novel algorithm that effectively explores the space of possible parent sets of node it guides the exploration towards the most promising parent sets on the basis of an approximated score function that is computed in constant time the second part is an improvement of an existing algorithm for structure optimization the new algorithm provably achieves higher score compared to its original formulation our novel approach consistently outperforms the state of the art on very large data sets introduction learning the structure of bayesian network from data is we focus on learning namely finding the structure which maximizes score that depends on the data several exact algorithms have been developed based on dynamic programming branch and bound linear and integer programming heuristic usually structural learning is accomplished in two steps parent set identification and structure optimization parent set identification produces list of suitable candidate parent sets for each variable structure optimization assigns parent set to each node maximizing the score of the resulting structure without introducing cycles the problem of parent set identification is unlikely to admit algorithm with good quality guarantee this motivates the development of effective search heuristics usually however one decides the maximum number of parents per node and then simply computes the score of all parent sets at that point one performs structural optimization an exception is the greedy search of the algorithm which has however been superseded by the more modern approaches mentioned above higher implies larger search space and allows achieving higher score however it also requires higher computational time when choosing the the user makes between these two objectives however when the number of variables is large the is istituto dalle molle di studi sull intelligenza artificiale idsia scuola universitaria professionale della svizzera italiana supsi della svizzera italiana usi generally set to small value to allow the optimization to be feasible the largest data set analyzed in with the software contains variables it is analyzed setting in gobnilp is used for structural learning with variables setting these are among the largest examples of structural learning in the literature in this paper we propose an algorithm that performs approximated structure learning with thousands of variables without constraints on the it is constituted by novel approach for parent set identification and novel approach for structure optimization as for parent set identification we propose an anytime algorithm that effectively explores the space of possible parent sets it guides the exploration towards the most promising parent sets exploiting an approximated score function that is computed in constant time as for structure optimization we extend the algorithm of which provides an effective approach for model selection with reduced computational cost our algorithm is guaranteed to find solution better than or equal to that of we test our approach on data sets containing up to ten thousand variables as performance indicator we consider the score of the network found our parent set identification approach outperforms consistently the usual approach of setting the maximum and then computing the score of all parent sets our structure optimization approach outperforms gobnilp when learning with more than nodes all the software and data sets used in the experiments are available online structure learning of bayesian networks consider the problem of learning the structure of bayesian network from complete data set of instances dn the set of categorical random variables is xn the goal is to find the best dag where is the collection of nodes and is the collection of arcs can be defined as the set of parents πn of each variable different scores can be used to assess the fit of dag we adopt the bic which asymptotically approximates the posterior probability of the dag the bic score is decomposable namely it is constituted by the sum of the scores of the individual variables bic xn xn bic xi πi nx log log where is the maximum likelihood estimate of the conditional probability xi and nx represents the number of times πi appears in the data set and indicates the size of the cartesian product space of the variables given as arguments instead of the number of variables such that is the number of states of xi and exploiting decomposability we first identify independently for each variable list of candidate parent sets parent set identification then by structure optimization we select for each node the parent set that yields the highest score without introducing cycles parent set identification for parent set identification usually one explores all the possible parent sets whose number however increases as nk where denotes the maximum pruning rules do not considerably reduce the size of this space usually the parent sets are explored in sequential order first all the parent size of size one then all the parent sets of size two and so on up to size we refer to this approach as sequential ordering if the solver adopted for structural optimization is exact this strategy allows to find the globally optimum graph given the chosen value of in order to deal with large number of variables it is however necessary setting low for instance adopts when dealing with the largest data set diabetes which contains variables in gobnilp is used for structural learning with variables again setting higher value of would make the structural http http learning not feasible yet low implies dropping all the parent sets with size larger than some of them possibly have high score in it is proposed to adopt the subset πcorr of the most correlated variables with the children variable then consider only parent sets which are subsets of πcorr however this approach is not commonly adopted possibly because it requires specifying the size of πcorr indeed acknowledges the need for further innovative approaches in order to effectively explore the space of the parent sets we propose two anytime algorithms to address this problem the first is the simplest we call it greedy selection it starts by exploring all the parent sets of size one and adding them to list then it repeats the following until time is expired pops the best scoring parent set from the list explores all the supersets obtained by adding one variable to and adds them to the list note that in general the parent sets chosen at two adjoining step are not related to each other the second approach independence selection adopts more sophisticated strategy as explained in the following parent set identification by independence selection independence selection uses an approximation of the actual bic score of parent set which we denote as to guide the exploration of the space of the parent sets the of parent set constituted by the union of two parent sets and is defined as follows bic bic inter with and inter if we already know bic and bic from previous calculations and we know bic then can be computed in constant time with respect to data accesses we thus exploit to quickly estimate the score of large number of candidate parent sets and to decide the order to explore them we provide bound for the difference between and bic to this end we denote by ii the interaction information ii namely the difference between the mutual information of and conditional on and the unconditional mutual information of and theorem let be node of and be parent set for with and then bic ii where ii is the interaction information estimated from data proof bic bic bic bic inter nx log nx log log nx log log nx log log log log ii where denotes the conditional mutual information estimated from data corollary let be node of and be parent set of such that and then min proof theorem states that bic ii we now devise bounds for interaction information recalling that mutual information and conditional mutual information are always and achieve their maximum value at the smallest entropy of their argument ii the theorem is proven by simply permuting the values in the ii of such equation since ii the bounds for ii are valid we know that log for any set of nodes hence the result of corollary could be further manipulated to achieve bound for the difference between bic and of at most log min however corollary is stronger and can still be computed efficiently as follows when computing we assumed that bic and bic had been precomputed as such we can also have precomputed the values and at the same time as the bic scores were computed without any significant increase of complexity when computing bic for given just use the same loop over the data to compute corollary let be node of and be parent set for that node with and if then bic if then bic if the interaction information ii then bic proof it follows from theorem considering that mutual information if and are independent while if and are conditionally independent we now devise novel pruning strategy for bic based on the bounds of corollaries and theorem let be node of and be parent set for that node with and let if min then and its supersets are not optimal and can be ignored proof min implies bic and theorem of prunes and all its supersets thus we can efficiently check whether large parts of the search space can be discarded based on these results we note that corollary and hence theorem are very generic in the choice of and even though usually one of them is taken as singleton independence selection algorithm we now describe the algorithm that exploits the score in order to effectively explore the space of the parent sets it uses two lists open list for the parent sets to be explored ordered by their score closed list of already explored parent sets along with their actual bic score the algorithm starts with the bic of the empty set computed first it explores all the parent sets of size one and saves their bic score in the closed list then it adds to the open list every parent set of size two computing their scores in constant time on the basis of the scores available from the closed list it then proceeds as follows until all elements in open have been processed or the time is expired it extracts from open the parent set with the best score it computes its bic score and adds it to the closed list it then looks for all the possible expansions of obtained by adding single variable such that is not present in open or closed it adds them to open with their scores eventually it also considers all the explored subsets of it safely prunes if any of its subsets yields higher bic score than the algorithm returns the content of the closed list pruned and ordered by the bic score such list becomes the content of the cache of scores for the procedure is repeated for every variable and can be easily parallelized figure compares sequential ordering and independence selection it shows that independence selection is more effective than sequential ordering because it biases the search towards the highestscoring parent sets structure optimization the goal of structure optimization is to choose the overall highest scoring parent sets measured by the sum of the local scores without introducing directed cycles in the graph we start from the approach proposed in which we call search or obs which exploits the fact bic bic iteration iteration sequential ordering indep selection ordering figure exploration of the parent sets space for given variable performed by sequential ordering and independence selection each point refers to distinct parent set pn that the optimal network can be found in time ck where ci and ci is the number of elements in the cache of scores of xi if an ordering over the variables is is needed to check whether all the variables in parent set for come before in the ordering simple array can be used as data structure for this checking this implies working on the search space of the possible orderings which is convenient as it is smaller than the space of network structures multiple orderings are sampled and evaluated different techniques can be used for guiding the sampling for each sampled total ordering over variables xn the network is consistent with the order if πi xi network consistent with given ordering automatically satisfies the acyclicity constraint this allows us to choose independently the best parent set of each node moreover for given total ordering vn of the variables the algorithm tries to improve the network by greedy search swapping procedure if there is pair vj such that the swapped ordering with vj in place of and vice versa yields better score for the network then these nodes are swapped and the search continues one advantage of this swapping over extra random orderings is that searching for it and updating the network if good swap is found only takes time cj kn which can be sped up as cj only is inspected for parents sets containing and is only processed if has vj as parent in the current network while new sampled ordering would take ck the swapping approach is usually favourable if ci is which is plausible assumption we emphasize that the use of here is sole with the purpose of analyzing the complexity of the methods since our parent set identification approach does not rely on fixed value for however the consistency rule of obs is quite restricting while it surely refuses all cyclic structures it also rules out some acyclic ones which could be captured by interpreting the ordering in slightly different manner we propose novel consistency rule for given ordering which processes the nodes in vn from vn to obs can do it in any order as the local parent sets can be chosen independently and we define the parent set of vj such that it does not introduce cycle in the current partial network this allows in the ordering from node vj to its successors as long as this does not introduce cycle we call this idea acyclic selection obs or simply asobs because we need to check for cycles at each step of constructing the network for given ordering at first glance the algorithm seems to be slower time complexity of cn against ck for obs note this difference is only relevant as we intend to work with large values surprisingly we can implement it in the same overall time complexity of ck as follows build and keep boolean square matrix to mark which are the descendants of nodes tells whether is descendant of start it all false for each node vj in the order with go through the parent sets and pick the best scoring one for which all contained parents are not descendants of vj this takes time ci if parent sets are kept as lists build todo list with the descendants of vj from the matrix representation and associate an empty todo list to all ancestors of vj start the todo lists of the parents of vj with the descendants of vj for each ancestor of vj ancestors will be iteratively visited by following depthfirst graph search procedure using the network built so far we process node after and shall be understood as usual asymptotic notation functions its children with todo lists have been already processed the search stops when all ancestors are visited for each element in the todo list of if is true then ignore and move on otherwise set to true and add to the todo of parents of let us analyze the complexity of the method step takes overall time ck already considering the outer loop step takes overall time already considering the outer loop steps and will be analyzed based on the number of elements on the todo lists and the time to process them in an amortized way note that the time complexity is directly related to the number of elements that are processed from the todo lists we can simply look to the moment that they leave list as their inclusion in the lists will be in equal number we will now count the number of times we process an element from todo list this number is overall bounded over all external loop cycles by the number of times we can make cell of matrix turn from false to true which is plus the number of times we ignore an element because the matrix cell was already set to true which is at most per each vj as this is the maximum number of descendants of vj and each of them can fall into this category only once so again there are times in total in other words each element being removed from todo list is either ignored matrix already set to true or an entry in the matrix of descendants is changed from false to true and this can only happen times hence the total time complexity is ck which is ck for any greater than very plausible scenario as each local cache of variable usually has more than elements moreover we have the following interesting properties of this new method theorem for given ordering the network obtained by asobs has score equal than or greater to that obtained by obs proof it follows immediately from the fact that the consistency rule of asobs generalizes that of obs that is for each node vj with asobs allows all parent sets allowed by obs and also others containing theorem for given ordering defined by vn and current graph consistent with if obs consistency rule allows the swapping of vj and leads to improving the score of then the consistency rule of asobs allows the same swapping and achieves the same improvement in score proof it follows immediately from the fact that the consistency rule of asobs generalizes that of obs so from given graph if swapping is possible under obs rules then it is also possible under asobs rules experiments we compare three different approaches for parent set identification sequential greedy selection and independence selection and three different approaches gobnilp obs and asobs for structure optimization this yields nine different approaches for structural learning obtained by combining all the methods for parent set identification and structure optimization note that obs has been shown in to outperform other search over structures such as greedy and methods we allow one minute per variable to each approach for parent set identification we set the maximum to high value that allows learning even complex structures notice that our novel approach does not need maximum we set maximum to put our approach and its competitors on the same ground once computed the scores of the parent sets we run each solver gobnilp obs asobs for hours for given data set the computation is performed on the same machine the explicit goal of each approach for both parent set identification and structure optimization is to maximize the bic score we then measure the bic score of the bayesian networks eventually obtained as performance indicator the difference in the bic score between two alternative networks is an asymptotic approximation of the logarithm of the bayes factor the bayes factor is the ratio of the posterior probabilities of two competing models let us denote by the difference between the bic score of network and network positive values of imply data set data set data set data set audio jester netflix accidents retail dna kosarek msweb book eachmovie webkb bbc ad table data sets sorted according to the number of variables evidence in favor of network the evidence in favor of network is respectively weak positive strong very strong if is between and and and beyond learning from datasets we consider data sets already used in the literature of structure learning firstly introduced in and we randomly split each data set into three subsets of instances this yields data sets the approaches for parent set identification are compared in table for each fixed structure optimization approach we learn the network starting from the list of parent sets computed by independence selection is greedy selection gs and sequential selection sq in turn we analyze gs and sq positive means that independence selection yields network with higher bic score than the network obtained using an alternative approach for parent set identification vice versa for negative values of in most cases see table implying very strong support for the network learned using independence selection we further analyze the results through the null hypothesis of the test is that the bic score of the network learned under independence selection is smaller than or equivalent to the bic score of the network learned using the alternative approach greedy selection or sequential selection depending on the case if data set yields which is very negative strongly negative negative neutral it supports the null hypothesis if data sets yields bic score which is positive strongly positive extremely positive it supports the alternative hypothesis under any fixed structure solver the sign test rejects the null hypothesis providing significant evidence in favor of independence selection in the following when we further cite the sign test we refer to same type of analysis the sign test analyzes the counts of the which are in favor and against given method as for structure optimization asobs achieves higher bic score than obs in all the data sets under every chosen approach for parent set identification these results confirm the improvement of asobs over obs theoretically proven in section in most cases the in favor of asobs is larger than the difference in favor of asobs is significant sign test under every chosen approach for parent set identification we now compare asobs and gobnilp on the smaller data sets data sets with gobnilp significantly outperforms sign test asobs under every chosen approach for parent set identification on most of such data sets the in favor of the network learned by gobnilp is larger than this outcome is expected as gobnilp is an exact solver and those data gobnilp gs sq asobs gs sq gs sq very positive strongly positive positive neutral negative strongly negative very negative structure solver parent identification is vs obs table comparison of the approaches for parent set identification on data sets given any fixed solver for structural optimization is results in significantly higher bic scores than both gs and sq parent identification structure solver as vs independence sel gp ob forward sel gp ob sequential sel gp ob very positive strongly positive positive neutral negative strongly negative very negative table comparison between the structure optimization approaches on the data sets with asobs as outperforms both gobnilp gb and obs ob under any chosen approach for parent set identification sets imply relatively reduced search space however the focus of this paper is on large data sets on the data sets with asobs outperforms gobnilp sign test under every chosen approach for parent set identification table learning from data sets sampled from known networks in the next experiments we create data sets by sampling from known networks we take the largest networks available in the literature andes diabetes pigs link munin additionally we randomly generate other networks five networks of size five networks of size five networks of size each variable has number of states randomly drawn from to and number of parents randomly drawn from to overall we consider networks from each network we sample data set of instances we perform experiments and analysis as in the previous section for the sake of brevity we do not add further tables of results as for parent set identification independence selection outperforms both greedy selection and sequential selection the difference in favor of independence selection is significant sign test under every chosen structure optimization approach the of the learned network is in most cases take for instance gobnilp for structure optimization then independence selection yields in cases when compared to gs and in cases when compared to sq similar results are obtained using the other solvers for structure optimization strong results support also asobs against obs and gobnilp under every approach for parent set identification is obtained in cases when comparing asobs and obs the number of cases in which asobs obtains when compared against gobnilp ranges between and depending on the approach adopted for parent set selection the superiority of asobs over both obs and gobnilp is significant sign test under every approach for parent set identification moreover we measured the hamming distance between the moralized true structure and the learned structure on the data sets with asobs outperforms gobnilp and obs and is outperforms gs and sq sign test the novel framework is thus superior in terms of both score and correctness of the retrieved structure conclusion and future work our novel approximated approach for structural learning of bayesian networks scales up to thousands of nodes without constraints on the maximum the current results refer to the bic score but in future the methodology could be extended to other scoring functions acknowledgments work partially supported by the swiss nsf grant http references bartlett and cussens integer linear programming for the bayesian network structure learning problem artificial intelligence in press chickering meek and heckerman learning of bayesian networks is hard in proceedings of the conference on uncertainty in artificial intelligence pages morgan kaufmann cooper and herskovits bayesian method for the induction of probabilistic networks from data machine learning cussens bayesian network learning with cutting planes in proceedings of the conference annual conference on uncertainty in artificial intelligence pages auai press cussens malone and yuan ijcai tutorial on optimal algorithms for learning bayesian networks https de campos and ji efficient structure learning of bayesian networks using constraints journal of machine learning research de campos zeng and ji structure learning of bayesian networks using constraints in proceedings of the annual international conference on machine learning pages haaren and davis markov network structure learning randomized feature generation approach in proceedings of the aaai conference on artificial intelligence heckerman geiger and chickering learning bayesian networks the combination of knowledge and statistical data machine learning jaakkola sontag globerson and meila learning bayesian network structure using lp relaxations in proceedings of the international conference on artificial intelligence and statistics pages koivisto parent assignment is hard for the mdl aic and nml costs in proceedings of the annual conference on learning theory pages koivisto and sood exact bayesian structure discovery in bayesian networks journal of machine learning research lowd and davis learning markov network structure with decision trees in geoffrey webb bing liu chengqi zhang dimitrios gunopulos and xindong wu editors proceedings of the int conference on data mining pages mcgill multivariate information transmission psychometrika moore and wong optimal reinsertion new search operator for accelerated and more accurate bayesian network structure learning in fawcett and mishra editors proceedings of the international conference on machine learning pages menlo park california august aaai press raftery bayesian model selection in social research sociological methodology silander and myllymaki simple approach for finding the globally optimal bayesian network structure in proceedings of the conference on uncertainty in artificial intelligence pages teyssier and koller search simple and effective algorithm for learning bayesian networks in proceedings of the conference on uncertainty in artificial intelligence pages yuan and malone an improved admissible heuristic for learning optimal bayesian networks in proceedings of the conference on uncertainty in artificial intelligence yuan and malone learning optimal bayesian networks shortest path perspective journal of artificial intelligence research 
matrix completion under monotonic single index models ravi ganti wisconsin institutes for discovery gantimahapat laura balzano electrical engineering and computer sciences university of michigan ann arbor girasole rebecca willett department of electrical and computer engineering rmwillett abstract most recent results in matrix completion assume that the matrix under consideration is or that the columns are in union of subspaces in settings however the linear structure underlying these models is distorted by typically unknown nonlinear transformation this paper addresses the challenge of matrix completion in the face of such nonlinearities given few observations of matrix that are obtained by applying lipschitz monotonic function to low rank matrix our task is to estimate the remaining unobserved entries we propose novel matrix completion method that alternates between lowrank matrix estimation and monotonic function estimation to estimate the missing matrix elements mean squared error bounds provide insight into how well the matrix can be estimated based on the size rank of the matrix and properties of the nonlinear transformation empirical results on synthetic and datasets demonstrate the competitiveness of the proposed approach introduction in matrix completion one has access to matrix with only few observed entries and the task is to estimate the entire matrix using the observed entries this problem has plethora of applications such as collaborative filtering recommender systems and sensor networks matrix completion has been well studied in machine learning and we now know how to recover certain matrices given few observed entries of the matrix when it is assumed to be low rank typical work in matrix completion assumes that the matrix to be recovered is incoherent low rank and entries are sampled uniformly at random while recent work has focused on relaxing the incoherence and sampling conditions under which matrix completion succeeds there has been little work for matrix completion when the underlying matrix is of high rank more specifically we shall assume that the matrix that we need to complete is obtained by applying some unknown function to each element of an unknown matrix because of the application of transformation the resulting ratings matrix tends to have large rank to understand the effect of the application of transformation on matrix we shall consider the following pm simple experiment given an matrix let σi ui vi be its svd the rank of the matrix is the number of singular values given an define the effective rank of as follows pm σj pm min σj effective rank figure the plot shows the defined in equation obtained by applying function to each element of where is matrix of rank the effective rank of tells us the rank of the lowest rank approximator that satisfies in figure we show the effect of applying monotonic function to the elements of matrix as increases both the rank of and its effective rank grow rapidly with rendering traditional matrix completion methods ineffective even in the presence of mild nonlinearities our model and contributions in this paper we consider the matrix completion problem where the data generating process is as follows there is some unknown matrix with and of rank monotonic lipschitz function is applied to each element of the matrix to get another matrix noisy version of which we call is observed on subset of indices denoted by mi zi xω the function is called the transfer function we shall assume that and the entries of are we shall also assume that the index set is generated uniformly at random with replacement from the set our task is to reliably estimate the entire matrix given observations of on we shall call the above model as monotonic matrix completion mmc to illustrate our framework we shall consider the following two simple examples in recommender systems users are required to provide discrete ratings of various objects for example in the netflix problem users are required to rate movies on scale of these discrete scores can be thought of as obtained by applying rounding function to some ideal real valued score matrix given by the users this score matrix may be well modeled by matrix but the application of the rounding function increases the rank of the original matrix another important example is that of completion of gaussian kernel matrices gaussian kernel matrices are used in kernel based learning methods the gaussian kernel matrix of set of points is an matrix obtained by applying the gaussian function on an underlying euclidean distance matrix the euclidean distance matrix is matrix however in many cases one can not measure all distances between objects resulting in an incomplete euclidean distance matrix and hence an incomplete kernel matrix completing the kernel matrix can then be viewed as completing matrix of large rank in this paper we study this matrix completion problem and provide algorithms with provable error guarantees our contributions are as follows in section we propose an optimization formulation to estimate matrices in the above described context in order to do this we introduce two formulations one using squared by we denote the set this is typical of many other recommender engines such as and technically the rounding function is not lipschitz function but can be well approximated by lipschitz function loss which we call mmc ls and another using calibrated loss function which we call as mmc for both these formulations we minimize and this calibrated loss function has the property that the minimizer of the calibrated loss satisfies equation we propose alternating minimization algorithms to solve our optimization problem our proposed algorithms called mmc and alternate between solving quadratic program to estimate and performing projected gradient descent updates to estimate the matrix mmc outputs the matrix where in section we analyze the mean squared error mse of the matrix returned by one step of the mmc algorithm the upper bound on the mse of the matrix output by mmc depends only on the rank of the matrix and not on the rank of matrix this property makes our analysis useful because the matrix could be potentially high rank and our results imply reliable estimation of high rank matrix with error guarantees that depend on the rank of the matrix we compare our proposed algorithms to implementations of low rank matrix completion on both synthetic and real datasets section related work classical matrix completion with and without noise has been investigated by several authors the recovery techniques proposed in these papers solve convex optimization problem that minimizes the nuclear norm of the matrix subject to convex constraints progress has also been made on designing efficient algorithms to solve the ensuing convex optimization problem recovery techniques based on nuclear norm minimization guarantee matrix recovery under the condition that the matrix is low rank the matrix is incoherent or not very spiky and the entries are observed uniformly at random literature on high rank matrix completion is relatively sparse when columns or rows of the matrix belong to union of subspaces then the matrix tends to be of high rank for such high rank matrix completion problems algorithms have been proposed that exploit the fact that multiple subspaces can be learned by clustering the columns or rows and learning subspaces from each of the clusters while eriksson et al suggested looking at the neighbourhood of each incomplete point for completion used combination of spectral clustering techniques as done in along with learning sparse representations via convex optimization to estimate the incomplete matrix singh et al consider certain specific class of matrices that are obtained from in the authors consider model similar to ours but instead of learning single monotonic function they learn multiple monotonic functions one for each row of the matrix however unlike in this paper their focus is on ranking problem and their proposed algorithms lack theoretical guarantees davenport et al studied the matrix completion problem their model is special case of the matrix completion model considered in this paper in the matrix completion problem we assume that is known and is the cdf of an appropriate probability distribution and the matrix is boolean matrix where each entry takes the value with probability mi and with probability mi since is known the focus in matrix completion problems is accurate estimation of to the best of our knowledge the mmc model considered in this paper has not been investigated before the mmc model is inspired by the model sim that has been studied both in statistics and econometrics for regression problems our mmc model can be thought of as an extension of sim to matrix completion problems algorithms for matrix completion our goal is to estimate and from the model in equations we approach this problem via mathematical optimization before we discuss our algorithms we mention in brief an algorithm for the problem of learning lipschitz monotonic functions in dimension this algorithm will be used for learning the link function in mmc the lpav algorithm suppose we are given data pn yn where pn def and yn are real numbers let is and monotonic the pn lpav algorithm introduced in outputs the best function in that minimizes pi yi in order to do this the lpav first solves the following optimization problem arg minn kz zj zi pj pi if pi pj def where pi this gives us the value of on discrete set of points pn to get everywhere else on the real line we simply perform linear interpolation as follows if if pn if µpi squared loss minimization natural approach to the monotonic matrix completion problem is to learn via squared loss minimization in order to do this we need to solve the following optimization problem min zi xi is and monotonic rank the problem is optimization problem individually in parameters reasonable approach to solve this optimization problem would be to perform optimization each variable while keeping the other variable fixed for instance in iteration while estimating one would keep fixed to say and then perform projected gradient descent this leads to the following updates for zi zi zi xi zi pr where is used in our projected gradient descent procedure and pr is projection on the rank cone the above update involves both the function and its derivative since our link function is monotonic one can use the lpav algorithm to estimate this link function furthermore since lpav estimates as linear function the function has everywhere and the can be obtained very cheaply hence the projected gradient update shown in equation along with the lpav algorithm can be iteratively used to learn estimates for and we shall call this algorithm as incorrect estimation of will also lead to incorrect estimation of the derivative hence we would expect ls to be less accurate than learning algorithm that does not have to estimate we next outline an approach that provides principled way to derive updates for and that does not require us to estimate derivatives of the transfer function as in ls minimization of calibrated loss function and the mmc algorithm let be differentiable function that satisfies furthermore since is monotonic function will be convex loss function now suppose and hence is known consider the following function of ex zi xi zi the above loss function is convex in since is convex differentiating the expression on the of equation and setting it to we get zi exi lpav stands for lipschitz pool adjacent violator the mmc model shown in equation satisfies equation and is therefore minimizer of the loss function hence the loss function is calibrated for the mmc model that we are interested in the idea of using calibrated loss functions was first introduced for learning single index models when the transfer function is identity is quadratic function and we get the squared loss approach that we discussed in section the above discussion assumes that is known however in the mmc model this is not the case to get around this problem we consider the following optimization problem min min ex zi xi zi where is convex function with and is matrix since we know that is lipschitz monotonic function we shall solve constrained optimization problem that enforces lipschitz constraints on and low rank constraints on we consider the sample version of the optimization problem shown in equation min min zi xi zi rank the of our algorithm mmc that solves the above optimization problem is shown in algorithm mmc optimizes for and alternatively where we fix one variable and update another at the start of iteration we have at our disposal iterates and to update our estimate of we perform gradient descent with fixed such that notice that the objective in equation is convex this is in contrast to the least squares objective where the objective in equation is the gradient of is xi gradient descent updates on using the above gradient calculation leads to an update of the form xi pr equation projects matrix onto cone of matrices of rank this entails performing svd on and retaining the top singular vectors and singular values while discarding the rest this is done in steps of algorithm as can be seen from the above equation we do not need to estimate derivative of this along with the convexity of the optimization problem in equation for given are two of the key advantages of using calibrated loss function over the previously proposed squared loss minimization formulation optimization over in round of algorithm we have after performing steps differentiating the objective function in equation we get that the optimal function should satisfy xi def where this provides us with strategy to calculate let then solving the optimization problem in equation is equivalent to solving the following optimization problem min xi subject to if where is the lipschitz constant of we shall assume that is known and does not need to be estimated the gradient of the objective function in equation when set to zero is the same as equation the constraints enforce monotonicity of and the lipschitz property of the above optimization routine is exactly the lpav algorithm the solution obtained from solving the lpav problem can be used to define on xω these two steps are repeated for iterations after iterations we have defined on in order to define everywhere else on the real line we perform linear interpolation as shown in equation algorithm monotonic matrix completion mmc input parameters data xω output initialize mn xω where xω is the matrix with zeros filled in at the unobserved locations initialize mn for do xi pr solve the optimization problem in to get set for all end for obtain on the entire real line using linear interpolation shown in equation def let us now explain our initialization procedure define xω where each is boolean mask with zeros everywhere and at an index corresponding to the index of an observed entry is the hadamard product product of matrices we have such boolean masks each corresponding to an observed entry we initialize to mn mn because each observed index is assumed to be sampled uniformly at xω random with replacement our initialization is guaranteed to be an unbiased estimate of mse analysis of mmc we shall analyze our algorithm mmc for the case of under the modeling assumption shown in equations and additionally we will assume that the matrices and are bounded in absolute value by when the mmc algorithm estimates and as follows mnxω pr is obtained by solving the lpav problem from equation with shown in equation this allows us to define define the mean squared error mse of our estimate as mi se mn denote by the spectral norm of matrix we need the following additional technical assumptions kz with probability at least where hides terms logarithmic in has entries bounded in absolute value by this means that in the worst case mn assumption requires that the spectral norm of is not very large assumption is weak assumption on the decay of the spectrum of by assumption applying weyl inequality we get since is noise matrix with independent bounded entries is matrix with entries this means that with hence assumption can be interpreted as imposing the condition this means that while could be full rank the th singular value of can not be too large def def theorem let let then under assumptions and the mse of the estimator output by mmc with is given by mn log mn se rα rmn where notation hides universal constants and the lipschitz constant of we would like to mention that the result derived for can be made to hold true for by an additional large deviation argument interpretation of our results our upper bounds on the mse of mmc depends on the quantity and since matrix has independent entries which are bounded in absolute value by is matrix with independent entries for such matrices see theorem in with these settings we can simplify the expression in equation to mn log rα rmn mn se remarkable fact about our sample complexity results is that the sample complexity is independent of the rank of matrix which could be large instead it depends on the rank of matrix which we assume to be small the dependence on is via the term from equation it is evident that the best error guarantees are obtained when for such values of equation reduces to mn log mn mn rmn se this result can be converted into sample complexity bound as follows if we are given pr pr mn it is important to note that the floor of the mse is then se which depends on the rank of and not on rank which can be much larger than experimental results we compare the performance of mmc mmc ls and nuclear norm based matrix completion lrmc on various synthetic and real world datasets the objective metric that we use to compare different algorithms is the root mean squared error rmse of the algorithms on unobserved test indices of the incomplete matrix synthetic experiments for our synthetic experiments we generated random matrix of rank by taking the product of two random gaussian matrices of size and with the matrix was generated using the function mi exp where by increasing we increase the lipschitz constant of the function making the matrix completion task harder for large enough mi sgn zi we consider the noiseless version of the problem where each entry in the matrix was sampled with probability and the sampled entries are observed this makes mnp for our implementations we assume that is unknown rmse on test data rmse on test data lrmc lrmc rmse on test data lrmc figure rmse of different methods at different values of and estimate it either via the use of dedicated validation set in the case of mmc or ii adaptively where we progressively increase the estimate of our rank until sufficient decrease in error over the training set is achieved for an implementation of the lrmc algorithm we used standard implementation from tfocs in order to speed up the run time of mmc we also keep track of the training set error and terminate iterations if the relative residual on the training set goes below certain threshold in the supplement we provide plot that demonstrates that for mmc the rmse on the training dataset has decreasing trend and reaches the required threshold in at most iterations hence we set figure show the rmse of each method for different values of as one can see from figure the rmse of all the methods improves for any given as increases this is expected since as increases pmn also increases as increases becomes steeper increasing the effective rank of this makes matrix completion task hard for small such as mmc is competitive with mmc and and is often the best in fact for small irrespective of the value of lrmc is far inferior to other methods for larger mmc works the best achieving smaller rmse over other methods experiments on real datasets we performed experimental comparisons on four real world datasets paper recommendation cameraman all of the above datasets except the cameraman dataset are ratings datasets where users have rated few of the several different items for the dataset we used randomly chosen ratings for each user for training randomly chosen rating for validation and the remaining for testing comes with its own training and testing dataset we used of the training data for validation for the cameraman and the paper recommendation datasets of the data was used for training for validation and the rest for testing the baseline algorithm chosen for low rank matrix completion is for each of the datasets we report the rmse of mmc mmc and on the test sets we excluded from these experiments because in all of our datasets the number of observed entries is very small fraction of the total number of entries and from our results on synthetic datasets we know that ls is not the best performing algorithm in such cases table shows the rmse over the test set of the different matrix completion methods as we see the rmse of mmc is the smallest of all the methods surpassing by large margin table rmse of different methods on real datasets dataset paperreco cameraman dimensions mmc mmc conclusions and future work we have investigated new framework and for high rank matrix completion problems called monotonic matrix completion and proposed new algorithms in the future we would like to investigate if one could relax improve the theoretical results for our experiments this threshold is set to http the parameter in the lmafit algorithm was set to effective rank and we used est for references prem melville and vikas sindhwani recommender systems in encyclopedia of machine learning springer mihai cucuringu graph realization and matrix completion phd thesis princeton university benjamin recht simpler approach to matrix completion jmlr emmanuel and benjamin recht exact matrix completion via convex optimization focm emmanuel candes and yaniv plan matrix completion with noise proceedings of the ieee sahand negahban and martin wainwright restricted strong convexity and weighted matrix completion optimal bounds with noise the journal of machine learning research raghunandan keshavan andrea montanari and sewoong oh matrix completion from few entries information theory ieee transactions on david gross recovering matrices from few coefficients in any basis information theory ieee transactions on jon dattorro convex optimization euclidean distance geometry lulu com bart vandereycken matrix completion by riemannian optimization siam journal on optimization mingkui tan ivor tsang li wang bart vandereycken and sinno pan riemannian pursuit for big matrix recovery in icml pages zheng wang lai zhaosong lu wei fan hasan davulcu and jieping ye matrix pursuit for matrix completion in icml pages zaiwen wen wotao yin and yin zhang solving factorization model for matrix completion by nonlinear successive algorithm mathematical programming computation brian eriksson laura balzano and robert nowak matrix completion in aistats congyuan yang daniel robinson and rene vidal sparse subspace clustering with missing entries in icml mahdi soltanolkotabi emmanuel candes et al geometric analysis of subspace clustering with outliers the annals of statistics ehsan elhamifar and rene vidal sparse subspace clustering algorithm theory and applications tpami aarti singh akshay krishnamurthy sivaraman balakrishnan and min xu completion of ultrametric matrices using selective entries in spcom pages ieee oluwasanmi koyejo sreangsu acharyya and joydeep ghosh retargeted matrix factorization for collaborative filtering in proceedings of the acm conference on recommender systems pages acm mark davenport yaniv plan ewout van den berg and mary wootters matrix completion information and inference sham kakade varun kanade ohad shamir and adam kalai efficient learning of generalized linear and single index models with isotonic regression in nips adam tauman kalai and ravi sastry the isotron algorithm isotonic regression in colt hidehiko ichimura semiparametric least squares sls and weighted sls estimation of models journal of econometrics joel horowitz and wolfgang direct semiparametric estimation of models with discrete covariates journal of the american statistical association alekh agarwal sham kakade nikos karampatziakis le song and gregory valiant least squares revisited scalable approaches for prediction in icml pages roman vershynin introduction to the analysis of random matrices arxiv preprint stephen becker candes and grant tfocs flexible methods for rank minimization in matrix optimization symposium siam conference on optimization 
isalogy answering visual analogy questions lawrence zitnick microsoft research larryz fereshteh sadeghi university of washington fsadeghi ali farhadi university of washington the allen institute for ai ali abstract in this paper we study the problem of answering visual analogy questions these questions take the form of image is to image as image is to what answering these questions entails discovering the mapping from image to image and then extending the mapping to image and searching for the image such that the relation from to holds for to we pose this problem as learning an embedding that encourages pairs of analogous images with similar transformations to be close together using convolutional neural networks with quadruple siamese architecture we introduce dataset of visual analogy questions in natural images and show first results of its kind on solving analogy questions on natural images introduction analogy is the task of mapping information from source to target analogical thinking is crucial component in problem solving and has been regarded as core component of cognition analogies have been extensively explored in cognitive sciences and explained by several theories and models shared structure shared abstraction identity of relation hidden deduction etc the common two components among most theories are the discovery of form of relation or mapping in the source and extension of the relation to the target such process is very similar to the tasks in analogy questions in standardized tests such as the scholastic aptitude test sat is to as is to what in this paper we introduce isalogy to address the problem of solving visual analogy questions three images ia ib and ic are provided as input and fourth image id must be selected such that ia is to ib as ic is to id this involves discovering an extendable mapping from ia to ib and then applying it to ic to find id estimating such mapping for natural images using current feature spaces would require careful alignment complex reasoning and potentially expensive training data instead we learn an embedding space where reasoning about analogies can be performed by simple vector transformations this is in fact aligned with the traditional logical understanding of analogy as an arrow or homomorphism from source to the target our goal is to learn representation that given set of training analogies can generalize to unseen analogies across various categories and attributes figure shows an example visual analogy question answering this question entails discovering the mapping from the brown bear to the white bear in this case color change applying the same mapping to the brown dog and then searching among set of images the middle row in figure to find an example that respects the discovered mapping from the brown dog best such mapping should ideally prefer white dogs the bottom row shows ranking imposed by isalogy we propose learning an embedding that encourages pairs of analogous images with similar mappings to be close together specifically we learn convolutional neural network cnn with siamese quadruple architecture figure to obtain an embedding space where analogical reasoning can be analogy question test set correct answers mixed with distractor negative images answer top ranked selections by our method figure visual analogy question asks for missing image id given three images ia ib ic in the analogy quadruple solving visual analogy question entails discovering the mapping from ia to ib and applying it to ic and search among set of images the middle row to find the best image for which the mapping holds the bottom row shows an ordering of the images imposed by isalogy based on how likely they can be the answer to the analogy question done with simple vector transformations doing so involves fine tuning the last layers of our network so that the difference in the unit normalized activations between analogue images is similar for image pairs with similar mapping and dissimilar for those that are not we also evaluate isalogy on generalization to unseen analogies to show the benefits of the proposed method we compare isalogy against competitive baselines that use standard cnns trained for classification our experiments are conducted on datasets containing natural images as well as synthesized images and the results include quantitative evaluations of isalogy across different sizes of distractor sets the performance in solving analogy questions is directly affected by the size of the set from which the candidate images are selected in this paper we study the problem of visual analogies for natural images and show the first results of its kind on solving visual analogy questions for natural images our proposed method learns an embedding where similarities are transferable across pairs of analogous images using siamese network architecture we introduce visual analogy question answering vaqa dataset of natural images that can be used to generate analogies across different objects attributes and actions of animals we also compile large set of analogy questions using the chair dataset containing analogies across viewpoint and style our experimental evaluations show promising results on solving visual analogy questions we explore different kinds of analogies with various numbers of distracters and show generalization to unseen analogies related work the problem of solving analogy questions has been explored in nlp using connectives supervised learning distributional similarities word vector representations and linguistic regularities and learning by reading solving analogy questions for diagrams and sketches has been extensively explored in ai these papers either assume simple forms of drawings require an abstract representation of diagrams or spatial reasoning in an framework is proposed to learn image filters between pair of images to creat an analogous filtered result on third image related to analogies is learning how to separate category and style properties in images which has been studied using bilinear models in this paper we study the problem of visual analogies for natural images possessing different semantic properties where obtaining abstract representations is extremely challenging our work is also related to metric learning using deep neural networks in convolutional network is learned in siamese architecture for the task of face verification attributes have been shown to be effective representations for semantical image understanding in the relative attributes are introduced to learn ranking function per attribute while these methods provide an efficient feature representation to group similar objects and map similar images nearby each other in an embedding space they do not offer semantic space that can capture mapping and can not be directly used for analogical inference in the relationships between multiple pairs of classes are modeled via analogies which is shown to improve recognition as well as gre textual analogy tests in our work we learn analogies without explicity considering categories and no textual data is provided in our analogy questions learning representations using both textual and visual information has also been explored using deep architectures these representations show promising results for learning mapping between visual data the same way that it was shown for text we differ from these methods as our objective is to directly optimized for analogy questions and our method does not use textual information different forms of visual reasoning has been explored in the domain recently the visual question answering problem has been studied in several papers in method is introduced for answering several types of textual questions grounded with images while proposes the task of visual question answering in another recent approach knowledge extracted from web visual data is used to answer questions while these works all use visual reasoning to answer questions none have considered solving analogy questions our approach we pose answering visual analogy question as the problem of discovering the mapping from image to image and searching for an image that has the same relation to image as to specifically we find function parametrized by that maps each pair of images to vector the goal is to solve for parameters such that for positive image analogies as we describe below is computed using the differences in convnet output features between images quadruple siamese network positive training example for our network is an analogical quadruple of images where the transformation from to is the same as that of to to be able to solve the visual analogy problem our learned parameters should map these two transformations to similar location to formalize this we use contrastive loss function to measure how well is capable of placing similar transformations nearby in the embedding space and pushing dissimilar transformations apart given feature vector for each pair of input images the contrastive loss is defined as lm max where and refer to the embedding feature vector for and respectively label is if the input quadruple is correct analogy or otherwise also is the margin parameter that pushes and close to each other in the embedding space if and forces the distance between and in wrong analogy pairs be bigger than in the embedding space we train our network with both correct and wrong analogy quadruples and the error is back propagated through stochastic gradient descent to adjust the network weights the overview of our network architecture is shown in figure to compute the embedding vectors we use the quadruple siamese architecture shown in figure using this architecture each image in the analogy quadruple is fed through convnet alexnet with shared parameters the label shows whether the input quadruple is correct analogy or false analogy example to capture the transformation between image pairs and the outputs of the last fully connected layer are subtracted we normalize our embedding vectors to have unit length which results in the euclidean distance being the same as the cosine distance if xi are the outputs of the last fully connected layer in the convnet for image ii xij xi xj is computed by xi xj xi xj xj using the loss function defined in equation may lead to the network overfitting positive analogy pairs in the training set can get pushed too close together in the embedding space during training to overcome this problem we consider margin mp for positive analogy quadruples in this case and in the positive analogy pairs will be pushed close to each other only if the distance between them is bigger than mp it is clear that mp mn should hold between the two margins lmp mn max mp max mn single margin embedding space margin shared double margin embedding space margin shared an ve inst ces shared loss one analogy instance margin figure visalogy network has quadruple siamese architecture with shared parameters the network is trained with correct analogy quadruples of images along with wrong analogy quadruples as negative samples the contrastive loss function pushes and of correct analogies close to each other in the embedding space while forcing the distance between and in negative samples to be more than margin building analogy questions for creating dataset of visual analogy questions we assume each training image has information where denotes its category and denotes its property example properties include color actions and object orientation valid analogy quadruple should have the form ci ci co co where the two input images and have the same category ci but their properties are different that is has the property while has the property similarly the output images and share the same category co where ci co also has the property while has the property and generating positive quadruples given set of labeled images we construct our set of analogy types we select two distinct categories and two distinct properties which are shared between and using these selections we can build different analogy types either or can be considered as ci and co and similarly for and for each analogy type ci ci co co we can generate set of positive analogy samples by combining corresponding images this procedure provides large number of positive analogy pairs generating negative quadruples using only positive samples for training the network leads to degenerate models since the loss can be made zero by simply mapping each input image to constant vector therefore we also generate quadraples that violate the analogy rules as negative samples during training to generate negative quadruples we take two approaches in the first approach we randomly select images from the whole set of training images and each time check that the generated quadruple is not valid analogy in the second approach we first generate positive analogy quadruple then we randomly replace either of or with an improper image to break the analogy suppose we select for replacement then we can either randomly select an image with category co and property where and or we can randomly select an image with property but with category where co the second approach generates set of hard negatives to help improve training during the training we randomly sample from the whole set of possible negatives experiments testing scenario and evaluation metric to evaluate the performance of our method for solving visual analogy questions we create set of correct analogy quadruples using the labels of images given set of images which contain both positive and distracter images we would like to rank each image ii in based on how well it completes the analogy we compute the corresponding feature embeddings for each of the input images as well as xi for each image in and we rank based on ours alexnet ft alexnet chance ours alexnet ft alexnet chance retrieval ours alexnet ft alexnet chance recall ours alexnet ft alexnet chance retrieval retrieval retrieval figure quantitative evaluation log scale on chairs dataset recall as function of the number of images returned recall at for each question the recall at is either or and is averaged over questions the size of the distractor set is varied alexnet alexnet alexnet ft alexnet on chairs dataset for categorizing ii ii where is the embedding obtained from our network as explained in section we consider the images with the same category as of and the same property as of to be correct retrieval and thus positive image and the rest of the images in as negative images we compute the recall at to measure whether or not an image with an appropriate label has appeared in the top retrieved images ranki baseline it has been shown that the output of the layer in alexnet produces high quality image descriptors in each of our experiments we compare the performance of solving visual analogy problems using the image embedding obtained from our network with the image representation of alexnet in practice we pass each test image through alexnet and our network and extract the output from the last fully connected layer using both networks note that for solving general analogy questions the set of properties and categories are not known at the test time accordingly our proposed network does not use any labels during training and is aimed to generalize the transformations without explictily using the label of categories and properties dataset to evaluate the capability of our trained network for solving analogy questions in the test scenarios explained above we use large dataset of chairs as well as novel dataset of natural images vaqa that we collected for solving analogy questions on natural images implementation details in all the experiments we use stochastic gradient descent sgd to train our network for initializing the weights of our network we use the alexnet network for the task of object recognition provided by the bvlc caffe website we the last two fully connected layers and the last convolutional layer unless stated otherwise we have also used the double margin loss function introduced in equation with mp mn which we empirically found to give the best results in validatation set the effect of using single margin double margin loss function is also investigated in section analogy question answering using chairs we use large collection of models of chairs with different styles introduced in to make the dataset the cad models are download from warehouse and each chair style is rendered on white background from different view points for making analogy quadruples we use different view points of each chair style which results in synthesized images in this dataset we treat different styles as different categories and different view points as different properties of the images according to the explanations given in section we randomly select styles and view points for training and keep the rest for testing we use the rest of classes of chairs with view points which are completely unseen during the training to build unseen analogy questions that test the generalization capability of our network at test time to construct an analogy question we randomly select two different styles and two different view points the first part of the analogy quadruple contains two images with the same style and with two different view points the images from the second half of the analogy quadruple have another style and has the same viewpoint as and has the same view point as together and build an analogy question where is the correct answer using approach the total number of positive analogies that could be used during training is ours analogy question baseline figure left several examples of analogy questions from the chairs dataset in each question the first and second chair have the same style while their view points change the third image has the same view point as the first image but in different style the correct answer to each question is retrieved from set with distractors and should have the same style as the third image while its view point should be similar to the second image middle retrievals using the features obtained from our method right retrievals using alexnet features all retrievals are sorted from left to right to train our network we uniformly sampled quadruples of positive and negative analogies and initialized the weights with the alexnet network and its parameters figure shows several samples of the analogy questions left column used at test time and the images retrieved by our method middle column compared with the baseline right column we see that our proposed approach can retrieve images with style similar to that of the third image and with similar to the second image while the baseline approach is biased towards retrieving chairs with style similar to that of the first and the second image to quantitatively compare the performance of our method with the baseline we randomly generated analogy questions using the test images and report the average recall at retrieval while varying the number of irrelevant images in the distractor set note that since there is only one image corresponding to each style there is only one positive answer image for each question the performance of chance at the retrieval is nk where is the size of the images of this dataset are synthesized and do not follow natural image statistics therefore to be fair at comparing the results obtained from our network with that of the baseline alexnet we all layers of the alexnet via loss for categorization of different and using the set of images seen during training we then use the features obtained from the last fully connected layer of this network to solve analogy questions as shown in figure all layers of alexnet the violet curve referred to as alexnet ft in the diagram helps improve the performance of the baseline however the recall of our network still outperforms it with large margin analogy question answering using vaqa dataset as explained in section to construct natural image analogy dataset we need to have images of numerous object categories with distinguishable properties we also need to have these properties be shared amongst object categories so that we can make valid analogy quadruples using the labels in natural images we consider the property of an object to be either the action that it is doing for animate objects or its attribute for both animate and objects unfortunately we found that current datasets have sparse number of object properties per class which restricts the number of possible analogy questions for instance many action datasets are human centric and do not have analogous actions for animals as result we collected our own dataset vaqa for solving visual analogy questions data collection we considered list of attributes and actions along with list of common objects and paired them to make list of labels for collecting images out of this list we removed combinations that are not common in the real world horse blue is not common in the real world though there might be synthesized images of blue horse in the web we used the remaining list of labels to query google image search with phrases made from concatenation of word and and downloaded images for each phrase the images are manually verified to contain the concept of interest however we did not pose any restriction about the of the objects after the pruning step there exists around images per category with total of images the vaqa dataset consists of images corresponding to phrases which are made out of different categories and properties using the shared properties amongst categories we can build types of analogies in our experiments we used over analogy questions for training our network seen attribute analogies seen action analogies unseen attribute analogies recall ours alexnet features chance unseen action analogies top retrieval ours alexnet features chance top retrieval ours alexnet features chance top retrieval ours alexnet features chance top retrieval figure quantitative evaluation log scale on the vaqa dataset using attribute and action analogy questions recall as function of the number of images returned recall at for each question the recall at is averaged over questions the size of the distractor set is fixed at in all experiments results shown for analogy types seen in training are shown in the left two plots and for analogy types not seen in training in the two right plots attribute analogy following the procedure explained in section we build positive and negative quadruples to train our network to be able to test the generalization of the learned embeddings for solving analogy question types that are not seen during training we randomly select attribute analogy types and remove samples of them from the training set of analogies using the remaining analogy types we sampled total of quadruples positive and negative that are used to train the network action analogy similarly we trained our network to learn action analogies for the generalization test we remove randomly selected analogy types and make the training quadruples using the remaining types we sampled quadruples positive and negative to train the network evaluation on vaqa using the unseen images during the training we make analogy quadruples to test the trained networks for the attribute and action analogies for evaluating the specification and generalization of our trained network we generate analogy quadruples in two scenarios of seen and unseen analogies using the analogy types seen during training and the ones in the withheld sets respectively in each of these scenarios we generated analogy questions and report the average recall at for each question images that have property equal to that of and category equal to are considered as correct answers the result is around positive images for each question and we fix the distracter set to have negative images for each question given the small size of our distracter set we report the average recall at the obtained results in different scenarios as summarized in figure in all the cases our method outperforms the baseline other than training separate networks for attribute and action analogies we trained and tested our network with combined set of analogy questions and obtained promising results with gap of compared to our baseline on the retrievals of the seen analogy questions note that our current dataset only has one property label per image either for attribute or action thus negative analogy for one property may be positive for the other more thorough analysis would require data which we leave for future work qualitative analysis figure shows examples of attribute analogy questions that are used for evaluating our network along with the top five retrieved images obtained from our method and the baseline method as explained above during the data collection we only prune out images that do not contain the of interest also we do not pose any restriction for generating positive quadruples such as restricting the objects to have similar pose or having the same number of objects of interest in the quadruples however as can be seen in figure our network had been able to implicitly learn to generalize the count of objects for example in the first row of figure an image pair is dog swimming dog standing and the second part of the analogy has an image of multiple horses swimming given this analogy question as input our network has retrieved images with multiple standing horses in the top five retrievals ablation study in this section we investigate the effect of training the network with double margins mp mn for positive and negative analogy quadruples compared with only using one single margin for negative quadruples we perform an ablation experiment where we compare the performance of the network at retrieval while being trained using either of the loss functions explained in section also in two different scenarios we either only the top fully connected layers and attribute action analogy question baseline ours figure left samples of test analogy questions from vaqa dataset middle retrievals using the features obtained from our method right retrievals using alexnet features recall testing with seen analogy types ours ft mp mn ours ft mp mn ours ft mn ours ft mn alexnet features chance ours ft mp mn ours ft mn ours ft mn alexnet features chance testing with unseen analogy types ours ft mp mn top top retrieval figure quantitative comparison for the effect of using double margin single margin for training the visalogy network ferred to as ft in figure or the top fully connected layers plus the last convolutional layer referred to as ft in figure we use fixed training sample set consisting of quadruples generated from the vaqa dataset in this experiment in each case we test the trained network using samples coming from the set of analogy questions whose types are during the training as can be seen from figure using double margins mp mn in the loss function has resulted in better performance in both testing scenarios while using double margins results in small increase in the seen analogy types testing scenario it has considerably increased the recall when the network was tested with unseen analogy types this demonstrates that the use of double margins helps generalization conclusion in this work we introduce the new task of solving visual analogy questions for exploring the task of visual analogy questions we provide new dataset of natural images called vaqa we answer the questions using siamese convnet architecture that provides an image embedding that maps together pairs of images that share similar property differences we have demonstrated the performance of our proposed network using two datasets and have shown that our network can provide an effective feature representation for solving analogy problems compared to image representations acknowledgments this work was in part supported by onr nsf nsf and allen distinguished investigator award references gentner holyoak kokinov the analogical mind perspectives from cognitive science mit press shelley multiple analogies in science and philosophy john benjamins publishing juthe argument by analogy argumentation aubry maturana efros russell sivic seeing chairs exemplar alignment using large dataset of cad models in cvpr turney similarity of semantic relations comput linguist turney littman learning of analogies and semantic relations corr baroni lenci distributional memory general framework for semantics comput linguist jurgens turney mohammad holyoak task measuring degrees of relational similarity acl turney pantel from frequency to meaning vector space models of semantics artif int res levy goldberg linguistic regularities in sparse and explicit word representations in conll acl barbella forbus analogical dialogue acts supporting learning by reading analogies in instructional texts in aaai chang forbus using analogy to cluster sketches for educational software ai magazine forbus usher tomai analogical learning of relationships in sketches in aaai forbus usher lovett lockwood wetzel cogsketch sketch understanding for cognitive science research and for education topics in cognitive science chang wetzel forbus spatial reasoning in comparative analyses of physics diagrams in spatial cognition ix hertzmann jacobs oliver curless salesin image analogies in siggraph acm tenenbaum freeman separating style and content with bilinear models neural computation chopra hadsell lecun learning similarity metric discriminatively with application to face verification in cvpr farhadi endres hoiem forsyth describing objects by their attributes in cvpr parikh grauman relative attributes in iccv hwang grauman sha semantic embedding for visual object categorization in icml kiros salakhutdinov zemel unifying embeddings with multimodal neural language models arxiv preprint mikolov yih zweig linguistic regularities in continuous space word representations in geman geman hallonquist younes visual turing test for computer vision systems pnas malinowski fritz approach to question answering about scenes based on uncertain input in nips sadeghi kumar divvala farhadi viske visual knowledge extraction and question answering by visual verification of relation phrases in cvpr antol agrawal lu mitchell batra zitnick parikh vqa visual question answering in iccv yu park berg berg visual madlibs fill in the blank description generation and question answering in iccv malinowski rohrbach fritz ask your neurons approach to answering questions about images in iccv krizhevsky sutskever hinton imagenet classification with deep convolutional neural networks in nips jia shelhamer donahue karayev long girshick guadarrama darrell caffe convolutional architecture for fast feature embedding arxiv preprint 
mcmc inference for normalized random measure mixture models juho lee and seungjin choi department of computer science and engineering pohang university of science and technology pohang korea stonecold seungjin abstract normalized random measures nrms provide broad class of discrete random measures that are often used as priors for bayesian nonparametric models dirichlet process is example of nrms most of posterior inference methods for nrm mixture models rely on mcmc methods since they are easy to implement and their convergence is well studied however mcmc often suffers from slow convergence when the acceptance rate is low inference is an alternative deterministic posterior inference method where bayesian hierarchical clustering bhc or incremental bayesian hierarchical clustering ibhc have been developed for dp or nrm mixture nrmm models respectively although ibhc is promising method for posterior inference for nrmm models due to its efficiency and applicability to online inference its convergence is not guaranteed since it uses heuristics that simply selects the best solution after multiple trials are made in this paper we present hybrid inference algorithm for nrmm models which combines the merits of both mcmc and ibhc trees built by ibhc outlines partitions of data which guides procedure to employ appropriate proposals inheriting the nature of mcmc our mcmc tgmcmc is guaranteed to converge and enjoys the fast convergence thanks to the effective proposals guided by trees experiments on both synthetic and realworld datasets demonstrate the benefit of our method introduction normalized random measures nrms form broad class of discrete random measures including dirichlet proccess dp normalized inverse gaussian process and normalized generalized gamma process nrm mixture nrmm model is representative example where nrm is used as prior for mixture models recently nrms were extended to dependent nrms dnrms to model data where exchangeability fails the posterior analysis for nrm mixture nrmm models has been developed yielding simple mcmc methods as in dp mixture dpm models there are two paradigms in the mcmc algorithms for nrmm models marginal samplers and slice samplers the marginal samplers simulate the posterior distributions of partitions and cluster parameters given data or just partitions given data provided that conjugate priors are assumed by marginalizing out the random measures the marginal samplers include the gibbs sampler and the sampler although it was not formally extended to nrmm models the slice sampler maintains random measures and explicitly samples the weights and atoms of the random measures the term slice comes from the auxiliary slice variables used to control the number of atoms to be used the slice sampler is known to mix faster than the marginal gibbs sampler when applied to complicated dnrm mixture models where the evaluation of marginal distribution is costly the main drawback of mcmc methods for nrmm models is their poor scalability due to the nature of mcmc methods moreover since the marginal gibbs sampler and slice sampler iteratively sample the cluster assignment variable for single data point at time they easily get stuck in local optima sampler may resolve the local optima problem to some extent but is still problematic for datasets since the samples proposed by split or merge procedures are rarely accepted recently deterministic alternative to mcmc algorithms for nrm or dnrm mixture models were proposed extending bayesian hierarchical clustering bhc which was developed as inference for dp mixture models the algorithm referred to as incremental bhc ibhc builds binary trees that reflects the hierarchical cluster structures of datasets by evaluating the approximate marginal likelihood of nrmm models and is well suited for the incremental inferences for or streaming datasets the key idea of ibhc is to consider only exponentially many posterior samples which are represented as binary trees instead of drawing indefinite number of samples as in mcmc methods however ibhc depends on the heuristics that chooses the best trees after the multiple trials and thus is not guaranteed to converge to the true posterior distributions in this paper we propose novel mcmc algorithm that elegantly combines ibhc and mcmc methods for nrmm models our algorithm called the mcmc utilizes the trees built from ibhc to proposes good quality posterior samples efficiently the trees contain useful information such as dissimilarities between clusters so the errors in cluster assignments may be detected and corrected with less efforts moreover designed as mcmc methods our algorithm is guaranteed to converge to the true posterior which was not possible for ibhc we demonstrate the efficiency and accuracy of our algorithm by comparing it to existing mcmc algorithms background throughout this paper we use the following notations denote by set of indices and by xi dataset partition of is set of disjoint nonempty subsets of whose union is cluster is an entry of data points in cluster is denoted by xc xi for πn for the sake of simplicity we often use to represent singleton for in this section we briefly review nrmm models existing posterior inference methods such as mcmc and ibhc normalized random measure mixture models let be homogeneous completely random measure crm on measure space with intensity and base measure written as crm ρh we also assume that dw dw so that has infinitely many atoms and the total mass is finite wj wj nrm is then formed by normalizing by its total mass for each index we draw the corresponding atoms from nrm θi since is discrete the set θi naturally form partition of with respect to the assigned atoms we write the partition as set of sets whose elements are and subsets of and the union of the elements is we index the elements clusters of with the symbol and denote the unique atom assigned to as θc summarizing the set θi as θc the posterior random measure is written as follows theorem let θc be samples drawn from where crm ρh with an auxiliary variable gamma the posterior random measure is written as wc δθc where ρu dw dw crm ρu dwc ρu dwc moreover the marginal distribution is written as dθc du du κρ dθc where ψρ dw κρ ρu dw using the predictive distribution for the novel atom is written as κρ θi κρ dθ δθc dθ κρ the most general crm may be used is the generalized gamma with intensity dw ασ dw in nrmm models the observed dataset is assusmed to be generated from likelihood with parameters θi drawn from nrm we focus qon the conjugate case where is conjugate to so that the integral dxc dθ dxi is tractable mcmc inference for nrmm models the goal of posterior inference for nrmm models is to compute the posterior dθc with the marginal likelihood dx marginal gibbs sampler marginal gibbs sampler is basesd on the predictive distribution at each iteration cluster assignments for each data point is sampled where xi may join an existκ ing cluster with probability proportional to ρκρ dxi or create novel cluster with probability proportional to κρ dxi slice sampler instead of marginalizing out slice sampler explicitly sample the atoms and weights wj of since maintaining infinitely many atoms is infeasible slice variables si are introduced for each data point and atoms with masses larger than threshold usually set as si are kept and remaining atoms are added on the fly as the threshold changes at each iteration xi is assigned to the jth atom with probability si wj dxi sampler both marginal gibbs and slice sampler alter single cluster assignment at time so are prone to the local optima sampler originally developed for dpm is marginal sampler that is based on at each iteration instead of changing individual cluster assignments sampler splits or merges clusters to propose new partition the split or merged partition is proposed by procedure called the restricted gibbs sampling which is gibbs sampling restricted to the clusters to split or merge the proposed partitions are accepted or rejected according to schemes samplers are reported to mix better than marginal gibbs sampler ibhc inference for nrmm models bayesian hierarchical clustering bhc is probabilistic agglomerative clustering where the marginal likelihood of dpm is evaluated to measure the dissimilarity between nodes like the traditional agglomerative clustering algorithms bhc repeatedly merges the pair of nodes with the smallest dissimilarities and builds binary trees embedding the hierarchical cluster structure of datasets bhc defines the generative probability of binary trees which is maximized during the construction of the tree and the generative probability provides lower bound on the marginal likelihood of dpm for this reason bhc is considered to be posterior inference algorithm for dpm incremental bhc ibhc is an extension of bhc to dependent nrmm models like bhc is deterministic posterior inference algorithm for dpm ibhc serves as deterministic posterior inference algorithms for nrmm models unlike the original bhc that greedily builds trees ibhc sequentially insert data points into trees yielding scalable algorithm that is well suited for online inference we first explain the generative model of trees and then explain the sequential algorithm of ibhc xi case case case figure left in ibhc new data point is inserted into one of the trees or create novel tree middle three possible cases in seqinsert right after the insertion the potential funcitons for the nodes in the blue bold path should be updated if updated the tree is split at that level ibhc aims to maximize the joint probability of the data and the auxiliary variable dx du du κρ dxc let tc be binary tree whose leaf nodes consist of the indices in let and denote the left and right child of the set in tree and thus the corresponding trees are denoted by tl and tr the generative probability of trees is described with the potential function which is the unnormalized reformulation of the original definition the potential function of the data xc given the tree tc is recursively defined as follows xc κρ dxc xc xc xl xr here hc is the hypothesis that xc was generated from single cluster the first therm xc is proportional to the probability that hc is true and came from the term inside the product of the second term is proportional to the probability that xc was generated from more than two clusters embedded in the subtrees tl and tr the posterior probability of hc is then computed as hc tc xl xr where xc is defined to be the dissimilarity between and in the greedy construction the pair of nodes with smallest are merged at each iteration when the minimum dissimilarity exceeds one hc tc hc is concluded to be false and the construction stops this is an important mechanism of bhc and ibhc that naturally selects the proper number of clusters in the perspective of the posterior inference this stopping corresponds to selecting the map partition that maximizes if the tree is built and the potential function is computed for the entire dataset lower bound on the joint likelihood is obtained du dx du now we explain the sequential tree construction of ibhc ibhc constructs tree in an incremental manner by inserting new data point into an appropriate position of the existing tree without computing dissimilarities between every pair of nodes the procedure which comprises three steps is elucidated in fig step left given suppose that trees are built by ibhc yielding to partition when new data point xi arrives this step assigns xi to tree tbc which has the smallest distance arg or create new tree ti if step middle suppose that the tree chosen in step is tc then step determines an appropriate position of xi when it is inserted into the tree tc and this is done by the procedure seqinsert seqinsert chooses the position of among three cases fig case elucidates an option where xi is placed on the top of the tree tc case and show options where xi is added as sibling of the subtree tl or tr respectively among these three cases the one with the highest potential function is selected which can easily be done by comparing and if is the smallest then case is selected and the insertion terminates otherwise if is the smallest xi is inserted into tl and seqinsert is recursively executed the same procedure is applied to the case where is smallest samplesub samplesub merge split stocinsert stocinsert split merge figure global moves of tgmcmc top row explains the way of proposing split partition from partition and explains the way to retain from bottom row shows the same things for merge case step right after step and are applied the potential functions of should be computed again starting from the subtree of tc to which xi is inserted to the root during this procedure updated values may exceed in such case we split the tree at the level where and all the split nodes after having inserted all the data points in the auxiliary variable and hyperparameters for dw are resampled and the tree is reconstructed this procedure is repeated several times and the trees with the highest potential functions are chosen as an output main results mcmc procedure ibhc should reconstruct trees from the ground whenever and hyperparameters are resampled and this is obviously time consuming and more importantly converge is not guaranteed instead of completely reconstructing trees we propose to refine the parts of existing trees with mcmc our algorithm called mcmc tgmcmc is combination of deterministic inference and mcmc where the trees constructed via ibhc guides mcmc to propose samples tgmcmc initialize chain with single run of ibhc given current partition and trees tc tgmcmc proposes novel partition by global and local moves global moves split or merges clusters to propose and local moves alters cluster assignments of individual data points via gibbs sampling we first explain the two key operations used to modify tree structures and then explain global and local moves more details on the algorithm can be found in the supplementary material key operations samplesub given tree tc draw subtree with probability is added for leaf nodes whose and set to the maximum among all subtrees of tc the drawn subtree is likely to contain errors to be corrected by splitting the probability of drawing is multiplied to where is usually set to transition probabilities stochastic version of ibhc may be inserted to via seqinsert with probability or may just be put into create new cluster stocinsert in with probability if is inserted via seqinsert the potential functions are updated accordingly but the trees are not split even if the update dissimilarities exceed as in samplesub the probability is multiplied to global moves the global moves of tgmcmc are analogy to sampling in sampling pair of data points are randomly selected and split partition is proposed if they belong to the same cluster or merged partition is proposed otherwise instead tgmcmc finds the clusters that are highly likely to be split or merged using the dissimilarities between trees which goes as follows in detail first we randomly pick tree tc in uniform then we compute for and put in set with probability the probability of merging and the transition probability up to this step is the set contains candidate clusters to merge with if is empty which means that there are no candidates to merge with we propose by splitting otherwise we propose by merging and clusters in split case we start splitting by drawing subtree tc by samplesub then we split to destroy all the parents of tc and collect the split trees into set fig top then we reconstruct the tree by stocinsert for all after the reconstruction has at least two clusters since we split before insertion the split partition to propose is the reverse transition probability is computed as follows to obtain from we must merge the clusters in to for this we should pick cluster and put other clusters in into since we can pick any at first the reverse transition probability is computed as sum of all those possibilities merge case suppose that we have cm the merged partition to propose is given as where cm we construct the corresponding binary tree as cascading tree where we put cm on top of in order fig bottom to compute the reverse transition probability we should compute the probability of splitting back into cm for this we should first choose and put nothing into the set to provoke splitting up to this step is then we should sample the parent of the subtree connecting and via samplesub and this would result in and cm finally we insert ci into via stocinsert ci for where we select each to create new cluster in corresponding update to by stocinsert is ci cj ci once we ve proposed and computed both and is accepted with probability min where dx du dx du ergodicity of the global moves to show that the global moves are ergodic it is enough to show that we can move an arbitrary point from its current cluster to any other cluster in finite step this can easily be done by single split and merge moves so the global moves are ergodic time complexity of the global moves the time complexity of stocinsert is where is height of the tree to insert the total time complexity of split proposal is mainly determined by the time to execute stocinsert this procedure is usually efficient especially when the trees are well balanced the time complexity to propose merged partition is local moves in local moves we resample cluster assignments of individual data points via gibbs sampling if leaf node is moved from to we detach from tc and run seqinsert here instead of running gibbs sampling for all data points we run gibbs sampling for subset of data points which is formed as follows for each we draw subtree by samplesub then we here we restrict samplesub to sample nodes since leaf nodes can not be split we assume that clusters are given their own indices such as hash values so that they can be ordered we do not split even if the update dissimilarity exceed one as in stocinsert gibbs sm tgmcmc gibbs sm tgmcmc time sec max gibbs sm tgmcmc time sec log ess time iter average average average average iteration gibbs sm tgmcmc max iteration ess log time iter figure experimental results on toy dataset top row scatter plot of toy dataset of three samplers with dp with nggp of tgmcmc with varying and varying bottom row the statistics of three samplers with dp and the statistics of three samplers with nggp average gibbs sm gibbs sm tgmcmc gibbs sub sm sub tgmcmc time sec max ess log time iter figure average plot and the statistics of the samplers for dataset draw subtree of again by samplesub we repeat this subsampling for times and put the leaf nodes of the final subtree into smaller would result in more data points to resample so we can control the tradeoff between iteration time and mixing rates cycling at each iteration of tgmcmc we cycle the global moves and local moves as in sampling we first run the global moves for times and run single sweep of local moves setting and were the moderate choice for all data we ve tested experiments in this section we compare marginal gibbs sampler gibbs sampler sm and tgmcmc on synthetic and real datasets toy dataset we first compared the samplers on simple toy dataset that has points with clusters sampled from the mixture of gaussians with predefined means and covariances since the partition found by ibhc is almost perfect for this simple data instead of initializing with ibhc we initialized the binary tree and partition as follows as in ibhc we sequentially inserted data points into existing trees with random order however instead of inserting them via seqinsert we just put data points on top of existing trees so that no splitting would occur tgmcmc was initialized with the tree constructed from this procedure and gibbs and sm were initialized with corresponding partition we assumed the and base measure dµ dλ rλ where is the dimensionality is the sample mean and is the sample covariance we compared the samplers using both dp and nggp priors for tgmcmc we fixed the number of global moves and the parameter for local moves except for the cases where we controlled them explicitly all the samplers were run for seconds and repeated times we compared the joint log dx du of samples and the effective sample size ess of the number of clusters found for sm and tgmcmc we compared the average log value of the acceptance ratio the results are summarized in fig as shown in average gibbs sm gibbs sm tgmcmc gibbs sub sm sub tgmcmc time sec max ess log time iter figure average plot and the statistics of the samplers for nips corpus the trace plot tgmcmc quickly converged to the ground truth solution for both dp and nggp cases also tgmcmc mixed better than other two samplers in terms of ess comparing the average log values of sm and tgmcmc we can see that the partitions proposed by tgmcmc is more often accepted we also controlled the parameter and as expected higher resulted in faster convergence however smaller more data points involved in local moves did not necessarily mean faster convergence synthetic dataset we also compared the three samplers on larger dataset containing points which we will call as dataset generated from mixture of gaussians with labels drawn from py we used the same base measure and initialization with those of the toy datasets and used the nggp prior we ran the samplers for seconds and repeated times gibbs and sm were too slow so the number of samples produced in seconds were too small hence we also compared gibbs sub and sm sub where we uniformly sampled the subset of data points and ran gibbs sweep only for those sampled points we controlled the subset size to make their running time similar to that of tgmcmc the results are summarized in fig again tgmcmc outperformed other samplers both in terms of the and ess interestingly sm was even worse than gibbs since most of the samples proposed by split or merge proposal were rejected gibbs sub and sm sub were better than gibbs and sm but still failed to reach the best state found by tgmcmc nips corpus we also compared the samplers on nips containing documents with words we used the multinomial likelihood and symmetric dirichlet base measure dir used nggp prior and initialized the samplers with normal ibhc as for the dataset we compared gibbs sub and sm sub along we ran the samplers for seconds and repeated times the results are summarized in fig tgmcmc outperformed other samplers in terms of the loglikelihood all the other samplers were trapped in local optima and failed to reach the states found by tgmcmc however ess for tgmcmc were the lowest meaning the poor mixing rates we still argue that tgmcmc is better option for this dataset since we think that finding the better states is more important than mixing rates conclusion in this paper we have presented novel inference algorithm for nrmm models our sampler called tgmcmc utilized the binary trees constructed by ibhc to propose good quality samples tgmcmc explored the space of partitions via global and local moves which were guided by the potential functions of trees tgmcmc was demonstrated to be outperform existing samplers in both synthetic and real world datasets acknowledgments this work was supported by the it program of machine learning center national research foundation nrf of korea and creative research project https references ferguson bayesian analysis of some nonparametric problems the annals of statistics lijoi mena and hierarchical mixture modeling with normalized inversegaussian priors journal of the american statistical association brix generalized gamma measures and cox processes advances in applied probability lijoi mena and controlling the reinforcement in bayesian nonparametric mixture models journal of the royal statistical society regazzini lijoi and distriubtional results for means of normalized random measures with independent increments the annals of statistics chen ding and buntine dependent hierarchical normalized random measures for dynamic topic modeling in proceedings of the international conference on machine learning icml edinburgh uk chen rao buntine and teh dependent normalized random measures in proceedings of the international conference on machine learning icml atlanta georgia usa james bayesian poisson process partition calculus with an application to bayesian moving averages the annals of statistics james lijoi and posterior analysis for normalized random measures with independent increments scandinavian journal of statistics favaro and teh mcmc for normalized random measure mixture models statistical science antoniak mixtures of dirichlet processes with applications to bayesian nonparametric problems the annals of statistics jain and neal markov chain monte carlo procedure for the dirichlet process mixture model journal of computational and graphical statistics griffin and walkera posterior simulation of normalized random measure mixtures journal of computational and graphical statistics lee and choi incremental inference with dependent normalized random measures in proceedings of the international conference on artificial intelligence and statistics aistats reykjavik iceland heller and ghahrahmani bayesian hierarchical clustering in proceedings of the international conference on machine learning icml bonn germany 
streaming hypergraph partitioning jennifer carnegie mellon university pittsburgh pa jiglesia dan alistarh microsoft research cambridge united kingdom milan vojnovic microsoft research cambridge united kingdom milanv abstract in many applications the data is of rich structure that can be represented by hypergraph where the data items are represented by vertices and the associations among items are represented by hyperedges equivalently we are given an input bipartite graph with two types of vertices items and associations which we refer to as topics we consider the problem of partitioning the set of items into given number of components such that the maximum number of topics covered by component is minimized this is clustering problem with various applications partitioning of set of information objects such as documents images and videos and load balancing in the context of modern computation platforms in this paper we focus on the streaming computation model for this problem in which items arrive online one at time and each item must be assigned irrevocably to component at its arrival time motivated by scalability requirements we focus on the class of streaming computation algorithms with memory limited to be at most linear in the number of components we show that greedy assignment strategy is able to recover hidden of items under natural set of recovery conditions we also report results of an extensive empirical evaluation which demonstrate that this greedy strategy yields superior performance when compared with alternative approaches introduction in variety of applications one needs to process data of rich structure that can be conveniently represented by hypergraph where associations of the data items represented by vertices are represented by hyperedges subsets of items such data structure can be equivalently represented by bipartite graph that has two types of vertices vertices that represent items and vertices that represent associations among items which we refer to as topics in this bipartite graph each item is connected to one or more topics the input can be seen as graph with vertices belonging to overlapping communities there has been significant work on partitioning set of items into disjoint components such that similar items are assigned to the same component see for survey this problem arises in the context of clustering of information objects such as documents images or videos for example the goal may be to partition given collection of documents into disjoint such that the maximum number of distinct topics covered by each is minimized resulting in work performed in part while an intern with microsoft research figure an example of hidden coclustering with five hidden clusters figure simple example of set of items with overlapping associations to topics parsimonious summary the same fundamental problem also arises in processing of complex data workloads including enterprise emails online social networks graph data processing and machine learning computation platforms and load balancing in modern streaming query processing platforms in this context the goal is to partition set of data items over given number of servers to balance the load according to some given criteria problem definition we consider the hypergraph partitioning problem defined as follows the input to the problem is set of items set of topics number of components to partition the set of items and demand matrix that specifies which particular subset of topics is associated with each individual item given partitioning of the set of items the cost of component is defined as the number of distinct topics that are associated with items of the given component the cost of given partition is the maximum cost of component in other words given an input hypergraph and partition of the set of vertices into given number of disjoints components the cost of component is defined to be the number of hyperedges that have at least one vertex assigned to this component for example for the simple input graph in figure partition of the set of items into two components and amounts to the cost of the components each of value thus the cost of the partition is of value the cost of component is submodular function as the distinct topics associated with items of the component correspond to neighborhood set in the input bipartite graph in the streaming computation model that we consider items arrive sequentially one at time and each item needs to be assigned irrevocably to one component at its arrival time this streaming computation model allows for limited memory to be used at any time during the execution whose size is restricted to be at most linear in the number of the components both these assumptions arise as part of system requirements for deployment in services the hypergraph partition problem is np hard the streaming computation problem is even more difficult as less information is available to the algorithm when an item must be assigned contribution in this paper we consider the streaming hypergraph partitioning problem we identify greedy item placement strategy which outperforms all alternative approaches considered on datasets and can be proven to have recovery property it recovers hidden of items in probabilistic inputs subject to recovery condition specifically we show that given set of hidden to be placed onto components the greedy strategy will tend to place items from the same hidden cluster onto the same component with high probability in turn this property implies that greedy will provide constant factor approximation of the optimal partition on inputs satisfying the recovery property the probabilistic input model we consider is defined as follows the set of topics is assumed to be partitioned into given number of disjoint hidden clusters each item is connected to topics according to mixture probability distribution defined as follows each item first selects one of the hidden clusters as home hidden cluster by drawing an independent sample from uniform distribution over the hidden clusters then it connects to each topic from its home hidden cluster independently with probability and it connects to each topic from each other hidden cluster with probability this defines hidden of the input bipartite graph see figure for an example this model is similar in spirit to the popular stochastic block model of an undirected graph and it corresponds to hidden model of an undirected bipartite graph we consider asymptotically accurate recovery of this hidden hidden cluster is said to be asymptotically recovered if the portion of items from the given hidden cluster assigned to the same partition goes to one asymptotically as the number of items observed grows large an algorithm guarantees balanced asymptotic recovery if additionally it ensures that the cost of the most loaded partition is within constant of the average partition load our main analytical result is showing that simple greedy strategy provides balanced asymptotic recovery of hidden clusters theorem we prove that sufficient condition for the recovery of hidden clusters is that the number of hidden clusters is at least log where is the number of components and that the gap between the probability parameters and is sufficiently large log kr log where is the number of topics in hidden cluster roughly speaking this means that if the mean number of topics to which an item is associated with in its home hidden cluster of topics is at least twice as large as the mean number of topics to which an item is associated with from other hidden clusters of topics then the simple greedy online algorithm guarantees asymptotic recovery the proof is based on coupling argument where we first show that assigning an item to partition based on the number of topics it has in common with each partition is similar to making the assignment proportionally to the number of items corresponding to the same hidden cluster present on each partition in turn this allows us to couple the assignment strategy with polya urn process with dynamics which implies that the policy converges to assigning each item from hidden cluster to the same partition additionally this phenomenon occurs in parallel for each cluster this recovery property will imply that this strategy will ensure constant factor approximation of the optimum assignment further we provide experimental evidence that this greedy online algorithm exhibits good performance for several input bipartite graphs outperforming more complex assignment strategies and even some offline approaches problem definition and basic results in this section we provide formal problem definition and present some basic results on the computational hardness and lower bounds input the input is defined by set of items set of topics and given number of components dependencies between items and topics are given by demand matrix di where di indicates that item needs topic and di alternatively we can represent the input as bipartite graph where there is an edge if and only if item needs topic or as hypergraph where hyperedge consists of all items that use the same topic the problem an assignment of items to components is given by where xi if item is assigned to component and xi otherwise given an assignment of items to components the cost of component is defined to be equal to the minimum number of distinct topics that are needed by this component to cover all the items assigned to it min cj di xi as defined the cost of each component is submodular function of the items assigned to it we consider the hypergraph partitioning problem defined as follows minimize max ck subject to xi we note that this problem is an instance of the submodular load balancing as defined in the framework allows for natural generalization to allow for demands in this paper we focus on demands basic results this problem is by reduction from the subset sum problem proposition the hypergraph partitioning problem is we now give lower bound on the optimal value of the problem using the observation that each topic needs to be made available on at least one component proposition for every partition of the set of items in components the maximum cost of component is larger than or equal to where is the number of topics we next analyze the performance of an algorithm which simply assigns each item independently to component chosen uniformly at random from the set of all components upon its arrival although this is popular strategy commonly deployed in practice for load balancing in computation platforms the following result shows that it does not yield good solution for the hypergraph partitioning problem proposition the expected maximum load of component under random assignment is at least nj where nj is the number of items associated with topic for instance if we assume that nj for each topic we obtain that the expected maximum load is of at least this suggests that the performance of random assignment is poor on an input where topics form disjoint clusters and each item subscribes to single cluster the optimal solution has cost whereas by the above claim random assignment has approximate cost yielding competitive ratio that is linear in balanced recovery of hidden we relax the input requirements by defining family of hidden inputs our model is generalization of the stochastic block model of graph to the case of hypergraphs we consider set of topics partitioned into clusters cℓ each of which contains topics given these hidden clusters each item is associated with topics as follows each item is first assigned home cluster ch chosen uniformly at random among the hidden clusters the item then connects to topics inside its home cluster by picking each topic independently with fixed probability further the item connects to topics from fixed arbitrary noise set qh of size at most outside its home cluster ch where the item is connected to each topic in qh uniformly at random with fixed probability sampling outside topics from the set of all possible topics would in the limit lead to every partition to contain all possible topics which renders the problem trivial we do not impose this limitation in the experimental validation definition hidden bipartite graph is in hc if it is constructed using the above process with items and clusters with topics per cluster where each item subscribes to topics inside its randomly chosen home cluster with probability and to topics from the noise set with probability at each time step new item is presented in the input stream of items and is immediately assigned to one of the components sk according to some algorithm algorithms do not know the number of hidden clusters or their size but can examine previous assignments definition asymptotic balanced given hidden hc we say an algorithm asymptotically recovers the hidden clusters cℓ if there exists recovery time tr during its execution after which for each hidden cluster ci there exists component sj such that each item with home cluster ci is assigned to component sj with probability that goes to as the number of items grows large moreover the recovery is balanced if the ratio between the maximum cost of component and the average cost over components is upper bounded by constant streaming algorithm and the recovery guarantee recall that we consider the online problem where we receive one item at time together with all its corresponding topics the item must be immediately and irrevocably assigned to some component in the following we describe the greedy strategy specified in algorithm data hypergraph received one item vertex at time partitions capacity bound result partition of into parts set initial partitions sk to be empty sets while there are incoming items do receive the next item and its topics minj components not exceeding capacity compute ri size of topic intersection arg ri if tied choose least loaded component sj sj item and its topics are assigned to sj return sk algorithm the greedy algorithm this strategy places each incoming item onto the component whose incremental cost after adding the item and its topics is minimized the immediate goal is not balancing but rather clustering similar items this could in theory lead to large imbalances to prevent this we add balancing constraint specifying the maximum load imbalance if adding the item to the first candidate component would violate the balancing constraint then the item is assigned to the first valid component in decreasing order of the intersection size the recovery theorem in this section we present our main theoretical result which provides sufficient condition for the greedy strategy to guarantee balanced asymptotic recovery of hidden clusters theorem the recovery theorem for random input consisting of hidden graph in hc to be partitioned across components if the number of clusters is log and the probabilities and satisfy log and log rk then the greedy algorithm ensures balanced asymptotic recovery of the hidden clusters remarks specifically we prove that under the given conditions recovery occurs for each hidden cluster by the time log cluster items have been observed with probability where is constant moreover clusters are randomly distributed among the components together these results can be used to bound the maximum cost of partition to be at most constant factor away the lower bound of given by lemma the extra cost comes from incorrect assignments before the recovery time and from the imperfect balancing of clusters over the components corollary the expected maximum load of component is at most proof overview we now provide an overview of the main ideas of the proof which is available in the full version of the paper preliminaries we say that two random processes are coupled if their random choices are the same we say that an event occurs with high probability if it occurs with probability at least where is constant we make use of polya urn process which is defined as follows we start each of urns with one ball and at each step observe new ball we assign the new ball to urn with probability proportional to bi where is fixed real constant and bi is the number of balls in urn at time we use the following classic result lemma polya urn convergence consider finite polya urn process with exponent and let xti be the fraction of balls in urn at time then almost surely the limit xi xti exists for each moreover we have that there exists an urn such that xj and that xi for all step recovering single cluster we first prove that in the case of single home cluster for all items and two components with no balance constraints the greedy algorithm with no balance constraints converges to monopoly eventually assigns all the items from dataset book ratings facebook app data retail data zune podcast data items readers users customers listeners topics books apps items bought podcasts of items of topics edges figure table showing the data sets and information about the items and topics this cluster onto the same component formally there exists some convergence time tr and some component si such that after time tr all future items will be assigned to component si with probability at least our strategy will be to couple greedy assignment with polya urn process with exponent showing that the dynamics of the two processes are the same there is one significant technical challenge that one needs to address while the polya process assigns new balls based on the ball counts of urns greedy assigns items and their respective topics based on the number of topic intersections between the item and the partition we resolve this issue by taking approach roughly we first prove that we can couple the number of items in component with the number of unique topics assigned to the same component we then prove that this is enough to couple the greedy assignment with polya urn process with exponent this will imply that greedy converges to monopoly by lemma we then extend this argument to single cluster and components but with no load balancing constraints the crux of the extension is that we can apply the argument to pairs of components to yield that some component achieves monopoly lemma given single cluster instance in hc with log and to be partitioned in components the greedy algorithm with no balancing constraints will eventually place every item in the cluster onto the same component second step the general case we complete the proof of theorem by considering the general case with clusters and we proceed in three we first show the recovery claim for general number of clusters but and no balance constraints this follows since for the algorithm choices with respect to clusters and their respective topics are independent hence clusters are assigned to components uniformly at random second we extend the proof for any value log rk by showing that the existence of noise edges under this threshold only affects the algorithm choices with very low probability finally we prove that the balance constraints are practically never violated for this type of input as clusters are distributed uniformly at random we obtain the following lemma for hidden input the greedy algorithm with and without capacity constraints can be coupled with version of the algorithm with log rk and constant capacity constraint final argument putting together lemmas and we obtain that greedy ensures balanced recovery for general inputs in hc for parameter values log log and log rk experimental results datasets and evaluation we first consider set of bipartite graph instances with summary provided in table all these datasets are available online except for zune podcast subscriptions we chose the consumer to be the item and the resource to be the topic we provide an experimental validation of the analysis on synthetic inputs in the full version of our paper in our experiments we considered partitioning of items onto components for range of values going from two to ten components we report the maximum number of topics in component normalized by the cost of perfectly balanced solution where is the total number of topics online assignment algorithms we compared the following other online assignment strategies all on one proportional greedy decreasing order balance big prefer big random greedy random order greedy decreasing order all on one proportional greedy decreasing order balance big prefer big random greedy random order greedy decreasing order normalized maximum load normalized maximum load book ratings all on one proportional greedy decreasing order balance big prefer big random greedy random order greedy decreasing order all on one proportional greedy decreasing order balance big prefer big random greedy random order greedy decreasing order normalized maximum load normalized maximum load facebook app data retail data zune podcast data figure the normalized maximum load for various online assignment algorithms under different input bipartite graphs versus the numbers of components trivially assign all items and topics to one component random assign each item independently to component chosen uniformly at random from the set of all components balance big inspect the items in random order and assign the large items to the least loaded component and the small items according to greedy an item is considered large if it subscribes to more than topics and small otherwise prefer big inspect the items in random order and keep buffer of up to small items when receiving large item put it on the least loaded component when the buffer is full place all the small items according to greedy greedy assign the items to the component they have the most topics in common with we consider two variants items arrive in random order and items arrive in decreasing order of the number of topics we allow slack parameter of up to topics proportional allocation inspect the items in decreasing order of the number of topics the probability an item is assigned to component is proportional to the number of common topics results greedy generally outperforms other online heuristics see figure also its performance is improved if items arrive in decreasing order of number of topics intuitively items with larger number of topics provide more information about the underlying structure of the bipartite graph than the items with smaller number of topics interestingly adding randomness to the greedy assignment made it perform far worse most times proportional assignment approached the worst case scenario random assignment outperformed proportional assignment and regularly outperformed prefer big and balance big item assignment strategies offline methods we also tested the streaming algorithm for wide range of synthetic input bipartite graphs according to the model defined in this paper and several offline approaches for the problem including hmetis label propagation basic spectral methods and parsa we found that label propagation and spectral methods are extremely time and memory intensive on our inputs due to the large number of topics and edges hmetis returns within seconds however the assignments were not competitive however hmetis provides balanced hypergraph cuts which are not necessarily good solution to our problem compared to parsa on bipartite graph inputs greedy provides assignments with up to higher max partition load on social graphs the performance difference can be as high as this discrepancy is natural since parsa has the advantage of performing multiple passes through the input related work the related problem of graph cut problem originally introduced in is defined as follows given an input graph the objective is to component the set of vertices such that the maximum number of edges adjacent to component is minimized similar problem was recently studied with respect to expansion defined as the ratio of the sum of weights of edges adjacent to component and the minimum between the sum of the weights of vertices within and outside the given component the balanced graph partition problem is optimization problem where the goal is to find balanced partition of the set of vertices that minimizes the total number of edges cut the best known approximation ratio for this problem is in the number of vertices the balanced graph partition problem was also considered for the set of edges of graph the related problem of community detection in an input graph data has been commonly studied for the planted partition model also well known as stochastic block model tight conditions for recovery of hidden clusters are known from the recent work in and as well as various approximation algorithms see some variants of hypergraph partition problems were studied by the machine learning research community including balanced cuts studied by using relaxations based on the concept of total variation and the maximum likelihood identification of hidden clusters the difference is that we consider the cut problem for hypergraph in the streaming computation model parsa considers the same problem in an offline model where the entire input is initially available to the algorithm and provides an efficient distributed algorithm for optimizing multiple criteria key component of parsa is procedure for optimizing the order of examining vertices by contrast we focus on performance under arbitrary arrival order and provide analytic guarantees under stochastic input model streaming computation with limited memory was considered for various canonical problems such as principal component analysis community detection balanced graph partition and query placement for the class of hyper graph partition problems most of the work is restricted to studying various streaming heuristics using empirical evaluations with few notable exceptions first theoretical analysis of streaming algorithms for balanced graph partitioning was presented in using the framework similar to the one deployed in this paper the paper gives sufficient conditions for greedy streaming strategy to recover clusters of vertices for the input graph according to stochastic block model which makes irrevocable assignments of vertices as they are observed in the input stream and uses memory limited to grow linearly with the number of clusters as in our case the argument uses reduction to polya urn processes the two main differences with our work is that we consider different problem hypergraph partition and this requires novel proof technique based on reduction to polya urn processes streaming algorithms for the recovery of clusters in stochastic block model were also studied in under weaker computation model which does not require irrevocable assignments of vertices at instances they are presented in the input stream and allows for memory polynomial in the number of vertices conclusion we studied the hypergraph partitioning problem in the streaming computation model with the size of memory limited to be at most linear in the number of the components of the partition we established first approximation guarantees for inputs according to random bipartite graph with hidden and evaluated performance on several input graphs there are several interesting open questions for future work it is of interest to study the tightness of the given recovery condition and in general better understand the between the memory size and the accuracy of the recovery it is also of interest to consider the recovery problem for wider set of random bipartite graph models another question of interest is to consider dynamic graph inputs with addition and deletion of items and topics references bansal feige krauthgamer makarychev nagarajan seffinaor and schwartz graph partitioning and small set expansion siam on computing bourse lelarge and vojnovic balanced graph edge partition in proc of acm kdd chen sanghavi and xu clustering sparse graphs in proc of nips cheng and church biclustering of expression data in ismb volume pages chung handjani and jungreis generalizations of polya urn problem annals of combinatorics dhillon documents and words using bipartite spectral graph partitioning in proc of acm kdd dhillon mallela and modha in proc of acm kdd fortunato community detection in graphs physics reports hein setzer jost and rangapuram the total variation on hypergraphs learning hypergraphs revisited in proc of nips karagiannis gkantsidis narayanan and rowstron hermes clustering users in services in proc of acm socc karypis and kumar multilevel hypergraph partitioning vlsi design krauthgamer naor and schwartz partitioning graphs into balanced components li andersen and smola graph partitioning via parallel submodular approximation to accelerate distributed machine learning arxiv preprint community detection thresholds and the weak ramanujan property in proc of acm stoc mitliagkas caramanis and jain memory limited streaming pca in proc of nips mossel neeman and sly reconstruction and estimation in the planted partition model probability theory and related fields pages connor and feizi biclustering using message passing in proc of nips pujol et al the little engine that could scaling online social networks trans stanton streaming balanced graph partitioning algorithms for random graphs in proc of soda stanton and kliot streaming graph partitioning for large distributed graphs in proc of acm kdd tsourakakis gkantsidis radunovic and vojnovic fennel streaming graph partitioning for massive scale graphs in proc of acm wsdm yun lelarge and proutiere streaming memory limited algorithms for community detection in proc of nips svitkina and tardos multiway cut in jansen khanna rolim and ron editors proc of pages zong gkantsidis and vojnovic herding small streaming queries in proc of acm debs 
collaboratively learning preferences from ordinal data sewoong oh kiran thekumparampil university of illinois at swoh jiaming xu the wharton school upenn jiamingx abstract in personalized recommendation systems it is important to predict preferences of user on items that have not been seen by that user yet similarly in revenue management it is important to predict outcomes of comparisons among those items that have never been compared so far the multinomial logit model popular discrete choice model captures the structure of the hidden preferences with matrix in order to predict the preferences we want to learn the underlying model from noisy observations of the matrix collected as revealed preferences in various forms of ordinal data natural approach to learn such model is to solve convex relaxation of nuclear norm minimization we present the convex relaxation approach in two contexts of interest collaborative ranking and bundled choice modeling in both cases we show that the convex relaxation is minimax optimal we prove an upper bound on the resulting error with finite samples and provide matching lower bound introduction in recommendation systems and revenue management it is important to predict preferences on items that have not been seen by user or predict outcomes of comparisons among those that have never been compared predicting such hidden preferences would be hopeless without further assumptions on the structure of the preference motivated by the success of matrix factorization models on collaborative filtering applications we model hidden preferences with matrices to collaboratively learn preference matrices from ordinal data this paper considers the following scenarios collaborative ranking consider an online market that collects each user preference as ranking over subset of items that are seen by the user such data can be obtained by directly asking to compare some items or by indirectly tracking online activities on which items are viewed how much time is spent on the page or how the user rated the items in order to make personalized recommendations we want model which captures how users who prefer similar items are also likely to have similar preferences on unseen items predicts which items user might prefer by learning from such ordinal data bundled choice modeling discrete choice models describe how user makes decisions on what to purchase typical choice models assume the willingness to buy an item is independent of what else the user bought in many cases however we make bundled purchases we buy particular ingredients together for one recipe or we buy two connecting flights one choice the first flight has significant impact on the other the connecting flight in order to optimize the assortment which flight schedules to offer for maximum expected revenue it is crucial to accurately predict the willingness of the consumers to purchase based on past history we consider case where there are two types of products jeans and shirts and want model that captures such interacting preferences for pairs of items one from each category and to predict the consumer choice probabilities on pairs of items by learning such models from past purchase history we use discrete choice model known as multinomial logit mnl model described in section to represent the preferences in collaborative ranking context mnl uses matrix to represent the hidden preferences of the users each row corresponds to user preference over all the items and when presented with subset of items the user provides ranking over those items which is noisy version of the hidden true preference the assumption naturally captures the similarities among users and items by representing each on space in bundled choice modeling context the matrix now represents how pairs of items are matched each row corresponds to an item from the first category and each column corresponds to an item from the second category an entry in the matrix represents how much the pair is preferred by randomly chosen user from pool of users notice that in this case we do not model individual preferences but the preference of the whole population the purchase history of the population is the record of which pair was chosen among subsets of items that were presented which is again noisy version of the hidden true preference the assumption captures the similarities and among the items in the same category and the interactions across categories contribution natural approach to learn such model from noisy observations is to solve convex relaxation of nuclear norm minimization described in section since nuclear norm is the tightest convex surrogate for the rank function we present such an approach for learning the mnl model from ordinal data in two contexts collaborative ranking and bundled choice modeling in both cases we analyze the sample complexity of the algorithm and provide an upper bound on the resulting error with finite samples we prove of our approach by providing matching lower bound up to factor technically we utilize the random utility model rum interpretation outlined in section of the mnl model to prove both the upper bound and the fundamental limit which could be of interest to analyzing more general class of rums related work in the context of collaborative ranking mnl models have been proposed to model partial rankings from pool of users recently there has been new algorithms and analyses of those algorithms to learn mnl models from samples in the case when each user provides comparisons proposes solving convex relaxation of maximizing the likelihood over matrices with bounded nuclear norm it is shown that this approach achieves statistically optimal generalization error rate instead of frobenius norm error that we analyze our analysis techniques are inspired by which proposed the convex relaxation for learning mnl but when the users provide only comparisons in this paper we generalize the results of by analyzing more general sampling models beyond pairwise comparisons the remainder of the paper is organized as follows in section we present the mnl model and propose convex relaxation for learning the model in the context of collaborative ranking we provide theoretical guarantees for collaborative ranking in section in section we present the problem statement for bundled choice modeling and analyze similar convex relaxation approach notations we use and to denote the frobenius norm and the norm denote the singular value and σi to denote the nuclear norm where for the spectral norm we use hhu vii ui vi and kuk to denote the inner product and the euclidean norm all ones vector is denoted by and is the indicator function of the event the set of the fist integers are denoted by model and algorithm in this section we present discrete choice modeling for collaborative ranking and propose an inference algorithm for learning the model from ordinal data multinomial logit mnl model for comparative judgment in collaborative ranking we want to model how people who have similar preferences on subset of items are likely to have similar tastes on other items as well when users provide ratings as in collaborative filtering applications matrix factorization models are widely used since the structure captures the similarities between users when users provide ordered preferences we use discrete choice model known as multinomial logit mnl model that has similar structure that captures the similarities between users and items let be the dimensional matrix capturing the preference of users on items where the rows and columns correspond to users and items respectively typically is assumed to be having rank that is much smaller than the dimensions however in the following we allow more general setting where might be only approximately low rank when user is presented with set of alternatives si she reveals her preferences as ranked list over those items to simplify the notations we assume all users compare the same number of items but the analysis naturally generalizes to the case when the size might differ from user to user let vi si denote the random best choice of user each user gives ranking independent of other users rankings from vi eθi where with si si vi and si for user the row of represents the underlying preference vector of the user and the more preferred items are more likely to be ranked higher the probabilistic nature of the model captures the noise in the revealed preferences the random utility model rum pioneered by describes the choices of users as manifestations of the underlying utilities the mnl models is special case of rum where each decision maker and each alternative are represented by feature vectors ui and vj respectively such that hhui vj ii resulting in matrix when presented with set of alternatives si the decision maker ranks the alternatives according to their random utility drawn from uij hhui vj ii ξij for item where ξij follow the standard gumbel distribution intuitively this provides justification for the mnl model as modeling the decision makers as rational being seeking to maximize utility technically this rum interpretation plays crucial role in our analysis in proving restricted strong convexity in appendix and also in proving fundamental limit in appendix there are few cases where the maximum likelihood ml estimation for rum is tractable one notable example is the pl model which is special case of the mnl model where is and all users have the same features pl model has been widely applied in econometrics analyzing elections and machine learning efficient inference algorithms has been proposed and the sample complexity has been analyzed for the mle and for the rank centrality although pl is quite restrictive in the sense that it assumes all users share the same features little is known about inference in rums beyond pl recently to overcome such restriction mixed pl models have been studied where is but there are only classes of users and all users in the same class have the same features efficient inference algorithms with provable guarantees have been proposed by applying recent advances in tensor decomposition methods directly clustering the users or using sampling methods however this mixture pl is still restrictive and both clustering and tensor based approaches rely heavily on the fact that the distribution is mixture and require additional incoherence assumptions on for more general models efficient inference algorithms have been proposed but no performance guarantee is known for finite samples although the mle for the general mnl model in is intractable we provide inference algorithm with provable guarantees nuclear norm minimization assuming is well approximated by matrix we estimate by solving the following convex relaxation given the observed preference in the form of ranked lists vi arg min nuc where the negative log likelihood function according to is ei etv ii log exp hhθ ei etj ii with si vi and si si vi and appropriately chosen set defined in since nuclear norm is tight convex surrogate for the rank the above optimization searches for solution that maximizes the likelihood nuclear norm minimization has been widely used in rank minimization problems but provable guarantees typically exists only for quadratic loss function our analysis extends such analysis techniques to identify the conditions under which restricted strong convexity is satisfied for convex loss function that is not quadratic collaborative ranking from comparisons we first provide background on the mnl model and then present main results on the performance guarantees notice that the distribution is independent of shifting each row of by constant hence there is an equivalent class of that gives the same distributions for the ranked lists for some since we can only estimate up to this equivalent class we search for the one whose rows sum to zero for all let maxi denote the dynamic range of the underlying such that when items are compared we always have eα eα for all si all si satisfying and all we do not make any assumptions on other than that with respect to and the purpose of defining the dynamic range in this way is that we seek to characterize how the error scales with given this definition we solve the optimization in over ωα and we have aij while in practice we do not require the norm constraint we need it for the analysis for related problem of matrix completion where the loss is quadratic either similar condition on norm is required or different condition on incoherence is required performance guarantee we provide an upper bound on the resulting error of our convex relaxation when of items si presented to user is drawn uniformly at random with replacement precisely for given si ji where ji are independently drawn uniformly at random over the items further if an item is sampled more than once if there exists ji ji for some and then we assume that the user treats these two items as if they are two distinct items with the same mnl weights ji ji resulting preference is therefore always over items with possibly multiple copies of the same item and distributed according to for example if it is possible to have si in which case the resulting ranking can be with probability such sampling with replacement is necessary for the analysis where we require independence in the choice of the items in si in order to apply the symmetrization technique to bound the expectation of the deviation cf appendix similar sampling assumptions have been made in existing analyses on learning models from noisy observations let and let σj denote the singular value of the matrix define log log log theorem under the described sampling model assume min log log log log and with any constant larger than then solving the optimization achieves min σj for any min with probability at least where proof is provided in appendix the above bound shows natural splitting of the error into two terms one corresponding to the estimation error for the component and the second one corresponding to the approximation error for how well one can approximate with matrix this bound holds for all values of and one could potentially optimize over we show such results in the following corollaries corollary exact matrices suppose has rank at most under the hypotheses of theorem solving the optimization with the choice of the regularization parameter achieves with probability at least log log log the number of entries is and we rescale the frobenius norm error appropriately by when is matrix then the degrees of freedom in representing is the above theorem shows that the total number of samples which is needs to scale as log log log in order to achieve an arbitrarily small error this is only factor larger than the degrees of freedom in section we provide lower bound on the error directly that matches the upper bound up to logarithmic factor the dependence on the dynamic range however is it is expected that the error increases with since the scales as but the exponential dependence in the bound seems to be weakness of the analysis as seen from numerical experiments in the right panel of figure although the error increase with numerical experiments suggests that it only increases at most linearly however tightening the scaling with respect to is challenging problem and such suboptimal dependence is also present in existing literature for learning even simpler models such as the model or the model which are special cases of the mnl model studied in this paper practical issue in achieving the above rate is the choice of since the dynamic range is not known in advance figure illustrates that the error is not sensitive to the choice of for wide range another issue is that the underlying matrix might not be exactly low rank it is more realistic to assume that it is approximately low rank following we formalize this notion with of matrices defined as bq ρq ρq min when this is set of matrices for this is set of matrices whose singular values decay relatively fast optimizing the choice of in theorem we get the following result corollary approximately matrices suppose bq ρq for some and ρq under the hypotheses of theorem solving the optimization with the choice of the regularization parameter achieves with probability at least ρq log log log this is strict generalization of corollary for and this recovers the exact estimation bound up to factor of two for approximate matrices in an we lose in the error exponent which reduces from one to proof of this corollary is provided in appendix the left panel of figure confirms the scaling of the error rate as predicted by corollary the lines merge top single line when the sample size is rescaled appropriately we make choice of log this choice is independent of and is smaller than proposed in theorem we generate random matrices of dimension where with and entries generated from uniform distribution over then the rmse rmse sample size log figure the rescaled rmse scales as log as expected from corollary for fixed left in the inset the same data is plotted versus rescaled sample size log the rescaled rmse is stable for broad range of and for fixed and right is subtracted form each row and then the whole matrix is scaled such that the largest entry is note that this operation does not increase the rank of the matrix this is because this can be written as and both terms in the operation are of the same column space as which is of rank the root mean squared error rmse is plotted where we implement and solve the convex optimization using proximal rmse gradient descent method as analyzed in the right panelp in figure illustrates that the actual error is insensitive to the choice of for broad range of log log after which it increases with lower bound for matrices for algorithm of convex relaxation we gave in the previous section bound on the achievable error we next compare this to the fundamental limit of this problem by giving lower bound on the achievable error by any algorithm efficient or not simple parameter counting argument indicates that it requires the number of samples to scale as the degrees of freedom to estimate dimensional matrix of rank we construct an appropriate packing over the set of matrices with bounded entries in ωα defined as and show that no algorithm can accurately estimate the true matrix with high probability using the generalized fano inequality this provides constructive argument to lower bound the minimax error rate which in turn establishes that the bounds in theorem is sharp up to logarithmic factor and proves no other algorithm can significantly improve over the nuclear norm minimization theorem suppose has rank under the described sampling model for large enough and there is universal numerical constant such that inf sup min αe log where the infimum is taken over all measurable functions over the observed ranked lists vi proof of this theorem is provided in appendix the term of primary interest in this bound is the first one which shows the scaling of the rescaled minimax rate as when and matches the upper bound in it is the dominant term in the bound whenever the number of samples is larger than the degrees of freedom by logarithmic factor log ignoring the dependence on this is typical regime of interest where the sample size is comparable to the latent dimension of the problem in this regime theorem establishes that the upper bound in theorem is up to logarithmic factor in the dimension choice modeling for bundled purchase history in this section we use the mnl model to study another scenario of practical interest choice modeling from bundled purchase history in this setting we assume that we have bundled purchase history data from users precisely there are two categories of interest with and alternatives in each category respectively for example there are tooth pastes to choose from and tooth brushes to choose from for the user subset si of alternatives from the first category is presented along with subset ti of alternatives from the second category we use and to denote the number of alternatives presented to single user and and we assume that the number of alternatives presented to each user is fixed to simplify notations given these sets of alternatives each user makes bundled purchase and we use ui vi to denote the bundled pair of alternatives tooth brush and tooth paste purchased by the user each user makes choice of the best alternative independent of other users choices according to the mnl model as ui vi for all si and ti the distribution is independent of shifting all the values of by constant hence there is an equivalent class of that gives the same distribution for the choices for some sincepwe can only estimate up to this equivalent class we search for the one that sum to zero let θj denote the dynamic range of the underlying such that when alternatives are presented we always have ui vi for all si ti and for all si and ti such that and we do not make any assumptions on other than that with respect to and assuming is well approximate by matrix we solve the following convex relaxation given the observed bundled purchase history ui vi si ti arg where the negative log likelihood function according to is eui etv ii log exp hhθ ii and and compared to collaborative ranking rows and columns of correspond to an alternative from the first and second category respectively each sample corresponds to the purchase choice of user which follow the mnl model with each person is presented subsets si and ti of items from each category each sampled data represents the most preferred bundled pair of alternatives performance guarantee we provide an upper bound on the error achieved by our convex relaxation when the of alternatives si from the first category and ti from the second category are drawn uniformly at random with replacement from and respectively precisely for given and we let si and ti where and are independently drawn uniformly at random over the and alternatives respectively similar to the previous section this sampling with replacement is necessary for the analysis define max log theorem under the described sampling model assume min log min max log and with any constant larger than max min then solving the optimization achieves min σj for any min with probability at least where proof is provided in appendix optimizing over gives the following corollaries corollary exact matrices suppose has rank at most under the hypotheses of theorem solving the optimization with the choice of the regularization parameter achieves with probability at least log this corollary shows that the number of samples needs to scale as log in order to achieve an arbitrarily small error this is only logarithmic factor larger than the degrees of freedom we provide fundamental lower bound on the error that matches the upper bound up to logarithmic factor for approximately matrices in an as defined in we show an upper bound on the error whose error exponent reduces from one to corollary approximately matrices suppose bq ρq for some and ρq under the hypotheses of theorem solving the optimization with the choice of the regularization parameter achieves with probability at least log since the proof is almost identical to the proof of corollary in appendix we omit it theorem suppose has rank under the described sampling model there is universal constant such that that the minimax rate where the infimum is taken over all measurable functions over the observed purchase history ui vi si ti is lower bounded by min inf sup log see appendix for the proof the first term is dominant and when the sample size is comparable to the latent dimension of the problem theorem is minimax optimal up to logarithmic factor discussion we presented convex program to learn mnl parameters from ordinal data motivated by two scenarios recommendation systems and bundled purchases we take the first principle approach of identifying the fundamental limits and also developing efficient algorithms matching those fundamental trade offs there are several remaining challenges nuclear norm minimization while is still slow we want methods that are efficient with provable guarantees the main challenge is providing good initialization to start such approaches for simpler models such as the pl model more general sampling over graph has been studied we want analytical results for more general sampling the practical use of the model and the algorithm needs to be tested on real datasets on purchase history and recommendations acknowledgments this research is supported in part by nsf cmmi award and nsf satc award references daniel mcfadden conditional logit analysis of qualitative choice behavior louis thurstone law of comparative judgment psychological review jacob marschak constraints and random utility indicators in proceedings of symposium on mathematical methods in the social sciences volume pages luce individual choice behavior wiley new york yu lu and sahand negahban individualized rank aggregation using nuclear norm regularization arxiv preprint dohyung park joe neeman jin zhang sujay sanghavi and inderjit dhillon preference completion collaborative ranking from pairwise comparisons isobel claire gormley and thomas brendan murphy grade of membership model for rank data bayesian analysis liu learning to rank for information retrieval foundations and trends in information retrieval hunter mm algorithms for generalized models annals of statistics pages john guiver and edward snelson bayesian inference for ranking models in proceedings of the annual international conference on machine learning pages acm francois caron and arnaud doucet efficient bayesian inference for generalized models journal of computational and graphical statistics hajek oh and xu inference from partial rankings in advances in neural information processing systems pages negahban oh and shah iterative ranking from comparisons in nips pages oh and shah learning mixed multinomial logit model from ordinal data in advances in neural information processing systems pages ding ishwar and saligrama topic modeling approach to rank aggregation boston university center for info and systems engg technical report http ammar oh shah and voloch what your choice learning the mixed logit model in proceedings of the acm conference on measurement and modeling of computer systems rui wu jiaming xu srikant laurent marc lelarge and bruce hajek clustering and inference from pairwise comparisons arxiv preprint azari soufiani diao lai and parkes generalized random utility models with multiple types in advances in neural information processing systems pages soufiani parkes and xia random utility theory for social choice in nips pages recht fazel and parrilo guaranteed solutions of linear matrix equations via nuclear norm minimization siam review and recht exact matrix completion via convex optimization foundations of computational mathematics negahban and wainwright restricted strong convexity and weighted matrix completion optimal bounds with noise journal of machine learning research boucheron lugosi and pascal massart concentration inequalities nonasymptotic theory of independence oxford university press agarwal negahban and wainwright fast global convergence rates of gradient methods for statistical recovery in in nips pages tropp tail bounds for sums of random matrices foundations of comput van de geer empirical processes in volume cambridge university press ledoux the concentration of measure phenomenon number american mathematical 
biologically inspired dynamic textures for probing motion perception andrew isaac meso institut de neurosciences de la timone umr marseille cedex france jonathan vacher cnrs unic and ceremade univ paris cedex france vacher laurent perrinet institut de neurosciences de la timone umr marseille cedex france gabriel cnrs and ceremade univ paris cedex france peyre abstract perception is often described as predictive process based on an optimal inference with respect to generative model we study here the principled construction of generative model specifically crafted to probe motion perception in that context we first provide an axiomatic derivation of the model this model synthesizes random dynamic textures which are defined by stationary gaussian distributions obtained by the random aggregation of warped patterns importantly we show that this model can equivalently be described as stochastic partial differential equation using this characterization of motion in images it allows us to recast models into principled bayesian inference framework finally we apply these textures in order to psychophysically probe speed perception in humans in this framework while the likelihood is derived from the generative model the prior is estimated from the observed results and accounts for the perceptual bias in principled fashion motivation normative explanation for the function of perception is to infer relevant hidden parameters from the sensory input with respect to generative model equipped with some prior knowledge about this representation this corresponds to the bayesian brain hypothesis as has been perfectly illustrated by the particular case of motion perception however the gaussian hypothesis related to the parameterization of knowledge in these models instance in the formalization of the prior and of the likelihood does not always fit with psychophysical results as such major challenge is to refine the definition of generative models so that they conform to the widest variety of results from this observation the estimation problem inherent to perception is linked to the definition of an adequate generative model in particular the simplest generative model to describe visual motion is the luminance conservation equation it states that luminance for is approximately conserved along trajectories defined as integral lines of vector field the corresponding generative model defines random fields as solutions to the stochastic partial differential equation spde hv where denotes the euclidean scalar product in is the spatial gradient of to match the statistics of natural scenes or some category of textures the driving term is usually defined as colored noise corresponding to some average coupling and is parameterized by covariance matrix while the field is usually constant vector accounting for translation with constant speed ultimately the application of this generative model is essential for probing the visual system for instance to understand how observers might detect motion in scene indeed as shown by the negative corresponding to the luminance conservation model and determined by hypothesized speed is proportional to the value of the model where is the whitening filter corresponding to the inverse of and is the convolution operator using some prior knowledge on the distribution of motions for instance preference for slow speeds this indeed leads to bayesian formalization of this inference problem this has been successful in accounting for large class of psychophysical observations as consequence such probabilistic frameworks allow one to connect different models from computer vision to neuroscience with unified principled approach however the model defined in is obviously quite simplistic with respect to the complexity of natural scenes it is therefore useful here to relate this problem to solutions proposed by texture synthesis methods in the computer vision community indeed the literature on the subject of static textures synthesis is abundant see and the references therein for applications in computer graphics of particular interest for us is the work of galerne et al which proposes stationary gaussian model restricted to static textures realistic dynamic texture models are however less studied and the most prominent method is the gaussian ar framework of which has been refined in contributions here we seek to engender better understanding of motion perception by improving generative models for dynamic texture synthesis from that perspective we motivate the generation of optimal stimulation within stationary gaussian dynamic texture model we base our model on previously defined heuristic coined motion clouds our first contribution is figure parameterization of the class of motion clouds stimuli the illustration relates the parametric changes in mc with real world top row and observer second row movements orientation changes resulting in scene rotation are parameterized through as shown in the bottom row where horizontal and obliquely oriented mc are compared zoom movements either from scene looming or observer movements in depth are characterised by scale changes reflected by scale or frequency term shown for larger or closer object compared to more distant translational movements in the scene characterised by using the same formulation for static slow and fast moving mc with the variability in these speeds quantified by σv and in the third row are the spatial and temporal frequency scale parameters the development of this formulation is detailed in the text an axiomatic derivation of this model seen as shot noise aggregation of dynamically warped textons this formulation is important to provide clear understanding of the effects of the model parameters manipulated during psychophysical experiments within our generative model they correspond to average translation speed and orientation of the textons and standard deviations of random fluctuations around this average our second contribution proved in the supplementary materials is to demonstrate an explicit equivalence between this model and class of linear stochastic partial differential equations spde this shows that our model is generalization of the luminance conservation equation this spde formulation has two chief advantages it allows for synthesis using an ar recurrence and it allows one to recast the of the model as generalization of the classical motion energy model which in turn is crucial to allow for bayesian modeling of perceptual biases our last contribution is an illustrative application of this model to the psychophysical study of motion perception in humans this application shows how the model allows us to define likelihood which enables simple fitting procedure to determine the prior driving the perceptual bias notations in the following we will denote the variable and the corresponding frequency variables if is function defined on then fˆ denotes its fourier transform for we denote cos sin its polar coordinates for function in we denote in the following we denote with capital letter such as random variable we denote realization of we let pa be the corresponding distribution of axiomatic construction of dynamic texture stimulation model solving estimation problem and finding optimal dynamic textures for stimulating an instance of such model can be seen as equivalent mathematical problems in the luminance conservation model the generative model is parameterized by coupling function which is encoded in the covariance of the driving noise and the motion flow this coupling covariance is essential as it quantifies the extent of the spatial integration area as well as the integration dynamics an important issue in neuroscience when considering the implementation of integration mechanisms from the local to the global scale in particular it is important to understand modular sensitivity in the various lower visual areas with different selectivities such as primary visual cortex or ascending the processing hierarchy middle temple area mt for instance by varying the frequency bandwidth of such dynamic textures distinct mechanisms for perception and action have been identified however such textures were based on heuristic and our goal here is to develop principled axiomatic definition from shot noise to motion clouds we propose derivation of general parametric model of dynamic textures this model is defined by aggregation through summation of basic spatial texton template the summation reflects transparency hypothesis which has been adopted for instance in while one could argue that this hypothesis is overly simplistic and does not model occlusions or edges it leads to tractable framework of stationary gaussian textures which has proved useful to model static and dynamic natural phenomena the simplicity of this framework allows for fine tuning of fourier parameterization which is desirable for the interpretation of psychophysical experiments we define random field as def iλ ϕap xp vp where ϕa is planar warping parameterized by finite dimensional vector intuitively this model corresponds to dense mixing of stereotyped static textons as in the originality is first the components of this mixing are derived from the texton by visual transformations ϕap which may correspond to arbitrary transformations such as zooms or rotations illustrated in figure second we explicitly model the motion position xp and speed vp of each individual texton the parameters xp vp ap are independent random vectors they account for the variability in the position of objects or observers and their speed thus mimicking natural motions in an ambient scene the set of translations xp is poisson point process of intensity the following section instantiates this idea and proposes canonical choices for these variabilities the warping parameters ap are distributed according to distribution pa the speed parameters vp are distributed according to distribution pv on the following result shows that the model converges to stationary gaussian field and gives the parameterization of the covariance its proof follows from specialization of theorem to our setting proposition iλ is stationary with bounded second order moments its covariance is where satisfies cg ϕa νt pv pa dνda where cg is the of when it converges in the sense of finite dimensional distributions toward stationary gaussian field of zero mean and covariance definition of motion clouds we detail this model here with warpings as rotations and scalings see figure these account for the characteristic orientations and sizes or spatial scales in scene with respect to the observer def ϕa where rθ is the planar rotation of angle we now give some physical and biological motivation underlying our particular choice for the distributions of the parameters we assume that the distributions pz and pθ of spatial scales and orientations respectively see figure are independent and have densities thus considering pa pz pθ the speed vector is assumed to be randomly fluctuating around central speed so that pv in order to obtain optimal responses to the stimulation as advocated by it makes sense to define the texton to be equal to an oriented gabor acting as an atom based on the structure of standard receptive field of each would have scale and central frequency since the orientation and scale of the texton is handled by the parameters we can impose without loss of generality the normalization in the special case where is grating of frequency and the image is dense mixture of drifting gratings whose has closed form expression detailed in proposition its proof can be found in the supplementary materials we call this gaussian field motion cloud mc and it is parameterized by the envelopes pz pθ pv and has central frequency and speed note that it is possible to consider any arbitrary textons which would give rise to more complicated parameterizations for the power spectrum but we decided here to stick to the simple case of gratings proposition when eihx the image defined in proposition is stationary gaussian field of covariance having the pz ξi rπ where the linear transform is such that cos dϕ remark note that the envelope of is shaped along cone in the spatial and temporal domains this is an important and novel contribution when compared to gaussian formulation like classical gabor in particular the bandwidth is then constant around the speed plane or the orientation line with respect to spatial frequency basing the generation of the textures on all possible translations rotations and zooms we thus provide principled approach to show that bandwidth should be proportional to spatial frequency to provide better model of moving textures parameter distributions we now give meaningful specialization for the probability distributions pz pθ which are inspired by some known scaling properties of the visual transformations relevant to dynamic scene perception first small centered linear movements of the observer along the axis of view orthogonal to the plane of the scene generate centered planar zooms of the image from the linear modeling of the observer displacement and the subsequent multiplicative nature of zoom scaling should follow law stating that subjective sensation when quantified is proportional to the logarithm of stimulus intensity thus we choose the scaling drawn from distribution pz defined in the bandwidth σz quantifies the variance in the amplitude of zooms of individual textons relative to the set characteristic scale similarly the texture is perturbed by variation in the global angle of the scene for instance the head of the observer may roll slightly around its normal position the distribution as good approximation of the warped gaussian distribution around the unit circle is an adapted choice for the distribution of with mean and bandwidth σθ see we may similarly consider that the position of the observer is variable in time on first order movements perpendicular to the axis of view dominate generating random perturbations to the global translation of the image at speed these perturbations are for instance described by gaussian random walk take for instance tremors which are constantly jittering small deg movements of the eye this justifies the choice of radial distribution for pv this radial distribution is thus selected as function of width σv and we choose here gaussian function for simplicity see note that as detailed in the supplementary slightly different with more complicated expression should be used to obtain an exact equivalence with the spde discretization mentioned in section the distributions of the parameters are thus chosen as ln cos ln pz pθ and remark note that in practice we have parametrized pz by its mode mz argmaxz pz and qr standard deviation dz pz dz see the supplementary material and σθ σz slope σv σz two different projections of in fourier space mc of two different spatial frequencies figure graphical representation of the covariance left the shape of the and an example of synthesized dynamics for and motion clouds right plugging these expressions into the definition of the power spectrum of the motion cloud one obtains parameterization which is very similar to the one originally introduced in the following table gives the speed and frequency central parameters in terms of amplitude and orientation each one being coupled with the relevant dispersion parameters figure and shows graphical display of the influence of these parameters mean dispersion speed σv freq orient σθ freq amplitude σz or mz dz remark note that the final envelope of is in agreement with the formulation that is used in however that previous derivation was based on heuristic which intuitively emerged from long interaction between modelers and psychophysicists herein we justified these different points from first principles remark the mc model can equally be described as stationary solution of stochastic partial differential equation spde this spde formulation is important since we aim to deal with dynamic stimulation which should be described by causal equation which is local in time this is crucial for numerical simulations since this allows us to perform synthesis of stimuli using an time discretization this is significant departure from previous implementation of dynamic stimulation this is also important to simplify the application of mc inside bayesian model of psychophysical experiments see section the derivation of an equivalent spde model exploits spectral formulation of mcs as gaussian random fields the full proof along with the synthesis algorithm can be found in the supplementary material psychophysical study speed discrimination to exploit the useful features of our mc model and provide generalizable proof of concept based on motion perception we consider here the problem of judging the relative speed of moving dynamical textures and the impact of both average spatial frequency and average duration of temporal correlations methods the task was to discriminate the speed of mc stimuli moving with horizontal central speed we assign as independent experimental variable the most represented spatial frequency mz that we denote in the following for easier reading the other parameters are set to the following values σv σθ and dz note that σv is thus dependent of the value of that is computed from mz and dz see remark and the supplementary to ensure that stays constant this parameter controls the temporal frequency bandwidth as illustrated on the middle of figure we used two alternative forced choice paradigm in each trial grey fixation screen with small dark fixation spot was followed by two stimulus intervals of ms each separated by grey ms interval the first stimulus had parameters and the second had parameters at the end of the trial grey screen appeared asking the participant to report which one of the two intervals was perceived as moving faster by pressing one of two buttons that is whether or given reference values for each trial and are selected so that vi where vj where or the ordering is randomized across trials and where values are expressed in cycles per degree and values in ten repetitions of each of the possible combinations of these parameters are made per block of trials and at least four such blocks were collected per condition tested the outcome of these experiments are summarized by psychometric curves where for all the value is the empirical probability each averaged over the typically trials that stimulus generated with parameters is moving faster than stimulus with parameters to assess the validity of our model we tested four different scenarios by considering all possible choices among and which corresponds to combinations of speeds and pair of temporal frequency parameters stimuli were generated on mac running os and displayed on viewsonic monitor with resolution at hz routines were written using matlab and psychtoolbox controlled the stimulus display observers sat cm from the screen in dark room three observers with normal or corrected to normal vision took part in these experiments they gave their informed consent and the experiments received ethical approval from the ethics committee in accordance with the declaration of helsinki bayesian modeling to make full use of our mc paradigm in analyzing the obtained results we follow the methodology of the bayesian observer used for instance in we assume the observer makes its decision using maximum posteriori map estimator argmin log pm log pv computed from some internal representation of the observed stimulus for simplicity we assume that the observer estimates from without bias to simplify the numerical analysis we assume that the likelihood is gaussian with variance independent of furthermore we assume that the prior is laplacian as this gives good description of the priori statistics of speeds in natural images pm and pv eaz vmax where vmax is cutoff speed ensuring that pv is well defined density even if az both az and σz are unknown parameters of the model and are obtained from the outcome of the experiments by fitting process we now explain likelihood and prior estimation following for instance the theoretical psychophysical curve obtained by bayesian decision model is def ϕv mv mv where mv is gaussian variable having the distribution pm the following proposition shows that in our special case of gaussian prior and laplacian likelihood it can be computed in closed form its proof follows closely the derivation of appendix and can be found in the supplementary materials proposition in the special case of the estimator with parameterization one has az az ϕv rt where ds is sigmoid function one can fit the experimental psychometric function to compute the perceptual bias term µz and an uncertainty λz such that λz remark note that in practice we perform fit in domain ie we consider where ln with following by comparing the theoretical and experimental psychopysical curves and one thus obtains the following expressions and az az the only remaining unknown is az that can be set as any negative number based on previous work on low speed priors or alternatively estimated in future by performing wiser fitting method psychophysic results the main results are summarized in figure showing the parameters µz in figure and the parameters σz in figure spatial frequency has positive effect on perceived speed speed is systematically perceived as faster as spatial frequency is increased moreover this shift can not simply be explained to be the result of an increase in the likelihood width figure at the tested spatial frequency as previously observed for contrast changes therefore the positive effect could be explained by negative effect in prior slopes az as the spatial frequency increases however we do not have any explanation for the observed constant likelihood width as it is not consistent with the speed width of the stimuli σv which is decreasing with spatial frequency discussion we exploited the principled and ecologically motivated parameterization of mc to ask about the effect of scene scaling on speed judgements in the experimental task mc stimuli in which the spatial scale content was systematically varied via frequency manipulations around central frequency of were found to be perceived as slightly faster at higher frequencies slightly slower at lower frequencies the effects were most prominent at the faster speed tested of relative to those at the fitted psychometic functions were compared to those predicted by bayesian model in which the likelihood or the observer sensory representation was characterised by simple gaussian indeed for this small data set intended as proof of concept the model was able to explain subject pse bias µz likehood width σz subject spatial frequency in spatial frequency in figure speed discrimination results task generates psychometric functions which show shifts in the point of subjective equality for the range of test stimuli of lower frequency with respect to the reference intersection of dotted horizontal and vertical lines gives the reference stimulus are perceived as going slower those with greater mean frequency are perceived as going relatively faster this effect is observed under all conditions but is stronger at the highest speed and for subject the estimated σz appear noisy but roughly constant as function of for each subject widths are generally higher for red than blue traces the parameter does not show significant effect across the conditions tested these systematic biases for spatial frequency as shifts in our priori on speed during the perceptual judgements as the likelihood width are constant across tested frequencies but lower at the higher of the tested speeds thus having larger measured bias given the case of the smaller likelihood width faster speed is consistent with key role for the prior in the observed perceptual bias larger data set including more standard spatial frequencies and the use of more observers is needed to disambiguate the models predicted prior function conclusions we have proposed and detailed generative model for the estimation of the motion of images based on formalization of small perturbations from the observer point of view during parameterized rotations zooms and translations we connected these transformations to descriptions of ecologically motivated movements of both observers and the dynamic world the fast synthesis of naturalistic textures optimized to probe motion perception was then demonstrated through fast gpu implementations applying techniques with much potential for future experimentation this extends previous work from by providing an axiomatic formulation finally we used the stimuli in psychophysical task and showed that these textures allow one to further understand the processes underlying speed estimation by linking them directly to the standard bayesian formalism we show that the sensory representations of the stimulus the likelihoods in such models can be described directly from the generative mc model in our case we showed this through the influence of spatial frequency on speed estimation we have thus provided just one example of how the optimized motion stimulus and accompanying theoretical work might serve to improve our understanding of inference behind perception the code associated to this work is available at https acknowledgements we thank guillaume masson for useful discussions during the development of the experiments we also thank manon and amfreville for proofreading lup was supported by ec brainscales the work of jv and gp was supported by the european research council erc project aim and lup were supported by speed references adelson and bergen spatiotemporal energy models for the perception of motion journal of optical society of america dong maximizing causal information of natural scenes in motion in ilg and masson editors dynamics of visual motion processing pages springer us doretto chiuso wu and soatto dynamic textures international journal of computer vision field relations between the statistics of natural images and the response properties of cortical cells opt soc am galerne stochastic image models and texture synthesis phd thesis ens de cachan galerne gousseau and morel synthesis by phase randomization image processing on line gregory perceptions as hypotheses philosophical transactions of the royal society biological sciences jogan and stocker signal integration in human visual speed perception the journal of neuroscience nestares fleet and heeger likelihood functions and confidence bounds for problems in ieee conference on computer vision and pattern recognition cvpr volume pages ieee comput soc vanzetta masson and perrinet motion clouds modelbased stimulus synthesis of random textures for the study of motion perception journal of neurophysiology simoncini perrinet montagnini mamassian and masson more is not always better adaptive gain control explains dissociation between perception and action nature neurosci sotiropoulos seitz and contrast dependency and prior expectations in human speed perception vision research stocker and simoncelli noise characteristics and prior expectations in human visual speed perception nature neuroscience unser and tafti an introduction to sparse stochastic processes cambridge university press cambridge uk unser tafti amini and kirshner unified formulation of gaussian versus sparse stochastic processes part ii theory ieee transactions on information theory wei lefebvre kwatra and turk state of the art in texture synthesis in eurographics state of the art report eurographics association wei and stocker efficient coding provides direct link between prior and likelihood in perceptual bayesian inference in bartlett pereira burges bottou and weinberger editors nips pages weiss and fleet velocity likelihoods in biological and machine vision in in probabilistic models of the brain perception and neural function pages weiss simoncelli and adelson motion illusions as optimal percepts nature neuroscience xia ferradans and aujol synthesizing and mixing stationary gaussian texture models siam journal on imaging sciences young and lesperance the gaussian derivative model for vision ii cortical data spatial vision 
generative image modeling using spatial lstms matthias bethge university of germany matthias lucas theis university of germany lucas abstract modeling the distribution of natural images is challenging partly because of strong statistical dependencies which can extend over hundreds of pixels recurrent neural networks have been successful in capturing dependencies in number of problems but only recently have found their way into generative image models we here introduce recurrent image model based on multidimensional long memory units which are particularly suited for image modeling due to their spatial structure our model scales to images of arbitrary size and its likelihood is computationally tractable we find that it outperforms the state of the art in quantitative comparisons on several image datasets and produces promising results when used for texture synthesis and inpainting introduction the last few years have seen tremendous progress in learning useful image representations while early successes were often achieved through the use of generative models recent breakthroughs were mainly driven by improvements in supervised techniques yet unsupervised learning has the potential to tap into the much larger source of unlabeled data which may be important for training bigger systems capable of more general scene understanding for example multimodal data is abundant but often unlabeled yet can still greatly benefit unsupervised approaches generative models provide principled approach to unsupervised learning perfect model of natural images would be able to optimally predict parts of an image given other parts of an image and thereby clearly demonstrate form of scene understanding when extended by labels the bayesian framework can be used to perform learning in the generative model while it is less clear how to combine other unsupervised approaches with discriminative learning generative image models are also useful in more traditional applications such as image reconstruction or compression recently there has been renewed strong interest in the development of generative image models most of this work has tried to bring to bear the flexibility of deep neural networks on the problem of modeling the distribution of natural images one challenge in this endeavor is to find the right balance between tractability and flexibility the present article contributes to this line of research by introducing fully tractable yet highly flexible image model our model combines recurrent neural networks with mixtures of experts more specifically the backbone of our model is formed by spatial variant of long memory lstm lstms have been particularly successful in modeling text and speech but have also been used to model the progression of frames in video and very recently to model single images in contrast to earlier work on modeling images here we use lstms which naturally lend themselves to the task of generative image modeling due to their spatial structure and ability to capture correlations ride pixels mcgsm xij ij xij ij slstm units slstm units pixels figure we factorize the distribution of images such that the prediction of pixel black may depend on any pixel in the green region graphical model representation of an mcgsm with causal neighborhood limited to small region visualization of our recurrent image model with two layers of spatial lstms the pixels of the image are represented twice and some arrows are omitted for clarity through feedforward connections the prediction of pixel depends directly on its neighborhood green but through recurrent connections it has access to the information in much larger region red to model the distribution of pixels conditioned on the hidden states of the neural network we use mixtures of conditional gaussian scale mixtures mcgsms this class of models can be viewed as generalization of gaussian mixture models but their parametrization makes them much more suitable for natural images by treating images as instances of stationary stochastic process this model allows us to sample and capture the correlations of arbitrarily large images recurrent model of natural images in the following we first review and extend the mcgsm and lstms before explaining how to combine them into recurrent image model section will demonstrate the validity of our approach by evaluating and comparing the model on number of image datasets factorized mixtures of conditional gaussian scale mixtures one successful approach to building flexible yet tractable generative models has been to use fullyvisible belief networks to apply such model to images we have to give the pixels an ordering and specify the distribution of each pixel conditioned on its parent pixels several parametrizations have been suggested for the conditional distributions in the context of natural images we here review and extend the work of theis et al who proposed to use mixtures of conditional gaussian scale mixtures mcgsms let be grayscale image patch and xij be the intensity of the pixel at location ij further let ij designate the set of pixels xmn such that or and figure then xij ij for the distribution of any parametric model with parameters note that this factorization does not make any independence assumptions but is simply an application of the probability chain rule further note that the conditional distributions all share the same set of parameters one way to improve the representational power of model is thus to endow each conditional distribution with its own set of parameters θij xij ij θij applying this trick to mixtures of gaussian scale mixtures mogsms yields the mcgsm untying shared parameters can drastically increase the number of parameters for images it can easily be reduced again by adding assumptions for example we can limit ij to smaller neighborhood surrounding the pixel by making markov assumption we will refer to the resulting set of parents as the pixel causal neighborhood figure another reasonable assumption is stationarity or shift invariance in which case we only have to learn one set of parameters θij which can then be used at every pixel location similar to convolutions in neural networks this allows the model to easily scale to images of arbitrary size while this assumption reintroduces parameter sharing constraints into the model the constraints are different from the ones induced by the joint mixture model the conditional distribution in an mcgsm takes the form of mixture of experts xij ij θij ij θij xij ij θij gate expert where the sum is over mixture component indices corresponding to different covariances and scales corresponding to different variances the gates and experts in an mcgsm are given by ij exp ηcs eαcs ij kc ij xij ij xij ij where kc is positive definite the number of parameters of an mcgsm still grows quadratically with the dimensionality of the causal neighborhood to further reduce the number of parameters we introduce form of the mcgsm with additional parameter sharing by replacing kc with factorized βcn bn bn this factorized mcgsm allows us to use larger neighborhoods and more mixture components detailed derivation of more general version which also allows for multivariate pixels is given in supplementary section spatial long memory in the following we briefly describe the spatial lstm slstm special case of the multidimensional lstm first described by graves schmidhuber at the core of the model are memory units cij and hidden units hij for each location ij on grid the operations performed by the spatial lstm are given by gij tanh ij cij gij iij ci fij fij ij ta hi hij tanh cij oij ij fij where is the logistic sigmoid function indicates pointwise product and ta is an affine transformation which depends on the only parameters of the network and the gating units iij and oij determine which memory units are affected by the inputs through gij and which memory states are written to the hidden units hij in contrast to regular lstm defined over time each memory unit of spatial lstm has two preceding states ci and and two corresponding forget gates fij and fij recurrent image density estimator we use grid of slstm units to sequentially read relatively small neighborhoods of pixels from the image producing hidden vector at every pixel the hidden states are then fed into factorized mcgsm to predict the state of the corresponding pixel that is xij ij xij hij importantly the state of the hidden vector only depends on pixels in ij and does not violate the factorization given in equation nevertheless the recurrent network allows this recurrent image density estimator ride to use pixels of much larger region for prediction and to nonlinearly transform the pixels before applying the mcgsm we can further increase the representational power of the model by stacking spatial lstms to obtain deep yet still completely tractable recurrent image model figure related work larochelle murray derived tractable density estimator nade in manner similar to how the mcgsm was derived but using restricted boltzmann machines rbm instead of mixture models as starting point in contrast to the mcgsm nade tries to keep the weight sharing constraints induced by the rbm equation uria et al extended nade to real values and introduced hidden layers to the model gregor et al describe related autoregressive network for binary data which additionally allows for stochastic hidden units gregor et al used lstms to generate images in sequential manner draw because the model was defined over bernoulli variables normalized rgb values had to be treated as probabilities making direct comparison with other image models difficult in contrast to our model the presence of stochastic latent variables in draw means that its likelihood can not be evaluated but has to be approximated ranzato et al and srivastava et al use recurrent neural networks to model videos but recurrency is not used to describe the distribution over individual frames srivastava et al optimize squared error corresponding to gaussian assumption while ranzato et al try to having to model pixel intensities by quantizing image patches in contrast here we also try to solve the problem of modeling pixel intensities by using an mcgsm which is equipped to model as well as distributions experiments ride was trained using stochastic gradient descent with batch size of momentum of and decreasing learning rate varying between and after each pass through the training set the mcgsm of ride was finetuned using for up to iterations before decreasing the learning rate no regularization was used except for early stopping based on validation set except where indicated otherwise the recurrent model used pixel wide neighborhood and an mcgsm with components and quadratic features bn in section spatial lstms were implemented using the caffe framework where appropriate we augmented the data by horizontal or vertical flipping of images we found that conditionally whitening the data greatly sped up the training process of both models letting represent pixel and its causal neighborhood conditional whitening replaces these with mx cyx my cyy cyx xx cyx where cyx is the covariance of and and mx is the mean of in addition to speeding up training this variance normalization step helps to make the learning rates less dependent on the training data when evaluating the conditional we compensate for the change in variance by adding the log det note that this preconditioning introduces shortcut connection from the pixel neighborhood to the predicted pixel which is not shown in figure ensembles uria et al found that forming ensembles of their autoregressive model over different pixel orderings significantly improved performance we here consider simple trick to produce an ensemble without the need for training different models or to change training procedures if tk are linear transformations leaving the targeted image distribution invariant or approximately invariant and if is the distribution of pretrained model then we form the ensemble det tk note that this is simply mixture model over images we considered rotating as well as flipping images along the horizontal and vertical axes yielding an ensemble over transformations while it could be argued that most of these transformations do not leave the distribution over natural images invariant we nevertheless observed noticeable boost in performance natural images several recent image models have been evaluated on small image patches sampled from the berkeley segmentation dataset although our model strength lies in its ability to scale to large images and to capture correlations we include results on to make connection to this part of the literature we followed the protocol of uria et al the rgb images were turned to grayscale uniform noise was added to account for the integer discretization and the resulting values were divided by the training set of images was split into images for training and images for validation while the test set contained images we dim nat rnade rnade hl rnade hl eornade layers gmm comp stm comp deep gmm layers mcgsm comp mcgsm comp mcgsm comp mcgsm comp eomcgsm comp ride layer ride layers eoride layers model dim dim dim dim grbm ica gsm isa mogsm comp mcgsm comp ride layer hid ride layer hid ride layers hid ride layers hid eoride layers hid model table average and table average rates for imrates for image patches dc comp and age patches and large images extracted from large images extracted from van hateren dataset extracted by image patches from each set and subtracted the average pixel intensity such that each patch dc component was zero because the resulting image patches live on dimensional subspace the pixel was discarded we used patches for training patches for validation and test patches for evaluation mcgsms have not been evaluated on this dataset and so we first tested mcgsms by training single factorized mcgsm for each pixel conditioned on all previous pixels in fixed ordering we find that already an mcgsm with components and quadratic features outperforms all single models including deep gaussian mixture model table our ensemble of outperforms an ensemble of rnades with hidden layers which to our knowledge is currently the best result reported on this dataset training the recurrent image density estimator ride on the dimensional dataset is more cumbersome we tried padding image patches with zeros which was necessary to be able to compute hidden state at every pixel the pixel was ignored during training and evaluation this simple approach led to reduction in performance relative to the mcgsm table possible explanation is that the model can not distinguish between pixel intensities which are zero and zeros in the padded region supplying the model with additional binary indicators as inputs one for each neighborhood pixel did not solve the problem however we found that ride outperforms the mcgsm by large margin when images were treated as instances of stochastic process that is using infinitely large images mcgsms were trained for up to iterations of on pixels and corresponding causal neighborhoods extracted from the training images causal neighborhoods were pixels wide and pixels high ride was trained for epochs on image patches of increasing size ranging from by to by pixels that is gradients were approximated as in backpropagation through time the right column in table shows average rates for both models analogously to the entropy rate we have for the expected rate lim log log xij ij where is an by image patch an average rate can be directly computed for the mcgsm while for ride and ensembles we approximated it by splitting the test images into by patches and evaluating on those to make the two sets of numbers more comparable we transformed nats as commonly reported on the dimensional data into bit per pixel rate using the formula dc ln det ln this takes into account for the missing dc component details on how the ensemble of transformations can be applied despite the missing pixel are given in supplementary section model mcgsm comp mcgsm comp diffusion ride layer ride layer ext ride layers ride layers ride layers eoride layers mcgsm ride neighborhood size table average rates on dead leaf images deep recurrent image model is on par with deep diffusion model using ensembles we are able to further improve the likelihood figure model performance on dead leaves as function of the causal neighborhood width simply increasing the neighborhood size of the mcgsm is not sufficient to improve performance dc and the jacobian of the transformations applied during preprocessing ln det see supplementary section for details the two rates in table are comparable in the sense that their differences express how much better one model would be at losslessly compressing test images than another where models would compress patches of an image independently we highlighted the best result achieved with each model in gray note that most models in this list do not scale as well to large images as the mcgsm or ride gmms in particular and are therefore unlikely to benefit as much from increasing the patch size comparison of the rates reveals that an mcgsm with components applied to large images already captures more correlations than any model applied to small image patches the difference is particularly striking given that the factorized mcgsm has approximately parameters while gmm with components has approximately parameters using an ensemble of rides we are able to further improve this number significantly table another dataset frequently used to test generative image models is the dataset published by van hateren and van der schaaf details of the preprocessing used in this paper are given in supplementary section we reevaluated several models for which the likelihood has been reported on this dataset likelihood rates as well as results on by patches are given in table because of the larger patch size ride here already outperforms the mcgsm on patches dead leaves dead leaf images are generated by superimposing disks of random intensity and size on top of each other this simple procedure leads to images which already share many of the statistical properties and challenges of natural images such as occlusions and correlations while leaving out others such as statistics they therefore provide an interesting test case for natural image models we used set of images where each image is by pixels in size we compare the performance of ride to the mcgsm and very recently introduced deep multiscale model based on diffusion process the same images as in previous literature were used for evaluation and we used the remaining images for training we find that the introduction of an slstm with hidden units greatly improves the performance of the mcgsm we also tried an extended version of the slstm which included memory units as additional inputs side of equation this yielded small improvement in performance row in table while adding layers or using more hidden units led to more drastic improvements using layers with hidden units in each layer we find that our recurrent image model is on par with the deep diffusion model by using ensembles we are able to beat all previously published results for this dataset table figure shows that the improved performance of ride is not simply due to an effectively larger causal neighborhood but that the nonlinear transformations performed by the slstm units matter simply increasing the neighborhood size of an mcgsm does not yield the same improvement instead the performance quickly saturates we also find that the performance of ride slightly deteriorates with larger neighborhoods which is likely caused by optimization difficulties figure from top to bottom by pixel crop of texture sample generated by an mcgsm trained on the full texture and sample generated by ride this illustrates that our model can capture variety of different statistical patterns the addition of the recurrent neural network seems particularly helpful where there are strong correlations texture synthesis and inpainting to get an intuition for the kinds of correlations which ride can capture or fails to capture we tried to use it to synthesize textures we used several by pixel textures published by brodatz the textures were split into sixteen by pixel regions of which were used for training and one randomly selected region was kept for testing purposes ride was trained for up to epochs on patches of increasing size ranging from by to by pixels samples generated by an mcgsm and ride are shown in figure both models are able to capture wide range of correlation structures however the mcgsm seems to struggle with textures having bimodal marginal distributions and periodic patterns and ride clearly improves on these textures although it also struggles to faithfully reproduce periodic structure possible explanations include that lstms are not well suited to capture periodicities or that these failures are not penalized strong enough by the likelihood for some textures ride produces samples which are nearly indistinguishable from the real textures and one application of generative image models is inpainting as proof of concept we used our model to inpaint large here by pixels region in textures figure missing pixels were replaced by sampling from the posterior of ride unlike the joint distribution the posterior distribution can not be sampled directly and we had to resort to markov chain monte carlo methods we found the following metropolis within gibbs procedure to be efficient enough the missing pixels were initialized via ancestral sampling since ancestral sampling is cheap we generated candidates and used the one with the largest posterior density following initialization we sequentially updated overlapping by pixel regions via metropolis sampling proposals were generated via ancestral sampling and accepted using the acceptance probability xij ij min ij ij where here xij represents by pixel patch and its proposed replacement since evaluating the joint and conditional densities on the entire image is costly we approximated using ride applied to by pixel patch surrounding ij randomly flipping images vertically or horizontally in between the sampling further helped figure shows results after gibbs sampling sweeps conclusion we have introduced ride deep but tractable recurrent image model based on spatial lstms the model exemplifies how recent insights in deep learning can be exploited for generative image figure the center portion of texture left and center was reconstructed by sampling from the posterior distribution of ride right modeling and shows superior performance in quantitative comparisons ride is able to capture many different statistical patterns as demonstrated through its application to textures this is an important property considering that on an intermediate level of abstraction natural images can be viewed as collections of textures we have furthermore introduced factorized version of the mcgsm which allowed us to use more experts and larger causal neighborhoods this model has few parameters is easy to train and already on its own performs very well as an image model it is therefore an ideal building block and may be used to extend other models such as draw or video models deep generative image models have come long way since deep belief networks have first been applied to natural images unlike convolutional neural networks in object recognition however no approach has as of yet proven to be likely solution to the problem of generative image modeling further conceptual work will be necessary to come up with model which can handle both the more abstract as well as the statistics of natural images acknowledgments the authors would like to thank van den oord for insightful discussions and wieland brendel christian behrens and matthias for helpful input on this paper this study was financially supported by the german research foundation dfg priority program be references bell and sejnowski the independent components of natural scenes are edge filters vision research brodatz textures photographic album for artists and designers dover new york url http cover and thomas elements of information theory wiley edition denton chintala szlam and fergus deep generative image models using laplacian pyramid of adversarial networks in advances in neural information processing systems domke karapurkar and aloimonos who killed the directed model in cvpr donahue jia vinyals hoffman zhang tzeng and darrell decaf deep convolutional activation feature for generic visual recognition in icml gerhard theis and bethge modeling natural image statistics in computer and applications wiley vch goodfellow mirza xu ozair courville and bengio generative adversarial nets in advances in neural information processing systems graves and schmidhuber offline handwriting recognition with multidimensional recurrent neural networks in advances in neural information processing systems gregor danihelka mnih blundell and wierstra deep autoregressive networks in proceedings of the international conference on machine learning gregor danihelka graves and wierstra draw recurrent neural network for image generation in proceedings of the international conference on machine learning heess williams and hinton learning generative texture models with extended in bmcv hinton osindero and teh fast learning algorithm for deep belief nets neural hochreiter and schmidhuber long memory neural computation hosseini sinz and bethge lower bounds on the redundancy of natural images vis and hoyer emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces neural computation jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding kingma and welling variational bayes in iclr kingma rezende mohamed and welling learning with deep generative models in advances in neural information processing systems krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in advances in neural information processing systems larochelle and murray the neural autoregressive distribution estimator in proceedings of the international conference on artificial intelligence and statistics lee mumford and huang occlusion models for natural images statistical study of dead leaves model international journal of computer vision lee grosse ranganath and ng convolutional deep belief networks for scalable unsupervised learning of hierarchical representations in icml li swersky and zemel generative moment matching networks in icml martin fowlkes tal and malik database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics in iccv matheron modele de partition technical report cmm neal connectionist learning of belief networks artificial intelligence ngiam chen koh and ng learning deep energy models in icml osindero and hinton modelling image patches with directed hierarchy of markov random fields in advances in neural information processing systems ranzato susskind mnih and hinton on deep generative models with applications to recognition in ieee conference on computer vision and pattern recognition ranzato szlam bruna mathieu collobert and chopra video language modeling baseline for generative models of natural videos robinson and fallside the utility driven dynamic error propagation network technical report cambridge university roth and black fields of experts international journal of computer vision simonyan and zisserman very deep convolutional networks for image recognition in international conference on learning represenations weiss maheswaranathan and ganguli deep unsupervised learning using nonequilibrium thermodynamics in icml srivastava and salakhutdinov multimodal learning with deep boltzmann machines jmlr srivastava mansimov and salakhutdinov unsupervised learning of video representations using lstms in proceedings of the international conference on machine learning sundermeyer schluter and ney lstm neural networks for language modeling in interspeech sutskever vinyals and le sequence to sequence learning with neural networks in advances in neural information processing systems theis gerwinn sinz and bethge in all likelihood deep belief is not enough jmlr theis hosseini and bethge mixtures of conditional gaussian scale mixtures applied to multiscale image representations plos one theis and bethge training sparse natural image models with fast gibbs sampler of an extended state space in advances in neural information processing systems tierney markov chains for exploring posterior distributions the annals of statistics uria murray and larochelle rnade the neural autoregressive in advances in neural information processing systems uria murray and larochelle deep and tractable density estimator in icml van den oord and schrauwen the mixture as natural image patch prior with application to image compression journal of machine learning research van den oord and schrauwen factoring variations in natural images with deep gaussian mixture models in advances in neural information processing systems van hateren and van der schaaf independent component filters of natural images compared with simple cells in primary visual cortex proc of the royal society biological sciences zoran and weiss from learning models of natural image patches to whole image restoration in ieee international conference on computer vision zoran and weiss natural images gaussian mixtures and dead leaves in nips 
robust pca with compressed data wooseok ha university of chicago haywse rina foygel barber university of chicago rina abstract the robust principal component analysis rpca problem seeks to separate lowrank trends from sparse outliers within data matrix that is to approximate matrix as the sum of matrix and sparse matrix we examine the robust principal component analysis rpca problem under data compression where the data is approximately given by that is sparse data matrix that has been compressed to size with substantially smaller than the original dimension via multiplication with compression matrix we give convex program for recovering the sparse component along with the compressed component along with upper bounds on the error of this reconstruction that scales naturally with the compression dimension and coincides with existing results for the uncompressed setting our results can also handle error introduced through additive noise or through missing data the scaling of dimension compression and signal complexity in our theoretical results is verified empirically through simulations and we also apply our method to data set measuring chlorine concentration across network of sensors to test its performance in practice introduction principal component analysis pca is tool for providing approximation to data matrix with the aim of reducing dimension or capturing the main directions of variation in the data more recently there has been increased focus on more general forms of pca that is more robust to realistic flaws in the data such as outliers the robust pca rpca problem formulates decomposition of the data into component capturing trends across the data matrix and sparse component capturing outlier measurements that may obscure the trends which we seek to separate based only on observing the data matrix depending on the application we may be primarily interested in one or the other component in some settings the sparse component may represent unwanted outliers corrupted may wish to clean the data by removing the outliers and recovering the component in other settings the sparse component may contain the information of instance in image or video data may capture the foreground objects which are of interest while may capture background components which we wish to subtract existing methods to separate the sparse and components include convex and nonconvex methods and can handle extensions or additional challenges such as missing data rather than structure streaming data and different types of structures superimposed with component in this paper we examine the possibility of demixing sparse and low rank structure under the additional challenge of working with data that has been compressed where comprise the approximately and approximately sparse components of the original data matrix while is random or fixed compression matrix in general we think of the compression dimension as being significantly smaller than motivated by several considerations communication constraints if the data matrix consists of measurements taken at remote sensors compression would allow the sensors to transmit information of dimension storage constraints storing matrix with nm many entries instead of nd many entries data privacy if the data is represented as the matrix where features were collected from individuals we can preserve privacy by compressing the data by random linear transformation and allow the access to database only through the compressed data this method has been called matrix masking in the privacy literature and studied by in the context of linear regression random projection methods have been shown to be highly useful for reducing dimensionality without much loss of accuracy for numerical tasks such as least squares regression or matrix computations here we use random projections to compress data while preserving the information about the underlying and sparse structure also applied random projection methods to the robust pca problem but their purpose is to accelerate the computational task of approximation which is different from the aim of our work in the compressed robust pca setting we hope to learn about both the and sparse components unlike compressed sensing problems where sparse structure may be reconstructed perfectly with undersampling here we face different type of challenge the sparse component is potentially identifiable from the compressed component using the tools of compressed sensing however the component is not identifiable from its compression specifically if we let pc be the projection operator onto the column span of then the two matrices and pc can not be distinguished after multiplication by therefore our goal will be to recover both the sparse component and the compressed component note that recovering is similar to the goal of recovering the column span of which may be useful interpretation if we think of the columns of the data matrix as data points lying in rn the column span of characterizes subspace of rn that captures the main trends in the data notation we will use the following notation throughout the paper we write for any we write or km to denote the number of nonzero entries in vector or matrix note that this is not in fact norm denotes the ith row of matrix and is treated as column vector we will use the matrix norms km kf frobenius norm km elementwise norm km elementwise norm km spectral norm largest singular value and km nuclear norm also known as the trace norm given by the sum of the singular values of problem and method we begin by formally defining the problem at hand the data which takes the form of matrix is by sum where is and is sparse however we can only access this data through noisy compression our observed data is the matrix where is the compression matrix and discuss specific models for later on absorbs all sources of error and given this model our goal will be to learn about both the and sparse structure in the ordinary robust pca setting the task of separating the and sparse components has been known to be possible when the underlying component satisfies certain conditions incoherence condition in or spikiness condition in in order to successfully decompose the and sparse component in the compressed data we thus need the similar conditions to hold for the compressed component which we define as the product as we will see if satisfies the spikiness condition kl then the compressed component satisfies the similar spikiness condition bound on kp this motivates the possibility to recover both the and sparse components in the case of compressed data as discussed above while we can aim to recover the sparse component there is no hope to recover the original component since is not identifiable in the compressed model therefore we propose natural convex program for recovering the underlying compressed lowrank component and the sparse component note that as discussed in random projection preserves the column span of and so we can recover the column span of via we define our estimators of the sparse component and the product as follows pb arg min ky kp note that we impose the spikiness condition kp on in order to guarantee good performance for demixing such two superimposed later section we will see that the same condition holds for this method is parametrized by the triple and natural scalings for these tuning parameters are discussed alongside our theoretical results sources of errors and noise next we give several examples of models and interpretations for the error term in random noise first we may consider model where the signal has an exact sparse decomposition with additive noise added before after the compression step zpre zpost where the entries of the and noise zpre and zpost are subgaussian random variables in this case the noise term in is given by zpre zpost misspecified model next we may consider case where the original data can be closely approximated by sparse decomposition but this decomposition is not exact in this case we could express the original uncompressed data as zmodel where zmodel captures the error of the sparse decomposition then this model misspecification can be absorbed into the noise term zmodel missing data given an original data matrix we might have access only to partial version of this matrix we write to denote the available data where indexes the entries where data is available and ij dij then sparse model for our compressed data is given by zmissing where zmissing in some settings we may first want to adjust before compressing the data for instance by reweighting the observed entries in to ensure closer approximation we have compressed data to denoting the reweighted matrix of partial observations by se zmissing with zmissing and where is the reweighted matrix of then the error from the missing data can be absorbed into the term zmissing combinations finally the observed data may differ from the compressed sparse decomposition due to combination of the factors above in which case we may write zpre zmodel zmissing zpost models for the compression matrix next we consider several scenarios for the compression matrix random compression in some settings the original data naturally lies in but is compressed by the user for some purpose for instance if we have data from individuals with each data point lying in rn we may compress this data for the purpose of providing privacy to the individuals in the data set alternately we may compress data to adhere to constraints on communication bandwidth or on data storage in either case we control the choice of the compression matrix and are free to use simple random model here we consider two models iid gaussian model the entries of are generated as cij orthogonal model where is an orthonormal matrix chosen uniformly at random note that in each case cc id multivariate regression multitask learning in multivariate linear regression we observe matrix of data that follows model where is an observed design matrix is an unknown matrix of coefficients generally the target parameter and is matrix of noise terms often the rows of are thought of as independent samples where each row is multivariate response in this setting the accuracy of the regression can often be improved by leveraging or sparse structure that arises naturally in the matrix of coefficients if is approximately sparse the methodology of this paper can be applied taking the transpose of the multivariate regression model we have compare to our initial model where we replace with and use the compression matrix then if is sparse approximation the multivariate regression can be formulated as problem of the form by setting the error term to equal theoretical results in this section we develop theoretical error bounds for the compressed robust pca problem under several of the scenarios described above we first give general deterministic result in section then specialize this result to handle scenarios of and noise and missing data results for multivariate regression are given in the supplementary materials deterministic result we begin by stating version of the restricted eigenvalue property found in the compressed sensing and sparse regression literature definition for matrix and for satisfies the restricted eigenvalue property with constants denoted by rem if log for all rd we now give our main result for the accuracy of the convex program theorem that we will see can be specialized to many of the settings described earlier this theorem gives deterministic result and does not rely on random model for the compression matrix or the error matrix theorem let be any matrix with rank and let be any matrix with at most nonzero entries per row that is maxi let be any compression matrix and define the data and the term as in let as before suppose that satisfies rem where log if parameters satisfy kl cc to the convex program satisfies then deterministically the solution pb kpb ksb sn we now highlight several applications of this theorem to specific settings random compression model with gaussian or subgaussian noise and random compression model with missing data an application to the multivariate linear regression model is given in the supplementary materials results for random compression with subgaussian noise suppose compression matrix is random and that the error term in the model comes from subgaussian noise measurement error that takes place before after the compression zpre zpost our model for this setting is as follows for fixed matrices and where rank and maxi we observe data zpre zpost where the compression matrix is generated under either the gaussian or orthogonal model and where the noise matrices zpre zpost are independent from each other and from with entries iid zpre ij pre iid and zpost ij post for this section we assume without further comment that is the compression should reduce the dimension of the data let max max pre post specializing the result of theorem to this setting we obtain the following probablistic guarantee theorem assume the model suppose that rank maxi and kl then there exist universal constants such that if we define log nd log nd max max and if kpb to the convex program satisfies log nd then the solution pb ksb max max sn log nd with probability at least nd remark if the entries of zpre and zpost are subgaussian rather than gaussian then the same result holds except for change in the constants appearing in the parameters recall that random variable is if etx et for all remark in the case our result matches corollary in agarwal et al exactly except that our result involves multiplicative logarithm factor log nd in the term whereas theirs does this additional log factor arises when we upper bound kl cc which is unavoidable if we want the bound to hold with high probability remark theorem shows the natural scaling the first term is the degree of freedom for compressed rank matrix whereas the term sn log nd is the signal complexity of sparse compod nent which has sn many nonzero entries the multiplicative factor max can be interpreted as the noise variance of the problem amplified by the compression results for random compression with missing data next we consider missing data scenario where the original matrix is only partially observed the original complete data is sparse however only subset of entries are are given access to dij for each after reweighting step we compress this data with compression matrix for instance in order to reduce communication storage or computation requirements note that in our paper is equivalent to in since their work defines to be the total number of nonzero entries in while we count entries per row for clarity of presentation we do not include additive noise before or after compression in this section however our theoretical analysis for additive noise theorem and for missing data theorem can be combined in straightforward way to obtain an error bound scaling as sum of the two respective bounds first we specify model for the missing data for each let be the probability that this entry is observed additionally we assume that the sampling scheme is independent across all entries and that the are to proceed we first define reweighted version of the partially observed data matrix and then multiply by the compression matrix where ij dij define also the reweighted versions of the low rank and sparse components ij lij and se ij sij and note that we then have the role of the reweighting step is to ensure that this noise term has where mean zero note that in the reformulation of the model is approximated with compression of where is the original low rank component while is defined above while the original sparse component is not identifiable via the missing data model since we have no information to help us recover entries sij for this new decomposition now has sparse component that is identifiable since by definition preserves the sparsity of but has no nonzero entries in unobserved locations that is ij whenever with this model in place we obtain the following probabilistic guarantee for this setting which is another specialized version of theorem we note that we again have no assumptions on the values of the entries in only on the sparsity there is no bound assumed on ks theorem assume the model suppose that rank maxi and kl if the sampling scheme satisfies for all for some positive constant then there exist universal constants such that if we define log nd log nd nd and if to the convex program satisfies log nd then the solution pb kpb ksb kf log nd sn nd min with probability at least nd experiments in this section we first use simulated data to study the behavior of the convex program for different compression dimensions signal complexities and missing levels which show the close agreement with the scaling predicted by our theory we also apply our method to data set consisting of chlorine measurements across network of sensors for simplicity in all experiments we select which is easier for optimization and generally results in solution that still has low spikiness that is the solution is the same as if we had imposed bound with finite simulated data here we run series of simulations on compressed data to examine the performance of the convex program in all cases we used the compression matrix generated under the orthogonal model we solve the convex program via alternating minimization over and selecting the regularization parameters and that minimizes the squared frobenius error all results are averaged over trials in practice the assumption that are known is not prohibitive for example we might model the row locations of the observed entries are chosen independently see or logistic and column model log in either case fitting model using the observed set is extremely accurate total squared error compression ratio dimension total squared error total squared error figure results for the noisy data experiment the total squared error calculated as in theorem is plotted against the compression ratio note the linear scaling as predicted by the theory dimension dimension rank total squared error total squared error rank sparsity proportion dimension sparsity proportion figure results for the top row and bottom row experiments the total squared error calculated as in theorem is plotted against the rank or sparsity proportion note the nearly linear scaling for most values of simulation compression ratio first we examine the role of the compression dimension we fix the matrix dimension the component is given by where and are and matrices with entries for rank the sparse component has of its entries generated as that is the data iid is where zij figure shows the squared frobenius error kpb ksb plotted against the compression ratio we see error scaling linearly with the compression ratio which supports our theoretical results simulation rank and sparsity next we study the role of rank and sparsity for matrix of size or we generate the data as before but we either vary the rank or we vary the sparsity with figure shows the squared frobenius error plotted against either the varying rank or the varying sparsity we repeat this experiment for several different compression dimensions we see little deviation from linear scaling for the smallest which can be due to the fact that our theorems give upper bounds rather than tight matching upper and lower bounds or perhaps the smallest value of does not satisfy the condition stated in the theorems however for all but the smallest we see error scaling nearly linearly with rank or with sparsity which is consistent with our theory simulation missing data finally we perform experiments under the existence of missing entries in the data matrix we fix dimensions and generate and as before with and but do not add noise to introduce the missing entries in the data we use uniform sampling scheme where each entry of is observed with probability total squared error total squared error figure results for the missing data experiment the total squared error calculated as in theorem is plotted against proportion of observed data or against for various values of based on one trial note the nearly linear scaling with respect to log relative error sparse model model compression dimension figure results for the chlorine data averaged over trials plotting the log of the relative error on the test set for sparse model and model the sparse model performs better across range of compression dimensions up to reduction in error with figure shows the squared frobenius error kpb ksb kf see theorem for details across range of probabilities we see that the squared error scales approximately linearly with as predicted by our theory chlorine sensor data to illustrate the application of our method to specific application we consider chlorine concentration data from network of the data contains realistic simulation of chlorine concentration measurements from sensors in hydraulic system over time points we assume is well approximated with sparse decomposition we then compress the data using the orthogonal model and study the performance of our estimators for varying in order to evaluate performance we use of the entries to fit the model as validation set for selecting tuning parameters and the final as test set we compare against mab details trix reconstruction equivalent to setting sb and fitting only the component are given in the supplementary materials the results are displayed in figure where we see that the error of the recovery grows smoothly with compression dimension and that the sparse decomposition gives better data reconstruction than the model discussion in this paper we have examined the robust pca problem under data compression where we seek to decompose data matrix into sparse components with access only to partial projection of the data this provides tool for accurate modeling of data with multiple superimposed structures while enabling restrictions on communication privacy or other considerations that may make compression necessary our theoretical results show an intuitive tradeoff between the compression ratio and the error of the fitted sparse decomposition which coincides with existing results in the extreme case of no compression compression ratio future directions for this problem include adapting the method to the streaming data online learning setting data obtained from http references alekh agarwal sahand negahban martin wainwright et al noisy matrix decomposition via convex relaxation optimal rates in high dimensions the annals of statistics peter bickel ya acov ritov and alexandre tsybakov simultaneous analysis of lasso and dantzig selector the annals of statistics pages emmanuel candès xiaodong li yi ma and john wright robust principal component analysis journal of the acm jacm rina foygel ohad shamir nati srebro and ruslan salakhutdinov learning with the weighted under arbitrary sampling distributions in advances in neural information processing systems pages nathan halko martinsson and joel tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam review jun he laura balzano and john lui online robust subspace tracking from partial information arxiv preprint jun he laura balzano and arthur szlam incremental gradient on the grassmannian for online foreground and background separation in subsampled video in ieee conference on computer vision and pattern recognition cvpr pages ieee odalric maillard and rémi munos compressed regression in advances in neural information processing systems pages praneeth netrapalli un niranjan sujay sanghavi animashree anandkumar and prateek jain robust pca in advances in neural information processing systems pages john wright arvind ganesh shankar rao yigang peng and yi ma robust principal component analysis exact recovery of corrupted matrices via convex optimization in advances in neural information processing systems pages huan xu constantine caramanis and sujay sanghavi robust pca via outlier pursuit in advances in neural information processing systems pages shuheng zhou john lafferty and larry wasserman compressed and sparse regression ieee transactions on information theory tianyi zhou and dacheng tao godec randomized sparse matrix decomposition in noisy case in proceedings of the international conference on machine learning pages 
sampling from probabilistic submodular models alkis gotovos eth zurich hamed hassani eth zurich andreas krause eth zurich alkisg hamed krausea abstract submodular and supermodular functions have found wide applicability in machine learning capturing notions such as diversity and regularity respectively these notions have deep consequences for optimization and the problem of approximately optimizing submodular functions has received much attention however beyond optimization these notions allow specifying expressive probabilistic models that can be used to quantify predictive uncertainty via marginal inference prominent special cases include ising models and determinantal point processes but the general class of and models is much richer and little studied in this paper we investigate the use of markov chain monte carlo sampling to perform approximate inference in general and models in particular we consider simple gibbs sampling procedure and establish two sufficient conditions the first guaranteeing and the second fast log mixing we also evaluate the efficiency of the gibbs sampler on three examples of such models and compare against recently proposed variational approach introduction modeling notions such as coverage representativeness or diversity is an important challenge in many machine learning problems these notions are well captured by submodular set functions analogously supermodular functions capture notions of smoothness regularity or cooperation as result submodularity and supermodularity akin to concavity and convexity have found numerous applications in machine learning the majority of previous work has focused on optimizing such functions including the development and analysis of algorithms for minimization and maximization as well as the investigation of practical applications such as sensor placement active learning influence maximization and document summarization beyond optimization though it is of interest to consider probabilistic models defined via submodular functions that is distributions over finite sets or equivalently binary random vectors defined as exp βf where is submodular or supermodular function equivalently either or is submodular and is scaling parameter finding most likely sets in such models captures classical submodular optimization however going beyond point estimates that is performing general probabilistic marginal inference in them allows us to quantify uncertainty given some observations as well as learn such models from data only few special cases belonging to this class of models have been extensively studied in the past most notably ising models which are in the usual case of attractive ferromagnetic potentials or under repulsive potentials and determinantal point processes which are recently djolonga and krause considered more general treatment of such models and proposed variational approach for performing approximate probabilistic inference for them it is natural to ask to what degree the usual alternative to variational methods namely monte carlo sampling is applicable to these models and how it performs in comparison to this end in this paper we consider simple markov chain monte carlo mcmc algorithm on and logsupermodular models and provide first analysis of its performance we present two theoretical conditions that respectively guarantee and fast log mixing in such models and experimentally compare against the variational approximations on three examples problem setup we start by considering set functions where is finite ground set of size without loss of generality if not otherwise stated we will hereafter assume that the marginal gain obtained by adding element to set is defined as intuitively submodularity expresses notion of diminishing returns that is adding an element to larger set provides less benefit than adding it to smaller one more formally is submodular if for any and any it holds that supermodularity is defined analogously by reversing the sign of this inequality in particular if function is submodular then the function is supermodular if function is both submodular and supermodular then it is called modular and may be written in the form mv where and mv for all our main focus in this paper are distributions over the powerset of of the form exp βf for all where is submodular or supermodular the scaling parameter is referred to as inverse temperature and distributions of the above form are called or logsupermodular respectively the constant denominator exp βf serves the purpose of normalizing the distribution and is called the partition function of an alternative and equivalent way of defining distributions of the above form is via binary random vectors if we define xv it is easy to see that the distribution px exp βf over binary vectors is isomorphic to the distribution over sets of with slight abuse of notation we will use to denote and use to refer to both distributions example models the ferromagnetic ising model is an example of model in its simplest form it is defined through an undirected graph and set of pairwise potentialspσv its distribution has the form exp σv and is because σv is supermodular each σv is supermodular and supermodular functions are closed under addition determinantal point processes dpps are examples of models dpp is defined via positive semidefinite matrix and has distribution of the form det ks where ks denotes the square submatrix indexed by since ln det ks is submodular function is another example of models are those defined through facility location functions which have the form wv where wv and are submodular if wv then represents set cover function note that both the facility location model and the ising model use decomposable functions that is functions that can be written as sum of simpler submodular resp supermodular functions marginal inference our goal is to perform marginal inference for the distributions described above concretely for some fixed we would like to compute the probability of sets that contain all elements of but no elements outside of that is more generally we are interested in computing conditional probabilities of the form this computation can be reduced to computing unconditional marginals as follows for any define the contraction of on fc by fc for all also for any define the restriction of to by for all if is submodular then its contractions and restrictions are also submodular and thus fc is submodular finally it is easy to see that exp fc in algorithm gibbs sampler input ground set distribution exp βf random subset of for to niter do unif xt xt padd exp exp unif if padd then xt else xt end for our experiments we consider computing marginals of the form for some which correspond to and sampling and mixing times performing exact inference in models defined by boils down to computing the partition function unfortunately this is generally problem which was shown to be the case even for ising models by jerrum and sinclair however they also proposed fpras for class of ferromagnetic models which gives us hope that it may be possible to efficiently perform approximate inference in more general models under suitable conditions mcmc sampling approaches are based on performing randomly selected local moves in state space to approximately compute quantities of interest the visited states form markov chain which under mild conditions converges to stationary distribution crucially the probabilities of transitioning from one state to another are carefully chosen to ensure that the stationary distribution is identical to the distribution of interest in our case the state space is the powerset of equivalently the space of all binary vectors of length and to approximate the marginal probabilities of we construct chain over subsets of that has stationary distribution the gibbs sampler in this paper we focus on one of the simplest and most commonly used chains namely the gibbs sampler also known as the glauber chain we denote by the transition matrix of the chain each element corresponds to the conditional probability of transitioning from state to state that is xt for any and any we also define an adjacency relation on the elements of the state space which denotes that and differ by exactly one element it follows that each has exactly neighbors the gibbs sampler is defined by an iterative procedure as shown in algorithm first it selects an element uniformly at random then it adds or removes to the current state xt according to the conditional probability of the resulting state importantly the conditional probabilities that need to be computed do not depend on the partition function thus the chain can be simulated efficiently even though is unknown and hard to compute moreover it is easy to see that thus the sampler only requires black box for the marginal gains of which are often faster to compute than the values of itself finally it is easy to show that the stationary distribution of the chain constructed this way is mixing times approximating quantities of interest using mcmc methods is based on using time averages to estimate expectations over the desired distribution in particular we estimate the expt pected value of function by ep for example to estimate the marginal for some we would define xv for all the choice of time and number of samples in the above expression presents tradeoff between computational efficiency and approximation accuracy it turns out that the effect of both and is largely dependent on fundamental quantity of the chain called mixing time the mixing time of chain quantifies the number of iterations required for the distribution of xt to be close to the stationary distribution more formally it is defined as tmix min where denotes the over the starting state of the chain total variation distance between the distribution of xt and establishing upper bounds on the ing time of our gibbs sampler is therefore sufficient to guarantee efficient approximate marginal inference see theorem theoretical results in the previous section we mentioned that exact computation of the partition function for the class of models we consider here is in general infeasible only for very few exceptions such as dpps is exact inference possible in polynomial time even worse it has been shown that the partition function of general ising models is hard to approximate in particular there is no fpras for these models unless rp np this implies that the mixing time of any markov chain with such stationary distribution will in general be exponential in it is therefore our aim to derive sufficient conditions that guarantee mixing times for the general class of models in some of our results we will use the fact that any submodular function can be written as where is constant that has no effect on distributions defined by is normalized modular function and is normalized monotone submodular function that is it additionally satisfies the monotonicity property for all and all similar decomposition is possible for any supermodular function as well mixing our guarantee for mixing times that are polynomial in depends crucially on the following quantity which is defined for any set function ζf max intuitively ζf quantifies notion of distance to modularity to see this note that function is modular if and only if for all for modular functions therefore we have ζf furthermore function is submodular if and only if for all similarly is supermodular if the above holds with the sign reversed it follows that for submodular and supermodular functions ζf represents the amount by which violates the modular equality it is also important to note that for submodular and supermodular functions ζf depends only on the monotone part of if we decompose according to then it is easy to see that ζf ζf trivial upper bound on ζf therefore is ζf another quantity that has been used in the past to quantify the deviation of submodular function from modularity is the curvature defined as κf although of similar intuitive meaning the multiplicative nature of its definition makes it significantly different from ζf which is defined additively as an example of function class with ζf that do not depend on assume ground set sl pl and consider functions where is bounded concave function for example min φmax functions of this form are submodular and have been used in applications such as document summarization to encourage diversity it is easy to see that for such functions ζf lφmax that is ζf is independent of the following theorem establishes bound on the mixing time of the gibbs sampler run on models of the form the bound is exponential in ζf but polynomial in theorem for any function the mixing time of the gibbs sampler is bounded by tmix exp log where pmin min if is submodular or supermodular then the bound is improved to tmix exp βζf log note that since the factor of two that constitutes the difference between the two statements of the theorem lies in the exponent it can have significant impact on the above bounds the dependence on pmin is related to the starting state of the chain and can be eliminated if we have way to guarantee starting state if is submodular or supermodular this is usually straightforward to accomplish by using one of the standard optimization algorithms as preprocessing step more generally if is bounded by fmax for all then log nβfmax canonical paths our proof of theorem is based on the method of canonical paths the idea of this method is to view the state space as graph and try to construct path between each pair of states that carries certain amount of flow specified by the stationary distribution under consideration depending on the choice of these paths and the resulting load on the edges of the graph we can derive bounds on the mixing time of the markov chain more concretely let us assume that for some set function and corresponding distribution as in we construct the gibbs chain on state space with transition matrix we can view the state space as directed graph that has vertex set and for any contains edge if and only if that is if and only if and differ by exactly one element now assume that for any pair of states we define what is called canonical path γab such that all si are edges in the above graph we denote the length of path γab by and define we also denote the set of all pairs of states whose canonical path goes through by css γab the following quantity referred to as the congestion of an edge uses collection of canonical paths to quantify to what amount that edge is overloaded the denominator quantifies the capacity of edge while the sum represents the total flow through that edge according to the choice of canonical paths the congestion of the whole graph is then defined as low congestion implies that there are no bottlenecks in the state space and the chain can move around fast which also suggests rapid mixing the following theorem makes this concrete theorem for any collection of canonical paths with congestion the mixing time of the chain is bounded by tmix log proof outline of theorem to apply theorem to our class of distributions we need to construct set of canonical paths in the corresponding state space and upper bound the resulting congestion first note that to transition from state to state in our case it is enough to remove the elements of and add the elements of each removal and addition corresponds to an edge in the state space graph and the order of these operations identify canonical path in this graph that connects to for our analysis we assume fixed order on the natural order of the elements themselves and perform the operations according to this order having defined the set of canonical paths we proceed to bounding the congestion for any edge the main difficulty in bounding is due to the sum in over all pairs in css to simplify this sum we construct for each edge an injective map ηss css this is combinatorial encoding technique that has been previously used in similar proofs to ours we then prove the following key lemma about these maps lemma for any and any it holds that exp ηss since ηss is injective it follows that ηss furthermore it is clear that each canonical path γab has length since we need to add remove at most elements to get from state to state combining these two facts with the above lemma we get exp if is submodular or supermodular we show that the dependence on ζf in lemma is improved to exp βζf more details can be found in the longer version of the paper fast mixing we now proceed to show that under some stronger conditions we are able to establish even log for any function we denote and define the following quantity γf max tanh which quantifies the maximum total influence of an element on the values of for example if the inclusion of makes no difference with respect to other elements of the ground set we will have γf the following theorem establishes conditions for fast mixing of the gibbs sampler when run on models of the form theorem for any set function if γf then the mixing time of the gibbs sampler is bounded by tmix log log γf if is additionally submodular or supermodular and is decomposed according to then log log tmix γf note that in the second part of the theorem γf depends only on the monotone part of we have seen in section that some commonly used models are based on decomposable functions that can be written in the form we prove the following corollary that provides an easy to check condition for fast mixing of the gibbs sampler when is decomposable submodular function corollary for any submodular function that can be written in the form of with being its monotone also decomposable part according to if we define xp xp θf max and λf max then it holds that γf θf for example applying this to the facility location model defined in section we get θf pl maxv wv and λf max wv and obtain fast mixing if θf λf as special case if we consider the class of set cover functions wv such that each covers at most sets and each set is covered by at most elements then θf λf and we obtain fast mixing if note that the corollary can be trivially applied to any submodular function by taking but may in general result in loose bound if used that way coupling our proof of theorem is based on the coupling technique more specifically we use the path coupling method given markov chain xt on state space with transition matrix coupling for zt is new markov chain xt yt on state space such that both xt and yt are by themselves markov chains with transition matrix the idea is to construct the coupling in such way that even when the starting points and are different the chains xt and yt tend to coalesce then it can be shown that the coupling time tcouple min xt yt is closely related to the mixing time of the original chain zt the main difficulty in applying the coupling approach lies in the construction of the coupling itself for which one needs to consider any possible pair of states yt zt the path coupling technique makes this construction easier by utilizing the same graph that we used to define canonical paths in section the core idea is to first define coupling only over adjacent states and then extend it for any pair of states by using metric on the graph more concretely let us denote by the path metric on state space that is for any is the minimum length of any path from to in the state space graph the following theorem establishes fast mixing using this metric as well as the diameter of the state space diam maxx theorem for any markov chain zt if xt yt is coupling such that for some and any with it holds that xt yt then the mixing time of the original chain is bounded by tmix log diam log proof outline of theorem in our case the path metric is the hamming distance between the binary vectors representing the states equivalently the number of elements by which two sets differ we need to construct suitable coupling xt yt for any pair of states consider the two corresponding sets that differ by exactly one element and assume that for some the case for some is completely analogous remember that the gibbs sampler first chooses an element uniformly at random and then adds or removes it according to the conditional probabilities our goal is to make the same updates happen to both and as frequently as possible as first step we couple the candidate element for update to always be the same in both chains then we have to distinguish between the following cases if then the conditionals for both chains are identical therefore we can couple both chains to add with probability padd which will result in new sets or remove with probability padd which will result in new sets either way we will have if we can not always couple the updates of the chains because the conditional probabilities of the updates are different in fact we are forced to have different updates one chain adding the other chain removing with probability equal to the difference of the corresponding conditionals which we denote here by pdif if this is the case we will have otherwise the chains will make the same update and will still differ only by element that is putting together all the above we get the following expected distance after one step γf pdif γf exp our result follows from applying theorem with γf noting that diam experiments we compare the gibbs sampler against the variational approach proposed by djolonga and krause for performing inference in models of the form and use the same three models as in their experiments we briefly review here the experimental setup and refer to their paper for more details the first is facility location model with an added modular term that penalizes the number of selected elements that is exp where is submodular facility location function the model is constructed from randomly subsampling real data from problem of sensor placement in water distribution network in the experiments we iteratively condition on random observations for each variable in the ground set the second is pairwise markov random field mrf generalized ising model with varying weights constructed by first randomly sampling points from gaussian mixture model and then introducing pairwise potential for each pair of points with weight in the distance of the pair in the experiments we iteratively condition on pairs of observations one from each cluster the third is mrf which is constructed by first generating random graph and then creating one potential per node which contains that node and all of its neighbors in the graph the strength of the potentials is controlled by parameter which is closely related to the curvature of the functions that define them in the experiments we vary this parameter from modular model to strongly supermodular model for all three models we constrain the size of the ground set to so that we are able to compute and compare against the exact marginals furthermore we run multiple repetitions for each model to account for the randomness of the model instance and the random initialization of var upper var lower gibbs gibbs gibbs var upper var lower gibbs gibbs gibbs var upper var lower gibbs gibbs gibbs number of conditioned elements facility location number of conditioned pairs pairwise mrf mrf figure absolute error of the marginals computed by the gibbs sampler compared to variational inference modest gibbs iterations outperform the variational method for the most part the gibbs sampler the marginals we compute are of the form for all we run the gibbs sampler for and iterations on each problem instance in compliance with recommended mcmc practice we discard the first half of the obtained samples as and only use the second half for estimating the marginals figure compares the average absolute error of the approximate marginals with respect to the exact ones the averaging is performed over and over the different repetitions of each experiment errorbars depict two standard errors the two variational approximations are obtained from factorized distributions associated with modular lower and upper bounds respectively we notice similar trend on all three models for the regimes that correspond to less peaked posterior distributions small number of conditioned variables small even gibbs iterations outperform both variational approximations the latter gain an advantage when the posterior is concentrated around only few states which happens after having conditioned on almost all variables in the first two models or for close to in the third model further related work in contemporary work to ours rebeschini and karbasi analyzed the mixing times of logsubmodular models using method based on matrix norms which was previously introduced by dyer et al and is closely related to path coupling they arrive at not directly to the one we presented in theorem iyer and bilmes recently considered different class of probabilistic models called submodular point processes which are also defined through submodular functions and have the form they showed that inference in spps is in general also hard problem and provided approximations and solutions for some subclasses the canonical path method for bounding mixing times has been previously used in applications such as approximating the partition function of ferromagnetic ising models approximating matrix permanents and counting matchings in graphs the most prominent application of methods is counting in graphs other applications include counting independent sets in graphs and approximating the partition function of various subclasses of ising models at high temperatures conclusion we considered the problem of performing marginal inference using mcmc sampling techniques in probabilistic models defined through submodular functions in particular we presented for the first time sufficient conditions to obtain upper bounds on the mixing time of the gibbs sampler in general and models furthermore we demonstrated that in practice the gibbs sampler compares favorably to previously proposed variational approximations at least in regimes of high uncertainty we believe that this is an important step towards unified framework for further analysis and practical application of this rich class of probabilistic submodular models acknowledgments this work was partially supported by erc starting grant references david aldous random walks on finite groups and rapidly mixing markov chains in seminaire de probabilites xvii springer russ bubley and martin dyer path coupling technique for proving rapid mixing in markov chains in symposium on foundations of computer science russ bubley martin dyer and catherine greenhill beating the bound for approximately counting colourings proof of rapid mixing in symposium on discrete algorithms michele conforti and gerard cornuejols submodular set functions matroids and the greedy algorithm tight bounds and some generalizations of the theorem disc app persi diaconis and daniel stroock geometric bounds for eigenvalues of markov chains the annals of applied probability josip djolonga and andreas krause from map to marginals variational inference in bayesian submodular models in neural information processing systems martin dyer leslie ann goldberg and mark jerrum matrix norms and rapid mixing for spin systems annals of applied probability martin dyer and catherine greenhill on markov chains for independent sets of algorithms uriel feige vahab mirrokni and jan vondrak maximizing submodular functions in symposium on foundations of computer science satoru fujishige submodular functions and optimization elsevier science andrew gelman and kenneth shirley innovation and intellectual property rights in handbook of markov chain monte carlo crc press daniel golovin and andreas krause adaptive submodularity theory and applications in active learning and stochastic optimization journal of artificial intelligence research rishabh iyer and jeff bilmes submodular point processes with applications in machine learning in international conference on artificial intelligence and statistics mark jerrum very simple algorithm for estimating the number of of graph random structures and algorithms mark jerrum counting sampling and integrating algorithms and complexity mark jerrum and alistair sinclair approximating the permanent siam journal on computing mark jerrum and alistair sinclair approximation algorithms for the ising model siam journal on computing mark jerrum alistair sinclair and eric vigoda approximation algorithm for the permanent of matrix with entries journal of the acm david kempe jon kleinberg and eva tardos maximizing the spread of influence through social network in conference on knowledge discovery and data mining daphne koller and nir friedman probabilistic graphical models principles and techniques the mit press andreas krause carlos guestrin anupam gupta and jon kleinberg sensor placements maximizing information while minimizing communication cost in information processing in sensor networks andreas krause jure leskovec carlos guestrin jeanne vanbriesen and christos faloutsos efficient sensor placement optimization for securing large water distribution networks journal of water resources planning and management alex kulesza and ben taskar determinantal point processes for machine learning foundations and trends in machine learning david levin yuval peres and elizabeth wilmer markov chains and mixing times american mathematical society hui lin and jeff bilmes class of submodular functions for document summarization in human language technologies george nemhauser laurence wolsey and marshall fisher an analysis of approximations for maximizing submodular set functions mathematical programming patrick rebeschini and amin karbasi fast mixing for discrete point processes in conference on learning theory alistair sinclair improved bounds for mixing rates of markov chains and multicommodity flow combinatorics probability and computing 
coevolve joint point process model for information diffusion and network mehrdad yichen manuel shuang li hongyuan zha le georgia institute of technology mpi for software mehrdad manuelgr zha lsong abstract information diffusion in online social networks is affected by the underlying network topology but it also has the power to change it online users are constantly creating new links when exposed to new information sources and in turn these links are alternating the way information spreads however these two highly intertwined stochastic processes information diffusion and network evolution have been predominantly studied separately ignoring their dynamics we propose temporal point process model coevolve for such joint dynamics allowing the intensity of one process to be modulated by that of the other this model allows us to efficiently simulate interleaved diffusion and network events and generate traces obeying common diffusion and network patterns observed in networks furthermore we also develop convex optimization framework to learn the parameters of the model from historical diffusion and network evolution traces we experimented with both synthetic data and data gathered from twitter and show that our model provides good fit to the data as well as more accurate predictions than alternatives introduction online social networks such as twitter or weibo have become large information networks where people share discuss and search for information of personal interest as well as breaking news in this context users often forward to their followers information they are exposed to via their followees triggering the emergence of information cascades that travel through the network and constantly create new links to information sources triggering changes in the network itself over time importantly recent empirical studies with twitter data have shown that both information diffusion and network evolution are coupled and network changes are often triggered by information diffusion while there have been many recent works on modeling information diffusion and network evolution most of them treat these two stochastic processes independently and separately ignoring the influence one may have on the other over time thus to better understand information diffusion and network evolution there is an urgent need for joint probabilistic models of the two processes which are largely inexistent to date in this paper we propose probabilistic generative model coevolve for the joint dynamics of information diffusion and network evolution our model is based on the framework of temporal point processes which explicitly characterize the continuous time interval between events and it consists of two interwoven and interdependent components refer to appendix for an illustration information diffusion process we design an identity revealing multivariate hawkes process to capture the mutual excitation behavior of retweeting events where the intensity of such events in user is boosted by previous events from her set of followees though hawkes processes have been used for information diffusion before the key innovation of our approach is to explicitly model the excitation due to particular source node hence revealing the identity of the source such design reflects the reality that information sources are explicitly acknowledged and it also allows particular information source to acquire new links in rate according to her informativeness ii network evolution process we model link creation as an information driven survival process and couple the intensity of this process with retweeting events although survival processes have been used for link creation before the key innovation in our model is to incorporate retweeting events as the driving force for such processes since our model has captured the source identity of each retweeting event new links will be targeted toward the information sources with an intensity proportional to their degree of excitation and each source influence our model is designed in such way that it allows the two processes information diffusion and network evolution unfold simultaneously in the same time scale and excise bidirectional influence on each other allowing sophisticated coevolutionary dynamics to be generated see figure importantly the flexibility of our model does not prevent us from efficiently simulating diffusion and link events from the model and learning its parameters from real world data efficient simulation we design scalable sampling procedure that exploits the sparsity of the generated networks its complexity is nd log where is the number of samples is the number of nodes and is the maximum number of followees per user convex parameters learning we show that the model parameters that maximize the joint likelihood of observed diffusion and link creation events can be found via convex optimization finally we experimentally verify that our model can produce coevolutionary dynamics of information diffusion and network evolution and generate retweet and link events that obey common information diffusion patterns cascade structure size and depth static network patterns node degree and temporal network patterns shrinking diameter described in related literature furthermore we show that by modeling the coevolutionary dynamics our model provide significantly more accurate link and diffusion event predictions than alternatives in large scale twitter dataset backgrounds on temporal point processes temporal point process is random process whose realization consists of list of discrete events localized in time ti with ti and many different types of data produced in online social networks can be represented as temporal point processes such as the times of retweets and link creations temporal point process can be equivalently represented as counting process which records the number of events before time let the history be the list of times of events tn up to but not including time then the number of observed events in small time window dt between is dn ti dt and hence dn where is dirac delta function more generally given function we can define the convolution with respect to dn as dn dn ti ti the point process representation of temporal data is fundamentally different from the discrete time representation typically used in social network analysis it directly models the time interval between events as random variables and avoid the need to pick time window to aggregate events it allows temporal events to be modeled in more fine grained fashion and has remarkably rich theoretical support an important way to characterize temporal point processes is via the conditional intensity function stochastic model for the time of the next event given all the times of previous events formally the conditional intensity function intensity for short is the conditional probability of observing an event in small window dt given the history dt event in dt dn where one typically assumes that only one event can happen in small window of size dt dn then given time we can also characterize the conditional probability that no event happens during and the conditional density that an event occurs at time as exp dτ and respectively furthermore we can express the of list of events tn in an observation window as log ti dτ tn this simple will later enable us to learn the parameters of our model from observed data finally the functional form of the intensity is often designed to capture the phenomena of interests some useful functional forms we will use later are poisson process the intensity is assumed to be independent of the history but it can be function ii hawkes process the intensity models mutual excitation between events ακω dn κω ti ti where κω exp is an exponential triggering kernel is baseline intensity independent of the history here the occurrence of each historical event increases the intensity by certain amount determined by the kernel and the weight making the intensity history dependent and stochastic process by itself we will focus on the exponential kernel in this paper however other functional forms for the triggering kernel such as function are possible and our model does not depend on this particular choice and iii survival process there is only one event for an instantiation of the process where becomes if an event already happened before generative model of information diffusion and network in this section we use the above background on temporal point processes to formulate our probabilistic generative model for the joint dynamics of information diffusion and network evolution event representation we model the generation of two types of events events er and link creation events el instead of just the time we record each event as triplet source er or el destination time for retweet event the triplet means that the destination node retweets at time tweet originally posted by source node recording the source node reflects the real world scenario that information sources are explicitly acknowledged note that the occurrence of event er does not mean that is directly retweeting from or is connected to this event can happen when is retweeting message by another node where the original information source is acknowledged node will pass on the same source acknowledgement to its followers agree original tweets posted by node are allowed in this notation in this case the event will simply be er given list of retweet events up to but not including time the history hus of retweets by due to source is hus ei ui si ti and si the entire history of retweet events is denoted as hr hus for link creation event the triplet means that destination node creates at time link to source node from time on node starts following node to ease the exposition we restrict ourselves to the case where links can not be deleted and thus each directed link is created only once however our model can be easily augmented to consider multiple link creations and deletions per node pair as discussed in section we denote the link creation history as hl joint model with two interwoven components given users we use two sets of counting processes to record the generated events one for information diffusion and the other for network evolution more specifically retweet events are recorded using matrix of size for each fixed time point the entry in the matrix nus counts the number of retweets of due to source up to time these counting processes are identity revealing since they keep track of the source node that triggers each retweet this matrix can be dense since nus can be nonzero even when node does not directly follow we also let dn dnus ii link events are recorded using an adjacency matrix of size for each fixed time point the entry in the matrix aus indicates whether is directly following that is aus means the directed link has been created before for simplicity of exposition we do not allow the matrix is typically sparse but the number of nonzero entries can change over time we also define da daus then the interwoven information diffusion and network evolution processes can be characterized using their respective intensities dn hr hl dt and da hr hl dt where γus and the sign means that the intensity matrices will depend on the joint history hr hl and hence their evolution will be coupled by this coupling we make the counting processes for link creation to be information driven and ii the evolution of the linking structure to change the information diffusion process refer to appendix for an illustration of our joint model in the next two sections we will specify the details of these two intensity matrices information diffusion process we model the intensity for retweeting events using multivariate hawkes process γus ηu βs auv dnvs where is the indicator function and fu auv is the current set of followees of the term ηu is the intensity of original tweets by user on his own initiative becoming the source of cascade and the term βs κω auv dnvs models the propagation of peer influence over the network where the triggering kernel models the decay of peer influence over time note that the retweet intensity matrix is by itself stochastic process that depends on the timevarying network topology the entries in whose growth is controlled by the network evolution process in section hence the model design captures the influence of the network topology and each source influence βs on the information diffusion process more specifically to compute γus one first finds the current set fu of followees of and then aggregates the retweets of these followees that are due to source note that these followees may or may not directly follow source then the more frequently node is exposed to retweets of tweets originated from source via her followees the more likely she will also retweet tweet originated from source once node retweets due to source the corresponding nus will be incremented and this in turn will increase the likelihood of triggering retweets due to source among the followers of thus the source does not simply broadcast the message to nodes directly following her but her influence propagates through the network even to those nodes that do not directly follow her finally this information diffusion model allows node to repeatedly generate events in cascade and is very different from the independent cascade or linear threshold models which allow at most one event per node per cascade network evolution process we model the intensity for link creation using combination of survival and hawkes process aus µu αu dnus where the term aus effectively ensures link is created only once and after that the corresponding intensity is set to zero the term µu denotes baseline intensity which models when node decides to follow source spontaneously at her own initiative the term αu corresponds to the retweets of node due to tweets originally published by source where the triggering kernel models the decay of interests over time here the higher the corresponding retweet intensity the more likely will find information by source useful and will create direct link to the link creation intensity is also stochastic process by itself which depends on the retweet events and is driven by the retweet count increments dnus it captures the influence of retweets on the link creation and closes the loop of mutual influence between information diffusion and network topology note that creating link is more than just adding path or allowing information sources to take shortcuts during diffusion the network evolution makes fundamental changes to the diffusion dynamics and stationary distribution of the diffusion process in section as shown in given fixed network structure the expected retweet intensity µs at time due to source will depend of the network structure in highly nonlinear fashion µs ηs where ηs rm has single nonzero entry with value ηs and is the matrix exponential when the stationary intensity ηs is also nonlinearly related to the network structure thus given two network structures and at two points in time which are different by few edges the effect of these edges on the information diffusion is not just simply an additive relation depending on how these newly created edges modify the of the sparse matrix their effect can be drastic to the information diffusion remark in our model each user is exposed to information through set of neighbors by doing so we couple information diffusion with the network evolution increasing the practical application of our model to datasets the particular definition of exposure retweet neighbor will depend on the type of historical information that is available remarkably the flexibility of our model allows for different types of diffusion events which we can broadly classify into two categories in first category events corresponds to the times when an information cascade hits person for example through retweet from one of her neighbors but she does not explicitly like or forward the associated post in second category the person decides to explicitly like or forward the associated post and events corresponds to the times when she does so intuitively events in the latter category are more prone to trigger new connections but are also less frequent therefore it is mostly suitable to large event dataset for examples those ones generated synthetically in contrast the events in the former category are less likely to inspire new links but found in abundance therefore it is very suitable for sparse data consequently in synthetic experiments we used the latter and in the real one we used the former it noteworthy that eq is written based on the latter category but fig in appendix is drawn based on the former efficient simulation of coevolutionary dynamics we can simulate samples link creations tweets and retweets from our model by adapting ogata thinning algorithm originally designed for multidimensional hawkes processes however naive implementation of ogata algorithm would scale poorly for each sample we would need to and thus to draw samples we would need to perform operations where is the number of nodes we designed sampling procedure that is especially for the structure of our model the algorithm is based on the following key idea if we consider each intensity function in and as separate hawkes process and draw sample from each it is easy to show that the minimum among all these samples is valid sample from the model however by drawing samples from all intensities the computational complexity would not improve however when the network is sparse whenever we sample new node or link event from the model only small number of intensity functions in the local neighborhood of the node or the link will change as consequence we can reuse most of the samples from the intensity functions for the next new sample and find which intensity functions we need to change in log operations using heap finally we exploit the properties of the exponential function to update individual intensities for each new sample in let ti and be two consecutive events then we can compute as ti exp ti without the need to compare all previous events the complete simulation algorithm is summarized in algorithm in appendix by using algorithm we reduce the complexity from to nd log where is the maximum number of followees per node that means our algorithm scales logarithmically with the number of nodes and linearly with the number of edges at any point in time during the simulation we also note that the events for link creations tweets and retweets are generated in temporally intertwined and retweet intensity event occurrence time link cross covariance link spike trains retweet event occurrence time lag figure coevolutionary dynamics for synthetic data spike trains of link and retweet events link and retweet intensities cross covariance of link and retweet intensities data fit poisson fit fit poisson fit data data poisson fit fit poisson fit fit data figure degree distributions when network sparsity level reaches for fixed leaving fashion by algorithm this is because every new retweet event will modify the intensity for link creation and after each link creation we also need to update the retweet intensities efficient parameter estimation from coevolutionary events given collection of retweet events eri and link creation events eli recorded within time window we can easily estimate the parameters needed in our model using maximum likelihood estimation here we compute the joint µu αu ηu βs of these events using eq log γui si ti γus dτ log λui si ti dτ eri tweet retweet eli links for the terms corresponding to retweets the log term only sums over the actual observed events but the integral term actually sums over all possible combination of destination and source pairs even if there is no event between particular pair of destination and source for such pairs with no observed events the corresponding counting processes have essentially survived the observation window and the term γus dτ simply corresponds to the log survival probability terms corresponding to links have similar structure to those for retweet since γus and are linear in the parameters ηu βs and µu αu respectively then log γus and log λus are concave functions in these parameters integration of γus and λus still results in linear functions of the parameters thus the overall objective in eq is concave and the global optimum can be found by many algorithms in our experiments we adapt the efficient algorithm developed in previous work furthermore the optimization problem decomposes in independent problems one per node and can be readily parallelized properties of simulated networks and in this section we perform an empirical investigation of the properties of the networks and information cascades generated by our model in particular we show that our model can generate coevolutionary retweet and link dynamics and wide spectrum of static and temporal network patterns and information cascades appendix contains additional simulation results and visualizations appendix contains an evaluation of our model estimation method in synthetic data retweet and link coevolution figures visualize the retweet and link events aggregated across different sources and the corresponding intensities for one node and one realization picked at random here it is already apparent that retweets and link creations are clustered in time and often follow each other further figure shows the of the retweet and link creation intensity computed across multiple realizations for the same node if and are two intensities the is function of the time lag defined as dt it can be seen that the has its peak around retweets and link creations are sparsity diameter diameter sparsity diameter diameter figure diameter for network sparsity panels and show the diameter against sparsity over time for fixed and for fixed respectively percentage percentage others cascade size others others cascade depth figure distribution of cascade structure size and depth for different values and fixed highly correlated and over time for ease of exposition we illustrated using one node however we found consistent results across nodes degree distribution empirical studies have shown that the degree distribution of online social networks and microblogging sites follow power law and argued that it is consequence of the rich get richer phenomena the degree distribution of network is power law if the expected number of nodes md with degree is given by md where intuitively the higher the values of the parameters and the closer the resulting degree distribution follows the lower their values the closer the distribution to an random graph figure confirms this intuition by showing the degree distribution for different values of small shrinking diameter there is empirical evidence that the diameter of online social networks and microblogging sites exhibit relatively small diameter and shrinks or flattens as the network grows figures show the diameter on the largest connected component lcc against the sparsity of the network over time for different values of and although at the beginning there is short increase in the diameter due to the merge of small connected components the diameter decreases as the network evolves here nodes arrive to the network when they follow or are followed by node in the largest connected component cascade patterns our model can produce the most commonly occurring cascades structures as well as cascade size and depth distributions as observed in historical twitter data figure summarizes the results the higher the value the shallower and wider the cascades experiments on real dataset in this section we validate our model using large twitter dataset containing nearly tweet retweet and link events from more than users we will show that our model can capture the dynamics and by doing so it predicts retweet and link creation events more accurately than several alternatives appendix contains detailed information about the dataset and additional experiments retweet and link coevolution figures visualize the retweet and link events aggregated across different sources and the corresponding intensities given by our trained model for one node picked at random here it is already apparent that retweets and link creations are clustered in time and often follow each other and our fitted model intensities successfully track such behavior further figure compares the between the empirical retweet and link creation intensities and between the retweet and link creation intensities given by our trained model computed across multiple realizations for the same node the similarity between both is striking and both has its peak around retweets and link creations are highly correlated and over time for ease of exposition as in section we illustrated using one node however we found consistent results across nodes see appendix link prediction we use our model to predict the identity of the source for each test link event given the historical link and retweet events before the time of the prediction and compare its performance with two state of the art methods denoted as trf and weng trf measures implementation codes are available at https intensity link event occurrence time retweet event occurrence time cross covariance link spike trains retweet estimated empirical lag figure coevolutionary dynamics for real data spike trains of link and retweet events estimated link and retweet intensities empirical and estimated cross covariance of link and retweet intensities events coevolve trf weng events coevolve hawkes avgrank coevolve trf weng avgrank events coevolve hawkes events links ar links activity ar activity figure prediction performance in the twitter dataset by means of average rank ar and success probability that the true test events rank among the events the probability of creating link from source at given time by simply computing the proportion of new links created from the source with respect to the total number of links created up to the given time weng considers different link creation strategies and makes prediction by combining them we evaluate the performance by computing the probability of all potential links using different methods and then compute the average rank of all true test events avgrank and ii the success probability sp that the true test events rank among the potential events at each test time we summarize the results in fig where we consider an increasing number of training events our model outperforms trf and weng consistently for example for training events our model achieves sp times larger than trf and weng activity prediction we use our model to predict the identity of the node that is going to generate each test diffusion event given the historical events before the time of the prediction and compare its performance with baseline consisting of hawkes process without network evolution for the hawkes baseline we take snapshot of the network right before the prediction time and use all historical retweeting events to fit the model here we evaluate the performance the via the same two measures as in the link prediction task and summarize the results in figure against an increasing number of training events the results show that by modeling the dynamics our model performs significantly better than the baseline discussion we proposed joint model of information diffusion and network evolution which can capture the coevolutionary dynamics mimics the most common static and temporal network patterns observed in networks and information diffusion data and predicts the network evolution and information diffusion more accurately than previous using point processes to model intertwined events in information networks opens up many interesting future modeling work our current model is just of rich set of possibilities offered by point process framework which have been rarely explored before in large scale social network modeling for example we can generalize our model to support link deletion by introducing an intensity matrix modeling link deletions as survival processes gus aus and then consider the counting process associated with the adjacency matrix to evolve as da hl dt dt we also can consider the number of nodes varying over time furthermore large and diverse range of point processes can also be used in the framework without changing the efficiency of the simulation and the convexity of the parameter estimation condition the intensity on additional external features such as node attributes acknowledge the authors would like to thank demetris antoniades and constantine dovrolis for providing them with the dataset the research was supported in part by bigdata onr nsf nsf career references kwak lee park and others what is twitter social network or news media www cheng adamic dow and others can cascades be predicted www antoniades and dovrolis dynamics in social networks case study of twitter myers and leskovec the bursty dynamics of the twitter information network www weng ratkiewicz perra goncalves castillo bonchi schifanella menczer and flammini the role of information diffusion in the evolution of social networks kdd du song and zha scalable influence estimation in diffusion networks nips balduzzi and uncovering the temporal dynamics of diffusion networks icml leskovec krause inferring networks of diffusion and influence kdd chakrabarti zhan and faloutsos recursive model for graph mining computer science department page leskovec chakrabarti kleinberg faloutsos and leskovec kronecker graphs an approach to modeling networks jmlr leskovec backstrom kumar and others microscopic evolution of social networks kdd liniger multivariate hawkes processes phd thesis ethz blundell beck heller modelling reciprocating relationships with hawkes processes nips farajtabar du valera zha and song shaping social activity by incentivizing users nips iwata shah and ghahramani discovering latent influence in online social activities via shared cascade poisson processes kdd linderman and adams discovering latent network structure in point process data icml valera modeling adoption of competing products and conventions in social media icdm zhou zha and song learning social infectivity in sparse networks using multidimensional hawkes processes aistats zhou zha and song learning triggering kernels for hawkes processes icml hunter smyth vu and others dynamic egocentric models for citation networks icml vu hunter smyth and asuncion regression models for longitudinal networks nips leskovec kleinberg and faloutsos graphs over time densification laws shrinking diameters and possible explanations kdd goel watts and goldstein the structure of online diffusion networks ec aalen borgan and gjessing survival and event history analysis process point of view kempe kleinberg and tardos maximizing the spread of influence through social network kdd ogata on lewis simulation method for point processes ieee tit erdos and on the evolution of random graphs hungar acad sci backstrom boldi rosa ugander and vigna four degrees of separation websci granovetter the strength of weak ties american journal of sociology pages romero and kleinberg the directed closure process in hybrid networks with an analysis of link formation on twitter icwsm ugander backstrom and kleinberg subgraph frequencies mapping the empirical and extremal geography of large graph collections www watts and strogatz collective dynamics of networks nature gross and blasius adaptive coevolutionary networks review royal society interface singer wagner and strohmaier factors influencing the of social and content networks in online social media modeling and mining ubiquitous social media pages springer 
supervised learning for dynamical system learning ahmed hefny carnegie mellon university pittsburgh pa ahefny carlton downey carnegie mellon university pittsburgh pa cmdowney geoffrey gordon carnegie mellon university pittsburgh pa ggordon abstract recently there has been substantial interest in spectral methods for learning dynamical systems these methods are popular since they often offer good tradeoff between computational and statistical efficiency unfortunately they can be difficult to use and extend in practice they can make it difficult to incorporate prior information such as sparsity or structure to address this problem we present new view of dynamical system learning we show how to learn dynamical systems by solving sequence of ordinary supervised learning problems thereby allowing users to incorporate prior knowledge via standard techniques such as regularization many existing spectral methods are special cases of this new framework using linear regression as the supervised learner we demonstrate the effectiveness of our framework by showing examples where nonlinear regression or lasso let us learn better state representations than plain linear regression does the correctness of these instances follows directly from our general analysis introduction approaches to learning dynamical systems such as em and mcmc can be slow and suffer from local optima this difficulty has resulted in the development of spectral algorithms which rely on factorization of matrix of observable moments these algorithms are often fast simple and globally optimal despite these advantages spectral algorithms fall short in one important aspect compared to em and mcmc the latter two methods are or frameworks that offer clear template for developing new instances incorporating various forms of prior knowledge for spectral algorithms by contrast there is no clear template to go from set of probabilistic assumptions to an algorithm in fact researchers often relax model assumptions to make the algorithm design process easier potentially discarding valuable information in the process to address this problem we propose new framework for dynamical system learning using the idea of regression to transform dynamical system learning to sequence of ordinary supervised learning problems this transformation allows us to apply the rich literature on supervised learning to incorporate many types of prior knowledge our new methods subsume variety of existing spectral algorithms as special cases the remainder of this paper is organized as follows first we formulate the new learning framework sec we then provide theoretical guarantees for the proposed methods sec finally we give this material is based upon work funded and supported by the department of defense under contract no with carnegie mellon university for the operation of the software engineering institute federally funded research and development center supported by grant from the pnc center for financial services innovation supported by nih grant mh and onr contract regression 𝑞𝑡 future 𝜓𝑡 history ℎ𝑡 𝑜𝑡 regression shifted future extended future 𝜉𝑡 regression 𝑝𝑡 condition on 𝑜𝑡 filter marginalize 𝑜𝑡 predict figure dynamical system observation ot is determined by latent state st and noise figure learning and applying dynamical system with instrumental regression the predictions from provide training data to at test time we filter or predict using the weights from two examples of how our techniques let us rapidly design new and useful dynamical system learning methods by encoding modeling assumptions sec framework for spectral algorithms dynamical system is stochastic process distribution over sequences of observations such that at any time the distribution of future observations is fully determined by vector st called the latent state the process is specified by three distributions the initial state distribution the state transition distribution st and the observation distribution ot st for later use we write the observation ot as function of the state st and random noise as shown in figure given dynamical system one of the fundamental tasks is to perform inference where we predict future observations given history of observations typically this is accomplished by maintaining distribution or belief over states st where denotes the first observations represents both our knowledge and our uncertainty about the true state of the system two core inference tasks are filtering and in filtering given the current belief bt and new observation ot we calculate an updated belief that incorporates ot in prediction we project our belief into the future given belief we estimate for some without incorporating any intervening observations the typical approach for learning dynamical system is to explicitly learn the initial transition and observation distributions by maximum likelihood spectral algorithms offer an alternate approach to learning they instead use the method of moments to set up system of equations that can be solved in closed form to recover estimates of the desired parameters in this process they typically factorize matrix or tensor of observed the name spectral algorithms often but not always avoid explicitly estimating the latent state or the initial transition or observation distributions instead they recover observable operators that can be used to perform filtering and prediction directly to do so they use an observable representation instead of maintaining belief bt over states st they maintain the expected value of sufficient statistic of future observations such representation is often called transformed predictive state in more detail we define qt ψt where ψt ot is vector of future features the features are chosen such that qt determines the distribution of future observations there are other forms of inference in addition to filtering and prediction such as smoothing and likelihood evaluation but they are outside the scope of this paper ot filtering then becomes the process of mapping predictive state qt to conditioned on ot while prediction maps predictive state qt to without intervening observations typical way to derive spectral method is to select set of moments involving ψt work out the expected values of these moments in terms of the observable operators then invert this relationship to get an equation for the observable operators in terms of the moments we can then plug in an empirical estimate of the moments to compute estimates of the observable operators while effective this approach can be statistically inefficient the goal of being able to solve for the observable operators is in conflict with the goal of maximizing statistical efficiency and can make it difficult to incorporate prior information each new source of information leads to new moments and different and possibly harder set of equations to solve to address these problems we show that we can instead learn the observable operators by solving three supervised learning problems the main idea is that just as we can represent belief about latent state st as the conditional expectation of vector of observable statistics we can also represent any other distributions needed for prediction and filtering via their own vectors of observable statistics given such representation we can learn to filter and predict by learning how to map these vectors to one another in particular the key intermediate quantity for filtering is the extended and marginalized belief ot equivalently ot we represent this distribution via vector ξt ot of features of the extended future the features are chosen such that the extended state pt ξt determines ot given ot filtering and prediction reduce respectively to conditioning on and marginalizing over ot in many models including hidden markov models hmms and kalman filters the extended state pt is linearly related to the predictive state qt property we exploit for our framework that is pt qt for some linear operator for example in discrete system ψt can be an indicator vector representing the joint assignment of the next observations and ξt can be an indicator vector for the next observations the matrix is then the conditional probability table ot ot our goal therefore is to learn this mapping we might try to use linear regression for this purpose substituting samples of ψt and ξt in place of qt and pt since we can not observe qt or pt directly unfortunately due to the overlap between observation windows the noise terms on ψt and ξt are correlated so linear regression will give biased estimate of to counteract this bias we employ instrumental regression instrumental regression uses instrumental variables that are correlated with the input qt but not with the noise this property provides criterion to denoise the inputs and outputs of the original regression problem we remove that part of the that is not correlated with the instrumental variables in our case since past observations do not overlap with future or extended future windows they are not correlated with the noise as can be seen in figure therefore we can use history features ht as instrumental variables in more detail by taking the expectation of given ht we obtain an moment condition for all pt ht qt ht ξt ht ψt ht ξt ht ψt ht assuming that there are enough independent dimensions in ht that are correlated with qt we maintain the rank of the moment condition when moving from to and we can recover by least squares regression if we can compute ψt ht and ξt ht for sufficiently many examples fortunately conditional expectations such as ψt ht are exactly what supervised learning algorithms are designed to compute so we arrive at our learning framework we first use supervised for convenience we assume that the system is that is the distribution of all future observations is determined by the distribution of the next observations note not by the next observations themselves at the cost of additional notation this restriction could easily be lifted future features ψt spectral algorithm for hmm eot where eot is an indicator vector and spans the range of qt typically the top left singular vectors of the joint probability table ot xt and xt xt where xt ot for matrix that spans the range of qt typically the top left singular vectors of the covariance matrix cov ot ot obtained as above ssid for kalman filters time dependent gain ssid for stable kalman filters constant gain uncontrolled hsepsr evaluation functional ks ot for characteristic kernel ks extended future features ξt eot ffilter yt and yt yt where yt is formed by stacking and ot pt specifies gaussian distribution where conditioning on ot is straightforward ot and estimate covariance by solving riccati equation pt together with the covariance specify gaussian distribution where conditioning on ot is straightforward kernel bayes rule ko ot ko ot and ko ot estimate state normalizer from output states table examples of existing spectral algorithms reformulated as instrument regression with linear regression here is vector formed by stacking observations through and denotes the outer product details and derivations can be found in the supplementary material learning to estimate ψt ht and ξt ht effectively denoising the training examples and then use these estimates to compute by finding the least squares solution to in summary learning and inference of dynamical system through instrumental regression can be described as follows model specification pick features of history ht future ψt ot and extended future ξt ot ψt must be sufficient statistic for ot ξt must satisfy fpredict ξt for known function fpredict ffilter ξt ot for known function ffilter stage regression learn possibly regression model to estimate ψt ht the training data for this model are ht ψt across time steps regression learn possibly regression model to estimate ξt ht the training data for this model are ht ξt across time steps regression use the feature expectations estimated in and to train model to predict where is linear operator the training data for this model are estimates of obtained from and across time steps initial state estimation estimate an initial state by averaging across several example realizations of our time inference starting from the initial state we can maintain the predictive state qt ψt through filtering given qt we compute pt ξt qt then given the observation ot we can compute ffilter pt ot or in the absence of ot we can predict the next state fpredict pt finally by definition the predictive state qt is sufficient to compute ot the process of learning and inference is depicted in figure modeling assumptions are reflected in the choice of the statistics and as well as the regression models in stages and table demonstrates that we can recover existing spectral algorithms for dynamical system learning using linear regression in addition to providing unifying view of some successful learning algorithms the new framework also paves the way for extending these algorithms in theoretically justified manner as we demonstrate in the experiments below our bounds assume that the training time steps are sufficiently spaced for the underlying process to mix but in practice the error will only get smaller if we consider all time steps assuming ergodicity we can set the initial state to be the empirical average vector of future features in single long sequence ψt it might seem reasonable to learn fcombined qt ot directly thereby avoiding the need to separately estimate pt and condition on ot unfortunately fcombined is nonlinear for common models such as hmms related work this work extends predictive state learning algorithms for dynamical systems which include spectral algorithms for kalman filters hidden markov models predictive state representations psrs and weighted automata it also extends kernel variants such as which builds on all of the above work effectively uses linear regression or linear ridge regression although not always in an obvious way one common aspect of predictive state learning algorithms is that they exploit the covariance structure between future and past observation sequences to obtain an unbiased observable state representation boots and gordon note the connection between this covariance and linear instrumental regression in the context of the we use this connection to build general framework for dynamical system learning where the state space can be identified using arbitrary possibly nonlinear supervised learning methods this generalization lets us incorporate prior knowledge to learn compact or regularized models our experiments demonstrate that this flexibility lets us take better advantage of limited data reducing the problem of learning dynamical systems with latent state to supervised learning bears similarity to langford et sufficient posterior representation spr which encodes the state by the sufficient statistics of the conditional distribution of the next observation and represents system dynamics by three functions that are estimated using supervised learning approaches while spr allows all of these functions to be it involves rather complicated training procedure involving multiple iterations of model refinement and model averaging whereas our framework only requires solving three regression problems in sequence in addition the theoretical analysis of only establishes the consistency of spr learning assuming that all regression steps are solved perfectly our work on the other hand establishes convergence rates based on the performance of regression theoretical analysis in this section we present error bounds for instrumental regression these bounds hold regardless of the particular regression method used assuming that the predictions converge to the true conditional expectations the bounds imply that our overall method is consistent let xt yt zt be triplets of input output and instrumental variables lack of independence will result in slower convergence in proportion to the mixing time of our process let and denote xt zt and yt zt and let and denote xt zt and yt zt as estimated by the and regression steps here and we want to analyze the convergence of the output of is of the weights given by ridge regression between outputs and outputs λix here denotes tensor outer product and is regularization parameter that ensures the invertibility of the estimated covariance before we state our main theorem we need to quantify the quality of regression in way that is independent of the functional form to do so we place bound on the error and assume that this bound converges to zero given the definition below for each fixed limn ηδ definition regression bound for any and the regression bound ηδ is number such that with probability at least for all kx ηδ ky ηδ in many applications and will be finite dimensional real vector spaces rdx rdy and rdz however for generality we state our results in terms of arbitrary reproducing kernel hilbert spaces in this case uses kernel ridge regression leading to methods such as for this purpose let and denote the uncentered covariance operators of and respectively and let denote the closure of the range of with the above assumptions theorem gives generic error bound on regression in terms of regression if and are finite dimensional and has full rank then using ordinary least squares setting will give the same bound but with in the first two terms replaced by the minimum eigenvalue of and the last term dropped theorem assume that almost surely assume is operator and let be as defined in then with probability at least for each xtest kxtest kx the error xtest xtest ky is bounded by log log error from regularization error from finite samples error in regression we defer the proof to the supplementary material the supplementary material also provides explicit bounds including expressions for the constants hidden by as well as concrete examples of regression bounds ηδ for practical regression models theorem assumes that xtest is in for dynamical systems all valid states satisfy this property however with finite data estimation errors may cause the estimated state xtest to have component in lemma bounds the effect of such errors it states that in stable system this component gets smaller as regression performs better the main limitation of lemma is the assumption that ffilter is which essentially means that the model estimated probability for ot is bounded below there is no way to guarantee this property in practice so lemma provides suggestive evidence rather than guarantee that our learned dynamical system will predict well lemma for observations let be the estimated state given let be the projection of onto assume ffilter is on pt when evaluated at ot and ffilter pt ot for any pt given the assumptions of theorem and assuming that kx for all the following holds for all with probability at least ηδ kx kx since is bounded the prediction error due to diminishes at the same rate as kx experiments and results we now demonstrate examples of tweaking the regression to gain advantage in the first experiment we show that nonlinear regression can be used to reduce the number of parameters needed in thereby improving statistical performance for learning an hmm in the second experiment we show that we can encode prior knowledge as regularization learning knowledge tracing model in this experiment we attempt to model and predict the performance of students learning from an interactive tutor we use the bayesian knowledge tracing bkt model which is essentially hmm the state st represents whether student has learned knowledge component kc and the observation ot represents the of solving the tth question in sequence of questions that cover this kc figure summarizes the model the events denoted by guessing slipping learning and forgetting typically have relatively low probabilities data description we evaluate the model using the geometry area data available from datashop this data was generated by students learning introductory geometry and contains attempts by correct answer skill known skill known incorrect answer skill unknown skill unknown figure transitions and observations in bkt each node represents possible value of the state or observation solid arrows represent transitions while dashed arrows represent observations students in knowledge components as is typical for bkt we consider student attempt at question to be correct iff the student entered the correct answer on the first try without requesting any hints from the help system each training sequence consists of sequence of first attempts for pair we discard sequences of length less than resulting in total of sequences models and evaluation under the reasonable assumption that the two states have distinct observation probabilities this model is hence we define the predictive state to be the expected next observation which results in the following statistics ψt ot and ξt ot where ot is represented by dimensional indicator vector and denotes the kronecker product given these statistics the extended state pt ξt is joint probability table of ot we compare three models that differ by history features and regression method this baseline uses ht and linear regression making it equivalent to the spectral hmm method of as detailed in the supplementary material this baseline represents ht by an indicator vector of the joint assignment of the previous observations we set to and uses linear regression this is essentially spectral hmm it thus incorporates more history information compared to at the expense of increasing the number of parameters by this model represents ht by binary vector of length encoding the previous observations and uses logistic regression as the model thus it uses the same history information as but reduces the number of parameters to at the expense of inductive bias we evaluated the above models using random splits of the sequences into training and testing for each testing observation ot we compute the absolute error between actual and expected value ot we report the mean absolute error for each split the results are displayed in figure we see that while incorporating more history information increases accuracy being able to incorporate the same information using more compact model gives an additional gain in accuracy we also compared the method to an hmm trained using expectation maximization em we found that the model is much faster to train than em while being on par with it in terms of prediction modeling independent subsystems using lasso regression spectral algorithms for kalman filters typically use the left singular vectors of the covariance between history and future features as basis for the state space however this basis hides any sparsity that might be present in our original basis in this experiment we show that we can instead use lasso without dimensionality reduction as our regression algorithm to discover sparsity this is useful for example when the system consists of multiple independent subsystems each of which affects subset of the observation coordinates the differences have similar sign but smaller magnitude if we use rmse instead of mae we used matlab logistic regression and em functions model training time relative to em em figure experimental results each graph compares the performance of two models measured by mean absolute error on splits the black line is points below this line indicate that model is better than model the table shows training time figure left singular vectors of left true linear predictor from to ot ot middle covariance matrix between ot and and right sparse regression weights each column corresponds to singular vector only absolute values are depicted singular vectors are ordered by their mean coordinate interpreting absolute values as probability distribution over coordinates to test this idea we generate sequence of observations from kalman filter observation dimensions through and through are generated from two independent subsystems of state dimension dimensions are generated from white noise each subsystem transition and observation matrices have random gaussian coordinates with the transition matrix scaled to have maximum eigenvalue of states and observations are perturbed by gaussian noise with covariance of and respectively we estimate the state space basis using examples assuming and compare the singular vectors of the past to future regression matrix to those obtained from the lasso regression matrix the result is shown in figure clearly using lasso as stage regression results in basis that better matches the structure of the underlying system conclusion in this work we developed general framework for dynamical system learning using supervised learning methods the framework relies on two key principles first we extend the idea of predictive state to include extended state as well allowing us to represent all of inference in terms of predictions of observable features second we use past features as instruments in an instrumental regression denoising state estimates that then serve as training examples to estimate system dynamics we have shown that this framework encompasses and provides unified view of some previous successful dynamical system learning algorithms we have also demostrated that it can be used to extend existing algorithms to incorporate nonlinearity and regularizers resulting in better state estimates as future work we would like to apply this framework to leverage additional techniques such as manifold embedding and transfer learning in stage regression we would also like to extend the framework to controlled processes references leonard baum ted petrie george soules and norman weiss maximization technique occurring in the statistical analysis of probabilistic functions of markov chains the annals of mathematical statistics pp gilks richardson and spiegelhalter markov chain monte carlo in practice chapman and hall london isbn this book thoroughly summarizes the uses of mcmc in bayesian analysis it is core book for bayesian studies daniel hsu sham kakade and tong zhang spectral algorithm for learning hidden markov models in colt judea pearl causality models reasoning and inference cambridge university press new york ny usa stock and watson introduction to econometrics series in economics animashree anandkumar rong ge daniel hsu sham kakade and matus telgarsky tensor decompositions for learning latent variable models the journal of machine learning research matthew rosencrantz and geoff gordon learning low dimensional predictive representations in icml international conference on machine learning pages van overschee and de moor subspace identification for linear systems theory implementation applications kluwer academic publishers byron boots arthur gretton and geoffrey gordon hilbert space embeddings of predictive state representations in proc intl conf on uncertainty in artificial intelligence uai kenji fukumizu le song and arthur gretton kernel bayes rule bayesian inference with positive definite kernels journal of machine learning research byron boots spectral approaches to learning predictive representations phd thesis carnegie mellon university december sajid siddiqi byron boots and geoffrey gordon hidden markov models in proceedings of the thirteenth international conference on artificial intelligence and statistics byron boots sajid siddiqi and geoffrey gordon closing the learning planning loop with predictive state representations in robotic research volume pages byron boots and geoffrey gordon an online spectral learning algorithm for partially observable nonlinear dynamical systems in proceedings of the national conference on artificial intelligence methods of moments for learning stochastic languages unified presentation and empirical comparison in proceedings of the international conference on machine learning pages song boots siddiqi gordon and smola hilbert space embeddings of hidden markov models in proc intl conf on machine learning icml byron boots and geoffrey gordon problems with applications to nonlinear system identification in proc intl conf on machine learning icml john langford ruslan salakhutdinov and tong zhang learning nonlinear dynamic models in proceedings of the annual international conference on machine learning icml montreal quebec canada june pages albert corbett and john anderson knowledge tracing modelling the acquisition of procedural knowledge user model kenneth koedinger baker cunningham skogsholm leber and john stamper data repository for the edm community the pslc datashop handbook of educational data mining pages le song jonathan huang alexander smola and kenji fukumizu hilbert space embeddings of conditional distributions with applications to dynamical systems in proceedings of the annual international conference on machine learning icml montreal quebec canada june pages 
pruning in games tuomas sandholm computer science department carnegie mellon university pittsburgh pa sandholm noam brown computer science department carnegie mellon university pittsburgh pa noamb abstract counterfactual regret minimization cfr is leading algorithm for finding nash equilibrium in large games cfr is an iterative algorithm that repeatedly traverses the game tree updating regrets at each information set we introduce an improvement to cfr that prunes any path of play in the tree and its descendants that has negative regret it revisits that sequence at the earliest subsequent cfr iteration where the regret could have become positive had that path been explored on every iteration the new algorithm maintains cfr convergence guarantees while making iterations significantly if previously known pruning techniques are used in the comparison this improvement carries over to recent variant of cfr experiments show an order of magnitude speed improvement and the relative speed improvement increases with the size of the game introduction games are general model for strategic interaction the last ten years have witnessed leap of several orders of magnitude in the size of games that can be solved to equilibrium this is the game class that this paper focuses on for small games linear program lp can find solution that is nash equilibrium to the game in polynomial time even in the presence of imperfect information however today leading lp solvers only scale to games with around nodes in the game tree instead iterative algorithms are used to approximate solutions for larger games there are variety of such iterative algorithms that are guaranteed to converge to solution among these counterfactual regret minimization cfr has emerged as the most popular and as the variant thereof cfr begins by exploring the entire game tree though sampling variants exist as well and calculating regret for every hypothetical situation in which the player could be key improvement that makes cfr practical in large games is pruning at high level pruning allows the algorithm to avoid traversing the entire game tree while still maintaining the same convergence guarantees the classic version of pruning which we will refer to as partial pruning allows the algorithm to skip updates for player in sequence if the other player current strategy does not reach the sequence with positive probability this dramatically reduces the cost of each iteration the magnitude of this reduction varies considerably depending on the game but can easily be higher than which improves the convergence speed of the algorithm by factor of moreover the benefit of partial pruning empirically seems to be more significant as the size of the game increases while partial pruning leads to large gain in speed we observe that there is still room for much larger speed improvement partial pruning only skips updates for player if an opponent action in the path leading to that point has zero probability this can fail to prune paths that are actually prunable consider game where the first player to act player has hundreds of actions to choose from and where over several iterations the reward received from many of them is extremely poor intuitively we should be able to spend less time updating the strategy for player following these poor actions and more time on the actions that proved worthwhile so far however here partial pruning will continue to update player strategy following each action in every iteration in this paper we introduce better version of pruning pruning rbp in which cfr can avoid traversing path in the game tree if either player takes actions leading to that path with zero probability this pruning needs to be temporary because the probabilities may change later in the cfr iterations so the reach probability may turn positive later on the number of cfr iterations during which sequence can be skipped depends on how poorly the sequence has performed in previous cfr iterations more specifically the number of iterations that an action can be pruned is proportional to how negative the regret is for that action we will detail these topics in this paper rbp can lead to dramatic improvement depending on the game as rough example consider game in which each player has very negative regret for actions leading to of nodes partial pruning which skips updates for player when the opponent does not reach the node would traverse of the game tree per iteration in contrast pruning which skips updates when either player does not reach the node would traverse only of the game tree in general rbp roughly squares the performance gain of partial pruning we test rbp with cfr and experiments show that it leads to more than an order of magnitude speed improvement over partial pruning the benefit increases with the size of the game background in this section we present the notation used in the rest of the paper in an game there is finite set of players is the set of all possible histories nodes in the game tree represented as sequence of actions and includes the empty history is the actions available in history and is the player who acts at that history where denotes chance chance plays an action with fixed probability σc that is known to all players the history reached after an action is taken in is child of represented by while is the parent of more generally is an ancestor of and is descendant of represented by if there exists sequence of actions from to are terminal histories for which no actions are available for each player there is payoff function ui if and the game is we define ui ui and maxi imperfect information is represented by information sets for each player by partition ii of for any information set ii all histories are indistinguishable to player so is the information set where is the player such that ii is the set of actions such that for all and maxi we define to be the maximum payoff reachable from history in and to be the minimum that is hvz up and hvz up we define to be the range of payoffs reachable from history in we similarly define and as the maximum minimum and range of payoffs respectively reachable from history in after taking action we define to be the set of information sets reachable by player after taking action formally if for some history and and strategy σi is probability vector over for player in information set the probability of particular action is denoted by σi since all histories in an information set belonging to player are indistinguishable the strategies in each of them must be identical that is for all σi σi and σi σi we define σi to be probability vector for player over all available strategies σi in the game strategy profile is tuple of strategies one for each player ui σi is the expected payoff for player if all players play according to the strategy profile hσi if series of strategies are played over iterations then σit σp is the joint probability of reaching if all players play according to πiσ is the contribution of player to this probability that is the probability of reaching if all players other than and chance always chose actions leading to is the contribution of all players other than and chance is the probability of reaching given that has been reached and if in game ii πi πi in this paper we focus on games therefore for we define πi πi for we define the average strategy for an information set to be σit πi σi σt πi nash equilibrium best response to is strategy such that ui ui nash equilibrium is strategy profile where every player plays best response formally it is strategy we define nash equilibrium strategy profile such that ui ui for player as strategy σi that is part of any nash equilibrium in games if σi and are both nash equilibrium strategies then hσi is nash equilibrium an is strategy profile such that ui ui counterfactual regret minimization counterfactual regret minimization cfr is popular algorithm for extensiveform games our analysis of cfr makes frequent use of counterfactual value informally this is the expected utility of an information set given that player tries to reach it for player at information set given strategy profile this is defined as viσ ui the counterfactual value of an action is viσ ui let be the strategy profile used on iteration the instantaneous regret on iteration for action in information set is rt vpσ vpσ and the regret for action in on iteration is rt rt regret for player additionally max rt and rt maxa in the entire game is rit max ui ui σit σi in cfr player in an information set picks an action among the actions with positive regret in proportion to his positive regret on that action formally on each iteration player selects actions according to probabilities if σi otherwise if player plays according to cfr in every iteration then on iteration rt moreover rit rt rt rt so as ti in games if both players average regret ti their average strategies form thus cfr constitutes an anytime algorithm for finding an equilibrium in games applying best response to sequences in section it was explained that if both players average regret approaches zero then their average strategies approach nash equilibrium cfr provides one way to compute strategies that have bounded regret but it is not the only way is variant of cfr in which one player plays cfr and the other player plays best response to the opponent strategy in every iteration calculating best response to fixed strategy is computationally cheap in games of perfect recall costing only single traversal of the game tree by playing best response in every iteration the is guaranteed to have at most zero regret moreover the cfr player regret is still bounded according to however in practice the cfr player regret in tends to be higher than when both players play vanilla cfr since the opponent is clairvoyantly maximizing the cfr player regret for this reason empirical results show that converges slower than cfr even though the regret is always at most zero we now discuss modification of cfr that will motivate the main contribution of this paper which in turn is described in section the idea is that by applying best response only in certain situations and cfr in others we can lower regret for one player without increasing it for the opponent without loss of generality we discuss how to reduce regret for player specifically consider an information set and action where and any history then for any ancestor history such that we know likewise for any descendant history such that we know thus from we see that player strategy on iteration in any information set following action has no effect on player regret for that iteration moreover it also has no effect on player regret for any information set except and information sets that follow action therefore by playing best response only in information sets following action and playing vanilla cfr elsewhere player guarantees zero regret for himself in all information sets following action without the practical cost of increasing his regret in information sets before or of increasing player regret this may increase regret for action itself but if we only do this when we can guarantee even after the iteration similarly player can simultaneously play best response in information sets following an action where for this approach leads to lower regret for both players in situations where both players sequences of reaching an information set have zero probability the strategies chosen have no impact on the regret or average strategy for either player so there is no need to compute what strategies should be played from then on our experiments showed that this technique leads to dramatic improvement over cfr in terms of the number of iterations the theoretical convergence bound remains the same however each iteration touches more actions more quickly become positive and are not skipped with partial thus takes longer it depends on the game whether cfr or this technique is faster overall see experiments in appendix pruning introduced in the next section outperforms both of these approaches significantly pruning rbp in this section we present the main contribution of this paper technique for soundly temporary actions from the tree traversal in order to speed it up significantly in section we proposed variant of cfr where player plays best response in information sets that the player reaches with zero probability in this section we show that these information sets and their descendants need not be traversed in every iteration rather the frequency that they must be traversed is proportional to how negative regret is for the action leading to them this traversal does not hurt the regret bound consider an information set and action where rt and regret for at least one other action in is positive and assume from we see that as described in section the strategy played by player on iteration in any information set following action has no effect on player moreover it has no immediate effect on what player will do in the next iteration other than in information sets following action because we know regret for action will still be at most on iteration since and will continue to not be played so rather than traverse the game tree following action we could procrastinate in deciding what player did on iteration in that branch until after iteration at which point regret for that action may be positive that is we could in principle store player strategy for each iteration between and and on iteration calculate best response to each of them and announce that player played those best responses following action on iterations to and update the regrets to match this obviously this itself would not be an improvement but performance would be identical to the algorithm described in section however rather than have player calculate and play best response for each iteration between and separately we could simply calculate best response against the average strategy that player played in those iterations this can be accomplished in single traversal of the game tree we can then announce that player played this best response on each iteration between and this provides benefits similar to the algorithm described in section but allows us to do the work of iterations in single traversal we coin this pruning rbp we now present theorem that guarantees that when we can prune through pruning for iterations theorem consider game let be an action such that on iteration let be an information set for any player such that and let let if when then regardless of what is identical for is played in during proof since viσ and viσ so from we get rt thus for iteration rt clearly the theorem is true for we prove the theorem continues to hold inductively for assume the theorem holds for iteration and consider iteration suppose ip and either or then for any there is no ancestor of in an information set in thus does not depend on the strategy in moreover for any if for some then because since or it similarly holds that then from rt does not depend on the strategy in now suppose ii for consider some and some first suppose that since πiσ so πiσ and contributes nothing to the regret of now suppose then for any if then and does not depend on the strategy in finally suppose and then for any such that we know and therefore does not depend on the strategy in now suppose and we proved rt for so thus for all is identical regardless of what is played in we can improve this approach significantly by not requiring knowledge beforehand of exactly how many iterations can be skipped rather we will decide in light of what happens during the intervenσt ing cfr iterations when an action needs to be revisited from we know that rt moreover vp does not depend on thus we can prune from iteration until iteration so long as σt vpσ vpσ in the worst case this allows us to skip only iterations however in practice it performs significantly better though we can not know on iteration how many iterations it will skip because it depends on what is played in our exploratory experiments showed that in practice performance also improves by replacing with more accurate upper bound on reward in cfr will still converge if is pruned for too many iterations however that hurts convergence speed in the experiments included in this paper we conservatively use as the upper bound best response calculation for pruning in this section we discuss how one can efficiently compute the best responses as called for in regretbased pruning the advantage of theorem is that we can wait until after pruning has is until we revisit an decide what strategies were played in during the intervening iterations we can then calculate single best response to the average strategy that the opponent played and say that that best response was played in in each of the intervening iterations this results in zero regret over those iterations for information sets in we now describe how this best response can be calculated efficiently pt typically when playing cfr one stores πit σit for each information set this allows one to immediately calculate the average strategy defined in in any particular iteration if we start pruning on iteration and revisit on iteration we wish to calculate best response to where πi σi an easy approach would be to store the opponent cumulative strategy before pruning begins and subtract it from the current cumulative strategy when pruning ends in fact we only need to store the opponent strategy in information sets that follow action however this could potentially use memory because the same information set belonging to player may be reached from multiple information sets belonging to player in contrast cfr only requires memory and we want to maintain this desirable property we accomplish that as follows to calculate best response against we traverse the game tree and calculate the counterfactual value defined in for every action for every information set belonging to player that does not lead to any further player information sets specifically we calculate for every action in such that since we calculate this only for actions where so does not depend on then starting from the bottom information sets we set the strategy to always play the action with the highest counterfactual value ties can be broken arbitrarily and pass this value up as the payoff for reaching repeating the process up the tree in order to calculate best response to we first store before pruning begins the counterfactual values for player against player average strategy for every action in each information set where when we revisit the action on iteration we calculate best response to except that we set the counterfactual value for every action in information set where to be the latter term was stored and the former term can be calculated from the current average strategy profile as before we set to always play whichever action has the highest counterfactual value and pass this term up slight complication arises when we are pruning an action in information set and wish to start pruning an earlier action from information set such that in this case it is necessary to explore action in order to calculate the best response in however if such traversals happen frequently then this would defeat the purpose of pruning action one way to address this is to only prune an action when the number of iterations guaranteed or estimated to be skipped exceeds some threshold this ensures that the overhead is worthwhile and that we are not frequently traversing an action farther down the tree that is already being pruned another option is to add some upper bound to how long we will prune an action if the lower bound for how long we will prune exceeds the upper bound for how long we will prune then we need not traverse in the best response calculation for because will still be pruned when we are finished with pruning in our experiments we use the former approach experiments to determine good parameter for this are presented in appendix pruning with is variant of cfr where the regret is never allowed to go below formally rt max rt rt for and rt for although this change appears small and does not improve the bound on regret it leads to faster empirical convergence was key advancement that allowed limit texas hold em poker to be essentially solved at first glance it would seem that and rbp are incompatible rbp allows actions to be traversed with decreasing frequency as regret decreases below zero however sets floor for regret at zero nevertheless it is possible to combine the two as we now show we modify the definition of regret in so that it can drop below zero but immediately returns to being positive as soon as regret begins increasing formally we modify the definition of regret in for to be as follows rt rt if rt and rt and rt rt rt otherwise this leads to identical behavior in and also allows regret to drop below zero so actions can be pruned when using rbp with regret does not strictly follow the rules for calls for an action to be played with positive probability whenever instantaneous regret for it is positive in the previous iteration since rbp only checks the regret for an action after potentially several iterations have been skipped there may be delay between the iteration when an action would return to play in and the iteration when it returns to play in rbp this does not pose theoretical problem cfr convergence rate still applies however this difference is noticeable when combined with linear averaging linear averaging weighs each iteration in the average strategy by it does not affect regret or influence the selection of strategies on an iteration that is with linear averaging the new definition for average strategy becomes σit tπi σi σt tπi linear averaging still maintains the asymptotic convergence rate of constant averaging where each iteration is weighed equally in empirically it causes to converge to nash equilibrium much faster however in vanilla cfr it results in worse performance and there is no proof guaranteeing convergence since rbp with results in behavior that does not strictly conform to linear averaging results in somewhat noisier convergence this can be mitigated by reporting the strategy profile found so far that is closest to nash equilibrium rather than the current average strategy profile and we do this in the experiments experiments we tested pruning in both cfr and against partial pruning as well as against cfr with no pruning our implementation traverses the game tree once each we tested our algorithm on standard leduc hold em and variant of it featuring more actions leduc hold em is popular benchmark problem for game solving due to its size large enough to be highly nontrivial but small enough to be solvable and strategic complexity in leduc hold em there is deck consisting of six cards two each of jack queen and king there are two rounds in the first round each player places an ante of chip in the pot and receives single private card round of betting then takes place with maximum with player going first public shared card is then dealt face up and another round of betting takes place again player goes first and there is maximum if one of the players has pair with the public card that players wins otherwise the player with the higher card wins in standard leduc hold em the bet size in the first round is chips and chips in the second round in our variant which we call there are bet sizes to choose from in the first round player may bet or chips while in the second round player may bet or chips we measure the quality of strategy profile by its exploitability which is the summed distance of both players from nash equilibrium strategy formally exploitability of strategy profile is we measure exploitability against the number of nodes touched over all cfr traversals as shown in figure rbp leads to substantial improvement over vanilla cfr with partial pruning in leduc hold em increasing the speed of convergence by more than factor of this is partially due to the game tree being traversed twice as fast and partially due to the use of best response in sequences that are pruned the benefit of which was described in section the improvement when added on top of is smaller increasing the speed of convergence by about factor of this matches the reduction in game tree traversal size the benefit from rbp is more substantial in the larger benchmark game rbp increases convergence speed of cfr by factor of and reduces the game tree traversal cost by about factor of in rbp improves the rate of convergence by about an order of magnitude rbp also decreases the number of nodes touched per iteration in by about factor of canonical traverses the game tree twice each iteration updating the regrets for each player in separate traversals this difference does not however affect the error measure in the experiments leduc hold em hold em figure top exploitability bottom nodes touched per iteration the results imply that larger games benefit more from rbp than smaller games this is not universally true since it is possible to have large game where every action is part of the nash equilibrium nevertheless there are many games with very large action spaces where the vast majority of those actions are suboptimal but players do not know beforehand which are suboptimal in such games rbp would improve convergence tremendously conclusions and future research in this paper we introduced new method of pruning that allows cfr to avoid traversing highregret actions in every iteration our pruning rbp temporarily ceases their traversal in sound way without compromising the overall convergence rate experiments show an order of magnitude speed improvement over partial pruning and suggest that the benefit of rbp increases with game size thus rbp is particularly useful in large games where many actions are suboptimal but where it is not known beforehand which actions those are in future research it would be worth examining whether similar forms of pruning can be applied to other algorithms as well rbp as presented in this paper is for cfr using regret matching to determine what strategies to use on each iteration based on the regrets rbp does not directly apply to other strategy selection techniques that could be used within cfr such as exponential weights because the latter always puts positive probability on actions also it would be interesting to see whether pruning could be applied to methods for equilibriumfinding the results in this paper suggest that for any algorithm to be efficient in large games effective pruning is essential acknowledgement this material is based on work supported by the national science foundation under grants and as well as xsede computing resources provided by the pittsburgh supercomputing center references michael bowling neil burch michael johanson and oskari tammelin limit holdem poker is solved science noam brown sam ganzfried and tuomas sandholm hierarchical abstraction distributed equilibrium computation and with application to champion texas hold em agent in proceedings of the international conference on autonomous agents and systems international foundation for autonomous agents and multiagent systems andrew gilpin javier and tuomas sandholm algorithm with ln convergence for in games mathematical programming conference version appeared in andrew gilpin and tuomas sandholm lossless abstraction of imperfect information games journal of the acm early version finding equilibria in large sequential games of imperfect information appeared in the proceedings of the acm conference on electronic commerce ec pages samid hoda andrew gilpin javier and tuomas sandholm smoothing techniques for computing nash equilibria of sequential games mathematics of operations research conference version appeared in eric griffin jackson time and space efficient algorithm for approximately solving large imperfect information games in aaai workshop on computer poker and imperfect information michael johanson nolan bard neil burch and michael bowling finding optimal abstract strategies in games in aaai conference on artificial intelligence aaai christian kroer kevin waugh fatma and tuomas sandholm faster firstorder methods for game solving in proceedings of the acm conference on economics and computation ec marc lanctot kevin waugh martin zinkevich and michael bowling monte carlo sampling for regret minimization in extensive games in proceedings of the annual conference on neural information processing systems nips pages pays an interior point approach to large games of incomplete information in aaai computer poker workshop tuomas sandholm the state of solving large games and application to poker ai magazine pages winter special issue on algorithmic game theory finnegan southey michael bowling bryce larson carmelo piccione neil burch darse billings and chris rayner bayes bluff opponent modelling in poker in proceedings of the annual conference on uncertainty in artificial intelligence uai pages july oskari tammelin solving large imperfect information games using arxiv preprint oskari tammelin neil burch michael johanson and michael bowling solving limit texas holdem in ijcai volume kevin waugh david schnizlein michael bowling and duane szafron abstraction pathologies in extensive games in international conference on autonomous agents and systems aamas martin zinkevich michael bowling michael johanson and carmelo piccione regret minimization in games with incomplete information in proceedings of the annual conference on neural information processing systems nips 
fast testing with analytic representations of probability measures kacper chwialkowski gatsby computational neuroscience unit ucl dino sejdinovic dept of statistics university of oxford aaditya ramdas dept of eecs and statistics uc berkeley aramdas arthur gretton gatsby computational neuroscience unit ucl abstract we propose class of nonparametric tests with cost linear in the sample size two tests are given both based on an ensemble of distances between analytic functions representing each of the distributions the first test uses smoothed empirical characteristic functions to represent the distributions the second uses distribution embeddings in reproducing kernel hilbert space analyticity implies that differences in the distributions may be detected almost surely at finite number of randomly chosen the new tests are consistent against larger class of alternatives than the previous tests based on the empirical characteristic functions while being much faster than the current or energy distancebased tests experiments on artificial benchmarks and on challenging testing problems demonstrate that our tests give better tradeoff than competing approaches and in some cases better outright power than even the most expensive tests this performance advantage is retained even in high dimensions and in cases where the difference in distributions is not observable with low order statistics introduction testing whether two random variables are identically distributed without imposing any parametric assumptions on their distributions is important in variety of scientific applications these include data integration in bioinformatics benchmarking for steganography and automated model checking such problems are addressed in the statistics literature via tests also known as homogeneity tests traditional approaches to testing are based on distances between representations of the distributions such as density functions cumulative distribution functions characteristic functions or mean embeddings in reproducing kernel hilbert space rkhs these representations are infinite dimensional objects which poses challenges when defining distance between distributions examples of such distances include the classical distance between cumulative distribution functions the maximum mean discrepancy mmd an rkhs norm of the difference between mean embeddings and the also known as energy distance which is an test for particular family of kernels tests may also be based on quantities other than distances an example being the kernel fisher discriminant kfd the estimation of which still requires calculating the rkhs norm of difference of mean embeddings with normalization by an inverse covariance operator in contrast to consistent tests heuristics based on such as the difference between characteristic functions evaluated at single frequency have been studied in the context of tests it was shown that the power of such tests can be maximized against fully specified alternative hypotheses where test power is the probability of correctly rejecting the null hypothesis that the distributions are the same in other words if the class of distributions being distinguished is known in advance then the tests can focus only at those particular frequencies where the characteristic functions differ most this approach was generalized to evaluating the empirical characteristic functions at multiple distinct frequencies by thus improving on tests that need to know the single best frequency in advance the cost remains linear in the sample size albeit with larger constant this approach still fails to solve the consistency problem however two distinct characteristic functions can agree on an interval and if the tested frequencies fall in that interval the distributions will be indistinguishable in section of the present work we introduce two novel distances between distributions which both use parsimonious representation of the probability measures the first distance builds on the notion of differences in characteristic functions with the introduction of smooth characteristic functions which can be though of as the analytic analogues of the characteristics functions distance between smooth characteristic functions evaluated at single random frequency is almost surely distance definition formalizes this concept between these two distributions in other words there is no need to calculate the whole infinite dimensional representation it is almost surely sufficient to evaluate it at single random frequency although checking more frequencies will generally result in more powerful tests the second distance is based on analytic mean embeddings of two distributions in characteristic rkhs again it is sufficient to evaluate the distance between mean embeddings at single randomly chosen point to obtain almost surely distance to our knowledge this representation is the first mapping of the space of probability measures into finite dimensional euclidean space in the simplest case the real line that is almost surely an injection and as result almost surely metrization this metrization is very appealing from computational viewpoint since the statistics based on it have linear time complexity in the number of samples and constant memory requirements we construct statistical tests in section based on empirical estimates of differences in the analytic representations of the two distributions our tests have number of theoretical and computational advantages over previous approaches the test based on differences between analytic mean embeddings is consistent for all distributions and the test based on differences between smoothed characteristic functions is consistent for all distributions with integrable characteristic functions contrast with which is only consistent under much more onerous conditions as discussed above this same weakness was used by in justifying test that integrates over the entire frequency domain albeit at cost quadratic in the sample size for which the mmd is generalization compared with such quadratic time tests our tests can be conducted in linear time hence we expect their tradeoff to be superior we provide several experimental benchmarks section for our tests first we compare test power as function of computation time for two testing settings amplitude modulated audio samples and the higgs dataset which are both challenging multivariate testing problems our tests give better tradeoff than the characteristic tests of the previous mmd tests and the mmd test in terms of power when unlimited computation time is available we might expect worse performance for the new tests in line with findings for and tests remarkably such loss of power is not the rule for instance when distinguishing signatures of the higgs boson from background noise higgs dataset we observe that test based on differences in smoothed empirical characteristic functions outperforms the mmd this is in contrast to and tests which by construction are less powerful than the mmd next for challenging artificial data both distributions and distributions for which the difference is very subtle our tests again give better tradeoff than competing methods analytic embeddings and distances in this section we consider mappings from the space of probability measures into of real valued analytic functions we will show that evaluating these maps at randomly selected points is almost surely injective for any using this result we obtain simple randomized metrization of the space of probability measures this metrization is used in the next section to construct nonparametric tests to motivate our approach we begin by recalling an integral family of distances between distributions denoted maximum mean discrepancies mmd the mmd is defined as mmd sup dp dq where and are probability measures on and bk is the unit ball in the rkhs hk associated with positive definite kernel popular choice of is the gaussian kernel exp kx with bandwidth parameter it can be shown that the mmd is equal to the rkhs distance between so called mean embeddings mmd kµp µq khk where µp is an embedding of the probability measure to hk µp dp and khk denotes the norm in the rkhs hk when is translation invariant the squared mmd can be written corollary mmd dt rd where denotes the fourier transform is the inverse fourier transform and are the characteristic functions of respectively from theorem kernel is called characteristic when the mmd for hk satisfies mmd iff any bounded continuous kernel whose inverse fourier transform is almost everywhere is characteristic by representation it is clear that the mmd with characteristic kernel is metric pseudometrics based on characteristic functions practical limitation when using the mmd in testing is that an empirical estimate is expensive to compute this being the sum of two and an empirical average with cost quadratic in the sample size lemma we might instead consider finite dimensional approximation to the mmd achieved by estimating the integral with the random variable tj tj where tj are sampled independently from the distribution with density function this type of approximation is applied to various kernel algorithms under the name of random fourier features in the statistical testing literature the quantity predates the mmd by considerable time and was studied in and more recently revisited in our first proposition is that can be poor choice of distance between probability measures as it fails to distinguish large class of measures the following result is proved in the appendix proposition let and let tj be sequence of real valued random variables with distribution which is absolutely continuous with respect to the lebesgue measure for any there exists an uncountable set of mutually distinct probability measures on the real line such that for any we are therefore motivated to find distances of the form that can distinguish larger classes of distributions yet remain efficient to compute these distances are characterized as follows definition random metric random process with values in indexed with pairs from the set of probability measures is said to be random metric if it satisfies all the conditions for metric with qualification almost surely formally for all random variables must satisfy if then if then from the statistical testing point of view the coincidence axiom of metric if and only if is key as it ensures consistency against all alternatives the quantity in violates the coincidence axiom so it is only random pseudometric other axioms are trivially satisfied we remedy this problem by replacing the characteristic functions by smooth characteristic functions definition smooth characteristic function of measure is characteristic function of convolved with an analytic smoothing kernel dw rd rd proposition shows that smooth characteristic function can be estimated in linear time the analogue of for smooth characteristic functions is simply tj tj where tj are sampled independently from the absolutely continuous distribution returning to our earlier example this might be if we believe this to be an informative choice the following theorem proved in the appendix demonstrates that the smoothing greatly increases the class of distributions we can distinguish theorem let be an analytic integrable kernel with an inverse fourier transform that is nonzero almost everywhere then for any is random metric on the space of probability measures with integrable characteristic functions and is an analytic function this result is primarily consequence of analyticity of smooth characteristic functions and the fact that analytic functions are well behaved there is an additional practical advantage to smoothing when the variability in the difference of the characteristic functions is high and these differences are local smoothing distributes the difference in cfs more broadly in the frequency domain simple illustration is in fig appendix making it easier to find by measurement at small number of randomly chosen points this accounts for the observed improvements in test power in section over differences in unsmoothed cfs metrics based on mean embeddings the key step which leads us to the construction of random metric is the convolution of the original characteristic functions with an analytic smoothing kernel this idea need not be restricted to the representations of probability measures in the frequency domain we may instead directly convolve the probability measure with positive definite kernel that need not be translation invariant yielding its mean embedding into the associated rkhs µp dp we say that positive definite kernel rd is analytic on its domain if for all rd the feature map is an analytic function on rd by using embeddings with characteristic and analytic kernels we obtain particularly useful representations of distributions as for the smoothed cf case we define µp tj µq tj the following theorem ensures that dµ is also random metric note that this does not imply that realizations of are distances on but it does imply that they are almost surely distances for all arbitrary finite subsets of theorem let be an analytic integrable and characteristic kernel then for any dµ is random metric on the space of probability measures and µp is an analytic function note that this result is stronger than the one presented in theorem since it is not restricted to the class of probability measures with integrable characteristic functions indeed the assumption that the characteristic function is integrable implies the existence and boundedness of density recalling the representation of mmd in we have proved that it is almost always sufficient to measure difference between µp and µq at finite number of points provided our kernel is characteristic and analytic in the next section we will see that metrization of the space of probability measures using random metrics dµ is very appealing from the computational point of view it turns out that the statistical tests that arise from these metrics have linear time complexity in the number of samples and constant memory requirements hypothesis tests based on distances between analytic functions in this section we provide two tests first test based on analytic mean embeddings and next test based on smooth characteristic functions we further describe the relation with competing alternatives proofs of all propositions are in appendix difference in analytic functions in the previous section we described the random metric based pj on difference in analytic mean embeddings µp tj µq tj if we replace µp with the empirical mean embedding xi it can be shown that for any sequence of unique tj under the null hypothesis as tj tj converges in distribution to sum of correlated variables even for fixed tj it is very computationally costly to obtain quantiles of this distribution since this requires bootstrap or permutation procedure we will follow different approach based on hotelling the hotelling statistic of normally distributed zero mean gaussian vector wj with covariance matrix is the compelling property of the statistic is that it is distributed as variable with degrees of freedom to see link pj between and equation consider random variable this is also distributed as sum of correlated variables in our case is replaced with difference of normalized empirical mean embeddings and is replaced with the empirical covariance of the difference of mean embeddings formally let zi denote the vector of differences between kernels at tests points tj zi xi yi xi tj yi tj rj pn we define the vector of mean empirical differences wn zi and its covariance matrix zi wn zi wn the test statistic is sn nwn wn the computation of sn requires inversion of matrix but this is fast and numerically stable will typically be small and is less than in our experiments the next proposition demonstrates the use of sn as test statistic proposition asymptotic behavior of sn let and let xi and yi be samples from and respectively if exists for large enough then the statistic sn is asymptotically distributed as variable with degrees of freedom as with fixed if then for any fixed sn as we now apply the above proposition to obtain statistical test test analytic mean embedding calculate sn choose threshold corresponding to the quantile of distribution with degrees of freedom and reject the null hypothesis whenever sn is larger than there are number of valid sampling schemes for the test points tj to evaluate the differences in mean embeddings see section for discussion difference in smooth characteristic functions from the convolution definition of smooth characteristic function it is not immediately obvious how to calculate its estimator in linear time in the next proposition however we show that smooth characteristic function is an expected value of some function with respect to the given measure which can be estimated in linear time proposition let be an integrable kernel and its inverse fourier transr form then the smooth characteristic function of can be written as rd eit dp it is now clear that test based on the smooth characteristic functions is similar to the test based on mean embeddings the main difference is in the definition of the vector of differences zi zi xi sin xi yi sin yi xi cos xi yi cos yi the imaginary and real part of the xi xi to ensure that wn and sn as all quantities yi yi are stacked together in order proposition let and let xi and yi be samples from and respectively then the statistic sn is almost surely asymptotically distributed as variable with degrees of freedom as with fixed if then almost surely for any fixed sn as other tests the test based on empirical characteristic functions was constructed originally for one test point and then generalized to many points it is quite similar to our second test but does not perform smoothing it is also based on statistic the block mmd is test which can be trivially linearized by fixing the block size as presented in the appendix finally another alternative is the mmd an inherently quadratic time test we scale mmd to linear time by our data set and choosing only points so that the mmd complexity becomes note however that the true complexity of mmd involves permutation calculation of the null distribution at cost bn where the number of permutations bn grows with see appendix for detailed description of alternative tests experiments in this section we compare tests on both artificial benchmark data and on data we denote the smooth characteristic function test as smooth cf and the test based on the analytic mean embeddings as mean embedding we compare against several alternative testing approaches block mmd block mmd characteristic functions based test cf mmd test mmd and the mmd test mmd experimental setup for all the experiments is the dimensionality of samples in dataset is number of samples in the dataset sample size and is number of test frequencies parameter selection is required for all the tests the table summarizes the main choices of the parameters made for the experiments the first parameter is the test function used to calculate the particular statistic the scalar represents the of the observed data notice that for the kernel tests we recover the standard parameterization exp exp kx the original cf test was proposed without any parameters hence we added to ensure fair comparison for this test varying is equivalent to adjusting the variance of the distribution of frequencies tj for all tests the value of the scaling parameter was chosen so as to minimize estimate on training set details are described in appendix we chose not to optimize the sampling scheme for the mean embeddingp and smooth cf tests since this would give them an unfair advantage over the block mmd mmd and cf tests the block size in the block mmd test and the number of test frequencies in the mean embedding smooth cf and cf tests were always set to the same value not greater than to maintain exactly the same time complexity note that we did not use the popular median heuristic for kernel bandwidth choice mmd and since it gives poor results for the blobs and am audio datasets we do not run mmd test for simulation or amplitude modulated music since the sample size is and too large for test with permutation sampling for the test critical value figure higgs dataset left test power sample size right test power execution time it is important to verify that type error is indeed at the design level set at in this paper this is verified in the appendix figure also shown in the plots is the percent confidence intervals for the results as averaged over runs test mean embedding smooth cf mmd mmd block mmd cf test function exp exp it exp exp exp it sampling scheme tj id tj id not applicable not applicable tj id other parameters no of test frequencies no of test frequencies size no of test frequencies real data higgs dataset varies the first experiment we consider is on the uci higgs dataset described in the task is to distinguish signatures of processes that produce higgs bosons from background processes that do not we consider test on certain extremely features in the dataset kinematic properties measured by the particle detectors the joint distributions of the azimuthal angular momenta for four particle jets we denote by the jet distribution of the background process no higgs bosons and by the corresponding distribution for the process that produces higgs bosons both are distributions on as discussed in fig unlike transverse momenta pt carry very little discriminating information for recognizing whether higgs bosons were produced therefore we would like to test the null hypothesis that the distributions of angular momenta no higgs boson observed and higgs boson observed might yet be rejected the results for different algorithms are presented in the figure we observe that the joint distribution of the angular momenta is in fact discriminative sample size varies from to the smooth cf test has significantly higher power than the other tests including the mmd which we could only run on up to samples due to computational limitations the leading performance of the smooth cf test is especially remarkable given it is several orders of magnitude faster than the quadratictime mmd even though we used the fastest mmd implementation where the asymptotic distribution is approximated by gamma density real data amplitude modulated music amplitude modulation is the earliest technique used to transmit voice over the radio in the following experiment observations were one thousand dimensional samples of carrier signals that were modulated with two different input audio signals from the same album song and song further details of these data are described in section to increase the difficulty of the testing problem independent gaussian noise of increasing variance in the range to was added to the signals the results are presented in the figure compared to the other tests the mean embedding and smooth cf tests are more robust to the moderate noise contamination simulation high dimensions varies it has recently been shown in theory and in practice that the problem gets more difficult for an increasing number of dimensions increases on which the distributions do not differ in the following experiment we study the power of the tests as function of dimension of the samples we run twosample tests on two datasets of gaussian random vectors which differ only in the first dimension dataset dataset ii id id id diag figure music test power added noise right four samples from and figure power redundant dimensions comparison for tests on high dimensional data where is vector of zeros id is identity matrix and diag is diagonal matrix with on the diagonal the number of dimensions varies from to dataset and from to dataset ii the power of the different tests is presented in figure the mean embedding test yields best performance for both datasets where the advantage is especially large for differences in variance simulation blobs varies the blobs dataset is grid of two dimensional gaussian distributions see figure which is known to be challenging testing task the difficulty arises from the fact that the difference in distributions is encoded at much smaller lengthscale than the overall data in this experiment both and are four by four grids of gaussians where has unit covariance matrix in each mixture component while each component of has direction of the largest variance rotated by and amplified to it was demonstrated by that good choice of kernel is crucial for this task figure presents the results of tests on the blobs dataset the number of samples varies from to mmd reached test power one with we found that the mmd test has the best power as function of the sample size but the worst tradeoff by contrast random distance based tests have the best tradeoff acknowledgment we would like thank bharath sriperumbudur and wittawat jitkrittum for insightful comments figure blobs dataset left test power sample size center test power execution time right illustration of the blob dataset references alba and garcia test for the problem based on empirical characteristic functions computational statistics and data analysis anderson an introduction to multivariate statistical analysis wiley july baldi sadowski and whiteson searching for exotic particles in physics with deep learning nature communications baringhaus and franz on new multivariate test mult anal alain berlinet and christine reproducing kernel hilbert spaces in probability and statistics volume kluwer academic boston borgwardt gretton rasch kriegel and smola integrating structured biological data by kernel maximum mean discrepancy bioinformatics davidson pointwise limits of analytic functions am math mon pages epps and singleton an omnibus test for the problem using the empirical characteristic function journal of statistical computation and gretton borgwardt rasch and smola kernel test jmlr gretton fukumizu harchaoui and sriperumbudur fast consistent kernel test in nips gretton sriperumbudur sejdinovic strathmann balakrishnan pontil and fukumizu optimal kernel choice for tests in nips harchaoui bach and moulines testing for homogeneity with kernel fisher discriminant analysis in nips ce heathcote test of goodness of fit for symmetric random variables aust stat cr heathcote the integrated squared error estimation of parameters biometrika ho and shieh for hypothesis testing scandinavian journal of statistics hotelling the generalization of student ratio ann math le sarlos and smola fastfood computing hilbert space expansions in loglinear time in icml volume pages lichman uci machine learning repository lloyd and ghahramani statistical model criticism using kernel two sample tests technical report and jessica fridrich benchmarking for steganography in information hiding pages springer rahimi and recht random features for kernel machines in nips ramdas reddi singh and wasserman on the decreasing power of and nonparametric hypothesis tests in high dimensions aaai reddi ramdas singh and wasserman on the power of lineartime kernel testing under alternatives aistats walter rudin real and complex analysis tata education sejdinovic sriperumbudur gretton and fukumizu equivalence of and statistics in hypothesis testing annals of statistics sriperumbudur fukumizu and lanckriet universality characteristic kernels and rkhs embedding of measures jmlr sriperumbudur gretton fukumizu lanckriet and hilbert space embeddings and metrics on probability measures jmlr steinwart and christmann support vector machines springer science business media steinwart hush and scovel an explicit description of the reproducing kernel hilbert spaces of gaussian rbf kernels information theory ieee transactions on sun and zhou reproducing kernel hilbert spaces associated with analytic mercer kernels journal of fourier analysis and applications gj the energy of statistical samples technical report zaremba gretton and blaschko low variance kernel test in nips ji zhao and deyu meng fastmmd ensemble of circular discrepancy for efficient test neural computation aa zinger av kakosyan and lb klebanov characterization of distributions by mean values of statistics and certain probabilistic metrics journal of mathematical sciences 
learning to segment object candidates pedro ronan collobert piotr pedro locronan pdollar facebook ai research abstract recent object detection systems rely on two critical steps set of object proposals is predicted as efficiently as possible and this set of candidate proposals is then passed to an object classifier such approaches have been shown they can be fast while achieving the state of the art in detection performance in this paper we propose new way to generate object proposals introducing an approach based on discriminative convolutional network our model is trained jointly with two objectives given an image patch the first part of the system outputs segmentation mask while the second part of the system outputs the likelihood of the patch being centered on full object at test time the model is efficiently applied on the whole test image and generates set of segmentation masks each of them being assigned with corresponding object likelihood score we show that our model yields significant improvements over object proposal algorithms in particular compared to previous approaches our model obtains substantially higher object recall using fewer proposals we also show that our model is able to generalize to unseen categories it has not seen during training unlike all previous approaches for generating object masks we do not rely on edges superpixels or any other form of segmentation introduction object detection is one of the most foundational tasks in computer vision until recently the dominant paradigm in object detection was the sliding window framework classifier is applied at every object location and scale more recently girshick et al proposed approach first rich set of object proposals set of image regions which are likely to contain an object is generated using fast but possibly imprecise algorithm second convolutional neural network classifier is applied on each of the proposals this approach provides notable gain in object detection accuracy compared to classic sliding window approaches since then most object detectors for both the pascal voc and imagenet datasets rely on object proposals as first preprocessing step object proposal algorithms aim to find diverse regions in an image which are likely to contain objects for efficiency and detection performance reasons an ideal proposal method should possess three key characteristics high recall the proposed regions should contain the maximum number of possible objects ii the high recall should be achieved with the minimum number of regions possible and iii the proposed regions should match the objects as accurately as possible in this paper we present an object proposal algorithm based on convolutional networks convnets that satisfies these constraints better than existing approaches convnets are an important class of algorithms which have been shown to be state of the art in many large scale object recognition tasks they can be seen as hierarchy of trainable filters interleaved with pedro pinheiro is with the idiap research institute in martigny switzerland and ecole polytechnique de lausanne epfl in lausanne switzerland this work was done during an internship at fair and pooling convnets saw resurgence after krizhevsky et al demonstrated that they perform very well on the imagenet classification benchmark moreover these models learn sufficiently general image features which can be transferred to many different tasks given an input image patch our algorithm generates mask and an associated score which estimates the likelihood of the patch fully containing centered object without any notion of an object category the core of our model is convnet which jointly predicts the mask and the object score large part of the network is shared between those two tasks only the last few network layers are specialized for separately outputting mask and score prediction the model is trained by optimizing cost function that targets both tasks simultaneously we train on ms coco and evaluate the model on two object detection datasets pascal voc and ms coco by leveraging powerful convnet feature representations trained on imagenet and adapted on the large amount of segmented training data available in coco we are able to beat the state of the art in object proposals generation under multiple scenarios our most notable achievement is that our approach beats other methods by large margin while considering smaller number of proposals moreover we demonstrate the generalization capabilities of our model by testing it on object categories not seen during training finally unlike all previous approaches for generating segmentation proposals we do not rely on edges superpixels or any other form of segmentation our approach is the first to learn to generate segmentation proposals directly from raw image data the paper is organized as follows presents related work describes our architecture choices and describes our experiments in different datasets we conclude in related work in recent years convnets have been widely used in the context of object recognition notable systems are alexnet and more recently googlenet and vgg which perform exceptionally well on imagenet in the setting of object detection girshick et al proposed model that beats by large margin models relying on features their approach can be divided into two steps selection of set of salient object proposals followed by convnet classifier currently most object detection approaches rely on this pipeline although they are slightly different in the classification step they all share the first step which consist of choosing rich set of object proposals most object proposal approaches leverage grouping and saliency cues these approaches usually fall into three categories objectness scoring in which proposals are extracted by measuring the objectness score of bounding boxes seed segmentation where models start with multiple seed regions and generate separate segmentation for each seed and superpixel merging where multiple are merged according to various heuristics these models vary in terms of the type of proposal generated bounding boxes or segmentation masks and if the proposals are ranked or not for more complete survey of object proposal methods we recommend the recent survey from hosang et al although our model shares high level similarities with these approaches we generate set of ranked segmentation proposals these results are achieved quite differently all previous approaches for generating segmentation masks including which has learning component rely on segmentations such as superpixels or edges instead we propose discriminative approach based on architecture to obtain our segmentation proposals most closely related to our approach multibox proposed to train convnet model to generate bounding box object proposals their approach similar to ours generates set of ranked proposals however our model generates segmentation proposals instead of the less informative bounding box proposals moreover the model architectures training scheme are quite different between our approach and more recently deepbox proposed convnet model that learns to rerank proposals generated by edgebox method for bounding box proposals this system shares some similarities to our scoring network our model however is able to generate the proposals and rank them in one shot from the test image directly from the pixel space finally concurrently with this work ren et al proposed region proposal networks for generating box proposals that shares similarities with our work we emphasize however that unlike all these approaches our method generates segmentation masks instead of bounding boxes conv vgg pool fsegm fscore figure top model architecture the network is split into two branches after the shared feature extraction layers the top branch predicts segmentation mask for the the object located at the center while the bottom branch predicts an object score for the input patch bottom examples of training triplets input patch mask and label green patches contain objects that satisfy the specified constraints and therefore are assigned the label note that masks for negative examples shown in red are not used and are shown for illustrative purposes only deepmask proposals our object proposal method predicts segmentation mask given an input patch and assigns score corresponding to how likely the patch is to contain an object both mask and score predictions are achieved with single convolutional network convnets are flexible models which can be applied to various computer vision tasks and they alleviate the need for manually designed features their flexible nature allows us to design model in which the two tasks mask and score predictions can share most of the layers of the network only the last layers are see figure during training the two tasks are learned jointly compared to model which would have two distinct networks for the two tasks this architecture choice reduces the capacity of the model and increases the speed of full scene inference at test time each sample in the training set is triplet containing the rgb input patch xk the binary mask corresponding to the input patch mk with mij where corresponds to pixel location on the input patch and label yk which specifies whether the patch contains an object specifically patch xk is given label yk if it satisfies the following constraints the patch contains an object roughly centered in the input patch ii the object is fully contained in the patch and in given scale range otherwise yk even if an object is partially present the positional and scale tolerance used in our experiments are given shortly assuming yk the ground truth mask mk has positive values only for the pixels that are part of the single object located in the center of the patch if yk the mask is not used figure bottom shows examples of training triplets figure top illustrates an overall view of our model which we call deepmask the top branch is responsible for predicting high quality object segmentation mask and the bottom branch predicts the likelihood that an object is present and satisfies the above two constraints we next describe in detail each part of the architecture the training procedure and the fast inference procedure network architecture the parameters for the layers shared between the mask prediction and the object score prediction are initialized with network that was to perform classification on the imagenet dataset this model is then for generating object proposals during training we choose the vgga architecture which consists of eight convolutional layers followed by relu nonlinearities and five layers and has shown excellent performance as we are interested in inferring segmentation masks the spatial information provided in the convolutional feature maps is important we therefore remove all the final fully connected layers of the model additionally we also discard the last layer the output of the shared layers has downsampling factor of due to the remaining four layers given an input image of dimension the output is feature map of dimensions segmentation the branch of the network dedicated to segmentation is composed of single convolution layer and relu followed by classification layer the classification layer consists of pixel classifiers each responsible for indicating whether given pixel belongs to the object in the center of the patch note that each pixel classifier in the output plane must be able to utilize information contained in the entire feature map and thus have complete view of the object this is critical because unlike in semantic segmentation our network must output mask for single object even when multiple objects are present see the elephants in fig for the classification layer one could use either locally or fully connected pixel classifiers both options have drawbacks in the former each classifier has only partial view of the object while in the latter the classifiers have massive number of redundant parameters instead we opt to decompose the classification layer into two linear layers with no in between this can be viewed as variant of using fully connected linear classifiers such an approach massively reduces the number of network parameters while allowing each pixel classifier to leverage information from the entire feature map its effectiveness is shown in the experiments finally to further reduce model capacity we set the output of the classification layer to be ho with ho and wo and upsample the output to to match the input dimensions scoring the second branch of the network is dedicated to predicting if an image patch satisfies constraints and ii that is if an object is centered in the patch and at the appropriate scale it is composed of layer followed by two fully connected plus relu layers the final output is single objectness score indicating the presence of an object in the center of the input patch and at the appropriate scale joint learning given an input patch xk the model is trained to jointly infer segmentation mask and an object score the loss function is sum of binary logistic regression losses one for each location of the segmentation network and one for the object score over all training triplets xk mk yk ij fsegm xk log fscore xk log ho ij ij here is the set of parameters fsegm xk is the prediction of the segmentation network at location and fscore xk is the predicted object score we alternate between backpropagating through for the scoring branch the data is the segmentation branch and scoring branch and set sampled such that the model is trained with an equal number of positive and negative samples note that the factor multiplying the first term of equation implies that we only backpropagate the error over the segmentation branch if yk an alternative would be to train the segmentation branch using negatives as well setting mij for all pixels if yk however we found that training with positives only was critical for generalizing beyond the object categories seen during training and for achieving high object recall this way during inference the network attempts to generate segmentation mask at every patch even if no known object is present full scene inference during full image inference we apply the model densely at multiple locations and scales this is necessary so that for each object in the image we test at least one patch that fully contains the object roughly centered and at the appropriate scale satisfying the two assumptions made during training this procedure gives segmentation mask and object score at each image location figure illustrates the segmentation output when the model is applied densely to an image at single scale the full image inference procedure is efficient since all computations can be computed convolutionally the vgg features can be computed densely in fraction of second given typical input image for the segmentation branch the last fully connected layer can be computed via convolutions applied to the vgg features the scores are likewise computed by convolutions on the vgg features followed by two convolutional layers exact runtimes are given in figure output of segmentation model applied densely to full image with pixel stride at single scale at the central horizontal image region multiple locations give rise to good masks for each of the three monkeys scores not shown note that no monkeys appeared in our training set finally note that the scoring branch of the network has downsampling factor larger than the segmentation branch due to the additional layer given an input test image of size ht wt ht wt the segmentation and object network generate outputs of dimension and respectively in order to achieve mapping between the mask prediction and object score we apply the interleaving trick right before the last layer for the scoring branch to double its output resolution we use exactly the implementation described in implementation details during training an input patch xk is considered to contain canonical positive example if an object is precisely centered in the patch and has maximal dimension equal to exactly pixels however having some tolerance in the position of an object within patch is critical as during full image inference most objects will be observed slightly offset from their canonical position therefore during training we randomly jitter each canonical positive example to increase the robustness of our model specifically we consider translation shift of pixels scale deformation of and also horizontal flip in all cases we apply the same transformation to both the image patch xk and the ground truth mask mk and assign the example positive label yk negative examples yk are any patches at least pixels or in scale from any canonical positive example during full image inference we apply the model densely at multiple locations with stride of pixels and scales scales to with step of this ensures that there is at least one tested image patch that fully contains each object in the image within the tolerances used during training as in the original network our model is fed with rgb input patches of dimension since we removed the fifth pooling layer the common branch outputs feature map of dimensions the score branch of our network is composed of max pooling followed by two fully connected layers with and hidden units respectively both of these layers are followed by relu and dropout procedure with rate of final linear layer then generates the object score the segmentation branch begins with single convolutional layer with units this feature map is then fully connected to low dimensional output of size which is further fully connected to each pixel classifier to generate an output of dimension as discussed there is no nonlinearity between these two layers in total our model contains around parameters final bilinear upsampling layer is added to transform the output prediction to the full resolution of the directly predicting the full resolution output would have been much slower we opted for layer as we observed that trainable one simply learned to bilinearly upsample alternatively we tried downsampling the instead of upsampling the network output however we found that doing so slightly reduced accuracy design architecture and were chosen using subset of the ms coco validation data with the data we used for evaluation we considered learning rate of we trained our model using stochastic gradient descent with batch size of examples momentum of and weight decay of aside from the vgg features weights are initialized randomly from uniform distribution our model takes around days to train on nvidia tesla to binarize predicted masks we simply threshold the continuous output using threshold of for pascal and for coco all the experiments were conducted using http figure deepmask proposals with highest iou to the ground truth on selected images from coco missed objects no matching proposals with iou are marked with red outline experimental results in this section we evaluate the performance of our approach on the pascal voc test set and on the first images of the ms coco validation set our model is trained on the coco training set which contains about images and total of nearly segmented objects although our model is trained to generate segmentation proposals it can also be used to provide box proposals by taking the bounding boxes enclosing the segmentation masks figure shows examples of generated proposals with highest iou to the ground truth on coco metrics we measure accuracy using the common intersection over union iou metric iou is the intersection of candidate proposal and annotation divided by the area of their union this metric can be applied to both segmentation and box proposals following hosang et al we evaluate the performance of the proposal methods considering the average recall ar between iou and for fixed number of proposals ar has been shown to correlate extremely well with detector performance recall at single iou threshold is far less predictive methods we compare to the current proposal methods including edgeboxes selectivesearch geodesic rigor and mcg these methods achieve top results on object detection when coupled with and also obtain the best ar results figure compares the performance of our approach deepmask to existing proposal methods on pascal using boxes and coco using both boxes and segmentations shown is the ar of each method as function of the number of generated proposals under all scenarios deepmask and its variants achieves substantially better ar for all numbers of proposals considered ar at selected proposal counts and averaged across all counts auc is reported in tables and for coco and pascal respectively notably deepmask achieves an order of magnitude reduction in the number of proposals necessary to reach given ar under most scenarios for example with segmentation proposals deepmask achieves an ar of on coco while competing methods require nearly segmentation proposals to achieve similar ar deepmask mcg selectivesearch rigor geodesic edgeboxes proposals small objects area recall iou deepmask deepmaskzoom mcg selectivesearch rigor geodesic iou recall with proposals deepmask deepmaskzoom mcg selectivesearch rigor geodesic recall large objects area proposals medium objects deepmask deepmaskzoom mcg selectivesearch rigor geodesic deepmask deepmaskzoom mcg selectivesearch rigor geodesic proposals proposals segm proposals on coco proposals deepmask deepmaskzoom mcg selectivesearch rigor geodesic average recall average recall deepmask deepmaskzoom mcg selectivesearch rigor geodesic recall box proposals on coco proposals box proposals on pascal deepmask deepmaskzoom mcg selectivesearch rigor geodesic average recall average recall deepmask deepmaskzoom mcg selectivesearch rigor geodesic edgeboxes average recall average recall iou recall with proposals recall with proposals figure average recall versus number of box and segmentation proposals on various datasets ar versus number of proposals for different object scales on segmentation proposals in coco recall versus iou threshold for different number of segmentation proposals in coco box proposals segmentation proposals ar ar ar auc ar ar ar aucs aucm aucl auc edgeboxes geodesic rigor selectivesearch mcg deepmaskzoom deepmaskfull deepmask table results on the ms coco dataset for both bounding box and segmentation proposals we report ar at different number of proposals and and also auc ar averaged across all proposal counts for segmentation proposals we report overall auc and also auc at different scales objects indicated by superscripts see text for details scale the coco dataset contains objects in wide range of scales in order to analyze performance in more detail we divided the objects in the validation set into roughly equally sized sets according to object pixel area small medium and large objects figure shows performance at each scale all models perform poorly on small objects to improve accuracy of deepmask we apply it at an additional smaller scale deepmaskzoom this boosts performance especially for small objects but at cost of increased inference time pascal edgeboxes geodesic rigor selectivesearch mcg deepmask ar ar ar auc table results on pascal voc test figure fast results on pascal localization figure shows the recall each model achieves as the iou varies shown for different number of proposals per image deepmask achieves higher recall in virtually every scenario except at very high iou in which it falls slightly below other models this is likely due to the fact that our method outputs downsampled version of the mask at each location and scale multiscale approach or skip connections could improve localization at very high iou generalization to see if our approach can generalize to unseen classes we train two additional versions of our model and is trained only with objects belonging to one of the pascal categories subset of the full coco categories is similar except we use the scoring network from the original deepmask results for the two models when evaluated on all coco categories as in all other experiments are shown in table compared to deepmask exhibits drop in ar but still outperforms all previous methods however matches the performance of deepmask this surprising result demonstrates that the drop in accuracy is due to the discriminatively trained scoring branch is inadvertently trained to assign low scores to the other categories the segmentation branch generalizes extremely well even when trained on reduced set of categories architecture in the segmentation branch the convolutional features are fully connected to layer which is in turn connected to the output with no intermediate see we also experimented with architecture deepmaskfull with over parameters where each of the outputs was directly connected to the convolutional features as can be seen in table deepmaskfull is slightly inferior to our final model and much slower detection as final validation we evaluate how deepmask performs when coupled with an object detector on pascal voc test we and evaluate the fast using proposals generated by selectivesearch and our method figure shows the mean average precision map for fast with varying number of proposals most notably with just deepmask proposals fast achieves map of and outperforms the best results obtained with selectivesearch proposals map of we emphasize that with fewer proposals deepmask outperforms selectivesearch this is consistent with the ar numbers in table with deepmask proposals fast improves to map after which performance begins to degrade similar effect was observed in speed inference takes an average of per image in the coco dataset on the smaller pascal images our runtime is competitive with the fastest segmentation proposal methods geodesic runs at per pascal image and substantially faster than most mcg takes inference time can further be dropped by by parallelizing all scales in single batch eliminating gpu overhead we do however require use of gpu for efficient inference conclusion in this paper we propose an innovative framework to generate segmentation object proposals directly from image pixels at test time the model is applied densely over the entire image at multiple scales and generates set of ranked segmentation proposals we show that learning features for object proposal generation is not only feasible but effective our approach surpasses the previous state of the art by large margin in both box and segmentation proposal generation in future work we plan on coupling our proposal method more closely with detection approaches acknowledgements we would like to thank ahmad humayun and lin for help with generating experimental results andrew tulloch omry yadan and alexey spiridonov for help with computational infrastructure and rob fergus yuandong tian and soumith chintala for valuable discussions references alexe deselaers and ferrari measuring the objectness of image windows pami chavali agrawal mahendru and batra evaluation protocol is gameable chen papandreou kokkinos murphy and yuille semantic image segmentation with deep convolutional nets and fully connected crfs iclr dalal and triggs histograms of oriented gradients for human detection in cvpr deng dong socher li li and imagenet hierarchical image database in cvpr erhan szegedy toshev and anguelov scalable object detection using deep neural networks in cvpr everingham gool williams winn and zisserman the pascal visual object classes voc challenge ijcv felzenszwalb girshick mcallester and ramanan object detection with discriminatively trained models pami girshick fast girshick donahue darrell and malik rich feature hierarchies for accurate object detection and semantic segmentation in cvpr hariharan girshick and malik hypercolumns for object segmentation and finegrained localization in cvpr he zhang ren and sun spatial pyramid pooling in deep convolutional networks for visual recognition in eccv hosang benenson and schiele what makes for effective detection proposals humayun li and rehg rigor reusing inference in graph cuts for generating object regions in cvpr kaiming xiangyu shaoqing and jian spatial pyramid pooling in deep convolutional networks for visual recognition in eccv and koltun geodesic object proposals in eccv and koltun learning to propose objects in cvpr krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips kuo hariharan and malik deepbox learning objectness with convolutional networks in lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee lin maire belongie bourdev girshick hays perona ramanan zitnick and microsoft coco common objects in context oquab bottou laptev and sivic is object localization for free learning with convolutional neural networks in cvpr pinheiro and collobert recurrent conv neural networks for scene labeling in icml barron marques and malik multiscale combinatorial grouping for image segmentation and object proposal generation in ren he girshick and sun faster towards object detection with region proposal networks in sermanet eigen zhang mathieu fergus and lecun overfeat integrated recognition localization and detection using convolutional networks in iclr simonyan and zisserman very deep convolutional networks for image recognition in iclr srivastava hinton krizhevsky sutskever and salakhutdinov dropout simple way to prevent neural networks from overfitting jmlr szegedy liu jia sermanet reed anguelov erhan vanhoucke and rabinovich going deeper with convolutions in cvpr szegedy reed erhan and anguelov scalable object detection in uijlings van de sande gevers and smeulders selective search for object recog ijcv viola and jones robust face detection ijcv zhu urtasun salakhutdinov and fidler segdeepm exploiting segmentation and context in deep neural networks for object detection in cvpr zitnick and edge boxes locating object proposals from edges in eccv 
gp kernels for analysis kyle ulrich david carlson kafui dzirasa lawrence carin department of electrical and computer engineering duke university department of psychiatry and behavioral sciences duke university department of statistics columbia university lcarin abstract gaussian processes provide convenient framework for problems an illustrative and motivating example of problem is electrophysiological data where experimentalists are interested in both power and phase coherence between channels recently wilson and adams proposed the spectral mixture sm kernel to model the spectral density of single task in gaussian process framework in this paper we develop novel covariance kernel for multiple outputs called the mixture csm kernel this new flexible kernel represents both the power and phase relationship between multiple observation channels we demonstrate the expressive capabilities of the csm kernel through implementation of bayesian hidden markov model where the emission distribution is gaussian process with csm covariance kernel results are presented for measured electrophysiological data introduction gaussian process gp models have become an important component of the machine learning literature they have provided basis for multivariate regression and classification tasks and have enjoyed much success in wide variety of applications gp places prior distribution over latent functions rather than model parameters in the sense that these functions are defined for any number of sample points and sample positions as well as any general functional form gps are nonparametric the properties of the latent functions are defined by positive definite covariance kernel that controls the covariance between the function at any two sample points recently the spectral mixture sm kernel was proposed by wilson and adams to model spectral density with mixture of gaussians this flexible and interpretable class of kernels is capable of recovering any composition of stationary kernels the sm kernel has been used for gp regression of scalar output single function or observation task achieving impressive results in extrapolating atmospheric concentrations image inpainting and feature extraction from electrophysiological signals however the sm kernel is not defined for multiple outputs multiple correlated functions multioutput gps intersect with the field of learning where solving similar problems jointly allows for the transfer of statistical strength between problems improving learning performance when compared to learning all tasks individually in this paper we consider neuroscience applications where hz extracellular potentials are simultaneously recorded from implanted electrodes in multiple brain regions of mouse these signals are known as local field potentials lfps and are often highly correlated between channels inferring and understanding that interdependence is biologically significant gp can be thought of as standard gp all observations are jointly normal where the covariance kernel is function of both the input space and the output space see and references therein for comprehensive review here input space means the points at which the functions are sampled time and the output space may correspond to different brain regions particular positive definite form of this covariance kernel is the sum of separable sos kernels or the linear model of coregionalization lmc in the geostatistics literature where separable kernel is represented by the product of separate kernels for the input and output spaces while extending the sm kernel to the setting via the lmc framework the smlmc kernel provides powerful modeling framework the kernel does not intuitively represent the data specifically the kernel encodes the spectrum square root of the cross power spectral density between every pair of channels but provides no crossphase information together the and spectra form the defined as the fourier transform of the between the pair of channels motivated by the desire to encode the full into the covariance kernel we design novel kernel termed the mixture csm kernel which provides an intuitive representation of the power and phase dependencies between multiple outputs the need for embedding the full into the covariance kernel is illustrated by recent surge in neuroscience research discovering that lfp interdependencies between regions exhibit phase synchrony patterns that are dependent on frequency band the remainder of the paper is organized as follows section provides summary of gp regression models for data and section introduces the sm and novel csm covariance kernels in section the csm kernel is incorporated in bayesian hidden markov model hmm with gp emission distribution as demonstration of its utility in hierarchical modeling section provides details on inverting the bayesian hmm with variational inference as well as details on fast novel gp fitting process that approximates the csm kernel by its representation in the spectral domain section analyzes the performance of this approximation and presents results for the csm kernel in the neuroscience application considering measured lfp data from the brain of mouse we conclude in section by discussing how this novel kernel can trivially be extended to any application where gps and the are of interest review of gaussian process regression regression task estimates samples from output channels ync corresponding to the input point xn the temporal sample an unobserved latent function fc is responsible for generating the observations such that xn where diag ηc is the precision of additive gaussian noise gp prior on the latent function is formalized by gp for arbitrary input where the mean function rc is set to equal without loss of generality and the covariance function cov fc creates dependencies between observations at input points and as observed on channels and in general the input space could be vector valued but for simplicity we here assume it to be scalar consistent with our motivating neuroscience application in which corresponds to time convenient representation for kernel functions is to separate the kernel into the product of kernel for the input space and kernel for the interactions between the outputs this is known as separable kernel sum of separable kernels sos representation is given by bq kq or kq where kq is the input space kernel for component bq is the output interaction kernel and is positive output kernel matrix note that we have discrete set of output spaces where the input space is continuous and discretely sampled arbitrarily in experiments the sos formulation is also known as the linear model of coregionalization lmc and is termed the coregionalization matrix when the lmc reduces to the intrinsic coregionalization model icm and when rank is restricted to equal the lmc reduces to the semiparametric latent factor model slfm any finite number of latent functional evaluations fc at locations xn has multivariate normal distribution such that is formed through the block partitioning kq where each is an matrix and symbolizes the kronecker product dataset consists of observations vec rcn at the respective locations xn such that the first elements of are from channel up to the last elements belonging to channel since both the likelihood and distribution over latent functions are gaussian the marginal likelihood is conveniently represented by df where all possible functions have been marginalized out each covariance kernel is defined by set of hyperparameters this conditioning was removed for notational simplicity but will henceforth be included in the notation for example if the squared exponential kernel is used then kse exp defined by single hyperparameter to fit gp to the dataset the hyperparameters are typically chosen to maximize the marginal likelihood in via gradient ascent expressive kernels in the spectral domain this section first introduces the spectral mixture sm kernel as well as extension of the sm kernel within the lmc framework while the model is capable of representing complex spectral relationships between channels it does not intuitively model the spectrum between channels we propose novel kernel known as the mixture csm kernel that provides both the and spectra of observations detailed derivations of each of these kernels is found in the supplemental material the spectral mixture kernel spectral gaussian sg kernel is defined by an amplitude spectrum with single gaussian distribution reflected about the origin ssg where are the kernel hyperparameters represents the peak frequency and the variance is scale parameter that controls the spread of the spectrum around this spectrum is function of angular frequency the fourier transform of results in the stationary positive definite autocovariance function ksg exp ντ cos µτ where stationarity implies dependence on input domain differences with the sg kernel may also be derived by considering latent signal cos with frequency uncertainty and phase offset ωφ the kernel is the function for such that ksg cov when computing the the frequency is marginalized out providing the kernel in that includes all frequencies in the spectral domain with probability weighted linear combination of sg kernels gives the spectral mixture sm kernel ksm aq ksg ssm aq ssg where aq νq µq and has degrees of freedom the sm kernel may be derived as the fourier transform of the spectral density ssm or as the of latent funcpq tions cos ωq φq with uncertainty in angular frequency ωq µq νq amplitude time phase time frequency figure latent functions drawn for two channels blue and red using the csm kernel left and kernel center the functions are comprised of two sg components centered at and hz for the csm kernel we set the phase shift right the purple and green spectra between and are shown for the csm kernel solid and kernel dashed the ability to tune phase relationships is beneficial for kernel design and interpretation the moniker for the sm kernel in reflects the mixture of gaussian components that define the spectral density of the kernel the sm kernel is able to represent any stationary covariance kernel given large enough to name few this includes any combination of squared exponential rational quadratic or periodic kernels the mixture kernel version of the sm kernel uses the sg kernel directly within the lmc framework ksg where sg kernels are shared among the outputs via the coregionalization matrices generalized version of this kernel was proposed in using the gaussian process regression network gprn the marginal distribution for any single channel is simply gaussian process with sm covariance kernel while this formulation is capable of providing full spectrum between two channels it contains no information about crossp phase spectrum specifically each channel is merely weighted sum of rq latent functions where rq rank whereas these functions are shared exactly across channels our novel csm kernel shares versions of these latent functions across channels definition the mixture csm kernel takes the form rq arcq exp νq cos µq φrcq pq rq has rq degrees of freedom where νq µq arq φrq and arcq and φrcq respectively represent the amplitude and shift in the input space for latent functions associated with channel in the lmc framework the csm kernel is rq csm re bq ksg bq kcsm ksg exp νq jµq βcq arcq exp where ksg is phasor notation of the sg kernel is βcq are complex scalar coefficients encoding amplitude and phase and ψcq µq φcq is an alternative phase representation we use complex notation where re returns the real component of its argument and represents the complex conjugate of both the csm and kernels force the marginal distribution of data from single channel to be gaussian process with sm covariance kernel the csm kernel is derived in the supplemental material by considering functions represented by sinusoidal signals pq prq iid fc cos ωqr φrcq where each ωqr µq νq computing the function cov fc provides the csm kernel comparison between draws from gaussian processes with csm and kernels is shown in figure the utility of the csm kernel is clearly illustrated by its ability to encode phase information as well as its powerful functional form of the full both amplitude are obtained by repand phase the amplitude function ac and phase function φc cp resenting the in phasor notation γc ssg ac exp jφc interestingly while the csm and kernels have identical marginal amplitude spectra for shared µq νq aq their spectra differ due to the inherent destructive interference of the csm kernel see figure right hmm analysis neuroscientists are interested in examining how the network structure of the brain changes as animals undergo task or various levels of arousal the lfp signal is modality that allows researchers to explore this network structure in the model provided in this section we cluster segments of the lfp signal into discrete brain states each brain state is represented by unique provided by the csm kernel the use of the full to define brain states is supported by previous work discovering that the power spectral density of lfp signals indicate various levels of arousal states in mice and phase synchrony patterns change as animals undergo different conditions in task see figure the observations from channels are segmented into contiguous windows the windows are common across channels such that the data for window are represented by ync at sample location xn given data each window consists of nw temporal samples but the model is defined for any set of sample locations we model the observations as emissions from hidden markov model hmm with hidden discrete states state assignments are represented by latent variables ζw for each window in general is set upper bound of the number of states brain states or clusters but the model can shrink down and infer the number of states needed to fit the data this is achieved by defining the dynamics of the latent states according to bayesian hmm categorical ζw categorical dirichlet where the initial state assignment is drawn from categorical distribution with probability vector and all subsequent states assignments are drawn from the transition vector here is the probability of transitioning from state to state the vectors ρl are independently drawn from symmetric dirichlet distributions centered around to impose sparsity on transition probabilities in effect this allows the model to learn the number of states needed for the data fewer than each cluster is assigned gp parameters the latent cluster assignment ζw for window indicates which set of gp parameters control the emission distribution of the hmm yw xn ζw gp ζw where kcsm is the csm kernel and the precision ζw diag ζw generates independent gaussian observation noise in this way each window is modeled as stochastic process with defined by ζw raw lfp data spectrum spectrum bla il cortex delta waves theta waves alpha waves beta waves lag rad potential amplitude time sec frequency hz frequency hz figure short segment of lfp data recorded from the basolateral amygdala and infralimbic cortex is shown on the left the and phase spectra are produced using welch averaged periodogram method for several consecutive second windows of lfp data frequency dependent phase synchrony lags are consistently present in the spectrum motivating the csm kernel this frequency dependency aligns with preconcieved notions of bands or brain waves hz alpha waves inference convenient notation vectorizes all observations within window vec nw where vec is the vectorization of matrix the first nw elements of are observations from channel up to the last nw elements of belonging to channel because samples are obtained on an evenly spaced temporal grid we fix nw and align relative sample locations within window to an oracle xw xn for all the model in section generates the set of observations at aligned sample locations given kernel hyperparameters and model variables ζw the latent variables are inverted using variational inference obtaining an approxiql mate posterior distribution dir the approximate posterior is chosen to minimize the kl divergence to the true posterior distribution using the standard variational em method detailed in chapter of during each iteration of the variational em algorithm the kernel hyperparameters are chosen to pw pl maximize the expected marginal ζw log via gradient ascent where ζw is the marginal posterior probability that window is assigned is the csm kernel matrix for state with the complex form to brain state and re ksg performing gradient ascent requires the derivatives tr where implementation of this gradient requires the inversion of which has complexity and storage requirements since simple method to invert sum of kronecker products does not exist common trick for gps with evenly spaced samples temporal grid is to use the discrete fourier transform dft to approximate the inverse of by viewing this as an approximately circulant matrix these methods can speed up inference because circulant matrices are diagonalizable by the dft coefficient matrix adjusting these methods to the formulation we show how the dft of the marginal covariance matrices retains the information proposition let γζw represent the marginal likelihood of observations in window and denote the concatenation of the dft of each channel as where is the unitary dft matrix then is shown in the supplemental material to have the complex normal distribution cn ζw in where xi for all and diag ssg is approximately diagonal the spectral density ssg ssg ssg ωb is found via at angular frequencies and is row vector of zeros the hyperparameters of the csm kernels may now be optimized from the expected marginal of instead of conceptually the only difference during the fitting process is that with the latter derivatives of the covariance kernel are used while with the former derivatives of the power spectral density are used computationally this method improves the complexity of fitting the standard csm kernel to complexity memory requirements are also reduced from to the reason for this improvement is that is now represented as independent blocks reducing the inversion of to inverting permuted matrix experiments section demonstrates the performance of the csm kernel and the accuracy of the dft approximation in section the dft approximation for the csm kernel is used in bayesian hmm framework to cluster lfp data based on the full the hmm states here correspond to states of the brain during lfp recording table the mean and standard deviation of the difference between the aic value of given model and the aic value of the csm model lower values are better kl divergence hz hz hz series length seconds figure data is drawn from gaussian process with known csm covariance kernel where the domain restricted to fixed number of seconds gaussian process is then fitted to this data using the dft approximation the of the fitted marginal likelihood from the true marginal likelihood is shown rank model aic csm csm csm performance and inference analysis the performance of the csm kernel is compared to the kernel and squared exponential kernel each of these models allow and the rank of the coregionalization matrices is varied from to for given rank the csm kernel always obtains the largest marginal likelihood for window of lfp data and the marginal likelihood always increases for increasing rank to penalize the number of kernel parameters csm kernel for channels has free parameters to optimize the akaike information criterion aic is used for model selection for this reason we do not test rank greater than table shows that csm kernel is selected using this criterion followed by csm kernel to show the csm kernel is consistently selected as the preferred model we report means and standard deviations of aic value differences across different randomly selected windows of lfp data next we provide numerical results for the conditions required when using the dft approximation in this allows for one to define details of particular application in order to determine if the dft approximation to the csm kernel is appropriate csm kernel is defined for two outputs with single gaussian component the mean frequency and variance for this component are set to push the limits of the application for example with lfp data low frequency content is of interest namely greater than hz therefore we test values of hz we anticipate variances at these frequencies to be around hz conversion to angular frequency gives and the covariance matrix in is formed using these parameters fixed noise variance and observations on time grid with sampling rate of hz data are drawn from the marginal likelihood with covariance new csm kernel is fit to using the dft approximation providing an estimate the kl divergence of the fitted marginal likelihood from the true marginal likelihood is kl log tr where and tr are the determinant and trace operators respectively computing kl for various values of and provides the results in figure this plot shows that the dft approximation struggles to resolve low frequency components unless the series length is sufficiently long due to the approximation error when using the dft approximation on lfp data we priori filter out frequencies below hz and perform analyses with series length of seconds this ensures the dft approximation represents the true covariance matrix the following application of the csm kernel uses these settings including the csm kernel in bayesian hierarchical model we analyze hours of lfp data of mouse transitioning between different stages of sleep observations were recorded simultaneously from channels filtered at hz and subsampled to hz using second windows provides and the hmm was implemented with the number of kernel components and the number of states amplitude phase amplitude basalamy dls phase amplitude phase phase dhipp phase amplitude frequency hz dms amplitude frequency frequency state state state state state state state dzirasa et al csm kernel frequency hz frequency frequency minutes figure subset of results from the bayesian hmm analysis of brain states in the upper left the full crossspectrum for an arbitrary state state is plotted in the upper right the amplitude top and phase bottom functions for the between the dorsomedial striatum dms and hippocampus dhipp are shown for all seven states on the bottom the maximum likelihood state assignments are shown and compared to the state assignments from the same colors between the csm state assignments and the phase and amplitude functions correspond to the same state these colors are alligned to the states but there is no explicit relationship between the colors of the two state sequences this was chosen because sleep staging tasks categorize as many as seven states various levels of rapid eye movement slow wave sleep and wake although rigorous model selection on is necessary to draw scientific conclusions from the results the purpose of this experiment is to illustrate the utility of the csm kernel in this application an illustrative subset of the results are shown in figure the full is shown for single state state and the between the dorsomedial striatum and the dorsal hippocampus are shown for all states furthermore we show the progression of these brain state assignments over hours and compare them to states from the method of where statistics of the hippocampus spectral density were clustered in an ad hoc fashion to the best of our knowledge this method represents the most relevant and accurate results for sleep staging from lfp signals in the neuroscience literature from these results it is apparent that our clusters pick up of for instance states and all appear with high probability when the method from infers state observing the function of reveals striking differences from other states in the theta wave hz and the alpha wave hz this function is nearly identical for states and implying that significant differences in the spectrum may have played role in identifying the difference between these two brain states many more of these interesting details exist due to the expressive nature of the csm kernel as full interpretation of the results is not the focus of this work we contend that the csm kernel has the potential to have tremendous impact in fields such as neuroscience where the dynamics of relationships of lfp signals are of great interest conclusion this work introduces the mixture kernel as an expressive kernel capable of extracting patterns for observations combined with the powerful nonparametric representation of gaussian process the csm kernel expresses functional form for every pairwise between channels this is novel approach that merges gaussian processes in the machine learning community to standard signal processing techniques we believe the csm kernel has the potential to impact broad array of disciplines since the kernel can trivially be extended to any application where gaussian processes and the are of interest acknowledgments the research reported here was funded in part by aro darpa doe nga and onr references akaike new look at the statistical model identification ieee transactions on automatic control alvarez rosasco and lawrence kernels for functions review foundations and trends in machine learning beal variational algorithms for approximate bayesian inference phd thesis university college london caruana multitask learning machine learning dietrich and newsam fast and exact simulation of stationary gaussian processes through circulant embedding of the covariance matrix siam journal on scientific computing dzirasa fuentes kumar potes and nicolelis chronic in vivo neurophysiological recordings in mice journal of neuroscience methods dzirasa ribeiro costa santos lin grosmark sotnikova gainetdinov caron and nicolelis dopaminergic control of states the journal of neuroscience gallager principles of digital communication pages and alpaydn multiple kernel learning algorithms jmlr goovaerts geostatistics for natural resources evaluation oxford university press gregoriou gotts zhou and desimone coupling between prefrontal and visual cortex during attention science candela rasmussen and sparse spectrum gaussian process regression jmlr lloyd duvenaud grosse tenenbaum and ghahramani automatic construction and description of nonparametric regression models aaai mackay ensemble learning for hidden markov models technical report pfaff ribeiro matthews and kow concepts and mechanisms of generalized central nervous system arousal anyas rasmussen and williams gaussian processes for machine learning sauseng and klimesch what does phase information of oscillatory brain activity tell us about cognitive processes neuroscience and biobehavioral reviews zaehle voges schmitt buentjen kopitzki esslinger hinrichs heinze knight and corticothalamic phase synchrony and coupling predict human memory formation elife teh seeger and jordan semiparametric latent factor models aistats tucker hirota wamsley lau chaklader and fishbein daytime nap containing solely sleep enhances declarative but not procedural memory neurobiology of learning and memory ulrich carlson lian borg dzirasa and carin analysis of brain states from lfp nips welch the use of fast fourier transform for the estimation of power spectra method based on time averaging over short modified periodograms ieee transactions on audio and electroacoustics wilson covariance kernels for fast automatic pattern discovery and extrapolation with gaussian processes phd thesis university of cambridge wilson and adams gaussian process kernels for pattern discovery and extrapolation icml wilson gilboa nehorai and cunningham fast kernel learning for multidimensional pattern extrapolation nips wilson and knowles gaussian process regression networks icml yang smola song and wilson la carte learning fast kernels aistats 
secure differential privacy peter sewoong pramod department of electrical computer engineering department of industrial enterprise systems engineering university of illinois urbana il usa swoh pramodv abstract we study the problem of interactive function computation by multiple parties each possessing bit in differential privacy setting there remains an uncertainty in any party bit even when given the transcript of interactions and all the other parties bits each party wants to compute function which could differ from party to party and there could be central observer interested in computing separate function performance at each party is measured via the accuracy of the function to be computed we allow for an arbitrary cost metric to measure the distortion between the true and the computed function values our main result is the optimality of simple protocol each party randomizes its bit sufficiently and shares the privatized version with the other parties this optimality result is very general it holds for all types of functions heterogeneous privacy conditions on the parties all types of cost metrics and both average and over the inputs measures of accuracy introduction computation mpc is generic framework where multiple parties share their information in an interactive fashion towards the goal of computing some functions potentially different at each of the parties in many situations of common interest the key challenge is in computing the functions as privately as possible without revealing much about one information to the other potentially colluding parties for instance an interactive voting system aims to compute the majority of say binary opinions of each of the parties with each party being averse to declaring their opinion publicly another example involves banks sharing financial risk exposures the banks need to agree on quantities such as the overnight lending rate which depends on each bank exposure which is quantity the banks are naturally loath to truthfully disclose central learning theory question involves characterizing the fundamental limits of interactive information exchange such that strong and suitably defined adversary only learns as little as possible while still ensuring that the desired functions can be computed as accurately as possible one way to formulate the privacy requirement is to ensure that each party learns nothing more about the others information than can be learned from the output of the function computed this topic is studied under the rubric of secure function evaluation sfe the sfe formulation has been extensively studied with the goal of characterizing which functions can be securely evaluated one drawback of sfe is that depending on what auxiliary information the adversary might have disclosing the exact function output might reveal each party data for example consider computing the average of the data owned by all the parties even if we use sfe party data can be recovered if all the other parties collaborate to ensure protection of the private data under such strong adversary we want to impose stronger privacy guarantee of differential privacy recent breaches of sensitive information about individuals due to linkage attacks prove the vulnerability of existing privatization schemes such as anonymization of the records in linkage attacks an adversary matches up anonymized records containing sensitive information with public records in different dataset such attacks have revealed the medical record of former governor of massachusetts the purchase history of amazon users genomic information and movie viewing history of netflix users an alternative formulation is differential privacy relatively recent formulation that has received considerable attention as formal mathematical notion of privacy that provides protection against such strong adversaries recent survey is available at the basic idea is to introduce enough randomness in the communication so that an adversary possessing arbitrary side information and access to the entire transcript of the communication will still have some residual uncertainty in identifying any of the bits of the parties this privacy requirement is strong enough that functions will be computed only with some error thus there is great need for understanding the fundamental tradeoff between privacy and accuracy and for designing privatization mechanisms and communication protocols that achieve the optimal tradeoffs the formulation and study of an optimal framework addressing this tradeoff is the focus of this paper we study the following problem of computation under differential privacy each party possesses single bit of information and the information bits are statistically independent each party is interested in computing function which could differ from party to party and there could be central observer observing the entire transcript of the interactive communication protocol interested in computing separate function performance at each party and the central observer is measured via the accuracy of the function to be computed we allow an arbitrary cost metric to measure the distortion between the true and the computed function values each party imposes differential privacy constraint on its information bit the privacy level could be different from party to party there remains an uncertainty in any specific party bit even to an adversary that has access to the transcript of interactions and all the other parties bits the interactive communication is achieved via broadcast channel that all parties and the central observer can hear this modeling is without loss of generality since the differential privacy constraint protects against an adversary that can listen to the entire transcript the communication between any two parties might as well be revealed to all the others it is useful to distinguish between two types of communication protocols interactive and we say communication protocol is if message broadcasted by one party does not depend on the messages broadcasted by other parties in contrast interactive protocols allow the messages at any stage of the communication to depend on all the previous messages our main result is the exact optimality of simple protocol in terms of maximizing accuracy for any given privacy level when each party possesses one bit each party randomizes sufficiently and publishes its own bit in other words randomized response is exactly optimal each party and the central observer then separately compute their respective decision functions to maximize the appropriate notion of their accuracy measure this optimality result is very general it holds for all types of functions heterogeneous privacy conditions on the parties all types of cost metrics and both average and over the inputs measures of accuracy finally the optimality result is simultaneous in terms of maximizing accuracy at each of the parties and the central observer each party only needs to know its own desired level of privacy its own function to be computed and its measure of accuracy optimal data release and optimal decision making is naturally separated the key technical result is geometric understanding of the space of conditional probabilities of given transcript the interactive nature of the communication constrains the space to be tensor special case of equation in and perhaps implicitly used in the analog of this result is in while differential privacy imposes linear constraints on the singular vectors of this tensor we characterize the convex hull of such manifolds of tensors and show that their exactly correspond to the transcripts that arise from randomized response protocol this universal for all functionalities characterization is then used to argue that both and accuracies are maximized by randomized responses technically we prove that randomized response is the optimal solution of the rankconstrained and optimization of the rank constraints on higher order tensors arises from the necessary condition of possibly interactive protocols known as protocol compatibility see section for details to solve this optimization we transform into novel linear program of and the price we pay is the increased dimension the resulting lp is now infinite dimensional the idea is that we introduce new variable for each possible tensor and optimize over all of them formulating utility maximization under differential privacy as linear programs has been previously studied in under the standard model where there is single data publisher and single data analyst these approaches exploit the fact that both the differential privacy constraints and the utilities are linear in the matrix representing privatization mechanism similar technique of transforming optimization problem into an infinite dimensional lp has been successfully applied in where optimal privatization mechanisms under local differential privacy has been studied we generalize these techniques to optimizations further perhaps surprisingly we prove that this infinite dimensional linear program has simple optimal solution which we call randomized response upon receiving the randomized responses each party can compute the best approximation of its respective function the main technical innovation is in proving that the optimal solution of this lp corresponds to corner points of convex hull of particular manifold defined by tensor see lemma in the supplementary material for details and the respective manifold has simple structure such that the corner points correspond to particular protocols that we call randomized responses when the accuracy is measured via average accuracy both the objective and the constraints are linear and it is natural to expect the optimal solution to be at the corner points see equation surprising aspect of our main result is that the optimal solution is still at the corner points even though the accuracy is concave function over the protocol see equation this work focuses on the scenario where each party possesses single bit of information with multiple bits of information at each of the parties the existence of differentially private protocol with fixed accuracy for any functionality implies the existence of protocol with the same level of privacy and same level of accuracy for specific functionality that only depends on one bit of each of the parties as in thus if we can obtain lower bounds on accuracy for functionalities involving only single bit at each of the parties we obtain lower bounds on accuracy for all general functionalities however communication is unlikely to be exactly optimal in this general case where each party possesses multiple bits of information and we provide further discussion in section we move detailed discussion of related work section to the supplementary material focusing on the problem formulation next problem formulation consider the setting where we have parties each with its own private binary data xi generated independently the independence assumption here is necessary because without it each party can learn something about others which violates differential privacy even without revealing any information we discuss possible extensions to correlated sources in section differential privacy implicitly imposes independence in setting the goal of the private computation is for each party to compute an arbitrary function fi of interest by interactively broadcasting messages while preserving the privacy of each party there might be central observer who listens to all the messages being broadcasted and wants to compute another arbitrary function the parties are honest in the sense that once they agree on what protocol to follow every party follows the rules at the same time they can be curious and each party needs to ensure other parties can not learn his bit with sufficient confidence the privacy constraints here are similar to the local differential privacy setting studied in in the sense that there are multiple privacy barriers each one separating each individual party and the rest of the world however the main difference is that we consider computation where there are multiple functions to be computed and each node might possess different function to be computed let xk denote the vector of bits and xk is the vector of bits except for the bit the parties agree on an interactive protocol to achieve the goal of computation transcript is the output of the protocol and is random instance of all broadcasted messages until all the communication terminates the probability that transcript is broadcasted via series of interactive communications when the data is is denoted by px for and for then protocol can be represented as matrix denoting the probability distribution over set of transcripts conditioned on px in the end each party makes decision on what the value of function fi is based on its own bit xi and the transcript that was broadcasted decision rule is mapping from transcript and private bit xi to decision represented by function fˆi xi we allow randomized decision rules in which case fˆi xi can be random variable for the central observer decision rule is function of just the transcript denoted by function we consider two notions of accuracy the average accuracy and the accuracy for the party consider an accuracy measure wi or equivalently negative cost function such that wi fi fˆi xi measures the accuracy when the function to be computed is fi and the approximation is fˆi xi then the average accuracy for this party is defined as accave wi fi fˆi efˆi px wi fi fˆi xi where the expectation is taken over the random transcript distribution as and also any randomness in the decision function fˆi we want to emphasize that the input is deterministic we impose no distribution on the input data and the expectation is not over the data sets compared to assuming distribution over the data this is weaker assumption on the data and hence makes our main result stronger for example if the accuracy measure is an indicator such that wi then accave measures the average probability of getting the correct function output for given protocol it takes operations to compute the optimal decision rule fi ave xi arg max px wi fi for each the computational cost of for computing the optimal decision rule is unavoidable in general since that is the inherent complexity of the problem describing the distribution of the transcript requires the same cost we will show that the optimal protocol requires set of transcripts of size and the computational complexity of the decision rule for general function is however for fixed protocol this decision rule needs to be computed only once before any message is transmitted further it is also possible to find closed form solution for the decision rule when has simple structure one example is the xor function studied in detail in section where the optimal decision rule is as simple as evaluating the xor of all the received bits which requires operations when there are multiple maximizers we can choose arbitrarily and it follows that there is no gain in randomizing the decision rule for average accuracy similarly the accuracy is defined as accwc wi fi fˆi min efˆi px wi fi fˆi xi for accuracy given protocol the optimal decision rule of the party with bit xi can be computed by solving the following convex program xx xi arg max min px wi fi qτ subject to qτ and the optimal random decision rule fi wc xi is to output given transcript according to xi qτ yi this can be formulated as linear program with variables and constraints again it is possible to find closed form solution for the decision rule when has simple structure for the xor function the optimal decision rule is again evaluating the xor of all the received bits requiring operations for central observer the accuracy measures are defined similarly and the optimal decision rule is now ave arg max px and for accuracy the optimal random decision rule wc is to output given tranp script according to qτ subject to qτ and xx arg max min px qτ where is the measure of accuracy for the central observer privacy is measured by differential privacy since we allow heterogeneous privacy constraints we use εi to denote the desired privacy level of the party we say protocol is εi private for the party if for and all xi and eεi this condition ensures no adversary can infer the private data xi with high enough confidence no matter what auxiliary information he might have and independent of his computational power to lighten notations we let λi eεi and say protocol is λi private for the party if the protocol is λi private for all then we say that the protocol is λi differentially private for all parties necessary condition on the protocols when the bits are generated independent of each other is protocol compatibility conditioned on the transcript of the protocol the input bits stay independent of each other in our setting input bits are deterministic hence independent mathematically protocol is protocol compatible if each column is tensor when reshaped into order tensor where px px precisely there exist vectors such that where denotes the standard ik uik this is crucial in deriving the main results and it is fact in the secure computation literature this follows from the fact that when the bits are generated independently all the bits are still independent conditioned on the transcript xi which follows implicitly from and directly from equation of notice that using the tensor representation of each column of the protocol we have it follows that is λi private if and only if λi randomized response consider the following simple protocol known as the randomized response which is term first coined by and commonly used in many private communications including the setting we will show in section that this is the optimal protocol for simultaneously maximizing the accuracy of all the parties each party broadcasts randomized version of its bit denoted by such that λi xi with probability with probability where is the logical complement of xi each transcript can be represented by the output of the protocol which in this case is where is now the set of all broadcasted bits accuracy maximization consider the problem of maximizing the average accuracy for centralized observer with function up to the scaling of in the accuracy can be written as xx ep px wx qτ where denotes the randomized decision up on receiving the transcript in the following we define wx to represent the accuracy measure and qτ fˆ to represent the decision rule focusing on this single central observer for the purpose of illustration we want to design protocols px and decision rules qτ that maximize the above accuracy further this protocol has to be compatible with interactive communication satisfying the rank one condition discussed above and satisfy the differential privacy condition in hence we can formulate the accuracy maximization can be formulated as follows given wx in terms of the function to be computed an accuracy measure and required privacy level λi we solve maximize wx px qτ subject to and are matrices rank xi λi where is defined as order tensor defined from the column of matrix as defined in equation notice that the rank constraint is only necessary condition for protocol to be compatible with interactive communication schemes valid interactive communication protocol implies the condition but not all protocols are valid interactive communication schemes therefore the above is relaxation with larger feasible set of protocols but it turns out that the optimal solution of the above optimization problem is the randomized response which is valid communication protocol hence there is no loss in solving the above relaxation the main challenge in solving this optimization is that it is tensor optimization which is notoriously difficult since the rank constraint is over order tensor array with possibly common approaches of convex relaxation from for matrices which are order tensors does not apply further we want to simultaneously apply similar optimizations to all the parties with different functions to be computed we introduce novel transformation of the above optimization into linear program in and the price we pay is in the increased dimensionality the lp has an infinite dimensional decision variable however combined with the geometric understanding of the the manifold of tensors we can identify the exact optimal solution we show in the next section that given desired level of privacy λi there is single universal protocol that simultaneously maximizes the accuracy for all parties any functions of interest any accuracy measures and both and average case accuracy together with optimal decision rules performed at each of the receiving ends this gives the exact optimal computation scheme main result we show perhaps surprisingly that the simple randomized response presented in is the optimal protocol in very general sense for any desired privacy level λi and arbitrary function fi for any accuracy measure wi and any notion of accuracy either average or worst case we show that the randomized response is universally optimal the proof of the following theorem can be found in the supplementary material theorem let the optimal decision rule be defined as in for the average accuracy and for the accuracy then for any λi any function fi and any accuracy measure wi for the randomized response for given λi with the optimal decision function achieves the maximum accuracy for the party among all λi private interactive protocols and all decision rules for the central observer the randomized response with the optimal decision rule defined in and achieves the maximum accuracy among all λi differentially private interactive protocols and all decision rules for any arbitrary function and any measure of accuracy this is strong universal optimality result every party and the central observer can simultaneously achieve the optimal accuracy using universal randomized response each party only needs to know its own desired level of privacy its own function to be computed and its measure of accuracy optimal data release and optimal decision making are naturally separated however it is not immediate at all that scheme such as the randomized response would achieve the maximum accuracy the fact that interaction does not help is and might as well be true only for the binary data scenario we consider in this paper the key technical innovation is the convex geometry in the proof which does not generalize to larger alphabet case once we know interaction does not help we can make an educated guess that the randomized response should dominate over other schemes this intuition follows from the dominance of randomized response in the setting that was proved using powerful operational interpretation of differential privacy first introduced in this intuition can in fact be made rigorous as we show in section of our supplemental material with simple example however we want to emphasize that our main result for computation does not follow from any existing analysis of randomized responses in particular those seemingly similar analyses in the challenge is in proving that interaction does not help which requires the technological innovations presented in this paper xor computation for given function and given accuracy measure analyzing the performance of the optimal protocol provides the exact nature of the tradeoff consider scenario where central observer wants to compute the xor of all the each of which is private in this special case we can apply our main theorem to analyze the accuracy exactly in combinatorial form and we provide proof in section corollary consider computation for xk and the accuracy measure is one if correct and zero if not and for any private protocol and any decision rule fˆ the average and accuracies are bounded by accave accwc and the equality is achieved by the randomized response and optimal decision rules in and the optimal decision for both accuracies is simply to output the xor of the received privatized bits this is strict generalization of similar result in where xor computation was studied but only for setting in the high privacy regime where equivalently eε this implies that accave εk the leading term is due to the fact that we are considering an accuracy measure of boolean function the second term of εk captures the effect that we are essentially observing the xor through consecutive binary symmetric channels with flipping probability hence the accuracy gets exponentially worse in on the other hand if those are allowed to collaborate then they can compute the xor in advance and only transmit the privatized version of the xor achieving accuracy of this is always better than not collaborating which is the bound in corollary discussion in this section we discuss few topics each of which is interesting but to solve in any obvious way our main result is general and sharp but we want to ask how to push it further generalization to multiple bits when each party owns multiple bits it is possible that interactive protocols improve over the randomized response protocol this is discussed with examples in section in the supplementary material approximate differential privacy common generalization of differential privacy known as the approximate differential privacy is to allow small slack of in the privacy condition in the context protocol is εi δi private for the party if for all and all xi and for all subset eεi δi it is natural to ask if the linear programming lp approach presented in this paper can be extended to identify the optimal protocol under εi δi privacy the lp formulations of and heavily rely on the fact that any differentially private protocol can be decomposed as the combination of the matrix and the since the differential privacy constraints are invariant under scaling of pτ one can represent the pattern of the distribution with sτ and the scaling with θτ this is no longer true for εi δi privacy and the analysis technique does not generalize correlated sources when the data xi are correlated each party observe noisy version of the state of the world knowing xi reveals some information on other parties bits in general revealing correlated data requires careful coordination between multiple parties the analysis techniques developed in this paper do not generalize to correlated data since the crucial tensor structure of sτ is no longer present extensions to general utility functions surprising aspect of the main result is that even though the accuracy is concave function over the protocol the maximum is achieved at an extremal point of the manifold of tensors this suggests that there is deeper geometric structure of the problem leading to possible universal optimality of the randomized response for broader class of utility functions it is an interesting task to understand the geometric structure of the problem and to ask what class of utility functions lead to optimality of the randomized response acknowledgement this research is supported in part by nsf cise award nsf satc award nsf cmmi award and nsf eng award references emmanuel abbe amir khandani and andrew lo methods for sharing financial risk exposures the american economic review amos beimel kobbi nissim and eran omri distributed private data analysis simultaneously solving how and what in advances in pages springer michael shafi goldwasser and avi wigderson completeness theorems for distributed computation in proceedings of the twentieth annual acm symposium on theory of computing pages acm blackwell equivalent comparisons of experiments the annals of mathematical statistics blum dwork mcsherry and nissim practical privacy the sulq framework in proceedings of the symposium on principles of database systems pages acm brenner and nissim impossibility of differentially private universally optimal mechanisms in foundations of computer science annual ieee symposium on pages ieee calandrino kilzer narayanan felten and shmatikov you might also like privacy risks of collaborative filtering in security and privacy sp ieee symposium on pages ieee chaudhuri sarwate and sinha differentially private principal components in advances in neural information processing systems pages chaudhuri sarwate and sinha algorithm for principal components journal of machine learning research kamalika chaudhuri claire monteleoni and anand sarwate differentially private empirical risk minimization the journal of machine learning research david chaum claude and ivan damgard multiparty unconditionally secure protocols in proceedings of the twentieth annual acm symposium on theory of computing pages acm cover and thomas elements of information theory john wiley sons duchi jordan and wainwright local privacy and statistical minimax rates in foundations of computer science focs ieee annual symposium on pages ieee dwork differential privacy in automata languages and programming pages springer dwork mcsherry nissim and smith calibrating noise to sensitivity in private data analysis in theory of cryptography pages springer cynthia dwork differential privacy survey of results in theory and applications of models of computation pages springer cynthia dwork krishnaram kenthapadi frank mcsherry ilya mironov and moni naor our data ourselves privacy via distributed noise generation in advances in pages springer quan geng and pramod viswanath the optimal mechanism in differential privacy arxiv preprint quan geng and pramod viswanath the optimal mechanism in differential privacy multidimensional setting arxiv preprint ghosh roughgarden and sundararajan universally privacy mechanisms siam journal on computing goldreich micali and wigderson how to play any mental game in proceedings of the nineteenth annual acm symposium on theory of computing stoc pages new york ny usa acm goyal mironov pandey and sahai tradeoffs for differentially private protocols in advances in pages springer mangesh gupte and mukund sundararajan universally optimal privacy mechanisms for minimax agents in proceedings of the acm symposium on principles of database systems pages acm hardt and roth beating randomized response on incoherent matrices in proceedings of the annual acm symposium on theory of computing pages acm homer szelinger redman duggan tembe muehling pearson stephan nelson and craig resolving individuals contributing trace amounts of dna to highly complex mixtures using snp genotyping microarrays plos genetics kairouz oh and viswanath extremal mechanisms for local differential privacy in advances in neural information processing systems kapralov and talwar on differentially private low rank approximation in proceedings of the annual symposium on discrete algorithms pages siam shiva prasad kasiviswanathan homin lee kobbi nissim sofya raskhodnikova and adam smith what can we learn privately siam journal on computing joe kilian more general completeness theorems for secure computation in proceedings of the annual acm symposium on theory of computing pages acm and raub secure computability of functions in the it setting with dishonest majority and applications to security in theory of cryptography pages springer andrew mcgregor ilya mironov toniann pitassi omer reingold kunal talwar and salil vadhan the limits of differential privacy in foundations of computer science focs annual ieee symposium on pages ieee mcsherry and talwar mechanism design via differential privacy in foundations of computer science focs annual ieee symposium on pages ieee narayanan and shmatikov robust of large sparse datasets in security and privacy sp ieee symposium on pages ieee sewoong oh and pramod viswanath the composition theorem for differential privacy arxiv preprint manoj prabhakaran and vinod prabhakaran on secure multiparty sampling for more than two parties in information theory workshop itw ieee pages ieee benjamin recht maryam fazel and pablo parrilo guaranteed solutions of linear matrix equations via nuclear norm minimization siam review sweeney simple demographics often identify people uniquely health warner randomized response survey technique for eliminating evasive answer bias journal of the american statistical association andrew yao protocols for secure computations in ieee annual symposium on foundations of computer science pages ieee 
spatial transformer networks max jaderberg karen simonyan andrew zisserman koray kavukcuoglu google deepmind london uk jaderberg simonyan zisserman korayk abstract convolutional neural networks define an exceptionally powerful class of models but are still limited by the lack of ability to be spatially invariant to the input data in computationally and parameter efficient manner in this work we introduce new learnable module the spatial transformer which explicitly allows the spatial manipulation of data within the network this differentiable module can be inserted into existing convolutional architectures giving neural networks the ability to actively spatially transform feature maps conditional on the feature map itself without any extra training supervision or modification to the optimisation process we show that the use of spatial transformers results in models which learn invariance to translation scale rotation and more generic warping resulting in performance on several benchmarks and for number of classes of transformations introduction over recent years the landscape of computer vision has been drastically altered and pushed forward through the adoption of fast scalable learning framework the convolutional neural network cnn though not recent invention we now see cornucopia of models achieving results in classification localisation semantic segmentation and action recognition tasks amongst others desirable property of system which is able to reason about images is to disentangle object pose and part deformation from texture and shape the introduction of local layers in cnns has helped to satisfy this property by allowing network to be somewhat spatially invariant to the position of features however due to the typically small spatial support for pixels this spatial invariance is only realised over deep hierarchy of and convolutions and the intermediate feature maps convolutional layer activations in cnn are not actually invariant to large transformations of the input data this limitation of cnns is due to having only limited pooling mechanism for dealing with variations in the spatial arrangement of data in this work we introduce the spatial transformer module that can be included into standard neural network architecture to provide spatial transformation capabilities the action of the spatial transformer is conditioned on individual data samples with the appropriate behaviour learnt during training for the task in question without extra supervision unlike pooling layers where the receptive fields are fixed and local the spatial transformer module is dynamic mechanism that can actively spatially transform an image or feature map by producing an appropriate transformation for each input sample the transformation is then performed on the entire feature map and can include scaling cropping rotations as well as deformations this allows networks which include spatial transformers to not only select regions of an image that are most relevant attention but also to transform those regions to canonical expected pose to simplify inference in the subsequent layers notably spatial transformers can be trained with standard allowing for training of the models they are injected in figure the result of using spatial transformer as the first layer of network trained for distorted mnist digit classification the input to the spatial transformer network is an image of an mnist digit that is distorted with random translation scale rotation and clutter the localisation network of the spatial transformer predicts transformation to apply to the input image the output of the spatial transformer after applying the transformation the classification prediction produced by the subsequent network on the output of the spatial transformer the spatial transformer network cnn including spatial transformer module is trained with only class labels no knowledge of the groundtruth transformations is given to the system spatial transformers can be incorporated into cnns to benefit multifarious tasks for example image classification suppose cnn is trained to perform classification of images according to whether they contain particular digit where the position and size of the digit may vary significantly with each sample and are uncorrelated with the class spatial transformer that crops out and the appropriate region can simplify the subsequent classification task and lead to superior classification performance see fig ii given set of images containing different instances of the same but unknown class spatial transformer can be used to localise them in each image iii spatial attention spatial transformer can be used for tasks requiring an attention mechanism such as in but is more flexible and can be trained purely with backpropagation without reinforcement learning key benefit of using attention is that transformed and so attended lower resolution inputs can be used in favour of higher resolution raw inputs resulting in increased computational efficiency the rest of the paper is organised as follows sect discusses some work related to our own we introduce the formulation and implementation of the spatial transformer in sect and finally give the results of experiments in sect additional experiments and implementation details are given in the supplementary material or can be found in the arxiv version related work in this section we discuss the prior work related to the paper covering the central ideas of modelling transformations with neural networks learning and analysing representations as well as attention and detection mechanisms for feature selection early work by hinton looked at assigning canonical frames of reference to object parts theme which recurred in where affine transformations were modeled to create generative model composed of transformed parts the targets of the generative training scheme are the transformed input images with the transformations between input images and targets given as an additional input to the network the result is generative model which can learn to generate transformed images of objects by composing parts the notion of composition of transformed parts is taken further by tieleman where learnt parts are explicitly with the transform predicted by the network such generative capsule models are able to learn discriminative features for classification from transformation supervision the invariance and equivariance of cnn representations to input image transformations are studied in by estimating the linear relationships between representations of the original and transformed images cohen welling analyse this behaviour in relation to symmetry groups which is also exploited in the architecture proposed by gens domingos resulting in feature maps that are more invariant to symmetry groups other attempts to design transformation invariant representations are scattering networks and cnns that construct filter banks of transformed filters stollenga et al use policy based on network activations to gate the responses of the network filters for subsequent forward pass of the same image and so can allow attention to specific features in this work we aim to achieve invariant representations by manipulating the data rather than the feature extractors something that was done for clustering in neural networks with selective attention manipulate the data by taking crops and so are able to learn translation invariance work such as are trained with reinforcement learning to avoid the figure the architecture of spatial grid generator localisation net transformer module the input feature map is passed to localisation network which regresses the transformation parameters the regular spatial grid over is transformed to the sampling grid tθ which is applied to as described in sect producing the warped output feature map the combination of the localisation network and sampling mechanism defines spatial transformer sampler spatial transformer need for differentiable attention mechanism while use differentiable attention mechansim by utilising gaussian kernels in generative model the work by girshick et al uses region proposal algorithm as form of attention and show that it is possible to regress salient regions with cnn the framework we present in this paper can be seen as generalisation of differentiable attention to any spatial transformation spatial transformers in this section we describe the formulation of spatial transformer this is differentiable module which applies spatial transformation to feature map during single forward pass where the transformation is conditioned on the particular input producing single output feature map for inputs the same warping is applied to each channel for simplicity in this section we consider single transforms and single outputs per transformer however we can generalise to multiple transformations as shown in experiments the spatial transformer mechanism is split into three parts shown in fig in order of computation first localisation network sect takes the input feature map and through number of hidden layers outputs the parameters of the spatial transformation that should be applied to the feature map this gives transformation conditional on the input then the predicted transformation parameters are used to create sampling grid which is set of points where the input map should be sampled to produce the transformed output this is done by the grid generator described in sect finally the feature map and the sampling grid are taken as inputs to the sampler producing the output map sampled from the input at the grid points sect the combination of these three components forms spatial transformer and will now be described in more detail in the following sections localisation network the localisation network takes the input feature map with width height and channels and outputs the parameters of the transformation tθ to be applied to the feature map floc the size of can vary depending on the transformation type that is parameterised for an affine transformation is as in the localisation network function floc can take any form such as network or convolutional network but should include final regression layer to produce the transformation parameters parameterised sampling grid to perform warping of the input feature map each output pixel is computed by applying sampling kernel centered at particular location in the input feature map this is described fully in the next section by pixel we refer to an element of generic feature map not necessarily an image in general the output pixels are defined to lie on regular grid gi of pixels gi xti yit forming an output feature map rh where and are the height and width of the grid and is the number of channels which is the same in the input and output for clarity of exposition assume for the moment that tθ is affine transformation aθ we will discuss other transformations below in this affine case the pointwise transformation is xi xi xi yit yit yis figure two examples of applying the parameterised sampling grid to an image producing the output the sampling grid is the regular grid ti where is the identity transformation parameters the sampling grid is the result of warping the regular grid with an affine transformation tθ where xti yit are the target coordinates of the regular grid in the output feature map xsi yis are the source coordinates in the input feature map that define the sample points and aθ is the affine transformation matrix we use height and width normalised coordinates such that xti yit when within the spatial bounds of the output and xsi yis when within the spatial bounds of the input and similarly for the coordinates the transformation and sampling is equivalent to the standard texture mapping and coordinates used in graphics the transform defined in allows cropping translation rotation scale and skew to be applied to the input feature map and requires only parameters the elements of aθ to be produced by the localisation network it allows cropping because if the transformation is contraction the determinant of the left has magnitude less than unity then the mapped regular grid will lie in parallelogram of area less than the range of xsi yis the effect of this transformation on the grid compared to the identity transform is shown in fig the class of transformations tθ may be more constrained such as that used for attention tx aθ ty allowing cropping translation and isotropic scaling by varying tx and ty the transformation tθ can also be more general such as plane projective transformation with parameters piecewise affine or thin plate spline indeed the transformation can have any parameterised form provided that it is differentiable with respect to the parameters this crucially allows gradients to be backpropagated through from the sample points tθ gi to the localisation network output if the transformation is parameterised in structured way this reduces the complexity of the task assigned to the localisation network for instance generic class of structured and differentiable transformations which is superset of attention affine projective and thin plate spline transformations is tθ mθ where is target grid representation in is the regular grid in homogeneous coordinates and mθ is matrix parameterised by in this case it is possible to not only learn how to predict for sample but also to learn for the task at hand differentiable image sampling to perform spatial transformation of the input feature map sampler must take the set of sampling points tθ along with the input feature map and produce the sampled output feature map each xsi yis coordinate in tθ defines the spatial location in the input where sampling kernel is applied to get the value at particular pixel in the output this can be written as vic unm xsi φx yis φy where φx and φy are the parameters of generic sampling kernel which defines the image interpolation bilinear unm is the value at location in channel of the input and vic is the output value for pixel at location xti yit in channel note that the sampling is done identically for each channel of the input so every channel is transformed in an identical way this preserves spatial consistency between channels in theory any sampling kernel can be used as long as gradients can be defined with respect to xsi and yis for example using the integer sampling kernel reduces to vic unm bxsi byis where bx rounds to the nearest integer and is the kronecker delta function this sampling kernel equates to just copying the value at the nearest pixel to xsi yis to the output location xti yit alternatively bilinear sampling kernel can be used giving vic unm max max to allow backpropagation of the loss through this sampling mechanism we can define the gradients with respect to and for bilinear sampling the partial derivatives are xx max max if xsi unm max if xsi if xsi and similarly to for this gives us differentiable sampling mechanism allowing loss gradients to flow back not only to the input feature map but also to the sampling grid coordinates and therefore back to the transformation parameters and localisation network since and can be easily derived from for example due to discontinuities in the sampling fuctions must be used this sampling mechanism can be implemented very efficiently on gpu by ignoring the sum over all input locations and instead just looking at the kernel support region for each output pixel spatial transformer networks the combination of the localisation network grid generator and sampler form spatial transformer fig this is module which can be dropped into cnn architecture at any point and in any number giving rise to spatial transformer networks this module is computationally very fast and does not impair the training speed causing very little time overhead when used naively and even potential speedups in attentive models due to subsequent downsampling that can be applied to the output of the transformer placing spatial transformers within cnn allows the network to learn how to actively transform the feature maps to help minimise the overall cost function of the network during training the knowledge of how to transform each training sample is compressed and cached in the weights of the localisation network and also the weights of the layers previous to spatial transformer during training for some tasks it may also be useful to feed the output of the localisation network forward to the rest of the network as it explicitly encodes the transformation and hence the pose of region or object it is also possible to use spatial transformers to downsample or oversample feature map as one can define the output dimensions and to be different to the input dimensions and however with sampling kernels with fixed small spatial support such as the bilinear kernel downsampling with spatial transformer can cause aliasing effects finally it is possible to have multiple spatial transformers in cnn placing multiple spatial transformers at increasing depths of network allow transformations of increasingly abstract representations and also gives the localisation networks potentially more informative representations to base the predicted transformation parameters on one can also use multiple spatial transformers in parallel this can be useful if there are multiple objects or parts of interest in feature map that should be focussed on individually limitation of this architecture in purely network is that the number of parallel spatial transformers limits the number of objects that the network can model model fcn cnn aff proj tps aff proj tps mnist distortion rts rts table left the percentage errors for different models on different distorted mnist datasets the different distorted mnist datasets we test are tc translated and cluttered rotated rts rotated translated and scaled projective distortion elastic distortion all the models used for each experiment have the same number of parameters and same base structure for all experiments right some example test images where spatial transformer network correctly classifies the digit but cnn fails the inputs to the networks the transformations predicted by the spatial transformers visualised by the grid tθ the outputs of the spatial transformers and rts examples use thin plate spline spatial transformers tps while examples use affine spatial transformers aff with the angles of the affine transformations given for videos showing animations of these experiments and more see https experiments in this section we explore the use of spatial transformer networks on number of supervised learning tasks in sect we begin with experiments on distorted versions of the mnist handwriting dataset showing the ability of spatial transformers to improve classification performance through actively transforming the input images in sect we test spatial transformer networks on challenging dataset street view house numbers for number recognition showing results using multiple spatial transformers embedded in the convolutional stack of cnn finally in sect we investigate the use of multiple parallel spatial transformers for classification showing performance on birds dataset by automatically discovering object parts and learning to attend to them further experiments with mnist addition and can be found in the supplementary material distorted mnist in this section we use the mnist handwriting dataset as testbed for exploring the range of transformations to which network can learn invariance to by using spatial transformer we begin with experiments where we train different neural network models to classify mnist data that has been distorted in various ways rotation rotation scale and translation rts projective transformation elastic warping note that elastic warping is destructive and can not be inverted in some cases the full details of the distortions used to generate this data are given in the supplementary material we train baseline fcn and convolutional cnn neural networks as well as networks with spatial transformers acting on the input before the classification network and the spatial transformer networks all use bilinear sampling but variants use different transformation functions an affine transformation aff projective transformation proj and thin plate spline transformation tps with regular grid of control points the cnn models include two layers all networks have approximately the same number of parameters are trained with identical optimisation schemes backpropagation sgd scheduled learning rate decrease with multinomial cross entropy loss and all with three weight layers in the classification network the results of these experiments are shown in table left looking at any particular type of distortion of the data it is clear that spatial transformer enabled network outperforms its counterpart base network for the case of rotation translation and scale distortion rts the achieves and depending on the class of transform used for tθ whereas cnn with two maxpooling layers to provide spatial invariance achieves error this is in fact the same error that the achieves which is without single convolution or layer in its network showing that using spatial transformer is an alternative way to achieve spatial invariance models consistently perform better than models due to layers in providing even more spatial invariance and convolutional layers better modelling local structure we also test our models in noisy environment on images with translated mnist digits and size model maxout cnn cnn ours dram single multi st conv st conv st conv st table left the sequence error for svhn recognition on crops of pixels and inflated crops of which include more background the best reported result from uses model averaging and monte carlo averaging whereas the results from other models are from single forward pass of single model right the schematic of the multi model the transformations of each spatial transformer st are applied to the convolutional feature map produced by the previous layer the result of the composition of the affine transformations predicted by the four spatial transformers in multi visualised on the input image background clutter see fig third row for an example an fcn gets error cnn gets error while an gets error and an gets error looking at the results between different classes of transformation the thin plate spline transformation tps is the most powerful being able to reduce error on elastically deformed digits by reshaping the input into prototype instance of the digit reducing the complexity of the task for the classification network and does not over fit on simpler data interestingly the transformation of inputs for all st models leads to standard upright posed digit this is the mean pose found in the training data in table right we show the transformations performed for some test cases where cnn is unable to correctly classify the digit but spatial transformer network can street view house numbers we now test our spatial transformer networks on challenging dataset street view house numbers svhn this dataset contains around real world images of house numbers with the task to recognise the sequence of numbers in each image there are between and digits in each image with large variability in scale and spatial arrangement we follow the experimental setup as in where the data is preprocessed by taking crops around each digit sequence we also use an additional more loosely cropped dataset as in we train baseline character sequence cnn model with hidden layers leading to five independent softmax classifiers each one predicting the digit at particular position in the sequence this is the character sequence model used in where each classifier includes output to model variable length sequences this model matches the results obtained in we extend this baseline cnn to include spatial transformer immediately following the input stcnn single where the localisation network is cnn we also define another extension where before each of the first four convolutional layers of the baseline cnn we insert spatial transformer multi in this case the localisation networks are all fully connected networks with units per layer in the multi model the spatial transformer before the first convolutional layer acts on the input image as with the previous experiments however the subsequent spatial transformers deeper in the network act on the convolutional feature maps predicting transformation from them and transforming these feature maps this is visualised in table right this allows deeper spatial transformers to predict transformation based on richer features rather than the raw image all networks are trained from scratch with sgd and dropout with randomly initialised weights except for the regression layers of spatial transformers which are initialised to predict the identity transform affine transformations and bilinear sampling kernels are used for all spatial transformer networks in these experiments the results of this experiment are shown in table left the spatial transformer models obtain results reaching error on images compared to previous of error interestingly on images while other methods degrade in performance an achieves error while the previous state of the art at error is with recurrent attention model that uses an ensemble of models with monte carlo averaging in contrast the stcnn models require only single forward pass of single model this accuracy is achieved due to the fact that the spatial transformers crop and rescale the parts of the feature maps that correspond to the digit focussing resolution and network capacity only on these areas see table right model cimpoi zhang branson lin simon cnn ours table left the accuracy on bird classification dataset spatial transformer networks with two spatial transformers and four spatial transformers in parallel outperform other models resolution images can be used with the without an increase in computational cost due to downsampling to after the transformers right the transformation predicted by the spatial transformers of top row and bottom row on the input image notably for the one of the transformers shown in red learns to detect heads while the other shown in green detects the body and similarly for the for some examples in terms of computation speed the multi model is only slower forward and backward pass than the cnn classification in this section we use spatial transformer network with multiple transformers in parallel to perform bird classification we evaluate our models on the birds dataset containing training images and test images covering species of birds the birds appear at range of scales and orientations are not tightly cropped and require detailed texture and shape analysis to distinguish in our experiments we only use image class labels for training we consider strong baseline cnn model an inception architecture with batch normalisation on imagenet and on cub which by itself achieves accuracy of previous best result is we then train spatial transformer network which contains or parallel spatial transformers parameterised for attention and acting on the input image discriminative image parts captured by the transformers are passed to the part description each of which is also initialised by inception the resulting part representations are concatenated and classified with single softmax layer the whole architecture is trained on image class labels with backpropagation details in supplementary material the results are shown in table left the achieves an accuracy of outperforming the baseline by in the visualisations of the transforms predicted by table right one can see interesting behaviour has been learnt one spatial transformer red has learnt to become head detector while the other green fixates on the central part of the body of bird the resulting output from the spatial transformers for the classification network is somewhat posenormalised representation of bird while previous work such as explicitly define parts of the bird training separate detectors for these parts with supplied keypoint training data the is able to discover and learn part detectors in manner without any additional supervision in addition spatial transformers allows for the use of resolution input images without any impact on performance as the output of the transformed images are sampled at before being processed conclusion in this paper we introduced new module for neural networks the spatial transformer this module can be dropped into network and perform explicit spatial transformations of features opening up new ways for neural networks to model data and is learnt in an fashion without making any changes to the loss function while cnns provide an incredibly strong baseline we see gains in accuracy using spatial transformers across multiple tasks resulting in performance furthermore the regressed transformation parameters from the spatial transformer are available as an output and could be used for subsequent tasks while we only explore networks in this work early experiments show spatial transformers to be powerful in recurrent models and useful for tasks requiring the disentangling of object reference frames references ba mnih and kavukcuoglu multiple object recognition with visual attention iclr branson van horn belongie and perona bird species categorization using pose normalized deep convolutional nets bruna and mallat invariant scattering convolution networks ieee pami cimpoi maji and vedaldi deep filter banks for texture recognition and segmentation in cvpr cohen and welling transformation properties of learned visual representations iclr erhan szegedy toshev and anguelov scalable object detection using deep neural networks in cvpr frey and jojic fast clustering in nips gens and domingos deep symmetry networks in nips girshick donahue darrell and malik rich feature hierarchies for accurate object detection and semantic segmentation in cvpr goodfellow bulatov ibarz arnoud and shet number recognition from street view imagery using deep convolutional neural networks gregor danihelka graves and wierstra draw recurrent neural network for image generation icml hinton parallel computation that assigns canonical frames of reference in ijcai hinton krizhevsky and wang transforming in icann hinton srivastava krizhevsky sutskever and salakhutdinov improving neural networks by preventing of feature detectors corr ioffe and szegedy batch normalization accelerating deep network training by reducing internal covariate shift icml jaderberg simonyan vedaldi and zisserman synthetic data and artificial neural networks for natural scene text recognition nips dlw kanazawa sharma and jacobs locally convolutional neural networks in nips lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee lenc and vedaldi understanding image representations by measuring their equivariance and equivalence cvpr lin roychowdhury and maji bilinear cnn models for visual recognition netzer wang coates bissacco wu and ng reading digits in natural images with unsupervised feature learning in nips dlw russakovsky deng su krause satheesh ma huang karpathy khosla bernstein et al imagenet large scale visual recognition challenge sermanet frome and real attention for categorization simon and rodner neural activation constellations unsupervised part model discovery with convolutional networks sohn and lee learning invariant representations with local transformations stollenga masci gomez and schmidhuber deep networks with internal selective attention through feedback connections in nips tieleman optimizing neural networks that generate images phd thesis university of toronto wah branson welinder perona and belongie the dataset xu ba kiros cho courville salakhutdinov zemel and bengio show attend and tell neural image caption generation with visual attention icml zhang zou ming he and sun efficient and accurate approximations of nonlinear convolutional networks 
anytime influence bounds and the explosive behavior of diffusion networks kevin nicolas cmla ens cachan cnrs saclay france paris france scaman lemonnier vayatis abstract the paper studies transition phenomena in information cascades observed along diffusion process over some graph we introduce the laplace hazard matrix and show that its spectral radius fully characterizes the dynamics of the contagion both in terms of influence and of explosion time using this concept we prove tight bounds for the influence of set of nodes and we also provide an analysis of the critical time after which the contagion becomes our contributions include formal definitions and tight lower bounds of critical explosion time we illustrate the relevance of our theoretical results through several examples of information cascades used in epidemiology and viral marketing models finally we provide series of numerical experiments for various types of networks which confirm the tightness of the theoretical bounds introduction diffusion networks capture the underlying mechanism of how events propagate throughout complex network in marketing social graph dynamics have caused large transformations in business models forcing companies to their customers not as mass of isolated economic agents but as customer networks in epidemiology precise understanding of spreading phenomena is heavily needed when trying to break the chain of infection in populations during outbreaks of viral diseases but whether the subject is virus spreading across computer network an innovative product among early adopters or rumor propagating on network of people the questions of interest are the same how many people will it infect how fast will it spread and even more critically for decision makers how can we modify its course in order to meet specific goals several papers tackled these issues by studying the influence maximization problem given known diffusion process on graph it consists in finding the subset of initial seeds with the highest expected number of infected nodes at certain time distance this problem being various heuristics have been proposed in order to obtain scalable suboptimal approximations while the first algorithms focused on models and the special case subsequent papers brought empirical evidences of the key role played by temporal behavior existing models of stochastic processes include multivariate hawkes processes where recent progress in inference methods made available the tools for the study of activity shaping which is closely related to influence maximization however in the most studied case in which each node of the network can only be infected once the most widely used model remains the information cascade ctic model under this framework successful inference as well as influence maximization algorithms have been developed however if recent works provided theoretical foundations for the inference problem assessing the quality of influence maximization remains challenging task as few theoretical results exist for general graphs in the setting studies of the sir diffusion process in epidemiology or percolation for specific graphs provided more accurate understanding of these processes more recently it was shown in that the spectral radius of given hazard matrix played key role in influence of information cascades this allowed the authors to derive closedform tight bounds for the influence in general graphs and characterize epidemic thresholds under which the influence of any set of nodes is at most in this paper we extend their approach in order to deal with the problem of anytime influence bounds for information cascades more specifically we define the laplace hazard matrices and show that the influence at time of any set of nodes heavily depends on their spectral radii moreover we reveal the existence and characterize the behavior of critical times at which supercritical processes explode we show that before these times processes will behave and infect at most nodes these results can be used in various ways first they provide way to evaluate influence maximization algorithms without having to test all possible set of influencers which is intractable for large graphs secondly critical times allow decision makers to know how long contagion will remain in its early phase before becoming event in fields where knowing when to act is nearly as important as knowing where to act finally they can be seen as the first formula for anytime influence estimation for information cascades indeed we provide empirical evidence that our bounds are tight for large family of graphs at the beginning and the end of the infection process the rest of the paper is organized as follows in section we recall the definition of information cascades model and introduce useful notations in section we derive theoretical bounds for the influence in section we illustrate our results by applying them on specific cascade models in section we perform experiments in order to show that our bounds are sharp for family of graphs and sets of initial nodes all proof details are provided in the supplementary material information cascades information propagation and influence in diffusion networks we describe here the propagation dynamics introduced in let be directed network of nodes we equip each directed edge with probability distribution pij over pij is thus measure on and define the cascade behavior as follows at time only subset of influencers is infected each node infected at time τi may transmit the infection at time τi τij along its outgoing edge with probability density pij τij and independently of other transmission events the process ends for given for each node we will denote as τv the possibly infinite time at which it is reached by the infection the influence of at time denoted as σa is defined as the expected number of nodes reached by the contagion at time originating from σa τv where the expectation is taken over cascades originating from τv following the percolation literature we will differentiate between cascades whose size is and cascades whose size is proportional to where denotes the size of the network this work focuses on upper bounding the influence σa for any given time and characterizing the critical times at which phase transitions occur between and supercritical behaviors the laplace hazard matrix we extend here the concept of hazard matrix first introduced in different from the homonym notion of which plays key role in the influence of the information cascade definition let be directed graph and pij be integrable edge transmission probr abilities such that pij dt for let lh be the matrix denoted as the laplace hazard matrix whose coefficients are dt ln dt if ij ij ij lhij otherwise where denotes the laplace transform of pij defined for every by pij dt note that the long term behavior of the cascade is retrieved when and coincides with the concept of hazard matrix used in we recall that for any square matrix of size its spectral radius is the maximum of the absolute values of its eigenvalues if is moreover real and positive we also have mx existence of critical time of contagion in the following we will derive critical times before which the contagion is and above which the contagion is we now formalize this notion of critical time via limits of contagions on networks theorem let gn be sequence of networks of size and pnij be transmission probability functions along the edges of gn let also σn be the maximum influence in gn at time from single influencer then there exists critical time such that for every sequence of times tn if lim tn then σn tn if σn tn then lim inf tn moreover such critical time is unique in other words the critical time is time before which the regime is and after which no contagion can be the next proposition shows that after the critical time the contagion is proposition if tn is such that lim inf tn then lim inf σn and σn tn the contagion is conversely if tn is such that lim inf then lim tn in order to simplify notations we will omit in the following the dependence in of all the variables whenever stating results holding in the limit theoretical bounds for the influence of set of nodes we now present our upper bounds on the influence at time and derive lower bound on the critical time of contagion upper bounds on the maximum influence at time the next proposition provides an upper bound on the influence at time for any set of influencers such that this result may be valuable for assessing the quality of influence maximization algorithms in given network proposition define lh then for any such that denoting by σa the expected number of nodes reached by the cascade starting from at time σa min est where is the smallest solution in of the following equation exp corollary under the same assumptions σa min st note that the upper bound in is corollary of proposition using when corollary with implies that the regime is for all when the behavior may be and the influence may reach linear values in however at cost growing exponentially with it is always possible to choose such that and retrieve behavior while the exact optimal parameter is in general not explicit two choices of derive relevant results either simplifying est by choosing or keeping by choosing in particular the following corollary shows that the contagion explodes at most as eρ for any corollary let and under the same assumptions σa remark since this section focuses on bounding σa for given all the aforementioned results also hold for ptij pij this is equivalent to integrating everything on rt rt rt instead of lhij ln pij dt pij dt pij dt this choice of lh is particularly useful when some edges are transmitting the contagion with probability see for instance the si epidemic model in section lower bound on the critical time of contagion the previous section presents results about how explosive contagion is these findings suggest that the speed at which contagion explodes is bounded by certain quantity and thus that the process needs certain amount of time to become this intuition is made formal in the following corollary ln if the sequence tn corollary assume ρn and is such that tn lim sup ln then σa tn in other words the regime of the contagion is before lim inf the technical condition ln and ln ln imposes that for large lim ρn ρn ln has the same behavior than ρn this condition verges sufficiently fast to so that is not very restrictive and is met for the different case studies considered in section this result may be valuable for decision makers since it provides safe time region in which the contagion has not reached macroscopic scale it thus provides insights into how long do decision makers have to prepare control measures after the process can explode and immediate action is required application to particular contagion models in this section we provide several examples of cascade models that show that our theoretical bounds are applicable in wide range of scenarios and provide the first results of this type in many areas including two widely used epidemic models fixed transmission pattern when the transmission probabilities are of the form pij αij and αij lhij ln αij ρα and ln where ρα is the hazard matrix defined in in these networks the temporal and structural behaviors are clearly separated while ρα summarizes the structure of the network and how connected the nodes are to one another captures how fast the transmission probabilities are fading through time when ρα the behavior is and the bound on the critical times is given by inverting ln lim inf where exists and is unique since is decreasing from to in general it is not possible to give more explicit version of the critical time of corollary or of the anytime influence bound of proposition however we investigate in the rest of this section specific which lead to explicit results exponential transmission probabilities notable example of fixed transmission pattern is the case of exponential probabilities pij αij for and αij influence maximization algorithms under this specific choice of transmission functions have been for instance developed in in such case we can calculate the spectral radii explicitly ρα ln ij ji where ρα is again the hazard matrix when ρα this leads to critical time lower bounded by lim inf ln ρα the influence bound of corollary can also be reformulated in the following way corollary assume ρα or else λt ρα then the minimum in eq is met for ρα and corollary rewrites σa λρα eλt ρα if ρα and λt ρα the minimum in eq is met for and corollary rewrites ρα σa ρα note that in particular the condition of corollary is always met in the case where ρα moreover we retrieve the behavior when concerning the behavior in the bound matches exactly the bound when is very large in the case however for sufficiently small we obtain greatly improved result with very instructive growth in si and sir epidemic models both epidemic models si and sir are particular cases of exponential transmission probabilities sir model is widely used epidemic model that uses three states to describe the spread of an infection each node of the network can be either susceptible infected or removed at subset of nodes is infected then each node infected at time τi is removed at an time θi of parameter transmission along its outgoing edge occurs at time τi τij with conditional probability density exp given that node has not been removed at that time when the removing events are not observed sir is equivalent to ct ic except that transmission along outgoing edges of one node are positively correlated however our results still hold in case of such correlation as shown in the following result proposition assume the propagation follow sir model of transmission parameter and parameter define pij exp for let ij be the adjacency matrix of the underlying undirected network then results of proposition and subsequent corollaries still hold with given by lh lh ln from this proposition the same analysis than in the independent transmission events case can be derived and the critical time for the sir model is lim inf ln ln βδ proposition consider the sir model with transmission rate recovery rate and adjacency matrix an assume lim inf ln βδ an and the sequence tn is such that lim sup ln βδ an tn ln then σa tn this is direct corollary of corollary with ln βδ an the si model is simpler model in which individuals of the network remain infected and contagious through time thus the network is totally infected at the end of the contagion and σa for this reason the previous critical time for the more general sir model is of no use here and more precise analysis is required following the remark of section we can integrate pij on instead of which leads to the following result proposition consider the si model with transmission rate and adjacency matrix an assume lim inf an and the sequence tn is such that lim sup βtn ln an ln an then σa tn in other words the critical time for the si model is lower bounded by ln ln lim inf an if an ln for sparse networks with maximum degree in the critical time ln resumes to tc lim inf however when the graph is denser and an ln then tc lim inf ln an information cascade final example is the contagion in which node infected at time makes unique attempt to infect its neighbors at time this defines the information cascade model the totally connected erdos renyi preferential attachment small world contact network upper bound influence σa influence σa spectral radius ρα spectral radius ρα influence σa influence σa spectral radius ρα spectral radius ρα figure empirical maximum influence the spectral radius ρα defined in section for various network types simulation parameters and diffusion model studied by the first works on influence maximization in this setting pij αij where is the dirac distribution centered at the spectral radii are given by ρα and the influence bound of corollary simplifies to corollary let ρα or else if then σa otherwise σa ρα moreover the critical time is lower bounded by lim inf ln ln ρα notable difference from the exponential transmission probabilities is that is here inversely proportional to ln ρα instead of ρα in eq which implies that for the same influence contagion will explode much slower than one with constant infection rate this is probably due to the existence of very small infection times for contagions with exponential transmission probabilities experimental results this section provides an experimental validation of our bounds by comparing them to the empirical influence simulated on several network types in all our experiments we simulate contagion with exponential transmission probabilities see section on networks of size and generated random networks of different types for more information on the respective random generators see networks preferential attachment networks networks geometric random networks and totally connected networks with fixed weight except for the ingoing and outgoing edges of single node having respectively weight and the reason for simulating on such totally connected networks is that the influence over these networks tend to match our upper bounds more closely and plays the role of best case influence σa influence σa totally connected erdos renyi preferential attachment small world contact network upper bound influence σa number of nodes number of nodes number of nodes figure empirical maximum influence the network size for various network types simulan tion parameters and ρα in such setting ρln note the versus linear behavior and scenario more precisely the transmission probabilities are of the form pij for each edge where and in the formulas of section we first investigate the tightness of the upper bound on the maximum influence given in proposition figure presents the empirical influence ρα ln where is the adjacency matrix of the network for large set of network types as well as the upper bound in proposition each point in the figure corresponds to the maximum influence on one network the influence was averaged over cascade simulations and the best influencer whose influence was maximal was found by performing an exhaustive search our bounds are tight for all values of for totally connected networks in the regime ρα for the regime ρα the behavior in is very instructive for we are tight for most network types when ρα is high for the average transmission time for the τij the maximum influence varies lot across different graphs this follows the intuition that this is one of the times where for given final number of infected node the local structure of the networks will play the largest role through precise temporal evolution of the infection because ρα explains quite well the final size of the infection this discrepancy appears on our graphs at ρα fixed while our bound does not seem tight for this particular time the order of magnitude of the explosion time is retrieved and our bounds are close to optimal values as soon as in order to further validate that our bounds give meaningful insights on the critical time of explosion for graphs figure presents the empirical influence with respect to the size of the network for different network types and values of with ρα fixed to ρα in this setting the critical time of corollary is given by ρln we see that our bounds are tight for totally connected networks for all values of moreover the accuracy of critical time estimation is proved by the drastic change of behavior around with phase transitions having occurred for most network types as soon as conclusion in this paper we characterize the phase transition in information cascades between their and behavior we provide for the first time general influence bounds that apply for any time horizon graph and set of influencers we show that the key quantities governing this phenomenon are the spectral radii of given laplace hazard matrices we prove the pertinence of our bounds by deriving the first results of this type in several application fields finally we provide experimental evidence that our bounds are tight for large family of networks acknowledgments this research is part of the sodatech project funded by the french government within the program of investments for the future big data references michael trusov randolph bucklin and koen pauwels effects of versus traditional marketing findings from an internet social networking site journal of marketing david kempe jon kleinberg and tardos maximizing the spread of influence through social network in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm wei chen yajun wang and siyu yang efficient influence maximization in social networks in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm wei chen chi wang and yajun wang scalable influence maximization for prevalent viral marketing in social networks in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm manuel david balduzzi and bernhard uncovering the temporal dynamics of diffusion networks in proceedings of the international conference on machine learning pages nan du le song hyenkyun woo and hongyuan zha uncover information diffusion networks in proceedings of the sixteenth international conference on artificial intelligence and statistics pages alan hawkes and david oakes cluster process representation of process journal of applied probability pages ke zhou hongyuan zha and le song learning triggering kernels for hawkes processes in proceedings of the international conference on machine learning pages remi lemonnier and nicolas vayatis nonparametric markovian learning of triggering kernels for mutually exciting and mutually inhibiting multivariate hawkes processes in machine learning and knowledge discovery in databases pages springer mehrdad farajtabar nan du manuel isabel valera hongyuan zha and le song shaping social activity by incentivizing users in advances in neural information processing systems pages manuel and bernhard influence maximization in continuous time diffusion networks in proceedings of the international conference on machine learning pages nan du le song manuel and hongyuan zha scalable influence estimation in diffusion networks in advances in neural information processing systems pages manuel le song hadi daneshmand and schoelkopf estimating diffusion networks recovery conditions sample complexity algorithm journal of machine learning research jean and thibaut horel inferring graphs from cascades sparse recovery framework in proceedings of the international conference on machine learning pages moez draief ayalvadi ganesh and laurent thresholds for virus spread on networks annals of applied probability svante janson and oliver riordan the phase transition in inhomogeneous random graphs random structures algorithms remi lemonnier kevin scaman and nicolas vayatis tight bounds for influence in diffusion networks and application to bond percolation and epidemiology in advances in neural information processing systems pages william kermack and anderson mckendrick contributions to the mathematical theory of epidemics ii the problem of endemicity proceedings of the royal society of london series jure leskovec andreas krause carlos guestrin christos faloutsos jeanne vanbriesen and natalie glance outbreak detection in networks in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm mark newman networks an introduction oxford university press new york ny usa mathew penrose random geometric graphs volume oxford university press oxford 
svms from tighter generalization bounds to novel algorithms dogan microsoft research cambridge uk udogan yunwen lei department of mathematics city university of hong kong yunwelei alexander binder istd pillar singapore university of technology and design machine learning group tu berlin alexander binder marius kloft department of computer science humboldt university of berlin kloft abstract this paper studies the generalization performance of classification algorithms for which we the first generalization error bound with logarithmic dependence on the class size substantially improving the linear dependence in the existing generalization analysis the theoretical analysis motivates us to introduce new classification machine based on regularization where the parameter controls the complexity of the corresponding bounds we derive an efficient optimization algorithm based on fenchel duality theory benchmarks on several datasets show that the proposed algorithm can achieve significant accuracy gains over the state of the art introduction typical application domains such as natural language processing information retrieval image annotation and web advertising involve tens or hundreds of thousands of classes and yet these datasets are still growing to handle such learning tasks it is essential to build algorithms that scale favorably with respect to the number of classes over the past years much progress in this respect has been achieved on the algorithmic side including efficient stochastic gradient optimization strategies although also theoretical properties such as consistency and behavior have been studied there still is discrepancy between algorithms and theory in the sense that the corresponding theoretical bounds do often not scale well with respect to the number of classes this discrepancy occurs the most strongly in research on generalization bounds that is bounds that can measure generalization performance of prediction models purely from the training samples and which thus are very appealing in model selection crucial advantage of these bounds is that they can better capture the properties of the distribution that has generated the data which can lead to tighter estimates than conservative bounds to our best knowledge for classification the first error bounds were given by these bounds exhibit quadratic dependence on the class size and were used by and to derive bounds for classification and multiple kernel learning mkl problems respectively more recently improve the quadratic dependence to linear dependence by introducing novel surrogate for the margin that is independent on the true realization of the class label however heavy dependence on the class size such as linear or quadratic implies poor generalization guarantee for classification problems with massive number of classes in this paper we show generalization bounds for classification problems the first sublinear dependence on the number of classes choosing appropriate regularization this dependence can be as mild as logarithmic we achieve these improved bounds via the use of gaussian complexities while previous bounds are based on wellknown structural result on rademacher complexities for classes induced by the maximum operator the proposed proof technique based on gaussian complexities exploits potential coupling among different components of the classifier while this fact is ignored by previous analyses the result shows that the generalization ability is strongly impacted by the employed regularization which motivates us to propose new learning machine performing regularization over the components as natural choice we investigate here the application of the proven norm this results in novel support vector machine which contains the classical model by crammer singer as special case for the bounds indicate that the parameter crucially controls the complexity of the resulting prediction models we develop an efficient optimization algorithm for the proposed method based on its fenchel dual representation we empirically evaluate its effectiveness on several standard benchmarks for multiclass classification taken from various domains where the proposed approach significantly outperforms the method of the remainder of this paper is structured as follows section introduces the problem setting and presents the main theoretical results motivated by which we propose new classification model in section and give an efficient optimization algorithm based on fenchel duality theory in section we evaluate the approach for the application of visual image recognition and on several standard benchmark datasets taken from various application domains section concludes theory problem setting and notations this paper considers classification problems with classes let denote the input space and denote the output space assume that we are given sequence of examples xn yn independently drawn according to probability measure defined on the sample space based on the training examples we wish to learn prediction rule from space of hypotheses mapping from to and use the mapping arg to predict ties are broken by favoring classes with lower index for which our loss function defined below always counts an error for any hypothesis the margin ρh of the function at labeled example is ρh the prediction rule makes an error at if ρh and thus the expected risk incurred from using for prediction is any function can be equivalently represented by the function ρh the class hc with hj we denote by of margin functions associated to let be mercer kernel with being the associated feature map hφ for all we denote by the dual norm of hw for convex function we denote by its fenchel conjugate supw hw vi for any wc we define the pc by kwj for any we denote by the dual exponent of satisfying and we require the following definitions definition strong convexity function is said to be convex norm iff and we have kx definition regular loss we call loss if it satisfies the following properties αx αf bounds the loss from above ii is in the sense iii is decreasing and it has zero point some examples of loss functions include the hinge and the margin loss main results our discussion on generalization error bounds is based on the established methodology of rademacher and gaussian complexities definition rademacher and gaussian complexity let be family of functions defined on and zn fixed sample of size with elements in then the empirical rademacher and gaussian complexities of with respect to the sample are defined by rs eσ sup σi zi gs eg sup gi zi where σn are independent random variables with equal probability taking values or and gn are independent random variables note that we have the following comparison inequality relating rademacher and gaussian complexities cf section in πp log nrs rs gs existing work on generalization bounds for classifiers builds on the following structural result on rademacher complexities lemma rs max hc hj hj rs hj where hc are hypothesis sets this result is crucial for the standard generalization analysis of classification since the margin ρh involves the maximum operator which is removed by but at the expense of linear dependency on the class size in the following we show that this linear dependency is suboptimal because does not take into account the coupling among different classes for example common regularizer used in learning algorithms is pc khj for which the components hc are correlated via regularizer and the bound ignoring this correlation would not be effective in this case as remedy we here introduce new structural complexity result on function classes induced by general classes via the maximum operator while allowing to preserve the correlations among different components meanwhile instead of considering the rademacher complexity lemma concerns the structural relationship of gaussian complexities since it is based on comparison result among different gaussian processes lemma structural result on gaussian complexity let be class of functions defined on with let gnc be independent distributed random variables then for any sample xn of size we have gs sup hj xi max hc hc eg hc where eg denotes the expectation to the gaussian variables gnc the proof of lemma is given in supplementary material equipped with lemma we are now able to present general generalization bound the proof of the following results theorem theorem and corollary is given in supplementary material theorem generalization bound for classification let rx be hypothesis class with let be loss function and denote sup ρh suppose that the examples xn yn are independently drawn from probability measure defined on then for any with probability at least the following classification generalization bound holds for any log ρh xi yi eg sup hj xi hc where gnc are independent distributed random variables remark under the same condition of theorem derive the following generalization bound cf corollary in log ρh xi yi rs this linear dependence on is due to the of for comparison theorem implies that the puse pc dependence on is governed by the term hj xi an advantage of which is that the components hc are jointly coupled as we will see this allows us to derive an improved result with favorable dependence on when constraint is imposed on hc the following theorem applies the general result in theorem to methods the hypothesis space is defined by imposing constraint with general strongly convex function theorem generalization bound for learning algorithms and suppose that the hypothesis space is defined by hf hw hwc where is convex function norm defined on satisfying let be loss function and denote sup ρh let gnc be independent distributed random variables then for any with probability at least we have log πλ hw ρhw xi yi eg xi we now consider the following specific hypothesis spaces using constraint hp hw hwc corollary generalization bound let be loss function and denote sup ρh then with probability at least for any hw hp the generalization error hw can be upper bounded by log log log if log ρhw xi yi xi xi cp otherwise remark the bounds in corollary enjoy mild dependence on the number of classes the dependence is polynomial with exponent for log and becomes logarithmic if log even in the theoretically unfavorable case of the bounds still exhibit radical dependence on the number of classes which is substantially milder than the quadratic dependence established in and the linear dependence established in our generalization bound is and shows clearly how the margin would affect the generalization performance when is the margin loss large margin would increase the empirical error while decrease the model complexity and vice versa comparison of the achieved bounds to the state of the art related work on bounds the large body of theoretical work on learning considers bounds based on the covering number bound of linear operators obtain generalization bound exhibiting linear dependence on the class size which is improved by to radical dependence of the form log ρc under conditions analogous to corollary derive independent generalization guarantee however their bound is based on delicate definition of margin which is why it is commonly not used in the mainstream literature derive the following generalization bound log ep inf log ep sup λn λn where is margin condition scaling factor and regularization parameter eq is independent yet corollary shows superiority in the following aspects first for svms pn margin loss our bound consists of an empirical error ρhw xi yi and complexity term divided by the margin value note that in corollary when the margin is large which is often desirable the last term in the bound given by corollary becomes small the bound is an increasing function of which is undesirable secondly theorem applies to general loss functions expressed through strongly convex function over general hypothesis space while the bound only applies to specific regularization algorithm lastly all the above mentioned results are conservative estimates related work on bounds the techniques used in above mentioned papers do not straightforwardly translate to bounds which is the type of bounds in the focus of the present work the investigation of these was initiated to our best knowledge by with the structural complexity bound for function classes induced via the maximal operator derive margin bound admitting quadratic dependency on the number of classes use these results in to study the generalization performance of where the components hc are coupled with an constraint due to the usage of the suboptimal eq obtain margin bound growing quadratically the number of classes develop new classification algorithm based on natural notion called the margin of kernel also present novel rademacher complexity margin bound based on eq and the bound also depends quadratically on the class size more recently give refined rademacher complexity bound with linear dependence on the class size the key reason for this improvement is the introduction of ρθ bounding margin ρh from below and since the maximum operation in ρθ is applied to the set rather than the subset yi for ρh one needs not to consider the random realization of yi we also use this trick in our proof of theorem however fail to improve this linear dependence to logarithmic dependence as we achieved in corollary due to the use of the suboptimal structural result algorithms motivated by the generalization analysis given in section we now present new learning algorithm based on performing empirical risk minimization in the hypothesis space this corresponds to the following problem primal problem min kwj ti ti hwyi xi maxhwy xi for we recover the seminal algorithm by crammer singer cs which is thus special case of the proposed formulation an advantage of the proposed approach over can be that as shown in corollary the dependence of the generalization performance on the class size becomes milder as decreases to dual problems since the optimization problem is convex we can derive the associated dual problem for the construction of efficient optimization algorithms the derivation of the following dual problem is deferred to supplementary material for matrix we denote by αi the row denote by ej the unit vector in rc and the vector in rc with all components being zero problem completely dualized problem for general loss the lagrangian dual of is αiy sup αij xi αij αi yi theorem epresenter theorem for any dual variable the associated primal variable wc minimizing the lagrangian saddle problem can be represented by wj xi αij xi αij xi for the hinge loss we know its conjugate is if αiyi αiyi αiyi if and elsewise now and elsewise hence we have the following dual problem for the hinge loss function problem completely dualized problem for the hinge loss sup αij xi αiyi αi eyi αi optimization algorithms the dual problems and are not quadratic programs for and thus generally not easy to solve to circumvent this difficulty we rewrite problem as the following equivalent problem min kwj ti ti hwyi xi hwy xi yi βj the class weights βc in eq play similar role as the kernel weights in mkl algorithms the equivalence between problem and eq follows directly from lemma in which shows that the optimal βc in eq can be explicitly represented in closed form motivated by the recent work on mkl we propose to solve the problem via alternately optimizing and as we will show given temporarily fixed the optimization of reduces to standard classification problem furthermore the update of given fixed can be achieved via an analytic formula problem partially dualized problem for general loss for fixed the partial dual problem for the problem is sup αiy βj αij xi αij αi yi the primal variable minimizing the associated lagrangian saddle problem is wj βj αij xi we defer the proof to supplementary material analogous to problem we have the following partial dual problem for the hinge loss problem partially dualized problem for the hinge loss sup βj αij xi αiyi αi eyi αi the problems and are quadratic so we can use the dual coordinate ascent algorithm to very efficiently solve them for the case of linear kernels to this end we need to compute the gradient and solve the restricted problem of optimizing only one αi keeping all other dual variables fixed the gradient of can be exactly represented by xi hwj xi suppose the additive change to be applied to the current αi is δαi then from we have αi δαi αn δαij βj xi xi δαij const ij therefore the of optimizing δαi is given by max δαi βj xi xi δαij δαij ij δαi eyi αi δαi we now consider the subproblem of updating class weights with temporarily fixed for which we have the following analytic solution the proof is deferred to the supplementary material proposition solving the subproblem with respect to the class weights given fixed wj the minimal βj optimizing the problem is attained at βj kwj the update of βj based on eq requires calculating kwj which can be easily fulfilled by recalling the representation established in eq the resulting training algorithm for the proposed is given in algorithm the algorithm alternates between solving problem for fixed class weights line and updating the class weights in manner line recall that problem establishes completely dualized problem which can be used as sound stopping criterion for algorithm algorithm training algorithm for input examples xi yi and the kernel initialize βj wj for all while optimality conditions are not satisfied do optimize the classification problem compute kwj for all according to eq update βj for all according to eq end empirical analysis we implemented the proposed algorithm algorithm in and solved the involved problem using dual coordinate ascent we experiment on six benchmark datasets the sector dataset studied in the news dataset collected by the dataset collected by the birds birds as part from and the caltech collected by we used features from the bvlc reference caffenet from table gives an information on these datasets we compare with the classical cs in which constitutes strong baseline for these datasets we employ cross validation on the training set to tune the regularization parameter by grid search over the set and from to with equidistant points we repeat the experiments times and report in table on the average accuracy and standard deviations attained on the test set dataset no of classes no of training examples no of test examples no of attributes sector news birds birds caltech table description of datasets used in the experiments method dataset sector news birds birds caltech crammer singer table accuracies achieved by cs and the proposed on the benchmark datasets we observe that the proposed consistently outperforms cs on all considered datasets specifically our method attains accuracy gain on sector accuracy gain on news accuracy gain on accuracy gain on birds accuracy gain on birds and accuracy gain on birds we perform wilcoxon signed rank test between the accuracies of cs and our method on the benchmark datasets and the is which means our method is significantly better than cs at the significance level of these promising results indicate that the proposed could further lift the state of the art in multiclass classification even in applications beyond the ones studied in this paper conclusion motivated by the ever growing size of datasets in applications such as image annotation and web advertising which involve tens or hundreds of thousands of classes we studied the influence of the class size on the generalization behavior of classifiers we focus here on generalization bounds enjoying the ability to capture the properties of the distribution that has generated the data of independent interest for hypothesis classes that are given as maximum over base classes we developed new structural result on gaussian complexities that is able to preserve the coupling among different components while the existing structural results ignore this coupling and may yield suboptimal generalization bounds we applied the new structural result to study learning rates for classifiers and derived for the first time bound with logarithmic dependence on the class size which substantially outperforms the linear dependence in the generalization bounds motivated by the theoretical analysis we proposed novel where the parameter controls the complexity of the corresponding bounds this class of algorithms contains the classical cs as special case for we developed an effective optimization algorithm based on the fenchel dual representation for several standard benchmarks taken from various domains the proposed approach surpassed the method of cs by up to future direction will be to derive bound that is completely independent of the class size even overcoming the mild logarithmic dependence here to this end we will study more powerful structural results than lemma for controlling complexities of function classes induced via the maximum operator as good starting point we will consider covering numbers acknowledgments we thank mehryar mohri for helpful discussions this work was partly funded by the german research foundation dfg award kl references zhang independent generalization analsysis of some discriminative classification in advances in neural information processing systems pp hofmann cai and ciaramita learning with taxonomies classifying documents and words in nips workshop on syntax semantics and statistics deng dong socher li li and imagenet hierarchical image database in computer vision and pattern recognition cvpr ieee conference on pp ieee beygelzimer langford lifshits sorkin and strehl conditional probability tree estimation analysis and algorithms in proceedings of uai pp auai press bengio weston and grangier label embedding trees for large tasks in advances in neural information processing systems pp jain and kapoor active learning for large problems in computer vision and pattern recognition cvpr ieee conference on pp ieee dekel and shamir classification with more classes than examples in international conference on artificial intelligence and statistics pp gupta bengio and weston training highly multiclass classifiers the journal of machine learning research vol no pp zhang statistical analysis of some large margin classification methods the journal of machine learning research vol pp tewari and bartlett on the consistency of multiclass classification methods the journal of machine learning research vol pp glasmachers universal consistency of support vector classification in advances in neural information processing systems pp mohri rostamizadeh and talwalkar foundations of machine learning mit press kuznetsov mohri and syed deep boosting in advances in neural information processing systems pp koltchinskii and panchenko empirical margin distributions and bounding the generalization error of combined classifiers annals of statistics pp guermeur combining discriminant models with new svms pattern analysis applications vol no pp oneto anguita ghio and ridella the impact of unlabeled patterns in rademacher complexity theory for kernel classifiers in advances in neural information processing systems pp koltchinskii and panchenko rademacher processes and bounding the risk of function learning in high dimensional probability ii pp springer cortes mohri and rostamizadeh classification with maximum margin multiple kernel in pp kloft brefeld sonnenburg and zien multiple kernel learning the journal of machine learning research vol pp crammer and singer on the algorithmic implementation of multiclass vector machines the journal of machine learning research vol pp bartlett and mendelson rademacher and gaussian complexities risk bounds and structural results mach learn vol pp ledoux and talagrand probability in banach spaces isoperimetry and processes vol berlin springer hill and doucet framework for artif intell res jair vol pp micchelli and pontil learning the kernel function via regularization journal of machine learning research pp keerthi sundararajan chang hsieh and lin sequential dual method for large scale linear svms in acm sigkdd pp acm rennie and rifkin improving multiclass text classification with the support vector machine tech mit lang newsweeder learning to filter netnews in proceedings of the international conference on machine learning pp lewis yang rose and li new benchmark collection for text categorization research the journal of machine learning research vol pp welinder branson mita wah schroff belongie and perona birds tech california institute of technology jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding arxiv preprint 
neural spike train analysis with generalized count linear dynamical systems lars buesing department of statistics columbia university new york ny lars yuanjun gao department of statistics columbia university new york ny krishna shenoy department of electrical engineering stanford university stanford ca shenoy john cunningham department of statistics columbia university new york ny abstract latent factor models have been widely used to analyze simultaneous recordings of spike trains from large heterogeneous neural populations these models assume the signal of interest in the population is latent intensity that evolves over time which is observed in high dimension via noisy observations these techniques have been well used to capture neural correlations across population and to provide smooth denoised and concise representation of spiking data one limitation of many current models is that the observation model is assumed to be poisson which lacks the flexibility to capture and that is common in recorded neural data thereby introducing bias into estimates of covariance here we develop the generalized count linear dynamical system which relaxes the poisson assumption by using more general exponential family for count data in addition to containing poisson bernoulli negative binomial and other common count distributions as special cases we show that this model can be tractably learned by extending recent advances in variational inference techniques we apply our model to data from primate motor cortex and demonstrate performance improvements over methods both in capturing the variance structure of the data and in prediction introduction many studies and theories in neuroscience posit that populations of neural spike trains are noisy observation of some underlying and signal of interest as such over the last decade researchers have developed and used number of methods for jointly analyzing populations of simultaneously recorded spike trains and these techniques have become critical part of the neural data analysis toolkit in the supervised setting generalized linear models glm have used stimuli and spiking history as covariates driving the spiking of the neural population in the unsupervised setting latent variable models have been used to extract hidden structure that captures the variability of the recorded data both temporally and across the population of neurons in both these settings however limitation is that spike trains are typically assumed to be conditionally poisson given the shared signal the poisson assumption while offering algorithmic conveniences in many cases implies the property of equal dispersion the conditional mean and variance are equal this property is particularly troublesome in the analysis of neural spike trains which are commonly observed to be either or variance greater than or less than the mean no doubly stochastic process with poisson observation can capture and while such model can capture it must do so at the cost of erroneously attributing variance to the latent signal rather than the observation process to allow for deviation from the poisson assumption some previous work has instead modeled the data as gaussian or using more general renewal process models the former of which does not match the count nature of the data and has been found inferior and the latter of which requires costly inference that has not been extended to the population setting more general distributions like the negative binomial have been proposed but again these families do not generalize to cases of furthermore these more general distributions have not yet been applied to the important setting of latent variable models here we employ exponential family distribution that addresses these needs and includes much previous work as special cases we call this distribution the generalized count gc distribution and we offer here four main contributions we introduce the gc distribution and derive variety of commonly used distributions that are special cases using the glm as motivating example ii we combine this observation likelihood with latent linear dynamical systems prior to form gc linear dynamical system gclds iii we develop variational learning algorithm by extending the current methods to the gclds setting and iv we show in data from the primate motor cortex that the gclds model provides superior predictive performance and in particular captures data covariance better than poisson models generalized count distributions we define the generalized count distribution as the family of probability distributions pgc exp θk where and the function parameterizes the distribution and exp is the normalizing constant the primary virtue of the gc family is that it ers all common distributions as special cases and naturally parameterizes many common supervised and unsupervised models as will be shown for example the function implies poisson distribution with rate parameter exp generalizations of the poisson distribution have been of interest since at least and the paper introduced the gc family and proved two additional properties first that the expectation of any gc distribution is monotonically increasing in for fixed and second and perhaps most relevant to this study concave convex functions imply gc distributions furthermore often desired features like zero truncation or zero inflation can also be naturally incorporated by modifying the value thus with controlling the log rate of the distribution and controlling the shape of the distribution the gc family provides rich model class for capturing the spiking statistics of neural data other discrete distribution families do exist such as the distribution and ordered regression but the gc family offers rich exponential family which makes computation somewhat easier and allows the functions to be interpreted figure demonstrates the relevance of modeling dispersion in neural data analysis the left panel shows scatterplot where each point is an individual neuron in recorded population of neurons from primate motor cortex experimental details will be described in plotted are the mean and variance of spiking activity of each neuron activity is considered in bins for reference the line implied by homogeneous poisson process is plotted in red and note further that all doubly stochastic poisson models would have an implied dispersion above this poisson line these data clearly demonstrate meaningful underscoring the need for the present advance the right panel demonstrates the appropriateness of the gc model class showing that function will produce the expected given the left panel we expect gc distributions to be most relevant but indeed many neural datasets also demonstrate over and highlighting the need for flexible observation family variance variance neuron convex linear concave neuron mean firing rate per time bin expectation figure left panel mean firing rate and variance of neurons in primate motor cortex during the period of reaching experiment see the data exhibit especially for high neurons the two marked neurons will be analyzed in detail in figure right panel the expectation and variance of the gc distribution with different choices of the function to illustrate the generality of the gc family and to lay the foundation for our unsupervised learning approach we consider briefly the case of supervised learning of neural spike train data where generalized linear models glm have been used extensively we define gcglm as that which models single neuron with count data yi and associated covariates xi rp as yi gc xi where xi xi here gc denotes random variable distributed according to rp are the regression coefficients this gcglm model is highly general table shows that many of the commonly used models are special cases of gcglm by restricting the function to have certain parametric form in addition to this convenient generality one benefit of our parametrization of the gc model is that the curvature of directly measures the extent to which the data deviate from the poisson assumption allowing us to meaningfully interrogate the form of note that has no intercept term because it can be absorbed in the function as linear term αk see table unlike previous gc work our parameterization implies that maximum likelihood parameter estimation mle is tractable convex program which can be seen by considering log yi arg max xi yi yi log xi arg max first note that although we have to optimize over function that is defined on all integers we can exploit the empirical support of the distribution to produce finite optimization problem namely for any that is not achieved by any data point yi the count the mle for must be and thus we only need to optimize for that have empirical support in the data thus is finite dimensional vector to avoid the potential overfitting caused by truncation of gi beyond the empirical support of the data we can enforce large finite support and impose quadratic penalty on the second difference of to encourage linearity in which corresponds to poisson distribution second note that we can fix without loss of generality which ensures model identifiability with these constraints the remaining values can be fit as free parameters or as set of linear inequalities on similarly for concave case finally problem convexity is ensured as all terms are either linear or linear within the function leading to fast optimization algorithms generalized count linear dynamical system model with the gc distribution in hand we now turn to the unsupervised setting namely coupling the gc observation model with latent dynamical system our model is generalization table special cases of gcglm for all models the gcglm parametrization for is only associated with the slope βx and the intercept is absorbed into the function in all cases we have outside the stated support of the distribution whenever unspecified the support of the distribution and the domain of the function are integers model name logistic regression typical parameterization exp xβ exp xβ λk exp exp xβ poisson regression adjacent category regression exp αk xβ gcglm parametrization αk αk log negative binomial regression regression pk exp xβ λk λj log αk log exp xβ of linear dynamical systems with poisson likelihoods plds which have been extensively used for analysis of populations of neural spike trains denoting yrti as the observed of neuron at time on experimental trial the plds assumes that the spike activity of neurons is noisy poisson observation of an underlying latent state xrt rp where such that yrti poisson exp xrt di here cn rn is the factor loading matrix mapping the latent state xrt to log rate with time and trial invariant baseline log rate rn thus the vector cxrt denotes the vector of log rates for trial and time critically the latent state xrt can be interpreted as the underlying signal of interest that acts as the common input signal to all neurons which is modeled priori as linear gaussian dynamical system to capture temporal correlations xr axrt bt where rp and parameterize the initial state the transition matrix and innovations covariance parameterize the dynamical state update the optional term bt rp allows the model to capture firing rate that is fixed across experimental trials the plds has been widely used and has been shown to outperform other models in terms of predictive performance including in particular the simpler gaussian linear dynamical system the plds model is naturally extended to what we term the generalized count linear dynamical system gclds by modifying equation using gc likelihood yrti gc xrt gi where gi is the function in that models the dispersion for neuron similar to the glm for identifiability the baseline rate parameter is dropped in and we can fix as with the gcglm one can recover preexisting models such as an lds with bernoulli observation as special cases of gclds see table inference and learning in gclds as is common in lds models we use to learn parameters bt gi because the required expectations do not admit closed form as in previous similar work we required an additional approximation step which we implemented via variational lower bound here we briefly outline this algorithm and our novel contributions and we refer the reader to the full details in the supplementary materials first each requires calculating xr for each trial the conditional distribution of the latent trajectories xr xrt given observations yr yrti and parameter for ease of notation below we drop the trial index these posterior distributions are intractable and in the usual way we make normal approximation we identify the optimal by maximizing variational bayesian lower bound the evidence lower bound or elbo over the variational parameters as log eq log log tr eq xt log yti const which is the usual form to be maximized in variational bayesian em vbem algorithm here rpt and rpt are the expectation and variance of given by the lds prior in the first term of is the negative divergence between the variational distribution and prior distribution encouraging the variational distribution to be close to the prior the second term involving the gc likelihood encourages the variational distribution to explain the observations well the integrations in the second term are intractable this is in contrast to the plds case where all integrals can be calculated analytically below we use the ideas of to derive tractable further lower bound here the term eq xt log yti can be reduced to eq xt log yti ηti log pgc gi ηti exp kηti gi yti ηti gi yti log yti log where ηti cti xt denoting νtik kηti gi log kcti xt gi log is reduced to eq νtiyti log exp νtik since νtik is linear transformation of xt under the variational distribution νtik is also normally distributed νtik htik ρtik we have htik kcti mt ρtik cti vt ci where mt vt are the expectation and covariance matrix of xt under variational distribution now we can derive lower bound for the expectation by jensen inequality eq νti νtiyti log exp νtik log exp htik ρtik fti hti ρti combining and we get tractable variational lower bound eq log fti hti ρti for computational convenience we complete the by maximizing the new evidence lower bound via its dual full details are derived in the supplementary materials the then requires maximization of over similar to the plds case the set of parameters involving the latent gaussian dynamics bt can be optimized analytically then the parameters involving the gc likelihood gi can be optimized efficiently via convex optimization techniques full details in supplementary material in practice we initialize our vbem algorithm with algorithm and we initialize each in vbem with laplace approximation which empirically gives substantial runtime advantages and always produces sensible optimum with the above steps we have fully specified learning and inference algorithm which we now use to analyze real neural data code can be found at https experimental results we analyze recordings of populations of neurons in the primate motor cortex during reaching experiment details of which have been described previously in brief rhesus macaque monkey executed cued reaches from central target to peripheral targets before the subject was cued to move the go cue it was given preparatory period to plan the upcoming reach each trial was thus separated into two temporal epochs each of which has been suggested to have their own meaningful dynamical structure we separately analyze these two periods the preparatory period period preceding the go cue and the reaching period before to after the movement onset we analyzed data across all reach targets and results were highly similar in the following for simplicity we show results for single reaching target one trial dataset spike trains were simultaneously recorded from electrodes using blackrock array we bin neural activity at to include only units with robust activity we remove all units with mean rates less than spike per second on average resulting in units for the preparatory period and units for the reaching period as we have already shown in figure the reaching period data are strongly even absent conditioning on the latent dynamics implying further in the observation noise data during the preparatory period are particularly interesting due to its clear structure to fully assess the gclds model we analyze four lds models separate function gi is fitted for each neuron ii single function is shared across all neurons up to linear term modulating the baseline firing rate iii truncated linear function gi is fitted which corresponds to observations and iv plds the poisson case is recovered when gi is linear function on all nonnegative integers in all cases we use the learning and inference of we initialize the plds using nuclear norm minimization and initialize the gclds models with the fitted plds for all models we vary the latent dimension from to to demonstrate the generality of the gclds and verify our algorithmic implementation we first considered extensive simulated data with different gclds parameters not shown in all cases gclds model outperformed plds in terms of negative nll on test data with high statistical significance we also compared the algorithms on plds data and found very similar performance between gclds and plds implying that gclds does not significantly overfit despite the additional free parameters and computation due to the functions analysis of the reaching period figure compares the fits of the two neural units highlighted in figure these two neurons are particularly during the reaching period and thus should be most indicative of the differences between the plds and gclds models the left column of figure shows the fitted functions the for four lds models being compared it is apparent in both the and cases that the fitted function is concave though it was not constrained to be so agreeing with the observed in figure the middle column of figure shows that all four cases produce models that fit the mean activity of these two neurons very well the black trace shows the empirical mean of the observed data and all four lines highly overlapping and thus not entirely visible follow that empirical mean closely this result is confirmatory that the gclds matches the mean and the current plds more importantly we have noted the key feature of the gclds is matching the dispersion of the data and thus we expect it should outperform the plds in fitting variance the right column of figure shows this to be the case the plds significantly overestimates the variance of the data the model tracks the empirical variance quite closely in both neurons the gcldslinear result shows that only adding truncation does not materially improve the estimate of variance and dispersion the dotted blue trace is quite far from the true data in black and indeed it is quite close to the poisson case the still outperforms the plds case but it does not model the dispersion as effectively as the case where each neuron has its own dispersion parameter as figure suggests the natural next question is whether this outperformance is simply in these two illustrative neurons or if it is population effect figure shows that indeed the population is much better modeled by the gclds model than by competing alternatives the left and middle panels of figure show prediction error of the lds models for each reaching target we use and the results are averaged across all reaching mean variance neuron spikes per bin neuron variance mean observed data plds spikes per bin time after movement onset ms time after movement onset ms time after movement onset ms time after movement onset ms figure examples of fitting result for selected neurons each row corresponds to one neuron as marked in left panel of figure left column fitted using gclds and plds middle and right column fitted mean and variance of plds and gclds see text for details plds latent dimension fitted variance nll reduction mse reduction plds latent dimension observed variance figure for monkey data during the reaching period left panel percentage reduction of mse compared to the baseline homogeneous poisson process middle panel percentage reduction of predictive negative log likelihood nll compared to the baseline right panel fitted variance of plds and gclds for all neurons compared to the observed data each point gives the observed and fitted variance of single neuron averaged across time targets critically these predictions are made for all neurons in the population to give informative performance metrics we defined baseline performance as straightforward homogeneous poisson process for each neuron and compare the lds models with the baseline using percentage reduction of and negative log likelihood thus higher error reduction numbers imply better performance the mse left panel shows that the gclds offers minor improvement reduction in mse beyond what is achieved by the plds though these standard error bars suggest an insignificant result paired is indeed significant nonetheless this minor result agrees with the middle column of figure since predictive mse is essentially measurement of the mean in the middle panel of figure we see that the significantly outperforms alternatives in predictive log likelihood across the population paired again this largely agrees with the implication of figure as negative log likelihood measures both the accuracy of mean and variance the right panel of figure shows that the gclds fits the variance of the data exceptionally well across the population unlike the plds analysis of the preparatory period to augment the data analysis we also considered the preparatory period of neural activity when we repeated the analyses of figure on this dataset the same results occurred the gclds model produced concave or close to concave functions and outperformed the plds model both in predictive mse minority and negative log likelihood significantly for brevity we do not show this analysis here instead we here compare the temporal which is also common analysis of interest in neural data analysis and as noted is particularly salient in preparatory activity figure shows that gclds model fits both the temporal left panel and variance right panel considerably better than plds which overestimates both quantities covariance recorded data plds fitted variance time lag ms plds observed variance figure for monkey data during the preparatory period left panel temporal averaged over all units during the preparatory period compared to the fitted by plds and right panel fitted variance of plds and for all neurons compared to the observed data averaged across time discussion in this paper we showed that the gc family better captures the conditional variability of neural spiking data and further improves inference of key features of interest in the data we note that it is straightforward to incorporate external stimuli and spike history in the model as covariates as has been done previously in the poisson case beyond the gcglm and gclds the gc family is also extensible to other models that have been used in this setting such as exponential family pca and subspace clustering the cost of this performance compared to the plds is an extra parameterization the gi functions and the corresponding algorithmic complexity while we showed that there seems to be no empirical sacrifice to doing so it is likely that data with few examples and reasonably poisson dispersion may cause gclds to overfit acknowledgments jpc received funding from sloan research fellowship the simons foundation scgb and scgb the grossman center at columbia university and the gatsby charitable trust thanks to byron yu gopal santhanam and stephen ryu for providing the cortical data references cunningham and yu dimensionality reduction for neural recordings nature neuroscience vol no pp paninski maximum likelihood estimation of cascade neural encoding models network computation in neural systems vol no pp truccolo eden fellows donoghue and brown point process framework for relating neural spiking activity to spiking history neural ensemble and extrinsic covariate effects journal of neurophysiology vol no pp pillow shlens paninski sher litke chichilnisky and simoncelli spatiotemporal correlations and visual signalling in complete neuronal population nature vol no pp vidne ahmadian shlens pillow kulkarni litke chichilnisky simoncelli and paninski modeling the impact of common noise inputs on the network activity of retinal ganglion cells journal of computational neuroscience vol no pp kulkarni and paninski models for multiple neural data network computation in neural systems vol no pp yu cunningham santhanam ryu shenoy and sahani factor analysis for analysis of neural population activity in nips pp macke buesing cunningham yu shenoy and sahani empirical models of spiking in neural populations in nips pp petreska yu cunningham santhanam ryu shenoy and sahani dynamical segmentation of single trials from population neural data in nips pp pfau pnevmatikakis and paninski robust learning of dynamics from large neural ensembles in nips pp buesing machado cunningham and paninski clustered factor analysis of multineuronal spike data in nips pp churchland yu cunningham sugrue cohen corrado newsome clark hosseini scott et stimulus onset quenches neural variability widespread cortical phenomenon nature neuroscience vol no pp cunningham yu shenoy and maneesh inferring neural firing rates from spike trains using gaussian processes in nips pp adams murray and mackay tractable nonparametric bayesian inference in poisson processes with gaussian process intensities in icml pp acm koyama on the spike train variability characterized by power relationship neural computation goris movshon and simoncelli partitioning neuronal variability nature neuroscience vol no pp scott and pillow fully bayesian inference for neural models with spiking in nips pp linderman adams and pillow inferring structured connectivity from spike trains under generalized linear models cosyne del castillo and overdispersed and underdispersed poisson generalizations journal of statistical planning and inference vol no pp emtiyaz khan aravkin friedlander and seeger fast dual variational inference for nonconjugate latent gaussian models in icml pp rao on discrete distributions arising out of methods of ascertainment the indian journal of statistics series pp lambert poisson regression with an application to defects in manufacturing technometrics vol no pp singh characterization of positive poisson distribution and its statistical application siam journal on applied mathematics vol no pp sellers and shmueli flexible regression model for count data the annals of applied statistics pp ananth and kleinbaum regression models for ordinal responses review of methods and international journal of epidemiology vol no pp paninski pillow and lewi statistical models for neural encoding decoding and optimal stimulus design progress in brain research vol pp boyd and vandenberghe convex optimization cambridge university press buesing macke and sahani learning stable regularised latent models of neural population dynamics network computation in neural systems vol no pp buesing macke and sahani estimating state and parameters in models of spike trains in advanced state space methods for neural and clinical data cambridge univ lawhern wu hatsopoulos and paninski population decoding of motor cortical activity using generalized linear model with hidden states journal of neuroscience methods vol no pp churchland cunningham kaufman foster nuyujukian ryu and shenoy neural population dynamics during reaching nature vol no pp cohen and kohn measuring and interpreting neuronal correlations nature neuroscience vol no pp 
learning with wasserstein loss charlie chiyuan center for brains minds and machines massachusetts institute of technology frogner chiyuan mauricio shell international hossein mobahi csail massachusetts institute of technology hmobahi tomaso poggio center for brains minds and machines massachusetts institute of technology tp abstract learning to predict outputs is challenging but in many problems there is natural metric on the outputs that can be used to improve predictions in this paper we develop loss function for learning based on the wasserstein distance the wasserstein distance provides natural notion of dissimilarity for probability measures although optimizing with respect to the exact wasserstein distance is costly recent work has described regularized approximation that is efficiently computed we describe an efficient learning algorithm based on this regularization as well as novel extension of the wasserstein distance from probability measures to unnormalized measures we also describe statistical learning bound for the loss the wasserstein loss can encourage smoothness of the predictions with respect to chosen metric on the output space we demonstrate this property on tag prediction problem using the yahoo flickr creative commons dataset outperforming baseline that doesn use the metric introduction we consider the problem of learning to predict measure over finite set this problem includes many common machine learning scenarios in multiclass classification for example one often predicts vector of scores or probabilities for the classes and in semantic segmentation one can model the segmentation as being the support of measure defined over the pixel locations many problems in which the output of the learning machine is both and might be cast as predicting measure we specifically focus on problems in which the output space has natural metric or similarity structure which is known or estimated priori in practice many learning problems have such structure in the imagenet large scale visual recognition challenge ilsvrc for example the output dimensions correspond to object categories that have inherent semantic relationships some of which are captured in the wordnet hierarchy that accompanies the categories similarly in the keyword spotting task from the iarpa babel speech recognition project the outputs correspond to keywords that likewise have semantic relationships in what follows we will call the similarity structure on the label space the ground metric or semantic similarity using the ground metric we can measure prediction performance in way that is sensitive to relationships between the different output dimensions for example confusing dogs with cats might authors contributed equally code and data are available at http divergence wasserstein distance distance divergence grid size wasserstein noise figure the wasserstein loss encourages predictions that are similar to ground truth robustly to incorrect labeling of similar classes see appendix shown is euclidean distance between prediction and ground truth left number of classes averaged over different noise levels and right noise level averaged over number of classes baseline is the multiclass logistic loss be more severe an error than confusing breeds of dogs loss function that incorporates this metric might encourage the learning algorithm to favor predictions that are if not completely accurate at least semantically similar to the ground truth in this paper we develop loss function for learning that measures the wasserstein distance between prediction and the target label with respect to chosen metric on the output space the wasserstein distance is defined as the cost of the optimal transport plan for moving the mass in the predicted measure to match that in the target and has been applied to wide range of problems including barycenter estimation label propagation and clustering to our knowledge this paper represents the first use of the wasserstein distance as loss for supervised learning siberian husky eskimo dog figure semantically nearequivalent classes in ilsvrc we briefly describe case in which the wasserstein loss improves learning performance the setting is multiclass classification problem in which label noise arises from confusion of semantically categories figure shows such case from the ilsvrc in which the categories siberian husky and eskimo dog are nearly indistinguishable we synthesize toy version of this problem by identifying categories with points in the euclidean plane and randomly switching the training labels to nearby classes the wasserstein loss yields predictions that are closer to the ground truth robustly across all noise levels as shown in figure the standard multiclass logistic loss is the baseline for comparison section in the appendix describes the experiment in more detail the main contributions of this paper are as follows we formulate the problem of learning with prior knowledge of the ground metric and propose the wasserstein loss as an alternative to traditional information loss functions specifically we focus on empirical risk minimization erm with the wasserstein loss and describe an efficient learning algorithm based on entropic regularization of the optimal transport problem we also describe novel extension to unnormalized measures that is similarly efficient to compute we then justify erm with the wasserstein loss by showing statistical learning bound finally we evaluate the proposed loss on both synthetic examples and image annotation problem demonstrating benefits for incorporating an output metric into the loss related work decomposable loss functions like kl divergence and distances are very popular for probabilistic or predictions as each component can be evaluated independently often leading to simple and efficient algorithms the idea of exploiting smoothness in the label space according to prior metric has been explored in many different forms including regularization and with graphical models optimal transport provides natural distance for probability distributions over metric spaces in the optimal transport is used to formulate the wasserstein barycenter as probability distribution with minimum total wasserstein distance to set of given points on the probability simplex propagates histogram values on graph by minimizing dirichlet energy induced by optimal transport the wasserstein distance is also used to formulate metric for comparing clusters in and is applied to image retrieval contour matching and many other problems however to our knowledge this is the first time it is used as loss function in discriminative learning framework the closest work to this paper is theoretical study of an estimator that minimizes the optimal transport cost between the empirical distribution and the estimated distribution in the setting of statistical parameter estimation learning with wasserstein loss problem setup and notation we consider the problem of learning map from rd into the space rk of measures over finite set of size assume possesses metric dk which is called the ground metric dk measures semantic similarity between dimensions of the output which correspond to the elements of we perform learning over hypothesis space of predictors parameterized by these might be linear logistic regression models for example in the standard statistical learning setting we get an sequence of training examples xn yn sampled from an unknown joint distribution px given measure of performance risk the goal is to find the predictor that minimizes the expected risk typically is difficult to optimize directly and the joint distribution px is unknown so learning is performed via empirical risk minimization specifically we solve min xi yi with loss function acting as surrogate of optimal transport and the exact wasserstein loss information loss functions are widely used in learning with outputs along with other popular measures like hellinger distance and distance these divergences treat the output dimensions independently ignoring any metric structure on given cost function the optimal transport distance measures the cheapest way to transport the mass in probability measure to match that in wc inf where is the set of joint probability measures on having and as marginals an important case is that in which the cost is given by metric dk or its power dpk with in this case is called wasserstein distance also known as the earth mover distance in this paper we only work with discrete measures in the case of probability measures these are histograms in the simplex when the ground truth and the output of both lie in the simplex we can define wasserstein loss definition exact wasserstein loss for any let be the predicted value at element given input let be the ground truth value for given by the corresponding label then we define the exact wasserstein loss as wpp inf ht where is the distance matrix dpk and the set of valid transport plans is where is the vector wpp is the cost of the optimal plan for transporting the predicted mass distribution to match the target distribution the penalty increases as more mass is transported over longer distances according to the ground metric algorithm gradient of the wasserstein loss given if unnormalized while has not converged do if normalized if unnormalized end while if unnormalized log log if normalized wpp diag kv if unnormalized efficient optimization via entropic regularization to do learning we optimize the empirical risk minimization functional by gradient descent doing so requires evaluating descent direction for the loss with respect to the predictions unfortunately computing subgradient of the exact wasserstein loss is quite costly as follows the exact wasserstein loss is linear program and subgradient of its solution can be computed using lagrange duality the dual lp of is wpp sup cm as is linear program at an optimum the values of the dual and the primal are equal see hence the dual optimal is subgradient of the loss with respect to its first argument computing is costly as it entails solving linear program with contraints with being the dimension of the output space this cost can be prohibitive when optimizing by gradient descent entropic regularization of optimal transport cuturi proposes smoothed transport objective that enables efficient approximation of both the transport matrix in and the subgradient of the loss introduces an entropic regularization term that results in strictly convex problem wpp inf ht log importantly the transport matrix that solves is diagonal scaling of matrix diag kdiag for and where and are the lagrange dual variables for identifying such matrix subject to equality constraints on the row and column sums is exactly matrix balancing problem which is in numerical linear algebra and for which efficient iterative algorithms exist and use the algorithm extending smoothed transport to the learning setting when the output vectors and lie in the simplex can be used directly in place of as can approximate the exact wasserstein distance closely for large enough in this case the gradient of the objective can be obtained from the optimal scaling vector as log log uk sinkhorn iteration for the gradient is given in algorithm note that is only defined up to constant shift any upscaling of the vector can be paired with corresponding downscaling of the vector and vice versa without altering the matrix the choice log log ensures that is tangent to the simplex convergence to smoothed approximation port wasserstein of exact convergence of alternating projections figure the relaxed transport problem for unnormalized measures for many learning problems however normalized output assumption is unnatural in image segmentation for example the target shape is not naturally represented as histogram and even when the prediction and the ground truth are constrained to the simplex the observed label can be subject to noise that violates the constraint there is more than one way to generalize optimal transport to unnormalized measures and this is subject of active study we will develop here novel objective that deals effectively with the difference in total mass between and while still being efficient to optimize relaxed transport we propose novel relaxation that extends smoothed transport to unnormalized measures by replacing the equality constraints on the transport marginals in with soft penalties with respect to kl divergence we get an unconstrained approximate transport problem the resulting objective is wkl min ht kl kl wkz log is the generalized kl divergence between where kl rk here represents division as with the previous formulation the optimal transport matrix with respect to is diagonal scaling of the matrix proposition the transport matrix optimizing satisfies diag kdiag where and and the optimal transport matrix is fixed point for iteration proposition diag kdiag optimizing satisfies and ii where kv represents multiplication unlike the previous formulation is unconstrained with respect to the gradient is given by rh wkl the iteration is given in algorithm when restricted to normalized measures the relaxed problem approximates smoothed transport figure shows for normalized and the relative distance between the values of and for large enough converges to as and increase also retains two properties of smoothed transport figure shows that for normalized outputs the relaxed loss converges to the unregularized wasserstein distance as and increase and figure shows that convergence of the iterations in is nearly independent of the dimension of the output space note that although the iteration suggested by proposition is observed empirically to converge see figure for example we have not proven guarantee that it will do so in figures and are generated as described in section in and have dimension in convergence is defined as in shaded regions are intervals the unregularized wasserstein distance was computed using fastemd posterior probability posterior probability norm norm posterior predictions for images of digit posterior predictions for images of digit figure mnist example each curve shows the predicted probability for one digit for models trained with different values for the ground metric statistical properties of the wasserstein loss let xn yn be samples and be the empirical risk minimizer argmin wp hx yi further assume ho is the composition of softmax and base hypothesis space ho of functions mapping into rk the softmax layer outputs prediction that lies in the simplex theorem for and any with probability at least it holds that log inf rn with the constant cm rn ho is the rademacher complexity measuring the complexity of the hypothesis space ho the rademacher complexity rn ho for commonly used models like neural networks and kernel machines decays with the training set size this theorem guarantees that the expected wasserstein loss of the empirical risk minimizer approaches the best achievable loss for as an important special case minimizing the empirical risk with wasserstein loss is also good for multiclass classification let be the encoded label vector for the groundtruth class proposition in the multiclass classification setting for and any at least it holds that ex dk inf with probability ke cm rn ho log where the predictor is with being the empirical risk minimizer note that instead of the classification error ex we actually get bound on the expected semantic distance between the prediction and the groundtruth empirical study impact of the ground metric in this section we show that the wasserstein loss encourages smoothness with respect to an artificial metric on the mnist handwritten digit dataset this is classification problem with output dimensions corresponding to the digits and we apply ground metric dp where and this metric encourages the recognized digit to be numerically close to the true one we train model independently for each value of and plot the average predicted probabilities of the different digits on the test set in figure cost cost loss function divergence wasserstein loss function divergence wasserstein wasserstein wasserstein wasserstein wasserstein of proposed tags of proposed tags original flickr tags dataset flickr tags dataset figure cost comparison of the proposed loss wasserstein and the baseline divergence note that as the metric approaches the metric which treats all incorrect digits as being equally unfavorable in this case as can be seen in the figure the predicted probability of the true digit goes to while the probability for all other digits goes to as increases the predictions become more evenly distributed over the neighboring digits converging to uniform distribution as flickr tag prediction we apply the wasserstein loss to real world learning problem using the recently released creative commons dataset our goal is tag prediction we select descriptive tags along with two random sets of images each associated with these tags for training and testing we derive distance metric between tags by using to embed the tags as unit vectors then taking their euclidean distances to extract image features we use matconvnet note that the set of tags is highly redundant and often many semantically equivalent or similar tags can apply to an image the images are also partially tagged as different users may prefer different tags we therefore measure the prediction performance by the cost pk defined as ck minj dk where is the set of groundtruth tags and are the tags with highest predicted probability the standard auc measure is also reported we find that linear combination of the wasserstein loss wpp and the standard multiclass logistic loss kl yields the best prediction results specifically we train linear model by minimizing wpp on the training set where controls the relative weight of kl note that kl taken alone is our baseline in these experiments figure shows the cost on the test set for the combined loss and the baseline kl loss we additionally create second dataset by removing redundant labels from the original dataset this simulates the potentially more difficult case in which single user tags each image by selecting one tag to apply from amongst each cluster of applicable semantically similar tags figure shows that performance for both algorithms decreases on the harder dataset while the combined wasserstein loss continues to outperform the baseline in figure we show the effect on performance of varying the weight on the kl loss we observe that the optimum of the cost is achieved when the wasserstein loss is weighted more heavily than at the optimum of the auc this is consistent with semantic smoothing effect of wasserstein which during training will favor mispredictions that are semantically similar to the ground truth sometimes at the cost of lower auc we finally show two selected images from the test set in figure these illustrate cases in which both algorithms make predictions that are semantically relevant despite overlapping very little with the ground truth the image on the left shows errors made by both algorithms more examples can be found in the appendix to avoid numerical issues we scale down the ground metric such that all of the distance values are in the interval the dataset used here is available at http the wasserstein loss can achieve similar by choosing the metric parameter as discussed in section however the relationship between and the smoothing behavior is complex and it can be simpler to implement the by combining with the kl loss wasserstein auc divergence auc cost auc auc cost wasserstein auc divergence auc original flickr tags dataset flickr tags dataset figure between semantic smoothness and maximum likelihood flickr user tags street parade dragon our proposals people protest parade baseline proposals music car band flickr user tags water boat reflection sunshine our proposals water river lake summer baseline proposals river water club nature figure examples of images in the flickr dataset we show the groundtruth tags and as well as tags proposed by our algorithm and the baseline conclusions and future work in this paper we have described loss function for learning to predict measure over finite set based on the wasserstein distance although optimizing with respect to the exact wasserstein loss is computationally costly an approximation based on entropic regularization is efficiently computed we described learning algorithm based on this regularization and we proposed novel extension of the regularized loss to unnormalized measures that preserves its efficiency we also described statistical learning bound for the loss the wasserstein loss can encourage smoothness of the predictions with respect to chosen metric on the output space and we demonstrated this property on tag prediction problem showing improved performance over baseline that doesn incorporate the metric an interesting direction for future work may be to explore the connection between the wasserstein loss and markov random fields as the latter are often used to encourage smoothness of predictions via inference at prediction time references jonathan long evan shelhamer and trevor darrell fully convolutional networks for semantic segmentation cvpr to appear olga russakovsky jia deng hao su jonathan krause sanjeev satheesh sean ma zhiheng huang andrej karpathy aditya khosla michael bernstein alexander berg and li imagenet large scale visual recognition challenge international journal of computer vision ijcv marco cuturi and arnaud doucet fast computation of wasserstein barycenters icml justin solomon raif rustamov leonidas guibas and adrian butscher wasserstein propagation for learning in icml pages michael coen hidayath ansari and nathanael fillmore comparing clusterings in space icml pages lorenzo rosasco mauricio alvarez and neil lawrence kernels for functions review foundations and trends in machine learning leonid rudin stanley osher and emad fatemi nonlinear total variation based noise removal algorithms physica nonlinear phenomena chen george papandreou iasonas kokkinos kevin murphy and alan yuille semantic image segmentation with deep convolutional nets and fully connected crfs in iclr marco cuturi gabriel and antoine rolet smoothed dual approach for variational wasserstein problems march yossi rubner carlo tomasi and leonidas guibas the earth mover distance as metric for image retrieval ijcv kristen grauman and trevor darrell fast contour matching using approximate earth mover distance in cvpr shirdhonkar and jacobs approximate earth mover distance in linear time in cvpr herbert edelsbrunner and dmitriy morozov persistent homology theory and practice in proceedings of the european congress of mathematics federico bassetti antonella bodini and eugenio regazzini on minimum kantorovich distance estimators stat probab july villani optimal transport old and new springer berlin heidelberg vladimir bogachev and aleksandr kolesnikov the problem achievements connections and perspectives russian math surveys dimitris bertsimas john tsitsiklis and john tsitsiklis introduction to linear optimization athena scientific boston third printing edition marco cuturi sinkhorn distances lightspeed computation of optimal transport nips philip knight and daniel ruiz fast algorithm for matrix balancing ima journal of numerical analysis october lenaic chizat gabriel bernhard schmitzer and vialard unbalanced optimal transport geometry and kantorovich formulation august ofir pele and michael werman fast and robust earth mover distances iccv pages peter bartlett and shahar mendelson rademacher and gaussian complexities risk bounds and structural results jmlr march bart thomee david shamma gerald friedland benjamin elizalde karl ni douglas poland damian borth and li the new data and new challenges in multimedia research arxiv preprint tomas mikolov ilya sutskever kai chen greg corrado and jeff dean distributed representations of words and phrases and their compositionality in nips vedaldi and lenc matconvnet convolutional neural networks for matlab corr ledoux and talagrand probability in banach spaces isoperimetry and processes classics in mathematics springer berlin heidelberg clark givens and rae michael shortt class of wasserstein metrics for probability distributions michigan math 
marginal regression ping li department of statistics and biostatistics department of computer science rutgers university pingli martin slawski department of statistics and biostatistics department of computer science rutgers university abstract we consider the problem of sparse signal recovery from linear measurements quantized to bits marginal regression is proposed as recovery algorithm we study the question of choosing in the setting of given budget of bits and derive single expression characterizing the between and the choice turns out to be optimal for estimating the unit vector corresponding to the signal for any level of additive gaussian noise before quantization as well as for adversarial noise for we show that quantization constitutes an optimal quantization scheme and that the norm of the signal can be estimated consistently by maximum likelihood by extending introduction consider the common compressed sensing cs model yi hai σεi or equivalently σε yi aij ai aij εi where the aij and the εi are standard gaussian random variables the latter of which will be referred to by the term additive noise and accordingly as noise level and rn is the signal of interest to be recovered given let where be the of the cardinality of its support one of the celebrated results in cs is that accurate recovery of is possible as long as log and can be carried out by several computationally tractable algorithms subsequently the concept of signal recovery from an incomplete set of linear measurements was developed further to settings in which only coarsely quantized versions of such linear measurements are available with the extreme case of measurements more generally one can think of measurements assuming that one is free to choose given fixed budget of bits gives rise to between and an optimal balance of these two quantities minimizes the error in recovering the signal such optimal depends on the quantization scheme the noise level and the recovery algorithm this has been considered in previous cs literature however the analysis therein concerns an recovery algorithm equipped with knowledge of which is not fully realistic in specific variant of iterative hard thresholding for measurements is considered it is shown via numerical experiments that choosing can in fact achieve improvements over at the level of the total number of bits required for approximate signal recovery on the other hand there is no analysis supporting this observation moreover the experiments in only concern noiseless setting another approach is to treat quantization as additive error and to perform signal recovery by means of variations of recovery algorithms for cs in this line of research is assumed to be fixed and discussion of the aforementioned is missing in the present paper we provide an analysis of compressed sensing from measurements using specific approach to signal recovery which we term marginal regression this approach builds on method for compressed sensing proposed in an influential paper by plan and vershynin which has subsequently been refined in several recent works as indicated by the name marginal regression can be seen as quantized version of marginal regression simple yet surprisingly effective approach to support recovery that stands out due to its low computational cost requiring only single multiplication and sorting operation our analysis yields precise characterization of the above involving and in various settings it turns out that the choice is optimal for recovering the normalized signal under additive gaussian noise as well as under adversarial noise it is shown that the choice additionally enables one to estimate while being optimal for recovering for hence for the specific recovery algorithm under consideration it does not pay off to take furthermore once the noise level is significantly high marginal regression is empirically shown to perform roughly as good as several alternative recovery algorithms finding suggesting that in settings taking does not pay off in general as an intermediate step in our analysis we prove that quantization constitutes an optimal quantization scheme in the sense that it leads to minimization of an upper bound on the reconstruction error notation we use and for the support of rn xj is the indicator function of expression the symbol means up to positive universal constant supplement proofs and additional experiments can be found in the supplement from marginal regression to marginal regression some background on marginal regression it is common to perform sparse signal recovery by solving an optimization problem of the form min ky where is penalty term encouraging sparse solutions standard choices for are which is computationally not feasible in general its convex relaxation or penalty terms like scad or mcp that are more amenable to optimization than the alternatively can as well be used to enforce constraint by setting ιc where ιc if and otherwise with rn or rn being standard choices note that is equivalent to the optimization problem where min hη xi replacing by recall that the entries of are we obtain min hη xi which tends to be much simpler to solve than as the first two terms are separable in the components of for the choices of mentioned above we obtain closed form solutions bj ηj bj sign ηj ιx bj ηj ιx bj sign ηj for where denotes the positive part and is the sth largest entry in in absolute pn magnitude and min in other words the estimators are hardrespectively versions of ηj which are essentially equal to the univariate or marginal regression coefficients θj in the sense that ηj θj op hence the term marginal regression in the literature it is the estimator in the left half of that is popular albeit as means to infer the support of rather than itself under the performance with respect to signal recovery can still be reasonable in view of the statement below proposition consider model with and the marginal regression estimator defined by bj ηj where then there exists positive constants such that with probability at least log kb in comparison pthe relative of more sophisticated methods like the lasso scales as log which is comparable to once is of the same order of magnitude as marginal regression can also be interpreted as single projected gradient iteration from for problem with ιx taking more than one projected gradient iteration gives rise to popular recovery algorithm known as iterative hard thresholding iht compressed sensing with observations and the method of plan vershynin as generalization of one can consider measurements of the form yi hai σεi for some map without loss generality one may assume that as long as which is assumed in the sequel by defining accordingly plan and vershynin consider the following optimization problem for recovering and develop framework for analysis that covers even more general measurement models than the proposed estimator minimizes min hη xi note that the constraint set contains the authors prefer the former because it is suited for approximately sparse signals as well and second because it is convex however the optimization problem with sparsity constraint is easy to solve min hη xi lemma the solution of problem is given by ej ηj while this is elementary we state it as separate lemma as there has been some confusion in the existing literature in the same solution is obtained after unnecessarily convexifying the constraint set which yields the unit ball of the norm in family of concave penalty terms including the scad and mcp is proposed in place of the cardinality constraint however in light of lemma the use of such penalty terms lacks motivation the minimization problem is essentially that of marginal regression with ιx the only difference being that the norm of the solution is fixed to one note that the marginal regression estimator is of for with changes to ab in addition let and define and as the minimizers of the optimization problems min hη xi min hη xi it is not hard to verify that in summary for estimating the direction it does not matter if quadratic term in the objective or an constraint is used moreover estimation of the scale and the direction can be separated adopting the framework in we provide straightforward bound on the of minimizing to this end we define two quantities which will be of central interest in subsequent analysis where is defined by inf ηj log the quantity concerns the deterministic part of the analysis as it quantifies the distortion of the linear measurements under the map while is used to deal with the stochastic part the definition of is based on the usual tail bound for the maximum of centered random variables in fact as long as has bounded range gaussianity of the aij implies that the ηj ηj are accordingly the constant is proportional to the norm of the ηj ηj cf proposition consider model and suppose that and denote by the minimizer of then with probability at least it holds that log kx so far has been assumed to be known if that is not the case can be estimated as follows as proposition in the setting of proposition consider sb log and the minimizer of with replaced by sb then with probability at least no false positive selection moreover if log one has marginal regression quantized measurements directly fit into the observation model here the map represents quantizer that partitions into bins rk in increasing order and given by distinct thresholds tk such that rk tk each bin is assigned distinct representative from µk in increasing order so that is defined by pk sign µk rk expanding model accordingly we obtain pk yi sign hai σεi µk hai σεi rk sign hai εi µk hai xu εi rk where and thus the scale of the signal can be absorbed into the definition of the bins respectively thresholds which should be proportional to we may thus again fix and in turn for the analysis below estimation of separately from will be discussed in an extra section analysis in this section we study in detail the central question of the introduction suppose we have fixed budget of bits available and are free to choose the number of measurements and the number of bits per measurement subject to such that the kb of marginal regression is as small as possible what is the optimal choice of in order to answer this question let us go back to the error bound that bound applies to marginal regression for any choice of and varies with λb and ψb both of which additionally depend on the choice of the thresholds and the representatives it can be shown that the dependence of on the ratio is tight asymptotically as hence it makes sense to compare two different choices and in terms of the ratio of ωb ψb and since the bound decays with for measurements to improve over measurements with respect to the total bits used it is then required that ωb the route to be taken is thus as follows we first derive expressions for λb and ψb and then minimize the resulting expression for ωb the free parameters and we are then in position to compare ωb for evaluating λb λb below denotes the multiplication between vectors lemma we have λb hα µi where αk αk rk ek ek rk evaluating ψb ψb exact evaluation proves to be difficult we hence resort to an analytically more tractable approximation which is still sufficiently accurate as confirmed by experiments lemma as and as we have ψb hα µi note that the proportionality constant not depending on in front of the given expression does not need to be known as it cancels out when computing ratios ωb the asymptotics is limiting but still makes sense for growing with recall that we fix optimal choice of and it turns that the optimal choice of minimizing ψb coincides with the solution of an instance of the classical quantization problem stated below let be random variable with finite variance and the quantization map from above pk min min sign µk rk problem can be seen as problem at the population level and it is solved in practice by an alternating scheme similar to that used for for from logconcave distribution gaussian that scheme can be shown to deliver the global optimum theorem consider the minimization problem mint ψb its minimizer equals that of the problem for moreover ωb ψb where denotes the value of λb for evaluated at the choice of minimizing ωb for regarding the choice of the result of theorem may not come as suprise as the entries of are it is less immediate though that this specific choice can also be motivated as the one leading to the minimization of the error bound furthermore theorem implies that the relative performance of and measurements does not depend on as long as the respective optimal choice of is used which requires to be known theorem provides an explicit expression for ωb that is straightforward to compute the following table lists ratios ωb for selected values of and ωb required for these figures suggests that the smaller the better the performance for given budget of bits beyond additive noise additive gaussian noise is perhaps the most studied form of perturbation but one can of course think of numerous other mechanisms whose effect can be analyzed on the basis of the same scheme used for additive noise as long as it is feasible to obtain the corresponding expressions for and we here do so for the following mechanisms acting after quantization random bin flip for with probability yi remains unchanged with probability yi is changed to an element from yi uniformly at random ii adversarial bin flip for write yi qµk for and µk with probability yi remains unchanged with probability yi is changed to note that for and ii coincide sign flip with probability depending on the magnitude of the corresponding value λb may even be negative which is unlike the case of additive noise recall that the error bound requires borrowing terminology from robust statistics we consider min λb as the breakdown point the expected proportion of contaminated observations that can still be tolerated so that continues to hold mechanism ii produces natural counterpart of gross corruptions in the standard setting it can be shown that among all maps applied randomly to the observations with fixed probability ii maximizes the ratio hence the attribute adversarial in figure we display ψb for for both and ii the table below lists the corresponding breakdown points for simplicity are not optimized but set to the optimal in the sense of choice in the noiseless case the underlying derivations can be found in the supplement ii figure and the table provide one more argument in favour of measurements as they offer better robustness adversarial corruptions in fact once the fraction of such corruptions reaches performs best on the measurement scale for the milder corruption scheme turns out to the best choice for significant but moderate fraction of bin flips fraction of gross corruptions figure ψb for mechanisms and ii scale estimation in section we have decomposed into product of unit vector and scale parameter we have pointed out that can be estimated by marginal regression separately from since the latter can be absorbed into the definition of the bins rk accordingly bu and ψb estimating and respectively we here consider we may estimate as bu ψb with the maximum likelihood estimator mle for by following which studied the estimation of the scale parameter for the entire family of distributions the work of was motivated by different line of one scan cs algorithm based on designs first we consider the case so that the yi are the likelihood function is yi rk rk tk mk where mk rk and denotes the standard gaussian cdf note that for is constant does not depend on which confirms that for it is impossible to recover for the mle has simple closed form expression given by ψb the following tail bound establishes fast convergence of ψb to proposition let and where denotes the derivative of the standard gaussian pdf with probability at least exp we have the exponent is maximized for and becomes smaller as moves away from while scale estimation from measurements is possible convergence can be slow if is not well chosen for convergence can be faster but the mle is not available in closed form we now turn to the case the mle based on is no longer consistent if is known then the joint likelihood of for is given by ui hai li hai where li ui denotes the interval the observation is contained in before quantization it is not clear to us whether the likelihood is which would ensure that the global optimum can be obtained by convex programming empirically we have not encountered any issue with spurious local minima when using and as the mle from the noiseless case as starting point the only issue with we are aware of concerns the case in which there exists so that hai li ui in this situation the mle for equals zero and the mle for may not be unique however this is rather unlikely scenario as long as there is noticeable noise level as is typically unknown we may follow the principle replacing by an estimator bu experiments we here provide numerical results some of the key points made in the previous sections we also compare marginal regression to alternative recovery algorithms setup our simulations follow model with and regarding the support and its signs are selected uniformly at random while the absolute magnitude of the entries corresponding to the support are drawn from the uniform distribution on where log and log with controlling the signal strength the is then normalized to unit before normalization the norm of the signal lies in by construction which ensures that as increases the signal strength condition is satisfied with increasing probability for we use quantization for variable which is optimal for but not for each possible configuration for and is replicated times due to space limits representative subset of the results is shown the rest can be found in the supplement empirical verification of the analysis in section the experiments reveal that what is predicted by the analysis of the comparison of the relative performance of and measurements for estimating closely agrees with what is observed empirically as can be seen in figure estimation of the scale and the noise level figure suggests that the mle for is suitable approach at least as long as is not too small for the mle for appears to have noticeable bias as it tends to instead of for increasing and thus increasing observe that for convergence to the true value is smaller as for required improvement predicted improvement error error required improvement predicted improvement required improvement predicted improvement error error required improvement predicted improvement figure average errors for and on the in dependence of the signal strength the curve predicted improvement of is obtained by scaling the error by the factor predicted by the theory of section likewise the curve required improvement results by scaling the error of by and indicates what would be required by to improve over at the level of total bits estimated noise level estimated norm of figure estimation of here and the curves depict the average of the mle discussed in section while the bars indicate standard deviation while is about for small the above two issues are presumably effect consequence of using bu in place of marginal regression and alternative recovery algorithms we compare the error of marginal regression to several common recovery algorithms compared to apparently more principled methods which try to enforce agreement of and ab the hamming distance or surrogate thereof marginal regression can be seen as crude approach as it is based on maximizing the inner product between and ax one may thus expect that its performance is inferior in summary our experiments confirm that this is true in settings but not so if the noise level is substantial below we briefly present the alternatives that we consider the approach in based on which only differs in that the constraint set results from relaxation as shown in figure the performance is similar though slightly inferior standard iterative hard thresholding based on quadratic loss as pointed out above marginal regression can be seen as version of iterative hard thresholding the variant of iterative hard threshold for binary observations using hinge loss function as proposed in svm linear svm with squared hinge loss and an implemented in liblinear the cost parameter is chosen from log by variant of iterative hard threshold for quantized observations based on specific piecewiese linear loss function this approach is based on solving the following convex optimization problem minx ξi ξi subject to li ξi hai xi ui ξi ξi where li ui is the bin observation is assigned to the essential idea is to enforce consistency of the observed and predicted bin assignments up to slacks ξi while promoting sparsity of the solution via an penalty the parameter is chosen from log by turning to the results as depicted by figure the difference between noiseless and heavily noisy setting is perhaps most striking both iht variants significantly outperform marginal regression by comparing errors for iht can be seen to improve over at the level of the total bits marginal regression is on par with the best performing methods for only achieves moderate reduction in error over while is supposedly affected by convergence issues overall the results suggest that setting with substantial noise favours crude approach measurements and conceptually simple recovery algorithms marginal svm error error marginal error error marginal marginal svm figure average errors for several recovery algorithms on the in dependence of the signal strength we contrast conclusion bridging marginal regression and popular approach to cs due to plan vershynin we have considered signal recovery from quantized measurements the main finding is that for marginal regression it is not beneficial to increase beyond compelling argument for is the fact that the norm of the signal can be estimated unlike the case compared to measurements measurements also exhibit strong robustness properties it is of interest if and under what circumstances the conclusion may differ for other recovery algorithms acknowledgement this work is partially supported by and references blumensath and davies iterative hard thresholding for compressed sensing applied and computational harmonic analysis boufounos and baraniuk compressive sensing in information science and systems candes and tao the dantzig selector statistical estimation when is much larger than the annals of statistics chen and banerjee compressed sensing with the norm in aistats donoho compressed sensing ieee transactions on information theory fan chang hsieh wang and lin liblinear library for large linear classification journal of machine learning research genovese jin wasserman and yao comparison of the lasso and marginal regression journal of machine learning research gopi netrapalli jain and nori compressed sensing provable support and vector recovery in icml jacques degraux and de vleeschouwer quantized iterative hard thresholding bridging and quantized compressed sensing jacques hammond and fadili dequantizing compressed sensing when oversampling and constraints combine ieee transactions on information theory jacques laska boufounos and baraniuk robust compressive sensing via binary stable embeddings of sparse vectors ieee transactions on information theory kieffer uniqueness of locally optimal quantizer for density and convex error weighting function ieee transactions on information theory laska and baraniuk regime change versus in compressive sensing laska boufounos davenport and baraniuk democracy in action quantization saturation and compressive sensing applied and computational harmonic analysis li binary and coding for stable random projections li one scan compressed sensing technical report li zhang and zhang compressed counting meets compressed sensing in colt liu and wright robust dequantized compressive sensing applied and computational harmonic analysis lloyd least squares quantization in pcm ieee transactions on information theory max quantizing for minimum distortion ire transactions on information theory needell and tropp cosamp iterative signal recovery from incomplete and inaccurate samples applied and computational harmonic analysis plan and vershynin compressed sensing by linear programming communications on pure and applied mathematics plan and vershynin robust compressed sensing and sparse logistic regression convex programming approach ieee transactions on information theory zhu and gu towards lower sample complexity for robust compressed sensing in icml vershynin in compressed sensing theory and applications chapter introduction to the nonasymptotic analysis of random matrices cambridge university press wainwright sharp thresholds for noisy and recovery of sparsity using constrained quadratic programming lasso ieee transactions on information theory zhang and zhang general theory of concave regularization for sparse estimation problems statistical science zhang yi and jin efficient algorithms for robust compressive sensing in icml zhang adaptive greedy algorithm for learning sparse representations ieee transactions on information theory 
natural neural networks guillaume desjardins karen simonyan razvan pascanu koray kavukcuoglu gdesjardins simonyan razp korayk google deepmind london abstract we introduce natural neural networks novel family of algorithms that speed up convergence by adapting their internal representation during training to improve conditioning of the fisher matrix in particular we show specific example that employs simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer while preserving the computation of the network such networks can be trained efficiently via the proposed projected natural gradient descent algorithm prong which amortizes the cost of these reparametrizations over many parameter updates and is closely related to the mirror descent online learning algorithm we highlight the benefits of our method on both unsupervised and supervised learning tasks and showcase its scalability by training on the imagenet challenge dataset introduction deep networks have proven extremely successful across broad range of applications while their deep and complex structure affords them rich modeling capacity it also creates complex dependencies between the parameters which can make learning difficult via first order stochastic gradient descent sgd as long as sgd remains the workhorse of deep learning our ability to extract highlevel representations from data may be hindered by difficult optimization as evidenced by the boost in performance offered by batch normalization bn on the inception architecture though its adoption remains limited the natural gradient appears ideally suited to these difficult optimization issues by following the direction of steepest descent on the probabilistic manifold the natural gradient can make constant progress over the course of optimization as measured by the kl divergence between consecutive iterates utilizing the proper distance measure ensures that the natural gradient is invariant to the parametrization of the model unfortunately its application has been limited due to its high computational cost natural gradient descent ngd typically requires an estimate of the fisher information matrix fim which is square in the number of parameters and worse it requires computing its inverse truncated newton methods can avoid explicitly forming the fim in memory but they require an expensive iterative procedure to compute the inverse such computations can be wasteful as they do not take into account the highly structured nature of deep models inspired by recent work on model reparametrizations our approach starts with simple question can we devise neural network architecture whose fisher is constrained to be identity this is an important question as sgd and ngd would be equivalent in the resulting model the main contribution of this paper is in providing simple theoretically justified network reparametrization which approximates via gradient descent natural gradient update over layers our method is computationally efficient due to the local nature of the reparametrization based on whitening and the amortized nature of the algorithm our second contribution is in unifying many heuristics commonly used for training neural networks under the roof of the natural gradient while highlighting an important connection between model reparametrizations and mirror descent finally we showcase the efficiency and the scalability of our method across of experiments scaling our method from standard deep to large convolutional models on imagenet trained across multiple gpus this is to our knowledge the natural gradient algorithm is scaled to problems of this magnitude the natural gradient this section provides the necessary background and derives particular form of the fim whose structure will be key to our efficient approximation while we tailor the development of our method to the classification setting our approach generalizes to regression and density estimation overview we consider the problem of fitting the parameters rn of model to an empirical distribution under the we denote by the observation vector and its associated label concretely this stochastic optimization problem aims to solve log defining the loss as stochastic gradient descent sgd performs the above minimization by iteratively following the direction of steepest descent given by the column vector parameters are updated using the rule where is learning rate an equivalent proximal form of gradient descent reveals the precise nature of ri namely each iterate is the solution to an auxiliary optimization problem where controls the distance between consecutive iterates using an distance in contrast the natural gradient relies on the between iterates more appropriate distance measure for probability distributions its metric is determined by the fisher information matrix log log the covariance of the gradients of the model wrt its parameters the natural gradient direction is then obtained as rn see for recent overview of the topic fisher information matrix for mlps we start by deriving the precise form of the fisher for canonical perceptron mlp composed of layers we consider the following deep network for binary classification though our approach generalizes to an arbitrary number of output classes hl fl wl hl bl the parameters of the mlp denoted wl bl are the weights wi rni connecting layers and and the biases bi rni fi is an function let us define to be the backpropagated gradient through the we ignore the off components of the fisher matrix and focus on the block fwi corresponding to interactions between parameters of layer this block takes the form fwi vec hti vec hti where vec is the vectorization function yielding column vector from the rows of matrix assuming that and activations hi are independent random variables we can write fwi km ln hi hi figure natural neural network illustration of the projections involved in prong where is the element at row and column of matrix and is the element of vector fwi km ln is the entry in the fisher capturing interactions between parameters wi and wj our hypothesis verified experimentally in sec is that we can greatly improve conditioning of the fisher by enforcing that hi hti for all layers of the network despite ignoring possible correlations in the and off block diagonal terms of the fisher projected natural gradient descent this section introduces whitened neural networks wnn which perform approximate whitening of their internal representations we begin by presenting novel whitened neural layer with the assumption that the network statistics µi hi and hi hti are fixed we then show how these layers can be adapted to efficiently track population statistics over the course of training the resulting learning algorithm is referred to as projected natural gradient descent prong we highlight an interesting connection between prong and mirror descent in section whitened neural layer the building block of wnn is the following neural layer hi fi vi ui hi ci di compared to eq we have introduced an explicit centering parameter ci rni equal to µi which ensures that the input to the dot product has zero mean in expectation this is analogous to the centering reparametrization for deep boltzmann machines the weight matrix ui rni is matrix whose rows are obtained from an eigendecomposition of diag ui diag the is regularization term controlling the maximal multiplier on the learning rate or equivalently the size of the trust region the parameters vi rni and di rni are analogous to the canonical parameters of neural network as introduced in eq though operate in the space of whitened unit activations ui hi ci this layer can be stacked to form deep neural network having layers with model parameters vl dl and whitening coefficients ul cl as depicted in fig though the above layer might appear at first glance we crucially do not learn the whitening coefficients via loss minimization but instead estimate them directly from the model statistics these coefficients are thus constants from the point of view of the optimizer and simply serve to improve conditioning of the fisher with respect to the parameters denoted indeed using the same derivation that led to eq we can see that the terms of now involve terms ui hi ui hi which equals identity by construction updating the whitening coefficients as the whitened model parameters evolve during training so do the statistics µi and for our model to remain well conditioned the whitening coefficients must be updated at regular intervals algorithm projected natural gradient descent input training set initial parameters reparam frequency number of samples ns regularization term ui ci repeat if mod then amortize cost of lines for all layers do compute canonical parameters wi vi ui bi di wi ci proj estimate µi and using ns samples from update ci from µi and ui from eigen decomp of update update parameters vi wi ui di bi proj end for end if perform sgd update wrt using samples from until convergence while taking care not to interfere with the convergence properties of gradient descent this can be achieved by coupling updates to with corresponding updates to such that the overall function implemented by the mlp remains unchanged by preserving the product vi ui before and after each update to the whitening coefficients with an analoguous constraint on the biases unfortunately while estimating the mean µi and diag could be performed online over minibatch of samples as in the recent batch normalization scheme estimating the full covariance matrix will undoubtedly require larger number of samples while statistics could be accumulated online via an exponential moving average as in rmsprop or the cost of the eigendecomposition required for computing the whitening matrix ui remains cubic in the layer size in the simplest instantiation of our method we exploit the smoothness of gradient descent by simply amortizing the cost of these operations over consecutive updates sgd updates in the whitened model will be closely aligned to ngd immediately following the reparametrization the quality of this approximation will degrade over time until the subsequent reparametrization the resulting algorithm is shown in the of algorithm we can improve upon this basic amortization scheme by updating the whitened parameters using diagonal natural gradient update whose statistics are computed online in our framework this can be implemented via the reparametrization wi vi di ui where di is diagonal matrix updated such that di ui hi for each minibatch updates to di can be compensated for exactly and cheaply by scaling the rows of ui and columns of vi accordingly simpler implementation of this idea is to combine prong with which we denote as duality and mirror descent there is an inherent duality between the parameters of our whitened neural layer and the parameters of canonical model indeed there exist linear projections and which map from canonical parameters to whitened parameters and corresponds to line of algorithm while corresponds to line this duality between and reveals close connection between prong and mirror descent mirror descent md is an online learning algorithm which generalizes the proximal form of gradient descent to the class of bregman divergences where and is strictly convex and differentiable function replacing the distance by mirror descent solves the proximal problem of eq by applying updates in dual space and then projecting back onto the primal space defining and with the complex conjugate of the mirror descent updates are given by figure fisher matrix for small mlp before and after the first reparametrization best viewed in colour condition number of the fim during training relative to the initial conditioning all models where initialized such that the initial conditioning was the same and learning rate where adjusted such that they reach roughly the same training error in the given time it is well known that the natural gradient is special case of md where the distance generating function is chosen to be the mirror updates are somewhat unintuitive however why is the gradient applied to the dual space if it has been computed in the space of parameters this is where prong relates to md it is trivial to show that using the function instead of the previously defined enables us to directly update the dual parameters using the gradient computed directly in the dual space indeed the resulting updates can be shown to implement the natural gradient and are thus equivalent to the updates of eq with the appropriate choice of the operators and correspond to the projections and used by prong to map from the canonical neural parameters to those of the whitened layers as illustrated in fig the advantage of this whitened form of md is that one may amortize the cost of the projections over several updates as gradients can be computed directly in the dual parameter space related work this work extends the recent contributions of in formalizing many commonly used heuristics for training mlps the importance of activations and gradients as well as the importance of normalized variances in the forward and backward passes more recently vatanen et al extended their previous work by introducing multiplicative constant to the centered in contrast we introduce full whitening matrix ui and focus on whitening the feedforward network activations instead of normalizing geometric mean over units and gradient variances the recently introduced batch normalization bn scheme quite closely resembles diagonal version of prong the main difference being that bn normalizes the variance of activations before the as opposed to normalizing the latent activations by looking at the full covariance furthermore bn implements normalization by modifying the computations thus requiring the method to backpropagate through the normalization operator diagonal version of prong also bares an interesting resemblance to rmsprop in that both normalization terms involve the square root of the fim an important distinction however is that prong applies this update in the whitened parameter space thus preserving the natural gradient interpretation as the fisher and thus which we drop for clarity depend on the parameters these should be indexed with time superscript figure optimizing deep on mnist impact of eigenvalue regularization term impact of amortization period showing that initialization with the whitening reparametrization is important for achieving faster learning and better error rate training error vs number of updates training error vs plots show that prong achieves better error rate both in number of updates and wall clock is closely related to prong and was developed concurrently to our method it targets the same of the fisher approximating each block as in eq unlike our method however kfac does not approximate the covariance of backpropagated gradients as the identity and further estimates the required statistics using exponential moving averages unlike our approach based on amortization similar techniques can be found in the preconditioning of the kaldi speech recognition toolkit by modeling the fisher matrix as the covariance of sparsely connected gaussian graphical model fang represents general formalism for exploiting model structure to efficiently compute the natural gradient one application to neural networks is in decorrelating gradients across neighbouring layers similar algorithm to prong was later found in where it appeared simply as thought experiment but with no amortization or recourse for efficiently computing experiments we begin with set of diagnostic experiments which highlight the effectiveness of our method at improving conditioning we also illustrate the impact of the and controlling the frequency of the reparametrization and the size of the trust region section evaluates prong on unsupervised learning problems where models are both deep and fully connected section then moves onto large convolutional models for image classification experimental details such as model architecture or configurations can be found in the supplemental material introspective experiments conditioning to provide better understanding of the approximation made by prong we train small mlp with tanh on downsampled version of mnist the model size was chosen in order for the full fisher to be tractable fig shows the fim of the middle hidden layers before and after whitening the model activations we took the absolute value of the entries to improve visibility fig depicts the evolution of the condition number of the fim during training measured as percentage of its initial value before the first whitening reparametrization in the case of prong we present such curves for sgd rmsprop batch normalization and prong the results clearly show that the reparametrization performed by prong improves conditioning reduction of more than these observations confirm our initial assumption namely that we can improve conditioning of the block diagonal fisher by whitening activations alone sensitivity of figures highlight the effect of the eigenvalue regularization term and the reparametrization interval the experiments were performed on the best figure classification error on and imagenet on prong achieves better test error and converges faster on imagenet achieves comparable validation error while maintaining faster covergence rate performing of section on the mnist dataset figures plot the reconstruction error on the training set for various values of and as determines maximum multiplier on the learning rate learning becomes extremely sensitive when this learning rate is for smaller step sizes however lowering can yield significant speedups often converging faster than simply using larger learning rate this confirms the importance of the manifold curvature for optimization lower allows for different directions to be scaled drastically different according to their corresponding curvature fig compares the impact of for models having proper whitened initialization solid lines to models being initialized with standard initialization dashed lines these results are quite surprising in showing the effectiveness of the whitening reparametrization as simple initialization scheme that being said performance can degrade due to ill conditioning when becomes excessively large unsupervised learning following martens we compare prong on the task of minimizing reconstruction error of dense on the mnist dataset reconstruction error with respect to updates and wallclock time are shown in fig we can see that prong significantly outperforms the baseline methods by up to an order of magnitude in number of updates with respect to wallclock our method significantly outperforms the baselines in terms of time taken to reach certain error threshold despite the fact that the runtime per epoch for prong was that of sgd compared to batch normalization sgd and rmsprop sgd note that these timing numbers reflect performance under the optimal choice of which in the case of batch normalization yielded batch size of compared to for all other methods further breaking down the performance of the runtime of prong was spent performing the whitening reparametrization compared to for estimating the per layer means and covariances this confirms that amortization is paramount to the success of our supervised learning we now evaluate our method for training deep supervised convolutional networks for object recognition following we perform whitening across feature maps only that is we treat pixels in given feature map as independent samples this allows us to implement the whitened neural layer as sequence of two convolutions where the first is by whitening filter prong is compared to sgd rmsprop and batch normalization with each algorithm being accelerated via momentum results are presented on and the imagenet challenge datasets in both cases learning rates were decreased using waterfall annealing schedule which divided the learning rate by when the validation error failed to improve after set number of evaluations unstable combinations of learning rates and are omitted for clarity we note that our whitening implementation is not optimized as it does not take advantage of gpu acceleration runtime is therefore expected to improve as we move the to gpu we now evaluate prong on using deep convolutional model inspired by the vgg architecture the model was trained on random crops with random horizontal reflections model selection was performed on validation set of examples results are shown in fig with respect to training error prong and bn seem to offer similar speedups compared to sgd with momentum our hypothesis is that the benefits of prong are more pronounced for densely connected networks where the number of units per layer is typically larger than the number of maps used in convolutional networks interestingly prong generalized better achieving test error for batch normalization this reflects the findings of which showed how ngd can leverage unlabeled data for better generalization the unlabeled data here comes from the extra crops and reflections observed when estimating the whitening matrices imagenet challenge dataset our final set of experiments aims to show the scalability of our method we applied our natural gradient algorithm to the dataset images labelled into categories using the inception architecture in order to scale to problems of this size we parallelized our training loop so as to split the processing of single minibatch of size across multiple gpus note that prong can scale well in this setting as the estimation of the mean and covariance parameters of each layer is also embarassingly parallel eight gpus were used for computing gradients and estimating model statistics though the eigen decomposition required for whitening was itself not parallelized in the current implementation given the difficulty of the task we employed the enhanced version of the algorithm as simple periodic whitening of the model proved to be unstable figure shows that batch normalisation and converge to approximately the same validation error vs respectively for similar in comparison sgd achieved validation error of however exhibits much faster convergence initially after updates it obtains around error compared to for bn alone we stress that the imagenet results are somewhat preliminary while our error is higher than reported in we used much less extensive data augmentation pipeline we are only beginning to explore what natural gradient methods may achieve on these large scale optimization problems and are encouraged by these initial findings discussion we began this paper by asking whether convergence speed could be improved by simple model reparametrizations driven by the structure of the fisher matrix from theoretical and experimental perspective we have shown that whitened neural networks can achieve this via simple scalable and efficient whitening reparametrization they are however one of several possible instantiations of the concept of natural neural networks in previous incarnation of the idea we exploited similar reparametrization to include whitening of backpropagated we favor the simpler approach presented in this paper as we generally found the alternative less stable for deep networks this may be due to the difficulty in estimating gradient covariances in lower layers problem which seems to mirror the famous vanishing gradient problem maintaining whitened activations may also offer additional benefits from the point of view of model compression and generalization by virtue of whitening the projection ui hi forms an ordered representation having least and most significant bits the sharp in the eigenspectrum of may explain why deep networks are ammenable to compression similarly one could envision spectral versions of dropout where the dropout probability is function of the eigenvalues alternative ways of orthogonalizing the representation at each layer should also be explored via alternate decompositions of or perhaps by exploiting the connection between linear and pca we also plan on pursuing the connection with mirror descent and further bridging the gap between deep learning and methods from online convex optimization acknowledgments we are extremely grateful to shakir mohamed for invaluable discussions and feedback in the preparation of this manuscript we also thank philip thomas volodymyr mnih raia hadsell sergey ioffe and shane legg for feedback on the paper the weight matrix can be parametrized as wi rit vi ui with ri the whitening matrix for references amari natural gradient works efficiently in learning neural computation jimmy ba and rich caruana do deep nets really need to be deep in nips amir beck and marc teboulle mirror descent and nonlinear projected subgradient methods for convex optimization oper res combettes and pesquet proximal splitting methods in signal processing arxiv december john duchi elad hazan and yoram singer adaptive subgradient methods for online learning and stochastic optimization in jmlr xavier glorot and yoshua bengio understanding the difficulty of training deep feedforward neural networks in aistats may sergey ioffe and christian szegedy batch normalization accelerating deep network training by reducing internal covariate shift icml roger grosse james martens optimizing neural networks with approximate curvature in icml june alex krizhevsky learning multiple layers of features from tiny images master thesis university of toronto yann lecun bottou genevieve orr and efficient backprop in neural networks tricks of the trade lecture notes in computer science lncs springer verlag yann lecun lon bottou yoshua bengio and patrick haffner learning applied to document recognition in proceedings of the ieee pages james martens deep learning via optimization in icml june and montavon deep boltzmann machines and the centering trick in montavon and orr editors neural networks tricks of the trade springer yann ollivier riemannian metrics for neural networks arxiv razvan pascanu and yoshua bengio revisiting natural gradient for deep networks in iclr daniel povey xiaohui zhang and sanjeev khudanpur parallel training of deep neural networks with natural gradient and parameter averaging iclr workshop raiko valpola and lecun deep learning made easier by linear transformations in perceptrons in aistats raskutti and mukherjee the information geometry of mirror descent arxiv october ruslan salakhutdinov roger grosse scaling up natural gradient by sparsely factorizing the inverse fisher matrix in icml june olga russakovsky jia deng hao su jonathan krause sanjeev satheesh sean ma zhiheng huang andrej karpathy aditya khosla michael bernstein alexander berg and li imagenet large scale visual recognition challenge international journal of computer vision ijcv nicol schraudolph accelerated gradient descent by decomposition technical report istituto dalle molle di studi sull intelligenza artificiale simonyan and zisserman very deep convolutional networks for image recognition in international conference on learning representations jascha the natural gradient by analogy to signal whitening and recipes and tricks for its use arxiv nitish srivastava geoffrey hinton alex krizhevsky ilya sutskever and ruslan salakhutdinov dropout simple way to prevent neural networks from overfitting journal of machine learning research christian szegedy wei liu yangqing jia pierre sermanet scott reed dragomir anguelov dumitru erhan vincent vanhoucke and andrew rabinovich going deeper with convolutions arxiv philip thomas william dabney stephen giguere and sridhar mahadevan projected natural actorcritic in advances in neural information processing systems tijmen tieleman and geoffrey hinton rmsprop divide the gradient by running average of its recent magnitude coursera neural networks for machine learning tommi vatanen tapani raiko harri valpola and yann lecun pushing stochastic gradient towards methods backpropagation learning with transformations in nonlinearities iconip 
optimization monte carlo efficient and embarrassingly parallel inference max informatics institute university of amsterdam edward meeds informatics institute university of amsterdam tmeeds abstract we describe an embarrassingly parallel anytime monte carlo method for models the algorithm starts with the view that the stochasticity of the generated by the simulator can be controlled externally by vector of random numbers in such way that the outcome knowing is deterministic for each instantiation of we run an optimization procedure to minimize the distance between summary statistics of the simulator and the data after reweighing these samples using the prior and the jacobian accounting for the change of volume in transforming from the space of summary statistics to the space of parameters we show that this weighted ensemble represents monte carlo estimate of the posterior distribution the procedure can be run embarrassingly parallel each node handling one sample and anytime by allocating resources to the worst performing sample the procedure is validated on six experiments introduction computationally demanding simulators are used across the full spectrum of scientific and industrial applications whether one studies embryonic morphogenesis in biology tumor growth in cancer research colliding galaxies in astronomy weather forecasting in meteorology climate changes in the environmental science earthquakes in seismology market movement in economics turbulence in physics brain functioning in neuroscience or fabrication processes in industry approximate bayesian computation abc forms large class algorithms that aims to sample from the posterior distribution over parameters for these simulator based models likelihoodfree inference however is notoriously inefficient in terms of the number of simulation calls per independent sample further like regular bayesian inference algorithms care must be taken so that posterior sampling targets the correct distribution the simplest abc algorithm abc rejection sampling can be fully parallelized by running independent processes with no communication or synchronization requirements it is an embarrassingly parallel algorithm unfortunately as the most inefficient abc algorithm the benefits of this title are limited there has been considerable progress in distributed mcmc algorithms aimed at data problems recently sequential monte carlo smc algorithm called the particle cascade was introduced that emits streams of samples asynchronously with minimal memory management and communication in this paper we present an alternative embarrassingly parallel sampling approach each processor works independently at full capacity and will indefinitely emit independent samples the main trick is to pull random number generation outside of the simulator and treat the simulator as deterministic piece of code we then minimize the difference donald bren school of information and computer sciences university of california irvine and canadian institute for advanced research between observations and the simulator output over its input parameters and weight the final optimized parameter value with the prior and the inverse of the jacobian we show that the resulting weighted ensemble represents monte carlo estimate of the posterior moreover we argue that the error of this procedure is if the optimization gets to the optimal value this optimization monte carlo omc has several advantages it can be run embarrassingly parallel the procedure generates independent samples and the core procedure is now optimization rather than mcmc indeed optimization as part of inference procedure has recently been proposed using probabilistic model of the mapping from parameters to differences between observations and simulator outputs they apply bayesian optimization to efficiently perform posterior inference note also that since random numbers have been separated out from the simulator powerful tools such as automatic differentiation are within reach to assist with the optimization in practice we find that omc uses far fewer simulations per sample than alternative abc algorithms the approach of controlling randomness as part of an inference procedure is also found in related class of parameter estimation algorithms called indirect inference connections between abc and indirect inference have been made previously by as novel way of creating summary statistics an indirect inference perspective led to an independently developed version of omc called the reverse sampler in section we briefly introduce abc and present it from novel viewpoint in terms of random numbers in section we derive abc through optimization from geometric point of view then proceed to generalize it to higher dimensions we show in section extensive evidence of the correctness and efficiency of our approach in section we describe the outlook for optimizationbased abc abc sampling algorithms the primary interest in abc is the posterior of simulator parameters given vector of statistics of observations the likelihood is generally not available in abc instead we can use the simulator as generator of that reside in the same space as by treating as auxiliary variables we can continue with the bayesian treatment dx dx dθ of particular importance is the choice of kernel measuring the discrepancy between observations and popular choices for kernels are the gaussian kernel and the uniform the bandwidth parameter which may be vector accounting for relative importance of each statistic plays critical role small produces more accurate posteriors but is more computationally demanding whereas large induces larger error but is cheaper we focus our attention on abc samplers which include rejection sampling importance sampling is sequential monte carlo smc and population monte carlo in rejection sampling we draw parameters from the prior then run simulation at those parameters if the discrepancy then the particle is accepted otherwise it is rejected this is repeated until particles are accepted importance sampling generalizes rejection sampling using proposal distribution qφ instead of the prior and produces samples with weights wi smc extends is to multiple rounds with decreasing adapting their particles after each round such that each new population improves the approximation to the posterior our algorithm has similar qualities to smc since we generate population of weighted particles but differs significantly since our particles are produced by independent optimization procedures making it completely parallel parallel and efficient abc sampling algorithm inherent in our assumptions about the simulator is that internally there are calls to random number generator which produces the stochasticity of the we will assume for the moment that this can be represented by vector of uniform random numbers which if known would make the simulator deterministic more concretely we assume that any simulation output can be represented as deterministic function of parameters and vector of random numbers dθ dy dθ dy figure illustration of omc geometry dashed lines indicate contours over for several for three values of their initial and optimal positions are shown solid circles within the grey acceptance region the jacobian indicated by the blue diagonal line describes the relative change in volume induced in from small change in corresponding weights are shown as vertical stems when dθ dy here the change in volume is proportional to the length of the line segment inside the ellipsoid the orange line indicates the projection of the observation onto the contour of in this case identical to the optimal this assumption has been used previously in abc first in coupled abc and also in an application of hamiltonian dynamics to abc we do not make any further assumptions regarding or though for some problems their dimension and distribution may be known priori in these cases it may be worth employing sobol or other sequences to further improve the accuracy of any monte carlo estimates we will first derive dual representation for the abc likelihood function see also dx dxdu du leading to the following monte carlo approximation of the abc posterior du ui ui since is kernel that only accepts arguments and ui that are close to each other for values of that are as small as possible equation tells us that we should first sample values for from and then for each such sample find the value for θio that results in θio in practice we want to drive these values as close to each other as possible through optimization and accept an error if the remaining distance is still note that apart from sampling the values for this procedure is deterministic and can be executed completely in parallel without any communication in the following we will assume single observation vector but the approach is equally applicable to dataset of cases the case dθ dy we will first study the case when the number of parameters is equal to the number of summary statistics to understand the derivation it helps to look at figure which illustrates the derivation for the one dimensional case in the following we use the following abbreviation fi stands for ui the general idea is that we want to write the approximation to the posterior as mixture of small uniform balls or delta peaks in the limit ui wi with wi some weights that we will derive shortly then if we make small enough we can replace any average of sufficiently smooth function this approximate posterior simply by evaluating at some arbitrarily chosen points inside these balls for instance we can take the center of the ball dθ wi to derive this expression we first assume that fi ball of radius is the normalizer which is immaterial because it cancels in the posterior for small enough we claim that we can linearize fi around θio fi θio joi θio ri ri θio θio where joi is the jacobian matrix with columns we take θio to be the end result of our optimization procedure for sample ui using this we thus get fi fi θio joi θio ri we first note that since we assume that our optimization has ended up somewhere inside the ball defined by fi we can assume that fi θio also since we only consider values for that satisfy fi and furthermore assume that the function fi is lipschitz continuous in it follows that θio as well all of this implies that we can safely ignore the remaining term ri which is of order θio if we restrict ourselves to the volume inside the ball the next step is to view the term fi as distribution in with the taylor expansion this results in fi θio θio jo fi θio jot ji θi ji this represents an ellipse in with centroid and volume vi given by θio jio fi θio vi det jot ji with constant independent of we can approximate the posterior now as det jot det jot ji ji where in the last step we have send finally we can compute the constant through norot malization the whole procedure is accurate up to errors of the θi det ji ji order and it is assumed that the optimization procedure delivers solution that is located within the epsilon ball if one of the optimizations for certain sample ui did not end up within the epsilon ball there can be two reasons the optimization did not converge to the optimal value for or for this value of there is no solution for which can get within distance from the observation if we interpret as our uncertainty in the observation and we assume that our optimization succeeded in finding the best possible value for then we should simply reject this sample θi however it is hard to detect if our optimization succeeded and we may therefore sometimes reject samples that should not have been rejected thus one should be careful not to create bias against samples ui for which the optimization is difficult this situation is similar to sampler that will not mix to remote local optima in the posterior distribution the case dθ dy this is the overdetermined case and here the situation as depicted in figure is typical the manifold that ui traces out as we vary forms lower dimensional surface in the dy dimensional enveloping space this manifold may or may not intersect with the sphere centered at the observation or ellipsoid for the general case instead of assume that the manifold does intersect the epsilon ball but not since we trust our observation up to distance we may simple choose to pick the closest point to on the manifold which is given by θio fi θi ot ot ji ji ji is the we can now define our ellipse around this point shifting the where center of the ball from to fi which do not coincide in this case the uniform distribution on the ellipse in is now defined in the dθ dimensional manifold and has volume vi det jot so once again we arrive at almost the same equation as before eq but with ji the slightly different definition of the point given by eq crucially since fi and if we assume that our optimization succeeded we will only make mistakes of order the case dθ dy this is the underdetermined case in which it is typical that entire manifolds hyperplanes may be solution to fi in this case we can not approximate the posterior with mixture of point masses and thus the procedure does not apply however the case dθ dy is less interesting than the other ones above as we expect to have more summary statistics than parameters for most problems experiments the goal of these experiments is to demonstrate the correctness of omc and the relative efficiency of omc in relation to two sequential mc algorithms smc aka population mc and adaptive weighted smc to demonstrate correctness we show histograms of weighted samples along with the true posterior when known and for three experiments the exact omc weighted samples when the exact jacobian and optimal is known to demonstrate efficiency we compute the mean simulations per sample ss number of simulations required to reach an and the effective sample size ess defined as additionally we may measure the fraction of effective samples in the population ess is good way of detecting whether the posterior is dominated by few particles how many particles achieve discrepancy less than epsilon there are several algorithmic options for omc the most obvious is to spawn independent processes draw for each and optimize until is reached or max nbr of simulations run then compute jacobians and particle weights variations could include keeping sorted list of discrepancies and allocating computational resources to the worst particle however to compare omc with smc in this paper we use sequential version of omc that mimics the epsilon rounds of smc each simulator uses different optimization procedures including newton method for smooth simulators and random walk optimization for others jacobians were computed using finite differences to limit computational expense we placed max of simulations per sample per round for all algorithms unless otherwise noted we used and repeated runs times lack of error bars indicate very low deviations across runs we also break some of the notational convention used thus far so that we can specify exactly how the random numbers translate into and the into statistics this is clarified for each example results are explained in figures to normal with unknown mean and known variance the simplest example is the inference of the mean of univariate normal distribution with known variance the prior distribution is normal with mean and variance kσ where is factor relating the dispersions of and the data yn the simulator can generate data according to the normal distribution or deterministically if the random effects rum are known xm xm xm rum where rum erf using the inverse cdf sufficient statistic for this problem is the average rum xm therefore we have where the average of the random effects in our experiment we set and the exact jacobian and θio can be computed for this problem for draw ui ji if is the mean of the observations then by setting θio ui we find θio ui therefore the exact weights are wi θio allowing us to compare directly with an exact posterior based on our dual representation by shown by orange circles in figure we used newton method to optimize each particle figure left inference of unknown mean for omc uses ss uses ss at omc uses ss only ss more and smc jumps to ss for all algorithms and values right inference for mixture of normals similar results for omc at had ss and at has ss the remained at for omc but decreased to at and for both at not only does the ess remain high for omc but it also represents the tails of the distribution well even at low normal mixture standard illustrative abc problem is the inference of the mean of mixture of two normals with θa θb where hyperparameters are θa θb and single observation scalar for this problem so we drop the subscript the true posterior is simply in this problem there are two random numbers and one for selecting the mixture component and the other for the random innovation further the statistic is the identity erf erf erf where erf as with the previous example the jacobian is and θio ui is known exactly this problem is notable for causing performance issues in and its difficulty in targeting the tails of the posterior this is not the case for omc exponential with unknown rate in this example the goal is to infer the rate of an exponential distribution with gamma prior gamma based on draws from exp xm exp xm exp xm ln um rum where rum ln um the pinverse cdf of the exponential sufficient statistic for this problem is the average xm again we have exact expressions for the jacobian and θi using ui ui ji ui and θio ui we used in our experiments linked mean and variance of normal in this example we link together the mean and variance of the data generating function as follows xm erf θrum xm xm figure left inference of rate of exponential similar result wrt ss occurs for this experiment at omc had for at ss was omc dropping with below omc drops at to for smc at remains the same right inference of linked normal drops significantly for omc at to and at to while it remains high for smc to this is the result the inability of every ui to achieve whereas for smc the algorithm allows them to drop their random numbers and effectively switch to another this was verified by running an expensive optimization resulting in and optimized particles having under taking this inefficiency into account omc still requires simulations per effective sample for smc ie and where rum erf we put positive constraint on we used statistics the mean and variance of draws from the simulator xm θr xm where and rum the exact jacobian is therefore in our experiments the simplest model explains populations over time controlled by set of stochastic differential equations dt dt where and are the prey and predator population sizes respectively gaussian noise is added at each full lognormal priors are placed over the simulator runs for time steps with constant initial populations of for both prey and predator there is therefore outputs prey and predator populations concatenated which we use as the statistics to run deterministic simulation we draw ui where the dimension of is half of the random variables are used for and the other half for in other words rust erf where for the appropriate population the jacobian is matrix that can be computed using using forward simulations bayesian inference of the queue model bayesian inference of the queuing model is challenging requiring abc algorithms or sophisticated procedures though simple to simulate the output can be quite figure top bottom queue the left plots shows the posterior mean std errors of the posterior predictive distribution sorted for simulations per sample and the posterior of are shown in the other plots for at the ss for omc were for and increased at to however the was lower for omc at it was and down to at whereas stayed around for this is again due to the optimal discrepancy for some being greater than however the samples that remain are independent samples for the results are similar but the is lower than the number of discrepancies satisfying indicating that the volume of the jacobians is having large effect on the variance of the weights future work will explore this further noisy in the queuing model single server processes arriving customers which are then served within random time customer arrives at time wm exp after customer and is served in sm service time both wm and sm are unobserved only the times xm are observed following we write the simulation algorithm in terms of arrival times vm to simplify the updates we keep track of the departure times dm initially and followed by updates for vm wm xm sm max vm dm xm after trying several optimization procedures we found the most reliable optimizer was simply random walk the random sources in the problem used for wm there are and for um there are therefore is dimension typical statistics for this problem are quantiles of the minimum and maximum values in other words the vector is sorted then evenly spaced values for the statistics functions we used quantiles the jacobian is an matrix in our experiments conclusion we have presented optimization monte carlo algorithm that by controlling the simulator randomness transforms traditional abc inference into set of optimization procedures by using omc scientists can focus attention on finding useful optimization procedure for their simulator and then use omc in parallel to generate samples independently we have shown that omc can also be very efficient though this will depend on the quality of the optimization procedure applied to each problem in our experiments the simulators were cheap to run allowing jacobian computations using finite differences we note that for input spaces and expensive simulators this may be infeasible solutions include randomized gradient estimates or automatic differentiation ad libraries future work will include incorporating ad improving efficiency using sobol numbers when the size is known incorporating bayesian optimization adding partial communication between processes and inference for expensive simulators using optimization acknowledgments we thank the anonymous reviewers for the many useful comments that improved this manuscript mw acknowledges support from facebook google and yahoo references ahn korattikara liu rajan and welling large scale distributed bayesian matrix factorization using stochastic gradient mcmc in kdd ahn shahbaba and welling distributed stochastic gradient mcmc in proceedings of the international conference on machine learning pages beaumont cornuet marin and robert adaptive approximate bayesian computation biometrika blum and regression models for approximate bayesian computation statistics and computing bonassi and west sequential monte carlo with adaptive weights for approximate bayesian computation bayesian analysis del moral doucet and jasra sequential monte carlo samplers journal of the royal statistical society series statistical methodology drovandi pettitt and faddy approximate bayesian computation using indirect inference journal of the royal statistical society series applied statistics fearnhead and prangle constructing summary statistics for approximate bayesian computation approximate bayesian computation journal of the royal statistical society series statistical methodology forneron and ng the abc of simulation estimation with auxiliary statistics arxiv preprint forneron and ng reverse sampler of the posterior distribution arxiv preprint gourieroux monfort and renault indirect inference journal of applied econometrics gutmann and corander bayesian optimization for inference of statistical models journal of machine learning research preprint in press jones schonlau and welch efficient global optimization of expensive functions journal of global optimization maclaurin and duvenaud autograd meeds leenders and welling hamiltonian abc uncertainty in ai neal efficient bayesian computation for household epidemics statistical computing paige wood doucet and teh asynchronous anytime sequential monte carlo in advances in neural information processing systems pages shestopaloff and neal on bayesian inference for the queue with efficient mcmc sampling technical report dept of statistics university of toronto sisson fan and tanaka sequential monte carlo without likelihoods proceedings of the national academy of sciences sisson fan and tanaka sequential monte carlo without likelihoods errata proceedings of the national academy of sciences snoek larochelle and adams practical bayesian optimization of machine learning algorithms advances in neural information processing systems spall multivariate stochastic approximation using simultaneous perturbation gradient approximation automatic control ieee transactions on 
adaptive splitting methods for statistical learning and image processing thomas department of computer science university of maryland college park md min school of economics and management southeast university nanjing china xiaoming department of mathematics hong kong baptist university kowloon tong hong kong abstract the alternating direction method of multipliers admm is an important tool for solving complex optimization problems but it involves minimization that are often difficult to solve efficiently the hybrid gradient pdhg method is powerful alternative that often has simpler than admm thus producing lower complexity solvers despite the flexibility of this method pdhg is often impractical because it requires the careful choice of multiple stepsize parameters there is often no intuitive way to choose these parameters to maximize efficiency or even achieve convergence we propose stepsize rules that automatically tune pdhg parameters for optimal convergence we rigorously analyze our methods and identify convergence rates numerical experiments show that adaptive pdhg has strong advantages over methods in terms of both efficiency and simplicity for the user introduction splitting methods such as admm have recently become popular for solving problems in distributed computing statistical regression and image processing admm allows complex problems to be broken down into sequences of simpler usually involving least squares minimizations however in many cases these least squares minimizations are difficult to directly compute in such situations the hybrid gradient method pdhg also called the linearized admm enables the solution of complex problems with simpler sequence of that can often be computed in closed form this flexibility comes at cost the pdhg method requires the user to choose multiple stepsize parameters that jointly determine the convergence of the method without having extensive analytical knowledge about the problem being solved such as eigenvalues of linear operators there is no intuitive way to select stepsize parameters to obtain fast convergence or even guarantee convergence at all in this article we introduce and analyze variants of pdhg variants that automatically tune stepsize parameters to attain and guarantee fast convergence without user input applying adaptivity to splitting methods is difficult problem it is known that naive adaptive variants of tomg limin xmyuan admm are however recent results prove convergence when specific mathematical requirements are enforced on the stepsizes despite this progress the requirements for convergence of adaptive pdhg have been unexplored this is surprising given that stepsize selection is much bigger issue for pdhg than for admm because it requires multiple stepsize parameters the contributions of this paper are as follows first we describe applications of pdhg and its advantages over admm we then introduce new adaptive variant of pdhg the new algorithm not only tunes parameters for fast convergence but contains line search that guarantees convergence when stepsize restrictions are unknown to the user we analyze the convergence of adaptive pdhg and rigorously prove convergence rate guarantees finally we use numerical experiments to show the advantages of adaptivity on both convergence speed and ease of use the hybrid gradient method the pdhg scheme has its roots in the method which was studied by popov research in this direction was reinvigorated by the introduction of pdhg which converges rapidly for wider range of stepsizes than pdhg was first presented in and analyzed for convergence in it was later studied extensively for image segmentation an extensive technical study of the method and its variants is given by he and yuan several extensions of pdhg including simplified iterations for the case that or is differentiable are presented by condat several authors have also derived pdhg as preconditioned form of admm pdhg solves problems of the form min max ax for convex and we will see later that an incredibly wide range of problems can be cast as the steps of pdhg are given by at arg min kx xk arg min ky where and are stepsize parameters steps and of the method update decreasing the energy by first taking gradient descent step with respect to the inner product term in and then taking backward or proximal step involving in steps and the energy is increased by first marching up the gradient of the inner product term with respect to and then backward step is taken with respect to pdhg has been analyzed in the case of constant stepsizes and in particular it is known to converge as long as at however pdhg typically does not converge when stepsizes are used even in the case that at furthermore it is unclear how to select stepsizes when the spectral properties of are unknown in this article we identify the specific stepsize conditions that guarantee convergence in the presence of adaptivity and propose backtracking scheme that can be used when the spectral radius of is unknown applications linear inverse problems many inverse problems and statistical regressions have the form minimize sx ax where the data term is some convex function is convex regularizer such as the and are linear operators and is vector of data recently the alternating direction method of multipliers admm has become popular method for solving such problems the admm relies on the change of variables sx and generates the following sequence of iterates for some stepsize arg minx ax sx ksx arg miny the in requires the solution of potentially large problem involving both and common formulations such as the consensus admm solve these large with direct matrix factorizations however this is often impractical when either the data matrices are extremely large or fast transforms such as fft dct or hadamard can not be used the problem can be put into the form using the fenchel conjugate of the convex function denoted which satisfies the important identity max for all in the domain of replacing in with this expression involving its conjugate yields min max ax sx which is of the form the forward gradient steps of pdhg handle the matrix explicitly allowing linear inverse problems to be solved without any difficult we will see several examples of this below scaled lasso the lasso or scaled lasso is variable selection regression that obtains sparse solutions to systems of linear equations scaled lasso has several advantages over classical lasso it is more robust to noise and it enables setting penalty parameters without cross validation given data matrix and vector the scaled lasso finds sparse solution to the system dx by solving min kdx for some scaling parameter note the term in is not squared as in classical lasso if we write max and kdx max dx we can put in the form min max dx unlike admm pdhg does not require the solution of problems involving minimization form total variation is commonly used to solve problems of the min kax where is array image is the discrete gradient operator is linear operator and contains data if we add dual variable and write rx we obtain max min kax rx which is clearly of the form the pdhg solver using formulation avoids the inversion of the gradient operator that is required by admm this is useful in many applications for example in compressive sensing the matrix may be orthogonal hadamard wavelet or fourier transform in this case the proximal of pdhg are solvable in closed form using fast transforms because they do not involve the gradient operator the of admm involve both the gradient operator and the matrix simultaneously and thus require inner loops with expensive iterative solvers adaptive formulation the convergence of pdhg can be measured by the size of the residuals or gradients of with respect to the primal and dual variables and these primal and dual gradients are simply at and where and denote the of and the can be directly evaluated from the sequence of pdhg iterates using the optimality condition for rearranging this yields the same method can be applied to to obtain applying these results to yields the closed form residuals at xk when choosing the stepsize for pdhg there is tradeoff between the primal and dual residuals choosing large and small drives down the primal residuals at the cost of large dual residuals choosing small and large results in small dual residuals but large primal errors one would like to choose stepsizes so that the larger of and is as small as possible if we assume the residuals on step change monotonically with then max is minimized when this suggests that we tune to balance the primal and dual residuals to achieve residual balancing we first select parameter that controls the aggressiveness of adaptivity on each iteration we check whether the primal residual is at least twice the dual if so we increase the primal stepsize to and decrease the dual to if the dual residual is at least twice the primal we do the opposite when we modify the stepsize we shrink the adaptivity level to for we will see in section that this adaptivity level decay is necessary to guarantee convergence in our implementation we use in addition to residual balancing we check the following backtracking condition after each iteration xk xk ky where is constant we use is our experiments if condition fails then we shrink and before the next iteration we will see in section that the backtracking condition is sufficient to guarantee convergence the complete scheme is listed in algorithm algorithm adaptive pdhg choose large and and set while kpk kdk tolerance do compute from xk using the pdhg updates check the backtracking condition and if it fails set compute the residuals and use them for the following two adaptive updates if then set and if then set and if no adaptive updates were triggered then and end while convergence theory in this section we analyze algorithm and its rate of convergence in our analysis we consider adaptive variants of pdhg that satisfy the following assumptions we will see later that these assumptions guarantee convergence of pdhg with rate algorithm trivially satisfies assumption the sequence measures the adaptive aggressiveness on iteration and serves the same role as in algorithm the geometric decay of ensures that assumption holds the backtracking rule explicitly guarantees assumption assumptions for adaptive pdhg the sequences and are positive and bounded the sequence is summable where max either or is bounded and there is constant such that for all xk xk ky variational inequality formulation for notational simplicity we define the composite vector uk xk and the matrices at at mk and ax this notation allows us to formulate the optimality conditions for as variational inequality vi if is solution to then is minimizer of more formally at ax likewise is maximized by and so subtracting from and letting yields the vi formulation where we say is an approximate solution to with vi accuracy if where is unit ball centered at in theorem we prove ergodic convergence of adaptive pdhg using the vi notion of convergence preliminary results we now prove several results about the pdhg iterates that are needed to obtain convergence rate lemma the iterates generated by pdhg satisfy kuk uk the proof of this lemma follows standard techniques and is presented in the supplementary material this next lemma bounds iterates generated by pdhg lemma suppose the stepsizes for pdhg satisfy assumptions and then kuk for some upper bound cu cu the proof of this lemma is given in the supplementary material lemma under assumptions and we have kuk where kuk cu ch ku and ch is constant such that ku ch ku proof using the definition of mk we obtain kuk kuk kxk kx ku ky ky kuk cu ch ku ku cu ch ku where we have used the bound kuk cu from lemma and this final lemma provides vi interpretation of the pdhg iteration lemma the iterates uk xk generated by pdhg satisfy mk uk proof let uk xk be pair of pdhg iterates the minimizers in and of pdhg satisfy the following for all at at yk xk and also for all xk adding these two inequalities and using the notation yields the result convergence rate we now combine the above lemmas into our final convergence result theorem suppose that the stepsizes in pdhg satisfy assumptions and consider the sequence defined by this sequence satisfies the convergence bound ku ku cu thus converges to solution of with rate in the vi sense ch ku proof we begin with the following identity special case of the polar identity for vector spaces ku ku uk kuk we apply this to the vi formulation of the pdhg iteration to get mk uk ku uk kuk ku note that also assumption guarantees that kuk these observations reduce to and so ku and invoke lemma we now sum for to at uk ku because is convex ku ut ku ku ut ku uk ku cu th uk ch ku ku th the left side of therefore satisfies combining and yields the tasty bound ku ut ku cu ch ku applying proves the theorem numerical results we apply the original and adaptive pdhg to the test problems described in section we terminate the algorithms when both the primal and dual residual norms kpk and kdk are smaller than we consider four variants of pdhg the method adapt backtrack denotes adaptive pdhg with backtracking the method adapt refers to the adaptive method without backtracking with at we alsop consider the pdhg with two different stepsize choices the method const refers to the method with both stepsize parameters equal to at the method const refers to the method where the stepsizes are chosen to be the final values of the stepsizes used by adapt this final method is meant to demonstrate the performance of pdhg with stepsize that is customized to the problem at hand but still the specifics of each test problem are described below primal stepsize rof convergence curves ba cktra ck co st co st al ba cktra ck τk energy gap iteration iteration figure left convergence curves for the tv denoising experiment with the displays the difference between the objective at the kth iterate and the optimal objective value right stepsize sequences for both adaptive schemes table iteration counts for each problem with runtime sec in parenthesis problem scaled lasso scaled lasso scaled lasso tv tv tv compressive compressive compressive adapt backtrack adapt const const scaled lasso we test our methods on using the synthetic problem suggested in the test problem recovers dimensional vector with nonzero components using gaussian matrix total variation minimization we apply the model with to the cameraman image the image is scaled to the range and noise contaminated with standard deviation the image is denoised with and see table for time trial results note the similar performance of algorithm with and without backtracking indicating that there is no advantage to knowing the constant at we plot convergence curves and show the evolution of in figure note that is large for the first several iterates and then decays over time compressed sensing we reconstruct phantom from hadamard measurements data is generated by applying the hadamard transform to discretization of the phantom and then sampling and of the coefficients are random discussion and conclusion several interesting observations can be made from the results in table first both the backtracking adapt backtrack and adapt methods have similar performance on average for the imaging problems with neither algorithm showing consistently better performance thus there is no cost to using backtracking instead of knowing the ideal stepsize at finally the method const using optimized stepsizes did not always outperform the constant stepsizes this occurs because the true best stepsize choice depends on the active set of the problem and the structure of the remaining error and thus evolves over time this is depicted in figure which shows the time dependence of this show that adaptive methods can achieve superior performance by evolving the stepsize over time acknowledgments this work was supported by the national science foundation the office of naval research and the hong kong research grants council general research fund hkbu the second author was supported in part by the program for new century excellent university talents under grant no and the qing lan project references glowinski and marroco sur approximation par finis ordre un et la par une classe de de dirichlet non rev automat inf recherche roland glowinski and patrick le tallec augmented lagrangian and methods in nonlinear mechanics society for industrial and applied mathematics philadephia pa tom goldstein and stanley osher the split bregman method for regularized problems siam img april ernie esser xiaoqun zhang and tony chan general framework for class of first order algorithms for convex optimization in imaging science siam journal on imaging sciences antonin chambolle and thomas pock algorithm for convex problems with applications to imaging convergence yuyuan ouyang yunmei chen guanghui lan and eduardo pasiliao an accelerated linearized alternating direction method of multipliers arxiv preprint he yang and wang alternating direction method with penalty parameters for monotone variational inequalities journal of optimization theory and applications popov modification of the method for search of saddle points mathematical notes of the academy of sciences of the ussr mingqiang zhu and tony chan an efficient hybrid gradient algorithm for total variation image restoration ucla cam technical report pock cremers bischof and chambolle an algorithm for minimizing the functional in computer vision ieee international conference on pages bingsheng he and xiaoming yuan convergence analysis of algorithms for problem from contraction perspective siam img january laurent condat splitting method for convex optimization involving lipschitzian proximable and linear composite terms journal of optimization theory and applications silvia bonettini and valeria ruggiero on the convergence of hybrid gradient algorithms for total variation image restoration journal of mathematical imaging and vision boyd parikh chu peleato and eckstein distributed optimization and statistical learning via the alternating direction method of multipliers foundations and trends in machine learning belloni victor chernozhukov and wang lasso pivotal recovery of sparse signals via conic programming biometrika tingni sun and zhang scaled sparse linear regression biometrika rudin osher and fatemi nonlinear total variation based noise removal algorithms physica tom goldstein lina xu kevin kelly and richard baraniuk the stone transform image enhancement and compressive video preprint available at lustig donoho and pauly sparse mri the application of compressed sensing for rapid mr imaging magnetic resonance in medicine xiaoqun zhang and froment total variation based fourier reconstruction and regularization for computer tomography in nuclear science symposium conference record ieee volume pages oct robert tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series 
on some provably correct cases of variational inference for topic models andrej risteski department of computer science princeton university princeton nj risteski pranjal awasthi department of computer science rutgers university new brunswick nj abstract variational inference is an efficient popular heuristic used in the context of latent variable models we provide the first analysis of instances where variational inference algorithms converge to the global optimum in the setting of topic models our initializations are natural one of them being used in the most popular implementation of variational inference in addition to providing intuition into why this heuristic might work in practice the multiplicative rather than additive nature of the variational inference updates forces us to use proof arguments which we believe might be of general theoretical interest introduction over the last few years heuristics for optimization have emerged as one of the most fascinating phenomena for theoretical study in machine learning methods like alternating minimization em variational inference and the like enjoy immense popularity among ml practitioners and with good reason they re vastly more efficient than alternate available methods like convex relaxations and are usually easily modified to suite different applications theoretical understanding however is sparse and we know of very few instances where these methods come with formal guarantees among more classical results in this direction are the analyses of lloyd algorithm for which is very closely related to the em algorithm for mixtures of gaussians the recent work of also characterizes global convergence properties of the em algorithm for more general settings another line of recent work has focused on different heuristic called alternating minimization in the context of dictionary learning prove that with appropriate initialization alternating minimization can provably recover the ground truth have proven similar results in the context of phase retreival another popular heuristic which has so far eluded such attempts is known as variational inference we provide the first characterization of global convergence of variational inference based algorithms for topic models we show that under natural assumptions on the matrix and the topic priors along with natural initialization variational inference converges to the parameters of the underlying ground truth model to prove our result we need to overcome number of technical hurdles which are unique to the nature of variational inference firstly the difficulty in analyzing alternating minimization methods for dictionary learning is alleviated by the fact that one can come up with closed form expressions for the updates of the dictionary matrix we do not have this luxury second the norm in which variational inference naturally operates is kl divergence which can be difficult to work with we stress that the focus of this work is not to identify new instances of topic modeling that were previously not known to be efficiently solvable but rather providing understanding about the behaviour of variational inference the defacto method for learning and inference in the context of topic models latent variable models and em we briefly review em and variational methods the setting is latent variable models where the observations xi are generated according to distribution xi zi xi where are parameters of the models and zi is latent variable given the observations xi common task is to find the max likelihood value of the parameter argmaxθ log xi the em algorithm is an iterative method to achieve this dating all the way back to and in the in the above framework it can be formulated as the following procedure maintaining estimates θt of the model parameters and the posterior distribution over the hidt den variables in the we xcompute the distribution in the mstep we set argmaxθ log xi zi sometimes even the above two steps may not be computationally feasible in which case they can be relaxed by choosing family of simple distributions and performing the following updates in the variational we compute the distribution min kl θt in the we set log xi zi by picking the family appropriately it often argmaxθ ble to make both steps above run in polynomial time as expected none of the above two families of approximations come with any provable global convergence guarantees with em the problem is ensuring that one does not get stuck in local optimum with variational em additionally we are faced with the issue of in principle not even exploring the entire space of solutions topic models and prior work we focus on particular popular latent variable model topic models the generative model over word documents is the following for each document in the corpus proportion of topics γk is sampled according to prior distribution then for each position in the document we pick topic zp according to multinomial with parameters γk conditioned on zp we pick word from multinomial with parameters βi to put in position the matrix of values βi is known as the matrix the body of work on topic models is vast prior theoretical work relevant in the context of this paper includes the sequence of works by as well as and and assume that the matrix contains anchor words this means that each topic has word which appears in that topic and no other on the other hand work with certain expansion assumption on the graph which says that if one takes subset of topics the number of words in the support of these topics should be at least smax where smax is the maximum support size of any topic neither paper needs any assumption on the topic priors and can handle almost arbitrarily short documents the assumptions we make on the matrix will be related to the ones in the above works but our documents will need to be long so that the empirical counts of the words are close to their expected counts our priors will also be more structured this is expected since we are trying to analyze an existing heuristic rather than develop new algorithmic strategy the case where the documents are short seems significantly more difficult namely in that case there are two issues to consider one is proving the variational approximation to the posterior distribution over topics is not too bad the second is proving that the updates do actually reach the global optimum assuming long documents allows us to focus on the second issue alone which is already challenging on high level the instances we consider will have the following structure the topics will satisfy weighted expansion property for any set of topics of constant size for any topic in this set the probability mass on words which belong to and no other topic in will be large similar to the expansion in but only over constant sized subsets the number of topics per document will be small further the probability of including given topic in document is almost independent of any other topics that might be included in the document already similar properties are satisfied by the dirichlet prior one of the most popular priors in topic modeling originally introduced by the documents will also have dominating topic similarly as in for each word and topic it appears in there will be decent proportion of documents that contain topic and no other topic containing these can be viewed as local anchor documents for that topic we state below informally our main result see sections and for more details theorem under the above mentioned assumptions popular variants of variational inference for topic models with suitable initializations provably recover the ground truth model in polynomial time variational relaxation for learning topic models in this section we briefly review the variational relaxation for topic models following closely throughout the paper we will denote by the total number of words and the number of topics we will assume that we are working with sample set of documents we will also denote by the fractional count of word in document count where count is the number of times word appears in the document and nd is the number of words in the document for topic models variational updates are way to approximate the computationally intractable as described in section recall the model parameters for topic models are the topic prior parameters and the matrix the observable is the list of words in the document the latent variables are the topic assignments zj at each position in the document and the topic proportions the variational hence becomes minp kl αt for some family of distributions the family one usually considered is πn qj zj mean field family in it shown that for dirichlet priors the optimal distributions are dirichlet distribution for with some parameter and multinomials for with some parameters φj the variational em updates are shown to have the following form in the one runs to convergence the following updates on pnd the and parameters φd βi eeq log γd αd φd in the one xx updates the and parameters by setting βi φtd wd where φtd is the converged value of φd wd is the word in document position wd is an indicator variable which is if the word in position in document is word the dirichlet parameters do not have closed form expression and are updated via gradient descent simplified updates in the long document limit from the above updates it is difficult to give assign an intuitive meaning to the and parameters indeed it not even clear what one would like them to be ideally at the global optimum we will be however working in the large document limit and this will simplify the updates in particular in the in the large document limit the first term in the update equation for has vanishing pnd contribution in this case we can simplify the as φd βi γd γd φd notice importantly in the second update we now use variables γd instead of which are nork malized such that γd these correspond to the topic proportions given our current estimates βi for the model parameters the will remain as is but we will focus on the only and ignore the updates as the estimates disappeared from the updates βi γd where γd is the converged value of γd in this case the intuitive ing of the and variables is clear they are estimates of the the model parameters and the topic proportions given an estimate of the model parameters respectively the way we derived them these updates appear to be an approximate form of the variational updates in however it is possible to also view them in more principled manner these updates approximate the posterior distribution αt by first approximating this posterior by αt where is the value for given our current estimates of and then setting αt to be product distribution it is intuitively clear that in the large document limit this approximation should not be much worse than the one in as the posterior concentrates around the maximum likelihood value and in fact our proofs will work for finite but long documents finally we will rewrite the above equations in slightly more convenient form denoting fd γd βi the can be written as iterate until convergence γd γd fd the becomes βi βi fd γd fd pd pd γd where γd βi and γd is the converged value of γd alternating kl minimization and thresholded updates we will further modify the and update equations we derived above in slightly modified form these updates were used in paper by in the context of matrix factorization pd there the authors proved that under these updates kl fd is one can easily modify their arguments to show that the same property is preserved if the is replaced by step γdt minγdt kl where is the simplex minimizing the kl divergence between the counts and the predicted counts with respect to the variables in fact iterating the updates above is way to solve this convex minimization problem via version of gradient descent which makes multiplicative updates rather than additive updates thus the updates are performing alternating minimization using the kl divergence as the distance measure with the difference that for the variables one essentially just performs single gradient step in this paper we will make modification of the which is very natural intuitively the update for βi goes over all appearances of the word and adds the fractional assignment of the word to topic under our current estimates of the variables in the modified version we will γd only average over those documents where γd the intuitive reason behind this modification is the following the em updates we are studying work with the kl divergence which puts more weight on the larger entries thus for the documents in di the estimates for γd should be better than they might be in the documents di of course since the terms fd involve all the variables γd it is not priori clear that this modification will gain us much but we will prove that it in fact does formally we discuss the three modifications of variational inference specified as algorithm and we call them tem for thresholded em algorithm solve the following convex program for each document minγd γd γd and γd if is not in the support of document let di to be the set of documents γd γd fd log γd set βi βi initializations we will consider two different strategies for initialization first we will consider the case where we initialize with the matrix and the document priors having the correct support the analysis of tem in this case will be the cleanest while the main focus of the paper is tem we ll show that this initialization can actually be done for our case efficiently second we will consider an initialization that is inspired by what the current implementation uses concretely we ll algorithm iterative tem initialize γd uniformly among the topics in the support of document repeat fd γd γd βi until convergence same as above algorithm incomplete tem initialize γd with the values gotten in the previous iteration then perform just one step of same as before assume that the user has some way of finding for each topic seed document in which the proportion of topic is at least cl then when initializing one treats this document as if it were pure namely one sets βi to be the fractional count of word in this document we do not attempt to design an algorithm to find these documents case study sparse topic priors support initialization we start with simple case as mentioned all of our results only hold in the long documents regime we will assume for each document the number of sampled words is large enough so that one can approximate the expected frequencies of the words one can find values γd such that γd βi we ll split the rest of the assumptions into those that apply to the topicword matrix and the topic priors let first consider the assumptions on the matrix we will impose conditions that ensure the topics don overlap too much namely we assume words are discriminative each word appears in topics almost disjoint supports if the intersection of the supports of and is βi βi we also need assumptions on the topic priors the documents will be sparse and all topics will be roughly equally likely to appear there will be virtually no dependence between the topics conditioning on the size or presence of certain topic will not influence much the probability of another topic being included these are analogues of distributions that have been analyzed for dictionary learning formally sparse and gapped documents each of the documents in our samples has at most topics furthermore for each document the largest topic argmaxi γd is such that for any other topic γd γd for some arbitrarily small constant dominant topic equidistribution the probability that topic is such that γd γd is weak topic correlations and independent topic distribution for all sets with topics it must be the case that γd is dominating γd is dominating γd furthermore for any set of topics pr γd these assumptions are less smooth version of properties of the dirichlet prior namely it folklore result that dirichlet draws are sparse with high probability for certain reasonable range of parameters this was formally proven by though sparsity there means small number of large coordinates it also well known that dirichlet essentially can not enforce any correlation between different topics we show analogues of the weak topic correlations property and equidistribution in the supplementary material for completeness sake the above assumptions can be viewed as local notion of separability of the model in the following sense first consider particular document for each topic that participates in that document consider the words which only appear in the support of topic in the document in some sense these words are local anchor words for that document these words appear only in one topic of that document because of the almost disjoint supports property there will be decent mass on these words in each document similarly consider particular element βi of the matrix let call dl the set of documents where for all other topics appearing in that document these documents are like local anchor documents for that pair in those documents the word appears as part of only topic it turns out the above properties imply there is decent number of these for any pair are at least poly finally technical condition we will also assume that all nonzero γd βi intuitively this means if topic is present it needs to be reasonably large and similarly for words in topics such assumptions also appear in the context of dictionary learning we will prove the following theorem given an instance of topic modelling satisfying the properties specified above where the number of documents is log if we initialize the supports of the βi and γd variables correctly after log log updates or updates we recover the matrix and topic proportions to multiplicative accuracy for any theorem if the number of documents is there is procedure which with probability correctly identifies the supports of the βi and γd variables provable convergence of tem the correctness of the tem updates is proven in steps identifying dominating topic first we prove that if γd is the largest one among all topics in the document topic is actually the largest topic phase getting constant multiplicative factor estimates after initialization after log rounds we will get to variables βi γd which are within constant multiplicative factor from βi γd phase ii alternating minimization lower and upper bound evolution once the and estimates are within constant factor of their true values we show that the lone words and documents have boosting effect they cause the multiplicative upper and lower bounds to improve at each round the updates we are studying are multiplicative not additive in nature and the objective they are optimizing is so the standard techniques do not work the intuition behind our proof in phase ii can be described as follows consider one update for one of the variables say βi we show that βi αβi βi for some constant at time step is something fairly large one should think of it as and comes from the existence of the local anchor documents similar equation holds for the variables in which case the good term comes from the local anchor words furthermore we show that the error in the decreases over time as does the value of so that eventually we can reach βi the analysis bears resemblance to the state evolution and density evolution methods in error decoding algorithm analysis in the sense that we maintain quantity about the evolving system and analyze how it evolves under the specified iterations the quantities we maintain are quite simple upper and lower multiplicative bounds on our estimates at any round initialization recall the goal of this phase is to recover the supports to find out which topics are present in document and identify the support of each topic we will find the topic supports first this uses an idea inspired by in the setting of dictionary learning roughly we devise test which will take as input two documents and will try to determine if the two documents have topic in common or not the test will have no false positives will never say yes if the documents don have topic in common but might say no even if they do we then ensure that with high probability for each topic we find pair of documents intersecting in that topic such that the test says yes the detailed initialization algorithm is included in the supplementary material case study dominating topics seeded initialization next we ll consider an initialization which is essentially what the current implementation of uses namely we will call the following initialization seeded initialization for each topic the user supplies document in which γd cl we treat the document as if it only contains topic and initialize with βi fd we show how to modify the previous analysis to show that with few more assumptions this strategy works as well firstly we will have to assume anchor words that make up decent fraction of the mass of each topic second we also assume that the words have bounded dynamic range the values of word in two different topics are within constant from each other the documents are still gapped but the gap now must be larger finally in roughly fraction of the documents where topic is dominant that topic has proportion for some small but still constant similar assumption small fraction of almost pure documents appeared in recent paper by formally we have small dynamic range and large fraction of anchors for each discriminative words if βi and βi furthermore each topic has anchor words such that their total weight is at least gapped documents in each document the largest topic has proportion at least cl and all the other topics are at most cs cl cs log log bcl log cl small fraction of dominant documents among all the documents where topic is nating in fraction of them γd where min log log bcl log cl cl the dependency between the parameters cl is little difficult to parse but if one thinks as for small and logη since log roughly we want that cl in other words the weight we require to have on the anchors depends only logarithmically on the range in the documents where the dominant topic has proportion similar reasoning as the precise statement is as above gives that we want is approximately γd follows theorem given an instance of topic modelling satisfying the properties specified above where the number of documents is log if we initialize with seeded initialization after log log of updates we recover the matrix and topic proportions to multiplicative accuracy if the proof is carried out in few phases phase anchor identification we show that as long as we can identify the dominating topic in each of the documents anchor words will make progress after log number of rounds the values for the estimates will be almost zero for the topics for which word is not an anchor for topic for which word is an anchor we ll have good estimate phase ii discriminative word identification after the anchor words are properly identified in the previous phase if βi βi will keep dropping and quickly reach almost zero the values corresponding to βi will be decently estimated phase iii alternating minimization after phase and ii above we are back to the scenario of the previous section namely there is improvement in each next round during phase and ii the intuition is the following due to our initialization even in the beginning each topic is correlated with the correct values in update we are minimizing kl with respect to the γd variables so we need way to argue that whenever the estimates are not too bad minimizing this quantity provides an estimate about how far the optimal γd variables are from we show the following useful claim lemma if for all topics kl rβ and minγd kl rf after running minimization step with respect to the γd variables we get that kl divergence rf rβ this lemma critically uses the existence of anchor words namely we show intuitively if one thinks of as will be large if is large hence if is not too large whenever is small so is we will be able to maintain rβ and rf small enough throughout the iterations so that we can identify the largest topic in each of the documents on common words we briefly remark on common words words such that βi in this case the proofs above as they are will not work since common words do not have any lone documents however if fraction of the documents where topic is dominant contains topic with proportion and furthermore in each topic the weight on these words is no more than then our proofs still work with either initialization the idea for the argument is simple when the dominating topic is very large we show that fd is very highly correlated with documents behave like anchor documents namely one can show βi βi so these theorem if we additionally have common words satisfying the properties specified above after log log updates in case study or any of the tem variants in case study and we use the same initializations as before we recover the matrix and topic proportions to multiplicative accuracy if discussion and open problems in this work we provide the first characterization of sufficient conditions when variational inference leads to optimal parameter estimates for topic models our proofs also suggest possible hard cases for variational inference namely instances with large dynamic range compared to the proportion of anchor words correlated topic priors it not hard to such instances where support initialization performs very badly even with only anchor and common words we made no effort to explore the optimal relationship between the dynamic range and the proportion of anchor words as it not clear what are the worst case instances for this seeded initialization on the other hand empirically works much better we found that when cl and when the proportion of anchor words is as low as variational inference recovers the ground truth even on instances with fairly large dynamic range our current proof methods are too weak to capture this observation in fact even the largest topic is sometimes misidentified in the initial stages so one can not even run tem only the vanilla variational inference updates analyzing the dynamics of variational inference in this regime seems like challenging problem which would require significantly new ideas references agarwal anandkumar jain and netrapalli learning sparsely used overcomplete dictionaries via alternating minimization in proceedings of the conference on learning theory colt anandkumar hsu javanmard and kakade learning latent bayesian networks and topic models under expansion constraints in proceedings of the international conference on machine learning icml we stress we want to analyze whether variational inference will work or not handling common words algorithmically is easy they can be detected and filtered out initially then we can perform the variational inference updates over the rest of the words only this is in fact often done in practice see supplementary material anandkumar kakade foster liu and hsu two svds suffice spectral decompositions for probabilistic topic modeling and latent dirichlet allocation technical report arora ge halpern mimno moitra sontag wu and zhu practical algorithm for topic modeling with provable guarantees in proceedings of the international conference on machine learning icml arora ge kanna and moitra computing nonnegative matrix provably in proceedings of the annual acm symposium on theory of computing pages acm arora ge ma and moitra simple efficient and neural algorithms for sparse coding in proceedings of the conference on learning theory colt arora ge and moitra learning topic models going beyond svd in proceedings of the annual ieee symposium on foundations of computer science focs arora ge and moitra new algorithms for learning incoherent and overcomplete dictionaries in proceedings of the conference on learning theory colt balakrishnan wainwright and yu statistical guarantees for the em algorithm from population to analysis arxiv preprint bansal bhattacharyya and kannan provable algorithm for learning topics in dominant admixture corpus in advances in neural information processing systems nips blei and lafferty topic models text mining classification clustering and applications blei ng and jordan latent dirichlet allocation journal of machine learning research dasgupta and schulman variant of em for gaussian mixtures in proceedings of uncertainty in artificial intelligence uai dasgupta and schulman probabilistic analysis of em for mixtures of separated spherical gaussians journal of machine learning research dempster laird and rubin maximum likelihood from incomplete data via the em algorithm journal of the royal statistical society series ding rohban ishwar and saligrama topic discovery through data dependent and random projections arxiv preprint ding rohban ishwar and saligrama efficient distributed topic modeling with provable guarantees in proceedings ot the international conference on artificial intelligence and statistics pages hoffman blei paisley and wan stochastic variational inference journal of machine learning research jordan ghahramani jaakkola and saul an introduction to variational methods for graphical models machine learning kumar and kannan clustering with spectral norm and the algorithm in proceedings of foundations of computer science focs lee and seung algorithms for matrix factorization in advances in neural information processing systems nips netrapalli jain and sanghavi phase retrieval using alternating minimization in advances in neural information processing systems nips sontag and roy complexity of inference in latent dirichlet allocation in advances in neural information processing systems nips sundberg maximum likelihood from incomplete data via the em algorithm scandinavian journal of statistics telgarsky dirichlet draws are sparse with high probability manuscript 
collaborative filtering with graph information consistency and scalable methods nikhil rao yu pradeep ravikumar inderjit dhillon nikhilr rofuyu paradeepr inderjit department of computer science university of texas at austin abstract low rank matrix completion plays fundamental role in collaborative filtering applications the key idea being that the variables lie in smaller subspace than the ambient space often additional information about the variables is known and it is reasonable to assume that incorporating this information will lead to better predictions we tackle the problem of matrix completion when pairwise relationships among variables are known via graph we formulate and derive highly efficient conjugate gradient based alternating minimization scheme that solves optimizations with over million observations up to orders of magnitude faster than stochastic based methods on the theoretical front we show that such methods generalize weighted nuclear norm formulations and derive statistical consistency guarantees we validate our results on both real and synthetic datasets introduction low rank matrix completion approaches are among the most widely used collaborative filtering methods where partially observed matrix is available to the practitioner who needs to impute the missing entries specifically suppose there exists ratings matrix and we only observe subset of the entries yij mn the goal is to estimate yi to this end one typically looks to solve one of the following equivalent programs arg min arg min kw where the nuclear norm given by the sum of singular values is tight convex relaxation of the non convex rank penalty and is equivalent to the regularizer in is the projection operator that only retains those entries of the matrix that lie in the set in many cases however one not only has the partially observed ratings matrix but also has access to additional information about the relationships between the variables involved for example one might have access to social network of users similarly one might have access to attributes of items movies etc the nature of the attributes can be fairly arbitrary but it is reasonable to assume that similar share similar attributes natural question to ask then is if one can take advantage of this additional information to make better predictions in this paper we assume that the row and column variables lie on graphs the graphs may naturally be part of the data social networks product graphs or they can be constructed from available features the idea then is to incorporate this additional structural information into the matrix completion setting we not only require the resulting optimization program to enforce additional constraints on but we also require it to admit efficient optimization algorithms we show in the sections that follow that this in fact is indeed the case we also perform theoretical analysis of our problem when the observed entries of are corrupted by additive white gaussian noise to summarize the contributions of our paper are as follows we provide scalable algorithm for matrix completion graph with structural information our method relies on efficient multiplication schemes and is orders of magnitude faster than stochastic gradient descent based approaches we make connections with other structured matrix factorization frameworks notably we show that our method generalizes the weighted nuclear norm and methods based on gaussian generative models we derive consistency guarantees for graph regularized matrix completion and empirically show that our bound is smaller than that of traditional matrix completion where graph information is ignored we empirically validate our claims and show that our method achieves comparable error rates to other methods while being significantly more scalable related work and key differences for convex methods for matrix factorization haeffele et al provided framework to use regularizers with norms other than the euclidean norm in abernethy et al considered kernel based embedding of the data and showed that the resulting problem can be expressed as norm minimization scheme srebro and salakhutdinov introduced weighted nuclear norm and showed that the method enjoys superior performance as compared to standard matrix completion under sampling scheme we show that the graph based framework considered in this paper is in fact generalization of the weighted nuclear norm problem with weight matrices in the context of matrix factorization with graph structural information considered graph regularized nonnegative matrix factorization framework and proposed gradient descent based method to solve the problem in the context of recommendation systems in social networks ma et al modeled the weight of graph explicitly in regularization framework li and yeung considered similar setting to ours but key point of difference between all the aforementioned methods and our paper is that we consider the partially observed ratings case there are some works developing algorithms for the situation with partially observations however none of them provides statistical guarantees weighted norm minimization has been considered before in the context of low rank matrix completion the thrust of these methods has been to show that despite suboptimal conditions correlated data sampling the sample complexity does not change none of these methods use graph information we are interested in complementary question given variables conforming to graph information can we obtain better guarantees under uniform sampling to those achieved by traditional methods matrix factorization assume that the true target matrix can be factorized as and there exist graph whose adjacency matrix encodes the relationships between the rows of and graph for rows of in particular two rows or columns connected by an edge in the graph are close to each other in the euclidean distance in the context of embedding proposed smoothing term of the form wi ij wj tr lap where lap is the graph laplacian for where dw is the diagonal matrix with dii eij adding into the minimization problem encourages solutions where wi wj when eij is large similar argument holds for and the associated graph laplacian lap the authors call this the trust between links in social network we would thus not only want the target matrix to be low rank but also want the variables to be faithful to the underlying graph structure to this end we consider the following problem min tr lap tr lap kw tr lw tr lh where lw lap im and lh is defined similarly note that we subsume the regularization parameters in the definition of lw lh note that kw tr im min the regularizer in encourages solutions that are smooth with respect to the corresponding graphs however the laplacian matrix can be replaced by other positive matrices that encourage structure by different means indeed very general class of laplacian based regularizers was considered in where one can replace lw by function hx lap xi where lap qi qit where qi constitute the of lap and is scalar function of the eigenvalues our case corresponds to being the identity function we briefly summarize other schemes that fit neatly into apart from the graph regularizer we consider covariance matrices for variables proposed kernelized probabilistic matrix factorization kpmf which is generative model to incorporate covariance information of the variables into matrix factorization they assumed that each row of is generated according to multivariate gaussian and solving the corresponding map estimation procedure yields exactly with lw cw and lh ch where cw ch are the associated covariance matrices feature matrices for variables assume that there is feature matrix for objects associated rows for such one can construct graph and hence laplacian using various methods such as neighbors neighbors etc moreover one can assume that there exists kernel xi xj that encodes pairwise relations and we can use the kernel gram matrix as laplacian we can thus see that problem is very general scheme and can incorporate information available in many different forms in the sequel we assume the matrices lw lh are given in the theoretical analysis in section for ease of exposition we further assume that the minimum eigenvalues of lw lh are unity general nonzero minimum eigenvalue will merely introduce multiplicative constants in our bounds grals graph regularized alternating least squares in this section we propose efficient algorithms for which is convex with respect to or separately this allows us to employ alternating minimization methods to solve the problem when is fully observed li and yeung propose an alternating minimization scheme using block steepest descent we deal with the partially observed setting and propose to apply conjugate gradient cg which is known to converge faster than steepest descent to solve each subproblem we propose very efficient multiplication routine that results in the algorithm being highly scalable compared to the stochastic gradient descent approaches in we assume that and when optimizing with fixed we obtain the following min tr lh optimizing while fixed is similar and thus we only show the details for solving since lh is nonsingular is strongly we first present our algorithm for the fully observed case since it sets the groundwork for the partially observed setting in fact nonsingular lh can be handled using proximal updates and our algorithm will still apply algorithm for given matrices lh multiplication input vec compute kn kj wi sj wi slh return vec algorithm for given matrices lh initialization multiplication input vec sg lh return vec fully observed case as in among others there may be scenarios where is completely observed and the goal is to find the embeddings that conform to the corresponding graphs in this case the loss term in is simply ky thus setting rf is equivalent to solving the following sylvester equation for an matrix hw lh admits closed form solution however the standard algorithm for the sylvester equation requires transforming both and lh into schur form diagonal in our case where and lh are symmetric by the qr algorithm which is time consuming for large lh thus we consider applying conjugate gradient cg to minimize directly we define the following quadratic function st vec rnk ik lh in it is not hard to show that vec and so we apply cg to minimize the most crucial step in cg is the multiplication using the identity vec vec axb it follows that ik lh vec lh and in vec sw where vec thus the multiplication can be implemented by series of matrix multiplications as follows vec lh where can be and stored in space the details are presented in algorithm the time complexity for single cg iteration is nnz lh nk where nnz is the number of non zeros since in most practical applications is generally small the complexity is essentially nnz lh as long as nk nnz lh partially observed case in this case the loss term of becomes yij wit hj where wit is the row of and hj is the column of similar to the fully observed case we can define vec where lp is block diagonal matrix with diagonal blocks bj bj where again we can see vec note that the transpose is used here instead of which is used in the fully observed case for given let sj sn be matrix such that vec and kj kn with kj bj sj then vec note that since can be very large in practice it may not be feasible to compute and store all bj in the beginning alternatively bj sj can be computed in time as follows sj wit sj wi thus can be computed in time and the multiplication can be done in nnz lh time see algorithm for detailed procedure as result each cg iteration for minimizing is also very cheap remark on convergence in it is shown that any local minimizer of is global minimizer of if is larger than the true rank of the underlying from the alternating minimization procedure is guaranteed to globally converge to block of the converged point might not be local minimizer but it still yields good performance in practice most importantly since the updates are cheap to perform our algorithm scales well to large datasets convex connection via generalized weighted nuclear norm we now show that the regularizer in can be cast as generalized version of the weighted nuclear norm the weights in our case will correspond to the scaling factors introduced on the matrices due to the eigenvalues of the shifted graph laplacians lw lh respectively weighted atomic norm from we know that the nuclear norm is the gauge function induced by the atomic set wi hti kwi khi note that all rank one matrices in have unit frobenius norm now assume pm is basis of rm and sp is diagonal matrix with sp ii encoding the preference over the space spanned by pi the more the preference the larger the value similarly consider the basis and the preference sq for rn let sp and qsq and consider the following preferential atomic set wi hti wi aui hi bvi kui kvi clearly each atom in has frobenius norm this atomic set allows for biasing of the solutions towards certain atoms we then define corresponding atomic norm kzka inf ci it is not hard to verify that kzka is norm and kzka is closed and convex equivalence to graph regularization the graph regularization can be shown to be special case of the atomic norm as consequence of the following result theorem for any sp kzka inf qsq ka and corresponding weighted atomic set kb ht we prove this result in appendix theorem immediately leads us to the following equivalence result corollary let lw uw sw uwt and lh uh sh uht be the eigen decomposition for lw and lh we have tr lw ka and tr lh kb where uw sw and uh sh as result km ka with the preference pair uw sw for the column space and the preference pair uh sh for row space is weighted atomic norm equivalent for the graph regularization using lw and lh the results above allow us to obtain the dual weighted atomic norm for matrix kat zb ksw uwt zuh sh the authors actually show this for more general class of regularizers nash equilibrium is used in which is weighted spectral norm an elementary proof of this result can be found in appendix note that we can then write kzka ka zb uw zuh in the authors consider norm similar to but with being diagonal matrices in the spirit of their nomenclature we refer to the norm in as the generalized weighted nuclear norm statistical consistency in the presence of noisy measurements in this section we derive theoretical guarantees for the graph regularized low rank matrix estimators we first introduce some additional notation we assume that there is matrix of rank with kz kf and entries of are uniformly and revealed to us we further assume an mapping between the set of observed indices and so that the tth measurement is given by yt yi hei etj mn where denotes the matrix trace inner product and is randomly selected coordinate pair from let are corresponding matrices defined in corollary for the given lw lh we assume that the minimum singular value of both lw and lh is we then define the following graph based complexity measures mn ka zb ka zb tk ka zb ka zb tk where is the norm finally we assume that the true matrix can be expressed as linear combination of atoms from we define au our goal in this section will be to characterize the solution to the following convex program where the constraint set precludes selection of overly complex matrices in the sense of arg min kf kzka where log where is constant depending on quick note on solving since ka is weighted nuclear norm one can resort to proximal point methods or greedy methods developed specifically for atomic norm constrained minimization the latter are particularly attractive since the greedy step reduces to computing the maximum singular vectors which can be efficiently computed using power methods however such methods will first involve computing the eigen decompositions of the graph laplacians and then storing the large dense matrices we refrain from resorting to such methods in section and instead use the efficient framework derived in section we now state our main theoretical result theorem suppose we observe entries of the form from matrix with and which can be represented using at most atoms from let be the minimizer of the convex problem with log max then with high probability we have log where are positive constants see appendix for the detailed proof proof sketch is as follows our results can be generalized to non uniform sampling schemes as well proof sketch there are three major portions of the proof using the fact that has unit frobeniusp norm and can be expressed as combination of at most atoms we can show kz ka appendix using we can derive bound for the dual norm of the gradient of the loss given by krl ksw uwt rl uh sh appendix finally using we define notion of restricted strong convexity rsc that the error matrices lie in the proof follows closely along the lines of the equivalent result in with appropriate modifications to accommodate our generalized weighted nuclear norm appendix comparison to standard matrix completion it is instructive to consider our result in the context of noisy matrix completion with uniform samples in this case one would replace lw lh by identity matrices effectively ignoring graph information available specifically the standard notion of spikiness mn kzk kzkf defined in will apply and the corresponding error bound theorem will have replaced by in general it is hard to quantify the relationship between and and detailed comparison is an interesting topic for future work however we show below using simulations for various scenarios that the former is much smaller than the latter we generate matrices of rank with being random orthonormal matrices and having diagonal elements picked from uniform distribution we generate graphs at random using the schemes discussed below and set am with as defined in corollary we then compute for various comparing to most real world graphs exhibit power law degree distribution we generated graphs with the ith node having degree ip with varying negative values figure shows that as from below the gains received from using our norm is clear compared to the standard nuclear norm we also observe that in general the weighted formulation is never worse then unweighted the dotted magenta line is the same applies for random graphs where there is an edge between each with varying probability figure gwnn nn mse αn αg αn αg power law random measurements sample complexity figure ratio of spikiness measures for traditional matrix completion and our formulation sample complexity for the nuclear norm nn and generalized weighted nuclear norm gwnn sample complexity we tested the sample complexity needed to recover matrix generated from power law distributed graph with figure again outlines that the atomic formulation requires fewer examples to get an accurate recovery we average the results over independent runs and we used to solve the atomic norm constrained problem experiments on real datasets comparison to related formulations we compare grals to other methods that incorporate side information for matrix completion the admm method of that regularizes the entire target matrix using known features imc and standard matrix completion mc we use the movielens that has features along with the ratings matrix the dataset contains user features such as age numeric gender binary and occupation which we map http method imc global mean user mean movie mean admm mc grals admm mc grals rmse time figure time comparison of different methods on movielens dataset flixster douban yahoomusic rmse table rmse on the movielens dataset table data statistics users items ratings links rank used into dimensional feature vector per user we then construct neighbor graph using the euclidean distance metric we do the same for the movies except in this case we have an dimensional feature vector per movie for imc we use the feature vectors directly we trained model of rank and chose optimal parameters by cross validation table shows the rmse obtained for the methods considered figure shows that the admm method while obtaining reasonable rmse does not scale well since one has to compute an svd at each iteration scalability of grals we now demonstrate that the proposed grals method is more efficient than other methods for solving the matrix factorization problem we compare grals to the sgd method in and gd als with simple gradient descent we consider three collaborate filtering datasets with graph information see table for we randomly select of ratings as the training set and use the remaining as the test set all the experiments are performed on an intel machine with xeon cpu ivy bridge and enough ram figure shows orders of magnitude improvement in time compared to sgd more experimental results are provided in the supplementary material flixster douban yahoomusic figure comparison of grals gd and sgd the is the computation time in discussion in this paper we have considered the problem of collaborative filtering with graph information for users items and showed that it can be cast as generalized weighted nuclear norm problem we derived statistical consistency guarantees for our method and developed highly scalable alternating minimization method experiments on large real world datasets show that our method achieves orders of magnitude speedups over competing approaches acknowledgments this research was supported by nsf grant yu acknowledges support from an intel phd fellowship nr was supported by an ices fellowship see more details in appendix references jacob abernethy francis bach theodoros evgeniou and vert matrix factorization with attributes arxiv preprint francis bach julien mairal and jean ponce convex sparse matrix factorizations corr mikhail belkin and partha niyogi laplacian eigenmaps and spectral techniques for embedding and clustering in nips volume pages mikhail belkin and partha niyogi laplacian eigenmaps for dimensionality reduction and data representation neural computation deng cai xiaofei he jiawei han and thomas huang graph regularized nonnegative matrix factorization for data representation pattern analysis and machine intelligence ieee transactions on cai emmanuel and zuowei shen singular value thresholding algorithm for matrix completion siam journal on optimization venkat chandrasekaran benjamin recht pablo parrilo and alan willsky the convex geometry of linear inverse problems foundations of computational mathematics gideon dror noam koenigstein yehuda koren and markus weimer the yahoo music dataset and in kdd cup pages benjamin haeffele eric young and rene vidal structured matrix factorization optimality algorithm and applications to image processing in proceedings of the international conference on machine learning pages prateek jain and inderjit dhillon provable inductive matrix completion arxiv preprint mohsen jamali and martin ester matrix factorization technique with trust propagation for recommendation in social networks in proceedings of the fourth acm conference on recommender systems recsys pages vassilis kalofolias xavier bresson michael bronstein and pierre vandergheynst matrix completion on graphs li and yeung relation regularized matrix factorization in international joint conference on artificial intelligence hao ma dengyong zhou chao liu michael lyu and irwin king recommender systems with social regularization in proceedings of the fourth acm international conference on web search and data mining wsdm pages hong kong china paolo massa and paolo avesani bootstrapping of recommender systems ecai workshop on recommender systems pages sahand negahban and martin wainwright restricted strong convexity and weighted matrix completion optimal bounds with noise the journal of machine learning research sahand negahban pradeep ravikumar martin wainwright and bin yu unified framework for analysis of with decomposable regularizers statistical science nikhil rao parikshit shah and stephen wright conditional gradient with enhancement and truncation for regularization nips workshop on greedy algorithms benjamin recht simpler approach to matrix completion the journal of machine learning research alexander smola and risi kondor kernels and regularization on graphs in learning theory and kernel machines pages springer nathan srebro and ruslan salakhutdinov collaborative filtering in world learning with the weighted trace norm in advances in neural information processing systems pages ambuj tewari pradeep ravikumar and inderjit dhillon greedy algorithms for structurally constrained high dimensional problems in advances in neural information processing systems pages roman vershynin note on sums of independent random matrices after lecture notes miao xu rong jin and zhou speedup matrix completion with side information application to learning in advances in neural information processing systems pages yangyang xu and wotao yin block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion siam journal on imaging sciences zhou zhao lijun zhang xiaofei he and wilfred ng expert finding for question answering via graph regularized matrix completion knowledge and data engineering ieee transactions on pp tinghui zhou hanhuai shan arindam banerjee and guillermo sapiro kernelized probabilistic matrix factorization exploiting graphs and side information in sdm volume pages siam 
combinatorial bandits revisited richard sadegh alexandre marc france department of automatic control kth stockholm sweden inria ens paris france mstms alepro abstract this paper investigates stochastic and adversarial combinatorial bandit problems in the stochastic setting under feedback we derive regret lower bound and discuss its scaling with the dimension of the decision space we propose escb an algorithm that efficiently exploits the structure of the problem and provide analysis of its regret escb has better performance guarantees than existing algorithms and significantly outperforms these algorithms in practice in the adversarial setting under bandit feedback we propose omb exp an algorithm with the same regret scaling as algorithms but with lower computational complexity for some combinatorial problems introduction bandit mab problems constitute the most fundamental sequential decision problems with an exploration exploitation in such problems the decision maker selects an arm in each round and observes realization of the corresponding unknown reward distribution each decision is based on past decisions and observed rewards the objective is to maximize the expected cumulative reward over some time horizon by balancing exploitation arms with higher observed rewards should be selected often and exploration all arms should be explored to learn their average rewards equivalently the performance of decision rule or algorithm can be measured through its expected regret defined as the gap between the expected reward achieved by the algorithm and that achieved by an oracle algorithm always selecting the best arm mab problems have found applications in many fields including sequential clinical trials communication systems economics see in this paper we investigate generic combinatorial mab problems with linear rewards as introduced in in each round decision maker selects an arm from finite set and pd receives reward mi xi the reward vector is unknown we focus here on the case where all arms consist of the same number of basic actions in the sense that km after selecting an arm in round the decision maker receives some feedback we consider both feedback under which after round for all the component xi of the reward vector is revealed if and only if mi ii bandit feedback under which only the reward is revealed based on the feedback received up to round the decision maker selects an arm for the next round and her objective is to maximize her cumulative reward over given time horizon consisting of rounds the challenge in these problems resides in the very large number of arms in its combinatorial structure the size of could well grow as dm fortunately one may hope to exploit the problem structure to speed up the exploration of arms we consider two instances of combinatorial bandit problems depending on how the sequence of reward vectors is generated we first analyze the case of stochastic rewards where for all algorithm regret llr min cucb log log cucb md log escb theorem md log table regret upper bounds for stochastic combinatorial optimization under feedback xi are with bernoulli distribution of unknown mean the reward sequences are also independent across we then address the problem in the adversarial setting where the sequence of vectors is arbitrary and selected by an adversary at the beginning of the experiment in the stochastic setting we provide sequential arm selection algorithms whose performance exceeds that of existing algorithms whereas in the adversarial setting we devise simple algorithms whose regret have the same scaling as that of algorithms but with lower computational complexity contribution and related work stochastic combinatorial bandits under feedback contribution we derive an asymptotic as the time horizon grows large regret lower bound satisfied by any algorithm theorem this lower bound is and tight there exists an algorithm that attains the bound on all problem instances although the algorithm might be computationally expensive to our knowledge such lower bounds have not been proposed in the case of stochastic combinatorial bandits the dependency in and of the lower bound is unfortunately not explicit we further provide simplified lower bound theorem and derive its scaling in in specific examples we propose escb efficient sampling for combinatorial bandits an algorithm whose regret scales at most as min log theorem where denotes the expected reward difference between the best and the arm escb assigns an index to each arm the index of given arm can be interpreted as performing likelihood tests with vanishing risk on its average reward our indexes are the natural extension of and indexes defined for unstructured bandits numerical experiments for some specific combinatorial problems are presented in the supplementary material and show that escb significantly outperforms existing algorithms related work previous contributions on stochastic combinatorial bandits focused on specific combinatorial structures matroids or permutations generic combinatorial problems were investigated in the proposed algorithms llr and cucb are variants of the ucb algorithm and their performance guarantees in table our algorithms improve over llr and cucb by multiplicative factor of adversarial combinatorial problems under bandit feedback contribution we present algorithm omb exp whose regret is log µmin where µmin mi and is the smallest nonzero eigenvalue of the matrix when is uniformly distributed over theorem for most problems of interest dλ and min poly so that omb exp has dt log regret known regret lower bound is dt so the regret gap between omb exp and this lower bound scales at most as up to logarithmic factor related work adversarial combinatorial bandits have been extensively investigated recently see and references therein some papers consider specific instances of these problems routing permutations for generic combinatorial problems known regret lower bounds scale as mdt and dt if in the case of semibandit and bandit feedback respectively in the case of feedback proposes algorithm lower bound om band with ohn xploration omb exp theorem regret dt if dt log dλ dt log dt mdλ log min table regret of various algorithms for adversarial combinatorial bandits with bandit feedback note that for most combinatorial classes of interests dλ and min poly osmd anp algorithm whose regret upper bound matches the lower bound presents an algorithm with dl log regret where is the total reward of the best arm after rounds for problems with bandit feedback proposes om band and derives regret upper bound which depends on the structurep of arm set for most problems of interest the regret under om band is by dt log addresses generic linear optimization with bandit feedback and the proposed algorithm referred to as with ohn xploration has regret scaling at most as dt log in the case of combinatorial structure as we show next for many combinatorial structures of interest matchings spanning trees omb exp yields the same regret as om band and with ohn xploration with lower computational complexity for large class of problems table summarises known regret bounds example is the set of all binary vectors with coordinates we have µmin and refer to the supplementary material for details hence when the regret upper bound of omb exp becomes dt log which is the same as that of om band and with ohn xploration example matchings the set of arms is the set of perfect matchings in km and we have µmin and hence the regret upper bound of omb exp is log the same as for om band and with ohn xploration example spanning trees is the set of spanning trees in the complete graph kn in this case and by cayley formula has arms log min for and dλ when the regret upper bound of om band and with ohn xploration becomes log as for omb exp we get the same regret upper bound log models and objectives we consider mab problems where each arm is subset of basic actions taken from for xi denotes the reward of basic action in round in the stochastic setting for each the sequence of rewards xi is with bernoulli distribution with mean θi rewards are assumed to be independent across actions we denote by θd the vector of unknown expected rewards of the various basic actions in the adversarial setting the reward vector xd is arbitrary and the sequence is decided but unknown at the beginning of the experiment the set of arms is an arbitrary subset of such that each of its elements has basic actions arm is identified with binary column vector md and we have km at the beginning of each round policy selects an arm based on the arms chosen rounds and their observed rewards the reward of arm selected in previous in round is mi xi we consider both and bandit feedbacks under feedback and policy at the end of round the outcome of basic actions xi for all are revealed to the decision maker whereas under bandit feedback only can be observed let be the set of all feasible policies the objective is to identify policy in maximizing the cumulative expected reward over finite time horizon the expectation is here taken with respect to possible randomness in the rewards in the stochastic setting and the possible randomization in the policy equivalently we aim at designing policy that minimizes regret where the regret of policy is defined by rπ max finally for the stochastic setting we denote by µm the expected reward of arm and let or for short be any arm with maximum expected reward arg maxm µm in what follows to simplify the presentation we assume that the optimal is unique we further define minm where µm and maxm µm stochastic combinatorial bandits under feedback regret lower bound given define the set of parameters that can not be distinguished from when selecting action and for which arm is suboptimal mi θi λi we define and kl the divergence between bernoulli distributions of respective means and kl log log finally for we define the vector kl kl θi λi we derive regret lower bound valid for any uniformly good algorithm an algorithm is uniformly good iff rπ for all and all parameters the proof of this result relies on general result on controlled markov chains theorem for all for any uniformly good policy lim inf where is the optimal value of the optimization problem inf xm xm rπ log kl observe first that optimization problem is linear program which can be solved for any fixed but its optimal value is difficult to compute explicitly determining how scales as function of the problem dimensions and is not obvious also note that has the following interpretation assume that has unique solution then any uniformly good algorithm must select action at least log times over the first rounds from we know that there exists an algorithm which is asymptotically optimal so that its regret matches the lower bound of theorem however this algorithm suffers from two problems it is computationally infeasible for large problems since it involves solving times furthermore the algorithm has no finite time performance guarantees and numerical experiments suggest that its finite time performance on typical problems is rather poor further remark that if is the set of singletons classical bandit theorem reduces to the bound and if is the set of bandit with multiple plays theorem reduces to the lower bound derived in finally theorem can be generalized in straightforward manner for when rewards belong to exponential family of distributions gaussian exponential gamma etc by replacing kl by the appropriate divergence measure simplified lower bound we now study how the regret scales as function of the problem dimensions and to this aim we present simplified regret lower bound given we say that set has property iff for all we have mi mi for all we may now state theorem theorem let be maximal subset of with property define minm then kl θi θj corollary let for some constant and be such that each arm has at most suboptimal basic actions then theorem provides an explicit regret lower bound corollary states that scales at least with the size of for most combinatorial sets is proportional to see supplementary material for some examples which implies that in these cases one can not obtain regret smaller than min log this result is intuitive since is the number of parameters not observed when selecting the optimal arm the algorithms proposed below have regret of log which is acceptable since typically is much smaller than min algorithms next we present escb an algorithm for stochastic combinatorial bandits that relies on arm indexes as in and we derive regret upper bounds for escb that hold even if we assume that km instead of km so that arms may have different numbers of basic actions indexes escb relies on arm indexes in general an index of arm in round say bm should be defined so that bm with high probability then as for and applying the principle of optimism in face of uncertainty natural way to devise algorithms based on indexes is to select in round the arm with the highest index under given algorithm at time we define peach ti mi the number of times basic action has been sampled the empirical mean pn reward of action is then defined as xi mi if ti and otherwise we define the corresponding vectors ti and the indexes we propose are functions of the round and of our first index for arm referred to as bm or bm for short is an extension of index let log log log bm is the optimal value of the following optimization problem max kl where we use the convention that for rd vu vi ui as we show later bm may be computed efficiently using line search procedure similar to that used to determine index our second index cm or cm for short is generalization of the and indexes cm note that in the classical bandit problems with independent arms when bm reduces to the index which yields an asymptotically optimal algorithm and cm reduces to the index the next theorem provides generic properties of our indexes an important consequence of these properties is that the expected number of times where bm or cm underestimate is finite as stated in the corollary below theorem for all and we have bm cm ii there exists cm depending on only such that for all and bm cm log corollary bm cm log statement in the above theorem is obtained combining pinsker and inequalities the proof of statement ii is based on concentration inequality on sums of empirical kl divergences proven in it enables to control the fluctuations of multivariate empirical distributions for exponential families it should also be observed that indexes bm and cm can be extended in straightforward manner to the case of continuous linear bandit problems where the set of arms is the unit sphere and one wants to maximize the dot product between the arm and an unknown vector bm can also be extended to the case where reward distributions are not bernoulli but lie in an exponential family gaussian exponential gamma etc replacing kl by suitably chosen divergence measure close look at cm reveals that the indexes proposed in pd and are too conservative to be optimal in our setting there the confidence bonus tim pd was replaced by at least tim note that assume that the various basic actions are arbitrarily correlated while we assume independence among basic actions when independence does not hold provides problem instance where the regret is at least log this min log since we have added the does not contradict our regret upper bound scaling as min independence assumption index computation while the index cm is explicit bm is defined as the solution to an optimization problem we show that it may be computed by simple line search for and define λv λv fix and define mi and for define ti kl ti theorem if bm otherwise is strictly increasing and ii define as the unique solution to then bm ti theorem shows that bm can be computed using line search procedure such as bisection as this computation amounts to solving the nonlinear equation where is strictly increasing the proof of theorem follows from kkt conditions and the convexity of the kl divergence the escb algorithm the of escb is presented in algorithm we consider two variants of the algorithm based on the choice of the index ξm when ξm bm and if ξm cm in practice outperforms introducing is however instrumental in the regret analysis of in view of theorem the following theorem provides finite time analysis of our escb algorithms the proof of this theorem borrows some ideas from the proof of theorem theorem the regret under algorithms satisfies for all rπ min cm where cm does not depend on and as consequence rπ min log when algorithm escb for do select arm arg maxm ξm observe the rewards and update ti and end for algorithm omb exp initialization set log min log min and γc with for do mixing let decomposition select distribution over such that sampling select random arm with distribution and incur reward yn xi mi estimation let where has law set yn where is the of update set exp projection set qn to be the projection of onto the set using the kl divergence end for escb with time horizon has complexity of as neither bm nor cm can be written as for some vector rd assuming that the offline static combinatorial problem is solvable in time the complexity of the cucb algorithm in and after rounds is thus if the offline problem is efficiently implementable poly cucb is efficient whereas escb is not since may have exponentially many elements in of the supplement we provide an extension of escb called poch that attains almost the same regret as escb while enjoying much better computational complexity adversarial combinatorial bandits under bandit feedback we now consider adversarial combinatorial bandits with bandit feedback we start with the following observation max max with co the convex hull of we embed in the simplex by dividing its elements by let be this scaled version of co inspired by osmd we propose the omb exp algorithm where the kl divergence is the bregman divergence used to project onto projection using the kl divergence is addressed in we denote the kl divergence between distributions and in by kl log the projection of distribution onto closed convex set of distributions is arg kl let be the smallest nonzero eigenvalue of where is uniformly distributed over we define the distribution and let µmin mini mµi is the distribution over basic actions induced by the uniform distribution over the for omb exp is shown in algorithm the kl projection in omb exp ensures that co there exists distribution over such that this guarantees that the system of linear equations in the decomposition step is consistent we propose to perform the projection step the kl projection of onto using methods we provide simpler method in of the supplement the decomposition step can be efficiently implemented using the algorithm of the following theorem provides regret upper bound for omb exp omb exp theorem for all mλ log min log µmin for most classes of we have µmin for these poly and dλ classes omb exp has regret of dt log which is factor log off the lower bound see table it might not be possible to compute the projection step exactly and this step can be solved up to accuracy in round namely we find qn such that kl qn kl proposition shows that for the approximate projection gives the same regret as when the projection is computed exactly theorem gives the computational complexity of omb exp with approximate projection when co is described by polynomially in many linear omb exp is efficiently implementable and its running time scales almost linearly in proposition and theorem easily extend to other algorithms and thus might be of independent interest proposition if the projection we have rc omb exp step omb exp of is log min solved up to accuracy log min theorem assume that co is defined by linear equalities and linear inequalities if the projection step is solved up to accuracy then omb exp has time complexity the time complexity of omb exp can be reduced by exploiting the structure of see page in particular if inequality describing co are box constraints the time complexity of omb exp is log the computational complexity of omb exp is determined by the structure of co and omb exp has log time complexity due to the efficiency of methods in contrast the computational complexity of om band depends on the complexity of sampling from om band may have time complexity that is in see page for instance consider the matching problem described in section we have equality constraints and box constraints so that the time complexity of omb exp is log it is noted that using algorithm the cost of decomposition in this case is on the other hand omb band has time complexity of with function as it requires to approximate permanent requiring operations per round thus omb exp has much lower complexity than om band and achieves the same regret conclusion we have investigated stochastic and adversarial combinatorial bandits for stochastic combinatorial bandits with feedback we have provided tight regret lower bound that in most cases scales at least as min log we proposed escb an algorithm log regret we plan to reduce the gap between this regret guarantee and with min the regret lower bound as well as investigate the performance of poch for adversarial combinatorial bandits with bandit feedback we proposed the omb exp algorithm there is gap between the regret of omb exp and the known regret lower bound in this setting and we plan to reduce it as much as possible acknowledgments proutiere research is supported by the erc fsa grant and the ssf project references herbert robbins some aspects of the sequential design of experiments in herbert robbins selected papers pages springer bubeck and regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning and lugosi prediction learning and games volume cambridge university press cambridge and lugosi combinatorial bandits journal of computer and system sciences garivier and olivier the algorithm for bounded stochastic bandits and beyond in proc of colt venkatachalam anantharam pravin varaiya and jean walrand asymptotically efficient allocation rules for the multiarmed bandit problem with multiple iid rewards automatic control ieee transactions on branislav kveton zheng wen azin ashkan hoda eydgahi and brian eriksson matroid bandits fast combinatorial optimization with learning in proc of uai yi gai bhaskar krishnamachari and rahul jain learning multiuser channel allocations in cognitive radio networks combinatorial bandit formulation in proc of ieee dyspan yi gai bhaskar krishnamachari and rahul jain combinatorial network optimization with unknown variables bandits with linear rewards and individual observations trans on networking wei chen yajun wang and yang yuan combinatorial bandit general framework and applications in proc of icml branislav kveton zheng wen azin ashkan and csaba szepesvari tight regret bounds for stochastic combinatorial in proc of aistats zheng wen azin ashkan hoda eydgahi and branislav kveton efficient learning in combinatorial in proc of icml audibert bubeck and lugosi regret in online combinatorial optimization mathematics of operations research linder lugosi and the shortest path problem under partial monitoring journal of machine learning research satyen kale lev reyzin and robert schapire bandit slate problems advances in neural information processing systems pages nir ailon kohei hatano and eiji takimoto bandit online optimization over the permutahedron in algorithmic learning theory pages springer gergely neu regret bounds for combinatorial in proc of colt bubeck and sham kakade towards minimax policies for online linear optimization with bandit feedback proc of colt todd graves and tze leung lai asymptotically efficient adaptive choice of control laws in controlled markov chains siam control and optimization tze leung lai and herbert robbins asymptotically efficient adaptive allocation rules advances in applied mathematics peter auer and paul fischer finite time analysis of the multiarmed bandit problem machine learning stefan magureanu richard combes and alexandre proutiere lipschitz bandits regret lower bounds and optimal algorithms proc of colt and shields information theory and statistics tutorial now publishers inc stephen boyd and lieven vandenberghe convex optimization cambridge university press sherali constructive proof of the representation theorem for polyhedral sets based on fundamental definitions american journal of mathematical and management sciences david helmbold and manfred warmuth learning permutations with exponential weights journal of machine learning research 
variational information maximisation for intrinsically motivated reinforcement learning shakir mohamed and danilo rezende google deepmind london shakir danilor abstract the mutual information is core statistical quantity that has applications in all areas of machine learning whether this is in training of density models over multiple data modalities in maximising the efficiency of noisy transmission channels or when learning behaviour policies for exploration by artificial agents most learning algorithms that involve optimisation of the mutual information rely on the algorithm an enumerative algorithm with exponential complexity that is not suitable for modern machine learning applications this paper provides new approach for scalable optimisation of the mutual information by merging techniques from variational inference and deep learning we develop our approach by focusing on the problem of learning where the mutual information forms the definition of internal drive known as empowerment using variational lower bound on the mutual information combined with convolutional networks for handling visual input streams we develop stochastic optimisation algorithm that allows for scalable information maximisation and reasoning directly from pixels to actions introduction the problem of measuring and harnessing dependence between random variables is an inescapable statistical problem that forms the basis of large number of applications in machine learning including rate distortion theory information bottleneck methods population coding curiositydriven exploration model selection and reinforcement learning in all these problems the core quantity that must be reasoned about is the mutual information in general the mutual information mi is intractable to compute and few existing algorithms are useful for realistic applications the received algorithm for estimating mutual information is the algorithm that effectively solves for the mi by enumeration an approach with exponential complexity that is not suitable for modern machine learning applications by combining the best current practice from variational inference with that of deep learning we bring the generality and scalability seen in other problem domains to information maximisation problems we provide new approach for maximisation of the mutual information that has significantly lower complexity allows for computation with sensory inputs and that allows us to exploit modern computational resources the technique we derive is generally applicable but we shall describe and develop our approach by focussing on one popular and increasingly topical application of the mutual information as measure of empowerment in reinforcement learning reinforcement learning rl has seen number of successes in recent years that has now established it as practical scalable solution for realistic planning and decision making limitation of the standard rl approach is that an agent is only able to learn using external rewards obtained from its environment truly autonomous agents will often exist in environments that lack such external rewards or in environments where rewards are sparsely distributed reinforcement learning attempts to address this shortcoming by equipping an agent with number of internal drives or intrinsic reward signals such as hunger boredom or curiosity that allows the agent to continue to explore learn and act meaningfully in world there are many state representation external environment action decoder observation ak environment agent internal environment ak source option kb critic ak state repr state embedding option planner ak figure loop separating environment into internal and external facets figure computational graph for variational information maximisation ways in which to formally define internal drives but what all such definitions have in common is that they in some unsupervised fashion allow an agent to reason about the value of information in the sequences it experiences the mutual information allows for exactly this type of reasoning and forms the basis of one popular intrinsic reward measure known as empowerment our paper begins by describing the framework we use for online and learning section and then describes the general problem associated with mutual information estimation and empowerment section we then make the following contributions we develop stochastic variational information maximisation new algorithm for scalable estimation of the mutual information and channel capacity that is applicable to both discrete and continuous settings we combine variational information optimisation and tools from deep learning to develop scalable algorithm for reinforcement learning demonstrating new application of the variational theory for problems in reinforcement learning and decision making we demonstrate that behaviours obtained using variational information maximisation match those using the exact computation we then apply our algorithms to broad range of problems for which it is not possible to compute the exact solution but for which we are able to act according to empowerment learning directly from pixel information reinforcement learning or learning attempts to address the question of where rewards come from and how they are used by an autonomous agent consider an online learning system that must model and reason about its incoming data streams and interact with its environment this loop is common to many areas such as active learning process control optimisation and reinforcement learning an extended view of this framework was presented by singh et al who describe the environment as factored into external and internal components figure an agent receives observations and takes actions in the external environment importantly the source and nature of any reward signals are not assumed to be provided by an oracle in the external environment but is moved to an internal environment that is part of the agent decisionmaking system the internal environment handles the efficient processing of all input data and the choice and computation of an appropriate internal reward signal there are two important components of this framework the state representation and the critic we are principally interested in systems for which there are no solutions currently developed to achieve this our state representation system is convolutional neural network the critic in figure is responsible for providing intrinsic rewards that allow the agent to act under different types of internal motivations and is where information maximisation enters the learning problem the nature of the critic and in particular the reward signal it provides is the main focus of this paper wide variety of reward functions have been proposed and include missing information or bayesian surprise which uses the kl divergence to measure the change in an agents internal belief after the observation of new data measures based on prediction errors of future states such predicted change predicted mode change or probability gain or salient event prediction and measures based on quantities such as predicted information gain pig causal entropic forces or empowerment the paper by oudeyer kaplan currently provides the widest singular discussion of the breadth of intrinsic motivation measures although we have wide choice of intrinsic reward measures none of the available informationtheoretic approaches are efficient to compute or scalable to problems they require either knowledge of the true transition probability or summation over all configurations of the state space which is not tractable for complex environments or when the states are large images mutual information and empowerment the mutual information is core quantity that acts as general measure of dependence between two random variables and defined as ep log where the is joint distribution over the random variables and and are the corresponding marginal distributions and can be many quantities of interest in computational neuroscience they are the sensory inputs and the spiking population code in telecommunications they are the input signal to channel and the received transmission when learning exploration policies in rl they are the current state and the action at some time in the future respectively for intrinsic motivation we use an internal reward measure referred to as empowerment that is obtained by searching for the maximal mutual information conditioned on starting state between sequence of actions and the final state reached max max ep log where ak is sequence of primitive actions ak leading to final state and is the transition probability of the environment is the joint distribution of action sequences and the final state is distribution over action sequences and is the joint probability marginalised over the action sequence equation is the definition of the channel capacity in information theory and is measure of the amount of information contained in the action sequences about the future state this measure is compelling since it provides measure for intrinsic motivation that fits naturally within the framework for intrinsically motivated learning described by figure furthermore empowerment like the or function in reinforcement learning assigns value to each state in an environment an agent that seeks to maximise this value will move towards states from which it can reach the largest number of future states within its planning horizon it is this intuition that has led authors to describe empowerment as measure of agent preparedness or as means by which an agent may quantify the extent to which it can reliably influence its environment motivating an agent to move to states of maximum influence an agent generates an sequence of actions steps into the future this is only used by the agent for its internal planning using when optimised using the distribution becomes an efficient exploration policy that allows for uniform exploration of the state space reachable at horizon and is another compelling aspect of empowerment we provide more intuition for this in appendix but this policy is not what is used by the agent for acting when an agent must act in the world it follows policy obtained by planning algorithm using the empowerment value we expand on this in sect further consequence is that while acting the agent is only curious about parts of its environment that can be reached within its internal planning horizon we shall not explore the effect of the horizon in this work but this has been and we defer to the insights of salge et al scalable information maximisation the mutual information mi as we have described it thus far whether it be for problems in empowerment channel capacity or rate distortion hides two difficult statistical problems firstly computing the mi involves expectations over the unknown state transition probability this can be seen by rewriting the mi in terms of the difference between conditional entropies as where log and ep log this computation requires marginalisation over the transition dynamics of the environment which is unknown in general we could estimate this distribution by building generative model of the environment and then use this model to compute the mi since learning accurate generative models remains challenging task solution that avoids this is preferred and we also describe one approach for empowerment in appendix secondly we currently lack an efficient algorithm for mi computation there exists no scalable algorithm for computing the mutual information that allows us to apply empowerment to highdimensional problems and that allow us to easily exploit modern computing systems the current solution is to use the algorithm which essentially enumerates over all states thus being limited to problems and not being applicable to the continuous domain more scalable estimators have been developed these have high memory footprint or require very large number of observations any approximation may not be bound on the mi making reasoning about correctness harder and they can not easily be composed with existing systems that allow us to design unified system in the continuous domain monte carlo integration has been proposed but applications of monte carlo estimators can require large number of draws to obtain accurate solutions and manageable variance we have also explored monte carlo estimators for empowerment and describe an alternative importance estimator for the mi and channel capacity in appendix variational information lower bound the mi can be made more tractable by deriving lower bound to it and maximising this instead here we present the bound derived by barber agakov using the entropy formulation of the mi reveals that bounding the conditional entropy component is sufficient to bound the entire mutual information by using the property of the kl divergence we obtain the bound kl kq ep log log where we have introduced variational distribution with parameters the distribution has parameters this bound becomes exact when is equal to the true action posterior distribution other lower bounds for the mutual information are also possible jaakkola jordan present lower bound by using the convexity bound for the logarithm brunel nadal use gaussian assumption and appeal to the lower bound the bound is highly convenient especially when compared to other bounds since the transition probability appears linearly in the expectation and we never need to evaluate its probability we can thus evaluate the expectation directly by monte carlo using data obtained by interaction with the environment the bound is also intuitive since we operate using the marginal distribution on action sequences which acts as source exploration distribution the transition distribution acts as an encoder transition distribution from to and the variational distribution conveniently acts as decoder planning distribution taking us from to variational information maximisation straightforward optimisation procedure based on is an alternating optimisation for the parameters of the distributions and barber agakov made the connection between this approach and the generalised em algorithm and refer to it as the im information maximisation algorithm and we follow the same optimisation principle from an optimisation perspective the maximisation of the bound in can be in gaussian models the variances can diverge we avoid such divergent solutions by adding constraint on the value of the entropy which results in the constrained optimisation problem max max ep ln where is the action sequence performed by the agent when moving from to and temperature which is function of the constraint is an inverse at all times we use very general source and decoder distributions formed by complex functions using deep networks and use stochastic gradient ascent for optimisation we refer to our approach as stochastic variational information maximisation to highlight that we do all our computation on of recent experience from the agent the optimisation for the decoder becomes maximum likelihood problem and the optimisation for the source requires computation of an unnormalised model which we describe next we summmarise the overall procedure in algorithm maximum likelihood decoder the first step of the alternating optimisation is the optimisation of equation the decoder and is supervised maximum likelihood problem given set of data from past interactions with the environment we learn distribution from the start and termination states respectively to the action sequences that have been taken we parameterise the decoder as an distribution over the action sequence ak ak we are free to choose the distributions ak for each action in the sequence which we choose as categorical distributions whose mean parameters are the result of the function with parameters is function that we specify using neural network with activation functions by maximising this we are able to make stochastic updates to the variational parameters of this distribution the neural network models used are expanded upon in appendix estimating the source distribution given current estimate of the decoder the variational solution for the distribution computed under by solving the functional derivative the constraint that is given by exp where ep ln and is normalisation ae term by substituting this optimal distribution into the original objective we find that it can be expressed in terms of the normalisation function only log the distribution is implicitly defined as an unnormalised distribution there are no direct mechanisms for sampling actions or computing the normalising function for such distributions we could use gibbs or importance sampling but these solutions are not satisfactory as they would require several evaluations of the unknown function per decision per state we obtain more convenient problem by approximating the unnormalised distribution by normalised directed distribution this is equivalent to approximating the energy term by function of the of the directed model ln we introduced scalar function into the approximation but since this is not dependent on the action sequence it does not change the approximation and can be verified by substituting into since is normalised distribution this leaves to account for the normalisation term log verified by substituting and into we therefore obtain cheap estimator of empowerment to optimise the parameters of the directed model and the scalar function we can minimise any measure of discrepancy between the two sides of the approximation we minimise the squared error giving the loss function for optimisation as ep ln at convergence of the optimisation we obtain compact function with which to compute the empowerment that only requires forward evaluation of the function is parameterised using an distribution similar to with conditional distributions specified by deep networks the scalar function is also parameterised using deep network further details of these networks are provided in appendix behaviour policies using empowerment as an intrinsic reward measure an agent will seek out states of maximal empowerment we can treat the empowerment value as reward and can then utilise any standard planning algorithm policy gradients or monte carlo search we use the simplest planning strategy by using greedy empowerment maximisation this amounts to choosing actions arg maxa where ep this policy does not account for the effect of actions beyond the planning horizon natural enhancement is to use value iteration to allow the agent to take actions by maximising its long term potentially bound mi true mi approximate nats approximate empowerment empowerment nats true mi parameters variational convolutional source while not converged do read current state convnet compute state repr draw action sequence obtain data acting in env convnet compute state repr log log end while empowerment bound mi algorithm stochastic variational information maximisation for empowerment empowerment nats nats truetrue empowerment figure comparing exact vs approximate empowerment heat maps empowerment in environments two rooms cross room scatter plot agreement for discounted empowerment third approach would be to use empowerment as potential function and the difference between the current and previous state empowerment as shaping function with in the planning fourth approach is one where the agent uses the source distribution as its behaviour policy the source distribution has similar properties to the greedy behaviour policy and can also be used but since it effectively acts as an empowered agents internal exploration mechanism it has large variance it is designed to allow uniform exploration of the state space understanding this choice of behaviour policy is an important line of ongoing research algorithm summary and complexity the system we have described is scalable and general purpose algorithm for mutual information maximisation and we summarise the core components using the computational graph in figure and in algorithm the state representation mechanism used throughout is obtained by transforming raw observations to produce the start and final states respectively when the raw observations are pixels from vision the state representation is convolutional neural network while for other observations such as continuous measurements we use neural network we indicate the parameters of these models using since we use unified loss function we can apply gradient descent and backpropagate stochastic gradients through the entire model allowing for joint optimisation of both the information and representation parameters for optimisation we use preconditioned optimisation algorithm such as adagrad the computational complexity of empowerment estimators involves the planning horizon the number of actions and the number of states for the exact computation we must enumerate over the number of states which for is for grids or for binary images is the complexity of using the ba algorithm is for grid worlds or for binary images the ba algorithm even in environments with small number of interacting objects becomes quickly intractable since the state space grows exponentially with the number of possible interactions and is also exponential in the planning horizon in contrast our approach deals directly on the image dimensions using visual inputs the convolutional network produces vector of size upon which all subsequent computation is based consisting of an llayer neural network this gives complexity for state representation of lp the autoregressive distributions have complexity of kn where is the size of the hidden layer thus our approach has at most quadratic complexity in the size of the hidden layers used and linear in other quantities and matches the complexity of any currently employed models in addition since we use gradient descent throughout we are able to leverage the power of gpus and distributed gradient computations results we demonstrate the use of empowerment and the effectiveness of variational information maximisation in two types of environments static environments consists of rooms and mazes in different configurations in which there are no objects with which the agent can interact or other moving figure left empowerment landscape for figure empowerment for room environagent and key scenario yellow is the key and ment showing an empty room room with green is the door right agent in corridor an obstacle room with moveable box with flowing lava the agent places bricks to room with row of moveable boxes stem the flow of lava jects the number of states in these settings is equal to the number of locations in the environment so is still manageable for approaches that rely on state enumeration in dynamic environments aspects of the environment change such as flowing lava that causes the agent to reset or predator that chases the agent for the most part we consider discrete action settings in which the agent has five actions up down left right do nothing the agent may have other actions such as picking up key or laying down brick there are no external rewards available and the agent must reason purely using visual pixel information for all these experiments we used horizon of effectiveness of the mi bound we first establish that the use of the variational information lower bound results in the same behaviour as that obtained using the exact mutual information in set of static environments we consider environments that have at most discrete states and compute the true mutual information using the algorithm we compute the variational information bound on the same environment using pixel information on images to compare the two approaches we look at the empowerment landscape obtained by computing the empowerment at every location in the environment and show these as heatmaps for action selection what matters is the location of the maximum empowerment and by comparing the heatmaps in figure we see that the empowerment landscape matches between the exact and the variational solution and hence will lead to the same in each image in figure we show of the empowerment for each location in the environment we then analyze the point of highest empowerment for the large room it is in the centre of the room for the room it is at the centre of the cross and in environment it is located near both doors in addition we show that the empowerment values obtained by our method constitute close approximation to the true empowerment for the environment correlation coeff these results match those by authors such as klyubin et al using empowerment and freer using different measure the causal entropic force the advantage of the variational approach is clear from this discussion we are able to obtain solutions of the same quality as the exact computation we have far more favourable computational scaling one that is not exponential in the size of the state space and planning horizon and we are able to plan directly from pixel information dynamic environments having established the usefulness of the bound and some further understanding of empowerment we now examine the empowerment behaviour in environments with dynamic characteristics even in small environments the number of states becomes extremely large if there are objects that can be moved or added and removed from the environment making enumerative algorithms such as ba quickly infeasible since we have an exponential explosion in the number of states we first reproduce an experiment from salge et al that considers the empowered behaviour of an agent in room that is empty has fixed box has moveable box has row of moveable boxes salge et al explore this setup to discuss the choice of the state representation and that not including the existence of the box severely limits the planning ability of the agent in our approach we do not face this problem of choosing the state representation since the agent will reason about all objects that appear within its visual observations obviating the need for state representations figure shows that in an empty room the empowerment is uniform almost everywhere except close to the walls in room with fixed box the fixed box limits the set of future reachable states and as expected empowerment is low around the box in room where the box can be moved the box can now be seen as tool and we have high empowerment near the box similarly when we have four boxes in row the empowerment is highest around the up down left right stay figure predator red and agent blue scenario panels show the simulation other panels show trace of the path that the figure empowerment planning in predator and prey take at points on its trajectory maze environment black panels show the path the shows path history cyan shows the direction to the maximum empowerment taken by the agent boxes these results match those of salge et al and show the effectiveness of reasoning from pixel information directly figure shows how planning with empowerment works in dynamic maze environment where lava flows from source at the bottom that eventually engulfs the maze the only way the agent is able to safeguard itself is to stem the flow of lava by building wall at the entrance to one of the corridors at every point in time the agent decides its next action by computing the expected empowerment after taking one action in this environment we show the planning for all available actions and bar graph with the empowerment values for each resulting state the action that leads to the highest empowerment is taken and is indicated by the black figure left shows separated by door the agent is able to collect key that allows it to open the door before collecting the key the maximum empowerment is in the region around the key once the agent has collected the key the region of maximum empowerment is close to the figure right shows an agent in corridor and must protect itself by building wall of bricks which it is able to do successfully using the same empowerment planning approach described for the maze setting scenario we demonstrate the applicability of our approach to continuous settings by studying simple physics simulation shown in figure here the agent blue is followed by predator red and is randomly reset to new location in the environment if caught by the predator both the agent and the predator are represented as spheres in the environment that roll on surface with friction the state is the position velocity and angular momentum of the agent and the predator and the action is force vector as expected the maximum empowerment lies in regions away from the predator which results in the agent learning to escape the conclusion we have developed new approach for scalable estimation of the mutual information by exploiting recent advances in deep learning and variational inference we focussed specifically on intrinsic motivation with reward measure known as empowerment which requires at its core the efficient computation of the mutual information by using variational lower bound on the mutual information we developed scalable model and efficient algorithm that expands the applicability of empowerment to problems with the complexity of our approach being extremely favourable when compared to the complexity of the algorithm that is currently the standard the overall system does not require generative model of the environment to be built learns using only interactions with the environment and allows the agent to learn directly from visual information or in continuous spaces while we chose to develop the algorithm in terms of intrinsic motivation the mutual information has wide applications in other domains all which stand to benefit from scalable algorithm that allows them to exploit the abundance of data and be applied to problems acknowledgements we thank daniel polani for invaluable guidance and feedback video http video http http http videos references barber and agakov the im algorithm variational approach to information maximization in nips volume pp brunel and nadal mutual information fisher information and population coding neural computation buhmann chehreghani frank and streich information theoretic model selection for pattern analysis workshop on unsupervised and transfer learning cover and thomas elements of information theory john wiley sons duchi hazan and singer adaptive subgradient methods for online learning and stochastic optimization the journal of machine learning research gao steeg and galstyan efficient estimation of mutual information for strongly dependent variables gretton herbrich and smola the kernel mutual information in icasp volume pp itti and baldi bayesian surprise attracts human attention in nips pp jaakkola and jordan improving the mean field approximation via the use of mixture distributions in learning in graphical models pp jung polani and stone empowerment for continuous systems adaptive behavior klyubin polani and nehaniv empowerment universal measure of control in ieee congress on evolutionary computation pp schmidhuber and gomez evolving deep unsupervised convolutional networks for reinforcement learning in gecco pp lecun and bengio convolutional networks for images speech and time series the handbook of brain theory and neural networks little and sommer learning and exploration in loops frontiers in neural circuits mnih kavukcuoglu and silver et al control through deep reinforcement learning nature nelson finding useful questions on bayesian diagnosticity probability impact and information gain psychological review ng andrew harada daishi and russell stuart policy invariance under reward transformations theory and application to reward shaping in icml oudeyer and kaplan how can we define intrinsic motivation in international conference on epigenetic robotics rubin shamir and tishby trading value and information in mdps in decision making with imperfect decision makers pp salge glackin and polani changing the environment based on empowerment as intrinsic motivation entropy salge glackin and polani introduction in guided selforganization inception pp schmidhuber formal theory of creativity fun and intrinsic motivation ieee trans autonomous mental development singh barto and chentanez intrinsically motivated reinforcement learning in nips still and precup an approach to reinforcement learning theory in biosciences sutton and barto introduction to reinforcement learning mit press tishby pereira and bialek the information bottleneck method in allerton conference on communication control and computing todorov erez and tassa mujoco physics engine for control in intelligent robots and systems iros pp and freer causal entropic forces phys rev yeung the algorithms in information theory and network coding pp 
structural smoothing framework for robust vishwanathan department of computer science university of california santa cruz ca usa vishy pinar yanardag department of computer science purdue university west lafayette in usa ypinar abstract in this paper we propose general smoothing framework for graph kernels by taking structural similarity into account and apply it to derive smoothed variants of popular graph kernels our framework is inspired by smoothing techniques used in natural language processing nlp however unlike nlp applications that primarily deal with strings we show how one can apply smoothing to richer class of that naturally arise in graphs moreover we discuss extensions of the process that can be adapted to smooth structured objects thereby leading to novel graph kernels our kernels are able to tackle the diagonal dominance problem while respecting the structural similarity between features experimental evaluation shows that not only our kernels achieve statistically significant improvements over the unsmoothed variants but also outperform several other graph kernels in the literature our kernels are competitive in terms of runtime and offer viable option for practitioners introduction in many applications we are interested in computing similarities between structured objects such as graphs for instance one might aim to classify chemical compounds by predicting whether compound is active in an screen or not kernel function which corresponds to dot product in reproducing kernel hilbert space offers flexible way to solve this problem rconvolution is framework for computing kernels between discrete objects where the key idea is to recursively decompose structured objects into let denote dot product in reproducing kernel hilbert space represent graph and represent vector of frequencies the kernel between two graphs and is computed by hφ ih many existing graph kernels can be viewed as instances of kernels for instance the graphlet kernel decomposes graph into graphlets subtree kernel referred as for the rest of the paper decomposes graph into subtrees and the kernel decomposes graph into however based graph kernels suffer from few drawbacks first the size of the feature space often grows exponentially as size of the space grows the probability that two graphs will contain similar becomes very small therefore graph becomes similar to itself but not to any other graph in the training data this is well known as the diagonal dominance problem where the resulting kernel matrix is close to the identity matrix second lower order tend to be more numerous while vast majority of the occurs rarely in other words few dominate the distribution this exhibits strong behavior and results in underestimation of the true distribution third the used to define graph kernel are often related to each other however an kernel only respects exact matchings this problem is figure graphlets of size larly important when noise is present in the training data and considering partial similarity between might alleviate the noise problem our solution in this paper we propose to tackle the above problems by using general framework to smooth graph kernels that are defined using frequency vector of decomposed structures we use structure information by encoding relationships between lower and higher order in order to derive our method the remainder of this paper is structured as follows in section we review three families of graph kernels for which our smoothing is applicable in section we review smoothing methods for multinomial distributions in section we introduce framework for smoothing structured objects in section we propose bayesian variant of our model that is extended from the hierarchical process in section we discuss related work in section we compare smoothed graph kernels to their unsmoothed variants as well as to other graph kernels we report results on classification accuracy on several benchmark datasets as well as their section concludes the paper graph kernels existing graphs kernels based on can be categorized into three major families graph kernels based on subgraphs graph kernels based on subtree patterns and graph kernels based on walks or paths graph kernels based on subgraphs graphlet is dof see figure given two graphs and the kernel is defined as kgk where and are vectors of normalized counts of graphlets that is the component of resp denotes the frequency of graphlet gi occurring as of resp graph kernels based on subtree patterns is popular instance of graph kernels that decompose graph into its subtree patterns it simply iterates over each vertex in graph and compresses the label of the vertex and labels of its neighbors into multiset label the vertex is then relabeled with the compressed label to be used for the next iteration algorithm concludes after running for iterations and the compressed labels are used for constructing frequency vector for each graph formally given and this kernel is defined as kw lg lg where lg contains the frequency of each compressed label occurring in iterations graph kernels based on walks or paths graph kernel is popular instance of this family this kernel simply compares the sorted endpoints and the length of that are common between two graphs formally let pg represent the set of all in graph and pi pg denote triplet ls le nk where nk is the length of the path and ls and le are the labels of the sourcedand sinkevertices respectively the kernel between graphs and is defined as ksp pg pg where component of pg contains the frequency of triplet occurring in graph resp pg smoothing multinomial distributions in this section we briefly review smoothing techniques for multinomial distributions let em represent sequence of discrete events drawn from ground set figure topologically sorted graphlet dag for where nodes are colored based on degree suppose we would like to estimate the probability ei for some it is well known ca that the maximum likelihood estimate mle can be computed as pm le ei pm where ca denotes the number of times the event appears in the observed sequence and cj denotes the total number of observed events however mle of the multinomial distribution is spiky since it assigns zero probability to the events that did not occur in the observed sequence in other words an event with low probability is often estimated to have zero probability mass the general idea behind smoothing is to adjust the mle of the probabilities by pushing the high probabilities downwards and pushing low or zero probabilities upwards in order to produce more accurate distribution on the events interpolated smoothing methods offer flexible solution between the maximum likelihood model and smoothed model or fallback model the way the fallback model is designed is the key to define new smoothing absolute discounting and interpolated are two popular instances of interpolated smoothing methods max ca md pa ei pa ei here is discount factor md ca is the number of events whose counts are larger than while is the fallback distribution absolute discounting defines the fallback distribution as the smoothed version of the mle while uses an unusual estimate of the fallback distribution by using number of different contexts that the event follows in the lower order model smoothing structured objects in this section we first propose new interpolated smoothing framework that is applicable to richer set of objects such as graphs by using directed acyclic graph dag we then discuss how to design such dags for various graph kernels structural smoothing the key to designing new smoothing method is to define fallback distribution which not only incorporates domain knowledge but is also easy to estimate recursively suppose we have access to weighted dag where every node at the level represents an event from the ground set moreover let wij denote the weight of the edge connecting event to event and pa resp ca denote the parents resp children of event in the dag we define our structural smoothing for events at level as follows max ca md wja pss ei pss wja the way to understand the above equation is as follows we subtract fixed discounting factor from every observed event which accumulates to total mass of md each event receives some portion of this accumulated probability mass from its parents the proportion of the mass that parent at level transmits to given child depends on the weight wja between the parent and the child normalized by the sum of the weights of the edges from to all its children and the probability mass pss that is assigned to node in other words the portion child event is able to obtain from the total discounted mass depends on how authoritative its parents are and how strong the relationship between the child and its parents see table in for summarization of various smoothing algorithms using this general framework designing the dag in order to construct dag for smoothing structured objects we first construct vocabulary that denotes the set of all unique that are going to be smoothed each item in the vocabulary corresponds to node in the dag can be generated statically or dynamically based on the type of the graph kernel exploits for instance it requires effort to generate the vocabulary of size graphlets for graphlet kernel however one needs to build the vocabulary dynamically in and kernels since the depend on the node labels obtained from the datasets after constructing the vocabulary the relationship between needs to be obtained given of size we apply transformation to find all possible of size that can be reduced into each that is obtained by this transformation is assigned as parent of after obtaining the relationship between the dag is constructed by drawing directed edge from each parent to its children nodes since all descendants of given at depth are at depth this results in topological ordering of the vertices and hence the resulting graph is indeed dag next we discuss how to construct such dags for different graph kernels graphlet kernel we construct the vocabulary for graphlet kernel by enumerating all canonical graphlets of size up to each graphlet is node in the dag we then apply transformation to infer the relationship between graphlets as follows we place directed edge from graphlet to if and only if can be obtained from by deleting node in other words all edges from graphlet of size point to graphlet of size in order to assign weights to the edges given graphlet pair and we count the number of times can be obtained from by deleting node call this number recall that is of size and is of size and therefore can at most be let cg denote the set of children of node in the dag and ng then we define the weight of the edge connecting and as the idea here is that the weight encodes the proportion of different ways of extending which results in the graphlet for instance let us consider and its parents see figure for the dag of graphlets with size even if graphlet is not observed in the training data it still gets probability mass proportional to the edge weight from its parents in order to overcome the sparsity problem of unseen data kernel the kernel performs an exact matching between the compressed multiset labels for instance given two labels abcde and abcdf it simply assigns zero value for their similarity even though two labels have partial similarity in order to smooth kernel we first run the original algorithm and obtain the multiset representation of each graph in the dataset we then apply transformation to infer the relationship between compressed labels as follows in each iteration of algorithm and for each multiset label of size in the vocabulary we generate its power set by computing all subsets of size while keeping the root node fixed for instance the parents of multiset label abcde are abcd abce abde acde then we simply construct the dag by drawing directed edge from parent labels to children notice that considering only the set of labels generated from the kernel is not sufficient enough for constructing valid dag for instance it might be the case that none of the possible parents of given label exists in the vocabulary simply due to the sparsity problem of all possible parents of abcde we might only observe abce in the training data thus restricting ourselves to the original vocabulary leaves such labels orphaned in the dag therefore we consider pseudo parents as part of the vocabulary when constructing the dag since the in this kernel are we use uniform weight between parent and its children kernel similar to other graph kernels discussed above graph kernel does not take partial similarities into account for instance given two abcde and abcdf compressed as and respectively it assigns zero for their similarity since their sink labels are different however one can notice that exhibit strong dependency relationship for instance given pij abcde of size one can derive the abcd abc ab of size as result of the optimal property that is one can show that all of are also with we used nauty to obtain isomorphic representations of graphlets figure an illustration of table assignment adapted from in this example labels at the tables are given by black dots indicate the number of occurrences of each label in draws from the process the same source node in order to smooth kernel we first build the vocabulary by computing all for each graph let pij be of size and pij be shortestpath of size that is obtained by removing the sink node of pij let lij be the compressed form of pij that represents the sorted labels of its endpoints and concatenated to its length resp lij then in order to build the dag we draw directed edge from lij of depth to lij of depth if and only if pij is of pij in other words all ascendants of lij consist of the compressed labels obtained from of pij of size similar to kernel we assign uniform weight between parents and children smoothing processes are known to produce distributions novel interpretation of interpolated is proposed by as approximate inference in hierarchical bayesian model consisting of process by following similar spirit we extend our model to adapt process as an alternate smoothing framework process on ground set of graphlets is defined via pk where is discount parameter is strength parameter and pk is base distribution the most intuitive way to understand draws from the process is via the chinese restaurant process see figure consider restaurant with an infinite number of tables algorithm insert customer input pk occupied tables counts of customers labels of tables if then append to draw graphlet gi pk insert customer in parent draw gj wij append gj to return gj else with probability max cj cj cj return lj with probability proportional to dt append to draw graphlet gi pk insert customer in parent draw gj wij append gj to return gj end if where customers enter the restaurant one by one the first customer sits at the first table and graphlet is assigned to it by drawing sample from the base distribution since this table is occupied for the first time the label of the first table is the first graphlet drawn from the process subsequent customers when they enter the restaurant decide to sit at an already occupied table with probability proportional to ci where ci represents the number of customers already sitting at table if they sit at an already occupied table then the label of that table denotes the next graphlet drawn from the process on the other hand with probability where is the current number of occupied tables new customer might decide to occupy new table in this case the base distribution is invoked to label this table with graphlet intuitively the reason this process generates behavior is because popular graphlets which are served on tables with large number of customers have higher probability of attracting new customers and hence being generated again similar to rich gets richer phenomenon in hierarchical process the base distribution pk is recursively defined via process pk dk θk in order to label table we need draw from pk which is obtained by inserting customer into the corresponding restaurant however adopting the traditional hierarchical process is not straightforward in our case since the size of the context differs between levels of hierarchy that is child restaurant in the hierarchy can have more than one parent restaurant to request label from in other words is defined over of size while pk is defined over gk of size nk therefore one needs transformation function to transform base distributions of different sizes we incorporate edge weights between parent and child restaurants by using the same weighting scheme in section this changes the chinese restaurant process as follows when we need to label table we will first draw graphlet gi pk by inserting customer into the corresponding restaurant given gi we will draw graphlet gj proportional to wij where wij is obtained from the dag see algorithm for pseudo code of inserting customer deletion of customer is handled similarly see algorithm algorithm delete customer input with probability cl cl cl gj lj if cl then pk delete cl from delete lj from end if return related work survey of most popular graph kernel methods is already given in previous sections several methods proposed in smoothing structured objects our framework is similar to dependency tree kernels since both methods are using the notion of smoothing for structured objects however our method is interested in the problem of smoothing the count of structured objects thus while smoothing is achieved by using dag we discard the dag once the counts are smoothed another related work to ours is propagation kernels that define graph features as counts of similar distributions on the respective graphs by using locality sensitive hashing lsh our framework not only considers node label distributions but also explicitly incorporates structural similarity via the dag another similar work to ours is recently proposed framework by which learns the relationship between by using neural language models however their framework do not respect the structural similarity between which is an important property to consider especially in the presence of noise in edges or labels experiments the aim of our experiments is threefold first we want to show that smoothing graph kernels significantly improves the classification accuracy second we want to show that the smoothed kernels are comparable to or outperform graph kernels in terms of classification table comparison of classification accuracy standard deviation of sp wl graphlet gk kernels with their smoothed variants smoothed variants with statistically significant improvements over the base kernels are shown in bold as measured by with value of ramon ram walk and random walk kernels are included for additional comparison where indicates the computation did not finish in hours runtime for constructing the dag and smoothing smth the counts are also reported where indicates seconds and indicates minutes dataset sp moothed wl moothed gk moothed yp am randomwalk andom walk dag mth gk dag mth sp dag mth wl dag mth pyp utag tc nzymes roteins ci ci accuracy while remaining competitive in terms of computational requirements third we want to show that our methods outperform base kernels when edge or label noise is presence datasets we used the following benchmark datasets used in graph kernels mutag ptc enzymes proteins and mutag is dataset of mutagenic aromatic and heteroaromatic nitro compounds with discrete labels ptc is dataset of chemical compounds has discrete labels enzymes is dataset of protein tertiary structures obtained from and has discrete labels proteins is dataset of graphs obtained from having discrete labels and are two balanced datasets of chemical compounds having size and with and labels respectively experimental setup we compare our framework against representative instances of major families of graph kernels in the literature in addition to the base kernels we also compare our smoothed kernels with the random walk kernel the subtree and random walk kernel the random walk random walk and are written in matlab and obtained from all other kernels were coded in python except smoothing which is coded in we used parallel implementation for smoothing the counts of kernel for efficiency all kernels are normalized to have unit length in the feature space moreover we use cross validation with binary vector machine svm where the value for each fold is independently tuned using training data from that fold in order to exclude random effects of the fold assignments this experiment is repeated times and average prediction accuracy of experiments with their standard deviations are results in our first experiment we compare the base kernels with their smoothed variants as can be seen from table smoothing improves the classification accuracy of every base kernel on every dataset with majority of the improvements being statistically significant with we observe that even though smoothing improves the accuracy of graphlet kernels on proteins and the improvements are not statistically significant we believe this is due to the fact that these datasets are not sensitive to structural noise as much as the other datasets thus considering the partial similarities we modified the open source implementation of pyp https implementations of original and smoothed versions of the kernels datasets and detailed discussion of parameter selection procedure with the list of parameters used in our experiments can be accessed from http figure classification accuracy noise for base graph kernels dashed lines and their smoothed variants lines do not improve the results significantly moreover pyp smoothed graphlet kernels achieve statistically significant improvements in most of the datasets however they are outperformed by smoothed graphlet kernels introduced in section in our second experiment we picked the best smoothed kernel in terms of classification accuracy for each dataset and compared it against the performance of graph kernels see table smoothed kernels outperform other methods on all datasets and the results are statistically significant on every dataset except ptc in our third experiment we investigated the runtime behavior of our framework with two major costs first one has to compute dag by using the original feature vectors next the constructed dag need to be used to compute smoothed representations of the feature vectors table shows the total wallclock runtime taken by all graphs for constructing the dag and smoothing the counts for each dataset as can be seen from the runtimes our framework adds constant factor to the original runtime for most of the datasets while the dag creation in kernel also adds negligible overhead the cost of smoothing becomes significant if the vocabulary size gets prohibitively large due to the exponential growing nature of the kernel to subtree parameter finally in our fourth experiment we test the performance of graph kernels when edge or label noise is present for edge noise we randomly removed and added of the edges in each graph for label noise we randomly flipped of the node labels in each graph where random labels are selected proportionally to the original of the graph figure shows the performance of smoothed graph kernels under noise as can be seen from the figure smoothed kernels are able to outperform their base variants when noise is present an interesting observation is that even though significant amount of edge noise is added to proteins and nci datasets the performance of base kernels do not change drastically this further supports our observation that these datasets are not sensitive to structural noise as much as the other datasets conclusion and future work we presented novel framework for smoothing graph kernels inspired by smoothing techniques from natural language processing and applied our method to graph kernels our framework is rather general and lends itself to many extensions for instance by defining domainspecific relationships one can construct different dags with different weighting schemes another interesting extension of our smoothing framework would be to apply it to graphs with continuous labels moreover even though we restricted ourselves to graph kernels in this paper our framework is applicable to any kernel that uses based representation such as string kernels acknowledgments we thank to hyokun yun for his tremendous help in implementing processes we also thank to anonymous nips reviewers for their constructive comments and jiasen yang joon hee choi amani abu jabal and parameswaran raman for reviewing early drafts of the paper this work is supported by the national science foundation under grant no references borgwardt and kriegel kernels on graphs in icml pages borgwardt ong vishwanathan smola and kriegel protein function prediction via graph kernels in ismb detroit usa chen and goodman an empirical study of smoothing techniques for language modeling in acl pages croce moschitti and basili structured lexical similarity via convolution kernels on dependency trees in proceedings of emnlp pages association for computational linguistics debnath lopez de compadre debnath shusterman and hansch structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds correlation with molecular orbital energies and hydrophobicity med chem feragen kasenburg petersen de bruijne and borgwardt scalable kernels for graphs with continuous attributes in nips pages flach and wrobel on graph kernels hardness results and efficient alternatives in colt pages goldwater griffiths and johnson interpolating between types and tokens by estimating powerlaw generators nips goldwater griffiths and johnson producing distributions and damping word frequencies with language models jmlr haussler convolution kernels on discrete structures technical report uc santa cruz kandola graepel and reducing kernel matrix diagonal dominance using semidefinite programming in colt volume of lecture notes in computer science pages washington dc kneser and ney improved for language modeling in icassp mckay nauty user guide version australian national university neumann garnett moreno patricia and kersting propagation kernels for partially labeled graphs in workshop on mining and learning with graphs edinburgh uk ney essen and kneser on structuring probabilistic dependences in stochastic language modeling in computer speech and language pages pitman and yor the distribution derived from stable subordinator annals of probability przulj biological network comparison using graphlet degree distribution in eccb ramon and expressivity versus efficiency of graph kernels technical report first international workshop on mining graphs trees and sequences held with and smola learning with kernels severyn and moschitti fast support vector machines for convolution tree kernels data mining and knowledge discovery shervashidze and borgwardt fast subtree kernels on graphs in nips shervashidze vishwanathan petri mehlhorn and borgwardt efficient graphlet kernels for large graph comparison in aistats shervashidze schweitzer van leeuwen mehlhorn and borgwardt weisfeilerlehman graph kernels jmlr smola and kondor kernels and regularization on graphs in colt pages teh hierarchical bayesian language model based on processes in acl toivonen srinivasan king kramer and helma statistical evaluation of the predictive toxicology challenge bioinformatics july vishwanathan schraudolph kondor and borgwardt graph kernels jmlr wale watson and karypis comparison of descriptor spaces for chemical compound retrieval and classification knowledge and information systems yanardag and vishwanathan deep graph kernels in kdd pages acm zhai and lafferty study of smoothing methods for language models applied to information retrieval acm trans inf 
competitive distribution estimation why is good ananda theertha suresh uc san diego asuresh alon orlitsky uc san diego alon abstract estimating distributions over large alphabets is fundamental tenet yet no method is known to estimate all distributions well for example estimators are nearly optimal but often perform poorly in practice and practical estimators such as absolute discounting and are not known to be near optimal for essentially any distribution we describe the first universally probability estimators for every discrete distribution they are provably nearly the best in the following two competitive ways first they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the distribution up to permutation second they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the exact distribution but as all natural estimators restricted to assign the same probability to all symbols appearing the same number of times specifically for distributions over symbols and samples we show that for both comparisons simple variant of estimator is always within kl divergence of on from the best estimator and that more involved estimator is within min conversely we show that any estimator must have kl divergence at least min over the best estimator for the first comparison and at least min for the second introduction background many learning applications ranging from staples such as speech recognition and machine translation to biological studies in virology and bioinformatics call for estimating large discrete distributions from their samples probability estimation over large alphabets has therefore long been the subject of extensive research both by practitioners deriving practical estimators and by theorists searching for optimal estimators yet even after all this work estimators remain elusive the estimators frequently analyzed by theoreticians are nearly optimal yet perform poorly for many practical distributions while common practical estimators such as absolute discounting and are not well understood and lack provable performance guarantees to understand the terminology and approach solution we need few definitions the performance of an estimator for an underlying distribution is typically evaluated in terms of the leibler kl divergence def px log px qx reflecting the expected increase in the ambiguity about the outcome of when it is approximated by kl divergence is also the increase in the number of bits over the entropy that uses to compress the output of and is also the of estimating by it is therefore of interest to construct estimators that approximate large class of distributions to within small kl divergence we now describe one of the problem simplest formulations loss distribution estimator over support set associates with any observed sample sequence def distribution over given samples xn generated independently according to distribution over the expected kl loss of is rn let be known collection of distributions over discrete set the loss of an estimator over all distributions in is def rn max rn and the lowest loss for achieved by the best estimator is the loss def rn min rn min max rn performance can be viewed as regret relative to an oracle that knows the underlying distribution hence from here on we refer to it as regret the most natural and important collection of distributions and the one we study here is the set of all discrete distributions over an alphabet of some size which without loss of generality we assume to be hence the set of all distributions is the simplex in dimensions def pk pi and pi following researchers have studied rn and related quantities for example see we outline some of the results derived estimators the estimator assigns to symbol that appeared times probability proportional to for example if three coin tosses yield one heads and two tails the estimator assigns probability to heads and to tails showed that as for every as an estimator related to is near optimal and achieves rn the more challenging and practical regime is where the sample size is not overwhelmingly larger than the alphabet size for example in english text processing we need to estimate the distribution of words following context but the number of times context appears in corpus may not be much larger than the vocabulary size several results are known for other regimes as well when the sample size is linear in the alphabet size rn can be shown to be constant and showed that as estimators achieve the optimal rn log while estimators are nearly optimal the distributions attaining the regret are near uniform in practice distributions are rarely uniform and instead tend to follow for these distributions estimators the estimators described in the next subsection practical estimators for real applications practitioners tend to use more sophisticated estimators with better empirical performance these include the estimator that the sample to find the best fit for the observed data or the estimators that rather than add positive constant to each count do the opposite and subtract positive constant perhaps the most popular and enduring have been the estimator and some of its def def variations let nx nx xn be the number of times symbol appears in xn and let ϕt ϕt xn be the number of symbols appearing times in xn the basic estimator posits that if nx qx xn ϕt surprisingly relating the probability of an element not just to the number of times it was observed but also to the number other elements appearing as many and one more times it is easy to see that this basic version of the estimator may not work well as for example it assigns any element appearing times probability hence in practice the estimator is modified for example using empirical frequency to elements appearing many times the estimator was published in and quickly adapted for use but for half century no proofs of its performance were known following several papers showed that variants estimate the combined probability of symbols appearing any given number of times with accuracy that does not depend on the alphabet size and showed that different variation of similarly estimates the probabilities of each symbol and all unseen symbols combined however these results do not explain why estimators work well for the actual probability estimation problem that of estimating the probability of each element not of the combination of elements appearing certain number of times to define and derive estimators we take different competitive approach competitive optimality overview to evaluate an estimator we compare its performance to the best possible performance of two estimators designed with some prior knowledge of the underlying distribution the first estimator is designed with knowledge of the underlying distribution up to permutation of the probabilities namely knowledge of the probability multiset but not of the association between probabilities and symbols the second estimator is designed with exact knowledge of the distribution but like all natural estimators forced to assign the same probabilities to symbols appearing the same number of times for example upon observing the sample the estimator must assign the same probability to and and the same probability to and these estimators can not be implemented in practice as in reality we do not have prior knowledge of the estimated distribution but the prior information is chosen to allow us to determine the best performance of any estimator designed with that information which in turn is better than the performance of any estimator designed without prior information we then show that certain variations of the estimators designed without any prior knowledge approach the performance of both estimators for every underlying distribution competing with near full information we first define the performance of an estimator designed with some knowledge of the underlying distribution suppose that the estimator is designed with the aid of an oracle that knows the value of for some given function over the class of distributions the function partitions into subsets each corresponding to one possible value of we denote the subsets by and the partition by and as before denote the individual distributions by then the oracle knows the unique partition part such that for example if is the multiset of then each subset corresponds to set of distributions with the same probability multiset and the oracle knows the multiset of probabilities for every partition part an estimator incurs the regret in rn max rn the oracle knowing the unique partition part incurs the least regret rn min rn the competitive regret of over the oracle for all distributions in is rn rn the competitive regret over all partition parts and all distributions in each is def rnp max rn rn and the best possible competitive regret is def rnp min rnp consolidating the intermediate definitions rnp min max max rn rn namely an estimator who knows the partition part incurs regret rn over each part and the competitive regret rnp of estimators is the least overall increase in the regret due to not knowing in appendix we give few examples of such partitions partition refines partition if every part in is partitioned by some parts in for example refines in appendix we show that if refines then for every rnp rnp considering the collection of all distributions over it follows that as we start with partition and keep refining it till the oracle knows the competitive regret of estimators will increase from to rn natural question is therefore how much information can the oracle have and still keep the competitive regret low we show that the oracle can know the distribution exactly up to permutation and still the regret will be very small two distributions and permutation equivalent if for some permutation of pi for all for example and are permutation equivalent permutation equivalence is clearly an equivalence relation and hence partitions the collection of distributions over into equivalence classes let pσ be the corresponding partition we construct estimators that uniformly bound rnpσ thus the same estimator uniformly bounds rnp for any coarser partition of such as partitions into classes of distributions with the same support size or entropy note that the partition pσ corresponds to knowing the underlying distribution up to permutation hence rnpσ is the additional kl loss compared to an estimator designed with knowledge of the underlying distribution up to permutation this notion of competitiveness has appeared in several contexts in data compression it is called while in statistics it is often called adaptive or local minmax and recently in property testing it is referred as competitive or subsequent to this work studied competitive estimation in distance however their regret is poly log compared to our competing with natural estimators our second comparison is with an estimator designed with exact knowledge of but forced to be natural namely to assign the same probability to all symbols appearing the same number of times in the sample for example for the observed sample the same probability must be assigned to and and the same probability to and since estimators derive all their knowledge of the distribution from the data we expect them to be natural we compare the regret of estimators to that of natural estimators let qnat be the set of all natural estimators for distribution the lowest regret of natural estimator designed with prior knowledge of is def rnnat minnat rn and the regret of an estimator relative to the is rnnat rn rnnat thus the regret of an estimator over all distributions in is rnnat max rnnat and the best possible competitive regret is rnnat minq rnnat in the next section we state the results showing in particular that rnnat is uniformly bounded in section we outline the proofs and in section we describe experiments comparing the performance of competitive estimators to that of motivated estimators results estimators are often used in conjunction with empirical frequency where estimates low probabilities and empirical frequency estimates large probabilities we first show that even this simple version defined in appendix and denoted is uniformly optimal for all distributions for simplicity we prove the result when the number of samples is poi pσ nat poisson random variable with mean let rpoi and rpoi be the regrets in this sampling process similar result holds with exactly samples but the proof is more involved as the multiplicities are dependent theorem appendix for any and on pσ nat rpoi rpoi furthermore lower bound in shows that this bound is optimal up to logarithmic factors more complex variant of denoted was proposed in we show that its regret diminishes uniformly in both the and formulations theorem section for any and pσ nat rn rn min where and below also hide multiplicative logarithmic factors in lemma in section and lower bound in can be combined to prove matching lower bound on the competitive regret of any estimator for the second formulation rnnat min hence has competitive regret relative to natural estimators fano inequality usually yields lower bounds on kl loss not regret by carefully constructing distribution classes we lower bound the competitive regret relative to the estimators theorem appendix for any and pσ rn min illustration and implications figure demonstrates some of the results the horizontal axis reflects the set of distributions illustrated on one dimension the vertical axis indicates the kl loss or absolute regret for clarity shown for the blue line is the upper bound on the regret which by is very high for this regime log the red line is the regret of the estimator designed with prior knowledge of the probability multiset observe that while for some probability multisets the regret approaches the log upper bound for other probability multisets it is much lower and for some such as uniform over or over symbols where the probability multiset determines the distribution it is even for many practically relevant distributions such as distributions and sparse distributions the regret is small compared to log the green line is an upper bound on the absolute regret of the estimator by theorem it is always at most larger than the red line it follows that for many distributions possibly for distributions with more structure such as those occurring in nature the regret of is significantly smaller than the pessimistic bound implies rn log min nk kl loss distributions uniform distribution figure qualitative behavior of the kl loss as function of distributions in different formulations we observe few consequences of these results theorems and establish two estimators and their relative regrets diminish to zero at least as fast as and respectively independent of how large the alphabet size is although the results are for relative regret as shown in figure they lead to estimator with smaller absolute regret namely the expected kl divergence the same regret upper bounds hold for all coarser partitions of where instead of knowing the multiset the oracle knows some property of multiset such as entropy experiments recall that for sequence xn nx denotes the number of times symbol appears and ϕt denotes the number of symbols appearing times for small values of and the estimator proposed in simplifies to combination of and empirical estimators by lemmas and for symbols appearing times if then the estimate is close to the underlying total probability mass otherwise the empirical estimate is closer hence for symbol appearing times if we use the estimator otherwise we use the empirical estimator if nx if qx else ϕt where is normalization factor note that we have replaced in the estimator by to ensure that every symbol is assigned probability expected kl divergence laplace empirical number of samples uniform laplace empirical expected kl divergence zipf with parameter number of samples number of samples number of samples uniform prior dirichlet laplace empirical laplace empirical expected kl divergence zipf with parameter step expected kl divergence number of samples laplace empirical expected kl divergence expected kl divergence laplace empirical number of samples dirichlet prior figure simulation results for support number of samples ranging from to averaged over trials we compare the performance of this estimator to four estimators three popular estimators and the optimal natural estimator an estimator has the form nx where is normalization factor to ensure that the probabilities add up to the laplace estimator βtl minimizes the expected loss when the underlying distribution is generated by uniform prior over the estimator βtkt is asymptotically optimal for the cumulative regret and minimizes the expected loss when the underlying distribution is generated according to prior the estimator βtbs is asymptotically optimal for rn finally as shown in lemma the optimal estimator qx ϕnnx achieves the lowest loss of any natural estimator designed with knowledge of the underlying distribution we compare the performance of the proposed estimator to that of the four estimators above we consider six distributions uniform distribution step distribution with half the symbols having probability and the other half have probability zipf distribution with parameter pi zipf distribution with parameter pi distribution generated by the uniform prior on and distribution generated from prior all distributions have support size ranges from to and the results are averaged over trials figure shows the results observe that the proposed estimator performs similarly to the best natural estimator for all six distributions it also significantly outperforms the other estimators for zipf uniform and step distributions the performance of other estimators depends on the underlying distribution for example since laplace is the optimal estimator when the underlying distribution is generated from the uniform prior it performs well in figure however performs poorly on other distributions furthermore even though for distributions generated by dirichlet priors all the estimators have similar looking regrets figures the proposed estimator performs better than estimators which are not designed specifically for that prior proof sketch of theorem the proof consists of two parts we first show that for every estimator rnpσ rnnat and then upper bound rnnat using results on combined probability mass lemma appendix for every estimator rnpσ rnnat the proof of the above lemma relies on showing that the optimal estimator for every class in pσ is natural relation between rnnat and combined probability estimation we now relate the regret in estimating distribution to that of estimating the combined or total probability mass defined as follows recall that ϕt denotes the number of symbols appearing times def for sequence xn let st st xn denote the total probability of symbols appearing times for notational convenience we use st to denote both st xn and st and the usage becomes clear in the context similar to kl divergence between distributions we define kl divergence between and their estimates as st st log since the natural estimator assigns same probability to symbols that appear the same number of times estimating probabilities is same as estimating the total probability of symbols appearing given number of times we formalize it in the next lemma lemma appendix for natural estimator let xn nx qx xn then rnnat in lemma appendix we show that there is natural estimator that achieves rnnat taking maximum over all distributions and minimum over all estimators results in lemma for natural estimator let xn nx qx xn then rnnat max furthermore rnnat min max thus finding the best competitive natural estimator is same as finding the best estimator for the combined probability mass proposed an algorithm for estimating such that for all and for all with probability the result is stated in theorem of one can convert this result to result on expectation easily using the property that their estimator is bounded below by and show that max pn slight modification of their proofs for lemma and theorem in their paper using ϕt pn ϕt shows that their estimator for the combined probability mass satisfies max min the above equation together with lemmas and results in theorem acknowledgements we thank jayadev acharya moein falahatgar paul ginsparg ashkan jafarpour mesrob ohannessian venkatadheeraj pichapati yihong wu and the anonymous reviewers for helpful comments references william gale and geoffrey sampson frequency estimation without tears journal of quantitative linguistics chen and goodman an empirical study of smoothing techniques for language modeling in acl liam paninski variational minimax estimation of discrete distributions under kl loss in nips hermann ney ute essen and reinhard kneser on structuring probabilistic dependences in stochastic language modelling computer speech language fredrick jelinek and robert mercer probability distribution estimation from sparse data ibm tech disclosure irving good the population frequencies of species and the estimation of population parameters biometrika thomas cover and joy thomas elements of information theory wiley krichevsky universal compression and retrieval dordrecht the netherlands kluwer sudeep kamath alon orlitsky dheeraj pichapati and ananda theertha suresh on learning distributions from their samples in colt dietrich braess and thomas sauer bernstein polynomials and learning theory journal of approximation theory david mcallester and robert schapire on the convergence rate of estimators in colt evgeny drukh and yishay mansour concentration bounds for unigrams language model in colt jayadev acharya ashkan jafarpour alon orlitsky and ananda theertha suresh optimal probability estimation with applications to prediction and classification in colt alon orlitsky narayana santhanam and junan zhang always good turing asymptotically optimal probability estimation in focs boris yakovlevich ryabko coding problemy peredachi informatsii boris yakovlevich ryabko fast adaptive coding algorithm problemy peredachi informatsii dominique bontemps boucheron and elisabeth gassiat about adaptive coding on countable alphabets ieee transactions on information theory boucheron elisabeth gassiat and mesrob ohannessian about adaptive coding on countable alphabets envelope classes corr david donoho and jain johnstone ideal spatial adaptation by wavelet shrinkage biometrika felix abramovich yoav benjamini david donoho and iain johnstone adapting to unknown sparsity by controlling the false discovery rate the annals of statistics peter bickel chris klaassen ya acov ritov and jon wellner efficient and adaptive estimation for semiparametric models johns hopkins university press baltimore andrew barron lucien and pascal massart risk bounds for model selection via penalization probability theory and related fields alexandre tsybakov introduction to nonparametric estimation springer jayadev acharya hirakendu das ashkan jafarpour alon orlitsky and shengjun pan competitive closeness testing colt jayadev acharya hirakendu das ashkan jafarpour alon orlitsky shengjun pan and ananda theertha suresh competitive classification and closeness testing in colt jayadev acharya ashkan jafarpour alon orlitsky and ananda theertha suresh competitive test for uniformity of monotone distributions in aistats gregory valiant and paul valiant an automatic inequality prover and instance optimal identity testing in focs gregory valiant and paul valiant instance optimal learning corr michael mitzenmacher and eli upfal probability and computing randomized algorithms and probabilistic analysis cambridge university press 
efficient learning by directed acyclic graph for resource constrained prediction joseph wang department of electrical computer engineering boston university boston ma joewang kirill trapeznikov systems technology research woburn ma venkatesh saligrama department of electrical computer engineering boston university boston ma srv abstract we study the problem of reducing acquisition costs in classification systems our goal is to learn decision rules that adaptively select sensors for each example as necessary to make confident prediction we model our system as directed acyclic graph dag where internal nodes correspond to sensor subsets and decision functions at each node choose whether to acquire new sensor or classify using the available measurements this problem can be posed as an empirical risk minimization over training data rather than jointly optimizing such highly coupled and problem over all decision nodes we propose an efficient algorithm motivated by dynamic programming we learn node policies in the dag by reducing the global objective to series of cost sensitive learning problems our approach is computationally efficient and has proven guarantees of convergence to the optimal system for fixed architecture in addition we present an extension to map other budgeted learning problems with large number of sensors to our dag architecture and demonstrate empirical performance exceeding algorithms for data composed of both few and many sensors introduction many scenarios involve classification systems constrained by measurement acquisition budget in this setting collection of sensor modalities with varying costs are available to the decision system our goal is to learn adaptive decision rules from labeled training data that when presented with an unseen example would select the most informative and acquisition strategy for this example in contrast methods attempt to identify common sparse subset of sensors that can work well for all data our goal is an adaptive method that can classify typical cases using inexpensive sensors while using expensive sensors only for atypical cases we propose an adaptive sensor acquisition system learned using labeled training examples the system modeled as directed acyclic graph dag is composed of internal nodes which contain decision functions and single sink node the only node with no outgoing edges representing the terminal action of stopping and classifying sc at each internal node decision function routes an example along one of the outgoing edges sending an example to another internal node represents acquisition of previously unacquired sensor whereas sending an example to the sink node indicates that the example should be classified using the currently acquired set of sensors the goal is to learn these decision functions such that the expected error of the system is minimized subject to an expected budget constraint first we consider the case where the number of sensors available is small as in though the dimensionality of data acquired by each sensor may be large such as an image taken in different modalities in this scenario we construct dag that allows for sensors to be acquired in any order and classification to occur with any set of sensors in this regime we propose novel algorithm to learn node decisions in the dag by emulating dynamic programming dp in our approach we decouple complex sequential decision problem into series of tractable learning subproblems learning csl generalizes learning by allowing decision costs to be data dependent such reduction enables us to employ computationally efficient csl algorithms for iteratively learning node functions in the dag in our theoretical analysis we show that given fixed dag architecture the policy risk learned by our algorithm converges to the bayes risk as the size of the training set grows next we extend our formulation to the case where large number of sensors exist but the number of distinct sensor subsets that are necessary for classification is small as in where the depth of the trees is fixed to for this regime we present an efficient subset selection algorithm based on approximation we treat each sensor subset as new sensor construct dag over unions of these subsets and apply our dp algorithm empirically we show that our approach outperforms methods in both small and large scale settings related work there is an extensive literature on adaptive methods for sensor selection for reducing costs it arguably originated with detection cascades see and references therein popular method in reducing computation cost in object detection for cases with highly skewed class imbalance and generic features computationally cheap features are used at first to filter out negative examples and more expensive features are used in later stages our technical approach is closely related to trapeznikov et al and wang et al like us they formulate an erm problem and generalize detection cascades to classifier cascades and trees and handle balanced scenarios trapeznikov et al propose similar training scheme for the case of cascades however restrict their training to cascades and simple decision functions which require alternating optimization to learn alternatively wang et al attempt to jointly solve the decision learning problem by formulating linear surrogate converting the problem into linear program lp conceptually our work is closely related to xu et al and kusner et al who introduce trees of classifiers cstc and approximately submodular trees of classifiers astc respectively to reducing test time costs like our paper they propose global erm problem they solve for the tree structure internal decision rules and leaf classifiers jointly using alternative minimization techniques recently kusner et al propose approximately submodular trees of classifiers astc variation of cstc which provides robust performance with significantly reduced training time and greedy approximation respectively recently nan et al proposed random forests to efficiently learn budgeted systems using greedy approximation over large data sets figure simple example of sensor selection dag for three sensor system at each state represented by binary vector indicating measured sensors policy chooses between either adding new sensor or stopping and classifying note that the state ssc has been repeated for simplicity the subject of this paper is broadly related to other adaptive methods in the literature generative methods pose the problem as pomdp learn conditional probability models and myopically select features based information gain of unknown features methods encode current observations as state unused features as action space and formulate various reward functions to account for classification error and costs he et al apply imitation learning of greedy policy with single classification step as actions et al and karayev et al apply reinforcement learning to solve this mdp benbouzid et al propose classifier cascades with an additional skip action within an mdp framework nan et al consider nearest neighbor approach to feature selection with confidence driven by margin magnitude adaptive sensor acquisition by dag in this section we present our adaptive sensor acquisition dag that during sequentially decides which sensors should be acquired for every new example entering the system before formally describing the system and our learning approach we first provide simple illustration for sensor dag shown in fig the state indicating acquired sensors is represented by binary vector with indicating that sensor measurement has not been acquired and representing an acquisition consider new example that enters the system initially it has state of as do all samples during since no sensors have been acquired it is routed to the policy function which makes decision to measure one of the three sensors or to stop and classify let us assume that the function routes the example to the state indicating that the first sensor is acquired at this node the function has to decide whether to acquire the second sensor acquire the third or classifying using only the first if chooses to stop and classify then this example will be classified using only the first sensor such decision process is performed for every new example the system adaptively collects sensors until the policy chooses to stop and classify we assume that when all sensors have been collected the decision function has no choice but to stop and classify as shown for in fig problem formulation data instance consists of sensor measurements xm and belongs to one of classes indicated by its label each sensor measurement xm is not necessarily scalar but may instead be let the pair be distributed according to an unknown joint distribution additionally associated with each sensor measurement xm is an acquisition cost cm to model the acquisition process we define state space sk ssc the states sk represent subsets of sensors and the state ssc represents the action of stopping and classifying with current subset let xs correspond to the space of sensor measurements in subset we assume that the state space includes all possible for example in fig the system contains all subsets of sensors we also introduce the state transition function that defines set of actions that can be taken from the current state transition from the current sensor subset to new subset corresponds to an acquisition of new sensor measurements transition to the state ssc corresponds to stopping and classifying using the available information this terminal state ssc has access to classifier bank used to predict the label of an example since classification has to operate on any sensor subset there is one classifier for every sk fsk such that fs xs we assume the classifier bank is given and practically the classifiers can be either unique for each subset or missing feature sensor classification system as in we overload notation and use node subset of sensors and path leading up to that subset on the dag interchangeably in particular we let denote the collection of subsets of nodes each subset is associated with node on the dag graph we refer to each node as state since it represents the for an instance at that node we define the loss associated with classifying an pair using the sensors in sj as lsj ck using this convention the loss is the sum of the empirical risk associated with classifier fsj and the cost of the sensors in the subset sj the expected loss over the data is defined ld ex lπ our goal is to find policy which adaptively selects subsets for examples such that their average loss is minimized min ld where is policy selected from family of policies and is the state selected by the policy for example we denote the quantity ld as the value of when is the family of all measurable functions ld is the bayes cost representing the minimum possible cost for any while enumerating all possible combinations is feasible for small for large this problem becomes intractable we will overcome this limitation in section by applying novel sensor selection algorithm for now we remain in the small regime function given the distribution of data in practice the distribution is unknown and instead we are given training examples xn yn drawn from the problem becomes an empirical risk minimization min lπ xi xi yi recall that our sensor acquisition system is represented as dag each node in graph corresponds to state sensor subset in and the state transition function sj defines the outgoing edges from every node sj we refer to the entire edge set in the dag as in such system the policy is parameterized by the set of decision functions πk at every node in the dag each function πj sj maps an example to new state node from the set specified by outgoing edges rather than directly minimizing the empirical risk in first we define cost associated with all edges sj sk sj sk ct if sk ssc otherwise is either the cost of acquiring new sensors or is the classification error induced by classifying with the current subset if sk ssc using this cost we define the empirical loss of the system path for an example πk sj sj path πk where path πk is the path on the dag induced by the policy functions πk for example the empirical minimization equivalent to for our dag system is sample average over all example specific path losses πk argmin πk xi yi πk next we present reduction to learn the functions πk that minimize the loss in learning policies in dag learning the functions πk that minimize the cost in is highly coupled problem learning decision function πj is dependent on the other functions in two ways πj is dependent on functions at nodes downstream nodes for which path exists from πj as these determine the cost of each action taken by πj on an individual example the and πj is dependent on functions at nodes upstream nodes for which path exists to πj as these determine the distribution of examples that πj acts on consider policy πj at node corresponding to state sj such that all outgoing edges from lead to leaves also we assume all examples pass through this node πj we are ignoring the effect of upstream dependence this yields the following important lemma lemma given the assumptions above the problem of minimizing the risk in single policy function πj is equivalent to solving cost sensitive learning csl proof consider the risk in with πj such that all outgoing edges from lead to leaf ignoring the effect of other policy functions is upstream from the risk πjx sj sk min πj sk sj xi yi πj minimizing the risk over training examples yields the optimization problem on the right hand side this is equivalent to csl problem over the space of labels sj with costs given by the transition costs sj sk in order to learn the policy functions πk we propose algorithm which iteratively learns policy functions using lemma we solve the csl problem by using scheme for learn which constructs tree of binary classifiers each binary classifier can be trained using regularized risk minimization for concreteness we define the learn algorithm as learn xn iltert ree xn where the binary classifiers in the filter tree are trained using an appropriately regularized calibrated convex loss function note that multiple schemes exist that map the csl problem to binary classification we consider the csl problem formulated by beygelzimer et al where an instance of the problem is defined by distribution over inf space of features and associated costs for predicting each of the labels for each realization of features the goal is to learn function which maps each element of to label the expected cost is minimized single iteration of algorithm proalgorithm graph reduce algorithm ceeds as follows node is choinput data xi yi sen whose outgoing edges connect only dag nodes edges costs xi yi to leaf nodes the costs associated csl alg learn xn with each connected leaf node are found while graph is not empty do the policy πj is trained on the entire choose node all children of are set of training data according to these leaf nodes costs by solving csl problem for example do the costs associated with taking the ac construct the weight vector of edge costs tion πj are computed for each example per action and the costs of moving to state are end for updated outgoing edges from node πj learn xn are removed making it leaf node evaluate πj and update edge costs to node and disconnected nodes that were xi yi sn sj ij πj xi xi yi sn sj previously connected to node are re remove all outgoing edges from node in moved the algorithm iterates through remove all disconnected nodes from these steps until all edges have been reend while moved we denote the policy functions output policy functions πk trained on the empirical data using alg as πk analysis our goal is to show that the expected risk of the policy functions πk learned by alg converge to the bayes risk we first state our main result theorem alg is universally consistent that is lim ld πk ld are the policy functions learned using alg which in turn uses learn dewhere πk scribed by eq alg emulates dynamic program applied in an empirical setting policy functions are decoupled and trained from leaf to root conditioned on the output of descendant nodes to adapt to the empirical setting we optimize at each stage over all examples in the training set the key insight is the fact that universally consistent learners output optimal decisions over subsets of the space of data that is they are locally optimal to illustrate this point consider standard classification problem let be the support or region of examples induced by upstream deterministic decisions and bayes optimal classifiers the full space and subset respectively are equal on the reduced support arg min arg min from this insight we decouple learning problems while still training system that converges to the bayes risk this can be achieved by training universally consistent csl algorithms such as filter trees that reduce the problem to binary classification by learning consistent binary classifiers the risk of the function can be shown to converge to the bayes risk proof of theorem is included in the supplementary material computational efficiency alg reduces the problem to solving series of km binary classification problems where is the number of nodes in the dag and is the number of sensors finding each binary classifier is computationally efficient reducing to convex problem with variables in contrast nearly all previous approaches require solving problem and resort to alternating optimization or greedy approximation alternatively convex surrogates proposed for the global problem require solving large convex programs with variables even for simple linear decision functions furthermore existing algorithms can not be applied to train these systems often leading to less efficient implementation generalization to other budgeted learning problems although we presented our algorithm in the context of supervised classification and uniform linear sensor acquisition cost structure the above framework holds for wide range of problems in particular any learning problem can be solved using the proposed dag approach by generalizing the cost function sj sk sj sk if sk ssc sj otherwise where sj sk is the cost of acquiring sensors in sk for example given the current state sj and sj is some loss associated with applying sensor subset sj to example this framework allows for significantly more complex budgeted learning problems to be handled for example the sensor acquisition cost sj sk can be object dependent and such as increasing acquisition costs as time increases which can arise in image retrieval problems where users are less likely to wait as time increases the cost sj can include alternative costs such as error in regression precision error in ranking or model error in structured learning as in the supervised learning case the learning functions and example labels do not need to be explicitly known instead the system requires only empirical performance to be provided allowing complex decision systems such as humans to be characterized or systems learned where the classifiers and labels are sensitive information adaptive sensor acquisition in so far we considered the case where the dag system allows for any subset of sensors to be acquired however this is often computationally intractable as the number of nodes in the graph grows exponentially with the number of sensors in practice these complete systems are only feasible for data generated from small set of sensors or less learning sensor subsets although constructing an exhaustive dag for data with large number of sensors is computationally intractable in many cases this is unnecessary motivated by previous methods we assume that the number of active nodes in the exhaustive graph is small that is these nodes are either not visited by any examples or all examples that visit the node acquire the same next sensor equivalently this can be viewed as the system needing only small number of sensor subsets to classify all examples with low acquisition cost figure an example of dag system using the sensor subsets shown on the bottom left the new states are the union of these sensor subsets with the system otherwise constructed in the same fashion as the small scale system rather than attempt to build the entire combinatorially sized graph we instead use this assumption to first find these active subsets of sensors and construct dag to choose between unions of these subsets the step of finding these sensor subsets can be viewed as form of feature clustering with goal of grouping features that are jointly useful for classification by doing so the size of the dag is reduced from exponential in the number of sensors to exponential in much smaller user chosen parameter number of subsets in experimental results we limit which allows for diverse subsets of sensors to be found while preserving computational tractability and efficiency our goal is to learn sensor subsets with high classification performance and low acquisition cost empirically low cost as defined in ideally our goal is to jointly learn the subsets which minimize the empirical risk of the entire system as defined in however this presents computationally intractable problem due to the exponential search space rather than attempt to solve this difficult problem directly we minimize classification error over collection of sensor subsets σt subject to cost constraint on the total number of sensors used we decouple the problem from the policy learning problem by assuming that each example is classified by the best possible subset for constant sensor cost the problem can be expressed as set constraint problem min σt min xi such that where is the total sensor budget over all sensor subsets and is the cost of single sensor although minimizing this loss is still computationally intractable consider instead the equivalent problem of maximizing the reward the event of correct classification of the subsets defined as max xi max σt ct such that this problem is related to the knapsack problem with objective maximizing the reward in is still computationally intractable problem however the reward function is structured to allow for efficient approximation lemma the objective of the maximization in is with respect to the set of subsets such that adding any new set to the reward yields diminishing returns theorem given that the empirical risk of each classifier fσk is submodular and monotonically decreasing the elements in σk and uniform sensor costs the strategy in alg is an approximation of the optimal reward in proof of these statements is included in the supplementary material and centers on showing that the objective is and therefore applying greedy strategy yields approximation of the optimal strategy algorithm sensor subset selection input number of subsets cost constraint bδ output feature subsets σt initialize σt σi σt pt while cδ do σi σi σi σt end while constructing dag using sensor subsets alg requires computation of the reward for only bδ tm sensor subsets where is the number of sensors to return approximation to the problem given the set of sensor subsets σt we can now construct dag using all possible unions of these subsets where each sensor subset σj is treated as new single sensor and apply the small scale system presented in sec the result is an efficiently learned system with relatively low complexity yet strong additionally this result can be extended to the case of costs where simple extension of the greedy algorithm yields approximation simple case where three subsets are used is shown in fig the three learned subsets of sensors are shown on the bottom left of fig and these three subsets are then used to construct the entire dag in the same fashion as in fig at each stage the state is represented by the union of sensor subsets acquired grouping the sensors in this fashion reduces the size of the graph to nodes as opposed to nodes required if any subset of the sensors can be selected this approach allows us to map adaptive sensor selection problems to small scale dag in sec experimental results to demonstrate the performance of our dag sensor acquisition system we provide experimental results on data sets previously used in budgeted learning three data sets previously used for budget cascades are tested in these data sets examples are composed of small number of sensors under sensors to compare performance we apply the lp approach to learning sensor trees and construct trees containing all subsets of sensors as opposed to fixed order cascades next we examine performance of the dag system using higher dimensional sets of data previously used to compare budgeted learning performance in these cases the dimensionality of the data between and features makes exhaustive subset construction computationally infeasible we greedily construct sensor subsets using alg then learn dag over all unions of these sensor subsets we compare performance with cstc and astc for all experiments we use cost sensitive filter trees where each binary classifier in the tree is learned using logistic regression homogeneous polynomials are used as decision functions in the filter trees for all experiments uniform sensor cost were were varied in the range achieve systems with different budgets performance between the systems is compared by plotting the average number of features acquired during the average test error small sensor set experiments lp tree dag lp tree dag lp tree dag average test error average test error average test error average features used letter average features used average features used pima satimage figure average number of sensors acquired average test error comparison between lp tree systems and dag systems we compare performance of our trained dag with that of complete tree trained using an lp surrogate on the landsat pima and letter datasets to construct each sensor dag we include all subsets of sensors including the empty set and connect any two nodes differing by single sensor with the edge directed from the smaller sensor subset to the larger sensor subset by including the empty set no initial sensor needs to be selected homogeneous polynomials are used for both the classification and system functions in the lp and dag as seen in fig the systems learned with dag outperform the lp tree systems additionally the performance of both of the systems is significantly better than previously reported performance on these data sets for budget cascades this arises due to both the higher complexity of the classifiers and decision functions as well as the flexibility of sensor acquisition order in the dag and lp tree compared to cascade structures for this setting it appears that the dag approach is superior approach to lp trees for learning budgeted systems large sensor set experiments astc cstc dag astc cstc dag astc cstc dag miniboone forest cifar figure comparison between cstc astc and dag of the average number of acquired features test error next we compare performance of our trained dag with that of cstc and astc for the miniboone forest and cifar datasets we use the validation data to find the homogeneous polynomial that gives the best classification performance using all features miniboone linear forest order cifar order these polynomial functions are then used for all classification and policy functions for each data set alg was used to find subsets with an subset of all features added an exhaustive dag was trained over all unions of these subsets fig shows performance comparing the average cost average error of cstc astc and our dag system the systems learned with dag outperform both cstc and astc on the miniboone and forest data sets with comparable performance on cifar at low budgets and superior performance at higher budgets acknowledgments this material is based upon work supported in part by the national science foundation grant by the department of homeland security science and technology directorate office of university programs under grant award by onr grant and us af contract the views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the social policies either expressed or implied of the dhs onr or af references bartlett jordan and mcauliffe convexity classification and risk bounds journal of american statistical association beygelzimer langford and ravikumar multiclass classification with filter trees benbouzid and kégl fast classification using sparse decision dags in proceedings of the international conference on machine learning chen xu weinberger chapelle and kedem classifier cascade tradeoff between accuracy and feature evaluation cost in international conference on artificial intelligence and statistics denoyer preux and gallinari classification sequential approach to sparsity in machine learning and knowledge discovery in databases pages gao and koller active classification based on value of classifier in advances in neural information processing systems volume pages he daume iii and eisner imitation learning by coaching in advances in neural information processing systems pages ji and carin feature acquisition and classification pattern recognition kanani and melville active acquisition for customer targeting in advances in neural information processing systems karayev fritz and darrell dynamic feature selection for classification on budget in international conference on machine learning workshop on prediction with sequential models kusner chen zhou xu weinberger and chen sensitive learning with submodular trees of classifiers in aaai conference on artificial intelligence leskovec krause guestrin faloutsos vanbriesen and glance outbreak detection in networks in international conference on knowledge discovery and data mining maaten chen tyree and weinberger learning with marginalized corrupted features in proceedings of the international conference on machine learning nan wang and saligrama random forest in proceedings of the international conference on machine learning nan wang trapeznikov and saligrama fast classification in international conference on acoustics speech and signal processing nemhauser wolsey and fisher an analysis of approximations for maximizing submodular set mathematical programming sheng and ling feature value acquisition in testing sequential batch test algorithm in proceedings of the international conference on machine learning pages steinwart consistency of support vector machines and other regularized kernel classifiers information theory ieee transactions on trapeznikov and saligrama supervised sequential classification under budget constraints in international conference on artificial intelligence and statistics pages wang bolukbasi trapeznikov and saligrama model selection by linear programming in european conference on computer vision pages wang and saligrama local supervised learning through space partitioning in advances in neural information processing systems pages wang and saligrama learning machines in asian conference on machine learning pages wang trapeznikov and saligrama an lp for sequential learning under budgets in international conference on artificial intelligence and statistics pages xu chapelle and weinberger the greedy miser learning under budgets in proceedings of the international conference on machine learning xu kusner chen and weinberger tree of classifiers in proceedings of the international conference on machine learning pages zhang and zhang survey of recent advances in face detection technical report microsoft research 
hybrid sampler for mixture models gatsby unit university college london mlomeli stefano favaro department of economics and statistics university of torino and collegio carlo alberto yee whye teh department of statistics university of oxford abstract this paper concerns the introduction of new markov chain monte carlo scheme for posterior sampling in bayesian nonparametric mixture models with priors that belong to the general class we present novel compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous mcmc schemes we describe comparative simulation results demonstrating the efficacy of the proposed mcmc algorithm against existing marginal and conditional mcmc samplers introduction according to ghahramani models that have nonparametric component give us more flexiblity that could lead to better predictive performance this is because their capacity to learn does not saturate hence their predictions should continue to improve as we get more and more data furthermore we are able to fully consider our uncertainty about predictions thanks to the bayesian paradigm however major impediment to the widespread use of bayesian nonparametric models is the problem of inference over the years many mcmc methods have been proposed to perform inference which usually rely on tailored representation of the underlying process this is an active research area since dealing with this infinite dimensional component forbids the direct use of standard methods for posterior inference these methods usually require representation there are two main sampling approaches to facilitate simulation in the case of bayesian nonparametric models random truncation and marginalization these two schemes are known in the literature as conditional and marginal samplers in conditional samplers the prior is replaced by representation chosen according to truncation level in marginal samplers the need to represent the component can be bypassed by marginalising it out marginal samplers have less storage requirements than conditional samplers but could potentially have worst mixing properties however not integrating out the infinite dimensional compnent leads to more comprehensive representation of the random probability measure useful to compute expectations of interest with respect to the posterior in this paper we propose novel mcmc sampler for mixture models very large class of bayesian nonparametric mixture models that encompass all previously explored ones in the literature our approach is based on hybrid scheme that combines the main strengths of both conditional and marginal samplers in the flavour of probabilistic programming we view our contribution as step towards wider usage of flexible bayesian nonparametric models as it allows automated inference in probabilistic programs built out of wide variety of bayesian nonparametric building blocks processes random probability measures rpms have been introduced in pitman as generalization of homogeneous normalized random measures nrms let be complete and separable metric space endowed with the borel bpxq let crmpρ be homogeneous completely random measure crm with measure and base distribution see kingman for good overview about crms and references therein then the corresponding total mass of is µpxq and let it be finite positive almost surely and absolutely continuous with respect to lebesgue measure for any let us consider the conditional distribution of given that the total mass dt this distribution is denoted by pkpρ δt it is the distribution of rpm where δt denotes the usual dirac delta function rpms form class of rpms whose distributions are obtained by mixing pkpρ δt over with respect to some distribution on the positive real line specifically rpm has following the hierarchical representation pkpρ δt the rpm is referred to as the rpm with measure base distribution and mixing distribution throughout the paper we denote by pkpρ the distribution of and without loss of generality we will assume that ptqdt where fρ is the density of the total mass under the crm and is function note that when γpdtq fρ ptqdt then the distribution pkpρ fρ coincides with nrmpρ the resulting pk δφk is almost surely discrete and since is homogeneous the atoms pφk of are independent of their masses ppk and form sequence of independent random variables identically distributed according to finally the masses of have distribution governed by the measure and the distribution one nice property is that is almost surely discrete if we obtain sample tyi ui from it there is positive probability of yi yj for each pair of indexes hence it induces random partition on where and are in the same block in if and only if yi yj kingman showed that is exchangeable this property will be one of the main tools for the derivation of our hybrid sampler sampling processes second object induced by rpm is permutation of its atoms specifically order the blocks in by increasing order of the least element in each block and for each let zk be the least element of the kth block zk is the index among pyi of the first appearance of the kth unique value in the sequence let µptyzk uq be the mass of the corresponding atom in then is permutation of the masses of atoms in with larger masses tending to appear earlier in the sequence it is easy to see that and that the sequence can be understood as construction starting with stick of length break off the first piece of length the surplus length of stick is then the second piece with length is broken off etc theorem of perman et al states that the sequence of surplus masses ptk forms markov chain and gives the corresponding initial distribution and transition kernels the corresponding generative process for the sequence pyi is as follows start with drawing the total mass from its distribution pρ pt ptqdt ii the first draw from is pick from the masses of the actual value of is simply while the mass of the corresponding atom in is with conditional distribution fρ pt pρ dtq fρ ptq with surplus mass iii for subsequent draws let be the current number of distinct values among and the unique values atoms in the masses of these first atoms are denoted by řk and the surplus mass is tk for each with probability we set yi with probability tk yi takes on the value of an atom in besides the first atoms the actual value yk is drawn from while its mass is drawn from sk fρ ptk sk pρ dsk dtk ρpdsk tk fρ ptk tk tk by multiplying the above infinitesimal probabilities one obtains the joint distribution of the random elements and pρ pπn pck qkprks dsk for rks dtq fρ pt řk sk qhptqdt sk ρpdsk where pck qkprks denotes particular partition of rns with blocks ck ordered by increasing least element and is the cardinality of block ck the distribution is invariant to the order such joint distribution was first obtained in pitman see also pitman for further details relationship to the usual construction in the generative process above we mentioned that it is reminiscent of the well known stick breaking construction from ishwaran james where you break stick of length one but it is not the same however we can effectively reparameterize the model starting with equation due to two useful identities in distribution pj and vj for ăj ăj indeed using this reparameterization we obtain the corresponding joint in terms of weights tvj uj which correspond to representation note that this joint distribution is for general measure density fρ and it is conditioned on the valued of the random variable we can recover the well known stick breaking representations for the dirichlet and processes for specific choice of and if we integrate out see the supplementary material for further details about the latter however in general these random variables form sequence of dependent random variables with complicated distribution except for the two previously mentioned processes see pitman for details mixture model we are mainly interested in using rpms as building block for an infinite mixture model indeed we can use equation as the top level of the following hierarchical specification pkpρσ δt iid yi ind xi yi yi figure varying table size chinese restaurant representation for observations txi ui where is the likelihood term for each mixture component and our dataset consists of observations pxi qiprns of the corresponding variables pxi qiprns we will assume that is smooth after specifying the model we would like to carry out inference for clustering density estimation tasks we can do it exactly and more efficiently than with known mcmc samplers with our novel approach in the next section we present our main contribution and in the following one we show how it outperforms other samplers hybrid sampler equation joint distribution is written in terms of the first weights in order to obtain complete representation of the rpm we need to sample from it countably infinite number of times succesively devise some way of representing this object exactly in computer with finite memory and storage is needed we introduce the following novel strategy starting from equation we exploit the generative process of section when reassigning observations to clusters in addition to this we reparameřk terize the model in terms of surplus mass random variable and end up with the following joint distribution pρ pπn pck qkprks dsk for rks dv xi dxi for rnsq pv sk sk fρ pvq sk ρpdsk pdxi ipck for this reason while having complete representation of the infinite dimensional part of the model we only need to explicitly represent those weights associated to occupied clusters plus surplus mass term which is associated to the rest of the empty clusters as figure shows the cluster reassignment step can be seen as lazy sampling scheme we explicitly represent and update the weights associated to occupied clusters and create weight only when new cluster appears to make this possible we use the induced partition and we call equation the varying table size chinese restaurant representation because the weights can be thought as the sizes of the tables in our restaurant in the next subsection we compute the complete conditionals of each random variable of interest to implement an overall gibbs sampling mcmc scheme complete conditionals starting from equation we obtain the following complete conditionals for the gibbs sampler pv dv sk fρ pvqh sk dv dsi rest si sk si sk si ρpdsi surpmassi psi qdsi where surpmassi řk jăi sc pdxi txj ujpc if is assigned to existing cluster ppci if is assigned to new cluster pdxi yc according to the rule above the ith observation will be either reassigned to an existing cluster or to one of the new clusters in the reuse algorithm as in favaro teh if it is assigned to new cluster then we need to sample new weight from the following dsk rest pv sk qρpsk qsk vq psk qdsk every time new cluster is created we need to obtain its corresponding weight which could happen times per iteration hence it has significant contribution to the overall computational cost for this reason an independent and identically distributed draw from its corresponding complete conditional is highly desirable in the next subsection we present way to achieve this finally for updating cluster parameters ukprks in the case where is to the likelihood we use an extension of favaro teh reuse algorithm see algorithm in the supplementary material for details the complete conditionals in equation do not have standard form but generic mcmc method can be applied to sample from each within the gibbs sampler we use slice sampling from neal to update the weights and the surplus mass however there is class of priors where the total mass density is intractable so an additional step needs to be introduced to sample the surplus mass in the next subsection we present two alternative ways to overcome this issue example of classes of priors processes for any let fσ ptq γpσj be the density function of positive random variable and sinpπσjq tσj dx this class of rpms is denoted by pkpρσ ht where ρpdxq ρσ pdxq is function that indexes each member of the class for example in the experimental section we picked choices of the function that index the following processes normalized stable and normalized generalized gamma processes this class includes all gibbs type priors with parameter so other choices of are possible see gnedin pitman and de blasi et al for noteworthy account of this class of bayesian nonparametric priors in this case the total mass density is intractable and we propose two ways of dealing with this firstly we used kanter integral representation for the density as in lomeli et al introduce an auxiliary variable and slice sample each variable pv dv exp apzq si dv si pz dz exp apzq dz see algorithm in the supplementary material for details alternatively we can completely bypass the evaluation of the total mass density by updating the surplus mass with step with an independent proposal from stable or from an exponentially tilted stable it is straight forward to obtain draws from these proposals see devroye and hofert for an improved rejection sampling method for the exponentially tilted case this leads to the following acceptance ratio řk řk dv exp pv dv restq fσ pvq exp řk řk pv dv restq fσ pv exp si si dv exp see algorithm in the supplementary material for details finally to sample new weight dsk rest pv sk vq psk qdsk fortunately we can get an draw from the above due to an identity in distribution given by favaro et al for the usual stick breaking weights for any prior in this class such that uv where are coprime integers then we just reparameterize it back to obtain the new weight see algorithm in the supplementary material for details γpa bq γpaqγpbq exp processes let fρ ptq be the density of positive random variable log where betapa bq and ρpxq this class of rpms generalises the gamma process but has similar properties indeed if we take and the density function for is γptq fρ ptq we recover the measure and total mass density function of gamma process finally to sample new weight expps qq vqq dsk rest dsk vq psk if this complete conditional is monotone decreasing unnormalised density with maximum at we can easily get an draw with simple rejection sampler where the rejection constant is bv and the proposal is vq there is no other known sampler for this process relationship to marginal and conditional mcmc samplers starting from equation another strategy would be to reparameterize the model in terms of the usual stick breaking weights next we could choose random truncation level and represent finitely many sticks as in favaro walker alternatively we could integrate out the random probability measure and sample only the partition induced by it as in lomeli et al conditional samplers have large memory requirements as often the number of sticks needed can be very large furthermore the conditional distributions of the stick lengths are quite involved so they tend to have slow running times marginal samplers have less storage requirements than conditional samplers but could potentially have worst mixing properties for example lomeli et al had to introduce number of auxiliary variables which worsen the mixing our novel hybrid sampler exploits marginal and conditional samplers advantages it has less memory requirements since it just represents the weights of occupied as opposed to conditional samplers which represent both empty and occupied clusters also it does not integrate out the weights thus we obtain more comprehensive representation of the rpm performance assesssment we illustrate the performance of our hybrid sampler on range of bayesian nonparametric mixture models obtained by different specifications of and as in equation at the top level of this hierarchical specification different bayesian nonparametric priors were chosen from both classes presented in the examples section we chose the base distribution and the likelihood term for the kth cluster to be śnk xi µk pdµk dµk and dxnk µk are the nk observations assigned to the kth cluster at some iteration denotes where txj uj normal distribution with mean µk and variance common parameter among all clusters the mean prior distribution is normal centered at and with variance although the base distribution is conjugate to the likelihood we treated it as case and sampled the parameters at each iteration rather than integrating them out we used the dataset from roeder to test the algorithmic performance in terms of running time and effective sample size ess as table shows the dataset consists of measurements of velocities in of galaxies from survey of the corona borealis region for the class we compared it against our implementation of favaro walker conditional sampler and against the marginal sampler of lomeli et al we chose to compare our hybrid sampler against these existing approaches which follow the same general purpose paradigm algorithm process hybrid conditional marginal hybrid conditional marginal normalized stable process hybrid conditional marginal hybrid conditional marginal normalized generalized gamma process hybrid conditional marginal hybrid conditional marginal hybrid conditional marginal running time ess na na na na na na na na na na table running times in seconds and ess averaged over chains iterations burn in table shows that different choices of result in differences in the algorithm running times and ess the reason for this is that in the case there are readily available random number generators which do not increase the computational cost in contrast in the case rejection sampler method is needed every time new weight is sampled which increases the computational cost see favaro et al for details even so in most cases we outperform both marginal and conditional mcmc schemes in terms of running times and in all cases in terms of ess in the case even thought the ess and running times are competitive we found that the acceptance rate is not optimal we are currently exploring other choices of proposals finally in example our approach is the only one available and it has good running times and ess this qualitative comparison confirms our previous statements about our novel approach discussion our main contribution is our hybrid mcmc sampler as general purpose tool for inference with very large class of infinite mixture models we argue in favour of an approach in which generic algorithm can be applied to very large class of models so that the modeller has lot of flexibility in choosing specific models suitable for problem of interest our method is hybrid approach since it combines the perks of the conditional and marginal schemes indeed our experiments confirm that our hybrid sampler is more efficient since it outperforms both marginal and conditional samplers in running times in most cases and in ess in all cases we introduced new compact way of representing the infinite dimensional component such that it is feasible to perform inference and how to deal with the corresponding intractabilities however there are still various challenges that remain when dealing with these type of models for instance there are some values for which we are unable to perform inference with our novel sampler secondly when step is used there could be other ways to improve the mixing in terms of better proposals finally all bnp mcmc methods can be affected by the dimensionality and size of the dataset when dealing with an infinite mixture model indeed all methods rely on the same way of dealing with the likelihood term when adding new cluster all methods sample its corresponding parameter from the prior distribution in high dimensional scenario it could be very difficult to sample parameter values close to the existing data points we consider these points to be an interesting avenue of future research acknowledgments we thank konstantina palla for her insightful comments is funded by the gatsby charitable foundation stefano favaro is supported by the european research council through stg and yee whye teh is supported by the european research council under the european unions seventh framework programme erc grant agreement no references de blasi favaro lijoi mena ruggiero are priors the most natural generalization of the dirichlet process pages of ieee transactions on pattern analysis machine intelligence vol devroye random variate generation devroye random variate generation for exponentially and polynomially tilted stable distributions acm transactions on modelling and computer simulation escobar estimating normal means with dirichlet process prior journal of the american statistical association escobar west bayesian density estimation and inference using mixtures journal of the american statistical association favaro teh mcmc for normalized random measure mixture models statistical science favaro walker slice sampling mixture models journal of computational and graphical statistics favaro lomeli nipoti teh on the representation of models electronic journal of statistics ghahramani probabilistic machine learning and artificial inteligence nature gnedin pitman exchangeable gibbs partitions and stirling triangles journal of mathematical sciences hofert efficiently sampling nested archimedean copulas comput statist data ishwaran james gibbs sampling methods for priors journal of the american statistical association james poisson process partition calculus with applications to exchangeable models and bayesian nonparametrics arxiv kanter stable densities under change of scale and total variation inequalities annals of probability kingman completely random measures pacific journal of mathematics kingman the representation of partition structures journal of the london mathematical society lomeli favaro teh marginal sampler for mixture models journal of computational and graphical statistics to appear neal markov chain sampling methods for dirichlet process mixture models tech rept department of statistics university of toronto neal slice sampling annals of statistics papaspiliopoulos roberts retrospective markov chain monte carlo methods for dirichlet process hierarchical models biometrika perman pitman yor sampling of poisson point processes and excursions probability theory and related fields pitman random discrete distributions invariant under permutation advances in applied probability pitman partitions pages of goldstein ed statistics and science festschrift for terry speed institute of mathematical statistics pitman combinatorial stochastic processes lecture notes in mathematics berlin regazzini lijoi distributional results for means of normalized random measures with independent increments annals of statistics roeder density estimation with confidence sets exemplified by and voids in the galaxies journal of the american statistical association von renesse yor zambotti properties of class of subordinators stochastic processes and their applications walker stephen sampling the dirichlet mixture model with slices communications in statistics simulation and computation 
an active learning framework using codes for sparse polynomials and graph sketching kannan uc berkeley kannanr xiao li uc berkeley xiaoli abstract let be an polynomial consisting of monomials in which only coefficients are the goal is to learn the polynomial by querying the values of we introduce an active learning framework that is associated with low query cost and computational runtime the significant savings are enabled by leveraging sampling strategies based on modern coding theory specifically the design and analysis of codes such as ldpc codes which represent the of modern packet communications more significantly we show how this design perspective leads to exciting and to the best of our knowledge largely unexplored intellectual connections between learning and coding the key is to relax the assumption with an setting where the polynomial is assumed to be drawn uniformly at random from the ensemble of all polynomials of given size and sparsity our framework succeeds with high probability with respect to the polynomial ensemble with sparsity up to for any where is exactly learned using ns queries in time ns log even if the queries are perturbed by gaussian noise we further apply the proposed framework to graph sketching which is the problem of inferring sparse graphs by querying graph cuts by writing the cut function as polynomial and exploiting the graph structure we propose sketching algorithm to learn the an arbitrary unknown graph using only few cut queries which scales almost linearly in the number of edges and in the graph size experiments on real datasets show significant reductions in the runtime and query complexity compared with competitive schemes introduction one of the central problems in computational learning theory is the efficient learning of polynomials the task of learning an polynomial has been studied extensively in the literature often in the context of fourier analysis for functions function defined on set of binary variables many concept classes such as circuits decision trees and disjunctive normative form dnf formulas have been proven very difficult to learn in the with random examples almost all existing efficient algorithms are based on the membership query model which provides arbitrary access to the value of given any this makes richer set of concept classes learnable in polynomial time poly this is form of what is now popularly referred to as active learning which makes queries using different sampling strategies for instance use regular subsampling and use random sampling based on compressed sensing however they remain difficult to scale computationally especially for large and this work was supported by grant nsf ccf eager in this paper we are interested in learning polynomials with for some although this regime is not typically considered in the literature we show that by relaxing the mindset to an setting explained later we can handle this more challenging regime and reduce both the number of queries and the runtime complexity even if the queries are corrupted by gaussian noise in the spirit of active learning we design sampling strategy that makes queries to based on modern coding theory and signal processing the queries are formed by strategically subsampling the input to induce aliasing patterns in the dual domain based on codes then our framework exploits the aliasing pattern code structure to reconstruct by peeling the sparse coefficients with an iterative simple peeling decoder through lens our algorithm achieves low query complexity codes and low computational complexity peeling decoding further we apply our proposed framework to graph sketching which is the problem of inferring hidden sparse graphs with nodes by actively querying graph cuts see fig motivated by bioinformatics applications learning hidden graphs from additive or queries edge counts within set or across two sets has gained considerable interest this problem closely pertains to our learning framework because the cut function of any graph can be written as sparse polynomial with respect to the binary variables indicating graph partition for the cut given query access to the cut value for an arbitrary partition of the graph how many cut queries are needed to infer the hidden graph structure what is the runtime for such inference unknown graph cut query inferred graph figure given set of nodes infer the graph structure by querying graph cuts most existing algorithms that achieve the optimal query cost for graph sketching see are nonconstructive except for few algorithms that run in polynomial time in the graph size inspired by our active learning framework we derive sketching algorithm associated with query cost and runtime that are both in the graph size and in the number of edges to the best of our knowledge this is the first constructive sketching scheme with costs in the graph size in the following we introduce the problem setup our learning model and summarize our contributions problem setup our goal is to learn the following polynomial in terms of its coefficients χk where is the index of the χk xi and is the coefficient in this work we consider an setting for learning definition polynomial ensemble the polynomial ensemble is collection of polynomials satisfying the following conditions the vector is with for some the support supp is chosen uniformly at random over each coefficient takes values from some set according to pa for all supp and pa is some probability distribution over the notation is defined as we consider active learning under the membership query model each query to at returns the pair where is some additive noise we propose query frameb of the polynomial work that leads to fast reconstruction algorithm which outputs an estimate coefficients the performance of our is evaluated by the probability of failing to recover the exact coefficients pf pr where is the indicator function and the expectation is taken with respect to the noise the randomized construction of our queries as well as the random polynomial ensemble our approach and contributions particularly relevant to this work are the algorithms on learning decision trees and boolean functions by uncovering the fourier spectrum of recent papers further show that this problem can be formulated and solved as compressed sensing problem using random queries specifically gives an algorithm using queries based on mutual coherence whereas the restricted isometry property rip is used in to give query complexity of however this formulation needs to estimate vector and hence the complexity is exponential in to alleviate the computational burden proposes scheme to reduce the number of unknowns to which shortens the runtime to poly using samples however this method only works with very small due to the exponential scaling under the sparsity regime for some existing algorithms irrespective of using membership queries or random examples do not immediately apply here because this may require samples and large runtime due to the obscured polynomial scaling in in our framework we show that can be learned exactly in time in and in even when the queries are perturbed by random gaussian noise theorem noisy learning let where is some arbitrarily large but finite set in the presence of noise our algorithm learns exactly in terms of the coefficients which runs in time ns log using ns queries with probability at least the proposed algorithm and proofs are given in the supplementary material further we apply this framework on learning hidden graphs from cut queries we consider an undirected weighted graph with edges and weights rr where is given but the edge set is unknown this generalizes to hypergraphs where an edge can connect at most nodes called the rank of the graph for hypergraph with edges the cut function is function each monomial depending on at most variables where the sparsity is bounded by on the graph sketching problem uses random queries to sketch the sparse temporal changes of hypergraph in polynomial time poly nd however shows that it becomes computationally infeasible for small graphs nodes edges with while the learngraph algorithm runs in time log using log log rd queries although this significantly reduces the runtime compared to the algorithm only tackles very sparse graphs due to the scaling and this implies that the sketching needs to be done on relatively small graphs nodes over fine sketching intervals minutes to suppress the sparsity within the sketching interval in this work we adapt and apply our learning framework to derive an efficient sketching algorithm whose runtime scales as ds log log log by using ds log log queries we use our adapted algorithm on real datasets and find that we can handle much coarser sketching intervals half an hour and much larger hypergraphs nodes learning framework our learning framework consists of query generator and reconstruction engine given the sparsity and the number of variables the query generator strategically constructs queries randomly and the reconstruction engine recovers the vector for notation convenience we replace each boolean variable xi with binary variable for all using the notation in the fourier expansion we have hm ki where hm ki over now the coefficients can be interpreted as the walshhadamard transform wht coefficients of the polynomial for membership query design the building block of our query generator is the basic query set by subsampling and tiny whts subsampling we choose samples indexed selectively by for where is the subsampling matrix and is the subsampling offset wht very small wht is performed over the samples for where each output coefficient can be obtained according to the aliasing property of wht hd ki mt where ji is the observation noise with variance the basic query set implies that each coefficient is the weighted hash output of under the hash function mt from perspective the coefficient for constitutes parity constraint of the coefficients where enters the parity if mt if we can induce set of parity constraints that mimic good codes with respect to the unknown coefficients the coefficients can be recovered iteratively in the spirit of peeling decoding similar to that in ldpc codes now it boils down to the following questions how to choose the subsampling matrix and how to choose the query set size how to recover the coefficients from their aliased observations in the following we illustrate the principle of our learning framework through simple example with boolean variables and sparsity main idea simple example suppose that the coefficients are and we choose and use two patterns and for subsampling where all queries made using the same pattern mi are called query group in this example by enforcing zero subsampling offset we generate only one set of queries uc under each pattern mc according to for example under pattern the chosen samples are then the observations are obtained by wht coefficients of these chosen samples for illustration we assume the queries are noiseless generally speaking it is impossible to reconstruct the coefficients from these queries however since the coefficients are sparse then the observations are reduced to the observations are captured by bipartite graph which consists of left nodes and right nodes see fig query stage query stage figure example of bipartite graph for the observations decoding we illustrate how to decode the unknown from the bipartite graph in fig with the help of an oracle and then introduce how to get rid of this oracle the right nodes can be categorized as right node is if it is not connected to any left node right node is if it is connected to only one left node we refer to the index and its associated value as the pair right node is if it is connected to more than one left node the oracle informs the decoder exactly which right nodes are as well as the corresponding pair then we can learn the coefficients iteratively as follows step select all edges in the bipartite graph with right degree detect presence of and the pairs informed by the oracle step remove peel off these edges and the left and right end nodes of these edges step remove peel off other edges connected to the left nodes that are removed in step step remove contributions of the left nodes removed in step from the remaining right nodes finally decoding is successful if all edges are removed clearly this simple example is only an illustration in general if there are query groups associated with the subsampling patterns mc and query set size we define the bipartite graph ensemble below and derive the guidelines for choosing them to guarantee successful recovery definition sparse graph ensemble the bipartite graph ensemble mc is collection of bipartite graphs where there are left nodes each associated with distinct coefficient there are groups of right nodes and ηs right nodes per group and each right node is characterized by the observation uc indexed by in each group there exists an edge between left node and right node uc in group if mtc and thus each left node has regular degree using the construction of mc given in the supplemental material the decoding is successful over the ensemble mc if and are chosen appropriately the key idea is to avoid excessive aliasing by exploiting sufficiently large but finite number of groups for diversity and maintaining the query set size on par with the sparsity lemma if we construct our query generator using query groups with ηs for some redundancy parameter satisfying table minimum value for given the number of groups then the decoder learns in peeling iterations with probability getting rid of the oracle now we explain how to detect and obtain the pair without an oracle we ploit the diversity of subsampling offsets from let dc fp be the offset matrix containing subsampling offsets where each row is chosen offset denote by uc the vector of observations called observation bin associated with the offsets at the right node we have the general observation model for each right node in the bipartite graph as follows proposition given the offset matrix we have dc wc mt where wc wc contains noise samples with variance is an elementwise exponentiation operator and dc is the offset signature associated with in the same simple example we keep the subsampling matrix and use the set of offsets and such that the observation bin associated with the subsampling pattern is for example observations and are given as with these bin observations one can effectively determine if check node is singleton or for example say satisfies then the index and the value of can be obtained by simple ratio test the above tests are easy to verify for all observations such that the pair is obtained for peeling in fact this detection scheme for obtaining the oracle information is mentioned in the noiseless scenario by using offsets however this procedure fails in the presence of noise in the following we propose the general detection scheme for the noisy scenario while using offsets learning in the presence of noise in this section we propose robust bin detection scheme that identifies the type of each observation bin and estimate the pair of in the presence of noise for convenience we drop the group index and the node index without loss of clarity because the detection scheme is identical for all nodes from all groups the bin detection scheme consists of the detection scheme and the detection scheme as described next detection proposition given with observed in the presence of noise then by collecting the signs of the observations we have dk sgn where contains independent bernoulli variables with probability at most pe and the sign function is defined as sgn if and sgn if note that the vector is received codeword of the message over binary symmetric channel bsc under an unknown flip sgn therefore we can design the offset matrix according to linear block codes the codes should include as valid codeword such that both dk and dk can be decoded correctly and then obtain the correct codeword dk and hence definition let the offset matrix fp constitute generator matrix of some linear code which satisfies minimum distance βp with code rate and pe since there are information bits in the index there exists some linear code with block length that achieves minimum distance of βp where is the rate of the code as long as pe it is obvious that the unknown can be decoded with exponentially decaying probability of error excellent examples include the class of expander codes or ldpc codes which admits linear time decoding algorithm therefore the detection can be performed in time same as the noiseless case and detection the detection scheme works when the underlying bin is indeed however it does not work on isolating from and we address this issue by further introducing extra random offsets fp constitute random matrix consisting of definition let the offset matrix independent identically distributed bernoulli entries with probability we perform the following ep the observations associated with denote by for some verification the bin is if ku ek verification the bin is if ku where are the detection estimates it is shown in the supplemental material that this bin detection scheme works with probability at least together with lemma the learning framework in the presence of noise succeeds with probability at least as detailed in the supplemental material this leads to overall sample complexity of sn and runtime of ns log application in hypergraph sketching consider hypergraph with edges where cut is set of selected vertices denoted by the boolean cube xn over where xi if and xi if the value of specific cut can be written as xi xi letting xi we have hk mi with xi for all where the coefficient is scaled wht coefficient clearly if the number of hyperedges is small and the maximum size of each hyperedge is small the coefficients are sparse and the sparsity can be well upper bounded by now we can use our learning framework to compute the sparse coefficients from only few cut queries note that in the graph sketching problem the weight of is bounded by due to the special structure of cut function therefore in the noiseless setting we can leverage the sparsity and use much fewer offsets in the spirit of compressed sensing in the supplemental material we adapt our framework to derive the graphsketch bin detection scheme with even lower query costs and runtime proposition the graphsketch bin detection scheme uses log log offsets and successfully detects and their pairs with probability at least next we provide numerical experiments of our learning algorithm for sketching large random hypergraphs as well as actual hypergraphs formed by real in fig we compare the probability of success in sketching hypergraphs with nodes over trials against the learngraph in by randomly generating to hyperedges with rank the performance is plotted against the number of edges and the query complexity of learning as seen from fig the query complexity of our framework is significantly lower than that of sketching the yahoo messenger user communication pattern dataset we sketch the hypergraphs extracted from yahoo messenger user communication pattern dataset which records communications for days the dataset is recorded as day time transmitter receiver flag where day and time represent the time stamp of each message the transmitter and receiver represent the ids of the sender and the recipient the zipcode is spatial stamp of each message and the flag indicates if the recipient is in the contact list there are unique users and unique zipcodes hidden hypergraph structure is captured as follows we used matlab on macbook pro with an intel core processor at ghz and gb ram we would like to acknowledge and thank the authors for providing their source codes prob of success secs of edges of queries of edges of queries our framework our framework prob of success secs of edges of queries of edges of queries learngraph learngraph figure sketching performance of random hypergraphs with nodes over an interval δt each sender with unique zipcode forms hyperedge and the recipients are the members of the hyperedge by considering consecutive intervals δt over set of δz zipcodes the communication pattern gives rise to hypergraph with only few hyperedges in each interval and each hyperedge contains only few nodes the complete set of nodes in the hypergraph is the set of recipients who are active during the intervals in table we choose the sketching interval δt and consider intervals for each interval we extract the communication hypergraph from the dataset by sketching the communications originating from set of δz by posing queries constructed at random in our framework we average our performance over trial runs and obtain the success probability temporal graph of edges degree pf sec table sketching performance with groups and query sets of size we maintain groups of queries with query sets of size per group throughout all the experiments queries it is also seen that we can sketch the temporal communication hypergraphs from the real dataset over much larger intervals hr than that by learngraph around sec to min also more reliably in terms of success probability conclusions in this paper we introduce active learning framework for sparse polynomials under much more challenging sparsity regime the proposed framework effectively lowers the query complexity and especially the computational complexity our framework is useful in sketching large hypergraphs where the queries are obtained by specific graph cuts we further show via experiments that our learning algorithm performs very well over real datasets compared with existing approaches we did now show the performance of learngraph because it fails to work on hypergraphs with the number of hyperedges at this scale with reasonable number of queries as mentioned in references angluin computational learning theory survey and selected bibliography in proceedings of the annual acm symposium on theory of computing pages acm bouvel grebinski and kucherov combinatorial search on graphs motivated by bioinformatics applications brief survey in concepts in computer science pages springer bshouty and mansour simple learning algorithms for decision trees and multivariate polynomials in foundations of computer science annual symposium on pages oct bshouty and mazzawi optimal query complexity for reconstructing hypergraphs in international symposium on theoretical aspects of computer pages choi jung and kim almost tight upper bound for finding fourier coefficients of bounded functions journal of computer and system sciences goldman computational learning theory in algorithms and theory of computation handbook pages chapman jackson an efficient algorithm for learning dnf with respect to the uniform distribution in foundations of computer science annual symposium on pages ieee kearns the computational complexity of machine learning mit press kocaoglu shanmugam dimakis and klivans sparse polynomial learning and graph sketching in advances in neural information processing systems pages kushilevitz and mansour learning decision trees using the fourier spectrum siam journal on computing mansour learning boolean functions via the fourier transform in theoretical advances in neural computation and learning pages springer mansour randomized interpolation and approximation of sparse polynomials siam journal on computing mazzawi reconstructing graphs using edge counting queries phd thesis technionisrael institute of technology faculty of computer science negahban and shah learning sparse boolean polynomials in communication control and computing allerton annual allerton conference on pages ieee richardson and urbanke modern coding theory cambridge university press scheibler haghighatshoar and vetterli fast hadamard transform for signals with sparsity arxiv preprint settles active learning literature survey university of wisconsin madison stobbe and krause learning fourier sparse set functions in international conference on artificial intelligence and statistics pages yahoo yahoo webscope dataset 
local smoothness in variance reduced optimization daniel vainsencher han liu dept of operations research financial engineering princeton university princeton nj tong zhang dept of statistics rutgers university piscataway nj tzhang abstract we propose family of sampling strategies to provably speed up class of stochastic optimization algorithms with linear convergence including stochastic variance reduced gradient svrg and stochastic dual coordinate ascent sdca for large family of penalized empirical risk minimization problems our methods exploit data dependent local smoothness of the loss functions near the optimum while maintaining convergence guarantees our bounds are the first to quantify the advantage gained from local smoothness which are significant for some problems significantly better empirically we provide thorough numerical results to back up our theory additionally we present algorithms exploiting local smoothness in more aggressive ways which perform even better in practice introduction we consider minimization of functions of form φi where the convex φi corresponds to loss of on some data xi is convex regularizer and is strongly convex so that in addition we assume each φi is smooth in general and near flat in some region examples include svm regression with the absolute error or insensitive loss smooth approximations of those and also logistic regression stochastic optimization algorithms consider one loss φi at time chosen at random according to distribution pt which may change over time recent algorithms combine φi with information about previously seen losses to accelerate the process achieving linear convergence rate including stochastic variance reduced gradient svrg stochastic averaged gradient sag and stochastic dual coordinate ascent sdca the expected number of iterations required by these algorithms is of form log where is lipschitz constant of all loss gradients measuring their smoothness difficult problems having condition number much larger than are called ill conditioned and have motivated the development of accelerated algorithms some of these algorithms have been adapted to allow importance sampling where pt is non uniform the effect on convergence bounds is to replace the uniform bound described above by lavg the average over li loss specific lipschitz bounds in practice for an important class of problems large proportion of φi need to be sampled only very few times and others indefinitely as an example we take an instance of smooth svm with and solved via standard sdca in figure we observe the decay of an upper bound on the updates possible for different samples where choosing sample that is white produces no update the large majority of the figure is white indicating wasted effort for of losses the algorithm captured all relevant information after just visits since the non white zone is nearly constant over time detecting and focusing on the few important losses should be possible this represents both success of sdca and significant room for improvement as focusing just half the effort on the active losses would increase effectiveness by factor of similar phenomena occur under the svrg and sag algorithms as well but is the phenomenon specific to single problem or general for what problems can we expect the set of useful losses to be small and near constant figure sdca on smoothed svm dual residuals upper bound the sdca update size white indicates zero hence wasted effort the dual residuals quickly become sparse the support is stable allowing pt to change over time the phenomenon described indeed can be exploited figure shows significant speedups obtained by our variants of svrg and sdca comparisons on other datasets are given in section the mechanism by which speed up is obtained is specific to each algorithm but the underlying phenomenon we exploit is the same many problems are much smoother locally than globally first consider single smoothed hinge loss φi as used in smoothed svm with smoothing parameter the of the hinge loss is spread in φi over an interval of length as illustrated in figure and given by φi otherwise svrg solving smoothed hinge loss svm on mnist gradient is lip smooth strong convexity uniform sampling global smoothness sampling local svrg alg empirical affinity svrg alg effective passes over data duality duality the lipschitz constant of da φi is hence it enters into the global estimate of condition number lavg as li kxi hence approximating the hinge loss more precisely with smaller makes the problems strictly more ill conditioned but outside that interval of length φi can be locally approximated as affine having constant gradient into correct expression of local conditioning say on interval in the figure it should contribute nothing so smaller can sometimes make the problem locally better conditioned set of losses having constant gradients over subset of the hypothesis space can be summarized for purposes of optimization by single affine sdca solving smoothed hinge loss svm on mnist gradient is lip smooth strong convexity uniform sampling global smoothness sampling alg empirical sdca alg effective passes over data figure on the left we see variants of svrg with on the right variants of sdca figure loss φi that is near flat hessian vanishes near constant gradient on ball with radius kxi is induced by the euclidean ball of hypotheses wt that we prove includes then the loss φi does not contribute to curvature in the region of interest and an affine model of the sum of such φi on can replace sampling from them we find in algorithms by combining strong convexity with quantities such as duality gap or gradient norm function so sampling from should not be necessary it so happens that sag svrg and sdca naturally do such modeling hence need only light modifications to realize significant gains we provide the details for svrg in section the sag case is similar and for sdca in section other losses while nowhere affine are locally smooth the logistic regression loss has gradients with local lipschitz constants that decay exponentially with distance from hyperplane dependent on xi for such losses we can not forgo sampling any φi permanently but we can still obtain bounds benefitting from local smoothness for an svrg variant next we define formally the relevant geometric properties of the optimization problem and relate them to provable convergence improvements over existing generic bounds we give detailed bounds in the sequel throughout is euclidean ball of radius around definition we shall denote li φi which is also the uniform lipschitz coefficient of that hold at distance at most from remark algorithms will use similar quantities not dependent on knowing such as around known definition we define the average ball smoothness function of problem by li li in theorem we see that algorithm requires fewer stochastic gradient samples to reduce loss suboptimality by constant factor than svrg with importance sampling according to global smoothness once it has certified that the optimum is within of the current iterate it uses times less stochastic gradient steps the next measure similarly increases when many losses are affine on ball around the optimum definition we define the ball affinity function of problem by li in theorem we see similarly that algorithm requires fewer accesses of φi to reduce the duality gap to any than sdca with importance sampling according to global once it has certified that the optimum is within distance of the current primal iterate it accesses times fewer φi in both cases local smoothness and affinity enable us to focus constant portion of sampling effort on the fewer losses still challenging near the optimum when these are few the ratios and hence algorithmic advantage are large we obtain these provable speedups over already fast algorithms by using that local smoothness which we can certify for non smooth losses such as svm and and absolute loss regression we can similarly ignore irrelevant losses leading to significant practical improvements the current theory for such losses is insufficient to quantify the speed ups as we do for smooth losses we obtain algorithms that are simpler and sometimes much faster by using the more qualitative observation that as iterates tend to an optimum the set of relevant losses is generally stable and shrinking then algorithms can estimate the set of relevant losses directly from quantities observed in performing stochastic iterations sidestepping the looseness of estimating there are two previous works in this general direction the first paper work combining sampling and empirical estimation of loss smoothness is they note excellent empirical performance on variant of sag but without theory ensuring convergence we provide similarly fast and bound free variants of sdca section and svrg section dynamic importance sampling variant of sdca was reported in without relation to local smoothness we discuss the connection in section local smoothness and gradient descent algorithms in this section we describe how svrg in contrast to the classical stochastic gradient descent sgd naturally exposes local smoothness in losses then we present two variants of svrg that realize these gains we begin by considering single loss when close to the optimum and for simplicity assume assume small ball around our current estimate includes around the optimum and is contained in flat region of φi and this holds for large proportion of the losses sgd and its descendent svrg with importance sampling use updates of form wt ηvit pi where ep vi pi wt is an unbiased estimator of the full gradient of the loss term φi xi svrg uses pi vit where is some reference point with the advantage that vit has variance that vanishes as wt we point out in addition that when and xi is constant on the effects of sampling φi cancels out and vit in particular we can set pti with no loss of information more generally when is near constant on small li the difference between the sampled values of in vit is very small and pti can be similarly small we formalize this in the next section where we localize existing theory that applied importance sampling to adapt svrg statically to losses with varied global smoothness the local svrg algorithm halving the suboptimality of solution using svrg has two parts computing an exact gradient at reference point and performing many stochastic gradient descent steps the sampling distribution step size and number of iterations in the latter are determined by smoothness of the losses algorithm replaces the global bounds on gradient change li with local ones li made valid by restricting iterations to small ball certified to contain the optimum this allows us to leverage previous algorithms and analysis maintaining previous guarantees and improving on them when is large for this section we assume as in the initial version of svrg we may incorporate smooth regularizer though in different way explained later this allows us to apply the existing algorithm and its theory instead of using the proximal operator for fixed regularization we use it to localize by projections the stochastic descent to ball around the reference point see algorithm then the theory developed around importance sampling and global smoothness applies to sharper local smoothness estimates that hold on ignoring φi which are affine on is special case this allows for fewer stochastic iterations and using larger stepsize obtaining speedups that are problem dependent but often large in late stages see figure this is formalized in the following theorem algorithm local svrg is an application of proxsvrg with dependent regularization this portion reduces suboptimality by constant factor apply iteratively to minimize loss compute define ib by strong convexity otherwise for each compute φi define probability distribution pi weighted lipschitz constant maxi npi and step size apply the inner loop of set for choose it ii compute npit iii wt proxηr ηv return wt theorem let be an initial solution such that certifies that algorithm finds with ef pn using time where remark in the difficult case that is ill conditioned even locally so that nµ the term is negligible and the ratio between complexities of algorithm and an svrg using global smoothness approaches proof in the initial pass on the data compute and we then apply single round of algorithm of with the regularizer χb localizing around the reference point then we may apply theorem of with local instead of the global li required there for general proximal operators this allows us to use the corresponding larger stepsize remark the use of projections hence the restriction to smooth regularization is necessary because the local smoothness is restricted to and venturing outside with large step size may compromise convergence entirely while excursions outside are difficult to control in theory in practice skipping the projection entirely does not seem to hurt convergence informally stepping far from requires moving consistently against which is an unlikely event remark the theory requires stochastic steps per exact gradient to guarantee any improvement at all but for ill conditioned problems this is often very pessimistic in practice the first stochastic steps after an exact gradient provide most of the benefit in this heuristic scenario the computational benefit of theorem is through the sampling distribution and the larger step size enlarging the step size without accompanying theory often gains corresponding speed up to certain precision but the risk of non convergence materializes frequently while incorporated smooth by adding it to every loss function this could reduce the smoothness increase inherent in the losses hence reducing the benefits of our approach we instead poses no real propose to add single loss function defined as nr that this is not of form φi difficulty because depends on losses only through their gradients and smoothness the main difficulty with the approach of this section is that in early stages is large in part because is often very small for are common choices leading to loose bounds on in some cases the speed up is only obtained when the precision is already satisfactory we consider less conservative scheme in the next section the empirical affinity svrg algorithm relies on local smoothness to certify that some are small in contrast empirical affinity svrg algorithm takes to be evidence that loss is active when several times that is evidence of local affinity of the loss hence it can be sampled less often this strategy deemphasizes locally affine losses even when is too large to certify it thereby focuses work on the relevant losses much earlier half of the time we sample proportional to the global bounds li which keeps estimates of current and also bounds the variance when some increases from zero to positive benefit of using is that it is observed at every sample of without additional work pseudo code for the slightly long algorithm is in the supplementary material for space reasons stochastic dual coordinate ascent sdca the sdca algorithm solves through the dual problem where λn xi αi at each iteration sdca chooses at random according to pt and updates the αi corresponding to the loss φi to increase this scheme has been used for particular losses before and was analyzed in obtaining linear rates for general smooth losses uniform sampling and regularization and recently generalized in to other regularizers and general sampling distributions in particular show improved bounds and performance by cally adapting to the global of losses properties using distribution pi nµ smoothness pn log avg iterations to obtain an expected duit suffices to perform avg ality gap of at most while sdca is very different from gradient descent methods it shares the property that when the current state of the algorithm in the form of αi already matches the derivative information for φi the update does not require φi and can be skipped as we ve seen in figure many losses converge αi very quickly we will show that local affinity is sufficient condition the algorithm the algorithmic approach for exploiting locally affine losses in sdca is very different from that for gradient descent style algorithms for some affine losses we certify early that some αi are in their final form see lemma and henceforth ignore them this applies only to locally affine not just smooth losses but unlike does not require modifying the algorithm for explicit localization we use reduction to obtain improved rates while reusing the theory of for the remaining points these results are stated for squared euclidean regularization but hold for strongly convex as in lemma let wt αt and let gi wt in other words φi is affine on which includes then we can compute the optimal value αi proof as stated in section of for each we have then if φi xi is constant singleton on wt containing then in particular that is the lemma enables algorithm to ignore growing proportion of losses the overall convergence this enables is given by the following algorithm adapting to locally affine φi with speedup approximately rn for compute rτ compute wτ for αi iτ iτ pi otherwise li nµ otherwise for choose it pτ ii compute sit αit it iii αj αj otherwise theorem epoch algorithm duality gap is at ετ will achieve expected duality gap if at log iterations where and in at most li remark assuming li for simplicity and recalling we find the number of iterations is reduced by factor of at least compared to using pi li nµ in contrast the cost of the steps to added by algorithm is at most factor of which may be driven towards one by the choice of recent work modified sdca for dynamic importance sampling dependent on the so called dual residual κi αi where by we refer to the derivative of φi at which is at they exhibit practical improvement in convergence especially for smooth svm and theoretical speed ups when is sparse for an impractical version of the algorithm but does not tell us when this holds nor the magnitude of the expected benefit in terms of properties of the problem as opposed to algorithm state such as in the context of locally flat losses such as smooth svm we answer these questions through local smoothness lemma shows κi tends to zero for losses that are locally affine on ball around the optimum and the practical algorithm realizes the benefit when this certification comes into play as quantified in terms of the empirical sdca algorithm algorithm uses local affinity and small duality gap to certify the optimality of some αi avoiding calculating that are zero or useless naturally is small enough only late in the process algorithm instead dedicates half of samples in proportion to the magnitude of recent the other half chosen uniformly as figure illustrates this approach leads to significant speed up much earlier than the approach based on duality gap certification of local affinity while we it is not clear that we can prove for algorithm bound that strictly improves on algorithm it is worth noting that except for probably rare updates to and factor of the empirical algorithm should quickly detect all locally affine losses hence obtain at least the speed up of the certifying algorithm in addition it naturally adapts to the expected small updates of locally smooth losses note that is closely related to and might be replacable by but the current algorithm differs significantly from those in in how these quantities are used to guide sampling algorithm empirical sdca rn ati for and pτ where ai for choose it pτ ii compute sit αit it iii aj otherwise it iv αtj αj otherwise empirical evaluation duality sdca solving smoothed hinge loss svm on dorothea gradient is lip smooth strong convexity uniform sampling global smoothness sampling alg empirical sdca alg effective passes over data sdca solving smoothed hinge loss svm on gradient is lip smooth strong convexity uniform sampling global smoothness sampling alg empirical sdca alg effective passes over data duality duality sdca solving smoothed hinge loss svm on mushroom gradient is lip smooth strong convexity uniform sampling global smoothness sampling alg empirical sdca alg effective passes over data duality we applied the same algorithms with the same parameters to additional classification datasets to demonstrate the impact of our algorithm variants more widely the results for sdca are in figure those for svrg in figure in section in the supplementary material for lack of space sdca solving smoothed hinge loss svm on gradient is lip smooth strong convexity uniform sampling global smoothness sampling alg empirical sdca alg effective passes over data figure sdca variant results on four additional datasets the advantages of using local smoothness are significant on the harder datasets references dominik csiba zheng qu and peter stochastic dual coordinate ascent with adaptive probabilities arxiv preprint rie johnson and tong zhang accelerating stochastic gradient descent using predictive variance reduction in advances in neural information processing systems pages on one of the new datasets svrg with ratio of to lavg more aggressive than theory suggests stopped converging hence we changed all runs to use the permissible no other parameters were changed adapted to the dataset qihang lin zhaosong lu and lin xiao an accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization arxiv preprint mark schmidt nicolas le roux and francis bach minimizing finite sums with the stochastic average gradient arxiv preprint shai and tong zhang accelerated proximal stochastic dual coordinate ascent for regularized loss minimization mathematical programming pages shai and tong zhang stochastic dual coordinate ascent methods for regularized loss the journal of machine learning research lin xiao and tong zhang proximal stochastic gradient method with progressive variance reduction siam journal on optimization yuchen zhang and lin xiao stochastic coordinate method for regularized empirical risk minimization arxiv preprint peilin zhao and tong zhang stochastic optimization with importance sampling arxiv preprint peilin zhao and tong zhang stochastic optimization with importance sampling for regularized loss minimization proceedings of the international conference on machine learning 
saliency scale and information towards unifying theory neil bruce department of computer science university of manitoba bruce shafin rahman department of computer science university of manitoba abstract in this paper we present definition for visual saliency grounded in information theory this proposal is shown to relate to variety of classic research contributions in theory interest point detection bilateral filtering and to existing models of visual saliency based on the proposed definition of visual saliency we demonstrate results competitive with the art for both prediction of human fixations and segmentation of salient objects we also characterize different properties of this model including robustness to image transformations and extension to wide range of other data types with mesh models serving as an example finally we relate this proposal more generally to the role of saliency computation in visual information processing and draw connections to putative mechanisms for saliency computation in human vision introduction many models of visual saliency have been proposed in the last decade with differences in defining principles and also divergent objectives the motivation for these models is divided among several distinct but related problems including human fixation prediction salient object segmentation and more general measures of objectness models also vary in intent and range from hypotheses for saliency computation in human visual cortex to those motivated exclusively by applications in computer vision at high level the notion of saliency seems relatively straightforward and characterized by patterns that stand out from their context according to unique colors striking patterns discontinuities in structure or more generally figure against ground while this is seemingly simplistic concept the relative importance of defining principles of model and fine grained implementation details in determining output remains obscured given similarities in the motivation for different models there is also value in considering how different definitions of saliency relate to each other while also giving careful consideration to parallels to related concepts in biological and computer vision the characterization sought by models of visual saliency is reminiscent ideas expressed throughout seminal work in computer vision for example early work in theory includes emphasis on the importance of extrema in structure expressed across as an indicator of potentially important image content related efforts grounded in information theory that venture closer to modern notions of saliency include kadir and brady and jagersand analysis of interaction between scale and local entropy in defining relevant image content these concepts have played significant role in techniques for affine invariant keypoint matching but have received less attention in the direct prediction of saliency information theoretic models are found in the literature directly addressing saliency prediction for determining gaze points or prominent example of this is the aim model wherein saliency is based directly on measuring the of image patterns alternative information theoretic definitions have been posed including numerous models based on measures of redundancy or compressibility that are strongly related to information theoretic concepts given common roots in communication theory in this paper we present relatively simple information theoretic definition of saliency that is shown to have strong ties to number of classic concepts in the computer vision and visual saliency literature beyond specific model this also serves to establish formalism for characterizing relationships between scale information and saliency this analysis also hints at the relative importance of fine grained implementation details in differentiating performance across models that employ disparate but strongly related definitions of visual salience the balance of the paper is structured as follows in section we outline the principle for visual saliency computation proposed in this paper defined by maxima in information miss in section we demonstrate different characteristics of the proposed metric and performance on standard benchmarks finally section summarizes main points of this paper and includes discussion of broader implications maxima in information miss in the following we present general definition of saliency that is strongly related to prior work discussed in section in short according to our proposal saliency corresponds to maxima in space miss the description of miss follows and is accompanied by more specific discussion of related concepts in computer vision and visual saliency research let us first assume that the saliency of statistics that define local region of an image are function of the rarity likelihood of such statistics we ll further assume without loss of generality that these local statistics correspond to pixel intensities the likelihood of observing pixel at position with intensity ip in an image based on the global statistics is given by the frequency of intensity ip relative to the total number of pixels norp malized histogram lookup this may be expressed as follows ip iq ip with the dirac delta function one may generalize this expression to kernel density estimate ip gσi ip iq where gσi corresponds to kernel function assumed to be gaussian in this case this may be viewed as either smoothing the intensity histogram or applying density estimate that is more robust to low sample in practice the proximity of pixels to one another is also relevant filtering operations applied to images are typically local in their extent and the correlation among pixel values inversely proportional to the spatial distance between them adding local spatial weighting to the likelihood estimate such that nearby pixels have stronger influence the expression is as follows ip gσb gσi iq this constitutes locally weighted likelihood estimate of intensity values based on pixels in the surround having established the expression in equation we shift to discussion of theory in traditional theory the representation is defined by convolution of an image with gaussian kernel such that with the variance of gaussian filter features are often derived from the family of gaussian derivatives defined by lxm yn δxm yn with differential invariants produced by combining gaussian derivatives of different orders in weighted combination an important concept in theory is the notion that scale selection or the size and position of relevant structure in the data is related to the scale at which features normalized derivatives assume maximum value this consideration forms the basis for early definitions of saliency which derive measure of saliency corresponding to the scale at which local entropy is maximal this point is revisited later in this section the representation may also be defined as the solution to therheat equation δi δt ixx iyy which may be rewritten as where gσs iq dq and the local although this example is based on pixels intensities the same analysis may be applied to statistics of arbitrary dimensionality for higher dimensional feature vectors appropriate sampling is especially important spatial support this expression is the solution to the heat equation when σs this corresponds to diffusion process that is isotropic there are also variety of operations in image analysis and filtering that correspond to more general process of anisotropic diffusion one prominent example is the that proposed by perona and malik that implementsredge preserving smoothing similar process is captured by the yaroslavsky filter gσr iq iq dq bσs with bσs reflecting the spatial range of the filter the difference between these techniques and an isotropic diffusion process is that relative intensity values among local pixels determine the degree of diffusion or weighted local sampling the yaroslavsky filter may be shown to be special case of the more general bilateral filter corresponding to for the pspatial weight factor gσb gσi iq iq with wp gσb gσi iq wp in the same manner that selection of extrema defined by an isotropic diffusion process carries value in characterizing relevant image content and scale we propose to consider extrema that carry relationship to an anisotropic diffusion process note that the normalization term wp appearing in the equation for the bilateral filter is equivalent to the expression appearing in equation in contrast to bilateral filtering we are not interested in producing weighted sample of local intensities but we instead consider the sum of the weights themselves which correspond to robust estimate of the likelihood of ip one may further relate this to an information theoretic quantity of in considering ip the selfinformation associated with the observation of intensity ip with the above terms defined maxima in information are defined as iss ip max σb log gσb gσi iq saliency is therefore equated to the local for the scale at which this quantity has its maximum value for each pixel location in manner akin to scale selection based on normalized gradients or differential invariants this also corresponds to scale and value selection based on maxima in the sum of weights that define local anisotropic diffusion process in what follows we comment further on conceptual connections to related work scale space extrema the definition expressed in equation has strong relationship to the idea of selecting extrema corresponding to normalized gradients in or in space in this case rather than gaussian blurred intensity profile scale extrema are evaluated with respect to local information expressed across scale space kadir and brady in kadir and brady proposal interest points or saliency in general is related to the scale at which entropy is maximal while entropy and are related maxima in local entropy alone are insufficient to define salient content regions are therefore selected on the basis of the product of maximal local entropy and magnitude change of the probability density function in contrast the approach employed by miss relies only on the expression in equation and does not require additional normalization it is worth noting that success in matching keypoints relies on the distinctness of keypoint descriptors which is notion closely related to saliency attention based on information maximization aim the quantity expressed in equation is identical to the definition of saliency assumed by the aim model for specific choice of local features and fixed scale the method proposed in equation considers the maximum selfinformation expressed across scale space for each local observation to determine relative saliency bilateral filtering bilateral filtering produces weighted sample of local intensity values based on proximity in space and feature space the sum of weights in the normalization term provides direct estimate of the likelihood of the intensity or statistics at the kernel center and is directly related to graph based saliency and random walks proposals for visual saliency also include techniques defined by graphs and random walks there is also common ground between this family of approaches and those grounded in information theory specifically random walk or markov process defined on lattice may be seen as process related to anisotropic diffusion where the transition probabilities between nodes define diffusion on the lattice for model such as graph based visual saliency gbvs directed edge from node to node is given weight where is measure of dissimilarity and gaussian profile in the event that the dissimilarity measure is also defined by gaussian function of intensity values at and the edge weight defining transition probability is equivalent to wp and the expression in equation evaluation in this section we present an array of results that demonstrate the utility and generality of the proposed saliency measure this includes typical saliency benchmark results for both fixation prediction and object segmentation based on miss we also consider the relative invariance of this measure to image deformations viewpoint lighting and demonstrate robustness to such deformations this is accompanied by demonstration of the value of miss in more general sense in assessing saliency for broad range of data types with demonstration based on point cloud data finally we also contrast behavior against very recently proposed models of visual saliency that leverage deep learning revealing distinct and important facets of the overall problem the results that are included follow the framework established in section however the intensity value appearing in equations in section is replaced by vector of rgb values corresponding to each pixel denotes the norm and is therefore euclidean distance in the rgb colorspace it is worth noting that the definition of miss may be applied to arbitrary features including normalized gradients differential invariants or alternative features the motivation for choosing pixel color values is to demonstrate that high level of performance may be achieved on standard benchmarks using relatively simple set of features in combination with miss variety of steps are commonplace in evaluating saliency models including topological spatial bias of output or local gaussian blur of the saliency map in some of our results as noted bilateral blurring has been applied to the output saliency map in place of standard gaussian blurring the reasons for this are detailed later on in in this section but it is worth stating that this has shown to be advantageous in comparison to the standard of gaussian blur in our benchmark results benchmark results are provided for both fixation data and salient object segmentation for segmentation based evaluation we apply the methods described by li et al this involves segmentation using mcg with resulting segments weighted based on the saliency map miss versus scale in considering scale space extrema plotting entropy or energy among normalized derivatives across scale is revealing with respect to characteristic scale and regions of interest following this line of analysis in figure we demonstrate variation in information values as function of σb expressed in pixels in figure three pixels are labeled corresponding to each of these categories as indicated by colored dots the plot in figure shows the for all of the selected pixels considering wide range of scales object pixels edge pixels and pixels tend to produce different characteristic curves across scale in considering ip center bias via local connectivity center bias has been much discussed in the saliency literature and as such we include results in this section that apply different strategy for considering center bias in particular in the following center bias appears more directly as factor that influences the relative weights assigned to likelihood estimate defined by local pixels this effectively means that pixels closer to the center have more influence in determining estimated likelihoods one can imagine such an operation having more prominent role in foveated vision system wherein centrally located photoreceptors have much greater density than the periphery the first variant of center bias proposed is as follows ip max log where gσb gσi iq gσcb σb note that while the authors originally employed cmpc as segmentation algorithm more recent recommendations from the authors prescribe the use of mcg object pixel object pixel object pixel edge pixel edge pixel edge pixel pixel pixel pixel kernel size in pixels figure sample image with select pixel locations highlighted in color of the corresponding pixel locations as function of scale figure input images in and sample output for raw saliency maps with bilateral blur using bias using bias object segmentation using is the spatial center of the image gσcb is gaussian function which controls the amount of center bias based on σcb the second approach includes the center bias control parameters directly within in the second gaussian function ip max log where σi σb σb is the maximum possible distance from the center pixel to any other pixel salient objects and fixations evaluation results address two distinct and standard problems in saliency prediction these are fixation prediction and salient object prediction respectively the evaluation largely follows the methodology employed by li et al benchmarking metrics considered are common standards in saliency model evaluation and details are found in the supplementary material we have compared our results with several saliency and segmentation algorithms itti aim gbvs dva sun sig aws ft gc sf pcas and across different datasets note that for segmentation based tests comparison among saliency algorithms considers only the reason for this is that this was the highest performing of all of the saliency algorithms considered by li et al in our results we exercise range of parameters to gauge their relative importance the size of gaussian kernel gσb determines the spatial scale different kernel sizes are considered in range from to pixels with the standard deviation σb equal to one third of the kernel width for fixation prediction only subset of smaller scales is sufficient to achieve good performance but the complete set of scales is necessary for segmentation the gaussian kernel that defines color distance gσi is determined by the standard deviation σi we tested values for σi ranging from to for post processing standard bilateral filtering bb kernel size of is used and center bias results are based on fixed σcb for the kernel gσcb for for the second alternative method one gaussian kernel gσi is used with σi all of these settings have also considered different scaling factors applied to the overall image and and in most cases results corresponding to the resize factor of are best scaling down the image implies shift in the scales spanned in scale space towards lower spatial frequencies table benchmarking results for fixation prediction aws aim sig dva gbvs sun itti bruce cerf judd imgsal pascal miss basic miss bb miss miss table benchmarking results for salient object prediction saliency algorithms ft imgsal pascal aws aim sig dva gbvs sun itti miss basic miss bb miss miss table benchmarking results for salient object prediction segmentation algorithms sf gc pcas ft ft imgsal pascal basic bb in figure we show some qualitative results of output corresponding to miss with different postprocessing variants of center bias weighting for both saliency prediction and object segmentation lighting and viewpoint invariance given the relationship between miss and models that address the problem of invariant keypoint selection it is interesting to consider the relative invariance in saliency output subject to changing viewpoint lighting or other imaging conditions this is especially true given that saliency models have been shown to typically exhibit high degree of sensitivity to imaging conditions this implies that this analysis is relevant not only to interest point selection but also to measuring the relative robustness to small changes in viewpoint lighting or optics in predicting fixations or salient targets to examine affine invariance we have to used image samples from classic benchmark which represent changes in blur lighting and viewpoint in all of these sequence the first image is the reference image and the imaging conditions change gradually throughout the sequence we have applied the miss algorithm without considering any center bias to all of the images in those sequences from the raw saliency output we have selected keypoints based on nonmaxima suppression with radius pixels and threshold for every detected keypoint we assign circular region centered at the keypoint the radius of this circular region is based on the width of the gaussian kernel gσb defining the characteristic scale at which achieves maximum response keypoint regions are compared across images subject to their repeatability repeatability measures the similarity among detected regions across different frames and is standard way of gauging the capability to detect common regions across different types of image deformations we compare our results with several other region detectors including harris hessian mser ibr and ebr figure demonstrates that output corresponding to the proposed saliency measure revealing considerable degree of invariance to affine transformations and changing image characteristics suggesting robustness for applications for gaze prediction and object selection wall sequence repeatebility leuven sequence repeatebility bikes sequence repeatebility repeatebility bark sequence haraff hesaff mseraf ibraff ebraff miss increasing zoom rotation increasing blur increasing light increasing viewpoint angle figure demonstration of invariance to varying image conditions including viewpoint lighting and blur based on standard benchmark beyond images while the discussion in this paper has focused almost exclusively on image input it is worth noting that the proposed definition of saliency is sufficiently general that this may be applied to alternative forms of data including models audio signals or any form of data with locality in space or time to demonstrate this we present saliency output based on information for mesh model given that vertices are sparsely represented in coordinate space in contrast to figure saliency for different scales on mesh the continuous discretized grid representation model results correspond to surround based present for images some differences are on nearest neighbors left and nearest essary in how likelihood estimates are derived neighbors right respectively in this case the spatial support is defined according to the nearest spatial neighbors of each vertex instead of color values each vertex belonging to the mesh is characterized by three dimensional vector defining surface normal in the and directions computation is otherwise identical to the process outlined in equation an example of output associated with two different choices of is shown in figure corresponding to and respectively for model with vertices for demonstrative purposes the output for two individual spatial scales is shown rather than the maximum across scales red indicates high levels of saliency and green low values on the mesh are histogram equalized to equate any contrast differences it is interesting to note that this saliency metric and output is very similar to proposals in computer graphics for determining local mesh saliency serving mesh simplification note that this method allows determination of characteristic scale for vertices on the mesh in addition to defining saliency this may also useful to inferring the relationship between different parts hands fingers there is considerable generality in that the measure of saliency assumed is agnostic to the features considered with few caveats given that our results are based on local color values this implies relatively low dimensional feature space on which likelihoods are estimated however one can imagine an analogous scenario wherein each image location is characterized by feature vector outputs of bank of filters resulting in much higher dimensionality in the statistics as dimensionality increases in feature space the finite number of samples within local spatial or temporal window implies an exponential decline in the sample density for likelihood estimation this consideration can be solved in applying an approximation based on marginal statistics as in such an approximation relies on assumptions such as independence which may be achieved for arbitrary data sets in first encoding raw feature values via stacked sparse autoencoders or related feature learning strategies one might also note that saliency values may be assigned to units across different layers of hierarchical representation based on such feature representation saliency context and human vision solutions to quantifying visual saliency based on deep learning have begun to appear in the literature this has been made possible in part by efforts to scale up data collection via crowdsourcing in defining tasks that serve as an approximation of traditional gaze tracking studies recent yet to be published methods of this variety show considerable improvement on some standard benchmarks over traditional models it is therefore interesting to consider what differences exist between such approaches and more traditional approaches premised on measures of local feature contrast to this end we present some examples in figure where output differs significantly between model based on deep learning salicon and one based on feature contrast miss the importance of this example is in highlighting different aspects of saliency computation that contribute to the bigger picture it is evident that models capable of detecting specific objects and modeling context are may perform well on saliency benchmarks however it is also evident that there is some deficit in their capacity to represent saliency defined by strong feature contrast or according to factors of importance in human visual search behavior in the same vane in human vision hierarchical feature extraction from edges to complex objects and local measures for gain control normalization and feature contrast play significant role all acting in concert it is therefore natural to entertain the idea that comprehensive solution to the problem involves considering both features of the nature implemented in deep learning models coupled with contrastive saliency akin to miss in practice the role of salience in distributed representation in modulating object and context specific signals presents one promising avenue for addressing this problem it has been argued that normalization is canonical operation in sensory neural information processing under the assumption of generalized gaussian statistics it can be shown that divisive normalization implements an operation equivalent to log likelihood of neural response in reference to cells in the surround the nature of computation assumed by miss therefore finds strong correlate in basic operations that implement feature contrast in human vision and that pairs naturally with the structure of computation associated with representing objects and context figure examples where deep learning model produces counterintuitive results relative to models based on feature contrast top original image middle salicon output bottom miss output discussion in this paper we present generalized information theoretic characterization of saliency based on maxima in information this definition is shown to be related to variety of classic research contributions in theory interest point detection bilateral filtering and existing models of visual saliency based on relatively simplistic definition the proposal is shown to be competitive against contemporary saliency models for both fixation based and object based saliency prediction this also includes demonstration of the relative robustness to image transformations and generalization of the proposal to broad range of data types finally we motivate an important distinction between contextual and contrast related factors in driving saliency and draw connections to associated mechanisms for saliency computation in human vision acknowledgments the authors acknowledge financial support from the nserc canada discovery grants program university of manitoba gets funding and onr grant references koenderink the structure of images biological cybernetics lindeberg theory basic tool for analyzing structures at different scales journal of applied statistics kadir and brady saliency scale and image description ijcv jagersand saliency maps and attention selection in scale and spatial coordinates an information theoretic approach in iccv pages ieee mikolajczyk et al comparison of affine region detectors ijcv bruce and tsotsos saliency based on information maximization nips pages toews and wells for image feature detection and featurebased classification of volumetric brain images in cvpr workshops pages ieee borji sihite and itti quantitative analysis of agreement in visual saliency modeling comparative study ieee tip perona shiota and malik anisotropic diffusion in diffusion in computer vision pages springer buades coll and morel algorithm for image denoising in cvpr volume pages ieee paris and durand fast approximation of the bilateral filter using signal processing approach in eccv pages springer lmj florack bm ter haar romeny jan koenderink and max viergever general intensity transformations and differential invariants journal of mathematical imaging and vision mokhtarian and suomela robust image corner detection through curvature scale space ieee pami harel koch and perona visual saliency nips li hou koch rehg and yuille the secrets of salient object segmentation cvpr pages arbelez barron marques and malik multiscale combinatorial grouping cvpr pages carreira and sminchisescu cpmc automatic object segmentation using constrained parametric ieee tpami itti koch and niebur model of visual attention for rapid scene analysis ieee pami hou and zhang dynamic visual attention searching for coding length increments nips pages zhang tong marks shan and cottrell sun bayesian framework for saliency using natural statistics journal of vision hou harel and koch image signature highlighting sparse salient regions ieee tpami lebor and pardo on the relationship between optical variability visual saliency and eye fixations computational approach journal of vision achanta hemamiz estrada and susstrunk salient region detection cvpr workshops pages cheng mitra huang torr and hu global contrast based salient region detection ieee tpami perazzi krahenbuhl pritch and hornung saliency filters contrast based filtering for salient region detection cvpr pages tal margolin makes patch distinct cvpr pages andreopoulos and tsotsos on sensor bias in experimental methods for comparing saliency and recognition algorithms ieee tpami lee varshney and jacobs mesh saliency acm siggraph pages bruce and tsotsos saliency attention and visual search an information theoretic approach journal of vision gao and vasconcelos saliency computational principles biological plausibility and implications for neurophysiology and psychophysics neural computation jiang et al salicon saliency in context cvpr pages 
fighting bandits with new kind of smoothness jacob abernethy university of michigan jabernet chansoo lee university of michigan chansool ambuj tewari university of michigan tewaria abstract we provide new analysis framework for the adversarial bandit problem using the notion of convex smoothing we define novel family of algorithms with minimax optimal regret guarantees first we show that regularizationp via the tsallis entropy which includes as special case matches the minimax regret with smaller constant factor second we show that wide class of perturbation methods achieve regret as low as log as long as the perturbation distribution has bounded hazard function for example the gumbel weibull frechet pareto and gamma distributions all satisfy this key property and lead to algorithms introduction the classic bandit mab problem generally attributed to the early work of robbins poses generic online decision scenario in which an agent must make sequence of choices from fixed set of options after each decision is made the agent receives some feedback in the form of loss or gain associated with her choice but no information is provided on the outcomes of alternative options the agent goal is to minimize the total loss over time and the agent is thus faced with the balancing act of both experimenting with the menu of choices while also utilizing the data gathered in the process to improve her decisions the mab framework is not only mathematically elegant but useful for wide range of applications including medical experiments design gittins automated poker playing strategies van den broeck et and hyperparameter tuning pacula et early mab results relied on stochastic assumptions iid on the loss sequence auer et gittins et lai and robbins as researchers began to establish guarantees for sequential decision problems such as prediction with expert advice littlestone and warmuth natural question arose as to whether similar guarantees were possible for the bandit setting the pioneering work of auer freund and schapire answered this in the affirmative by showing that their algorithm possesses regret bounds with matching lower bounds attention later turned to the bandit version of online linear optimization and several associated guarantees were published the following decade abernethy et dani and hayes dani et flaxman et mcmahan and blum nearly all proposed methods have relied on particular algorithmic blueprint they reduce the bandit problem to the setting while using randomization to make decisions and to estimate the losses family of algorithms for the setting is follow the regularized leader ftrl which optimizes the objective function of the following form arg min where is the decision set is an estimate of the cumulative loss vector and is regularizer convex function with suitable curvature to stabilize the objective the choice of regularizer is critical to the algorithm performance for example the algorithm auer regularizes with the entropy function and achieves nearly optimal regret bound when is the probability simplex for general convex set however other regularizers such as barrier functions abernethy et have tighter regret bounds another class of algorithms for the full information setting is follow the perturbed leader ftpl kalai and vempala whose foundations date back to the earliest work in adversarial online learning hannan here we choose distribution on rn sample random vector and solve the following linear optimization problem arg min ftpl is computationally simpler than ftrl due to the linearity of the objective but it is analytically much more complex due to the randomness for every different choice of an entirely new set of techniques had to be developed devroye et van erven et rakhlin et al and abernethy et al made some progress towards unifying the analysis framework their techniques however are limited to the setting in this paper we propose new analysis framework for the bandit problem that unifies the regularization and perturbation algorithms the key element is new kind of smoothness property which we call differential consistency it allows us to generate wide class of both optimal and algorithms for the adversarial bandit problem we summarize our main results we show that regularization via the tsallis entropy leads to the adversarial mab algorithm matching the minimax regret rate of audibert and bubeck with tighter constant interestingly our algorithm fully generalizes we show that wide array of noise distributions lead to regret bounds matching those of furthermore our analysis reveals strikingly simple and appealing sufficient condition for achieving regret the hazard rate function of the noise distribution must be bounded by constant we conjecture that this requirement is in fact both necessary and sufficient prediction algorithms for the bandit let us now introduce the adversarial bandit problem on each round learner must choose distribution pt over the set of available actions the adversary nature chooses vector gt of losses the learner samples it pt and plays action it after selecting this action the learner observes only the value gt it and receives no information as to the values gt for it this limited information feedback is what makes the bandit problem much more challenging than the setting in which the entire gt is observed the learner goal is to minimize the regret regret is defined to be the difference in the realized loss and the loss of the best fixed action in hindsight regrett max gt gt it to be precise we consider the expected regret where the expectation is taken with respect to the learner randomization loss gain note we use the term loss to refer to although the maximization in would imply that should be thought of as gain instead we use the former term however as we impose the assumption that gt throughout the paper the algorithmic template our results focus on particular algorithmic template described in framework which is slight variation of the gradient based prediction algorithm gbpa of abernethy et al note that the algorithm maintains an unbiased estimate of the cumulative losses ii updates by adding single round estimate that has only one coordinate and iii uses the gradient of convex function as sampling distribution pt the choice of is flexible but must be differentiable convex function and its derivatives must always be probability distribution framework may appear restrictive but it has served as the basis for much of the published work on adversarial mab algorithms auer et kujala and elomaa neu and first the gbpa framework essentially encompasses all ftrl and ftpl algorithms abernethy et which are the core techniques not only for the full information settings but also for the bandit settings second the estimation scheme ensures that remains an unbiased estimate of gt although there is some flexibility any unbiased estimation scheme would require some kind of theory tells us that the unbiased estimates of quantity that is observed with only probabilty must necessarily involve fluctuations that scale as framework prediction alg gbpa template for bandit gbpa is differentiable convex function such that and ri for all initialize for to do nature loss vector gt is chosen by the adversary sampling learner chooses it according to the distribution cost learner gains loss gt it gt it estimation learner guesses it update it lemma define maxi gi so that we can write the expected regret of gbpa as pt eregrett gt hr gt then the expected regret of the gbpa can be written as eregrett it ei overestimation penalty underestimation penalty where the expectations are over the sampling of it divergence penalty proof let be valid convex function for the gbpa consider gbpa being run on the loss sequence gt the algorithm produces sequence of estimated losses now consider which is gbpa run with the full information on the deterministic loss sequence there is no estimation step and the learner updates directly the regret of this run can be written as pt hr and gt by the convexity of hence it suffices to show that the has regret at most the righthand side of equation which is fairly result in online learning literature see for example and lugosi theorem or abernethy et section for completeness we included the full proof in appendix new kind of smoothness what has emerged as guiding principle throughout machine learning is that enforcing stability of an algorithm can often lead immediately to performance is small modifications of the input data should not dramatically alter the output in the context of gbpa algorithmic stability is guaranteed as long as the dervative is lipschitz abernethy et al explored set of conditions on that lead to optimal regret guarantees for the setting indeed this work discussed different settings where the regret depends on an upper bound on either the nuclear norm or the operator norm of this hessian in short regret in the full information setting relies on the smoothness of the choice of in the bandit setting however merely uniform bound on the magnitude of is insufficient to guarantee low regret the regret lemma involves terms of the form where the incremental quantity can scale as large as the inverse of the smallest probability of what is needed is stronger notion of the smoothness that bounds in correspondence with and we propose the following definition definition differential consistency for constants we say that convex function is if for all ri we now prove useful bound that emerges from differential consistency and in the following two sections we shall show how this leads to regret guarantees theorem suppose is for constants then divergence penalty at time in lemma can be upper bounded as eit ri proof for the sake of clarity we drop the subscripts we use to denote the cumulative estimate to denote the marginal estimate and to denote the true loss gt note that by definition of algorithm is sparse vector with one and coordinate gt plus it is conditionally independent given for fixed it let reit so that teit eit now we can it write pn eit it dr ds pn ri rei ei dr ds pn rei ri dr ds pn ri ri dr ds pn ri dr ds pn pn ri ri the first inequality is by the supposition and the second inequality is due to the convexity of which guarantees that ri is an increasing function in the coordinate interestingly this part of the proof critically depends on the fact that the we are in the loss setting where is always minimax bandit algorithm via tsallis smoothing the design of bandit algorithm in the adversarial setting proved to be challenging task ignoring the dependence on for the moment we note that the initial published work on provided only an guarantee auer et and it was not until the final version of this work auer et that the authors obtained the optimal rate for the more general setting of online linear optimization several rates were achieved dani and hayes flaxman et mcmahan and blum before the desired was obtained abernethy et dani et we can view as an instance of gbpa where the potential function is the fenchel conjugate of the shannon entropy for any the negative shannon entropy is defined as gi in pi log pi and its fenchel conjugate is hp fact we have expression for the supremum log exp by inspecting the gradient of the above expression it is easy to see that chooses the distribution pt rh every round the tighter bound given by auer et al scaled according to log and the authors provided matching lower bound of the form it remained an open question for some time whether there exists minimax optimal algorithm that does not contain the log term until audibert and bubeck proposed the implicitly normalized forecaster inf the inf is implicitly defined via potential function with certain properties it was not immediately clear from this result how to define algorithm using the tools of regularization and bregman divergence more recently audibert et al improved upon audibert and bubeck extending the results to the combinatorial setting and they also discovered that inf can be interpreted in terms of bregman divergences we give here reformulation of inf that leads to very simple analysis in terms of our notion of differential consistency our reformulation can be viewed as variation of where the key modification is to replace the shannon entropy function with the tsallis for parameter this particular function proposed by tsallis possesses number of natural properties the tsallis entropy is in fact generalization of the shannon entropy as one obtains the latter as special case of the former asymptotically that is it is easy to prove the following uniform convergence as we emphasize again that one can easily show that bandit algorithm is indeed identical to inf using the appropriate parameter mapping although our analysis is simpler due to the notion of differential consistency definition theorem let hp gi then the gbpa has regret at most eregret before proving the theorem we note that it immediately recovers the upper bound as special case an easy application of rule shows that as log and choosing log we see that the side of tends to log however the choice is clearly not the optimal choice as we show in the following statement which directly follows from the theorem once we see that corollary for any if we choose then we have nt eregret in particular the choice of gives regret of no more than proof of theorem we will bound each penalty term in lemma since is the underestimation penalty is upper bounded by and the overestimation penalty is at most min the minimum of occurs at hence overestimation penalty more precisely the function we give here is the negative tsallis entropy according to its original definition now it remains to upper bound the divergence penalty with we observe that forward calculus gives let be the indicator function of that is for and for it is clear that is the dual of the function and moreover we observe that is of at following the setup of penot taking advantage of proposition in the latter reference we conclude that is of at hence diag for any what we have stated indeed is that is thus applying theorem gives pi and noting that the and the are dual to each other we can apply inequality to any probability distribution pn to obtain so the divergence penalty is at most which completes the proof bandit algorithms via stochastic smoothing let be continuous distribution over an unbounded support with probability density function and cumulative density function consider the gbpa where iid zn max gi zi which is stochastic smoothing of maxi gi function since the max function is convex is also convex by bertsekas we can swap the order of differentiation and expectation iid zn where arg max gi zi even if the function is not differentiable everywhere the swapping is still possible with any subgradient as long as they are bounded hence the ties between coordinates which happen with probability zero anyways can be resolved in an arbitrary manner it is clear that is in the probability simplex and note that zn gi zi gj zj gi pzi zi gi gi where gj zj the unbounded support condition guarantees that this partial derivative is for all given any so satisfies the requirements of algorithm connection to follow the perturbed leader there is straightforward way to efficiently implement the sampling step of the bandit gbpa algorithm with stochastically smoothed function instead of evaluating the expectation of equation we simply take random sample in fact this is equivalent to follow the perturbed leader algorithm ftpl kalai and vempala for bandit settings on the other hand implementing the estimation step is hard because generally there is no expression for to address this issue neu and proposed geometric resampling gr gr uses an iterative resampling process to estimate ri this process gives an unbiased estimate when allowed to run for an unbounded number of iterations even when we truncate the resampling process after iterations the extra regret due to the estimation bias is at most since the em additive term lower bound for the bandit problem is any choice of does not affect the asymptotic regret of the algorithm in summary all our gbpa regret bounds in this section hold for the corresponding ftpl algorithm with an extra additive em term in the bound despite the fact that algorithms provide natural randomized decision strategy they have seen little applications mostly because they are hard to analyze but one should expect general results to be within reach the algorithm for example can be viewed through the lens of perturbations where the noise is distributed according to the gumbel distribution indeed an early result of kujala and elomaa showed that mab strategy comes about through the use of noise and the same perturbation strategy has more recently been utilized in the work of neu and and et al however more general understanding of perturbation methods has remained elusive for example would gaussian noise be sufficient for guarantee what about say the weibull distribution hazard rate analysis in this section we show that the performance of the gbpa can be characterized by the hazard function of the smoothing distribution the hazard rate is standard tool in survival analysis to describe failures due to aging for example an increasing hazard rate models units that deteriorate with age while decreasing hazard rate models units that improve with age counter intuitive but not illogical possibility to the best of our knowledge the connection between hazard rates and design of adversarial bandit algorithms has not been made before definition hazard rate function hazard rate function of distribution is for the rest of the section we assume that is unbounded in the direction of so that the hazard function is everywhere this assumption is for the clarity of presentation and can be easily removed appendix theorem the regret of the gbpa on ez maxi gi is at most hd sup hd zn max zi proof we analyze each penalty term in lemma due to the convexity of the underestimation penalty is the overestimation penalty is clearly at most zn maxi zi and lemma proves the sup hd upper bound on the divergence penalty it remains to provide the tuning parameter suppose we scale the perturbation by we add to each coordinate it is easy to see that xi for the divergence penalty let be the cdf of the scaled random variable observe that and thus hence the hazard rate scales by which completes the proof lemma the divergence penalty of the gbpa with maxi gi zi is at most sup hd each round proof recall the gradient expression in equation the diagonal entry of the hessian is rii gi gi gi gi sup gi gi gi gi sup ri where gj zj which is random variable independent of zi we now apply theorem with and sup to complete the proof distribution gumbel frechet weibull pareto xm gamma supx hd as at most at at as maxn zi log log log log log log log param log exponential log exponential table distributions that give log regret ftpl algorithm the parameterization follows wikipedia pages for easy lookup we denote the euler constant by distributions marked with need to be slightly modified using the conditioning trick explained in appendix the maximum of frechet hazard function has to be computed numerically elsayed but elementary calculations show that it is bounded by appendix corollary follow the perturbed leader algorithm with distributions in table restricted to certain range of parameters combined with geometric resampling section with nt has an expected regret of order log table provides the two terms we need to bound we derive the third column of the table in appendix using extreme value theory embrechts et note that our analysis in the proof of lemma is quite tight the only place we have an inequality is when we upper bound the hazard rate it is thus reasonable to pose the following conjecture conjecture if distribution has monotonically increasing hazard rate hd that does not converge as gaussian then there is sequence of losses that will incur at least linear regret the intuition is that if adversary keeps incurring high loss for the arm then with high probability gi will be large so the expectation in equation will be dominated by the hazard function evaluated at large values of gi acknowledgments abernethy acknowledges the support of nsf under career grant tewari acknowledges the support of nsf under career grant references abernethy hazan and rakhlin methods for and bandit online learning ieee transactions on information theory abernethy lee sinha and tewari online linear optimization via smoothing in colt pages audibert and bubeck minimax policies for adversarial and stochastic bandits in colt pages audibert bubeck and lugosi minimax policies for combinatorial prediction games in colt auer using confidence bounds for the journal of machine learning research auer freund and schapire gambling in rigged casino the adversarial bandit problem in focs auer and fischer analysis of the multiarmed bandit problem machine learning auer freund and schapire the nonstochastic multiarmed bandit problem siam journal of computuataion issn bertsekas stochastic optimization problems with nondifferentiable cost functionals journal of optimization theory and applications issn and lugosi prediction learning and games cambridge university press dani and hayes robbing the bandit less regret in online geometric optimization against an adaptive adversary in soda pages dani hayes and kakade the price of bandit information for online optimization in nips devroye lugosi and neu prediction by perturbation in conference on learning theory pages elsayed reliability engineering wiley series in systems engineering and management wiley isbn url https embrechts and mikosch modelling extremal events for insurance and finance applications of mathematics springer isbn url https flaxman kalai and mcmahan online convex optimization in the bandit setting gradient descent without gradient in soda pages isbn gittins quantitative methods in the planning of pharmaceutical research drug information journal gittins glazebrook and weber bandit allocation indices john wiley sons hannan approximation to bayes risk in repeated play in dresher tucker and wolfe editors contributions to the theory of games volume iii pages kalai and vempala efficient algorithms for online decision problems journal of computer and system sciences neu valko and munos efficient learning by implicit exploration in bandit problems with side observations in nips pages curran associates kujala and elomaa on following the perturbed leader in the bandit setting in algorithmic learning theory pages springer lai and robbins asymptotically efficient adaptive allocation rules advances in applied mathematics littlestone and warmuth the weighted majority algorithm information and computation issn mcmahan and blum online geometric optimization in the bandit setting against an adaptive adversary in colt pages neu and an efficient algorithm for learning with feedback in algorithmic learning theory pages springer pacula ansel amarasinghe and oreilly hyperparameter tuning in adaptive operator selection in applications of evolutionary computation pages springer penot and conjugation nonlinear analysis theory methods applications url http rakhlin shamir and sridharan relax and randomize from value to algorithms in advances in neural information processing systems pages robbins some aspects of the sequential design of experiments bull amer math tsallis possible generalization of statistics journal of statistical physics van den broeck driessens and ramon tree search in poker using expected reward distributions in advances in machine learning pages springer van erven kotlowski and warmuth follow the leader with dropout perturbations in colt 
beyond measurements structured estimation with designs vidyashankar sivakumar arindam banerjee department of computer science engineering university of minnesota twin cities sivakuma banerjee pradeep ravikumar department of computer science university of texas austin pradeepr abstract we consider the problem of structured estimation with normregularized estimators such as lasso when the design matrix and noise are drawn from distributions existing results only consider designs and noise and both the sample complexity and estimation error have been shown to depend on the gaussian width of suitable sets in contrast for the setting we show that the sample complexity and the estimation error will depend on the exponential width of the corresponding sets and the analysis holds for any norm further using chaining we show that the exponential width for any set will be at most log times the gaussian width of the set yielding gaussian width based results even for the case further for certain popular estimators viz lasso and group lasso using vcdimension based analysis we show that the sample complexity will in fact be the same order as gaussian designs our general analysis and results are the first in the setting and are readily applicable to special families such as and distributions introduction we consider the following problem of high dimensional linear regression where rn is the response vector has independent isotropic random rows rn has entries and the number of covariates is much larger compared to the number of samples given and assuming that is structured usually characterized as having small value according to some norm the problem is to recover close to considerable progress has been made over the past decade on structured estimation using suitable or regression of the form ky λn argmin where is suitable norm and λn is the regularization parameter early work focused on estimation of sparse vectors using the lasso and related estimators where sample complexity of such estimators have been rigorously established based on the rip restricted isometry property and the more general re restricted eigenvalue conditions several subsequent advances have considered structures beyond using more general norms such as overlapping group sparse norms norm nuclear norm and so on in recent years much of the literature has been unified and nonasymptotic estimation error bound analysis techniques have been developed for regularized estimation with any norm in spite of such advances most of the existing literature relies on the assumption that entries in the design matrix are in particular recent unified treatments based on decomposable norms atomic norms or general norms all rely on concentration properties of subgaussian distributions certain estimators such as the dantzig selector and variants consider constrained problem rather than regularized problem as in but the analysis again relies on entries of being for the setting of constrained estimation building on prior work by outlines possible strategy for such analysis which can work for any distribution but works out details only for the case in recent work considered design matrices but with noise and suggested modifying the estimator in via type estimator based on multiple estimates of from in this paper we establish results for the estimation problem as in for any norm under the assumption that elements xij of the design matrix follow subexponential distribution whose tails are dominated by scaled versions of the symmetric exponential distribution exp for all and for suitable constants to understand the motivation of our work note that in most of machine learning and statistics unlike in compressed sensing the design matrix can not be chosen but gets determined by the problem in many application domains like finance climate science ecology social network analysis variables with heavier tails than are frequently encountered for example in climate science to understand the relationship between extreme value phenomena like heavy precipitation variables from the distributions are used while high dimensional statistical techniques have been used in practice for such applications currently lacking is the theoretical guarantees on their performance note that the class of distributions have heavier tails compared to but have all moments to the best of our knowledge this is the first paper to analyze regularized estimation problems of the form with design matrices and noise where is in our main result we obtain bounds on the estimation error the optimal structured parameter the sample complexity bounds are log worse compared to the case for example for the norm we obtain sample complexity bound instead of log for the case the analysis depends on two key ingredients which have been discussed in previous work the satisfaction of the re condition on set which is the error set associated with the norm and the design interaction manifested in the form of lower bounds on the regularization parameter specifically the re condition depends on the properties of the design matrix we outline two different approaches for obtaining the sample complexity to satisfy the re condition one based on the exponential width of and another based on the of linear predictors drawn from for two widely used cases lasso and we show that the based analysis leads to sharp bound on the sample complexity which is exactly the same order as that for design matrices in particular for lasso with log samples are sufficient to satisfy the re condition for designs further we show that the bound on the regularization parameter depends on the exponential width we ωr of the unit norm ball ωr rp through careful argument based chaining we show that for any set rp the exponential width we cwg log where wg is the gaussian width of the set and is an absolute constant recent advances on computing or bounding wg for various structured sets can then be used to bound we again for the case of lasso we ωr log the rest of the paper is organized as follows in section we describe various aspects of the problem and highlight our contributions in section we establish key result on the relationship between gaussian and exponential widths of sets which will be used for our subsequent analysis in section we establish results on the regularization parameter λn re constant and the we show some experimental results before concluding in section estimation error background and preliminaries in this section we describe various aspects of the problem introducing notations along the way and highlight our contributions throughout the paper values of constants change from line to line problem setup we consider the problem defined in the goal of this paper is to establish conditions for consisˆ tent estimation and derive bounds on error set under the assumption λn the error vector lies in cone regularization parameter for λn xθ following analysis in restricted eigenvalue re conditions for consistent estimation the design matrix should satisfy the following re condition inf on the error set for some constant the re sample complexity is the number of samples required to satisfy the re condition and has been shown to be related to the gaussian width of the error set deterministic recovery bounds if satisfies the re condition on the error set and λn satisfies cψ λn with high probability the assumptions stated earlier show the error bound for some constant where is the norm compatibility constant norm regularization one example for we will consider throughout the paper is the norm regularization in particular we will always consider norms another popular example we consider is the norm let gng denote collection of groups which are blocks of any vector rp for any vector rp let θng denote vector with coordinates θing θi if gng else θing let ng be the maximum size of any group in the group sparse setting for any subset sg ng with cardinality sg we assume that the parameter vector rp satisfies sg such vector is called sg sparse we will focus on png the case when kθ contributions one of our major results is the relationship between the gaussian and exponential width of sets using arguments from generic chaining existing analysis frameworks for our problem for and obtain results in terms of gaussian widths of suitable sets associated with the norm for and this dependency in some cases is replaced by the exponential width of the set by establishing precise relationship between the two quantities we leverage existing results on the computation of gaussian widths for our scenario another contribution is obtaining the same order of the re sample complexity bound as for the case for and norms while this strong result has already been explored in for we adapt it for our analysis framework and also extend it to the setting as for the application of our work the results are applicable to all distributions which by definition are distributions admitting density density of the form eψ with any concave function this covers many practically used distributions including extreme value distributions relationship between gaussian and exponential widths in this section we introduce complexity parameter of set we which we call the exponential width of the set and establish sharp upper bound for it in the gaussian width of the set wg in particular we prove the inequality we wg log for some fixed constant to see the connection with the rest of the paper remember that our subsequent results for λn and are expressed in terms of the gaussian width and exponential width of specific sets associated with the norm with this result we establish precise sample complexity bounds by leveraging body of literature on the computation of gaussian widths for various structured sets we note that while the exponential width has been defined and used earlier see for to the best of our knowledge this is the first result establishing the relation between the gaussian and exponential widths of sets our result relies on generic chaining generic chaining gaussian width and exponential widths consider process xt hh ti indexed by set rp where each element hi has mean it follows from the definition that the process is centered xt we will also assume for convenience that set is finite also for any consider canonical distance metric we are interested in computing the quantity xt now for reasons detailed in the supplement nconsider that we split into sequence of subsets with for and tm for some large let function πn tn defined as πn tn maps each point to some point tn closest according to the set tn and the associated function πn define partition an of the set each element of the partition an has some element and all closest to it according to the map πn also the size of the partition an are called admissible sequences in generic chaining note that there are multiple admissible sequences corresponding to multiple ways of defining the sets tm we will denote by an the diameter of the element an distance metric defined as an sups definition given and metric space we define γα inf sup an where the inf is taken over all possible admissible sequences of the set gaussian width let xt hg ti where each element gi is the quantity wg xt is called the gaussian width of the set define the distance metric ks the relation between gaussian width and the is seen from the following result from theorem of stated below wg note that following theorem in any process which satisfies the concentration bound xt exp satisfies the upper bound in exponential width let xt he ti where each element ei is is centered exponential random variable satisfying exp define the distance metrics ks and ks the quantity we xt is called the exponential width of the set by theorem and theorem in for some universal constant we satisfies we note that which satisfies concentration bound xt any process satisfies the upper bound in the above inequality exp min an upper bound for the exponential width in this section we prove the following relationship between the exponential and gaussian widths theorem for any set rp for some constant the following holds we wg log proof the result depends on geometric results lemma and theorem in theorem consider countable set rp and number assume that the gaussian width is bounded then there is decomposition where such that ls lsu ls ls where is some universal constant and is the unit norm ball in rp we first examine the exponential widths of the sets and for the set we su wg wg where the first inequality follows from and the second inequality follows from we will need the following result on bounding the exponential width of an unit ball in dimensions to compute the exponential width of the proof given in the supplement is based on the fact he ti and then using simple union bound argument to bound lemma consider the set rp then for some universal constant we sup he ti log the exponential width of is we we we wg we wg log the first equality follows from as is subset of norm ball the second inequality follows from elementary properties of widths of sets and the last inequality follows from lemma now as stated in theorem in and is any number greater than we choose log and noting that log log for some constant yields we lwg log we lwg log the final step following arguments as theorem is to bound exponential width of set we suphh ti sup hh sup hh we we lwg log this proves theorem recovery bounds if the regularization parameter λn we obtain bounds on the error vector and the re condition is satisfied on the error set with re constant then obtain the following error bound for some constant λn where is the norm compatibility constant given by regularization parameter as discussed earlier for our analysis the regularization parameter should satisfy λn observe that for the linear model is the noise implying that λn with denoting random vector with entries sup sup he ui the first equality follows from the definition of dual norm the second inequality follows from the fact that and are independent of each other also by elementary arguments has entries with norm bounded by khxit the above argument was first proposed for the case in for design and noise the difference compared to the case is the dependence on the exponential width instead of the gaussian width of the unit norm ball using known results on the gaussian widths of unit and norms corollaries below are derived using the relationship between gaussian and exponential widths derived in section corollary if is the norm for design matrix and noise xθ log corollary if is the norm for design matrix and noise log ng log xθ the re condition for gaussian and previous work has established rip bounds of the form inf sup in particular rip is satisfied if the number of samples is of the order of square of the gaussian width of the error set which we will call the re sample complexity bound as we move to heavier tails establishing such bounds requires assumptions on the boundedness of the euclidean norm of the rows of on the other hand analysis of only the lower bound requires very few assumptions on in particular being the sum of random quantities the lower bound should be satisfied even with very weak moment assumptions on making these observations develop arguments obtaining re sample complexity bounds when set is the unit sphere even for design matrices having only bounded fourth moments note that with such weak moment assumptions upper bound can not be established our analysis for the re condition essentially follow this premise and arguments from bound based on exponential width we obtain sample complexity bound which depends on the exponential width of the error set the result we state below follows along similar arguments made in which in turn are based on arguments from theorem let have independent isotropic rows let and is constant that depends on the norm let we denote the exponential width of the set then for some with probability atleast exp inf cξ ξτ contrasting the result with previous results for the case the dependence on wg on the is replaced by we thus leading to log worse sample complexity bound the corollary below applies the result for the norm note that results from for norm show rip bounds for the same number of samples corollary for an and norm regularization if then with probability atleast exp and constants depending on and inf bound based on in this section we show stronger re sample complexity result for and regularization the arguments follow along similar lines to theorem let be random matrix with isotropic random rows xi rp let is constant that depends on the norm and define let we denote the exponential width of the set let cξ be with cξ for some suitable constant if then with probability atleast exp cξ inf consider the case of norm consequence of the above result is that the re condition is satisfied on the set for some where is constant that will depend on the re constant when is log the argument follows from the fact that is union of spheres thus the result is obtained by applying theorem to each sphere and using union bound argument the final step involves showing that the re condition is satisfied on the error set if it is satisfied on using maurey empirical approximation argument corollary for set which is the error set for the norm if log for some suitable constant then with probability atleast exp nβ where are constants the following result holds for depending on the constant inf essentially the same arguments for the norm lead to the following result corollary for set which is the error set for the norm if msg where sg log eng then with probability atleast exp nβ ng are constants and depending on constant inf recovery bounds for and norms we combine result with results obtained for λn and previously for and norms corollary for the norm when cs log for some constant with high probability log corollary for the norm when msg sg log ng for some constant with high probability log log both bounds are log worse compared to corresponding bounds for the case in terms of sample complexity should scale as instead of log for for norm and sg log log ng instead of sg log ng for the case for lasso to get upto constant order error bound experiments we perform experiments on synthetic data to compare estimation errors for gaussian and subexponential design matrices and noise for both and group sparse norms for we run experiments with dimensionality and sparsity level for group sparse norms we run experiments with dimensionality max group size number of groups ng groups each of size and groups for the design matrix for the gaussian case we sample rows randomly from an isotropic gaussian distribution while for design probability of success figure probability of recovery in noiseless case with increasing sample size there is sharp phase transition and the curves overlap for gaussian and subexponential designs basis pursuit with gaussian design basis pursuit with design group sparse with gaussian design group sparse with design number of samples lasso with gaussian design and noise lasso with design and noise estimation error estimation error group sparse lasso with gaussian design and noise group sparse lasso with design and noise number of samples vs sample size for left and norms right the curve for figure estimation error designs and noise decays slower than gaussians matrices we sample each row of randomly from an isotropic distribution the number of samples in is incremented in steps of with an initial starting value of for the noise it is sampled from the gaussian and distributions with variance for the gaussian and cases respectively for each sample size we repeat the procedure above times and all results reported in the plots are average values over the runs we report two sets of results figure shows percentage of success vs sample size for the noiseless case when success in the noiseless case denotes exact recovery which is possible when the re condition is satisfied hence we expect the sample complexity for recovery to be order of square of gaussian width for gaussian and distributions as validated by the plots in figure figure shows average estimation error vs number of samples for the noisy case when the noise is added only for runs in which exact recovery was possible in the noiseless case for example when we do not have any results in figure as even noiseless recovery is not possible for each the estimation errors are average values over runs as seen in figure the error decay is slower for distributions compared to the gaussian case conclusions this paper presents unified framework for analysis of error and structured recovery in norm regularized regression problems when the design matrix and noise are essentially generalizing the corresponding analysis and results for the case the main observation is that the dependence on gaussian width is replaced by the exponential width of suitable sets associated with the norm together with the result on the relationship between exponential and gaussian widths previous analysis techniques essentially carry over to the case we also show that stronger result exists for the re condition for the lasso and problems as future work we will consider extending the stronger result for the re condition for all norms acknowledgements this work was supported by nsf grants and by nasa grant references adamczak litvak pajor and restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling constructive approximation banerjee chen fazayeli and sivakumar estimation with norm regularization in nips bickel ritov and tsybakov simultaneous analysis of lasso and dantzig selector annals of statistics candes romberg and tao robust uncertainty principles exact signal reconstruction from highly incomplete frequency information ieee transactions on information theory candes and tao decoding by linear programming ieee transactions on information theory candes and tao the dantzig selector statistical estimation when is much larger than annals of statistics chandrasekaran recht parrilo and willsky the convex geometry of linear inverse problems foundations of computational mathematics chatterjee chen and banerjee generalized dantzig selector application to the norm in nips hsu and sabato regression with generalized in icml koltchinskii and mendelson bounding the smallest singular value of random matrix without concentration and mendelson sparse recovery under weak moment assumptions ledoux and talagrand probability in banach spaces isoperimetry and processes springer berlin meinshausen and yu recovery of sparse representations for data annals of statistics mendelson learning without concentration journal of the acm to appear mendelson and paouris on generic chaining and the smallest singular value of random matrices with heavy tails journal of functional analysis negahban ravikumar wainwright and yu unified framework for highdimensional analysis of with decomposable regularizers statistical science oliveira the lower tail of random quadratic forms with applications to ordinary least squares and restricted eigenvalue properties rudelson and zhou reconstruction from anisotropic random measurements ieee transaction on information theory talagrand the generic chaining springer berlin tropp convex recovery of structured signal from independent random linear measurements in sampling theory renaissance to appear vershynin introduction to the analysis of random matrices in eldar and kutyniok editors compressed sensing pages cambridge university press cambridge wainwright sharp thresholds for and noisy sparsity recovery using quadratic programmming lasso ieee transaction on information theory zhao and yu on model selection consistency of lasso journal of machine learning research 
spectral norm regularization of orthonormal representations for graph transduction rakesh shivanna google mountain view ca usa rakeshshivanna bibaswan chatterjee dept of computer science automation indian institute of science bangalore raman sankaran chiranjib bhattacharyya dept of computer science automation indian institute of science bangalore ramans chiru francis bach inria sierra normale paris france abstract recent literature suggests that embedding graph on an unit sphere leads to better generalization for graph transduction however the choice of optimal embedding and an efficient algorithm to compute the same remains open in this paper we show that orthonormal representations class of graph embeddings are pac learnable existing analysis do not apply as the vc dimension of the function class is infinite we propose an alternative bound which do not depend on the vc dimension of the underlying function class but is related to the famous function the main contribution of the paper is spore spectral regularized orthonormal embedding for graph transduction derived from the pac bound spore is posed as convex function over an elliptope these problems are usually solved as programs sdps with time complexity we present infeasible inexact proximal iip an inexact proximal method which performs subgradient procedure on an approximate projection not necessarily feasible iip is more scalable than sdp has an convergence and is generally applicable whenever suitable approximate projection is available we use iip to compute spore where the approximate projection step is computed by fista an accelerated gradient descent procedure we show that the method has convergence rate of the proposed algorithm easily scales to of vertices while the standard sdp computation does not scale beyond few hundred vertices furthermore the analysis presented here easily extends to the multiple graph setting introduction learning problems on data have received significant attention in recent years we study an instance of graph transduction the problem of learning labels on vertices of simple typical example is webpage classification where very small part of the entire web is manually classified even for simple graphs predicting binary labels of the unlabeled vertices is more formally let be simple graph with unknown labels without loss of generality let the labels of first vertices be observable let simple graph is an unweighted undirected graph with no self loops or multiple edges let ys and be the labels of andp given and ys the goal is to learn soft predictions rn such that yj is small where is any loss function the following formulation has been extensively used min ers where is kernel and is regularizer constant let be the solution to given and proposed the following generalization bound trp infn erv where are dependent on and trp kpii argued that trp should be constant and can be enforced by normalizing the diagonal entries of to be this is an important advice in graph transduction however it is to be noted that the set of normalized kernels is quite large and gives little insight in choosing the optimal kernel normalizing the diagonal entries of can be viewed geometrically as embedding the graph on unit sphere recently studied rich class of unit sphere graph embeddings called orthonormal representations and find that it is statistically consistent for graph transduction however the choice of the optimal orthonormal embedding is not clear we study orthonormal representations for the following equivalent kernel learning formulation of with λm ωc ys maxn αi αi αj yi yj kij αi αj from probably approximately correctly pac learning point of view note that the final predictions are given by kij yj where is the optimal solution to contributions we make the following contributions using we show the class of orthonormal representations are efficiently pac learnable over large class of graph families including and random graphs the above analysis suggests that spectral norm regularization could be beneficial in computing the best embedding to this end we pose the problem of spectral norm regularized orthonormal embedding spore for graph transduction namely that of minimizing convex function over an elliptope one could solve such problems as sdps which unfortunately do not scale well beyond few hundred vertices we propose an infeasible inexact proximal iip method novel projected subgradient descent algorithm in which the projection is approximated by an inexact proximal method we suggest novel approximation criteria which approximates the proximal operator for the support function of the feasible set within given precision one could compute an approximation to the projection from the inexact proximal point which may not be feasible hence the name iip we iip converges to the optimal minimum of convex function with rate in iterations the iip algorithm is then applied to the case when the set of interest is composed of the intersection of two convex sets the proximal operator for the support function of the set of interest can be obtained using the fista algorithm once we know the proximal operator for the support functions of the individual sets involved our analysis paves the way for learning labels on multiple graphs by using the embedding by adopting an mkl style approach we present both algorithmic and generalization results notations let kf denote the euclidean and frobenius norm respectively let sn and denote the set of square symmetric and square symmetric positive matrices respectively let be orthant let denote the dimensional simplex let for any sn let λn denote its eigenvalues we denote the adjacency matrix of graph by let denote the complement graph of with the adjacency matrix where is vector of all and is the identity matrix let yb be the label and spaces over given we use hng to denote and hinge loss and respectively the notations will denote standard measures in asymptotic analysis related work analysis was restricted to laplacian matrices and does not give insights in choosing the optimal unit sphere embedding studied graph transduction using pac model however for graph orthonormal embeddings there is no known sample complexity estimate showed that working with orthonormal embeddings leads to consistency however the choice of optimal embedding and an efficient algorithm to compute the same remains an open issue furthermore we show that sample complexity estimate is preliminaries an orthonormal embedding of simple graph is defined by matrix un such that and uj whenever kui lab denote the set of all possible orthonormal embeddings of the graph lab is an orthonormal embedding recently showed an interesting connection to the set of graph kernel matrices kii kij note that is positive semidefinite and hence there exists such that note that kij uj where ui is the column of hence by inspection it is clear that lab using similar argument we can show that for any lab the matrix thus the two sets lab and are equivalent furthermore orthonormal embeddings are associated with an interesting quantity the function however computing requires solving an sdp which is impractical generalization bound for graph transduction using orthonormal embeddings in this section we derive generalization bound used in the sequel for pac analysis we derive the following error bound valid for any orthonormal embedding supplementary material section theorem generalization bound let be simple graph with unknown binary labels on the vertices let given and labels of randomly drawn subgraph let ybn be the predictions learnt by ωc ys in then for with probability over the choice of such that hng yi log note that the above is bound in comparison to the expected analysis in also the above result suggests that graph embeddings with low spectral norm and empirical error lead to better generalization analysis in suggests that we should embed graph on unit sphere however does not help to choose the optimal embedding for graph transduction exploiting our analysis from we present spectral norm regularized algorithm in section we would also like to study pac learnability of orthonormal embeddings where pac learnability is defined as follows given does there exist an such that over the generalization error the quantity is termed as labelled sample complexity existing analysis do not apply to orthonormal embeddings as discussed in related work section theorem allows us to derive improved statistical estimates section spore formulation and pac analysis theorem suggests that penalizing the spectral norm of would lead to better generalization to this end we motivate the following formulation ψc ys min where ωc ys max gives an optimal orthonormal embedding the optimal which we will refer to as spore in this section we first study the pac learnability of spore and derive labelled sample complexity estimate next we study efficient computation of spore though spore can be posed as an sdp we show in section that it is possible to exploit the structure and solve efficiently given and ys the function ωc ys is convex in as it is the maximum of affine functions of the spectral norm of is also convex and hence is convex function furthermore is an elliptope convex body which can be described by the intersection of positive and affine constraints it follows that hence is convex usually these formulations are posed as sdps which do not scale beyond few hundred vertices in section we derive an efficient first order method which can solve for of vertices let be the optimal embedding computed from note that once the kernel is fixed the predictions are only dependent on ωc ysp let be the solution to ωc ys as in then the final predictions of is given by kij αj yj at this point we derive an interesting error convergence rate we gather two important results the proof of which appears in the supplementary material section lemma given simple graph lemma given and for any and ωc ys in the standard pac setting there is complete disconnection between the data distribution and target hypothesis however in the presence of unlabeled nodes without any assumption on the data it is impossible to learn labels following existing literature we work with similarity graphs where presence of an edge would mean two nodes are similar and derive the following supplementary material section theorem let be simple graph with unknown binary labels on the vertices given and labels of randomly drawn subgraph let be the predictions learnt by spore for parameters then for and ϑϑ with probability over the choice of such that nϑ log proof sketch let be the kernel learnt by spore using theorem and lemma for hng yi log from the primal formulation of using lemma and we get βϑ hng yi ωc ys ψc ys plugging back in choosing such that cm and optimizing for gives us the choice of parameters as stated finally using proves the result in the theorem above is the complement graph of the optimal orthonormal embedding tend to embed vertices to nearby regions if they have connecting edges hence the notion of similarity is implicitly captured in the embedding from for fixed and note that the error converges at faster rate for dense graph is small than for sparse graph is large such connections relating to graph structural properties were previously unavailable we also estimate labelled sample complexity by bounding by to obtain ϑn log this connection helps to reason the intuition that for sparse graph one would need larger number of labelled vertices than for dense graph for constants we obtain fractional labelled sample complexity estimate of which is icant improvement over the recently proposed the use of stronger machinery of rademacher averages supplementary material section instead of and specializing to spore allows us to improve over existing analysis the proposed sample complexity estimate is interesting for such graphs include random graphs and graphs inexact proximal methods for spore in this section we propose an efficient algorithm to solve spore see the optimization problem spore can be posed as an sdp generic sdp solvers have runtime complexity of and often does not scale well for large graphs we study methods such as projected subgradient procedures as an alternative to sdps for minimizing the main computational challenge in developing such procedures is that it is difficult to compute the projection on the elliptope one could potentially use the seminal dykstra algorithm of finding feasible point in the intersection of two convex sets the algorithm asymptotically finds point in the intersection this asymptotic convergence is serious disadvantage in the usage of dykstra algorithm as projection it would be useful to have an algorithm which after finite number of iterations yield an approximate projection and subsequent descent algorithm can yield convergent algorithm motivated by spore we study the problem of minimizing convex functions where the projection onto the feasible set can be computed only approximately recently there has been increasing interest in studying inexact proximal methods in the sequel we design an inexact proximal method which yields an algorithm to solve the algorithm is based on approximating the prox function by an iterative procedure which satisfies suitably designed criterion an infeasible inexact proximal iip algorithm let be convex function with properly defined at every consider the following optimization problem min subgradient projection iteration of the form px xk αk hk hk xk accurate solution by running the iterations number of times of rd on rd if px kv in many is often used to arrive at an where px is the projection situations such as it is not possible to accurately compute the projection in finite amount of time and one may obtain only an approximate projection using the moreau decomposition px proxσx one can compute the projection if one could compute proxσx where σa is the support function of and proxσx refers to the proximal operator for the function at as defined argmin kv we assume that one could compute zx not necessarily in such that pσx zx minn pσx and zx see that zx is an inexact prox and the resultant estimate of the projection can be infeasible but hopefully not too far away note that recovers the exact case the next theorem confirms that it is possible to converge to the true optimum for supplementary material section theorem consider the optimization problem starting from any where is solution of for every let us assume that we could obtain yk such that zk yk yk satisfy where yk xk αk hk αk khsk khk kxk then the iterates xk αk hk hk xk more general definition of the proximal operator is yield related work on inexact proximal methods there has been recent interest in deriving inexact proximal methods such as projected gradient descent see for comprehensive list of references to the best of our knowledge composite functions have been analyzed but no one has explored the case that is the results presented here are thus complementary to note the subtlety in using the proper approximation criteria using distance criterion between the true projection and the approximate projection or an approximate optimality criteria on the optimal distance would lead to worse bound using dual approximate optimality criterion here through the proximal operator for the support function is key as noted in and references therein as an immediate consequence of theorem note that suppose we have an algorithm to compute proxσx which guarantees after iterations that pσx zs min pσx for constant particular to the set over which pσx is defined we can initialize that may suggest that one could use iterations to yield ft where in remarks computational efficiency dictates that the number of projection steps kept at minimum to this end we see that number of projection steps need to be at least with the current choice of stepsizes let cp be the cost of one iteration of fista step and be the cost of one outer iteration the total computation cost can be then estimated as cp applying iip to compute spore the problem of computing spore can be posed as minimizing convex function over an intersection of two sets intersection of positive cone and polytope of equality constraints sn mij the algorithm described in theorem readily applies to the new setting if the projection can be computed efficiently the proximal operator for σx can be derived as proxσx argmin pσx vk σa σb this means that even if we do not have an efficient procedure for computing proxσx directly we can devise an algorithm to guarantee the approximation if we can compute proxσa and proxσb efficiently this can be done through the application of the popular fista algorithm for which also guarantees algorithm detailed in the supplementary named iip ist computes the following simple steps followed by the usual fista variable updates at each iteration gradient descent step on and with respect to the smooth term and proximal step with respect to σa and σb using the expressions supplementary material using the tools discussed above we design algorithm to solve the spore formulation using iip the proposed algorithm readily applies to general convex sets however we confine ourselves to specific sets of interest in our problem the following theorem states the convergence rate of the proposed procedure theorem consider the optimization problem with where and are and respectively starting from any the iterates kt in algorithm satisfy min kt proof is an immediate extension of theorem supplementary material section the derivation is presented in supplementary material claim algorithm iip for spore function pprox proj subg compute stepsize initialize for do compute subgradient of at see equation vt fista for steps use algorithm supp iip ist vt kt roja kt proxσa kt kt needs to be psd for the next svm call use supp end for end function equating the problem with the spore problem we have ωc βλ the set of subgradients of at the iteration is given by kt yαt αt βvt vt is returned by svm and vt is the eigen vector corresponding to kt where be diagonal matrix such that yii yi for and otherwise the step size is calculated using estimates of and which can be derived as nc for the spore problem check the supplementary material for the derivations multiple graph transduction multiple graph transduction is of recent interest in setting where individual views are expressed by graph this includes many practical problems in bioinformatics spam detection etc we propose an mkl style extension of spore with improved pac bounds formally the problem of multiple graph transduction is stated as let be set of simple graphs defined on common vertex set given and ys as before the goal is to accurately predict following the standard technique of taking convex combination of graph kernels we propose the following formulation ηk ys max φc ys min min ωc similar to theorem we can show the following supplementary material theorem nϑ log where min it immediately follows that combining multiple graphs improves the error convergence rate see and hence the labelled sample complexity also the bound suggests that the presence of at least one good graph is sufficient for to learn accurate predictions this motivates us to use the proposed formulation in the presence of noisy graphs section we can also apply the iip algorithm described in section to solve for supplementary material section experiments we conducted experiments on both real world and synthetic graphs to illustrate our theoretical observations all experiments were run on cpu with xeon processors cache and memory running centos αt ykt yα and vt kt αj table spore comparison dataset ks spore diabetes fourclass heart ionosphere sonar table large scale nodes dataset ks spore graph transduction spore we use two datasets uci and mnist for the uci datasets we use the rbf and threshold with the mean and for the mnist datasets we construct similarity matrix using cosine distance for random sample of nodes and threshold with to obtain unweighted graphs with labelled nodes we compare spore with formulation using graph kernels unnormalized laplacian normalized laplacian and where being diagonal matrix of degrees we choose parameters and by cross validation table summarizes the results averaged over different labelled samples with each entry being accuracy in loss function as expected from section spore significantly outperforms existing methods we also tackle large scale graph transduction problems table shows superior performance of algorithm for random sample of nodes with only outer iterations and inner projections multiple graph transduction we illustrate the effectiveness of combining multiple graphs using mixture of random graphs where we fix and the labels such that yi if otherwise an edge is present with probability if yi yj otherwise present with probability we generate three datasets to simulate homogenous heterogenous and noisy case shown in table table superior performance of graph homo heter noisy table synthetic multiple graphs dataset graph homo heter noisy union intersection majority multiple graphs was compared with individual graphs and with the union intersection and majority we use spore to solve for single graph transduction and the results were averaged over random samples of labelled nodes for the comparison metric as before table shows that combining multiple graphs improves classification accuracy furthermore the noisy case illustrates the robustness of the proposed formulation key observation from conclusion we show that the class of orthonormal graph embeddings are efficiently pac learnable our analysis motivates spectral norm regularized formulation spore for graph transduction using inexact proximal method we design an efficient first order method to solve for the proposed formulation the algorithm and analysis presented readily generalize to the multiple graphs setting acknowledgments we acknowledge support from grant from center for applied mathematics ifcam kxi the th entry of an rbf kernel is given by exp majority graph is graph where an edge is present if majority of the graphs have the edge where is set as the mean distance references ando and zhang learning on graph with laplacian regularization in nips balcan and blum an augmented pac model for learning in chapelle and zien editors learning mit press cambridge boyle and dykstra method for finding projections onto the intersection of convex sets in hilbert spaces in advances in order restricted statistical inference volume of lecture notes in statistics pages springer new york cormen leiserson rivest and stein introduction to algorithms volume mit press cambridge laurent and varvitsiotis forbidden minor characterizations for optimal solutions to semidefinite programs over the elliptope comb theory ser erdem and pelillo graph transduction as game neural computation goemans semidefinite programming in combinatorial optimization mathematical programming jethava martinsson bhattacharyya and dubhashi the function svms and finding large dense subgraphs in nips pages johnson and zhang on the effectiveness of laplacian normalization for graph learning jmlr lecun and cortes the mnist database of handwritten digits leordeanu zanfir and sminchisescu learning and optimization for hypergraph matching in iccv pages ieee lichman uci machine learning repository on the shannon capacity of graph information theory ieee transactions on parikh and boyd proximal algorithms foundations and trends in optimization schmidt roux and bach convergence rates of inexact methods for convex optimization in nips pages shivanna and bhattacharyya learning on graphs using orthonormal representation is statistically consistent in nips pages tran application of three graph laplacian based learning methods to protein function prediction problem ijbb villa salzo baldassarre and verri accelerated and inexact algorithms siam journal on optimization zhang and ando analysis of spectral kernel design based learning nips zhou bousquet lal weston and learning with local and global consistency nips zhou and burges spectral clustering and transductive learning with multiple views in icml pages acm 
convolutional networks on graphs for learning molecular fingerprints david dougal jorge rafael timothy hirzel ryan adams harvard university abstract we introduce convolutional neural network that operates directly on graphs these networks allow learning of prediction pipelines whose inputs are graphs of arbitrary size and shape the architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints we show that these features are more interpretable and have better predictive performance on variety of tasks introduction recent work in materials design used neural networks to predict the properties of novel molecules by generalizing from examples one difficulty with this task is that the input to the predictor molecule can be of arbitrary size and shape currently most machine learning pipelines can only handle inputs of fixed size the current state of the art is to use fingerprint software to compute feature vectors and use those features as inputs to deep neural network or other standard machine learning method this formula was followed by during training the molecular fingerprint vectors were treated as fixed in this paper we replace the bottom layer of this stack the function that computes molecular fingerprint vectors with differentiable neural network whose input is graph representing the original molecule in this graph vertices represent individual atoms and edges represent bonds the lower layers of this network is convolutional in the sense that the same local filter is applied to each atom and its neighborhood after several such layers global pooling step combines features from all the atoms in the molecule these neural graph fingerprints offer several advantages over fixed fingerprints predictive performance by using data adapting to the task at hand fingerprints can provide substantially better predictive performance than fixed fingerprints we show that neural graph fingerprints match or beat the predictive performance of standard fingerprints on solubility drug efficacy and organic photovoltaic efficiency datasets parsimony fixed fingerprints must be extremely large to encode all possible substructures without overlap for example used fingerprint vector of size after having removed features differentiable fingerprints can be optimized to encode only relevant features reducing downstream computation and regularization requirements interpretability standard fingerprints encode each possible fragment completely distinctly with no notion of similarity between fragments in contrast each feature of neural graph fingerprint can be activated by similar but distinct molecular fragments making the feature representation more meaningful equal contribution figure left visual representation of the computational graph of both standard circular fingerprints and neural graph fingerprints first graph is constructed matching the topology of the molecule being fingerprinted in which nodes represent atoms and edges represent bonds at each layer information flows between neighbors in the graph finally each node in the graph turns on one bit in the fingerprint vector right more detailed sketch including the bond information used in each operation circular fingerprints the state of the art in molecular fingerprints are circular fingerprints ecfp circular fingerprints are refinement of the morgan algorithm designed to encode which substructures are present in molecule in way that is invariant to circular fingerprints generate each layer features by applying fixed hash function to the concatenated features of the neighborhood in the previous layer the results of these hashes are then treated as integer indices where is written to the fingerprint vector at the index given by the feature vector at each node in the graph figure left shows sketch of this computational architecture ignoring collisions each index of the fingerprint denotes the presence of particular substructure the size of the substructures represented by each index depends on the depth of the network thus the number of layers is referred to as the radius of the fingerprints circular fingerprints are analogous to convolutional networks in that they apply the same operation locally everywhere and combine information in global pooling step creating differentiable fingerprint the space of possible network architectures is large in the spirit of starting from configuration we designed differentiable generalization of circular fingerprints this section describes our replacement of each discrete operation in circular fingerprints with differentiable analog hashing the purpose of the hash functions applied at each layer of circular fingerprints is to combine information about each atom and its neighboring substructures this ensures that any change in fragment no matter how small will lead to different fingerprint index being activated we replace the hash operation with single layer of neural network using smooth function allows the activations to be similar when the local molecular structure varies in unimportant ways indexing circular fingerprints use an indexing operation to combine all the nodes feature vectors into single fingerprint of the whole molecule each node sets single bit of the fingerprint to one at an index determined by the hash of its feature vector this operation converts an graph into vector for small molecules and large fingerprint length the fingerprints are always sparse we use the softmax operation as differentiable analog of indexing in essence each atom is asked to classify itself as belonging to single category the sum of all these classification label vectors produces the final fingerprint this operation is analogous to the pooling operation in standard convolutional neural networks algorithm circular fingerprints input molecule radius fingerprint length initialize fingerprint vector for each atom in molecule ra lookup atom features for to for each layer for each atom in molecule rn neighbors ra rn concatenate ra hash hash function mod ra convert to index fi write at index return binary vector algorithm neural graph fingerprints input molecule radius hidden weights hr output weights wr initialize fingerprint vector for each atom in molecule ra lookup atom features for to for each layer for each atom in molecule rn ra sum ri ra vhln smooth function softmax ra wl sparsify add to fingerprint return vector figure pseudocode of circular fingerprints left and neural graph fingerprints right differences are highlighted in blue every operation is replaced with differentiable analog canonicalization circular fingerprints are identical regardless of the ordering of atoms in each neighborhood this invariance is achieved by sorting the neighboring atoms according to their features and bond features we experimented with this sorting scheme and also with applying the local feature transform on all possible permutations of the local neighborhood an alternative to canonicalization is to apply function such as summation in the interests of simplicity and scalability we chose summation circular fingerprints can be interpreted as special case of neural graph fingerprints having large random weights this is because in the limit of large input weights tanh nonlinearities approach step functions which when concatenated form simple hash function also in the limit of large input weights the softmax operator approaches argmax operator which is analogous to an indexing operation algorithms and summarize these two algorithms and highlight their differences given fingerprint length and features at each layer the parameters of neural graph fingerprints consist of separate output weight matrix of size for each layer as well as set of weight matrices of size at each layer one for each possible number of bonds an atom can have up to in organic molecules experiments we ran two experiments to demonstrate that neural fingerprints with large random weights behave similarly to circular fingerprints first we examined whether distances between circular fingerprints were similar to distances between neural distances figure left shows scatterplot of pairwise distances between circular neural fingerprints fingerprints had length and were calculated on pairs of molecules from the solubility dataset distance was measured using continuous generalization of the tanimoto jaccard similarity measure given by distance min xi yi max xi yi there is correlation of between the distances the line of points on the right of the plot shows that for some pairs of molecules binary ecfp fingerprints have exactly zero overlap second we examined the predictive performance of neural fingerprints with large random weights that of circular fingerprints figure right shows average predictive performance on the solubility dataset using linear regression on top of fingerprints the performances of both methods follow similar curves in contrast the performance of neural fingerprints with small random weights follows different curve and is substantially better this suggests that even with random weights the relatively smooth activation of neural fingerprints helps generalization performance rmse log neural fingerprint distances neural vs circular distances circular fingerprints random conv with large parameters random conv with small parameters circular fingerprint distances fingerprint radius figure left comparison of pairwise distances between molecules measured using circular fingerprints and neural graph fingerprints with large random weights right predictive performance of circular fingerprints red neural graph fingerprints with fixed large random weights green and neural graph fingerprints with fixed small random weights blue the performance of neural graph fingerprints with large random weights closely matches the performance of circular fingerprints examining learned features to demonstrate that neural graph fingerprints are interpretable we show substructures which most activate individual features in fingerprint vector each feature of circular fingerprint vector can each only be activated by single fragment of single radius except for accidental collisions in contrast neural graph fingerprint features can be activated by variations of the same structure making them more interpretable and allowing shorter feature vectors solubility features figure shows the fragments that maximally activate the most predictive features of fingerprint the fingerprint network was trained as inputs to linear model predicting solubility as measured in the feature shown in the top row has positive predictive relationship with solubility and is most activated by fragments containing hydrophilic group standard indicator of solubility the feature shown in the bottom row strongly predictive of insolubility is activated by repeated ring structures fragments most activated by feature oh nh oh oh fragments most activated by feature figure examining fingerprints optimized for predicting solubility shown here are representative examples of molecular fragments highlighted in blue which most activate different features of the fingerprint top row the feature most predictive of solubility bottom row the feature most predictive of insolubility toxicity features we trained the same model architecture to predict toxicity as measured in two different datasets in figure shows fragments which maximally activate the feature most predictive of toxicity in two separate datasets fragments most activated by toxicity feature on dataset fragments most activated by toxicity feature on dataset figure visualizing fingerprints optimized for predicting toxicity shown here are representative samples of molecular fragments highlighted in red which most activate the feature most predictive of toxicity top row the most predictive feature identifies groups containing sulphur atom attached to an aromatic ring bottom row the most predictive feature identifies fused aromatic rings also known as polycyclic aromatic hydrocarbons carcinogen constructed similar visualizations but in way to determine which toxic fragments activated given neuron they searched over list of toxic substructures and chose the one most correlated with given neuron in contrast our visualizations are generated automatically without the need to restrict the range of possible answers beforehand predictive performance we ran several experiments to compare the predictive performance of neural graph fingerprints to that of the standard setup circular fingerprints fed into neural network experimental setup our pipeline takes as input the smiles string encoding of each molecule which is then converted into graph using rdkit we also used rdkit to produce the extended circular fingerprints used in the baseline hydrogen atoms were treated implicitly in our convolutional networks the initial atom and bond features were chosen to be similar to those used by ecfp initial atom features concatenated encoding of the atom element its degree the number of attached hydrogen atoms and the implicit valence and an aromaticity indicator the bond features were concatenation of whether the bond type was single double triple or aromatic whether the bond was conjugated and whether the bond was part of ring training and architecture training used batch normalization we also experimented with tanh vs relu activation functions for both the neural fingerprint network layers and the fullyconnected network layers relu had slight but consistent performance advantage on the validation set we also experimented with dropconnect variant of dropout in which weights are randomly set to zero instead of hidden units but found that it led to worse validation error in general each experiment optimized for minibatches of size using the adam algorithm variant of rmsprop that includes momentum hyperparameter optimization to optimize hyperparameters we used random search the hyperparameters of all methods were optimized using trials for each fold the following hyperparameters were optimized log learning rate log of the initial weight scale the log penalty fingerprint length fingerprint depth up to and the size of the hidden layer in the network additionally the size of the hidden feature vector in the convolutional neural fingerprint networks was optimized dataset units predict mean circular fps linear layer circular fps neural net neural fps linear layer neural fps neural net solubility log drug efficacy in nm photovoltaic efficiency percent table mean predictive accuracy of neural fingerprints compared to standard circular fingerprints datasets we compared the performance of standard circular fingerprints against neural graph fingerprints on variety of domains solubility the aqueous solubility of molecules as measured by drug efficacy the effective concentration in vitro of molecules against strain of falciparum the parasite that causes malaria as measured by organic photovoltaic efficiency the harvard clean energy project uses expensive dft simulations to estimate the photovoltaic efficiency of organic molecules we used subset of molecules from this dataset predictive accuracy we compared the performance of circular fingerprints and neural graph fingerprints under two conditions in the first condition predictions were made by linear layer using the fingerprints as input in the second condition predictions were made by neural network using the fingerprints as input in all settings all differentiable parameters in the composed models were optimized simultaneously results are summarized in table in all experiments the neural graph fingerprints matched or beat the accuracy of circular fingerprints and the methods with neural network on top of the fingerprints typically outperformed the linear layers software automatic differentiation ad software packages such as theano significantly speed up development time by providing gradients automatically but can only handle limited control structures and indexing since we required relatively complex control flow and indexing in order to implement variants of algorithm we used more flexible automatic differentiation package for python called autograd this package handles standard numpy code and can differentiate code containing while loops branches and indexing code for computing neural fingerprints and producing visualizations is available at limitations computational cost neural fingerprints have the same asymptotic complexity in the number of atoms and the depth of the network as circular fingerprints but have additional terms due to the matrix multiplies necessary to transform the feature vector at each step to be precise computing the neural fingerprint of depth fingerprint length of molecule with atoms using molecular convolutional net having features at each layer costs rn rn in practice training neural networks on top of circular fingerprints usually took several minutes while training both the fingerprints and the network on top took on the order of an hour on the larger datasets limited computation at each layer how complicated should we make the function that goes from one layer of the network to the next in this paper we chose the simplest feasible architecture single layer of neural network however it may be fruitful to apply multiple layers of nonlinearities between each step as in or to make information preservation easier by adapting the long memory architecture to pass information upwards limited information propagation across the graph the local architecture developed in this paper scales well in the size of the graph due to the low degree of organic molecules but its ability to propagate information across the graph is limited by the depth of the network this may be appropriate for small graphs such as those representing the small organic molecules used in this paper however in the worst case it can take depth network to distinguish between graphs of size to avoid this problem proposed hierarchical clustering of graph substructures network could examine the structure of the entire graph using only log layers but would require learning to parse molecules techniques from natural language processing might be fruitfully adapted to this domain inability to distinguish stereoisomers special bookkeeping is required to distinguish between stereoisomers including enantomers mirror images of molecules and isomers rotation around double bonds most circular fingerprint implementations have the option to make these distinctions neural fingerprints could be extended to be sensitive to stereoisomers but this remains task for future work related work this work is similar in spirit to the neural turing machine in the sense that we take an existing discrete computational architecture and make each part differentiable in order to do optimization neural nets for quantitative relationship qsar the modern standard for predicting properties of novel molecules is to compose circular fingerprints with neural networks or other regression methods used circular fingerprints as inputs to an ensemble of neural networks gaussian processes and random forests used circular fingerprints of depth as inputs to multitask neural network showing that multiple tasks helped performance neural graph fingerprints the most closely related work is who build neural network having inputs their approach is to remove all cycles and build the graph into tree structure choosing one atom to be the root recursive neural network is then run from the leaves to the root to produce representation because graph having nodes has possible roots all possible graphs are constructed the final descriptor is sum of the representations computed by all distinct graphs there are as many distinct graphs as there are atoms in the network the computational cost of this method thus grows as where is the size of the feature vector and is the number of atoms making it less suitable for large molecules convolutional neural networks convolutional neural networks have been used to model images speech and time series however standard convolutional architectures use fixed computational graph making them difficult to apply to objects of varying size or structure such as molecules more recently and others have developed convolutional neural network architecture for modeling sentences of varying length neural networks on fixed graphs introduce convolutional networks on graphs in the regime where the graph structure is fixed and each training example differs only in having different features at the vertices of the same graph in contrast our networks address the situation where each training input is different graph neural networks on graphs propose neural network model for graphs having an interesting training procedure the forward pass consists of running scheme to equilibrium fact which allows the gradient to be computed without storing the entire forward computation they apply their network to predicting mutagenesis of molecular compounds as well as web page rankings also propose neural network model for graphs with learning scheme whose inner loop optimizes not the training loss but rather the correlation between each vector and the training error residual they apply their model to dataset of boiling points of molecular compounds our paper builds on these ideas with the following differences our method replaces their complex training algorithms with simple gradientbased optimization generalizes existing circular fingerprint computations and applies these networks in the context of modern qsar pipelines which use neural networks on top of the fingerprints to increase model capacity unrolled inference algorithms and others have noted that iterative inference procedures sometimes resemble the feedforward computation of recurrent neural network one natural extension of these ideas is to parameterize each inference step and train neural network to approximately match the output of exact inference using only small number of iterations the neural fingerprint when viewed in this light resembles an unrolled algorithm on the original graph conclusion we generalized existing molecular features to allow their optimization for diverse tasks by making each operation in the feature pipeline differentiable we can use standard training methods to scalably optimize the parameters of these neural molecular fingerprints we demonstrated the interpretability and predictive performance of these new fingerprints features have already replaced features in speech recognition machine vision and processing carrying out the same task for virtual screening drug design and materials design is natural next step acknowledgments we thank edward jennifer wei and samsung advanced institute of technology for their support this work was partially funded by nsf references bastien pascal lamblin razvan pascanu james bergstra ian goodfellow arnaud bergeron nicolas bouchard and yoshua bengio theano new features and speed improvements deep learning and unsupervised feature learning nips workshop joan bruna wojciech zaremba arthur szlam and yann lecun spectral networks and locally connected networks on graphs arxiv preprint george dahl navdeep jaitly and ruslan salakhutdinov neural networks for qsar predictions arxiv preprint john delaney esol estimating aqueous solubility directly from molecular structure journal of chemical information and computer sciences gamo laura sanz jaume vidal cristina de cozar emilio alvarez lavandera dana vanderwall darren vs green vinod kumar samiul hasan et al thousands of chemical starting points for antimalarial lead identification nature robert glem andreas bender catrin arnby lars carlsson scott boyer and james smith circular fingerprints flexible molecular descriptors with applications from physical chemistry to adme idrugs the investigational drugs journal alex graves greg wayne and ivo danihelka neural turing machines arxiv preprint johannes hachmann roberto sule carlos roel aryeh leslie vogt anna brockway and the harvard clean energy project computational screening and design of organic photovoltaics on the world community grid the journal of physical chemistry letters john hershey jonathan le roux and felix weninger deep unfolding inspiration of novel deep architectures arxiv preprint sepp hochreiter and schmidhuber long memory neural computation sergey ioffe and christian szegedy batch normalization accelerating deep network training by reducing internal covariate shift arxiv preprint nal kalchbrenner edward grefenstette and phil blunsom convolutional neural network for modelling sentences proceedings of the annual meeting of the association for computational linguistics june diederik kingma and jimmy ba adam method for stochastic optimization arxiv preprint yann lecun and yoshua bengio convolutional networks for images speech and time series the handbook of brain theory and neural networks alessandro lusci gianluca pollastri and pierre baldi deep architectures and deep learning in chemoinformatics the prediction of aqueous solubility for molecules journal of chemical information and modeling alessio micheli neural network for graphs contextual constructive approach neural networks ieee transactions on morgan the generation of unique machine description for chemical structure journal of chemical documentation travis oliphant python for scientific computing computing in science engineering bharath ramsundar steven kearnes patrick riley dale webster david konerding and vijay pande massively multitask networks for drug discovery rdkit cheminformatics accessed david rogers and mathew hahn fingerprints journal of chemical information and modeling scarselli gori ah chung tsoi hagenbuchner and monfardini the graph neural network model neural networks ieee transactions on jan richard socher eric huang jeffrey pennin christopher manning and andrew ng dynamic pooling and unfolding recursive autoencoders for paraphrase detection in advances in neural information processing systems pages richard socher jeffrey pennington eric huang andrew ng and christopher manning recursive autoencoders for predicting sentiment distributions in proceedings of the conference on empirical methods in natural language processing pages association for computational linguistics kai sheng tai richard socher and christopher manning improved semantic representations from long memory networks arxiv preprint challenge national center for advancing translational sciences http online accessed thomas unterthiner andreas mayr klambauer and sepp hochreiter toxicity prediction using deep learning arxiv preprint thomas unterthiner andreas mayr klambauer marvin steijaert wenger hugo ceulemans and sepp hochreiter deep learning as an opportunity in virtual screening in advances in neural information processing systems li wan matthew zeiler sixin zhang yann cun and rob fergus regularization of neural networks using dropconnect in international conference on machine learning david weininger smiles chemical language and information system journal of chemical information and computer sciences 
mixed submodular partitioning fast algorithms guarantees and applications kai rishabh shengjie wenruo jeff department of electrical engineering university of washington department of computer science university of washington kaiwei rkiyer wangsj wrbai bilmes abstract we investigate two novel mixed submodular data partitioning problems that we collectively call submodular partitioning these problems generalize purely robust instances of the problem namely submodular fair allocation sfa and submodular load balancing slb and also instances that is the submodular welfare problem swp and submodular multiway partition smp while the robust versions have been studied in the theory community existing work has focused on tight approximation guarantees and the resultant algorithms are not generally scalable to large applications this is in contrast to the average case where most of the algorithms are scalable in the present paper we bridge this gap by proposing several new algorithms including greedy and relaxation algorithms that not only scale to large datasets but that also achieve theoretical approximation guarantees comparable to the we moreover provide new scalable algorithms that apply to additive combinations of the robust and objectives we show that these problems have many applications in machine learning ml including data partitioning and load balancing for distributed ml data clustering and image segmentation we empirically demonstrate the efficacy of our algorithms on problems involving data partitioning for distributed optimization of convex and deep neural network objectives and also purely unsupervised image segmentation introduction the problem of data partitioning is of great importance to many machine learning ml and data science applications as is evidenced by the wealth of clustering procedures that have been and continue to be developed and used most data partitioning problems are based on expected or utility objectives where the goal is to optimize sum of cluster costs and this includes the ubiquitous procedure other algorithms are based on robust objective functions where the goal is to optimize the cluster cost such robust algorithms are particularly important in mission critical applications such as parallel and distributed computing where one single poor partition block can significantly slow down an entire parallel machine as all compute nodes might need to spin while waiting for slow node to complete round of computation taking weighted combination of both robust and average case objective functions allows one to balance between optimizing and overall performance we are unaware however of any previous work that allows for mixing between and objectives in the context of data partitioning this paper studies two new mixed partitioning problems of the following form prob max min fi aπi fj aπj prob min max fi aπi fj aπj where the set of sets aπm is partition of finite set aπi and aπi aπj and refers to the set of all partitions of into blocks the parameter controls the objective is the average case is the robust case and is mixed case in general problems and are hopelessly intractable even to approximate but we assume that the fm are all monotone nondecreasing fi fi whenever normalized fi and submodular fi fi fi fi these assumptions allow us to develop fast simple and scalable algorithms that have approximation guarantees as is done in this paper these assumptions moreover allow us to retain the naturalness and applicability of problems and to wide variety of practical problems submodularity is natural property in many ml applications when minimizing submodularity naturally model notions of interacting costs and complexity while when maximizing it readily models notions of diversity summarization quality and information hence problem asks for partition whose blocks each and that collectively are good say summary of the whole problem on the other hand asks for partition whose blocks each and that collectively are internally homogeneous as is typical in clustering taken together we call problems and submodular partitioning we further categorize these problems depending on if the fi are identical to each other homogeneous or not heterogeneous the heterogeneous case clearly generalizes the homogeneous setting but as we will see the additional homogeneous structure can be exploited to provide more efficient tighter algorithms problem approximation factor in rch atching llipsoid nm log log reedw elfare reed mm min problem approximation factor min ampling log llipsoid log reed elax log mm reed ax omb fa max βα λβ ov round omb lb min mα eneral reed eneral ov hardness hardness hardness hardness table summary of our contributions and existing work on problems and see text for details previous work special cases of problems and have appeared previously problem with is called submodular fair allocation sfa and problem with is called submodular load balancing slb robust optimization problems both of which previously have been studied when fi are all modular slb is called minimum makespan scheduling an lp relaxation algorithm provides for the heterogeneous setting when the objectives are submodular the problem becomes much harder even in the homogeneous setting show that the problem is information theoretically hard to approximate within log they provide balanced partitioning algorithm yielding factor of min under the homogeneous setting they also give algorithm achieving log for the homogeneous setting however the algorithm is not practical and scalable since it involves solving in the log instances of submodular function minimization each of which requires computation where is the cost of function valuation another approach approximates each submodular function by its ellipsoid approximation again and reduces slb to its modular version minimum makespan scheduling leading to an approximation factor of log sfa on the other hand has been studied mostly in the heterogeneous setting when are all modular the tightest algorithm so far is to iteratively round an lp solution achieving approximation whereas the problem is to approximate for any when fi are submodular gives algorithm with factor approximation that performs poorly when proposes binary search algorithm yielding an improved factor of similar to slb applies the same ellipsoid similar have been called the uniform the case in the past results obtained in this paper are marked as methods for only the homogeneous setting are marked as approximation techniques leading to factor of log these approaches are theoretically interesting but they do not scale to large problems problems and when have also been previously studied problem becomes the submodular multiway partition smp for which one can obtain relaxation based in the homogeneous case in the heterogeneous case the guarantee is log similarly propose greedy splitting algorithm for the homogeneous setting problem becomes the submodular welfare for which scalable greedy algorithm achieves approximation unlike the worst case many of the algorithms proposed for these problems are scalable the general case of problems and differs from either of these extreme cases since we wish both for robust and average case partitioning and controlling allows one to trade off between the two as we shall see the flexibility of mixture can be more natural in certain applications applications there are number of applications of submodular partitioning in ml as outlined below some of these we evaluate in section submodular functions naturally capture notions of interacting cooperative costs and homogeneity and thus are useful for clustering and image segmentation while the average case instance has been used before more variant problem with is useful to produce balanced clusterings the submodular valuations of all the blocks should be similar to each other problem also addresses problem in image segmentation namely how to use only submodular functions which are instances of functions for image segmentation problem addresses this problem by allowing each segment to have its own submodular function fj and the objective measures the homogeneity fj aπj of segment based on the image pixels aπj assigned to it moreover by combining the average case and the worst case objectives one can achieve tradeoff between the two empirically we evaluate our algorithms on unsupervised image segmentation section and find that it outperforms other clustering methods including spectral clustering and graph cuts submodularity also accurately represents computational costs in distributed systems as shown in in fact considers two separate problems text data partitioning for balancing memory demands and parameter partitioning for balancing communication costs both are treated by solving an instance of slb problem where memory costs are modeled using submodular function and the communication costs are modeled using modular additive function another important ml application evaluated in section is distributed training of statistical models as data set sizes grow the need for statistical training procedures tolerant of the distributed data partitioning becomes more important existing schemes are often developed and performed assuming data samples are distributed in an arbitrary or random fashion as an alternate strategy if the data is intelligently partitioned such that each block of samples can itself lead to good approximate solution consensus amongst the distributed results could be reached more quickly than when under poor partitioning submodular functions can in fact express the value of subset of training data for certain machine learning risk functions using these functions within problem one can expect partitioning by formulating the problem as an instance of problem where each block is good representative of the entire set thereby achieving faster convergence in distributed settings we demonstrate empirically in section that this provides better results on several machine learning tasks including the training of deep neural networks our contributions in contrast to problems and in the average case existing algorithms for the worst case are not scalable this paper closes this gap by proposing three new classes of algorithmic frameworks to solve sfa and slb greedy algorithms algorithms and extension based relaxation algorithm for sfa when we formulate the problem as submodular maximization which can be approximated up to factor of with function evaluations for general we give simple and scalable greedy algorithm reed ax and show factor of in the homogeneous setting improving the factor of under the heterogeneous setting for the heterogeneous setting we propose saturate greedy algorithm reed at that iteratively solves instances of submodular welfare problems we show reed at has guarantee of which ensures at least dm blocks receive utility at least op for any for slb we first generalize thephardness result in and show that it is hard to approximate better than for any log even in the homogeneous setting we then give extension based relaxation algorithm ov round yielding tight factor of for the heterogeneous setting as far as we know this is the first algorithm achieving factor of for slb in this setting for both sfa and slb we also obtain more efficient algorithms with bounded approximation factors which we call mm in and mm ax next we show algorithms to handle problems and with general we first give two simple and generic schemes omb fa wp and omb lb mp both of which efficiently combines an algorithm for the problem special case with and an algorithm for the average case special case with to provide guarantee interpolating between the two bounds for problem we generalize reed at leading to eneral reed at whose guarantee smoothly interpolates in terms of between the factor by reed at in the case of and the constant factor of by the greedy algorithm in the case of for problem we generalize ov round to obtain relaxation algorithm eneral ov round that achieves an for general the theoretical contributions and the existing work for problems and are summarized in table lastly we demonstrate the efficacy of problem on unsupervised image segmentation and the success of problem to distributed machine learning including admm and neural network training robust submodular partitioning problems and when notation we define as the gain of in the context of we assume that the ground set is approximation algorithms for sfa problem with we first study approximation algorithms for sfa when the problem becomes where min and is submodular thanks to theorem theorem if and are monotone submodular min is also submodular proofs for all theorems in this paper are given in the simple randomized greedy algorithm therefore approximates sfa with to factor of matching the problem hardness for general we approach sfa from the perspective of the greedy algorithms in this work we introduce two variants of greedy algorithm reed ax alg and reed at alg suited to the homogeneous and heterogeneous settings respectively reed ax the key idea of reed ax see alg is to greedily add an item with the maximum marginal gain to the block whose current solution is minimum initializing ai with the empty sets the greedy flavor also comes from that it incrementally grows the solution by greedily improving the overall objective fi ai until ai forms partition besides its simplicity theorem offers the optimality guarantee theorem reed ax achieves guarantee of under the homogeneous setting by assuming the homogeneity of the fi we obtain very simple algorithm improving upon the factor thanks to the lazy evaluation trick as described in line in alg need not to recompute the marginal gain for every item in each round leading reed ax to scale to large data sets reed at though simple and effective in the homogeneous setting reed ax performs arbitrarily poorly under the heterogeneous setting to this end we provide another algorithm saturate greedy reed at see alg the key idea of reed at is to relax sfa to much simpler problem submodular welfare swp problem with similar pm in flavor to the one proposed in reed at defines an intermediate objective fic aπi where fic min fi line the parameter controls the saturation in each block fic satisfies submodularity for each unlike sfa the combinatorial optimization problem line is much easier and is an instance of swp in this work we solve line by the efficient greedy algorithm as described in with factor one can also use more computationally expensive relaxation algorithm as given in to solve line with tight factor setting the input argument as the approximation factor for line the essential idea of reed at is to perform binary search over the parameter to find the largest such that the returned solution for the instance of swp satisfies reed at terminates after mini fi solving log instances of swp theorem gives optimality guarantee theorem given and any reed at finds partition such that at least dm blocks receive utility at least mini fi aπi algorithm reed in algorithm reed ax input let am while do argminj aj aj aj end while output ai algorithm reed at algorithm mm in input fi pm let min fi ai let cmin cmax mini fi while cmax cmin do cmax cmin if αc then cmax else cmin end if end while output input fi partition let repeat for do pick supergradient mi at aπ for fi end for maxi mi aπ until output algorithm mm ax input fi partition let repeat for do pick subgradient hi at aπ for fi end for mini hi aπ until output algorithm ov round input let am while do argminj aj aj aj end while output ai input fi fi solve for via convex relaxation rounding let am for do argmaxi end for output ai for any theorem ensures that the top dm valued blocks in the partition returned by reed at are controls the between the number of top valued blocks to bound and the performance guarantee attained for these blocks the smaller is the more top blocks are bounded but with weaker guarantee we set the input argument or as the performance guarantee for solving swp so that the above theoretical analysis follows however the is often achieved only by very contrived submodular functions for the ones used in practice the greedy algorithm often leads to solution and our own observations setting as the actual performance guarantee for swp often very close to can improve the empirical bound and we in practice typically set to good effect mm ax lastly we introduce another algorithm for the heterogeneous setting called minorizationmaximization mm ax see alg similar to the one proposed in the idea is to iteratively maximize tight lower bounds of the submodular functions submodular functions have tight modular lower bounds which are related to the subdifferential of the submodular set function at set denote subgradient at by hy the extreme points of may be computed via greedy algorithm let be permutation of that assigns the elements in to the first positions if and only if each such permutation defines chain with elements siσ and an extreme point hy of has each entry as hy si defined as above hy forms lower bound of tight at hσy hσy and hσy the idea of mm ax is to consider modular lower bound tight at the set corresponding to each block of partition in other words at itert ation for each block we approximate fi with its modular lower bound tight at aπi and solve modular version of problem line which admits efficient approximation algorithms mm ax is initialized with partition which isp obtained by solving problem where each fi is replaced with simple modular function fi the following bound holds ai where log κf theorem mm ax achieves guarantee of mini is the partition obtained by the algorithm and is the curvature of submodular function at approximation algorithms for slb problem with we next investigate slb where existing hardness results are log which is independent of and implicitly assumes that log however applications for slb are often dependent on with we hence offer hardness analysis in terms of in the following theorem for any slb can not be approximated to factor of for any log with polynomial number of queries even under the homogeneous setting for the rest of the paper we assume log for slb unless stated otherwise reed in theorem implies that slb is hard to approximate better than however an arbitrary partition already achieves the best approximation factor of that one can hope for under the homogeneous setting since maxi aπi aπi maxi aπi for any in practice one can still implement greedy style heuristic which we refer to as reed in alg very similar to reed ax reed in only differs in line where the item with the smallest marginal gain is added since the functions are all monotone any additions to block can if anything only increase its value so we choose to add to the minimum valuation block in line to attempt to keep the maximum valuation block from growing further ov round next we consider the heterogeneous setting for which we propose tight algorithm ov round see alg the algorithm proceeds as follows apply the extension of submodular functions to relax slb to convex program which is exactly solved to fractional solution line map the fractional solution to partition using the technique as proposed in line the extension which naturally connects submodular function with its convex relaxation is defined as follows given any we obtain permutation σx by ordering its elements in order and thereby chain of sets snσx with sjσx σx σx for the extension pn σx for is the weighted sum of the ordered entries of σx sjσx given the convexity of the fi slb is relaxed to the following convex program min max fi xi xi for xm denoting the optimal solution for eqn as the step simply maps each item to block such that argmaxi the bound for ov round is as follows theorem ov round achieves approximation factor we remark that to the best of our knowledge ov round is the first algorithm that is tight and that gives an approximation in terms of for the heterogeneous setting mm in similar to mm ax for sfa we propose mm in see alg for slb here we iteratively choose modular upper bounds which are defined via superdifferentials of submodular function at moreover there are specific supergradients that define the following two modular upper bounds when referring to either one we use mfx then and and at iteration for each block mm in replaces fi with choice of its modular upper bound mi tight at aπi and solves modular version of problem line for which there exists an efficient lp relaxation based algorithm similar to mm axp the initial partition is obtained by solving problem where each fi is substituted with fi the following bound holds theorem mm in achieves guarantee of maxi aπm denotes the optimal partition aπ where on with admm on timit on mnist with distributed nn test accuracy test accuracy test accuracy submodular partition random partition submodular partition random partition submodular partition random partition number of iterations number of iterations number of iterations on with admm partition on timit with distributed nn on mnist with distributed nn test accuracy test accuracy test accuracy submodular partition adversarial partition random partition submodular partition random partition submodular partition random partition number of iterations number of iterations mnist number of iterations timit figure comparison between submodular and random partitions for distributed ml including admm fig and distributed neural nets fig and fig for the box plots the central mark is the median the box edges are and percentiles and the bars denote the best and worst cases general submodular partitioning problems and when in this section we study problem and problem in the most general case we first propose simple and general extremal combination scheme that works both for problem and it naturally combines an algorithm for solving the problem with an algorithm for solving the average case we use problem as an example but the same scheme easily works for problem denote lg wc as the algorithm for the problem sfa and lg ac as the algorithm for the average case swp the scheme is to first obtain partition by running lg wc on the instance of problem with and second partition by running lg ac with then we output one of and with which the higher valuation for problem is achieved we call this scheme omb fa wp suppose lg wc solves the problem with factor and lg ac for the average case with when applied to problem we refer to this scheme as omb lb mp and the following guarantee holds for both schemes βα theorem for any omb fa wp solves problem with factor max λβ βα in the heterogeneous case and max min λβ in the homogeneous case similarly omb lb mp solves problem with factor min mmα in the heterogeneous case mα and min in the homogeneous case the drawback of omb fa wp and omb lb mp is that they do not explicitly exploit the tradeoff between the and objectives in terms of to obtain more practically interesting algorithms we also give eneral reed at that generalizes reed pms at to solve lem similar to reed at we define an intermediate objective min ai pm fj aj in eneral reed at following the same algorithmic design as in reed at eneral reed at only differs from reed at in line where the submodular welfare problem is defined on the new objective in we show that eneral reed at gives approximation while also yielding guarantee that generalizes theorem in particular eneral reed at recovers the bicriterion guarantee as shown in theorem when in the case of eneral reed at recovers the guarantee of the greedy algorithm for solving the submodular welfare problem the objective moreover an improved guarantee is achieved by eneral reed at as increases details are given in to solve problem we generalize ov round leading to eneral ov round similar to ov round we relax each submodular objective as its convex relaxation using the extension almost the same as ov round eneral ov round only differs in line where problem is relaxed as the following convex program xm maxi xi pm pm λm fj xj xi for following the same rounding procedure as ov round eneral ov round is guaranteed to give an for problem with general details are given in experiments and conclusions we conclude in this section by empirically evaluating the algorithms proposed for problems and on data partitioning applications including distributed admm distributed deep neural network training and lastly unsupervised image segmentation tasks admm we first consider data partitioning for distributed convex optimization the evaluation task is text categorization on the newsgroup data set which consists of articles divided almost evenly across classes we formulate the classification as an regularized logistic regression which is solved by admm implemented as we run instances of random partitioning on the training data as baseline in this case we use the feature based function same as the one used in in the homogeneous setting of problem with we use reed ax as the partitioning algorithm in figure we observe that the resulting partitioning performs much better than random partitioning and significantly better than an adversarial partitioning formed by grouping similar items together more details are given in distributed deep neural network dnn training next we evaluate our framework on distributed deep neural network dnn training we test on two tasks handwritten digit recognition on the mnist database which consists of training and test samples phone classification on the timit data which has training and test samples dnn model is applied to the mnist experiment and we train dnn for timit for both experiments the submodular partitioning is obtained by solving the homogeneous case of problem using reed ax on form of clustered facility location as proposed and used in we perform distributed training using an averaging stochastic gradient descent scheme similar to the one in we also run instances of random partitioning as baseline as shown in figure and the submodular partitioning outperforms the random baseline an adversarial partitioning which is formed by grouping items with the same class in either case can not even be trained unsupervised image on original mentation we test the all of grabcut efficacy of problem on ground unsupervised image segtruth mentation over the grabcut data set color images and their ground truth labels by unsupervised we mean that no labeled data spectral clustering at any time in supervised or training graph cut nor any kind of interactive segmentation was used in submodular partitioning forming or optimizing the objective the submodular partitioning for each image figure unsupervised image segmentation right some examples is obtained by solving the homogeneous case of problem using modified variant of reed in on the facility location function we compare our method against the other unsupervised methods spectral clustering and graph cuts given an of an image and its ground truth labels we assign each of the blocks either to the foreground or background label having the larger intersection in fig we show example segmentation results after this mapping on several example images as well as averaged relative to ground truth over the whole data set more details are given in acknowledgments this material is based upon work supported by the national science foundation under grant no the national institutes of health under award and by google microsoft and an intel research award iyer acknowledges support from microsoft research fellowship this work was supported in part by terraswarm one of six centers of starnet semiconductor research corporation program sponsored by marco and darpa references arthur and vassilvitskii the advantages of careful seeding in soda asadpour and saberi an approximation algorithm for fair allocation of indivisible goods in sicomp boyd parikh chu peleato and eckstein distributed optimization and statistical learning via the alternating direction method of multipliers foundations and trends in machine learning buchbinder feldman naor and schwartz tight linear time for unconstrained submodular maximization in focs chekuri and ene approximation algorithms for submodular multiway partition in focs chekuri and ene submodular cost allocation problem and applications in automata languages and programming pages springer ene and wu local distribution and the symmetry gap approximability of multiway partitioning problems in soda fisher nemhauser and wolsey an analysis of approximations for maximizing submodular set in polyhedral combinatorics fujishige submodular functions and optimization volume elsevier gordaliza and review of robust clustering methods advances in data analysis and classification goemans harvey iwata and mirrokni approximating submodular functions everywhere in soda golovin fair allocation of indivisible goods technical report iyer jegelka and bilmes monotone closure of relaxed constraints in submodular optimization connections between minimization and maximization extended version iyer jegelka and bilmes fast semidifferential based submodular function optimization in icml jegelka and bilmes submodularity beyond submodular energies coupling edges in graph cuts in cvpr khot and ponnuswami approximation algorithms for the allocation problem in approx kolmogorov and zabin what energy functions can be minimized via graph cuts in tpami krause mcmahan guestrin and gupta robust submodular observation selection in jmlr lenstra shmoys and tardos approximation algorithms for scheduling unrelated parallel machines in mathematical programming li andersen and smola graph partitioning via parallel submodular approximation to accelerate distributed machine learning in arxiv preprint minoux accelerated greedy algorithms for maximizing submodular set functions in optimization techniques narasimhan jojic and bilmes in nips orlin faster strongly polynomial time algorithm for submodular function minimization mathematical programming povey zhang and khudanpur parallel training of deep neural networks with natural gradient and parameter averaging arxiv preprint svitkina and fleischer submodular approximation algorithms and lower bounds in focs optimal approximation for the submodular welfare problem in the value oracle model in stoc wei iyer and bilmes submodularity in data subset selection and active learning in icml wei iyer wang bai and bilmes mixed submodular partitioning fast algorithms guarantees and applications nips extended supplementary zhao nagamochi and ibaraki on generalized greedy splitting algorithms for multiway partition problems discrete applied mathematics 
tractable learning for complex probability queries jessa bekker jesse davis ku leuven belgium arthur choi adnan darwiche guy van den broeck university of california los angeles aychoi darwiche guyvdb abstract tractable learning aims to learn probabilistic models where inference is guaranteed to be efficient however the particular class of queries that is tractable depends on the model and underlying representation usually this class is mpe or conditional probabilities pr for joint assignments we propose tractable learner that guarantees efficient inference for broader class of queries it simultaneously learns markov network and its tractable circuit representation in order to guarantee and measure tractability our approach differs from earlier work by using sentential decision diagrams sdd as the tractable language instead of arithmetic circuits ac sdds have desirable properties which more general representations such as acs lack that enable basic primitives for boolean circuit compilation this allows us to support broader class of complex probability queries including counting threshold and parity in polytime introduction tractable learning is promising new machine learning paradigm that focuses on learning probability distributions that support efficient querying it is motivated by the observation that while classical algorithms for learning bayesian and markov networks excel at fitting data they ignore the cost of reasoning with the learned model however many applications such as systems require efficient and guaranteed accurate reasoning capabilities hence new learning techniques are needed to support applications with these requirements initially tractable learning focused on the first model class recognized to be tractable graphical models recent advances in probabilistic inference exploit other properties of model including local structure and exchangeability which even scale to models that have high treewidth in particular the discovery of local structure led to arithmetic circuits acs which are much more powerful representation of tractable probability distributions in turn this led to new tractable learners that targeted acs to guarantee efficient inference in this context acs with latent variables are sometimes called networks spns other tractable learners target exchangeable models or determinantal point processes there is in tractable learning that is poorly understood and often ignored tractability is not absolute and always relative to class of queries that the user is interested in existing approaches define tractability as the ability to efficiently compute explanations mpe or conditional probabilities pr where are joint assignments to subsets of random variables while these queries are indeed efficient on acs many other queries of interest are not for example computing partial map remains on and ac models similarly various decision monotonicity and utility queries remain perhaps the simplest query beyond the reach of tractable ac learners is for probabilities pr where are complex properties such as counts thresholds comparison and parity of sets of random variables these properties naturally appear throughout the machine learning literature for example in neural nets and in exchangeable and statistical relational models we believe they have not been used to their full potential in the graphical models world due to their intractability we call these types of queries complex probability queries this paper pushes the boundaries of tractable learning by supporting more queries efficiently while we currently lack any representation tractable for partial map we do have all the machinery available to learn tractable models for complex probability queries their tractability is enabled by the weighted model counting wmc encoding of graphical models and recent advances in compilation of boolean functions into sentential decision diagrams sdds sdds can be seen as syntactic subset of acs with more desirable properties including the ability to incrementally compile markov network via conjoin operator dynamically minimize the size and complexity of the representation and efficiently perform complex probability queries our first contribution is tractable learning algorithm for markov networks with compact sdds following the outer loop of the successful acmn learner for acs that uses sdd primitives to modify the circuit during the markov network structure search support for the complex queries listed above also means that these properties can be employed as features in the learned network second we prove that complex symmetric probability queries over variables as well as their extensions run in time polynomial in and linear in the size of the learned sdd tighter complexity bounds are obtained for specific classes of queries finally we illustrate these tractability properties in an empirical evaluation on four data sets and four types of complex queries background markov networks markov network or markov random field compactly represents the joint distribution over set of variables xn markov networks are often represented as models that is an exponentiated weighted sum of features of the state of variables pr wj fj the fj are functions of the state wj is the weight associated exp with fj and is the partition function for discrete models features are often boolean functions typically conjunction of tests of the form xi xi xj xj one is interested in performing certain inference tasks such as computing the posterior marginals or state mpe given observations in general such tasks are intractable and learning markov networks from data require estimating the weights of the features parameter learning and the features themselves structure learning we can learn the parameters by optimizing some convex objective function which is typically the evaluation of this function and its gradient is in general intractable therefore it is common to optimize an approximate objective such as the the classical structure learning approach is greedy search it starts with features over individual variables and greedily searches for new features to add to the model from set of candidate features found by conjoining pairs of existing features other approaches convert local models into global one to prevent overfitting one puts penalty on the complexity of the model number of features tractable circuit representations and tractable learning tractable circuit representations overcome the intractability of inference in markov networks although we are not always guaranteed to find compact tractable representation for every markov network in this paper we will guarantee their existence for the learned models ac arithmetic circuits acs are directed acyclic graphs whose leafs are inputs representing either indicator variables to assign values to random variables parameters weights wj or constants figure shows an example acs encode the partition function computation of markov network the literature typically shows hardness for polytrees results carry over because these have compact acs ia weight feature ib parameter variable ib markov network sentential decision diagram arithmetic circuit figure markov network over variables and its tractable sdd and ac representations by setting indicators to and evaluating the ac the value of the partition function is obtained at the root other settings of the indicators encode arbitrary evidence moreover second pass yields all marginal probabilities similar procedures exist for mpe all these algorithms run in time linear in the size of the ac number of edges the tractable learning paradigm for markov networks is best exemplified by acmn which concurrently learns markov network and its ac it employs complexity penalty based on the inference cost moreover acmn efficiently computes the exact as opposed to and its gradient on the ac acmn uses the standard greedy feature search outlined above sdd sentential decision diagrams sdds are tractable representation of sentences in propositional logic the reviews sdds in detail brief summary is next sdds are directed acyclic graphs as depicted in figure circle node represents the disjunction of its children pair of boxes denotes the conjunction of the two boxes and each box can be negated boolean variable or reference to another sdd node the detailed properties of sdds yield two benefits first sdds support an efficient conjoin operator that can incrementally construct new sdds from smaller sdds in linear time second sdds support dynamic minimization which allows us to control the growth of an sdd during incremental construction there is close connection between sdds for logic and acs for graphical models through an intermediate weighted model counting formulation which is reviewed in the supplement given graphical model one can construct logical sentence whose satisfying assignments are in correspondence with the possible worlds of moreover each satisfying assignment of encodes the weights wj that apply to its possible world in for each feature fj of this includes constraint fj pj meaning that weight wj applies when parameter variable pj is true see figure consequence of this correspondence is that given an sdd for we can efficiently construct an ac for the original markov network see figure hence an sdd corresponding to is tractable representation of different from acs sdds have the following properties support for efficient linear conjunction allows us to add new features fj and incrementally learn markov network moreover dynamic minimization lets us systematically search for more compact circuits for the same markov network mitigating the increasing complexity of inference as we learn more features such operations are not available for acs in general learning algorithm we propose learnsdd which employs greedy search that simultaneously learns markov network and its underlying sdd which is used for inference the cost of inference in the learned model is dictated by the size of its sdd conceptually our approach is similar to acmn with the key differences being our use of sdds instead of acs which gives us more tractability and freedom in the types of features that are considered https algorithm learnsdd initialize model with variables as features mbest while number of edges and not timeout best score generatefeatures for each feature in do if score best score best score score mbest mbest learnsdd outlined in algorithm receives as input training set maximum number of edges and parameter to control the relative importance of fitting the data the cost of inference as is typical with approaches to structure learning the initial model has one feature xi true for each variable which corresponds to markov network next learnsdd iteratively constructs set of candidate features where each feature is logical formula it scores each feature by compiling it into an sdd conjoining the feature to the current model temporarily and then evaluating the score of the model that includes the feature the supplement shows how features is added to an sdd in each iteration the highest scoring feature is selected and added to the model permanently the process terminates when the maximum number of edges is reached or when it runs out of time inference time is dictated by the size of the learned sdd to control this cost we invoke dynamic sdd minimization each time feature is evaluated and when we permanently add feature to the model performing structure learning with sdds offers advantages over acs first sdds support practical conjoin operation which greatly simplifies the design of structure learning algorithm acmn instead relies on complex ac modification algorithm second sdds support dynamic minimization allowing us to search for smaller sdds as needed the following two sections discuss the score function and feature generation in greater detail score function and weight learning score functions capture between fitting the training data and the preference for simpler models captured by regularization term in tractable learning the regularization term reflects the cost of inference in the model therefore we use the following score function score log pr log pr where is the training data is the model extended with feature is the old model returns the number of edges in the sdd representation and is parameter the first term is the improvement in the model due to incorporating the second term measures the relative growth of the sdd representation after incorporating we use the relative growth because adding feature to larger model adds many more edges than adding feature to smaller model section shows that any query inference complexity depends on the sdd size finally lets us control the between fitting the data and the cost of inference scoring model requires learning the weights associated with each feature because we use sdds we can efficiently compute the exact and its gradient using only two passes over the sdd therefore we learn estimates of the weights generating features in each iteration learnsdd constructs set of candidate features using two different feature generators conjunctive and mutex the conjunctive generator considers each pair of features in the model and proposes four new candidates per pair and the mutex generator automatically identifies mutually exclusive sets of variables in the data and proposes feature to capture this relationship mutual exclusivity arises naturally in data it occurs in tractable learning because existing approaches typically assume boolean data hence any attribute is converted into multiple binary variables for all variable sets xv that have exactly one true value in each training example the exactly one fean ture xi is added to the candidate set when at most one variable is true the wn vn mutual exclusivity feature xi is added to the candidate set complex queries tractable learning focuses on learning models that can efficiently compute the probability of query given some evidence where both the query and evidence are conjunctions of literals however many other important and interesting queries do not conform to this structure including the following consider the task of predicting the probability that legislative bill will pass given that some lawmakers have already announced how they will vote answering this query requires estimating the probability that count exceeds given threshold imagine only observing the first couple of sentences of long review and wanting to assess the probability that the entire document has more positive words than negative words in it which could serve as proxy for how positive negative the review is answering this requires comparing two groups in this case positive words and negative words table lists these and other examples of what we call complex queries which are logical functions that can not be written as conjunction of literals unfortunately tractable models based on acs are in general unable to answer these types of queries efficiently we show that using model with an sdd as the target tractable representation can permit efficient exact inference for certain classes of complex queries symmetric queries and their generalizations no known algorithm exists for efficiently answering these types of queries in acs for other classes of complex queries the complexity is never worse than for acs and in many cases sdds will be more efficient note that spns have the same complexity for answering queries as acs since they are interchangeable we first discuss how to answer complex queries using acs and sdds we then discuss some classes of complex queries and when we can guarantee tractable inference in sdds answering complex queries currently it is only known how to solve conjunctive queries in acs therefore we will answer complex queries by asking multiple conjunctive queries we convert the query into dnf format consisting of mutually exclusive clauses now the probability of the cp query is the sum of the probabilities of the clauses pr pr ci in the worst case this construction requires clauses for queries over variables the inference complexity for each clause on the ac is hence the total inference complexity is sdds can answer complex queries without transforming them into mutually exclusive clauses instead the query can directly be conjoined with the weighted model counting formulation of the markov network given an sdd sm for the markov network and an sdd sq for we can efficiently compile an sdd sa for from sa we can compute the partition function of the markov network after asserting which gives us the probability of this computation is performed efficiently on the ac that corresponds to sa cf section the supplement explains the protocol for answering query the size of the sdd sa is at most and inference is linear in the circuit size therefore it is when converting an arbitrary query into sdd the size may grow as large as with the number of variables in the query but often it will be much smaller see section thus the overall complexity is but often much better depending on the query class classes of complex queries first class of tractable queries are symmetric boolean functions these queries do not depend on the exact input values but only on how many of them are true definition boolean function xn is symmetric query precisely when xn xπ xπ for every permutation of the indexes table examples of complex queries with the sdd size and the number of query variables query class symmetric query asymmetric tractable query query type parity hamming distance group comparison inference complexity mn mnk example table lists examples of functions that can always be answered in polytime because they have compact sdd note that the features generated by the mutex generator are types of queries where and therefore have compact sdd we have the following result theorem markov networks with compact sdds support tractable querying of symmetric functions more specifically let be markov network with an sdd of size and let be any symmetric function of variables then prm can be computed in time moreover when is parity function querying takes mn time and when is or function querying takes mnk time the proof shows that any sdd can be conjoined with these queries without increasing the sdd size by more than factor polynomial in the proof of theorem is given in the supplement this tractability result can be extended to certain functions for example negating the inputs to symmetric functions still yields tractable complex query this allows queries for the probability that the state is within given hamming distance from desired state moreover boolean combinations of bounded number of tractable function also admit efficient querying this allows queries that compare symmetric properties of different groups of variables we can not guarantee tractability for other classes of complex queries because some queries do not have compact sdd representation an example of such query is the weighed where each literal has corresponding weight and the total weight of true literals must be bigger than some threshold while the complexity of using sdds and acs to answer such queries is the same we show in the supplement that sdds can still be more efficient in practice empirical evaluation the goal of this section is to evaluate the merits of using sdds as target representation in tractable learning for complex queries specifically we want to address the following questions does capturing mutual exclusivity allow learnsdd to learn more accurate models than acmn do sdds produced by learnsdd answer complex queries faster than acs learned by acmn to resolve these questions we run learnsdd and acmn on data and compare their performance our learnsdd implementation builds on the publicly available sdd data table describes the characteristics of each data set table data set characteristics data set traffic temperature voting movies train set size tune set size http test set size num vars mutex features we used the traffic and temperature data sets to evaluate the benefit of detecting mutual exclusivity in the initial version of these data sets each variable had four values which were binarized using encoding complex queries to evaluate complex queries we used voting data from and pang and lee movie review data the voting data contains all votes in the house of representatives from the congress each bill is an example and the variables are the votes of the congressmen which can be yes no or present the movie review data contains positive and negative movie reviews we first applied the porter stemmer and then used the scikit learn which counts all and while omitting the standard scikit learn stop words we selected the most frequent in the training data to serve as the features methodology for all data sets we divided the data into single train tune and test partition all experiments were run on identically configured machines with ram and twelve cores mutex features using the training set we learned models with both learnsdd and acmn for learnsdd we tried setting to and for acmn we did grid search for the penalty ps and the and weights and with ps and for both methods we stopped learning if the circuit exceeded two million edges or the algorithm ran for hours for each approach we picked the best learned model according to the tuning set we evaluated the quality of the selected model using the on the test set complex queries in this experiment the goal is to compare the time needed to answer query in models learned by learnsdd and acmn in both sdds and acs inference time depends linearly on the number of edges in the circuit therefore to ensure fair comparison the learned models should have approximately the same number of edges hence we first learned an sdd and then used the number of edges in the learned sdd to limit the size of the model learned by acmn in the voting data set we evaluated the threshold query what is the probability that at least of the congressmen vote yes on bill given as evidence that some lawmakers have already announced their vote we vary the percentage of unknown votes from to in intervals of point we evaluated several queries on the movie data set the first two queries mimic an active sensing setting to predict features of the review without reading it entirely the evidence for each query are the features that appear in the first characters of the stemmed review on average the stemmed reviews have approximately characters the first query is pr positive ngrams and second is pr positive ngrams negative ngrams which correspond to threshold query and group comparison query respectively for both queries we varied the size of the positive and negative ngram sets from to ngrams with an increment size of we randomly selected which ngrams are positive and negative as we are only interested in query evaluation time the third query is the probability that parity function over set of features is even we vary the number of variables considered by the parity function from to for each query we report the average per example inference time for each learned model on the test set we impose minute average time limit and minutes individual time limit for each query for completeness the supplement reports run times for queries that are guaranteed to not be tractable for both acs and sdds as well as the conditional of all queries results and discussion mutex features figure shows the test set as function of the size of the learned model in both data sets learnsdd produces smaller models that have the same accuracy as ac this is because it can add mutex features without the need to add other features that are needed as building blocks but are redundant afterwards these results allow us to affirmatively answer complex queries figure shows the inference times for complex queries that are extensions of symmetric queries for all queries we see that learnsdd model results in significantly faster inference times than acmn model in fact acmn model exceeds the ten minute time limit on http http and http learnsdd acmn learnsdd acmn size size temperature traffic figure the size and of the models learned by learnsdd and acmn ideally the model is small with high accuracy upper left corner which is best approached by the learnsdd models out of of the query settings whereas this only happens in settings for learnsdd the sdd can answer all parity questions and positive word queries in less than three hundred milliseconds and the group comparison in less than three seconds it can answer the voting query with up to of the votes unknown in less than ten minutes these results demonstrate learnsdd superior ability to answer complex queries compared to acmn and allow us to positively answer timeout timeout unknown votes sdd ac time time sdd ac threshold query voting time sdd ac positive words threshold query movie timeout sdd ac time timeout positive words negative words group comparison movie variables parity movie figure the time for sdds acs to answer complex queries varying the number of query variables sdds need less time in all settings answering nearly all queries acs timeout in more than of the cases conclusions this paper highlighted the fact that tractable learning approaches learn models for only restricted classes of queries primarily focusing on the efficient computation of conditional probabilities we focused on enabling efficient inference for complex queries to achieve this we proposed using sdds as the target representation for tractable learning we provided an algorithm for simultaneously learning markov network and its sdd representation we proved that sdds support polytime inference for complex symmetric probability queries empirically sdds enable significantly faster inference times than acs for multiple complex queries probabilistic sdds are closely related representation they also support complex queries in structured probability spaces but they lack structure learning algorithms subject of future work acknowledgments we thank songbai yan for prior collaborations on related projects jb is supported by iwt jd is partially supported by the research fund ku leuven eu marie curie cig iwt and ac and ad are partially supported by nsf and onr references domingos niepert and lowd in icml workshop on learning tractable probabilistic models bach and jordan thin junction trees in proceedings of nips pages zhang hierarchical latent class models for cluster analysis jmlr narasimhan and bilmes bounded graphical models in proc uai chechetka and guestrin efficient principled learning of thin junction trees in proceedings of nips pages chavira and darwiche on probabilistic inference by weighted model counting aij niepert and van den broeck tractability through exchangeability new perspective on efficient probabilistic inference in proceedings of aaai darwiche differential approach to inference in bayesian networks jacm lowd and rooshenas learning markov networks with arithmetic circuits in proc aistats pages rahman kothalkar and gogate cutset networks simple tractable and scalable approach for improving the accuracy of trees in proceedings of ecml pkdd pages gens and domingos learning the structure of networks in proceedings of iclm pages rooshenas and lowd learning networks with direct and indirect variable interactions in proceedings icml pages niepert and domingos exchangeable variable models in proceedings of icml van haaren van den broeck meert and davis lifted generative learning of markov logic networks machine learning to appear kulesza and taskar determinantal point processes for machine learning foundations and trends in machine learning park map complexity results and approximation methods in proceedings of uai chen choi and darwiche algorithms and applications for the probability jair pages krause guestrin optimal nonmyopic value of information in graphical models efficient algorithms and theoretical limits in proceedings of ijcai van der gaag bodlaender and feelders monotonicity in bayesian networks in proceedings of uai pages de campos and zaffalon on the complexity of solving limited memory influence diagrams with binary variables aij parberry and schnitger relating boltzmann machines to conventional models of computation neural networks buchman and poole representing aggregators in relational probabilistic models in proceedings of aaai darwiche sdd new canonical representation of propositional knowledge bases in proceedings of ijcai pages della pietra della pietra and lafferty inducing features of random fields ieee tpami daniel lowd and jesse davis improving markov network structure learning using decision trees the journal of machine learning research kisa van den broeck choi and darwiche probabilistic sentential decision diagrams in kr choi van den broeck and darwiche tractable learning for structured probability spaces case study in learning preference distributions in proceedings of ijcai 
stop wasting my gradients practical svrg reza mohamed osama alim mark department of computer science university of british columbia rezababa moahmed schmidtm jakub school of mathematics university of edinburgh scott sallinen department of electrical and computer engineering university of british columbia scotts abstract we present and analyze several strategies for improving the performance of stochastic gradient svrg methods we first show that the convergence rate of these methods can be preserved under decreasing sequence of errors in the control variate and use this to derive variants of svrg that use strategies to reduce the number of gradient calculations required in the early iterations we further show how to exploit support vectors to reduce the number of gradient computations in the later iterations ii prove that the regularized svrg iteration is justified and improves the convergence rate iii consider alternate selection strategies and iv consider the generalization error of the method introduction we consider the problem of optimizing the average of finite but large sum of smooth functions min fi huge proportion of the procedures in machine learning can be mapped to this problem this includes classic models like least squares and logistic regression but also includes more advanced methods like conditional random fields and deep neural network models in the highdimensional setting large the traditional approaches for solving are full gradient fg methods which have linear convergence rates but need to evaluate the gradient fi for all examples on every iteration and stochastic gradient sg methods which make rapid initial progress as they only use single gradient on each iteration but ultimately have slower sublinear convergence rates le roux et al proposed the first general method stochastic average gradient sag that only considers one training example on each iteration but still achieves linear convergence rate other methods have subsequently been shown to have this property but these all require storing previous evaluation of the gradient or the dual variables for each for many objectives this only requires space but for general problems this requires np space making them impractical recently several methods have been proposed with similar convergence rates to sag but without the memory requirements they are known as mixed gradient stochastic gradient svrg and gradient methods we will use svrg we give canonical svrg algorithm in the next section but the salient features of these methods are that they evaluate two gradients on each iteration and occasionally must compute the gradient on all examples svrg methods often dramatically outperform classic fg and sg methods but these extra evaluations mean that svrg is slower than sg methods in the important early iterations they also mean that svrg methods are typically slower than methods like sag in this work we first show that svrg is robust to inexact calculation of the full gradients it requires provided the accuracy increases over time we use this to explore strategies that require fewer gradient evaluations when far from the solution and we propose mixed method that may improve performance in the early iterations we next explore using support vectors to reduce the number of gradients required when close to the solution give justification for the regularized svrg update that is commonly used in practice consider alternative minibatch strategies and finally consider the generalization error of the method notation and svrg algorithm svrg assumes is convex each fi is convex and each gradient is lipschitzcontinuous with constant the method begins with an initial estimate sets and then generates sequence of iterates xt using xt xs µs where is the positive step size we set µs xs and it is chosen uniformly from after every steps we set xt for random and we reset with to analyze the convergence rate of svrg we will find it convenient to define the function mµη as it appears repeatedly in our results we will use to indicate the value of when and we will simply use for the special case when johnson zhang show that if and are chosen such that the algorithm achieves linear convergence rate of the form ρe xs where is the optimal solution this convergence rate is very fast for appropriate and while this result relies on constants we may not know in general practical choices with good empirical performance include setting and using xm rather than random iterate unfortunately the svrg algorithm requires gradient evaluations for every iterations of since updating xt requires two gradient evaluations and computing µs require gradient evaluations we can reduce this to if we store the gradients xs but this is not practical in most applications thus svrg requires many more gradient evaluations than classic sg iterations of methods like sag svrg with error we first give result for the svrg method where we assume that µs is equal to xs up to some error es this is in the spirit of the analysis of who analyze fg methods under similar assumptions we assume that kxt for all which has been used in related work and is reasonable because of the coercity implied by proposition if µs xs es and we set and so that then the svrg algorithm with chosen randomly from xm satisfies ρe xs zekes ηekes we give the proof in appendix this result implies that svrg does not need very accurate approximation of xs in the crucial early iterations since the first term in the bound will dominate further this result implies that we can maintain the exact convergence rate of svrg as long as the errors es decrease at an appropriate rate for example we obtain the same convergence rate provided that max ekes ekes for any and some further we still obtain linear convergence rate as long as kes converges to zero with linear convergence rate algorithm batching svrg input initial vector update frequency learning rate for do choose batch size elements sampled without replacement from µs fi xs for do randomly pick it xt fit fit xs µs end for option set xm option ii set xt for random end for sampling xiao zhang show that sampling nus improves the performance of svrg they assume pn each fi is li continuous and sample it with probability li where li the iteration is then changed to fit xt lit it which maintains that the search direction is unbiased in appendix we show that if µs is computed with error for this algorithm and if we set and so that then we have convergence rate of zekes ηekes xs which can be faster since the average may be much smaller than the maximum value svrg with batching there are many ways we could allow an error in the calculation of µs to speed up the algorithm for example if evaluating each involves solving an optimization problem then we could solve this optimization problem inexactly for example if we are fitting graphical model with an iterative approximate inference method we can terminate the iterations early to save time when the fi are simple but is large natural way to approximate µs is with subset or batch of training examples chosen without replacement µs fi the batch size controls the error in the approximation and we can drive the error to zero by increasing it to existing svrg methods correspond to the special case where for all algorithm gives for an svrg implementation that uses this strategy if we assume that the sample variance of the norms of the gradients is bounded by for all xs kfi kf xs then we have that chapter ekes so if we want ekes where is constant for some we need ns nγ algorithm mixed svrg and sg method replace in algorithm with the following lines if fit then xt fit fit xs µs else xt ηfit end if if the batch size satisfies the above condition then ηγ max ηγ and the convergence rate of svrg is unchanged compared to using the full batch on all iterations the condition guarantees linear convergence rate under any sequence of batch sizes the strategy suggested by for classic sg methods however tedious calculation shows that has an inflection point at log log corresponding to this was previously observed empirically figure and occurs because we are sampling without replacement this transition means we don need to increase the batch size exponentially mixed sg and svrg method an approximate µs can drastically reduce the computational cost of the svrg algorithm but does not affect the in the gradients required for svrg iterations this factor of is significant in the early iterations since this is when stochastic methods make the most progress and when we typically see the largest reduction in the test error to reduce this factor we can consider mixed strategy if it is in the batch then perform an svrg iteration but if it is not in the current batch then use classic sg iteration we illustrate this modification in algorithm this modification allows the algorithm to take advantage of the rapid initial progress of sg since it predominantly uses sg iterations when far from the solution below we give convergence rate for this mixed strategy proposition let µs xs and we set and so that αl with if we assume then algorithm has zekes ηekes αl ησ we give the proof in appendix the extra term depending on the variance is typically the bottleneck for sg methods classic sg methods require the to converge to zero because of this term however the mixed method can keep the fast progress from using constant since the term depending on converges to zero as converges to one since implies that αl this result implies that when xs is large compared to es and that the mixed method actually converges faster sharing single step size between the sg and svrg iterations in proposition is for example if is close to and then the sg iteration might actually take us far away from the minimizer thus we may want to use decreasing sequence of step sizes for the sg iterations in appendix we show that using for the sg iterations can improve the dependence on the error es and variance using support vectors using batch decreases the number of gradient evaluations required when svrg is far from the solution but its benefit diminishes over time however for certain objectives we can further algorithm heuristic for skipping evaluations of fi at if ski then compute if then psi psi update the number of consecutive times was zero max psi ski skip exponential number of future evaluations if it remains zero else psi this could be support vector do not skip it next time end if return else ski ski in this case we skip the evaluation return end if reduce the number of gradient evaluations by identifying support vectors for example consider minimizing the huberized hinge loss hsvm with threshold if if bi ai min if bi ati in terms of we have fi the performance of this loss function is similar to logistic regression and the hinge loss but it has the appealing properties of both it is differentiable like logistic regression meaning we can apply methods like svrg but it has support vectors like the hinge loss meaning that many examples will have fi and we can also construct huberized variants of many losses for regression and classification if we knew the support vectors where fi we could solve the problem faster by ignoring the vectors for example if there are training examples but only support vectors in the optimal solution we could solve the problem times faster while we typically don know the support vectors in this section we outline heuristic that gives large practical improvements by trying to identify them as the algorithm runs our heuristic has two components the first component is maintaining the list of vectors at xs specifically we maintain list of examples where xs when svrg picks an example it that is part of this list we know that xs and thus the iteration only needs one gradient evaluation this modification is not heuristic in that it still applies the exact svrg algorithm however at best it can only cut the runtime in half the heuristic part of our strategy is to skip xs or xt if our evaluation of has been zero more than two consecutive times and skipping it an exponentially larger number of times each time it remains zero specifically for each example we maintain two variables ski for skip and psi for pass whenever we need to evaluate for some xs or xt we run algorithm which may skip the evaluation this strategy can lead to huge computational savings in later iterations if there are few support vectors since many iterations will require no gradient evaluations identifying support vectors to speed up computation has long been an important part of svm solvers and is related to the classic shrinking heuristic while it has previously been explored in the context of dual coordinate ascent methods this is the first work exploring it for stochastic gradient methods regularized svrg we are often interested in the special case where problem has the decomposition min gi common choice of is scaled of the parameter vector this regularizer encourages sparsity in the parameter vector and can be addressed with the proximalsvrg method of xiao zhang alternately if we want an explicit we could set to the indicator function for ball containing in appendix we give variant of proposition that allows errors in the method for settings like this another common choice is the with this regularizer the svrg updates can be equivalently written in the form xt xt xt xs µs pn where gi that is they take an exact gradient step with respect to the regularizer and an svrg step with respect to the gi functions when the are sparse this form of the update allows us to implement the iteration without needing operations related update is used by le roux et al to avoid operations in the sag algorithm in appendix we prove the below convergence rate for this update proposition consider instances of problem that can be written in the form where is lh continuous and each is lg continuous and assume that we set and so that lm with lm max lg lh then the regularized svrg iteration has lm xs since lm and strictly so in the case of this result shows that for regularized problems svrg actually converges faster than the standard analysis would indicate similar result appears in et al further this result gives theoretical justification for using the update for other functions where it is not equivalent to the original svrg method strategies et al have also recently considered using batches of data within svrg they consider using in the inner iteration the update of xt to decrease the variance of the method but still use full passes through the data to compute µs this prior work is thus complimentary to the current work in practice both strategies can be used to improve performance in appendix we show that sampling the inner proportional to li achieves convergence rate of ρm xs where is the size of the while ρm mµη and we assume ρm this generalizes the standard rate of svrg and improves on the result of et al in the smooth case this rate can be faster than the rate of the standard svrg method at the cost of more expensive iteration and may be clearly advantageous in settings where parallel computation allows us to compute several gradients simultaneously the regularized svrg form suggests an alternate strategy for problem consider that contains fixed set bf and random set bt without loss of generality assume that we sort the fi based on their li values so that ln for the fixed bf we will always choose the mf values with the largest li bf fmf in contrast we choose the members of the random set bt by sampling from br fmf fn proportional to their pn lipschitz constants pi mlr with li in appendix we show the following convergence rate for this strategy proposition let fi and bf fi if we replace the svrg update with xt xt xt fi li then the convergence rate is xs where and max if and mf nm with then we get faster convergence rate than svrg with of size the scenario where this rate is slower than the existing minibatch svrg strategy is when but we could relax this assumption by dividing each element of the fixed set bf into two functions βfi and fi where then replacing each function fi in bf with βfi and adding fi to the random set br this result may be relevant if we have access to gate array fpga or graphical processing unit gpu that can compute the gradient for fixed subset of the examples very efficiently however our experiments appendix indicate this strategy only gives marginal gains in appendix we also consider constructing by sampling proportional to fi xs or xs these seemed to work as well as lipschitz sampling on all but one of the datasets in our experiments and this strategy is appealing because we have access to these values while we may not know the li values however these strategies diverged on one of the datasets learning efficiency in this section we compare the performance of svrg as learning algorithm compared to fg and sg methods following bottou bousquet we can formulate the generalization error of learning algorithm as the sum of three terms eapp eest eopt where the approximation error eapp measures the effect of using limited class of models the estimation error eest measures the effect of using finite training set and the optimization error eopt measures the effect of inexactly solving problem bottou bousquet study asymptotic performance of various algorithms for fixed approximation error and under certain conditions on the distribution of the data depending on parameters or in appendix we discuss how svrg can be analyzed in their framework the table below includes svrg among their results algorithm fg sg svrg time to reach eopt nκd log dνκ log time to reach log dνκ log κd log previous with log in this table the condition number is in this setting stochastic gradient methods can obtain better bounds for problems with better dependence on the dimension and without depending on the noise variance experimental results in this section we present experimental results that evaluate our proposed variations on the svrg method we focus on logistic regression classification given set of training data an bn where ai rd and bi the goal is to find the rd solving argmin log exp ati we consider the datasets used by whose properties are listed in the supplementary material as in their work we add bias variable normalize dense features and set the regularization parameter to we used of and we used which gave good performance across methods and datasets in our first experiment we compared three variants of svrg the original strategy that uses all examples to form µs full growing batch strategy that sets grow and the mixed described by algorithm under this same choice mixed while variety of practical batching methods have been proposed in the literature we did not find that any of these strategies consistently outperformed the doubling used by the simple grow full grow mixed full grow mixed test error objective minus optimum effective passes effective passes full grow sv full sv grow full grow sv full sv grow test error objective minus optimum effective passes effective passes figure comparison of training objective left and test error right on the spam dataset for the logistic regression top and the hsvm bottom losses under different batch strategies for choosing µs full grow and mixed and whether we attempt to identify support vectors sv strategy our second experiment focused on the hsvm on the same datasets and we compared the original svrg algorithm with variants that try to identify the support vectors sv we plot the experimental results for one run of the algorithms on one dataset in figure while appendix reports results on the other datasets over different runs in our results the growing batch strategy grow always had better test error performance than using the full batch while for large datasets it also performed substantially better in terms of the training objective in contrast the mixed strategy sometimes helped performance and sometimes hurt performance utilizing support vectors often improved the training objective often by large margins but its effect on the test objective was smaller discussion as svrg is the only method among the new stochastic methods it represents the natural method to use for huge variety of machine learning problems in this work we show that the convergence rate of the svrg algorithm can be preserved even under an inexact approximation to the full gradient we also showed that using to approximate µs gives natural way to do this explored the use of support vectors to further reduce the number of gradient evaluations gave an analysis of the regularized svrg update and considered several new strategies our theoretical and experimental results indicate that many of these simple modifications should be considered in any practical implementation of svrg acknowledgements we would like to thank the reviewers for their helpful comments this research was supported by the natural sciences and engineering research council of canada rgpin rgpin jakub is supported by google european doctoral fellowship references le roux schmidt and bach stochastic gradient method with an exponential convergence rate for optimization with finite training sets advances in neural information processing systems nips and zhang stochastic dual coordinate ascent methods for regularized loss minimization journal of machine learning research vol pp mairal optimization with surrogate functions international conference on machine learning icml defazio bach and saga fast incremental gradient method with support for convex composite objectives advances in neural information processing systems nips mahdavi zhang and jin mixed optimization for smooth functions advances in neural information processing systems nips johnson and zhang accelerating stochastic gradient descent using predictive variance reduction advances in neural information processing systems nips zhang mahdavi and jin linear convergence with condition number independent access of full gradients advances in neural information processing systems nips and gradient descent methods arxiv preprint schmidt le roux and bach convergence rates of inexact methods for convex optimization advances in neural information processing systems nips hu kwok and pan accelerated gradient methods for stochastic optimization and online learning advances in neural information processing systems nips xiao and zhang proximal stochastic gradient method with progressive variance reduction siam journal on optimization vol no pp lohr sampling design and analysis cengage learning friedlander and schmidt hybrid methods for data fitting siam journal of scientific computing vol no pp aravkin friedlander herrmann and van leeuwen robust inversion dimensionality reduction and randomized sampling mathematical programming vol no pp rosset and zhu piecewise linear regularized solution paths the annals of statistics vol no pp joachims making svm learning practical in advances in kernel methods support vector learning burges and smola eds ch pp cambridge ma mit press usunier bordes and bottou guarantees for approximate incremental svms international conference on artificial intelligence and statistics aistats liu and gradient descent in the proximal setting arxiv preprint bottou and bousquet the tradeoffs of large scale learning advances in neural information processing systems nips byrd chin nocedal and wu sample size selection in optimization methods for machine learning mathematical programming vol no pp van den doel and ascher adaptive and stochastic algorithms for eit and dc resistivity problems with piecewise constant solutions and many measurements siam scient comput vol 
mind the gap generative approach to interpretable feature selection and extraction been kim julie shah massachusetts institute of technology cambridge ma beenkim julie shah finale harvard university cambridge ma finale abstract we present the mind the gap model mgm an approach for interpretable feature extraction and selection by placing interpretability criteria directly into the model we allow for the model to both optimize parameters related to interpretability and to directly report global set of distinguishable dimensions to assist with further data exploration and hypothesis generation mgm extracts distinguishing features on datasets of animal features recipes ingredients and disease it also maintains or improves performance when compared to related approaches we perform user study with domain experts to show the mgm ability to help with dataset exploration introduction not only are our data growing in volume and dimensionality but the understanding that we wish to gain from them is increasingly sophisticated for example an educator might wish to know what features characterize different clusters of assignments to provide feedback tailored to each student needs clinical researcher might apply clustering algorithm to his patient cohort and then wish to understand what sets of symptoms distinguish clusters to assist in performing differential diagnosis more broadly researchers often perform clustering as tool for data exploration and hypothesis generation in these situations the domain expert goal is to understand what features characterize cluster and what features distinguish between clusters objectives such as data exploration present unique challenges and opportunities for problems in unsupervised learning while in more typical scenarios the discovered latent structures are simply required for some downstream as features for supervised prediction data exploration the model must provide information to domain expert in form that they can readily interpret it is not sufficient to simply list what observations are part of which cluster one must also be able to explain why the data partition in that particular way these explanations must necessarily be succinct as people are limited in the number of cognitive entities that they can process at one time the standard for summarizing clusters and other latent factor representations is to list the most probable features of each factor for example word lists are the standard for presenting topics from topic models principle component vectors in pca are usually described by list of dimensions with the largest magnitude values for the components with the largest magnitude eigenvalues versions of these models make this goal more explicit by trying to limit the number of values in each factor other works make these descriptions more intuitive by deriving disjunctive normal form dnf expressions for each cluster or learning set of important features and examples that characterizes each cluster while these approaches might effectively characterize each cluster they do not provide information about what distinguishes clusters from each other understanding these differences is important in many when performing differential diagnosis and computing relative risks techniques that combine variable selection and clustering assist in finding dimensions that than simply clusters variable extraction methods such as pca project the data into smaller number of dimensions and perform clustering there in contrast variable selection methods choose small number of dimensions to retain within variable selection approaches filter methods first select important dimensions and then cluster based on those wrapper methods iterate between selecting dimensions and clustering to maximize clustering objective embedded methods combine variable selection and clustering into one objective all of these approaches identify small subset of dimensions that can be used to form clustering that is as good as or better than using all the dimensions primary motivation for identifying this small subset is that one can then accurately cluster future data with many fewer measurements per observation however identifying minimal set of distinguishing dimensions is the opposite of what is required in data exploration and hypothesis generation tasks here the researcher desires comprehensive set of distinguishing dimensions to better understand the important patterns in the data in this work we present generative approach for discovering global set of distinguishable dimensions when clustering data our goal is to find comprehensive set of distinguishing dimensions to assist with further data exploration and hypothesis generation rather than few dimensions that will distinguish the clusters we use an embedded approach that incorporates interpretability criteria directly into the model first we use feature extraction technique to consolidate dimensions into groups second we define important groups as ones having parameter is groups that have gap in their parameter values across clusters by building these interpretability criteria directly into the model we can easily report back what an extracted set of features means by its logical formula and what sets of features distinguish one cluster from another without any analysis model we consider wnd with observations and binary dimensions our goal is to decompose these observations into clusters while simulateneously returning comprehensive list of what sets of dimensions are important for distinguishing between the clusters mgm has two core elements which perform interpretable feature extraction and selection at the feature extraction stage features are grouped together by logical formulas which are easily interpreted by people allowing some dimensionality reduction while maintaining interpretability next we select features for which there is large parameter values from personal communication with domain experts across several domains we observed that than simply often as aspect of interest as it provides an unambiguous way to discriminate between clusters we focus on data our feature extraction step involves consolidating dimensions into groups we posit that there an infinite number of groups and multinomial latent variable ld that indicates the group to which dimension belongs each group is characterized by latent variable fg which contains the formula associated with the group in this work we only consider the formulas fg or fg and and constrain each dimension to belong to only one group simple boolean operations like or and and are easy to interpret by people requiring each dimension to be part of only one group avoid having to solve possibly satisfiability problem as part of the generative procedure feature selection is performed through binary latent variable yg which indicates whether each group is important for distinguishing clusters if group is important yg then the probability βgk that group is present in an observation from cluster is drawn from distribution modeled as mixture of beta distributions if the group is unimportant yg the the probability βgk is drawn from distribution while distribution with high variance can also produce both low and high values for the probability βgk it will also produce intermediate values however draws from the distribution will have clear gap between low and high values this definition of important distributions is distinct from the criterion in where σp πg γg yg tgk βgk πf πl fg ld αi βi αz πz zn wnd ing mind the gap graphical model cartoon describing emissions from important dimensions in our case we define importance by than simply variance thus we distinguish panel from and while distinguishes between and figure graphical model of mgm cartoon of distinguishing dimensions parameters for important distributions were selected from distribution and parameters for unimportant dimensions were shared across all clusters figure illustrates this difference generative model the graphical model for mgm is shown in figure we assume that there are an infinite number of possible groups each with an associated formula fg each dimension belongs to group as indicated by ld we also posit that there are set of latent clusters each with emission characteristics described below the latent variable βgk corresponds to the probability that group is present in the data and is drawn with or distribution governed by the parameters γg yg tgk each observation belongs to exactly one latent cluster indicated by zn the binary variable ing indicates whether group is present in observation finally the probability of some observation wnd depends on whether its associated group indicated by ld is present in the data indicated by ing and the associated formula fg the complete generative process first involves assigning dimensions to groups choosing the formula fg associated with each group and deciding whether each group is important πl dp αl yg bernoulli πg πf dirichlet αf γg beta ld multinomial πl fg multinomial πf where dp is the dirichlet process thus there are an infinite number of potential groups however given finite number of dimensions only finite number of groups can be present in the data next emission parameters are selected for each cluster if yg else tgk bernoulli γg if tgk else βgk beta αu βu βgk beta αb βb βgk beta αt βt finally observations wnd are generated πz dirichlet αz zn multinomial πz ing bernoulli βgk if ing wnd else wnd formulafg the above equations indicate that if ing that is group is not present in the observation then in that observation all wnd such that ld are also absent wnd if the group is present ing and the group formula fg and then all the dimensions associated with that dimension are present wnd finally if the group is present ing and the group formula fg or then we sample the associated wnd from all possible configurations of wnd such that at least one wnd figure motivating examples with cartoons from three clusters vacation student winter and the distinguishable dimensions discovered by the mgm let yg γg tgk βgk ld fg zn ing be the set of variables in the mgm given set of observations wnd the posterior over factors as yg γg tgk βgk ld fg zn ing wnd yg γg fg tgk βgk yg ld zn ing zn wnd ld most of these terms are to compute given the generative model the likelihood term wnd ld can be expanded as ld ing fg ld ing sat fg ld ing ld wnd ing ld wnd where we use to indicate the vector of measurements associated with observation the function sat wn fg ld indicates whether the associated formula fg is satisfied where fg involves dimensions of that belong to group ld motivating example here we provide an example to illustrate the properties of mgm on synthetic of cartoon faces each cartoon face can be described by eight features earmuffs scarf hat sunglasses pencil silly glasses face color mouth shape see figure the cartoon faces belong to three clusters winter faces tend to have earmuffs and scarves student faces tend to have silly glasses and pencils vacation faces tend to have hats and sunglasses face color does not distinguish between the different clusters the mgm discovers four distinguishing sets of features the vacation cluster has hat or sunglasses the winter cluster has earmuffs or scarfs or smile and the student cluster has silly glasses as well as pencils face color does not appear because it does not distinguish between the groups however we do identify both hats and sunglasses as important even though only one of those two features is important for distinguishing the vacation cluster from the other clusters our model aims to find comprehensive list the distinguishing features for human expert to later review for interesting patterns not minimal subset for classification by consolidating as sunglasses or hat still provide compact summary of the ways in which the clusters can be distinguished inference solving equation is computationally intractable we use variational approach to approximate the true posterior distribution yg γg tgk βgk ld fg zn ing wnd with factored distribution qηg yg bernoulli ηg qλgk tgk bernoulli λgk γg beta qφgk βgk beta qτn dirichlet qcd ld multinomial cd qνn zn multinomial νn qing ing bernoulli ong qeg fg bernoulli eg where in addition we use approximation to the dirichlet process to approximate the distribution over group assignments ld minimizing the divergence between the true posterior wnd and the variational distribution corresponds to maximizing the evidence lower bound the elbo eq log wnd where is the entropy because of the conjugate exponential family terms most of the expressions in the elbo are straightforward to compute the most challenging part is determining how to optimize the variational terms ld ing and fg that are involved in the likelihood in equation here we first relax our generative process of or to have it correspond to independently sampling each wnd with some probability thus equation becomes fg ld fg ld ing wnd fg ld ing wnd fg ld ing wnd fg ld ing wnd ing ld wnd ing ld wnd with this relaxation the expression for the entire evidence lower bound is to compute the full derivations are given in the supplementary materials however the logical formulas in equation still impose hard combinatorial constraints on settings of the variables ing fg ld that are associated with the logical formulas specifically if the values for the formula choice fg and group assignments ld are fixed then the value of ing is also fixed because the feature extraction step is deterministic once ing is fixed however the relationships between all the other variables are conjugate in the exponential family therefore we alternate our inference between the variables ing fg ld and the variables yg γg tgk βgk zn feature extraction we consider only degenerate distributions ing fg ld that put mass on only one setting of the variables note that this is still valid setting for the variational inference as fixing values for ing fg and ld which corresponds to degenerate beta or dirichlet prior only means that we are further limiting our set of variational distributions not fully optimizing lower bound due to this constraint can only lower the lower bound we perform an agglomerative procedure for assigning dimensions to groups we begin our search with each dimension assigned to its own formula ld fd or merges of groups are explored using combination of and random proposals in which we also explore changing the formula assignment of the group for the proposals we use an initial run of vanilla clustering algorithm to test whether combining two more groups results in an extracted feature that has high variance at each iteration we compute the elbo for subsets of these proposals and choose the agglomeration with the highest elbo feature selection given particular setting of the extraction variables ing fg ld the remaining variables yg γg tgk βgk zn are all in the exponential family the corresponding posterior distributions yg γg tgk βgk and zn can be optimized via coordinate ascent results we applied our mgm to both standard benchmark and more interesting data sets in all cases we ran restarts of the mgm inference was run for iterations or until the elbo improved by less than relative to the previous iteration twenty possible merges were explored in each iteration faces digits mgm kmeans hfs law dpm hfs cc table mutual information and number of clusters in parentheses for uci benchmarks the mutual information is with respect to the true class labels higher is better performance values for hfs law dpm hfs and cc are taken from figure results on datasets animal dataset left recipe dataset middle and disease dataset right each row represents an important feature lighter boxes indicate that the feature is likely to be present in the cluster while darker boxes are unlikely to be present each merge exploration involved combining two existing groups into new group if we failed to accept our candidate merge proposals more than three times within an iteration we switched to random proposals for the remaining proposals we swept over the number of clusters from to and reported the results with the highest elbo benchmark problems mgm discriminates classes we compared the classification performance of our clustering algorithms on several uci benchmark problems the digits data set consists of grayscale images for each digit the faces data set consists of images of people with images of each person from various angles in both cases we binarized the images setting the value to if the value was less than if the value was greater than these two are chosen as they are discrete and we have the same versions for comparison to results cited in the mutual information between our discovered clusters and the true classes in the data sets is shown in table higher mutual information between our clustering and known labels is one way to objectively show that our clusters correspond to groups that humans find interesting the classification labels mgm is second only to hfs in the faces dataset second only to hfs and the highest scoring model in the digits dataset it always outperforms demonstrating interpretability applications our quantitative results on the benchmark datasets show that the structure recovered by our approach is consistent with classes defined by human labelers better than or at the level of other clustering approaches however the dimensions in the image benchmarks do not have much associated meaning and the our approach was designed for clustering not classification here we demonstrate the qualitative advantages of our approach on three more interesting datasets animals the animals data set consists of biological and ecological properties of animals such as has wings or has teeth we are also provided class labels such as insects mammals and birds the result of our mgm is shown in figure each row is distinguishable feature each column is cluster lighter color boxes in figure indicate that the feature is likely to be present in the cluster while darker color boxes indicate that the feature is unlikely to be present in the cluster below each cluster few animals that belong to that cluster are listed we first note that as desired our model selects features that have large variation in their probabilities across the clusters rows in figure thus it is to read what makes each column different from the others the mammals in the third column do not lay eggs the insects in the fifth column are toothless and invertebrates and therefore have no tails they are also rarely predators unlike the land animals many of the water animals in columns one and two do not breathe recipes the recipes data set consists of ingredients from recipes taken from the computer cooking there are recipes with total ingredients the recipes fall into four categories pasta chili brownies or punch we seek to find ingredients and groups of ingredients that can distinguish different types of recipes note the names for each cluster have been filled in after the analysis based on the class label of the majority of the observations that were grouped into that cluster the mgm distills the ingredients into only important features the first extracted feature contains several spices which are present in pasta brownies and chili but not in punch punch is also distinguished from the other clusters by its lack of basic spices such as salt and pepper the second extracted feature the third extracted feature contains number of savory cooking ingredients such as oil garlic and shallots these are common in the pasta and chili clusters but uncommon in the punch and brownie clusters diseases finally we consider data set of patients with autism spectrum disorder asd accumulated over the first years of life asds are complex disease that is often associated with conditions such as seizures and developmental delays as most patients have very few diagnoses we limited our analysis to the patients with at least diagnoses and the diagnoses that occurred in at least of the patients we binarized the count data to values our model reduces these dimensions to important sets of features the extracted features had many more dimensions than in the examples so we only list two features from each group and provide the total number in parenthesis several of the groups of the extracted did not use any auxiliary similar to those from in particular report clusters of patients with epilepsy and cerebral palsy patients with psychiatric disorders and patients with gastrointestinal disorders using our representation we can easily see that there appears to be one group of sick patients cluster for whom all features are likely we can also see what features distinguish clusters and from each other by which ones are unlikely to be present verifying interpretability human subject experiment we conducted pilot study to gather more qualitative evaluation of the mgm we first divided the asd data into three datasets with random disjoint subsets of approximately dimensions each for each of these subsets we prepared the data in three formats raw patient data list of symptoms clustered results centroids from and clustered results with the mgm with distinguishable sets of features both the clustered results were presented as figures such as figure and the raw data were presented in spreadsheet three domain experts were then tasked to explore the different data subsets in each format so each participant saw all formats and all data subsets and produce sentence executive summary of each the different conditions serve as reference points for the subjects to give more qualitative feedback about the mgm all subjects reported that the raw with small number of impossible to summarize in minute period subjects also reported that the aggregation of states in the mgm helped them summarize the data faster rather than having to aggregate manually while none of them explicitly indicated they noticed that all the rows of the mgm were relevant they did report that it was easier to find the differences one strictly preferred the mgm over the options while another found the mgm easier for making up narrative but was overall satisfied with both the mgm and the clustering one subject appreciated the succinctness of the mgm but was concerned that it may lose some information this final comment motivates future work on structured priors for on what logical formulas should be allowed or likely future user studies should study the effects of the feature extraction and selection separately finally qualitative review of the summaries produced found similar but slightly more compact organization of notes in the mgm compared to the model computer cooking contest http discussion and related work mgm combines extractive and selective approaches for finding small set of distinguishable dimensions when performing unsupervised learning on data sets rather than rely on criteria that use statistical measures of variation and then performing additional to interpret the results we build interpretable criteria directly into the model our feature extraction step allows us to find natural groupings of dimensions such as backbone or tail or toothless in the animal data and salt or pepper or cream in the recipe data defining an interesting dimension as one whose parameters are drawn from distribution helps us recover groups like pasta and punch providing such comprehensive lists of distinguishing dimensions assists in the data exploration and hypothesis generation process similarly providing lists of dimensions that have been consolidated in one extraction aids the human discovery process of why those dimensions might be meaningful group closest to our work are feature selection approaches such as which also use mixture of to identify feature types in particular uses similar hierarchy of beta and bernoulli priors to identify important dimensions they carefully choose the priors so that some dimensions can be globally important while other dimensions can be locally important the parameters for important dimensions are chosen iid from gaussian distribution while values for all unimportant dimensions come from the same background distribution our approach draws parameters for important dimensions from distributions with multiple while unimportant dimensions are drawn from distribution thus our model is more expressive than approaches in which all unimportant dimension values are drawn from the same distribution it captures the idea that not all variation is important clusters can vary in their emission parameters for particular dimension and that variation still might not be interesting specifically an important dimension is one where there is gap between parameter values our feature extraction step collapses the dimensionality further while retaining interpretability more broadly there are many other lines of work that focus on creating latent variable models based on diversity or differences methods for inducing diversity such as determinantal point processes have been used to find diverse solutions on applications ranging from detecting objects in videos topic modeling and variable selection in these cases the goal is to avoid finding multiple very similar optima while the generated solutions are different the model itself does not provide descriptions of what distinguishes one solution from the rest moreover there may be situations in which forcing solutions to be very different might not make sense for example when clustering recipes it may be very sensible for the ingredient salt to be common feature of all clusters likewise when clustering patients from an autism cohort one would expect all patients to have some kind of developmental disorder finally other approaches focus on building models in which factors describe what distinguishes them from some baseline for example builds topic model in which each topic is described by the difference from some baseline distribution contrastive learning focuses on finding the directions that are most distinguish background data from foreground data approaches to topic models try to find topics that can best assist in distinguishing between classes but are not necessarily readily interpretable themselves conclusions and future work we presented mgm an approach for interpretable feature extraction and selection by incorporating criteria directly into the model design we found key dimensions that distinguished clusters of animals recipes and patients while this work focused on the clustering of binary data these ideas could also be applied to mixed and multiple membership models similarly notions of interestingness based on gap could be applied to categorical and continuous data it also would be interesting to consider more expressive extracted features such as more complex logical formulas finally while we learned feature extractions in completely unsupervised fashion our generative approach also allows one to flexibly incorporate domain knowledge about possible group memberships into the priors references miller the magical number seven plus or minus two some limits on our capacity for processing information the psychological review pp march blei ng and jordan latent dirichlet allocation jmlr pp zou hastie and tibshirani sparse principal component analysis journal of computational and graphical statistics vol than and ho fully sparse topic models in pp williamson wang heller and blei the ibp compound dirichlet process and its application to focused topic modeling icml elhamifar and vidal sparse subspace clustering algorithm theory and applications ieee trans pattern anal mach vol no pp agrawal gehrke gunopulos and raghavan automatic subspace clustering of high dimensional data for data mining applications sigmod vol pp june kim rudin and shah the bayesian case model generative approach for reasoning and prototype classification in nips differential diagnosis in primary care jama vol no pp and kf what the relative risk method of correcting the odds ratio in cohort studies of common outcomes jama vol no pp alelyani tang and liu feature selection for clustering data clustering algorithms and applications vol unsupervised variable selection when random rankings sound as in fsdm mitra murthy and pal unsupervised feature selection using feature similarity ieee transactions on pattern analysis and machine intelligence vol no pp dash and liu feature selection for clustering in kdd current issues and new applications pp tsuda kawanabe and mller clustering with the fisher score in nips dy and brodley feature selection for unsupervised learning jmlr pp guan dy and jordan unified probabilistic model for global and local unsupervised feature in icml pp fan and bouguila online learning of dirichlet process mixture of generalized dirichlet distributions for simultaneous clustering and localized feature in acml pp yu huang and wang document clustering via dirichlet process mixture model with feature selection in kdd pp acm freitas comprehensible classification models position paper acm sigkdd explorations newsletter de ath and fabricius classification and regression trees powerful yet simple technique for ecological data analysis ecology vol no pp wainwright and jordan graphical models exponential families and variational inference foundations and trends in machine learning vol no pp lichman uci machine learning repository kemp and tenenbaum the discovery of structural form pnas ge and kohane comorbidity clusters in autism spectrum disorders an electronic health record analysis pediatrics vol no pp kulesza learning with determinantal point processes phd thesis university of pennsylvania kulesza and taskar structured determinantal point processes in nips zou and adams priors for diversity in generative latent variable models in nips batmanghelich quon kulesza kellis golland and bornn diversifying sparsity using variational determinantal point processes corr eisenstein ahmed and xing sparse additive generative models of text icml zou hsu parkes and adams contrastive learning using spectral methods in nips zhu ahmed and xing medlda maximum margin supervised topic models for regression and classification in icml pp acm 
normative theory of adaptive dimensionality reduction in neural networks cengiz pehlevan simons center for data analysis simons foundation new york ny cpehlevan dmitri chklovskii simons center for data analysis simons foundation new york ny dchklovskii abstract to make sense of the world our brains must analyze datasets streamed by our sensory organs because such analysis begins with dimensionality reduction modeling early sensory processing requires biologically plausible online dimensionality reduction algorithms recently we derived such an algorithm termed similarity matching from multidimensional scaling mds objective function however in the existing algorithm the number of output dimensions is set priori by the number of output neurons and can not be changed because the number of informative dimensions in sensory inputs is variable there is need for adaptive dimensionality reduction here we derive biologically plausible dimensionality reduction algorithms which adapt the number of output dimensions to the eigenspectrum of the input covariance matrix we formulate three objective functions which in the offline setting are optimized by the projections of the input dataset onto its principal subspace scaled by the eigenvalues of the output covariance matrix in turn the output eigenvalues are computed as ii iii equalized thresholded eigenvalues of the input covariance matrix in the online setting we derive the three corresponding adaptive algorithms and map them onto the dynamics of neuronal activity in networks with biologically plausible local learning rules remarkably in the last two networks neurons are divided into two classes which we identify with principal neurons and interneurons in biological circuits introduction our brains analyze datasets streamed by our sensory organs with efficiency and speed rivaling modern computers at the early stage of such analysis the dimensionality of sensory inputs is drastically reduced as evidenced by anatomical measurements human retina for example conveys signals from million photoreceptors to the rest of the brain via million ganglion cells suggesting dimensionality reduction therefore biologically plausible dimensionality reduction algorithms may offer model of early sensory processing in seminal work oja proposed that single neuron may compute the first principal component of activity in upstream neurons at each time point oja neuron projects vector composed of firing rates of upstream neurons onto the vector of synaptic weights by summing up currents generated by its synapses in turn synaptic weights are adjusted according to hebbian rule depending on the activities of only the postsynaptic and corresponding presynaptic neurons following oja work many multineuron circuits were proposed to extract multiple principal components of the input for review see however most multineuron algorithms did not meet the same level of rigor and biological plausibility as the algorithm which can be derived using normative approach from principled objective function and contains only cal hebbian learning rules algorithms derived from principled objective functions either did not posess local learning rules or had other biologically implausible features in other algorithms local rules were chosen heuristically rather than derived from principled objective function there is notable exception to the above observation but it has other shortcomings the twolayer circuit with reciprocal synapses can be derived from the minimization of the representation error however the activity of principal neurons in the circuit is dummy variable without its own dynamics therefore such principal neurons do not integrate their input in time contradicting existing experimental observations other normative approaches use an information theoretical objective to compare theoretical limits with experimentally measured information in single neurons or populations or to calculate optimal synaptic weights in postulated neural network recently novel approach to the problem has been proposed starting with the multidimensional scaling mds strain cost function we derived an algorithm which maps onto neuronal circuit with local learning rules however had major limitations which are shared by vairous other multineuron algorithms the number of output dimensions was determined by the fixed number of output neurons precluding adaptation to the varying number of informative components better solution would be to let the network decide depending on the input statistics how many dimensions to represent the dimensionality of neural activity in such network would be usually less than the maximum set by the number of neurons because output neurons were coupled by synapses which are most naturally implemented by inhibitory synapses if these neurons were to have excitatory outputs as suggested by cortical anatomy they would violate dale law each neuron uses only one fast neurotransmitter here following by we mean synaptic weights that get more negative with correlated activity of and postsynaptic neurons the output had wide dynamic range which is difficult to implement using biological neurons with limited range better solution is to equalize the output variance across neurons in this paper we advance the normative approach of by proposing three new objective functions which allow us to overcome the above limitations we optimize these objective functions by proceeding as follows in section we formulate and solve three optimization problems of the form offline setting arg min here the input to the network xt is an matrix with centered input data samples in rn as its columns and the output of the network yt is matrix with corresponding outputs in rk as its columns we assume and such optimization problems are posed in the offline setting where outputs are computed after seeing all data whereas the optimization problems in the offline setting admit solution such setting is for modeling neural computation on the mechanistic level and must be replaced by the online setting indeed neurons compute an output yt for each data sample presentation xt before the next data sample is presented and past outputs can not be altered in such online setting optimization is performed at every time step on the objective which is function of all inputs and outputs up to time moreover an online algorithm also known as streaming is not capable of storing all previous inputs and outputs and must rely on smaller number of state variables in section we formulate three corresponding online optimization problems with respect to yt while keeping all the previous outputs fixed online setting yt arg min yt then we derive algorithms solving these problems online and map their steps onto the dynamics of neuronal activity and local learning rules for synaptic weights in three neural networks we show that the solutions of the optimization problems and the corresponding online algorithms remove the limitations outlined above by performing the following computational tasks output eig output eig output eig input eig yk principal zl xn yk xn synapses hebbian input eig zl yk input eig xn figure functions of the three offline solutions and neural network implementations of the corresponding online algorithms inputoutput functions of covariance eigenvalues equalization after thresholding corresponding network architectures the eigenvalues of the input covariance matrix figure eigenvalues below the threshold are set to zero and the rest are shrunk by the threshold magnitude thus the number of output dimensions is chosen adaptively this algorithm maps onto neural network with the same architecture as in figure but with modified learning rules of input eigenvalues figure eigenvalues below the threshold vanish as before but eigenvalues above the threshold remain unchanged the steps of such algorithm map onto the dynamics of neuronal activity in network which in addition to principal neurons has layer of interneurons reciprocally connected with principal neurons and each other figure equalization of eigenvalues figure the corresponding network architecture figure lacks reciprocal connections among interneurons as before the number of abovethreshold eigenvalues is chosen adaptively and can not exceed the number of principal neurons if the two are equal this network whitens the output in section we demonstrate that the online algorithms perform well on synthetic dataset and in discussion we compare our neural circuits with biological observations dimensionality reduction in the offline setting in this section we introduce and solve in the offline setting three novel optimization problems whose solutions reduce the dimensionality of the input we state our results in three theorems which are proved in the supplementary material of covariance eigenvalues we consider the following optimization problem in the offline setting min αt it where and it is the identity matrix to gain intuition behind this choice of the objective function let us expand the squared norm and keep only the terms arg min αt it arg min tr where the first term matches the similarity of input and output and the second term is nuclear norm of known to be convex relaxation of the matrix rank used for matrix modeling thus objective function enforces similarity matching we show that the optimal output is projection of the input data onto its principal subspace the subspace dimensionality is set by the number of eigenvalues of the data covariance matrix pt xx xt that are greater than or equal to the parameter theorem suppose an of vx λx vx where λx diag λx with λx λt note that has at most nonzero eigenvalues coint ciding with those of then uk stk λx αt vkx are optima of where stk αt diag st αt st λk αt st is the softx thresholding function st max consists of the columns of corresponding to the top eigenvalues vk vk and uk is any orthogonal matrix uk the form uniquely defines all optima of except when λx αt and λx of covariance eigenvalues consider the following minimax problem in the offline setting min max αt it where and we introduced an internal variable which is an matrix zt with zt rl the intuition behind this objective function is again based on similarity matching but rank regularization is applied indirectly via the internal variable theorem suppose an of vx λx vx where λx diag λt with λx λt assume min then uk htk λx αt vkx ul stl min λx αt vlx are optima of where htk αt diag ht αt ht λk αt ht aθ with being the step function and if stl min αt diag st αt st λmin αt vpx vpx and up the form uniquely defines all optima except when either is an eigenvalue of or and λx equalizing thresholded covariance eigenvalues consider the following minimax problem in the offline setting min max tr xy yz αt βt where and this objective function follows from after dropping the quartic term theorem suppose of is vx λx vx where λx an diag λx with assume min then uk βt θk λx αt vkx ul oλy vx are optima of where θk αt diag αt λk αt is an rectangular diagonal matrix with top min diagonals are set to arbitrary nonnegative constants and the rest are zero oλy is orthogonal matrix that has two blocks the min dimensional and the bottom block is min dimensional vp xtop blockx is vp and up the form uniquely defines all optima of except when either is an eigenvalue of or and λx remark if then is and equalizing variance across all channels yy βik implying that the output is whitened online dimensionality reduction using neural nets in this section we formulate online versions of the dimensionality reduction optimization problems presented in the previous section derive corresponding online algorithms and map them onto the dynamics of neural networks with biologically plausible local learning rules the order of subsections corresponds to that in the previous section online of eigenvalues consider the following optimization problem in the online setting yt arg min αt it yt by keeping only the terms that depend on yt we get the following objective for xt yt yt yt yt αt im yt kyt kyt in the limit the last two terms can be dropped since the first two terms grow linearly with and dominate the remaining cost is positive definite quadratic form in yt and the optimization problem is convex at its minimum the following equality holds yt yt αt im yt yt xt xt while analytical solution via matrix inversion exists for yt we are interested in biologically plausible algorithms instead we use weighted jacobi iteration where yt is updated according to yt yt wty xt wty yt where is the weight parameter and wty and wty are normalized and outputoutput covariances tp tp yt xt yt yt yy yy yx wt ik wt wt ii tp αt αt yt yt iteration can be implemented by the dynamics of neuronal activity in network figure then wty and wty represent the weights of feedforward xt yt and lateral yt yt synaptic connections respectively remarkably synaptic weights appear in the online solution despite their absence in the optimization problem formulation previously nonnormalized covariances have been used as state variables in an online dictionary learning algorithm to formulate fully online algorithm we rewrite in recursive form this requires introducing representing cumulative activity of neuron up to time dt scalar variable dt tp αt yt then at each data sample presentation after the output yt converges to steady state the following updates are performed dty dt yt yx ij wt ij yt xt yt wt ij yy yy wt ij yt yt yt wt ij hence we arrive at neural network algorithm that solves the optimization problem for streaming data by alternating between two phases after data sample is presented at time in the first phase of the algorithm neuron activities are updated until convergence to fixed point in the second phase of the algorithm synaptic weights are updated for feedforward connections according to local hebbian rule and for lateral connections according to local rule due to the sign in equation interestingly in the limit these updates have the same form as the oja rule except that the learning rate is not free parameter but is determined by the cumulative neuronal activity online of eigenvalues consider the following minimax problem in the online setting where we assume yt zt arg min arg max αt it yt zt by keeping only those terms that depend on yt or zt and considering the limit we get the following objective kyt xt yt yt zt αt ik zt yt zt note that this objective is strongly convex in yt and strongly concave in zt the solution of this minimax problem is the of the objective function which is found by setting the gradient of the objective with respect to yt zt to zero yt xt xt zt zt zt zt αt ik zt zt αt yt to obtain neurally plausible algorithm we solve these equations by weighted jacobi iteration yt yt wty xt wty zt zt zt wtzy yt wtzz zt here similarly to wt are normalized covariances that can be updated recursively dty dt dt zt yx yx ij wt ij yt xt αwt ij yz yz ij wt ij yt zt αwt ij zy zy wtzy wt ij zt yt zt wt ij zz zz zz wtzz wt ii wt ij zt zt zt wt ij equations and define an online algorithm that can be naturally implemented by neural network with two populations of neurons principal and interneurons figure again after each data sample presentation the algorithm proceeds in two phases first is iterated until convergence by the dynamics of neuronal activities second synaptic weights are updated according to local for synapses from interneurons and hebbian for all other synapses rules online thresholding and equalization of eigenvalues consider the following minimax problem in the online setting where we assume and yt zt arg min arg max tr xy yz αty βtz yt zt by keeping only those terms that depend on yt or zt and considering the limit we get the following objective αt kyt xt yt yt βt kzt zt zt this objective is strongly convex in yt and strongly concave in zt and its saddle point is given by αt yt yt xt yt zt βt zt zt yt yt to obtain neurally plausible algorithm we solve these equations by weighted jacobi iteration yt yt wty xt wty zt zt zt ηwtzy yt as before wt are normalized covariances which can be updated recursively dty dt dt yx yx yx wt ij wt ij yt xt αwt ij yz yz ij wt ij yt zt αwt ij zy zy wtzy wt ij zt yt βwt ij equations and define an online algorithm that can be naturally implemented by neural network with principal neurons and interneurons as beofre after each data sample presentation at eigenvalue error subspace error eigenvalue error subspace error subspace error eigenvalue error figure performance of the three neural networks equalization after thresholding top eigenvalue error bottom subspace error as function of data presentations solid lines means and shades stds over runs red principal blue dashed lines power laws for metric definitions see text time the algorithm first iterates by the dynamics of neuronal activities until convergence and second updates synaptic weights according to local for synapses from interneurons and hebbian for all other synapses rules while an algorithm similar to but with predetermined learning rates was previously given in it has not been derived from an optimization problem plumbley convergence analysis of his algorithm suggests that at the fixed point of synaptic updates the interneuron activity is also projection onto the principal subspace this result is special case of our offline solution supported by the online numerical simulations next section numerical simulations here we evaluate the performance of the three online algorithms on synthetic dataset which is generated by an dimensional colored gaussian process with specified covariance matrix in this covariance matrix the eigenvalues and the remaining are chosen uniformly from the interval correlations are introduced in the covariance matrix by generating random orthonormal eigenvectors for all three algorithms we choose and for the equalizing algorithm we choose in all simulated networks the number of principal neurons and for the and the equalizing algorithms the number of interneurons synaptic weight matrices were initialized randomly and synaptic update learning rates and were initialized to network dynamics is run with weight until the relative change in yt and zt in one cycle is to quantify the performance of these algorithms we use two different metrics the first metric eigenvalue error measures the deviation of output covariance eigenvalues from their optimal offline values given in theorems and the eigenvalue error at time is calculated by summing squared differences between the eigenvalues of yy or zz and their optimal offline values at time the second metric subspace error quantifies the deviation of the learned subspace from the true principal subspace to form such metric at each we calculate the linear transformation that maps inputs xt to outputs yt fyt xt and zt fzx xt at the fixed points of the neural dynamics stages of the three algorithms exact expressions for these matrices for all algorithms are given in the supplementary material then at each the deviation is fm um um where fm is an matrix whose columns are the top right singular vectors of ft fm is the projection matrix to the subspace spanned by these singular vectors um is an matrix whose columns are the principal eigenvectors of the input covariance matrix at time ux um is the projection matrix to the principal subspace further numerical simulations comparing the performance of the algorithm with with other neural principal subspace algorithms can be found in discussion and conclusions we developed normative approach for dimensionality reduction by formulating three novel optimization problems the solutions of which project the input onto its principal subspace and rescale the data by ii iii equalization after thresholding of the input eigenvalues remarkably we found that these optimization problems can be solved online using biologically plausible neural circuits the dimensionality of neural activity is the number of either input covariance eigenvalues above the threshold if or output neurons if the former case is ubiquitous in the analysis of experimental recordings for review see interestingly the division of neurons into two populations principal and interneurons in the last two models has natural parallels in biological neural networks in biology principal neurons and interneurons usually are excitatory and inhibitory respectively however we can not make such an assignment in our theory because the signs of neural activities xt and yt and hence the signs of synaptic weights are unconstrained previously interneurons were included into neural circuits outside of the normative approach similarity matching in the offline setting has been used to analyze experimentally recorded neuron activity lending support to our proposal semantically similar stimuli result in similar neural activity patterns in human fmri and monkey electrophysiology it cortices in addition computed similarities among visual stimuli by matching them with the similarity among corresponding retinal activity patterns using an information theoretic metric we see several possible extensions to the algorithms presented here our online objective functions may be optimized by alternative algorithms such as gradient descent which map onto different circuit architectures and learning rules interestingly gradient on objectives has been previously related to the dynamics of principal and interneurons inputs coming from distribution with covariance matrix can be processed by algorithms derived from the objective functions where contributions from older data points are forgotten or discounted such discounting results in higher learning rates in the corresponding online algorithms even at large giving them the ability to respond to variations in data statistics hence the output dimensionality can track the number of input dimensions whose eigenvalues exceed the threshold in general the output of our algorithms is not decorrelated such decorrelation can be achieved by including term in our objective functions choosing the threshold parameter requires an priori knowledge of input statistics better solution to be presented elsewhere would be to let the network adjust such threshold adaptively by filtering out all the eigenmodes with power below the mean eigenmode power here we focused on dimensionality reduction using only spatial as opposed to the correlation structure we thank greengard sengupta grinshpan wright barnett and pnevmatikakis references david hubel eye brain and vision scientific american american books oja simplified neuron model as principal component analyzer math biol ki diamantaras and sy kung principal component neural networks theory and applications john wiley sons yang projection approximation subspace tracking ieee trans signal hu zj towfic pehlevan genkin and db chklovskii neuron as signal processing device in asilomar conference on signals systems and computers pages ieee oja principal components minor components and linear neural networks neural networks arora cotter livescu and srebro stochastic optimization for pca and pls in allerton conf on communication control and computing pages ieee goes zhang arora and lerman robust stochastic principal component analysis in proc int conf on artificial intelligence and statistics pages todd leen dynamics of learning in recurrent networks nips adaptive network for optimal linear feature extraction in int joint conf on neural networks pages ieee td sanger optimal unsupervised learning in linear feedforward neural network neural networks rubner and tavan network for analysis epl md plumbley network which optimizes information capacity by orthonormalizing the principal subspace in proc int conf on artificial neural networks pages md plumbley subspace network that determines its own output dimension tech md plumbley information processing in negative feedback neural networks neural vertechi brendel and ck machens unsupervised learning of an efficient memory network in nips pages ba olshausen and dj field sparse coding with an overcomplete basis set strategy employed by vision res aa koulakov and rinberg sparse incomplete representations potential role of olfactory granule cells neuron druckmann hu and db chklovskii mechanistic model of early sensory processing based on subtracting sparse representations in nips pages al fairhall gd lewen bialek and rrr van steveninck efficiency and ambiguity in an adaptive neural code nature se palmer marre mj berry and bialek predictive information in sensory population pnas doi jl gauthier gd field shlens et al efficient coding of spatial information in the primate retina neurosci linsker in perceptual network computer pehlevan hu and db chklovskii neural network for linear subspace learning derivation from multidimensional scaling of streaming data neural comput young and as householder discussion of set of points in terms of their mutual distances psychometrika ws torgerson multidimensional scaling theory and method psychometrika hg barrow and jml budd automatic gain control by basic neural circuit artificial neural networks ej and recht exact matrix completion via convex optimization found comput math mairal bach ponce and sapiro online learning for matrix factorization and sparse coding jmlr boyd and vandenberghe convex optimization cambridge university press gao and ganguli on simplicity and complexity in the brave new world of neuroscience curr opin neurobiol zhu and cj rozell modeling inhibitory interneurons in efficient sensory coding models plos comput biol pd king zylberberg and mr deweese inhibitory interneurons decorrelate excitatory cells to drive sparse code formation in spiking model of neurosci kriegeskorte mur da ruff kiani et al matching categorical object representations in inferior temporal cortex of man and monkey neuron kiani esteky mirpour and tanaka object category structure in response patterns of neuronal population in monkey inferior temporal cortex neurophysiol segev and schneidman retinal metric stimulus distance measure derived from population neural responses prl hs seung tj richardson jc lagarias and jj hopfield networks nips minimax and hamiltonian dynamics of pehlevan and db chklovskii optimization theory of networks for pca and whitening in allerton conf on communication control and computing 
on the convergence of stochastic gradient mcmc algorithms with integrators changyou nan lawrence dept of electrical and computer engineering duke university durham nc usa google venice ca usa cchangyou dingnan lcarin abstract recent advances in bayesian learning with data have witnessed emergence of stochastic gradient mcmc algorithms such as stochastic gradient langevin dynamics sgld stochastic gradient hamiltonian mcmc sghmc and the stochastic gradient thermostat while convergence properties of the sgld with euler integrator have recently been studied corresponding theory for general has not been explored in this paper we consider general with integrators and develop theory to analyze convergence properties and their asymptotic invariant measures our theoretical results show faster convergence rates and more accurate invariant measures for with integrators for example with the proposed efficient symmetric splitting integrator the mean square error mse of the posterior average for the sghmc achieves an optimal convergence rate of at iterations compared to for the sghmc and sgld with euler integrators furthermore convergence results of are also developed with the same convergence rates as their counterparts for specific decreasing sequence experiments on both synthetic and real datasets verify our theory and show advantages of the proposed method in two real applications introduction in bayesian learning diffusion based sampling methods have become increasingly popular most of these methods are based on diffusions defined as xt xt dt xt dwt here xt represents model states the time index wt is brownian motion functions rn rn and rn not necessarily equal to are assumed to satisfy the usual lipschitz continuity condition in bayesian setting the goal is to design appropriate functions and so that the stationary distribution of the diffusion has marginal distribution that is equal to the posterior distribution of for example langevin dynamics ld correspond to and in with in being the identity matrix langevin dynamics correspond to and for some in here is the unnormalized negative and is known as the momentum based on the equation the stationary distributions of these dynamics exist and their marginals over are equal to exp the posterior distribution we are interested in since diffusions are markov processes exact sampling is in general infeasible as result the following two approximations have been introduced in the machine learning ature to make the sampling numerically feasible and practically scalable instead of analytically integrating infinitesimal increments dt numerical integration over small step is used to approximate the integration of the true dynamics although many numerical schemes have been studied in the sde literature in machine learning only the euler scheme is widely applied during every integration instead of working with the gradient of the full negative version of it is calculated from the minibatch of data important when considering problems with massive data in this paper we call algorithms based on and algorithms to be complete some recently proposed algorithms are briefly reviewed in appendix algorithms often work well in practice however some theoretical concerns about the convergence properties have been raised recently showed that the sgld converges weakly to the true posterior in the author studied the inconsistency of the hamiltonian pde with stochastic gradients but not the sghmc and pointed out its incompatibility with data subsampling however real applications only require convergence in the weak sense instead of requiring convergence as in only laws of sample paths are of very recently the invariance measure of an sgmcmc with specific stochastic gradient noise was studied in however the technique is not readily applicable to our general setting in this paper we focus on general and study the role of their numerical integrators our main contributions include from theoretical viewpoint we prove weak convergence results for general which are of practical interest specifically for numerical integrator the bias of the expected sample average of an at iteration is upper bounded by with optimal step size and the mse by with optimal this generalizes the results of the sgld with an euler integrator in and is better when ii from practical perspective we introduce numerically efficient integrator based on symmetric splitting schemes when applied to the sghmc it outperforms existing algorithms including the sgld and sghmc with euler integrators considering both synthetic and large real datasets preliminaries two approximation errors in in weak convergence analysis instead of working directly with in we study how the expected value of any suitably smooth statistic of xt evolves in time this motivates the introduction of an infinitesimal generator formally the generator of the diffusion is defined for any compactly supported twice differentiable function rn such that xt lf xt xt xt xt xt where at tr at means approaches zero along the positive real axis is associated with an integrated form via kolmogorov backward xet et where xet denotes the exact solution of the diffusion at time the operator et is called the kolmogorov operator for the diffusion since diffusion is continuous it is generally infeasible to solve analytically so is et in practice local numerical integrator is used for every small step with the corresponding kolmogorov operator ph approximating ehl let xnlh denote the approximate sample path from such numerical integrator similarly we have xnlh ph xn let denote the composition of two operators and is evaluated on the output of for time lh we have the following approximation xet ehl ehl ph ph xnt with compositions where is obtained by decomposing into each for minibatch of data while approximation is manifested by approximating the infeasible ehl with ph from feasible integrator the symmetric splitting integrator proposed later such that for completeness we provide mean properties of the sghmc similar to in appendix more details of the equation are provided in appendix specifically under mild conditions on we can expand the operator ehl up to the such that the remainder terms are bounded by refer to for more details we will assume these conditions to hold for the in this paper xnt is close to the exact expectation xet the latter is the first approximation error introduced in formally to characterize the degree of approximation accuracies for different numerical methods we use the following definition definition an integrator is said to be local integrator if for any smooth and bounded function the corresponding kolmogorov operator ph satisfies the following relation ph ehl the second approximation error is manifested when handling large data specifically the sgld and sghmc use stochastic gradients in the and lds respectively by replacing in and the full negative with scaled from the minibatch we denote the corresponding generators with stochastic gradients as the generator in the minibatch for the sghmc becomes where as result in algorithms we use the noisy operator to approximate such that xn lh where xn denotes the numerical with stochastic gradient noise lh xet xn approximations and in are from the stochastic gradient and numerical integrator approximations respectively similarly we say corresponds to local integrator of if in the following sections we focus on which use numerical integrators with stochastic gradients and for the first time analyze how the two introduced errors affect their convergence behaviors for notational simplicity we henceforth use xt to represent the approximate xn convergence analysis this section develops theory to analyze convergence properties of general with both fixed and decreasing step sizes as well as their asymptotic invariant measures error analysis given an diffusion with an invariant measure the posterior average is defined as for some test function of interest for given numerical method pl with generated samples xlh we use the sample average defined as xlh to approximate in the analysis we define functional that solves the following poisson equation lψ xlh xlh or equivalently lψ xlh the solution functional xlh characterizes the difference between xlh and the posterior average for every xlh thus would typically possess unique solution which is at least as smooth as under the elliptic or hypoelliptic settings in the unbounded domain of xlh rn to make the presentation simple we follow and make certain assumptions on the solution functional of the poisson equation which are used in the detailed proofs extensive empirical results have indicated the assumptions to hold in many real applications though extra work is needed for theoretical verifications for different models which is beyond the scope of this paper assumption and its up to derivatives dk are bounded by kdk ψk ck pk for ck pk furthermore the expectation of on xlh is bounded supl ev xlh and is smooth such that max for some see for conditions to ensure is ergodic the existence of such function can be translated into finding lyapunov function for the corresponding sdes an important topic in pde literatures see assumption in and appendix for more details we emphasize that our proof techniques are related to those of the sgld but with significant distinctions in that instead of expanding the function xlh whose parameter xlh does not endow an explicit form in general we start from expanding the kolmogorov backward equation for each minibatch moreover our techniques apply for general instead of for one specific algorithm more specifically given local integrator with the corresponding kolmogorov operator according to definition and the kolmogorov backward equation for the minibatch can be expanded as xlh hk where is the identity map recall that in sghmc by further using the poisson equation to simplify related terms associated with after some algebra shown in appendix the bias can be derived from as lh hk lh all terms in the above equation can be bounded with details provided in appendix this gives us bound for the bias of an algorithm in theorem theorem under assumption let be the operator norm the bias of an with integrator at time hl can be bounded as hk lh note the bound above includes the term measuring the difference between the expectation of stochastic gradients and the true gradient it vanishes when the stochastic gradient is an unbiased estimation of the exact gradient an assumption made in the sgld this on the other hand indicates that if the stochastic gradient is biased might diverge when the growth of is faster than we point this out to show our result to be more informative than that of the sgld though this case might not happen in real applications by expanding the proof for the bias we are also able to bound the mse of algorithms given in theorem theorem under assumption and assume is an unbiased estimate of ul for smooth test function the mse of an with integrator at time hl is bounded for some independent of as lh compared to the sgld the extra term relates to the variance of noisy gradients as long as the variance is bounded the mse still converges with the same rate specifically when optimizing bounds for the bias and mse the optimal bias decreases at rate of with step size while this is with step size for the these rates decrease faster than those of the sgld when the case of for the sghmc with our proposed symmetric splitting integrator is discussed in section stationary invariant measures the asymptotic invariant measures of correspond to approaching infinity in the above analysis according to the bias and mse above asymptotically the sample average is random variable with mean hk and variance close to the true this section defines distance between measures and studies more formally how the approximation errors affect the invariant measures of algorithms to compare with the standard mcmc convergence rate of the rate needs to be taken square root first we note that under mild conditions the existence of stationary invariant measure for an sgmcmc can be guaranteed by application of the theorem examining the conditions is beyond the scope of this paper for simplicity we follow and assume stationary invariant measures do exist for we denote the corresponding invariant measure as and the true posterior of model as similar to rwe assume our numerical solver is geometric ergodic meaning that for test function we have ex xlh for any from the ergodic theorem where ex denotes the expectation conditional on the geometric ergodicity implies that the integration is independent of the starting point of an algorithm given this we have the following theorem on invariant measures of theorem assume that integrator is geometric ergodic and its invariance measures exist define the the invariant measures and as distance between supφ then any invariant measure of an is close to with an error up to an order of hk there exists some such that chk for integrator with full gradients the corresponding invariant measure has been shown to be bounded by an order of hk as result theorem suggests only orders of numerical approximations but not the stochastic gradient approximation affect the asymptotic invariant measure of an algorithm this is also reflected by experiments presented in section with decreasing step sizes the original sgld was first proposed with sequence instead of fixing step sizes as analyzed in in the authors provide theoretical foundations on its asymptotic convergence properties we demonstrate in this section that for general algorithms decreasing step sizes for each minibatch are also feasible note our techniques here are different from those used for the sgld which interestingly result in similar convergence patterns specifically by adapting the same techniques used in the previous sections we establish conditions on the step size sequence to ensure asymptotic convergence and develop theory on their ergodic error as well to guarantee asymptotic consistency the following conditions on decreasing step size sequences are required assumption the step sizes hl are decreasingk hl and satisfy that hl hl and pl denote the finite sum of step sizes as sl hl under assumption we need to modify the sample average defined in section as weighted summation of xlh pl hl sl xlh for simplicity we assume to be an unbiased estimate of such that extending techniques in previous sections we develop the following bounds for the bias and mse theorem under assumptions and for smooth test function the bias and mse of with integrator at time sl are bounded as pl hl bias sl sl pl mse sl as result the asymptotic bias approaches according to the assumptions if further hl the mse also goes to in words the are consistent among the kinds of decreasing step size sequences commonly recognized one is hl for we show in the following corollary that such sequence leads to valid sequence corollary using the step size sequences hl for all the step size assumptions in theorem are satisfied as result the bias and mse approach zero asymptotically the sample average is asymptotically consistent with the posterior average actually the sequence be decreasing we assume it is decreasing for simplicity the assumption of hl satisfies this requirement but is weaker than the original assumption remark theorem indicates the sample average asymptotically converges to the true posterior average it is possible to find out the optimal decreasing rates for the specific decreasing sequence pl hl specifically using the bounds for see the proof of corollary for the two pl terms in the bias in theorem decreases at rate of whereas decreases as the balance between these two terms is achieved when which agrees with theorem on the optimal rate of similarly for the mse the first term decreases as independent of while the second and third terms decrease as and respectively thus the balance is achieved when which also agrees with the optimal rate for the mse in theorem according to theorem one theoretical advantage of over variants is the asymptotically unbiased estimation of posterior averages though the benefit might not be significant in real applications where the asymptotic regime is not reached practical numerical integrators given the theory for with integrators we here propose symmetric splitting integrator for practical use the euler integrator is known as integrator the proof and its detailed applications on the sgld and sghmc are given in appendix the main idea of the symmetric splitting scheme is to split the local generator into several that can be solved unfortunately one can not easily apply splitting scheme with the sgld however for the sghmc it can be readily split into la where la lb lol in these correspond to the following sdes which are all analytically solvable dθ dθ dt dθ dp dt dt based on these the local kolmogorov operator is defined as xlh where la lb ehlol lb la so that the corresponding updates for xlh θlh plh consist of the following steps θlh plh plh plh θlh plh plh θlh θlh plh θlh plh plh where are intermediate variables we denote such splitting method as the aboba scheme from the markovian property of kolmogorov operator it is readily seen that all such symmetric splitting schemes with different orders of and are equivalent lemma below shows the symmetric splitting scheme is local integrator lemma the symmetric splitting scheme is local integrator the corresponding kolmogorov operator satisfies when this integrator is applied to the sghmc the following properties can be obtained remark applying theorem to the sghmc with the symmetric splitting scheme the bias is bounded as lh the optimal bias decreasing rate is compared to for the sgld similarly the mse is bounded by decreasing optimally as with step size compared lh to the mse of for the sgld this indicates that the sghmc with the splitting integrator converges faster than the sgld and sghmc with euler integrators remark for sghmc based on remark the optimal step size decreasing rate for the bias is and for the mse these agree with their counterparts in remark thus are faster than the with euler integrators this is different from the traditional splitting in sde literatures where instead of is split mse bias iterations iterations figure bias of left and mse of right with different step size rates thick red curves correspond to theoretically optimal rates experiments bias we here verify our theory and compare with related algorithms on both synthetic data and machine learning applications splitting euler synthetic data we consider standard gaussian model where xi data samples xi are generated and every minibatch in the stochastic gradient is of size the test function is defined as with explicit expression for the posterior average to evaluate the expectations in the bias and mse we average over runs with random initializations first we compare the invariant measures with of step size the proposed splitting integrator and euler integrator for the sghmc results of the sgld are omitted since they are not figure comparisons of symmetas competitive figure plots the biases with different step ric splitting and euler integrators sizes it is clear that the euler integrator has larger biases in the invariant measure and quickly explodes when the step size becomes large which does not happen for the splitting integrator in real applications we also find this happen frequently shown in the next section making the euler scheme an unstable integrator next we examine the asymptotically optimal step size rates for the sghmc from the theory we know these are for the bias and for the mse in both sghmc and sghmc for the step sizes we did grid search to select the best prefactors which resulted in for the and for the with different values we plot the traces of the bias for the and the mse for the in figure similar results for the bias of the and the mse of the are plotted in appendix we find that when rates are smaller than the theoretically optimal rates bias and mse the bias and mse tend to decrease faster than the optimal rates at the beginning especially for the but eventually they slow down and are surpassed by the optimal rates consistent with the asymptotic theory this also suggests that if only small number of iterations were feasible setting larger step size than the theoretically optimal one might be beneficial in practice finally we study the relative convergence speed of the sghmc and sgld we test both and versions for experiments the step sizes are set to with chosen according to the theory for sgld and sghmc to provide fair comparison the constants are selected via grid search from to with an interval of for it is then fixed in the other runs with different values the parameter in the sghmc is selected within as well for experiments an initial step size is chosen within with an interval of for different and then it decreases according to their theoretical optimal rates figure shows comparison of the biases for the sghmc and sgld as indicated by both theory and experiments the sghmc with the splitting integrator yields faster convergence speed than the sgld with an euler integrator machine learning applications for real applications we test the sgld with an euler integrator the sghmc with the splitting integrator and the sghmc with an using the same initial step size is not fair because the sgld requires much smaller step sizes sgld sghmc sgld sghmc bias bias iterations iterations figure biases for the left and right sghmc and sgld euler integrator first we test them on the latent dirichlet allocation model lda the data used consists of randomly downloaded documents from wikipedia using scripts provided in we randomly select documents for testing and validation respectively as in the vocabulary size is we use the reparametrization trick to sample from the probabilistic simplex the step sizes are chosen from and parameter from the minibatch size is set to with one pass of the whole data in the experiments and therefore we collect posterior samples to calculate test perplexities with standard holdout technique as described in next recently studied sigmoid belief network model sbn is tested which is directed counterpart of the popular rbm model we use one layer model where the bottom layer corresponds to binary observed data which is generated from the hidden layer also binary via sigmoid function as shown in the sbn is readily learned by we test the model on the mnist dataset which consists of hand written digits of size for training and for testing again the step sizes are chosen from from the minibatch is set to with iterations for training like applied for the rbm an advance technique called anneal importance sampler ais is adopted for calculating test likelihoods perplexity we briefly describe the results here more details are provided in appendix for lda with topics the best test perplexities for the and sgld are and spectively while these are and respectively for topics similar to the synthetic experiments we also observed crashed when using large step sizes this is illustrated more step size clearly in figure for the sbn with hidfigure sghmc with topics the euden units we obtain negative test of ler explodes with large step sizes and for the and sgld respectively and these are and for hidden units note the on sbn yields results on test likelihoods compared to which was for hidden units decrease of units in the with ais is considered to be reasonable gain which is approximately equal to the gain from shallow to deep model is more accuracy and robust than due to its splitting integrator conclusion for the first time we develop theory to analyze ergodic errors as well as asymptotic invariant measures of general with integrators our theory applies for both fixed and decreasing step size which are shown to be equivalent in terms of convergence rates and are faster with our proposed integrator than previous with euler integrators experiments on both synthetic and large real datasets validate our theory the theory also indicates that with increasing order of numerical integrators the convergence rate of an is able to theoretically approach the standard mcmc convergence rate given the theoretical convergence results can be used effectively in real applications acknowledgments supported in part by aro darpa doe nga and onr we acknowledge jonathan mattingly and chunyuan li for inspiring discussions david carlson for the ais codes references chen fox and guestrin stochastic gradient hamiltonian monte carlo in icml ding fang babbush chen skeel and neven bayesian sampling using stochastic gradient thermostats in nips risken the equation new york welling and teh bayesian learning via stochastic gradient langevin dynamics in icml teh thiery and vollmer consistency and fluctuations for stochastic gradient langevin dynamics technical report university of oxford uk url http vollmer zygalakis and teh asymptotic properties of stochastic gradient langevin dynamics technical report university of oxford uk january url http betancourt the fundamental incompatibility of hamiltonian monte carlo and data subsampling in icml sato and nakagawa approximation analysis of stochastic gradient langevin dynamics by using equation and process in icml leimkuhler and shang adaptive thermostats for noisy gradient systems technical report university of edinburgh uk may url http abdulle vilmart and zygalakis long time accuracy of splitting methods for langevin dynamics siam numer hasminskii stochastic stability of differential equations berlin heidelberg mattingly stuart and tretyakov construction of numerical and stationary measures via poisson equations siam numer giesl construction of global lyapunov functions using radial basis functions springer berlin heidelberg bogoliubov and krylov la theorie generalie de la mesure dans son application etude de systemes dynamiques de la mecanique ann math ii in french leimkuhler and matthews rational construction of stochastic numerical methods for molecular sampling amrx blei ng and jordan latent dirichlet allocation jmlr hoffman blei and bach online learning for latent dirichlet allocation in nips gan chen henao carlson and carin scalable deep poisson factor analysis for topic modeling in icml patterson and teh stochastic gradient riemannian langevin dynamics on the probability simplex in nips gan henao carlson and carin learning deep sigmoid belief networks with data augmentation in aistats salakhutdinov and murray on the quantitative analysis of deep belief networks in icml mnih and gregor neural variational inference and learning in belief networks in icml debussche and faou weak backward error analysis for sdes siam numer kopec weak backward error analysis for overdamped langevin processes ima numer 
learning structured densities via infinite dimensional exponential families mladen kolar university of chicago mkolar siqi sun tti chicago jinbo xu tti chicago abstract learning the structure of probabilistic graphical models is well studied problem in the machine learning community due to its importance in many applications current approaches are mainly focused on learning the structure under restrictive parametric assumptions which limits the applicability of these methods in this paper we study the problem of estimating the structure of probabilistic graphical model without assuming particular parametric model we consider probabilities that are members of an infinite dimensional exponential family which is parametrized by reproducing kernel hilbert space rkhs and its kernel one difficulty in learning nonparametric densities is the evaluation of the normalizing constant in order to avoid this issue our procedure minimizes the penalized score matching objective we show how to efficiently minimize the proposed objective using existing group lasso solvers furthermore we prove that our procedure recovers the graph structure with under mild conditions simulation studies illustrate ability of our procedure to recover the true graph structure without the knowledge of the data generating process introduction undirected graphical models or markov random fields have been extensively studied and applied in fields ranging from computational biology to natural language processing and computer vision in an undirected graphical model conditional independence assumptions underlying probability distribution are encoded in the graph structure furthermore the joint probability density function can be factorized according to the cliques of the graph one of the fundamental problems in the literature is learning the structure of graphical model given an sample from an unknown distribution lot of work has been done under specific parametric assumptions on the unknown distribution for example in gaussian graphical models the structure of the graph is encoded by the sparsity pattern of the precision matrix similarly in the context of exponential family graphical models where the node conditional distribution given all the other nodes is member of an exponential family the structure is described by the coefficients most existing approaches to learn the structure of undirected graphical model are based on minimizing penalized loss objective where the loss is usually or composite likelihood and the penalty induces sparsity on the resulting parameter vector see for example in addition to sparsity inducing penalties methods that use other structural constraints have been proposed for example since many networks are several algorithms are designed specifically to learn structure of such networks graphs tend to have cluster structure and learning simultaneously the structure and cluster assignment has been investigated in this paper we focus on learning the structure of pairwise graphical models without assuming parametric class of models the main challenge in estimating nonparametric graphical models is computation of the log normalizing constant to get around this problem we propose to use score matching as divergence instead of the usual kl divergence as it does not require evaluation of the log partition function the probability density function is estimated by minimizing the expected distance between the model score function and the data score function where the score function is defined as gradient of the corresponding probability density functions the advantage of this measure is that the normalization constant is canceled out when computing the distance in order to learn the underlying graph structure we assume that the logarithm of the density is additive in and potentials and use sparsity inducing penalty to select edge potentials as we will prove later our procedure will allow us to consistently estimate the underlying graph structure the rest of paper is organized as follows we first introduce the notations background and related work then we formulate our model establish representer theorem and present group lasso algorithm to optimize the objective next we prove that our estimator is consistent by showing that it can recover the true graph with high probability given sufficient number of samples finally the results for simulated data are presented to demonstrate the correctness of our algorithm empirically notations let denote the set for vector θd rd let kθkp denote its lp norm let column vector vec denote the vectorization of matrix cat denote the concatenation of two vectors and and mat atd the matrix with rows given by atd for rd let lp denote the space of function for which the power of absolute value is integrable and for lp let kf klp kf kp dx denote its lp norm throughout the paper we denote or hi hij as hilbert space and kh as corresponding inner product and norm for any operator we use kck to denote the usual operator norm which is defined as kck inf kcf akf for all and kckhs to denote its norm which is defined as kcei where ei is an orthonormal basis of for an index set also we use to denote operator range space for any and let denote their tensor product background related work learning graphical models in exponential families let xd be random vector from multivariate gaussian distribution it is well known that the conditional independency of two variables given all the others is encoded in the zero pattern of its precision matrix that is xi and xj are conditionally independent given if and only if ωij where is the vector of without xi and xj sparse estimate of can be obtained by joint selection or neighborhood selection optimization with an added penalty given independent realizations of rows of the penalized estimate of the precision matrix can be obtained as arg min tr log det where and controls the sparsity level of estimated graph the method estimates the neighborhood of node by the of the solution to regularized linear model arg min kxs the estimated neighborhood is then θsa another way to specify parametric graphical model is by assuming that each distributions is part of the exponential family specifically the conditional distribution of xs given is assumed to be xs exp θst xs xt xs where is the base measure is the constant and is the neighborhood the node similar to the neighborhood of each node can be estimated by minimizing the negative with penalty on the optimization is tractable when the normalization constant can be easily computed based on the model assumption for example under poisson graphical model assumptions for count data the normalization constant is exp θst xt when using the neighborhood estimation the graph can be estimated as the union of the neighborhoods of each node which leads to consistent graph estimation generalized exponential family and rkhs we say is rkhs associated with kernel if and only if for each the following two conditions are satisfied and it has reproducing properties such that hf ih for all where is symmetric and positive semidefinite function denote the rkhs with kernel as for any there exists set of xi and αi such that αi xi similarly for any βj yj the inner product of and is defined as hf gih qp therefore the norm of simply is kf αi αj xi xj the sumi mation is guaranteed to be larger than or equal to zero because the kernel is positive semidefinite we consider the exponential family in infinite dimensions where pf ef and the function space is defined as log ef dx where is the base measure is generalized normalization constant such that pf is valid probability density function and is rkhs associated with kernel to see it as generalization of the exponential family we show some examples that can generate useful finite dimension exponential families normal xy poisson xy exponential xy for more detailed information please refer to when learning structure of graphical model we will further impose structural conditions on in order ensure that consists of additive functions score matching score matching is convenient procedure that allows for estimating probability density without computing the normalizing constant it is based on minimizing fisher divergence log log dx where log log is the score function observe that for the normalization constant cancels out in the gradient computation which makes the divergence independent of since the score matching objective involves the unknown oracle probability density function it is typically not computable however under some mild conditions which we will discuss in methods section can be rewritten as log log dx after substituting the expectation with an empirical average we get log xa log xa pkp compared to maximum likelihood estimation minimizing pkp is computationally tractable while we will be able to estimate only up to scale factor this will be sufficient for the purpose of graph structure estimation methods model formulation and assumptions we assume that the true probability density function is in furthermore for simplicity we assume that log ij xi xj where ii xi xi is node potential and ij xi xj is an edge potential the set denotes the edge set of the graph extensions to models where potentials are defined over larger cliques are possible we further assume that ij hij kij where hij is rkhs with kernel kij to simplify the notation we use ij or kij to denote ij xi xj and kij xi xj if the context is clear we drop the subscript for norm or inner product define fij hij as set of functions that decompose as sum of bivariate functions on edge note that is also subset of rkhs with the norm kf kfij khij and kernel kij let kf kfij khij for any edge set not necessarily the true edge set we denote ωs fs kfs khs as the norm reduced to similarly denote its dual norm as fs maxωs gs hfs gs under the assumption that the unknown is additive the loss function becomes dx xi xj xi xj hfij ij dx fij ij hfij ij cijij fij ij intuitively can be viewed as matrix and the operator at position ij ij is cij ij for general ij the corresponding operator simply is define css as xi xj xj dx which intuitively can be treated as sub matrix of with rows and columns we will use this notation intensively in the main theorem and its proof following we make the following assumptions each kij is twice differentiable on for any and χj aj bj we assume that lim xi or kij where xi and ai bi could be or this condition ensures that for any for more details see kij khij khij the operator css is compact and the smallest eigenvalue ωmin λmin css cs css where which means there exists such that cγ is the oracle function we will discuss the definition of operator and in section compared with can be interpreted as the dependency condition and the is the incoherence condition which is standard condition for structure learning in high dimensional statistical estimators estimation procedure we estimate by minimizing the following penalized score matching objective kf min fij hij is given in the norm kf where kfij khij is used as sparsity inducing penalty simplified form of is given below that will lead to efficient algorithm for solving the following theorem states that the score matching objective can be written as penalized quadratic function on theorem the score matching objective can be represented as lµ hf kf where dx is trace operator ii given observed data the empirical estimation of lµ is hfij kf const hf kij xai xaj where kij xai xaj xa and ξˆij kij xai xaj otherwise xa if or ξˆij please refer to our supplementary material for detailed proof the above theorem still requires us to minimize over our next results shows that the solution is finite dimensional that is we establish representer theorem for our problem please visit for supplementary material and code theorem the solution to can be represented as xbi xbj xbi xbj fˆij βbij βbji αij ξˆij where ii minimizing is equivalent to minimizing the following quadratic function ab ab βbij βbji αij hij ai bj xx βbij αij kξˆij kf ij βbji hij µx θij dai ij ij ai kij xa xb xb where gab hrb ξij are constant that only depends on ijrs ij cat vec vec is the vector parameter and θij cat αij vec is group of parameters dai and are corresponding constant vectors and matrices based on and the order of parameters then the above problem can be solved by group lasso the first part of theorem states our representer theorem and the second part is obtained by plugging in to see supplementary material for detailed proof theorem provides us with an efficient way to minimize as it reduced the optimization to group lasso problem for which many efficient solvers exist let fˆµ arg minf denote the solution to we can estimate the graph as follows kfˆij that is the graph is encoded in the sparsity pattern of fˆµ statistical guarantees in this section we study statistical properties of the proposed estimator let denote the true edge set and its complement we prove that recovers with high probability when the sample size is sufficiently large denote mat dai dnd we will need the following result on the estimated operator proposition lemma in or theorem in properties of ckhs µl min kc µl where and is diagonal with positive constants the following result gives first order optimality conditions for the optimization problem proposition optimality condition achieves optimality when the following two conditions are satisfied µω fs kfs khs µω with these preliminary results we have the following main results theorem assume that conditions are satisfied the regularization parameter is ηκmin min lected at the order of and satisfies where κmin kfs κmax and κmax then proof idea the theorem above is the main theoretical guarantee for our score matching estimator we use the witness proof framework inspired by let denote the true density function and the probability density function we first construct solution fˆs on true edge set as fˆs min kfij fs and set fˆs as zero using proposition we prove that kfˆs op then we compute the subgradient on and prove that its dual norm is upper bounded by µω by using assumptions and therefore we construct solution that satisfied the optimality condition and converges in probability to the true graph refer to supplementary material for detailed proof experiments we illustrate performance of our method on two simulations in our experiments we use the same kernel defined as follows kx exp xt that is the summation of gaussian kernel and polynomial kernel we set and for all the simulations we report the true positive rate vs false positive rate roc curve to measure the performance of different procedures let be the true edge set and let be the estimated graph the true positive and and rate is defined as tprµ and false positive rate is fprµ where is the cardinality of the set the curve is then plotted based on regularization parameters and based on independent runs in the first simulation we apply our algorithm to data sampled from simple chain gaussian model see figure for detail and compare its performance with glasso we use the same sampling method as in to generate the data we set ωs for and its diagonal to constant such that is positive definite we set the dimension to and change the sample size data points except for the low sample size case the performance of our method is comparable with glasso without utilizing the fact that the underlying distribution is of particular parametric form intuitively to capture the graph structure the proposed nonparametric method requires more data because of much weaker assumptions to further show the strength of our algorithm we test it on nonparanormal npn distribution random vector xp has nonparanormal distribution if there exist functions fp such that fd xd when is monotone and differentiable the probability density function is given by exp here the graph structure is still encoded in the sparsity pattern of that is xi if and only if ωij in our experiments we use the symmetric power transformation that is fj zj σj qr zj µj µj σj dt µj sme glasso adjacent matrix truepositiverate truepositiverate falsepositiverate falsepositiverate figure the estimation results for gaussian graphical models left the adjacent matrix of true graph center the roc curve of glasso right the roc curve of score matching estimator sme sme nonparanormal glasso falsepositiverate falsepositiverate truepositiverate truepositiverate truepositiverate falsepositiverate figure the estimated roc curves of nonparanormal graphical models for glasso left npn center and sme right where sign to transform data for comparison with graph lasso we first use truncation method to gaussianize the data and then apply graphical lasso to the transformed data see for details from figure without knowing the underlying data distribution the score matching estimator outperforms glasso and show similar results to nonparanormal when the sample size is large discussion in this paper we have proposed new procedure for learning the structure of nonparametric graphical model our procedure is based on minimizing penalized score matching objective which can be performed efficiently using existing group lasso solvers particularly appealing aspect of our approach is that it does not require computing the normalization constant therefore our procedure can be applied to very broad family of infinite dimensional exponential families we have established that the procedure provably recovers the true underlying graphical structure with highprobability under mild conditions in the future we plan to investigate more efficient algorithms for solving since it is often the case that is well structured and can be efficiently approximated acknowledgments the authors are grateful to the financial support from national institutes of health national science foundation career award and ibm corporation faculty research fund at the university of chicago booth school of business this work was completed in part with resources provided by the university of chicago research computing center references albert networks in cell biology journal of cell science ambroise chiquet matias et al inferring sparse gaussian graphical models with latent structure electronic journal of statistics aronszajn theory of reproducing kernels transactions of the american mathematical society pages canu and smola kernel methods and the exponential family neurocomputing defazio and caetano convex formulation for learning networks via submodular relaxation in advances in neural information processing systems pages friedman hastie and tibshirani sparse inverse covariance estimation with the graphical lasso biostatistics friedman hastie and tibshirani note on the group lasso and sparse group lasso arxiv preprint fukumizu bach and gretton statistical consistency of kernel canonical correlation analysis the journal of machine learning research geman and graffigne markov random field image models and their applications to computer vision in proceedings of the international congress of mathematicians volume page estimation of statistical models by score matching in journal of machine learning research pages some extensions of score matching computational statistics data analysis jeon and lin an effective method for anova estimation with application to nonparametric graphical model building statistica sinica kindermann snell et al markov random fields and their applications volume american mathematical society providence ri koller and friedman probabilistic graphical models principles and techniques mit press kourmpetis van dijk bink van ham and ter braak bayesian markov random field analysis for protein function prediction based on network data plos one lafferty mccallum and pereira conditional random fields probabilistic models for segmenting and labeling sequence data li markov random field modeling in image analysis liu lafferty and wasserman the nonparanormal semiparametric estimation of high dimensional undirected graphs the journal of machine learning research liu and ihler learning scale free networks by reweighted regularization in international conference on artificial intelligence and statistics pages manning and foundations of statistical natural language processing mit press meier van de geer and the group lasso for logistic regression journal of the royal statistical society series statistical methodology meinshausen and graphs and variable selection with the lasso the annals of statistics pages ravikumar wainwright and lafferty graphical model selection using logistic regression ravikumar wainwright lafferty et al ising model selection using logistic regression the annals of statistics rockafellar convex analysis number princeton university press sriperumbudur fukumizu kumar gretton and density estimation in infinite dimensional exponential families arxiv preprint sun wang and xu inferring block structure of graphical models in exponential families in proceedings of the eighteenth international conference on artificial intelligence and statistics pages wei and li markov random field model for analysis of genomic data bioinformatics yang allen liu and ravikumar graphical models via generalized linear models in advances in neural information processing systems pages yuan and lin model selection and estimation in the gaussian graphical model biometrika zhao liu roeder lafferty and wasserman the huge package for undirected graph estimation in the journal of machine learning research 
are you talking to machine dataset and methods for multilingual image question answering haoyuan junhua baidu research jie zhiheng lei university of california los angeles wei gaohaoyuan mjhustc huangzhiheng abstract in this paper we present the mqa model which is able to answer questions about the content of an image the answer can be sentence phrase or single word our model contains four components long memory lstm to extract the question representation convolutional neural network cnn to extract the visual representation an lstm for storing the linguistic context in an answer and fusing component to combine the information from the first three components and generate the answer we construct freestyle multilingual image question answering dataset to train and evaluate our mqa model it contains over images and freestyle chinese pairs and their english translations the quality of the generated answers of our mqa model on this dataset is evaluated by human judges through turing test specifically we mix the answers provided by humans and our model the human judges need to distinguish our model from the human they will also provide score the larger the better indicating the quality of the answer we propose strategies to monitor the quality of this evaluation process the experiments show that in of cases the human judges can not distinguish our model from humans the average score is for human the details of this work including the dataset can be found on the project page http introduction recently there is increasing interest in the field of multimodal learning for both natural language and vision in particular many studies have made rapid progress on the task of image captioning most of them are built based on deep neural networks deep convolutional neural networks cnn recurrent neural network rnn or long memory lstm the image datasets with sentence annotations play crucial role in this progress despite the success of these methods there are still many issues to be discussed and explored in particular the task of image captioning only requires generic sentence descriptions of an image but in many cases we only care about particular part or object of an image the image captioning task lacks the interaction between the computer and the user as we can not input our preference and interest in this paper we focus on the task of visual question answering in this task the method needs to provide an answer to freestyle question about the content of an image we propose the mqa model to address this task the inputs of the model are an image and question this model has four components see figure the first component is an lstm network that encodes natural language sentence into dense vector representation the second component is deep convolutional neural network that extracted the image representation this component was on imagenet classification task and is fixed during the training the third component is another lstm network that encodes the information of the current word and previous words in the answer into dense representations the fourth component fuses the information from the first three components to predict the next word in the answer we jointly train the first third and fourth components by maximizing the probability of the groundtruth answers in the training set using loss image question what is the color of the bus what is there in yellow what is there on the grass except where is the kitty the person 观察一下说出食物里任意一种蔬菜的 名字 please look carefully and tell me what is the name of the vegetables in the plate answer the bus is red bananas sheep 西兰花 broccoli 在椅子上 on the chair figure sample answers to the visual question generated by our model on the newly proposed freestyle multilingual image question answering dataset function to lower down the risk of overfitting we allow the weight sharing of the word embedding layer between the lstms in the first and third components we also adopt the transposed weight sharing scheme as proposed in which allows the weight sharing between word embedding layer and the fully connected softmax layer to train our method we construct freestyle multilingual image question answering see details in section based on the ms coco dataset the current version of the dataset contains images with chinese pairs and their corresponding english to diversify the annotations the annotators are allowed to raise any question related to the content of the image we propose strategies to monitor the quality of the annotations this dataset contains wide range of ai related questions such as action recognition is the man trying to buy vegetables object recognition what is there in yellow positions and interactions among objects in the image where is the kitty and reasoning based on commonsense and visual content why does the bus park here see last column of figure because of the variability of the freestyle pairs it is hard to accurately evaluate the method with automatic metrics we conduct visual turing test using human judges specifically we mix the pairs generated by our model with the same set of questionanswer pairs labeled by annotators the human judges need to determine whether the answer is given by model or human in addition we also ask them to give score of wrong partially correct or correct the results show that our mqa model passes of this test treated as answers of human and the average score is in the discussion we analyze the failure cases of our model and show that combined with the model our model can automatically ask question about an image and answer that question related work recent work has made significant progress using deep neural network models in both the fields of computer vision and natural language for computer vision methods based on convolutional neural network cnn achieve the performance in various tasks such as object classification detection and segmentation for natural language the recurrent neural network rnn and the long memory network lstm are also widely used in machine translation and speech recognition the structure of our mqa model is inspired by the model for the image captioning and retrieval tasks it adopts deep cnn for vision and rnn for language we extend the model to handle the input of question and image pairs and generate answers in the experiments we find that we can learn how to ask good question about an image using the model and this question can be answered by our mqa model there has been recent effort on the visual question answering task however most of them use and restricted set of questions some of these questions are generated from template in addition our dataset is much larger than theirs there are only and images for and respectively we are actively developing and expanding the dataset please find the latest information on the project page http the results reported in this paper are obtained from model trained on the first version of the dataset subset of the current version which contains images and pairs what is the cat doing boa sitting on the umbrella shared embedding shared lstm fusing cnn intermediate softmax sitting on the umbrella eoa figure illustration of the mqa model architecture we input an image and question about the image what is the cat doing to the model the model is trained to generate the answer to the question sitting on the umbrella the weight matrix in the word embedding layers of the two lstms one for the question and one for the answer are shared in addition as in this weight matrix is also shared in transposed manner with the weight matrix in the softmax layer different colors in the figure represent different components of the model best viewed in color there are some concurrent and independent works on this topic propose largescale dataset also based on ms coco they also provide some simple baseline methods on this dataset compared to them we propose stronger model for this task and evaluate our method using human judges our dataset also contains two different kinds of language which can be useful for other tasks such as machine translation because we use different set of annotators and different requirements of the annotation our dataset and the can be complementary to each other and lead to some interesting topics such as dataset transferring for visual question answering both and use model containing single lstm and cnn they concatenate the question and the answer for the answer is single word also prefer single word as the answer and then feed them to the lstm different from them we use two separate lstms for questions and answers respectively in consideration of the different properties grammar of questions and answers while allow the sharing of the for the dataset adopt the dataset proposed in which is much smaller than our dataset utilize the annotations in ms coco and synthesize dataset with four types of questions object number color and location they also synthesize the answer with single word their dataset can also be complementary to ours the multimodal qa mqa model we show the architecture of our mqa model in figure the model has four components long memory lstm for extracting semantic representation of question ii deep convolutional neural network cnn for extracting the image representation iii an lstm to extract representation of the current word in the answer and its linguistic context and iv fusing component that incorporates the information from the first three parts together and generates the next word in the answer these four components can be jointly trained together the details of the four model components are described in section the effectiveness of the important components and strategies are analyzed in section the inputs of the model are question and the reference image the model is trained to generate the answer the words in the question and answer are represented by vectors binary vectors with the length of the dictionary size and have only one vector indicating its index in the word dictionary we add hboai sign and heoai sign as two spatial words in the word dictionary at the beginning and the end of the training answers respectively they will be used for generating the answer to the question in the testing stage in the testing stage we input an image and question about the image into the model first to generate the answer we start with the start sign hboai and use the model to calculate the probability distribution of the next word we then use beam search scheme that keeps the best candidates in practice we fix the cnn part because the gradient returned from lstm is very noisy finetuning the cnn takes much longer time than just fixing it and does not improve the performance significantly with the maximum probabilities according to the softmax layer we repeat the process until the model generates the end sign of the answer hboai the four components of the mqa model the semantic meaning of the question is extracted by the first component of the model it contains dimensional word embedding layer and an lstm layer with memory cells the function of the word embedding layer is to map the vector of the word into dense semantic space we feed this dense word representation into the lstm layer lstm is recurrent neural network that is designed for solving the gradient explosion or vanishing problem the lstm layer stores the context information in its memory cells and serves as the bridge among the words in sequence question to model the long term dependency in the data more effectively lstm add three gate nodes to the traditional rnn structure the input gate the output gate and the forget gate the input gate and output gate regulate the read and write access to the lstm memory cells the forget gate resets the memory cells when their contents are out of date different from the image representation does not feed into the lstm in this component we believe this is reasonable because questions are just another input source for the model so we should not add images as the supervision for them the information stored in the lstm memory cells of the last word in the question the question mark will be treated as the representation of the sentence ii the second component is deep convolutional neural network cnn that generates the representation of an image in this paper we use the googlenet note that other cnn models such as alexnet and vggnet can also be used as the component in our model we remove the final softmax layer of the deep cnn and connect the remaining top layer to our model iii the third component also contains word embedding layer and an lstm the structure is similar to the first component the activation of the memory cells for the words in the answer as well as the word embeddings will be fed into the fusing component to generate the next words in the answer in they concatenate the training question and answer and use single lstm because of the different properties grammar of question and answer in this paper we use two separate lstms for questions and answers respectively we denote the lstms for the question and the answer as lstm and lstm respectively in the rest of the paper the weight matrix in lstm is not shared with the lstm in the first components note that the semantic meaning of single words should be the same for questions and answers so that we share the parameters in the layer for the first and third component iv finally the fourth component fuses the information from the first three layers specifically the activation of the fusing layer for the tth word in the answer can be calculated as follows vrq rq vi vra ra vw where denotes addition rq stands for the activation of the lstm memory cells of the last word in the question denotes the image representation ra and denotes the activation of the lstm memory cells and the word embedding of the tth word in the answer respectively vrq vi vra and vw are the weight matrices that need to be learned is an function after the fusing layer we build an intermediate layer that maps the dense multimodal representation in the fusing layer back to the dense word representation we then build fully connected softmax layer to predict the probability distribution of the next word in the answer this strategy allows the weight sharing between word embedding layer and the fully connected softmax layer as introduced in see details in section similar to we use the sigmoid function as the activation function of the three gates and adopt relu as the function for the lstm memory cells the activation function for the word embedding layer the fusing layer and the intermediate layer is the scaled hyperbolic tangent function tanh the weight sharing strategy as mentioned in section our model adopts different lstms for the question and the answer because of the different grammar properties of questions and answers however the meaning of single words in both questions and answers should be the same therefore we share the weight matrix between the layers of the first component and the third component in addition this weight matrix for the layers is shared with the weight matrix in the fully connected softmax layer in transposed manner intuitively the function of the weight matrix in the layer is to encode the word representation into dense word representation the function of the weight matrix in the softmax layer is to decode the dense word representation into pseudo representation which is the inverse operation of the wordembedding this strategy will reduce nearly half of the parameters in the model and is shown to have better performance in image captioning and novel visual concept learning tasks training details the cnn we used is on the imagenet classification task this component is fixed during the qa training we adopt loss defined on the word sequence of the answer minimizing this loss function is equivalent to maximizing the probability of the model to generate the groundtruth answers in the training set we jointly train the first second and the fourth components using stochastic gradient decent method the initial learning rate is and we decrease it by factor of for every epoch of the data we stop the training when the loss on the validation set does not decrease within three epochs the hyperparameters of the model are selected by for the chinese question answering task we segment the sentences into several word phrases these phrases can be treated equivalently to the english words the freestyle multilingual image question answering dataset our method is trained and evaluated on multilingual visual question answering dataset in section we will describe the process to collect the data and the method to monitor the quality of annotations some statistics and examples of the dataset will be given in section the latest dataset is available on the project page http the data collection we start with the images from the newly released ms coco training validation and testing set as the initial image set the annotations are collected using baidu online crowdsourcing to make the labeled pairs diversified the annotators are free to give any type of questions as long as these questions are related to the content of the image the question should be answered by the visual content and commonsense we are not expecting to get questions such as what is the name of the person in the image the annotators need to give an answer to the question themselves on the one hand the freedom we give to the annotators is beneficial in order to get freestyle interesting and diversified set of questions on the other hand it makes it harder to control the quality of the annotation compared to more detailed instruction to monitor the annotation quality we conduct an initial quality filtering stage specifically we randomly sampled images as quality monitoring dataset from the ms coco dataset as an initial set for the annotators they do not know this is test we then sample some annotations and rate their quality after each annotator finishes some labeling on this quality monitoring dataset about pairs per annotator we only select small number of annotators individuals whose annotations are satisfactory the questions are related to the content of the image and the answers are correct we also give preference to the annotators who provide interesting questions that require high level reasoning to give the answer only the selected annotators are permitted to label the rest of the images we pick set of good and bad examples of the annotated pairs from the quality monitoring dataset and show them to the selected annotators as references we also provide reasons for selecting these examples after the annotation of all the images is finished we further refine the dataset and remove small portion of the images with badly labeled questions and answers the statistics of the dataset currently there are images with chinese pairs and their english translations each image has at least two pairs as annotations the average lengths http image gt question what is the boy in green cap doing gt answer he is playing skateboard is there any person in the image yes what is the texture of the sofa in the room cloth is the man trying to buy vegetables yes gt question gt answer is the computer on the right hand what is the color of the frisbee or left hand side of the gentleman on the right hand side yellow how many layers are there for the cake six what are the people doing walking with umbrellas why does the bus park there preparing for repair what does it indicate when the phone mouse and laptop are placed together their owner is tired and sleeping figure sample images in the dataset this dataset contains chinese questionanswer pairs with corresponding english translations of the questions and answers are and respectively measured by chinese words some sample images are shown in figure we randomly sampled pairs and their corresponding images as the test set the questions in this dataset are diversified which requires vast set of ai capabilities in order to answer them they contain some relatively simple image understanding questions of the actions of objects what is the boy in green cap doing the object class is there any person in the image the relative positions and interactions among objects is the computer on the right or left side of the gentleman and the attributes of the objects what is the color of the frisbee in addition the dataset contains some questions that need reasoning with clues from vision language and commonsense for example to answer the question of why does the bus park there we should know that this question is about the parked bus in the image with two men holding tools at the back based on our commonsense we can guess that there might be some problems with the bus and the two men in the image are trying to repair it these questions are hard to answer but we believe they are actually the most interesting part of the questions in the dataset we categorize the questions into types and show the statistics of them on the project page the answers are also diversified the annotators are allowed to give single phrase or single word as the answer yellow or they can give complete sentence the frisbee is yellow experiments for the very recent works for visual question answering they test their method on the datasets where the answer of the question is single word or short phrase under this setting it is plausible to use automatic evaluation metrics that measure the single word similarity such as similarity measure wups however for our newly proposed dataset the answers in the dataset are freestyle and can be complete sentences for most of the cases there are numerous choices of answers that are all correct the possible alternatives are bleu score meteor cider or other metrics that are widely used in the image captioning task the problem of these metrics is that there are only few words in an answer that are semantically critical these metrics tend to give equal weights bleu and meteor or different weights according to the frequency term cider of the words in sentence hence can not fully show the importance of the keywords the evaluation of the image captioning task suffers from the same problem not as severe as question answering because it only needs general description to avoid these problems we conduct real visual turing test using human judges for our model which will be described in details in section in addition we rate each generated sentences with score the larger the better in section which gives more evaluation of our method in section we provide the performance comparisons of different variants of our mqa model on the validation set human mqa pass visual turing test fail pass rate human rated scores avg score table the results of our mqa model for our dataset the visual turing test in this visual turing test human judge will be presented with an image question and the answer to the question generated by the testing model or by human annotators he or she need to determine based on the answer whether the answer is given by human pass the test or machine fail the test in practice we use the images and questions from the test set of our dataset we use our mqa model to generate the answer for each question we also implement baseline model of the question answering without visual information the structure of this baseline model is similar to mqa except that we do not feed the image information extracted by the cnn into the fusing layer we denote it as the answers generated by our mqa model the model and the groundtruth answer are mixed together this leads to question answering pairs with the corresponding images which will be randomly assigned to human judges the results are shown in table it shows that of the answers generated by our mqa model are treated as answers provided by human the performs very badly in this task but some of the generated answers pass the test because some of the questions are actually questions it is possible to get correct answer by random guess based on pure linguistic clues to study the variance of the vtt evaluation across different sets of human judges we conduct two additional evaluations with different groups of judges under the same setting the standard deviations of the passing rate are and for human the model and mqa model respectively it shows that vtt is stable and reliable evaluation metric for this task the score of the generated answer the visual turing test only gives rough evaluation of the generated answers we also conduct evaluation with scores of or and mean that the answer is totally wrong and perfectly correct respectively means that the answer is only partially correct the general categories are right but the are wrong and makes sense to the human judges the human judges for this task are not necessarily the same people for the visual turing test after collecting the results we find that some human judges also rate an answer with if the question is very hard to answer so that even human without carefully looking at the image will possibly make mistakes we show randomly sampled images whose scores are in figure the results are shown in table we show that among the answers that are not perfectly correct scores are not over half of them are partially correct similar to the vtt evaluation process we also conducts two additional groups of this scoring evaluation the standard deviations of human and our mqa model are and respectively in addition for and of the cases the three groups give the same score for human and our mqa model respectively performance comparisons of the different mqa variants in order to show the effectiveness of the different components and strategies of our mqa model we implement three variants of the mqa in figure for the first variant we replace the first lstm component of the model the lstm to extract the question embedding image question answer what is in the plate food what is the dog doing surfing in the sea where is the cat on the bed what is there in the image there is clock what is the type of the vehicle 火车 train figure random examples of the answers generated by the mqa model with score given by the human judges image generated question answer where is this is this guy playing tennis what kind of food is this where is the computer this is the kitchen room yes pizza on the desk figure the sample generated questions by our model and their answers with the average embedding of the words in the quesword error loss tion using it is used to show the tiveness of the lstm as question embedding learner and extractor for the second variant we use two lstms to model question and answer it is used to show the effectiveness of the decoupling strategy of the weights of table performance comparisons of the the lstm and the lstm in our model for the different mqa variants third variant we do not adopt the transposed weight sharing tws strategy it is used to show the effectiveness of tws the word error rates and losses of the three variants and the complete mqa model mqacomplete are shown in table all of the three variants performs worse than our mqa model discussion in this paper we present the mqa model which is able to give sentence or phrase as the answer to freestyle question for an image to validate the effectiveness of the method we construct freestyle multilingual image question answering dataset containing over pairs we evaluate our method using human judges through real turing test it shows that of the answers given by our mqa model are treated as the answers provided by human the dataset can be used for other tasks such as visual machine translation where the visual information can serve as context information that helps to remove ambiguity of the words in sentence we also modified the lstm in the first component to the multimodal lstm shown in this modification allows us to generate question about the content of image and provide an answer to this question we show some sample results in figure we show some failure cases of our model in figure the model sometimes makes mistakes when the commonsense reasoning through background scenes is incorrect for the image in the first column our method says that the man is surfing but the small yellow frisbee in the image indicates that he is actually trying to catch the frisbee it also makes mistakes when the targeting object that the question focuses on is too small or looks very similar to other objects images in the second and fourth column another interesting example is the image and question in the fifth column of figure answering this question is very hard since it needs high level reasoning based on the experience from everyday life our model outputs hoov sign which is special word we use when the model meets word which it has not seen before does not appear in its word dictionary in future work we will try to address these issues by incorporating more visual and linguistic information using object detection or using attention models image question gt answer mqa answer 盘子里有什么水果 what is the handsome boy doing what is there in the image which fruit is there in the plate what is the type of the vehicle why does the bus park there 草原上的马群 苹果和橙子 trying to catch the frisbee horses on the grassland apples and oranges bus preparing for repair 冲浪 香蕉和橙子 oov surfing they are buffalos bananas and oranges train oov do not know figure failure cases of our mqa model on the dataset references antol agrawal lu mitchell batra zitnick and parikh vqa visual question answering arxiv preprint bigham jayant ji little miller miller miller tatarowicz white white et al vizwiz nearly answers to visual questions in acm symposium on user interface software and technology pages chen papandreou kokkinos murphy and yuille semantic image segmentation with deep convolutional nets and fully connected crfs iclr chen and zitnick learning recurrent visual representation for image caption generation in cvpr cho van merrienboer gulcehre bougares schwenk and bengio learning phrase representations using rnn encoderdecoder for statistical machine translation arxiv preprint donahue hendricks guadarrama rohrbach venugopalan saenko and darrell recurrent convolutional networks for visual recognition and description in cvpr elman finding structure in time cognitive science fang gupta iandola srivastava deng gao he mitchell platt et al from captions to visual concepts and back in cvpr geman geman hallonquist and younes visual turing test for computer vision systems pnas girshick donahue darrell and malik rich feature hierarchies for accurate object detection and semantic segmentation in cvpr grubinger clough and deselaers the iapr benchmark new evaluation resource for visual information systems in international workshop ontoimage pages hochreiter and schmidhuber long memory neural computation kalchbrenner and blunsom recurrent continuous translation models in emnlp pages karpathy and deep alignments for generating image descriptions in cvpr kiros salakhutdinov and zemel unifying embeddings with multimodal neural language models tacl klein lev sadeh and wolf fisher vectors derived from hybrid mixture models for image annotation arxiv preprint krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips lavie and agarwal meteor an automatic metric for mt evaluation with high levels of correlation with human judgements in workshop on statistical machine translation pages association for computational linguistics lebret pinheiro and collobert simple image description generator via linear approach arxiv preprint lecun bottou orr and efficient backprop in neural networks tricks of the trade pages lin maire belongie hays perona ramanan and zitnick microsoft coco common objects in context arxiv preprint malinowski and fritz approach to question answering about scenes based on uncertain input in advances in neural information processing systems pages malinowski rohrbach and fritz ask your neurons approach to answering questions about images arxiv preprint mao xu yang wang huang and yuille deep captioning with multimodal recurrent neural networks in iclr mao xu yang wang huang and yuille learning like child fast novel visual concept learning from sentence descriptions of images arxiv preprint mao xu yang wang and yuille explain images with multimodal recurrent neural networks nips deeplearning workshop mikolov joulin chopra mathieu and ranzato learning longer memory in recurrent neural networks arxiv preprint mikolov burget and khudanpur recurrent neural network based language model in interspeech pages mikolov sutskever chen corrado and dean distributed representations of words and phrases and their compositionality in nips pages nair and hinton rectified linear units improve restricted boltzmann machines in icml pages papineni roukos ward and zhu bleu method for automatic evaluation of machine translation in acl pages ren kiros and zemel image question answering visual semantic embedding model and new dataset arxiv preprint russakovsky deng su krause satheesh ma huang karpathy khosla bernstein berg and imagenet large scale visual recognition challenge simonyan and zisserman very deep convolutional networks for image recognition in iclr sutskever vinyals and le sequence to sequence learning with neural networks in nips pages szegedy liu jia sermanet reed anguelov erhan vanhoucke and rabinovich going deeper with convolutions arxiv preprint tu meng lee choe and zhu joint video and text parsing for understanding events and answering queries multimedia ieee turing computing machinery and intelligence mind pages vedantam zitnick and parikh cider image description evaluation in cvpr vinyals toshev bengio and erhan show and tell neural image caption generator in cvpr wu and palmer verbs semantics and lexical selection in acl pages xu ba kiros cho courville salakhutdinov zemel and bengio show attend and tell neural image caption generation with visual attention arxiv preprint young lai hodosh and hockenmaier from image descriptions to visual denotations new similarity metrics for semantic inference over event descriptions in acl pages zhu mao and yuille learning from weakly supervised data by the expectation loss svm algorithm in nips pages 
variance reduced stochastic gradient descent with neighbors aurelien lucchi department of computer science eth zurich switzerland thomas hofmann department of computer science eth zurich switzerland simon inria sierra normale paris france brian mcwilliams department of computer science eth zurich switzerland abstract stochastic gradient descent sgd is workhorse in machine learning yet its slow convergence can be computational bottleneck variance reduction techniques such as sag svrg and saga have been proposed to overcome this weakness achieving linear convergence however these methods are either based on computations of full gradients at pivot points or on keeping per data point corrections in memory therefore relative to sgd may need minimal number of epochs in order to materialize this paper investigates algorithms that can exploit neighborhood structure in the training data to share and information about past stochastic gradients across data points which offers advantages in the transient optimization phase as we provide unified convergence analysis for family of variance reduction algorithms which we call memorization algorithms we provide experimental results supporting our theory introduction we consider general problem that is pervasive in machine learning namely optimization of an empirical or regularized convex risk function given convex loss and convex regularizer one aims at finding parameter vector which minimizes the empirical expectation argmin fi fi xi yi we assume throughout that each fi has gradients steepest descent can find the minimizer but requires repeated computations of full gradients which becomes prohibitive for massive data sets stochastic gradient descent sgd is popular alternative in particular in the context of learning sgd updates only involve for an index chosen uniformly at random providing an unbiased gradient estimate since it is surprising recent finding that the finite sum structure of allows for significantly faster convergence in expectation instead of the standard rate of sgd for functions it is possible to obtain linear convergence with geometric rates while sgd requires asymptotically vanishing learning rates often chosen to be these more recent methods introduce corrections that ensure convergence for constant learning rates based on the work mentioned above the contributions of our paper are as follows first we define family of variance reducing sgd algorithms called memorization algorithms which includes saga and svrg as special cases and develop unifying analysis technique for it second we show geometric rates for all step sizes including universal step size choice providing the first convergence proof for svrg third based on the above analysis we present new insights into the between freshness and biasedness of the corrections computed from previous stochastic gradients fourth we propose new class of algorithms that resolves this by computing corrections based on stochastic gradients at neighboring points we experimentally show its benefits in the regime of learning with small number of epochs memorization algorithms algorithms variance reduced sgd given an optimization problem as in we investigate class of stochastic gradient descent algorithms that generates an iterate sequence wt with updates taking the form γgi gi with αi where αj here is the current and the new parameter vector is the step size and is an index selected uniformly at random are variance correction terms such that which guarantees unbiasedness egi the aim is to define updates of asymptotically vanishing variance gi as which requires this implies that corrections need to be designed in way to exactly cancel out the stochasticity of at the optimum how the memory αj is updated distinguishes the different algorithms that we consider saga the saga algorithm maintains variance corrections αi by memorizing stochastic gradients the update rule is for the selected and αj for note that these corrections will be used the next time the same index gets sampled setting αi guarantees unbiasedness obviously can be updated incrementally saga reuses the stochastic gradient computed at step to update as well as we also consider method that updates randomly chosen αj variables at each iteration this is convenient reference point to investigate the advantages of fresher corrections note that in saga the corrections will be on average iterations old in this can be controlled to be at the expense of additional gradient computations svrg we reformulate variant of svrg in our framework using randomization argument similar to but simpler than the one suggested in fix and draw in each iteration uniform if complete update is performed otherwise they are left unchanged while updates exactly variables in each iteration svrg occasionally updates all variables by triggering an additional sweep through the data there is an option to not maintain variables explicitly and to save on space by storing only and uniform memorization algorithms motivated by saga and svrg we define class of algorithms which we call uniform memorization algorithms definition uniform algorithm evolves iterates according to eq and selects in each iteration random index set of memory locations to update according to fj if αj αj otherwise such that any has the same probability of of being updated nq note that and the above svrg are special cases for nq if otherwise for svrg otherwise because we need it in section we will also define an algorithm which we call saga which makes use of neighborhood system ni and which selects neighborhoods uniformly ni note that definition requires ni finally note that for generalized linear models where fi depends on xi only through hw xi we get xi the update direction is determined by xi whereas the effective step length depends on the derivative of scalar function ξi as used in this leads to significant memory savings as one only needs to store the scalars as xi is always given when performing an update analysis recurrence of iterates the evolution equation in expectation implies the recurrence by crucially using the unbiasedness condition egi kw ekgi here and in the rest of this paper expectations are always taken only with respect to conditioned on the past we utilize number of bounds see which exploit strong convexity of wherever appears as well as lipschitz continuity of the fi wherever appears hf kw ekgi kfi ekαi hi fi fi hw ekαi eq can be generalized using kxk kyk with however for the sake of simplicity we sacrifice tightness and choose applying all of the above yields lemma for the iterate sequence of any algorithm that evolves solutions according to eq the following holds for single update step in expectation over the choice of kw γµkw ekαi all proofs are deferred to the appendix ideal and approximate variance correction note that in the ideal case of αi we yielding rate of would immediately get condition for contraction by choosing with γµ which is half the inverse of the condition number how can we further bound ekαi in the case of sgd key insight is that for memorization algorithms we can apply the smoothness bound in eq kαi wτi wτi where wτi is old note that if we only had approximations βi in the sense that kβi αi see section then we can use kx yk to get the somewhat worse bound kβi αi wτi lyapunov function ideally we would like to show that for suitable choice of each iteration results in contraction kw where however the main challenge arises from the fact that the quantities αi represent stochastic gradients from previous iterations this requires somewhat more complex proof technique adapting the lyapunov function method from we define upper bounds hi kαi such that hi as we start with and conceptually initialize hi and then update hi in sync with αi hi if αi is updated hi otherwise so that we maintain valid bounds kαi hi and ekαi with palways hi the hi are quantities showing up in the analysis but need not be computed we now define family of lyapunov γn lσ kw sσ with and lq this is simplified version of the one appearing in as we assume unconstrained regime in expectation under random update the lyapunov function lσ changes as elσ sσ we can readily apply lemma to bound the first part the second part is due to which mirrors the update of the variables by crucially using the property that any αj has the same probability of being updated in we get the following result lemma for uniform algorithm it holds that note that in expectation the shrinkage does not depend on the location of previous iterates wτ and the new increment is proportional to the of the current iterate technically this is how the possibly complicated dependency on previous iterates is dealt with in an effective manner convergence analysis we first state our main lemma about lyapunov function contractions lemma fix and arbitrarily for any uniform algorithm with sufficiently small step size such that kσ min and nµ we have that elσ lσ with cµγ note that min in the limit by maximizing the bounds in lemma over the choices of and we obtain our main result that provides guaranteed geometric rates for all step sizes up to theorem consider uniform algorithm for any step size with the algorithm converges at geometric rate of at least with if otherwise µγ where nµ we would like to provide more insights into this result corollary in theorem is maximized for we can write as in the big data regime nq whereas in the case the guaranteed rate is bounded by in the regime where the condition number dominates large and by in the opposite regime of large data small note that if we have nq with so for it pays off to increase freshness as it affects the rate proportionally in the regime the influence of vanishes note that for the rate decreases monotonically yet the decrease is only minor with the exception of small neighborhood around the entire range of results in very similar rates underestimating however leads to significant by factor as the optimal choice of depends on we would prefer step sizes that are thus giving rates that adapt to the local curvature see it turns out that by choosing step size that maximizes mink we obtain step size with rate off by at most corollary choosing leads to for all to gain more insights into the for these fixed large universal step sizes the following corollary details the range of rates obtained corollary choosing with yields min in particular we have an for the choice that min nq roughly matching the rate given in for sharing gradient memory analysis as we have seen fresher gradient memory larger choice for affects the guaranteed convergence rate as however as long as one step of algorithm is as expensive as steps of algorithm this insight does not lead to practical improvements per se yet it raises the question whether we can accelerate these methods in particular by approximating gradients stored in the αi variables note that we are always using the correct stochastic gradients in the current update and by assuring we will not introduce any bias in the update direction rather we lose the guarantee of asymptotically vanishing variance at however as we will show it is possible to retain geometric rates up to around we will focus on updates for concreteness and investigate an algorithm that mirrors saga with the only difference that it maintains approximations βi to the true αi variables we aim to guarantee ekαi βi and will use eq to modify the of lemma we see that approximation errors are multiplied with which implies that we should aim for small learning rates ideally without compromising the rate from theorem and corollary we can see that we can choose for sufficiently large which indicates that there is hope to dampen the effects of the approximations we now make this argument more precise theorem consider uniform algorithm with that are on average ekαi βi for any step size where is given by corollary in the appendix note that and as we get el wt µγ with ekfi where denote the unconditional expectation over histories in contrast to which is conditional and kµ corollary with min we have with rate min in the relevant case of we thus converge towards some around at similar rate as for the exact method for we reduce the step size significantly to compensate the extra variance and to still converge to an resulting in the slower rate instead of we also note that the geometric convergence of sgd with constant step size to neighborhood of the solution also proven in can arise as special case in our analysis by setting αi in lemma we can take for sgd an approximate algorithm can thus be interpreted as making an algorithmic parameter rather than fixed value as in sgd algorithms sharing gradient memory we now discuss our proposal of using neighborhoods for sharing gradient information between data points thereby we avoid an increase in gradient computations relative to or at the expense of suffering an approximation bias this leads to new tradeoff between freshness and approximation quality which can be resolved in ways depending on the desired final optimization accuracy we distinguish two types of quantities first the gradient memory αi as defined by the reference algorithm second the shared gradient memory state βi which is used in modified update rule in eq βi assume that we select an index for the weight update then we generalize eq as follows fi if ni βi βi βj βj otherwise in the important case of generalized linear models where one has xi we can modify the relevant case in eq by xj this has the advantages of using the correct direction while reducing storage requirements approximation bounds for our analysis we need to control the error kαi βi this obviously requires investigations let us first look at the case of ridge regression fi hxi wi yi and thus xi λw with hxi wi yi considering ni being updated we have δij kwk yi kxj where δij kxi xj note that this can be with the exception of the norm kwk that we only know at the time of an update similarly for regularized logistic regression with we have yi eyi hxi wi with the requirement on neighbors that yi yj we get eδij kwk kxj wi again we can δij and kxj in addition to we can also store hxi wi we can use these bounds in two ways first assuming that the iterates stay within we can derive upper bounds max ni kwk obviously the more compact the neighborhoods are the smaller this is most useful for the analysis second we can specify target accuracy and then prune neighborhoods dynamically this approach is more practically relevant as it allows us to directly control however dynamically varying neighborhood violates definition we fix this in sound manner by modifying the memory updates as follows if ni and if ni and βj otherwise this allows us to interpolate between sharing more aggressively saving computation and performing more computations in an exact manner in the limit of we recover as we recover the first variant mentioned computing neighborhoods note that the pairwise euclidean distances show up in the bounds in eq and in the classification case we also require yi yj whereas in the ridge regression case we also want yj to be small thus modulo filtering this suggests the use of euclidean distances as the metric for defining neighborhoods standard approximation techniques for finding near est neighbors can be used this comes with computational overhead yet the additional costs will amortize over multiple runs or multiple data analysis tasks experimental results algorithms we present experimental results on the performance of the different variants of memorization algorithms for variance reduced sgd as discussed in this paper saga has been uniformly superior to svrg in our experiments so we compare saga and from eq alongside with sgd as straw man and as point of reference for we have chosen for and the same setting was used across all data sets and experiments data sets as special cases for the choice of the loss function and regularizer in eq we consider two commonly occurring problems in machine learning namely regression and logistic regression we apply regression on the million song year regression from the uci repository this dataset contains data points each described by input features we apply logistic regression on the cov and datasets obtained from the libsvm website the cov dataset contains data points each described by input features the dataset contains data points each described by input features we added an to ensure the objective is strongly convex http cov sgd cst sgd saga suboptimality year suboptimality suboptimality sgd cst sgd saga epochs sgd cst sgd saga epochs epochs gradient evaluation suboptimality suboptimality suboptimality epochs epochs epochs gradient evaluation suboptimality suboptimality suboptimality epochs epochs epochs datapoint evaluation suboptimality suboptimality suboptimality epochs epochs epochs datapoint evaluation figure comparison of saga and sgd with decreasing and constant step size on three datasets the top two rows show the suboptimality as function of the number of gradient evaluations for two different values of the bottom two rows show the suboptimality as function of the number of datapoint evaluations number of stochastic updates for two different values of experimental protocol we have run the algorithms in question in an sampling setting and averaged the results over runs figure shows the evolution of the suboptimality of the objective as function of two different metrics in terms of the number of update steps performed datapoint evaluation and in terms of the number of gradient computations gradient evaluation note that sgd and saga compute one stochastic gradient per update step unlike which is included here not as practically relevant algorithm but as an indication of potential imq provements that could be achieved by fresher corrections step size µn was used everywhere except for plain sgd note that as in all cases this is close to the optimal value suggested by our analysis moreover using step size of for saga as suggested in previous work did not appear to give better results for plain sgd we used schedule of the form γt with constants optimized coarsely via the is expressed in units of suggestively called epochs saga sgd cst as we can see if we run sgd with the same constant step size as saga it takes several epochs until saga really shows significant gain the constant variant of sgd is faster in the early stages until it converges to neighborhood of the optimum where individual runs start showing very noisy behavior saga outperforms plain saga quite consistently when counting stochastic update steps this establishes optimistic reference curves of what we can expect to achieve with the actual is somewhat data set dependent saga and with sufficiently small can realize much of the possible freshness gains of and performs very similar for few epochs where it traces nicely between the saga and curves we see solid on all three datasets for both and asymptotics it should be clearly stated that running at fixed for longer will not result in good asymptotics on the empirical risk this is because as theory predicts can not drive the suboptimality to zero but rather at point determined by in our experiments the point with saga was typically after epochs note that the gains in the first epochs can be significant though in practice one will either define desired accuracy level and choose accordingly or one will switch to saga for accurate convergence conclusion we have generalized variance reduced sgd methods under the name of memorization algorithms and presented corresponding analysis which commonly applies to all such methods we have investigated in detail the range of safe step sizes with their corresponding geometric rates as guaranteed by our theory this has delivered number of new insights for instance about the between small and large step sizes in different regimes as well as about the role of the freshness of stochastic gradients evaluated at past iterates we have also investigated and quantified the effect of additional errors in the variance correction terms on the convergence behavior dependent on how scales with we have shown that such errors can be tolerated yet for small may have negative effect on the convergence rate as much smaller step sizes are needed to still guarantee convergence to small region we believe this result to be relevant for number of approximation techniques in the context of variance reduced sgd motivated by these insights and results of our analysis we have proposed modification of saga that exploits similarities between training data points by defining neighborhood system approximate versions of point gradients are then computed by sharing information among neighbors this the possibility of in streaming data setting where each data point is only seen once we believe this to be promising direction for future work empirically we have been able to achieve consistent for the initial phase of regularized risk minimization this shows that approximate computations of variance correction terms constitutes promising approach of computation with solution accuracy acknowledgments we would like to thank yannic kilcher martin jaggi leblond and the anonymous reviewers for helpful suggestions and corrections references andoni and indyk hashing algorithms for approximate nearest neighbor in high dimensions commun acm bottou machine learning with stochastic gradient descent in compstat pages springer dasgupta and sinha randomized partition trees for nearest neighbor search algorithmica defazio bach and saga fast incremental gradient method with support for convex composite objectives in advances in neural information processing systems pages johnson and zhang accelerating stochastic gradient descent using predictive variance reduction in advances in neural information processing systems pages and gradient descent methods arxiv preprint robbins and monro stochastic approximation method the annals of mathematical statistics pages schmidt convergence rate of stochastic gradient with constant step size ubc technical report schmidt roux and bach minimizing finite sums with the stochastic average gradient arxiv preprint singer srebro and cotter pegasos primal estimated solver for svm mathematical programming and zhang stochastic dual coordinate ascent methods for regularized loss the journal of machine learning research 
sample efficient path integral control under uncertainty yunpeng pan evangelos theodorou and michail kontitsis autonomous control and decision systems laboratory institute for robotics and intelligent machines school of aerospace engineering georgia institute of technology atlanta ga kontitsis abstract we present optimal control framework that is derived using the path integral pi control approach we find iterative control laws analytically without priori policy parameterization based on probabilistic representation of the learned dynamics model the proposed algorithm operates in manner which differentiate it from other methods that perform forward sampling to find optimal controls our method uses significantly less samples to find analytic control laws compared to other approaches within the pi control family that rely on extensive sampling from given dynamics models or trials on physical systems in fashion in addition the learned controllers can be generalized to new tasks without based on the compositionality theory for the optimal control framework we provide experimental results on three different tasks and comparisons with methods to demonstrate the efficiency and generalizability of the proposed framework introduction stochastic optimal control soc is general and powerful framework with applications in many areas of science and engineering however despite the broad applicability solving soc problems remains challenging for systems in continuous state action spaces various function approximation approaches to optimal control are available but usually sensitive to model uncertainty over the last decade soc based on exponential transformation of the value function has demonstrated remarkable applicability in solving real world control and planning problems in control theory the exponential transformation of the value function was introduced in in the recent decade it has been explored in terms of path integral interpretations and theoretical generalizations discrete time formulations and scalable algorithms the resulting stochastic optimal control frameworks are known as path integral pi control for continuous time kullback leibler kl control for discrete time or more generally linearly solvable optimal control one of the most attractive characteristics of pi control is that optimal control problems can be solved with forward sampling of stochastic differential equations sdes while the process of sampling with sdes is more scalable than numerically solving partial differential equations it still suffers from the curse of dimensionality when performed in naive fashion one way to circumvent this problem is to parameterize policies and then perform optimization with sampling however in this case one has to impose the structure of the policy therefore restrict the possible optimal control solutions within the assumed parameterization in addition the optimized policy parameters can not be generalized to new tasks in general pi policy search approaches require large number of samples from trials performed on real physical systems the issue of sample inefficiency further restricts the applicability of pi control methods on physical systems with unknown or partially known dynamics motivated by the aforementioned limitations in this paper we introduce sample efficient modelbased approach to pi control different from existing pi control approaches our method combines the benefits of pi control theory and probabilistic reinforcement learning methodologies the main characteristics of the our approach are summarized as follows it extends the pi control theory to the case of uncertain systems the structural constraint is enforced between the control cost and uncertainty of the learned dynamics which can be viewed as generalization of previous work different from parameterized pi controllers we find analytic control law without any policy parameterization rather than keeping fixed control cost weight or ignoring the constraint between control authority and noise level in this work the control cost weight is adapted based on the explicit uncertainty of the learned dynamics model the algorithm operates in different manner compared to existing methods that perform forward sampling more precisely our method perform successive deterministic approximate inference and backward computation of optimal control law the proposed approach is significantly more sample efficient than samplingbased pi control in rl setting our method is comparable to the rl methods in terms of sample and computational efficiency thanks to the linearity of the backward pde the learned controllers can be generalized to new tasks without by constructing composite controllers in contrast most policy search and trajectory optimization methods find policy parameters that can not be generalized iterative path integral control for class of uncertain systems problem formulation we consider nonlinear stochastic system described by the following differential equation dx dt bdω with state control and standard brownian motion noise with variance σω is the unknown drift term passive dynamics is the control matrix and is the diffusion matrix given some previous control uold we seek the optimal control correction term δu such that the total control uold δu the original system becomes dx uold δu dt bdω uold dt δudt bdω uold in this work we assume the dynamics based on the previous control can be represented by gaussian processes gp such that fgp uold dt bdω where fgp is the gp representation of the biased drift term under the previous control now the original dynamical system can be represented as follow fgp gp µf σf dx fgp gδudt where µf σf are predictive mean and covariance functions respectively for the gp model we use prior of zero mean and covariance function xi xj exp xi xj xi xj δij with σs σω the δij is the kronecker symbol that is one iff and zero otherwise samples over fgp can be drawn using an vector of gaussian variable µf lf where lf is obtained using cholesky factorization such that σf lf lt note that generally is an infinite dimensional vector and we can use the same sample to represent uncertainty during learning without loss of generality we assume to be the standard brownian motion for the rest of the paper we use simplified notations with subscripts indicating the time step the representation of the system is xt δut ωt dt and the ditional probability of given xt and δut is gaussian δut where xt µf gt δut and σf in this paper we consider stochastic optimal control problem xt δut dt xt where the immediate cost is defined as xt ut xt δut rt δut and xt xt xdt xt xdt is quadratic cost function where xdt is the desired state rt xt is statedependent positive definite weight matrix next we show the linearized equation for this class of optimal control problems linearized equation for uncertain dynamics at each iteration the goal is to find the optimal control update δut that minimizes the value function xt min xt δut dt xt dxt dt δut is the bellman equation by approximating the integral for small dt and applying rule we obtain the hjb equation detailed derivation is skipped vt min qt δut rt δut µf gt δut vt tr σf vt δut to find the optimal control update we take gradient of the above expression inside the parentheses with respect to δut and set to this yields δut gt vt inserting this expression into the hjb equation yields the following nonlinear and second order pde vt qt vt µf vt gt gt tr σf vt in order to solve the above pde we use the exponential transformation of the value function vt log ψt where ψt xt is called the desirability of xt the corresponding partial derivatives can be found as vt ψλt ψt vt ψλt ψt and vt ψt ψλt ψt inserting these terms to results in ψt qt ψt µf ψt gt rt gt ψt tr ψt σf ψt tr ψt σf ψt ψt the quadratic terms ψt will cancel out under the assumption of λgt gt σf this constraint is different from existing works in path integral control where the constraint is enforced between the additive noise covariance and control authority more precisely λgt gt bσω the new constraint enables an adaptive update of control cost weight based on explicit uncertainty of the learned dynamics in contrast most existing works use fixed control cost weight this condition also leads to more exploration more aggressive control under high uncertainty and less exploration with more certain dynamics given the aforementioned assumption the above pde is simplified as ψt qt ψt µt tr ψt σf ψt subject to the terminal condition ψt exp qt the resulting pde is linear in general solving analytically is intractable for nonlinear systems and cost functions we apply the formula which gives probabilistic representation of the solution of the linear pde ψt lim τt exp qj dt ψt dτt where τt is the state trajectory from time to the optimal control is obtained as ψt gt σf gt vt λgt rt gt ψt ψt old uold ψt rather than computing ψt and ψt the optimal control can be approximated based on path costs of sampled trajectories next we briefly review some of the existing approaches related works according to the path integral control theory the stochastic optimal control problem becomes an approximation problem of path integral this problem can be solved by forward sampling of the uncontrolled sde the optimal control is approximated based on path costs of sampled trajectories therefore the computation of optimal controls becomes forward process more precisely when the control and noise act in the same subspace the optimal control can be evaluated as the weighted average of the noise ep τt dω where the exp probability of trajectory is τt exp dτ and τt is defined as the path cost computed by performing forward sampling however these approaches require large amount of samples from given dynamics model or extensive trials on physical systems when applied in reinforcement learning settings in order to improve sample efficiency nonparametric approach was developed by representing the desirability ψt in terms of linear operators in reproducing kernel hilbert space rkhs as approach it allows sample but relies on numerical methods to estimate the gradient of desirability ψt which can be computationally expensive on the other hand computing the analytic expressions of the path integral embedding is intractable and requires exact knowledge of the system dynamics furthermore the control approximation is based on samples from the uncontrolled dynamics which is usually not sufficient for highly nonlinear or underactuated systems another class of method is based on policy parameterization notable approaches include and recently developed pi the limitations of these methods are they do not take into account model uncertainty in the passive dynamics the imposed policy parameterizations restrict optimal control solutions the optimized policy parameters can not be generalized to new tasks brief comparison of some of these methods can be found in table motivated by the challenge of combining sample efficiency and generalizability next we introduce probabilistic approach to compute the optimal control analytically pi iterative pi state feedback pi our method structural constraint λgt same as pi same as pi same as pi gt σf gt bσω dynamics model gp policy parameterization no yes yes yes no table comparison with some notable and recent path approaches proposed approach analytic path integral control scheme in order to derive the proposed framework firstly we learn the function fgp xt uold dt bdω from sampled data learning the continuous mapping from state to state transition can be viewed as an inference with the goal of inferring the state transition fgp xt the kernel function has been defined in which can be interpreted as similarity measure of random variables more specifically if the training input xi and xj are close to each other in the kernel space their outputs dxi and dxj are highly correlated given sequence of states xt and the corresponding state transition the posterior distribution can be obtained by conditioning the joint prior distribution on the observations in this work we make the standard assumption of independent outputs no correlation between each output dimension to propagate the dynamics over trajectory of time horizon we employ the moment matching approach to compute the predictive distribution given an input distribution over the state µt σt the predictive distribution over the state at dt can be approximated as gaussian such that µt µf σt σf cov xt cov xt the above formulation is used to approximate transition probabilities over the trajectory details regarding the moment matching method can be found in all mean and variance terms can be computed analytically the σs σω are learned by maximizing the of the training outputs given the inputs given the approximation of transition probability we now introduce bayesian nonparametric formulation of path integral control based on probabilistic representation of the dynamics firstly we perform approximate inference forward propagation to obtain the gaussian belief predictive mean and covariance of the state over the trajectory since the exponential transformation of the state cost exp dt is an unnormalized gaussian xd we can evaluate the following integral analytically dt dt dt dt µj σj exp qj dt dxj σj exp µj xj λσj µj xj for thus given boundary condition ψt exp qt and predictive distribution at the final step µt σt we can evaluate the backward desirability ψt analytically using the above expression more generally we use the following recursive rule xj ψj µj σj exp qj dt ψj dxj for dt dt since we use deterministic approximate inference based on instead of explicitly sampling from the corresponding sde we approximate the conditional distribution xj by the gaussian predictive distribution µj σj therefore the path integral tx τt exp qj dt ψt dτt µt σt exp qt dt µt σt exp qt dxt dxt ψt ψt ψt ψt exp dt we evaluate the desirability ψt backward in time by successive computation using the above recursive expression the optimal control law requires gradients of the desirability function with respect to the state which can be computed backward in time as well for simplicity we denote the function xj ψj by φj thus we compute the gradient of the recursive expression ψj φj φj ψj where dt dt given the expression in we compute the gradient terms in as dφj dp xj dµj dσj dt dt φj µj xdj where λσj dp xj dxt dxt dxt dt φj dt dt dt λσj µj xdj µj xdj λσj and dµ µj σj dxt dxt dxt dxt dxt φj the term ψt is compute similarly the partial derivatives can be computed analytically as in we compute all gradients using this scheme without any numerical method finite differences given ψt and ψt the optimal control takes analytic form as in since ψt and ψt are explicit functions of xt the resulting control law is essentially different from the feedforward control in path integral control frameworks as well as the parameterized state feedback pi control policies notice that at current time step we update the control sequence using the presented scheme only is applied to the system to move to the next step while the controls are used for control update at future steps the transition sample recorded at each time step is incorporated to update the gp model of the dynamics summary of the proposed algorithm is shown in algorithm algorithm sample efficient path integral control under uncertain dynamics initialization apply random controls to the physical system record data repeat for do incorporate transition sample to learn gp dynamics model repeat approximate inference for predictive distributions using uold see backward computation of optimal control updates see update optimal controls uold until convergence apply optimal control to the system move one step forward and record data end for until task learned generalization to unlearned tasks without sampling in this section we describe how to generalize the learned controllers for new unlearned tasks without any interaction with the real system the proposed approach is based on the compositionality theory in linearly solvable optimal control lsoc we use superscripts to denote previously learned task indexes firstly we define distance measure between the new target and old targets xdk gaussian kernel exp xdk xdk where is diagonal matrix kernel width the composite terminal cost xt for the new task becomes pk exp xt xt log pk where xt is the terminal cost for old tasks for conciseness we define normalized distance measure pkω ωk which can be interpreted as probability weight based on we have the composite terminal desirability for the new task which is linear combination of ψkt ψkt exp xt since ψkt is the solution to the linear pde the linear combination of desirability holds everywhere from to as long as it holds on the boundary terminal time step therefore we obtain the composite control ψkt pk ψk the composite control law in is essentially different from an interpolating control law it enables controllers that constructed from learned controllers for different tasks this scheme can not be adopted in policy search or trajectory optimization methods such as alternatively generalization can be achieved by imposing policies however this approach might restrict the choice of optimal controls given the assumed structure of control policy experiments and analysis we consider simulated rl tasks cp swing up double pendulum on cart dpc swing up and robotic arm reaching the cp and dpc systems consist of cart and pendulum the tasks are to the pendulum from the initial position point down both cp and dpc are systems with only one control acting on the cart is robotic arm that has state dimensions degrees of freedom with actuators on the joints the task is to steer the to the desired position and orientation in order to demonstrate the performance we compare the proposed control framework with three related methods iterative path integral control with known dynamics model pilco and pddp iterative path integral control is stochastic control method it is based on importance sampling using controlled diffusion process rather than passive dynamics used in standard path integral control iterative pi control is used as baseline with given dynamics model pilco is policy search method that features data efficiency in terms of number of trials required to learn task pilco requires an extra optimizer such as bfgs for policy improvement pddp is gaussian belief space trajectory optimization approach it performs dynamic programming based on local approximation of the learned dynamics and value function both pilco and pddp are applied with unknown dynamics in this work we do not compare our method with approaches such as since these methods would certainly cost more samples than methods such as pilco and pddp the reason for choosing these two methods for comparison is that our method adopts similar model learning scheme while other methods such as is based on different model in experiment we demonstrate the sample efficiency of our method using the cp and dpc tasks for both tasks we choose and dt time steps per rollout the iterative pi with given dynamics model uses sample rollouts per iteration and iterations at each time step we initialize pilco and the proposed method by collecting sample rollouts corresponding to transition samples for tasks respectively at each trial on the true dynamics model we use sample rollout for pilco and our method pddp uses rollouts corresponding to transition samples for initialization as well as at each trial for the tasks fig shows the results in terms of ψt and computational time for both tasks our method shows higher desirability lower terminal state cost at each trial which indicates higher sample efficiency for task learning this is mainly because our method performs online reoptimization at each time step in contrast the other two methods do not use this scheme however we assume partial information of the dynamics matrix is given pilco and pddp perform optimization on entirely unknown dynamics in many robotic systems corresponds to the inverse of the inertia matrix which can be identified based on data as well in terms of computational efficiency our method outperforms pilco since we compute the optimal control update analytically while pilco solves large scale nonlinear optimization problems to obtain policy parameters our method is more computational expensive than pddp because pddp seeks local optimal controls that rely on linear approximations while our method is global optimal control approach despite the relatively higher computational burden than pddp our method offers reasonable efficiency in terms of the time required to reach the baseline performance in experiment we demonstrate the generalizability of the learned controllers to new tasks using the composite control law based on the system we use and dt time steps per rollout first we learn independent controllers using algorithm the target postures are shown in fig for all tasks we initialize with sample rollouts and sample at each trial blue bars in fig shows the desirabilities ψt after trials next we use the composite law to construct controllers without using other controllers learned using algorithm ψk for instance the composite controller for task is found as ψk the performance comparison of the composite controllers with controllers learned from trials is shown in fig it can be seen that the composite controllers give close performance as independently learned controllers the compositionality theory generally does not apply to policy search methods and trajectory optimizers such as pilco pddp and other recent methods our method benefits from the compositionality of control laws that can be applied for control without double pendulum on cart iterative pi true model pilco pddp ours ψt time ψt iterative pi true model pilco pddp ours time trial trial trial trial figure comparison in terms of sample efficiency and computational efficiency for and double pendulum on cart tasks left subfigures show the terminal desirability ψt for pilco and pddp ψt is computed using terminal state costs at each trial right subfigures show computational time in minute at each trial independent controller trials composite controller no sampling ψt task figure resutls for the tasks tasks tested in this experiment each number indicates corresponding target posture comparison of the controllers learned independently from trials and the composite controllers without sampling each composite controller is obtained from other independent controllers learned from trials conclusion and discussion we presented an iterative learning control framework that can find optimal controllers under uncertain dynamics using very small number of samples this approach is closely related to the family of path integral pi control algorithms our method is based on optimization scheme which differs significantly from current approaches moreover it combines the attractive characteristics of probabilistic reinforcement learning and linearly solvable optimal control theory these characteristics include sample efficiency optimality and generalizability by iteratively updating the control laws based on probabilistic representation of the learned dynamics our method demonstrated encouraging performance compared to the methods in addition our method showed promising potential in performing control based on the compositionality of learned controllers besides the assumed structural constraint between control cost weight and uncertainty of the passive dynamics the major limitation is that we have not taken into account the uncertainty in the control matrix future work will focus on further generalization of this framework and applications to real systems acknowledgments this research is supported by nsf references bertsekas and tsitsiklis programming optimization and neural computation series athena scientific barto powell si and wunsch handbook of learning and approximate dynamic programming fleming exit probabilities and optimal stochastic control applied math optim fleming and soner controlled markov processes and viscosity solutions applications of mathematics springer new york edition kappen linear theory for control of nonlinear stochastic systems phys rev lett kappen path integrals and symmetry breaking for optimal control theory journal of statistical mechanics theory and experiment kappen an introduction to stochastic control theory path integrals and reinforcement learning aip conference proceedings thijssen and kappen mar path integral control and feedback phys rev todorov efficient computation of optimal actions proceedings of the national academy of sciences theodorou buchli and schaal generalized path integral control approach to reinforcement learning the journal of machine learning research stulp and sigaud path integral policy improvement with covariance matrix adaptation in proceedings of the international conference on machine learning icml pages acm rawlik toussaint and vijayakumar path integral control by reproducing kernel hilbert space embedding in proceedings of the international joint conference on artificial intelligence ijcai pages pan and theodorou nonparametric infinite horizon stochastic control in ieee symposium on adaptive dynamic programming and reinforcement learning adprl pages ieee kappen peters and neumann policy search for path integral control in machine learning and knowledge discovery in databases pages springer dvijotham and todorov linearly solvable optimal control reinforcement learning and approximate dynamic programming for feedback control pages deisenroth neumann and peters survey on policy search for robotics foundations and trends in robotics deisenroth fox and rasmussen gaussian processes for learning in robotics and control ieee transsactions on pattern analysis and machine intelligence theodorou and todorov relative entropy and free energy dualities connections to path integral and kl control in ieee conference on decision and control pages pan and theodorou probabilistic differential dynamic programming in advances in neural information processing systems nips pages levine and abbeel learning neural network policies with guided policy search under unknown dynamics in advances in neural information processing systems nips pages levine and koltun learning complex neural network policies with trajectory optimization in proceedings of the international conference on machine learning pages schulman levine moritz jordan and abbeel trust region policy optimization arxiv preprint hennig optimal reinforcement learning for gaussian systems in advances in neural information processing systems nips pages quinonero candela girard larsen and rasmussen propagation of uncertainty in bayesian kernel to ahead forecasting in ieee international conference on acoustics speech and signal processing williams and rasmussen gaussian processes for machine learning mit press todorov compositionality of optimal control laws in advances in neural information processing systems nips pages deisenroth englert peters and fox policy search for robotics in proceedings of ieee international conference on robotics and automation icra 
stochastic expectation propagation yingzhen li university of cambridge cambridge uk miguel harvard university cambridge ma usa jmh richard turner university of cambridge cambridge uk abstract expectation propagation ep is deterministic approximation algorithm that is often used to perform approximate bayesian parameter learning ep approximates the full intractable posterior distribution through set of local approximations that are iteratively refined for each datapoint ep can offer analytic and computational advantages over other approximations such as variational inference vi and is the method of choice for number of models the local nature of ep appears to make it an ideal candidate for performing bayesian learning on large models in dataset settings however ep has crucial limitation in this context the number of approximating factors needs to increase with the number of datapoints which often entails prohibitively large memory overhead this paper presents an extension to ep called stochastic expectation propagation sep that maintains global posterior approximation like vi but updates it in local way like ep experiments on number of canonical learning problems using synthetic and datasets indicate that sep performs almost as well as full ep but reduces the memory consumption by factor of sep is therefore ideally suited to performing approximate bayesian learning in the large model large dataset setting introduction recently number of methods have been developed for applying bayesian learning to large datasets examples include sampling approximations distributional approximations including stochastic variational inference svi and assumed density filtering adf and approaches that mix distributional and sampling approximations one family of approximation method has garnered less attention in this regard expectation propagation ep ep constructs posterior approximation by iterating simple local computations that refine factors which approximate the posterior contribution from each datapoint at first sight it therefore appears well suited to problems the locality of computation make the algorithm simple to parallelise and distribute and good practical performance on range of small data applications suggest that it will be accurate however the elegance of local computation has been bought at the price of prohibitive memory overhead that grows with the number of datapoints since local approximating factors need to be maintained for every datapoint which typically incur the same memory overhead as the global approximation the same pathology exists for the broader class of power ep pep algorithms that includes variational message passing in contrast variational inference vi methods utilise global approximations that are refined directly which prevents memory overheads from scaling with is there ever case for preferring ep or pep to vi methods for large data we believe that there certainly is first ep can provide significantly more accurate approximations it is well known that variational approaches are biased and often severely so and for particular models the variational objective is pathologically such as those with likelihood functions second the fact that ep is truly local to factors in the posterior bution and not just likelihoods means that it affords different opportunities for tractable algorithm design as the updates can be simpler to approximate as ep appears to be the method of choice for some applications researchers have attempted to push it to scale one approach is to swallow the large computational burden and simply use large data structures to store the approximating factors trueskill this approach can only be pushed so far second approach is to use adf simple variant of ep that only requires global approximation to be maintained in memory adf however provides poorly calibrated uncertainty estimates which was one of the main motivating reasons for developing ep in the first place third idea complementary to the one described here is to use approximating factors that have simpler structure low rank this reduces memory consumption for gaussian factors from to but does not stop the scaling with another idea uses ep to carve up the dataset using approximating factors for collections of datapoints this results in rather than local updates and other methods must be used to compute them indeed the spirit of is to extend sampling methods to large datasets not ep itself can we have the best of both worlds that is accurate global approximations that are derived from truly local computation to address this question we develop an algorithm based upon the standard ep and adf algorithms that maintains global approximation which is updated in local way we call this class of algorithms stochastic expectation propagation sep since it updates the global approximation with damped stochastic estimates on data in an analogous way to svi indeed the generalisation of the algorithm to the pep setting directly relates to svi importantly sep reduces the memory footprint by factor of when compared to ep we further extend the method to control the granularity of the approximation and to treat models with latent variables without compromising on accuracy or unnecessary memory demands finally we demonstrate the scalability and accuracy of the method on number of real world and synthetic datasets expectation propagation and assumed density filtering we begin by briefly reviewing the ep and adf algorithms upon which our new method is based consider for simplicity observing dataset comprising samples xn from probabilistic model parametrised by an unknown vector that is drawn from prior exact bayesian inference involves computing the typically intractable posterior distribution of the parameters given the data xn fn here is simpler tractable approximating distribution that will be refined by ep the goal of ep is to refine the approximate factors so that they capture the contribution of each of the likelihood terms to the posterior fn xn in this spirit one approach would be to find each approximating factor fn by minimising the kl divergence between the posterior and the distribution formed by replacing one of the likelihoods by its corresponding approximating factor kl fn xn unfortunately such an update is still intractable as it involves computing the full posterior instead ep approximates this procedure by replacing the exact posterior xn on both sides of the kl by the approximate posterior called the cavity distribution since this couples the updates for the approximating factors the updates must now be iterated in more detail ep iterates four simple steps first the factor selected for update is removed from the approximation to produce the cavity distribution second the corresponding likelihood is included to produce the tilted distribution xn third ep updates the approximating factor by minimising kl fn the hope is that the contribution the makes to the posterior is similar to the effect the same likelihood has on the tilted distribution if the approximating distribution is in the exponential family as is often the case then the kl minimisation reduces to moment matching step that we denote fn proj finally having updated the factor it is included into the approximating distribution we summarise the update procedure for single factor in algorithm critically the approximation step of ep involves local computations since one likelihood term is treated at time the assumption algorithm ep choose factor fn to refine compute cavity distribution compute tilted distribution xn moment matching fn proj inclusion fn algorithm adf choose datapoint xn compute cavity distribution compute tilted distribution xn moment matching fn proj inclusion fn algorithm sep choose datapoint xn compute cavity distribution compute tilted distribution xn moment matching fn proj inclusion fn implicit update fn figure comparing the expectation propagation ep assumed density filtering adf and stochastic expectation propagation sep update steps typically the algorithms will be initialised using and where appropriate fn or is that these local computations although possibly requiring further approximation are far simpler to handle compared to the full posterior in practice ep often performs well when the updates are parallelised moreover by using approximating factors for groups of datapoints and then running additional approximate inference algorithms to perform the ep updates which could include nesting ep ep carves up the data making it suitable for distributed approximate inference there is however one wrinkle that complicates deployment of ep at scale computation of the cavity distribution requires removal of the current approximating factor which means any implementation of ep must store them explicitly necessitating an memory footprint one option is to simply ignore the removal step replacing the cavity distribution with the full approximation resulting in the adf algorithm algorithm that needs only maintain global approximation in memory but as the moment matching step now the underlying approximating factor consider the new form of the objective kl xn the variance of the approximation shrinks to zero as multiple passes are made through the dataset early stopping is therefore required to prevent overfitting and generally speaking adf does not return uncertainties that are to the posterior in the next section we introduce new algorithm that sidesteps ep large memory demands whilst avoiding the pathological behaviour of adf stochastic expectation propagation in this section we introduce new algorithm inspired by ep called stochastic expectation propagation sep that combines the benefits of local approximation tractability of updates distributability and parallelisability with global approximation reduced memory demands the algorithm can be interpreted as version of ep in which the approximating factors are tied or alternatively as corrected version of adf that prevents overfitting the key idea is that at convergence the approximating factors in ep can be interpreted as parameterising global factor that captures the qn qn average effect of likelihood on the posterior fn xn in this spirit the new algorithm employs direct iterative refinement of global approximation comprising the prior and copies of single approximating factor that is sep uses updates that are analogous to ep in order to refine in such way that it captures the average effect likelihood function has on the posterior first the cavity distribution is formed by removing one of the copies of the factor second the corresponding likelihood is included to produce the tilted distribution xn and third sep finds an intermediate factor approximation by moment matching fn proj finally having updated the factor it is included into the approximating distribution it is important here not to make full update since fn captures the effect of just single likelihood function xn instead damping should be employed to make partial update fn natural choice uses which can be interpreted as minimising kl in the moment update but other choices of may be more appropriate including decreasing according to the condition sep is summarised in algorithm unlike adf the cavity is formed by dividing out which captures the average affect of the likelihood and prevents the posterior from collapsing like adf however sep only maintains the global approximation since and when gaussian approximating factors are used for example sep reduces the storage requirement of ep from to which is substantial saving that enables models with many parameters to be applied to large datasets algorithmic extensions to sep and theoretical results sep has been motivated from practical perspective by the limitations inherent in ep and adf in this section we extend sep in four orthogonal directions relate sep to svi many of the algorithms described here are summarised in figure and they are detailed in the supplementary material parallel sep relating the ep fixed points to sep the sep algorithm outlined above approximates one likelihood at time which can be computationally slow however it is simple to parallelise the sep updates by following the same recipe by which ep is parallelised consider minibatch comprising datapoints for full parallel batch update use first we form the cavity distribution for each likelihood unlike ep these are all identical next in parallel compute intermediate factors fm proj in ep these intermediate factors become the new likelihood approximations and the approximation is updated to fn fm in sep the same update is used for the approximating distribution which becomes fold fm and by imqm plication the approximating factor is fnew fold fm one way of understanding parallel sep is as double loop algorithm the inner loop produces intermediate approximations qm arg minq kl these are then combined in the outer loop pm arg minq kl kl for parallel sep reduces to the original sep algorithm for parallel sep is equivalent to the averaged ep algorithm proposed in as theoretical tool to study the convergence properties of normal ep this work showed that under fairly restrictive conditions likelihood functions that are and varying slowly as function of the parameters aep converges to the same fixed points as ep in the large data limit there is another illuminating connection between sep and aep since sep approximating factor qn converges to the geometric average of the intermediate factors fn sep converges to the same fixed points as aep if the learning rates satisfy the condition and therefore under certain conditions to the same fixed points as ep but it is still an open question whether there are more direct relationships between ep and stochastic power ep relationships to variational methods the relationship between variational inference and stochastic variational inference mirrors the relationship between ep and can these relationships be made more formal if the moment projection step in ep is replaced by natural parameter matching step then the resulting algorithm is equivalent to the variational message passing vmp algorithm and see supplementary material moreover vmp has the same fixed points as variational inference since minimising the local variational kl divergences is equivalent to minimising the global variational kl these results carry over to the new algorithms with minor modifications specifically vmp can be transformed into svmp by replacing vmp local approximations with the global form employed by sep in the supplementary material we show that this algorithm is an instance of standard svi and that it therefore has the same fixed points as vi when satisfies the condition more generally the procedure can be applied any member of the power ep pep family of algorithms which replace the moment projection step in ep with minimization relationships between ﬁxed points relationships between algorithms vi vmp avmp pep vmp alpha divergence updates avmp ep aep ep sep aep multiple approximating factors same same stochastic methods sep aep averaged ep avmp averaged vmp ep expectation propagation same in large data limit conditions apply parallel minibatch updates pep power ep sep stochastic ep svmp stochastic vmp ep with parallel updates sep with parallel updates vmp with parallel updates vi variational inference vmp variational message passing figure relationships between algorithms note that care needs to be taken when interpreting the as see supplementary material but care has to be taken when taking the limiting cases see supplementary these results lend weight to the view that sep is natural stochastic generalisation of ep distributed sep controlling granularity of the approximation ep uses approximation comprising single factor for each likelihood sep on the other hand uses approximation comprising signal global factor to approximate the average effect of all likelihood terms one might worry that sep approximation is too severe if the dataset contains sets of datapoints that have very different likelihood contributions for handwritten digits classification consider the affect of and on the posterior it might be more sensible in such cases to partition the dataset into disjoint pieces pk dk xn with nk and use an approximating factor for each partition if normal ep updates are performed on the subsets treating dk as single true factor to be approximated we arrive at the distributed ep algorithm but such updates are challenging as multiple likelihood terms must be included during each update necessitating additional approximations mcmc simpler alternative uses inside each partition implying posterior qk approximation of the form fk nk with fk nk approximating dk the limiting cases of this algorithm when and recover sep and ep respectively sep with latent variables many applications of ep involve latent variable models although this is not the main focus of the paper we show that sep is applicable in this case without scaling the memory footprint with consider model containing hidden variables hn associated with each observation xn hn that are drawn from prior hn the goal is to approximate the true posterior over parameters and hidden variables hn hn xn typically ep would approximate the effect of each intractable term as xn hn fn gn hn instead sep ties the approximate parameter factors xn hn gn hn yielding hn gn hn critically as proved in supplementary the local factors gn hn do not need to be maintained in memory this means that all of the advantages of sep carry over to more complex models involving latent variables although this can potentially increase computation time in cases where updates for gn hn are not analytic since then they will be initialised from scratch at each update experiments the purpose of the experiments was to evaluate sep on number of datasets synthetic and realworld small and large and on number of models probit regression mixture of gaussians and bayesian neural networks bayesian probit regression the first experiments considered simple bayesian classification problem and investigated the stability and quality of sep in relation to ep and adf as well as the effect of using minibatches and varying the granularity of the approximation the model comprised probit likelihood function yn xn and gaussian prior over the parameter γi the synthetic data comprised datapoints xn yn where xn were dimensional and were either sampled from single gaussian distribution fig or from mixture of gaussians mogs with components fig to investigate the sensitivity of the methods to the homogeneity of the dataset the labels were produced by sampling from the generative model we followed measuring the performance by computing an approximation of kl where was replaced by gaussian that had the same mean and covariance as samples drawn from the posterior using the sampler nuts to quantify the calibration of uncertainty estimations results in fig indicate that ep is the best performing method and that adf collapses towards delta function sep converges to solution which appears to be of similar quality to that obtained by ep for the dataset containing gaussian inputs but slightly worse when the mogs was used variants of sep that used larger fluctuated less but typically took longer to converge although for the small minibatches shown this effect is not clear the utility of finer grained approximations depended on the homogeneity of the data for the second dataset containing mog inputs shown in fig approximations were found to be advantageous if the datapoints from each mixture component are assigned to the same approximating factor generally it was found that there is no advantage to retaining more approximating factors than there were clusters in the dataset to verify whether these conclusions about the granularity of the approximation hold in real datasets we sampled datapoints for each of the digits in mnist and performed classification each digit class was assigned its own global approximating factor we compare the of test set using adf sep full ep and dsep in figure ep and dsep significantly outperform adf dsep is slightly worse than full ep initially however it reduces the memory to of full ep without losing accuracy substantially sep accuracy was still increasing at the end of learning and was slightly better than adf further empirical comparisons are reported in the supplementary and in summary the three ep methods are indistinguishable when likelihood functions have similar contributions to the posterior finally we tested sep performance on six small binary classification datasets from the uci machine learning we did not consider the effect of or the granularity of the approximation using we ran the tests with damping and stopped learning after convergence by monitoring the updates of approximating factors the classification results are summarised in table adf performs reasonably well on the mean classification error metric presumably because it tends to learn good approximation to the posterior mode however the posterior variance is poorly approximated and therefore adf returns poor test scores ep achieves significantly higher test than adf indicating that superior approximation to the posterior variance is attained crucially sep performs very similarly to ep implying that sep is an accurate alternative to ep even though it is refining cheaper global posterior approximation mixture of gaussians for clustering the small scale experiments on probit regression indicate that sep performs well for probabilistic models although it is not the main focus of the paper we sought to test the flexibility of the method by applying it to latent variable model specifically mixture of gaussians synthetic mogs dataset containing datapoints was constructed comprising gaussians https figure bayesian logistic regression experiments panels and show synthetic data experiments panel shows the results on mnist see text for full details table average test results all methods on probit regression all methods appear to capture the posterior mode however ep outperforms adf in terms of test on almost all of the datasets with sep performing similarly to ep dataset australian breast crabs ionos pima sonar adf mean error sep ep adf test sep ep the means were sampled from gaussian distribution µj the cluster identity variables were sampled from uniform categorical distribution hn and each mixture component was isotropic xn xn µhn ep adf and sep were performed to approximate the joint posterior over the cluster means µj and cluster identity variables hn the other parameters were assumed known figure visualises the approximate posteriors after iterations all methods return good estimates for the means but adf collapses towards point estimate as expected sep in contrast captures the uncertainty and returns nearly identical approximations to ep the accuracy of the methods is quantified in fig by comparing the approximate posteriors to those obtained from nuts in this case the approximate measure is analytically intractable instead we used the averaged of the difference of the gaussian parameters fitted by nuts and ep methods these measures confirm that sep approximates ep well in this case probabilistic backpropagation the final set of tests consider more complicated models and large datasets specifically we evaluate the methods for probabilistic backpropagation pbp recent method for scalable bayesian learning in neural network models previous implementations of pbp perform several iterations of adf over the training data the moment matching operations required by adf are themselves intractable and they are approximated by first propagating the uncertainty on the synaptic weights forward through the network in sequential way and then computing the gradient of the marginal likelihood by backpropagation adf is used to reduce the large memory cost that would be required by ep when the amount of available data is very large we performed several experiments to assess the accuracy of different implementations of pbp based on adf sep and ep on regression datasets following the same experimental protocol as in see supplementary material we considered neural networks with hidden units except for year and protein which we used table shows the average test rmse and test for each method interestingly sep can outperform ep in this setting possibly because the stochasticity enabled it to find better solutions and typically it performed similarly memory reductions using figure posterior approximation for the mean of the gaussian components visualises posterior approximations over the cluster means confidence level the coloured dots indicate the true label or the inferred cluster assignments the rest in we show the error in of the approximate gaussians means top and covariances bottom table average test results for all methods datasets are also from the uci machine learning repository dataset naval power protein wine year adf na rmse sep ep adf na test sep ep sep instead of ep were large for the protein dataset and for the year dataset see supplementary surprisingly adf often outperformed ep although the results presented for adf use number of sweeps and further iterations generally degraded performance adf good performance is most likely due to an interaction with additional moment approximation required in pbp that is more accurate as the number of factors increases conclusions and future work this paper has presented the stochastic expectation propagation method for reducing ep large memory consumption which is prohibitive for large datasets we have connected the new algorithm to number of existing methods including assumed density filtering variational message passing variational inference stochastic variational inference and averaged ep experiments on bayesian logistic regression both synthetic and real world and mixture of gaussians clustering indicated that the new method had an accuracy that was competitive with ep experiments on the probabilistic on large real world regression datasets again showed that sep comparably to ep with vastly reduced memory footprint future experimental work will focus on developing methods to leverage approximations desp that showed promising experimental performance and also updates there is also need for further theoretical understanding of these algorithms and indeed ep itself theoretical work will study the convergence properties of the new algorithms for which we only have limited results at present systematic comparisons of algorithms and variational methods will guide practitioners to choosing the appropriate scheme for their application acknowledgements we thank the reviewers for valuable comments yl thanks the schlumberger foundation faculty for the future fellowship on supporting her phd study jmhl acknowledges support from the rafael del pino foundation ret thanks epsrc grant and references sungjin ahn babak shahbaba and max welling distributed stochastic gradient mcmc in proceedings of the international conference on machine learning pages bardenet arnaud doucet and chris holmes towards scaling up markov chain monte carlo an adaptive subsampling approach in proceedings of the international conference on machine learning pages matthew hoffman david blei chong wang and john william paisley stochastic variational inference journal of machine learning research miguel and ryan adams probabilistic backpropagation for scalable learning of bayesian neural networks andrew gelman aki vehtari pasi jylnki christian robert nicolas chopin and john cunningham expectation propagation as way of life minjie xu balaji lakshminarayanan yee whye teh jun zhu and bo zhang distributed bayesian posterior sampling via moment sharing in nips thomas minka expectation propagation for approximate bayesian inference in uncertainty in artificial intelligence volume pages manfred opper and ole winther expectation consistent approximate inference the journal of machine learning research malte kuss and carl edward rasmussen assessing approximate inference for binary gaussian process classification the journal of machine learning research simon and nicolas chopin expectation propagation for inference journal of the american statistical association john cunningham philipp hennig and simon gaussian probabilities and expectation propagation arxiv preprint thomas minka power ep technical report microsoft research cambridge john winn and christopher bishop variational message passing in journal of machine learning research pages michael jordan zoubin ghahramani tommi jaakkola and lawrence saul an introduction to variational methods for graphical models machine learning matthew james beal variational algorithms for approximate bayesian inference phd thesis university of london richard turner and maneesh sahani two problems with variational expectation maximisation for models in barber cemgil and chiappa editors bayesian time series models chapter pages cambridge university press richard turner and maneesh sahani probabilistic amplitude and frequency demodulation in zemel bartlett pereira and weinberger editors advances in neural information processing systems pages ralf herbrich tom minka and thore graepel trueskill bayesian skill rating system in advances in neural information processing systems pages peter maybeck stochastic models estimation and control academic press yuan qi ahmed and thomas minka gaussian processes for general likelihoods in uncertainty and artificial intelligence uai amari and hiroshi nagaoka methods of information geometry volume oxford university press herbert robbins and sutton monro stochastic approximation method the annals of mathematical statistics pages guillaume dehaene and simon expectation propagation in the limit thomas minka divergence measures and message passing technical report microsoft research cambridge matthew hoffman and andrew gelman the sampler adaptively setting path lengths in hamiltonian monte carlo the journal of machine learning research 
exactness of approximate map inference in continuous mrfs nicholas ruozzi department of computer science university of texas at dallas richardson tx abstract computing the map assignment in graphical models is generally intractable as result for discrete graphical models the map problem is often approximated using linear programming relaxations much research has focused on characterizing when these lp relaxations are tight and while they are relatively in the discrete case only few results are known for their continuous analog in this work we use graph covers to provide necessary and sufficient conditions for continuous map relaxations to be tight we use this characterization to give simple proofs that the relaxation is tight for decomposable and logsupermodular decomposable models we conclude by exploring the relationship between these two seemingly distinct classes of functions and providing specific conditions under which the map relaxation can and can not be tight introduction graphical models are popular modeling tool for both discrete and continuous distributions we are commonly interested in one of two inference tasks in graphical models finding the most probable assignment map inference and computing marginal distributions these problems are nphard in general and variety of approximate inference schemes are used in practice in this work we will focus on approximate map inference for discrete state spaces linear programming relaxations of the map problem specifically the map lp are quite common these relaxations replace global marginalization constraints with collection of local marginalization constraints wald and globerson refer to these as local consistency relaxations lcrs the advantage of lcrs is that they are often much easier to specify and to optimize over by using algorithm such as loopy belief propagation lbp however the analogous relaxations for continuous state spaces may not be compactly specified and can lead to an unbounded number of constraints except in certain special cases to overcome this problem further relaxations have been proposed by construction each of these further relaxations can only be tight if the initial lcr was tight as result there are compelling theoretical and algorithmic reasons to investigate when lcrs are tight among the most continuous models are the gaussian graphical models for this class of models it is known that the continuous map relaxation is tight when the corresponding inverse covariance matrix is positive definite and scaled diagonally dominant special case of the decomposable models in addition lbp is known to converge to the correct solution for gaussian graphical models and decomposable models that satisfy scaled diagonal dominance condition while much of the prior work in this domain has focused on graphical models in this work we provide general necessary and sufficient condition for the continuous map relaxation to be tight this condition mirrors the known results for the discrete case and is based on the notion of graph covers the map lp is tight if and only if the optimal solution to the map problem is an upper bound on the map solution over any graph cover appropriately scaled this characterization will allow us to understand when the map relaxation is tight for more general models apart from this characterization theorem the primary goal of this work is to move towards uniform treatment of the discrete and continuous cases they are not as different as they may initially appear to this end we explore the relationship between decomposable models and logsupermodular decomposable models introduced here in the continuous case models provide an example of continuous graphical models for which the map relaxation is tight but the objective function is not necessarily these two concepts have analogs in discrete state spaces in particular decomposability is related to closures of discrete functions and decomposability is known condition which guarantees that the map lp is exact in the discrete setting we prove number of results that highlight the similarities and differences between these two concepts as well as general condition under which the map relaxation corresponding to pairwise twice continuously differentiable model can not be tight prerequisites let be function where is the set of possible assignments of each variable function factors with respect to hypergraph if there exist potential functions fi for each and fα for each such that xn fi xi fα xα the hypergraph together with the potential functions and define graphical model we are interested computing in general this map inference task is but in practice local algorithms based on approximations from statistical physics such as lbp produce reasonable estimates in many settings much effort has been invested into understanding when lbp solves the map problem in this section we briefly review approximate map inference in the discrete setting when is finite set for simplicity and consistency we will focus on models as in given vector of sufficient statistics φi xi rk for each and xi and parameter vector θi rk we will assume that fi xi exp hθi φi xi similarly given vector of sufficient statistics φα xα for each and xα and parameter vector θα we will assume that fα xα exp hθα φα xα we will write to represent the concatenation of the individual sufficient statistics and to represent the concatenation of the parameters the objective function can then be expressed as exp hθ the map lp relaxation the map problem can be formulated in terms of mean parameters sup log sup hθ µi rm eτ where is the space of all densities over and is the set of all realizable mean parameters in general is difficult object to compactly describe and to optimize over as result one typically constructs convex outerbounds on that are more manageable in the case that is finite one such outerbound is given by the map lp for each and define φi xi xi similarly for each and define φα xα xα with this choice of sufficient statistics is equivalent to the set of all marginal distributions over the individual variables and elements of that arise from some joint probability distribution the map lp is obtained by replacing with relaxation that only enforces local consistency constraints µα xα µi xi for all xi ml for all xi µi xi the set of constraints ml is known as the local marginal polytope the approximate map problem is then to compute hθ µi hypergraph graph one possible of figure an example of graph cover of factor graph the nodes in the cover are labeled for the node that they copy in the base graph graph covers in this work we are interested in understanding when this relaxation is tight when does hθ µi log for discrete mrfs the map lp is known to be tight in variety of different settings two different theoretical tools are often used to investigate the tightness of the map lp duality and graph covers duality has been particularly useful in the design of convergent and correct schemes that solve the map lp graph covers provide theoretical framework for understanding when and why algorithms such as belief propagation fail to solve the map problem definition graph covers graph if there exists graph homomorphism such that for all vertices and all maps the neighborhood of in bijectively to the neighborhood of in if graph covers graph then looks locally the same as in particular local messagepassing algorithms such as lbp have difficulty distinguishing graph and its covers if then we say that is copy of further is said to be an of if every vertex of has exactly copies in this definition can be easily extended to hypergraphs each hypergraph can be represented in factor graph form create node in the factor graph for each vertex called variable nodes and each hyperedge called factor nodes of each factor node is connected via an edge in the factor graph to the variable nodes on which the corresponding hyperedge depends for an example of see figure to any ah of given by the homomorphism we can associate collection of potentials the potential at node is equal to fh the potential at node and for each ah we associate the potential fh in this way we can construct function such that factorizes over we will say that the graphical model is an of the graphical model whenever is an of and is chosen as described th above it will be convenient in the sequel to write xh xm where xm is the copy of variable there is direct correspondence between ml and assignments on graph covers this correspondence is the basis of the following theorem theorem ruozzi and tatikonda sup hθ µi sup sup sup xh log xh where is the set of all of theorem claims that the optimal value of the map lp is equal to the supremum over all map assignments over all graph covers appropriately scaled in particular the proof of this result shows that under mild conditions there exists an of and an assignment xh such that log hθ µi continuous mrfs in this section we will describe how to extend the previous results from discrete to continuous mrfs using graph covers the relaxation that we consider here is the appropriate extension of the map lp where each of the sums are replaced by integrals densities τi τα τα xα τi xi for all xi ml µi eτi φi for all µα eτα φα for all our goal is to understand under what conditions this continuous relaxation is tight wald and globerson have approached this problem by introducing further relaxation of ml which they call the weak local consistency relaxation weak lcr they provide conditions under which the weak lcr and hence the above relaxation is tight in particular they show that weak lcr is tight for the class of decomposable models in this work we take different approach we first prove the analog of theorem in the continuous case and then we show that the known conditions that guarantee tightness of the continuous relaxation are simple consequences of this general theorem theorem sup hθ µi sup sup sup xh log xh where is the set of all of the proof of theorem is conceptually straightforward albeit technical and can be found in appendix the proof approximates the expectations in ml as expectations with respect to simple functions applies the known results for finite spaces and takes the appropriate limit like its discrete counterpart theorem provides necessary and sufficient conditions for the continuous relaxation to be tight in particular for the relaxation to be tight the optimal solution on any cover appropriately scaled can not exceed the value of the optimal solution of the map problem over tightness of the map relaxation theorem provides necessary and sufficient conditions for the tightness of the continuous relaxation however checking that the maximum value attained on any is bounded by the maximum value over the base graph to the in and of itself appears to be daunting task in this section we describe two families of graphical models for which this condition is easy to verify the decomposable functions and the decomposable functions decomposability has been studied before particularly in the case of gaussian graphical models with respect to graphical models however appears to have been primarily studied in the discrete case decomposability function rn is if λx for all rn and all if can be written as product of potentials over hypergraph we say that is decomposable over theorem if is decomposable then supx log hθ µi proof by decomposability for any of xm which we obtain by applying the definition of separately to each of the copies of the potential functions for each node and factor of as result xm xm supx the proof of the theorem then follows by applying theorem wald and globerson provide different proof of theorem by exploiting duality and the weak lcr decomposability functions have played an important role in the study of discrete graphical models and arises in number of classical correlations inequalities the fkg inequality for decomposable models the map lp is tight and the map problem can be solved exactly in polynomial time in the continuous case is defined analogously to the discrete case that is rn is if for all rn where is the componentwise maximum of the vectors and and is the componentwise minimum continuous functions are sometimes said to be multivariate totally positive of order two we will say that graphical model is decomposable if can be factorized as product of potentials for any collection of vectors xk rn let xk be the vector whose th component is the ith largest element of xkj for each theorem if is decomposable then supx log hθ µi proof by decomposability for any of xm again this follows by repeatedly applying the definition of separately to each of the copies of the potential functions for each node and factor of as result qm xm xm xm xm the proof of the theorem then follows by applying theorem decomposability decomposability as discussed above decomposable and decomposable models are both examples of continuous graphical models for which the map relaxation is tight these two classes are not equivalent twice continuously differentiable functions are supermodular if and only if all off diagonal elements of the hessian matrix are contrast this with twice continuously differentiable concave functions where the hessian matrix must be negative semidefinite in particular this means that functions can be multimodel in this section we explore the relationship between and gaussian mrfs we begin with the case of gaussian graphical models pairwise graphical models given by ax exp aii xi bi xi exp xi xj for some symmetric positive definite matrix and vector rn here factors over the graph corresponding to the entries of the matrix gaussian graphical models are relatively class of continuous graphical models in fact sufficient conditions for the convergence and correctness of gaussian belief propagation gabp are known for these models specifically gabp converges to the optimal solution if the positive definite matrix is scaled diagonally dominant or decomposable these conditions are known to be equivalent definition is scaled diagonally dominant if rn such that ij in addition the following theorem provides characterization of scaled diagonal dominance and hence decomposability in terms of graph covers for these models theorem ruozzi and tatikonda let be symmetric positive definite matrix the following are equivalent is scaled diagonally dominant all covers of are positive definite all of are positive definite the proof of this theorem constructs specific whose covariance matrix has negative eigenvalues whenever the matrix is positive definite but not scaled diagonally dominant the joint distribution corresponding to this is not bounded from above so the optimal value of the map relaxation is as per theorem for gaussian graphical models decomposability and decomposability are related every positive definite decomposable model is decomposable and every positive definite decomposable model is signed version of some positive definite decomposable gaussian graphical model this follows from the following simple lemma lemma symmetric positive definite matrix is scaled diagonally dominant if and only if the matrix such that bii aii for all and bij for all is positive definite if is positive definite and scaled diagonally dominant then the model is decomposable in contrast the model would be decomposable if all of the elements of were negative independent of the diagonal in particular the diagonal could have both positive and negative elements meaning that could be either or neither as quadratic forms do not correspond to normalizable gaussian graphical models the case appears to be less interesting as the map problem is unbounded from above however the situation is entirely different for constrained over some convex set maximization as an example consider box constrained maximization problem where the matrix has all negative entries such model is always decomposable hence the map relaxation is tight but the model is not necessarily pairwise twice differentiable mrfs all of the results from the previous section can be extended to general twice continuously differentiable functions over pairwise graphical models for all in this section unless otherwise specified assume that all models are pairwise theorem if log is strictly concave and twice continuously differentiable the following are equivalent log is scaled diagonally dominant for all log xh is negative definite for every graph cover of and every xh log xh is negative definite for every of and every xh the equivalence of in theorem follows from theorem corollary if log is scaled diagonally dominant for all then the continuous map relaxation is tight corollary if is decomposable over pairwise graphical model and strictly logconcave then log is scaled diagonally dominant for all whether or not decomposability is equivalent to the other conditions listed in the statement of theorem remains an open question though we conjecture that this is the case similar ideas can be extended to general twice continuously differentiable functions theorem suppose log is twice continuously differentiable with maximum at let bij log ij for all and bii log ii if admits pairwise factorization over and has both positive and negative eigenvalues then the continuous map relaxation is not tight proof if has both positive and negative eigenvalues then there exists of such that log has both positive and negative eigenvalues as result the lift of to the is saddle point consequently supxh xh by theorem the continuous map relaxation can not be tight this negative result is quite general if log is positive definite but not scaled diagonally dominant at any global optimum then the map relaxation is not tight in particular this means that all decomposable functions that meet the conditions of the theorem must be at their optima algorithmically moallemi and van roy argued that belief propagation converges for models that are decomposable and scaled diagonally dominant it is unknown whether or not similar convergence argument applies to decomposable functions concave closures many of the tightness results in the discrete case can be seen as specific case of the continuous results described above again suppose that is finite set definition the concave closure of function at rn is given by sup equivalently the concave closure of function is the smallest concave function such that for all function and its concave closure must necessarily have the same maximum computing the concave or convex closure of function is in general but it can be efficiently computed for certain special classes of discrete functions in particular when and log is supermodular then its concave closure can be computed in polynomial time as it is equal to the extension of log the extension has number of interesting properties most notably it is linear the extension of sum of functions is equal to sum of the extensions define the closure of to be fˆ exp log as result if is decomposable then fˆ is decomposable theorem if fˆ fˆi fˆα then hθ µi this theorem is direct consequence of theorem for example the tightness results of bayati et al and sanghavi et al and indeed many others can be seen as special case of this theorem even when is not finite the concave closure can be similarly defined and the theorem holds in this case as well given the characterization in the discrete case this suggests that there could be possibly deep connection between closures and decomposability discussion we have demonstrated that the same necessary and sufficient condition based on graph covers for the tightness of the map lp in the discrete case translates seamlessly to the continuous case this characterization allowed us to provide simple proofs of the tightness of the map relaxation for logconcave decomposable and decomposable models while the proof of theorem is nontrivial it provides powerful tool to reason about the tightness of map relaxations we also explored the intricate relationship between and decomposablity in both the discrete and continuous cases which provided intuition about when the map relaxation can or can not be tight for pairwise graphical models proof of theorem the proof of this theorem proceeds in two parts first we will argue that log xh sup hθ µi sup sup sup xh to see this fix an of via the homomorphism and consider any assignment xh construct the mean parameters ml as follows τα xα τi xi τi xi φi xi dxi τα xα φα xα dxα xh xi xh xα here is the dirac delta this implies that log xh hθ sup hθ µi for the other direction fix some ml such that is generated by the vector of densities we will prove the result for locally consistent probability distributions with bounded support the result for arbitrary will then follow by constructing sequences of these distributions that converge in measure to for simplicity we will assume that each potential function is strictly consider the space for some positive integer we will consider local probability distributions that are supported on subsets of this space that is supp τi for each and supp τα for each for fixed positive integer divide the interval into intervals of size and let sk denote the th interval this partitioning divides into disjoint cubes of volume the distribution can be approximated as sequence of distributions as follows define vector of approximate densities by setting sr sk τi xi dxi if sk τis otherwise τα xα dxα if kj skj kj τα xα otherwise we have τi xi φi xi dxi for each and dxα for each the continuous map relaxation for local probability distributions of this form can be expressed in terms of discrete variables over to see this define µsi zi sz τis xi dxi for each zi and µsα zα sz ταs xα dxα for each zα the corresponding map lp objective evaluated at µs is then xx xx log fα xα dxα µsi zi log fi xi dxi µsα zα szi zi szα zα this map lp objective corresponds to discrete graphical model that factors over the hypergraph with potential functions corresponding to the above integrals over the partitions indexed by the vector exp log fi xi dxi exp log fα xα dxα szi exp log fi xi dx sz szα exp log fα xα dx sz every assignment selects single cube indexed by the value of the objective is calculated by averaging log over the cube indexed by as result maxz supx and for any of xm as this upper bound holds for any fixed it must also hold for any vector of distributions that can be written as limit of such distributions now by applying theorem for the discrete case hθ hθ µs log xh as desired to finish the proof observe that any riemann supm supxh integrable density can be arbitrarily well approximated by densities of this form as in order to make this precise we would need to use lebesgue integration or take sequence of probability distributions over the space rm that arbitrarily the desired assignment xh the same argument will apply in the general case but each of the local distributions must be contained in the support of the corresponding potential function supp τi supp fi for the integrals to exist references globerson and jaakkola fixing convergent message passing algorithms for map in proc neural information processing systems nips vancouver canada werner linear programming approach to problem review pattern analysis and machine intelligence ieee transactions on ruozzi and tatikonda algorithms reparameterizations and splittings ieee transactions on information theory wald and globerson tightness results for local consistency relaxations in continuous mrfs in proc uncertainty in artifical intelligence uai quebec city quebec canada minka expectation propagation for approximate bayesian inference in proceedings of the seventeenth conference on uncertainty in artificial intelligence uai pages ruozzi and tatikonda algorithms for quadratic minimization journal of machine learning research malioutov johnson and willsky and belief propagation in gaussian graphical models journal of machine learning research moallemi and van roy convergence of message passing for quadratic optimization information theory ieee transactions on may moallemi and van roy convergence of for convex optimization information theory ieee transactions on april wainwright and jordan graphical models exponential families and variational inference in machine learning foundations and trends bayati borgs chayes and zecchina belief propagation for weighted on arbitrary graphs and its relation to linear programs with integer solutions siam journal on discrete mathematics kolmogorov and zabih what energy functions can be minimized via graph cuts in computer visioneccv pages springer sanghavi malioutov and willsky belief propagation and lp relaxation for weighted matching in general graphs information theory ieee transactions on april sanghavi shah and willsky message passing for maximum weight independent set information theory ieee transactions on wainwright jaakkola and willsky map estimation via agreement on hyper trees and linear programming information theory ieee transactions on david sontag talya meltzer amir globerson yair weiss and tommi jaakkola tightening lp relaxations for map using in conference in uncertainty in artificial intelligence pages auai press vontobel counting in graph covers combinatorial characterization of the bethe entropy function information theory ieee transactions on vontobel and koetter decoding and analysis of iterative decoding of ldpc codes corr iwata fleischer and fujishige strongly algorithm for minimizing submodular functions journal of the acm schrijver combinatorial algorithm minimizing submodular functions in strongly polynomial time journal of combinatorial theory series karlin and rinott classes of orderings of measures and related correlation inequalities multivariate totally positive distributions journal of multivariate analysis weiss and freeman correctness of belief propagation in gaussian graphical models of arbitrary topology neural malioutov approximate inference in gaussian graphical models thesis eecs mit 
scale up nonlinear component analysis with doubly stochastic gradients bo yingyu le georgia institute of technology lsong princeton university yingyul abstract nonlinear component analysis such as kernel principle component analysis kpca and kernel canonical correlation analysis kcca are widely used in machine learning statistics and data analysis but they can not scale up to big datasets recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity however to obtain high quality solutions the number of random features should be the same order of magnitude as the number of data points making such approach not directly applicable to the regime with millions of data points we propose simple computationally efficient and memory friendly algorithm based on the doubly stochastic gradients to scale up range of kernel nonlinear component analysis such as kernel pca cca and svd despite the nature of these problems our method enjoys theoretical guarantees that it converges at the rate to the global optimum even for the top eigen subspace unlike many alternatives our algorithm does not require explicit orthogonalization which is infeasible on big datasets we demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets introduction scaling up nonlinear component analysis has been challenging due to prohibitive computation and memory requirements recently methods such as randomized component analysis rca are able to scale to larger datasets by leveraging random feature approximation such methods approximate the kernel function by using explicit random feature mappings then perform subsequent steps in the primal form resulting in linear computational complexity nonetheless theoretical analysis shows that in order to get high quality results the number of random features should grow linearly with the number of data points experimentally one often sees that the statistical performance of the algorithm improves as one increases the number of random features another approach to scale up the kernel component analysis is to use stochastic gradient descent and online updates these stochastic methods have also been extended to the kernel case they require much less computation than their batch counterpart converge in rate and are naturally applicable to streaming data setting despite that they share severe drawback all data points used in the updates need to be saved rendering them impractical for large datasets in this paper we propose to use the doubly stochastic gradients for nonlinear component analysis this technique is general framework for scaling up kernel methods for convex problems and has been successfully applied to many popular kernel machines such as kernel svm kernel ridge regressions and gaussian process it uses two types of stochastic approximation simultaneously random data points instead of the whole dataset as in stochastic update rules and random features instead of the true kernel functions as in rca these two approximations lead to the following benefits computation efficiency the key computation is the generation of of random features and the evaluation of them on of data points which is very efficient memory efficiency instead of storing training data points we just keep small program for regenerating the random features and sample previously used random features according to prespecified random seeds this leads to huge savings the memory requirement up to step is independent of the dimension of the data adaptibility unlike other approaches that can only work with fixed number of random features beforehand doubly stochastic approach is able to increase the model complexity by using more features when new data points arrive and thus enjoys the advantage of nonparametric methods although on first look our method appears similar to the approach in the two methods are fundamentally different in they address convex problems whereas our problem is highly nonconvex the convergence result in crucially relies on the properties of convex functions which do not translate to our problem instead our analysis centers around the stochastic update of power iterations which uses different set of proof techniques in this paper we make the following contributions general framework we show that the general framework of doubly stochastic updates can be applied in various kernel component analysis tasks including kpca ksvd kcca etc strong theoretical guarantee we prove that the finite time convergence rate of doubly stochastic approach is this is significant result since the global convergence result is nonconvex problem the guarantee is for update rules without explicit orthogonalization previous works require explicit orthogonalization which is impractical for kernel methods on large datasets strong empirical performance our algorithm can scale to datasets with millions of data points moreover the algorithm can often find much better solutions thanks to the ability to use many more random features we demonstrate such benefits on both synthetic and real world datasets since kernel pca is typical task we focus on it in the paper and provide description of other tasks in section although we only state the guarantee for kernel pca the analysis naturally carries over to the other tasks related work many efforts have been devoted to scale up kernel methods the random feature approach approximates the kernel function with explicit random feature mappings and solves the problem in primal form thus circumventing the quadratic computational complexity it has been applied to various kernel methods among which most related to our work is rca one drawback of rca is that their theoretical guarantees are only for kernel matrix approximation it does not say anything about how close the solution obtained from rca is to the true solution in contrast we provide finite time convergence rate of how our solution approaches the true solution in addition even though moderate size of random features can work well for tens of thousands of data points datasets with tens of millions of data points require many more random features our online approach allows the number of random features hence the flexibility of the function class to grow with the number of data points this makes our method suitable for data streaming setting which is not possible for previous approaches online algorithms for pca have long history oja proposed two stochastic update rules for approximating the first eigenvector and provided convergence proof in respectively these rules have been extended to the generalized hebbian update rules that compute the top eigenvectors the subspace case similar ones have also been derived from the perspective of optimization and stochastic gradient descent they are further generalized to the kernel case however online kernel pca needs to store all the training data which is impractical for large datasets our doubly stochastic method avoids this problem by using random features and keeping only small program for regenerating previously used random features according to seeds as result it can scale up to tens of millions of data points for finite time convergence rate proved the rate for the top eigenvector in linear pca using oja rule for the same task proposed noise reduced pca with linear convergence rate where the rate is in terms of epochs number of passes over the whole dataset the noisy power method presented in provided linear convergence for subspace although it only converges linearly to constant error level in addition the updates require explicit orthogonalization which algorithm αi algorithm evaluate αi require φω for do sample xi sample ωi with seed hi evaluate xi αj αi ηi φωi xi hi αj αj ηi αj hi hi for end for require φω set rk for do sample ωi with seed φωi αi end for is impractical for kernel methods in comparison our method converges in for subspace without the need for orthogonalization preliminaries kernels and covariance operators kernel is function that is pd for all cn and xn we have pn ci cj xi xj reproducing kernel hilbert space rkhs on is hilbert space of functions from to is an rkhs if and only if there exists such that and hf if given with rkhs the covariance operator is linear operator defined as af ex and furthermore hg af if ex let fk be list of functions in the rkhs and we define notation af afk and af is matrix whose element is hfi afj if the of function defines linear operator vv such that vv hv if let vk be list of functions then the weighted sum of set of linear operators can be denoted using notation as σk pk λi vi vi where σk is diagonal matrix with λi on the entry of the diagonal eigenfunctions and kernel pca function is an eigenfunction of with the corresponding eigenvalue if av λv given set of eigenfunctions vi and associated eigenvalues λi where hvi vj if δij we can write the as σk where is the list of top eigenfunctions σk is diagonal matrix with the corresponding eigenvalues is the list of the rest of the eigenfunctions and is diagonal matrix with the rest of the eigenvalues kernel pca aims to identifypthe the top subspace in the finite data case the empirical covariance operator is xi xi according to the representer theorem we have pn vi αij xj where αi rn are weights for the data points using av λv and the kernel trick we have kαi λi αi where is the gram matrix random feature approximation the random feature approximation for kernels gaussian rbf kernel relies on the identity eiω dp φω φω since the fourier transform of pd function is nonrd negative thus can be considered as scaled probability measure we can therefore approximate the kernel function as an empirical average of samples from the distribution in other words φωi φωi where ωi are samples drawn from for gaussian rbf kernel exp this yields gaussian distribution exp see for more details algorithm in this section we describe an efficient algorithm based on the doubly stochastic gradients to scale up kernel pca kpca is essentially an eigenvalue problem in functional space traditional approaches convert it to the dual form leading to another eigenvalue problem whose size equals the number of training points which is not scalable other approaches solve it in the primal form with stochastic functional gradient descent however these algorithms need to store all the training points seen so far they quickly run into memory issues when working with millions of data points we propose to tackle the problem with doubly stochastic gradients in which we make two unbiased stochastic approximations one stochasticity comes from sampling data points as in stochastic gradient descent another source of stochasticity is from random features to approximate the kernel one technical difficulty in designing doubly stochastic kpca is an explicit orthogonalization step required in the update rules which ensures the top eigenfunctions are orthogonal this is infeasible for kernel methods on large dataset since it requires solving an increasingly larger kpca problem in every iteration to solve this problem we formulate the orthogonality constraints into lagrange multipliers which leads to an update rule the new update enjoys small per iteration complexity and converges to the subspace we present the algorithm by first deriving the stochastic functional gradient update without random feature approximations then introducing the doubly stochastic updates for simplicity of presentation the following description uses one data point and one random feature at time but typically of data points and random features are used in each iteration stochastic functional gradient update kernel pca can be formulated as the following optimization problem max tr ag where and is the function gradient descent on the lagrangian leads to gt ηt gt agt using stochastic approximation for at xt xt we have at gt xt gt and at gt gt gt where gt gt xt gt xt therefore the update rule is gt ηt gt gt ηt xt gt this rule can also be derived using stochastic gradient and oja rule doubly stochastic update the update rule must store all the data points it has seen so far which is impractical for large scale datasets to address this issue we use the random feature approximation φωi φωi denote ht the function we get at iteration the update rule becomes ht ηt ht ht ηt φωt xt φωt ht where ht is the evaluation of ht at the current data point ht ht xt hkt xt the specific updates in terms of the coefficients are summarized in algorithms and note that in theory new random features are drawn in each iteration but in practice one can revisit these random features extensions locating individual eigenfunctions the algorithm only finds the eigen subspace but not necessarily individual eigenfunctions modified version called generalized hebbian algorithm gha can be used for this purpose gt ηt at gt ηt gt ut where ut is an operator that sets the lower triangular parts to zero latent variable models and kernel svd recently spectral methods have been proposed to learn latent variable models with provable guarantees in which the key computation is svd our algorithm can be straightforwardly extended to solve kernel svd with two simultaneous update rules the algorithm is summarized in algorithm see the supplementary for derivation details kernel cca and generalized eigenvalue problem given two variables and cca finds two projections such that the correlations between the two projected variables are maximized it is equivalent to generalized eigenvalue problem which can also be solved in our framework we present the updates for coefficients in algorithm and derivation details in the supplementary kernel sliced inverse regression kernel sliced inverse regression aims to do sufficient dimension reduction in which the found low dimension representation preserves the statistical correlation with the targets it also reduces to generalized eigenvalue problem and has been shown to find the same subspace as kcca analysis in this section we provide finite time convergence guarantees for our algorithm as discussed in the previous section explicit orthogonalization is not scalable for the kernel case therefore we algorithm algorithm require φω output αi βi for do sample xi sample yi sample ωi with seed ui evaluate xi αj vi evaluate yi βj rk ui vi vi αi ηi φωi xi vi βi ηi φωi yi ui αj αj ηi αj for βj βj ηi βj for end for require φω output αi βi for do sample xi sample yi sample ωi with seed ui evaluate xi αj vi evaluate yi βj rk ui vi vi αi ηi φωi xi vi ui βi ηi φωi yi ui vi end for need to provide guarantees for the updates without orthogonalization this challenge is even more prominent when using random features since it introduces additional variance furthermore our guarantees are the top subspace although the convergence without normalization for top eigenvector has been established before the subspace case is complicated by the fact that there are angles between subspaces and we need to bound the largest angle to the best of our knowledge our result is the first finite time convergence result for subspace without explicit orthogonalization note that even though it appears our algorithm is similar to on the surface the underlying analysis is fundamentally different in the result only applies to convex problems where every local optimum is global optimum while the problems we consider are highly as result many techniques that builds upon are not applicable conditions and assumptions we will focus on the case when good initialization is given in other words we analyze the later stage of the convergence which is typical in the literature the early stage can be analyzed using established techniques in practice one can achieve good initialization by solving small rca problem with thousands of data points and random features throughout the paper we suppose and regard and as constants note that this is true for all the kernels and corresponding random features considered we further regard the eigengap λk as constant which is also true for typical applications and datasets update without random features our guarantee is on the cosine of the principal angle between the computed subspace and the ground kv gt wk truth eigen subspace also called potential function gt minw kg consider the two different update rules one with explicit orthogonalization and another without orth ft ηt at ft gt ηt gt at gt where at is the empirical covariance of our final guarantee for gt is the following theorem assume and suppose the sizes satisfy that for any ka ai λk there exist step sizes ηi such that gt the convergence rate is in the same order as that of computing only the top eigenvector in linear pca the bound requires the size is large enough so that the spectral norm of is approximated up to the order of the eigengap this is because the increase of the potential is in the order of the eigengap similar terms appear in the analysis of the noisy power method which however requires orthogonalization and is not suitable for the kernel case we do not specify the size but by assuming suitable data distributions it is possible to obtain explicit bounds see for example proof sketch we first prove the guarantee for the orthogonalized subspace ft which is more convenient to analyze and then show that the updates for ft and gt are first order equivalent so gt enjoys the same guarantee to do so we will require lemma and below lemma ft let denote ft then key step in proving the lemma is to show the following recurrence λk ka at we will need the size large enough so that ka at is smaller than the another key element in the proof of the theorem is the first order equivalence of the two update rules to show this we introduce gt orth gt ηt at gt to denote the subspace by applying the update rule of ft on gt we show that the potentials of and gt are close lemma gt the lemma means that applying the two update rules to the same input will result in two subspaces with similar potentials then by we have gt which leads to our theorem the proof of lemma is based on the observation that λmin comparing the taylor expansions ηt for and gt leads to the lemma doubly stochastic update the ht computed in the doubly stochastic update is no longer in the rkhs so the principal angle is not well defined instead we will compare the evaluation of functions from ht and the true principal subspace respectively on point formally we show that for any function with unit norm kvkf there exists function in ht such that for any err is small with high probability to do so we need to introduce companion update rule ηt xt ηt ht ht resulting in function in the rkhs but the update makes use of function values from ht ht which is outside the rkhs let be the coefficients of projected onto ht and then the error can be decomposed as kv zkf lemma zkf ii lemma kvkf by definition kv so the first error term can be bounded by the guarantee on which can be obtained by similar arguments in theorem for the second term note that is defined in such way that the difference between and is martingale which can be bounded by careful analysis theorem assume and suppose the sizes satisfy that for any ka ai λk and are of order ln δt there exist step sizes ηi such that the following holds if λk for all then for any and any function in the span of with unit norm kvkf we have that with probability at least there exists in the span of ht satisfying ln δt the error scales as with the step besides the condition that ka ai is up to the order of the eigengap we additionally need that the random features approximate the kernel function up to constant accuracy on all the data points up to time which eventually leads to ln δt sizes finally we need to be roughly isotropic is roughly orthonormal intuitively this should be true for the following reasons is orthonormal the update for is close to that for gt which in turn is close to ft that are orthonormal potential function convergence number of data points estimated groundtruth figure convergence of our algorithm on the synthetic dataset it is on par with the rate denoted by the dashed red line recovery of the top three eigenfunctions our algorithm in red matches the dashed blue figure visualization of the molecular space dataset by the first two principal components bluer dots represent lower pce values while redder dots are for higher pce values kernel pca linear pca best viewed in color proof sketch in order to bound term in we show that lemma ln δt this is proved by following similar arguments to get the recurrence except with an additional error term which is caused by the fact that the update rule for is using the evaluation ht xt rather than xt bounding this additional term thus relies on bounding the difference between ht which is also what we need for bounding term ii in for this we show lemma for any and unit vector with probability over ht ln the key to prove this lemma is that our construction of makes sure that the difference between and ht consists of their difference in each time step furthermore the difference forms martingale and thus can be bounded by azuma inequality see the supplementary for the details experiments synthetic dataset with analytical solution we first verify the convergence rate of on synthetic dataset with analytical solution of eigenfunctions if the data follow gaussian distribution and we use gaussian kernel then the eigenfunctions are given by the hermite polynomials we generated million such data points and ran with total of random features in each iteration we use data of size and random feature minibatch of size after all random features are generated we revisit and adjust the coefficients of existing random features the kernel bandwidth is set as the true bandwidth the step size is scheduled as ηt where and are two parameters we use small such that in early stages the step size is large enough to arrive at good initial solution figure shows the convergence rate of the proposed algorithm seeking top subspace the potential function is the squared sine of the principle angle we can see the algorithm indeed converges at the rate figure show the recovered top eigenfunctions compared with the the solution coincides with one eigenfunction and deviates only slightly from two others kernel pca visualization on molecular space dataset molecularspace dataset contains million molecular motifs we are interested in visualizing the dataset with kpca the data are represented by sorted coulomb matrices of size each molecule also has an attribute called power conversion efficiency pce we use gaussian kernel with bandwidth chosen by the median trick we ran kernel pca with total of random features with feature size of and data size of we ran iterations with step size ηt figure presents visualization by projecting the data onto the top two principle components compared with linear pca kpca shrinks the distances between the clusters and brings out the important structures in the dataset we can also see higher pce values tend to lie towards the center of the ring structure nonparametric latent variable model we learn latent variable models with using one million data points achieving higher quality solutions compared with two other approaches the dataset consists of two latent components one is gaussian distribution and the other gamma distribution with shape parameter uses total of random features and uses feature of size and data of size we compare with random estimated estimated true true estimated estimated true true estimated estimated true true mean squared error random fourier features nystrom features dsgd random feature dimension figure recovered latent components figure comparison on random features nystrom features kuka dataset fourier features and random nystrom features both of fixed functions figures shows the learned conditional distributions for each component we can see achieves almost perfect recovery while fourier and nystrom random feature methods either confuse high density areas or incorrectly estimate the spread of conditional distributions kcca we compare our algorithm on in the kcca task which consists of million digits and their transformations we divide each image into the left and right halves and learn their correlations the evaluation criteria is the total correlations on the top canonical correlation directions measured on separate test set of size we compare with random fourier and random nystrom features on both total correlation and running time our algorithm uses total of features with feature of size and data of size and with iterations the kernel bandwidth is set using the median trick and is the same for all methods all algorithms are run times and the mean is reported the results are presented in table our algorithm achieves the best correlations in comparable run time with random fourier features this is especially significant for random fourier features since the run time would increase by almost four times if double the number of features were used in addition nystrom features generally achieve better results than fourier features since they are data dependent we can also see that for large datasets it is important to use more random features for better performance table kcca results on mnist top largest correlations of feat random features corrs minutes nystrom features corrs minutes corrs minutes linear cca corrs minutes kernel sliced inverse regression on kuka dataset we evaluate our algorithm under the setting of kernel sliced inverse regression way to perform sufficient dimension reduction sdr for high dimension regression after performing sdr we fit linear regression model using the projected input data and evaluate mean squared error mse the dataset records rhythmic motions of kuka arm at various speeds representing realistic settings for robots we use variant that contains million data points generated by the sl simulator the kuka robot has joints and the high dimension regression problem is to predict the torques from positions velocities and accelerations of the joints the input has dimensions while the output is dimensions since there are seven independent joints we set the reduced dimension to be seven we randomly select as test set and out of the remaining training set we randomly choose as validation set to select step sizes the total number of random features is with batch and batch both equal to we run total of iterations using step size ηt figure shows the regression errors for different methods the error decreases with more random features and our algorithm achieves lowest mse by using random features nystrom features do not perform as well in this setting probably because the spectrum decreases slowly there are seven independent joints as nystrom features are known to work well for fast decreasing spectrum acknowledge the research was supported in part by bigdata onr nsf nsf career nsf simons investigator award and simons collaboration grant references anandkumar foster hsu kakade and liu two svds suffice spectral decompositions for probabilistic topic modeling and latent dirichlet allocation corr arora cotter and srebro stochastic optimization of pca with capped msg in advances in neural information processing systems pages balsubramani dasgupta and freund the fast convergence of incremental pca in advances in neural information processing systems pages cai and zhou optimal rates of convergence for sparse covariance matrix estimation the annals of statistics chin and suter incremental kernel principal component analysis ieee transactions on image processing dai xie he liang raj balcan and song scalable kernel methods via doubly stochastic gradients in advances in neural information processing systems pages hardt and price the noisy power method meta algorithm with applications in advances in neural information processing systems pages honeine online kernel principal component analysis model ieee trans pattern anal mach kim franz and iterative kernel principal component analysis for image modeling ieee transactions on pattern analysis and machine intelligence kim and pavlovic covariance operator based dimensionality reduction with extension to semisupervised settings in international conference on artificial intelligence and statistics pages le sarlos and smola fastfood computing hilbert space expansions in loglinear time in international conference on machine learning sra smola ghahramani and randomized nonlinear component analysis in international conference on machine learning icml meier hennig and schaal incremental local gaussian regression in ghahramani welling cortes lawrence and weinberger editors advances in neural information processing systems pages curran associates montavon hansen fazli rupp biegler ziehe tkatchenko von lilienfeld and learning invariant representations of molecules for atomization energy prediction in neural information processing systems pages oja simplified neuron model as principal component analyzer math biology oja subspace methods of pattern recognition john wiley and sons new york rahimi and recht random features for kernel machines in platt koller singer and roweis editors advances in neural information processing systems mit press cambridge ma rahimi and recht weighted sums of random kitchen sinks replacing minimization with randomization in learning in neural information processing systems sanger optimal unsupervised learning in linear feedforward network neural networks schraudolph and vishwanathan fast iterative kernel pca in platt and hofmann editors advances in neural information processing systems cambridge ma june mit press shamir stochastic pca algorithm with an exponential convergence rate arxiv preprint song anamdakumar dai and xie nonparametric estimation of latent variable models in international conference on machine learning icml vershynin how close is the sample covariance matrix to the actual covariance matrix journal of theoretical probability williams and seeger the effect of the input density distribution on classifiers in langley editor proc intl conf machine learning pages san francisco california morgan kaufmann publishers 
generalization in adaptive data analysis and holdout cynthia dwork microsoft research toniann pitassi university of toronto vitaly feldman ibm almaden research omer reingold samsung research america moritz hardt google research aaron roth university of pennsylvania abstract overfitting is the bane of data analysts even when data are plentiful formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures yet the practice of data analysis is an inherently interactive and adaptive process new analyses and hypotheses are proposed after seeing the results of previous ones parameters are tuned on the basis of obtained results and datasets are shared and reused an investigation of this gap has recently been initiated by the authors in where we focused on the problem of estimating expectations of adaptively chosen functions in this paper we give simple and practical method for reusing holdout or testing set to validate the accuracy of hypotheses produced by learning algorithm operating on training set reusing holdout set adaptively multiple times can easily lead to overfitting to the holdout set itself we give an algorithm that enables the validation of large number of adaptively chosen hypotheses while provably avoiding overfitting we illustrate the advantages of our algorithm over the standard use of the holdout set via simple synthetic experiment we also formalize and address the general problem of data reuse in adaptive data analysis we show how the based approach given in is applicable much more broadly to adaptive data analysis we then show that simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings finally we demonstrate that these incomparable approaches can be unified via the notion of approximate that we introduce this in particular allows the preservation of statistical validity guarantees even when an analyst adaptively composes algorithms which have guarantees based on either of the two approaches introduction the goal of machine learning is to produce hypotheses or models that generalize well to the unseen instances of the problem more generally statistical data analysis is concerned with estimating properties of the underlying data distribution rather than properties that are specific to the finite data set at hand indeed large body of theoretical and empirical research was developed for ensuring generalization in variety of settings in this work it is commonly assumed that each analysis procedure such as learning algorithm operates on freshly sampled dataset or if not is validated on freshly sampled holdout or testing set see for the full version of this work part of this work done while visiting the simons institute uc berkeley unfortunately learning and inference can be more difficult in practice where data samples are often reused for example common practice is to perform feature selection on dataset and then use the features for some supervised learning task when these two steps are performed on the same dataset it is no longer clear that the results obtained from the combined algorithm will generalize although not usually understood in these terms freedman paradox is an elegant demonstration of the powerful negative effect of adaptive analysis on the same data in freedman simulation variables with significant are selected and linear regression is performed on this adaptively chosen subset of variables with famously misleading results when the relationship between the dependent and explanatory variables is the procedure overfits erroneously declaring significant relationships most of machine learning practice does not rely on formal guarantees of generalization for learning algorithms instead dataset is split randomly into two or sometimes more parts the training set and the testing or holdout set the training set is used for learning predictor and then the holdout set is used to estimate the accuracy of the predictor on the true distribution additional averaging over different partitions is used in because the predictor is independent of the holdout dataset such an estimate is valid estimate of the true prediction accuracy formally this allows one to construct confidence interval for the prediction accuracy on the data distribution however in practice the holdout dataset is rarely used only once and as result the predictor may not be independent of the holdout set resulting in overfitting to the holdout set one reason for such dependence is that the holdout data is used to test large number of predictors and only the best one is reported if the set of all tested hypotheses is known and independent of the holdout set then it is easy to account for such multiple testing however such static approaches do not apply if the estimates or hypotheses tested on the holdout are chosen adaptively that is if the choice of hypotheses depends on previous analyses performed on the dataset one prominent example in which holdout set is often adaptively reused is hyperparameter tuning similarly the holdout set in machine learning competition such as the famous imagenet competition is typically reused many times adaptively other examples include using the holdout set for feature selection generation of base learners in aggregation techniques such as boosting and bagging checking stopping condition and decisions see for discussion of several subtle causes of overfitting the concrete practical problem we address is how to ensure that the holdout set can be reused to perform validation in the adaptive setting towards addressing this problem we also ask the more general question of how one can ensure that the final output of adaptive data analysis generalizes to the underlying data distribution this line of research was recently initiated by the authors in where we focused on the case of estimating expectations of functions from samples these are also referred to as statistical queries our results we propose simple and general formulation of the problem of preserving statistical validity in adaptive data analysis we show that the connection between differentially private algorithms and generalization from can be extended to this more general setting and show that similar but sometimes incomparable guarantees can be obtained from algorithms whose outputs can be described by short strings we then define new notion approximate that unifies these two basic techniques and gives new perspective on the problem in particular we give an adaptive composition theorem for which gives simple way to obtain generalization guarantees for analyses in which some of the procedures are differentially private and some have short description length outputs we apply our techniques to the problem of reusing the holdout set for validation in the adaptive setting reusable holdout we describe simple and general method together with two specific instantiations for reusing holdout set for validating results while provably avoiding overfitting to the holdout set the analyst can perform any analysis on the training dataset but can only access the holdout set via an algorithm that allows the analyst to validate her hypotheses against the holdout set crucially our algorithm prevents overfitting to the holdout set even when the analyst hypotheses are chosen adaptively on the basis of the previous responses of our algorithm our first algorithm referred to as thresholdout derives its guarantees from differential privacy and the results in for any function given by the analyst thresholdout uses the holdout set to validate that does not overfit to the training set that is it checks that the mean value of evaluated on the training set is close to the mean value of evaluated on the distribution from which the data was sampled the standard approach to such validation would be to compute the mean value of on the holdout set the use of the holdout set in thresholdout differs from the standard use in that it exposes very little information about the mean of on the holdout set if does not overfit to the training set then the analyst receives only the confirmation of closeness that is just single bit on the other hand if overfits then thresholdout returns the mean value of on the training set perturbed by carefully calibrated noise using results from we show that for datasets consisting of samples these modifications provably prevent the analyst from constructing functions that overfit to the holdout set this ensures correctness of thresholdout responses naturally the specific guarantees depend on the number of samples in the holdout set the number of queries that thresholdout can answer is exponential in as long as the number of times that the analyst overfits is at most quadratic in our second algorithm sparsevalidate is based on the idea that if most of the time the procedures generate results that do not overfit then validating them against the holdout set does not reveal much information about the holdout set specifically the generalization guarantees of this method follow from the observation that the transcript of the interaction between data analyst and the holdout set can be described concisely more formally this method allows the analyst to pick any boolean function of dataset described by an algorithm and receive back its value on the holdout set simple example of such function would be whether the accuracy of predictor on the holdout set is at least certain value unlike in the case of thresholdout here there is no need to assume that the function that measures the accuracy has bounded range or even lipschitz making it qualitatively different from the kinds of results achievable subject to differential privacy more involved example of validation would be to run an algorithm on the holdout dataset to select an hypothesis and check if the hypothesis is similar to that obtained on the training set for any desired notion of similarity such validation can be applied to other results of analysis for example one could check if the variables selected on the holdout set have large overlap with those selected on the training set an instantiation of the sparsevalidate algorithm has already been applied to the problem of answering statistical and more general queries in the adaptive setting we describe simple experiment on synthetic data that illustrates the danger of reusing standard holdout set and how this issue can be resolved by our reusable holdout the design of this experiment is inspired by freedman classical experiment which demonstrated the dangers of performing variable selection and regression on the same data generalization in adaptive data analysis we view adaptive analysis on the same dataset as an execution of sequence of steps am each step is described by an algorithm ai that takes as input fixed dataset xn drawn from some distribution over which remains unchanged over the course of the analysis each algorithm ai also takes as input the outputs of the previously run algorithms through and produces value in some range yi the dependence on previous outputs represents all the adaptive choices that are made at step of data analysis for example depending on the previous outputs ai can run different types of analysis on we note that at this level of generality the algorithms can represent the choices of the data analyst and need not be explicitly specified we assume that the analyst uses algorithms which individually are known to generalize when executed on fresh dataset sampled independently from distribution we formalize this by assuming that for every fixed value with probability at least βi over the choice of according to distribution the output of ai on inputs and has desired property relative to the data distribution for example has low generalization error note that in this assumption are fixed and independent of the choice of whereas the analyst will execute ai on values where yj aj in other words in the adaptive setup the algorithm ai can depend on the previous outputs which depend on and thus the set given to ai is no longer an independently sampled dataset such dependence invalidates the generalization guarantees of individual procedures potentially leading to overfitting differential privacy first we spell out how the differential privacy based approach from can be applied to this more general setting specifically simple corollary of results in is that for dataset consisting of samples any output of algorithm can be used in subsequent analysis while controlling the risk of overfitting even beyond the setting of statistical queries studied in key property of differential privacy in this context is that it composes adaptively namely if each of the algorithms used by the analyst is differentially private then the whole procedure will be differentially private albeit with worse privacy parameters therefore one way to avoid overfitting in the adaptive setting is to use algorithms that satisfy sufficiently strong guarantees of description length we then show how description length bounds can be applied in the context of guaranteeing generalization in the presence of adaptivity if the total length of the outputs of algorithms can be described with bits then there are at most possible values of the input to ai for each of these individual inputs ai generalizes with probability βi taking union bound over failure probabilities implies generalization with probability at least βi occam razor famously implies that shorter hypotheses have lower generalization error our observation is that shorter hypotheses and the results of analysis more generally are also better in the adaptive setting since they reveal less about the dataset and lead to better generalization of subsequent analyses note that this result makes no assumptions about the data distribution in the full versionwe also show that description analysis suffices for obtaining an algorithm albeit not an efficient one that can answer an exponentially large number of adaptively chosen statistical queries this provides an alternative proof for one of the results in approximate our main technical contribution is the introduction and analysis of new measure which unifies the generalization arguments that come from both differential privacy and description length and that quantifies how much information has been learned about the data by the analyst formally for jointly distributed random variables the is the maximum of the logarithm of the factor by which uncertainty about is reduced given the value of namely log max where the maximum is taken over all in the support of and in the support approximate is relaxation of in our use denotes dataset drawn randomly from the distribution and denotes the output of possibly randomized algorithm on we prove that approximate has the following properties an upper bound on approximate gives generalization guarantees differentially private algorithms have low for any distribution over datasets stronger bound holds for approximate on datasets these bounds apply only to pure differential privacy the case bounds on the description length of the output of an algorithm give bounds on the approximate of the algorithm for any approximate composes adaptively composition properties of approximate imply that one can easily obtain generalization guarantees for adaptive sequences of algorithms some of which are differentially private and others of which have outputs with short description length these properties also imply that differential privacy can be used to control generalization for any distribution over datasets which extends its generalization guarantees beyond the restriction to datasets drawn from fixed distribution as in we remark that pure differential privacy and description length are otherwise incomparable bounds on or differential privacy of an algorithm can however be translated to bounds on randomized description length for different algorithm with statistically indistinguishable output here we say that randomized algorithm has randomized description length of if for every fixing of the algorithm random bits it has description length of details of these results and additional discussion appear in section and the full version related work this work complements where we initiated the formal study of adaptivity in data analysis the primary focus of is the problem of answering adaptively chosen statistical queries the main technique is strong connection between differential privacy and generalization differential privacy guarantees that the distribution of outputs does not depend too much on any one of the data samples and thus differential privacy gives strong stability guarantee that behaves well under adaptive data analysis the link between generalization and approximate differential privacy made in has been subsequently strengthened both qualitatively by who make the connection for broader range of queries and quantitatively by and who give tighter quantitative bounds these papers among other results give methods for accurately answering exponentially in the dataset size many adaptively chosen queries but the algorithms for this task are not efficient it turns out this is for fundamental reasons hardt and ullman and steinke and ullman prove that under cryptographic assumptions no efficient algorithm can answer more than quadratically many statistical queries chosen adaptively by an adversary who knows the true data distribution the classical approach in theoretical machine learning to ensure that empirical estimates generalize to the underlying distribution is based on the various notions of complexity of the set of functions output by the algorithm most notably the vc dimension if one has sample of data large enough to guarantee generalization for all functions in some class of bounded complexity then it does not matter whether the data analyst chooses functions in this class adaptively or our goal in contrast is to prove generalization bounds without making any assumptions about the class from which the analyst can output functions an important line of work establishes connections between the stability of learning algorithm and its ability to generalize stability is measure of how much the output of learning algorithm is perturbed by changes to its input it is known that certain stability notions are necessary and sufficient for generalization unfortunately the stability notions considered in these prior works do not compose in the sense that running multiple stable algorithms sequentially and adaptively may result in procedure that is not stable the measure we introduce in this work max information like differential privacy has the strength that it enjoys adaptive composition guarantees this makes it amenable to reasoning about the generalization properties of adaptively applied sequences of algorithms while having to analyze only the individual components of these algorithms connections between stability empirical risk minimization and differential privacy in the context of learnability have been recently explored in numerous techniques have been developed by statisticians to address common special cases of adaptive data analysis most of them address single round of adaptivity such as variable selection followed by regression on selected variables or model selection followed by testing and are optimized for specific inference procedures the literature is too vast to adequately cover here see ch in for textbook introduction and for survey of some recent work in contrast our framework addresses multiple stages of adaptive decisions possible lack of predetermined analysis protocol and is not restricted to any specific procedures finally inspired by our work blum and hardt showed how to reuse the holdout set to maintain an accurate leaderboard in machine learning competition that allows the participants to submit adaptively chosen models in the process of the competition such as those organized by kaggle their analysis also relies on the description technique we used to analyze sparsevalidate preliminaries in the discussion below log refers to binary logarithm and ln refers to the natural logarithm for two random variables and over the same domain the of from is defined as xky log is defined as xky log max definition randomized algorithm with domain for is private if for all pairs of datasets that differ in single element ka log the case when is sometimes referred to as pure differential privacy and in this case we may say simply that is private consider two algorithms and that are composed adaptively and assume that for every fixed input generalizes for all but fraction of datasets here we are speaking of generalization informally our definitions will support any property of input and dataset intuitively to preserve generalization of we want to make sure that the output of does not reveal too much information about the dataset we demonstrate that this intuition can be captured via notion of and its relaxation approximate for two random variables and we use to denote the random variable obtained by drawing and independently from their probability distributions definition let and be jointly distributed random variables the between and is defined as kx the is defined as kx in our use is going to be joint distribution where is random dataset and is possibly randomized algorithm taking dataset as an input definition we say that an algorithm has of if for every distribution over datasets where is dataset chosen randomly according to we denote this by an immediate corollary of our definition of approximate is that it controls the probability of bad events that can happen as result of the dependence of on theorem let be random dataset in and be an algorithm with range such that for some then for any event in particular we remark that mutual information between and would not suffice for ensuring that bad events happen with tiny probability for example mutual information of allows to be as high as log where approximate satisfies the following adaptive composition property and let lemma let be an algorithm such that be an algorithm such that for every has let be defined such that then bounds on description length gives the following bound on theorem let be randomized algorithm taking as an input an dataset and outputting value in finite set then for every log next we prove simple bound on of differentially private algorithms that applies to all distributions over datasets theorem let be an private algorithm then log finally we prove stronger bound on approximate for datasets consisting of samples using the technique from theorem let be an private algorithm with range for distribution over let be random variable drawn from let denote the random variable output by on input then for any log ln one way to apply bound on is to start with concentration of measure result which ensures that the estimate of predictor accuracy is correct with high probability when the predictor is chosen independently of the samples for example for loss function with range hoeffding bound implies that for dataset consisting of samples the empirical estimate is not within of the true accuracy with probability now given bound of log on information of the algorithm that produces the estimator thm implies that the produced estimate is not within of the true accuracy with probability thm implies that any private algorithm has of at most log for dataset consisting of samples thm implies that private algorithm has of log for reusable holdout we describe two simple algorithms that enable validation of analyst queries in the adaptive setting thresholdout our first algorithm thresholdout follows the approach in where differentially private algorithms are used to answer adaptively chosen statistical queries this approach can also be applied to any functions of the dataset but for simplicity we present the results for statistical queries here we address an easier problem in which the analyst queries only need to be answered when they overfit also unlike in the analyst has full access to the training set and the holdout algorithm only prevents overfitting to holdout dataset as result unlike in the general query answering setting our algorithm can efficiently validate an exponential in number of queries as long as relatively small number of them overfit pn for function and dataset xn let es xi thresholdout is given access to the training dataset st and holdout dataset sh and budget limit it allows any query of the form and its goal is to provide an estimate of to achieve this the algorithm gives an estimate of esh in way that prevents overfitting of functions generated by the analyst to the holdout set in other words responses of thresholdout are designed to ensure that with high probability esh is close to and hence an estimate of esh gives an estimate of the true expectation given function thresholdout first checks if the difference between the average value of on the training set st or est and the average value of on the holdout set sh or esh is below certain threshold here is fixed number such as and is laplace noise variable whose standard deviation needs to be chosen depending on the desired guarantees the laplace distribution is symmetric exponential distribution if the difference is below the threshold then the algorithm returns est if the difference is above the threshold then the algorithm returns esh for another laplacian noise variable each time the difference is above threshold the overfitting budget is reduced by one once it is exhausted thresholdout stops answering queries we provide the pseudocode of thresholdout below input training set st holdout set sh threshold noise rate budget sample lap for each query do if output else sample lap ii if est sample lap lap and output esh iii else output est we now establish the formal generalization guarantees that thresholdout enjoys theorem let and we set and ln let denote holdout dataset of size drawn from distribution and st be any additional dataset over consider an algorithm that is given access to st and adaptively chooses functions φm while interacting with thresholdout which is given datasets st and values for every let ai denote the answer of thresholdout on function φi further for every we define the counter of overfitting zi φj est φj then zi φi whenever ln min ln ln sparsevalidate we now present general algorithm for validation on the holdout set that can validate many arbitrary queries as long as few of them fail the validation more formally our algorithm allows the analyst to pick any boolean function of dataset or even any algorithm that outputs single bit and provides back the value of on the holdout set sh sparsevalidate has budget for the total number of queries that can be asked and budget for the number of queries that returned once either of the budgets is exhausted no additional answers are given we now give general description of the guarantees of sparsevalidate theorem let denote randomly chosen holdout set of size let be an algorithm that is given access to sparsevalidate and outputs queries ψm such that each ψi is in some set ψi of functions from to assume that for every and ψi ψi ψi βi let ψi be the random variable equal to the th query of on then pmin ψi βi where in this general formulation it is the analyst responsibility to use the budgets economically and pick query functions that do not fail validation often at the same time sparsevalidate ensures that for the appropriate values of the parameters the analyst can think of the holdout set as fresh sample for the purposes of validation hence the analyst can pick queries in such way that failing the validation reliably indicates overfitting an example of the application of sparsevalidate for answering statistical and queries that is based on our analysis can be found in the analysis of generalization on the holdout set in and the analysis of the median mechanism we give in the full version also rely on this technique experiments in our experiment the analyst is given labeled data set of size and splits it randomly into training set st and holdout set sh of equal size we denote an element of by tuple where is vector and is the corresponding class label the analyst wishes to select variables to be included in her classifier for various values of the number of variables to select she picks variables with the largest absolute correlations with the label however she verifies the correlations with the label on the holdout set and uses only those variables whose correlation agrees in sign with the correlation on the training set and both correlations are larger than some threshold in absolute value she then creates simple linear threshold classifier on the selected variables using only the signs of the correlations of the selected variables final test evaluates the classification accuracy of the classifier on both the training set and the holdout set in our first experiment each attribute of is drawn independently from the normal distribution and we choose the class label uniformly at random so that there is no correlation between the data point and its label we chose and varied the number of selected variables in this scenario no classifier can achieve true accuracy better than nevertheless reusing standard holdout results in reported accuracy of over for on both the training set and the holdout set the standard deviation of the error is less than the average and standard deviation of results obtained from independent executions of the experiment are plotted above for comparison the plot also includes the accuracy of the classifier on another fresh data set of size drawn from the same distribution we then executed the same algorithm with our reusable holdout thresholdout was invoked with and explaining why the accuracy of the classifier reported by thresholdout is off by up to whenever the accuracy on the holdout set is within of the accuracy on the training set we also used gaussian noise instead of laplacian noise as it has stronger concentration properties thresholdout prevents the algorithm from overfitting to the holdout set and gives valid estimate of classifier accuracy additional experiments and discussion are presented in the full version references raef bassily adam smith thomas steinke and jonathan ullman more general queries and less generalization error in adaptive data analysis corr avrim blum and moritz hardt the ladder reliable leaderboard for machine learning competitions corr olivier bousquet and andré elisseeff stability and generalization jmlr gavin cawley and nicola talbot on in model selection and subsequent selection bias in performance evaluation journal of machine learning research chuong do foo and andrew ng efficient multiple hyperparameter learning for models in nips pages cynthia dwork vitaly feldman moritz hardt toniann pitassi omer reingold and aaron roth generalization in adaptive data analysis and holdout reuse corr extended abstract to appear in nips cynthia dwork vitaly feldman moritz hardt toniann pitassi omer reingold and aaron roth preserving statistical validity in adaptive data analysis corr extended abstract in stoc cynthia dwork krishnaram kenthapadi frank mcsherry ilya mironov and moni naor our data ourselves privacy via distributed noise generation in eurocrypt pages cynthia dwork frank mcsherry kobbi nissim and adam smith calibrating noise to sensitivity in private data analysis in theory of cryptography pages springer david freedman note on screening regression equations the american statistician moritz hardt and jonathan ullman preventing false discovery in interactive data analysis is hard in focs pages trevor hastie robert tibshirani and jerome friedman the elements of statistical learning data mining inference and prediction springer series in statistics springer john langford clever methods of overfitting http kobbi nissim and uri stemmer on the generalization properties of differential privacy corr tomaso poggio ryan rifkin sayan mukherjee and partha niyogi general conditions for predictivity in learning theory nature bharat rao and glenn fung on the dangers of an experimental evaluation in international conference on data mining pages siam juha reunanen overfitting in making comparisons between variable selection methods journal of machine learning research shai ohad shamir nathan srebro and karthik sridharan learnability stability and uniform convergence the journal of machine learning research thomas steinke and jonathan ullman interactive fingerprinting codes and the hardness of preventing false discovery arxiv preprint jonathan taylor and robert tibshirani statistical learning and selective inference proceedings of the national academy of sciences wang jing lei and stephen fienberg learning with differential privacy stability learnability and the sufficiency and necessity of erm principle corr 
market scoring rules act as opinion pools for agents mithun chakraborty sanmay das department of computer science and engineering washington university in louis louis mo mithunchakraborty sanmay abstract market scoring rule msr popular tool for designing algorithmic prediction markets is an mechanism for the aggregation of probabilistic beliefs from myopic agents in this paper we add to growing body of research aimed at understanding the precise manner in which the price process induced by msr incorporates private information from agents who deviate from the assumption of we first establish that for myopic trading agent with utility function msr satisfying mild regularity conditions elicits the agent probability conditional on the latest market state rather than her true subjective probability hence we show that msr under these conditions effectively behaves like more traditional method of belief aggregation namely an opinion pool for agents true probabilities in particular the logarithmic market scoring rule acts as logarithmic pool for constant absolute risk aversion utility agents and as linear pool for an atypical budgetconstrained agent utility with decreasing absolute risk aversion we also point out the interpretation of market maker under these conditions as bayesian learner even when agent beliefs are static introduction how should we combine opinions or beliefs about hidden truths or uncertain future events furnished by several individuals with potentially diverse information sets into single group judgment for decision or purposes this has been fundamental question across disciplines for long time surowiecki one simple principled approach towards achieving this end is the opinion pool op which directly solicits inputs from informants in the form of probabilities or distributions and then maps this vector of inputs to single probability or distribution based on certain axioms genest and zidek however this technique abstracts away from the issue of providing proper incentives to agent to reveal her private information honestly financial markets approach the problem differently offering financial incentives for traders to supply their information about valuations and aggregating this information into informative prices prediction market is relatively novel tool that builds upon this idea offering trade in financial security whose final monetary worth is tied to the future revelation of some currently unknown ground truth hanson introduced family of algorithms for designing automated prediction markets called the market scoring rule msr of which the logarithmic market scoring rule lmsr is arguably the most widely used and msr effectively acts as cost market maker always willing to take the other side of trade with any willing buyer or seller and its quoted price after every transaction one of the most attractive properties of msr is its for myopic trader but this also means that every time msr trades with such an agent the updated market price is reset to the subjective probability of that agent the market mechanism itself does not play an active role in unifying pieces of information gleaned from the entire trading history into its current price ostrovsky and iyer et al have shown that with differentially informed bayesian and agents respectively trading repeatedly information gets aggregated in market in perfect bayesian equilibrium however if agent beliefs themselves do not converge can the price process emerging out of their interaction with msr still be viewed as an aggeragator of information in some sense intuitively even if an agent does not revise her belief based on her inference about her peers information from market history her conservative attitude towards risk should compel her to trade in such way as to move the market price not all the way to her private belief but to some function of her belief and the most recent price thus the evolving price should always retain some memory of all agents information sequentially injected into the market therefore the assumption of agents may not be indispensable for providing theoretical guarantees on how the market incorporates agent beliefs few attempts in this vein can be found in the literature typically embedded in broader context sethi and vaughan abernethy et al but there have been few general results see section for review in this paper we develop new unified understanding of the information aggregation characteristics of market with agents mediated by msr with no regard to how the agents beliefs are formed in fact we demonstrate an equivalence between such markets and opinion pools we do so by first proving in section that for any msr interacting with myopic traders the revised instantaneous price after every trade equals the latest trader probability conditional on the preceding market state we then show that this price update rule satisfies an axiomatic characterization of opinion pooling functions from the literature establishing the equivalence in sections and we focus on specific msr the commonly used logarithmic variety lmsr we demonstrate that market with agents having constant absolute risk aversion cara utilities is equivalent to logarithmic opinion pool and that market with agents having specific concave utility with decreasing absolute risk aversion is equivalent to linear opinion pool we also demonstrate how the agents utility function parameters acquire additional significance with respect to this pooling operation and that in these two scenarios the market maker can be interpreted as bayesian learning algorithm even if agents never update beliefs our results are reminiscent of similar findings about competitive equilibrium prices in markets with rational agents pennock beygelzimer et al millin et al etc but those models require that agents learn from prices and also abstract away from any consideration of microstructure and the dynamics of actual price formation how the agents would reach the equilibrium is left open by contrast our results do not presuppose any kind of generative model for agent signals and also do not involve an equilibrium analysis hence they can be used as tools to analyze the convergence characteristics of the market price in situations with potentially or irrational agents related work given the plethora of experimental and empirical evidence that prediction markets are at least as effective as more traditional means of belief aggregation wolfers and zitzewitz cowgill and zitzewitz there has been considerable work on understanding how such market formulates its own consensus belief from individual signals an important line of research beygelzimer et al millin et al hu and storkey storkey et al has focused on competitive equilibrium analysis of prediction markets under various trader models and found an equivalence between the market equilibrium price and the outcome of an opinion pool with the same agents seminal work in this field was done by pennock who showed that linear and logarithmic opinion pools arise as special cases of the equilibrium of his intuitive model of securities markets when all agents have generalized logarithmic and negative exponential utilities respectively unlike these analyses that abstract away from the microstructure ostrovsky and iyer et al show that certain market structures including market scoring rules satisfying mild conditions perform information aggregation the market belief measure converges in probability to the ground truth for repeatedly trading and learning agents with and utilities respectively our contribution while drawing inspiration from these sources differs in that we delve into the characteristics of the evolution of the price rather than the properties of prices in equilibrium and examine the manner in which the microstructure induces aggregation even if the agents are not bayesian while there has also been significant work on market properties in continuous double auctions or markets mediated by sophisticated algorithms cliff and bruten farmer et al brahma et al and references therein when the agents are zero intelligence or derivatives thereof and therefore definitely not bayesian this line of literature has not looked at market scoring rules in detail and analytical results have been rare in recent years the literature focusing on the market scoring rule or equivalently the cost functionbased market maker family has grown substantially chen and vaughan and frongillo et al have uncovered isomorphisms between this type of market structure and machine learning algorithms we on the other hand are concerned with the similarities between price evolution in markets and opinion pooling methods see garg et al our work comes close to that of sethi and vaughan who show analytically that the price sequence of cost market maker with traders is convergent under general conditions and by simulation that the limiting price of lmsr with but myopic logarithmic utility agents is approximately linear opinion pool of agent beliefs abernethy et al show that exponential utility agent with an exponential family belief distribution updates the state vector of generalization of lmsr that they propose to convex combination of the current market state vector and the natural parameter vector of the agent own belief distribution see their theorem corollary this reduces to logarithmic opinion pool logop for classical lmsr the connection was also noted by pennock and xia in their theorem but with respect to an artificial probability distribution based on an agent observed trade that the authors defined instead of considering traders belief structure or strategies we show how results of this type arise as special cases of more general equivalence that we establish in this paper model and definitions consider or principal interested in the opinions beliefs forecasts of group of agents about an extraneous random binary event expressed in the form of point probabilities πi πi is agent subjective probability pr can represent proposition such as republican will win the next presidential election or the favorite will beat the underdog by more than point spread in game of football or the next avengers movie will hit certain box office target in its opening in this section we briefly describe two approaches towards the aggregation of such private beliefs the opinion pool which disregards the problem of incentivizing truthful reports and focuses simply on unifying multiple probabilistic reports on topic and the market scoring rule an mechanism for extracting honest beliefs from agents opinion pool op this family of methods takes as input the vector of probabilistic reports pi submitted by agents also called experts in this context and computes an aggregate or consensus operator pb pn garg et al identified three desiderata for an opinion pool other criteria are also recognized in the literature but the following are the most basic and natural unanimity if all experts agree the aggregate also agrees with them boundedness the aggregate is bounded by the extremes of the inputs monotonicity if one expert changes her opinion in particular direction while all other experts opinions remain unaltered then the aggregate changes in the same direction definition we call pb pn valid opinion pool for probabilistic reports if it possesses properties and listed above it is easy to derive the following result for recursively defined pooling functions that will prove useful for establishing an equivalence between market scoring rules and opinion pools the proof is in section of the supplementary material lemma for scenario if and are valid opinion pools for two probabilistic reports and probabilistic reports respectively then pn pn is also valid opinion pool for reports two popular opinion pooling methods are the linear opinion pool linop and the logarithmic opinion pool logop which are essentially weighted average or convex combination and renormalized weighted geometric mean of the experts probability reports respectively pn linop pn ωilin pi qn qn log ωilog ωilog logop pn pi for scenario where ωilin ωilog pn ωilin pn ωilog market scoring rule msr in general scoring rule is function of two variables where is an agent probabilistic prediction density or mass function about an uncertain event is the realized or revealed outcome of that event after the prediction has been made and the resulting value of is the agent ex post compensation for prediction for binary event scoring rule can just be represented by the pair which is the vector of agent compensations for and respectively being the agent reported probability of which may or may not be equal to her true subjective probability say pr scoring rule is defined to be strictly proper if it is for agent an agent maximizes her subjective expectation of her ex post compensation by reporting her true subjective probability arg in addition scoring rule is regular if sj is except possibly that or is any regular strictly proper scoring rule can written in the following form gneiting and raftery sj is strictly convex function with as which is expect possibly that or is if is differentiable in is simply its derivative classic example of regular strictly proper scoring rule is the logarithmic scoring rule ln ln where is free parameter hanson introduced an extension of scoring rule wherein the principal initiates the process of information elicitation by making baseline report and then elicits publicly declared reports pi sequentially from agents the ex post compensation cx pi received by agent from the principal where is the realized outcome of event is the difference between the scores assigned to the reports made by herself and her predecessor cx pi sx pi sx if each agent acts and myopically as if her current interaction with the principal is her last then the incentive compatibility property of strictly proper score still holds for the sequential version moreover it is easy to show that the principal payout loss is bounded regardless of agent behavior in particular for the logarithmic score the loss bound for is ln can be referred to as the principal loss parameter sequentially shared strictly proper scoring rule of the above form can also be interpreted as cost prediction market mechanism offering trade in an valued security written on the event hence the name market scoring rule the cost function is strictly convex function of the total outstanding quantity of the security that determines all execution costs its first derivative the cost per share of buying or the proceeds per share from selling an infinitesimal quantity of the security is called the market instantaneous price and can be interpreted as the market maker current probability chen and pennock for the starting price being equal to the principal baseline report trading occurs in discrete episodes in each of which an agent orders quantity of the security to buy or sell given the market cost function and the publicly displayed instantaneous price since there is correspondence between agent order size and pi the market revised instantaneous price after trading with agent an agent action or trading decision in this setting is identical to making probability report by selecting pi if agent is then pi is by design her subjective probability πi see hanson chen and pennock for further details definition we call market scoring rule if the underlying scoring rule is regular and strictly proper and the associated convex function as in is continuous and thricedifferentiable with and for msr behavior with myopic agents we first present general results on the connection between sequential trading in market with agents and opinion pooling and then give more detailed picture for two representative utility functions without and with budget constraints respectively please refer to section of the supplementary material for detailed proofs of all results in this section suppose that in addition to belief πi pr each agent has continuous utility function of wealth ui where cmin denotes her ex post wealth her net compensation from the market mechanism after the realization of defined in and cmin is her minimum acceptable wealth negative value suggests tolerance of debt ui satisfies the usual criteria of except possibly that and risk aversion except possibly that through out its domain et al in other words ui is strictly increasing and strictly concave additionally we require its first two derivatives to be finite and continuous on cmin except that we tolerate cmin cmin note that by choosing finite lower bound cmin on the agent wealth we can account for any starting wealth or budget constraint that effectively restricts the agent action space lemma if then there exist lower and upper bounds pmin and pmax respectively on the feasible values of the price pi to which agent can drive the market min min regardless of her belief πi where pmin and pmax ci ci since the latest price can be viewed as the market current state from myopic agent perspective the agent final utility depends not only on her own action pi and the extraneously determined outcome but also on the current market state she encounters her rational action being given by pi arg πi ui πi ui pi this leads us to the main result of this section theorem if market scoring rule for an security with starting instantaneous price trades with sequence of myopic agents with subjective probabilities πn and utility functions of wealth un as above then the updated market price pi after every trading episode is equivalent to valid opinion pool for the market initial baseline report and the subjective probabilities πi of all agents who have traded up to and including that episode proof sketch for every trading epsiode by setting the first derivative of agent expected utility to zero and analyzing the resulting equation we can arrive at the following lemmas lemma under the conditions of theorem if then the revised price pi after agent trades is the unique solution in to the equation pi πi pi πi ui pi πi pi since and πi pi is also confined to by induction lemma the implicit function pi πi described by has the following properties pi πi or if and only if πi min πi pi max πi whenever πi πi for any given resp πi pi is strictly increasing function of πi resp evidently properties and above correspond to axioms of unanimity boundedness and monotonicity respectively defined in section hence pi πi is valid opinion pooling function for πi finally since defines the opinion pool pi recursively in terms of we can invoke lemma to obtain the desired result there are several points worth noting about this result first since the updated market price pi is also equivalent to agent action section the of is agent probability pennock of given her utility function her action and the current market state thus lemma is natural extension of the elicitation properties of msr msrs by design elicit subjective probabilities from agents in an incentive compatible manner we show that in general they elicit probabilities when they interact with agents lemma is also consistent with the observation of pennock that for all belief elicitation schemes based on monetary incentives an external observer can only assess participant probability uniquely she can not discern the participant belief and utility separately second observe that this pooling operation is accomplished by msr even without direct revelation finally notice the presence of the market maker own initial baseline as component in the final aggregate however for the examples we study below the impact of diminishes with the participation of more and more informed agents and we conjecture that this is generic property in general the exact form of this pooling function is determined by the complex interaction between the msr and agent utility and closed form of pi from might not be attainable in many cases however given paticular msr we can venture to identify agent utility functions which give rise to opinion pools hence for the rest of this paper we focus on the logarithmic market scoring rule lmsr one of the most popular tools for implementing prediction markets lmsr as logop for constant absolute risk aversion cara utility theorem if myopic agent having subjective belief πi and utility function satisfying our criteria trades with lmsr market with parameter and current instantaneous price then the market updated price pi is identical to logarithmic opinion pool between the current price and the agent subjective belief αi pi πiαi πi πi αi αi if and only if agent utility function is of the form ui τi exp the aggregation weight being given by αi constant τi τi the proof is in section of the supplementary material note that is standard formulation of the cara or negative exponential utility function with risk tolerance τi smaller the value of τi higher is agent aversion to risk the unbounded domain of ui indicates lack of budget constraints risk aversion comes about from the fact that the range of the function is bounded above by its risk tolerance τi but not bounded below moreover the logop equation can alternatively be expressed as linear update in terms of ratios another popular means of formulating one belief about binary event pi αi πi αi ln for aggregation weight and risk tolerance since αi is an increasing function of an agent risk tolerance relative to the market loss parameter the latter being in way measure of how much risk the market maker is willing to take identity implies that the higher an agent risk tolerance the larger is the contribution of her belief towards the changed market price which agrees with intuition also note the interesting manner in which the market loss parameter effectively scales down an agent risk tolerance enhancing the inertia factor αi of the price process bayesian interpretation the bayesian interpretation of logop in general is bordley we restate it here in is appropriate for our prediction market setting we αi πi πi can recast as pi this shows that over the ith trading episode the agent market environment is equivalent to bayesian learner performing inference on the point estimate of the probability of the forecast event starting with the prior pr and having direct access to πi which corresponds to the observation for the inference problem the likelihood function αi associated with this observation being sequence of traders if all agents in the system have cara utilities with potentially different risk tolerances and trade with lmsr myopically only once each in the order then the final market ratio after these trades on unfolding pn qn the recursion in is given by pn ein πi this is logop where determines the inertia of the market initial price which diminishes as more and more traders interact with the market and ejn quantifies the degree to which an individual trader impacts the final aggregate market qn belief ejn αj αi and enn αn interestingly the weight of an agent belief depends not only on her own risk tolerance but also on those of all agents succeeding her in the trading sequence lower weight for more risk tolerant successor ceteris paribus and is independent of her predecessors utility parameters this is sensible since by the design of msr trader action influences the action of each of rational traders so that the action of each of these successors in turn has role to play in determining the market impact of trader belief in particular if τj then the aggregation weights satisfy the inequalities αjn lmsr assigns progessively higher weights to traders arriving later in the market lifetime when they all exhibit identical constant risk aversion this seems to be reasonable aggregation principle in most scenarios wherein the amount of information in the world improves over time moreover in this situation which indicates that the weight of the market baseline belief in the aggregate may be higher than those of some of the trading agents if the market maker has comparatively high loss parameter this strong effect of the trading sequence on the weights of agents beliefs is significant difference between the trader setting and the market equilibrium setting where each agent weight is independent of the utility function parameters of her peers convergence if agents beliefs are themselves independent samples from the same distribution over πi then by the sum laws of expectation and variance pn pn var pn αin hence using an appropriate concentration inequality boucheron et al and the properties of the ein we can show that as increases the market ratio pn converges to with high probability this convergence guarantee does not require the agents to be bayesian lmsr as linop for an atypical utility with decreasing absolute risk aversion theorem if myopic agent having subjective belief πi and utility function satisfying our criteria trades with lmsr market with parameter and current instantaneous price then the market updated price pi is identical to linear opinion pool between the current price and the agent subjective belief pi βi πi βi for some constant βi if and only if agent utility function is of the form ui ln exp bi where bi represents agent budget the aggregation weight being βi exp the proof is in section of the supplementary material the above atypical utility function has its domain bounded below and possesses positive but strictly decreasing absolute risk aversion measure et al ai exp for any bi it shares these characteristics with the logarithmic utility function moreover although this function is approximately linear for large positive values of the wealth it is approximately logarithmic when bi theorem is somewhat surprising since it is logarithmic utility that has traditionally been found to effect linop in market equilibrium pennock beygelzimer et al storkey et al of course in this paper we are not in an equilibrium convergence setting but in light of the above similarities between utility function and logarithmic utility it is perhaps not unreasonable to ask whether the logarithmic connection is still maintained approximately for lmsr price evolution under some conditions we have extensively explored this idea both analytically and by simulations and have found that small agent budget compared to the lmsr loss parameter seems to produce the desired result see section of the supplementary material note that unlike in theorem the equivalence here requires the agent utility function to depend on the market maker loss parameter the scaling factor in the exponential since the microstructure is assumed to be common knowledge as in traditional msr settings the consideration of an agent utility that takes into account the market pricing function is not unreasonable since the domain of utility function is bounded below we can derive πi bounds on βi βi hence βi pmax possible values of pi from lemma pmin min the revised price is linear interpolation equation becomes pi πi pmax pi between the agent price bounds her subjective probability itself acting as the interpolation factor aggregation weight and budget constraint evidently the aggregation weight of agent belief βi exp is an increasing function of her budget normalized with respect to the market loss parameter it is in way measure of her relative risk tolerance thus broad characteristics analogous to the ones in section apply to these aggregation weights as well with the ratio replaced by the actual market price bayesian interpretation under the mild technical assumption that agent belief πi is rational and her budget bi is such that βi is also rational it is possible to obtain positive integers ri ni and positive rational number such that πi ri and βi ni then we can rewrite the linop equation as pi ri which is equivalent to the rior expectation of bayesian inference procedure described as follows the forecast event is modeled as the future final flip of biased coin with an unknown probability of heads in episode the principal or aggregator has prior distribution eta over this probability with thus is the prior mean and the corresponding size parameter agent is and her subjective probability πi accessible to the aggregator is her maximum likelihood estimate associated with the binomial likelihood of observing ri heads out of private sample of ni independent flips of the above coin ni is common knowledge note that ni are measures of certainty of the aggregator and the trading agent respectively and the latter normalized budget bi ln becomes measure of her certainty relative to the aggregator current state in this interpretation sequence of traders and convergence if all agents have utility with potentially different budgets and trade with lmsr myopically once each then the final aggregate market pn qn price is given by pn βein πi which is linop where αi βejn qn βj βi βenn βn again all intuitions about ejn from section carry over to βen moreover if πi then we can proceed exactly as in section to show that as increases pn converges to with high probability discussion and future work we have established the correspondence of securities market microstructure to class of traditional belief aggregation methods and by extension bayesian inference procedures in two important cases an obvious next step is the identification of general conditions under which msr and agent utility combination is equivalent to given pooling operation another research direction is extending our results to sequence of agents who trade repeatedly until convergence taking into account issues such as the order in which agents trade when they return the effects of the updated wealth after the first trade for agents with budgets etc acknowledgments we are grateful for support from nsf iis awards and references jacob abernethy sindhu kutty lahaie and rahul sami information aggregation in exponential family markets in proc acm conference on economics and computation pages acm alina beygelzimer john langford and david pennock learning performance of prediction markets with kelly bettors in proc aamas pages robert bordley multiplicative formula for aggregating probability assessments management science boucheron lugosi and olivier bousquet concentration inequalities in advanced lectures on machine learning pages springer aseem brahma mithun chakraborty sanmay das allen lavoie and malik bayesian market maker in proc acm conference on electronic commerce pages acm yiling chen and david pennock utility framework for market makers in proc yiling chen and jennifer wortman vaughan new understanding of prediction markets via noregret learning in proc acm conference on electronic commerce pages acm dave cliff and janet bruten zero is not enough on the lower limit of agent intelligence for continuous double auction markets technical report laboratories bristol bo cowgill and eric zitzewitz corporate prediction markets evidence from google ford and koch industries technical report working paper doyne farmer paolo patelli and ilija zovko the predictive power of zero intelligence in financial markets pnas rafael frongillo nicolas penna and mark reid interpreting prediction markets stochastic approach in proc nips pages ashutosh garg jayram shivakumar vaithyanathan and huaiyu zhu generalized opinion pooling in proc intl symp on artificial intelligence and mathematics citeseer christian genest and james zidek combining probability distributions critique and an annotated bibliography statistical science pages tilmann gneiting and adrian raftery strictly proper scoring rules prediction and estimation journal of the american statistical association robin hanson combinatorial information market design information systems frontiers jinli hu and amos storkey trading prediction markets with connections to machine learning proc icml krishnamurthy iyer ramesh johari and ciamac moallemi information aggregation and allocative efficiency in smooth markets management science andreu michael whinston and jerry green microeconomic theory volume new york oxford university press jono millin krzysztof geras and amos storkey isoelastic agents and wealth updates in machine learning markets in proc icml pages michael ostrovsky information aggregation in dynamic markets with strategic traders econometrica david pennock aggregating probabilistic beliefs market mechanisms and graphical representations phd thesis the university of michigan david pennock and lirong xia price updating in combinatorial prediction markets with bayesian networks in proc uai pages rajiv sethi and jennifer wortman vaughan belief aggregation with automated market makers computational economics forthcoming available at ssrn amos storkey zhanxing zhu and jinli hu aggregation under bias divergence aggregation and its implementation via machine learning markets in machine learning and knowledge discovery in databases pages springer james surowiecki the wisdom of crowds anchor justin wolfers and eric zitzewitz prediction markets econ perspectives 
sparse linear programming via primal and dual augmented coordinate descent ian yen kai zhong hsieh pradeep ravikumar inderjit dhillon university of texas at austin university of california at davis ianyen pradeepr inderjit zhongkai chohsieh abstract over the past decades linear programming lp has been widely used in different areas and considered as one of the mature technologies in numerical optimization however the complexity offered by algorithms method and primal dual simplex methods is still unsatisfactory for problems in machine learning with huge number of variables and constraints in this paper we investigate general lp algorithm based on the combination of augmented lagrangian and coordinate descent giving an iteration complexity of log with nnz cost per iteration where nnz is the number of in the constraint matrix and in practice one can further reduce cost per iteration to the order of in columns rows corresponding to the active primal dual variables through an strategy the algorithm thus yields tractable alternative to standard lp methods for problems of sparse solutions and nnz mn we conduct experiments on lp instances from svm sparse inverse covariance estimation and nonnegative matrix factorization where the proposed approach finds solutions of precision orders of magnitude faster than implementations of and simplex methods introduction linear programming lp has been studied since the early century and has become one of the representative tools of numerical optimization with wide applications in machine learning such as svm map inference nonnegative matrix factorization exemplarbased clustering sparse inverse covariance estimation and markov decision process however as the demand for scalability keeps increasing the scalability of existing lp solvers has become unsatisfactory in particular most algorithms in machine learning targeting data have complexity linear to the data size while the complexity of lp solvers method and primal dual simplex methods is still at least quadratic in the number of variables or constraints the quadratic complexity comes from the need to solve each linear system exactly in both simplex and interior point method in particular the simplex method when traversing from one corner point to another requires solution to linear system that has dimension linear to the number of variables or constraints while in an method finding the newton direction requires solving linear system of similar size while there are sparse variants of lu and cholesky decomposition that can utilize the sparsity pattern of matrix in linear system the complexity for solving such system is at least quadratic to the dimension except for very special cases such as or matrix our solver has been released here http for interior point method ipm one remedy to the high complexity is employing an iterative method such as conjugate gradient cg to solve each linear system inexactly however this can hardly tackle the linear systems produced by ipm when iterates approach boundary of constraints though substantial research has been devoted to the development of preconditioners that can help iterative methods to mitigate the effect of creating preconditioner of tractable size is challenging problem by itself most commercial lp software thus still relies on exact methods to solve the linear system on the other hand some dual or primal stochastic descent methods have cheap cost for each iteration but require iterations to find solution of precision which in practice can even hardly find feasible solution satisfying all constraints augmented lagrangian method alm was invented early in and since then there have been several works developed linear program solver based on alm however the challenge of alm is that it produces series of quadratic problems that in the traditional sense are harder to solve than linear system produced by ipm or simplex methods specifically in approach one needs to solve several linear systems via cg to find solution to the quadratic program while there is no guarantee on how many iterations it requires on the other hand projected gradient method pgm despite its guaranteed iteration complexity has very slow convergence in practice more recently admm was proposed as variant of alm that for each iteration only updates one pass or even less blocks of primal variables before each dual update which however requires much smaller step size in the dual update to ensure convergence and thus requires large number of iterations for convergence to moderate precision to our knowledge there is still no report on significant improvement of methods over ipm or simplex method for linear programming in the recent years coordinate descent cd method has demonstrated efficiency in many machine learning problems with bound constraints or other terms and has solid analysis on its iteration complexity in this work we show that cd algorithm can be naturally combined with alm to solve linear program more efficiently than existing methods on problems we provide an log iteration complexity of the augmented lagrangian with coordinate descent algorithm that bounds the total number of cd updates required for an solution and describe an implementation of that has cost nnz for each pass of cd in practice an strategy is introduced to further reduce cost of each iteration to the active size of variables and constraints for and lp respectively where lp has most of variables being zero and lp has few binding constraints at the optimal solution note unlike in ipm the conditioning of each subproblem in alm does not worsen over iterations the framework thus provides an alternative to interior point and simplex methods when it is infeasible to exactly solving an or linear system sparse linear program we are interested in solving linear programs of the form min ct ai bi ae be xj nb where ai is mi by matrix of coefficients and ae is me by without loss of generality we assume constraints are imposed on the first nb variables denoted as xb such that xb xf and cb cf the inequality and equality coefficient matrices can then be partitioned as ai ai ai and ae ae ae the dual problem of then takes the form minm bt atb cb cf yi mi where mi me bi be ab ai ae af ai ae and yi ye in most of lp occur in machine learning and are both at scale in the order for which an algorithm with cost mn or is unacceptable fortunately there are usually various types of sparsity present in the problem that can be utilized to lower the complexity first the constraint matrix ai ae are usually pretty sparse in the sense that nnz mn and one can compute product ax in nnz however in most of current lp solvers not only product but also linear system involving needs to be solved which in general has cost much more than nnz and can be up to min in the worst case in particular the methods when moving from one corner to another requires solving linear system that involves of with columns corresponding to the basic variables while in an interior point method ipm one also needs to solve normal equation system of matrix adt at to obtain the newton direction where dt is diagonal matrix that gradually enforces complementary slackness as ipm iteration grows while one remedy to the high complexity is to employ iterative method such as conjugate gradient cg to solve the system inexactly within ipm this approach can hardly handle the occurs when ipm iterates approaches boundary on the other hand the augmented lagrangian approach does not have such asymptotic and thus an iterative method with complexity linear to nnz can be used to produce sufficiently accurate solution for each besides sparsity in the constraint matrix two other types of structures which we termed primal and dual sparsity are also prevalent in the context of machine learning lp refers to an lp with optimal solution comprising only few elements while lp refers to an lp with few binding constraints at optimal which corresponds to the dual variables in the following we give two examples of sparse lp support vector machine the problem of support vector machine min kwm ξi wm ξi wyti xi wm xi em ξi where ei if yi ei otherwise the task is since among all samples and class only those leads to misclassification will become binding constraints the problem is also since it does feature selection through note the constraint matrix in is also sparse since each constraint only involves two weight vectors and the pattern xi can be also sparse sparse inverse covariance estimation the sparse inverse covariance estimation aims to find sparse matrix that approximate the inverse of covariance matrix one of the most popular approach to this solves program of the form min ksω id kmax which is due to the penalty the problem has dense constraint matrix which however has special structure where the coefficient matrix can be decomposed into product of two and possibly sparse by matrices in case is sparse or this decomposition can be utilized to solve the linear program much more efficiently we will discuss on how to utilize such structure in section primal and dual augmented coordinate descent in this section we describe an augmented lagrangian method alm that carefully tackles the sparsity in lp the choice between primal and dual alm depends on the type of sparsity present in the lp in particular primal al method can solve problem of few variables more efficiently while dual alm will be more efficient for problem with few binding constraints in the following we describe the algorithm only from the primal point of view while the dual version can be obtained by exchanging the roles of primal and dual algorithm primal augmented lagrangian method initialization rm and repeat solve to obtain from ai bi update ηt ae be increase ηt by constant factor if necessary until ai xt bi and kae xt be augmented lagrangian method dual proximal method let be the dual objective function that takes if is infeasible the primal al algorithm can be interpreted as dual proximal point algorithm that for each iteration solves argmin ky since is nonsmooth is not easier to solve than the original dual problem however the dual of takes the form yit ηt ai bi min ae be ηt ye xb which is quadratic problem note given as lagrangian multipliers of the corresponding minimizing lagrangian is ηt ai bi ae be yit ye and thus one can solve from and find through the resulting algorithm is sketched in algorithm for problem of medium scale is not easier to solve than linear system due to constraints and thus an alm is not preferred to ipm in the traditional sense however for problem with nnz the alm becomes advantageous since the conditioning of does not worsen over iterations and thus allows iterative methods to solve it approximately in time proportional to nnz ii for problem most of primal dual variables become binding at zero as iterates approach to the optimal solution which yields potentially much smaller subproblem solving subproblem via coordinate descent given dual solution yt we employ variant of randomized coordinate descent rcd method to solve subproblem first we note that given the part of variables in can minimized in as bi ai yit where function truncates each element of vector to be as max vi then can be as ηt ai bi yit min ae be ye xb algorithm rcd for subproblem algorithm for subproblem input ηt and satisfying input ηt and satisfying relation relation output xt wt output xt wt repeat repeat pick coordinate uniformly at random identify active variables at compute at and set dt compute find newton direction with cg obtain newton direction dj find step size via projected line search do line search to find step size update xt xt update xt xt maintain relation maintain relation until until denote the objective function as the gradient of can be expressed as ηt ati ηt ate where ai bi yit ae be ye and the generalized hessian of is ηt ati ai ηt ate ae where is an mi by mi diagonal matrix with dii if wi and dii otherwise the rcd algorithm then proceeds as follows in each iteration it picks coordinate from uniformly at random and minimizes the coordinate the minimization is conducted by newton step which first finds the newton direction through minimizing quadratic approximation argmin xt xt xt and then conducted line search to find the smallest satisfying xt ej xt σβ xt for some parameter where ej denotes vector with only jth element equal to and all others equal to note the problem has solution xt xt which in naive implementation takes nnz time due to the computation of and however in clever implementation one can maintain the relation as follows whenever coordinate xj is updated by aj dj ae where aj aij ae denotes the jth column of ai and ae then the gradient and generalized of jth coordinate cj ηt haij ηt hae vi ηt aii wi ae can be computed in nnz aj time similarly for each coordinate update one can evaluate the difference of function value xt ej xt in nnz aj by only computing terms related to the jth variable the overall procedure for solving subproblem is summarized in algorithm in practice random permutation is used instead of uniform sampling to ensure that every coordinate is updated once before proceeding to the next round which can speed up convergence and ease the checking of stopping condition and an strategy is employed to avoid updating variables with we describe details in section convergence analysis in this section we prove the iteration complexity of method existing analysis shows that randomized coordinate descent can be up to times faster than methods in certain conditions however to prove global linear rate of convergence the analysis requires objective function to be strongly convex which is not true for our here we follow the approach in to show global linear convergence of algorithm by utilizing the fact that when restricted to constant subspace is strongly convex all proofs will be included in the appendix theorem linear convergence denote as the optimum of and the iterates of the rcd algorithm has γn where max is an upper bound on second derivative and lg is local constant of function ηt kz yt and is constant of hoffman bound that depends on the polyhedron formed by the set of optimal solutions then the following theorem gives bound on the number of iterations required to find an solution in terms of the proximal minimization theorem inner iteration complexity denote as the dual solution corresponding to the primal iterate to guarantee ky with probability it suffices running rcd algorithm for number of iterations log ηt now we prove the overall iteration complexity of note that existing linear convergence analysis of alm on linear program assumes exact solutions of subproblem which is not possible in practice our next theorem extends the linear convergence result to cases when subproblems are solved inexactly and in particular shows the total number of coordinate descent updates required to find an solution theorem iteration complexity denote as the sequence of iterates obtained from inexact dual proximal updates as that generated by exact updates and ys as the projection of to the set of optimal dual solutions to guarantee with probability it suffices to run algorithm for lr log outer iterations with ηt and solve each by running algorithm for lr log log log inner iterations where is constant depending on the polyhedral set of optimal solutions kproxηt and are upper and lower bounds on the initial and optimal function values of subproblem respectively fast asymptotic convergence via projected the rcd algorithm converges to solution of moderate precision efficiently but in some problems higher precision might be required in such case we transfer the subproblem solver from rcd to projected method after iterates are close enough to the optimum note the projected newton method does not have global iteration complexity but has fast convergence for iterates very close to the optimal denote as the objective in each iterate of begins by finding the set of active variables defined as at then the algorithm fixes xt at and solves newton linear system at xt xt to obtain direction for the current active variables let dat denotes vector taking value in for at and taking value for at the algorithm then conducts projected line search to find smallest satisfying xt dat xt σβ xt dat and update by xt xt compared to interior point method one key to the tractability of this approach lies on the conditioning of linear system which does not worsen as outer iteration increases so an iterative conjugate gradient cg method can be used to obtain accurate solution without factorizing the hessian matrix the only operation required within cg is the product xt ηt ati wt ai ate ae at where the operator at takes the with row and column indices belonging to at for primal or lp the product can be evaluated very efficiently since it only involves elements in columns of ai ae belonging to the active set and rows of ai corresponding to the binding constraints for which dii wt the overall cost of the product is only nnz ai dt at nnz ae where dt is the set of current binding constraints considering that the computational bottleneck of is on the cg iterations for solving linear system the efficient computation of product reduces the overall complexity of significantly the whole procedure is summarized in algorithm practical issues precision of subproblem minimization in practice it is unnecessary to solve subproblem to high precision especially for iterations of alm in the beginning in our implementation we employ strategy where in the first phase we limit the cost spent on each to be constant multiple of nnz while in the second phase we dynamically increment the al parameter ηt and inner precision to ensure sufficient decrease in the primal and dual infeasibility respectively the strategy is particularly useful for primal or problem where in the latter phase has smaller active set that results in less computation cost even when solved to high precision strategy our implementation of algorithm maintains an active set of variables which initially contains all variables but during the rcd iterates any variable xj binding at with gradient greater than threshold will be excluded from till the end of each subproblem solving will be after each dual proximal update note in the initial phase the cost spent on each subproblem is constant multiple of nnz so if is small one would spend more iterations on the active variables to achieve faster convergence dealing with decomposable constraint matrix when we have by constraint matrix that can be decomposed into product of an matrix and matrix if min or nnz nnz nnz we can the constraint ax as with auxiliary variables rr this new representation reduce the cost of product in algorithm and the cost of each pass of cd in algorithm from nnz to nnz nnz numerical experiments table timing results in sec unless specified on multiclass svm data news sector mnist vehicle nb mi barrier table timing results in sec unless specified on sparse inverse covariance estimation data textmine dorothea nb mi me nf barrier table timing results in sec unless specified for nonnegative matrix factorization data micromass ocr nb mi barrier in this section we compare the algorithm with implementation of interior point and primal dual simplex methods in commercial lp solver cplex which is of top efficiency among many lp solvers as investigated in for all experiments the stopping criteria is set to require both primal and dual infeasibility in the smaller than and set the initial subproblem tolerance and ηt the lp instances are generated from sparse inverse covariance estimation and nonnegative matrix factorization for the sparse inverse covariance estimation problem we use technique introduced in section to decompose the lowrank matrix and since results in independent problems for each column of the estimated matrix we report result on only one of them the data source and statistics are included in the appendix among all experiments we observe that the proposed primal dual methods become particularly advantageous when the matrix is sparse for example for text data set and news in table the matrix is particularly sparse and can be orders of magnitude faster than other approaches by avoiding solving linear system exactly in addition the also dual simplex is more efficient in problem due to the problem strong dual sparsity while the is more efficient on the inverse covariance estimation problem for the nonnegative matrix factorization problem both the dual and primal lp solutions are not particularly sparse due to the choice of matrix approximation tolerance of samples but the approach is still comparably more efficient acknowledgement we acknowledge the support of aro via and the support of nsf via grants and nih via as part of the joint initiative to support research at the interface of the biological and mathematical sciences references zhu rosset hastie and tibshirani support vector machines nips koller and friedman probabilistic graphical models principles and techniques mit press gillis and luce robust nonnegative matrix factorization using linear optimization jmlr nellore and ward recovery guarantees for clustering yen lin zhong ravikumar and dhillon convex approach to madbayes dirichlet process mixture models in icml yuan high dimensional inverse covariance matrix estimation via linear programming jmlr bello and riano linear programming solvers for markov decision processes in systems and information engineering design symposium pages joachims training linear svms in linear time in kdd acm hsieh chang lin keerthi and sundararajan dual coordinate descent method for linear svm in icml volume acm yuan hsieh chang and lin comparison of optimization methods and software for largescale linear classification jmlr nocedal and wright numerical optimization springer gondzio interior point methods years later ejor gondzio interior point method computational optimization and applications and finding approximate solutions for large scale linear programs thesis evtushenko yu golikov ai and mollaverdy augmented lagrangian method for linear programming problems optimization methods and software delbos and gilbert global linear convergence of an augmented lagrangian algorithm for solving convex quadratic optimization problems augmented lagrangian algorithms for linear programming journal of optimization theory and applications and toraldo on the solution of large quadratic programming problems with bound constraints siam journal on optimization hong and luo on linear convergence of alternating direction method of multipliers arxiv wang banerjee and luo parallel direction method of multipliers in nips and the direct extension of admm for convex minimization problems is not necessarily convergent mathematical programming and nearest neighbor based greedy coordinate descent in nips yen chang lin lin and lin indexed block coordinate descent for linear classification with limited memory in kdd acm yen lin and lin block minimization framework for learning with limited memory in nips zhong yen dhillon and ravikumar proximal for computationally intensive in nips and iteration complexity of randomized descent methods for minimizing composite function mathematical programming nesterov efficiency of coordinate descent methods on optimization problems siam journal on optimization wang and lin iteration complexity of feasible descent methods for convex optimization the journal of machine learning research yen hsieh ravikumar and dhillon constant nullspace strong convexity and fast convergence of proximal methods under settings in nips meindl and templ analysis of commercial and free and open source solvers for linear optimization problems eurostat and statistics netherlands hoffman on approximate solutions of systems of linear inequalities journal of research of the national bureau of standards rahimi and recht random features for kernel machines in nips yen lin lin ravikumar and dhillon sparse random feature algorithm as coordinate descent in hilbert space in nips 
training very deep networks rupesh kumar srivastava klaus greff schmidhuber the swiss ai lab idsia usi supsi rupesh klaus juergen abstract theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success however training becomes more difficult as depth increases and training of very deep networks remains an open problem here we introduce new architecture designed to overcome this our highway networks allow unimpeded information flow across many layers on information highways they are inspired by long memory recurrent networks and use adaptive gating units to regulate the information flow even with hundreds of layers highway networks can be trained directly through simple gradient descent this enables the study of extremely deep and efficient architectures introduction previous work many recent empirical breakthroughs in supervised machine learning have been achieved through large and deep neural networks network depth the number of successive computational layers has played perhaps the most important role in these successes for instance within just few years the image classification accuracy on the imagenet dataset has increased from to using deeper networks with rather small receptive fields other results on practical machine learning problems have also underscored the superiority of deeper networks in terms of accuracy performance in fact deep networks can represent certain function classes far more efficiently than shallow ones this is perhaps most obvious for recurrent nets the deepest of them all for example the bit parity problem can in principle be learned by large feedforward net with binary input units output unit and single but large hidden layer but the natural solution for arbitrary is recurrent net with only units and weights reading the input bit string one bit at time making single recurrent hidden unit flip its state whenever new is observed related observations hold for boolean circuits and modern neural networks to deal with the difficulties of training deep networks some researchers have focused on developing better optimizers initialization strategies in particular the normalized initialization for certain activation functions have been widely adopted for training moderately deep networks other similarly motivated strategies have shown promising results in preliminary experiments experiments showed that certain activation functions based on local competition may help to train deeper networks skip connections between layers or to output layers where error is injected have long been used in neural networks more recently with the explicit aim to improve the flow of information related recent technique is based on using soft targets from shallow teacher network to aid in training deeper student networks in multiple stages similar to the neural history compressor for sequences where slowly ticking teacher recurrent net is distilled into quickly ticking student recurrent net by forcing the latter to predict the hidden units of the former finally deep networks can be trained to help in credit assignment but this approach is less attractive compared to direct training very deep network training still faces problems albeit perhaps less fundamental ones than the problem of vanishing gradients in standard recurrent networks the stacking of several transformations in conventional network architectures typically results in poor propagation of activations and gradients hence it remains hard to investigate the benefits of very deep networks for variety of problems to overcome this we take inspiration from long short term memory lstm recurrent networks we propose to modify the architecture of very deep feedforward networks such that information flow across layers becomes much easier this is accomplished through an adaptive gating mechanism that allows for computation paths along which information can flow across many layers without attenuation we call such paths information highways they yield highway networks as opposed to traditional plain our primary contribution is to show that extremely deep highway networks can be trained directly using stochastic gradient descent sgd in contrast to plain networks which become hard to optimize as depth increases section deep networks with limited computational budget for which training procedure mentioned above was recently proposed can also be directly trained in single stage when converted to highway networks their ease of training is supported by experimental results demonstrating that highway networks also generalize well to unseen data highway networks notation we use boldface letters for vectors and matrices and italicized capital letters to denote transformation functions and denote vectors of zeros and ones respectively and denotes an identity matrix the function is defined as the dot operator is used to denote multiplication plain feedforward neural network typically consists of layers where the lth layer applies transformation parameterized by wh on its input xl to produce its output yl thus is the input to the network and yl is the network output omitting the layer index and biases for clarity wh is usually an affine transform followed by activation function but in general it may take other forms possibly convolutional or recurrent for highway network we additionally define two transforms wt and wc such that wh wt wc we refer to as the transform gate and as the carry gate since they express how much of the output is produced by transforming the input and carrying it respectively for simplicity in this paper we set giving wh wt wt the dimensionality of wh and wt must be the same for equation to be valid note that this layer transformation is much more flexible than equation in particular observe that for particular values of if wt wh if wt similarly for the jacobian of the layer transform this paper expands upon shorter report on highway networks more recently similar lstminspired model was also proposed figure comparison of optimization of plain networks and highway networks of various depths left the training curves for the best hyperparameter settings obtained for each network depth right mean performance of top out of hyperparameter settings plain networks become much harder to optimize with increasing depth while highway networks with up to layers can still be optimized well best viewed on screen larger version included in supplementary material dy dx if wt wh if wt thus depending on the output of the transform gates highway layer can smoothly vary its behavior between that of and that of layer which simply passes its inputs through just as plain layer consists of multiple computing units such that the ith unit computes yi hi highway network consists of multiple blocks such that the ith block computes block state hi and transform gate output ti finally it produces the block output yi hi ti xi ti which is connected to the next constructing highway networks as mentioned earlier equation requires that the dimensionality of wh and wt be the same to change the size of the intermediate representation one can replace with obtained by suitably or another alternative is to use plain layer without highways to change dimensionality which is the strategy we use in this study convolutional highway layers utilize and local receptive fields for both and transforms we used the same sized receptive fields for both and to ensure that the block state and transform gate feature maps match the input size training deep highway networks we use the transform gate defined as wt bt where wt is the weight matrix and bt the bias vector for the transform gates this suggests simple initialization scheme which is independent of the nature of bt can be initialized with negative value etc such that the network is initially biased towards carry behavior this scheme is strongly inspired by the proposal to initially bias the gates in an lstm network to help bridge temporal dependencies early in learning note that so the conditions in equation can never be met exactly in our experiments we found that negative bias initialization for the transform gates was sufficient for training to proceed in very deep networks for various initial distributions of wh and different activation functions used by in pilot experiments sgd did not stall for networks with more than layers although the initial bias is best treated as hyperparameter as general guideline we suggest values of and for convolutional highway networks of depth approximately and our pilot experiments on training very deep networks were successful with more complex block design closely resembling an lstm block unrolled in time here we report results only for much simplified form network no of parameters test accuracy in highway networks width width maxout dsn table test set classification accuracy for pilot experiments on the mnist dataset network no of layers fitnet results reported by romero et al teacher fitnet fitnet highway networks highway fitnet highway fitnet highway no of parameters accuracy in table test set accuracy of convolutional highway networks architectures tested were based on fitnets trained by romero et al using hint based training highway networks were trained in single stage without hints matching or exceeding the performance of fitnets experiments all networks were trained using sgd with momentum an exponentially decaying learning rate was used in section for the rest of the experiments simpler commonly used strategy was employed where the learning rate starts at value and decays according to fixed schedule by factor and the schedule were selected once based on validation set performance on the dataset and kept fixed for all experiments all convolutional highway networks utilize the rectified linear activation function to compute the block state to provide better estimate of the variability of classification results due to random initialization we report our results in the format best mean based on runs wherever available experiments were conducted using caffe and brainstorm https frameworks source code hyperparameter search results and related scripts are publicly available at http optimization to support the hypothesis that highway networks do not suffer from increasing depth we conducted series of rigorous optimization experiments comparing them to plain networks with normalized initialization we trained both plain and highway networks of varying varying depths on the mnist digit classification dataset all networks are thin each layer has blocks for highway networks and units for plain networks yielding roughly identical numbers of parameters per layer in all networks the first layer is fully connected plain layer followed by or fully connected plain or highway layers finally the network output is produced by softmax layer we performed random search of runs for both plain and highway networks to find good settings for the following hyperparameters initial learning rate momentum learning rate exponential decay factor activation function either rectified linear or tanh for highway networks an additional hyperparameter was the initial value for the transform gate bias between and other weights were initialized using the same normalized initialization as plain networks the training curves for the best performing networks for each depth are shown in figure as expected and plain networks exhibit very good performance mean loss which significantly degrades as depth increases even though network capacity increases highway networks do not suffer from an increase in depth and layer highway networks perform similar to layer networks the highway network performed more than orders of magnitude better compared to plain network it was also observed that highway networks consistently converged significantly faster than plain ones network maxout dasnet nin dsn highway network accuracy in accuracy in table test set accuracy of convolutional highway networks on the and object recognition datasets with typical data augmentation for comparison we list the accuracy reported by recent studies in similar experimental settings pilot experiments on mnist digit classification as sanity check for the generalization capability of highway networks we trained convolutional highway networks on mnist using two architectures each with convolutional layers followed by softmax output the number of filter maps width was set to and for all the layers we obtained test set performance competitive with methods with much fewer parameters as show in table experiments on and object recognition comparison to fitnets fitnet training maxout networks can cope much better with increased depth than those with traditional activation functions however romero et al recently reported that training on through plain backpropogation was only possible for maxout networks with depth up to layers when the number of parameters was limited to and the number of multiplications to similar limitations were observed for higher computational budgets training of deeper networks was only possible through the use of training procedure and addition of soft targets produced from shallow teacher network training we found that it was easy to train highway networks with numbers of parameters and operations comparable to those of fitnets in single stage using sgd as shown in table highway and highway which are based on the architectures of fitnet and fitnet respectively obtain similar or higher accuracy on the test set we were also able to train thinner and deeper networks for example highway network consisting of alternating receptive fields of size and with parameters performs better than the earlier teacher network comparison to methods it is possible to obtain high performance on the and datasets by utilizing very large networks and extensive data augmentation this approach was popularized by ciresan et al and recently extended by graham since our aim is only to demonstrate that deeper networks can be trained without sacrificing ease of training or generalization ability we only performed experiments in the more common setting of global contrast normalization small translations and mirroring of images following lin et al we replaced the fully connected layer used in the networks in the previous section with convolutional layer with receptive field of size one and global average pooling layer the hyperparameters from the last section were for both and therefore it is quite possible to obtain much better results with better the results are tabulated in table analysis figure illustrates the inner workings of the hidden layer highway networks trained on mnist top row and bottom row the first three columns show obtained via random search over hyperparameters to minimize the best training set error achieved using each configuration figure visualization of best highway networks trained on mnist top row and bottom row the first hidden layer is plain layer which changes the dimensionality of the representation to each of the highway layers consists of blocks the first column shows the transform gate biases which were initialized to and respectively in the second column the mean output of the transform gate over all training examples is depicted the third and fourth columns show the output of the transform gates and the block outputs both networks using tanh for single random training sample best viewed in color the bias the mean activity over all training samples and the activity for single random sample for each transform gate respectively block outputs for the same single sample are displayed in the last column the transform gate biases of the two networks were initialized to and respectively it is interesting to note that contrary to our expectations most biases decreased further during training for the network the biases increase with depth forming gradient curiously this gradient is inversely correlated with the average activity of the transform gates as seen in the second column this indicates that the strong negative biases at low depths are not used to shut down the gates but to make them more selective this behavior is also suggested by the fact that the transform gate activity for single example column is very sparse the effect is more pronounced for the network but can also be observed to lesser extent in the mnist network the last column of figure displays the block outputs and visualizes the concept of information highways most of the outputs stay constant over many layers forming pattern of stripes most of the change in outputs happens in the early layers for mnist and for routing of information one possible advantage of the highway architecture over shortcut connections is that the network can learn to dynamically adjust the routing of the information based on the current input this begs the question does this behaviour manifest itself in trained networks or do they just learn static routing that applies to all inputs similarly partial answer can be found by looking at the mean transform gate activity second column and the single example transform gate outputs third column in figure especially for the case most transform gates are active on average while they show very selective activity for the single example this implies that for each sample only few blocks perform transformation but different blocks are utilized by different samples this routing mechanism is further investigated in figure in each of the columns we show how the average over all samples of one specific class differs from the total average shown in the second column of figure for mnist digits and substantial differences can be seen figure visualization showing the extent to which the mean transform gate activity for certain classes differs from the mean activity over all training samples generated using the same highway networks on mnist on as figure best viewed in color within the first layers while for cifar class numbers and the differences are sparser and spread out over all layers in both cases it is clear that the mean activity pattern differs between classes the gating system acts not just as mechanism to ease training but also as an important part of the computation in trained network layer importance since we bias all the transform gates towards being closed in the beginning every layer mostly copies the activations of the previous layer does training indeed change this behaviour or is the final network still essentially equivalent to network with much fewer layers to shed light on this issue we investigated the extent to which lesioning single layer affects the total performance of trained networks from section by lesioning we mean manually setting all the transform gates of layer to forcing it to simply copy its inputs for each layer we evaluated the network on the full training set with the gates of that layer closed the resulting performance as function of the lesioned layer is shown in figure for mnist left it can be seen that the error rises significantly if any one of the early layers is removed but layers seem to have close to no effect on the final performance about of the layers don learn to contribute to the final result likely because mnist is simple dataset that doesn require much depth we see different picture for the dataset right with performance degrading noticeably when removing any of the first layers this suggests that for complex problems highway network can learn to utilize all of its layers while for simpler problems like mnist it will keep many of the unneeded layers idle such behavior is desirable for deep networks in general but appears difficult to obtain using plain networks discussion alternative approaches to counter the difficulties posed by depth mentioned in section often have several limitations learning to route information through neural networks with the help of competitive interactions has helped to scale up their application to challenging problems by improving credit assignment but they still suffer when depth increases beyond even with careful initialization effective initialization methods can be difficult to derive for variety of activation functions deep supervision has been shown to hurt performance of thin deep networks very deep highway networks on the other hand can directly be trained with simple gradient descent methods due to their specific architecture this property does not rely on specific transformations which may be complex convolutional or recurrent transforms and derivation of suitable initialization scheme is not essential the additional parameters required by the gating mechanism help in routing information through the use of multiplicative connections responding differently to different inputs unlike fixed skip connections figure lesioned training set performance of the best highway networks on mnist left and right as function of the lesioned layer evaluated on the full training set while forcefully closing all the transform gates of single layer at time the performance is indicated as dashed line at the bottom possible objection is that many layers might remain unused if the transform gates stay closed our experiments show that this possibility does not affect networks and narrow highway networks can the accuracy of wide and shallow maxout networks which would not be possible if layers did not perform useful computations additionally we can exploit the structure of highways to directly evaluate the contribution of each layer as shown in figure for the first time highway networks allow us to examine how much computation depth is needed for given problem which can not be easily done with plain networks acknowledgments we thank nvidia corporation for their donation of gpus and acknowledge funding from the eu project nascence we are grateful to sepp hochreiter and thomas unterthiner for helpful comments and jan for help in conducting experiments references alex krizhevsky ilya sutskever and geoffrey hinton imagenet classification with deep convolutional neural networks in advances in neural information processing systems christian szegedy wei liu yangqing jia pierre sermanet scott reed dragomir anguelov dumitru erhan vincent vanhoucke and andrew rabinovich going deeper with convolutions cs september karen simonyan and andrew zisserman very deep convolutional networks for image recognition cs september dc ciresan ueli meier jonathan masci luca gambardella and schmidhuber flexible high performance convolutional neural networks for image classification in ijcai dan ciresan ueli meier and schmidhuber deep neural networks for image classification in ieee conference on computer vision and pattern recognition dong yu michael seltzer jinyu li huang and frank seide feature learning in deep neural on speech recognition tasks arxiv preprint sepp hochreiter and jurgen schmidhuber bridging long time lags by weight guessing and long shortterm memory spatiotemporal models in biological and artificial systems johan computational limitations of circuits mit press johan and mikael goldmann on the power of threshold circuits computational complexity monica bianchini and franco scarselli on the complexity of neural network classifiers comparison between shallow and deep architectures ieee transactions on neural networks guido montufar razvan pascanu kyunghyun cho and yoshua bengio on the number of linear regions of deep neural networks in advances in neural information processing systems james martens and venkatesh medabalimi on the expressive efficiency of sum product networks cs stat november james martens and ilya sutskever training deep and recurrent networks with optimization neural networks tricks of the trade pages ilya sutskever james martens george dahl and geoffrey hinton on the importance of initialization and momentum in deep learning pages yann dauphin razvan pascanu caglar gulcehre kyunghyun cho surya ganguli and yoshua bengio identifying and attacking the saddle point problem in optimization in advances in neural information processing systems pages xavier glorot and yoshua bengio understanding the difficulty of training deep feedforward neural networks in international conference on artificial intelligence and statistics pages kaiming he xiangyu zhang shaoqing ren and jian sun delving deep into rectifiers surpassing performance on imagenet classification cs february david sussillo and abbott random walk initialization for training very deep feedforward networks cs stat december andrew saxe james mcclelland and surya ganguli exact solutions to the nonlinear dynamics of learning in deep linear neural networks stat december ian goodfellow david mehdi mirza aaron courville and yoshua bengio maxout networks cs stat february rupesh srivastava jonathan masci sohrob kazerounian faustino gomez and schmidhuber compete to compute in advances in neural information processing systems pages tapani raiko harri valpola and yann lecun deep learning made easier by linear transformations in perceptrons in international conference on artificial intelligence and statistics pages alex graves generating sequences with recurrent neural networks lee saining xie patrick gallagher zhengyou zhang and zhuowen tu nets pages adriana romero nicolas ballas samira ebrahimi kahou antoine chassang carlo gatta and yoshua bengio fitnets hints for thin deep nets cs december schmidhuber learning complex extended sequences using the principle of history compression neural computation march geoffrey hinton simon osindero and teh fast learning algorithm for deep belief nets neural computation sepp hochreiter untersuchungen zu dynamischen neuronalen netzen masters thesis technische sepp hochreiter and schmidhuber long memory neural computation november felix gers schmidhuber and fred cummins learning to forget continual prediction with lstm in icann volume pages rupesh kumar srivastava klaus greff and schmidhuber highway networks cs may nal kalchbrenner ivo danihelka and alex graves grid long memory cs july yangqing jia evan shelhamer jeff donahue sergey karayev jonathan long ross girshick sergio guadarrama and trevor darrell caffe convolutional architecture for fast feature embedding cs benjamin graham convolutional neural networks september min lin qiang chen and shuicheng yan network in network marijn stollenga jonathan masci faustino gomez and schmidhuber deep networks with internal selective attention through feedback connections in nips jost tobias springenberg alexey dosovitskiy thomas brox and martin riedmiller striving for simplicity the all convolutional net cs december rupesh kumar srivastava jonathan masci faustino gomez and schmidhuber understanding locally competitive networks in international conference on learning representations 
bayesian active model selection with an application to automated audiometry jacob gardner cs cornell university ithaca ny kilian weinberger cs cornell university ithaca ny gustavo malkomes cse wustl louis mo luizgustavo dennis barbour bme wustl louis mo dbarbour roman garnett cse wustl louis mo garnett john cunningham statistics columbia university new york ny abstract we introduce novel approach for active model selection and demonstrate its effectiveness in application although our method can work with arbitrary models we focus on actively learning the appropriate structure for gaussian process gp models with arbitrary observation likelihoods we then apply this framework to rapid screening for hearing loss nihl widespread and preventible disability if diagnosed early we construct gp model for audiometric responses of patients with nihl using this and previously published model for healthy responses the proposed method is shown to be capable of diagnosing the presence or absence of nihl with drastically fewer samples than existing approaches further the method is extremely fast and enables the diagnosis to be performed in real time introduction personalized medicine has long been critical application area for machine learning in which automated decision making and diagnosis are key components beyond improving quality of life machine learning in diagnostic settings is particularly important because collecting additional medical data often incurs significant financial burden time cost and patient discomfort in machine learning one often considers this problem to be one of active feature selection acquiring each new feature blood test incurs some cost but will with hope better inform diagnosis treatment and prognosis by careful analysis we may optimize this trade off however many diagnostic settings in medicine do not involve feature selection but rather involve querying sample space to discriminate different models describing patient attributes particular clarifying example that motivates this work is hearing loss nihl prevalent disorder affecting million adults in the united states alone and affecting over half of workers in particular occupations such as mining and construction most tragically nihl is entirely preventable with simple solutions earplugs the critical requirement for prevention is effective early diagnosis to be tested for nihl patients must complete audiometric exam that presents series of tones at various frequencies and intensities at each tone the patient indicates whether hears the tone from the responses the clinician infers the patient audible threshold on set of discrete frequencies the audiogram this process requires the delivery of up to hundreds of tones audiologists scan the audiogram for hearing deficit with characteristic notch narrow band that can be anywhere in the frequency domain that is indicative of nihl unfortunately at early stages of the disorder notches can be small enough that they are undetectable in standard audiogram leaving many cases undiagnosed until the condition has become severe increasing audiogram resolution would require higher sample counts more presented tones and thus only lengthen an already burdensome procedure we present here better approach note that the nihl diagnostic challenge is not one of feature selection choosing the next test to run and classifying the result but rather of model selection is this patient hearing better described by normal hearing model or notched nihl model here we propose novel active model selection algorithm to make the nihl diagnosis in as few tones as possible which directly reflects the time and personnel resources required to make accurate diagnoses in large populations we note that this is problem in the truest sense diagnosis corresponds to selecting between two or more sets of indexed probability distributions models rather than the misnomer of choosing an index from within model hyperparameter optimization in the nihl case this distinction is critical we are choosing between two models the set of possible nihl hearing functions and the set of normal hearing functions this approach suggests very different and more direct algorithm than first learning the most likely nihl function and then accepting or rejecting it as different from normal the standard approach we make the following contributions first we design completely general method based on maximizing the mutual information between the response to tone and the posterior on the model class critically we develop an analytical approximation of this criterion for gaussian process gp models with arbitrary observation likelihoods enabling active structure learning for gps second we extend the work of gardner et al which uses active learning to speed up audiogram inference to the broader question of identifying which or fits given patient finally we develop novel gp prior mean that parameterizes notched hearing loss for nihl patients to our knowledge this is the first publication with an active approach that does not require updating each model for every candidate point allowing audiometric diagnosis of nihl to be performed in real time finally using patient data from clinical trial we show empirically that our method typically automatically detects simulated hearing loss with fewer than query tones this is vastly fewer than the number required to infer conventional audiogram or even an actively learned audiogram highlighting the importance of both the approach and our focus on model selection bayesian model selection we consider supervised learning problems defined on an input space and an output space suppose we are given set of observed data where represents the design matrix of independent variables xi and the associated vector of dependent variables yi xi let be probabilistic model and let be an element of the parameter space indexing given set of observations we wish to compute the probability of being the correct model to explain compared to other models the key quantity of interest to model selection is the model evidence dθ which represents the probability of having generating the observed data under the model marginalized over to account for all possible members of that model under prior given set of candidate models mi and the computed evidence for each we can apply bayes rule to compute the posterior probability of each model given the data mi mi where represents the prior probability distribution over the models active bayesian model selection suppose that we have mechanism for actively selecting new and observing add to our dataset in order to better distinguish the candidate models mi after making this observation we will form an augmented dataset from which we can recompute new model posterior an approach motivated by information theory is to select the location maximizing the mutual information between the observation value and the unknown model em where indicates differential entropy whereas equation is computationally problematic involving costly model retraining the equivalent expression is typically more tractable has been applied fruitfully in various settings and requires only computing the differential entropy of the predictive distribution mi mi and the predictive distributions mi with all models trained with the currently available data in contrast to this does not involve any retraining cost although computing the entropy in might be problematic we note that this is integral that can easily be resolved with quadrature our proposed approach which we call bayesian active model selection bams is then to compute for each candidate location the mutual information between and the unknown model and query where this is maximized arg max related work although active learning and model selection have been widely investigated active model selection has received comparatively less attention ali et al proposed an active learning model selection method that requires cross validation when evaluating each candidate requiring model updates per iteration where is the total budget kulick et al also considered an approach to active model selection suggesting maximizing the expected cross entropy between the current model posterior and the updated distribution this approach also requires extensive model retraining with model updates per iteration to estimate this expectation for each candidate these approaches become prohibitively expensive for applications with large number of candidates in our audiometric experiments for example we consider candidate points expending seconds per iteration whereas these mentioned techniques would take several hours to selected the next point to query active model selection for gaussian processes in the previous section we proposed general framework for performing sequential active bayesian model selection without making any assumptions about the forms of the models mi here we will discuss specific details of our proposal when these models represent alternative structures for gaussian process priors on latent function we assume that our observations are generated via latent function with known observation model where fi xi standard nonparametric bayesian approach with such models is to place gaussian process gp prior distribution on gp where is mean function and is covariance function or kernel we condition on the observed data to form posterior distribution which is typically an updated gaussian process making approximations if necessary we make predictions at new input via the predictive distribution df where the mean and kernel functions are parameterized by hyperparameters that we concatenate into vector and different choices of these hyperparameters imply that the functions drawn from the gp will have particular frequency amplitude and other properties together and define model parametrized by the hyperparameters much attention is paid to learning these hyperparameters in fixed model class sometimes under the unfortunate term model note however that the structural not hyperparameter choices made in the mean function and covariance function themselves are typically done by selecting often blindly from several solutions see for example though also see and this choice has substantial bearing on the resulting functions we can model indeed in many settings choosing the nature of plausible functions is precisely the problem of model selection for example to decide whether the function has periodic structure exhibits nonstationarity etc our goal is to automatically and actively decide these structural choices during gp modeling through intelligent sampling to connect to our active learning formulation let mi be set of gaussian process models for the latent function each model comprises mean function µi covariance function ki and associated hyperparameters θi our approach outlined in section requires the computation of three quantities that are not typically encountered in gp modeling and inference the hyperparameter posterior the model evidence and the predictive distribution where we have marginalized over in the latter two quantities the approaches to gp inference are maximum mle or maximum map estimation where we maximize the hyperparameter posterior arg max log arg max log log typically predictive distributions and other desired quantities are then reported at the mle map hyperparameters implicitly making the assumption that although computationally convenient choice this does not account for uncertainty in the hyperparameters which can be nontrivial with small datasets furthermore accounting correctly for model parameter uncertainty is crucial to model selection where it naturally introduces penalty we discuss approximations to these quantities below approximating the model evidence and hyperparameter posterior the model evidence and hyperparameter posterior distribution are in general intractable for gps as there is no conjugate prior distribution available instead we will use laplace approximation where we make taylor expansion of log around its mode the result is multivariate gaussian approximation log the laplace approximation also results in an approximation to the model evidence log log log log det log where is the dimension of the laplace approximation to the model evidence can be interpreted as rewarding explaining the data well while penalizing model complexity note that the bayesian information criterion bic commonly used for model selection can be seen as an approximation to the laplace approximation approximating the predictive distribution we next consider the predictive distribution dθ df the posterior in is typically known gaussian distribution derived analytically for gaussian observation likelihoods or approximately using standard approximate gp inference techniques however the integral over in is intractible even with gaussian approximation to the hyperparameter posterior as in garnett et al introduced mechanism for approximately marginalizing gp hyperparameters called the mgp which we will adopt here due to its strong empirical performance the mgp assumes using noninformative prior in the case of maximum likelihood that we have gaussian approximation to the hyperparameter posterior we define the posterior predictive mean and variance functions as var the mgp works by making an expansion of the predictive distribution around the posterior mean hyperparameters the nature of this expansion is chosen so as to match various derivatives of the true predictive distribution see for details the posterior distribution of is approximated by σmgp where σmgp the mgp thus inflates the predictive variance from the the posterior mean hyperparameters by term that is commensurate with the uncertainty in measured by the posterior covariance and the dependence of the latent predictive mean and variance on measured by the gradients and with the gaussian approximation in the integral in now reduces to integrating the observation likelihood against univariate gaussian this integral is often analytic and at worse requires quadrature implementation given the development above we may now efficiently compute an approximation to the bams criterion for active gp model selection given currently observed data for each of our candidate models mi we first find the laplace approximation to the hyperparameter posterior and model evidence given the approximations to the model evidence we may compute an approximation to the model posterior suppose we have set of candidate points from which we may select our next for each of our models we compute the mgp approximation to the latent point from posteriors which we use standard techniques to compute the predictive distributions mi finally with the ability to compute the differential entropies of these predictive distributions as well as the marginal predictive distribution we may compute the mutual information of each candidate in parallel see the appendix for explicit formulas for common likelihoods and description of reusable code we will release in conjunction with this manuscript to ease implementation audiometric threshold testing standard audiometric tests are calibrated such that the average human subject has chance of hearing tone at any frequency this empirical unit of intensity is defined as db hl humans give binary reports whether or not tone was heard in response to stimuli and these observations are inherently noisy typical audiometric tests present tones in predefined order on grid in increments of db hl at each of six octaves recently gardner et al demonstrated that bayesian active learning of patient audiometric function significantly improves the in terms of accuracy and number of stimuli required however learning patient entire audiometric function may not always be necessary audiometric testing is frequently performed on otherwise young and healthy patients to detect hearing loss nihl hearing loss occurs when an otherwise healthy individual is habitually subjected to sound this can result in sharp hearing loss in narrow sometimes less than one octave frequency range early detection of nihl is critical to desirable clinical outcomes so screenings of susceptible populations for example factory workers is commonplace hearing loss is difficult to diagnose with standard audiometry because grid must be very fine to ensure that notch is detected the full audiometric test of gardner et al may also be inefficient if the only goal of testing is to determine whether notch is present as would be the case for screening we cast the detection of hearing loss as an active model selection problem we will describe two gaussian process models of audiometric functions baseline model of normal human this is arbitrary and need not be the laplace approximation in so this is slight abuse of notation hearing and model reflecting nihl we then use the bams framework introduced above to as rapidly as possible for given patient determine which model best describes his or her hearing model to model healthy patient audiometric function we use the model described in the gp prior proposed in that work combines constant prior mean µhealthy modeling natural threshold with kernel taken to be the sum of two components linear covariance in intensity and covariance in frequency let represent tone stimulus with representing its intensity and its frequency we define exp where weight each component and is length scale of random deviations from constant hearing threshold this kernel encodes two fundamental properties of human audiologic response first hearing is monotonic in intensity the linear contribution ensures that the posterior probability of detecting fixed frequency will be monotonically increasing after conditioning on few tones second human hearing ability is locally smooth in frequency because nearby locations in the cochlea are mechanically coupled the combination of µhealthy with specifies our healthy model mhealthy with parameters θhealthy hearing loss model we extend the model above to create second gp model reflecting localized hearing deficit characteristic of nihl we create novel flexible prior mean function for this purpose the parameters of which specify the exact nature of the hearing loss our proposed notch mean is µnihl where denotes the unnormalized normal probability density function with mean and standard deviation which we scale by depth parameter to reflect the prominence of the hearing loss this contribution results in localized subtractive notch feature with tunable center width and height we retain constant offset to revert to the model outside the vicinity of the localized hearing deficit note that we completely model the effect of nihl on patient responses with this mean notch function the kernel above remains appropriate the combination of µnihl with specifies our nihl model mnihl with in addition to the parameters of our healthy model the additional parameters θnihl results to test bams on our nihl detection task we evaluate our algorithm using audiometric data comparing to several baselines from the results of clinical trial we have examples of audiometric functions inferred for several human patients using the method of gardner et al we may use these to simulate audiometric examinations of healthy patients using different methods to select tone presentations we simulate patients with nihl by adjusting ground truth inferred from nine healthy patients with samples from our notch mean prior recall that audiogram data is extremely scarce we first took thorough audiometric test of each of nine patients from our trial with normal hearing using samples selected using the algorithm in on the domain hz db typical ranges for audiometric testing we inferred the audiometric function over the entire domain from the measured responses using the gp model mhealthy with parameters learned via mle ii inference the observation model was where is the standard normal cdf and approximate gp inference was performed via laplace approximation we then used the approximate gp posterior mhealthy for this patient as for simulating healthy patient responses the posterior probability of tone detection learned from one patient is shown in the background of figure we simulated healthy patient response to given query tone by sampling conditionally independent bernoulli random variable with parameter mhealthy we simulated patient with nihl by then drawing notch parameters the parameters of from an prior adding the corresponding notch to the learned healthy latent mean recomputing the detection probabilities and proceeding as above example nihl detection probabilities generated in this manner are depicted in the background of figure inference was done in domain intensity db hl frequency hz frequency hz normal hearing model ground truth notch model ground truth figure samples selected by bams red and the method of gardner et al white when run on the ground truth and the nihl model ground truth contours denote probability of detection at intervals circles indicate presentations that were heard by the simulated patient exes indicate presentations that were not heard by the simulated patient diagnosing nihl to test our active approach to diagnosing nihl we simulated series of audiometric tests selecting tones using three alternatives bams the algorithm of and random each algorithm shared candidate set of quasirandom tones generated using scrambled halton set so as to densely cover the search space we simulated nine healthy patients and total of patients exhibiting range of nihl presentations using independent draws from our notch mean prior in the latter case for each audiometric test simulation we initialized with five random tones then allowed each algorithm to actively select maximum of additional tones very small fraction of the hundreds typically used in regular audiometric test we repeated this procedure for each of our nine healthy patients using the model we further simulated for each patient three separate presentations of nihl as described above we plot the posterior probability of the correct model after each iteration for each method in figure in all runs with both models bams was able to rapidly achieve greater than confidence in the correct model without expending the entire budget although all methods correctly inferred high healthy posterior probability for the healthy patient bams wass more confident for the nihl patients neither baseline inferred the correct model whereas bams rarely required more than actively chosen samples to confidently make the correct diagnosis note that when bams was used on nihl patients there was often an initial period during which the healthy model was favored followed by rapid shift towards the correct model this is because our method penalizes the increased complexity of the notch model until sufficient evidence for notch is acquired figure shows the samples selected by bams for typical healthy and nihl patients the fundamental strategy employed by bams in this application is logical it samples in row of relatively highintensity tones the intuition for this design is that failure to recognize normally heard sound is strong evidence of notch deficit once the notch has been found figure bams continues to sample within the notch to confirm its existence and rule out the possibility of the miss tone not heard being due to the stochasticity of the process once satisfied the bams approach then samples on the periphery of the notch to further solidify its belief the bams algorithm sequentially makes observations where the healthy and nihl model disagree the most typically in the of the map notch location the exact intensity at which bams samples is determined by the prior over the parameter when we changed the notch depth prior to support shallower or deeper notches data not shown bams sampled at lower or we also compared with uncertainty sampling and query by committee qbc the performance was comparable to random sampling and is omitted for clarity pr correct model bams random mhealthy iteration iteration notch model ground truth normal hearing model ground truth figure posterior probability of the correct model as function of iteration number higher intensities respectively to continue to maximize model disagreement similarly the spacing between samples is controlled by the prior over the parameter finally it is worth emphasizing the stark difference between the sampling pattern of bams and the audiometric tests of see figure indeed when the goal is learning the patient audiometric function the audiometric testing algorithm proposed in that work typically has very good estimate after samples however when using bams the primary goal is to detect or rule out nihl as result the samples selected by bams reveal little about the nuances of the patient audiometric function while being highly informative about the correct model to explain the data this is precisely the tradeoff one seeks in diagnostic setting highlighting the critical importance of focusing on the problem directly conclusion we introduced novel approach for active model selection bayesian active model selection and successfully applied it to rapid screening for hearing loss our method for active model selection does not require model retraining to evaluate candidate points making it more feasible than previous approaches further we provided an effective and efficient analytic approximation to our criterion that can be used for automatically learning the model class of gaussian processes with arbitrary observation likelihoods rich and commonly used class of potential models acknowledgments this material is based upon work supported by the national science foundation nsf under award number additionally jrg and kqw are supported by nsf grants and gm is supported by db acknowledges nih grant as well as the cimit jpc acknowledges the sloan foundation references kononenko machine learning for medical diagnosis history state of the art and perspective artificial intelligence in medicine saria rajani gould koller and penn integration of early physiological responses predicts later illness severity in preterm infants science translational medicine bailey chen mao lu hackmann micek heard faulkner and kollef trial of alert for clinical deterioration in patients hospitalized on general medical wards journal of hospital medicine shargorodsky curhan curhan and eavey change in prevalence of hearing loss in us adolescents journal of the american medical association carhart and jerger preferred method for clinical determination of thresholds journal of speech and hearing disorders hughson and westlake manual for program outline for rehabilitation of aural casualties both military and civilian transactions of the american academy of ophthalmology and otolaryngology supplement don eggermont and brackmann reconstruction of the audiogram using brain stem responses and noise masking annals of otology rhinology and part supplement gardner song weinberger barbour and cunningham psychophysical detection testing with bayesian active learning in uai mackay information theory inference and learning algorithms cambridge university press houlsby huszar ghahramani and collaborative gaussian processes for preference learning in nips pages garnett osborne and hennig active learning of linear embeddings for gaussian processes in uai pages hoffman and ghahramani predictive entropy search for efficient global optimization of functions in nips pages houlsby and ghahramani active learning with robust ordinal matrix factorization in icml pages ali caruana and kapoor active learning with model selection in aaai kulick lieck and toussaint active learning of hyperparameters an expected cross entropy criterion for active model selection corr rasmussen and williams gaussian processes for machine learning mit press duvenaud automatic model construction with gaussian processes phd thesis computational and biological learning laboratory university of cambridge duvenaud lloyd grosse tenenbaum and ghahramani structure discovery in nonparametric regression through compositional kernel search in icml pages wilson gilboa nehorai and cunningham fast kernel learning for multidimensional pattern extrapolation in nips pages williams and rasmussen gaussian processes for regression in nips raftery approximate bayes factors and accounting for model uncertainty in generalised linear models biometrika kuha aic and bic comparisons of assumptions and performance sociological methods and research schwarz estimating the dimension of model annals of statistics murphy machine learning probabilistic perspective mit press kuss and rasmussen assessing approximate inference for binary gaussian process classification journal of machine learning research minka expectation propagation for approximate bayesian inference in uai pages mcbride and williams audiometric notch as sign of noise induced hearing loss occupational environmental medicine nelson nelson and fingerhut the global burden of occupational hearing loss american journal of industrial medicine 
particle gibbs for infinite hidden markov models nilesh tripuraneni university of cambridge shixiang gu university of cambridge mpi for intelligent systems hong ge university of cambridge zoubin ghahramani university of cambridge zoubin abstract infinite hidden markov models ihmm are an attractive nonparametric generalization of the classical hidden markov model which can automatically infer the number of hidden states in the system however due to the nature of the transition dynamics performing inference in the ihmm is difficult in this paper we present an particle gibbs pg algorithm to resample state trajectories for the ihmm the proposed algorithm uses an efficient proposal optimized for ihmms and leverages ancestor sampling to improve the mixing of the standard pg algorithm our algorithm demonstrates significant convergence improvements on synthetic and real world data sets introduction hidden markov models hmm are among the most widely adopted models used to model datasets in the statistics and machine learning communities they have also been successfully applied in variety of domains including genomics language and finance where sequential data naturally arises rabiner bishop one possible disadvantage of the space hmm framework is that one must specify the number of latent states standard model selection techniques can be applied to the finite hmm but bear high computational overhead since they require the repetitive training exploration of many hmm of different sizes bayesian nonparametric methods offer an attractive alternative to this problem by adapting their effective model complexity to fit the data in particular beal et al constructed an hmm over countably infinite using hierarchical dirichlet process hdp prior over the rows of the transition matrix various approaches have been taken to perform full posterior inference over the latent states transition emission distributions and hyperparameters since it is impossible to directly apply the algorithm due to the size of the state space the original gibbs sampling approach proposed in teh et al suffered from slow mixing due to the strong correlations between nearby time steps often present in data scott however van gael et al introduced set of auxiliary slice variables to dynamically truncate the state space to be finite referred to as beam sampling allowing them to use dynamic programming to jointly resample the latent states thus circumventing the problem despite the power of the scheme fox et al found that application of the beam sampler to the sticky ihmm resulted in slow mixing relative to an inexact blocked sampler due to the introduction of auxiliary slice variables in the sampler equal contribution the main contributions of this paper are to derive an pg algorithm for the ihmm using the construction for the hdp and constructing an optimal importance proposal to efficiently resample its latent state trajectories the proposed algorithm is compared to existing inference algorithms for ihmms and empirical evidence suggests that the infinitestate pg algorithm consistently outperforms its alternatives furthermore by construction the time complexity of the proposed algorithm is here denotes the length of the sequence denotes the number of particles in the pg sampler and denotes the number of active states in the model despite the simplicity of sampler we find in variety of synthetic and experiments that these particle methods dramatically improve convergence of the sampler while being more scalable we will first define the ihmm in section and review the dirichlet process dp and hierarchical dirichlet process hdp in our appendix then we move onto the description of our mcmc sampling scheme in section in section we present our results on variety of synthetic and datasets model and notation infinite hidden markov models we can formally define the ihmm we review the theory of the hdp in our appendix as follows gem iid iid dp φj st cat yt here is the shared dp measure defined on integers here st are the latent states of the ihmm yt are the observed data and φj parametrizes the emission distribution usually and are chosen to be conjugate to simplify the inference can be interpreted as the prior mean for transition probabilities into state with governing the variability of the prior mean across the rows of the transition matrix the controls how concentrated or diffuse the probability mass of will be over the states of the transition matrix to connect the hdp with the ihmm note that given draw from the hdp gk kk we identify with the transition probability from state to state where parametrize the emission distributions note that fixing implies only transitions between the first states of the transition matrix are ever possible leaving us with the finite bayesian hmm if we define finite hierarchical bayesian hmm by drawing dir dir αβ with joint density over the states as pφ st fφ yt then after taking the hierarchical prior in equation approaches the hdp figure graphical model for the sticky setting recovers the prior and emission distribution specification the hyperparameter governs the variability of the prior mean across the rows of the transition matrix and controls how concentrated or diffuse the probability mass of will be over the states of the transition matrix however in the we have each row of the transition matrix is drawn as dp thus the hdp prior doesn differentiate from jumps between different states this can be especially problematic in the setting since state persistence in data can lead to the creation of unnecessary extra states and unrealistically rapid switching dynamics in our model in fox et al this problem is addressed by including bias parameter into the distribution of transitioning probability vector πj αβ κδj dp to incorporate prior beliefs that smooth dynamics are more probable such construction only involves the introduction of one further hyperparameter which controls the stickiness of the transition matrix note similar was explored in beal et al for the standard ihmm most approaches to inference have placed vague gamma on the and which can be resampled efficiently as in teh et al similarly in the sticky ihmm in order to maintain tractable resampling of fox et al chose to place vague gamma priors on and beta prior on in this work we follow teh et al fox et al and place priors gamma aγ bγ gamma as bs and beta aκ bκ on the we consider two conjugate emission models for the output states of the ihmm multinomial emission distribution for discrete data and normal emission distribution for continuous data for discrete data we choose φk dir αφ with φst cat for continuous data we choose φk ig αφ βφ with φst posterior inference for the ihmm let us first recall the collection of variables we need to sample is shared dp base measure is the transition matrix acting on the latent states while φk parametrizes the emission distribution we can then resample the variables of the ihmm in series of gibbs steps step step step step step sample sample sample sample sample due to the strongly correlated nature of data resampling the latent hidden states in step is often the most difficult since the other variables can be sampled via the gibbs sampler once sample of has been obtained in the following section we describe novel efficient sampler for the latent states of the ihmm and refer the reader to our appendix and teh et al fox et al for detailed discussion on steps for sampling variables infinite state particle gibbs sampler within the particle mcmc framework of andrieu et al sequential monte carlo or particle filtering is used as complex proposal for the algorithm the particle gibbs sampler is conditional smc algorithm resulting from clamping one particle to an apriori fixed trajectory in particular it is transition kernel that has as its stationary distribution the key to constructing generic sampler for the ihmm to resample the latent states is to note that the finite number of particles in the sampler are localized in the latent space to finite subset of the infinite set of possible states moreover they can only transition to finitely many new states as they are propagated through the forward pass thus the infinite measure and infinite transition matrix only need to be instantiated to support the number of active states defined as being in the state space in the particle gibbs algorithm if particle transitions to state outside the active set the objects and can be lazily expanded via the constructions derived for both objects in teh et al and stated in equations and thus due to the properties of both the construction and the pgas kernel this resampling procedure will leave the target distribution invariant below we first describe our particle gibbs algorithm for the ihmm then detail our notation we provide further background on smc in our supplement step for iteration initialize as sample for initialize weights for step for iteration use trajectory from and sample the index cat of the ancestor of particle for ai sample sit qt for if sit then create new state using the construction for the hdp sample new transition probability vector dir αβ ii use construction to iteratively expand as iid beta πk iii expand transition probability vectors to include transitions to state via the hdp construction as πj where beta βk βl πk iv sample new emission parameter compute the ancestor weights and resample an as an recompute and normalize particle weights using ai ai wt sit sit yt sit sit wt sit wt sit wt sit step sample with wti and return in the particle gibbs sampler at each step weighted particle system sit wti serves as an empirical approximation to the distribution with the variables ait denoting the ancestor particles of sit here we have used st to denote the latent transition distribution yt the emission distribution and the prior over the initial state more efficient importance proposal qt in the pg algorithm described above we have choice of the importance sampling density qt to ai ai use at every time step the simplest choice is to sample from the prior qt sit which can lead to satisfactory performance when then observations are not too informative and the dimension of the latent variables are not too large however using the prior as importance proposal in particle mcmc is known to be suboptimal in order to improve the mixing rate of the sampler it ai ai is desirable to sample from the partial posterior qt sit yt whenever possible an an in general sampling from the posterior qt snt yt may be impossible but in the ihmm we can show that it is analytically tractable to see this note that we have lazily represented as finite vector moreover we can easily evaluate the likelihood ytn for all snt however if snt we need to compute ytn ytn dφ if and are conjugate we can analytically compute the marginal likelihood of the state but this can also be approximated by monte carlo sampling for likelihoods see neal for more detailed sion of this argument thus we can compute yt yt φk for each particle snt where we investigate the impact of posterior prior proposals in figure based on the convergence of the number of states and joint we can see that sampling from the posterior improves the mixing of the sampler indeed we see from the prior sampling experiments that increasing the number of particles from to does seem to marginally improve the mixing the sampler but have found particles sufficient to obtain good results however we found no appreciable gain when increasing the number of particles from to when sampling from the posterior and omitted the curves for clarity it is worth noting that the pg sampler with ancestor resampling does still perform reasonably even when sampling from the prior improving mixing via ancestor resampling it has been recognized that the mixing properties of the pg kernel can be poor due to path degeneracy lindsten et variant of pg that is presented in lindsten et al attempts to address this problem for any model with modification resample new value for the variable an in an ancestor sampling step at every time step which can significantly improve the mixing of the pg kernel with little extra computation in the case of markovian systems to understand ancestor sampling for consider the reference trajectory ranging from the current time step to the final time now artificially assign candidate history to this partial path by connecting to one of the other particles history up until that point which can be achieved by simply assigning new value to the variable at to do this we first compute the weights pt then an is sampled according to at remarkably this ancestor sampling step leaves the density invariant as shown in lindsten et al for arbitrary nonmarkovian models however since the infinite hmm is markovian we can show the computation of the ancestor sampling weights simplifies to note that the ancestor sampling step does not change the time complexity of the infinitestate pg sampler resampling and our resampling scheme for and will follow straightforwardly from this scheme in fox et al teh et al we present review of their methods and related work in our appendix for completeness empirical study in the following experiments we explore the performance of the pg sampler on both the ihmm and the sticky ihmm note that throughout this section we have only taken and particles for the pg sampler which has time complexity when sampling from the posterior compared to the time complexity of of the beam sampler for completeness we also compare to the gibbs sampler which has been shown perform worse than the beam sampler van gael et due to strong correlations in the latent states convergence on synthetic data to study the mixing properties of the pg sampler on the ihmm and sticky ihmm we consider two synthetic examples with strongly positively correlated latent states first as in van gael et al jll pg beam truth pgas beam pgas gibbs beam gibbs iteration iteration figure learned latent transition matrices for the pg sampler and beam sampler vs ground truth transition matrix for gibbs sampler omitted for clarity pg correctly recovers strongly correlated matrix while the beam sampler supports extra spurious states in the latent space figure comparing the performance of the pg sampler pg sampler on sticky ihmm beam sampler and gibbs sampler on inferring data from state strongly correlated hmm left number of active states iterations right likelihood iterations best viewed in color we generate sequences of length from state hmm with probability of and residual probability mass distributed uniformly over the remaining states where the emission distributions are taken to be normal with fixed standard deviation and emission means of for the states the base distribution for the ihmm is taken to be normal with mean and standard deviation and we initialized the sampler with active states in the case we see in figure that the pg sampler applied to both the ihmm and the sticky ihmm converges to the true value of much quicker than both the beam sampler and gibbs sampler uncovering the model dimensionality and structure of the transition matrix by more rapidly eliminating spurious active states from the space as evidenced in the learned transition matrix plots in figure moreover as evidenced by the joint in figure we see that the pg sampler applied to both the ihmm and the sticky ihmm converges quickly to good mode while the beam sampler has not fully converged within iterations and the gibbs sampler is performing poorly to further explore the mixing of the pg sampler the beam sampler we consider similar inference problem on synthetic data over larger state space we generate data from sequences of length from state hmm with probability of and residual probability mass distributed uniformly over the remaining states and take the emission distributions to be normal with fixed standard deviation and means equally spaced apart between and the base distribution for the ihmm is also taken to be normal with mean and standard deviation the samplers were initialized with and states to explore the convergence and robustness of the pg sampler the beam sampler jll figure comparing the performance of the pg sampler beam sampler on inferring data from state strongly correlated hmm with different initializations left number of active states from different initial iterations right likelihood from different initial iterations jll figure influence of posterior prior proposal and number of particles in pg sampler on ihmm left number of active states from different initial numbers of particles and prior posterior proposal iterations right likelihood from different initial numbers of particles and prior posterior proposal iterations as observed in figure we see that the pg sampler applied to the ihmm and sticky ihmm converges far more quickly from both small and large initialization of and active states to the true value of hidden states as well as converging in jll more quickly indeed as noted in fox et al the introduction of the extra slice variables in the beam sampler can inhibit the mixing of the sampler since for the beam sampler to consider transitions with low prior probability one must also have sampled an unlikely corresponding slice variable so as not to have truncated that state out of the space this can become particularly problematic if one needs to consider several of these transitions in succession we believe this provides evidence that the particle gibbs sampler presented here which does not introduce extra slice variables is mixing better than beam sampling in the ihmm ion channel recordings for our first real dataset we investigate the behavior of the pg sampler and beam sampler on an ion channel recording in particular we consider recording from rosenstein et al of single alamethicin channel previously investigated in palla et al we subsample the time series by factor of truncate it to be of length and further log transform and normalize it we ran both the beam and pg sampler on the ihmm for iterations until we observed convergence in the joint due to the large fluctuations in the observed time series the beam sampler infers the number of active hidden states to be while the pg sampler infers the number of active hidden states to be however in figure we see that beam sampler infers solution for the latent states which rapidly oscillates between subset of likely states during temporal regions which intuitively seem to be better explained by single state however the pg sampler has converged to mode which seems to better represent the latent transition dynamics and only seems to infer extra states in the regions of large fluctuation indeed this suggests that the beam sampler is mixing worse with respect to the pg sampler beam latent states pg latent states figure left observations colored by an inferred latent trajectory using beam sampling inference right observations colored by an inferred latent state trajectory using pg inference alice in wonderland data for our next example we consider the task of predicting sequences of letters taken from alice adventures in wonderland we trained an ihmm on the characters from the first chapter of the book and tested on subsequent characters from the same chapter using multinomial emission model for the ihmm once again we see that the pg sampler applied to the ihmm converges quickly in joint to mode where it stably learns value of as evidenced in figure though the performance of the pg and beam samplers appear to be roughly comparable here we would like to highlight two observations firstly the inferred value of obtained by the pg sampler quickly converges independent of the initialization in the rightmost of figure however the beam sampler prediction for the number of active states still appears to be decreasing and more rapidly fluctuating than both the ihmm and sticky ihmm as evidenced by the error bars in the middle plot in addition to being quite sensitive to the initialization as shown in the rightmost plot based on the previous synthetic experiment section and this result we suspect that although both the beam sampler and pg sampler are quickly converging to good solutions as evidenced the training joint the beam sampler is learning transition matrix with unnecessary spurious active states next we calculate the predictive of the alice jll pgas pgas pgas pgas beam iteration iteration figure left comparing the joint iterations for the pg sampler and beam sampler middle comparing the convergence of the active number of states for the ihmm and sticky ihmm for the pg sampler and beam sampler right trace plots of the number of states for different initializations for in wonderland test data averaged over different realizations and find that the pg sampler with particles achieves predictive of while the beam sampler achieves predictive of showing the pg sampler applied to the ihmm and sticky ihmm learns hyperparameter and latent variable values that obtain better predictive performance on the dataset we note that in this experiment as well we have only found it necessary to take particles in the pg sampler achieve good mixing and empirical performance although increasing the number of particles to does improve the convergence of the sampler in this instance given that the pg sampler has time complexity of for single pass while the beam sampler and truncated methods have time complexity of for single pass we believe that the pg sampler is competitive alternative to the beam sampler for the ihmm discussions and conclusions in this work we derive new inference algorithm for the ihmm using the particle mcmc framework based on the construction for the hdp we also develop an efficient proposal inside pg optimized for ihmm to efficiently resample the latent state trajectories for ihmm the proposed algorithm is empirically compared to existing inference algorithms for ihmms and shown to be promising because it converges more quickly and robustly to the true number of states in addition to obtaining better predictive performance on several synthetic and realworld datasets moreover we argued that the pg sampler proposed here is competitive alternative to the beam sampler since the time complexity of the particle samplers presented is versus the of the beam sampler another advantage of the proposed method is the simplicity of the pg algorithm which doesn require truncation or the introduction of auxiliary variables also making the algorithm easily adaptable to challenging inference tasks in particular the pg sampler can be directly applied to the sticky with dp emission model considered in fox et al for which no sampler exists we leave this development and application as an avenue for future work references andrieu doucet and holenstein particle markov chain monte carlo methods journal of the royal statistical society series statistical methodology beal ghahramani and rasmussen the infinite hidden markov model in advances in neural information processing systems pages bishop pattern recognition and machine learning volume springer new york fox sudderth jordan and willsky an for systems with state persistence in proceedings of the international conference on machine learning pages acm lindsten jordan and particle gibbs with ancestor sampling the journal of machine learning research neal markov chain sampling methods for dirichlet process mixture models journal of computational and graphical statistics palla knowles and ghahramani reversible infinite hmm using normalised random measures arxiv preprint rabiner tutorial on hidden markov models and selected applications in speech recognition proceedings of the ieee rosenstein ramakrishnan roseman and shepard single ion channel recordings with lipid membranes nano letters scott bayesian methods for hidden markov models journal of the american statistical association teh jordan beal and blei hierarchical dirichlet processes journal of the american statistical association van gael saatci teh and ghahramani beam sampling for the infinite hidden markov model in proceedings of the international conference on machine learning volume 
learning spatiotemporal trajectories from longitudinal data olivier stanley aramis lab inria paris inserm cnrs umr sorbonne upmc univ paris umr institut du cerveau et de la moelle icm paris france cmap ecole polytechnique palaiseau france abstract we propose bayesian model to learn typical scenarios of changes from longitudinal data namely repeated measurements of the same objects or individuals at several points in time the model allows to estimate trajectory in the space of measurements random variations of this trajectory result from spatiotemporal transformations which allow changes in the direction of the trajectory and in the pace at which trajectories are followed the use of the tools of riemannian geometry allows to derive generic algorithm for any kind of data with smooth constraints which lie therefore on riemannian manifold stochastic approximations of the algorithm is used to estimate the model parameters in this highly setting the method is used to estimate model of the progressive impairments of cognitive functions during the onset of alzheimer disease experimental results show that the model correctly put into correspondence the age at which each individual was diagnosed with the disease thus validating the fact that it effectively estimated normative scenario of disease progression random effects provide unique insights into the variations in the ordering and timing of the succession of cognitive impairments across different individuals introduction brain diseases such as parkinson or alzheimer disease ad are complex diseases with multiple effects on the metabolism structure and function of the brain models of disease progression showing the sequence and timing of these effects during the course of the disease remain largely hypothetical large databases have been collected recently in the hope to give experimental evidence of the patterns of disease progression based on the estimation of models these databases are longitudinal in the sense that they contain repeated measurements of several subjects at multiple but which do not necessarily correspond across subjects learning models of disease progression from such databases raises great methodological challenges the main difficulty lies in the fact that the age of given individual gives no information about the stage of disease progression of this individual the onset of clinical symptoms of ad may vary from forty and eighty years of age and the duration of the disease from few years to decades moreover the onset of the disease does not correspond with the onset of the symptoms according to recent studies symptoms are likely to be preceded by silent phase of the disease for which little is known as consequence statistical models based on the regression of measurements with age are inadequate to model disease progression the set of the measurements of given individual at specific belongs to highdimensional space building model of disease progression amounts to estimating continuous trajectories in this space and average those trajectories among group of individuals trajectories need to be registered in space to account for the fact that individuals follow different trajectories and in time to account for the fact that individuals even if they follow the same trajectory may be at different position on this trajectory at the same age the framework of models seems to be well suited to deal with this hierarchical problem models for longitudinal measurements were introduced in the seminal paper of laird and ware and have been widely developed since then see for instance however this kind of models suffers from two main drawbacks regarding our problem these models are built on the estimation of the distribution of the measurements at given time point in many situations this reference time is given by the experimental set up date at which treatment begins date of seeding in studies of plant growth etc in studies of ageing using these models would require to register the data of each individual to common stage of disease progression before being compared unfortunately this stage is unknown and such temporal registration is actually what we wish to estimate another limitation of usual models is that they are defined for data lying in euclidean spaces however measurements with smooth constraints usually can not be summed up or scaled such as normalized scores of neurospychological tests positive definite symmetric matrices shapes encoded as images or meshes these data are naturally modeled as points on riemannian manifolds although the development of statistical models for data is blooming topic the construction of statistical models for longitudinal data on manifold remains an open problem the concept of was introduced in to allow for temporal registration of trajectories of shape changes nevertheless the combination of the with the intrinsic variability of shapes across individuals is done at the expense of simplifying approximation the variance of shapes does not depend on time whereas it should adapt with the average scenario of shape changes moreover the estimation of the parameters of the statistical model is made by minimizing sum of squares which results from an uncontrolled likelihood approximation in are used to define metric between curves that are invariant under time reparameterization this invariance by definition prevents the estimation of correspondences across trajectories and therefore the estimation of distribution of trajectories in the spatiotemporal domain in the authors proposed model for longitudinal image data but the model is not built on the inference of statistical model and does not include time reparametrization of the estimated trajectories in this paper we propose generic statistical framework for the definition and estimation of mixedeffects models for longitudinal data using the tools of geometry allows us to derive method that makes little assumptions about the data and problem to deal with modeling choices boil down to the definition of the metric on the manifold this geometrical modeling also allows us to introduce the concept of parallel curves on manifold which is key to uniquely decompose differences seen in the data in spatial and temporal component because of the of the model the estimation of the parameters should be based on an adequate maximization of the observed likelihood to address this issue we propose to use stochastic version of the expectationmaximization algorithm namely the mcmc saem for which theoretical results regarding the convergence have been proved in experimental results on neuropsychological tests scores and estimates of scenarios of ad progression are given in section spatiotemporal model for data riemannian geometry setting the observed data consists in repeated multivariate measurements of individuals for given individual the measurements are obtained at time points ti ni the measurement of the individual is denoted by yi we assume that each observation yi is point on riemannian manifold embedded in rp with and equipped with riemannian metric we denote the covariant derivative we assume that the manifold is geodesically complete meaning that geodesics are defined for all time we recall that geodesic is curve drawn on the manifold which has no acceleration for point and vector tp the mapping expp denotes the riemannian exponential namely the point that is reached at time by the geodesic starting at with velocity the parallel transport of vector tγ in the tangent space at point on curve is family of vectors tγ which satisfies and we denote pγ the isometry that maps to in order to describe our model we need to introduce the notion of parallel curves on the manifold definition let be curve on defined for all time and vector tγ one defines the curve called parallel to the curve as expm pγ the idea is illustrated in fig one uses the parallel transport to move the vector from to along at the point new point on is obtained by taking the riemannian exponential of pγ this new point is denoted by as varies one describes curve on which can be understood as parallel to the curve it should be pointed out that even if is geodesic is in general not geodesic of in the euclidean case flat manifold the curve is the translation of the curve figure figure figure figure model description on schematic manifold figure left vector wi is choosen in tγ figure middle the tangent vector wi is transported along the geodesic and point wi is constructed at time by use of the riemannian exponential figure right the curve wi is the parallel resulting from the construction generic spatiotemporal model for longitudinal data our model is built in hierarchical manner data points are seen as samples along individual trajectories and these trajectories derive from trajectory the model writes yi wi ψi ti εi where we assume the trajectory to be geodesic denoted from now on individual trajectories derive from the group average by spatiotemporal transformations they are defined as time of trajectory that is parallel to the wi ψi for the ith individual wi denotes tangent vector in tγ for some specific time point that needs to be estimated and which is orthogonal to the tangent vector for the inner product given by the metric gγ the function ψi is defined as ψi αi τi the parameter αi is an acceleration factor which encodes whether the individual is progressing faster or slower than the average τi is which characterizes the advance or delay of the ith individual with respect to the average and wi is which encodes the variability in the measurements across individuals at the same stage of progression the normal tubular neighborhood theorem ensures that parallel shifting defines spatiotemporal coordinate system as long as the vectors wi are choosen orthogonal and sufficently small the orthogonality condition on the tangent vectors wi is necessary to ensure the identifiability of the model indeed if vector wi was not choosen orthogonal its orthogonal projection would play the same role as the acceleration spatial and temporal transformations commute in the sense that one may the average trajectory before building the parallel curve or vice versa mathematically this writes wi ψi wi ψi this relation also explains the particular form of the affine ψi the geodesic is characterized by the fact that it passes at by point with velocity then ψi is the same trajectory except that it passes by point at time τi with velocity αi the fixed effects of the model are the parameters of the average geodesic the point on the manifold the and the velocity the random effects are the acceleration factors αi τi and wi the first two random effects are scalars one assumes the acceleration factors to follow distribution they need to be positive in order not to reverse time and to follow gaussian distribution are vectors of dimension in the hyperplane in tγ in the spirit of independent component analysis we assume that wi result from the superposition of ns statistically independent components this writes wi asi where is ns matrix of rank ns si vector of ns independent sources following heavy tail laplace distribution with fixed parameter and each column cj ns of satisfies the orthogonality condition hcj iγ for the dataset ti yi ni the model may be summarized as yi wi ψi ti εi with ψi αi τi αi exp ξi wi asi and ξi τi εi in si laplace eventually the parameters of the model one needs to estimate are the fixed effects and the variance of the random effects namely σξ στ vec propagation model in product manifold we wish to use these developments to study the temporal progression of family of biomarkers we assume that each component of yi is scalar measurement of given biomarker and belongs to geodesically complete manifold therefore each measurement yi is point in the product manifold which we assume to be equipped with the riemannian product metric we denote the geodesic of the manifold which goes through the point at time with velocity in order to determine relative progression of the biomarkers among themselves we consider parametric family of geodesics of we assume here that all biomarkers have on average the same dynamics but shifted in time this hypothesis allows to model temporal succession of effects during the course of the relative timing in biomarker changes is measured by the vector δn which becomes fixed effect of the model in this setting curve that is parallel to geodesic is given by the following lemma lemma let be geodesic of the product manifold and let if denotes parallel to the geodesic with wn and wn γn we have γn as consequence parallel to the average trajectory has the same form as the geodesic but with randomly perturbed delays the model writes for all wk yi αi ti τi εi where wk denotes the component of the wi and yi the measurement of the biomarker at the time point for the individual multivariate logistic curves model the propagation model given in is now described for normalized biomarkers such as scores of neuropsychological tests in this case we assume the manifold to be and equipped with the riemannian metric given by for tp tp gp ug with the geodesics given by this metric in the riemannian manifold are logistic curves of the form exp and leads to the multivariate logistic curves model in we can notice the quite unusual paramaterization of the logistic curve this parametrization naturally arise because satisfies and in this case the model writes yi exp αi ti τi δk as εi where asi denotes the component of the vector asi note that is not equivalent to linear model on the logit of the observations the logit transform corresponds to the riemannian logarithm at in our framework is not fixed but estimated as parameter of our model even with fixed the model is still due to the multiplication between αi and τi and therefore does not boil down to the usual linear model parameters estimation in this section we explain how to use stochastic version of the em algorithm to produce estimates of the parameters σξ στ vec of the model the algorithm detailed in this section is essentially the same as in its scope of application is not limited to statistical models on product manifolds and the algorithm can actually be used for the inference of very large family of statistical models the random effects ξi τi sj and ns are considered as hidden variables with the observed data yi form the complete data of the model in this context the em algorithm proposed in is very efficient to compute the maximum likelihood estimate of due to the nonlinearity and complexity of the model the step is intractable as consequence we considered stochastic version of the em algorithm namely the markov chain stochastic approximation algorithm based on this algorithm is an algorithm which alternates between three steps simulation stochastic approximation and maximization if denotes the current parameter estimates of the algorithm in the simulation step sample of the missing data is obtained from the transition kernel of an ergodic markov chain whose stationary distribution is the conditional distribution of the missing data knowing and denoted by this simulation step is achieved using within gibbs sampler note that the high complexity of our model prevents us from resorting to sampling methods as in as they would require heavy computations such as the fisher information matrix the stochastic approximation step consists in stochastic approximation on the complete log summarized as follows qt εt log zp where εt is decreasing sequence of positive in which satisfies εt and finally the parameter estimates are updated in the maximization step according to qt the theoretical convergence of the mcmc saem algorithm is proved only if the model belong to the curved exponential family or equivalently if the complete of the model may be written log where is sufficent statistic of the model in this case the stochastic approximation on the complete can be replaced with stochastic approximation on the sufficent statistics of the model note that the multivariate logistic curves model does not belong to the curved exponential family usual workaround consists in regarding the parameters of the model as realizations of independents gaussian random variables where is diagonal matrix with very small diagonal entries and the estimation now targets this yields and for all δk to ensure the orthogonality condition on the columns of we assumed that follows normal distribution on the space cns tγδ hcj iγδ ns equivalently we assume that the matrix writes βk bk where for all βk and ns is an orthonormal basis of obtained by application of the process to basis of the random variables ns are considered as new hidden variables of the model the parameters of the model are ns σξ στ whereas the hidden variables of the model are δk βk ns ξi τi sj the algorithm given below summarizes the saem algorithm for this model the algorithm was tested on synthetic data generated according to the allowed to recover the parameters used to generate the synthetic dataset algorithm overview of the mcmc saem algorithm for the multivariate logistic curves model if sj denotes the vector of hidden variables obtained in the simulation step of the iteration of the mcmc saem let fi fi rn and fi be the component of ηw exp ξi ti τi and wi ns βl bl initialization random εk repeat simulation step sj gibbs sampler fi rk compute the sufficent statistics yi kfi rk pp rp with ni and hi rn τi rp δj hi βj ns stochastic approximation step sj maximization step sj εk sj for δj for all ns σξ στ yi until convergence return for all experiments data we use the neuropsychological assessment test from the adnigo or cohorts of the alzheimer disease neuroimaging initiative adni the consists of questions which allow to test the impairment of several cognitive functions for the purpose of our analysis these items are grouped into four categories memory items language items praxis items and concentration item scores within each category are added and normalized by the maximum possible score consequently each data point consists in four normalized scores which can be seen as point on the manifold we included individuals in the study who were diagnosed with mild cognitive impairment mci at their first visit and whose diagnosis changed to ad before their last visit there is an average of visits per subjects min max with an average duration of or months between consecutive visits the multivariate logistic curves model was used to analyze this longitudinal data experimental results the model was applied with ns or independent sources in each experiment the mcmc saem was run five times with different initial parameter values the experiment which returned the smallest residual variance was kept the maximum number of iterations was arbitrarily set to and the number of iterations was set to iterations the limit of iterations is enough to observe the convergence of the sequences of parameters estimates as result two and three sources allowed to decrease the residual variance better than one source for one source for two sources and for three sources the residual variance resp mean that the model allowed to explain resp of the total variance we implemented our algorithm in matlab without any particular optimization scheme the iterations require approximately one day the number of parameters to be estimated is equal to therefore the number of sources do not dramatically impact the runtime simulation is the most computationally expensive part of our algorithm for each run of the algorithm the proposal distribution is the prior distribution as consequence the acceptation ratio simplifies and one computation of the acceptation ratio requires two computations of the likelihood of the observations conditionally on different vectors of latent variables and the vector of current parameters estimates the runtime could be improved by parallelizing the sampling per individuals for matter of clarity and because the results obtained with three sources were similar to the results with two sources we report here the experimental results obtained with two independent sources the average model of disease progression is plotted in fig the estimated fixed effects are years unit per year and years this means that on average the memory score first coordinate reaches the value at years followed by concentration which reaches the same value at years and then by praxis and language at age and years respectively random effects show the variability of this average trajectory within the studied population the standard deviation of the equals στ years meaning that the disease progression model in fig is shifted by years to account for the variability in the age of disease onset the effects of the variance of the acceleration factors and the two independent components of the are illustrated in fig the acceleration factors shows the variability in the pace of disease progression which ranges between times faster and times slower than the average the first independent component shows variability in the relative timing of the cognitive impairments in one direction memory and concentration are impaired nearly at the same time followed by language and praxis in the other direction memory is followed by concentration and then language and praxis are nearly superimposed the second independent component keeps almost fixed the timing of memory and concentration and shows great variability in the relative timing of praxis and language impairment it shows that the ordering of the last two may be inverted in different individuals overall these components show that the onset of cognitive impairment tends to occur by pairs memory concentration followed by language praxis individual estimates of the random effects are obtained from the simulation step of the last iteration of the algorithm and are plotted in fig the figure shows that the estimated individual correspond well to the age at which individuals were diagnosed with ad this means that the value estimated by the model is good threshold to determine diagnosis fact that has occurred by chance and more importantly that the correctly register the dynamics of the individual trajectories so that the normalized age correspond to the same stage of disease progression across individuals this fact is corroborated by fig which shows that the normalized age of conversion to ad is picked at years old with small variance compared to the real distribution of age of conversion figure the four curves represent the estimated average trajectory vertical line is drawn at years old and an horizontal line is drawn at figure in blue resp red histogram of the ages of conversion to ad tdiag resp normalized ages of conversion to ad ψi tdiag with ψi as in acceleration factor 𝛼𝑖 independent component independent component figure variability in disease progression superimposed with the average trajectory dotted lines effects of the acceleration factor with plots of exp first column first and second independent component of with plots of ci for or second and third column respectively figure plots of individual random effects factor ξi log αi against τi color corresponds to the age of conversion to ad discussion and perspectives we proposed generic spatiotemporal model to analyze longitudinal measurements the fixed effects define trajectory which is geodesic on the data manifold random effects are acceleration factor and which provide insightful information about the variations in the direction of the individual trajectories and the relative pace at which they are followed this model was used to estimate normative scenario of alzheimer disease progression from neuropsychological tests we validated the estimates of the spatiotemporal registration between individual trajectories by the fact that they put into correspondence the same event on individual trajectories namely the age at diagnosis alternatives to estimate model of disease progression include the model which estimates the ordering of categorical variables our model may be seen as generalization of this model for continuous variables which do not only estimate the ordering of the events but also the relative timing between them practical solutions to combine spatial and temporal sources of variations in longitudinal data are given in our goal was here to propose theoretical and algorithmic foundations for the systematic treatment of such questions references the alzheimer disease neuroimaging initiative https kuhn construction of bayesian deformable models via stochastic approximation algorithm convergence study bernoulli braak braak staging of alzheimer neurofibrillary changes neurobiology of aging delyon lavielle moulines convergence of stochastic approximation version of the em algorithm annals of statistics pp dempster laird rubin maximum likelihood from incomplete data via the em algorithm journal of the royal statistical society series methodological pp diggle heagerty liang zeger analysis of longitudinal data oxford university press donohue le goff thomas raman gamst beckett jack weiner dartigues aisen the alzheimer disease neuroimaging initiative estimating multivariate progression from data alzheimer dementia durrleman pennec braga gerig ayache toward comprehensive framework for the spatiotemporal statistical analysis of longitudinal shape data international journal of computer vision fonteijn modat clarkson barnes lehmann hobbs scahill tabrizi ourselin fox et al an model for disease progression and its application in familial alzheimer disease and huntington disease neuroimage girolami calderhead riemann manifold langevin and hamiltonian monte carlo methods journal of the royal statistical society series statistical methodology hirsch differential topology springer science business media karhunen oja independent component analysis vol john wiley sons jack knopman jagust shaw aisen weiner petersen trojanowski hypothetical model of dynamic biomarkers of the alzheimer pathological cascade the lancet neurology kuhn lavielle maximum likelihood estimation in nonlinear mixed effects models computational statistics data analysis laird ware models for longitudinal data biometrics pp singer willett applied longitudinal data analysis modeling change and event occurrence oxford university press singh hinkle joshi fletcher hierarchical geodesic model for diffeomorphic longitudinal shape analysis in information processing in medical imaging pp springer su kurtek klassen srivastava et al statistical analysis of trajectories on riemannian manifolds bird migration hurricane tracking and video surveillance the annals of applied statistics 
bayesian framework for modeling confidence in perceptual decision making koosha khalvati rajesh rao department of computer science and engineering university of washington seattle wa koosha rao abstract the degree of confidence in one choice or decision is critical aspect of perceptual decision making attempts to quantify decision maker confidence by measuring accuracy in task have yielded limited success because confidence and accuracy are typically not equal in this paper we introduce bayesian framework to model confidence in perceptual decision making we show that this model based on partially observable markov decision processes pomdps is able to predict confidence of decision maker based only on the data available to the experimenter we test our model on two experiments on decision making involving the random dots motion discrimination task in both experiments we show that our model predictions closely match experimental data additionally our model is also consistent with other phenomena such as the effect in perceptual decision making introduction the brain is faced with the persistent challenge of decision making under uncertainty due to noise in the sensory inputs and perceptual ambiguity mechanism for of one decisions is therefore crucial for evaluating the uncertainty in one decisions this kind of decision making called perceptual decision making and the associated called confidence have received considerable attention in decision making experiments in recent years one possible way of estimating the confidence of decision maker is to assume that it is equal to the accuracy or performance on the task however the decision maker belief about the chance of success and accuracy need not be equal because the decision maker may not have access to information that the experimenter has access to for example in the task of random dots motion discrimination on each trial the experimenter knows the difficulty of the task coherence or motion strength of the dots but not the decision maker in this case when the data is binned based on difficulty of the task the accuracy is not equal to decision confidence an alternate way to estimate the subject confidence is to use auxiliary tasks such as wagering or asking the decision maker to estimate confidence explicitly these methods however only provide an indirect window into the subject confidence and are not always applicable in this paper we explain how model of decision making based on partially observable decision making processes pomdps can be used to estimate decision maker confidence based on experimental data pomdps provide unifying bayesian framework for modeling several important aspects of perceptual decision making including evidence accumulation via bayesian updates the role of priors costs and rewards of actions etc one of the advantages of the pomdp model over the other models is that it can incorporate various types of uncertainty in computing the optimal this research was supported by nsf grants and and onr grant decision making strategy and race models are able to handle uncertainty in probability updates but not the costs and rewards of actions furthermore these models originated as descriptive models of observed data where as the pomdp approach is fundamentally normative prescribing the optimal policy for any task requiring decision making under uncertainty in addition the pomdp model can capture the temporal dynamics of task time has been shown to play crucial role in decision making especially in decision confidence pomdps have previously been used to model evidence accumulation and understand the role of priors to our knowledge this is the first time that it is being applied to model confidence and explain experimental data on decision making tasks in the following sections we introduce some basic concepts in perceptual decision making and show how pomdp can model decision confidence we then explore the model predictions for two experiments in perceptual decision making involving confidence fixed duration motion discrimination task with wagering and motion discrimination task with confidence report our results show that the predictions of the pomdp model closely match experimental data the model predictions are also consistent with the phenomena in decision making involving in the hard trials and in the easy ones accuracy belief and confidence in perceptual decision making consider perceptual decision making tasks in which the subject has to guess the hidden state of the environment correctly to get reward any guess other than the correct state usually leads to no reward the decision maker has been trained on the task and wants to obtain the maximum possible reward since the state is hidden the decision maker must use one or more observations to estimate the state for example the state could be one of two biased coins one biased toward heads and the other toward tails on each trial the experimenter picks one of these coins randomly and flips it the decision maker only sees the result heads or tails and must guess which coin has been picked if she guesses correctly she gets reward immediately if she fails she gets nothing in this context accuracy is defined as the number of correct guesses divided by the total number of trials in single trial if represents the action or choice of the decision maker and and denote the state and observation respectively then accuracy for the choice with observation is the probability as where as represents the action of decision maker choosing and is the true state this accuracy can be measured by the experimenter however from the decision maker perspective her chance of success in trial is given by the probability of being the correct state given observation we call this probability the decision maker belief after choosing an action for example as the conf idence for this choice is the probability as according to bayes theorem as the goal of our decision maker is to maximize her reward she picks the most probable state this means that on observing she picks where is the most probable state arg max therefore is equal to for and for the rest of the actions as result accuracy is for the most probable state and for the rest also is equal to for the most probable state this means that given observation accuracy is equal to the confidence on the most probable state also this confidence is equal to the belief of the most probable state as confidence can not be defined on actions not performed one could consider confidence on the most probable state only implying that accuracy confidence and belief are all equal given observation as all of the above equalities however depend on the ability of the decision maker to compute according to bayes theorem if the decision maker has the perfect observation model she could compute by estimating and in the case that there are multiple states with maximum probability accuracy is the sum of the confidence values on those states beforehand by counting the total number of occurrences of each state without considering any observations and the total number of occurrences of observation respectively therefore accuracy and confidence are equal if the decision maker has the true model for observations sometimes however the decision maker does not even have access to for example in the motion discrimination task if the data is binned based on difficulty motion strength the decision maker can not estimate because she does not know the difficulty of each trial as result accuracy and confidence are not equal in the general case the decision maker can utilize multiple observations over time and perform an action on each time step for example in the coin toss problem the decision maker could request flip multiple times to gather more information if she requests flip two times and then guesses the state to be the coin biased toward heads her actions would be sample sample choose heads she also has two observations likely to be two heads in the general case the state of the environment can also change after each in this case the relationship between accuracy and confidence at time after sequence history ht of actions and observations ht is at ht st st ht at with the same reasoning as above accuracy and confidence are equal if and only if the decision maker has access to all the observations and has the true model of the task the pomdp model partially observable markov decision processes pomdps provide mathematical framework for decision making under uncertainty in autonomous agents pomdp is formally tuple with the following description is finite set of states of the environment is finite set of possible actions and is finite set of possible observations is transition function defined as which determines the probability of going from state to another state after performing particular action is an observation function defined as which determines the probability of observing after performing an action and ending in particular state is the reward function defined as determining the reward received by performing an action in particular state is the discount factor which is always between and and determines how much rewards in the future are discounted compared to current rewards in pomdp goal is to find sequence of actions to maximize the expected discounted reward est st at the states are not fully observable and the agent must rely on its observations to choose actions at the time we have history of actions and observations ht the belief state at time is the posterior probability over states given this history and the prior probability over states bt st as the system is markovian the belief state captures the sufficient statistics for the history of states and actions and it is possible to obtain using only bt at and at at bt given this definition of belief the goal of the agent is to find sequence of actions to maximize the expected reward bt at the actions are picked based on the belief state and the resulting mapping from belief states to actions is called policy denoted by pwhich is ability distribution over actions bt at the policy which maximizes bt at is called the optimal policy it can be shown that there is always deterministic optimal policy allowing the agent to always choose one action for each bt as result we may use function where is the space of all possible beliefs there has been considerable progress in recent years in fast which find policies for pomdps modeling decision making with pomdps results from experiments and theoretical models indicate that in many perceptual decision making tasks if the previous task state is revealed the history beyond this state does not exert noticeable in traditional perceptual decision making tasks such as the random dots task the state does not usually change however our model is equally applicable to this situation influence on decisions suggesting that the markov assumption and the notion of belief state is applicable to perceptual decision making additionally since the pomdp model aims to maximize the expected reward the problem of guessing the correct state in perceptual decision making can be converted to reward maximization problem by simply setting the reward for the correct guess to and the reward for all other actions to the pomdp model also allows other costs in decision making to be taken into account the cost of sampling that the brain may utilize for metabolic or other reasons finally as there is only one correct hidden state in each trial the policy is deterministic choosing the most probable state consistent with the pomdp model all these facts mean that we could model the perceptual decision making with the pomdp framework in the cases where all observations and the true environment model are available to the decision maker the belief state in the pomdp is equal to both accuracy and confidence as discussed above when some information is hidden from the decision maker one can use pomdp with that information to model accuracy and another pomdp without that information to model the confidence if this hidden information is independent of time we can model the difference with the initial belief state we use two similar pomdps to model accuracy and confidence but with different initial belief states in the motion discrimination experiment it is common to bin the data based on the difficulty of the task this difficulty is hidden to the decision maker and also independent of the time as result the confidence can be calculated by the same pomdp that models accuracy but with different initial belief state this case is discussed in the next section experiments and results we investigate the applicability of the pomdp model in the context of two tasks in perceptual decision making the first is motion discrimination task with sure option presented in in this task movie of randomly moving dots is shown to monkey for fixed duration after delay period the monkey must correctly choose the direction of motion left or right of the majority of the dots to obtain reward in half of the trials third choice also becomes available the sure option which always leads to reward though the reward is less than the reward for guessing the direction correctly intuitively if the monkey wants to maximize reward it should go for the sure choice only when it is very uncertain about the direction of the dots the second task is motion discrimination task in humans studied in in this task the subject observes the random dots motion stimuli but must determine the direction of the motion in this case up or down of the majority of the dots as fast and as accurately as possible rather than observing for fixed duration in addition to their decision regarding direction subjects indicated their confidence in their decision on horizontal bar stimulus where pointing nearer to the left end meant less confidence and nearer to the right end meant more confidence in both tasks the difficulty of the task is governed by parameter known as coherence or motion strength defined as the percentage of dots moving in the same direction from frame to frame in given trial in the experiments the coherence value for given trial was chosen to be one of the following fixed duration task as pomdp the direction and the coherence of the moving dots comprise the states of the environment in addition the actions which are available to the subject are dependent on the stage of the trial namely random dots display wait period choosing the direction or the sure choice or choosing only the direction as result the stage of the trial is also part of the state of the pomdp as the transition between these stages are dependent on time we incorporate discretized time as part of the state considering the data we define new state for each constant each direction each coherence and each stage when there is intersection between stages we use dummy states to enforce the delay period of waiting and terminal state which indicates termination of the trial direction coherence stage time waiting states terminal the actions are sample wait left right and sure the transition function models the passage of time and stages the observation function models evidence accumulation only in the random dots display stage and with the action sample the observations received in each are governed by the number of dots moving in the same direction we model the observations as normally distributed figure experimental accuracy of the decision maker for each coherence is shown in this plot is from the curves with empty circles and dashed lines are the trials where the sure option was not given to the subject the curves with solid circles and solid lines are the trials where the sure option was shown but waived by the decision maker shows the accuracy curves for the pomdp model fit to the experimental accuracy data from trials where the sure option was not given around mean related to the coherence and the direction as follows display sample µd σd the reward for choosing the correct direction is set to and the other rewards were set relative to this reward the sure option was set to positive reward less than while the cost of sampling and receiving new observation was set to negative reward value to model the unavailability of some actions in some states we set their resultant rewards to large negative number to preclude the decision making agent from picking these actions the discount factor models how much more immediate reward is worth relative to future rewards in the task the subject does not have the option of terminating the trial early to get reward sooner and therefore we used discount factor of for this task predicting the confidence in the fixed duration task as mentioned before confidence and accuracy are equal to each other when the same amount of information is available to the experimenter and the decision maker therefore they can be modeled by the same pomdp however these two are not equal when we look at specific coherence difficulty the data is binned based on coherence because the coherence in each trial is not revealed to the decision maker figure shows the accuracy stimulus duration binned based on coherence the confidence is not equal to the accuracy in this plot however we could predict the decision maker confidence only from accuracy data this time we use two pomdps one for the experimenter and one for the decision maker at time bt of the experimenter pomdp can be related to accuracy and bt of the decision maker to confidence these two pomdps have the same model parameters but different initial belief state this is because the subject knows the environment model but does not have access to the coherence in each trial first we find the set of parameters for the experimenter pomdp to reproduce the same accuracy curves as in the experiment for each coherence we only use data from the trials where the sure option is not given dashed curves in figure as the data is binned based on the coherence and coherence is observable to the experimenter the initial belief state of the experimenter pomdp for coherence is as following for each of two possible initial states at time and for the rest fitting the pomdp to the accuracy data yields the mean and variance for each observation function and the cost for sampling figure shows the accuracy curves based on the experimenter pomdp now we could apply the parameters obtained from fitting accuracy data the experimenter pomdp to the decision maker pomdp to predict her confidence the decision maker does not know the coherence in each single trial therefore the initial belief state should be uniform distribution over all initial states all coherences not only coherence of that trial also neural data from experiments and wagering experiments suggest that the decision maker does not recognize the existence of true zero coherence state coherence therefore the initial figure the confidence predicted by the pomdp fit to the observed accuracy data in the fixedduration experiment is shown in shows accuracy and confidence in one plot demonstrating that they are not equal for this task curves with solid lines show the confidence same curves as and the ones with dashed lines show the accuracy same as figure figure experimental wagering results plot and the wagering predicted by our model plot plot is from probability of coherence states is set to figure shows the pomdp predictions regarding the subject belief figure confirms that the predicted confidence and accuracy are not equal to test our prediction about the confidence of the decision maker we use experimental data from wagering in this experiment if the reward for the sure option is rsure then the decision maker chooses it if and only if right rright rsure and lef rlef rsure where direction is the sum of the belief states of all states in that direction since rsure can not be obtained from the fit to the accuracy data we choose value for rsure which makes the prediction of the confidence consistent with the wagering data shown in figure we found that if rsure is approximately twothirds the value of the reward for correct direction choice the pomdp model prediction matches experimental data figure possible objection is that the free parameter of rsure was used to fit the data although rsure is needed to fit the exact probabilities we found that any reasonable value for rsure generates the same trend of wagering in general the effect of rsure is to shift the plots vertically the most important phenomena here is the relatively small gap between hard trials and easy trials in figure figure shows what this wagering data would look like if the decision maker knew the coherence in each trial and confidence was equal to the accuracy the difference between these two plots figure and figure and figure which shows the confidence and the accuracy together confirm the pomdp model ability to explain effect wherein the decision maker underestimates easy trials and has overconfidence in the hard ones another way of testing the predictions about the confidence is to verify if the pomdp predicts the correct accuracy in the trials where the decision maker waives the sure option figure shows that the results from the pomdp closely match the experimental data both in wagering and accuracy improvement our methods are presented in more detail in the supplementary document reaction time task the pomdp for the reaction time task is similar to the fixed duration task the most important components of the state are again direction and coherence we also need some dummy states for the figure shows what wagering would look like if the accuracy and the confidence were equal shows the accuracy predicted by the pomdp model in the trials where the sure option is shown but waived solid lines and also in the trials where it is not shown dashed lines for comparison see experimental data in figure waiting period between the decision command from the decision maker and reward delivery however the passage of stages and time are not modeled the absence of time in the state representation does not mean that the time is not modeled in the framework tracking time is very important component of any pomdp especially when the discount factor is less than one the actions for this task are sample wait up and down the latter two indicating choice for the direction of motion the transition model and the observation model are similar to those for the fixed duration task direction coherence waiting states terminal sample µd σd the reward for choosing the correct direction is and the reward for sampling is small negative value adjusted to the reward of the correct choice as the subject controls the termination of the trial the discount factor is less than in this task the subjects have been explicitly advised to terminate the task as soon as they discover the direction therefore there is an incentive for the subject to terminate the trial sooner while sampling cost is constant during the experiment the discount factor makes the decision making strategy dependent on time discount factor less than means that as time passes the effective value of the rewards decreases also in general reaction time task the discount factor connects the trials to each other while models usually assume each single trial is independent of the others trials are actually dependent when the decision maker has control over trial termination specifically the decision maker has motivation to terminate each trial quickly to get the reward and proceed to the next one moreover when one is very uncertain about the outcome of trial it may be prudent to terminate the trial sooner with the expectation that the next trial may be easier predicting the confidence in the reaction time task like the fixed duration task we want to predict the decision maker confidence on specific coherence to achieve this we use the same technique having two pomdps with the same model and different initial belief states the control of the subject over the termination of the trial makes estimating the confidence more difficult in the reaction time task as the subject decides based on her own belief not accuracy the relationship between the accuracy and the reaction time binned based on difficulty is very noisy in comparison to the fixed duration task the plots of this relationship are illustrated in the supplementary materials of therefore we fit the experimenter pomdp to two other plots reaction time motion strength coherence and accuracy motion strength coherence the first subject of the original experiment was picked for this analysis because the behavior of this subject was consistent with the behavior of the majority of subjects figures and show the experimental data from figure and show the results from the pomdp model fit to experimental data as in the previous task the initial belief state of the pomdp for coherence is for each direction of and for the rest all the free parameters of the pomdp were extracted from this fit again as in the fixed duration task we assume that the decision maker knows the environment model but does not know about the coherence of each trial and existence of coherence figure shows the reported confidence from the experiments and figure shows the prediction of our pomdp model for the belief of the figure and show accuracy motion strength and reaction time motion strength plots from the random dots experiments in and show the results from the pomdp model figure illustrates the reported confidence by the human subject from shows the predicted confidence by the pomdp model decision maker although this report is not in percentile and quantitative comparison is not possible the general trends in these plots are similar the two become almost identical if one maps the report bar to the probability range in both tasks we assume that the decision maker has nearly perfect model of the environment apart from using different coherences instead of the zero coherence state assumed not known this assumption is not necessarily true although the decision maker understands that the difficulty of the trials is not constant she might not know the exact number of coherences for example she may divide trials into three categories easy normal and hard for each direction however these differences do not significantly change the belief because the observations are generated by the true model not the decision maker model we tested this hypothesis in our experiments although using using separate decision maker model makes the predictions closer to the real data we used the true experimenter model to avoid overfitting the data conclusions our results present to our knowledge the first supporting evidence for the utility of bayesian reward optimization framework based on pomdps for modeling confidence judgements in subjects engaged in perceptual decision making we showed that the predictions of the pomdp model are consistent with results on decision confidence in both primate and human decision making tasks encompassing and paradigms unlike traditional descriptive models such as or race models the pomdp model is normative and is derived from bayesian and reward optimization principles additionally unlike the traditional models it allows one to model optimal decision making across trials using the concept of discount factor important directions for future research include leveraging the ability of the pomdp framework to model probabilistic state transitions and exploring predictions of the pomdp model for decision making experiments with more sophisticated functions references karl astrom optimal control of markov decision processes with incomplete state estimation journal of mathematical analysis and applications pages jan drugowitsch ruben anne churchland michael shadlen and alexandre pouget the cost of accumulating evidence in perceptual decision making the journal of neuroscience jan drugowitsch ruben and alexandre pouget relation between belief and performance in perceptual decision making plos one timothy hanks mark mazurek roozbeh kiani elisabeth hopp and michael shadlen elapsed decision time affects the weighting of prior probability in perceptual decision task journal of neuroscience yanping huang abram friesen timothy hanks michael shadlen and rajesh rao how prior probability influences decision making unifying probabilistic model in proceedings of the annual conference on neural information processing systems nips pages yanping huang and rajesh rao reward optimization in the primate brain probabilistic model of decision making under uncertainty plos one peter juslin henrik olsson and mats bjorkman brunswikian and thurstonian origins of bias in probability assessment on the interpretation of stochastic components of judgment journal of behavioral decision making leslie pack kaelbling michael littman and anthony cassandra planning and acting in partially observable stochastic domains artificial intelligence adam kepecs and zachary mainen computational framework for the study of confidence in humans and animals philosophical transactions of the royal society biological sciences adam kepecs naoshige uchida hatim zariwala and zachary mainen neural correlates computation and behavioural impact of decision confidence nature koosha khalvati and alan mackworth fast pairwise heuristic for planning under uncertainty in proceedings of the aaai conference on artificial intelligence pages roozbeh kiani leah corthell and michael shadlen choice certainty is informed by both evidence and decision time neuron roozbeh kiani and michael shadlen representation of confidence associated with decision by neurons in the parietal cortex science hanna kurniawati david hsu and wee sun lee sarsop efficient pomdp planning by approximating optimally reachable belief spaces in proceedings of the robotics science and systems iv navindra persaud peter mcleod and alan cowey wagering objectively measures awareness nature neuroscience rajesh rao decision making under uncertainty neural model based on partially observable markov decision processes frontiers in computational neuroscience stphane ross joelle pineau sebastien paquet and brahim online planning algorithms for pomdps journal of artificial intelligence research michael shadlen and william newsome motion perception seeing and deciding proceedings of the national academy of sciences of the united states of america richard smallwood and edward sondik the optimal control of partially observable markov processes over finite horizon operations research edward sondik the optimal control of partially observable markov processes over the infinite horizon discounted costs operations research pp 
optimization in deep neural networks behnam neyshabur toyota technological institute at chicago bneyshabur ruslan salakhutdinov departments of statistics and computer science university of toronto rsalakhu nathan srebro toyota technological institute at chicago nati abstract we revisit the choice of sgd for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights we argue for geometry invariant to rescaling of weights that does not affect the output of the network and suggest which is an approximate steepest descent method with respect to regularizer related to regularization is easy and efficient to implement and leads to empirical gains over sgd and adagrad introduction training deep networks is challenging problem and various heuristics and optimization algorithms have been suggested in order to improve the efficiency of the training however training deep architectures is still considerably slow and the problem has remained open many of the current training methods rely on good initialization and then performing stochastic gradient descent sgd sometimes together with an adaptive stepsize or momentum term revisiting the choice of gradient descent we recall that optimization is inherently tied to choice of geometry or measure of distance norm or divergence gradient descent for example is tied to the norm as it is the steepest descent with respect to norm in the parameter space while coordinate descent corresponds to steepest descent with respect to the norm and multiplicative weight updates is tied to an entropic divergence moreover at least when the objective function is convex convergence behavior is tied to the corresponding norms or potentials for example with gradient descent or sgd convergence speeds depend on the norm of the optimum the norm or divergence can be viewed as regularizer for the updates there is therefore also strong link between regularization for optimization and regularization for learning optimization may provide implicit regularization in terms of its corresponding geometry and for ideal optimization performance the optimization geometry should be aligned with inductive bias driving the learning is the geometry on the weights the appropriate geometry for the space of deep networks or can we suggest geometry with more desirable properties that would enable faster optimization and perhaps also better implicit regularization as suggested above this question is also linked to the choice of an appropriate regularizer for deep networks focusing on networks with relu activations we observe that scaling down the incoming edges to hidden unit and scaling up the outgoing edges by the same factor yields an equivalent network balanced unbalanced objective epoch sgd update rescaling weight explosion in an unbalanced network sgd update training on mnist rescaling sgd update poor updates in an unbalanced network figure evolution of the error function when training network on mnist with two hidden layers each containing hidden units the unbalanced initialization blue curve is generated by applying sequence of rescaling functions on the balanced initializations red curve updates for simple case where the input is thresholds are set to zero constant the stepsize is and the gradient with respect to output is updated network for the case where the input is thresholds are set to zero constant the stepsize is and the gradient with respect to output is computing the same function since predictions are invariant to such rescalings it is natural to seek geometry and corresponding optimization method that is similarly invariant we consider here geometry inspired by regularization regularizing the maximum norm of incoming weights into any unit which seems to provide better inductive bias compared to the norm weight decay but to achieve rescaling invariance we use not the itself but rather the minimum over all rescalings of the weights we discuss how this measure can be expressed as path regularizer and can be computed efficiently we therefore suggest novel optimization method that is an approximate steepest descent method with respect to path regularization and we demonstrate that gradient descent and adagrad for classifications tasks on several benchmark datasets notations feedforward neural network that computes function rd rc can be represented by directed acyclic graph dag with input nodes vin vin output nodes vout vout weights and an activation function that is applied on the internal nodes hidden units we denote the function computed by this network as fg in this paper we focus on relu rectified linear unit activation function σrelu max we refer to the depth of the network which is the length of the longest directed path in for any we define vini to be the set of vertices with longest path of length to an input unit and vout is defined similarly for paths to output units in layered networks vin vout is the set of hidden units in hidden layer rescaling and unbalanceness one of the special properties of relu activation function is homogeneity that is for any scalar and any we have σrelu σrelu this interesting property allows the network to be rescaled without changing the function computed by the network we define the rescaling function ρc such that given the weights of the network constant and node the rescaling function multiplies the incoming edges and divides the outgoing edges of by that is ρc maps to the weights for the rescaled network where for any otherwise it is easy to see that the rescaled network computes the same function fg σrelu fg ρc σrelu we say that the two networks with weights and are rescaling equivalent denoted by if and only if one of them can be transformed to another by applying sequence of rescaling functions ρc given training set yn xn yn our goal is to minimize the following objective function fw xi yi let be the weights at step of the optimization we consider update step of the following form for example for gradient descent we have where is the in the stochastic setting such as sgd or gradient descent we calculate the gradient on small subset of the training set since rescaling equivalent networks compute the same function it is desirable to have an update rule that is not affected by rescaling we call an optimization method rescaling invariant if the updates of rescaling equivalent networks are rescaling equivalent that is if we start at either one of the two rescaling equivalent weight vectors after applying update steps separately on and they will remain rescaling equivalent and we have unfortunately gradient descent is not rescaling invariant the main problem with the gradient updates is that scaling down the weights of an edge will also scale up the gradient which as we see later is exactly the opposite of what is expected from rescaling invariant update furthermore gradient descent performs very poorly on unbalanced networks we say that network is balanced if the norm of incoming weights to different units are roughly the same or within small range for example figure shows huge gap in the performance of sgd initialized with randomly generated balanced network when training on mnist compared to network initialized with unbalanced weights here is generated by applying sequence of random rescaling functions on and therefore in an unbalanced network gradient descent updates could blow up the smaller weights while keeping the larger weights almost unchanged this is illustrated in figure if this were the only issue one could scale down all the weights after each update however in an unbalanced network the relative changes in the weights are also very different compared to balanced network for example figure shows how two rescaling equivalent networks could end up computing very different function after only single update measures for deep networks following we consider the grouping of weights going into each node of the network this forms the following generic type regularizer parametrized by µp two simple cases of above are and that correspond to overall regularization and weight decay respectively another form of regularization that is shown to be very effective in relu networks is the regularization which is the maximum over all units of norm of incoming edge to the the correspond to regularization when we set in equation and can be written in the following form µp sup this definition of is bit different than the one used in the context of matrix factorization the later is similar to the minimum upper bound over norm of both outgoing edges from the input units and incoming edges to the output units in two layer network weight decay is probably the most commonly used regularizer on the other hand regularization might not seem ideal as it is very extreme in the sense that the value of regularizer corresponds to the highest value among all nodes however the situation is very different for networks with relu activations and other activation functions with homogeneity property in these cases regularization has shown to be very effective the main reason could be because relu networks can be rebalanced in such way that all hidden units have the same norm hence regularization will not be crude measure anymore since µp is not rescaling invariant and the values of the scale measure are different for rescaling equivalent networks it is desirable to look for the minimum value of regularizer among all rescaling equivalent networks surprisingly for network the minimum regularizer among all rescaling equivalent networks can be efficiently computed by single forward step to see this we consider the vector the path vector where the number of coordinates of is equal to the total number of paths from the input to output units and each coordinate of is the equal to the product of weights along path from an input nodes to an output node the regularizer is then defined as the norm of ek φp kπ kp vin the following lemma establishes that the regularizer corresponds to the minimum over all equivalent networks of the norm lemma φp min µp the definition of the regularizer involves an exponential number of terms but it can be computed efficiently by dynamic programming in single forward step using the following equivalent form as nested sums φp in out vin straightforward consequence of lemma is that the φp is invariant to rescaling for any φp φp an approximate steepest descent motivated by empirical performance of regularization and the fact that is invariant to rescaling we are interested in deriving the steepest descent direction with respect to the path regularizer φp arg min arg min ek we vin arg min the steepest descent step is hard to calculate exactly instead we will update each coordinate we independently and synchronously based on that is we arg min we taking the partial derivative with respect to we and setting it to zero we obtain we we vin vout algorithm rule γin in γout for to do γin γin in initialization γout γout end for γin γout we we update rule where vin vout denotes the paths from any input unit to any output unit that includes solving for we gives us the following update rule we γp where γp is given as γp vin vout we call the optimization using the update rule gradient descent when used in stochastic settings we refer to it as now that we know an approximate steepest descent with respect to the we can ask whether or not this makes rescaling invariant optimization method the next theorem proves that indeed rescaling invariant theorem rescaling invariant proof it is sufficient to prove that using the update rule for any and any if ρc then ρc for any edge in the network if is neither incoming nor outgoing edge of the node then and since the gradient is also the same for edge however if is an incoming edge to we have that cw we we have moreover since the outgoing edges of are divided by we get γp therefore γp and cwe γp cwe γp similar argument proves the invariance of rule for outgoing edges of therefore rescaling invariant efficient implementation the update rule in the way it is written needs to consider all the paths which is exponential in the depth of the network however it can be calculated in time that is no more than step on single data point that is in setting with batch size if the backpropagation on the can be done in time bt the running time of the on the will be roughly very moderate runtime increase with typical sizes of hundreds or thousands of points algorithm shows an efficient implementation of the update rule we next compare other optimization methods in both balanced and unbalanced settings table general information on datasets used in the experiments data set mnist svhn dimensionality color color grayscale color classes training set test set experiments in this section we compare two commonly used optimization methods in deep learning sgd and adagrad we conduct our experiments on four common benchmark datasets the standard mnist dataset of handwritten digits and datasets of tiny images of natural scenes and street view house numbers svhn dataset containing color images of house numbers collected by google street view details of the datasets are shown in table in all of our experiments we trained networks with two hidden layers each containing hidden units we used of size and the of where is an integer between and to choose for each dataset we considered the validation errors over the validation set randomly chosen points that are kept out during the initial training and picked the one that reaches the minimum error faster we then trained the network over the entire training set all the networks were trained both with and without dropout when training with dropout at each update step we retained each unit with probability we tried both balanced and unbalanced initializations in balanced initialization incoming weights to each unit are initialized to samples from gaussian distribution with standard deviation in the unbalanced setting we first initialized the weights to be the same as the balanced weights we then picked hidden units randomly with replacement for each unit we multiplied its incoming edge and divided its outgoing edge by where was chosen randomly from distribution the optimization results without dropout are shown in figure for each of the four datasets the plots for objective function the training error and the test error are shown from left to right where in each plot the values are reported on different epochs during the optimization although we proved that are the same for balanced and unbalanced initializations to verify that despite numerical issues they are indeed identical we trained both balanced and unbalanced initializations since the curves were exactly the same we only show single curve we can see that as expected the unbalanced initialization considerably hurts the performance of sgd and adagrad in many cases their training and test errors are not even in the range of the plot to be displayed while essentially the same another interesting observation is that even in the balanced settings not only does get to the same value of objective function training and test error faster but also the final generalization error for sometimes considerably lower than sgd and adagrad except where the generalization error for sgd is slightly better compared to the plots for test errors could also imply that implicit regularization due to steepest descent with respect to leads to solution that generalizes better this view is similar to observations in on the role of implicit regularization in deep learning the results for training with dropout are shown in figure where here we suppressed the very poor results on unbalanced initializations we observe that except for mnist much faster than sgd or adagrad it also generalizes better to the test set which again shows the effectiveness of updates the results suggest that sgd and adagrad in two different ways first it can achieve the same accuracy much faster and second the implicit regularization by to local minima that can generalize better even when the training error is zero this can be better analyzed by looking at the plots for more number of epochs which we have provided in the supplementary material we should also point that can be easily combined with adagrad to take training loss training error unbalanced sgd balanced sgd unbalanced adagrad balanced adagrad unbalanced epoch mnist test error svhn epoch epoch figure learning curves using different optimization methods for datasets without dropout left panel displays the objective function middle and right panels show the corresponding values of the training and test errors where the values are reported on different epochs during the course of optimization best viewed in color advantage of the adaptive stepsize or used together with momentum term this could potentially perform even better compare to discussion we revisited the choice of the euclidean geometry on the weights of relu networks suggested an alternative optimization method approximately corresponding to different geometry and showed that using such an alternative geometry can be beneficial in this work we show success and we expect to be beneficial also in training for very deep convolutional networks combining with adagrad with momentum or with other optimization heuristics might further enhance results although we do believe is very good optimization method and is an easy for sgd we hope this work will also inspire others to consider other geometries other regularizers and perhaps better update rules particular property of is its rescaling invariance which we training loss training error test error dropout sgd dropout adagrad dropout epoch mnist svhn epoch epoch figure learning curves using different optimization methods for datasets with dropout left panel displays the objective function middle and right panels show the corresponding values of the training and test errors best viewed in color argue is appropriate for relu networks but is certainly not the only rescaling invariant update possible and other invariant geometries might be even better can also be viewed as tractable approximation to natural gradient which ignores the activations the input distribution and dependencies between different paths natural gradient updates are also invariant to rebalancing but are generally computationally intractable finally we choose to use steepest descent because of its simplicity of implementation better choice might be mirror descent with respect to an appropriate potential function but such construction seems particularly challenging considering the of neural networks acknowledgments research was partially funded by nsf award and intel we thank ryota tomioka and hao tang for insightful discussions and leon bottou for pointing out the connection to natural gradient references john duchi elad hazan and yoram singer adaptive subgradient methods for online learning and stochastic optimization the journal of machine learning research xavier glorot and yoshua bengio understanding the difficulty of training deep feedforward neural networks in aistats ian goodfellow david mehdi mirza aaron courville and yoshua bengio maxout networks in proceedings of the international conference on machine learning icml pages kaiming he xiangyu zhang shaoqing ren and jian sun delving deep into rectifiers surpassing performance on imagenet classification arxiv preprint sergey ioffe and christian szegedy batch normalization accelerating deep network training by reducing internal covariate shift in arxiv kingma and ba adam method for stochastic optimization corr alex krizhevsky and geoffrey hinton learning multiple layers of features from tiny images computer science department university of toronto tech rep yann lecun bottou yoshua bengio and patrick haffner learning applied to document recognition proceedings of the ieee james martens and roger grosse optimizing neural networks with approximate curvature in icml yuval netzer tao wang adam coates alessandro bissacco bo wu and andrew ng reading digits in natural images with unsupervised feature learning in nips workshop on deep learning and unsupervised feature learning behnam neyshabur ryota tomioka and nathan srebro in search of the real inductive bias on the role of implicit regularization in deep learning international conference on learning representations iclr workshop track behnam neyshabur ryota tomioka and nathan srebro capacity control in neural networks colt nathan srebro and adi shraibman rank and in learning theory pages springer nathan srebro karthik sridharan and ambuj tewari on the universality of online mirror descent in advances in neural information processing systems pages nitish srivastava geoffrey hinton alex krizhevsky ilya sutskever and ruslan salakhutdinov dropout simple way to prevent neural networks from overfitting the journal of machine learning research sutskever martens george dahl and geoffery hinton on the importance of momentum and initialization in deep learning in icml 
on the consistency theory of high dimensional variable screening xiangyu wang dept of statistical science duke university usa chenlei leng dept of statistics university of warwick uk david dunson dept of statistical science duke university usa dunson abstract variable screening is fast dimension reduction technique for assisting high dimensional feature selection as preselection method it selects moderate size subset of candidate variables for further refining via feature selection to produce the final model the performance of variable screening depends on both computational efficiency and the ability to dramatically reduce the number of variables without discarding the important ones when the data dimension is substantially larger than the sample size variable screening becomes crucial as faster feature selection algorithms are needed conditions guaranteeing selection consistency might fail to hold this article studies class of linear screening methods and establishes consistency theory for this special class in particular we prove the restricted diagonally dominant rdd condition is necessary and sufficient condition for strong screening consistency as concrete examples we show two screening methods sis and holp are both strong screening consistent subject to additional constraints with large probability if log under random designs in addition we relate the rdd condition to the irrepresentable condition and highlight limitations of sis introduction the rapidly growing data dimension has brought new challenges to statistical variable selection crucial technique for identifying important variables to facilitate interpretation and improve prediction accuracy recent decades have witnessed an explosion of research in variable selection and related fields such as compressed sensing with core focus on regularized methods regularized methods can consistently recover the support of coefficients the signals via optimizing regularized loss functions under certain conditions however in the big data era when far exceeds such regularized methods might fail due to two reasons first the conditions that guarantee variable selection consistency for convex regularized methods such as lasso might fail to hold when second the computational expense of both convex and regularized methods increases dramatically with large bearing these concerns in mind propose the concept of variable screening fast technique that reduces data dimensionality from to size comparable to with all predictors having nonzero coefficients preserved they propose marginal correlation based fast screening technique sure independence screening sis that can preserve signals with large probability however this method relies on strong assumption that the marginal correlations between the response and the important predictors are high which is easily violated in the practice extends the marginal correlation to the spearman rank correlation which is shown to gain certain robustness but is still limited by the same strong assumption and take different approach to attack the screening problem they both adopt variants of forward selection type algorithm that includes one variable at time for constructing candidate variable set for further refining these methods eliminate the strong marginal assumption in and have been shown to achieve better empirical performance however such improvement is limited by the extra computational burden caused by their iterative framework which is reported to be high when is large to ameliorate concerns in both screening performance and computational efficiency develop new type of screening method termed ordinary projection holp this new screener relaxes the strong marginal assumption required by sis and can be computed efficiently complexity is thus scalable to dimensionality this article focuses on linear models for tractability as computation is one vital concern for designing good screening method we primarily focus on class of linear screeners that can be efficiently computed and study their theoretical properties the main contributions of this article lie in three aspects we define the notion of strong screening consistency to provide unified framework for analyzing screening methods in particular we show necessary and sufficient condition for screening method to be strong screening consistent is that the screening matrix is restricted diagonally dominant rdd this condition gives insights into the design of screening matrices while providing framework to assess the effectiveness of screening methods we relate rdd to other existing conditions the irrepresentable condition ic is necessary and sufficient for sign consistency of lasso in contrast to ic that is specific to the design matrix rdd involves another ancillary matrix that can be chosen arbitrarily such flexibility allows rdd to hold even when ic fails if the ancillary matrix is carefully chosen as in holp when the ancillary matrix is chosen as the design matrix certain equivalence is shown between rdd and ic revealing the difficulty for sis to achieve screening consistency we also comment on the relationship between rdd and the restricted eigenvalue condition rec which is commonly seen in the high dimensional literature we illustrate via simple example that rdd might not be necessarily stronger than rec we study the behavior of sis under random designs and prove that sample size of ρs log is sufficient for sis and holp to be screening consistent where is the sparsity measures the diversity of signals and evaluates the ratio this is to be compared to the sign consistency results in where the design matrix is fixed and assumed to follow the ic the article is organized as follows in section we set up the basic problem and describe the framework of variable screening in section we provide deterministic necessary and sufficient condition for consistent screening its relationship with the irrepresentable condition is discussed in section in section we prove the consistency of sis and holp under random designs by showing the rdd condition is satisfied with large probability although the requirement on sis is much more restictive linear screening consider the usual linear regression xβ where is the response vector is the design matrix and is the noise the regression task is to learn the coefficient vector in the high dimensional setting where sparsity assumption is often imposed on so that only small portion of the coordinates are such an assumption splits the task of learning into two phases the first is to recover the support of the location of coefficients the second is to estimate the value of these signals this article mainly focuses on the first phase as pointed out in the introduction when the dimensionality is too high using regularization methods methods raises concerns both computationally and theoretically to reduce the dimensionality suggest variable screening framework by finding submodel md is among the largest coordinates of or mγ let and define as the true model with being its cardinarlity the hope is that the submodel size or will be smaller or comparable to while md or mγ to achieve this goal two steps are usually involved in the screening analysis the first is to show there exists some such that and the second step is to bound the size of such that to unify these steps for more comprehensive theoretical framework we put forward slightly stronger definition of screening consistency in this article definition strong screening consistency an estimator of is strong screening consistent if it satisfies that min max and sign sign βi remark this definition does not differ much from the usual screening property studied in the literature which requires where max denotes the th largest item the key of strong screening consistency is the property that requires the estimator to preserve consistent ordering of the zero and coefficients it is weaker than variable selection consistency in the requirement in can be seen as relaxation of the sign consistency defined in as no requirement for is needed as shown later such relaxation tremendously reduces the restriction on the design matrix and allows screening methods to work for broader choice of the focus of this article is to study the theoretical properties of special class of screeners that take the linear form as ay for some ancillary matrix examples include sure independence screening sis where and ordinary projection holp where xx we choose to study the class of linear estimators because linear screening is computationally efficient and theoretically tractable we note that the usual ordinary estimator is also special case of linear estimators although it is not well defined for deterministic guarantees in this section we derive the necessary and sufficient condition that guarantees ay to be strong screening consistent the design matrix and the error are treated as fixed in this section and we will investigate random designs later we consider the set of sparse coefficient vectors defined by the set contains vectors having at most coordinates with the ratio of the largest and smallest coordinate bounded by before proceeding to the main result of this section we introduce some terminology that helps to establish the theory definition restricted diagonally dominant matrix symmetric matrix is restricted diagonally dominant with sparsity if for any and φii max φkj φkj where is constant notice this definition implies that for φii φkj φkj which is related to the usual diagonally dominant matrix the restricted diagonally dominant matrix provides necessary and sufficient condition for any linear estimators ay to be strong screening consistent more precisely we have the following result theorem for the noiseless case where linear estimator ay is strong screening consistent for every if and only if the screening matrix ax is restricted diagonally dominant with sparsity and proof assume is restricted diagonally dominant with sparsity and recall φβ suppose is the index set of predictors for any if we let then we have βj βj βj φii φij φii φij φkj φki φkj φki βi βi βi βj φkj φki βj φkj βi φki βi βi βi and βj βj βj φii φij φii φij φkj φki φkj φki βi βi βi βj φkj φki sign βi βi therefore whatever value sign βi is it always holds that and thus to prove the sign consistency for coefficients we notice that for βj φij βi φii φij βj βi φii βi the proof of necessity is left to the supplementary materials the noiseless case is good starting point to analyze intuitively in order to preserve the correct order of the coefficients in axβ one needs ax to be close to diagonally dominant matrix so that ms will take advantage of the large diagonal terms in ax to dominate ms that is just linear combinations of terms when noise is considered the condition in theorem needs to be changed slightly to accommodate extra discrepancies in addition the smallest coefficient has to be lower bounded to ensure certain level of ratio thus we augment our previous definition of to have signal strength control bτ min then we can obtain the following modified theorem theorem with noise the linear estimator ay is strong screening consistent for every bτ if ax ip is restricted diagonally dominant with sparsity and the proof of theorem is essentially the same as theorem and is thus left to the supplementary materials the condition in theorem can be further tailored to necessary and sufficient version with extra manipulation on the noise term nevertheless this might not be useful in practice due to the randomness in noise in addition the current version of theorem is already tight in the sense that there exists some noise vector such that the condition in theorem is also necessary for strong screening consistency theorems and establish ground rules for verifying consistency of given screener and provide practical guidance for screening design in section we consider some concrete examples of ancillary matrix and prove that conditions in theorems and are satisfied by the corresponding screeners with large probability under random designs relationship with other conditions for some special cases such sure independence screening sis the restricted diagonally dominant rdd condition is related to the strong irrepresentable condition ic proposed in assume each column of is standardized to have mean zero letting and be given coefficient vector the ic is expressed as kcs cs sign βs for some where ca represents the of with row indices in and column indices in the authors enumerate several scenarios of such that ic is satisfied we verify some of these scenarios for screening matrix corollary if φii and for some as defined in corollary and in then is restricted diagonally dominant matrix with sparsity and if for some as defined in corollary in then is restricted diagonally dominant matrix with sparsity and more explicit but nontrivial relationship between ic and rdd is illustrated below when theorem assume φii and if is restricted diagonally dominant with sparsity and then satisfies kφs sign βs for all on the other hand if satisfies the ic for all for some then is restricted diagonally dominant matrix with sparsity and theorem demonstrates certain equivalence between ic and rdd however it does not mean that rdd is also strong requirement notice that ic is directly imposed on the covariance matrix this makes ic strong assumption that is easily violated for example when the predictors are highly correlated in contrast to ic rdd is imposed on matrix ax where there is flexibility in choosing only when is chose to be rdd is equivalently strong as ic as shown in next theorem for other choices of such as holp defined in next section the estimator satisfies rdd even when predictors are highly correlated therefore rdd is considered as weak requirement for sis the screening matrix coincides with the covariance matrix making rdd and ic effectively equivalent the following theorem formalizes this theorem let and standardize columns of to have sample variance one assume satisfies the sparse riesz condition min λmin xπt xπ for some now if ax is restricted diagonally dominant with sparsity and with then satisfies the ic for any in other words under the condition the strong screening consistency of sis for implies the model selection consistency of lasso for theorem illustrates the difficulty of sis the necessary condition that guarantees good screening performance of sis also guarantees the model selection consistency of lasso however such strong necessary condition does not mean that sis should be avoided in practice given its substantial advantages in terms of simplicity and computational efficiency the strong screening consistency defined in this article is stronger than conditions commonly used in justifying screening procedures as in another common assumption in the high dimensional literature is the restricted eigenvalue condition rec compared to rec rdd is not necessarily stronger due to its flexibility in choosing the ancillary matrix prove that the rec is satisfied when the design matrix is however rec might not be guaranteed when the row of follows distribution in contrast as the example shown in next section and in by choosing xx the resulting estimator satisfies rdd even when the rows of follow distributions screening under random designs in this section we consider linear screening under random designs when and are gaussian the theory developed in this section can be easily extended to broader family of distributions for example where follows distribution and follows an elliptical distribution we focus on the gaussian case for conciseness let and we prove the screening consistency of sis and holp by verifying the condition in theorem recall the ancillary matrices for sis and holp are defined respectively as aholp xx asis for simplicity we assume σii for to verify the rdd condition it is essential to quantify the magnitude of the entries of ax and lemma let asis then for any and we have tn σii exp min and tn σij exp min where kx is constant is random variable with one degree of freedom and the norm is defined in lemma states that the screening matrix asis for sis will eventually converge to the covariance matrix in when tends to infinity and log thus the screening performance of sis strongly relies on the structure of in particular the asymptotically necessary and sufficient condition for sis being strong screening consistent is satisfying the rdd condition for the noise term we have the following lemma lemma let asis for any and we have tn σt exp min where is defined the same as in lemma the proof of lemma is essentially the same as the proof of terms in lemma and is thus omitted as indicated before the necessary and sufficient condition for sis to be strong screening consistent is that follows rdd as rdd is usually hard to verify we consider stronger sufficient condition inspired by corollary theorem let if then for any if the sample size satisfies log where is defined in lemma then with probability at least asis kasis ip is restricted diagonally dominant with sparsity and in other words sis is screening consistent for any bτ proof taking union bound on the results from lemma and we have for any and min φii or max or σt exp min in other words for any when log with probability at least we have log log min φii max max log sufficient condition for to be restricted diagonally dominant is that min φii max max plugging in the values we have log log log solving the above inequality notice that and completes the proof the requirement that ρsr or the necessary and sufficient condition that is rdd strictly constrains the correlation structure of causing the difficulty for sis to be strong screening consistent for holp we instead have the following result lemma let aholp assume for some then for any there exists some and such that for any and any we have and κt where proof the proof of lemma relies heavily on previous results for the stiefel manifold provided in the supplementary materials we only sketch the basic idea here and leave the complete proof to the supplementary materials defining xx then we have hh and follows the matrix angular central gaussian macg with covariance the diagonal terms of hh can be bounded similarly via the lemma by using the fact that hh σu where is random projection matrix now for terms we decompose the stiefel manifold as where is vector is matrix and is chosen so that and show that follows angular central gaussian acg distribution with covariance σg conditional on it can be shown that hh let hh then is equivalent to and we obtain the desired coupling distribution as hh using the normal representation of acg if xp then acg we can write in terms of normal variables and then bound all terms using concentration inequalities lemma quantifies the entries of the screening matrix for holp as illustrated in the lemma regardless of the covariance diagonal terms of are always np and the terms are pn thus with is likely to satisfy the rdd condition with large probability for the noise vector we have the following result lemma let aholp assume for some then for any there exist the same as in lemma such that for any and κt if the proof is almost identical to lemma and is provided in the supplementary materials the following theorem results after combining lemma and theorem assume for some for any if the sample size satisfies max ρs log where max and are the same constants defined in lemma then with probability at least aholp kaholp ip is restricted diagonally dominant with sparsity and this implies holp is screening consistent for any bτ proof notice that if min max kx xx ij then the proof is because kx xx is already restricted diagonally dominant matrix let the above equation then requires cκsρ cκt cκsρ cκ which implies that ρs ρs where therefore taking union bounds on all matrix entries we have does not hold where the second inequality is due to the fact that and now for any holds with probability at least if log log log which is satisfied provided noticing log now pushing to the limit gives the precise condition we need there are several interesting observations on equation and first ρs appears in both expressions we note that ρs evaluates the sparsity and the diversity of the signal while is closely related to the ratio furthermore holp relaxes the correlation constraint or the covariance constraint is rdd with the conditional number constraint thus for any as long as the sample size is large enough strong screening consistency is assured finally holp provides an example to satisfy the rdd condition in answer to the question raised in section concluding remarks this article studies and establishes necessary and sufficient condition in the form of restricted diagonally dominant screening matrices for strong screening consistency of linear screener we verify the condition for both sis and holp under random designs in addition we show close relationship between rdd and the ic highlighting the difficulty of using sis in screening for arbitrarily correlated predictors for future work it is of interest to see how linear screening can be adapted to compressed sensing and how techniques such as preconditioning can improve the performance of marginal screening and variable selection acknowledgments this research was partly support by grant nih from the national institute of environmental health sciences references david donoho compressed sensing ieee transactions on information theory richard baraniuk compressive sensing ieee signal processing magazine robert tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series statistical methodology jianqing fan and runze li variable selection via nonconcave penalized likelihood and its oracle properties journal of the american statistical association emmanuel candes and terence tao the dantzig selector statistical estimation when is much larger than the annals of statistics peter bickel ya acov ritov and alexandre tsybakov simultaneous analysis of lasso and dantzig selector the annals of statistics zhang nearly unbiased variable selection under minimax concave penalty the annals of statistics peng zhao and bin yu on model selection consistency of lasso the journal of machine learning research martin wainwright sharp thresholds for and noisy recovery of sparsity using quadratic programming ieee transactions on information theory jason lee yuekai sun and jonathan taylor on model selection consistency of mestimators with geometrically decomposable penalties advances in neural processing information systems jianqing fan and jinchi lv sure independence screening for ultrahigh dimensional feature space journal of the royal statistical society series statistical methodology gaorong li heng peng jun zhang lixing zhu et al robust rank correlation based screening the annals of statistics hansheng wang forward regression for dimensional variable screening journal of the american statistical association haeran cho and piotr fryzlewicz high dimensional variable selection via tilting journal of the royal statistical society series statistical methodology xiangyu wang and chenlei leng ordinary projection for screening variables https zhang and jian huang the sparsity and bias of the lasso selection in highdimensional linear regression the annals of statistics garvesh raskutti martin wainwright and bin yu restricted eigenvalue properties for correlated gaussian designs the journal of machine learning research shuheng zhou restricted eigenvalue conditions on subgaussian random matrices arxiv preprint roman vershynin introduction to the analysis of random matrices arxiv preprint lingzhou xue and hui zou sure independence screening and compressed random sensing biometrika jinzhu jia and karl rohe preconditioning to comply with the irrepresentable condition arxiv preprint 
memory networks sainbayar sukhbaatar dept of computer science courant institute new york university sainbar arthur szlam jason weston rob fergus facebook ai research new york aszlam jase robfergus abstract we introduce neural network with recurrent attention model over possibly large external memory the architecture is form of memory network but unlike the model in that work it is trained and hence requires significantly less supervision during training making it more generally applicable in realistic settings it can also be seen as an extension of rnnsearch to the case where multiple computational steps hops are performed per output symbol the flexibility of the model allows us to apply it to tasks as diverse as synthetic question answering and to language modeling for the former our approach is competitive with memory networks but with less supervision for the latter on the penn treebank and datasets our approach demonstrates comparable performance to rnns and lstms in both cases we show that the key concept of multiple computational hops yields improved results introduction two grand challenges in artificial intelligence research have been to build models that can make multiple computational steps in the service of answering question or completing task and models that can describe long term dependencies in sequential data recently there has been resurgence in models of computation using explicit storage and notion of attention manipulating such storage offers an approach to both of these challenges in the storage is endowed with continuous representation reads from and writes to the storage as well as other processing steps are modeled by the actions of neural networks in this work we present novel recurrent neural network rnn architecture where the recurrence reads from possibly large external memory multiple times before outputting symbol our model can be considered continuous form of the memory network implemented in the model in that work was not easy to train via backpropagation and required supervision at each layer of the network the continuity of the model we present here means that it can be trained from pairs and so is applicable to more tasks tasks where such supervision is not available such as in language modeling or realistically supervised question answering tasks our model can also be seen as version of rnnsearch with multiple computational steps which we term hops per output symbol we will show experimentally that the multiple hops over the memory are crucial to good performance of our model on these tasks and that training the memory representation can be integrated in scalable manner into our neural network model approach our model takes discrete set of inputs xn that are to be stored in the memory query and outputs an answer each of the xi and contains symbols coming from dictionary with words the model writes all to the memory up to fixed buffer size and then finds continuous representation for the and the continuous representation is then processed via multiple hops to output this allows backpropagation of the error signal through multiple memory accesses back to the input during training single layer we start by describing our model in the single layer case which implements single memory hop operation we then show it can be stacked to give multiple hops in memory input memory representation suppose we are given an input set xi to be stored in memory the entire set of xi are converted into memory vectors mi of dimension computed by embedding each xi in continuous space in the simplest case using an embedding matrix of size the query is also embedded again in the simplest case via another embedding matrix with the same dimensions as to obtain an internal state in the embedding space we compute the match between and each memory mi by taking the inner product followed by softmax pi softmax ut mi where softmax zi ezi ezj defined in this way is probability vector over the inputs output memory representation each xi has corresponding output vector ci given in the simplest case by another embedding matrix the response vector from the memory is then sum over the transformed inputs ci weighted by the probability vector from the input ci because the function from input to output is smooth we can easily compute gradients and backpropagate through it other recently proposed forms of memory or attention take this approach notably bahdanau et al and graves et al see also generating the final prediction in the single layer case the sum of the output vector and the input embedding is then passed through final weight matrix of size and softmax to produce the predicted label softmax the overall model is shown in fig during training all three embedding matrices and as well as are jointly learned by minimizing standard loss between and the true label training is performed using stochastic gradient descent see section for more details softmax xi input mi embedding question predicted answer inner product sentences embedding weights pi sentences xi ci output embedding predicted answer weighted sum softmax question figure single layer version of our model three layer version of our model in practice we can constrain several of the embedding matrices to be the same see section multiple layers we now extend our model to handle hop operations the memory layers are stacked in the following way the input to layers above the first is the sum of the output ok and the input uk from layer different ways to combine ok and uk are proposed later uk ok each layer has its own embedding matrices ak used to embed the inputs xi however as discussed below they are constrained to ease training and reduce the number of parameters at the top of the network the input to also combines the input and the output of the top memory layer softmax softmax ok uk we explore two types of weight tying within the model adjacent the output embedding for one layer is the input embedding for the one above we also constrain the answer prediction matrix to be the same as the final output embedding and the question embedding to match the input embedding of the first layer the input and output embeddings are the same across different layers ak and we have found it useful to add linear mapping to the update of between hops that is huk ok this mapping is learnt along with the rest of the parameters and used throughout our experiments for weight tying version of our memory model is shown in fig overall it is similar to the memory network model in except that the hard max operations within each layer have been replaced with continuous weighting from the softmax note that if we use the weight tying scheme our model can be cast as traditional rnn where we divide the outputs of the rnn into internal and external outputs emitting an internal output corresponds to considering memory and emitting an external output corresponds to predicting label from the rnn point of view in fig and eqn is hidden state and the model generates an internal output attention weights in fig using the model then ingests using updates the hidden state and so here unlike standard rnn we explicitly condition on the outputs stored in memory during the hops and we keep these outputs soft rather than sampling them thus our model makes several computational steps before producing an output meant to be seen by the outside world related work number of recent efforts have explored ways to capture structure within sequences using rnns or models the memory in these models is the state of the network which is latent and inherently unstable over long timescales the models address this through local memory cells which lock in the network state from the past in practice the performance gains over carefully trained rnns are modest see mikolov et al our model differs from these in that it uses global memory with shared read and write functions however with weight tying our model can be viewed as form of rnn which only produces an output after fixed number of time steps corresponding to the number of hops with the intermediary steps involving memory operations that update the internal state some of the very early work on neural networks by steinbuch and piske and taylor considered memory that performed operations on stored input vectors and then fit parametric models to the retrieved sets this has similarities to single layer version of our model subsequent work in the explored other types of memory for example das et al and mozer et al introduced an explicit stack with push and pop operations which has been revisited recently by in the context of an rnn model closely related to our model is the neural turing machine of graves et al which also uses continuous memory representation the ntm memory uses both content and access unlike ours which only explicitly allows the former although the temporal features that we will introduce in section allow kind of access however in part because we always write each memory sequentially our model is somewhat simpler not requiring operations like sharpening furthermore we apply our memory model to textual reasoning tasks which qualitatively differ from the more abstract operations of sorting and recall tackled by the ntm note that in this view the terminology of input and output from fig is flipped when viewed as traditional rnn with this special conditioning of outputs becomes part of the output embedding of the rnn and becomes the input embedding our model is also related to bahdanau et al in that work bidirectional rnn based encoder and gated rnn based decoder were used for machine translation the decoder uses an attention model that finds which hidden states from the encoding are most useful for outputting the next translated word the attention model uses small neural network that takes as input concatenation of the current hidden state of the decoder and each of the encoders hidden states similar attention model is also used in xu et al for generating image captions our memory is analogous to their attention mechanism although is only over single sentence rather than many as in our case furthermore our model makes several hops on the memory before making an output we will see below that this is important for good performance there are also differences in the architecture of the small network used to score the memories compared to our scoring approach we use simple linear layer whereas they use more sophisticated gated architecture we will apply our model to language modeling an extensively studied task goodman showed simple but effective approaches which combine with cache bengio et al ignited interest in using neural network based models for the task with rnns and lstms showing clear performance gains over traditional methods indeed the current is held by variants of these models for example very large lstms with dropout or rnns with diagonal constraints on the weight matrix with appropriate weight tying our model can be regarded as modified form of rnn where the recurrence is indexed by memory lookups to the word sequence rather than indexed by the sequence itself synthetic question and answering experiments we perform experiments on the synthetic qa tasks defined in using version of the dataset given qa task consists of set of statements followed by question whose answer is typically single word in few tasks answers are set of words the answer is available to the model at training time but must be predicted at test time there are total of different types of tasks that probe different forms of reasoning and deduction here are samples of three of the tasks sam walks into the kitchen sam picks up an apple sam walks into the bedroom sam drops the apple where is the apple bedroom brian is lion julius is lion julius is white bernhard is green what color is brian white mary journeyed to the den mary went back to the kitchen john journeyed to the bedroom mary discarded the milk where was the milk before the den hallway note that for each question only some subset of the statements contain information needed for the answer and the others are essentially irrelevant distractors the first sentence in the first example in the memory networks of weston et al this supporting subset was explicitly indicated to the model during training and the key difference between that work and this one is that this information is no longer provided hence the model must deduce for itself at training and test time which sentences are relevant and which are not formally for one of the qa tasks we are given example problems each having set of sentences xi where question sentence and answer let the jth word of sentence be xij represented by vector of length where the vocabulary is of size reflecting the simplistic nature of the qa language the same representation is used for the question and answer two versions of the data are used one that has training problems per task and second larger one with per task model details unless otherwise stated all experiments used hops model with the adjacent weight sharing scheme for all tasks that output lists the answers are multiple words we take each possible combination of possible outputs and record them as separate answer vocabulary word sentence representation in our experiments we explore two different representations for the sentences the first is the bow representation that takes the sentence xi xin embeds each word and sums the resulting vectors mi axij and ci cxij the input vector representing the question is also embedded as bag of words bqj this has the drawback that it can not capture the order of the words in the sentence which is important for some tasks we therefore propose second representation that encodes the position of words within the sentence this takes the form mi lj axij where is an multiplication lj is column vector with the structure lkj assuming indexing with being the number of words in the sentence and is the dimension of the embedding this sentence representation which we call position encoding pe means that the order of the words now affects mi the same representation is used for questions memory inputs and memory outputs temporal encoding many of the qa tasks require some notion of temporal context in the first example of section the model needs to understand that sam is in the bedroom after he is in the kitchen to enable our model to address them we modify the memory vector so that mi axij ta where ta is the ith row of special matrix ta that encodes temporal information the output embedding is augmented in the same way with matrix tc ci cxij tc both ta and tc are learned during training they are also subject to the same sharing constraints as and note that sentences are indexed in reverse order reflecting their relative distance from the question so that is the last sentence of the story learning time invariance by injecting random noise we have found it helpful to add dummy memories to regularize ta that is at training time we can randomly add of empty memories to the stories we refer to this approach as random noise rn training details of the babi training set was to form validation set which was used to select the optimal model architecture and hyperparameters our models were trained using learning rate of with anneals every epochs by until epochs were reached no momentum or weight decay was used the weights were initialized randomly from gaussian distribution with zero mean and when trained on all tasks simultaneously with training samples training samples epochs epochs were used with learning rate anneals of every epochs epochs all training uses batch size of but cost is not averaged over batch and gradients with an norm larger than are divided by scalar to have norm in some of our experiments we explored commencing training with the softmax in each memory layer removed making the model entirely linear except for the final softmax for answer prediction when the validation loss stopped decreasing the softmax layers were and training recommenced we refer to this as linear start ls training in ls training the initial learning rate is set to the capacity of memory is restricted to the most recent sentences since the number of sentences and the number of words per sentence varied between problems null symbol was used to pad them all to fixed size the embedding of the null symbol was constrained to be zero on some tasks we observed large variance in the performance of our model sometimes failing badly other times not depending on the initialization to remedy this we repeated each training times with different random initializations and picked the one with the lowest training error baselines we compare our abbreviated to to range of alternate models memnn the strongly supervised memory networks approach proposed in this is the best reported approach in that paper it uses max operation rather than softmax at each layer which is trained directly with supporting facts strong supervision it employs modeling nonlinear layers and an adaptive number of hops per query weakly supervised heuristic version of memnn where the supporting sentence labels are not used in training since we are unable to backpropagate through the max operations in each layer we enforce that the first memory hop should share at least one word with the question and that the second memory hop should share at least one word with the first hop and at least one word with the answer all those memories that conform are called valid memories and the goal during training is to rank them higher than invalid memories using the same ranking criteria as during strongly supervised training lstm standard lstm model trained using question answer pairs only also weakly supervised for more detail see source code is available at https results we report variety of design choices bow vs position encoding pe sentence representation ii training on all tasks independently vs jointly training joint training used an embedding dimension of while independent training used iii two phase training linear start ls where softmaxes are removed initially vs training with softmaxes from the start iv varying memory hops from to the results across all tasks are given in table for the training set along with the mean performance for training they show number of interesting points the best models are reasonably close to the supervised models for memnn vs for with position encoding linear start random noise jointly trained and for memnn vs for with position encoding linear start random noise although the supervised models are still superior all variants of our proposed model comfortably beat the weakly supervised baseline methods the position encoding pe representation improves over bow as demonstrated by clear improvements on tasks and where word ordering is particularly important the linear start ls to training seems to help avoid local minima see task in table where pe alone gets error while using ls reduces it to jittering the time index with random empty memories rn as described in section gives small but consistent boost in performance especially for the smaller training set joint training on all tasks helps importantly more computational hops give improved performance we give examples of the hops performed via the values of eq over some illustrative examples in fig and in the supplementary material baseline task supporting fact supporting facts supporting facts argument relations argument relations questions counting simple negation indefinite knowledge basic coreference conjunction compound coreference time reasoning basic deduction basic induction positional reasoning size reasoning path finding agent motivation mean error failed tasks err strongly supervised memnn lstm memnn wsh bow on training data mean error failed tasks err pe pe ls pe ls rn hop hops pe ls pe ls joint joint hops pe ls joint pe ls rn joint pe ls lw joint table test error rates on the qa tasks for models using training examples mean test errors for training examples are shown at the bottom key bow representation pe position encoding representation ls linear start training rn random injection of time index noise lw weight tying if not stated adjacent weight tying is used joint joint training on all tasks as opposed to training language modeling experiments the goal in language modeling is to predict the next word in text sequence given the previous words we now explain how our model can easily be applied to this task more detailed results for the training set can be found in the supplementary material following we found adding more solves tasks and see the supplementary material story supporting fact support hop hop daniel went to the bathroom mary travelled to the hallway john went to the bedroom john travelled to the bathroom yes mary went to the office where is john answer bathroom prediction bathroom hop story supporting facts john dropped the milk john took the milk there sandra went back to the bathroom john moved to the hallway mary went back to the bedroom where is the milk answer hallway support hop yes prediction hallway story basic induction support hop hop brian is frog yes lily is gray brian is yellow yes julius is green greg is frog yes what color is greg answer yellow prediction yellow hop story size reasoning support hop hop hop the suitcase is bigger than the chest yes the box is bigger than the chocolate the chest is bigger than the chocolate yes the chest fits inside the container the chest fits inside the box does the suitcase fit in the chocolate answer no prediction no yes hop hop figure example predictions on the qa tasks of we show the labeled supporting facts support from the dataset which does not use during training and the probabilities of each hop used by the model during inference successfully learns to focus on the correct supporting sentences model rnn lstm scrn of hidden penn treebank of memory valid hops size perp test perp of hidden of hops memory size valid perp test perp table the perplexity on the test sets of penn treebank and corpora note that increasing the number of memory hops improves performance figure average activation weight of memory positions during memory hops white color indicates where the model is attending during the th hop for clarity each row is normalized to have maximum value of model is trained on left penn treebank and right dataset we now operate on word level as opposed to the sentence level thus the previous words in the sequence including the current are embedded into memory separately each memory cell holds only single word so there is no need for the bow or linear mapping representations used in the qa tasks we employ the temporal embedding approach of section since there is no longer any question in fig is fixed to constant vector without embedding the output softmax predicts which word in the vocabulary of size is next in the sequence loss is used to train model by backpropagating the error through multiple memory layers in the same manner as the qa tasks to aid training we apply relu operations to half of the units in each layer we use weight sharing the query weights of each layer are the same the output weights of each layer are the same as noted in section this makes our architecture closely related to an rnn which is traditionally used for language modeling tasks however here the sequence over which the network is recurrent is not in the text but in the memory hops furthermore the weight tying restricts the number of parameters in the model helping generalization for the deeper models which we find to be effective for this task we use two different datasets penn tree bank this consists of words distributed over vocabulary of words the same preprocessing as was used this is version of the first million characters dumped from wikipedia this is split into character sets all word occurring less than times are replaced with the unk token resulting in vocabulary size of training details the training procedure we use is the same as the qa tasks except for the following for each update the norm of the whole gradient of all parameters is and if larger than then it is scaled down to have norm this was crucial for good performance we use the learning rate annealing schedule from namely if the validation cost has not decreased after one epoch then the learning rate is scaled down by factor training terminates when the learning rate drops below after epochs or so weights are initialized using and batch size is set to on the penn tree dataset we repeat each training times with different random initializations and pick the one with smallest validation cost however we have done only single training run on dataset due to limited time constraints results table compares our model to rnn lstm and structurally constrained recurrent nets scrn baselines on the two benchmark datasets note that the baseline architectures were tuned in to give optimal our approach achieves lower perplexity on both datasets vs for on penn and vs for lstm on note that has more parameters than rnns with the same number of hidden units while lstm has more parameters we also vary the number of hops and memory size of our showing the contribution of both to performance note in particular that increasing the number of hops helps in fig we show how operates on memory with multiple hops it shows the average weight of the activation of each memory position over the test set we can see that some hops concentrate only on recent words while other hops have more broad attention over all memory locations which is consistent with the idea that succesful language models consist of smoothed model and cache interestingly it seems that those two types of hops tend to alternate also note that unlike traditional rnn the cache does not decay exponentially it has roughly the same average activation across the entire memory this may be the source of the observed improvement in language modeling conclusions and future work in this work we showed that neural network with an explicit memory and recurrent attention mechanism for reading the memory can be successfully trained via backpropagation on diverse tasks from question answering to language modeling compared to the memory network implementation of there is no supervision of supporting facts and so our model can be used in wider range of settings our model approaches the same performance of that model and is significantly better than other baselines with the same level of supervision on language modeling tasks it slightly outperforms tuned rnns and lstms of comparable complexity on both tasks we can see that increasing the number of memory hops improves performance however there is still much to do our model is still unable to exactly match the performance of the memory networks trained with strong supervision and both fail on several of the qa tasks furthermore smooth lookups may not scale well to the case where larger memory is required for these settings we plan to explore multiscale notions of attention or hashing as proposed in acknowledgments the authors would like to thank armand joulin tomas mikolov antoine bordes and sumit chopra for useful comments and valuable discussions and also the fair infrastructure team for their help and support in the qa tasks the gradient of each weight matrix is measured separately they tuned the on penn treebank and used them on without additional tuning except for the number of hidden units see for more detail references atkeson and schaal neural networks for robot learning neurocomputing bahdanau cho and bengio neural machine translation by jointly learning to align and translate in international conference on learning representations iclr bengio ducharme vincent and janvin neural probabilistic language model mach learn mar chung cho and bengio empirical evaluation of gated recurrent neural networks on sequence modeling arxiv preprint das giles and sun learning grammars capabilities and limitations of recurrent neural network with an external stack memory in in proceedings of the fourteenth annual conference of cognitive science society goodman bit of progress in language modeling corr graves generating sequences with recurrent neural networks arxiv preprint graves wayne and danihelka neural turing machines arxiv preprint gregor danihelka graves and wierstra draw recurrent neural network for image generation corr hochreiter and schmidhuber long memory neural computation joulin and mikolov inferring algorithmic patterns with recurrent nets nips greff gomez and schmidhuber clockwork rnn in icml marcus marcinkiewicz and santorini building large annotated corpus of english the penn treebank comput june mikolov statistical language models based on neural networks ph thesis brno university of technology mikolov joulin chopra mathieu and ranzato learning longer memory in recurrent neural networks arxiv preprint mozer and das connectionist symbol manipulator that discovers the structure of languages nips pages peng lu li and wong towards neural reasoning arxiv preprint pollack the induction of dynamical recognizers machine learning steinbuch and piske learning matrices and their applications ieee transactions on electronic computers sundermeyer and ney lstm neural networks for language modeling in interspeech pages taylor pattern recognition by means of automatic analogue apparatus proceedings of the institution of electrical engineers weston bordes chopra and mikolov towards question answering set of prerequisite toy tasks arxiv preprint weston chopra and bordes memory networks in international conference on learning representations iclr xu ba kiros cho courville salakhutdinov zemel and bengio show attend and tell neural image caption generation with visual attention arxiv preprint zaremba sutskever and vinyals recurrent neural network regularization arxiv preprint 
spectral representations for convolutional neural networks oren rippel department of mathematics massachusetts institute of technology jasper snoek twitter and harvard seas jsnoek rippel ryan adams twitter and harvard seas rpa abstract discrete fourier transforms provide significant speedup in the computation of convolutions in deep learning in this work we demonstrate that beyond its advantages for efficient computation the spectral domain also provides powerful representation in which to model and train convolutional neural networks cnns we employ spectral representations to introduce number of innovations to cnn design first we propose spectral pooling which performs dimensionality reduction by truncating the representation in the frequency domain this approach preserves considerably more information per parameter than other pooling strategies and enables flexibility in the choice of pooling output dimensionality this representation also enables new form of stochastic regularization by randomized modification of resolution we show that these methods achieve competitive results on classification and approximation tasks without using any dropout or finally we demonstrate the effectiveness of spectral parameterization of convolutional filters while this leaves the underlying model unchanged it results in representation that greatly facilitates optimization we observe on variety of popular cnn configurations that this leads to significantly faster convergence during training introduction convolutional neural networks cnns lecun et have been used to achieve unparalleled results across variety of benchmark machine learning problems and have been applied successfully throughout science and industry for tasks such as large scale image and video classification krizhevsky et karpathy et one of the primary challenges of cnns however is the computational expense necessary to train them in particular the efficient implementation of convolutional kernels has been key ingredient of any successful use of cnns at scale due to its efficiency and the potential for amortization of cost the discrete fourier transform has long been considered by the deep learning community to be natural approach to fast convolution bengio lecun more recently mathieu et al vasilache et al have demonstrated that convolution can be computed significantly faster using discrete fourier transforms than directly in the spatial domain even for tiny filters this computational gain arises from the convenient property of operator duality between convolution in the spatial domain and multiplication in the frequency domain in this work we argue that the frequency domain offers more than computational trick for convolution it also provides powerful representation for modeling and training cnns frequency decomposition allows studying an input across its various of variation and as such provides natural framework for the analysis of data with spatial coherence we introduce two applications of spectral representations these contributions can be applied independently of each other spectral parametrization we propose the idea of learning the filters of cnns directly in the frequency domain namely we parametrize them as maps of complex numbers whose discrete fourier transforms correspond to the usual filter representations in the spatial domain because this mapping corresponds to unitary transformations of the filters this reparametrization does not alter the underlying model however we argue that the spectral representation provides an appropriate domain for parameter optimization as the frequency basis captures typical filter structure well more specifically we show that filters tend to be considerably sparser in their spectral representations thereby reducing the redundancy that appears in spatial domain representations this provides the optimizer with more meaningful directions that can be taken advantage of with standard preconditioning we demonstrate the effectiveness of this reparametrization on number of cnn optimization tasks converging times faster than the standard spatial representation spectral pooling pooling refers to dimensionality reduction used in cnns to impose capacity bottleneck and facilitate computation we introduce new approach to pooling we refer to as spectral pooling it performs dimensionality reduction by projecting onto the frequency basis set and then truncating the representation this approach alleviates number of issues present in existing pooling strategies for example while max pooling is featured in almost every cnn and has had great empirical success one major criticism has been its poor preservation of information hinton this weakness is exhibited in two ways first along with other pooling approaches it implies very sharp dimensionality reduction by at least factor of every time it is applied on inputs moreover while it encourages translational invariance it does not utilize its capacity well to reduce approximation loss the maximum value in each window only reflects very local information and often does not represent well the contents of the window in contrast we show that spectral pooling preserves considerably more information for the same number of parameters it achieves this by exploiting the of typical inputs in their ratio as function of frequency for example natural images are known to have an expected power spectrum that follows an inverse power law power is heavily concentrated in the lower frequencies while higher frequencies tend to encode noise torralba oliva as such the elimination of higher frequencies in spectral pooling not only does minimal damage to the information in the input but can even be viewed as type of denoising in addition spectral pooling allows us to specify any arbitrary output map dimensionality this permits reduction of the map dimensionality in slow and controlled manner as function of network depth also since truncation of the frequency representation exactly corresponds to reduction in resolution we can supplement spectral pooling with stochastic regularization in the form of randomized resolution spectral pooling can be implemented at negligible additional computational cost in convolutional neural networks that employ fft for convolution kernels as it only requires matrix truncation we also note that these two ideas are both compatible with the method of batch normalization ioffe szegedy permitting even better training efficiency the discrete fourier transform the discrete fourier transform dft is powerful way to decompose spatiotemporal signal in this section we provide an introduction to number of components of the dft drawn upon in this work we confine ourselves to the dft although all properties and results presented can be easily extended to other input dimensions given an input cm we address the constraint of real inputs in subsection its dft cm is given by hw mh nw xmn the dft is linear and unitary and so its inverse transform is given by namely the conjugate of the transform itself dft basis functions examples of pairs conjugate symm figure properties of discrete fourier transforms all discrete fourier basis functions of map size note the equivalence of some of these due to conjugate symmetry examples of input images and their frequency representations presented as the frequency maps have been shifted to center the dc component rays in the frequency domain correspond to spatial domain edges aligned perpendicular to these conjugate symmetry patterns for inputs with odd top and even bottom dimensionalities orange constraint blue no constraint gray value fixed by conjugate symmetry intuitively the dft coefficients resulting from projections onto the different frequencies can be thought of as measures of correlation of the input with basis functions of various see figure for visualization of the dft basis functions and figure for examples of inputfrequency map pairs the widespread deployment of the dft can be partially attributed to the development of the fast fourier transform fft mainstay of signal processing and standard component of most math libraries the fft is an efficient implementation of the dft with time complexity log convolution using dft one powerful property of frequency analysis is the operator duality between convolution in the spatial domain and multiplication in the spectral domain namely given two inputs rm we may write where by we denote convolution and by an product approximation error the unitarity of the fourier basis makes it convenient for the analysis of approximation loss more specifically parseval theorem links the loss between any input and its approximation to the corresponding loss in the frequency domain kx kf an equivalent statement also holds for the inverse dft operator this allows us to quickly assess how an input is affected by any distortion we might make to its frequency representation conjugate symmetry constraints in the following sections of the paper we will propagate signals and their gradients through dft and inverse dft layers in these layers we will represent the frequency domain in the complex field however for all layers apart from these we would like to ensure that both the signal and its gradient are constrained to the reals necessary and sufficient condition to achieve this is conjugate symmetry in the frequency domain namely for any transform of some input it must hold that ymn modm modn thus intuitively given the left half of our frequency map the diminished number of degrees of freedom allows us to reconstruct the right in effect this allows us to store approximately half the parameters that would otherwise be necessary note however that this does not reduce the effective dimensionality since each element consists of real and imaginary components the conjugate symmetry constraints are visualized in figure given real input its dft will necessarily meet these this symmetry can be observed in the frequency representations of the examples in figure however since we seek to optimize over parameters embedded directly in the frequency domain we need to pay close attention to ensure the conjugate symmetry constraints are enforced upon inversion back to the spatial domain see subsection differentiation here we discuss how to propagate the gradient through fourier transform layer this analysis can be similarly applied to the inverse dft layer define rm and to be the input and output of dft layer respectively and rm loss function applied to which can be considered as the remainder of the forward pass since the dft is linear operator its gradient is simply the transformation matrix itself during then this gradient is conjugated and this by dft unitarity corresponds to the application of the inverse transform there is an intricacy that makes matters bit more complicated namely the conjugate symmetry condition discussed in subsection introduces redundancy inspecting the conjugate symmetry constraints in equation we note their enforcement of the special case for odd and for even for all other indices they enforce conjugate equality of pairs of distinct elements these conditions imply that the number of unconstrained parameters is about half the map in its entirety spectral pooling the choice of pooling technique boils down to the selection of an appropriate set of basis functions to project onto and some truncation of this representation to establish approximation to the original input the idea behind spectral pooling stems from the observation that the frequency domain provides an ideal basis for inputs with spatial structure we first discuss the technical details of this approach and then its advantages spectral pooling is straightforward to understand and to implement we assume we are given an input rm and some desired output map dimensionality first we compute the discrete fourier transform of the input into the frequency domain as cm and assume that the dc component has been shifted to the center of the domain as is standard practice we then crop the frequency representation by maintaining only the central submatrix of frequencies which we denote as finally we map this approximation back into the spatial domain by taking its inverse dft as these steps are listed in algorithm note that some of the conjugate symmetry special cases described in subsection might be broken by this truncation as such to ensure that is we must treat these individually with reat orner ases which can be found in the supplementary material figure demonstrates the effect of this pooling for various choices of the backpropagation procedure is quite intuitive and can be found in algorithm emove edundancy and ecover ap can be found in the supplementary material in subsection we addressed the nuances of differentiating through dft and inverse dft layers apart from these the last component left undiscussed is differentiation through the truncation of the frequency matrix but this corresponds to simple of the gradient maps to the appropriate dimensions in practice the dfts are the computational bottlenecks of spectral pooling however we note that in convolutional neural networks that employ ffts for convolution computation spectral pooling can be implemented at negligible additional computational cost since the dft is performed regardless we proceed to discuss number of properties of spectral pooling which we then test comprehensively in section algorithm spectral pooling input map output size output pooled map rop pectrum reat orner ases algorithm spectral pooling input gradient output output gradient input emove edundancy pad pectrum ecover ap figure approximations for different pooling schemes for different factors of dimensionality reduction spectral pooling projects onto the fourier basis and truncates it as desired this retains significantly more information and permits the selection of any arbitrary output map dimensionality information preservation spectral pooling can significantly increase the amount of retained information relative to maxpooling in two distinct ways first its representation maintains more information for the same number of degrees of freedom spectral pooling reduces the information capacity by tuning the resolution of the input precisely to match the desired output dimensionality this operation can also be viewed as linear filtering and it exploits the of the spectral density of the data with respect to frequency that is that the power spectra of inputs with spatial structure such as natural images carry most of their mass on lower frequencies as such since the amplitudes of the higher frequencies tend to be small parseval theorem from section informs us that their elimination will result in representation that minimizes the distortion after reconstruction second spectral pooling does not suffer from the sharp reduction in output dimensionality exhibited by other pooling techniques more specifically for pooling strategies such as max pooling the number of degrees of freedom of inputs is reduced by at least as function of stride in contrast spectral pooling allows us to specify any arbitrary output dimensionality and thus allows us to reduce the map size gradually as function of layer regularization via resolution corruption we note that the filtering radii say rh and rw can be chosen to be smaller than the output map dimensionalities namely while we truncate our input frequency map to size we can further all frequencies outside the central rh rw square while this maintains the output dimensionality of the input domain after applying the inverse dft it effectively reduces the resolution of the output this can be seen in figure this allows us to introduce regularization in the form of random resolution reduction we apply this stochastically by assigning distribution pr on the frequency truncation radius for simplicity we apply the same truncation on both axes sampling from this random radius at each iteration and wiping out all frequencies outside the square of that size note that this can be regarded as an application of nested dropout rippel et on both dimensions of the frequency decomposition of our input in practice we have had success choosing pr hmin uniform distribution stretching from some minimum value all the way up to the highest possible resolution spectral parametrization of cnns here we demonstrate how to learn the filters of cnns directly in their frequency domain representations this offers significant advantages over the traditional spatial representation which we show empirically in section let us assume that for some layer of our convolutional neural network we seek to learn filters of size to do this we parametrize each filter in our network directly in the frequency domain to attain its spatial representation we simply compute its inverse dft normalized count filters over time sparsity patterns spatial spectral element momentum momenta distributions figure learning dynamics of cnns with spectral parametrization the histograms have been produced after epochs of training on by each method but are similar throughout progression over several epochs of filters parametrized in the frequency domain each pair of columns corresponds to the spectral parametrization of filter and its inverse transform to the spatial domain filter representations tend to be more local in the fourier basis sparsity patterns for the different parametrizations spectral representations tend to be considerably sparser distributions of momenta across parameters for cnns trained with and without spectral parametrization in the spectral parametrization considerably fewer parameters are updated as from this point on we proceed as we would for any standard cnn by computing the convolution of the filter with inputs in our and so on the through the inverse dft is virtually identical to the one of spectral pooling described in section we compute the gradient as outlined in subsection being careful to obey the conjugate symmetry constraints discussed in subsection we emphasize that this approach does not change the underlying cnn model in any way only the way in which it is parametrized hence this only affects the way the solution space is explored by the optimization procedure leveraging filter structure this idea exploits the observation that cnn filters have very characteristic structure that reappears across data sets and problem domains that is cnn weights can typically be captured with small number of degrees of freedom represented in the spatial domain however this results in significant redundancy the frequency domain on the other hand provides an appealing basis for filter representation characteristic filters gabor filters are often very localized in their spectral representations this follows from the observation that filters tend to feature very specific and orientations hence they tend to have nonzero support in narrow set of frequency components this hypothesis can be observed qualitatively in figure and quantitatively in figure empirically in section we observe that spectral representations of filters leads to convergence speedup by times we remark that had we trained our network with standard stochastic gradient descent the linearity of differentiation and parameter update would have resulted in exactly the same filters regardless of whether they were represented in the spatial or frequency domain during training this is true for any invertible linear transformation of the parameter space however as discussed this parametrization corresponds to rotation to more meaningful axis alignment where the number of relevant elements has been significantly reduced since modern optimizers implement update rules that consist of adaptive rescaling they are able to leverage this axis alignment by making large updates to small number of elements this can be seen quantitatively in figure where the optimizer adam kingma ba in this case only touches small number of elements in its updates there exist number of extensions of the above approach we believe would be quite promising in future work we elaborate on these in the discussion experiments we demonstrate the effectiveness of spectral representations in number of different experiments we ran all experiments on code optimized for the xeon phi coprocessor we used spearmint snoek et for bayesian optimization of hyperparameters with concurrent evaluations max pooling spectral pooling method stochastic pooling maxout deeply supervised kf kf spectral pooling fraction of parameters kept approximation loss for the imagenet validation set classification rates figure average information dissipation for the imagenet validation set as function of fraction of parameters kept this is measured in error normalized by the input norm the red horizontal line indicates the best error rate achievable by max pooling test errors on without data augmentation of the optimal spectral pooling architecture as compared to current approaches stochastic pooling zeiler fergus maxout goodfellow et lin et and nets lee et spectral pooling information preservation we test the information retainment properties of spectral pooling on the validation set of imagenet russakovsky et for the different pooling strategies we plot the average approximation loss resulting from pooling to different dimensionalities this can be seen in figure we observe the two aspects discussed in subsection first spectral pooling permits significantly better reconstruction for the same number of parameters second for max pooling the only knob controlling the coarseness of approximation is the stride which results in severe quantization and constraining lower bound on preserved information marked in the figure as horizontal red line in contrast spectral pooling permits the selection of any output dimensionality thereby producing smooth curve over all frequency truncation choices classification with convolutional neural networks we test spectral pooling on different classification tasks we hyperparametrize and optimize the following cnn architecture ga softmax here by cf we denote convolutional layer with filters each of size by spectral pooling layer with output dimensionality and ga the global averaging layer described in lin et al we the number of filters per layer as every convolution and pooling layer is followed by relu nonlinearity we let hm be the height of the map of layer hence each spectral pooling layer reduces each output map dimension by factor we assign frequency dropout distribution pr bcm hm hm for layer total layers and with cm for some constants this parametrization can be thought of as some linear parametrization of the dropout rate as function of the layer we perform hyperparameter optimization on the dimensionality decay rate number of layers resolution randomization hyperparameters weight decay rate in momentum in and initial learning rate in we train each model for epochs and anneal the learning rate by factor of at epochs and we intentionally use no dropout nor data augmentation as these introduce number of additional hyperparameters which we want to disambiguate as alternative factors for success perhaps unsurprisingly the optimal hyperparameter configuration assigns the slowest possible layer map decay rate it selects randomized resolution reduction constants of about momentum of about and initial learning rate these settings allow us to attain classification rates of on and on these are competitive results among approaches that do not employ data augmentation comparison to approaches from the literature can be found in table spectral parametrization of cnns we demonstrate the effectiveness of spectral parametrization on number of cnn optimization tasks for different architectures and for different filter sizes we use the notation mpts to denote max pooling layer with size and stride and fcf is layer with filters size spatial spectral size deep generic sp pooling architecture filter size speedup factor deep deep generic generic sp pooling sp pooling speedup factors training curves figure optimization of cnns via spectral parametrization all experiments include data augmentation training curves for the various experiments the remainder of the optimization past the matching point is marked in light blue the red diamonds indicate the relative epochs in which the asymptotic error rate of the spatial approach is achieved speedup factors for different architectures and filter sizes speedup is observed even for tiny filters the first architecture is the generic one used in variety of deep learning papers such as krizhevsky et al snoek et al krizhevsky kingma ba softmax fc the second architecture we consider is the one employed in snoek et al which was shown to attain competitive classification rates it is deeper and more complex ga softmax the third architecture considered is the spectral pooling network from equation to increase the difficulty of optimization and reflect real training conditions we supplemented all networks with data augmentation in the form of translations horizontal reflections hsv perturbations and dropout we initialized both spatial and spectral filters in the spatial domain as the same values for the spectral parametrization experiments we then computed the fourier transform of these to attain their frequency representations we optimized all networks using the adam kingma ba update rule variant of rmsprop that we find to be fast and robust optimizer the training curves can be found in figure and the respective factors of convergence speedup in table surprisingly we observe speedup even for tiny filters of size where we did not expect the frequency representation to have much room to exploit spatial structure discussion and remaining open problems in this work we demonstrated that spectral representations provide rich spectrum of applications we introduced spectral pooling which allows pooling to any desired output dimensionality while retaining significantly more information than other pooling approaches in addition we showed that the fourier functions provide suitable basis for filter parametrization as demonstrated by faster convergence of the optimization procedure one possible future line of work is to embed the network in its entirety in the frequency domain in models that employ fourier transforms to compute convolutions at every convolutional layer the input is and the elementwise multiplication output is then inverse these transformations are very computationally intensive and as such it would be desirable to strictly remain in the frequency domain however the reason for these repeated transformations is the application of nonlinearities in the forward domain if one were to propose sensible nonlinearity in the frequency domain this would spare us from the incessant domain switching acknowledgements we would like to thank prabhat michael gelbart and matthew johnson for useful discussions and assistance throughout this project jasper snoek was fellow in the harvard center for research on computation and society this work is supported by the applied mathematics program within the office of science advanced scientific computing research of the department of energy under contract no this work used resources of the national energy research scientific computing center nersc we thank helen he and doug jacobsen for providing us with access to the babbage testbed at nersc references bengio yoshua and lecun yann scaling learning algorithms towards ai in bottou chapelle olivier decoste and weston eds large scale kernel machines mit press goodfellow ian david mirza mehdi courville aaron and bengio yoshua maxout networks corr url http hinton geoffrey what wrong with convolutional nets mit brain and cognitive sciences fall colloquium series dec url http hinton geoffrey ask me anything geoffrey hinton reddit machine learning url https ioffe sergey and szegedy christian batch normalization accelerating deep network training by reducing internal covariate shift corr url http karpathy andrej toderici george shetty sanketh leung thomas sukthankar rahul and li video classification with convolutional neural networks in computer vision and pattern recognition kingma diederik and ba jimmy adam method for stochastic optimization corr url http krizhevsky alex learning multiple layers of features from tiny images technical report krizhevsky sutskever ilya and hinton geoffrey imagenet classification with deep convolutional neural networks in advances in neural information processing systems lecun yann boser bernhard denker henderson howard hubbard and handwritten digit recognition with network in advances in neural information processing systems lee xie saining gallagher patrick zhang zhengyou and tu zhuowen nets corr url http lin min chen qiang and yan shuicheng network in network corr url http mathieu henaff mikael and lecun yann fast training of convolutional networks through ffts corr url http rippel oren gelbart michael and adams ryan learning ordered representations with nested dropout in international conference on machine learning russakovsky olga deng jia su hao krause jonathan satheesh sanjeev ma sean huang zhiheng karpathy andrej khosla aditya bernstein michael berg alexander and li imagenet large scale visual recognition challenge international journal of computer vision doi snoek jasper larochelle hugo and adams ryan prescott practical bayesian optimization of machine learning algorithms in neural information processing systems snoek jasper rippel oren swersky kevin kiros ryan satish nadathur sundaram narayanan patwary md mostofa ali prabhat and adams ryan scalable bayesian optimization using deep neural networks in international conference on machine learning torralba antonio and oliva aude statistics of natural image categories network august issn vasilache nicolas johnson jeff mathieu chintala soumith piantino serkan and lecun yann fast convolutional nets with fbfft gpu performance evaluation corr url http zeiler matthew and fergus rob stochastic pooling for regularization of deep convolutional neural networks corr url http 
online gradient boosting alina beygelzimer yahoo labs new york ny beygel elad hazan princeton university princeton nj ehazan satyen kale yahoo labs new york ny satyen haipeng luo princeton university princeton nj haipengl abstract we extend the theory of boosting for regression problems to the online learning setting generalizing from the batch setting for boosting the notion of weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with base class of regression functions while strong learning algorithm is an online learning algorithm with smooth convex loss functions that competes with larger class of regression functions our main result is an online gradient boosting algorithm that converts weak online learning algorithm into strong one where the larger class of functions is the linear span of the base class we also give simpler boosting algorithm that converts weak online learning algorithm into strong one where the larger class of functions is the convex hull of the base class and prove its optimality introduction boosting algorithms are ensemble methods that convert learning algorithm for base class of models with weak predictive power such as decision trees into learning algorithm for class of models with stronger predictive power such as weighted majority vote over base models in the case of classification or linear combination of base models in the case of regression boosting methods such as adaboost and gradient boosting have found tremendous practical application especially using decision trees as the base class of models these algorithms were developed in the batch setting where training is done over fixed batch of sample data however with the recent explosion of huge data sets which do not fit in main memory training in the batch setting is infeasible and online learning techniques which train model in one pass over the data have proven extremely useful natural goal therefore is to extend boosting algorithms to the online learning setting indeed there has already been some work on online boosting for classification problems of these the work by chen et al provided the first theoretical study of online boosting for classification which was later generalized by beygelzimer et al to obtain optimal and adaptive online boosting algorithms however extending boosting algorithms for regression to the online setting has been elusive and escaped theoretical guarantees thus far in this paper we rigorously formalize the setting of online boosting for regression and then extend the very commonly used gradient boosting methods to the online setting providing theoretical guarantees on their performance the main result of this paper is an online boosting algorithm that competes with any linear combination the base functions given an online linear learning algorithm over the base class this algorithm is the online analogue of the batch boosting algorithm of zhang and yu and in fact our algorithmic technique when specialized to the batch boosting setting provides exponentially better convergence guarantees we also give an online boosting algorithm that competes with the best convex combination of base functions this is simpler algorithm which is analyzed along the lines of the frankwolfe algorithm while the algorithm has weaker theoretical guarantees it can still be useful in practice we also prove that this algorithm obtains the optimal regret bound up to constant factors for this setting finally we conduct some experiments which show that our online boosting algorithms do obtain performance improvements over classes of base learners related work while the theory of boosting for classification in the batch setting is see the theory of boosting for regression is comparatively foundational theory of boosting for regression can be found in the statistics literature where boosting is understood as greedy stagewise algorithm for fitting of additive models the goal is to achieve the performance of linear combinations of base models and to prove convergence to the performance of the best such linear combination while the earliest works on boosting for regression such as do not have such convergence proofs later works such as do have convergence proofs but without bound on the speed of convergence bounds on the speed of convergence have been obtained by and helmbold relying on somewhat strong assumption on the performance of the base learning algorithm approach to boosting for regression was taken by freund and schapire who give an algorithm that reduces the regression problem to classification and then applies adaboost the corresponding proof of convergence relies on an assumption on the induced classification problem which may be hard to satisfy in practice the strongest result is that of zhang and yu who prove convergence to the performance of the best linear combination of base functions along with bound on the rate of convergence making essentially no assumptions on the performance of the base learning algorithm telgarsky proves similar results for logistic or similar loss using slightly simpler boosting algorithm the results in this paper are generalization of the results of zhang and yu to the online setting however we emphasize that this generalization is nontrivial and requires algorithmic ideas and proof techniques indeed we were not able to directly generalize the analysis in by simply adapting the techniques used in recent online boosting work but we made use of the classical algorithm on the other hand while an important part of the convergence analysis for the batch setting is to show statistical consistency of the algorithms in the online setting we only need to study the empirical convergence that is the regret which makes our analysis much more concise setup examples are chosen from feature space and the prediction space is rd let denote some norm in rd in the setting for online regression in each round for an adversary selects an example xt and loss function rd and presents xt to the online learner the online learner outputs prediction yt rd obtains the loss function and incurs loss yt let denote reference class of regression functions rd and let denote class of loss functions rd also let be function we say that the function class is online learnable for losses in with regret if there is an online learning algorithm that for every and every sequence xt for chosen by the adversary generates xt rd such that xt inf xt if the online learning algorithm is randomized we require the above bound to hold with high probability the above definition is simply the online generalization of standard empirical risk minimization erm in the batch setting concrete example is regression the prediction space is for labeled data point the loss for the prediction is given by where is fixed loss function that is convex in the second argument such as squared loss logistic loss etc given batch of labeled data points xt yt and base class of regression functions say the set of bounded norm linear regressors an erm algorithm finds the function that pt minimizes yt xt in the online setting the adversary reveals the data xt yt in an online fashion only presenting the true label yt after the online learner has chosen prediction yt thus setting yt yt yt we observe that if satisfies the regret bound then it makes predictions with total loss almost as small as that of the empirical risk minimizer up to the regret term if is the set of all linear regressors for example the algorithm could be online gradient descent or online newton step at high level in the batch setting boosting is understood as procedure that given batch of data and access to an erm algorithm for function class this is called weak learner obtains an approximate erm algorithm for richer function class this is called strong learner generally is the set of finite linear combinations of functions in the efficiency of boosting is measured by how many times the base erm algorithm needs to be called the number of boosting steps to obtain an erm algorithm for the richer function within the desired approximation tolerance convergence rates give bounds on how quickly the approximation error goes to and we now extend this notion of boosting to the online setting in the natural manner to capture the full generality of the techniques we also specify class of loss functions that the online learning algorithm can work with informally an online boosting algorithm is reduction that given access to an online learning algorithm for function class and loss function class with regret and bound on the total number of calls made in each iteration to copies of obtains an online learning algorithm for richer function class richer loss function class and possibly larger regret the bound on the total number of calls made to all the copies of corresponds to the number of boosting stages in the batch setting and in the online setting it may be viewed as resource constraint on the algorithm the efficacy of the reduction is measured by which is function of and certain parameters of the comparator class and loss function class we desire online boosting algorithms such that quickly as and we make the notions of richness in the above informal description more precise now comparator function classes given function class is said to be if for all and all we have kf throughout this paper we assume that is if then and it contains the constant zero function which we denote with some abuse of notation by there is slight abuse of notation here is not function but rather the output of the online learning algorithm computed on the given example using its internal state this is without loss of generality as will be seen momentarily our base assumption only requires an online learning algorithm for for linear losses by running the hedge algorithm on two copies of one of which receives the actual loss functions and the other recieves we get an algorithm which competes with negations of functions in and the constant zero function as well furthermore since the loss functions are convex indeed linear this can be made into deterministic reduction by choosing the convex combination of the outputs of the two copies of with mixing weights given by the hedge algorithm given we define two richer function classes the convex hull of denoted ch is the set of convex combinations of finite number of functions in and the span of denoted span is the set of linear combinations many functions in for any span define kf of finitely inf max wg wg since functions in span are not bounded it is not possible to obtain uniform regret bound for all functions in span rather the regret of an online learning algorithm for span is specified in terms of regret bounds for individual comparator functions span viz rf xt xt loss function classes the base loss function class we consider is the set of all linear functions rd with lipschitz constant bounded by function class that is online learnable with the loss function class is called online linear learnable for short the richer loss function class we consider is denoted by and is set of convex loss functions rd satisfying some regularity conditions specified in terms of certain parameters described below we define few parameters of the class for any let bd rd kyk be the ball of radius the class is said to have lipschitz constant lb on bd if for all and all bd there is an efficiently computable subgradient with norm at most lb next is said to be on bd if for all and all bd we have ky next define the projection operator rd bd as arg ky and define yk online boosting algorithms the setup is that we are given reference class of functions with an online linear learning algorithm with regret bound for normalization we also assume that the output of at any time is bounded in norm by ka xt for all we further assume that for every we can lipschitz constant lb smoothness parameter and the parameter for the class over bd furthermore the online boosting algorithm may make up to calls per iteration to any copies of it maintains for given budget parameter given this setup our main result is an online boosting algorithm algorithm competing with span the algorithm maintains copies of denoted ai for each copy corresponds to one stage in boosting when it receives new example xt it passes it to each ai and obtains their predictions ai xt which it then combines into prediction for yt using linear combination at the most basic level this linear combination is simply the sum of all the predictions scaled by step size parameter two tweaks are made to this sum in step to facilitate the analysis while constructing the sum the partial sum yti is multiplied by shrinkage factor this shrinkage term is tuned using an online gradient descent algorithm in step the goal of the tuning is to induce the partial sums yti to be aligned with descent direction for the loss functions as measured by the inner product yti yti the partial sums yti are made to lie in bd for some parameter by using the projection operator this is done to ensure that the lipschitz constant and smoothness of the loss function are suitably bounded it suffices to compute upper bounds on these parameters algorithm online gradient boosting for span require number of weak learners step size parameter let min inf maintain copies of the algorithm denoted ai for for each initialize for to do receive example xt define for to do define yti xt yt end for predict yt ytn obtain loss function and loss yt for to do pass loss function it yti to ai set max min ti yti yti where end for end for lb once the boosting algorithm makes the prediction yt and obtains the loss function each ai is updated using suitably scaled linear approximation to the loss function at the partial sum yti the linear loss function yti this forces ai to produce predictions that are aligned with descent direction for the loss function for lack of space we provide the analysis of the algorithm in section in the supplementary material the analysis yields the following regret bound for the algorithm theorem let be given parameter let min inf algorithm is an online learning algorithm for span and losses in with the following regret bound for any span where pt kf bb kf lb kf bkf xt the regret bound in this theorem depends on several parameters such as and lb in applications of the algorithm for regression with commonly used loss functions however these parameters are essentially modest constants see section for calculations of the parameters for various loss functions furthermore if is appropriately set log then the average regret clearly converges to as and while the requirement that may raise concerns about computational efficiency this is in fact analogous to the guarantee in the batch setting the algorithms converge only when the number of boosting stages goes to infinity moreover our lower bound theorem shows that this is indeed necessary we also present simpler boosting algorithm algorithm that competes with ch algorithm is similar to algorithm with some simplifications the final prediction is simply convex combination of the predictions of the base learners with no projections or shrinkage necessary while algorithm is more general algorithm may still be useful in practice when bound on the norm of the comparator function is known in advance using the observations in section furthermore its analysis is cleaner and easier to understand for readers who are familiar with the method and this serves as foundation for the analysis of algorithm this algorithm has an optimal up to constant factors regret bound as given in the following theorem proved in section in the supplementary material the upper bound in this theorem is proved along the lines of the algorithm and the lower bound using arguments theorem algorithm is an online learning algorithm for ch for losses in with the regret bound ld furthermore the dependence of this regret bound on is optimal up to constant factors the dependence of the regret bound on is unimprovable without additional assumptions otherwise algorithm will be an online linear learning algorithm over with better than regret algorithm online gradient boosting for ch maintain copies of the algorithm denoted an and let for to do receive example xt define for to do define yti yti ai xt end for predict yt ytn obtain loss function and loss yt for to do pass loss function it yti to ai end for end for for using deterministic base online linear learning algorithm if the base online linear learning algorithm is deterministic then our results can be improved because our online boosting algorithms are also deterministic and using standard simple reduction we can now allow to be any set of convex functions smooth or not with computable lipschitz constant lb over the domain bd for any this reduction converts arbitrary convex loss functions into linear functions viz if yt is the output of the online boosting algorithm then the loss function provided to the boosting algorithm as feedback is the linear function yt this reduction immediately implies that the base online linear learning algorithm when fed loss functions is already an online learning algorithm for ch with losses in with the regret bound ld as for competing with span since linear loss functions are we obtain the following easy corollary of theorem corollary let be given parameter and set algorithm is an online learning algorithm for span for losses in with the following regret bound for any span rf lb kf bkf kf pt where xt the parameters for several basic loss functions in this section we consider the application of our results to regression where we assume for normalization that the true labels of the examples and the predictions of the functions in the class are in in this case denotes the absolute value norm thus in each round the adversary chooses labeled data point xt yt and the loss for the prediction yt is given by yt yt yt where is fixed loss function that is convex in the second argument note that in this setting we give examples of several such loss functions below and compute the parameters lb for every as well as from theorem and linear loss we have lb and loss for some we have lb max and modified least squares max we have lb max and logistic loss ln exp we have lb exp exp and min ln variants of the boosting algorithms our boosting algorithms and the analysis are considerably flexible it is easy to modify the algorithms to work with and perhaps more natural kind of base learner which does greedy fitting or incorporate scaling of the base functions which improves performance also when specialized to the batch setting our algorithms provide better convergence rates than previous work fitting to actual loss functions the choice of an online linear learning algorithm over the base function class in our algorithms was made to ease the analysis in practice it is more common to have an online algorithm which produce predictions with comparable accuracy to the best function in hindsight for the actual sequence of loss functions in particular common heuristic in boosting algorithms such as the original gradient boosting algorithm by friedman or the matching pursuit algorithm of mallat and zhang is to build linear combination of base functions by iteratively augmenting the current linear combination via greedily choosing base function and step size for it that minimizes the loss with respect to the residual label indeed the boosting algorithm of zhang and yu also uses this kind of greedy fitting algorithm as the base learner in the online setting we can model greedy fitting as follows we first fix step size in advance then in each round the base learner receives not only the example xt but also an rd for the prediction and produces prediction xt rd after which it receives the loss function and loss xt the predictions of satisfy xt inf xt where is the regret our algorithms can be made to work with this kind of base learner as well the details can be found in section of the supplementary material improving the regret bound via scaling given an online linear learning algorithm over the function class with regret then for any scaling parameter we trivially obtain an online linear learning algorithm denoted over of viz simply by multiplying the predictions of by the corresponding regret scales by as well it becomes the performance of algorithm can be improved by using such an online linear learning algorithm over for suitably chosen scaling of the function class the regret bound from theorem improves because the of measured with respect to kf max kf is smaller than kf but degrades because the parameter min inf is larger than but as detailed in section of the supplementary material in many situations the improvement due to the former compensates for the degradation due to the latter and overall we can get improved regret bounds using suitable value of improvements for batch boosting our algorithmic technique can be easily specialized and modified to the standard batch setting with fixed batch of training examples and base learning algorithm operating over the batch exactly as in the main compared to the algorithm of is the use of the variables to scale the coefficients of the weak hypotheses appropriately while seemingly innocuous tweak this allows us to derive analogous bounds to those of zhang and yu on the optimization error that show that our boosting algorithm converges exponential faster detailed comparison can be found in section of the supplementary material experimental results is it possible to boost in an online fashion in practice with real base learners to study this question we implemented and evaluated algorithms and within the vowpal wabbit vw open source machine learning system the three online base learners used were vw default linear learner variant of stochastic gradient descent sigmoidal neural networks with hidden units and regression stumps regression stumps were implemented by doing stochastic gradient descent on each individual feature and predicting with the valued feature in the current example all experiments were done on collection of publically available regression and classification datasets described in section in the supplementary material using squared loss the only parameters tuned were the learning rate and the number of weak learners as well as the step size parameter for algorithm parameters were tuned based on progressive validation loss on half of the dataset reported is propressive validation loss on the remaining half progressive validation is standard online validation technique where each training example is used for testing before it is used for updating the model the following table reports the average and the median over the datasets relative improvement in squared loss over the respective base learner detailed results can be found in section in the supplementary material base learner sgd regression stumps neural networks average relative improvement algorithm algorithm median relative improvement algorithm algorithm note that both sgd stochastic gradient descent and neural networks are already very strong learners naturally boosting is much more for regression stumps which is weak base learner conclusions and future work in this paper we generalized the theory of boosting for regression problems to the online setting and provided online boosting algorithms with theoretical convergence guarantees our algorithmic technique also improves convergence guarantees for batch boosting algorithms we also provide experimental evidence that our boosting algorithms do improve prediction accuracy over commonly used base learners in practice with greater improvements for weaker base learners the main remaining open question is whether the boosting algorithm for competing with the span of the base functions is optimal in any sense similar to our proof of optimality for the the boosting algorithm for competing with the convex hull of the base functions references peter bartlett and mikhail traskin adaboost is consistent jmlr alina beygelzimer satyen kale and haipeng luo optimal and adaptive algorithms for online boosting in icml avrim blum adam kalai and john langford beating the bounds for and progressive in colt pages chen lin and lu an online boosting algorithm with theoretical justifications in icml chen lin and lu boosting with online binary learners for the multiclass bandit problem in icml michael collins robert schapire and yoram singer logistic regression adaboost and bregman distances in colt nigel and david helmbold boosting methods for regression machine learning marguerite frank and philip wolfe an algorithm for quadratic programming naval res logis yoav freund and robert schapire generalization of learning and an application to boosting jcss august jerome friedman greedy function approximation gradient boosting machine annals of statistics october helmut grabner and horst bischof boosting and vision in cvpr volume pages helmut grabner christian leistner and horst bischof boosting for robust tracking in eccv pages trevor hastie and robet tibshirani generalized additive models chapman and hall trevor hastie robert tibshirani and jerome friedman the elements of statistical learning data mining inference and prediction springer verlag elad hazan and satyen kale beyond the regret minimization barrier optimal algorithms for stochastic optimization jmlr elad hazan amit agarwal and satyen kale logarithmic regret algorithms for online convex optimization machine learning xiaoming liu and ting yu gradient feature selection for online boosting in iccv pages mallat and zhifeng zhang matching pursuits with dictionaries ieee transactions on signal processing december llew mason jonathan baxter peter bartlett and marcus frean boosting algorithms as gradient descent in nips nikunj oza and stuart russell online bagging and boosting in aistats pages robert schapire and yoav freund boosting foundations and algorithms mit press matus telgarsky boosting with the logistic loss is consistent in colt vw url https tong zhang and bin yu boosting with early stopping convergence and consistency annals of statistics martin zinkevich online convex programming and generalized infinitesimal gradient ascent in icml 
deep temporal sigmoid belief networks for sequence modeling zhe gan chunyuan li ricardo henao david carlson and lawrence carin department of electrical and computer engineering duke university durham nc lcarin abstract deep dynamic generative models are developed to learn sequential dependencies in data the model is designed by constructing hierarchy of temporal sigmoid belief networks tsbns defined as sequential stack of sigmoid belief networks sbns each sbn has contextual hidden state inherited from the previous sbns in the sequence and is used to regulate its hidden bias scalable learning and inference algorithms are derived by introducing recognition model that yields fast sampling from the variational posterior this recognition model is trained jointly with the generative model by maximizing its variational lower bound on the experimental results on bouncing balls polyphonic music motion capture and text streams show that the proposed approach achieves predictive performance and has the capacity to synthesize various sequences introduction considerable research has been devoted to developing probabilistic models for data such as video and music sequences motion capture data and text streams among them hidden markov models hmms and linear dynamical systems lds have been widely studied but they may be limited in the type of dynamical structures they can model an hmm is mixture model which relies on single multinomial variable to represent the history of to represent bits of information about the history an hmm could require distinct states on the other hand sequential data often contain complex temporal dependencies while lds can only model simple linear dynamics another class of models which are potentially better suited to model complex probability distributions over sequences relies on the use of recurrent neural networks rnns and variants of undirected graphical model called the restricted boltzmann machine rbm one such variant is the temporal restricted boltzmann machine trbm which consists of sequence of rbms where the state of one or more previous rbms determine the biases of the rbm in the current time step learning and inference in the trbm is the approximate procedure used in is heuristic and not derived from principled statistical formalism recently deep directed generative models are becoming popular directed graphical model that is closely related to the rbm is the sigmoid belief network sbn in the work presented here we introduce the temporal sigmoid belief network tsbn which can be viewed as temporal stack of sbns where each sbn has contextual hidden state that is inherited from the previous sbns and is used to adjust its bias based on this we further develop deep dynamic generative model by constructing hierarchy of tsbns this can be considered time time generative model recognition model generative model recognition model figure graphical model for the deep temporal sigmoid belief network generative and recognition model of the tsbn generative and recognition model of deep tsbn as deep sbn with temporal feedback loops on each layer both stochastic and deterministic hidden layers are considered compared with previous work our model can be viewed as generalization of an hmm with distributed hidden state representations and with deep architecture ii can be seen as generalization of lds with complex dynamics iii can be considered as probabilistic construction of the traditionally deterministic rnn iv is closely related to the trbm but it has fully generative process where data are readily generated from the model using ancestral sampling can be utilized to model different kinds of data binary and counts the explaining away effect described in makes inference slow if one uses traditional inference methods another important contribution we present here is to develop fast and scalable learning and inference algorithms by introducing recognition model that learns an inverse mapping from observations to hidden variables based on loss function derived from variational principle by utilizing the recognition model and techniques from we achieve fast inference both at training and testing time model formulation sigmoid belief networks deep dynamic generative models are considered based on the sigmoid belief network sbn an sbn is bayesian network that models binary visible vector in terms of binary hidden variables and weights rm with vm wm cm hj bj where vm hj wm cm bj and the logistic function the parameters and characterize all data and the hidden variables are specific to particular visible data the sbn is closely related to the rbm which is markov random field with the same bipartite structure as the sbn the rbm defines distribution over binary vector that is proportional to the exponential of its energy defined as wh the conditional distributions and in the rbm are factorial which makes inference fast while parameter estimation usually relies on an approximation technique known as contrastive divergence cd the energy function of an sbn may be written as log exp wm cm sbns explicitly manifest the generative process to obtain data in which the hidden layer provides directed explanation for patterns generated in the visible layer however the explaining away effect described in makes inference inefficient the latter can be alleviated by exploiting recent advances in variational inference methods temporal sigmoid belief networks the proposed temporal sigmoid belief network tsbn model is sequence of sbns arranged in such way that at any given time step the sbn biases depend on the state of the sbns in the previous time steps specifically assume we have binary visible sequence the tth time step of which is denoted vt the tsbn describes the joint probability as pθ ht vt where vt ht and each ht represents the hidden state corresponding to time step for each conditional distribution in is expressed as hjt bj vmt ht cm where and needed for the prior model and are defined as zero vectors respectively for conciseness the model parameters are specified as rm rm for wij is the transpose of the jth row of wi and cm and bj are bias terms the graphical model for the tsbn is shown in figure by setting and to be zero matrices the tsbn can be viewed as hidden markov model with an exponentially large state space that has compact parameterization of the transition and the emission probabilities specifically each hidden state in the hmm is represented as vector while in the tsbn the hidden states can be any binary vector we note that the transition matrix is highly structured since the number of parameters is only quadratic compared with the trbm our tsbn is fully directed which allows for fast sampling of fantasy data from the inferred model tsbn variants modeling data the model above can be readily extended to model sequence data by substituting with vt µt diag where µmt ht cm log σmt ht are elements of µt and respectively and are of the same size of and µmt and σmt and respectively compared with the gaussian trbm in which σmt is fixed to our formalism uses diagonal matrix to parameterize the variance structure of vt modeling count data we also introduce an approach for modeling data with count qm vmt observations by replacing with vt ymt where ymt pm ht cm exp exp this formulation is related to the replicated softmax model rsm described in however our approach uses directed connection from the binary hidden variables to the visible counts while also learning the dynamics in the count sequences furthermore rather than assuming that ht and vt only depend on and in the experiments we also allow for connections from the past time steps of the hidden and visible states to the current states ht and vt sliding window is then used to go through the sequence to obtain frames at each time we refer to as the order of the model deep architecture for sequence modeling with tsbns learning the sequential dependencies with the shallow model in may be restrictive therefore we propose two deep architectures to improve its representational power adding stochastic hidden layers ii adding deterministic hidden layers the graphical model for the deep tsbn is shown in figure specifically we consider deep tsbn with hidden layers ht for and assume layer contains hidden units and denote the visi ble layer vt ht and let ht for convenience in order to obtain proper generative model the top hidden layer contains stochastic binary hidden variables for the middle layers if stochastic hidden layers are utilized the generative process qj is expressed as ht hjt where each conditional distribution is parameterized via logistic function as in if deterministic hidden layers are employed we obtain ht ht where is chosen to be rectified linear function although the differences between these two approaches are minor learning and inference algorithms can be quite different as shown in section scalable learning and inference computation of the exact posterior over the hidden variables in is intractable approximate bayesian inference such as gibbs sampling or variational bayes vb inference can be implemented however gibbs sampling is very inefficient due to the fact that the conditional posterior distribution of the hidden variables does not factorize the vb indeed provides fully factored variational posterior but this technique increases the gap between the bound being optimized and the true potentially resulting in poor fit to the data to allow for tractable and scalable inference and parameter learning without loss of the flexibility of the variational posterior we apply the neural variational inference and learning nvil algorithm described in variational lower bound objective we are interested in training the tsbn model pθ described in with parameters given an observation we introduce distribution qφ with parameters that approximates the true posterior distribution we then follow the variational principle to derive lower bound on the marginal expressed eqφ log pθ log qφ we construct the approximate posterior qφ as recognition model by using this we avoid the need to compute variational parameters per data point instead we compute set of parameters used for all in order to achieve fast inference the recognition model is expressed as qφ ht vt and each conditional distribution is specified as hjt vt vt dj where and for are defined as zero vectors the recognition parameters are specified as for uij is the transpose of the jth row of ui and dj is the bias term the graphical model is shown in figure the recognition model defined in has the same form as in the approximate inference used for the trbm exact inference for our model consists of forward and backward pass through the entire sequence that requires the traversing of each possible hidden state our feedforward approximation allows the inference procedure to be fast and implemented in an online fashion parameter learning to optimize we utilize monte carlo methods to approximate expectations and stochastic gradient descent sgd for parameter optimization the gradients can be expressed as eqφ log pθ eqφ log pθ log qφ log qφ this lower bound is equivalent to the marginal if qφ specifically in the tsbn model if we define ht cm and vt dj the gradients for and can be calculated as log pθ vmt hjt log qφ hjt vmt other update equations along with the learning details for the tsbn variants in section are provided in the supplementary section we observe that the gradients in and share many similarities with the algorithm alternates between updating in the wake phase and updating in the sleep phase the update of is based on the samples generated from qφ and is identical to however in contrast to the recognition parameters are estimated from samples generated by the model epθ log qφ this update does not optimize the same objective as in hence the algorithm is not guaranteed to converge inspecting we see that we are using lφ log pθ log qφ as the learning signal for the recognition parameters the expectation of this learning signal is exactly the lower bound which is easy to evaluate however this tractability makes the estimated gradients of the recognition parameters very noisy in order to make the algorithm practical we employ the variance reduction techniques proposed in namely centering the learning signal by subtracting the baseline and the baseline ii variance normalization by dividing the centered learning signal by running estimate of its standard deviation the baseline is implemented using neural network additionally rmsprop form of sgd where the gradients are adaptively rescaled by running average of their recent magnitude were found in practice to be important for fast convergence thus utilized throughout all the experiments the outline of the nvil algorithm is provided in the supplementary section extension to deep models the recognition model corresponding to the deep tsbn is shown in figure two kinds of deep architectures are discussed in section we illustrate the difference of their learning algorithms in two respects the calculation of the lower bound and ii the calculation of the gradients the top hidden layer is stochastic if the middle hidden layers are also stochastic the calculation of the lower bound is more involved compared with the shallow model however the gradient evaluation remain simple as in on the other hand if deterministic middle hidden layers recurrent neural networks are employed the lower bound objective will stay the same as shallow model since the only stochasticity in the generative process lies in the top layer however the gradients have to be calculated recursively through the through time algorithm all details are provided in the supplementary section related work the rbm has been widely used as building block to learn the sequential dependencies in data the models and the temporal rbm to make exact inference possible the recurrent temporal rbm was also proposed and further extended to learn the dependency structure within observations in the work reported here we focus on modeling sequences based on the sbn which recently has been shown to have the potential to build deep generative models our work serves as another extension of the sbn that can be utilized to model data similar ideas have also been considered in and however in the authors focus on grammar learning and use approximation of the vb to carry out the inference while in the algorithm was developed we apply the model in different scenario and develop fast and scalable inference algorithm based on the idea of training recognition model by leveraging the stochastic gradient of the variational bound there exist two main methods for the training of recognition models the first one termed stochastic gradient variational bayes sgvb is based on reparameterization trick which can be only employed in models with continuous latent variables the variational top generated from piano midi topic nicaragua topic war of iraq war world war ii bottom generated from nottingham topic the age of american revolution figure left dictionaries learned using the hmsbn for the videos of bouncing balls middle samples generated from the hmsbn trained on the polyphonic music each column is sample vector of notes right time evolving from to for three selected topics learned from the stu dataset plotted values represent normalized probabilities that the topic appears in given year best viewed electronically and all the recent recurrent extensions of it the second one called neural variational inference and learning nvil is based on the trick which is more general and can also be applicable to models with discrete random variables the nvil algorithm has been previously applied to the training of sbn in our approach serves as new application of this algorithm for model experiments we present experimental results on four publicly available datasets the bouncing balls polyphonic music motion capture and to assess the performance of the tsbn model we show sequences generated from the model and report the average that the model assigns to test sequence and the average squared prediction error per frame code is available at https the tsbn model with and is denoted hidden markov sbn hmsbn the deep tsbn with stochastic hidden layer is denoted and the deep tsbn with deterministic hidden layer is denoted model parameters were initialized by sampling randomly from except for the bias parameters that were initialized as the tsbn model is trained using variant of rmsprop with momentum of and constant learning rate of the decay over the root mean squared gradients is set to the maximum number of iterations we use is the gradient estimates were computed using single sample from the recognition model the only regularization we used was weight decay of the baseline was implemented by using neural network with single hidden layer with tanh units for the prediction of vt given we first obtain sample from qφ ii calculate the conditional posterior pθ ht of the current hidden state iii make prediction for vt using pθ vt on the other hand synthesizing samples is conceptually simper sequences can be readily generated from the model using ancestral sampling bouncing balls dataset we conducted the first experiment on synthetic videos of bouncing balls where pixels are binary valued we followed the procedure in and generated videos for training and another videos for testing each video is of length and of resolution the dictionaries learned using the hmsbn are shown in figure left compared with previous work our learned bases are more spatially localized in table we compare the average squared prediction error per frame over the test videos with recurrent temporal rbm rtrbm and structured rtrbm srtrbm as can be seen our approach achieves better performance compared with the baselines in the literature furthermore we observe that tsbn reduces the prediction error significantly compared with an tsbn this is due to the fact table average prediction error for the table average prediction error obtained for ing balls dataset taken from the motion capture dataset taken from odel tsbn tsbn im rder red rr odel tsbn hmsbn ss walking running that by using tsbn more information about the past is conveyed we also examine the advantage of employing deep models using stochastic or deterministic hidden layer improves performances more results including are provided in supplementary section motion capture dataset in this experiment we used the cmu motion capture dataset that consists of measured joint angles for different motion types we used the running and walking sequences of subject walking sequences and running sequences we followed the preprocessing procedure of after which we were left with joint angles we partitioned the sequences into training and testing set the first of which had sequences and the second had sequences one walking and another running we averaged the prediction error over trials as reported in table the tsbn we implemented is of size in each hidden layer and order it can be seen that the models improves over the gaussian rtrbm and the srtrbm significantly figure motion trajectories generated from the hmsbn trained on the motion capture dataset left walking middle right another popular motion capture dataset is the mit to further demonstrate the directed generative nature of our model we give our trained hmsbn model different initializations and show generated synthetic data and the transitions between different motion styles in figure these generated data are readily produced from the model and demonstrate realistic behavior the smooth trajectories are walking movements while the vibrating ones are running corresponding video files avi are provided as mocap and in the supplementary material polyphonic music dataset the third experiment is based on four different polyphonic music sequences of piano piano nottingham nott musedata muse and jsb chorales jsb each of these datasets are represented as collection of binary sequences that span the whole range of piano from to the samples generated from the trained hmsbn model are shown in figure middle as can be seen different styles of polyphonic music are synthesized the corresponding midi files are provided as music and in the supplementary material our model has the ability to learn basic harmony rules and local temporal coherence however structure and musical melody remain elusive the variational lower bound along with the estimated in are presented in table the tsbn we implemented is of size and order empirically adding layers did not improve performance on this dataset hence no such results are reported the results of and rtrbm were obtained by only runs of the annealed importance sampling which has the potential to overestimate the true our variational lower bound provides more conservative estimate though our performance is still better than that of rnn quantitative results on the mit dataset are provided in supplementary section table test for the polyphonic table average prediction precision for stu music dataset taken from taken from odel tsbn iano ott use jsb odel hmsbn im mp pp state of the union dataset the state of the union stu dataset contains the transcripts of us state of the union addresses from to two tasks are considered prediction and dynamic topic modeling prediction the prediction task is concerned with estimating the words we employ the setup in after removing stop words and terms that occur fewer than times in one document or less than times overall there are unique words the entire data of the last year is for the documents in the previous years we randomly partition the words of each document into split the model is trained on the portion and the remaining words are used to test the prediction at each year the words in both sets are ranked according to the probability estimated from to evaluate the prediction performance we calculate the precision as in which is given by the fraction of the words predicted by the model that matches the true ranking of the word counts is used two recent works are compared and drfm the results are summarized in table our model is of order the column mp denotes the mean precision over all the years that appear in the training set the column pp denotes the predictive precision for the final year our model achieves significant improvements in both scenarios dynamic topic modeling the setup described in is employed and the number of topics is to understand the temporal dynamic per topic three topics are selected and the normalized probability that topic appears at each year are shown in figure right their associated top words per topic are shown in table the learned trajectory exhibits different temporal patterns across the topics clearly we can identify jumps associated with some key historical events for instance for topic we observe positive jump in related to military and paramilitary activities in and against nicaragua brought by the topic is related with war where the war of world war ii and iraq war all spike up in their corresponding years in topic we observe consistent positive jumps from to when the american revolution was taking place three other interesting topics are also shown in table topic appears to be related to education topic is about iraq and topic is axis and world war ii we note that the words for these topics are explicitly related to these matters table top most probable words associated with the stu topics topic family budget nicaragua free future freedom topic officer civilized warfare enemy whilst gained topic government country public law present citizens topic generations generation recognize brave crime race topic iraqi qaida iraq iraqis ai saddam topic philippines islands axis nazis japanese germans conclusion we have presented the deep temporal sigmoid belief networks an extension of sbn that models the temporal dependencies in sequences to allow for scalable inference and learning an efficient variational optimization algorithm is developed experimental results on several datasets show that the proposed approach obtains superior predictive performance and synthesizes interesting sequences in this work we have investigated the modeling of different types of data individually one interesting future work is to combine them into unified framework for dynamic learning furthermore we can use optimization methods to speed up inference acknowledgements this research was supported in part by aro darpa doe nga and onr references rabiner and juang an introduction to hidden markov models in assp magazine ieee kalman mathematical description of linear dynamical systems in the society for industrial applied mathematics series control hermans and schrauwen training and analysing deep recurrent neural networks in nips martens and sutskever learning recurrent neural networks with optimization in icml pascanu mikolov and bengio on the difficulty of training recurrent neural networks in icml graves generating sequences with recurrent neural networks in taylor hinton and roweis modeling human motion using binary latent variables in nips sutskever and hinton learning multilevel distributed representations for sequences in aistats sutskever hinton and taylor the recurrent temporal restricted boltzmann machine in nips bengio and vincent modeling temporal dependencies in highdimensional sequences application to polyphonic music generation and transcription in icml mittelman kuipers savarese and lee structured recurrent temporal restricted boltzmann machines in icml kingma and welling variational bayes in iclr mnih and gregor neural variational inference and learning in belief networks in icml rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in icml gan henao carlson and carin learning deep sigmoid belief networks with data augmentation in aistats neal connectionist learning of belief networks in artificial intelligence hinton osindero and teh fast learning algorithm for deep belief nets in neural computation hinton training products of experts by minimizing contrastive divergence in neural computation hinton and salakhutdinov replicated softmax an undirected topic model in nips hinton dayan frey and neal the algorithm for unsupervised neural networks in science tieleman and hinton lecture divide the gradient by running average of its recent magnitude in coursera neural networks for machine learning werbos backpropagation through time what it does and how to do it in proc of the ieee taylor and hinton factored conditional restricted boltzmann machines for modeling motion style in icml gan chen henao carlson and carin scalable deep poisson factor analysis for topic modeling in icml henderson and titov incremental sigmoid belief networks for grammar learning in jmlr hinton dayan to and neal the helmholtz machine through time in proc of the icann bayer and osendorfer learning stochastic recurrent networks in fabius van amersfoort and kingma variational recurrent in chung kastner dinh goel courville and bengio recurrent latent variable model for sequential data in nips han du salazar and carin dynamic rank factor model for text streams in nips acharya ghosh and zhou nonparametric bayesian factor analysis for dynamic count matrices in aistats fan wang kwok and heller fast stochastic backpropagation for variational inference in nips 
recognizing retinal ganglion cells in the dark emile richard stanford university emileric georges goetz stanford university ggoetz chichilnisky stanford university ej abstract many neural circuits are composed of numerous distinct cell types that perform different operations on their inputs and send their outputs to distinct targets therefore key step in understanding neural systems is to reliably distinguish cell types an important example is the retina for which techniques for identifying cell types are accurate but very here we develop automated classifiers for functional identification of retinal ganglion cells the output neurons of the retina based solely on recorded voltage patterns on large scale array we use classifiers based on features extracted from electrophysiological images spatiotemporal voltage waveforms and interspike intervals autocorrelations these classifiers achieve high performance in distinguishing between the major ganglion cell classes of the primate retina but fail in achieving the same accuracy in predicting cell polarities on off we then show how to use indicators of functional coupling within populations of ganglion cells to infer cell polarities with matrix completion algorithm this can result in accurate fully automated methods for cell type classification introduction in the primate and human retina roughly distinct classes of retinal ganglion cells rgcs send distinct visual information to diverse targets in the brain two complementary methods for identification of these rgc types have been pursued extensively anatomical studies have relied on indicators such as dendritic field size and shape and stratification patterns in synaptic connections to distinguish between cell classes functional studies have leveraged differences in responses to stimulation with variety of visual stimuli for the same purpose although successful these methods are difficult and require significant expertise thus they are not suitable for automated analysis of existing physiological recording data furthermore in some clinical settings they are entirely inapplicable at least two specific scientific and engineering goals demand the development of efficient methods for cell type identification discovery of new cell types while morphologically distinct rgc types exist only have been characterized functionally automated means of detecting unknown cell types in electrophysiological recordings would make it possible to process massive amounts of existing physiological data that would take too long to analyze manually in order to search for the poorly understood rgc types developing interfaces of the future in blind patients suffering from retinal degeneration rgcs no longer respond to light advanced retinal prostheses previously demonstrated aim at electrically restoring the correct neural code in each rgc type in diseased retina which requires cell type identification without information about the light response properties of rgcs in the present paper we introduce two novel and efficient computational methods for cell type identification in neural circuit using spatiotemporal voltage signals produced by spiking cells recorded with electrode array we describe the data we used for our study in section and we show how the raw descriptors used by our classifiers are extracted from voltage recordings of primate retina we then introduce classifier that leverages both handspecified and based features of the electrical signatures of unique rgcs as well as large unlabeled data sets to identify cell types section we evaluate its performance for distinguishing between midget parasol and small bistratified cells on manually annotated datasets then in section we show how matrix completion techniques can be used to identify populations of unique cell types and assess the accuracy of our algorithm by predicting the polarity on or off of rgcs on datasets where ground truth is available section is devoted to numerical experiments that we designed to test our modeling choices finally we discuss future work in section extracting descriptors from electrical recordings in this section we define the electrical signatures that we will use in cell classification and the algorithms that allow us to perform the statistical inference of cell type are described in the subsequent sections we exploit three electrical signatures of recorded neurons that are well measured in highdensity recordings first the electrical image ei of each cell which is the average spatiotemporal pattern of voltage measured across the entire electrode array during the spiking of cell this measure provides information about the geometric and electrical conduction properties of the cell itself second the interval distribution isi which summarizes the temporal separation between spikes emitted by the cell this measure reflects the specific ion channels in the cell and their distribution across the cell third the function ccf of firing between cells this measure captures the degree and polarity of interactions between cells in generation of spike electrophysiological image calculation alignment and filtering the raw data we used for our numerical experiments consist of extracellular voltage recordings of the electrical activity of retinas from male and female macaque monkeys which were sampled and digitized at khz per channel over channels laid out in µm hexagonal lattice see appendix for ms sample movie of an electrical recording the emission of an action potential by spiking neuron causes transient voltage fluctuations along its anatomical features soma dendritic tree axon by bringing an extracellular matrix of electrodes in contact with neural tissue we capture the projection of these voltage changes onto the plane of the recording electrodes see figure with such dense multielectrode arrays the voltage activity from single cell is usually picked up on multiple electrodes while the literature refers to this footprint as the electrophysiological or electrical image ei of the cell it is an inherently spatiotemporal characteristic of the neuron due to the transient nature of action potentials in essence it is short movie ms of the average electrical activity over the array during the emission of an action potential by spiking neuron which can include the properties of other cells whose firing is correlated with this neuron we calculated the electrical images of each identified rgc in the recording as described in the literature in minute recording we typically detected action potentials per rgc for each cell we averaged the voltages recorded over the entire array in ms window starting ms before the peak negative voltage sample for each action potential we cropped from the electrode array the subset of electrodes that falls within µm radius around the rgc soma see figure in order to represent each ei by matrix time points number of electrodes in µm radius or equivalently dimensional vector we augment the training data by exploiting the symmetries of the approximately hexagonal grid of the electrode array we form the training data eis from original eis rotating them by and the reflection of each spatial symmetries in total the characteristic radius µm here used to select the central portion of the ei is of our method which controls the signal to noise ratio in the input data see section figure middle panel in the appendix of this paper we describe families subdivided into of filters we manually built to capture anatomical features of the cell in particular we included filters corresponding to various action potential propagation velocities at level of the the axon and parameter which captures the soma size these quantities are believed to be indicative of cell type time ms distance to soma µm µm time to spike ms distance to soma µm time to spike ms µm time ms ms ms ms ms ms ms figure eis and cell morphology top row multielectrode arrays record projection of action potentials schematically illustrated here for midget left and parasol right rgc midget cells have an asymmetric dendritic field while parasol cells are more isotropic bottom row temporal evolution of the voltage recorded on the electrodes located within µmradius around the electrode where the largest action potential was detected which we use for cell type classification amplitude of the circles materialize signal amplitude red circles positive voltages blue circles negative voltages we filtered the spatiotemporally aligned rgc electrical images with our hand defined filters to create first feature set in separate experiments we also filtered aligned eis with iid gaussian random filters as many as our features in the fashion of see table to compare performances interspike intervals the statistics of the timing of action potential trains are another source of information about functional rgc types interspike intervals isis are an estimate of the probability of emission of two consecutive action potentials within given time difference by spiking neuron we build histograms of the times elapsed between two consecutive action potentials for each cell to form its isi we estimate the interspike intervals over ms with time granularity of ms resulting in dimensional isi vectors isis always begin by refractory period duration over which no action potentials occur following first action potential this period lasts the first isis then increase before decaying back to zero at rates representative of the functional cell type see figure left hand side we describe each isi using the values of time differences where the smoothed isi reaches of its maximum value as well as the slopes of the linear interpolations between each consecutive pair of points functions and electrical coupling of cells there is in the retina high probability of joint emission of action potentials between neighboring ganglion cells of the same type while rgcs of antagonistic polarities on vs off cells tend to exhibit strongly negatively correlated firing patterns in other words the emission of an action potential in the on pathway leads to reduced probability of observing an action potential in the off pathway at the same time the function of two rgcs characterizes the probability of joint emission of action potentials for this pair of cells with given latency and as such holds information about functional coupling between the two cells between different functional rgc types have been studied extensively in the literature previously for example in construction of ccfs follows the same steps as isi computation we obtain the ccf of pairs of cells by building histograms of time differences between their consecutive firing times large ccf value near the origin is indicative of positive functional coupling whereas negative coupling corresponds to negative ccf at the origin see figure the three panels on the right parasols δt ms parasols δt ms parasols correlation frequency correlation off parasol on parasol on midget off midget sbc correlation interspike intervals δt ms δt ms figure left panel interspike intervals for the major rgc types of the primate retina right panels functions between parasol cells purple traces single pairwise ccf red line population average green arrow strength of the correlation learning electrical signatures of retinal ganglion cells learning dictionaries from slices of unlabeled data learning descriptors from unlabeled data or dictionary learning has been successfully used for classification tasks in data such as images speech and texts the methodology we used for learning discriminative features given relatively large amount of unlabeled data follows closely the steps described in extracting independent slices from the data the first step in our approach consists of extracting independent as much as possible slices from data points one can think of slice as subset of the descriptors that is nearly independent from other subsets in image processing the analogue object is named patch small in our case we used data slices the isi descriptors form one such slice the others are extracted from eis it is reasonable to assume isi features and ei descriptors are independent quantities after aligning the eis and filtering them with collection of filter banks see appendix for description of our biologically motivated filters we group each set of filtered eis each group of filters reacts to specific patterns in eis rotational motion driven by dendrites radial propagation of the electrical signal along the axon and the direction of propagation constitute behaviors captured by distinct filter banks thereby we treat the response of data to each one of them as unique data slice each slice is then whitened and finally we perform sparse on each slice separately denotes an integer which is parameter of our algorithm that is letting denote slice of data number of data points and dimensionality of the slice and cn denote the set of cluster assignment matrices cn kui we consider the optimization problem min kx uvt cn with warm started nmf in order to solve the optimization problem we propose strategy that consists in relaxing the constraint cn in two steps we initially relax the constraint cn completely and set that is we consider problem where we substitute cn with the larger set and run an alternate minimization for few steps then we replace the clustering constraint cn with nonnegativity constraint while retaining after few steps of nonnegative alternate minimization we activate the constraint cn and finally raise the value of this strategy systematically resulted in lower values of the objective compared to random or initializations building feature vectors for labeled data in order to extract feature vectors from labeled data we first extract slice each data point we extract isi features on the one hand and filter each data point with all filter families each slice is separately whitened and compared to the cluster centers of its slice for this we use the matrices of cluster centroids computed for the all slices letting denote the soft thresholding operator sign xi max we compute for each slice which is the inner products of the corresponding slice of the data point with all cluster centroids for the same slice we concatenate the from different slices and use the resulting encoded point to predict cell types the last step is performed either by feeding concatenated vectors together with the corresponding label to logistic regression classifier which handles multiple classes in fashion or to random forest classifier predicting cell polarities by completing the rgc coupling matrix we additionally exploit pairwise spike train to infer rgc polarities on vs off and estimate the polarity vector by using measure of the pairwise functional coupling strength between cells the rationale behind this approach is that neighboring cells of the same polarity will tend to exhibit positive correlations between their action potential spike trains corresponding to positive functional coupling if the cells are of antagonistic polarities functional coupling strength will be negative the coupling of two neighboring cells can therefore be modeled as yi yj where yi yj denote cell polarities because far apart cells do not excite or inhibit each other to avoid incorporating noise in our model we choose to only include estimates of functional coupling strengths between neighboring cells the neighborhood size is hyperparameter of this approach that we study in section if denotes the graph of neighboring cells in recording we only use for spike trains of cells which are connected with an edge in since we can estimate the position of each rgc in the lattice from its ei we therefore can form the graph which is regular geometric graph if is the number of edges in let denote the linear map rq returning the values ci for cells and located within critical distance we use to denote the adjoint transpose operator the complete matrix of pairwise couplings can then be written up to observation noise as yyt where is the vector of cell polarities for on and for off cells therefore the observation can be modeled as yyt with observation noise and the recovery of yyt is then formulated as standard matrix completion problem minimizing the nonconvex loss using newton steps in this section we show how to estimate given the observation of yyt by minimizing the loss kp zzt even though minimizing this degree polynomial loss function is in general we propose newton method with spectral heuristic for approaching the solution see algorithm in similar contexts when the sampling of entries is uniform this type of spectral initialization followed by alternate minimization has been proven to converge to the global minimum of loss analogous to while our sampling graph is not an graph we empirically observed that its regular structure enables us to come up with reliable initial spectral guess that falls within the basin of attraction of the global minimum of in the subsequent newton scheme we iterate using the shifted hessian matrix zzt where ensures positive definiteness whenever computing and is expensive due to potentially large number of cells then replacing by diagonal or scalar approximation reduces per iteration cost while resulting in slower convergence we refer to this method as method for minimizing the nonconvex objective while ista is first order method applied to the convex relaxation of the problem as presented in the appendix see figure middle panel using the same convex relaxation we prove in the appendix that the proposed estimator has classification accuracy of at least with algorithm polarity matrix completion require observed couplings the projection operator let be the leading eigenpair of initialize revealed for do zt zt zt zt zt is the hessian or an approximation zt end for input task ei isi our filters ei isi rand filters ei isi rand filters ei only our filters isi only ccf table comparing performance for input data sources and filters cell type identification polarity identification cell type and polarity identification eis cropped within µm from the central electrode numerical experiments in this section we benchmark the performance of the cell type classifiers introduced previously on datasets where the ground truth was available for the rgcs in those datasets experts manually the light response properties of the cells in the manner previously described in the literature our unlabeled data contained spatial symmetries data points the labeled data consists of off midget off parasol on midget on parasol and small bistratified cells assembled from distinct recordings rgc classification from their electrical features our numerical experiment consists in hiding one out of labeled recordings learning cell classifiers on the others and testing the classifier on the hidden recording we chose to test the performance of the classifier against individual recordings for two reasons firstly we wanted to compare the polarity prediction accuracy from electrical features with the prediction made by matrix completion see section and the matrix completion algorithm takes as input pairwise data obtained from single recording only secondly experimental parameters likely to influence the eis and isis such as recording temperature vary from recording to recording but remain consistent within recording since we want the reported scores to reflect expected performance against new recordings not including points from the test distribution gives us more realistic proxy to the true test error in table we report classification accuracies on different classification tasks cell type identification midget parasol small bistratified cells polarity identification on versus off cells cell type and polarity small bistratified each row of the table contains the data used as input the first column represents the results for the method where the dictionary learning step is performed with and eis are recorded within radius of µm from the central electrode electrodes on our array we compare our method with an identical method where we replaced the filters by the random gaussian filters of second column for and third for the performance of random filters opens perspectives for learning deeper predictors using random filters in the first layer the impact of on our filters can be seen in figure panel larger seems to bring further information for polarity prediction but not for cell type classification which leads to an optimal choice in the problem in the and columns we used only part of the features sets at our disposal eis only and isis only respectively these results confirm that the joint use of both eis and isis for cell classification is beneficial globally cell type identification turns out to be an easier task than polarity prediction using per cell descriptors figure middle panel illustrates the impact of ei diameter on classification accuracy while larger recording radius lets us make use of more signal the amount of noise incorporated also increases with the number of electrodes taken into account and we observe in terms of signal to noise ratio on all three tasks an interesting observation is the second jump in the accuracy of prediction around an ei diameter of at which point we attain peak performance of we believe this jump takes place when axonal signals start being incorporated in the ei and we believe these signals to be strong indicator of cell type because of accuracy dictionary size electrical image radius µm maximum cell distance µm figure left panel effect of the dictionary size and middle panel eis radius on per cell classification right panel effect of the neighborhood size on polarity prediction using matrix completion cell index cell index loss opt coupling strength loss ut opt observed couplings newton first order convex ista pca iteration sp nmf rnd iteration figure left panel observed coupling matrix middle panel convergence of matrix completion algorithms right panel with our initialization versus other choices known differences in axonal conduction velocities prediction variance is also relatively low for prediction compared to polarity prediction while predicting polarity turns out to be significantly easier on some datasets than others on average the logistic regression classifier we used performed slightly better than random forests on the various tasks and data sets at our disposal matrix completion based polarity prediction matrix completion resulted in accuracy on three out of datasets and in an average of accuracy in the other datasets we report the average performance in table even though it is inferior to the simpler classification approach for two reasons the idea of using matrix completion for this task is new and it has high potential as demonstrated by figure right hand panel on some datasets matrix completion has accuracy however on other datasets either because of issues fragile or of other noise the approach does not do as well in figure right hand side we examine the effect of the neighborhood size on prediction accuracy colors correspond to different datasets for sake of readability we only show the results for out of datasets the best the worse and intermediary the sensitivity to maximum cell distance is clear on this plot bold curves correspond to the prediction resulting after steps of our newton algorithm dashed curves correspond to predictions by the first order nonconvex method stopped after steps and dots are prediction accuracies of the leading singular vector the spectral initialization of our algorithm overall the newton algorithm seems to perform better than its rivals and there appears to be an optimal radius to choose for each dataset which corresponds to the characteristic distance between pairs of cells here only parasols this parameter varies from dataset to dataset and hence requires parameter tuning before extracting ccf data in order to get the best performance out of the algorithm strategy for dictionary learning we refer to figure right hand panel for an illustration of our strategy for minimizing as described in section there we compare dense initialized with our start steps of unconstrained alternate minimization and steps of nonnegative alternate minimization referred to as with single spectral warm start steps unconstrained alternate minimization initialization sp and steps nonnegative alternate minimization nmf as well as with two standard baselines which are random initialization and initializations we postpone theoretical study of this initialization choice to future work note that each step of the alternate minimization involves few products and operations on matrices using nvidia tesla gpu drastically accelerated these steps allowing us to scale up our experiments discussion we developed accurate classifiers using unique collection of labeled and unlabeled electrical recordings and employing recent advances in several areas of machine learning the results show strong empirical success of the methodology which is highly scalable and adapted for major applications discussed below matrix completion for binary classification is novel and the two heuristics we used for minimizing our objectives show convincing superiority to existing baselines future work will be dedicated to studying properties of these algorithms recording methods three major aspects of electrical recordings are critical for successful cell type identification from electrical signatures first high spatial resolution is required to detect the fine features of the eis much more widely spaced electrode arrays such as those often used in the cortex may not perform as well second high temporal resolution is required to measure the isi accurately this suggests that optical measurements using sensors would not be as useful as electrical measurements third recordings are required to detect many pairs of cells and estimate their functional interactions electrode arrays with fewer channels may not suffice thus electrophysiological recordings are uniquely well suited to the task of identifying cell types future directions probable source of variability in cell type classification is differences between retinal preparations including eccentricity in the retina variability and experimental variables such as temperature and of the recording in the present data features were defined and assembled across dozen different recordings this motivates transfer learning to account for such variability exploiting the fact that although the features may change somewhat between preparations target domains the underlying cell types and the fundamental differences in electrical signatures are expected to remain we expect future work to result in models that enjoy higher complexity thanks to training on larger datasets thus achieving invariance to ambient conditions eccentricity and temperature automatically the model we used can be interpreted as neural network straightforward development would be to increase the number of layers the relative success of random filters on the first layer is sign that one can hope to get further automated improvement by building richer representations from the data itself and with minimum incorporation of prior knowledge application two major applications are envisioned first an extensive set of highdensity recordings from primate retina can now be mined for information on cell types manual identification of cell types using their light response properties is extremely however the present approach promises to facilitate automated mining second the identification of cell types without light responses is fundamental for the development of highresolution retinal prostheses of the future in such devices it is necessary to identify which electrodes are capable of stimulating which cells and drive spiking in rgcs according to their type in order to deliver meaningful visual signal to the brain for this futuristic interface application our results solve fundamental problem finally it is hoped that these applications in the retina will also be relevant for other brain areas where identification of neural cell types and customized electrical stimulation for neural implants may be equally important in the future acknowledgement we are grateful to montanari and palanker for inspiring discussions and valuable comments and rhoades for labeling the data er acknowledges support from grants and we thank the stanford data science initiative for financial support and nvidia corporation for the donation of the tesla gpu we used data collection was supported by national eye institute grants and ejc please contact ejc ej for access to the data references arthur and vassilvitskii the advantages of careful seeding in society for industrial and applied mathematics editors proceedings of the eighteenth annual symposium on discrete algorithms beck teboulle fast iterative algorithm for linear inverse problems siam journal of imaging sciences chichilnisky and rachel kalmar functional asymmetries in on and off ganglion cells of primate retina the journal of neuroscience coates and ng the importance of encoding versus training with sparse coding and vector quantization in international conference in machine learning icml volume coates ng and lee an analysis of networks in unsupervised feature learning in international conference on artificial intelligence and statistics aistats pages dacey the cognitive neurosciences book section origins of perception retinal ganglion cell diversity and the creation of parallel visual pathways pages mit press dacey and packer colour coding in the primate retina diverse cell types and circuitry curr opin neurobiol dacey and petersen dendritic field size and morphology of midget and parasol cells of the human retina pnas steven devries and denis baylor mosaic arrangement of ganglion cell receptive fields in rabbit retina journal of neurophysiology greschner shlens bakolista field gauthier jepson sher litke and chichilnisky correlated firing among major ganglion cell types in primate retina physiol jepson hottowy wiener dabrowski litke and chichilnisky reproduction of spatiotemporal visual signals for retinal prosthesis neuron keshavan montanari and oh matrix completion from few entries ieee transactions on information theory li gauthier schiff sher ahn field greschner callaway litke and chichilnisky anatomical identification of extracellularly recorded cells in multielectrode recordings neurosci litke bezayiff chichilnisky cunningham dabrowski grillo grivich grybos hottowy kachiguine kalmar mathieson petrusca rahman and sher what does the eye tell the brain development of system for the recording of retinal output activity ieee trans on nuclear science mairal bach ponce and sapiro online dictionary learning for sparse coding in international conference on machine learning icml pages mastronarde correlated firing of cat retinal ganglion cells spontaneously active inputs to and neurophysiol rahimi and recht random features for kernel machines in advances in neural information processing systems nips pages silveira and perry the topography of magnocellular projecting ganglion cells cells in the primate retina neuroscience 
theory of decision making under dynamic context michael shvartsman princeton neuroscience institute princeton university princeton nj vaibhav srivastava department of mechanical and aerospace engineering princeton university princeton nj vaibhavs jonathan cohen princeton neuroscience institute princeton university princeton nj jdc abstract the dynamics of simple decisions are well understood and modeled as class of random walk models however most decisions include influence of additional information we call context in this work we describe computational theory of decision making under dynamically shifting context we show how the model generalizes the dominant existing model of decision making and can be built up from weighted combination of decisions evolving simultaneously we also show how the model generalizes recent work on the control of attention in the flanker task finally we show how the model recovers qualitative data patterns in another task of longstanding psychological interest the ax continuous performance test using the same model parameters introduction in the late wald and colleagues developed sequential test called the sequential probability ratio test sprt this test accumulates evidence in favor of one of two simple hypotheses until log likelihood threshold is crossed and one hypothesis is selected forming random walk to decision bound this test was quickly applied as model of human decision making behavior both in its discrete form and in continuous realization as biased wiener process the diffusion decision model or ddm this work has seen recent revival due to evidence of neurons that appear to reflect ramping behavior consistent with evidence accumulation cortical circuits implementing decision process similar to the sprt in the basal ganglia in rats and the finding correlations between ddm parameters and activity in eeg and fmri bolstered by this revival number of groups investigated extension models some of these models tackle complex hypothesis spaces or greater biological realism others focus on relaxing stationarity assumptions about the task setting whether by investigating integration deadlines or different evidence distribution by trial we engage with the latter literature by providing theory of decision making under dynamically changing context we define context simply as additional information that may bear upon decision whether from perception or memory such theory is important because even simple tasks that use contexts such as prior biases may require inference on the context itself before it can bear on the decision the focus on dynamics is what distinguishes our work from efforts on changes in preferences and internal context updating the admission of evidence from memory distinguishes it from work on multisensory integration we illustrate such decisions with an example consider seeing someone that looks like friend target stimulus and decision to greet or not greet this person context can be external concert hall or internal the memory that the friend went on vacation and therefore this person is likely lookalike the context can strictly constrain the decision greeting friend in the street the middle of film or only bias it guessing whether this is friend or lookalike after retrieving the memory of them on vacation regardless context affects the decision and we assume it needs to be inferred either before or alongside the greeting decision itself we aim to build normative theory of this context processing component of decision making we show that our theory generalizes the sprt and therefore wiener process ddm in continuous time and how decisions can be optimally built up from dynamically weighted combination of decisions our theory is general enough to consider range of existing empirical paradigms in the literature including the stroop flanker simon and the we choose to mention these in particular because they reside on the bounds of the task space our theory considers on two different dimensions and describe discretization of task space on those dimensions that accommodates those existing paradigms we show that in spite of the framework generality it can provide wellbehaved predictions across qualitatively different tasks we do this by using our framework to derive notational variant of an existing flanker model and using parameter values from this previous model to simultaneously generate qualitatively accurate predictions in both the flanker and paradigms that is our theory generates plausible behavior in qualitatively different tasks using the same parameters the theoretical framework we assume that dynamic context decision making like fixed context decision making can be understood as sequential bayesian inference process our theory therefore uses sequentially drawn samples from external input internal memory to compute the joint posterior probability over the identity of the true context and decision target over time it maps from this joint probability to response probability using fixed response mapping and uses fixed threshold rule defined over the response probability to stop sampling and respond we make distinction between our theory of decision making and individual task models that can be derived from the theory by picking points in task space that the theory accommodates formally we assume the decider conditions decision based on its best estimate of two pieces of information some unknown true context taking on one of the values ci and some unknown true target taking on one of the values gj this intentionally abstracts from richer views of context ones which assume that the context is qualitatively different from the target or that the relevant context to sample from is unknown we denote by random variables representing the possible draws of context and target and deterministic function from the distribution to distribution over responses we define an abstract context sensor and target sensor selectively tuned to context or target information such that ec is discrete piece of evidence drawn from the context sensor and eg one drawn from the target sensor the goal of the decider is to average over the noise in the sensors to estimate the pair sufficiently to determine the correct response and we assume that this inference is done optimally using bayes rule of we denote by ton ton the time at which the context appears and tc the time at which it disapon of pears and likewise tg tg the time at which the target appears and disappears we also restrict on these times such that ton tg this is the primary distinction between context and target which can otherwise be two arbitrary stimuli the onsets and offsets define one dimension in continuous space of tasks over which our theory can make predictions the form of defines second dimension in the space of possible tasks where our theory makes predictions we use suboptimal but simple threshold heuristic for the decision rule when the posteriori probability of any response crosses some adaptively set threshold sampling ends and the response is made in favor of that response for the purposes of this paper we restrict ourselves to two extremes on both of these dimensions for stimulus onset and offset times we consider one setting where the context and target appear on of and disappear together perfect overlap ton tof tg and tc and one where the target of appears some time after the context disappears no overlap tc ton we label the former the external context model because the contextual information is immediately available and the latter the internal context model because the information must be previously encoded and maintained the external context model is like the ongoing film context from the introduction and the internal context is like knowing that the friend is on vacation for the response mapping function we consider one setting where the response is solely conditioned on the perceptual target response and one where the response is is conditioned jointly on the pair response the response is like choosing to greet or not greet the friend at the movie theater and the contextindependent one is like choosing to greet or not greet the friend on the street in the lab classic tasks like the stroop flanker and simon fall into the taxonomy as tasks with response because the response is solely conditioned on the perceptual target on the other side of both dimensions are tasks like the task and the ax continuous performance test in our consideration of these tasks we restrict our attention to the case where there are only two possible context and target hypotheses the sequential inference procedure we use can be performed for other numbers of hypotheses and responses though the analysis we show later in the paper relies on the assumption and on indepednence between the two sensors external context update first we describe the inference procedure in the case of perfect overlap of context and target at the current timestep the decider has available evidence samples from both the context and the target ec and eg and uses bayes rule to compute the posterior probability ec eg pτ pτ the first term is the likelihood of the evidence given the joint hypothesis and the second term is the prior which we take to be the posterior from the previous time step we use the flanker task as concrete example in this task participants are shown central target an or an surrounded on both sides by distractors flankers more or stimuli that are either congruent or incongruent with it participants are told to respond to the target only but show number of indications of influence of the distractor most notably an early period of performance and slowdown or reduced accuracy with incongruent relative to congruent flankers we label the two possible target identities and the possible flanker identities with the underscore representing the position of the target this gives us the two congruent possibilities or sss hhh and the two incongruent possibilities or shs hsh the response mapping function marginalize over context identities at each timestep with probability pc with probability the higher of the two response probabilities is compared to threshold and when this threshold is crossed the model responds what remains is to define the prior and the likelihood function ec eg or its inverse the sample generation function for sample generation we assume that the context and target are represented as two gaussian distributions µc αµ µg σc ασ σg µg αµ µc σg ασ σc here µc and µg are baseline means for the distributions of context and target and are their variances and the scaling factors mix them potentially reflecting perceptual overlap in the sensors this formulation is notational variant of an earlier flanker model but we are able to derive it by describing the task in our formalism we describe the exact mapping in the supplementary material moreover we later show how this notational equivalence lets us reproduce both yu and colleagues results and data patterns in another task using the same parameter settings comparison to model we now write the model in terms of likelihood ratio test to facilitate comparison to wald sprt and wiener diffusion models this is complementary to an earlier approach performing dynamical analysis on the problem in probability space first we write the likelihood ratio of the full response posteriors for the two responses since the likelihood ratio and the max posteriori probability are monotonically related thresholding on maps onto the threshold over the probability of the most probable response we described above eg eg ec eg pτ ec eg pτ ec eg pτ ec eg pτ for this analysis we assume that context and target samples are drawn independently from each other that αµ ασ and therefore that ec eg ec eg we also index the evidence samples by time to remove the prior terms pτ and introduce the notation lt tx eg gx and lt cx et cx for the likelihoods with indexing stimuli and tcon tgon indexing evidence samples over time now we can rewrite lt lt lt lt lt lt lt lt lt lt divide both the numerator and the denominator by lt lt lt lt lt lt lt lt lt lt lt lt lt lt separate out the target likelihood product and take logs log log lt lt log lt lt lt lt now the first term is the wald sequential probability ratio test with zgτ log lt lt in the continuum limit it is equal to wiener diffusion process dzg ag dw with ag log lt and bg var log we can relabel the sprt for the target zg log lt and do the same for the context drift that appears on both numerator and denominator of the final term zτc log lltt and zc log then the expression is as follows log zg log ezc zc log in equation comprises two terms the first is the unbiased sprt statistic while the second is nonlinear function of the sprt statistic for the decision on the context the nonlinear term plays the role of bias in the sprt for decision on target this rational dynamic prior bias is an advance over previous heuristic approaches to dynamic biases several limits of if the context and the target are independent then the second are of interest and reduces to the biased sprt for the target if each target term reduces to log is equally likely given context then the nonlinear term in reduces to zero and reduces to the sprt for the target if each context deterministically determines different target then any piece of evidence on the context is equally informative about the target accordingly reduces to the sum of statistic for context and target zgτ zcτ if the magnitude of drift rate for the context is much higher than the magnitude of drift rate for the target or the magnitude of the bias is high then the nonlinear term saturates at faster timescale the decision time in this limit the approximate contribution of the nonlinear term is either log or log finally in the limit of large thresholds or equivalently large decision times will be large will be small and the nonlinear term in can be approximated by linear function of zcτ obtained using the first order taylor series expansion in all these cases can be approximated by sum of two sprts however this approximation may not hold in general and we suspect many interesting cases will require us to consider the nonlinear model in in those cases the signal and noise characteristics of context and target will have different and we think distinguishable effects on the rt distributions we measure the update and application to new task recall our promise to explore two extremes on the dimension of context and onset timing and two extremes on the dimension of dependence the flanker task is an external context task with response so we now turn to an internal context task with response this task is the ax continuous performance test task with origins in the psychiatry literature now applied to cognitive control in this task subjects are asked to make response to probe target stimulus by convention labeled or where the response mapping is determined by previously seen cue context stimulus or in our notation unlike the flanker where all stimuli pairs are equally likely in the ax trials are usually the most common appearing of the time or more and by trials least common ay and bx trials appear with equal frequency but have dramatically different conditional probabilities due to the preponderance of ax trials two response mappings are used in the literature an asymmetric one where one response is made on ax trials and the other response otherwise and symmetric variant where one response is made to ax and by trials and the other to ay and bx trials we focus on the symmetric variant since in this case the response is always in the asymmetric variant the response is is on trials we can use the definition of the task to write new form for lef with probability with probability we assume for simplicity that the inference process on the context models the maintenance of context information and retrieval of the response rule though the model could be extended to perceptual encoding of the context as well that is we start the inference machine at tof using the following on update when tof pτ pτ then once the target appears the update becomes pτ pτ for samples after the context disappears we introduce simple decay mechanism wherein the probability with which the context sensor provides sample from the true context decays exponentially of sample is drawn from the true context with probability and drawn uniformly othp ec erwise the update takes this into account such that as grows the ratio approaches ec and the context sensor stops being informative notation omitted for space this means that the unconditional posterior of context can saturate at values other than the remainder of the model is exactly as described above this provides an opportunity to generate predictions of both tasks in shared model something we take up in the final portion of the paper but first as in the flanker model we reduce this model to combination of multiple instances of the ddm relating the internal context model to the drift model we sketch an intuition for how our internal context model can be built up from combination of drift models again assuming sensor independence the derivation uses the same trick of dividing numerator and denominator by the likelihood as the flanker expressions and is included in the supplementary material as is the asymmetric variant we state the final expression for the symmetric case here log log ezc ezc equation combines the sprt statistic associated with the context and the target in nonlinear fashion which is more complicated than in further complicated by the fact that the memory decay turns the context random walk into an process in expectation notation omitted for space but follows from the relationship between continuous and discrete ar processes the reduction of these equations to sprt or the sum of two sprts is subtle and is valid only in rather contrived settings for example if the drift rate for the target is much higher than the drift rate for the context then in the limit of large thresholds can be approximated by either log zc or log zc as with we think it will be highly instructive to further invesgate the cases where the reductions can not apply simulation results for both tasks using the same model and parameters with the relationship between both tasks established via our theory we can now simulate behavior in both tasks under nearly the same model parameters the one difference is in the memory component governed by the memory decay parameter and the target onset time τton longer intervals between context disappearance and target appearance have the same effect as higher values of they make context retrieved more poorly we use for the decay and interval which results in approximately probability of drawing correct sample by the time the target comes on the effect of both parameters is equivalent in the results we show since we do not explore variable delays but could be explored by varying this duration for simplicity we assume the sampling distribution for ec and eg is identical for both tasks though this need not hold except for identical stimuli sampled from perception for flanker simulations we use the model no spatial uncertainty αµ ασ to best match the model and our analytical connections to the sprt we assume the model has high congruence prior for the flanker model and the correct prior for the as detailed in table context target prior flanker flanker flanker table priors for the inference process for the flanker and instantiation of our theory the remainder of parameters are identical across both task simulations σc σg µc µg for and and µc µg for and to replicate the flanker results we followed by introducing error parameter this is the probability of making random response immediately at the first timestep we simulated trials for each model figure shows results from the simulation of the flanker task recovering the characteristic early performance in incongruent trials this simulation supports the assertion that our theory generalizes the flanker model of though we are not sure why our scale on timesteps appears different by about in spite of using what we think are equivalent parameters library for simulating tasks that fit in our framework and code for generating all simulation figures in this paper can be found at https for the behavior we compare qualitative patterns from our model to heterogeneous dataset of humans performing this task across different manipulations with trials per subject the manipulations were different variants of proactive inducing manipulations in the sense of this is the most apt comparison to our model proactive strategies are argued to involve response preparation of the sort that our model reflects in its accumulation over the context before the target appears figure shows mean rts and accuracies produced by our model for the under the same parameters that we used for the flanker model this model recovers the qualitative pattern of behavior seen in human subjects in this task both rt and error proportion by condition showing the same pattern moreover if we examine the conditional rt plot figure we see that the model predicts region of performance early in ay trials but not other trials this effect appears isomorphic to the early congruence effect in the flanker task in the sense that both are caused by strong prior biased away from the correct response on incongruent trials given high congruence prior in the flanker and on ay trials given high ax prior in more generally the model recovers conditional accuracy curves that look very similar to those in the human data congruent incongruent congruent density accuracy incongruent timesteps timesteps rt by condition model rt by condition humans rt ms rt timesteps figure model recovers characteristic flanker pattern left response time computed by rt bin for congruent and incongruent trials showing early performance right response time distributions for congruent and incongruent trials showing the same mode but fatter tail for incongruent relative to congruent trials both are signature phenomena in the flanker task previously recovered by the model of yu and colleagues consistent with our theory being generalization of their model ax error proportion error proportion ay bx by trial type errors by condition model trial type trial type errors by condition humans ax ay bx by trial type figure model recovers gross rt patterns in human behavior left rt and error rates by trial type in the model using the same parameters as the flanker model right rt and error rates by trial type in human participants error bars are standard errors where not visible they are smaller than the dots both rt and error patterns are quite similar note that the mapping need not be discussion in this paper we have provided theoretical framework for understanding decision making under dynamically shifting context we used this framework to derive models of two distinct tasks from the cognitive control literature one notational equivalent of previous model and the other novel model of task we showed how we can write these models in terms of combinations of random walks most importantly we showed how two models derived from our theoretical framing can recover rt error and accuracy patterns seen in human data without change of parameters between tasks and task models our results are quantitatively robust to small changes in the prior because equations and are smooth functions of the prior the early incongruent errors in flanker are also robust to larger changes as long as the congruence prior is above the ordering of rts and error rates for rely on assuming that participants at least learn the correct ordering of trial frequencies we think an uncontroversial assumption one natural next step should be to generate direct quantitative predictions of behavior in one task based on model trained on another task ideally on an individual subject level and in task conditional accuracy humans accuracy accuracy conditional accuracy model ax ay bx by timesteps rt ms figure model recovers conditional accuracy pattern in human behavior left response time computed by bin for the four trial types using same parameters as the flanker model right same plot from human participants see text for details bins with fewer than observations omitted error bars are standard errors where not visible they are smaller than the dots both plots show qualitatively similar patterns two discrepancies are of note first the model predicts very early ay responses to be more accurate than slightly later responses and early responses to be close to chance we think at least part of this is due to the error but we retained it for consistency with the flanker model second the humans show slightly better by than bx performance early on something the model does not recover we think this may have to do with global bias that the model is somehow not capturing note the abscissae are in different units though they correspond surprisingly well that fits in our framework that has not been extensively explored for example an flanker variant or response congruence judgment task the main challenge in pursuing this kind of analysis is our ability to efficiently estimate and explore these models which unlike the models have no analytic expressions or fast approximations we believe that approximations such as those provided for the flanker model can and should be applied within our framework both as way to generate more efficient data fits and as way to apply the tools of dynamical systems analysis to the overall behavior of system of particular interest is whether some points in the task space defined in our framework map onto existing descriptive decision models another natural next step is to seek evidence of our proposed form of integrator in neural data or investigate plausible neural implementations or approximations to it one way of doing so is computing tuning curves of neural populations in different regions to the individual components of the accumulators we propose in equations and another is to find connectivity patterns that perform the computation we hypothesize happens in the integrator third is to look for components correlated with them in eeg data all of these methods have some promise as they have been successfully applied to the fixed context model such neural data would not only test prediction of our theory but also via the brain locations found to be correlated address questions we presently do not address such as whether the dynamic weighting happens at the sampler or further upstream whether unreliable evidence is gated at the sampler or discounted at the integrator second key challenge given our focus on optimal inference is the fact that the fixed threshold decision rule we use is suboptimal for the case of non identically distributed observations while the likelihoods of context and target are independent in our simulations the likelihoods of the two responses are not identically distributed the optimal threshold is generally for this case though the specific form is not known finally while our model recovers accuracies and rt and accuracy patterns it fails to recover the correct pattern of rts that is it predicts much faster errors than corrects on average future work will need to investigate whether this is caused by qualitative or quantitative aspects of the theoretical framework and model references laming information theory of times london academic press ratcliff theory of memory psychological review vol no pp usher and mcclelland the time course of perceptual choice the leaky competing accumulator psychological review vol no pp bogacz brown moehlis holmes and cohen the physics of optimal decision making formal analysis of models of performance in psychological review vol pp yu dayan and cohen dynamics of attentional selection under conflict toward rational bayesian journal of experimental psychology human perception and performance vol pp june cohen and steingard schizophrenic deficits in the processing of context archives of general psychiatry vol wald and wolfowitz optimum character of the sequential probability ratio test the annals of mathematical statistics vol pp kira yang and shadlen neural implementation of wald sequential probability ratio test neuron vol pp bogacz and gurney the basal ganglia and cortex implement optimal decision making between alternative neural computation vol pp van vugt simen nystrom holmes and cohen eeg oscillations reveal neural correlates of evidence frontiers in neuroscience vol turner van maanen and forstmann informing cognitive abstractions through neuroimaging the neural drift diffusion psychological review vol no pp norris the bayesian reader explaining word recognition as an optimal bayesian decision psychological review vol pp apr wong and wang recurrent network mechanism of time integration in perceptual the journal of neuroscience the official journal of the society for neuroscience vol no pp frazier and yu sequential hypothesis testing under stochastic deadlines advances in neural information processing systems pp drugowitsch churchland shadlen and pouget the cost of accumulating evidence in perceptual decision the journal of neuroscience vol pp mar srivastava and schrater rational inference of relative preferences in advances in neural information processing systems pp reilly and frank making working memory work computational model of learning in the prefrontal cortex and basal ganglia neural computation vol pp sheppard raposo and churchland dynamic weighting of multisensory stimuli shapes in rats and journal of vision vol no pp stroop studies of interference in serial verbal journal of experimental psychology vol no pp gratton coles sirevaag eriksen and donchin and poststimulus activation of response channels psychophysiological journal of experimental psychology human perception and performance vol no pp simon and wolf choice reaction time as function of angular correspondence and age ergonomics vol pp liu yu and holmes dynamical analysis of bayesian inference models for the eriksen neural computation vol pp june hanks mazurek kiani hopp and shadlen elapsed decision time affects the weighting of prior probability in perceptual decision the journal of neuroscience vol pp apr lositsky wilson shvartsman and cohen drift diffusion model of proactive and reactive control in forced choice task in the conference on reinforcement learning and decision making pp braver the variable nature of cognitive control dual mechanisms trends in cognitive sciences vol pp hanks kopec brunton duan erlich and brody distinct relationships of parietal and prefrontal cortices to evidence accumulation nature liu and blostein optimality of the sequential probability ratio test for nonstationary observations ieee transactions on information theory vol no pp 
gaussian process model of quasar spectral energy distributions andrew albert wu school of engineering and applied sciences harvard university acm awu jeffrey regier jon mcauliffe department of statistics university of california berkeley jeff jon dustin lang mcwilliams center for cosmology carnegie mellon university dstn ryan adams school of engineering and applied sciences harvard university rpa prabhat david schlegel lawrence berkeley national laboratory prabhat djschlegel abstract we propose method for combining two sources of astronomical data spectroscopy and photometry that carry information about sources of light stars galaxies and quasars at extremely different spectral resolutions our model treats the spectral energy distribution sed of the radiation from source as latent variable that jointly explains both photometric and spectroscopic observations we place flexible nonparametric prior over the sed of light source that admits physically interpretable decomposition and allows us to tractably perform inference we use our model to predict the distribution of the redshift of quasar from low spectral resolution photometric data the so called photoz problem our method shows that tools from machine learning and bayesian statistics allow us to leverage multiple resolutions of information to make accurate predictions with uncertainties introduction enormous amounts of astronomical data are collected by range of instruments at multiple spectral resolutions providing information about billions of sources of light in the observable universe among these data are measurements of the spectral energy distributions seds of sources of light stars galaxies and quasars the sed describes the distribution of energy radiated by source over the spectrum of wavelengths or photon energy levels seds are of interesting because they convey information about source physical properties including type chemical composition and redshift which will be an estimand of interest in this work the sed can be thought of as latent function of which we can only obtain noisy measurements measurements of seds however are produced by instruments at widely varying spectral resolutions some instruments measure many wavelengths simultaneously spectroscopy while others http http psfflux flux nanomaggies band figure left example of quasar sed with sdss band filters sb overlaid right the same quasar photometrically measured band fluxes spectroscopic measurements include noisy samples at thousands of wavelengths whereas sdss photometric fluxes reflect the weighted response over large range of wavelengths average over large swaths of the energy spectrum and report low dimensional summary photometry spectroscopic data describe source sed in finer detail than broadband photometric data for example the baryonic oscillation spectroscopic survey measures sed samples at over four thousand wavelengths between and in contrast the sloan digital sky survey sdss collects spectral information in only broad spectral bins by using broadband filters called and but at much higher spatial resolution photometric preprocessing models can then aggregate pixel information into five fluxes and their uncertainties reflecting the weighted average response over large range of the wavelength spectrum the two methods of spectral information collection are graphically compared in figure despite carrying less spectral information broadband photometry is more widely available and exists for larger number of sources than spectroscopic measurements this work develops method for inferring physical properties sources by jointly modeling spectroscopic and photometric data one use of our model is to measure the redshift of quasars for which we only have photometric observations redshift is phenomenon in which the observed sed of source of light is stretched toward longer redder wavelengths this effect is due to combination of radial velocity with respect to the observer and the expansion of the universe termed cosmological redshift quasars or radio sources are extremely distant and energetic sources of electromagnetic radiation that can exhibit high redshift accurate estimates and uncertainties of redshift measurements from photometry have the potential to guide the use of higher spectral resolution instruments to study sources of interest furthermore accurate photometric models can aid the automation of identifying source types and estimating physical characteristics of faintly observed sources in large photometric surveys to jointly describe both resolutions of data we directly model quasar latent sed and the process by which it generates spectroscopic and photometric observations representing quasar sed as latent random measure we describe bayesian inference procedure to compute the marginal probability distribution of quasar redshift given observed photometric fluxes and their uncertainties the following section provides relevant application and statistical background section describes our probabilistic model of seds and broadband photometric measurements section outlines our inference method for efficiently computing statistics of the posterior distribution section presents redshift and sed predictions from photometric measurements among other model summaries and quantitative comparison between our method and two existing we conclude with discussion of directions for future work background the seds of most stars are roughly approximated by planck law for black body radiators and stellar atmosphere models quasars on the other hand have complicated seds characterized by some salient features such as the forest which is the absorption of light at many wavelengths from neutral hydrogen gas between the earth and the quasar one of the most interesting properties of quasars and galaxies conveyed by the sed is redshift which gives us insight into an object distance and age redshift affects our observation of seds by stretching the wavelengths of the quasar rest frame sed skewing toward longer redder wavelengths denoting the rest rest frame sed of quasar as function fn the effect of redshift with value zn figure spectroscopic measurements of multiple quasars at different redshifts the upper graph depicts the sample spectrograph in the observation frame intuitively thought of as stretched by factor the lower figure depicts the rest frame version of the same quasar spectra the two lines show the corresponding locations of the characteristic peak in each reference frame note that the has been changed to ease the visualization the transformation is much more dramatic the appearance of translation is due to missing data we don observe sed samples outside the range typically between and on the sed is described by the relationship fn obs fn rest zn some observed quasar spectra and their rest frame spectra are depicted in figure model this section describes our probabilistic model of spectroscopic and photometric observations spectroscopic flux model the sed of quasar is function where denotes the range of wavelengths and are real numbers representing flux density our model specifies quasar rest frame sed as latent random function quasar seds are highly structured and we model this structure by imposing the assumption that each sed is convex mixture of latent positive basis functions the model assumes there are small number of latent features or characteristics and that each quasar can be described by short vector of mixing weights over these features we place normalized process prior on each of these basis functions described in supplementary material the generative procedure for quasar spectra begins with shared basis iid βk gp kθ bk exp βk exp βk dλ where kθ is the kernel and bk is the exponentiated and normalized version of βk for each quasar wn wk mn mn zn wk where wn mixes over the latent types mn is the apparent brightness zn is the quasar redshift and distributions and are priors to be specified later as each positive sed basis function bk is normalized to integrate to one and each quasar weight vector wn also sums to one the latent normalized sed is then constructed as fn rest wn bk rest rest and we define the unnormalized sed mn fn this parameterization admits the rest interpretation of fn as probability density scaled by mn this interpretation allows us to figure graphical model representation of the joint photometry and spectroscopy model the left shaded variables represent spectroscopically measured samples and their variances the right shaded variables represent photometrically measured fluxes and their variances the upper box represents the latent basis with gp prior parameters and note that nspec nphoto replicates of wn mn and zn are instantiated bk xn wn yn mn τn σn zn nspec nphoto separate out the apparent brightness which is function of distance and overall luminosity from the sed itself which carries information pertinent to the estimand of interest redshift for each quasar with spectroscopic data we observe noisy samples of the redshifted and scaled spectral energy distribution at grid of wavelengths λp for quasar our observation frame samples are conditionally distributed as ind xn wn bk rest σn zn where σn is known measurement variance from the instruments used to make the observations the boss spectra and our rest frame basis are stored in units erg photometric flux model photometric data summarize the amount of energy observed over large swath of the wavelength spectrum roughly photometric flux measures proportionally the number of photons recorded by the instrument over the duration of an exposure filtered by bandspecific sensitivity curve we express flux in nanomaggies photometric fluxes and measurement error derived from broadband imagery have been computed directly from pixels for each quasar sdss photometric data are measured in five bands yielding vector of five flux values and their variances yn and τn each band measures photon observations at each wavelength in proportion to known filter sensitivity sb the filter sensitivities for the sdss ugriz bands are depicted in figure with an example observation frame quasar sed overlaid the obs actual measured fluxes can be computed by integrating the full object spectrum mn fn against the filters for band µb fn rest zn fn obs sb dλ where is conversion factor to go from the units of fn to nanomaggies details of this conversion are available in the supplementary material the function µb takes in rest frame sed redshift and maps it to the observed specific flux the results of this projection onto sdss bands are modeled as independent gaussian random variables with known variance ind yn fn rest zn µb fn rest zn τn rest conditioned on the basis bk we can represent fn with vector note rest that fn is function of wn zn mn and see equation so we can think of µb as function of wn zn mn and we overload notation and the conditional likelihood of photometric observations as yn wn zn mn µb wn zn mn τn intuitively what gives us statistical traction in inferring the posterior distribution over zn is the structure learned in the latent basis and weights the features that correspond to distinguishing bumps and dips in the sed note on priors for photometric weight and redshift inference we use flat prior on zn and empirically derived priors for mn and wn from the sample of spectroscopically measured sources choice of priors is described in the supplementary material inference basis estimation for computational tractability we first compute maximum posteriori map estimate of the basis bmap to condition on using the spectroscopic data xn σn zn we compute discretized map estimate of bk by directly optimizing the unnormalized log posterior implied by the likelihood in equation the gp prior over and diffuse priors over wn and mn wn mn bk xn σn zn xn wn mn bk bk wn mn we use gradient descent with momentum and lbfgs directly on the parameters βk ωn and log mn for the nspec spectroscopically measured quasars gradients were automatically computed using autograd following we first resample the observed spectra into common rest frame grid easing computation of the likelihood we note that although our model places full distribution over bk efficiently integrating out those parameters is left for future work sampling wn mn and zn the bayesian task requires that we compute posterior marginal distributions of integrating out and to compute these distributions we construct markov chain over the state space including and that leaves the target posterior distribution invariant we treat the inference problem for each photometrically measured quasar yn independently conditioned on basis bk our goal is to draw posterior samples of wn mn and zn for each the unnormalized posterior can be expressed wn mn zn yn mn zn wn mn zn where the left likelihood term is defined in equation note that due to analytic intractability we obs numerically integrate expressions involving fn dλ and sb because the observation yn can often be well explained by various redshifts and weight settings the resulting marginal posterior zn yn is often with regions of near zero probability between modes intuitively this is due to the information loss in the flux integration step this property is problematic for many standard mcmc techniques single chain mcmc methods have to jump between modes or travel through region of probability resulting in slow mixing to combat this effect we use parallel tempering method that is to constructing markov chains on distributions parallel tempering instantiates independent chains each sampling from the target distribution raised to an inverse temperature given target distribution the constructed chains sample πc where tc controls how hot how close to uniform each chain is at each iteration swaps between chains are proposed and accepted with standard acceptance probability pr accept swap πc xc πc xc within each chain we use slice sampling to generate samples that leave each chain distribution invariant is relatively mcmc method convenient property when sampling from thousands of independent posteriors we found parallel tempering to be essential for convincing posterior simulations mcmc diagnostics and comparisons to samplers are available in the supplemental material experiments and results we conduct three experiments to test our model where each experiment measures redshift predictive accuracy for different split of spectroscopically measured quasars from the dataset with confirmed redshifts in the range our experiments split in the following ways randomly ii by fluxes iii by redshift values in split ii we train on the brightest of quasars and test on subset of the remaining split iii takes the lowest of quasars as training data and subset of the brightest as test cases splits ii figure top map estimate of the latent bases bk note the different ranges of the wavelength each basis function distributes its mass across different regions of the spectrum to explain different salient features of quasar spectra in the rest frame bottom model reconstruction of sed and iii are intended to test the method robustness to different training and testing distributions mimicking the discovery of fainter and farther sources for each split we find map estimate of the basis bk and weights wn to use as prior for photometric inference for computational purposes we limit our training sample to random subsample of quasars the following sections outline the resulting model fit and inferred seds and redshifts basis validation we examined multiple choices of using out of sample likelihood on validation set in the following experiments we set which balances generalizability and computational tradeoffs discussion of this validation is provided in the supplementary material sed basis we depict map estimate of bk in figure our basis decomposition enjoys the benefit of physical interpretability due to our formulation of the problem basis places mass on the peak around allowing the model to capture the cooccurrence of more peaked seds with bump around basis captures the emission line at around because of the flexible nonparametric priors on bk our model is able to automatically learn these features from data the positivity of the basis and weights distinguishes our model from methods which sacrifice physical interpretability photometric measurements for each test quasar we construct an parallel tempering sampler and run for iterations and discard the first samples as given posterior samples of zn we take the posterior mean as point estimate figure compares the posterior mean to spectroscopic measurements for three different experiments where the gray lines denote posterior sample quantiles in general there is strong correspondence between spectroscopically measured redshift and our posterior estimate in cases where the posterior mean is off our distribution often covers the spectroscopically confirmed value with probability mass this is clear upon inspection of posterior marginal distributions that exhibit extreme behavior to combat this it is necessary to inject the model with more information to eliminate plausible hypotheses this information could come from another measurement new photometric band or from structured prior knowledge over the relationship between zn wn and mn our method simply fits mixture of gaussians to the spectroscopically measured wn mn sample to formulate prior distribution however incorporating dependencies between zn wn and mn similar to the xdqsoz technique will be incorporated in future work comparisons we compare the performance of our redshift estimator with two recent photometric redshift estimators xdqsoz and neural network the method in is conditional density estimator that discretizes the range of one flux band the and fits mixture of gaussians to the joint distribution over the remaining fluxes and redshifts one disadvantage to this approach is there there figure comparison of spectroscopically and photometrically measured redshifts from the sed model for three different data splits the left reflects random selection of quasars from the dataset the right graph reflects selection of test quasars from the upper zcutof where all training was done on lower redshifts the red estimates are posterior means figure left inferred seds from photometric data the black line is smoothed approximation to the true sed using information from the full spectral data the red line is sample from the pos obs terior fn yn which imputes the entire sed from only five flux measurements note that the bottom sample is from the left mode which redshift right corresponding posterior predictive distributions zn yn the black line marks the spectroscopically confirmed redshift the red line marks the posterior mean note the difference in scale of the is no physical significance to the mixture of gaussians and no model of the latent sed furthermore the original method trains and tests the model on range of which is problematic when predicting redshifts on much brighter or dimmer stars the regression approach from employs neural network with two hidden layers and the sdss fluxes as inputs more features more photometric bands can be incorporated into all models but we limit our experiments to the five sdss bands for the sake of comparison further detail on these two methods and broader review of approaches are available in the supplementary material average error and test distribution we compute mean absolute error mae mean absolute percentage error mape and root mean square error rmse to measure predictive performance table compares prediction errors for the three different approaches xd nn spec our experiments show that accurate redshift measurements are attainable even when the distribution of training set is different from test set by directly modeling the sed itself our method dramatically outperforms and in split iii particularly for very high redshift fluxes we also note that our training set is derived from only examples whereas the training set for xdqsoz and the neural network were quasars and quasars respectively this shortcoming can be overcome with more sophisticated inference techniques for the basis despite this the split random all flux all redshift all random flux redshift random flux redshift xd mae nn spec xd mape nn spec xd rmse nn spec table prediction error for three splits random ii iii corresponding to xdqsoz xd the neural network approach nn our model spec the middle and lowest sections correspond to test redshifts in the upper and respectively the xdqsoz and nn models were trained on roughly and example quasars respectively while the spec models were trained on predictions are comparable additionally because we are directly modeling the latent sed our method admits posterior estimate of the entire sed figure displays posterior sed samples and their corresponding redshift marginals for quasars inferred from only sdss photometric measurements discussion we have presented generative model of two sources of information at very different spectral resolutions to form an estimate of the latent spectral energy distribution of quasars we also described an efficient inference algorithm for computing posterior statistics given photometric observations our model accurately predicts and characterizes uncertainty about redshifts from only photometric observations and small number of separate spectroscopic examples moreover we showed that we can make reasonable estimates of the unobserved sed itself from which we can make inferences about other physical properties informed by the full sed we see multiple avenues of future work firstly we can extend the model of seds to incorporate more expert knowledge one such augmentation would include fixed collection of features curated by an expert corresponding to physical properties already known about class of sources furthermore we can also extend our model to directly incorporate photometric pixel observations as opposed to preprocessed flux measurements secondly we note that our method is more more computationally burdensome than xdqsoz and the neural network approach another avenue of future work is to find accurate approximations of these posterior distributions that are cheaper to compute lastly we can extend our methodology to galaxies whose seds can be quite complicated galaxy observations have spatial extent complicating their seds the combination of sed and spatial appearance modeling and computationally efficient inference procedures is promising route toward the automatic characterization of millions of sources from the enormous amounts of data available in massive photometric surveys acknowledgments the authors would like to thank matthew hoffman and members of the hips lab for helpful discussions this work is supported by the applied mathematics program within the office of science advanced scientific computing research of the department of energy under contract no this work used resources of the national energy research scientific computing center nersc we would like to thank tina butler tina declerck and yushu yao for their assistance references shadab alam franco albareti carlos allende prieto anders scott anderson brett andrews eric armengaud éric aubourg stephen bailey julian bautista et al the eleventh and twelfth data releases of the sloan digital sky survey final data from arxiv preprint jo bovy adam myers joseph hennawi david hogg richard mcmahon david schiminovich erin sheldon jon brinkmann donald schneider and benjamin weaver photometric redshifts and quasar probabilities from single generative model the astrophysical journal brescia cavuoti abrusco longo and mercurio photometric redshifts for quasars in surveys the astrophysical journal steve brooks andrew gelman galin jones and meng handbook of markov chain monte carlo crc press kyle dawson david schlegel christopher ahn scott anderson éric aubourg stephen bailey robert barkhouser julian bautista alessandra beifiori andreas berlind et al the baryon oscillation spectroscopic survey of the astronomical journal ro gray pw graham and sr hoyt the physical basis of luminosity classification in the late and early stars ii basic parameters of program stars and the role of microturbulence the astronomical journal edward harrison the and laws the astrophysical journal david hogg distance measures in cosmology arxiv preprint dougal maclaurin david duvenaud and ryan adams autograd differentiation of native python icml workshop on automatic machine learning christopher martin james fanson david schiminovich patrick morrissey peter friedman tom barlow tim conrow robert grange patrick jelinksy bruno millard et al the galaxy evolution explorer space ultraviolet survey mission the astrophysical journal letters radford neal slice sampling annals of statistics pages jorge nocedal updating matrices with limited storage mathematics of computation isabelle pâris patrick petitjean éric aubourg nicholas ross adam myers alina streblyanska stephen bailey patrick hall michael strauss scott anderson et al the sloan digital sky survey quasar catalog tenth data release astronomy astrophysics jeffrey regier andrew miller jon mcauliffe ryan adams matt hoffman dustin lang david schlegel and prabhat celeste variational inference for generative model of astronomical images in proceedings of the international conference on machine learning sdssiii measures of flux and magnitude https joseph silk and martin rees quasars and galaxy formation astronomy and astrophysics chris stoughton robert lupton mariangela bernardi michael blanton scott burles francisco castander aj connolly daniel eisenstein joshua frieman gs hennessy et al sloan digital sky survey early data release the astronomical journal jakob walcher brent groves tamás budavári and daniel dale fitting the integrated spectral energy distributions of galaxies astrophysics and space science david weinberg romeel dav neal katz and juna kollmeier the forest as cosmological tool proceedings of the annual astrophysica conference in maryland 
hidden technical debt in machine learning systems sculley gary holt daniel golovin eugene davydov todd phillips dsculley gholt dgg edavydov toddphillips google dietmar ebner vinay chaudhary michael young crespo dan dennison ebner vchaudhary mwyoung jfcrespo dennison google abstract machine learning offers fantastically powerful toolkit for building useful complex prediction systems quickly this paper argues it is dangerous to think of these quick wins as coming for free using the software engineering framework of technical debt we find it is common to incur massive ongoing maintenance costs in ml systems we explore several risk factors to account for in system design these include boundary erosion entanglement hidden feedback loops undeclared consumers data dependencies configuration issues changes in the external world and variety of introduction as the machine learning ml community continues to accumulate years of experience with live systems and uncomfortable trend has emerged developing and deploying ml systems is relatively fast and cheap but maintaining them over time is difficult and expensive this dichotomy can be understood through the lens of technical debt metaphor introduced by ward cunningham in to help reason about the long term costs incurred by moving quickly in software engineering as with fiscal debt there are often sound strategic reasons to take on technical debt not all debt is bad but all debt needs to be serviced technical debt may be paid down by refactoring code improving unit tests deleting dead code reducing dependencies tightening apis and improving documentation the goal is not to add new functionality but to enable future improvements reduce errors and improve maintainability deferring such payments results in compounding costs hidden debt is dangerous because it compounds silently in this paper we argue that ml systems have special capacity for incurring technical debt because they have all of the maintenance problems of traditional code plus an additional set of issues this debt may be difficult to detect because it exists at the system level rather than the code level traditional abstractions and boundaries may be subtly corrupted or invalidated by the fact that data influences ml system behavior typical methods for paying down code level technical debt are not sufficient to address technical debt at the system level this paper does not offer novel ml algorithms but instead seeks to increase the community awareness of the difficult tradeoffs that must be considered in practice over the long term we focus on interactions and interfaces as an area where ml technical debt may rapidly accumulate at an ml model may silently erode abstraction boundaries the tempting or chaining of input signals may unintentionally couple otherwise disjoint systems ml packages may be treated as black boxes resulting in large masses of glue code or calibration layers that can lock in assumptions changes in the external world may influence system behavior in unintended ways even monitoring ml system behavior may prove difficult without careful design complex models erode boundaries traditional software engineering practice has shown that strong abstraction boundaries using encapsulation and modular design help create maintainable code in which it is easy to make isolated changes and improvements strict abstraction boundaries help express the invariants and logical consistency of the information inputs and outputs from an given component unfortunately it is difficult to enforce strict abstraction boundaries for machine learning systems by prescribing specific intended behavior indeed ml is required in exactly those cases when the desired behavior can not be effectively expressed in software logic without dependency on external data the real world does not fit into tidy encapsulation here we examine several ways that the resulting erosion of boundaries may significantly increase technical debt in ml systems entanglement machine learning systems mix signals together entangling them and making isolation of improvements impossible for instance consider system that uses features xn in model if we change the input distribution of values in the importance weights or use of the remaining features may all change this is true whether the model is retrained fully in batch style or allowed to adapt in an online fashion adding new feature can cause similar changes as can removing any feature xj no inputs are ever really independent we refer to this here as the cace principle changing anything changes everything cace applies not only to input signals but also to learning settings sampling methods convergence thresholds data selection and essentially every other possible tweak one possible mitigation strategy is to isolate models and serve ensembles this approach is useful in situations in which decompose naturally such as in disjoint settings like however in many cases ensembles work well because the errors in the component models are uncorrelated relying on the combination creates strong entanglement improving an individual component model may actually make the system accuracy worse if the remaining errors are more strongly correlated with the other components second possible strategy is to focus on detecting changes in prediction behavior as they occur one such method was proposed in in which visualization tool was used to allow researchers to quickly see effects across many dimensions and slicings metrics that operate on basis may also be extremely useful correction cascades there are often situations in which model ma for problem exists but solution for slightly different problem is required in this case it can be tempting to learn model that takes ma as input and learns small correction as fast way to solve the problem however this correction model has created new system dependency on ma making it significantly more expensive to analyze improvements to that model in the future the cost increases when correction models are cascaded with model for problem learned on top of and so on for several slightly different test distributions once in place correction cascade can create an improvement deadlock as improving the accuracy of any individual component actually leads to detriments mitigation strategies are to augment ma to learn the corrections directly within the same model by adding features to distinguish among the cases or to accept the cost of creating separate model for undeclared consumers oftentimes prediction from machine learning model ma is made widely accessible either at runtime or by writing to files or logs that may later be consumed by other systems without access controls some of these consumers may be undeclared silently using the output of given model as an input to another system in more classical software engineering these issues are referred to as visibility debt undeclared consumers are expensive at best and dangerous at worst because they create hidden tight coupling of model ma to other parts of the stack changes to ma will very likely impact these other parts potentially in ways that are unintended poorly understood and detrimental in practice this tight coupling can radically increase the cost and difficulty of making any changes to ma at all even if they are improvements furthermore undeclared consumers may create hidden feedback loops which are described more in detail in section undeclared consumers may be difficult to detect unless the system is specifically designed to guard against this case for example with access restrictions or strict agreements slas in the absence of barriers engineers will naturally use the most convenient signal at hand especially when working against deadline pressures data dependencies cost more than code dependencies in dependency debt is noted as key contributor to code complexity and technical debt in classical software engineering settings we have found that data dependencies in ml systems carry similar capacity for building debt but may be more difficult to detect code dependencies can be identified via static analysis by compilers and linkers without similar tooling for data dependencies it can be inappropriately easy to build large data dependency chains that can be difficult to untangle unstable data dependencies to move quickly it is often convenient to consume signals as input features that are produced by other systems however some input signals are unstable meaning that they qualitatively or quantitatively change behavior over time this can happen implicitly when the input signal comes from another machine learning model itself that updates over time or lookup table such as for computing scores or semantic mappings it can also happen explicitly when the engineering ownership of the input signal is separate from the engineering ownership of the model that consumes it in such cases updates to the input signal may be made at any time this is dangerous because even improvements to input signals may have arbitrary detrimental effects in the consuming system that are costly to diagnose and address for example consider the case in which an input signal was previously the model consuming it likely fit to these and silent update that corrects the signal will have sudden ramifications for the model one common mitigation strategy for unstable data dependencies is to create versioned copy of given signal for example rather than allowing semantic mapping of words to topic clusters to change over time it might be reasonable to create frozen version of this mapping and use it until such time as an updated version has been fully vetted versioning carries its own costs however such as potential staleness and the cost to maintain multiple versions of the same signal over time underutilized data dependencies in code underutilized dependencies are packages that are mostly unneeded similarly underutilized data dependencies are input signals that provide little incremental modeling benefit these can make an ml system unnecessarily vulnerable to change sometimes catastrophically so even though they could be removed with no detriment as an example suppose that to ease the transition from an old product numbering scheme to new product numbers both schemes are left in the system as features new products get only new number but old products may have both and the model continues to rely on the old numbers for some products year later the code that stops populating the database with the old numbers is deleted this will not be good day for the maintainers of the ml system underutilized data dependencies can creep into model in several ways legacy features the most common case is that feature is included in model early in its development over time is made redundant by new features but this goes undetected bundled features sometimes group of features is evaluated and found to be beneficial because of deadline pressures or similar effects all the features in the bundle are added to the model together possibly including features that add little or no value as machine learning researchers it is tempting to improve model accuracy even when the accuracy gain is very small or when the complexity overhead might be high correlated features often two features are strongly correlated but one is more directly causal many ml methods have difficulty detecting this and credit the two features equally or may even pick the one this results in brittleness if world behavior later changes the correlations underutilized dependencies can be detected via exhaustive evaluations these should be run regularly to identify and remove unnecessary features figure only small fraction of ml systems is composed of the ml code as shown by the small black box in the middle the required surrounding infrastructure is vast and complex static analysis of data dependencies in traditional code compilers and build systems perform static analysis of dependency graphs tools for static analysis of data dependencies are far less common but are essential for error checking tracking down consumers and enforcing migration and updates one such tool is the automated feature management system described in which enables data sources and features to be annotated automated checks can then be run to ensure that all dependencies have the appropriate annotations and dependency trees can be fully resolved this kind of tooling can make migration and deletion much safer in practice feedback loops one of the key features of live ml systems is that they often end up influencing their own behavior if they update over time this leads to form of analysis debt in which it is difficult to predict the behavior of given model before it is released these feedback loops can take different forms but they are all more difficult to detect and address if they occur gradually over time as may be the case when models are updated infrequently direct feedback loops model may directly influence the selection of its own future training data it is common practice to use standard supervised algorithms although the theoretically correct solution would be to use bandit algorithms the problem here is that bandit algorithms such as contextual bandits do not necessarily scale well to the size of action spaces typically required for problems it is possible to mitigate these effects by using some amount of randomization or by isolating certain parts of data from being influenced by given model hidden feedback loops direct feedback loops are costly to analyze but at least they pose statistical challenge that ml researchers may find natural to investigate more difficult case is hidden feedback loops in which two systems influence each other indirectly through the world one example of this may be if two systems independently determine facets of web page such as one selecting products to show and another selecting related reviews improving one system may lead to changes in behavior in the other as users begin clicking more or less on the other components in reaction to the changes note that these hidden loops may exist between completely disjoint systems consider the case of two prediction models from two different investment companies improvements or more scarily bugs in one may influence the bidding and buying behavior of the other it may be surprising to the academic community to know that only tiny fraction of the code in many ml systems is actually devoted to learning or prediction see figure in the language of lin and ryaboy much of the remainder may be described as plumbing it is unfortunately common for systems that incorporate machine learning methods to end up with design patterns in this section we examine several that can surface in machine learning systems and which should be avoided or refactored where possible glue code ml researchers tend to develop general purpose solutions as packages wide variety of these are available as packages at places like or from code proprietary packages and platforms using generic packages often results in glue code system design pattern in which massive amount of supporting code is written to get data into and out of packages glue code is costly in the long term because it tends to freeze system to the peculiarities of specific package testing alternatives may become prohibitively expensive in this way using generic package can inhibit improvements because it makes it harder to take advantage of properties or to tweak the objective function to achieve goal because mature system might end up being at most machine learning code and at least glue code it may be less costly to create clean native solution rather than generic package an important strategy for combating is to wrap packages into common api this allows supporting infrastructure to be more reusable and reduces the cost of changing packages pipeline jungles as special case of glue code pipeline jungles often appear in data preparation these can evolve organically as new signals are identified and new information sources added incrementally without care the resulting system for preparing data in an format may become jungle of scrapes joins and sampling steps often with intermediate files output managing these pipelines detecting errors and recovering from failures are all difficult and costly testing such pipelines often requires expensive integration tests all of this adds to technical debt of system and makes further innovation more costly pipeline jungles can only be avoided by thinking holistically about data collection and feature extraction the approach of scrapping pipeline jungle and redesigning from the ground up is indeed major investment of engineering effort but one that can dramatically reduce ongoing costs and speed further innovation glue code and pipeline jungles are symptomatic of integration issues that may have root cause in overly separated research and engineering roles when ml packages are developed in an ivorytower setting the result may appear like black boxes to the teams that employ them in practice hybrid research approach where engineers and researchers are embedded together on the same teams and indeed are often the same people can help reduce this source of friction significantly dead experimental codepaths common consequence of glue code or pipeline jungles is that it becomes increasingly attractive in the short term to perform experiments with alternative methods by implementing experimental codepaths as conditional branches within the main production code for any individual change the cost of experimenting in this manner is relatively of the surrounding infrastructure needs to be reworked however over time these accumulated codepaths can create growing debt due to the increasing difficulties of maintaining backward compatibility and an exponential increase in cyclomatic complexity testing all possible interactions between codepaths becomes difficult or impossible famous example of the dangers here was knight capital system losing million in minutes apparently because of unexpected behavior from obsolete experimental codepaths as with the case of dead flags in traditional software it is often beneficial to periodically reexamine each experimental branch to see what can be ripped out often only small subset of the possible branches is actually used many others may have been tested once and abandoned abstraction debt the above issues highlight the fact that there is distinct lack of strong abstractions to support ml systems zheng recently made compelling comparison of the state ml abstractions to the state of database technology making the point that nothing in the machine learning literature comes close to the success of the relational database as basic abstraction what is the right interface to describe stream of data or model or prediction for distributed learning in particular there remains lack of widely accepted abstractions it could be argued that the widespread use of in machine learning was driven by the void of strong distributed learning abstractions indeed one of the few areas of broad agreement in recent years appears to be that is poor abstraction for iterative ml algorithms the abstraction seems much more robust but there are multiple competing specifications of this basic idea the lack of standard abstractions makes it all too easy to blur the lines between components common smells in software engineering design smell may indicate an underlying problem in component or system we identify few ml system smells not rules but as subjective indicators type smell the rich information used and produced by ml systems is all to often encoded with plain data types like raw floats and integers in robust system model parameter should know if it is multiplier or decision threshold and prediction should know various pieces of information about the model that produced it and how it should be consumed smell it is often tempting to write particular piece of system in given language especially when that language has convenient library or syntax for the task at hand however using multiple languages often increases the cost of effective testing and can increase the difficulty of transferring ownership to other individuals prototype smell it is convenient to test new ideas in small scale via prototypes however regularly relying on prototyping environment may be an indicator that the system is brittle difficult to change or could benefit from improved abstractions and interfaces maintaining prototyping environment carries its own cost and there is significant danger that time pressures may encourage prototyping system to be used as production solution additionally results found at small scale rarely reflect the reality at full scale configuration debt another potentially surprising area where debt can accumulate is in the configuration of machine learning systems any large system has wide range of configurable options including which features are used how data is selected wide variety of learning settings potential or verification methods etc we have observed that both researchers and engineers may treat configuration and extension of configuration as an afterthought indeed verification or testing of configurations may not even be seen as important in mature system which is being actively developed the number of lines of configuration can far exceed the number of lines of the traditional code each configuration line has potential for mistakes consider the following examples feature was incorrectly logged from to feature is not available on data before the code used to compute feature has to change for data before and after because of changes to the logging format feature is not available in production so substitute features and must be used when querying the model in live setting if feature is used then jobs for training must be given extra memory due to lookup tables or they will train inefficiently feature precludes the use of feature because of latency constraints all this messiness makes configuration hard to modify correctly and hard to reason about however mistakes in configuration can be costly leading to serious loss of time waste of computing resources or production issues this leads us to articulate the following principles of good configuration systems it should be easy to specify configuration as small change from previous configuration it should be hard to make manual errors omissions or oversights it should be easy to see visually the difference in configuration between two models it should be easy to automatically assert and verify basic facts about the configuration number of features used transitive closure of data dependencies etc it should be possible to detect unused or redundant settings configurations should undergo full code review and be checked into repository dealing with changes in the external world one of the things that makes ml systems so fascinating is that they often interact directly with the external world experience has shown that the external world is rarely stable this background rate of change creates ongoing maintenance cost fixed thresholds in dynamic systems it is often necessary to pick decision threshold for given model to perform some action to predict true or false to mark an email as spam or not spam to show or not show given ad one classic approach in machine learning is to choose threshold from set of possible thresholds in order to get good tradeoffs on certain metrics such as precision and recall however such thresholds are often manually set thus if model updates on new data the old manually set threshold may be invalid manually updating many thresholds across many models is and brittle one mitigation strategy for this kind of problem appears in in which thresholds are learned via simple evaluation on heldout validation data monitoring and testing unit testing of individual components and tests of running systems are valuable but in the face of changing world such tests are not sufficient to provide evidence that system is working as intended comprehensive live monitoring of system behavior in real time combined with automated response is critical for system reliability the key question is what to monitor testable invariants are not always obvious given that many ml systems are intended to adapt over time we offer the following starting points prediction bias in system that is working as intended it should usually be the case that the distribution of predicted labels is equal to the distribution of observed labels this is by no means comprehensive test as it can be met by null model that simply predicts average values of label occurrences without regard to the input features however it is surprisingly useful diagnostic and changes in metrics such as this are often indicative of an issue that requires attention for example this method can help to detect cases in which the world behavior suddenly changes making training distributions drawn from historical data no longer reflective of current reality slicing prediction bias by various dimensions isolate issues quickly and can also be used for automated alerting action limits in systems that are used to take actions in the real world such as bidding on items or marking messages as spam it can be useful to set and enforce action limits as sanity check these limits should be broad enough not to trigger spuriously if the system hits limit for given action automated alerts should fire and trigger manual intervention or investigation producers data is often fed through to learning system from various upstream producers these processes should be thoroughly monitored tested and routinely meet service level objective that takes the downstream ml system needs into account further any alerts must be propagated to the control plane of an ml system to ensure its accuracy similarly any failure of the ml system to meet established service level objectives be also propagated to all consumers and directly to their control planes if at all possible because external changes occur in response must also occur in as well relying on human intervention in response to alert pages is one strategy but can be brittle for issues creating systems to that allow automated response without direct human intervention is often well worth the investment other areas of debt we now briefly highlight some additional areas where technical debt may accrue data testing debt if data replaces code in ml systems and code should be tested then it seems clear that some amount of testing of input data is critical to system basic sanity checks are useful as more sophisticated tests that monitor changes in input distributions reproducibility debt as scientists it is important that we can experiments and get similar results but designing systems to allow for strict reproducibility is task made difficult by randomized algorithms inherent in parallel learning reliance on initial conditions and interactions with the external world process management debt most of the use cases described in this paper have talked about the cost of maintaining single model but mature systems may have dozens or hundreds of models running simultaneously this raises wide range of important problems including the problem of updating many configurations for many similar models safely and automatically how to manage and assign resources among models with different business priorities and how to visualize and detect blockages in the flow of data in production pipeline developing tooling to aid recovery from production incidents is also critical an important smell to avoid are common processes with many manual steps cultural debt there is sometimes hard line between ml research and engineering but this can be for system health it is important to create team cultures that reward deletion of features reduction of complexity improvements in reproducibility stability and monitoring to the same degree that improvements in accuracy are valued in our experience this is most likely to occur within heterogeneous teams with strengths in both ml research and engineering conclusions measuring debt and paying it off technical debt is useful metaphor but it unfortunately does not provide strict metric that can be tracked over time how are we to measure technical debt in system or to assess the full cost of this debt simply noting that team is still able to move quickly is not in itself evidence of low debt or good practices since the full cost of debt becomes apparent only over time indeed moving quickly often introduces technical debt few useful questions to consider are how easily can an entirely new algorithmic approach be tested at full scale what is the transitive closure of all data dependencies how precisely can the impact of new change to the system be measured does improving one model or signal degrade others how quickly can new members of the team be brought up to speed we hope that this paper may serve to encourage additional development in the areas of maintainable ml including better abstractions testing methodologies and design patterns perhaps the most important insight to be gained is that technical debt is an issue that engineers and researchers both need to be aware of research solutions that provide tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice even the addition of one or two seemingly innocuous data dependencies can slow further progress paying down technical debt requires specific commitment which can often only be achieved by shift in team culture recognizing prioritizing and rewarding this effort is important for the long term health of successful ml teams acknowledgments this paper owes much to the important lessons learned day to day in culture that values both innovative ml research and strong engineering practice many colleagues have helped shape our thoughts here and the benefit of accumulated folk wisdom can not be overstated we would like to specifically recognize the following roberto bayardo luis cobo sharat chikkerur jeff dean philip henderson arnar mar hrafnkelsson ankur jain joe kovac jeremy kubica brendan mcmahan satyaki mahalanabis lan nie michael pohl abdul salem sajid siddiqi ricky shan alan skelly cory williams and andrew young short version of this paper was presented at the workshop in in montreal canada references ananthanarayanan basker das gupta jiang qiu reznichenko ryabkov singh and venkataraman photon and scalable joining of continuous data streams in sigmod proceedings of the international conference on management of data pages new york ny usa anonymous machine learning the credit card of technical debt software engineering for machine learning nips workshop bottou peters candela charles chickering portugaly ray simard and snelson counterfactual reasoning and learning systems the example of computational advertising journal of machine learning research nov brown mccormick mowbray and malveau antipatterns refactoring software architectures and projects in crisis chilimbi suzue apacible and kalyanaraman project adam building an efficient and scalable deep learning training system in usenix symposium on operating systems design and implementation osdi broomfield co usa october pages dalessandro chen raeder perlich han williams and provost scalable handsfree transfer learning for online advertising in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm fowler code smells http fowler refactoring improving the design of existing code pearson education india langford and zhang the algorithm for bandits with side information in advances in neural information processing systems pages li andersen park smola ahmed josifovski long shekita and su scaling distributed machine learning with the parameter server in usenix symposium on operating systems design and implementation osdi broomfield co usa october pages lin and ryaboy scaling big data mining infrastructure the twitter experience acm sigkdd explorations newsletter mcmahan holt sculley young ebner grady nie phillips davydov golovin chikkerur liu wattenberg hrafnkelsson boulos and kubica ad click prediction view from the trenches in the acm sigkdd international conference on knowledge discovery and data mining kdd chicago il usa august morgenthaler gridnev sauciuc and bhansali searching for build debt experiences managing technical debt at google in proceedings of the third international workshop on managing technical debt sculley otey pohl spitznagel hainsworth and zhou detecting adversarial advertisements in the wild in proceedings of the acm sigkdd international conference on knowledge discovery and data mining san diego ca usa august securities and commission sec charges knight capital with violations of market access rule spector norvig and petrov google hybrid approach to research communications of the acm issue zheng the challenges of building machine learning tools for the masses software engineering for machine learning nips workshop 
local causal discovery of direct causes and effects tian gao qiang ji department of ecse rensselaer polytechnic institute troy ny gaot jiq abstract we focus on the discovery and identification of direct causes and effects of target variable in causal network causal learning algorithms generally need to find the global causal structures in the form of complete partial directed acyclic graphs cpdag in order to identify direct causes and effects of target variable while these algorithms are effective it is often unnecessary and wasteful to find the global structures when we are only interested in the local structure of one target variable such as class labels we propose new local causal discovery algorithm called causal markov blanket cmb to identify the direct causes and effects of target variable based on markov blanket discovery cmb is designed to conduct causal discovery among multiple variables but focuses only on finding causal relationships between specific target variable and other variables under standard assumptions we show both theoretically and experimentally that the proposed local causal discovery algorithm can obtain the comparable identification accuracy as global methods but significantly improve their efficiency often by more than one order of magnitude introduction causal discovery is the process to identify the causal relationships among set of random variables it not only can aid predictions and classifications like feature selection but can also help predict consequences of some given actions facilitate inference and help explain the underlying mechanisms of the data lot of research efforts have been focused on predicting causality from observational data they can be roughly divided into two causal discovery between pair of variables and among multiple variables we focus on multivariate causal discovery which searches for correlations and dependencies among variables in causal networks causal networks can be used for local or global causal prediction and thus they can be learned locally and globally many causal discovery algorithms for causal networks have been proposed and the majority of them belong to global learning algorithms as they seek to learn global causal structures the sgs and algorithm test for the existence of edges between every pair of nodes in order to first find the skeleton or undirected edges of causal networks and then discover all the resulting in partially directed acyclic graph pdag the last step of these algorithms is then to orient the rest of edges as much as possible using meek rules while maintaining consistency with the existing edges given causal network causal relationships among variables can be directly read off the structure due to the complexity of the algorithm and unreliable high order conditional independence tests several works have incorporated the markov blanket mb discovery into the causal discovery with approach growth and shrink gs algorithm uses the mbs of each node to build the skeleton of causal network discover all the and then use the meek rules to complete the global causal structure the hill climbing mmhc algorithm also finds mbs of each variable first but then uses the mbs as constraints to reduce the search space for the standard hill climbing structure learning methods in authors use markov blanket with collider sets cs to improve the efficiency of the gs algorithm by combining the spouse and discovery all these methods rely on the global structure to find the causal relationships and require finding the mbs for all nodes in graph even if the interest is the causal relationships between one target variable and other variables different mb discovery algorithms can be used and they can be divided into two different approaches and methods used by cs and gs algorithms greedily test the independence between each variable and the target by directly using the definition of markov blanket in contrast more recent methods aim to improve the data efficiency while maintaining reasonable time complexity by finding the parents and children pc set first and then the spouses to complete the mb local learning of causal networks generally aims to identify subset of causal edges in causal network local causal discovery lcd algorithm and its variants aim to find causal edges by testing the relationships among every set in causal network bayesian local causal discovery blcd explores the among mb nodes to infer causal edges while algorithms aim to identify subset of causal edges via special structures among all variables we focus on finding all the causal edges adjacent to one target variable in other words we want to find the causal identities of each node in terms of direct causes and effects with respect to one target node we first use markov blankets to find the direct causes and effects and then propose new causal markov blanket cmb discovery algorithm which determines the exact causal identities of mb nodes of target node by tracking their conditional independence changes without finding the global causal structure of causal network the proposed cmb algorithm is complete local discovery algorithm and can identify the same direct causes and effects for target variable as global methods under standard assumptions cmb is more scalable than global methods more efficient than methods and is complete in identifying direct causes and effects of one target while other local methods are not backgrounds we use to represent the variable space capital letters such as to represent variables bold letters such as mb to represent variable sets and use to represent the size of set and represent independence and dependence between and respectively we assume readers are familar with related concepts in causal network learning and only review few major ones here in causal network or causal bayesian network nodes correspond to the random variables in variable set two nodes are adjacent if they are connected by an edge directed edge from node to node indicates is parent or direct cause of and is child or direct effect of moreover if there is directed path from to then is an ancestor of and is descendant of if nonadjacent and have common child and are spouses three nodes and form if has two incoming edges from and forming and is not adjacent to is collider in path if has two incoming edges in this path with nonadjacent parents and is an unshielded collider path from node and is blocked by set of nodes if any of following holds true there is node in belonging to there is collider node on such that neither nor any of its descendants belong to otherwise is unblocked or active pdag is graph which may have both undirected and directed edges and has at most one edge between any pair of nodes cpdags represent markov equivalence classes of dags capturing the same conditional independence relationships with the same skeleton but potentially different edge orientations cpdags contain directed edges that has the same orientation for every dag in the equivalent class and undirected edges that have reversible orientations in the equivalent class let be the causal dag of causal network with variable set and be the joint probability distribution over variables in and satisfy causal markov condition if and only if is independent of of given its direct causes the causal faithfulness condition states that and are faithful to each other if all and every independence and conditional independence entailed by is present in it enables the recovery of from sampled data of another assumption by existing causal discovery algorithms is causal sufficiency set of variables is causally sufficient if no set of two or more variables in shares common cause variable outside without causal sufficiency assumption latent confounders between adjacent nodes would be modeled by edges we also assume no selection bias and we can capture the same independence relationships among variables from the sampled data as the ones from the entire population many concepts and properties of dag hold in causal networks such as and mb markov blanket of target variable mbt in causal network is the minimal set of nodes conditioned on which all other nodes are independent of denoted as mbt given an unknown distribution that satisfied the markov condition with respect to an unknown dag markov blanket discovery is the process used to estimate the mb of target node in from independently and identically distributed data of under the causal faithfulness assumption between and the mb of target node is unique and is the set of parents children and spouses of other parents of children of in addition the parents and children set of pct is also unique intuitively the mb can directly facilitate causal discovery if conditioning on the mb of target variable renders variable independent of then can not be direct cause or effect of from the local causal discovery point of view although mb may contain nodes with different causal relationships with the target it is reasonable to believe that we can identify their relationships exactly up to the markov equivalence with further tests lastly exiting causal network learning algorithms all use three meek rules which we assume the readers are familiar with to orient as many edges as possible given all in pdags to obtain cpdag the basic idea is to orient the edges so that the edge directions do not introduce new preserve the property of dag and enforce local causal discovery of direct causes and effects existing mb discovery algorithms do not directly offer the exact causal identities of the learned mb nodes of target although the methods can find the pc set of the target within the mb set they can only provide the causal identities of some children and spouses that form vstructures nevertheless following existing works under standard assumptions every pc variable of target can only be its direct cause or effect theorem causality within mb under the causal faithfulness sufficiency correct independence tests and no selection bias assumptions the parent and child nodes within target mb set in causal network contains all and only the direct causes and effects of the target variable the proof can be directly derived from the pc set definition of causal network therefore using the mb discovery methods if we can discover the exact causal identities of the pc nodes within the mb causal discovery of direct causes and effects of the target can therefore be successfully accomplished building on mb discovery we propose new local causal discovery algorithm causal markov blanket cmb discovery as shown in algorithm it identifies the direct causes and effects of target variable without the need of finding the global structure or the mbs of all other variables in causal network cmb has three major steps to find the mb set of the target and to identify some direct causes and effects by tracking the independence relationship changes among target pc nodes before and after conditioning on the target node to repeat step but conditioned on one pc node mb set and to repeat step and with unidentified neighboring nodes as new targets to identify more direct causes and effects of the original target step initial identification cmb first finds the mb nodes of target mbt using topologybased mb discovery algorithm that also finds pct cmb then uses the causalsearch subroutine shown in algorithm to get an initial causal identities of variables in pct by checking every variable pair in pct according to lemma lemma let pct the pc set of the target in causal dag the independence relationships between and can be divided into the following four conditions and this condition can not happen and and are both the parents of and at least one of and is child of and their identities are inconclusive and need further tests algorithm causal markov blanket discovery algorithm input data target variable output idt the causal identities of all nodes with respect to step establish initial id idt zeros mbt pct indm idt causalsearch pct idt step further test variables with idt for one in each pair with idt do mbx indm mbx idt causalsearch pct idt if no element of idt is equal to break for every pair of parents of do if and are idt pairs then idt idt that idt step resolve variable set with idt for each with idt do recursively find idx without going back to the already queried variables update idt according to idx if idx then idt for every in idt variable pairs do idt if no element of idt is equal to break return idt algorithm causalsearch subroutine input data target variable pct the pc set of the conditioned variable set id current id output idt the new causal identities of all nodes with respect to step single pc if then idt pct step check for every pct do if and then idt idt else if and then if idt then idt else if idt then idt if idt then idt else if idt then idt add to pairs with idt else if idt idt or then idt idt add to pairs with idt step identify idt pairs with known parents for every such that idt do for every in idt variable pairs do idt return idt does not happen because the path is unblocked either not given or given and the unblocked path makes and dependent on each other implies that and form with as the corresponding collider such as node in figure which has two parents and indicates that the paths between and are blocked conditioned on which means that either one of is child of and the other is parent or both of are children of for example node and in figure satisfy this condition with respect to shows that there may be another unblocked path from and besides for example in figure node and have multiple paths between them besides further tests are needed to resolve this case we use idt to represent the causal identities for all the nodes with respect to idt as variable causal identity to and the small case idt as the individual id of node to we also use idx to represent the causal identities of nodes with respect to node to avoid changing the already identified pcs cmb establishes priority we use the idt to represent nodes as the parents of idt children of idt to represent pair of nodes that can not be both parents ambiguous pairs from markov equivalent structures to be discussed at step and idt to represent the inconclusiveness lower number id can not be changed note that the identification number is slightly different from the condition number in lemma figure sample causal network sample network with nodes the only active path between and conditioned on mbc is into higher number shown by line of algorithm if variable pair satisfies they will both be labeled as parents line of algorithm if variable pair satisfies one of them is labeled as idt only if the other variable within the pair is already identified as parent otherwise they are both labeled as idt line and of algorithm if pc node remains inconclusive with idt it is labeled as idt in line of algorithm note that if has only one pc node it is labeled as idt line of algorithm nodes always have idt step resolve idt lemma alone can not identify the variable pairs in pct with idt due to other possible unblocked paths and we have to seek other information fortunately by definition the mb set of one of the target pc node can block all paths to that pc node lemma let pct the pc set of the target in causal dag the independence relationships between and conditioned on the mb of minus mbx can be divided into the following four conditions and mbx this condition can not happen and mbx and are both the parents of and mbx at least one of and is child of and mbx then and is directly connected are very similar to those in lemma is true because conditioned on and the mb of minus the only potentially unblocked paths between and are if happens then the path has no impact on the relationship between and and hence must be directly connected if and are not directly connected and the only potentially unblocked path between and is and and will be identified by line of algorithm with idt for example in figure conditioned on mbc the only path between and is through however if and are directly connected they will remain with idt such as node and from figure in this case and form fully connected clique and edges among the variables that form fully connected clique can have many different orientation combinations without affecting the conditional independence relationships therefore this case needs further tests to ensure meek rules are satisfied the third meek rule enforcing is first enforced by line of algorithm then the rest of idt nodes are changed to have idt by line of algorithm and to be further processed even though they could be both parents at the same time with neighbor nodes causal identities therefore step of algorithm makes all variable pairs with idt to become identified either as parents children or with idt after taking some neighbors mbs into consideration note that step of cmb only needs to find the mb for small subset of the pc variables in fact only one mb for each variable pair with idt step resolve idt after step some pc variables may still have idt this could happen because of the existence of markov equivalence structures below we show the condition under which the cmb can resolve the causal identities of all pc nodes lemma the identifiability condition for algorithm to fully identify all the causal relationships within the pc set of target must have at least two nonadjacent parents one of single ancestors must contain at least two nonadjacent parents or has parents that form pattern as defined in meeks rules we use single ancestors to represent ancestor nodes that do not have spouse with mutual child that is also an ancestor of if the target does not meet any of the conditions in lemma will never be satisfied and all pc variables within mb will have idt without single parent identified it is impossible to infer the identities of children nodes using therefore all the identities of the pc nodes are uncertain even though the resulting structure could be cpdag step of cmb searches for ancestor of to infer the causal directions for each node with idt cmb tries to identify its local causal structure recursively if pc nodes are all identified it would return to the target with the resolved identities otherwise it will continue to search for ancestor of note that cmb will not go back to variables with unresolved pc nodes without providing new information step of cmb checks the identifiability condition for all the ancestors of the target if graph structure does not meet the conditions of lemma the final idt will contain some idt which indicates reversible edges in cpdags the found causal graph using cmb will be pdag after step of algorithm and it will be cpdag after step of algorithm case study the procedure using cmb to identify the direct causes and effects of in figure has the following steps step cmb finds the mb and pc set of the pc set contains node and then ide and ide step to resolve the variable pair and with ide cmb finds the pc set of containing and their idd are all since contains only one parent to resolve idd cmb checks causal identities of node and without going back to the pc set of contains and cmb identifies idc idc and idc since resolves all its pc nodes cmb returns to node with idd with the new parent idd idd and cmb returns to node with ide step the ide and after resolving the pair with ide ide theorem the soundness and completeness of cmb algorithm if the identifiability condition is satisfied using sound and complete mb discovery algorithm cmb will identify the direct causes and effects of the target under the causal faithfulness sufficiency correct independence tests and no selection bias assumptions proof sound and complete mb discovery algorithm find all and only the mb nodes of target using it and under the causal sufficiency assumption the learned pc set contains all and only the variables by theorem when lemma is satisfied all parent nodes are identifiable through independence changes either by lemma or by lemma also since children can not be conditionally independent of another pc node given its mb minus the target node all parents identified by lemma and will be the true positive direct causes therefore all and only the true positive direct causes will be correctly identified by cmb since pc variables can only be direct causes or direct effects all and only the direct effects are identified correctly by cmb in the cases where cmb fails to identify all the pc nodes global causal discovery methods can not identify them either specifically structures failing to satisfy lemma can have different orientations on some edges while preserving the skeleton and hence leading to markov equivalent structures for the cases where has all single ancestors the edge directions among all single ancestors can always be reversed without introducing new and dag violations in which cases the meek rules can not identify the causal directions either for the cases with fully connected cliques these fully connected cliques do not meet the requirement for the first meek rule no new and the second meek rule preserving dags can always be satisfied within clique by changing the direction of one edge since cmb orients the in the third meek rule correctly by line of algorithm cmb can identify the same structure as the global methods that use the meek rules theorem consistency between cmb and global causal discovery methods for the same dag algorithm will correctly identify all the direct causes and effects of target variable as the global and causal discovery that use the meek rules up to cpdag under the causal faithfulness sufficiency correct independence tests and no selection bias assumptions proof it has been shown that causal methods using meek rules can identify up to graph cpdag since meek rules can not identify the structures that fail lemma the global and methods can only identify the same structures as cmb since cmb is sound and complete in identifying these structures by theorem cmb will identify all direct causes and effects up to cpdag complexity the complexity of cmb algorithm is dominated by the step of finding the mb which can have an exponential complexity all other steps of cmb are trivial in comparison if we assume uniform distribution on the neighbor sizes in network with nodes then the expected time compn plexity of step of cmb is while methods are in later steps cmb also needs to find mbs for small subset of nodes that include one node between every pair of nodes that meet and subset of the target neighboring nodes that provide additional clues for the target let be the total size of these nodes then cmb reduces the cost by nl times asymptotically experiments we use benchmark causal learning datasets to evaluate the accuracy and efficiency of cmb with four other causal discovery algorithms discussed gs mmhc cs and the local causal discovery algorithm due to page limit we show the results of the causal algorithms on four datasets alarm and they contain to nodes we use data samples for all datasets for each global or algorithm we find the global structure of dataset and then extract causal identities of all nodes to target node cmb finds causal identities of every variable with respect to the target directly we repeat the discovery process for each node in the datasets and compare the discovered causal identities of all the algorithms to all the markov equivalent structures with the known ground truth structure we use the edge scores to measure the number of missing edges extra edges and reversed in each node local causal structure and report average values along with its standard deviation for all the nodes in dataset we use the existing implementation of discovery algorithm to find the mb of target variable for all the algorithms we also use the existing implementations for mmhc and algorithms we implement gs cs and the proposed cmb algorithms in matlab on machine with cpu and memory following the existing protocol we use the number of conditional independence tests needed or scores computed for the search method mmhc to find the causal structures given the and the number of times that mb discovery algorithms are invoked to measure the efficiency of various algorithms we also use conditional independence tests with standard significance level of for all the datasets without worrying about parameter tuning as shown in table cmb consistently outperforms the global discovery algorithms on benchmark causal networks and has comparable edge accuracy with algorithms although cmb makes slightly more total edge errors in alarm and datasets than cs cmb is the best method on and since is an incomplete algorithm it never finds extra or reversed edges but misses the most amount of edges cmb can achieve more than one order of magnitude speedup sometimes two orders of magnitude as shown in and than the global methods compared to methods cmb also can achieve we specify the global and causal methods to be gs and cs if an edge is reversible in the equivalent class of the original graph but are not in the equivalent class of the learned graph it is considered as reversed edges as well for global methods it is the number of tests needed or scores computed given the moral graph of the global structure for it would be the total number of tests since it does not use moral graph or mbs table performance of various causal discovery algorithms on benchmark networks dataset alarm method mmhc gs cs cmb mmhc gs cs cmb mmhc gs cs cmb mmhc gs cs cmb errors extra edges missing reversed total efficiency no tests no mb more than one order of speedup on and in addition on these datasets cmb only invokes mb discovery algorithms between to times drastically reducing the mb calls of algorithms since independence test comparison is unfair to who does not use mb discovery or find moral graphs we also compared time efficiency between and cmb cmb is times faster on alarm times faster on and and times faster on than in practice the performance of cmb depends on two factors the accuracy of independence tests and mb discovery algorithms first independence tests may not always be accurate and could introduce errors while checking the four conditions of lemma and especially under insufficient data samples secondly causal discovery performance heavily depends on the performance of the mb discovery step as the error could propagate to later steps of cmb improvements on both areas could further improve cmb accuracy cmb complexity can still be exponential and is dominated by the mb discovery phrase and thus its worst case complexity could be the same as approaches for some special structures conclusion we propose new local causal discovery algorithm cmb we show that cmb can identify the same causal structure as the global and causal discovery algorithms with the same identification condition but uses fraction of the cost of the global and approaches we further prove the soundness and completeness of cmb experiments on benchmark datasets show the comparable accuracy and greatly improved efficiency of cmb for local causal discovery possible future works could study assumption relaxations especially without the causal sufficiency assumption such as by using similar procedure as fci algorithm and the improved cs algorithm to handle latent variables in cmb references constantin aliferis ioannis tsamardinos alexander statnikov aliferis ph tsamardinos ph and er statnikov hiton novel markov blanket algorithm for optimal variable selection david maxwell chickering optimal structure identification with greedy search journal of machine learning research gregory cooper simple algorithm for efficiently mining observational databases for causal relationships data mining and knowledge discovery isabelle guyon andre elisseeff and constantin aliferis causal feature selection daphne koller and mehran sahami toward optimal feature selection in icml pages morgan kaufmann subramani mani constantin aliferis alexander statnikov and med nyu bayesian algorithms for causal data mining in nips causality objectives and assessment pages subramani mani and gregory cooper study in causal discovery from infant birth and death records in proceedings of the amia symposium page american medical informatics association subramani mani and gregory cooper causal discovery using bayesian local causal discovery algorithm medinfo pt dimitris margaritis and sebastian thrun bayesian network induction via local neighborhoods in advances in neural information processing systems pages mit press christopher meek causal inference and causal explanation with background knowledge in proceedings of the eleventh conference on uncertainty in artificial intelligence pages morgan kaufmann publishers teppo niinimaki and pekka parviainen local structure disocvery in bayesian network in proceedings of uncertainy in artifical intelligence workshop on causal structure learning pages judea pearl probabilistic reasoning in intelligent systems networks of plausible inference morgan kaufmann publishers edition judea pearl causality models reasoning and inference volume cambridge univ press pellet and elisseeff finding latent causes in causal networks an efficient approach based on markov blankets in advances in neural information processing systems pages pellet and andre ellisseeff using markov blankets for causal structure learning journal of machine learning jose roland nilsson johan and jesper towards scalable and data efficient learning of markov boundaries int approx reasoning july craig silverstein sergey brin rajeev motwani and jeff ullman scalable techniques for mining causal structures data mining and knowledge discovery spirtes glymour and scheines causation prediction and search the mit press edition peter spirtes clark glymour richard scheines stuart kauffman valerio aimale and frank wimberly constructing bayesian network models of gene expression networks from microarray data peter spirtes christopher meek and thomas richardson causal inference in the presence of latent variables and selection bias in proceedings of the eleventh conference on uncertainty in artificial intelligence pages morgan kaufmann publishers alexander statnikov ioannis tsamardinos laura brown and constatin aliferis causal explorer matlab library for algorithms for causal discovery and variable selection for classification in causation and prediction challenge at wcci ioannis tsamardinos constantin aliferis and alexander statnikov time and sample efficient discovery of markov blankets and direct causal relations in proceedings of the ninth acm sigkdd international conference on knowledge discovery and data mining kdd pages new york ny usa acm ioannis tsamardinos laurae brown and constantinf aliferis the bayesian network structure learning algorithm machine learning jiji zhang on the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias artificial intelligence 
high dimensional em algorithm statistical optimization and asymptotic zhaoran wang princeton university quanquan gu university of virginia yang ning princeton university han liu princeton university abstract we provide general theory of the em algorithm for inferring high dimensional latent variable models in particular we make two contributions for parameter estimation we propose novel high dimensional em algorithm which naturally incorporates sparsity structure into parameter estimation with an appropriate initialization this algorithm converges at geometric rate and attains an estimator with the optimal statistical rate of convergence ii based on the obtained estimator we propose new inferential procedure for testing hypotheses for low dimensional components of high dimensional parameters for broad family of statistical models our framework establishes the first computationally feasible approach for optimal estimation and asymptotic inference in high dimensions introduction the em algorithm is the most popular approach for calculating the maximum likelihood estimator of latent variable models nevertheless due to the nonconcavity of the likelihood function of latent variable models the em algorithm generally only converges to local maximum rather than the global one on the other hand existing statistical guarantees for latent variable models are only established for global optima therefore there exists gap between computation and statistics significant progress has been made toward closing the gap between the local maximum attained by the em algorithm and the maximum likelihood estimator in particular first establish general sufficient conditions for the convergence of the em algorithm further improve this result by viewing the em algorithm as proximal point method applied to the divergence see for detailed survey more recently establish the first result that characterizes explicit statistical and computational rates of convergence for the em algorithm they prove that given suitable initialization the em algorithm converges at geometric rate to local maximum close to the maximum likelihood estimator all these results are established in the low dimensional regime where the dimension is much smaller than the sample size in high dimensional regimes where the dimension is much larger than the sample size there exists no theoretical guarantee for the em algorithm in fact when the maximum likelihood estimator is in general not well defined unless the models are carefully regularized by assumptions furthermore even if regularized maximum likelihood estimator can be obtained in computationally tractable manner establishing the corresponding statistical properties especially asymptotic normality can still be challenging because of the existence of high dimensional nuisance parameters to address such challenge we develop general inferential theory of the em algorithm for parameter estimation and uncertainty assessment of high dimensional latent variable models in particular we make two contributions in this paper for high dimensional parameter estimation we propose novel high dimensional em algorithm by attaching truncation step to the expectation step and maximization step such research supported by nsf nsf nsf nsf nsf nih nih nih and fda truncation step effectively enforces the sparsity of the attained estimator and allows us to establish significantly improved statistical rate of convergence based upon the estimator attained by the high dimensional em algorithm we propose decorrelated score statistic for testing hypotheses related to low dimensional components of the high dimensional parameter under unified analytic framework we establish simultaneous statistical and computational guarantees for the proposed high dimensional em algorithm and the respective uncertainty assessment procedure let rd be the true parameter be its sparsity level and be the iterative solution sequence of the high dimensional em algorithm with being the total number of iterations in particular we prove that given an appropriate initialization init with relative error upper bounded by constant init the iterative solution sequence satisfies log optimization error statistical error optimal rate with high probability here and are quantities that possibly depend on and as the optimization error term in pdecreases to zero at geometric rate with respect to the overall estimation error achieves the log statistical rate of convergence up to an extra factor of log which is see theorem for details the proposed decorrelated score statistic is asymptotically normal moreover its limiting variance is optimal in the sense that it attains the semiparametric information bound for the low dimensional components of interest in the presence of high dimensional nuisance parameters see theorem for details our framework allows two implementations of the the exact maximization versus approximate maximization the former one calculates the maximizer exactly while the latter one conducts an approximate maximization through gradient ascent step our framework is quite general we illustrate its effectiveness by applying it to two high dimensional latent variable models that is gaussian mixture model and mixture of regression model comparison with related work closely related work is by which considers the low dimensional regime where is much smaller than under certain initialization conditions theyp prove that the em algorithm converges at geometric rate to some local optimum that attains the statistical rate of convergence they cover both maximization and gradient ascent implementations of the and establish the consequences for the two latent variable models considered in our paper under low dimensional settings our framework adopts their view of treating the em algorithm as perturbed version of gradient methods however to handle the challenge of high dimensionality the key ingredient of our framework is the truncation step that enforces the sparsity structure along the solution path such truncation operation poses significant challenges for both computational and statistical analysis in detail for computational analysis we need to carefully characterize the evolution of each intermediate solution support and its effects on the evolution of the entire iterative solution sequence for statistical analysis we need to establish characterization of the entrywise statistical error which is technically more challenging than just establishing the error employed by in high dimensional regimes we need to establish the log statistical rate of convergence which is much sharper than their rate when in addition to point estimation we further construct hypothesis tests for latent variable models in the high dimensional regime which have not been established before high dimensionality poses significant challenges for assessing the uncertainty testing hypotheses of the constructed estimators for example show that the limiting distribution of the lasso estimator is not gaussian even in the low dimensional regime variety of approaches have been proposed to correct the lasso estimator to attain asymptotic normality including the debiasing method the desparsification methods as well as instrumental methods meanwhile propose the procedures for exact inference in addition several authors propose methods based on data splitting stability selection and sets however these approaches mainly focus on generalized linear models rather than latent variable models in addition their results heavily rely on the fact that the estimator is global optimum of convex program in comparison our approach applies to much broader family of statistical models with latent structures for these latent variable models it is computationally infeasible to obtain the global maximum of the penalized likelihood due to the nonconcavity of the likelihood function unlike existing approaches our inferential theory is developed for the estimator attained by the proposed high dimensional em algorithm which is not necessarily global optimum to any optimization formulation another line of research for the estimation of latent variable models is the tensor method which exploits the structures of third or higher order moments see and the references therein however existing tensor methods primarily focus on the low dimensional regime where in addition since the high order sample moments generally have slow statistical rate of convergence the estimators obtained by the tensor methods usually have suboptimal statistical rate even for for example establish the statistical rate of convergence for mixture of regression model which is suboptimal compared with the minimax lower bound similarly in high dimensional settings the statistical rates of convergence attained by tensor methods are significantly slower than the statistical rate obtained in this paper the latent variable models considered in this paper have been well studied nevertheless only few works establish theoretical guarantees for the em algorithm in particular for gaussian mixture model establish parameter estimation guarantees for the em algorithm and its extensions for mixture of regression model establish exact parameter recovery guarantees for the em algorithm under noiseless setting for high dimensional mixture of regression model analyze the gradient em algorithm for the they establish support recovery guarantees for the attained local optimum but have no parameter estimation guarantees in comparison with existing works this paper establishes general inferential framework for simultaneous parameter estimation and uncertainty assessment based on novel high dimensional em algorithm our analysis provides the first theoretical guarantee of parameter estimation and asymptotic inference in high dimensional regimes for the em algorithm and its applications to broad family of latent variable models notation the matrix kp is obtained by taking the of each row and then taking the of the obtained row norms we use to denote generic constants their values may vary from line to line we will introduce more notations in methodology we first introduce the high dimensional em algorithm and then the respective inferential procedure as examples we consider their applications to gaussian mixture model and mixture of regression model for compactness we defer the details to of the appendix more models are included in the longer version of this paper algorithm high dimensional em algorithm parameter sparsity parameter sb maximum number of iterations initialization sbinit supp init sb trunc init sbinit supp and trunc are defined in and for to evaluate qn mn mn is implemented as in algorithm or sb supp sb trunc sb end for output algorithm maximization implementation of the input qn output mn argmax qn algorithm gradient ascent implementation of the input qn output mn parameter stepsize rqn high dimensional em algorithm before we introduce the proposed high dimensional em algorithm algorithm we briefly review the classical em algorithm let be the probability density function of where rd is the model parameter for latent variable models we assume that is obtained by marginalizing over an unobserved latent variable dz let be the density of conditioning on the observed variable we define qn yi log yi dz see of the appendix for detailed derivation at the iteration of the classical em algorithm we evaluate qn at the and then perform max qn at the the proposed high dimensional em algorithm algorithm is built upon the and lines and of the classical em algorithm in addition to the exact maximization implementation of the algorithm we allow the gradient ascent implementation of the algorithm which performs an approximate maximization via gradient ascent step to handle the challenge of high dimensionality in line of algorithm we perform truncation step to enforce the sparsity structure in detail we define supp the set of index corresponding to the top largest also for an index set we define the trunc function in line as trunc note that is the output of the line at the iteration of the high dimensional em algorithm to obtain the line preserves the entries of with the top sb large magnitudes and sets the rest to zero here sb is tuning parameter that controls the sparsity level line by iteratively performing the and the high dimensional em algorithm attains an estimator line here is the total number of iterations asymptotic inference notation let be the gradient with respect to and be the gradient with respect to if there is no confusion we simply denote rq as in the previous sections we define the higher order derivatives in the same manner is calculated by first taking derivative with respect to and then with respect to for rd with and we use notations such as and to denote the corresponding subvector of rd and the submatrix of we aim to conduct asymptotic inference for low dimensional components of the high dimensional parameter without loss of generality we consider single entry of in particular we assume where is the entry of interest while rd is treated as the nuisance parameter in the following we construct high dimensional score test named decorrelated score test it is worth noting that our method and theory can be easily generalized to perform statistical inference for an arbitrary low dimensional subvector of decorrelated score test for score test we are primarily interested in testing since this null hypothesis characterizes the uncertainty in variable selection our method easily generalizes to with for notational simplicity we define the following key quantity let tn qn qn we define the decorrelated score function sn as sn qn qn here rd is obtained using the following dantzig selector argmin subject to tn tn where is tuning parameter let where is the estimator attained by the high dimensional em algorithm algorithm we define the decorrelated score statistic as sn tn where and tn tn here we use instead of since we are interested in the null hypothesis we can also replace with and the theoretical results will remain the same in we will prove the proposed decorrelated score statistic in is asymptotically consequently the decorrelated score test with significance level takes the form sn tn where is the inverse function of the gaussian cumulative distribution function if we reject the null hypothesis the intuition of this decorrelated score test is explained in of the appendix the key theoretical observation is theorem which connects qn in and tn in with the score function and fisher information in the presence of latent structures let be its score function is and the fisher information is where is the expectation under the model with parameter theorem for the true parameter and any qn and tn rd it holds that proof see of the appendix for detailed proof based on the decorrelated score test it is easy to establish the decorrelated wald test which allows us to construct confidence intervals for compactness we defer it to the longer version of this paper theory of computation and estimation before we present the main results we introduce three technical conditions which will significantly ease our presentation they will be verified for specific latent variable models in of the appendix the first two conditions proposed by characterize the properties of the population version lower bound function the expectation of qn defined in we define the respective population version as follows for the in algorithm we define argmax for the in algorithm we define where is the stepsize in algorithm we use to denote the basin of attraction the local region where the high dimensional em algorithm enjoys desired guarantees condition we define two versions of this condition for the true parameter and any we have where is the population version maximization implementation defined in for the true parameter and any we have condition defines variant of lipschitz continuity for in the sequel we will use and in the analysis of the two implementations of the respectively condition for any is and concave this condition indicates that when the second variable of is fixed to be the function is sandwiched between two quadratic functions the third condition characterizes the statistical error between the sample version and population version mn defined in algorithms and and in and recall denotes the total number of nonzero entries in vector condition for any fixed with we have that mn holds with probability at least here possibly depends on sparsity level sample size dimension as well as the basin of attraction in the statistical error quantifies the of the difference between the population version and sample version particularly we constrain the input of and mn to be such condition is different from the one used by in detail they quantify the statistical error with the and do not constrain the input of and mn to be sparse consequently our subsequent statistical analysis is different from theirs the reason we use the is that it characterizes the more refined entrywise statistical error which converges at fast rate of log possibly with extra factors depending on specific models in comparison the statistical error converges at slow rate of which does not decrease to zero as increases with furthermore the entrywise statistical error is crucial to our key proof for quantifying the effects of the truncation step line of algorithm on the iterative solution sequence main results to simplify the technical analysis of the high dimensional em algorithm we focus on its resampling version which is illustrated in algorithm in of the appendix theorem we define where for some we assume condition holds and init for the maximization implementation of the algorithm we suppose that condition holds with and sb max sb min here and are constants under condition sb we have that for sb optimization error statistical error holds with probability at least where is the same constant as in for the gradient ascent implementation of the algorithm we suppose that condition holds with and the stepsize in algorithm is set to meanwhile we assume and hold with replaced by under condition sb we have that for holds with probability at least in which is replaced with proof see of the appendix for detailed proof the assumption in states that the sparsity parameter sb is chosen to be sufficiently large and also of the same order as the true sparsity level this assumption ensures that the error incurred by the truncation step can be upper bounded in addition as is shown for specific latent variable models in of the appendix the error term in condition psb decreases as sample size increases by the assumption in sb is of the same order as therefore the assumption in suggests the sample size is sufficiently large such that is sufficiently small these assumptions guarantee that the entire iterative solution sequence remains within the basin of attraction in the presence of statistical error theorem illustrates that the upper bound of the overall estimation error can be decomposed into two terms the first term is the upper bound of optimization error which decreases to zero at geometric rate of convergence because we have meanwhile term is the upper pthe second bound of statistical error which does not depend on since sb is of the same order as this term is proportional to where is the entrywise statistical error between and mn in of the appendix we prove that for each specific latent variable model is roughly of the order log there may be extra factors attached pto depending on each specific model therefore the statistical error term is roughly of the order log consequently for sufficiently large such that the optimization and statistical error terms in are of the same order the final estimator attains optimal log possibly with extra factors statistical rate for compactness we give the following example and defer the details to implications for gaussian mixture model we assume yn are the realizations of here is rademacher random variable and id is independent of where is the standard deviation suppose that we have where is sufficiently large constant that denotes the minimum ratio in of the appendix we prove that there exists some constant such that conditions and hold with exp with for sufficiently large we have that condition holds with log log then the first part of theorem implies log log for sufficiently large which is with respect to the minimax lower bound log theory of inference to simplify the presentation of the unified framework we lay out several technical conditions which will be verified for each model let em and be four quantities that scale with and these conditions will be verified for specific latent variable models in of the appendix condition em we have em condition we have condition tn op we have tn tn condition tn for any we have tn tn op op in the sequel we lay out an assumption on several population quantities and the sample size recall that where the entry of interest while is the nuisance parameter by the notations in and denote the submatrices of the fisher information matrix we define and sw as sw kw and sw supp we define and as the largest and smallest eigenvalues of and according to and we can easily verify that the following assumption ensures that hence in is invertible also according to and the fact that we have assumption we impose the following assumptions for positive constants and we assume the tuning parameter of the dantzig selector in is set to em where is sufficiently large constant the sample size is sufficiently large such that max em em max em the assumption on guarantees that the fisher information matrix is positive definite the other assumptions in guarantee the existence of the asymptotic variance of sn in the score statistic defined in similar assumptions are standard in existing asymptotic inference results for example for mixture of regression model impose variants of these assumptions for specific models we will show that em and all decrease with while increases with at slow rate therefore the assumptions in ensure that the sample size is sufficiently large we will make these assumptions more explicit after we specify em and for each model note the assumptions in imply that needs to be small for instance for specified in max in implies in the following we will prove that is of the order log hence we require that log rd is sparse such sparsity can be as follows according to assumption understood the definition of in we have therefore such sparsity assumption suggests lies within the span of few columns of such sparsity assumption on is necessary because otherwise it is difficult to accurately estimate in high dimensional regimes in the context of high dimensional generalized linear models impose similar sparsity assumptions main results decorrelated score test the next theorem establishes the asymptotic normality of the decorrelated score statistic defined in theorem we consider with under assumption and conditions we have that for sn tn where and tn are defined in the limiting variance of the decorrelated score function sn is which is defined in proof see of the appendix for detailed proof optimality prove that for inferring in the presence of nuisance parameter is the semiparametric efficient information the minimum limiting variance of the rescaled score function our proposed decorrelated score function achieves such semiparametric information lower bound and is therefore in this sense optimal in the following we use gaussian mixture model to illustrate the effectiveness of theorem we defer the details and the implications for mixture of regression to of the appendix implications for gaussian mixture model under the same model considered in if we assume all quantities except and are constant then we have that conditions hold with em log log log log and log log thus under assumption holds when also we can verify that in assumption holds if max log log conclusion we propose novel high dimensional em algorithm which naturally incorporates sparsity structure our theory shows that with suitable initialization the proposed algorithm converges at geometric rate and achieves an estimator with the optimal statistical rate of convergence beyond point estimation we further propose the decorrelated score and wald statistics for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters we apply the proposed algorithmic framework to broad family of high dimensional latent variable models for these models our framework establishes the first computationally feasible approach for optimal parameter estimation and asymptotic inference under high dimensional settings references and tensor decompositions for learning latent variable models journal of machine learning research and statistical guarantees for the em algorithm from population to analysis arxiv preprint and ta latent variable models and factor analysis unified approach vol wiley and sparse models and methods for optimal instruments with an application to eminent domain econometrica and simultaneous analysis of lasso and dantzig selector annals of statistics and concentration inequalities nonasymptotic theory of independence oxford university press and constrained minimization approach to sparse precision matrix estimation journal of the american statistical association and the dantzig selector statistical estimation when is much larger than annals of statistics and spectral experts for estimating mixtures of linear regressions arxiv preprint ta and at ta learning mixtures of gaussians using the algorithm arxiv preprint ta and probabilistic analysis of em for mixtures of separated spherical gaussians journal of machine learning research and maximum likelihood from incomplete data via the em algorithm journal of the royal statistical society series statistical methodology ava and ta confidence intervals and hypothesis testing for regression journal of machine learning research and variables selection in finite mixture of regression models journal of the american statistical association and asymptotics for estimators annals of statistics and ay exact inference after model selection via the lasso arxiv preprint ay and significance test for the lasso annals of statistics and the em algorithm and extensions vol wiley and stability selection journal of the royal statistical society series statistical methodology and for regression journal of the american statistical association introductory lectures on convex optimization basic course vol springer and va confidence sets in sparse regression annals of statistics and regression models test va for mixture ay and adaptive inference for least angle regression and the lasso arxiv preprint an analysis of the em algorithm and proximal point methods mathematics of operations research va and on asymptotically optimal confidence regions and tests for models annals of statistics va va asymptotic statistics vol cambridge university press introduction to the analysis of random matrices arxiv preprint and variable selection annals of statistics on the convergence properties of the em algorithm annals of statistics and av alternating minimization for mixed linear regression arxiv preprint and confidence intervals for low dimensional parameters in high dimensional linear models journal of the royal statistical society series statistical methodology 
revenue optimization against strategic buyers mehryar mohri courant institute of mathematical sciences mercer street new york ny google research avenue new york ny abstract we present revenue optimization algorithm for auctions when facing buyer with random valuations who seeks to optimize his surplus in order to analyze this problem we introduce the notion of buyer more natural notion of strategic behavior than what has been considered in the past we improve upon the previous and achieve an optimal regret bound in log log pwhen the seller selects prices from finite set log when the prices offered and provide regret bound in are selected out of the interval introduction online advertisement is currently the fastest growing form of advertising this growth has been motivated among other reasons by the existence of well defined metrics of effectiveness such as and conversion rates moreover online advertisement enables the design of better targeted campaigns by allowing advertisers to decide which type of consumers should see their advertisement these advantages have promoted the fast pace development of large number of advertising platforms among them adexchanges have increased in popularity in recent years in contrast to traditional advertising adexchanges do not involve contracts between publishers and advertisers instead advertisers are allowed to bid in for the right to display their ad an adexchange works as follows when user visits publisher website the publisher sends this information to the adexchange which runs auction with reserve vickrey milgrom among all interested advertisers finally the winner of the auction gets the right to display his ad on the publisher website and pays the maximum of the second highest bid and the reserve price in practice this process is performed in milliseconds resulting in millions of transactions recorded daily by the adexchange thus one might expect that the adexchange could benefit from this information by learning how much an advertiser values the right to display his ad and setting an optimal reserve price this idea has recently motivated research in the learning community on revenue optimization in auctions with reserve mohri and medina cui et et the algorithms proposed by these authors heavily rely on the assumption that the advertisers bids are drawn from some underlying distribution however if an advertiser is aware of the fact that the adexchange or publisher are using revenue optimization algorithm then most likely he would adjust his behavior to trick the publisher into offering more beneficial price in the future under this scenario the assumptions of mohri and medina and et would be violated in fact empirical evidence of strategic behavior by advertisers has been documented by edelman and ostrovsky it is therefore critical to analyze the interactions between publishers and strategic advertisers this work was partially done at the courant institute of mathematical sciences in this paper we consider the simpler scenario of revenue optimization in auctions with strategic buyers first analyzed by amin et al as pointed out by amin et al the study of this simplified problem is truly relevant since large number of auctions run by adexchanges consist of only one buyer or one buyer with large bid and several buyers with negligible bids in this scenario auction in fact reduces to auction where the seller sets reserve price and the buyer decides to accept it bid above it or reject it bid below to analyze the sequential nature of this problem we can cast it as repeated game between buyer and seller where strategic buyer seeks to optimize his surplus while the seller seeks to collect the largest possible revenue from the buyer this can be viewed as an instance of repeated nonzero sum game with incomplete information which is problem that has been well studied in the economics and game theory community nachbar however such previous work has mostly concentrated on the characterization of different types of achievable equilibria as opposed to the design of an algorithm for the seller furthermore the problem we consider admits particular structure that can be exploited to derive learning algorithms with more favorable guarantees for the specific task of revenue optimization the problem can also be viewed as an instance of bandit problem auer et lai and robbins more specifically particular type of continuous bandit problem previously studied by kleinberg and leighton indeed at every time the buyer can only observe the revenue of the price he offered and his goal is to find as fast as possible the price that would yield the largest expected revenue unlike bandit problem however here the performance of an algorithm can not be measured in terms of the external regret indeed as observed by bubeck and and arora et al the notion of external regret becomes meaningless when facing an adversary that reacts to the learner actions in short instead of comparing to the best achievable revenue by fixed price over the sequence of rewards seen one should compare against the simulated sequence of rewards that would have been seen had the seller played fixed price this notion of regret is known as strategic regret and regret minimization algorithms have been proposed before under different scenarios amin et mohri and medina in this paper we provide regret minimization algorithm for the stochastic scenario where at each round the buyer receives an valuation from an underlying distribution while this random valuation might seems surprising it is in fact standard assumption in the study of auctions milgrom and weber milgrom cole and roughgarden moreover in practice advertisers rarely interact directly with an adexchange instead several advertisers are part of an ad network and it is that ad network that bids on their behalf therefore the valuation of the ad network is not likely to remain fixed our model is also motivated by the fact that the valuation of an advertiser depends on the user visiting the publisher website since these visits can be considered random it follows that the buyer valuation is in fact random variable crucial component of our analysis is the definition of strategic buyer we consider buyer who seeks to optimize his cumulative discounted surplus however we show that buyer who exactly maximizes his surplus must have unlimited computational power which is not realistic assumption in practice instead we define the notion of an buyer who seeks only to approximately optimize his surplus our main contribution is to show that when facing an buyer seller can achieve log regret when the set of possible prices to offer is finite and an regret bound when the set of prices is remarkably these bounds on the regret match those given by kleinberg and leighton in truthful scenario where the buyer does not behave strategically the rest of this paper is organized as follows in section we discuss in more detail related previous work next we define more formally the problem setup section in particular we give precise definition of the notion of buyer section our main algorithm for finite set of prices is described in section where we also provide regret analysis in section we extend our algorithm to the continuous case where we show that regret in can be achieved previous work the problem of revenue optimization in auctions goes back to the seminal work of myerson who showed that under some regularity assumptions over the distribution the revenue optimal mechanism is auction with reserve this result applies to singleshot auctions where buyers and the seller interact only once and the underlying value distribution is known to the seller in practice however it is not realistic to assume that the seller has access to this distribution instead in cases such as advertisement the seller interacts with the buyer large number of times and can therefore infer his behavior from historical data this fact has motivated the design of several learning algorithms such as that of et who proposed bandit algorithm for revenue optimization in auctions and the work of mohri and medina who provided learning guarantees and an algorithm for revenue optimization where each auction is associated with feature vector the aforementioned algorithms are formulated under the assumption of buyers bidding in an fashion and do not take into account the fact that buyers can in fact react to the use of revenue optimization algorithms by the seller this has motivated series of publications focusing on this particular problem bikhchandani and mccardle analyzed the same problem proposed here when the buyer and seller interact for only two rounds kanoria and nazerzadeh considered repeated game of auctions where the seller knows that the value distribution can be either high meaning it is concentrated around high values or low and his goal is to find out from which distribution the valuations are drawn under the assumption that buyers can behave strategically finally the scenario considered here was first introduced by amin et al where the authors solve the problem of optimizing revenue against strategic buyer with fixed valuation and showed that buyer can achieve regret in mohri and medina later showed that one can in fact achieve regret in log closing the gap with the lower bound to factor of log the scenario of random valuations we consider here was also analyzed by amin et al where an algorithm achieving regret in was proposed when prices are offered from finite set with pd and free parameter finally an extension of this algorithm to the contextual setting was presented by the same authors in amin et where they provide an algorithm achieving regret the algorithms proposed by amin et al consist of alternating exploration and exploitation that is there exist rounds where the seller only tries to estimate the value of the buyer and other rounds where he uses this information to try to extract the largest possible revenue it is well known in the bandit literature dani and hayes abernethy et that algorithms that ignore information obtained on exploitation rounds tend to be indeed even in truthful scenario where the ucb algorithm auer et achieves regret in log the algorithm prop posed by amin et al achieves regret in log log for the optimal choice of which incidentally requires also access to the unknown value we propose instead an algorithm inspired by the ucb strategy using exploration and exploitation simultaneously we show that our algorithm admits regret that is in log log which matches the ucb bound in the truthful scenario and which depends on only through the additive term log known to be unavoidable amin et our results can not be directly compared with those of amin et al since they consider fully strategic adversary whereas we consider an adversary as we will see in the next section however the notion of adversary is in fact more natural than that of buyer who exactly optimizes his discounted surplus moreover it is not hard to show that when applied to our scenario perhaps modulo constant the algorithm of amin et al can not achieve better regret than in the fully strategic adversary setup we consider the following scenario similar to the one introduced by amin et al scenario buyer and seller interact for rounds at each round the seller attempts to sell some good to the buyer such as the right to display an ad the buyer receives valuation vt which is unknown to the seller and is sampled from distribution the seller offers price pt in response to which the buyer selects an action at with at indicating that he accepts the price and at otherwise we will say the buyer lies if he accepts the price at time at while the price offered is above his valuation vt pt or when he rejects the price at while his valuation is above the price offered vt pt the seller seeks to optimize his expected revenue over the rounds of interaction that is rev pt notice that when facing truthful buyer for any price the expected revenue of the seller is given by pd therefore with knowledge of the seller could set all prices pt to where pd since the actions of the buyer do not affect the choice of future prices by the seller the buyer has no incentive to lie and the seller will obtain an expected revenue of it is therefore natural to measure the performance of any revenue optimization algorithm in terms of the following notion of strategic regret regt rev max pd pt the objective of the seller coincides with the one assumed by kleinberg and leighton in the study of repeated interactions with buyers with random valuation however here we will allow the buyer to behave strategically which results in harder problem nevertheless the buyer is not assumed to be fully adversarial as in kleinberg and leighton instead we will assume as discussed in detail in the next section that the buyer seeks to approximately optimize his surplus which can be viewed as more natural assumption buyers here we define the family of buyers considered throughout this paper we denote by rt the vector xt and define the history of the game up to time by ht before the first round the seller decides on an algorithm for setting prices and this algorithm is announced to the buyer the buyer then selects strategy ht vt pt at for any value and strategy we define the buyer discounted expected surplus by sur at vt pt buyer minimizing this discounted surplus wishes to acquire the item as inexpensively as possible but does not wish to wait too long to obtain favorable price in order to optimize his surplus buyer must then solve markov decision process mdp indeed consider the scenario where at time the seller offers prices from distribution dt where is family of probability distributions over the interval the seller updates his beliefs as follows the current distribution dt is selected as function of the distribution at the previous round as well as the history ht which is all the information available to the seller more formally we let ft dt ht be transition function for the seller let st dt ht vt pt denote the state of the environment at time that is all the information available at time to the buyer finally let st st denote the maximum attainable expected surplus of buyer that is in state st at time it is clear that st will satisfy the following bellman equations st st max at vt pt dt ht ft dt ht ht at with the boundary condition st st vt pt definition buyer is said to be strategic if his action at time is solution of the bellman equation notice that depending on the choice of the family the number of states of the mdp solved by strategic buyer may be infinite even for deterministic algorithm that offers prices from finite set the number of states of this mdp would be in which quickly becomes intractable thus in view of the prohibitive cost of computing his actions the model of fully strategic buyer does not seem to be realistic we introduce instead the concept of buyers definition buyer is said to be if he behaves strategically except when no sequence of actions can improve upon the future surplus of the truthful sequence by more than or except for the first rounds for some depending only on the seller algorithm in which cases he acts truthfully we show in section that this definition implies the existence of such that an buyer only solves an mdp over the interval which becomes tractable problem for the parameter used in the definition is introduced to consider the unlikely scenario where buyer algorithm deliberately ignores all information observed during the rounds in which case it is optimal for the buyer to behave truthfully our definition is motivated by the fact that for buyer with bounded computational power there is no incentive in acting if the gain in surplus over truthful behavior is negligible regret analysis we now turn our attention to the problem faced by the seller the seller goal is to maximize his revenue when the buyer is truthful kleinberg and leighton have shown that this problem can be cast as continuous bandit problem in that scenario the strategic regret in fact coincides with the which is the quantity commonly minimized in stochastic bandit setting auer et bubeck and thus if the set of possible prices is finite the seller can use the ucb algorithm auer et al to minimize his in the presence of an buyer the rewards are no longer stochastic therefore we need to analyze the regret of seller in the presence of lies let denote finite set of prices offered by the seller define µp pd and µp for every price define also tp to be the number of times price has been offered up to time we will denote by and the corresponding quantities associated with the optimal price lemma let denote the number of times buyer lies for any the strategic regret of seller can be bounded as follows regt tp proof let lt denote the event that the buyer lies at round then the expected revenue of seller is given by at pt at pt where the last equality follows from the fact that when the buyer is truthful at moreover pt using the fact that we have µp tp µp tp pt pt since the regret of offering prices pfor which the seller is bounded by is bounded by it follows that the regret of tp we now define robust ucb algorithm for which we can bound the expectations tp for every price define bp pt pt tp to be the truep empirical mean of the reward that seller would obtain when facing truthful buyer let lt at denote the revenue obtained by the seller in rounds where the buyer lied notice that lt can be positive or negative finally let lt tp µp bp be the empirical mean obtained when offering price that is observed by the seller for the definition of our algorithm we will make use of the following upper confidence bound lp log bp tp tp we will use as shorthand for our algorithm selects the price pt that maximizes the quantity max µp bp we proceed to bound the expected number of times price is offered proposition let pt inequality holds tp log proof for any and define then µp bp lt tp log tp bp µp tp then the following pt and let if at time price is offered lt lt lt lp tp tp lt tp bp bp therefore if price is selected then at least one of the four terms in inequality must be positive log let notice that if tp then thus we can write tp hx pr pt tp this combined with the positivity of at least one of the four terms in yields tp pr bp µp pr bp µp pr pr pr pt we can now bound the probabilities appearing in as follows log pr bp µp pr tp lp tp lt tp µp log where the last inequality follows from an application of hoeffding inequality as well as the union bound similar argument can be made to bound the other term in using the definition of we then have tp log which completes the proof pt log pt corollary let denote the number of times buyer lies then the strategic regret of can be bounded as follows log regt pt notice that the choice of parameter of is subject to on the one hand should be small to minimize the first term of this regret bound on the other hand function pt pt is decreasing in therefore the term pt is beneficial for larger values of we now show that an buyer can only lie finite number of times which will imply the existence of an appropriate choice of for which we can ensure that pt thereby recovering the standard logarithmic regret of ucb proposition factor satisfies an buyer stops lying if the discounting after log rounds log proof after rounds for any sequence of actions at the surplus that can be achieved by the buyer in the remaining rounds is bounded by at vt pt for any sequence of actions thus by definition an buyer does not lie after rounds corollary if the ldiscounting factor satisfies and the seller uses the algorithm with log then the strategic regret of the seller is bounded by log log log log proof follows trivially from corollary and the previous proposition which implies that pt let us compare those of amin et al the regret bound given in amin et our results with is in where is parameter controlling the fraction of rounds used for exploration and in particular notice that the dependency of this bound on the cardinality of is quadratic instead of linear as in our case moreover the dependency on is in therefore even in truthful scenario where the dependency on remains polynomial whereas we recover the standard logarithmic regret only when the seller has access is strong requirement can he set the optimal value of to achieve regret in to which log log of course the algorithm proposed by amin et al assumes that the buyer is fully strategic whereas we only require the buyer to be however the authors assume that the distribution satisfies lipchitz condition which technically allows them to bound the number of lies in the same way as in proposition therefore the regret bound achieved by their algorithm remains the same in our scenario continuous pricing strategy thus far we have assumed that the prices offered by the buyer are selected out of discrete set in practice however the optimal price may not be within and therefore the algorithm described in the previous section might accumulate large regret when compared against the best price in in order to solve this problem we propose to discretize the interval and run our algorithm on the resulting discretization this induces since better discretization implies larger regret term in to find the optimal size of the discretization we follow the ideas of kleinberg and leighton and consider distributions that satisfy the condition that the function pd admits unique maximizer such that throughout this section we let and we consider the following finite set of prices pk ki we also let pk be an optimal price in pk that is pk and we let finally we denote by pk the gap with respect to price pk and by the corresponding gap with respect to the following theorem can be proven following similar ideas to those of kleinberg and leighton we defer its proof to the appendix theorem let log if the discounting factor satisfies and the seller and log then the strategic log uses the algorithm with the set of prices pk regret of the seller can be bounded as follows log max at pt log log log conclusion we introduced revenue optimization algorithm for auctions that is robust against buyers moreover we showed that our notion of strategic behavior is more natural than what has been previously studied our algorithm benefits from the optimal log regret bound for finite set of prices and admits regret in when the buyer is offered prices in scenario that had not been considered previously in the literature of revenue optimization against strategic buyers it is known that regret in is unattainable even in truthful setting but it remains an open problem to verify that the dependency on can not be improved our algorithm admits simple analysis and we believe that the idea of making truthful algorithms robust is general and can be extended to more complex auction mechanisms such as auctions with reserve acknowledgments we thank afshin rostamizadeh and umar syed for useful discussions about the topic of this paper and the nips reviewers for their insightful comments this work was partly funded by nsf and nsf references abernethy hazan and rakhlin competing in the dark an efficient algorithm for bandit linear optimization in proceedings of colt pp amin rostamizadeh and syed learning prices for repeated auctions with strategic buyers in proceedings of nips pp amin rostamizadeh and syed repeated contextual auctions with strategic buyers in proceedings of nips pp arora dekel and tewari online bandit learning against an adaptive adversary from regret to policy regret in proceedings of icml auer and fischer analysis of the multiarmed bandit problem machine learning bikhchandani and mccardle price discrimination by patient seller the journal of theoretical economics bubeck and regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning gentile and mansour regret minimization for reserve prices in auctions ieee transactions on information theory cole and roughgarden the sample complexity of revenue maximization in proceedings of stoc pp cui zhang li and mao bid landscape forecasting in online ad exchange marketplace in proceedings of sigkdd pp dani and hayes robbing the bandit less regret in online geometric optimization against an adaptive adversary in proceedings of soda pp edelman and ostrovsky strategic bidder behavior in sponsored search auctions decision support systems kanoria and nazerzadeh dynamic reserve prices for repeated auctions learning from bids in proceedings of wine pp kleinberg and leighton the value of knowing demand curve bounds on regret for online auctions in proceedings of focs pp lai and robbins asymptotically efficient adaptive allocation rules advances in applied mathematics milgrom and weber theory of auctions and competitive bidding econometrica journal of the econometric society milgrom putting auction theory to work cambridge university press mohri and medina learning theory and algorithms for revenue optimization in second price auctions with reserve in proceedings of icml pp mohri and medina optimal regret minimization in auctions with strategic buyers in proceedings of nips pp myerson optimal auction design mathematics of operations research pp nachbar bayesian learning in repeated games of incomplete information social choice and welfare nachbar prediction optimization and learning in repeated games econometrica journal of the econometric society vickrey counterspeculation auctions and competitive sealed tenders the journal of finance 
deep convolutional inverse graphics network tejas kulkarni william whitney pushmeet joshua massachusetts institute of technology cambridge usa microsoft research cambridge uk tejask wwhitney pkohli jbt first two authors contributed equally and are listed alphabetically abstract this paper presents the deep convolution inverse graphics network dcign model that aims to learn an interpretable representation of images disentangled with respect to scene structure and viewing transformations such as depth rotations and lighting variations the dcign model is composed of multiple layers of convolution and operators and is trained using the stochastic gradient variational bayes sgvb algorithm we propose training procedure to encourage neurons in the graphics code layer to represent specific transformation pose or light given single input image our model can generate new images of the same object with variations in pose and lighting we present qualitative and quantitative tests of the model efficacy at learning rendering engine for varied object classes including faces and chairs introduction deep learning has led to remarkable breakthroughs in learning hierarchical representations from images models such as convolutional neural networks cnns restricted boltzmann machines and have been successfully applied to produce multiple layers of increasingly abstract visual representations however there is relatively little work on characterizing the optimal representation of the data while cohen et al have considered this problem by proposing theoretical framework to learn irreducible representations with both invariances and equivariances coming up with the best representation for any given task is an open question various work has been done on the theory and practice of representation learning and from this work consistent set of desiderata for representations has emerged invariance interpretability abstraction and disentanglement in particular bengio et al propose that disentangled representation is one for which changes in the encoded data are sparse over transformations that is changes in only few latents at time should be able to represent sequences which are likely to happen in the real world the vision as inverse graphics paradigm suggests representation for images which provides these features computer graphics consists of function to go from compact descriptions of scenes the graphics code to images and this graphics code is typically disentangled to allow for rendering scenes with control over transformations such as object location pose lighting texture and shape this encoding is designed to easily and interpretably represent sequences of real data so that common transformations may be compactly represented in software code this criterion is conceptually identical to that of bengio et and graphics codes conveniently align with the properties of an ideal representation zi graphics code unpooling nearest neighbor convolution convolution pooling pose observed image filters kernel size ks filters ks filters ks shape light filters ks filters ks filters ks encoder decoder renderer figure model architecture deep convolutional inverse graphics network has an encoder and decoder we follow the variational autoencoder architecture with variations the encoder consists of several layers of convolutions followed by maxpooling and the decoder has several layers of unpooling upsampling using nearest neighbors followed by convolution during training data is passed through the encoder to produce the posterior approximation zi where zi consists of scene latent variables such as pose light texture or shape in order to learn parameters in gradients are using stochastic gradient descent using the following variational object function kl zi zi for every zi we can force to learn disentangled representation by showing with set of inactive and active transformations face rotating light sweeping in some direction etc during test data can be passed through the encoder to get latents zi images can be to different viewpoints lighting conditions shape variations etc by setting the appropriate graphics code group zi which is how one would manipulate an graphics engine recent work in inverse graphics follows general strategy of defining probabilistic with latent parameters then using an inference algorithm to find the most appropriate set of latent parameters given the observations recently tieleman et al moved beyond this pipeline by using generic encoder network and decoder network to approximate rendering function however none of these approaches have been shown to automatically produce graphics code and to learn rendering engine to reproduce images in this paper we present an approach which attempts to learn interpretable graphics codes for complex transformations such as rotations and lighting variations given set of images we use hybrid model to learn representation that is disentangled with respect to various transformations such as object rotations and lighting variations we employ deep directed graphical model with many layers of convolution and operators that is trained using the stochastic gradient variational bayes sgvb algorithm we propose training procedure to encourage each group of neurons in the graphics code layer to distinctly represent specific transformation to learn disentangled representation we train using data where each has set of active and inactive transformations but we do not provide target values as in supervised learning the objective function remains reconstruction quality for example nodding face would have the elevation transformation active but its shape texture and other transformations would be inactive we exploit this type of training data to force chosen neurons in the graphics code layer to specifically represent active transformations thereby automatically creating disentangled representation given single face image our model can the input image with different pose and lighting we present qualitative and quantitative results of the model efficacy at learning rendering engine corresponds to 𝜙l intrinsic properties shape texture etc figure structure of the representation vector is the azimuth of the face is the elevation of the face with respect to the camera and φl is the azimuth of the light source related work as mentioned previously a𝜙number of generative models have been proposed in the ture to obtain abstract visual representations unlike most models our approach is trained using with objective function consisting of data reconstruction and the variational bound relatively recently kingma et al proposed the sgvb algorithm to learn generative models with continuous latent variables in this work neural network encoder is used to approximate the posterior distribution and decoder network serves to enable stochastic reconstruction of observations in order to handle geometry of faces we work with relatively large scale images pixels our approach extends and applies the sgvb algorithm to jointly train and utilize many layers of convolution and operators for the encoder and decoder network respectively the decoder network is function that transform compact graphics code dimensions to image we propose using unpooling nearest neighbor sampling followed by convolution to handle massive increase in dimensionality with manageabletonumber of fromthe encoder encoder parameters output backpropagation recently proposed using cnns to generate images given parameters in supervised setting as their approach requires labels for the graphics code layer it can not be directly applied to image interpretation tasks our work is similar to ple in batch ranzato et al whosez work was amongst the first to use generic architecture for feature learning however in comparison to our proposal their model was trained the intermediate representations were not disentangled like graphics code and their approach does not use the variational loss to approximate thesignal zero error unique for each posterior distribution our work also similar in spirit to but in comparison our for clamped outputs same as output for xi in batch model does not assume lambertian reflectance model and implicitly constructs the representations another piece of related work is desjardins et al who used spike and slab prior to factorize representations in generative deep network les in batch xi in comparison to existing approaches it is important to note that our encoder network produces the interpretable and disentangled representations necessary to learn meaningful of inspired zmethods engine number have recently beenz proposed in the literature however most such methods rely on rendering engines the exception to this is work by hinton et al and tieleman on transforming autoencoders which use decoder to reconstruct input images zero ourerror worksignal is similar in spirit to these works but has some key differences it uses very generic for clamped convolutional architecture in the encoder and decoder networks to enable efficient learningoutputs on large datasets and image sizes it can handle single static frames as opposed to pair of images required in and it is generative model caption training on minibatch in which only the azimuth angle of the face as shown in figure the basic structure of the deep convolutional inverse graphics netchanges work consists of two parts an encoder network which captures distribution over graphics codes given datastep and the decoder network conditional during the forward output from which each learns component distribution of the to produce an approximation given can be disentangled representation containing encoder to be thez same for each sample in the batch this reflects the fact factored setisofforced latent variables such as pose light and shape this is important that the generating variables of the image which correspond to the desired values of these latents are unchanged throughout the batch by holding these outputs constant throughout the batch is forced to explain all the variance within the forward backward decoder decoder outi mean zki batch clamped unclamped encoder grad zki mean zki batch encoder figure training on minibatch in which only the azimuth angle of the face changes during the forward step the output from each component zi of the encoder is altered to be the same for each sample in the batch this reflects the fact that the generating variables of the image the identity of the face which correspond to the desired values of these latents are unchanged throughout the batch by holding these outputs constant throughout the batch the single neuron is forced to explain all the variance within the batch the full range of changes to the image caused by changing during the backward step is the only neuron which receives gradient signal from the attempted reconstruction and all zi receive signal which nudges them to be closer to their respective averages over the batch during the complete training process after this batch another batch is selected at random it likewise contains variations of only one of φl intrinsic all neurons which do not correspond to the selected latent are clamped and the training proceeds in learning meaningful approximation of graphics engine and helps tease apart the generalization capability of the model with respect to different types of transformations let us denote the encoder output of to be ye encoder the encoder output is used to parametrize the variational approximation zi where is chosen to be multivariate normal distribution there are two reasons for using this parametrization gradients of samples with respect to parameters of can be easily obtained using the reparametrization trick proposed in and various statistical shape models trained on scanner data such as faces have the same multivariate normal latent distribution given that model parameters we connect ye and zi the distribution parameters µzi σzi and latents can then be expressed as µz we ye σz diag exp we ye zi µzi σzi we present novel training procedure which allows networks to be trained to have disentangled and interpretable representations training with specific transformations the main goal of this work is to learn representation of the data which consists of disentangled and semantically interpretable latent variables we would like only small subset of the latent variables to change for sequences of inputs corresponding to events one natural choice of target representation for information about scenes is that already designed for use in graphics engines if we can deconstruct face image by splitting it into variables for pose light and shape we can trivially represent the same transformations that these variables are used for in graphics applications figure depicts the representation which we will attempt to learn with this goal in mind we perform training procedure which directly targets this definition of disentanglement we organize our data into corresponding to changes in only single scene variable azimuth angle elevation angle azimuth angle of the light figure manipulating light and elevation variables qualitative results showing the generalization capability of the learned decoder to single input image with different pose directions we change the latent zlight smoothly leaving all other latents unchanged we change the latent zelevation smoothly leaving all other latents unchanged source these are transformations which might occur in the real world we will term these the extrinsic variables and they are represented by the components of the encoding we also generate in which the three extrinsic scene variables are held fixed but all other properties of the face change that is these batches consist of many different faces under the same viewing conditions and pose these intrinsic properties of the model which describe identity shape expression are represented by the remainder of the latent variables these varying intrinsic properties are interspersed stochastically with those varying the extrinsic properties we train this representation using sgvb but we make some key adjustments to the outputs of the encoder and the gradients which train it the procedure figure is as follows select at random latent variable ztrain which we wish to correspond to one of azimuth angle elevation angle azimuth of light source intrinsic properties select at random in which that only that variable changes show the network each example in the minibatch and capture its latent representation for that example calculate the average of those representation vectors over the entire batch before putting the encoder output into the decoder replace the values zi ztrain with their averages over the entire batch these outputs are clamped calculate reconstruction error and backpropagate as per sgvb in the decoder replace the gradients for the latents zi ztrain the clamped neurons with their difference from the mean see section the gradient at ztrain is passed through unchanged continue backpropagation through the encoder using the modified gradient since the intrinsic representation is much than the extrinsic ones it requires more training accordingly we select the type of batch to use in ratio of about azimuth elevation lighting intrinsic we arrived at this ratio after extensive testing and it works well for both of our datasets this training procedure works to train both the encoder and decoder to represent certain properties of the data in specific neuron by clamping the output of all but one of the neurons we force the decoder to recreate all the variation in that batch using only the changes in that one neuron value by clamping the gradients we train the encoder to put all the information about the variations in the batch into one output neuron this training method leads to networks whose latent variables have strong equivariance with the corresponding generating parameters as shown in figure this allows the value of the true generating parameter the true angle of the face to be trivially extracted from the encoder invariance targeting by training with only one transformation at time we are encouraging certain neurons to contain specific information this is equivariance but we also wish to explicitly discourage them from having other information that is we want them to be invariant to other transformations since our of training data consist of only one transformation per batch then this goal corresponds to having all but one of the output neurons of the encoder give the same output for every image in the batch to encourage this property of the we train all the neurons which correspond to the inactive transformations with an error gradient equal to their difference from the mean it is simplest to think about this gradient as acting on the set of subvectors zinactive from the encoder for each input in the batch each of these zinactive will be pointing to but not identical point in space the invariance training signal will push them all closer together we don care where they are the network can represent the face shown in this batch however it likes we only care that the network always represents it as still being the same face no matter which way it facing this regularizing force needs to be scaled to be much smaller than the true training signal otherwise it can overwhelm the reconstruction goal empirically factor of works well experiments we trained our model on about batches of faces generated from face model obtained from paysan et al where each batch consists of faces with random variations on face identity variables pose or lighting we used the rmsprop learning algorithm during training and set the meta learning rate equal to the momentum decay to and weight decay to to ensure that these techniques work on other types of data we also trained networks to perform reconstruction on images of widely varied chairs from many perspectives derived from the pascal visual object classes dataset as extracted by aubry et al this task tests the ability of the to learn rendering function for dataset with high variation between the elements of the set the chairs vary from office chairs to wicker to modern designs and viewpoints span degrees and two elevations these networks were trained with the same methods and parameters as the ones above figure manipulating azimuth pose variables qualitative results showing the generalization capability of the learnt decoder to render original static image with different azimuth pose directions the latent neuron zazimuth is changed to random values but all other latents are clamped face dataset the decoder network learns an approximate rendering engine as shown in figures given static test image the encoder network produces the latents depicting scene variables such as light pose shape etc similar to an rendering engine we can independently control these to generate new images with the decoder for example as shown in figure given the original test image we can vary the lighting of an image by keeping all the other latents constant and varying zlight it is perhaps surprising that the decoder network is able to function as rendering engine figure generalization of decoder to render images in novel viewpoints and lighting conditions we generated several datasets by varying light azimuth and elevation and tested the invariance properties of representation we show quantitative performance on three network configurations as described in section all encoder networks reasonably predicts transformations from static test images interestingly as seen in the encoder network seems to have learnt switch node to separately process azimuth on left and right profile side of the face we also quantitatively illustrate the network ability to represent pose and light on smooth linear manifold as shown in figure which directly demonstrates our training algorithm ability to disentangle complex transformations in these plots the inferred and transformation values are plotted for random subset of the test set interestingly as shown in figure the encoder network representation of azimuth has discontinuity at facing straight forward comparison with entangled representations to explore how much of difference the training procedure makes we compare the reconstruction performance of networks with entangled representations baseline versus disentangled representations the baseline network is identical in every way to the but was trained with sgvb without using our proposed training procedure as in figure we feed each network single input image then attempt to use the decoder to this image at different azimuth angles to do this we first must figure out which latent of the entangled representation most closely corresponds to the azimuth this we do rather simply first we encode all images in an batch using the baseline encoder then we calculate the variance of each of the latents over this batch the latent with the largest variance is then the one most closely associated with the azimuth of the face and we will call it zazimuth once that is found the latent zazimuth is varied for both the models to render novel view of the face given single image of that face figure shows that explicit disentanglement is critical for reconstruction figure entangled versus disentangled representations first column original images second column transformed image using third column transformed image using network chair dataset we performed similar set of experiments on the chairs dataset described above this dataset contains still images rendered from cad models of different chairs each model skinned with the photographic texture of the real chair each of these models is rendered in different poses at each of two elevations there are images taken from degrees around the model we used approximately of these chairs in the training set and the remaining in the test set as such the networks had never seen the chairs in the test set from any angle so the tests explore the networks ability to generalize to arbitrary figure manipulating rotation each row was generated by encoding the input image leftmost with the encoder then changing the value of single latent and putting this modified encoding through the decoder the network has never seen these chairs before at any orientation some positive examples note that the is making conjecture about any components of the chair it can not see in particular it guesses that the chair in the top row has arms because it can see that it doesn examples in which the network extrapolates to new viewpoints less accurately chairs we resized the images to pixels and made them grayscale to match our face dataset we trained these networks with the azimuth flat rotation of the chair as disentangled variable represented by single node all other variation between images is undifferentiated and represented by the network succeeded in achieving error mse of reconstruction of on the test set each image has grayscale values in the range and is pixels in figure we have included examples of the network ability to previouslyunseen chairs at different angles given single image for some chairs it is able to render fairly smooth transitions showing the chair at many intermediate poses while for others it seems to only capture sort of keyframes representation only having distinct outputs for few angles interestingly the task of rotating chair seen only from one angle requires speculation about unseen components the chair might have arms or not curved seat or flat one etc discussion we have shown that it is possible to train deep convolutional inverse graphics network with fairly disentangled interpretable graphics code layer representation from static images by utilizing deep convolution and architecture within variational autoencoder formulation our model can be trained using on the stochastic variational objective function we proposed training procedure to force the network to learn disentangled and interpretable representations using face and chair analysis as working example we have demonstrated the invariant and equivariant characteristics of the learned representations acknowledgements we thank thomas vetter for access to the basel face model we are grateful for support from the mit center for brains minds and machines cbmm we also thank geoffrey hinton and ilker yildrim for helpful feedback and discussions references aubry maturana efros russell and sivic seeing chairs exemplar partbased alignment using large dataset of cad models in cvpr in machine learning bengio learning deep architectures for ai foundations and trends bengio courville and vincent representation learning review and new perspectives pattern analysis and machine intelligence ieee transactions on cohen and welling learning the irreducible representations of commutative lie groups arxiv preprint desjardins courville and bengio disentangling factors of variation via generative entangling arxiv preprint dosovitskiy springenberg and brox learning to generate chairs with convolutional neural networks goodfellow lee le saxe and ng measuring invariances in deep networks in advances in neural information processing systems pages hinton osindero and teh fast learning algorithm for deep belief nets neural computation hinton krizhevsky and wang transforming in artificial neural networks and machine pages springer kingma and welling variational bayes arxiv preprint kulkarni kohli tenenbaum and mansinghka picture probabilistic programming language for scene perception in proceedings of the ieee conference on computer vision and pattern recognition pages kulkarni mansinghka kohli and tenenbaum inverse graphics with probabilistic cad models arxiv preprint lecun and bengio convolutional networks for images speech and time series the handbook of brain theory and neural networks lee grosse ranganath and ng convolutional deep belief networks for scalable unsupervised learning of hierarchical representations in proceedings of the annual international conference on machine learning pages acm mansinghka kulkarni perov and tenenbaum approximate bayesian image interpretation using generative probabilistic graphics programs in advances in neural information processing systems pages mottaghi chen liu cho lee fidler urtasun and yuille the role of context for object detection and semantic segmentation in the wild in ieee conference on computer vision and pattern recognition cvpr paysan knothe amberg romdhani and vetter face model for pose and illumination invariant face recognition genova italy ieee ranzato huang boureau and lecun unsupervised learning of invariant feature hierarchies with applications to object recognition in computer vision and pattern recognition cvpr ieee conference on pages ieee salakhutdinov and hinton deep boltzmann machines in international conference on artificial intelligence and statistics pages tang salakhutdinov and hinton deep lambertian networks arxiv preprint tieleman optimizing neural networks that generate images phd thesis university of toronto tieleman and hinton lecture rmsprop coursera neural networks for machine learning vincent larochelle lajoie bengio and manzagol stacked denoising autoencoders learning useful representations in deep network with local denoising criterion the journal of machine learning research 
sparse and tensor decomposition parikshit shah parikshit nikhil rao nikhilr gongguo tang gtang abstract motivated by the problem of robust factorization of tensor we study the question of sparse and tensor decomposition we present an efficient computational algorithm that modifies leurgans algoirthm for tensor factorization our method relies on reduction of the problem to sparse and matrix decomposition via the notion of tensor contraction we use convex techniques for solving the reduced matrix which then allows us to perform the full decomposition of the tensor we delineate situations where the problem is recoverable and provide theoretical guarantees for our algorithm we validate our algorithm with numerical experiments introduction tensors are useful representational objects to model variety of problems such as graphical models with latent variables audio classification psychometrics and neuroscience one concrete example proposed in involves topic modeling in an exchangeable model wherein given corpus of documents one wishes to estimate parameters related to the different topics of the different documents each document has unique topic associated to it by computing the empirical moments associated to exchangeable and of words in the documents shows that this problem reduces to that of low rank tensor decomposition number of other machine learning tasks such as independent component analysis and learning gaussian mixtures are reducible to that of tensor decomposition while most tensor problems are computationally intractable there has been renewed interest in developing tractable and principled approaches for the same in this paper we consider the problem of performing tensor decompositions when subset of the entries of tensor are corrupted adversarially so that the tensor observed is where is the corruption one may view this problem as the tensor version of sparse and matrix decomposition problem as studied in we develop an algorithm for performing such decomopsition and provide theoretical guarantees as to when such decomposition is possible our work draws on two sets of tools the line of work addressing the robust pca problem in the matrix case and application of leaurgans algorithm for tensor decomposition and tensor inverse problems our algorithm is computationally efficient and scalable it relies on the key notion of tensor contraction which effectively reduces tensor problem of dimension to four decompostion problems for matrices of size one can then apply convex methods for sparse and matrix decomposition followed by certain linear algebraic operations to recover the constituent tensors our algorithm not only produces the correct decomposition of into and but also produces the low rank factorization of we are able to avoid tensor unfolding based approaches which are expensive and would lead to solving convex problems that are larger by orders of magnitude in the order case the unfolded matrix would be furthermore our method is conceptually simple to impelement as well as to analyze theoretically finally our method is also modular it can be extended to the higher order case as well as to settings where the corrupted tensor has missing entries as described in section problem setup in this paper vectors are denoted using lower case characters etc matrices by uppercase characters etc and tensors by bold characters we will work with tensors of third order representationally to be thought of as arrays and the term mode refers to one of the axes of the tensor slice of tensor refers to two dimensional matrix generated from the tensor by varying indices along two modes while keeping the third mode fixed for tensor we will refer to the indices of the ith slice the slice corresponding to the indices by si where and is defined similarly we denote the matrix corresponding to si by similarly the indices of the th slice will be denoted by sk and the matrix by given tensor of interest consider its decomposition into rank one tensors vi where ui vi and wi are unit vectors here denotes the tensor product so that is tensor of order and dimension without loss of generality throughout this paper we assume that we will present our results for third order tensors and analogous results for higher orders follow in transparent manner we will be dealing with tensors those tensors with tensors can have rank larger than the dimension indeed is an interesting regime but far more challenging and is topic left for future work kruskal theorem guarantees that tensors satisfying assumption below have unique minimal decomposition into rank one terms of the form the number of terms is called the kruskal rank assumption ui vi and wi are sets of linearly independent vectors while rank decomposition of tensors in the worst case is known to be computationally intractable it is known that the mild assumption stated in assumption above suffices for an algorithm known as leurgans algorithm to correctly identify the factors in this unique decomposition in this paper we will make this assumption about our tensor throughout this assumption may be viewed as genericity or smoothness assumption in is the rank λi are scalars and ui vi wi are the tensor factors let denote the matrix whose columns are ui and correspondingly define and let be sparse tensor to be viewed as corruption or adversarial noise added to so that one observes the problem of interest is that of decomposition recovering xand from for tensor we define its contraction with respect to contraction vector denoted by as the following matrix xa ij xijk ak so that the resulting matrix is weighted sum of the slices of the tensor under this notation the th slice matrix is contraction with respect to the th canonical basis vector we similarly define the contraction with respect to vector as xc jk xijk ci in the subsequent discussion we will also use the following notation for matrix km refers to the spectral norm km the nuclear norm km the elementwise norm and km maxi the elementwise norm incoherence the problem of sparse and decomposition for matrices has been studied in and it is well understood that exact decomposition is not always possible in order for the problem to be identifiable two situations must be avoided the component must not be sparse and the sparse component must not be in fact something stronger is both necessary and sufficient the tangent spaces of the matrix with respect to the rank variety and the sparse matrix with respect to the variety of sparse matrices must have transverse intersection for the problem to be amenable to recovery using comptationally tractable convex methods somewhat stronger incoherence assumptions are standard in the matrix case we will make similar assumptions for the tensor case which we now describe given the decomposition of we define the following subspaces of matrices tu at bv tv dw thus tu is the set of rank matrices whose column spaces are contained in span or row spaces are contained in span respectively and similar definition holds for tv and matrices if is rank matrix with column space span and row space span tu is the tangent space at with respect to the variety of rank matrices for tensor the support of refers to the indices corresponding to the entries of let denote the support of further for slice let ωi denote the corresponding sparsity pattern of the slice more generally ωi can be defined as th the sparsity of the matrix resulting from the mode slice when tensor contraction of is computed along mode the sparsity of the resulting matrix is the union of the sparsity patterns of snk each matrix slice ωi let denote the set of sparse matrices with support we define the following incoherence parameters max km km max max km km kn kn the quantities and being small implies that for contractions of the tensor all matrices in the tangent space of those contractions with respect to the of rank matrices are diffuse do not have sparse elements similarly being small implies that all matrices with the contracted sparsity pattern are such that their spectrum is diffuse they do not have low rank we will see specific settings where these forms of incoherence hold for tensors in section algorithm for sparse and low rank tensor decomposition we now introduce our algorithm to perform sparse and low rank tensor decompositions we begin with lemma lemma let with be tensor of rank then the rank of is at most similarly the rank of is at most pr proof consider tensor λi ui vi wi the reader may verify in straightforward manner that enjoys the decomposition λi hwi aiui vit the proof for the rank of is analogous note that while is matrix decomposition of the contraction it is not singular value decomposition the components need not be orthogonal for instance recovering the factors needs an application of simultaneous diagonalization which we describe next pr lemma suppose we are given an order tensor λi ui vi wi of size satisfying the conditions of assumption suppose the contractions and are computed with respect to unit vectors distributed independently and uniformly on the unit sphere and consider the matrices and formed as then the eigenvectors of corresponding to the eigenvalues are ui and the eigenvectors of are vi remark note that while the eigenvectors ui vj are thus determined source of ambiguity remains for fixed ordering of ui one needs to determine the order in which vj are to be arranged this can be generically achieved by using the common eigenvalues of and for pairing the contractions are computed with respect to random vectors the eigenvalues are distinct almost surely since the eigenvalues of are distinct they can be used to pair the columns of and lemma is essentially simultaneous diagonalization result that facilitates tensor decomposition given tensor one can compute two contractions for mode and apply simultaneous diagonalization as described in lemma this would yield the factors vi wi up to sign and reordering one can then repeat the same process with mode contractions to obtain ui vi in the final step one can then obtain λi by solving system of linear equations the full algorithm is described in algorithm in the supplementary material for contraction zvk of tensor with respect to vector along mode consider solving the convex problem minimize kx νk subject to zvk our algorithm stated in algorithm proceeds as follows given tensor we perform two random contractions vectors of the tensor along mode to obtain matrices za zb since is sum of sparse and components by lemma so are the matrices za zb we thus use to decompose them into constituent sparse and components which are the contractions of the matrices xa xb ya yb we then use xa xb and lemma to obtain the factors we perform the same operations along mode to obtain factors in the last step we solve for the scale factors λi system of linear equations algorithm in the supplementary material which we adopt for our decomposition problem in algorithm essentially relies on the idea of simultaneous diagonalization of matrices sharing common row and column spaces in this paper we do not analyze the situation where random noise is added to all the entries but only the sparse adversarial noise setting we note however that the key algorithmic insight of using contractions to perform tensor recovery is numerically stable and robust with respect to noise as has been studied in parameters that need to be picked to implement our algorithm are the regularization coefficients in the theoretical guarantees we will see that this can be picked in stable manner and that range of values guarantee exact decomposition when the suitable incoherence conditions hold in practice these coefficents would need to be determined by method note also that under suitable random sparsity assumptions the regularization coefficient may be picked to be the inverse of the of the dimension computational complexity the computational complexity of our algorithm is dominated by the complexity of perfoming the sparse and matrix decomposition of the contractions via for simplicity let us consider algorithm algorithm for sparse and low rank tensor decomposition input tensor parameters generate contraction vectors independently and uniformly distributed on unit sphere compute mode contractions and respectively solve the convex problem with call the resulting solution matrices and regularization parameter solve the convex problem with call the resulting solution matrices and regularization parameter compute of and let and denote the matrices whose columns are the eigenvectors of and respectively corresponding to the eigenvalues in sorted order let be the common rank of and the eigenvectors thus arranged are denoted as ui and vi generate contraction vectors independently and uniformly distributed on unit sphere solve the convex problem with call the resulting solution matrices and regularization parameter solve the convex problem with call the resulting solution matrices and regularization parameter compute of and let and denote the matrices whose columns are the eigenvectors of and respectively corresponding to the eigenvalues in sorted order let be the common rank of and simultaneously reorder the columns of also performing simultaneous sign reversals as necessary so that the columns of and are equal call the resulting matrix with columns wi solve for λi in the linear system λi ui vit hwi ai output decomposition pr λi ui vi wi the case where the target tensor has equal dimensions in different modes using standard first order method the solution of has per iteration complexity of and to achieve an accuracy of iterations since only four such steps need be performed the complexity of the method is where is the accuracy to which is solved another alternative is to reformulate such that it is amenable to greedy atomic approaches which yields an order of magnitude improvement we note that in contrast tensor unfolding for this problem results in the need to solve much larger convex programs for instance for the resulting flattened matrix be of size and the resulting convex lem would then have complexity of for higher order tensors the gap in computational complexity would increase by further orders of numerical experiments we now present numerical results to validate our approach we perform experiments for tensors of size tensor is generated as the sum of low rank tensor and sparse tensor the component is generated as follows three sets of unit vecots ui vi wi are generated randomly independently and uniformly distributed on the unit sphere also random positive scale factor uniformly distributed on is chosen and the tensor λi ui vi wi the tensor is generated by bernoulli randomly sampling its entries with probability for each such we perform trials and apply our algorithm in all our experiments the regularization parameter was picked to be the optimization problem is solved using cvx in matlab we report success if the mse is smaller than separately for both the and components we plot the empirical probability of success as function of in fig for multiple values of the true rank in fig we test the scalability recovery sparsity sparsity inexact recoveries recovery inexact recoveries corruption sparsity corruption sparsity low rank sparse component low rank sparse component nent nent figure recovery of the low rank and sparse components from our proposed methods in figures and we see that the probability of recovery is high when both the rank and sparsity are low in figures and we study the recovery error for tensor of dimensions and rank of our method by generating random tensor of rank and corrupting it with sparse tensor of varying sparsity level we run independent trials and see that for low levels of corruption both the low rank and sparse components are accurately recovered by our method main results we now present the main rigorous guarantees related to the performance of our algorithm due to space constraints the proofs are deferred to the supplementary materials pr theorem suppose where λi ui vi wi has rank and such that the factors satisfy assumption suppose has support and the following condition is satisfied then algoritm succeeds in exactly recovering the component tensors whenever νk are picked so that and specifically choice of and for any in these respective intervals guarantees exact recovery for matrix the degree of denoted by deg is the maximum number of in any row or column of for tensor we define the degree along mode denoted by degk to be the maximum number of entries in any row or column of matrix supported on defined in section the degree of is denoted by deg degk lemma we have deg for all for subspace rn let us define the incoherence of the subspace as maxkps ei where ps denotes the projection operator onto ei is standard unit vector and is the euclidean norm of vector let us define inc max span span span max span span max span span note that inc always for many random ensembles qof interest we have that the incoherence scales gracefully with the dimension inc max log lemma we have inc inc pr corollary let with λi ui vi wi and rank the factors satisfy assumption and incoherence inc suppose is sparse and has degree deg if the condition inc deg holds then algorithm successfully recovers the true solution when the parameters specifically choice of that guarantees exact recovery for any is valid choice remark note that corollary presents deterministic guarantee on the recoverability of sparse corruption of low rank tensor and can be viewed as tensor extension of corollary we now consider for the sake of simplicity tensors of uniform dimension we show that when the and sparse components are suitably random the approach outlined in algorithm achieves exact recovery we define the random sparsity model to be one where each entry of the tensor is independently and with identical probability we make no assumption about the mangitude of the entries of only that its entries are thus sampled pr lemma let λi ui vi wi where ui vi wi are uniformly randomly distributed on the unit sphere then the incoherence of the tensor satisifies max log inc with probability exceeding log for some constants suppose the are sampled according to the random sparsity model and then the tensor satisfies deg max log max log with probability exceeding exp max log for some constant corollary let where is low rank with random factors as per the conditions of lemma and is sparse with random support as per the conditions in lemma provided algorithm successfully recovers the correct decomposition with probability exceeding for some remarks under this sampling model the cardinality of the support of is allowed to be as large as when the rank is constant independent of we could equivalently have looked at uniformly random sampling model one where support set of size is chosen uniformly randomly from the set of all possible support sets of cardinality at most and our results for exact recovery would have gone through this follows from the equivalence principle for successful recovery between bernoulli sampling and uniform sampling see appendix note that for the random sparsity ensemble shows that choice of ensures exact recovery an additional condition regarding the magnitudes of the factors is needed however by extension the same choice can be shown to work for our setting extensions the approach described in algorithm and the analysis is quite modular and can be adapted to various settings to account for different forms of measurements and robustness models we do not present an analysis of these situations due to space constraints but outline how these extensions follow from the current development in straightforward manner higher order tensors algorithm can be extended naturally to the higher order setting recall that in the third order case one needs to recover two contractions along the third mode to discover factors and then two contractions along the first mode to discover factors for an order tensor of the form which is the sum of low rank component pr nk λi ui and sparse component one needs to compute higher order contractions of along different modes for each of these modes the resulting contraction is the sum of sparse and matrix and thus pairs of matrix problems of the form reveal the sparse and components of the contractions the factors can then be recovered via application of lemma and the full decomposition can thus be recovered the same guarantees as in theorem and corollary hold verbatim the notions of incoherence inc and degree deg of tensors need to be extended to the higher order case in the natural way block sparsity situations where entire slices of the tensor are corrupted may happen in recommender systems with adversarial ratings natural approach in this case is to use convex relaxation of the form minimize νk zvk in place of in algorithm in the above km kmi where mi is the ith column of since exact recovery of the and components of the contractions are guaranteed via this relaxation under suitable assumptions the algorithm would inherit associated provable guarantees subject to tensor completion in applications such as recommendation systems it may be desirable to perform tensor completion in the presence of sparse corruptions in an adaptation of leurgans algorithm was presented for performing completion from measurements restricted to only four slices of the tensor with sample complexity under suitable genericity assumptions about the tensor we note that it is straightforward to blend algorithm with this method to achieve completion with sparse corruptions recalling that and therefore the th mode slice of sum of sparse and low rank slices of and if only subset of elements of say pλ is observed for some index set we can replace in algorithm with minimize νk subject to pλ zvk pλ under suitable incoherence assumptions theorem the above will achieve exact recovery of the slices once four slices are accurately recovered one can then use leurgans algorithm to recover the full tensor theorem indeed the above idea can be extended more generally to the concept of deconvolving sum of sparse and tensors from separable measurements approaches basic primitive for sparse and tensor decomposition used in this paper is that of using for matrix decomposition more efficient approaches such as the ones described in may be used instead to speed up algorithm these alternative nonconvex methods requre rn steps per iterations and log iterations resulting in total complexity of rn log for solving the decomposition of the contractions to an accuracy of references nandkumar su and akade tensor approach to learning mixed membership community models the journal of machine learning research pp nandkumar su akade and elgarsky tensor decompositions for learning latent variable models tech eckmann and mith tensorial extensions of independent component analysis for multisubject fmri analysis neuroimage pp haskara harikar oitra and ijayaraghavan smoothed analysis of tensor decompositions in proceedings of the annual acm symposium on theory of computing acm pp hojanapalli and anghavi new sampling technique for tensors arxiv preprint and acm pp and right robust principal component analysis journal of the and and echt exact matrix completion via convex optimization foundations of computational mathematics pp attell parallel proportional profiles and other principles for determining the choice of factors by rotation psychometrika pp handrasekaran anghavi parrilo and illsky incoherence for matrix decomposition siam journal on optimization pp hen aramanis and anghavi robust matrix completion and corrupted columns in proceedings of the international conference on machine learning getoor and scheffer new york ny usa acm pp oyal empala and iao fourier pca and robust tensor decomposition in proceedings of the annual acm symposium on theory of computing acm pp illar and im most tensor problems are journal of the acm pp su akade and hang robust matrix decomposition with sparse corruptions information theory ieee transactions on pp uang oldfarb and right provable models for robust tensor completion pacific journal of optimization pp rishnamurthy and ingh matrix and tensor completion via adaptive sampling in advances in neural information processing systems ruskal arrays rank and uniqueness of trilinear decompositions with application to arithmetic complexity and statistics linear algebra uleshov haganty and iang tensor factorization via matrix factorization eurgans ross and bel decomposition for arrays siam journal on matrix analysis and applications pp rater hen and tang overcomplete tensor decomposition via convex optimization in ieee international workshop on computational advances in adaptive processing camsap cancun mexico esgarani laney and hamma discrimination of speech from based on multiscale modulations audio speech and language processing ieee transactions on pp uang right and oldfarb square deal lower bounds and improved relaxations for tensor recovery preprint etrapalli iranjan anghavi nandkumar pca in advances in neural information processing systems and jain robust ao hah and right greedy algorithms for signal demixing in signals systems and computers asilomar conference on ieee hah ao and tang optimal tensor recovery from separable measurements four contractions suffice tang and hah guaranteed tensor decomposition moment approach international conference on machine learning icml pp omioka ayashi and ashima estimation of tensors via convex optimization preprint uan and hang on tensor completion via nuclear norm minimization preprint 
minimax time series prediction alan malek uc berkeley malek wouter koolen centrum wiskunde informatica wmkoolen peter bartlett uc berkeley qut bartlett yasin queensland university of technology abstract we consider an adversarial formulation of the problem of predicting time series with square loss the aim is to predict an arbitrary sequence of vectors almost as well as the best smooth comparator sequence in retrospect our approach allows natural measures of smoothness such as the squared norm of increments more generally we consider linear time series model and penalize the comparator sequence through the energy of the implied driving noise terms we derive the minimax strategy for all problems of this type and show that it can be implemented efficiently the optimal predictions are linear in the previous observations we obtain an explicit expression for the regret in terms of the parameters defining the problem for typical simple definitions of smoothness the computation of the optimal predictions involves only sparse matrices in the case of data where the smoothness is defined in terms of the squared norm of the comparator increments we show that the regret grows as λt where is the length of the game and λt is an increasing limit on comparator smoothness introduction in time series prediction tracking and filtering problems learner sees stream of possibly noisy data and needs to predict the future path one may think of robot poses meteorological measurements stock prices etc popular stochastic models for such tasks include the moving average arma model in time series analysis brownian motion models in finance and state space models in signal processing in this paper we study the time series prediction problem in the regret framework instead of making assumptions on the data generating process we ask can we predict the data sequence online almost as well as the best offline prediction method in some comparison class in this case offline means that the comparator only needs to model the data sequence after seeing all of it our main contribution is computing the exact minimax strategy for range of time series prediction problems as concrete motivating example let us pose the simplest nontrivial such minimax problem min max min max kat xt min xt λt at xt loss of learner loss of comparator comparator complexity this notion of regret is standard in online learning going back at least to in which views it as the natural generalization of regularization to deal with comparators we offer two motivations for this regularization first one can interpret the complexity term as the magnitude of the noise required to generate the comparator using multivariate gaussian random walk and generalizing slightly as the energy of the innovations required to model the comparator using single fixed linear time series model specific arma coefficients second we can view the comparator term in equation as akin to the lagrangian of constrained optimization problem rather than competing with the comparator sequence that minimizes the cumulative loss subject to hard constraint on the complexity term the learner must compete with the comparator sequence that best trades off the cumulative loss and the smoothness the lagrange multiplier λt controls the notice that it is natural to allow λt to grow with since that penalizes the comparator change per round more than the loss per round for the particular problem we obtain an efficient algorithm using amortized time per round where is the dimension of the data there is no nasty dependence on as often happens with minimax algorithms our general minimax analysis extends to more advanced complexity terms for example we may regularize instead by smoothness magnitude of increments of increments etc or more generally we may consider fixed linear process and regularize the comparator by the energy of its implied driving noise terms innovations we also deal with arbitrary sequences of quadratic constraints on the data we show that the minimax algorithm is of familiar nature it is linear filter with twist its coefficients are not but instead arise from the intricate interplay between the regularization and the range of the data combined with shrinkage fortunately they may be computed in step by simple recurrence an unexpected detail of the analysis is the following as we will show the regret objective in is convex quadratic function of all data and the objectives that arise from the backward induction steps in the minimax analysis remain quadratic functions of the past however they may be either concave or convex changing direction of curvature is typically source of technical difficulty the minimax solution is different in either case quite remarkably we show that one can determine priori which rounds are convex and which are concave and apply the appropriate solution method in each we also consider what happens when the assumptions we need to make for the minimax analysis to go through are violated we will show that the obtained minimax algorithm is in fact highly robust simply applying it unlicensed anyway results in adaptive regret bounds that scale naturally with the realized data magnitude or more generally its energy related work there is rich history of tracking problems in the expert setting in this setting the learner has some finite number of actions to play and must select distribution over actions to play each round in such way as to guarantee that the loss is almost as small as the best single action in hindsight the problem of tracking the best expert forces the learner to compare with sequences of experts usually with some fixed number of switches the algorithm was an early solution but there has been more recent work tracking experts has been applied to other areas see for an application to sequential allocation an extension to linear combinations of experts where the expert class is penalized by the of the sequence was considered in minimax algorithms for squared euclidean loss have been studied in several contexts such as gaussian density estimation and linear regression in the authors showed that the minimax algorithm for quadratic loss is follow the leader predicting the previous data mean when the player is constrained to play in ball around the previous data mean additionally moroshko and krammer propose weak notion of that allows them to apply the minimax approach to framework the tracking problem in the regret setting has been considered previously where the authors studied the best linear predictor with comparison class of all sequences with bounded smoothness kat and proposed general method for converting regret bounds in the static setting to ones in the shifting setting where the best expert is allowed to change outline we start by presenting the formal setup in section and derive the optimal offline predictions in section we zoom in to quadratic games and solve these both in the convex and concave case with this in hand we derive the minimax solution to the time series prediction problem by backward induction in section in section we focus on the motivating problem for which we give faster implementation and tightly sandwich the minimax regret section concludes with discussion conjectures and open problems protocol and offline problem the game protocol is described in figure and is the usual online prediction game with squared euclidean loss the goal of the learner is to incur small regret that is to predict the data almost as well as the best sequence chosen in hindsight our motivating problem gauged complexity by the sum of squared norms of the increments thus encouraging smoothness here we generalize to complexitypterms defined by complexity matrix and charge the comparator by ks we recover the smoothness penalty of by taking to be the tridiagonal matrix for learner predicts at rd environment reveals xt learner suffers loss kat xt but we may also regularize by the sum of figure protocol squared norms the sum of norms of higher order increments or more generally we may consider fixed linear process and take to be the matrix that recovers the driving noise terms from the signal and then our penalty is exactly the energy of the implied noise for that linear process we now turn to computing the identity and quality of the best competitor sequence in hindsight theorem for any complexity matrix regularization scalar λt and data matrix xt xt the problem min xt λt ks has linear minimizer and quadratic value given by xt λt and tr xt λt proof writing we can compactly express the offline problem as min tr xt xt λt the derivative of the objective is xt setting this to zero yields the minimizer xt and simplification result in value tr xt λt note that for the choice of in computing the optimal can be performed in dt time by solving the linear system λt kt xt directly this system decomposes into one per dimension independent tridiagonal systems each in one per time step variables which can each be solved in linear time using gaussian elimination this theorem shows that the objective of our minimax problem is quadratic function of the data in order to solve round minimax problem with quadratic regret objective we first solve simple single round quadratic games minimax squared loss games one crucial tool in the minimax analysis of our tracking problem will be solving particular singleshot games in such games the player and adversary play prediction and data resulting in payoff given by the following square loss plus quadratic in ka xk kxk the quadratic and linear terms in have coefficients and rd note that is convex in and either convex or concave in as decided by the sign of the following result proved in appendix and illustrated for kbk by the figure to the right gives the minimax analysis for both cases theorem let be as in if kbk then the minimax problem min max has value kbk kbk and minimizer if if if if we also want to look at the performance of this strategy when we do not impose the norm bound kxk nor make the assumption kbk by evaluating we obtain an adaptive expression that scales with the actual norm kxk of the data theorem let be the strategy from then for any data rd and any rd kbk kbk if and kbk αkxk if these two theorems point out that the strategy in is amazingly versatile the former theorem establishes minimax optimality under data constraint kxk assuming that kbk yet the latter theorem tells us that even without constraints and assumptions this strategy is still an extremely useful heuristic for its actual regret is bounded by the minimax regret we would have incurred if we would have known the scale of the data kxk and kbk in advance the norm bound we imposed in the derivation induces the complexity measure for the data to which the strategy adapts this robustness property will extend to the minimax strategy for time series prediction finally it remains to note that we present the theorems in the canonical case problems with constraint of the form kx ck may be canonized by by and and scaling the objective by we find corollary fix and rd let denote the minimax value from with parameters if bk then min max kck with this machinery in place we continue the minimax analysis of time series prediction problems minimax time series prediction in this section we give the minimax solution to the online prediction problem recall that the evaluation criterion the regret is defined by kat xt min xt λt tr where is fixed matrix measuring the complexity of the comparator sequence since all the derivations ahead will be for fixed we drop the subscript on the we study the minimax problem min max min max at xt under the constraint on the data that kxt vt in each round for some fixed sequence vt such that vt rt this constraint generalizes the norm bound constraint from the motivating problem which is recovered by taking vt et this natural generalization allows us to also consider bounded norms of increments bounded higher order discrete derivative norms etc we compute the minimax regret and get an expression for the minimax algorithm we show that at any point in the game the value is quadratic function of the past samples and the minimax algorithm is linear it always predicts with weighted sum of all past samples most intriguingly the value function can either be convex or concave quadratic in the last data point depending on the regularization we saw in the previous section that these two cases require different minimax solution it is therefore an extremely fortunate fact that the particular case we find ourselves in at each round is not function of the past data but just property of the problem parameters and vt we are going to solve the sequential minimax problem one round at time to do so it is convenient to define the of the game from any state xt xt recursively by xt and min max at xt kxt vt kat xt xt we are interested in the minimax algorithm and minimax regret we will show that the minimax value and strategy are quadratic and linear function of the observations to express the value and strategy and state the necessary condition on the problem we will need series of scalars dt and matrices rt for which as we will explain below arises naturally from the minimax analysis the matrices which depend on the regularization parameter comparator complexity matrix and data constraints vt are defined recursively base case at bt ut is rt λt using the convenient abbreviations vt wt and rt ct we then recursively define and set dt by ct at bt ct ut bt ct ut ct ut dt if ct wt bt at dt if ct ct using this recursion for dt and rt we can perform the exact minimax analysis under certain condition on the interplay between the data constraint and the regularization we then show below that the obtained algorithm has regret bound theorem assume that and vt are such that any data sequence satisfying the constraint kxt vt for all rounds also satisfies ct ut bt for all rounds then the minimax value of and strategy for problem are given by bt if ct ds and at xt tr xt rt xt bt ct ut if ct in particular this shows that the minimax regret is given by pt dt proof by induction the base case xt is theorem for any we apply the definition of and the induction hypothesis to get min max at xt kxt vt kat xt tr xt rt ds tr at dt where we abbreviated min max at xt kxt vt kat xt ct xt bt without loss of generality assume wt now as kxt vt iff ut xt application of corollary with ct bt and ut followed by theorem results in optimal strategy if ct at ut bt if ct and value ct ut ut ct ut bt ct if ct ct ut bt ct if ct expanding all squares and rearranging cycling under the trace completes the proof on the one hand from technical perspective the condition of theorem is rather natural it guarantees that the prediction of the algorithm will fall within the constraint imposed on the data if it would not we could benefit by clipping the prediction this would be guaranteed to reduce the loss and it would wreck the backwards induction similar clipping conditions arise in the minimax analyses for linear regression and square loss prediction with mahalanobis losses in practice we typically do not have hard bound on the data sill by running the above minimax algorithm obtained for data complexity bounds kxt vt we get an adaptive regret bound that scales with the actual data complexity kxt vt as can be derived by replacing the application of theorem in the proof of theorem by an invocation of theorem theorem let and vt be arbitrary the minimax algorithm obtained in theorem keeps pt the regret bounded by dt kxt vt for any data sequence xt computation sparsity in the important special case typical application where the regularization and data constraint vt are encoding some order of smoothness we find that is banded diagonal and vt only has few tail entries it hence is the case that λk is sparse we now argue that the recursive updates preserve sparsity of the inverse in appendix we derive an update for in terms of for computation it hence makes sense to tabulate directly we now argue proof in appendix that all are sparse theorem say the vt are all but their tail entries are zero and say that is all but the the main and adjacent diagonals to either side are zero then each is the sum of the matrix and matrix all but the block of size is zero so what does this sparsity argument buy us we only need to maintain the original matrix and the entries of the block perturbation these entries can be updated backwards from in time per round using block matrix inverses this means that the of the entire step is linear in for updates and prediction we need ct and bt which we can compute using gaussian elimination from in time in the next section we will see special case in which we can update and predict in constant time data with increment squared regularization we return to our motivating problem with complexity matrix kt given by and norm constrained data vt et we show that the rt matrices are very simple their inverse is λkt with its entry perturbed using this we show that the prediction is linear combination of the past observations with weights decaying exponentially backward in time we derive update equation for the minimax prediction and tightly sandwich the regret here we will calculate few quantities that will be useful throughout this section the inverse λkt can be computed in closed form as direct application of the results in ex and cosh for any cosh cosh λkt sinh sinh where lemma recall that sinh we need some control on this inverse we will use the abbreviations zt λkt et ht λkt et zt and we now show that these quantities are easily computable see appendix for proofs lemma let be as in lemma then we can write ht λh λh and ht from below exponentially fast direct application of block matrix inversion lemma results in lemma we have ht zt ht and intriguingly following the optimal algorithm for all rounds can be done in computation and memory these resource requirements are surprising as playing weighted averages typically requires we found that the weighted averages are similar between rounds and can be updated cheaply we are now ready to state the main result of this section proved in appendix theorem let zt and ht be as in and kt as in for the minimax problem we have λkt γt et and the minimax prediction in round is given by at λct where γt ct ht and ct satisfy the recurrence ct ht and ct ct implementation theorem states that the minimax prediction is at λct using lemma we can derive an incremental update for at by defining and xt zt xt ht ht xt at xt ht ct this means we can predict in constant time per round lower bound pt by theorem using that wt so that dt ct the minimax regret equals ct for convenience we define rt λt and rt so that ht hrt we can obtain lower bound on ct from the expression given in theorem by ignoring the positive term to obtain ct by unpacking this lower bound recursively we arrive at ct λt rk rt which leads to since ri is decreasing function in for every we have ri ht rt rt λt ct dkdt λt log λt where we have exploited the fact that the integrand is monotonic and concave in and monotonic and convex in to lower bound the sums with an integral see claim in the appendix for more pt details since log λt λt and we have that ct matching the upper bound below upper bound as ht the alternative recursion and satisfies ct simple induction shows that is increasing with decreasing and it must hence have limit this limit is of this results in quadratic equation which has two solutions our starting point lies below the point so the sought limit is the smaller solution this is monotonic in plugging in the definition of we find series expansion around results in so all in all the bound is λt where we have written the explicit dependence of as discussed in the introduction allowing λt to grow with is natural and necessary for regret if λt were constant the regret term and complexity term would grow with at the same rate effectively forcing the learner to compete with sequences that could track the xt sequence arbitrarily well discussion we looked at obtaining the minimax solution to simple series prediction problems with square loss square norm regularization and square norm data constraints we obtained computational method to get the minimax result surprisingly the problem turns out to be mixture of quadratic minimax problems that can be either concave or convex these two problems have different solutions since the type of problem that is faced in each round is not function of the past data but only of the regularization the coefficients of the function can still be computed recursively however extending the analysis beyond quadratic loss and constraints is difficult the property of the is central to the calculations several open problems arise the stability of the coefficient recursion is so far elusive for the case of norm bounded data we found that the ct are positive and essentially constant however for higher order smoothness constraints on the data norm bounded increments increments of increments the situation is more intricate we find negative ct and oscillating ct both diminishing and increasing understanding the behavior of the minimax regret and algorithm as function of the regularization so that we can tune appropriately is an intriguing and elusive open problem acknowledgments we gratefully acknowledge the support of the nsf through grant and of the australian research council through an australian laureate fellowship and through the arc centre of excellence for mathematical and statistical frontiers thanks also to the simons institute for the theory of computing spring information theory program for the base case ct ct then references mark herbster and manfred warmuth tracking the best linear predictor the journal of machine learning research mark herbster and manfred warmuth tracking the best expert machine learning claire monteleoni online learning of sequences master thesis mit may artificial intelligence report kamalika chaudhuri yoav freund and daniel hsu an online framework for tracking in proceedings of the conference on uncertainty in artificial intelligence uai pages olivier bousquet and manfred warmuth tracking small set of experts by mixing past posteriors the journal of machine learning research pierre gaillard gabor lugosi and gilles stoltz mirror descent meets fixed share and feels no regret in pereira burges bottou and weinberger editors advances in neural information processing systems pages curran associates avrim blum and carl burch learning and the metrical task system problem machine learning eiji takimoto and manfred warmuth the minimax strategy for gaussian density estimation in colt pages peter bartlett wouter koolen alan malek manfred warmuth and eiji takimoto minimax linear regression in hazan and kale editors proceedings of the annual conference on learning theory colt pages jacob abernethy peter bartlett alexander rakhlin and ambuj tewari optimal strategies and minimax lower bounds for online convex games in proceedings of the annual conference on learning theory colt pages december edward moroshko and koby crammer weighted algorithm with improved regret in bshouty stoltz vayatis and zeugmann editors algorithmic learning theory international conference alt lyon france october proceedings volume of lecture notes in computer science pages springer edward moroshko and koby crammer regression algorithm for online learning in proceedings of the sixteenth international conference on artificial intelligence and statistics aistats scottsdale az usa april may volume of jmlr proceedings pages wouter koolen alan malek and peter bartlett efficient minimax strategies for square loss games in ghahramani welling cortes lawrence and weinberger editors advances in neural information processing systems nips pages december hu and robert connell analytical inversion of symmetric tridiagonal matrices journal of physics mathematical and general 
differentially private learning of structured discrete distributions ilias university of edinburgh moritz hardt google research ludwig schmidt mit abstract we investigate the problem of learning an unknown probability distribution over discrete population from random samples our goal is to design efficient algorithms that simultaneously achieve low error in total variation norm while guaranteeing differential privacy to the individuals of the population we describe general approach that yields near and computationally efficient differentially private estimators for wide range of and natural distribution families our theoretical results show that for wide variety of structured distributions there exist private estimation algorithms that are nearly as in terms of sample size and running their counterparts we complement our theoretical guarantees with an experimental evaluation our experiments illustrate the speed and accuracy of our private estimators on both synthetic mixture models and large public data set introduction the majority of available data in modern machine learning applications come in raw and unlabeled form an important class of unlabeled data is naturally modeled as samples from probability distribution over very large discrete domain such data occurs in almost every setting financial transactions seismic measurements neurobiological data sensor networks and network traffic records to name few classical problem in this context is that of density estimation or distribution learning given number of iid samples from an unknown target distribution we want to compute an accurate approximation of the distribution statistical and computational efficiency are the primary performance criteria for distribution learning algorithm more specifically we would like to design an algorithm whose sample size requirements are optimal and whose running time is nearly linear in its sample size beyond computational and statistical efficiency however data analysts typically have variety of additional criteria they must balance in particular data providers often need to maintain the anonymity and privacy of those individuals whose information was collected how can we reveal useful statistics about population while still preserving the privacy of individuals in this paper we study the problem of density estimation in the presence of privacy constraints focusing on the notion of differential privacy our contributions our main findings suggest that the marginal cost of ensuring differential privacy in the context of distribution learning is only moderate in particular for broad class of density estimation problems we give private estimation algorithms that are nearly as in terms of sample size and running nearly optimal baseline as our learning algorithm approximates the underlying distribution up to small error in total variation norm all crucial properties of the underlying distribution are preserved in particular the analyst is free to compose our learning algorithm with an arbitrary analysis the authors are listed in alphabetical order our strong positive results apply to all distribution families that can be by piecewise polynomial distributions extending recent line of work to the differentially private setting this is rich class of distributions including several natural mixture models distributions and monotone distributions amongst many other examples our algorithm is agnostic so that even if the unknown distribution does not conform exactly to any of these distribution families it continues to find good approximation these surprising positive results stand in sharp contrast with long line of hardness results and lower bounds in differential privacy which show separations between private and nonprivate learning in various settings complementing our theoretical guarantees we present novel heuristic method to achieve empirically strong performance our heuristic always guarantees privacy and typically converges rapidly we show on various data sets that our method scales easily to input sizes that were previously prohibitive for any implemented differentially private algorithm at the same time the algorithm approaches the estimation error of the best known method for sufficiently large number of samples technical overview we briefly introduce standard model of learning an unknown probability distribution from samples namely that of which is essentially equivalent to the minimax rate of convergence in distribution learning problem is defined by class of distributions the algorithm has access to independent samples from an unknown distribution and its goal is to output hypothesis distribution that is close to we measure the closeness between distributions in total variation distance which is equivalent to the and sometimes also called statistical distance in the noiseless setting we are promised that and the goal is to construct hypothesis such that with high probability the total variation distance dtv between and is at most where is the accuracy parameter the more challenging noisy or agnostic model captures the situation of having arbitrary or even adversarial noise in the data in this setting we do not make any assumptions about the target distribution and the goal is to find hypothesis that is almost as accurate as the best approximation of by any distribution in formally given sample access to potentially arbitrary target distribution and the goal of an agnostic learning algorithm for is to compute hypothesis distribution such that dtv optc where optc is the total variation distance between and the closest distribution to it in and is universal constant it is folklore fact that learning an arbitrary discrete distribution over domain of size to constant accuracy requires samples and running time the underlying algorithm is straightforward output the empirical distribution for distributions over very large domains linear dependence on is of course impractical and one might hope that drastically better results can be obtained for most natural settings indeed there are many natural and fundamental distribution estimation problems where significant improvements are possible consider for example the class of all unimodal distributions over in sharp contrast to the lower bound for the unrestricted case an algorithm of birgé is known to learn any unimodal distribution over with running time and sample complexity of log the starting point of our work is recent technique for learning univariate distributions via piecewise polynomial approximation our first main contribution is generalization of this technique to the setting of approximate differential privacy to achieve this result we exploit connection between structured distribution learning and private kolmogorov approximations more specifically we show in section that for the class of structured distributions we consider private algorithm that approximates an input histogram in the kolmogorov distance combined with the algorithmic framework of yields sample and computationally efficient private learners under the total variation distance our approach crucially exploits the structure of the underlying distributions as the kolmogorov distance is much weaker metric than the total variation distance combined with recent private algorithm we obtain differentially private learners for wide range of structured distributions over the sample complexity of our algorithms matches their analogues up to standard dependence on the privacy parameters and multiplicative factor of at most where denotes the iterated logarithm function the running time of our algorithm is nearlylinear in the sample size and logarithmic in the domain size related work there is long history of research in statistics on estimating structured families of distributions going back to the and it is still very active research area theoretical computer scientists have also studied these problems with an explicit focus on the computational efficiency in statistics the study of inference questions under privacy constraints goes back to the classical work of warner recently duchi et al study the between statistical efficiency and privacy in local model of privacy obtaining sample complexity bounds for basic inference problems we work in the model and our focus is on both statistical and computational efficiency there is large literature on answering range queries or threshold queries over an ordered domain subject to differential privacy see for example as well as the recent work and many references therein if the output of the algorithm represents histogram over the domain that is accurate on all such queries then this task is equivalent to approximating sample in kolmogorov distance which is the task we consider apart from the work of beimel et al and bun et al to the best of our knowledge all algorithms in this literature have running time that depends polynomially on the domain size moreover except for the aforementioned works we know of no other algorithm that achieves dependence on the domain size in its approximation guarantee of all algorithms in this area we believe that ours is the first implemented algorithm that scales to very large domains with strong empirical performance as we demonstrate in section preliminaries notation and basic definitions for we write to denote the set the pn of vector rn or equivalently function from to is for discrete probability distribution we write to denote the probability of element under for subset of the domain we write to denote def the total variation distance between two distributions and over is dtv kp the kolmogorov distance between and is defined pj pj def as dk note that dk dtv given set of independent samples sn drawn from distribution the empirical distribution pbn is defined as follows for all pbn sj definition distribution learning let be family of distributions over domain given sample access to an unknown distribution over and the goal of an learning algorithm for is to compute hypothesis distribution such that with probability at least it holds dtv optc where optc inf dtv and is universal constant differential privacy database is an of items from given database dn we let hist denote the normalized histogram corresponding to that is hist edi where ej denotes the standard basis vector in rn definition differential privacy randomized algorithm where is some arbitrary range is private if for all pairs of inputs differing in only one entry we have that for all subsets of the range the algorithm satisfies pr exp pr in the context of private distribution learning the database is the sample set from the unknown target distribution in this case the normalized histogram corresponding to is the same as the empirical distribution corresponding to hist pbn basic tools from probability we recall some probabilistic inequalities that will be crucial for our analysis our first tool is the vc inequality given family of subsets over define kpka the of is the maximum size of subset that is shattered by set is shattered by if for every some satisfies theorem vc inequality let pbn be an empirical distribution of samples from let be family of subsets of then kp pbn ka we note that the rhs above is best possible up to constant factors and independent of the domain size the dkw inequality can be obtained as consequence of the vc inequality by taking to be the class of all intervals the dkw inequality implies that for with probability at least over the draw of samples from the empirical distribution pbn will be to in kolmogorov distance we will also use the following uniform convergence bound theorem let be family of subsets over and pbn be an empirical distribution of samples from let be the random variable kp then we have pr connection to synthetic data distribution learning is closely related to the problem of generating synthetic data any dataset of size over universe can be interpreted as distribution over the domain the weight of item corresponds to the fraction of elements in that are equal to in fact this histogram view is convenient in number of algorithms in differential privacy if we manage to learn this unknown distribution then we can take samples from it obtain another synthetic dataset hence the quality of the distribution learner dictates the statistical properties of the synthetic dataset learning in total variation distance is particularly appealing from this point of view if two datasets represented as distributions satisfy dtv then for every test function we must have that put in different terminology this means that the answer to any statistical differs by at most between the two distributions differentially private learning framework in this section we describe our private distribution learning upper bounds we start with the simple case of privately learning an arbitrary discrete distribution over we then extend this bound to the case of privately and agnostically learning histogram distribution over an arbitrary but known partition of finally we generalize the recent framework of to obtain private agnostic learners for histogram distributions over an arbitrary unknown partition and more generally piecewise polynomial distributions our first theorem gives differentially private algorithm for arbitrary distributions over that essentially matches the best algorithm let cn be the family of all probability distributions over we have the following theorem there is computationally efficient private algorithm for cn that uses log log samples the algorithm proceeds as follows given dataset of samples from the unknown target distribution over it outputs the hypothesis hist pbn where rn is sampled from the laplace distribution with standard deviation the simple analysis is deferred to appendix over is function that is piecewise constant with at most interval pieces there is partition of into intervals it such that is constant on each ii let ht be the family of all distributions over with respect to partition it given sample access to distribution over our goal is to output hypothesis that satisfies dtv optt where optt inf dtv we show the following theorem there is computationally efficient private learning algorithm for ht that uses log log samples the main idea of the proof is that the differentially private learning problem for ht can be reduced to the same problem over distributions of support the theorem then follows by an statistical query asks for the average of predicate over the dataset application of theorem see appendix for further details theorem gives differentially private learners for any family of distributions over that can be by histograms over fixed partition including monotone distributions and distributions with known mode in the remainder of this section we focus on approximate privacy privacy for and show that for wide range of natural and distribution families there exists computationally efficient and differentially private algorithm whose sample size is at most factor of log worse than its counterpart in particular we give differentially private version of the algorithm in for wide range of distributions our algorithm has sample complexity and runs in time that is in the sample size and logarithmic in the domain size we can view our overall private learning algorithm as reduction for the sake of concreteness we state our approach for the case of histograms the generalization to piecewise polynomials being essentially identical let ht be the family of all distributions over with unknown partition in the setting combination of theorems and see appendix implies that ht is learnable using log samples the algorithm of starts with the empirical distribution pbn and it to obtain an hypothesis let ak be the collection of subsets of that can be expressed as unions of at most disjoint intervals the important property of the empirical distribution pbn is that with high probability pbn is to the target distribution in ak for any the crucial observation that enables our generalization is that the algorithm of achieves the same performance guarantees starting from any hypothesis such that kp qkao this observation motivates the following differentially private algorithm starting from the empirical distribution pbn efficiently construct an private hypothesis such that with probability at least it holds kq pbn kao pass as input to the learning algorithm of with parameters and return its output hypothesis we now proceed to sketch correctness since is private and the algorithm of step is only function of the composition theorem implies that is also private recall that with probability at least we have kp pbn kao by the properties of in step union bound and an application of the triangle inequality imply that with probability at least we have kp qkao hence the output of step is an agnostic hypothesis we have thus sketched proof of the following lemma lemma suppose there is an private synthetic data algorithm under the ao metric that is on databases of size where log then there exists an agnostic learning algorithm for ht with sample complexity recent work of bun et al gives an efficient differentially private synthetic data algorithm under the kolmogorov distance metric proposition there is an private synthetic data algorithm with respect to dk on databases of size over assuming log ln the algorithm runs in time log note that the kolmogorov distance is equivalent to the up to factor of hence by applying the above proposition for one obtains an synthetic data algorithm with respect to the at combining the above we obtain the following theorem there is an private learning algorithm for ht that uses ln log ln samples and runs in time log as an immediate corollary of theorem we obtain optimal and computationally efficient differentially private estimators for all the structured discrete distribution families studied we remark that potential difference is in the running time of the algorithm which depends on the support and structure of the distribution in these include classes of shape restricted densities including mixtures of unimodal and multimodal densities with unknown mode locations monotone hazard rate mhr and distributions and others due to space constraints we do not enumerate the full descriptions of these classes or statements of these results here but instead refer the interested reader to maximum error rule for private kolmogorov distance approximation in this section we describe simple and fast algorithm for privately approximating an input histogram with respect to the kolmogorov distance our private algorithm relies on simple nonprivate iterative greedy algorithm to approximate given histogram empirical distribution in kolmogorov distance which we term aximum rror rule this algorithm performs set of basic operations on the data and can be effectively implemented in the private setting to describe the version of aximum rror rule we point out connection of the kolmogorov distance approximation problem to the problem of approximating monotone univariate function with by piecewise linear function let pbn be the empirical probability distribution over and let pbn denote the corresponding empirical cdf note that pbn is monotone and piecewise constant with at most pieces we would like to approximate pbn by piecewise uniform distribution with corresponding piecewise linear cdf it is easy to see that this is exactly the problem of approximating monotone function by piecewise linear function in the aximum rror rule works as follows starting with the trivial linear approximation that interpolates between and the algorithm iteratively refines its approximation to the target empirical cdf using greedy criterion in each iteration it finds the point of the true curve empirical cdf pbn at which the current piecewise linear approximation disagrees most strongly with the target cdf in it then refines the previous approximation by adding the point and interpolating linearly between the new point and the previous two adjacent points of the approximation see figure for graphical illustration of our algorithm the aximum ror rule is popular method for monotone curve approximation whose convergence rate has been analyzed under certain assumptions on the structure of the input curve for example if the monotone input curve satisfies lipschitz condition it is known that the after iterations scales as see and references therein there are number of challenges towards making this algorithm differentially private the first is that we can not exactly select the maximum error point instead we can only choose an approximate maximizer using differentially private the standard algorithm for choosing such point would be the exponential mechanism of mcsherry and talwar unfortunately this algorithm falls short of our goals in two respects first it introduces linear dependence on the domain size in the running time making the algorithm prohibitively inefficient over large domains second it introduces logarithmic dependence on the domain size in the error of our approximation in place of the exponential mechanism we design using the choosing mechanism of beimel nissim and stemmer our runs in logarithmic time in the domain size and achieves dependence in the error see figure for pseudocode of our algorithm in the description of the algorithm we think of at as cdf defined by sequence of points xk yk specifying the value of the cdf at various discrete points of the domain we denote by weight at the weight of the interval according to the cdf at where the value at missing points in the domain is achieved by linear interpolation in other words at represents cdf corresponding to piecewise constant histogram similarly we let weight denote the weight of interval according to the sample that is we show that our algorithm satisfies privacy see appendix lemma for every maximumerrorrule satisfies privacy next we provide two performance guarantees for our algorithm the first shows that the running time per iteration is at most log the second shows that if at any step there is bad interval in that has large error then our algorithm finds such bad interval where the quantitative figure cdf approximation after iterations aximum rror rule privacy parameters for to ind bad nterval at pdate ind bad nterval let be the collection of all dyadic intervals of the domain for each let weight output an sampled from the choosing mechanism with score function over the collection with privacy parameters pdate let be the input interval compute wl weight laplace and wr weight laplace output the cdf obtained from by adding the points wl and wl wr to the graph of figure maximum error rule merr loss depends only on the domain size see appendix for the proof of the following theorem proposition merr runs in time log furthermore for every step with probability we have that the interval selected at step satisfies weight opt log log log recall that opt weight experiments in addition to our theoretical results from the previous sections we also investigate the empirical performance of our private distribution learning algorithm based on the maximum error rule the focus of our experiments is the learning error achieved by the private algorithm for various distributions for this we employ two types of data sets multiple synthetic data sets derived from mixtures of distributions see appendix and data set from higgs experiments the synthetic data sets allow us to vary single parameter in particular the domain size while keeping the remaining problem parameters constant we have chosen distribution from the higgs data set because it gives rise to large domain size our results show that the maximum error rule finds good approximation of the underlying distribution matching the learning error of baseline when the number of samples is sufficiently large moreover our algorithm is very efficient and runs in less than seconds for samples on domain of size we implemented our algorithm in the julia programming language and ran the experiments on an intel core cpu ghz mb cache in all experiments involving our private learning algorithm we set the privacy parameters to and since the noise magnitude depends on varying has the same effect as varying the the sample size similarly changes in are related to changes in and therefore we only consider this setting of privacy parameters higgs data in addition to the synthetic data mentioned above we use the lepton pt transverse momentum feature of the higgs data set see figure of the data set contains roughly million samples which we use as unknown distribution since the values are specified with digits of accuracy we interpret them as discrete values in for we then generate sample from this data set by taking the first samples and pass this subset as input to our private distribution learning algorithm this time we measure the error as kolmogorov distance between the hypothesis returned by our algorithm and the cdf given by the full set of million samples in this experiment figure we again see that the rule achieves good learning error moreover we investigate the following two aspects of the algorithm the number of steps taken by the maximum error rule influences the learning error in particular smaller number of steps leads to better approximation for small values of while more samples allow us to achieve better error with larger number of steps ii our algorithm is very efficient even for the largest sample size and the largest number of merr steps our algorithm runs in less than seconds note that on the same machine simply sorting floating point numbers takes about seconds since our algorithm involves sorting step this shows that the overhead added by the maximum error rule is only about compared to sorting in particular this implies that no algorithm that relies on sorted samples can outperform our algorithm by large margin limitations and future work as we previously saw the performance of the algorithm varies with the number of iterations currently this is parameter that must be optimized over separately for example by choosing the best run privately from the exponential mechanism this is standard practice in the privacy literature but it would be more appealing to find an adaptive method of choosing this parameter on the fly as the algorithm obtains more information about the data there remains gap in sample complexity between the private and the algorithm one reason for this are the relatively large constants in the privacy analysis of the choosing mechanism with tighter privacy analysis one could hope to reduce the sample size requirements of our algorithm by up to an order of magnitude it is likely that our algorithm could also benefit from certain steps such as smoothing the output histogram we did not evaluate such techniques here for simplicity and clarity of the experiments but this is promising direction higgs data higgs data running time seconds sample size sample size figure evaluation of our private learning algorithm on the higgs data set the left plot shows the kolmogorov error achieved for various sample sizes and number of steps taken by the maximum error rule the right plot displays the corresponding running times of our algorithm acknowledgments ilias diakonikolas was supported by epsrc grant and marie curie career integration grant ludwig schmidt was supported by madalgo and grant from the initiative references dwork the differential privacy frontier extended abstract in tcc pages chan diakonikolas servedio and sun learning mixtures of structured distributions over discrete domains in soda chan diakonikolas servedio and sun efficient density estimation via piecewise polynomial approximation in stoc pages acharya diakonikolas li and schmidt density estimation in time available at http kearns mansour ron rubinfeld schapire and sellie on the learnability of discrete distributions in proc stoc pages devroye and lugosi combinatorial methods in density estimation springer series in statistics springer birgé estimation of unimodal densities without smoothness assumptions annals of statistics chan diakonikolas servedio and sun density estimation in time using histograms in nips pages bun nissim stemmer and vadhan differentially private release and learning of threshold functions corr grenander on the theory of mortality measurement skand prakasa rao estimation of unimodal density sankhya ser groeneboom estimating monotone density in proc of the berkeley conference in honor of jerzy neyman and jack kiefer pages birgé estimating density under order restrictions nonasymptotic minimax risk ann of pages balabdaoui and wellner estimation of density limit distribution theory and the spline connection the annals of statistics pp umbgen and rufibach maximum likelihood estimation of density and its distribution function basic properties and uniform consistency bernoulli walther inference and modeling with distributions stat science freund and mansour estimating mixture of two product distributions in colt feldman donnell and servedio learning mixtures of product distributions over discrete domains in focs pages daskalakis diakonikolas and servedio learning distributions via testing in soda pages warner randomized response survey technique for eliminating evasive answer bias journal of the american statistical association duchi jordan and wainwright local privacy and statistical minimax rates in focs pages duchi wainwright and jordan local privacy and minimax bounds sharp rates for probability estimation in nips pages hardt ligett and mcsherry simple and practical algorithm for data release in nips li hay miklau and wang and query answering algorithm for range queries under differential privacy pvldb beimel nissim and stemmer private learning and sanitization pure approximate differential privacy in random pages dvoretzky kiefer and wolfowitz asymptotic minimax character of the sample distribution function and of the classical multinomial estimator ann mathematical statistics rote the convergence rate of the sandwich algorithm for approximating convex functions computing mcsherry and talwar mechanism design via differential privacy in focs pages pierre baldi peter sadowski and daniel whiteson searching for exotic particles in physics with deep learning nature communications dwork rothblum and vadhan boosting and differential privacy in focs 
sample complexity of learning mahalanobis distance metrics kristin branson janelia research campus hhmi bransonk nakul verma janelia research campus hhmi verman abstract metric learning seeks transformation of the feature space that enhances prediction quality for given task in this work we provide sample complexity rates for supervised metric learning we give matching and showing that sample complexity scales with the representation dimension when no assumptions are made about the underlying data distribution in addition by leveraging the structure of the data distribution we provide rates to specific notion of the intrinsic complexity of given dataset allowing us to relax the dependence on representation dimension we show both theoretically and empirically that augmenting the metric learning optimization criterion with simple regularization is important and can help adapt to dataset intrinsic complexity yielding better generalization thus partly explaining the empirical success of similar regularizations reported in previous works introduction in many machine learning tasks data is represented in euclidean space the distance in this space is then used to compare observations in methods such as clustering and classification often this distance is not ideal for the task at hand for example the presence of uninformative or mutually correlated measurements arbitrarily inflates the distances between pairs of observations metric learning has emerged as powerful technique to learn metric in the representation space that emphasizes feature combinations that improve prediction while suppressing spurious measurements this has been done by exploiting class labels or other forms of supervision to find mahalanobis distance metric that respects these annotations despite the popularity of metric learning methods few works have studied how problem complexity scales with key attributes of the dataset in particular how do we expect generalization error to theoretically and one varies the number of informative and uninformative measurements or changes the noise levels in this work we develop two general frameworks for analysis of supervised metric learning the metric learning framework uses class label information to derive distance constraints the objective is to learn metric that yields smaller distances between examples from the same class than those from different classes algorithms that optimize such objectives include mahalanobis metric for clustering mmc large margin nearest neighbor lmnn and information theoretic metric learning itml instead of using distance comparisons as proxy however one can also optimize for specific prediction task directly the second framework the metric learning framework explicitly incorporates the hypotheses associated with the prediction task to learn effective distance metrics examples in this regime include and our analysis shows that in both frameworks the sample complexity scales with dataset representation dimension theorems and and this dependence is necessary in the absence of assumptions about the underlying data distribution theorems and by considering any lipschitz loss our results improve upon previous sample complexity results see section and for the first time provide matching lower bounds in light of our observation that data measurements often include uninformative or weakly informative features we expect metric that yields good generalization performance to such features and accentuate the relevant ones we thus formalize the metric learning complexity of given dataset in terms of the intrinsic complexity of the optimal metric for mahalanobis metrics we characterize intrinsic complexity by the norm of the matrix representation of the metric we refine our sample complexity results and show bound for both frameworks that relaxes the dependence on representation dimension and instead scales with the dataset intrinsic metric learning complexity theorem based on our result we propose simple variation on the empirical risk minimizing erm algorithm that returns metric of complexity that jointly minimizes the observed sample bias and the expected variance for metrics of fixed complexity this balancing criterion can be viewed as structural risk minimizing algorithm that provides better generalization performance than an erm algorithm and justifies of weighting metrics in the optimization criteria for metric learning partly explaining empirical success of similar objectives we experimentally validate how the basic principle of normregularization can help enhance the prediction quality even for existing metric learning algorithms on benchmark datasets section our experiments highlight that indeed helps learn weighting metrics that better adapt to the signal in data in regimes preliminaries in this section we define our notation and explicitly define the and learning frameworks given representation space rd we want to learn weighting or on that minimizes some notion of error on data drawn from fixed unknown distribution on argminm err where is the class of weighting metrics σmax we constrain the maximum singular value σmax to remove arbitrary scalings for supervised metric learning this error is typically and can be defined in two intuitive ways the framework prefers metrics that bring data from the same class closer together than those from opposite classes the corresponding error then measures how the distances amongst data violate class labels errλdist φλ ρm where φλ ρm is generic loss function that computes the degree of violation between weighted distance ρm km and the label agreement and penalizes it by factor for example could penalize distances that are more than some upper limit and distances that are less than some lower limit min ρm if φl ρm min ρm otherwise note that we are looking at the linear form of the metric usually the corresponding quadratic form is discussed in the literature which is necessarily positive where max mmc optimizes an efficiently computable variant of eq by constraining the aggregate distances while maximizing the aggregate distances itml explicitly includes the upper and lower limits with an added regularization on the learned to be close to metric of interest while we will discuss that handle distances between pairs of observations it is easy to extend to relative distances among triplets min ρm ρm if φλtriple ρm ρm otherwise lmnn is popular variant in which instead of looking at all triplets it focuses on triplets in local neighborhoods improving the quality of local distance comparisons the framework prefers metrics that directly improve the prediction quality for downstream task let represent hypothesis class associated with the prediction task of interest each then the corresponding error becomes errhypoth inf example methods include which minimizes ranking errors for information retrieval and which incorporates network topology constraints for predicting network connectivity structure metric learning sample complexity general case in any practical setting we estimate the ideal weighting metric by minimizing the empirical version of the error criterion from finite size sample from let sm denote sample of size and err sm denote the corresponding empirical error we can then define the empirical risk minimizing metric based on samples as mm argminm err sm and compare its generalization performance to that of the theoretically optimal that is err mm err error analysis given an sequence of observations from pair zi xi yi we can pair the observations together to form paired sm of size and define the distance error induced by metric as pair errλdist sm ρm then for any distribution that is each kxk we have the theorem let φλ be loss function that is in the first argument then with probability at least over an draw of samples from an unknown support distribution paired as sm we have pair sup errdist errλdist sm λb ln while we pair samples into independent pairs it is common to consider all possibly dependent pairs by exploiting independence we provide simpler analysis yielding sample complexity rates which is similar to the dependent case we only present the results for paired comparisons the results are easily extended to triplet comparisons all the supporting proofs are provided in appendix this implies bound on our key quantity of interest eq to achieve estimation error rate λb ln samples are sufficient showing that one never needs more than number proportional to examples to achieve the desired level of accuracy with high probability since many applications involve data we next study if such strong dependency on is necessary it turns out that even for simple loss functions like φλl eq there are data distributions for which one can not ensure good estimation error with fewer than linear in samples theorem let be any algorithm that given an sample sm of size from fixed unknown bounded support distribution returns weighting metric from that minimizes the empirical error with respect to loss function φλl there exist indep of for all there exists bounded support distribution such that if psm errdist sm errdist while this strong dependence on may seem discouraging note that here we made no assumptions about the underlying structure of the data distribution one may be able to achieve more relaxed dependence on in settings in which individual features contain varying amounts of useful information this is explored in section error analysis in this setting we consider an set of observations from to obtain the unpaired sample sm zi of size to analyze the of weighting metrics optimized underlying hypothesis class we must measure the classification complexity of the version of the dimension of hypothesis class denoted fatγ encodes the right notion of classification complexity and provides way to relate generalization error to the empirical error at margin in the context of metric learning with respect to ap fixed hypothesis class define the empirical error at margin as errγhypoth sm inf xi yi margin xi yi where margin theorem let be base hypothesis class pick any and let then with probability at least over an draw of samples sm from an unknown distribution min sup errhypoth errhypoth sm ln ln ln as before this implies bound on eq to achieve estimation error rate ln ln samples suffices note that the task of finding an optimal metric only additively increases sample complexity over that of finding the optimal hypothesis from the underlying hypothesis class in contrast to the framework theorem here we get quadratic dependence on the following shows that strong dependence on is necessary in the absence of assumptions on the data distribution and base hypothesis class theorem pick any let be base hypothesis class of functions that is closed under addition of constants where for all each maps into the interval after applying an appropriate theshold then for any metric learning algorithm and for any there exists for all there exists distribution if ln psm errhypoth errγhypoth sm where is the dimension of at margin sample complexity for data with and weakly informative features we introduce the concept of the metric learning complexity of given dataset our key observation is that metric that yields good generalization performance should emphasize relevant features while suppressing the contribution of spurious features thus good metric reflects the quality of individual feature measurements of data and their relative value for the learning task we can leverage this and define the metric learning complexity of given dataset as the intrinsic complexity of the weighting metric that yields the best generalization performance for that dataset if multiple metrics yield best performance we select the one with minimum natural way to characterize the intrinsic complexity of weighting metric is via the norm of the matrix using metric learning complexity as our gauge for richness we now refine our analysis in both canonical frameworks we will first analyze sample complexity for metrics then show how to automatically adapt to the intrinsic complexity of the unknown underlying data distribution refinement we start with the following refinement of the metric learning sample complexity for class of frobenius weighting metrics lemma let be any class of weighting metrics on the feature space rd and define supm km let φλ be any loss function that is in the first argument then with probability at least over an draw of samples from an unknown pair distribution paired as sm we have pair sup errdist errdist sm λb ln observe that if our dataset has low metric learning complexity then considering an appropriate class of weighting metrics can help sharpen the sample complexity result yielding bound of course priori we do not know which class of metrics is appropriate we discuss how to automatically adapt to the right complexity class in section refinement effective analysis of metric learning requires accounting for potentially complex interactions between an arbitrary base hypothesis class and the distortion induced by weighting metric to the unknown underlying data distribution to make the analysis tractable while still keeping our base hypothesis class general we assume that is class of recall that for any smooth target function neural network with appropriate number of hidden units and connection weights can approximate arbitrarily well so this class is flexible enough to include most reasonable target hypotheses more formally define the base hypothesis class of neural network with pk hidden units as wi vi kvi where is smooth strictly monotonic activation function with then for generalization error any loss function φλ errλhypoth inf φλ σγ we have the we only present the results for networks in lemma the results are easily extended to multilayer networks since we know the functional form of the base hypothesis class two layer neural net we can provide more precise bound than leaving it as fat lemma let be any class of weighting metrics on the feature space rd and define be two layer neural network supm km for any let base hypothesis class as defined above and φλ be loss function that in its first argument then with probability at least over an draw of samples sm from an unknown support distribution we have sup errλhypoth errλhypoth sm bλγ ln automatically adapting to intrinsic complexity while lemmas and provide sample complexity bound tuned to the metric learning complexity of given dataset these results are not directly useful since one can not select the correct normbounded class priori as the underlying distribution is unknown fortunately by considering an appropriate sequence of classes of weighting metrics we can provide uniform bound that automatically adapts to the intrinsic complexity of the unknown underlying data distribution theorem define md km and consider the nested sequence of weighting metric classes let µd be any measure across the sequence such that µd for then for any with probability at least over an draw of sample sm from an unknown distribution for all and all md err errλ sm bλ ln where for error or ln for error for in particular for data distribution that has metric learning complexity at most if there are cbλ ln samples then with probability at least reg err mm errλ reg for mm argmin errλ sm λm dm λm ln δµdm dm km the measure µd above encodes our prior belief on the complexity class md from which target metric is selected by metric learning algorithm given the training sample sm in absence of any prior beliefs µd can be set to for for scale constrained weighting metrics σmax thus for an unknown underlying data distribution with metric learning complexity with number of samples just proportional to we can find good weighting metric this result also highlights that the generalization error of any weighting metric returned by an algorithm is proportional to the smallest class to which it belongs cf eq if two metrics and have similar empirical errors on given sample but have different intrinsic complexities then the expected risk of the two metrics can be considerably different we expect the metric with lower intrinsic complexity to yield better generalization error this partly explains the observed empirical success of optimization for metric learning using this as guiding principle we can design an improved optimization criteria for metric learning that jointly minimizes the sample error and frobenius norm regularization penalty in particular min err sm km for any error criteria err used in downstream prediction task and regularization parameter similar optimizations have been studied before here we explore the practical efficacy of this augmented optimization on existing metric learning algorithms in high noise regimes where dataset intrinsic dimension is much smaller than its representation dimension uci wine dataset uci iris dataset random id metric lmnn itml avg test error avg test error avg test error uci ionosphere dataset random id metric lmnn itml random id metric lmnn itml ambient noise dimension ambient noise dimension ambient noise dimension figure classification performance of lmnn and itml metric learning algorithms without regularization dashed red lines and with regularization solid blue lines on benchmark uci datasets the horizontal dotted line is the classification error of random label assignment drawn according to the class proportions and solid gray line shows classification error of performance with respect to identity metric no metric learning for baseline reference empirical evaluation our analysis shows that the generalization error of metric learning can scale with the representation dimension and regularization can help mitigate this by adapting to the intrinsic metric learning complexity of the given dataset we want to explore to what degree these effects manifest in practice we select two popular metric learning algorithms lmnn and itml that are used to find metrics that improve classification quality these algorithms have varying degrees of regularization built into their optimization criteria lmnn implicitly regularizes the metric via its large margin criterion while itml allows for explicit regularization by letting the practitioners specify prior weighting metric we modified the lmnn optimization criteria as per eq to also allow for an explicit controlled by the parameter we can evaluate how the unregularized criteria unmodified lmnn or itml with the prior set to the identity matrix compares to the regularized criteria modified lmnn with best or itml with the prior set to matrix datasets we use the uci benchmark datasets for our experiments ris samples ine samples and onosphere samples datasets each dataset has fixed unknown but low intrinsic dimension we can vary the representation dimension by augmenting each dataset with synthetic correlated noise of varying dimensions simulating regimes where datasets contain large numbers of uninformative features each uci dataset is augmented with synthetic correlated noise as detailed in appendix experimental setup each dataset was randomly split between training validation and test samples we used the default settings for each algorithm for regularized lmnn we picked the best performing parameter from on the validation set for regularized itml we seeded with the discriminating metric we set the prior as the matrix with all zeros except the diagonal entry corresponding to the most discriminating coordinate set to one all the reported results were averaged over runs results figure shows the performance with of lmnn and itml on uci datasets notice that the unregularized versions of both algorithms dashed red lines scale poorly when noisy features are introduced as the number of uninformative features grows the performance of both algorithms quickly degrades to that of classification performance in the original unweighted space with no metric learning solid gray line showing poor adaptability to the signal in the data the regularized versions of both algorithms solid blue lines significantly improve the classification performance remarkably regularized itml shows almost no degradation in classification mance even in very high noise regimes demonstrating strong robustness to noise these results underscore the value of regularization in metric learning showing that regularization encourages adaptability to the intrinsic complexity and improved robustness to noise discussion and related work previous theoretical work on metric learning has focused almost exclusively on analyzing upperbounds on the sample complexity in the framework without exploring any intrinsic properties of the input data our work improves these results and additionally analyzes the classifierbased framework it is to best of our knowledge the first to provide lower bounds showing that the dependence on is necessary importantly it is also the first to provide an analysis of sample rates based on notion of intrinsic complexity of dataset which is particularly important in metric learning where we expect the representation dimension to be much higher than intrinsic complexity studied the convex losses for stable algorithms and showed an sublinear in which can be relaxed by applying techniques from we analyze the erm criterion directly thus no assumptions are made about the optimization algorithm and provide precise characterization of when the problem complexity is independent of lm our lowerbound thm shows that the dependence on is necessary for erm in the case and analyzed the erm criterion and are most similar to our results providing an upperbound for the framework shows rate for thresholds on bounded convex losses for metric learning without explicitly studying the dependence on our thm improves this result by considering arbitrary possibly lipschitz losses and explicitly revealing the dependence on provides an alternate erm analysis of metrics and parallels our analysis in lemma while they focus on analyzing specific optimization criterion thresholds on the hinge loss with our result holds for general lipschitz losses our theorem extends it further by explicitly showing when we can expect good generalization performance from given dataset provides an interesting analysis for robust algorithms by relying upon the existence of partition of the input space where each cell has similar training and test losses their sample complexity bound scales with the partition size which in general can be exponential in it is worth emphasizing that none of these closely related works discuss the importance of or leverage the intrinsic structure in data for the metric learning problem our results in section formalize an intuitive notion of dataset intrinsic complexity for metric learning and show sample complexity rates that are finely tuned to this metric learning complexity our lower bounds indicate that exploiting the structure is necessary to get rates that don scale with representation dimension the framework we discuss has parallels with the kernel learning and similarity learning literature the typical focus in kernel learning is to analyze the generalization ability of linear separators in hilbert spaces similarity learning on the other hand is concerned about finding similarity function that does not necessarily has positive semidefinite structure that can best assist in linear classification our work provides complementary analysis for learning explicit linear transformations of the given representation space for arbitrary hypotheses classes our theoretical analysis partly justifies the empirical success of regularization as well our empirical results show that such regularization not only helps in designing new metric learning algorithms but can even benefit existing metric learning algorithms in regimes acknowledgments we would like to thank aditya menon for insightful discussions and the anonymous reviewers for their detailed comments that helped improve the final version of this manuscript references weinberger and saul distance metric learning for large margin nearest neighbor classification journal of machine learning research jmlr davis kulis jain sra and dhillon metric learning international conference on machine learning icml pages schultz and joachims learning distance metric from relative comparisons neural information processing systems nips xing ng jordan and russell distance metric learning with application to clustering with neural information processing systems nips pages mcfee and lanckriet metric learning to rank international conference on machine learning icml shaw huang and jebara learning distance metric from network neural information processing systems nips lim mcfee and lanckriet robust structural metric learning international conference on machine learning icml law thome and cord fantope regularization in metric learning computer vision and pattern recognition cvpr anthony and bartlett neural network learning theoretical foundations cambridge university press hornik stinchcombe and white multilayer feedforward networks are universal approximators neural networks bache and lichman uci machine learning repository jin wang and zhou regularized distance metric learning theory and algorithm neural information processing systems nips pages bousquet and elisseeff stability and generalization journal of machine learning research jmlr bian and tao learning distance metric by empirical loss minimization international joint conference on artificial intelligence ijcai pages cao guo and ying generalization bounds for metric and similarity learning corr bellet and habrard robustness and generalization for metric learning corr ying and campbell generalization bounds for learning the kernel conference on computational learning theory colt cortes mohri and rostamizadeh new generalization bounds for learning kernels international conference on machine learning icml balcan blum and srebro improved guarantees for learning via similarity functions conference on computational learning theory colt bellet habrard and sebban similarity learning for provably accurate sparse linear classification international conference on machine learning icml guo and ying generalization classification via regularized similarity learning neural computation bellet habrard and sebban survey on metric learning for feature vectors and structured data corr bartlett and mendelson rademacher and gaussian complexities risk bounds and structural results journal of machine learning research jmlr vershynin introduction to the analysis of random matrices in compressed sensing theory and applications 
learning recurrent attention models jimmy ba university of toronto roger grosse university of toronto jimmy rgrosse ruslan salakhutdinov university of toronto brendan frey university of toronto rsalskhu frey abstract despite their success convolutional neural networks are computationally expensive because they must examine all image locations stochastic models have been shown to improve computational efficiency at test time but they remain difficult to train because of intractable posterior inference and high variance in the stochastic gradient estimates borrowing techniques from the literature on training deep generative models we present the recurrent attention model method for training stochastic attention networks which improves posterior inference and which reduces the variability in the stochastic gradients we show that our method can greatly speed up the training time for stochastic attention networks in the domains of image classification and caption generation introduction convolutional neural networks trained have been shown to substantially outperform previous approaches to various supervised learning tasks in computer vision despite their wide success convolutional nets are computationally expensive when processing input images because they must examine all image locations at fine scale this has motivated recent work on visual models which reduce the number of parameters and computational operations by selecting informative regions of an image to focus on in addition to computational speedups models can also add degree of interpretability as one can understand what signals the algorithm is using by seeing where it is looking one such approach was recently used by to automatically generate image captions and highlight which image region was relevant to each word in the caption there are two general approaches to image understanding hard and soft attention soft attention based models obtain features from weighted average of all image locations where locations are weighted based on model saliency map by contrast hard attention model chooses typically stochastically series of discrete glimpse locations soft attention models are computationally expensive as they have to examine every image location we believe that the computational gains of attention require hard attention model unfortunately this comes at cost while soft attention models can be trained with standard backpropagation this does not work for hard attention models whose glimpse selections are typically discrete training stochastic hard attention models is difficult because the loss gradient involves intractable posterior expectations and because the stochastic gradient estimates can have high variance the latter problem was also observed by in the context of memory networks in this work we propose the recurrent attention model method for training stochastic recurrent attention models which deals with the problems of intractable inference and gradients by taking advantage of several advances from the literature on training deep generative models inference networks the reweighted algorithm and control variates during training the approximates posterior expectations using importance sampling with proposal distribution computed by an inference network unlike the prediction network the inference network has access to the object category label which helps it choose better glimpse locations as the name suggests we train both networks using the reweighted algorithm in addition we reduce the variance of the stochastic gradient estimates using carefully chosen control variates in combination these techniques constitute an improved training procedure for stochastic attention models the main contributions of our work are the following first we present new learning algorithm for stochastic attention models and compare it with training method based on variational inference second we develop novel control variate technique for gradient estimation which further speeds up training finally we demonstrate that our stochastic attention model can learn to classify translated and scaled mnist digits and generate image captions by attending to the relevant objects in images and their corresponding scale our model achieves similar performance to the variational method but with much faster training times related work in recent years there has been flurry of work on neural networks such models have been applied successfully in image classification object tracking machine translation caption generation and image generation attention has been shown both to improve computational efficiency and to yield insight into the network behavior our work is most closely related to stochastic hard attention models major difficulty of training such models is that computing the gradient requires taking expectations with respect to the posterior distribution over saccades which is typically intractable this difficulty is closely related to the problem of posterior inference in training deep generative models such as sigmoid belief networks since our proposed method draws heavily from the literature on training deep generative models we overview various approaches here one of the challenges of training deep or recurrent generative model is that posterior inference is typically intractable due to the explaining away effect one way to deal with intractable inference is to train separate inference network whose job it is to predict the posterior distribution classic example was the helmholtz machine where the inference network predicts mean field approximation to the the generative and inference networks are trained with the algorithm in the wake phase the generative model is updated to increase variational lower bound on the data likelihood in the sleep phase data are generated from the model and the inference network is trained to predict the latent variables used to generate the observations the approach was limited by the fact that the wake and sleep phases were minimizing two unrelated objective functions more recently various methods have been proposed which unify the training of the generative and inference networks into single objective function neural variational inference and learning nvil trains both networks to maximize variational lower bound on the since the stochastic gradient estimates in nvil are very noisy the method of control variates is used to reduce the variance in particular one uses an algortihm from reinforcement learning called reinforce which attempts to infer reward baseline for each instance the choice of baseline is crucial to good performance nvil uses separate neural network to compute the baseline an approach also used by in the context of attention networks control variates are discussed in more detail in section the reweighted approach is similar to traditional but uses importance sampling in place of mean field inference to approximate the posterior reweighted is described more formally in section another method based on inference networks is variational autoencoders which exploit clever reparameterization of the probabilistic model in order to improve the signal in the stochastic gradients nvil reweighted and variational autoencoders have all been shown to achieve considerably higher test compared to in the literature the inference network is often called recognition network we avoid this terminology to prevent confusion with the task of image classification inference network prediction network xn an figure the recurrent attention model traditional the term helmholtz machine is often used loosely to refer to the entire collection of techniques which simultaneously learn generative network and an inference network recurrent attention model we now describe our recurrent attention model given an image the network first chooses sequence of glimpses an and after each glimpse receives an observation xn computed by mapping an this mapping might for instance extract an image patch at given scale the first glimpse is based on version of the input while subsequent glimpses are chosen based on information acquired from previous glimpses the glimpses are chosen stochastically according to distribution an where denotes the parameters of the network this is in contrast with soft attention models which deterministically allocate attention across all image locations after the last glimpse the network predicts distribution over the target for instance the caption or image category as shown in figure the core of the attention network is recurrent network which we term the prediction network where the output at each time step is an action saccade which is used to compute the input at the next time step version of the input image is fed to the network at the first time step and the network predicts the class label at the final time step importantly the input is fed to the second layer while the class label prediction is made by the first layer preventing information from propagating directly from the image to the output this prevents local optima where the network learns to predict directly from the input disregarding attention completely on top of the prediction network is an inference network which receives both the class label and the attention network top layer representation as inputs it tries to predict the posterior distribution parameterized by over the next saccade conditioned on the image category being correctly predicted its job is to guide the posterior sampler during training time thereby acting as teacher for the attention network the inference network is described further in section one of the benefits of stochastic attention models is that the mapping can be localized to small image region or coarse granularity which means it can potentially be made very efficient furthermore need not be differentiable which allows for operations such as choosing scale which would be difficult to implement in soft attention network the cost of this flexibility is that standard backpropagation can not be applied so instead we use novel algorithms described in the next section learning in this work we assume that we have dataset with labels for the supervised prediction task object category in contrast to the supervised saliency prediction task there are no labels for where to attend instead we learn an attention policy based on the idea that the best locations to attend to are the ones which most robustly lead the model to predict the correct category in particular we aim to maximize the probability of the class label or equivalently minimize the by marginalizing over the actions at each glimpse log log we train the attention model by maximizing lower bound on in section we first describe previous approach which minimized variational lower bound we then introduce our proposed method which directly estimates the gradients of as shown in section our method can be seen as maximizing tighter lower bound on variational lower bound we first outline the approach of who trained the model to maximize variational lower bound on let be an approximating distribution the lower bound on is then given by log log in the case where is the prior as considered by this reduces to log the learning rules can be derived by taking derivatives of eqn with respect to the model parameters log log log the summation can be approximated using monte carlo samples from log log log the partial derivative terms can each be computed using standard backpropagation this suggests simple training algorithm for each image one first computes the samples from the prior and then updates the parameters according to eqn as observed by one must carefully use control variates in order to make this technique practical we defer discussion of control variates to section an improved lower bound on the the variational method described above has some counterintuitive properties early in training first because it averages the over actions it greatly amplifies the differences in probabilities assigned to the true category by different bad glances for instance glimpse sequence which leads to probability assigned to the correct class is considered much worse than one which leads to probability under the variational objective even though in practice they may be equally bad since they have both missed the relevant information second odd behavior is that all glimpse sequences are weighted equally in the gradient it would be better if the training procedure focused its effort on using those glances which contain the relevant information both of these effects contribute noise in the training procedure especially in the early stages of training instead we adopt an approach based on the step of reweighted where we attempt to maximize the marginal directly we differentiate the marginal loglikelihood objective in eqn with respect to the model parameters log log the summation and normalizing constant are both intractable to evaluate so we estimate them using importance sampling we must define proposal distribution which ideally should be close to the posterior one reasonable choice is the prior but another choice is described in section normalized importance sampling gives biased but consistent estimator of the gradient of given samples from the unnormalized importance weights are computed as the monte carlo estimate of the gradient is given by log log pm where wm are the normalized importance weights when is chosen to be the prior this approach is equivalent to the method of for learning generative networks our importance sampling based estimator can also be viewed as the gradient ascent update on the pm combining jensen inequality with the unbiasedness of objective function log the shows that this is lower bound on the log log log we relate this to the previous section by noting that log another application of jensen inequality shows that our proposed bound is at least as accurate as log log log burda et al further analyzed closely related importance sampling based estimator in the context of generative models bounding the mean absolute deviation and showing that the bias decreases monotonically with the number of samples training an inference network late in training once the attention model has learned an effective policy the prior distribution is reasonable choice for the proposal distribution as it puts significant probability mass on good actions but early in training the model may have only small probability of choosing good set of glimpses and the prior may have little overlap with the posterior to deal with this we train an inference network to predict given the observations as well as the class label where the network should look to correctly predict that class see figure with this additional information the inference network can act as teacher for the attention policy the inference network predicts sequence of glimpses stochastically an this distribution is analogous to the prior except that each decision also takes into account the class label we denote the parameters for the inference network as during training the prediction network is learnt by following the gradient of the estimator in eqn with samples drawn from the inference network output our training procedure for the inference network parallels the step of reweighted wakesleep intuitively the inference network is most useful if it puts large probability density over locations in an image that are most informative for predicting class labels we therefore train the inference weights to minimize the divergence between the recognition model prediction and posterior distribution from the attention model min dkl min log the gradient update for the recognition weights can be obtained by taking the derivatives of eq with respect to the recognition weights log ep since the posterior expectation is intractable we estimate it with importance sampling in fact we reuse the importance weights computed for the prediction network update see eqn to obtain the following gradient estimate for the recognition network log wm control variates the speed of convergence of gradient ascent with the gradients defined in eqns and suffers from high variance of the stochastic gradient estimates past work using similar gradient updates has found significant benefit from the use of control variates or reward baselines to reduce the variance choosing effective control variates for the stochastic gradient estimators amounts to finding function that is highly correlated with the gradient vectors and whose expectation is known or tractable to compute unfortunately good choice of control variate is highly we first note that log eq eq log the terms inside the expectation are very similar to the gradients in eqns and suggesting that stochastic estimates of these expectations would make good control variates to increase the correlation between the gradients and the control variates we reuse the same set of samples and importance weights for the gradients and control variates using these control variates results in the gradient estimates for the prediction and recognition networks we obtain log log wm log our use of control variates does not bias the gradient estimates beyond the bias which is present due to importance sampling however as we show in the experiments the resulting estimates have much lower variance than those of eqns and following the analogy with reinforcement learning highlighted by these control variates can also be viewed as reward baselines bp bq eq eq eq ep ep pm where is the number of samples drawn for proposal var effective sample size var variance of estimated gradient training error figure left training error as function of the number of updates middle variance of the gradient estimates right effective sample size max horizontal axis thousands of updates var variational baseline our proposed method uses the inference networks for the proposal distribution uses control variates encouraging exploration similarly to other methods based on reinforcement learning stochastic attention networks face the problem of encouraging the method to explore different actions since the gradient in eqn only rewards or punishes glimpse sequences which are actually performed any part of the space which is never visited will receive no reward signal introduced several heuristics to encourage exploration including raising the temperature of the proposal distribution regularizing the attention policy to encourage viewing all image locations and adding regularization term to encourage high entropy in the action distribution we have implemented all three heuristics for the and for the baselines while these heuristics are important for good performance of the baselines we found that they made little difference to the because the basic method already explores adequately experimental results to measure the effectiveness of the proposed method we first investigated toy classification task involving variant of the mnist handwritten digits dataset where transformations were applied to the images we then evaluated the proposed method on substantially more difficult image caption generation task using the dataset translated scaled mnist we generated dataset of randomly translated and scaled handwritten digits from the mnist dataset each digit was placed in black background image at random location and scale the task was to identify the digit class the attention models were allowed four glimpses before making classification prediction the goal of this experiment was to evaluate the effectiveness of our proposed model compared with the variational approach of for both the and the baseline the architecture was stochastic attention model which used relu units in all recurrent layers the actions included both continuous and discrete latent variables corresponding to glimpse scale and location respectively the distribution over actions was represented as gaussian random variable for the location and an independent multinomial random variable for the scale all networks were trained using adam with the learning rate set to the highest value that allowed the model to successfully converge to sensible attention policy the classification performance results are shown in table in figure the is compared with the variational baseline each using the same number of samples in order to make computation time roughly equivalent we also show comparisons against ablated versions of the where the control variates and inference network were removed when the inference network was removed the prior was used for the proposal distribution in addition to the classification results we measured the effective sample size ess of our method with and without control variates and the inference network ess is standard metric for evaluating importance samplers and is defined as wm where wm denotes the normalized importance weights results are shown in figure using the inference network reduced the variances in var training error test err no no exploration exploration no exploration exploration table classification error rate comparison for the figure the effect of the exploration heuristics on var the variational baseline and the training negative loglikelihood attention models trained using different algorithms on translated scaled mnist the numbers are reported after million updates using samples table bleu score performance on the figure training negative on for the first updates see figure for the labels dataset for our and the variational method gradient estimation although this improvement did not reflect itself in the ess control variates improved both metrics in section we described heuristics which encourage the models to explore the action space figure compares the training with and without these heuristics without the heuristics the variational method quickly fell into local minimum where the model predicted only one glimpse scale over all images the exploration heuristics fixed this problem by contrast the did not appear to have this problem so the heuristics were not necessary generating captions using attention we also applied the method to learn stochastic attention model similar to for generating image captions we report results on the dataset the split followed the same protocol as used in previous work the goal of this experiment was to examine the improvement of the over the variational method for learning with realistic imgaes similarly to we first ran convolutional network and the attention network then determined which part of the convolutional net representation to attend to the attention network predicted both which layer to attend to and location within the layer in contrast with where the scale was held fixed because convolutional net shrinks the representation with choosing layer is analogous to choosing scale at each glimpse the inference network was given the immediate preceding word in the target sentences we compare the bleu scores of our and the variational method in in table figure shows training curves for both models we observe that obtained similar performance to the variatinoal method but trained more efficiently conclusions in this paper we introduced the recurrent attention model an efficient method for training stochastic attention models this method improves upon prior work by using the reweighted algorithm to approximate expectations from the posterior over glimpses we also introduced control variates to reduce the variability of the stochastic gradients our method reduces the variance in the gradient estimates and accelerates training of attention networks for both invariant handwritten digit recognition and image caption generation acknowledgments this work was supported by the fields institute samsung onr grant and the hardware donation of nvidia corporation references krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in neural information processing systems ba mnih and kavukcuoglu multiple object recognition with visual attention in international conference on learning representations mnih heess graves and kavukcuoglu recurrent models of visual attention in neural information processing systems tang srivastava and salakhutdinov learning generative models with visual attention in neural information processing systems xu ba kiros cho courville salakhutdinov zemel and bengio show attend and tell neural image caption generation with visual attention in international conference on machine learning bahdanau cho and bengio neural machine translation by jointly learning to align and translate in international conference on learning representations zaremba and sutskever reinforcement learning neural turing machines dayan hinton neal and zemel the helmholtz machine neural computation bornschein and bengio reweighted paisley blei and jordan variational bayesian inference with stochastic search in international conference on machine learning mnih and gregor neural variational inference and learning in belief networks in international conference on machine learning larochelle and hinton learning to combine foveal glimpses with boltzmann machine in neural information processing systems denil bazzani larochelle and de freitas learning where to attend with deep architectures for image tracking neural computation april graves generating sequences with recurrent neural networks gregor danihelka graves and wierstra draw recurrent neural network for image generation radford neal connectionist learning of belief networks artificial intelligence williams simple statistical algorithms for connectionist reinforcement learning machine learning kingma and welling variational bayes in international conference on learning representations rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in international conference on machine learning itti koch and niebur model of visual attention for rapid scene analysis ieee transactions of pattern analysis and machine intelligence november judd ehinger durand and torralba learning to predict where humans look in international conference on computer vision tang and salakhutdinov learning stochastic feedforward neural networks in neural information processing systems burda grosse and salakhutdinov importance weighted autoencoders lex weaver and nigel tao the optimal reward baseline for reinforcement learning in proceedings of the seventeenth conference on uncertainty in artificial intelligence pages morgan kaufmann publishers lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee micah hodosh peter young and julia hockenmaier framing image description as ranking task data models and evaluation metrics journal of artificial intelligence research pages kingma and ba adam method for stochastic optimization andrej karpathy and li deep alignments for generating image descriptions arxiv preprint 
robust gaussian graphical modeling with the trimmed graphical lasso lozano ibm watson research center aclozano eunho yang ibm watson research center eunhyang abstract gaussian graphical models ggms are popular tools for studying network structures however many modern applications such as gene network discovery and social interactions analysis often involve noisy data with outliers or heavier tails than the gaussian distribution in this paper we propose the trimmed graphical lasso for robust estimation of sparse ggms our method guards against outliers by an implicit trimming mechanism akin to the popular least trimmed squares method used for linear regression we provide rigorous statistical analysis of our estimator in the setting in contrast existing approaches for robust sparse ggms estimation lack statistical guarantees our theoretical results are complemented by experiments on simulated and real gene expression data which further demonstrate the value of our approach introduction gaussian graphical models ggms form powerful class of statistical models for representing distributions over set of variables these models employ undirected graphs to encode conditional independence assumptions among the variables which is particularly convenient for exploring network structures ggms are widely used in variety of domains including computational biology natural language processing image processing statistical physics and spatial statistics in many modern applications the number of variables can exceed the number of observations for instance the number of genes in microarray data is typically larger than the sample size in such settings sparsity constraints are particularly pertinent for estimating ggms as they encourage only few parameters to be and induce graphs with few edges the most widely used estimator among others see minimizes the gaussian negative regularized by the norm of the entries or the entries of the precision matrix see this estimator enjoys strong statistical guarantees see the corresponding optimization problem is program that can be solved with interior point methods or by descent algorithms alternatively neighborhood selection can be employed to estimate conditional independence relationships separately for each node in the graph via lasso linear regression under certain assumptions the sparse ggm structure can still be recovered even under settings the aforementioned approaches rest on fundamental assumption the multivariate normality of the observations however outliers and corruption are frequently encountered in data see for gene expression data contamination of few observations can drastically affect the quality of model estimation it is therefore imperative to devise procedures that can cope with observations deviating from the model assumption despite this fact little attention has been paid to robust estimation of graphical models relevant work includes which leverages multivariate for robustified inference and the em algorithm they also propose an alternative which adds flexibility to the classical but requires the use of monte carlo em or variational approximation as the likelihood function is not available explicitly another tinent work is that of which introduces robustified likelihood function procedure is proposed for model estimation where the graphical structure is first obtained via coordinate gradient descent and the concentration matrix coefficients are subsequently using iterative proportional fitting so as to guarantee positive definiteness of the final estimate in this paper we propose the trimmed graphical lasso method for robust gaussian graphical modeling in the sparse setting our approach is inspired by the classical least trimmed squares method used for robust linear regression in the sense that it disregards the observations that are judged less reliable more specifically the trimmed graphical lasso seeks to minimize weighted version of the negative regularized by the penalty on the concentration matrix for the ggm and under some simple constraints on the weights these weights implicitly induce the trimming of certain observations our key contributions can be summarized as follows we introduce the trimmed graphical lasso formulation along with two strategies for solving the objective one involves solving series of graphical lasso problems the other is more efficient and leverages composite gradient descent in conjunction with partial optimization as our key theoretical contribution we provide statistical guarantees on the consistency of our estimator to the best of our knowledge this is in stark contrast with prior work on robust sparse ggm estimation which do not provide any statistical analysis experimental results under various data corruption scenarios further demonstrate the value of our approach problem setup and robust gaussian graphical models notation for matrices and hhu ii denotes the trace inner product tr for matrix and parameter ku ka denotes the norm andpku ka off does the norm only for entries for example ku off finally we use ku kf and to denote the frobenius and spectral norms respectively setup let xp be gaussian random field parameterized by concentration matrix exp xx ii where is the function of gaussian random field here the probability density function in is associated with gaussian distribution where given samples from gaussian random field the standard way to estimate the inverse covariance matrix is to solve the regularized maximum likelihood estimator mle that can be written as the following regularized program dd ee minimize log det off where is the space of the symmetric positive definite matrices and is regularization parameter that encourages sparse graph model structure in this paper we consider the case where the number of random variables may be substantially larger than the number of sample size however the concentration parameter of the underlying distribution is sparse the number of entries of is at most that is for now suppose that samples are drawn from this underlying distribution with true parameter we further allow some samples are corrupted and not drawn from specifically the set of sample indices is separated into two disjoint subsets if sample is in the set of good samples which we name then it is genuine sample from with the parameter on algorithm trimmed graphical lasso in initialize λi repeat compute given by assigning weight of one to the observations with lowest negative and weight of zero to the remaining ones wi line search choose see nesterov for discussion of how the stepsize may be chosen checking that the following update maintains positive definiteness this can be verified via cholesky factorization as in update sη where is the operator sν sign ui max and is only applied to the elements of matrix compute reusing the cholesky factor until stopping criterion is satisfied the other hand if the sample is in the set of bad samples the sample is corrupted the identifications of and are hidden to us however we naturally assume that only small number of samples are corrupted let be the number of good samples and hence then we assume that larger portion of samples are genuine and uncorrupted so that where if we assume that of samples are corrupted then in later sections we will derive robust estimator for corrupted samples of sparse gaussian graphical models and provide statistical guarantees of our estimator under the conditions and trimmed graphical lasso we now propose trimmed graphical lasso for robust estimation of sparse ggms dd ee minimize wi log det off where is regularization parameter to decide the sparsity of our estimation and is another parameter which decides the number of samples or sum of weights used in the training is ideally set as the number of uncorrupted samples in but practically we can tune the parameter by here the constraint is required to analyze this optimization problem as discussed in for another tuning parameter any positive real value would be sufficient for as long as finally note that when is fixed as and is set as infinity the optimization problem will be simply reduced to the vanilla regularized mle for sparse ggm without concerning outliers the optimization problem is convex in as well as in however this is not the case jointly nevertheless we will show later that any local optimum of is guaranteed to be strongly consistent under some fairly mild conditions optimization as we briefly discussed above the problem is not jointly convex but biconvex one possible approach to solve the objective of thus is to alternate between solving for with fixed and solving for with fixed given solving for is straightforward and boils down to assigning weight of one to the observations with lowest negative and weight of zero to the remaining ones given solving for can be accomplished by any algorithm solving the vanilla graphical lasso program each step solves convex problem hence the objective is guaranteed to decrease at each iteration and will converge to local minima more efficient optimization approach can be obtained by adopting partial minimization strategy for rather than solving to completion for each time is updated one performs single step update this approach stems from considering the following equivalent reformulation of our objective minimize dd ee wi log det off dd ee wi argmin wi on can then leverage standard methods such as projected and composite gradient descent that will converge to local optima the overall procedure is depicted in algorithm therein we assume that we pick sufficiently large so one does not need to enforce the constraint explicitly if needed the constraint can be enforced by an additional projection step statistical guarantees of trimmed graphical lasso one of the main contributions of this paper is to provide the statistical guarantees of our trimmed graphical lasso estimator for ggms the optimization problem is and therefore the methods solving will find estimators by local minima hence our theory in this section provides the statistical error bounds on any local minimum measured by kf and off norms simultaneously suppose that we have some local optimum of by arbitrary method while is fixed unconditionally we define as follows for sample index is simply set to ei so that ei otherwise for sample index we set hence is dependent on in order to derive the upper bound on the frobenius norm error we first need to assume the standard restricted strong convexity condition of with respective to the parameter restricted strong convexity condition let be an arbitrary error of parameter that is then for any possible error such that dd ee κl where κl is curvature parameter note that in order to guarantee the frobenius error bounds is required even for the vanilla gaussian graphical models without outliers which has been well studied by several works such as the following lemma lemma section of for any such that dd ee thus holds with κl while is standard condition that is also imposed for the conventional estimators under clean set of of samples we additionally require the following condition for successful estimation of on corrupted samples and consider arbitrary local optimum let then dd ee with some positive quantities and on and these will be specified below for some concrete examples can be understood as structural incoherence condition between the model parameter and the weight parameter such condition is usually imposed when analyzing estimators with multiple parameters for example see for robust linear regression estimator since is defined depending on each local optimum has its own condition we will see in the sequel that under some reasonable cases this condition for any local optimum holds with high probability also for note that for the case with clean samples the condition is trivially satisfied since all and hence the lhs becomes armed with these conditions we now state our main theorem on the error bounds of our estimator theorem consider corrupted gaussian graphical models let be an any local optie mum of suppose that satisfies the condition suppose also that the regularization parameter in is set such that κl max then this local optimum is guaranteed to be consistent as follows kf and kθ κl off kθ κl the statement in theorem holds deterministically and the probabilistic statement comes where we show and for given are satisfied note that defining pn log det it is standard way of choosing based on see for details also it is important to note that the term captures the relation between norm and the error norm kf including diagonal entries due to the space limit the proof of theorem and all other proofs are provided in the supplements now it is natural to ask how easily we can satisfy the conditions in theorem intuitively it is impossible to recover true parameter by weighting approach as in when the amount of corruptions exceeds that of normal observation errors to this end suppose that we have some upper bound on the corruptions for some function we have log where denotes the matrix in corresponding to outliers under this assumption we can properly choose the regularization parameter satisfying as follows corollary consider corrupted gaussian graphical models with conditions and suppose that we choose the regularization parameter log log log max max σii kς then any local optimum of is guaranteed to satisfy and have the error bounds in with probability at least exp for some universal positive constants and if we further assume the number of corrupted samples scales with at most for some constant then we can derive the following result as another corollary of theorem corollary consider corrupted gaussian graphical models suppose that the conditions and hold also suppose that the regularization parameter is set as logn where max maxi then if the sample size is lower log bounded as max log then any local optimum of is guaranteed to satisfy and have the following error bound log log kθ kf κl with probability at least exp for some universal positive constants and note that the off norm based error bound also can be easily derived using the selection of from corollary reveals an interesting result even when samples out of total samples are corrupted our estimator can successfully recover the true parameter with guaranteed error log which exactly recovers the frobenius error in the first term in this bound is bound for the case without outliers see for example due to the outliers we have the log performance degrade with the second term which is to the best of our knowledge this is the first statistical error bounds on the parameter estimation for gaussian graphical models with outliers also note that corollary only concerns on any local optimal point derived by an arbitrary optimization algorithm for the guarantees of multiple local optima simultaneously we may use union bound from the corollary when outliers follow gaussian graphical model now let us provide concrete example and show how in is precisely specified in this case outliers in the set are drawn from another gaussian graphical model with parameter σb this can be understood as the gaussian mixture model where the most of the samples are drawn from that we want to estimate and small portion of samples are drawn from σb in this case corollary can be further shaped as follows corollary suppose that the conditions and hold then the statement in log corollary holds with log experiments in this section we corroborate the performance of our trimmed graphical lasso algorithm on simulated data we compare against glasso the vanilla graphical lasso the and methods and the approach of simulated data our simulation setup is similar to and is akin to gene regulatory networks namely we consider four different scenarios where the outliers are generated from models with different graphical structures specifically each sample is generated from the following mixture distribution yk np np np where po and four different outlier distributions are considered θo θo θo ip θo ip we also consider the scenario where the outliers are not symmetric about the mean and simulate data from the following model sensitivity sensitivity sensitivity sensitivity glasso best best figure average roc curves for the comparison methods for contamination scenarios yk po np po np ip for each simulation run is randomly generated precision matrix corresponding to network with hub nodes simulated as follows let be the adjacency of the network for all we set aij with probability and zero otherwise we set aji aij we then randomly select hub nodes and set the elements of the corresponding rows and columns of to one with probability and zero otherwise using the simulated nonzero coefficients of the precision matrix are sampled as follows first we create matrix so that ei if ai and ei is sampled uniformly from if ai then we set finally we set λmin ip where λmin is the smallest eigenvalue of is randomly generated precision matrix in the same way is generated for the robustness parameter of the method we consider as recommended in for the method we consider since all the robust comparison methods converge to stationary point we tested various initialization strategies for the concentration matrix including ip λip and the estimate from glasso we did not observe any noticeable impact on the results figure presents the average roc curves of the comparison methods over simulation data sets for scenarios as the tuning parameter varies in the figure for and methods we depict the best curves with respect to parameter and respectively due to space constraints the detailed results for all the values of and considered as well as the results for model are provided in the supplements from the roc curves we can see that our proposed approach is competitive compared the alternative robust approaches and the edge over glasso is even more pronounced for frequency rescaled gene expression figure histogram of standardized gene expression levels for gene network estimated by scenarios and surprisingly with achieves superior sensitivity for nearly any specificity computationally the method is also competitive compared to alternatives the average over the path of tuning parameters is for for for for trimmed lasso for glasso experiments were run on in single computing node with intel core cpu and memory for and robustll we used the implementations provided by the methods authors for glasso we used the glassopath package application to the analysis of yeast gene expression data we analyze yeast microarray dataset generated by the dataset concerns yeast segregants instances we focused on genes variables belonging to pathway as provided by the kegg database for each of these genes we standardize the gene expression data to and unit standard deviation we observed that the expression levels of some genes are clearly not symmetric about their means and might include outliers for example the histogram of gene is presented in figure for the method we set and for trimglasso we use we use to choose the tuning parameters for each method after is chosen for each method we rerun the methods using the full dataset to obtain the final precision matrix estimates figure shows the pathway estimated by our proposed method for comparison the pathway from the kegg is provided in the supplements it is important to note that the kegg graph corresponds to what is currently known about the pathway it should not be treated as the ground truth certain discrepancies between kegg and estimated graphs may also be caused by inherent limitations in the dataset used for modeling for instance some edges in pathway may not be observable from gene expression data additionally the perturbation of cellular systems might not be strong enough to enable accurate inference of some of the links glasso tends to estimate more links than the robust methods we postulate that the lack of robustness might result in inaccurate network reconstruction and the identification of spurious links robust methods tend to estimate networks that are more consistent with that from the kegg of for glasso for for and for where the score is the harmonic mean between precision and recall for instance our approach recovers several characteristics of the kegg pathway for instance genes key regulator of dna replication playing important roles in the activation and maintenance of the checkpoint mechanisms coordinating phase and mitosis and essential gene for meiotic progression and mitotic cell cycle arrest are identified as hub genes while genes are unconnected to any other genes references lauritzen graphical models oxford university press usa jung hun oh and joseph deasy inference of gene regulatory networks using the graphical lasso algorithm bmc bioinformatics manning and schutze foundations of statistical natural language processing mit press woods markov image modeling ieee transactions on automatic control october hassner and sklansky markov random field models of digitized image texture in pages cross and jain markov random field texture models ieee trans pami ising beitrag zur theorie der ferromagnetismus zeitschrift physik ripley spatial statistics wiley new york yang lozano and ravikumar elementary estimators for graphical models in neur info proc sys nips yuan and lin model selection and estimation in the gaussian graphical model biometrika friedman hastie and tibshirani sparse inverse covariance estimation with the graphical lasso biostatistics bannerjee el ghaoui and aspremont model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data jour mach lear march ravikumar wainwright raskutti and yu covariance estimation by minimizing divergence electronic journal of statistics boyd and vandenberghe convex optimization cambridge university press cambridge uk meinshausen and graphs and variable selection with the lasso annals of statistics yang ravikumar allen and liu graphical models via generalized linear models in neur info proc sys nips tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series daye chen and li heteroscedastic regression with an application to eqtl data analysis biometrics michael finegold and mathias drton robust graphical modeling of gene networks using classical and alternative the annals of applied statistics sun and li robust gaussian graphical modeling via penalization biometrics alfons croux and gelper sparse least trimmed squares regression for analyzing highdimensional large data sets ann appl loh and wainwright regularized with nonconvexity statistical and algorithmic theory for local optima arxiv preprint hsieh sustik dhillon and ravikumar sparse inverse covariance matrix estimation using quadratic approximation in neur info proc sys nips nesterov gradient methods for minimizing composite objective function technical report center for operations research and econometrics core catholic univ louvain ucl nguyen and tran robust lasso with missing and grossly corrupted observations ieee trans info theory negahban ravikumar wainwright and yu unified framework for analysis of with decomposable regularizers statistical science yang and lozano robust gaussian graphical modeling with the trimmed graphical lasso rachel brem and leonid kruglyak the landscape of genetic complexity across gene expression traits in yeast proceedings of the national academy of sciences of the united states of america kanehisa goto sato kawashima furumichi and tanabe data information knowledge and principle back to metabolism in kegg nucleic acids 
testing closeness with unequal sized samples gregory department of computer science stanford university california ca valiant bhaswar bhattacharya department of statistics stanford university california ca bhaswar abstract we consider the problem of testing whether two samples were drawn from identical distributions versus distributions that differ significantly specifically given target error parameter independent draws from an unknown distribution with discrete support and draws from an unknown distribution of discrete support we describe test for distinguishing the case that from the case that if and are supported on at most elements then our with high probability provided test is successful and max we show that this tradeoff is information oretically optimal throughout this range in the dependencies on all parameters and to constant factors for distributions as consequence we obtain an algorithm for estimating the mixing time of markov chain on states up to log factor that uses τmix queries to next node oracle the core of our testing algorithm is relatively simple statistic that seems to perform well in practice both on synthetic and on natural language data we believe that this statistic might prove to be useful primitive within larger machine learning and natural language processing systems introduction one of the most basic problems in statistical hypothesis testing is the question of distinguishing whether two unknown distributions are very similar or significantly different classical tests like the test or the statistic are optimal in the asymptotic regime for fixed distributions as the sample sizes tend towards infinity nevertheless in many modern as the analysis of customer web logs natural language processing and genomics despite the quantity of available support sizes and complexity of the underlying distributions are far larger than the datasets as evidenced by the fact that many phenomena are observed only single time in the datasets and the empirical distributions of the samples are poor representations of the true underlying in such settings we must understand these statistical tasks not only in the asymptotic regime in which the amount of available data goes to infinity but in the undersampled regime in which the dataset is significantly smaller than the size or complexity of the distribution in question surprisingly despite an intense history of study by the statistics information theory and computer science communities aspects of basic hypothesis testing and estimation in the undersampled unresolved and require both new algorithms and new analysis techniques supported in part by nsf career award to give some specific examples two recent independent studies each considered the genetic sequences of over individuals and found that rare variants are extremely abundant with over of mutations observed just once in the sample separate recent paper found that the discrepancy in rare mutation abundance cited in different demographic modeling studies can largely be explained by discrepancies in the sample sizes of the respective studies as opposed to differences in the actual distributions of rare mutations across demographics highlighting the importance of improved statistical tests in this undersampled regime in this work we examine the basic hypothesis testing question of deciding whether two unknown distributions over discrete supports are identical or extremely similar versus have total variation distance at least for some specified parameter we consider and largely resolve this question in the extremely practically relevant setting of unequal sample sizes informally taking to be small constant we show that provided and are supported on at most elements for any the hypothesis test can be successfully performed with high probability over the random samples given samples of size from and from where is the size of the supports of the distributions and furthermore for every in this range this tradeoff between and is necessary up to constant factors thus our results smoothly interpolate between the known bounds of on the sample size necessary in the setting where one is given two samples and the bound of on the sample size in the setting in which the sample is drawn from one distribution and the other distribution is known to the algorithm throughout most of the regime of parameters when our algorithm is natural extension of the algorithm proposed in and is similar to the algorithm proposed in except with the addition of normalization term that seems crucial to obtaining our information theoretic optimality in the extreme regime when and our algorithm introduces an additional statistic which we believe is new our algorithm is relatively simple and practically viable in section we illustrate the efficacy of our approach on both synthetic data and on the problem of deducing whether two words are synonyms based on small sample of the in which they occur we also note that as pointed out in several related work this hypothesis testing question has applications to other problems such as estimating or testing the mixing time of markov chains and our results yield improved algorithms in these settings related work the general question of how to estimate or test properties of distributions using fewer samples than would be necessary to actually learn the distribution has been studied extensively since the late most of the work has focussed on symmetric properties properties whose value is invariant to relabeling domain elements such as entropy support size and distance metrics between distributions such as distance this has included both algorithmic work and results on developing techniques and tools for establishing lower bounds see the recent survey by rubinfeld for more thorough summary of the developments in this area the specific problem of closeness testing or identity testing that is deciding whether two distributions and are similar versus have significant distance has two main variants the setting in which is known and sample is drawn from and the settings in which both and are unknown and samples are drawn from both we briefly summarize the previous results for these two settings in the setting which can be thought of as the limiting setting in the case that we have an arbitrarily large sample drawn from distribution and relatively modest sized sample from initial work of goldreich and ron considered the problem of testing whether is the uniform distribution over versus has distance at least the tight bounds of were later shown by paninski essentially leveraging the birthday paradox and the intuition that among distributions supported on elements the uniform distribution maximizes the number of domain elements that will be observed once batu et al showed that up to polylogarithmic factors of and polynomial factors of this dependence was optimal for distributions over recently an algorithm and matching lower bound was shown for any max distribution up to constant factors max samples from are both necessary max and sufficient to test versus where is the norm of the vector of probabilities of distribution after the maximum element has been removed and the smallest elements up to total mass have been removed this immediately implies the tight bounds that if is any distribution supported on samples are sufficient to test its identity the setting was introduced to this community by batu et al the optimal sample complexity of this problem was recently determined by chan et al they showed that samples are necessary and sufficient in slightly different vein acharya et al recently considered the question of closeness testing with two unknown distributions from the standpoint of competitive analysis they proposed an algorithm that performs the desired task using polylog samples and established lower bound of where represents the number of samples required to determine whether set of samples were drawn from versus in the setting where and are explicitly known natural generalization of this hypothesis testing problem which interpolates between the setting and the setting is to consider unequal sized samples from the two distributions more formally given samples from the distribution the asymmetric closeness testing problem is to determine how many samples are required from the distribution such that the hypothesis versus can be distinguished with large constant probability say note that the results of chan et al imply that it is sufficient to consider this problem was studied recently by acharya et al they gave log an algorithm that given samples from the distribution uses max nεlog samples from to distinguish the two distributions with high probability they also proved lower bound of max there is polynomial gap in these upper and lower bounds in the dependence on and as corollary to our main hypothesis testing result we obtain an improved algorithm for testing the mixing time of markov chain the idea of testing mixing properties of markov chain goes back to the work of goldreich and ron which conjectured an algorithm for testing expansion of graphs their test is based on picking random node and testing whether random walks from this node reach distribution that is close uniform distribution on the nodes of the graph they conjectured that their algorithm had query complexity later czumaj and sohler kale and seshadhri and nachmias and shapira have independently concluded that the algorithm of goldreich and ron is provably test for expansion property of graphs rapid mixing of chain can also be tested using eigenvalue computations mixing is related to the separation between the two largest eigenvalues and eigenvalues of dense matrix can be approximated in time and space however for sparse symmetric matrix with nonzero entries the same task can be achieved in log operations and space batu et al used their distance test on the distributions to test mixing properties of markov chains given finite markov chain with state space and transition matrix they essentially show that one can estimate the mixing time τmix up to factor of log using τmix queries to next node oracle which takes state and outputs state drawn from the distribution such an oracle can often be simulated significantly more easily than actually computing the transition matrix we conclude this related work section with comment on robust hypothesis testing and distance estimation natural hope would be to simply estimate to within some additive which is strictly more difficult task than distinguishing from the results of valiant and valiant show that this problem is significantly more difficult than hypothesis testing the distance can be estimated to additive error for distributions supported on elements using samples of size log in both the setting where either one or both distributions are unknown moreover log samples are information theoretically necessary even if is the uniform from the case that distribution over and one wants to distinguish the case that recall that the test of distinguishing versus requires sample of size only the exact sample complexity of distinguishing whether nc versus is not well understood though in the case of constant up to logarithmic factors the required sample size seems to scale linearly in the exponent between and as goes from to our results our main result resolves the minimax sample complexity of the closeness testing problem in the unequal sample setting to constant factors in terms of the support sizes of the distributions in question theorem given and and sample access to distributions and over there is time algorithm which takes independent draws from and max independent draws from and with probability at least distinguishes whether versus moreover given samples from max samples from are theoretically necessary to distinguish from with any constant probability bounded below by the lower bound in the above theorem is proved using the machinery developed in valiant and interpolates between the lower bound in the setting of testing uniformity and the lower bound in the setting of equal sample sizes from two unknown distributions the algorithm establishing the upper bound involves version of statistic proposed in and is similar to the algorithm proposed in modulo the addition of normalizing term which seems crucial to obtaining our tight results in the extreme regime when and we incorporate an additional statistic that has not appeared before in the literature as an application of theorem in the extreme regime when we obtain an improved algorithm for estimating the mixing time of markov chain corollary consider finite markov chain with state space and next node oracle there is an algorithm that estimates the mixing time τmix up to multiplicative factor of log that uses τmix time and queries to the next node oracle concurrently to our work hsu et al considered the question of estimating the mixing time based on single sample path as opposed to our model of sampling oracle in contrast to our approach via hypothesis testing they considered the natural spectral approach and showed that the mixing time can be approximated up to logarithmic factors given path of length τmix where πmin is the minimum probability of state under the stationary distribution hence if the stationary distribution is uniform over states this becomes nτmix it remains an intriguing open question whether one can simultaneously achieve both the linear dependence on τmix of our results and the linear dependence on or the size of the state space as in their results outline we begin by stating our testing algorithm and describe the intuition behind the algorithm the formal proof of the performance guarantees of the algorithm require rather involved bounds on the moments of various parameters and are provided in the supplementary material we also defer the entirety of the matching information theoretic lower bounds to the supplementary material as the techniques may not appeal to as wide an audience as the algorithmic portion of our work the application of our testing results to the problem of testing or estimating the mixing time of markov chain is discussed in section finally section contains some empirical results suggesting that the statistic at the core of our testing algorithm performs very well in practice this section contains both results on synthetic data as well as an illustration of how to apply these ideas to the problem of estimating the semantic similarity of two words based on samples of the that contain the words in corpus of text algorithms for testing in this section we describe our algorithm for testing with unequal samples this gives the upper bound in theorem on the sample sizes necessary to distinguish from for clarity and ease of exposition in this section we consider to be some absolute constant and supress the dependency on the slightly more involved algorithm that also obtains the optimal dependency on the parameter is given in the supplementary material we begin by presenting the algorithm and then discuss the intuition for the various steps algorithm the closeness testing algorithm suppose and for some let denote two independent sets of samples drawn from and let denote two independent sets of samples drawn from we wish to test versus let log for an absolute constant and define the set mi where denotes the number of occurrences of in and denotes the number of occurrences of in let xi denote the number of occurrences of element in and yi denote the number of occurrences of element in check if xi yi check if xi yi xi yi cγ xi yi for an appropriately chosen constant cγ depending on if if and hold then accept otherwise reject otherwise if check if yi xi where is an appropriately chosen absolute constant reject if there exists such that yi and xi where is an appropriately chosen absolute constant if and hold then accept otherwise reject the intuition behind the above algorithm is as follows with high probability all elements in the set satisfy either pi or qi or both given that these elements are heavy their contribution to the distance will be accurately captured by the distance of their empirical frequencies where these empirical frequencies are based on the second set of samples for the elements that are not in set light empirical frequencies will in general not accurately reflect their true probabilities and hence the distance between the empirical distributions of the light elements will be misleading the statistic of equation is designed specifically for this regime if the denominator of this statistic were omitted then this would give an estimator for the squared distance between the distributions scaled by factor of to see this note that if pi and qi are small then binomial pi oisson pi and binomial qi oisson qi furthermore simple calculation yields that if oisson pi and yi oisson qi then xi yi xi yi the normalization by xi yi linearizes the statistic essentially turning the squared distance into an estimate of the distance between light elements of the two distributions similar results can possibly be obtained using other linear functions of xi and yi in the denominator though we note that the obvious normalizing factor of xi yi does not seem to work theoretically and seems to have extremely poor performance in practice for the extreme case corresponding to where and the statistic might have prohibitively large variance this is essentially due to the birthday paradox which might cause of rare elements having probability to occur twice in sample of size each such element will contribute to the statistic and hence the variance can be the statistic of equation is tailored to deal with these cases and captures the intuition that we are more tolerant of indices for which yi if the corresponding xi is larger it is worth noting that one can also define natural analog of the statistic corresponding to the indices for which yi using which the robustness parameter of the test can be improved the final that in this regime with there are no elements for which yi but xi is out the remaining sets of distributions for which the variance of the statistic is intolerably large finally we should emphasize that the crude step of using two independent batches of the first to obtain the partition of the domain into heavy and light elements and the second to actually compute the statistics is for ease of analysis as our empirical results of section suggest for practical applications one may want to use only the of and one certainly should not waste half the samples to perform the heavy light partition estimating mixing times in markov chains the basic hypothesis testing question of distinguishing identical distributions from those with significant distance can be employed for several other practically relevant tasks one example is the problem of estimating the mixing time of markov chains consider finite markov chain with state space transition matrix with stationary distribution the distribution starting at the point pxt is the probability distribution on obtained by running the chain for steps starting from definition time of markov chain with transition matrix is defined as tmix inf definition the average distribution of markov chain with states is the distribution pxt that is the distribution obtained by choosing uniformly from and walking steps from the state the connection between closeness testing and testing whether markov chain is close to mixing was first observed by batu et al who proposed testing the difference between distributions and for every the algorithm leveraged their equal hypothesis testing results drawing log samples from both the distributions and this yields an overall running time of here we note that our unequal hypothesis testing algorithm can yield an improved runtime since the distribution is independent of the starting state it suffices to take samples from once and samples from pxt for every this results in query and runtime complexity of we sketch this algorithm below algorithm testing for mixing times in markov chains given and finite markov chain with state space and transition matrix we wish to test tmix versus tmix draw log samples so log each of size pois from the average distribution for each state we will distinguish whether versus with probability of error we do this by running log runs of algorithm with the run using si and fresh set of pois samples from pxt if all of the closeness testing problems are accepted then we accept the above testing algorithm can be leveraged to estimate the mixing time of markov chain via the log basic observation that if tmix then for any tmix log and thus tmix log tmix because tmix and tmix differ by at most factor of log by applying algorithm for geometrically increasing sequence of and repeating each test log log times one obtains corollary restated below corollary for finite markov chain with state space and next node oracle there is an algorithm that estimates the mixing time τmix up to multiplicative factor of log that uses τmix time and queries to the next node oracle empirical results both our formal algorithms and the corresponding theorems involve some unwieldy constant factors that can likely be reduced significantly nevertheless in this section we provide some evidence that the statistic at the core of our algorithm can be fruitfully used in practice even for surprisingly small sample sizes testing similarity of words an extremely important primitive in natural language processing is the ability to estimate the semantic similarity of two words here we show that the statistic xi yi xi yi which is the core of our testing algorithm can accurately xi tinguish whether two words are very similar based on surprisingly small samples of the contexts in which they occur specifically for each pair of words that we consider we select random occurrences of and random occurrences of word from the google books corpus using the google books ngram we then compare the sample of words that follow with the sample of words that follow henceforth we refer to these as samples of the set of involving each word figure illustrates the statistic for various pairs of words that range from rather similar words like smart and intelligent to essentially identical word pairs such as grey and gray whose usage differs mainly as result of historical variation in the preference for one spelling over the other the sample size of containing the first word is fixed at and the sample size corresponding to the second word varies from through to provide frame of reference we also compute the value of the statistic for independent samples corresponding to the same word two different samples of words that follow wolf these are depicted in red for comparison we also plot the total variation distance between the empirical distributions of the pair of samples which does not clearly differentiate between pairs of identical words versus different words particularly for the smaller sample sizes one subtle point is that the issue with using the empirical distance between the distributions goes beyond simply not having consistent reference point for example let denote large sample of size from distribution denote small sample of size from and denote small sample of size from different distribution it is tempting to hope that the empirical distance between and will be smaller than the empirical distance between and as figure illustrates this is not always the case even for natural distributions for the specific example illustrated in the figure over much of the range of the empirical distance between and is indistinguishable from that of and though the statistic easily discerns that these distributions are very different this point is further emphasized in figure which depicts this phenomena in the synthetic setting where unif is the uniform distribution over elements and is the distribution whose elements have probabilities for the second and fourth plots represent the probability that the distance between two empirical distributions of samples from is smaller than the distance between the empirical distributions of the samples from and the first and third plots represent the analogous probability involving the statistic the first two plots correspond to and the last two correspond to in we consider pair of samples of respective sizes and as and range between and the google books ngram dataset is freely available here http figure two measures of the similarity between words based on samples of the containing each word each line represents pair of words and is obtained by taking sample of containing the first word and containing the second word where is depicted along the in logarithmic scale in both plots the red lines represent pairs of identical words the blue lines represent pairs of similar words and the black line represents the pair whose distribution of differ because of historical variations in preference for each spelling solid lines indicate the average over trials for each word pair and choice of with error bars of one standard deviation depicted the left plot depicts our statistic which clearly distinguishes identical words and demonstrates some intuitive sense of semantic distance the right plot depicts the total variation distance between the empirical does not successfully distinguish the identical words given the range of sample sizes considered the plot would not be significantly different if other distance metrics between the empirical distributions such as were used in place of total variation distance finally note the extremely uniform magnitudes of the error bars in the left plot as increases which is an added benefit of the xi yi normalization term in the statistic illustration of how the empirical distance can be misleading here the empirical distance between the distributions of samples of for is indistinguishable from that for the pair over much of the range of nevertheless our statistic clearly discerns that these are significantly different distributions here fox denotes the distribution of whose first word is fox restricted to only the most common pr pr pr pr figure the first and third plot depicts the probability that the statistic applied to samples of sizes drawn from nif is smaller than the statistic applied to sample of size drawn from and drawn from where is perturbed version of in which all elements have probability the second and fourth plots depict the probability that empirical distance between pair of samples of respective sizes drawn from is less than the empirical distribution between sample of size drawn from and drawn from the first two plots correspond to and the last two correspond to in all plots and range between and on logarithmic scale in all plots the colors depict the average probability based on trials references acharya das jafarpour orlitsky and pan competitive closeness testing colt acharya das jafarpour orlitsky and pan competitive classification and closeness testing colt acharya jafarpour orlitsky and suresh sublinear algorithms for outlier detection and generalized closeness testing isit acharya daskalakis and kamath optimal testing for properties of distributions nips kumar and sivakumar sampling algorithms lower bounds and applications stoc batu fortnow rubinfeld smith and white testing that distributions are close focs batu dasgupta kumar and rubinfeld the complexity of approximating the entropy siam journal on computing batu fischer fortnow kumar rubinfeld and white testing random variables for independence and identity focs chan diakonikolas valiant valiant optimal algorithms for testing closeness of discrete distributions symposium on discrete algorithms soda charikar chaudhuri motwani and narasayya towards estimation error guarantees for distinct values symposium on principles of database systems pods czumaj and sohler testing expansion in graphs focs goldreich and ron on testing expansion in graphs eccc guha mcgregor and venkatasubramanian streaming and sublinear approximation of entropy and information distances symposium on discrete algorithms soda hsu kontorovich and mixing time estimation in reversible markov chains from single sample path nips kale and seshadhri an expansion tester for bounded degree graphs icalp lncs vol keinan and clark recent explosive human population growth has resulted in an excess of rare genetic variants science levin peres and wilmer markov chains and mixing times amer math nachmias and shapira testing the expansion of graph electronic colloquium on computational complexity eccc vol nelson and wegmann et an abundance of rare functional variants in drug target genes sequenced in people science paninski estimation of entropy and mutual information neural vol paninski estimating entropy on bins given fewer than samples ieee transactions on information theory vol paninski test for uniformity given very discrete data ieee transactions on information theory vol raskhodnikova ron shpilka and smith strong lower bounds for approximating distribution support size and the distinct elements problem siam journal on computing vol rubinfeld taming big probability distributions xrds vol sinclair and jerrum approximate counting uniform generation and rapidly mixing markov chains information and computation vol tennessen bigham and connor et al evolution and functional impact of rare coding variation from deep sequencing of human exomes science valiant and valiant estimating the unseen an log estimator for entropy and support size shown optimal via new clts stoc valiant and valiant estimating the unseen improved estimators for entropy and other properties nips valiant and valiant an automatic inequality prover and instance optimal identity testing focs valiant testing symmetric properties of distributions stoc valiant testing symmetric properties of distributions phd thesis 
estimating jaccard index with missing observations matrix calibration approach wenye li macao polytechnic institute macao sar china wyli abstract the jaccard index is standard statistics for comparing the pairwise similarity between data samples this paper investigates the problem of estimating jaccard index matrix when there are missing observations in data samples starting from jaccard index matrix approximated from the incomplete data our method calibrates the matrix to meet the requirement of positive and other constraints through simple alternating projection algorithm compared with conventional approaches that estimate the similarity matrix based on the imputed data our method has strong advantage in that the calibrated matrix is guaranteed to be closer to the unknown ground truth in the frobenius norm than the matrix except in special cases they are identical we carried out series of empirical experiments and the results confirmed our theoretical justification the evaluation also reported significantly improved results in real learning tasks on benchmark datasets introduction critical task in data analysis is to determine how similar two data samples are the applications arise in many science and engineering disciplines for example in statistical and computing sciences similarity analysis lays foundation for cluster analysis pattern classification image analysis and recommender systems variety of similarity models have been established for different types of data when data samples can be represented as algebraic vectors popular choices include cosine similarity model linear kernel model and so on when each vector element takes value of zero or one the jaccard index model is routinely applied which measures the similarity by the ratio of the number of unique elements common to two samples against the total number of unique elements in either of them despite the wide applications the jaccard index model faces challenge when data samples are not fully observed as treatment imputation approaches may be applied which replace the missing observations with substituted values and then calculate the jaccard index based on the imputed data unfortunately with large portion of missing observations imputing data samples often becomes or even infeasible as evidenced in our evaluation instead of trying to fill in the missing values this paper investigates completely different approach based on matrix calibration starting from an approximate jaccard index matrix that is estimated from incomplete samples the proposed method calibrates the matrix to meet the requirement of positive and other constraints the calibration procedure is carried out with simple yet flexible alternating projection algorithm the proposed method has strong theoretical advantage the calibrated matrix is guaranteed to be better than or at least identical to in special cases the matrix in terms of shorter frobenius distance to the true jaccard index matrix which was verified empirically as well besides our evaluation of the method also reported improved results in learning applications and the improvement was especially significant with high portion of missing values note on notation throughout the discussion data sample ai is treated as set of features let fd be the set of all possible features without causing ambiguity ai also represents vector if the element of vector ai is one it means fj ai feature fj belongs to sample ai if the element is zero fj ai if the element is marked as missing it remains unknown whether feature fj belongs to sample ai or not background the jaccard index the jaccard index is commonly used statistical indicator for measuring the pairwise similarity for two nonempty and finite sets ai and aj it is defined to be the ratio of the number of elements in their intersection against the number of elements in their union jij aj aj where denotes the cardinality of set the jaccard index has value of when the two sets have no elements in common when they have exactly the same elements and strictly between and otherwise the two sets are more similar have more common elements when the value gets closer to for sets an the jaccard index matrix is defined as an matrix jij the matrix is symmetric and all diagonal elements of the matrix are handling missing observations when data samples are fully observed the accurate jaccard index can be obtained trivially by enumerating the intersection and the union between each pair of samples if both the number of samples and the number of features are small for samples with large number of features the index can often be approximated by minhash and related methods which avoid the explicit counting of the intersection and the union of the two sets when data samples are not fully observed however obtaining the accurate jaccard index generally becomes infeasible one approximation is to ignore the features with missing values only those features that have no missing values in all samples are used to calculate the jaccard index obviously for large dataset with features it is very likely that this method will throw away all features and therefore does not work at all the mainstream work tries to replace the missing observations with substituted values and then calculates the jaccard index based on the imputed data several simple approaches including zero median and neighbors knn methods are popularly used missing element is set to zero often implying the corresponding feature does not exist in sample it can also be set to the median value or the mean value of the feature over all samples or sometimes over number of nearest neighboring instances more systematical imputation framework is based on the classical expectation maximization em algorithm which generalizes maximum likelihood estimation to the case of incomplete data assuming the existence of latent variables the algorithm alternates between the expectation step and the maximization step and finds maximum likelihood or maximum posterior estimates of the variables in practice the imputation is often carried out through iterating between learning mixture of clusters of the filled data and missing values using cluster means weighted by the posterior probability that cluster generates the samples solution our work investigates the jaccard index matrix estimation problem for incomplete data instead of throwing away the features or imputing the missing values completely different solution based on matrix calibration is designed initial approximation for sample ai denote by the set of features that are known to be in ai and denote by the set of features that are known to be not in ai let oi if oi ai is fully observed without missing values otherwise ai is not fully observed with missing values the complement of oi with respect to denoted by oi gives ai unknown features and missing values for two samples ai and aj with missing values we approximate their jaccard index by oj oi jij oj oi oj oi here we assume that each sample has at least one observed feature it is obvious that jij is equal to the ground truth jij if the samples are fully observed there exists an interval ℓij µij that the true value jij lies in where if ℓij otherwise oi and µij if oi otherwise the lower bound ℓij is obtained from the extreme case of setting the missing values in way that the two sets have the fewest features in their intersection while having the most features in their union on the contrary the upper bound µij is obtained from the other extreme when the samples are fully observed the interval shrinks to single point ℓij µij jij matrix calibration denote by jij the true jaccard index matrix for set of data samples an we have theorem for given set of data samples its jaccard index matrix is positive for data samples with missing values the matrix jij often loses positive definiteness nevertheless it can be calibrated to ensure the property by seeking an matrix jij to minimize subject to the constraints and ℓij jij µij where requires to be positive and denotes the frobenius norm of matrix and kjkf ij jij let mn be the set of symmetric matrices the feasible region defined by the constraints denoted by is nonempty closed and convex subset of mn following standard results in optimization theory the problem of minimizing is convex denote by the projection onto its unique solution is given by the projection of onto jr pr for jr we have the equality holds iff theorem jr proof define an inner product on mn that induces the frobenius norm hx trace for mn then jr jr jr jr jr jr jr jr jr jr the second holds due to the kolmogrov criterion which states that the projection of onto jr is unique and characterized by for all jr and jr jr jr jr and jr the equality holds iff jr this key observation shows that projecting onto the feasible region will produce an improved estimate towards although this ground truth matrix remains unknown to us projection onto subsets based on the results in section we are to seek minimizer to to improve the estimate define two nonempty closed and convex subsets of mn mn and mn ℓij xij µij obviously now our minimization problem becomes finding the projection of onto the intersection of two sets and with respect to the frobenius norm this can be done by studying the projection onto the two sets individually denote by ps the projection onto and pt the projection onto for projection onto straightforward result based on the kolmogrov criterion is theorem for given matrix mn its projection onto xt pt is given by if ℓij xij µij xt ij ℓij if xij ℓij µij if xij µij for projection onto well known result is the following theorem for mn and its singular value decomposition σv where diag λn the projection of onto is given by xs ps where diag and λi if λi λi otherwise the matrix xs ps gives the positive matrix that most closely approximates with respect to the frobenius norm dykstra algorithm to study the orthogonal projection onto the intersection of subspaces classical result is von neumann alternating projection algorithm let be hilbert space with two closed subspaces and the orthogonal projection onto the intersection can be obtained by the product of the two projections when the two projections commute when they do not commute the work shows that for each the projection of onto the intersection can be obtained by the limit point of sequence of projections onto each subspace respectively the algorithm generalizes to any finite number of subspaces and projections onto them unfortunately different from the application in in our problem both and are not subspaces but subsets and von neumann convergence result does not apply the limit point of the generated sequence may converge to points to handle the difficulty dykstra extended von neumann work and proposed an algorithm that tr works with subsets consider the case of ci where is nonempty and each ci is closed and convex subset in assume that for any obtaining pc is hard while obtaining each dykstra algorithm produces two sequences from kp ci is easy starting the iterates xi and the increments ii the two sequences are generated by xki pci iik xki where and the initial values are given by the sequence of xki converges to the optimal solution with theoretical guarantee theorem let cr be closed and convex subsets of hilbert space such that ck for any and any the sequence xki converges strongly to pc xki as the convergent rate of dykstra algorithm for polyhedral sets is linear which coincides with the convergence rate of von neumann alternating projection method an iterative method based on the discussion in section we have simple approach shown in algorithm that finds the projection of an initial matrix onto the nonempty set here the projections onto and are given by the two theorems in section the algorithm stops when falls into the feasible region or when maximal number of iterations is achieved for practical implementation more robust stopping criterion can be adopted related work it is known study in mathematical optimization field to find positive matrix that is closest to given matrix number of methods have been proposed recently the idea of alternating projection method was firstly applied in financial application the problem can also be phrased as programming sdp model and be solved via the method in the work of and the method and the projected gradient method to the lagrangian dual of the original problem were applied which reported faster results than the sdp formulation an even faster newton method was developed in by investigating the dual problem which is unconstrained with twice continuously differentiable objective function and has quadratically convergent solution algorithm projection onto require initial matrix while not convergent do ps jtk isk jtk isk pt itk itk end while return jtk evaluation to evaluate the performance of the proposed method four benchmark datasets were used in our experiments mnist grayscale image database of handwritten digits to after binarization each image is represented as vector usps another grayscale image database of handwritten digits after binarization each image is represented as vector protein bioinformatics database with three classes of instances each instance is represented as sparse vector webspam dataset with both spam and web pages each page is represented as vector the data are highly sparse on average one vector has about values out of more than million features our experiments have two objectives one is to verify the effectiveness of the proposed method in estimating the jaccard index matrix by measuring the derivation of the calibrated matrix from the ground truth in frobenius norm the other is to evaluate the performance of the calibrated matrix in general learning applications the comparison is made against the popular imputation approaches listed in section including the zero knn and em approaches as the median approach gave very similar performance as the zero approach its results were not reported separately jaccard index matrix estimation the experiment was carried out under various settings for each dataset we experimented with and samples respectively for each sample different portions from to of feature values were marked as missing which was assumed to be missing at random and all features had the same probability of being marked as mentioned in section for the proposed calibration approach an initial jaccard index matrix was firstly built based on the incomplete data then the matrix was calibrated to meet the positive requirement and the lower and upper bounds requirement while for the imputation approaches the jaccard index matrix was calculated directly from the imputed data note that for the knn approach we iterated different from to and the best result was collected which actually overestimated its performance under some settings the results of the em approach were not available due to its prohibitive computational requirement to our platform the results are presented through the comparison of mean square deviations from the ground truth of the jaccard index matrix for an estimated matrix its mean square deviation from ftp fwjbujpo samples knn em calibration knn em calibration mean square deviation mean square deviation fwjbujpo samples mean square deviation samples knn em calibration knn calibration mean square deviation samples ratio of observed features mnist ratio of observed features ratio of observed features webspam fwjbujpo samples knn calibration mean square deviation mean square deviation samples knn em calibration protein fwjbujpo samples knn em calibration mean square deviation ratio of observed features usps mean square deviation samples knn calibration ratio of observed features mnist ratio of observed features usps ratio of observed features protein ratio of observed features webspam figure mean square deviations from the ground truth on benchmark datasets by different methods horizontal percentages of observed values from to vertical mean square deviations in samples samples for better visualization effect of the results shown in color the reader is referred to the soft copy of this paper is defined as the square frobenius distance between the two matrices divided by the number of pn elements nij in addition to the comparison with the popular approaches the mean square deviation between the matrix and shown as no calibration is also reported as baseline figure shows the results it can be seen that the calibrated matrices reported the smallest derivation from the ground truth in nearly all experiments the improvement is especially significant when the ratio of observed features is low the missing ratio is high it is guaranteed to be no worse than the matrix as evidenced in the results for all the imputation approaches there is no such guarantee supervised learning knowing the improved results in reducing the deviation from the ground truth matrix we would like to further investigate whether this improvement indeed benefits practical applications specifically in supervised learning we applied the calibrated results in nearest neighbor classification tasks given training set of labeled samples we tried to predict the labels of the samples in the testing set for each testing sample its label was determined by the label of the sample in the training set that had the largest jaccard index value with it similarly the experiment was carried out with samples and different portions of missing values from to respectively in each run of the samples were randomly chosen as the training set and the remaining were used as the testing set the mean and standard deviation of the classification errors in runs were reported as reference the results from the ground truth matrix shown as fully observed were also included figure shows the results again the matrix calibration method reported evidently improved results over the imputation approaches in most experiments the improvement verified the benefits brought by the reduced deviation from the true jaccard index matrix and therefore justified the usefulness of the proposed method in learning applications classification error samples classification error samples knn em calibration classification error samples classification error samples knn em calibration knn em calibration knn calibration classification error classification error classification error classification error ratio of observed features mnist ratio of observed features ratio of observed features classification error samples knn calibration classification error classification error classification error classification error knn calibration webspam classification error samples knn em calibration protein classification error samples knn em calibration ratio of observed features usps classification error samples ratio of observed features mnist ratio of observed features usps ratio of observed features protein ratio of observed features webspam figure classification errors on benchmark datasets by different methods horizontal percentage of observed values from to vertical classification errors samples samples for better visualization effect of the results shown in color the reader is referred to the soft copy of this paper discussion and conclusion the jaccard index measures the pairwise similarity between data samples which is routinely used in real applications unfortunately in practice it is to estimate the jaccard index matrix for incomplete data samples this paper investigates the problem and proposes matrix calibration approach in way that is completely different from the existing methods instead of throwing away the unknown features or imputing the missing values the proposed approach calibrates any approximate jaccard index matrix by ensuring the positive requirement on the matrix it is theoretically shown and empirically verified that the approach indeed brings about improvement in practical problems one point that is not particularly addressed in this paper is the computational complexity issue we adopted simple alternating projection procedure based on dykstra algorithm the computational complexity of the algorithm heavily depends on the successive matrix decompositions it is expensive when the size of the matrix becomes large calibrating jaccard index matrix for samples can be finished in seconds of time on our platform while calibrating matrix for samples quickly increases to more than an hour further investigations for faster solutions are thus necessary for scalability actually there is simple heuristic to calibrate large matrix firstly divide the matrix into small then calibrate each to meet the constraints finally merge the results although the heuristic may not give the optimal result it also guarantees to produce matrix better than or identical to the matrix the heuristic runs with high parallel efficiency and easily scales to very large matrices the detailed discussion is omitted here due to the space limit acknowledgments the work is supported by the science and technology development fund project no macao sar china references birgin and raydan robust stopping criteria for dykstra algorithm siam journal on scientific computing bouchard jousselme and proof for the positive definiteness of the jaccard index matrix international journal of approximate reasoning boyd and vandenberghe convex optimization cambridge university press new york ny usa boyd and xiao covariance matrix adjustment siam journal on matrix analysis and applications broder charikar frieze and mitzenmacher independent permutations in proceedings of the thirtieth annual acm symposium on theory of computing pages acm dempster laird and rubin maximum likelihood from incomplete data via the em algorithm journal of the royal statistical society series deutsch best approximation in inner product spaces springer new york ny usa duda and hart pattern classification john wiley and sons hoboken nj usa dykstra an algorithm for restricted least squares regression journal of the american statistical association escalante and raydan alternating projection methods siam philadelphia pa usa ghahramani and jordan supervised learning from incomplete data via an em approach in advances in neural information processing systems volume pages morgan kaufmann golub and van loan matrix computations johns hopkins university press baltimore md usa higham computing the nearest correlation matrix problem from finance ima journal of numerical analysis jaccard the distribution of the flora in the alpine zone new phytologist jain murty and flynn data clustering review acm computing surveys knol and ten berge approximation of an improper correlation matrix by proper one psychometrika leskovec rajaraman and ullman mining of massive datasets cambridge university press new york ny usa li and theory and applications of minwise hashing communications of the acm li lee and leung rlsc learning without agony in proceedings of the international conference on machine learning pages acm luenberger optimization by vector space methods john wiley sons new york ny usa malick dual approach to semidefinite problems siam journal on matrix analysis and applications qi and sun quadratically convergent newton method for computing the nearest correlation matrix siam journal on matrix analysis and applications rogers and tanimoto computer program for classifying plants science salton wong and yang vector space model for automatic indexing communications of the acm and smola learning with kernels support vector machines regularization optimization and beyond the mit press cambridge ma usa 
neural adaptive sequential monte carlo shixiang zoubin richard university of cambridge department of engineering cambridge uk mpi for intelligent systems germany zoubin abstract sequential monte carlo smc or particle filtering is popular class of methods for sampling from an intractable target distribution using sequence of simpler intermediate distributions like other importance methods performance is critically dependent on the proposal distribution bad proposal can lead to arbitrarily inaccurate estimates of the target distribution this paper presents new method for automatically adapting the proposal using an approximation of the divergence between the true posterior and the proposal distribution the method is very flexible applicable to any parameterized proposal distribution and it supports online and batch variants we use the new framework to adapt powerful proposal distributions with rich parameterizations based upon neural networks leading to neural adaptive sequential monte carlo nasmc experiments indicate that nasmc significantly improves inference in state space model outperforming adaptive proposal methods including the extended kalman and unscented particle filters experiments also indicate that improved inference translates into improved parameter learning when nasmc is used as subroutine of particle marginal metropolis hastings finally we show that nasmc is able to train latent variable recurrent neural network achieving results that compete with the for polymorphic music modelling nasmc can be seen as bridging the gap between adaptive smc methods and the recent work in scalable variational inference introduction sequential monte carlo smc is class of algorithms that draw samples from target distribution of interest by sampling from series of simpler intermediate distributions more specifically the sequence constructs proposal for importance sampling is smc is particularly for performing inference in dynamical models with hidden variables since filtering naturally decomposes into sequence and in many such cases it is the inference method generally speaking inference methods can be used as modules in parameter learning systems smc has been used in such way for both approximate parameter learning and in bayesian approaches such as the recently developed particle mcmc methods critically in common with any importance sampling method the performance of smc is strongly dependent on the choice of the proposal distribution if the proposal is not to the target distribution then the method can produce samples that have low effective sample size and this leads to monte carlo estimates that have pathologically high variance the smc community has developed approaches to mitigate these limitations such as resampling to improve particle diversity when the effective sample size is low and applying mcmc transition kernels to improve particle diversity complementary line of research leverages distributional approximate inference methods such as the extended kalman filter and unscented kalman filter to construct better proposals leading to the extended kalman particle filter ekpf and unscented particle ter upf in general however the construction of good proposal distributions is still an open question that severely limits the applicability of smc methods this paper proposes new adaptive smc method that automatically tunes flexible proposal distributions the quality of proposal distribution can be assessed using the intractable kl divergence between the target distribution and the parametrized proposal distribution we approximate the derivatives of this objective using samples derived from smc the framework is very general and tractably handles complex parametric proposal distributions for example here we use neural networks to carry out the parameterization thereby leveraging the large literature and efficient computational tools developed by this community we demonstrate that the method can efficiently learn good proposal distributions that significantly outperform existing adaptive proposal methods including the ekpf and upf on standard benchmark models used in the particle filter community we show that improved performance of the smc algorithm translates into improved mixing of the particle marginal pmmh finally we show that the method allows and more complicated models to be accurately handled using smc such as those parametrized using neural networks nn that are challenging for traditional particle filtering methods the focus of this work is on improving smc but many of the ideas are inspired by the burgeoning literature on approximate inference for unsupervised neural network models these connections are explored in section sequential monte carlo we begin by briefly reviewing two fundamental smc algorithms sequential importance sampling sis and sequential importance resampling sir consider probabilistic model comprising possibly hidden and observed states and respectively whose joint disqt tribution factorizes as zt xt this general form subsumes common models such as hidden markov models hmms as well as models for the hidden state such as gaussian processes the goal of the sequential importance sampler is to approximate the posterior distribution over pn the hidden state sequence through weighted set of sampled trajectories drawn from simpler proposal distribution any form of proposal distribution can be used in principle but particularly convenient one takes qt the same factorisation as the true posterior zt with filtering dependence on short derivation see supplementary material then shows that the normalized importance weights are defined by recursion zt xt zt sis is elegant as the samples and weights can be computed in sequential fashion using single forward pass however implementation suffers from severe pathology the distribution of importance weights often become highly skewed as increases with many samples attaining very low weight to alleviate the problem the sequential importance resampling sir algorithm adds an additional step that resamples zt at time from multinomial distribution given by and gives the new particles equal this replaces degenerated particles that have low weight with samples that have more substantial importance weights without violating the validity of the method sir requires knowledge of the full trajectory of previous samples at each stage to draw the samples and compute the importance weights for this reason when carrying out resampling each new particle needs to update its ancestry information letting aτ represent the ancestral index of particle at time for state zτ where and collecting these into the set at zt where tt at where aτ aτ the resampled trajectory can be denoted zt finally to lighten notation we use the shorthand more advanced implementations resample only when the effective sample size falls below threshold wt for the weights note that when employing resampling these do not depend on the previous weights since resampling has given the previous particles uniform weight the implementation of smc is given by algorithm in the supplementary material the critical role of proposal distributions in sequential monte carlo the choice of the proposal distribution in smc is critical even when employing the resampling step poor proposal distribution will produce trajectories that when traced backwards quickly collapse onto single ancestor clearly this represents poor approximation to the true posterior these effects can be mitigated by increasing the number of particles applying more complex additional mcmc moves but these strategies increase the computational cost the conclusion is that the proposal should be chosen with care the optimal choice for an unconstrained proposal that has access to all of the observed data at all times is the intractable posterior distribution qφ pθ given the restrictions imposed by the factorization this becomes zt zt which is still typically intractable the bootstrap filter instead uses the prior zt zt which is often tractable but fails to incorporate information from the current observation xt employs distributional approximate inference techniques to approximate zt examples include the ekpf and upf however these methods suffer from three main problems first the extended and unscented kalman filter from which these methods are derived are known to be inaccurate and poorly behaved for many problems outside of the smc setting second these approximations must be applied on sample by sample basis leading to significant additional computational overhead third neither approximation is tuned using an criterion in the next section we introduce new method for adapting the proposal that addresses these limitations adapting proposals by descending the inclusive kl divergence in this work the quality of the proposal distribution will be optimized using the inclusive between the true posterior distribution and the proposal kl pθ parameters are made explicit since we will shortly be interested in both adapting the proposal and learning the model this objective is chosen for four main reasons first this is direct measure of the quality of the proposal unlike those typically used such as effective sample size second if the true posterior lies in the class of distributions attainable by the proposal family then the objective has global optimum at this point third if the true posterior does not lie within this class then this kl divergence tends to find proposal distributions that have higher entropy than the original which is advantageous for importance sampling the exclusive kl is unsuitable for this reason fourth the derivative of the objective can be approximated efficiently using sample based approximation that will now be described the gradient of the negative kl divergence with respect to the parameters of the proposal distribution takes simple form kl pθ pθ log qφ the expectation over the posterior can be approximated using samples from smc one option would use the weighted sample trajectories at the final of smc but although asymptotically unbiased such an estimator would have high variance due to the collapse of the trajectories an alternative that reduces variance at the cost of introducing some bias uses the intermediate ancestral trees filtering approximation see the supplementary material for details kl pθ log qφ zt the simplicity of the proposed approach brings with it several advantages and opportunities online and batch variants since the derivatives distribute over time it is trivial to apply this update in an online way updating the proposal distribution every alternatively when learning parameters in batch setting it might be more appropriate to update the proposal parameters after making full forward pass of smc conveniently when performing approximate learning the gradient update for the model parameters can be efficiently approximated using the same sample particles from smc see supplementary material and algorithm similar derivation for maximum likelihood learning is also discussed in log pθ log pθ xt zt algorithm stochastic gradient adaptive smc batch inference and learning variants require proposal qφ model pθ observations tj number of particles repeat tj nextminibatch tj smc tj ptj log ptj log pθ xt zt optional optimize optimize optional until convergence efficiency of the adaptive proposal in contrast to the epf and upf the new method employs an analytic function for propagation and does not require costly distributional approximation as an similarly although the method bears similarity to the filter adf which minimizes local inclusive kl the new method has the advantage of minimizing global cost and does not require moment matching training complex proposal models the adaptation method described above can be applied to any parametric proposal distribution special cases have been previously treated by we propose related but arguably more straightforward and general approach to proposal adaptation in the next section we describe rich family of proposal distributions that go beyond previous work based upon neural networks this approach enables adaptive smc methods to make use of the rich literature and optimization tools available from supervised learning flexibility of training one option is to train the proposal distribution using samples from smc derived from the observed data however this is not the only approach for example the proposal could be trained using data sampled from the generative model instead which might mitigate overfitting effects for small datasets similarly the trained proposal does not need to be the one used to generate the samples in the first place the bootstrap filter or more complex variants can be used flexible and trainable proposal distributions using neural networks the proposed adaption method can be applied to any parametric proposal distribution here we briefly describe how to utilize this flexibility to employ powerful neural parameterizations that have recently shown excellent performance in supervised sequence learning tasks generally speaking applications of these techniques to unsupervised sequence modeling settings is an active research area that is still in its infancy and this work opens new avenue in this wider research effort in nutshell the goal is to parameterize qφ zt the proposal stochastic mapping from all previous hidden states and all observations up to and including the current observation to the current hidden state zt in flexible computationally efficient and trainable way here we use class of functions called long memory lstm that define deterministic mapping from an input sequence to an output sequence using recurrent dynamics and alleviate the common vanishing gradient problem in recurrent neural networks the distributions qφ zt can be mixture of gaussians mixture density network mdn in which the mixing proportions means and covariances are parameterised through another neural network see the supplementary for details on lstm mdn and neural network architectures experiments the goal of the experiments is three fold first to evaluate the performance of the adaptive method for inference on standard benchmarks used by the smc community with known ground truth second to evaluate the performance when smc is used as an inner loop of learning algorithm again we use an example with known ground truth third to apply smc learning to complex models that would normally be challenging for smc comparing to the in approximate inference one way of assessing the success of the proposed method would be to evaluate kl however this quantity is hard to accurately compute instead we use number of other metrics for the experiments where ground truth states are known we can evaluate the root mean square error rmse between the approximate posterior mean of the latent variables and the true value rmse zt more generally the estimate of the likelihood lml log log xt log wt and its variance is also indicative of performance finally we also employ common metric called the effective sample size ess to measure the effectiveness of our smc method ess of particles at time is given by esst if expected ess is maximized and equals the number of particles equivalently the normalized importance weights are uniform note that ess alone is not sufficient metric since it does not measure the absolute quality of samples but rather the relative quality inference in benchmark nonlinear model in order to evaluate the effectiveness of our adaptive smc method we tested our method on standard nonlinear model often used to benchmark smc algorithms the model is given by eq where σv σw the posterior distribution pθ is highly due to uncertainty about the signs of the latent states zt zt xt xt σw cos zt the experiments investigated how the new proposal adaptation method performed in comparison to standard methods including the bootstrap filter ekpf and ukpf in particular we were interested in the following questions do rich proposals improve inference for this we compared gaussian proposal with diagonal gaussian to mixture density network with three components does recurrent parameterization of the proposal help for this we compared neural network with hidden units to recurrent neural network with lstm units can injecting information about the prior dynamics into the proposal improve performance similar in spirit to for variational methods to assess this we parameterized proposals for vt process noise instead of zt and let the proposal have access to the prior dynamics for all experiments the parameters in the model were fixed to σv σw adaptation of the proposal was performed on samples from the generative process at each iteration results are summarized in fig and table see supplementary material for additional results average run times for the algorithms over sequence of length were bootstrap ekpf upf and where ekpf and upf implementations are provided by although these numbers should only be taken as guide as the implementations had differing levels of acceleration the new adaptive proposal methods significantly outperform the bootstrap ekpf and upf methods in terms of ess rmse and the variance in the lml estimates the proposal outperforms simple gaussian proposal compare to indicating proposals can improve performance moreover the rnn outperforms the nn compare rnn to nn although the proposal models can effectively learn the transition function injecting information about the prior dynamics into the proposal does help compare to rnn interestingly there is no clear cut winner between the ekpf and upf although the upf does return lml estimates that have lower variance all methods converged to similar lmls that were close to the values computed using large numbers of particles indicating the implementations are correct effective sample size log marginal likelihood ekpf prior rnn upf ekp prio rnn rnn upf rnn iteration figure left box plots for lml estimates from iteration to right average ess over the first iterations ess iter mean std prior ekpf upf rnn lml mean std rmse mean std table left middle average ess and log marginal likelihood estimates over the last iterations right the rmse over new sequences with no further adaptation inference in the cart and pole system as second and more physically meaningful system we considered system that consists of an inverted pendulum that rests on movable base the system was driven by white noise input an ode solver was used to simulate the system from its equations of motion we considered the problem of inferring the true position of the cart and orientation of the pendulum along with their derivatives and the input noise from noisy measurements of the location of the tip of the pole the results are presented in fig the system is significantly more intricate than the model in sec and does not directly admit the usage of ekpf or upf our proposal model successfully learns good proposals without any direct access to the prior dynamics ess rad ess iteration prior time time figure left normalized ess over iterations middle right posterior mean for the horizontal location of the cart and the change in relative angle of the pole learns to have higher ess than the prior and more accurately estimates the latent states σw σw prior iteration iteration figure pmmh samples of σw values for particles for small numbers of particles right pmmh is very slow to burn in and mix when proposing from the prior distribution due to the large variance in the marginal likelihood estimates it returns bayesian learning in nonlinear ssm smc is often employed as an inner loop of more complex algorithm one prominent example is particle markov chain monte carlo class of methods that sample from the joint posterior over model parameters and latent state trajectories here we consider the particle marginal sampler pmmh in this context smc is used to construct proposal distribution for mh step the proposal is formed by sampling proposed set of parameters by perturbing the current parameters using gaussian random walk then smc is used to sample proposed set of latent state variables resulting in joint proposal the mh step uses the smc marginal likelihood estimates to determine acceptance full details are given in the supplementary material in this experiment we evaluate our method in pmmh sampler on the same model from section following random walk proposal is used to sample σv σw diag the prior over is set as ig is initialized as and the pmmh is run for iterations two of the adaptive models considered section are used for comparison and where models are for iterations using samples from the initial the results are shown in fig and were typical for range of parameter settings given sufficient number of particles there is almost no difference between the prior proposal and our method however when the number of particles gets smaller nasmc enables significantly faster to the posterior particularly on the measurement noise σw and for similar reasons nasmc mixes more quickly the limitation with the is that the model needs to continuously adapt as the global parameter is sampled but note this is still not as costly as adapting on basis as is the case for the ekpf and upf polyphonic music generation finally the new method is used to train latent variable recurrent neural network for modelling four polymorphic music datasets of varying complexity these datasets are often used to benchmark rnn models because of their high dimensionality and the complex temporal dependencies involved at different time scales each dataset contains at least hours of polyphonic music with an average polyphony number of simultaneous notes of out of lvrnn contains recurrent neural network with lstm layers that is driven by stochastic latent variables zt at each and stochastic outputs xt that are fed back into the dynamics full details in the supplementary material both the lstm layers in the generative and proposal models are set as units and adam is used as the optimizer the bootstrap filter is compared to the new adaptive method nasmc particles are used in the training the hyperparameters are tuned using the validation set diagonal gaussian output is used in the proposal model with an additional hidden layer of size the log likelihood on the test set standard metric for comparison in generative models is approximated using smc with particles only the prior proposal is compared since sec shows the advantage of our method over the results are reported in table the adaptive method significantly outperforms the bootstrap filter on three of the four datasets on the piano dataset the bootstrap method performs marginally better in general the nlls for the new methods are comparable to the although detailed comparison is difficult as the methods with stochastic latent states require approximate marginalization using importance sampling or smc dataset nottingham musedata jsbchorales nasmc bootstrap storn sgvb srnn table estimated negative log likelihood on test data and storn are from and srnn and are results from comparison of variational inference to the nasmc approach there are several similarities between nasmc and variational methods that employ recognition models variational methods refine an approximation qφ to the posterior distribution pθ by optimising the exclusive or variational kl qφ it is common to approximate this integral using samples from the approximate posterior this general approach is similar in spirit to the way that the proposal is adapted in nasmc except that the inclusive is employed kl pθ and this entails that sample based approximation requires simulation from the true posterior critically nasmc uses the approximate posterior as proposal distribution to construct more accurate posterior approximation the smc algorithm therefore can be seen as correcting for the deficiencies in the proposal approximation we believe that this can lead to significant advantages over variational methods especially in the setting where variational methods are known to have severe biases moreover using the inclusive kl avoids having to compute the entropy of the approximating distribution which can prove problematic when using complex approximating distributions mixtures and heavy tailed distributions in the variational framework there is close connection between nasmc and the algorithm the algorithm also employs the inclusive kl divergence to refine posterior approximation and recent generalizations have shown how to incorporate this idea into importance sampling in this context the nasmc algorithm extends this work to smc conclusion this paper developed powerful method for adapting proposal distributions within general smc algorithms the method parameterises proposal distribution using recurrent neural network to model contextual information allows flexible distributional forms including mixture density networks and enables efficient training by stochastic gradient descent the method was found to outperform existing adaptive proposal mechanisms including the ekpf and upf on standard smc benchmark it improves burn in and mixing of the pmmh sampler and allows effective training of latent variable recurrent neural networks using smc we hope that the connection between smc and neural network technologies will inspire further research into adaptive smc methods in particular application of the methods developed in this paper to adaptive particle smoothing latent models and adaptive pmcmc for probabilistic programming are particular exciting avenues acknowledgments sg is generously supported by fellowship the alta institute and jesus college cambridge ret thanks the epsrc grants and we thank theano developers for their toolkit the authors of for releasing the source code and roger frigola sumeet singh fredrik lindsten and thomas for helpful suggestions on experiments results for are separately provided for reference since this is different model class references gordon salmond and smith novel approach to bayesian state estimation in iee proceedings radar and signal processing vol pp iet doucet de freitas and gordon sequential monte carlo methods in practice andrieu doucet and holenstein particle markov chain monte carlo methods journal of the royal statistical society series statistical methodology vol no pp poyiadjis doucet and singh particle approximations of the score and observed information matrix in state space models with application to parameter estimation biometrika vol no pp van der merwe doucet de freitas and wan the unscented particle filter in advances in neural information processing systems pp frigola chen and rasmussen variational gaussian process models in advances in neural information processing systems pp mackay information theory inference and learning algorithms vol cambridge university press cambridge minka expectation propagation for approximate bayesian inference in proceedings of the seventeenth conference on uncertainty in artificial intelligence pp morgan kaufmann publishers cornebise adaptive sequential monte carlo methods phd thesis ph thesis university pierre and marie graves supervised sequence labelling with recurrent neural networks vol springer sutskever vinyals and le sequence to sequence learning with neural networks in advances in neural information processing systems pp graves generating sequences with recurrent neural networks corr vol hochreiter and schmidhuber long memory neural computation vol no pp bishop mixture density networks gregor danihelka graves rezende and wierstra draw recurrent neural network for image generation in proceedings of the international conference on machine learning icml lille france july pp mchutchon nonlinear modelling and control using gaussian processes phd thesis university of cambridge uk department of engineering bengio and vincent modeling temporal dependencies in highdimensional sequences application to polyphonic music generation and transcription in international conference on machine learning icml bengio and pascanu advances in optimizing recurrent networks in acoustics speech and signal processing icassp ieee international conference on pp ieee bayer and osendorfer learning stochastic recurrent networks arxiv preprint kingma and ba adam method for stochastic optimization the international conference on learning representations iclr kingma and welling variational bayes the international conference on learning representations iclr rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models international conference on machine learning icml mnih and gregor neural variational inference and learning in belief networks international conference on machine learning icml turner and sahani two problems with variational expectation maximisation for models in bayesian time series models barber cemgil and chiappa eds ch pp cambridge university press hinton dayan frey and neal the algorithm for unsupervised neural networks science vol no pp bornschein and bengio reweighted the international conference on learning representations iclr 
local expectation gradients for black box variational inference michalis titsias athens university of economics and business mtitsias miguel vicarious miguel abstract we introduce local expectation gradients which is general purpose stochastic variational inference algorithm for constructing stochastic gradients by sampling from the variational distribution this algorithm divides the problem of estimating the stochastic gradients over multiple variational parameters into smaller so that each explores intelligently the most relevant part of the variational distribution this is achieved by performing an exact expectation over the single random variable that most correlates with the variational parameter of interest resulting in estimate that has low variance our method works efficiently for both continuous and discrete random variables furthermore the proposed algorithm has interesting similarities with gibbs sampling but at the same time unlike gibbs sampling can be trivially parallelized introduction stochastic variational inference has emerged as promising and flexible framework for performing large scale approximate inference in complex probabilistic models it significantly extends the traditional variational inference framework by incorporating stochastic approximation into the optimization of the variational lower bound currently there exist two major research directions in stochastic variational inference the first one data stochasticity attempts to deal with massive datasets by constructing stochastic gradients by using of training examples the second direction expectation stochasticity aims at dealing with the intractable expectations under the variational distribution that are encountered in probabilistic models unifying these two ideas it is possible to use stochastic gradients to address both massive datasets and intractable expectations this results in doubly stochastic estimation approach where the source of stochasticity can be combined with the stochasticity associated with sampling from the variational distribution in this paper we are interested to further investigate the expectation stochasticity that in practice is dealt with by drawing samples from the variational distribution challenging issue here is concerned with the variance reduction of the stochastic gradients specifically while the method based on the log derivative trick is currently the most general one it has been observed to severely suffer from high variance problems and thus it is only applicable together with sophisticated variance reduction techniques based on control variates however the construction of efficient control variates can be strongly dependent on the form of the probabilistic model therefore it would be highly desirable to introduce more black box procedures where simple stochastic gradients can work well for any model thus allowing the not to worry about having to design variance reduction techniques notice that for continuous random variables and differentiable functions the reparametrization approach offers simple black box procedure which does not require further variance reduction however reparametrization is neither applicable for discrete spaces nor for models and this greatly limits its scope of applicability in this paper we introduce simple black box algorithm for stochastic optimization in variational inference which provides stochastic gradients having low variance and without needing any extra variance reduction this method is based on new trick referred to as local expectation or integration the key idea here is that stochastic gradient estimation over multiple variational parameters can be divided into smaller where each requires different amounts of information about different parts of the variational distribution more precisely each aims at exploiting the conditional independence structure of the variational distribution based on this intuitive idea we introduce the local expectation gradients algorithm that provides stochastic gradient over variational parameter vi by performing an exact expectation over the associated latent variable xi while using single sample from the remaining latent variables essentially this consists of raoblackwellized estimate that allows to dramatically reduce the variance of the stochastic gradient so that for instance for continuous spaces the new stochastic gradient is guaranteed to have lower variance than the stochastic gradient corresponding to the reparametrization method where the latter utilizes single sample furthermore the local expectation algorithm has interesting similarities with gibbs sampling with the important difference that unlike gibbs sampling it can be trivially parallelized stochastic variational inference here we discuss the main ideas behind current algorithms on stochastic variational inference and particularly methods that sample from the variational distribution in order to approximate intractable expectations using monte carlo given joint probability distribution where are observations and are latent variables possibly including model parameters that consist of random variables and variational distribution qv the objective is to maximize the lower bound eqv log log qv eqv log eqv log qv with respect to the variational parameters ideally in order to tune we would like to have expression for the lower bound so that we could subsequently maximize it by using standard optimization routines such as algorithms however for many probabilistic models and forms of the variational distribution at least one of the two expectations in is intractable therefore in general we are faced with the following intractable expectation eqv where can be either log log qv or log log qv from which we would like to efficiently estimate the gradient over in order to apply optimization the most general method for estimating the gradient is based on the log derivative trick also called likelihood ratio or reinforce that has been invented in control theory and reinforcement learning and used recently for variational inference specifically this makes use of the property qv qv log qv which allows to write the gradient as eqv log qv and then obtain an unbiased estimate according to log qv where each is an independent draw from qv while this estimate is unbiased it has been observed to severely suffer from high variance so that in practice it is necessary to consider variance reduction techniques such as those based on control variates the second approach is suitable for continuous spaces where is differentiable function of it is based on simple transformation of which allows to move the variational parameters inside so that eventually the expectation is taken over base distribution that does not depend on the variational parameters any more for example if the variational distribution is the gaussian ll where the expectation in can be as lz dz and subsequently the gradient over can be approximated by the following unbiased monte carlo estimate lz where each is an independent sample from this estimate makes efficient use of the slope of which allows to perform informative moves in the space of furthermore it has been shown experimentally in several studies that the estimate in has relatively low variance and can lead to efficient optimization even when single sample is used at each iteration nevertheless limitation of the approach is that it is only applicable to models where is continuous and is differentiable even within this subset of models we are also additionally restricted to using certain classes of variational distributions for which reparametrization is possible next we introduce an approach that is applicable to broad class of models both discrete and continuous has favourable scaling properties and provides stochastic gradients local expectation gradients suppose that the latent vector in the probabilistic model takes values in some space sn where each set si can be continuous or discrete we consider variational distribution over that is represented as directed graphical model having the following joint density qv qvi xi where qvi xi is the conditional factor over xi given the set of the parents denoted by pai we assume that each conditional factor has its own separate set of variational parameters vi and vi vn the objective is then to obtain stochastic approximation for the gradient of the lower bound over each variational parameter vi our method is motivated by the observation that each vi is influenced mostly by its corresponding latent variable xi since vi determines the factor qvi xi therefore to get information about the gradient of vi we should be exploring multiple possible values of xi and rather smaller set of values from the remaining latent variables next we take this idea into the extreme where we will be using infinite draws from xi essentially an exact expectation together with just single sample of more precisely we factorize the variational distribution as qv xi where mbi denotes the markov blanket of xi the gradient over vi can be written as eq log qvi xi eq eq xi log qvi xi where in the second expression we used the law of iterated expectations then an unbiased stochastic gradient say at the iteration of an optimization algorithm can be obtained by drawing single sample from so that qe xi xi qvi xi eq xi xi log qvi xi xi xi is the same as xi but with xi denotes summation or integration and qvi xi removed from the the above is the expression for the proposed stochastic gradient for the parameter vi notice that this estimate does not rely on the log derivative trick since we never draw samples from xi instead the trick here is to perform local expectation integration or summation to get an independent sample from we can simply simulate full latent vector from qv by applying the standard ancestral sampling procedure for directed graphical models then the is by construction an independent draw from the where notice that xi xi mbi qvi xi for some function algorithm stochastic variational inference using local expectation gradients input qv initialize repeat set draw pivot sample qv for to do dvi eq xi xi log qvi xi vi vi ηt dvi end for until convergence criterion is met marginal furthermore the sample can be thought of as global or pivot sample that is needed to be drawn once and then it can be multiple times in order to compute all stochastic gradients for all variational parameters vn according to eq when the variable xi takes discrete values the expectation in eq reduces to sum of terms associated with all possible values of xi on the other hand when xi is continuous variable the expectation in corresponds to an univariate integral that in general may not be analytically tractable in this case we shall use fast numerical integration methods we shall refer to the above algorithm for providing stochastic gradients over variational parameters as local expectation gradients and of stochastic variational inference scheme that internally uses this algorithm is given in algorithm notice that algorithm corresponds to the case where log log qv while other cases can be expressed similarly in the next two sections we discuss some theoretical properties of local expectation gradients section and draw interesting connections with gibbs sampling section properties of local expectation gradients we first derive the variance of the stochastic estimates obtained by local expectation gradients in our analysis we will focus on the case of fitting fully factorized variational distribution and leave the more general case for future work having the form qvi xi qv for such case the local expectation gradient for each parameter vi from eq simplifies to eqvi xi xi log qvi xi qvi xi xi xi where also for notational simplicity we write as it would be useful to define the following mean and covariance functions xi eq xi cov xi eq xi xi that characterize the variability of xi as varies according to notice that based on eq the exact gradient of the variational lower bound over vi can also be written as xi qvi xi xi which has an analogous form to the local expectation gradient from with the difference that xi is now replaced by its mean value xi we can now characterize the variance of the stochastic gradient and describe some additional properties all proofs for the following statements are given in the supplementary material proposition the variance of the stochastic gradient in can be written as qvi xi qvi cov xi xi this gives us some intuition about when we expect the variance of the estimate to be small for instance two simple cases are when the covariance function cov xi takes small values which can occur when has low entropy or ii when cov xi is approximately constant in fact when cov xi is exactly constant then the variance is zero so that the stochastic gradient is exact as the following proposition states proposition if cov xi for all xi and then the variance in is equal to zero case for which the condition cov xi holds exactly is when the function factorizes as xi fi xi see supplementary material for proof such factorization essentially implies that xi is independent from the remaining random variables which results the local expectation gradient to be exact in contrast in order to get exactness by using the standard monte carlo stochastic gradient from eq and any of its improvements that apply variance reduction we will typically need to draw infinite number of samples to further analyze local expectation gradients we can contrast them with stochastic gradients obtained by the reparametrization trick suppose that we can reparametrize the random variable xi qvi xi according to xi vi zi where zi qi zi and qi zi is suitable base distribution we further assume that the function xi is differentiable with respect to xi and vi zi is differentiable with respect to vi then the exact gradient with respect to the variational parameter vi can be reparametrized as qvi xi xi dx qi zi vi zi dzi while stochastic estimate that follows from this expression is vi zi zi qi zi the following statement gives us clear understanding about how this estimate compares with the corresponding local expectation gradient proposition given that we can reparametrize xi as described above and all differentiability conditions mentioned above hold the gradient from can be equivalently written as qi zi vi zi dzi clearly the above expression is an expectation of the reparametrization gradient from eq and therefore based on the standard argument the variance of the local expectation gradient will always be lower or equal than the variance of estimate based on the reparametrization method notice that the reparametrization method is only applicable to continuous random variables and differentiable functions however for such cases reparametrization could be computationally more efficient than local expectation gradients since the latter approach will require to apply numerical integration to estimate the integral in or the integral in which could be computationally more expensive connection with gibbs sampling there are interesting similarities between local expectation gradients and gibbs sampling firstly notice that carrying out gibbs sampling in the variational distribution in eq requires iteratively sampling from each conditional xi for and clearly the same conditional appears also in local expectation gradients with the obvious difference that instead of sampling from xi we now average under this distribution of course in practice we never perform gibbs sampling on variational distribution but instead on the true posterior distribution which is proportional to ef where we assumed that log qv is not part of specifically at each gibbs step we simulate new value for some xi from the posterior conditional distribution that is propor tional to and where are the fixed values for the remaining random variables we can observe that an update in local expectation gradients is quite similar because now we also condi tion on some fixed remaining values in order to update the parameter vi towards the direction the exact value of the two integrals is the same however approximation of these two integrals based on numerical integration will typically not give the same value where xi gets closer to the corresponding true posterior conditional distribution despite these similarities there is crucial computational difference between the two procedures while in local expectation gradients it is perfectly valid to perform all updates of the variational parameters in parallel given the pivot sample in gibbs sampling all updates need to be executed in serial manner this difference is essentially consequence of the fundamental difference between variational inference and gibbs sampling where the former relies on optimization while the latter on convergence of markov chain experiments in this section we apply local expectation gradients legrad to two different types of stochastic variational inference problems and we compare it against the standard stochastic gradient based on the log derivative trick ldgrad that incorporates also variance as well as the gradient regrad given by eq in section we consider classification problem using two digits from the mnist database and we approximate bayesian logistic regression model using stochastic variational inference then in section we consider sigmoid belief networks and we fit them to the binarized version of the mnist digits bayesian logistic regression in this section we compare the three approaches in challenging binary classification problem using bayesian logistic regression specifically given dataset zj yj where zj is the input and yj the class label we model the joint distribution over the observed labels and the parameters by ym zm where is the sigmoid function and denotes gaussian prior on the weights we wish to apply the three algorithms in order to approximate the posterior overqthe regression parameters by factorized variational gaussian distribution of the form qv wi in the following we consider subset of the mnist dataset that includes all training examples from the digit classes and each with pixels so that by including the bias the number of weights is to obtain the local expectation gradient for each µi we need to apply numerical integration we used the quadrature rule having so that legrad was using function evaluations per gradient estimation for ldgrad we also set the number of samples to so that legrad and ldgrad match exactly in the number of function evaluations and roughly in computational cost when using the regrad approach based on we construct the stochastic gradient using target function gradient samples this matches the computational cost but regrad still has the unmatched advantage of having access to the gradient of the target function the variance of the stochastic gradient for parameter is shown in figure it is much smaller for legrad than for ldgrad despite having almost similar computational cost and use the same amount of information about the target function the evolution of the bound in figure clearly shows the advantage of using less noisy gradients ldgrad will need huge number of iterations to find the global optimum despite having optimized the step size of its stochastic updates sigmoid belief networks in the second example we consider sigmoid belief networks sbns and compare our approach with ldgrad in terms of variance and optimization efficiency and then ii we perform density estimation experiments by training sigmoid belief nets with fully connected hidden units using legrad note that regrad can not be used on discrete models as discussed in there are multiple unbiased estimators of and using directly tends to have large variance we use instead the estimator given by eq in though other estimators with even lower variance exist we restrict ourselves to those with the same scalability as the proposed legrad requiring at most computation per gradient estimation gaussian quadrature with grid points integrates exactly polynomials up to degree lower bound variance variance iterations iterations iterations figure variance of the gradient for the variational parameter for legrad red line and regrad blue line variance of the gradient for the variational parameter for ldgrad green line evolution of the stochastic value of the lower bound for the variance reduction comparison we consider network with an unstructured hidden layer where binary observed vectors yi are generated independently according to xy wd wd where is vector of hidden variables that follows uniform distribution the matrix which includes bias terms contains the parameters to be estimated by fitting the model to the data in theory we could use the em algorithm to learn the parameters however such an approach is not feasible because at the step we need to compute the posterior distribution xi over each hidden variable which clearly is intractable since each xi takes values therefore we need to apply approximate inference and next we consider stochastic variational inference using the local expectation gradients algorithm and compare this with the method in eq which has the same scalability properties and have been denoting as ldgrad more precisely we shall consider variational distribution that consists of recognition model which is parametrized by reverse sigmoid network that predicts the latent vector qk xi from the associated observation yi qv xi the yi yi variational parameters are contained in matrix also the bias terms the application of stochastic variational inference boils down to constructing separate lower bound for each pair yi xi so that the full bound is the sum of these individual terms see supplementary material for explicit expressions then the maximization of the bound proceeds by performing stochastic gradient updates for the model weights and the variational parameters the update for reduces to logistic regression type of update based upon drawing single sample from the full variational distribution on the other hand obtaining effective and low variance stochastic gradients for the variational parameters is considered to be very highly challenging task and current advanced methods are based on covariates that employ neural networks as auxiliary models in contrast the local expectation gradient for each variational parameter vk only requires evaluating yid wd xik ik log yi σik σik log yid wd ik ik yi where σik and yeid is the encoding of yid this expression is weighted sum across data terms where each term is difference induced by the directions xik and xik for all hidden units xik associated with the variational factors that depend on vk based on the above model we compare the performance of legrad and ldgrad when simultaneously optimizing and for small set of random binarized mnist digits the evolution of the instantaneous bound for hidden units can be seen in figure where once again legrad shows superior performance and increased stability in the second series of experiments we consider more complex sigmoid belief network where the prior over the hidden units becomes fully connected distribution parametrized by an table nll scores in the test data for the binarized mnist dataset the left part of the table shows results based on sigmoid belief nets sbn constructed and trained based on the approach from denoted as nvil or by using the legrad algorithm the right part of the table gives the performance of alternative state of the art models reported in table in sbn nvil nvil nvil legrad legrad legrad dim test nll model fdarn nade darn rbm rbm mob dim test nll additional set of model weights see supplementary material such model can better capture the dependence structure of the hidden units and provide good density estimator for high dimensional data we trained this model using the training examples of the binarized mnist by using of size and assuming different numbers of hidden units table provides negative log likelihood nll scores for leggrad and several other methods reported in notice that for legrad the nlls are essentially variational upper bounds of the exact nlls obtained by monte carlo approximation of the variational bound an estimate also considered in from table we can observe that legrad outperforms the advanced nvil technique proposed in finally figure and displays model weights and few examples of digits generated after having trained the model with units respectively lower bound iterations figure legrad red and ldgrad green convergence for the sbn model on single minibatch of mnist digits weights filters learned by legrad when training an sbn with units in the full mnist training set new digits generated from the trained model discussion local expectation gradients is generic black box stochastic optimization algorithm that can be used to maximize objective functions of the form eqv problem that arises in variational inference the idea behind this algorithm is to exploit the conditional independence structure of the variational distribution qv also this algorithm is mostly related to stochastic optimization schemes that make use of the log derivative trick that has been invented in reinforcement learning and has been recently used for variational inference the approaches in can be thought of as following global sampling strategy where multiple samples are drawn from qv and then variance reduction is built posteriori in subsequent stage through the use of control variates in contrast local expectation gradients reduce variance by directly changing the sampling strategy so that instead of working with global set of samples drawn from qv the strategy now is to exactly marginalize out the random variable that has the largest influence on specific gradient of interest while using single sample for the remaining random variables we believe that local expectation gradients can be applied to great range of stochastic optimization problems that arise in variational inference and in other domains here we have demonstrated its use for variational inference in logistic regression and sigmoid belief networks references christopher bishop pattern recognition and machine learning springer jrg bornschein and yoshua bengio reweighted corr pages peter glynn likelihood ratio gradient estimation for stochastic systems commun acm october geoffrey hinton peter dayan brendan frey and radford neal the algorithm for unsupervised neural networks science matthew hoffman david blei and francis bach online learning for latent dirichlet allocation in nips pages matthew hoffman david blei chong wang and john william paisley stochastic variational inference journal of machine learning research michael jordan zoubin ghahramani tommi jaakkola and lawrence saul an introduction to variational methods for graphical models mach november diederik kingma and max welling variational bayes arxiv preprint kucukelbir ranganath gelman and blei automatic variational inference in stan in advances in neural information processing systems andriy mnih and karol gregor neural variational inference and learning in belief networks in the international conference on machine learning icml radford neal connectionist learning of belief networks artif july john william paisley david blei and michael jordan variational bayesian inference with stochastic search in icml peters and schaal policy gradient methods for robotics in proceedings of the ieee international conference on intelligent robotics systems iros rajesh ranganath sean gerrish and david blei black box variational inference in proceedings of the seventeenth international conference on artificial intelligence and statistics aistats page danilo jimenez rezende shakir mohamed and daan wierstra stochastic backpropagation and approximate inference in deep generative models in the international conference on machine learning icml herbert robbins and sutton monro stochastic approximation method the annals of mathematical statistics ruslan salakhutdinov and iain murray on the quantitative analysis of deep belief networks in andrew mccallum and sam roweis editors proceedings of the annual international conference on machine learning icml pages omnipress tim salimans and david knowles variational posterior approximation through stochastic linear regression bayesian tim salimans and david knowles on using control variates with stochastic approximation for variational bayes and its connection to stochastic linear regression january michalis titsias and miguel doubly stochastic variational bayes for inference in the international conference on machine learning icml ronald williams simple statistical algorithms for connectionist reinforcement learning mach may 
on variance reduction in stochastic gradient descent and its asynchronous variants sashank reddi carnegie mellon university sjakkamr ahmed hefny carnegie mellon university ahefny suvrit sra massachusetts institute of technology suvrit carnegie mellon university bapoczos alex smola carnegie mellon university alex abstract we study optimization algorithms based on variance reduction for stochastic gradient descent sgd remarkable recent progress has been made in this direction through development of algorithms like sag svrg saga these algorithms have been shown to outperform sgd both theoretically and empirically however asynchronous versions of these crucial requirement for modern not been studied we bridge this gap by presenting unifying framework for many variance reduction techniques subsequently we propose an asynchronous algorithm grounded in our framework and prove its fast convergence an important consequence of our general approach is that it yields asynchronous versions of variance reduction algorithms such as svrg and saga as byproduct our method achieves near linear speedup in sparse settings common to machine learning we demonstrate the empirical performance of our method through concrete realization of asynchronous svrg introduction there has been steep rise in recent work on variance reduced stochastic gradient algorithms for convex problems of the form xn min fi under strong convexity assumptions such variance reduced vr stochastic algorithms attain better convergence rates in expectation than stochastic gradient descent sgd both in theory and the key property of these vr algorithms is that by exploiting problem structure and by making suitable tradeoffs they reduce the variance incurred due to stochastic gradients this variance reduction has powerful consequences it helps vr stochastic methods attain linear convergence rates and thereby circumvents slowdowns that usually hit sgd though we should note that sgd also applies to the harder stochastic optimization problem min which need not be although these advances have great value in general for problems we still require parallel or distributed processing and in this setting asynchronous variants of sgd remain indispensable therefore key question is how to extend the synchronous vr algorithms to asynchronous parallel and distributed settings we answer one part of this question by developing new asynchronous parallel stochastic gradient methods that provably converge at linear rate for smooth strongly convex problems our methods are inspired by the influential vrg gd ag and aga family of algorithms we list our contributions more precisely below contributions our paper makes two core contributions formal general framework for variance reduced stochastic methods based on discussions in and ii asynchronous parallel vr algorithms within this framework our general framework presents formal unifying view of several vr methods it includes saga and svrg as special cases while expressing key algorithmic and practical tradeoffs concisely thus it yields broader understanding of vr methods which helps us obtain asynchronous parallel variants of vr methods under settings common to machine learning problems our parallel algorithms attain speedups that scale near linearly with the number of processors as concrete illustration we present specialization to an asynchronous method we compare this specialization with reduced asynchronous sgd methods and observe strong empirical speedups that agree with the theory related work as already mentioned our work is closest to and generalizes ag aga vrg and gd which are primal methods also closely related are dual methods such as sdca and finito and in its convex incarnation iso more precise relation between these dual methods and vr stochastic methods is described in defazio thesis by their algorithmic structure these vr methods trace back to classical incremental gradient algorithms but by now it is that randomization helps obtain much sharper convergence results in expectation proximal and accelerated vr methods have also been proposed we leave study of such variants of our framework as future work finally there is recent work on for problems within asynchronous sgd algorithms both parallel and distributed variants are known in this paper we focus our attention on the parallel setting different line of methods is that of primal coordinate descent methods and their parallel and distributed variants our asynchronous methods share some structural assumptions with these methods finally the recent work generalizes to the setting thereby also permitting parallel processing albeit with more synchronization and allowing only small general framework for vr stochastic methods we focus on instances of where the cost function has an gradient so that krf rf lkx yk and it is convex for all rd hrf yi kx while our analysis focuses on strongly convex functions we can extend it to just smooth convex functions along the lines of inspired by the discussion on general view of variance reduced techniques in we now describe formal general framework for variance reduction in stochastic gradient descent we denote the collection fi of functions that make up in by for our algorithm we maintain an additional parameter rd for each fi we use at to denote the general iterative framework for updating the parameters is presented as algorithm observe that the algorithm is still abstract since it does not specify the subroutine chedule pdate this subroutine determines the crucial update mechanism of and thereby of at as we will see different schedules give rise to different fast methods proposed in the literature the part of the update based on at is the key for these approaches and is responsible for variance reduction next we provide different instantiations of the framework and construct new algorithm derived from it in particular we consider incremental methods ag vrg and aga and classic gradient descent radient escent for demonstrating our framework algorithm eneric tochastic variance eduction lgorithm data rd step size randomly pick it it where it for to do update iterate as xt rfit xt rfit rfi chedule pdate xi it end return xt figure shows the schedules for the aforementioned algorithms in case of vrg chedule pdate is triggered every iterations here denotes precisely the number of inner iterations used in so at remains unchanged for the iterations and all are updated to the current iterate at the mth iteration for aga unlike vrg at changes at the tth iteration for all this change is only to single element of at and is determined by the index it the function chosen at iteration the update of ag is similar to aga insofar that only one of the is updated at each iteration however the update for is based on rather than it this results in biased estimate of the gradient unlike vrg and aga finally the schedule for gradient descent is similar to ag except that all the are updated at each iteration due to the full update we end up with the exact gradient at each iteration this discussion highlights how the scheduler determines the resulting gradient method to motivate the design of another schedule let us consider the computational and storage costs of each of these algorithms for vrg since we update at after every iterations it is enough to store full gradient and hence the storage cost is however the running time is at each iteration and nd at the end of each epoch for calculating the full gradient at the end of each epoch in contrast both ag and aga have high storage costs of nd and running time of per iteration finally radient escent has low storage cost since it needs to store the gradient at cost but very high computational costs of nd at each iteration vrg has an additional computation overhead at the end of each epoch due to calculation of the whole gradient this is avoided in ag and aga at the cost of additional storage when is very large the additional computational overhead of vrg amortized over all the iterations is small however as we will later see this comes at the expense of slower convergence to the optimal solution the tradeoffs between the epoch size additional storage frequency of updates and the convergence to the optimal solution are still not completely resolved svrg chedule pdate xi it for to do xt end return saga chedule pdate xi it for to do it xt it end return sag chedule pdate xi it for to do end return gd chedule pdate xi it for to do end return figure chedule pdate function for vrg top left aga top right ag bottom left and radient escent bottom right while vrg is rest of algorithms perform updates at each iteration here denotes that divides straightforward approach to design new scheduler is to combine the schedules of the above algorithms this allows us to tradeoff between the various aforementioned parameters of our interest we call this schedule hybrid stochastic average gradient sag here we use the schedules of vrg and aga to develop sag however in general schedules of any of these algorithms can sag chedule pdate xt at it for to it xt it if si xt if end return figure chedule pdate for sag this algorithm assumes access to some index set and the schedule frequency vector recall that denotes divides be combined to obtain hybrid algorithm consider some the indices that follow aga schedule we assume that the rest of the indices follow an schedule with schedule frequency si for all figure shows the corresponding update schedule of sag if then sag is equivalent to aga while at the other extreme for and si for all it corresponds to vrg sag exhibits interesting storage computational and convergence that depend on in general while large cardinality of likely incurs high storage costs the computational cost per iteration is relatively low on the other hand when cardinality of is small and si are large storage costs are low but the convergence typically slows down before concluding our discussion on the general framework we would like to draw the reader attention to the advantages of studying algorithm first note that algorithm provides unifying framework for many gradient methods proposed in the literature second and more importantly it provides generic platform for analyzing this class of algorithms as we will see in section this helps us develop and analyze asynchronous versions for different algorithms under common umbrella finally it provides mechanism to derive new algorithms by designing more sophisticated schedules as noted above one such construction gives rise to sag convergence analysis in this section we provide convergence analysis for algorithm with sag schedules as observed earlier vrg and aga are special cases of this setup our analysis assumes unbiasedness of the gradient estimates at each iteration so it does not encompass ag for ease of exposition we assume that all si for all since sag is our analysis focuses on the iterates obtained after each epoch similar to see option ii of vrg in our analysis will be for the case where the iterate at the end of st epoch is replaced with an element chosen randomly from xkm with probability pm for brevity we use to denote the iterate chosen at the th epoch we also need the following quantity for our analysis fi fi hrfi theorem for any positive parameters step size and epoch size we define the following quantities max suppose the probabilities pi and that step size and epoch size are chosen such that the following conditions are satisfied then for iterates of algorithm under the sag schedule we have as corollary we immediately obtain an expected linear rate of convergence for sag corollary note that and therefore under the conditions specified in theorem and with we have we emphasize that there exist values of the parameters for which the conditions in theorem and corollary are easily satisfied for instance setting and the conditions in theorem are satisfied for sufficiently large additionally in the high condition number regime of we can obtain constant say with epoch size similar to this leads to computational complexity of log for sag to achieve accuracy in the objective function as opposed to log for batch gradient descent method please refer to the appendix for more details on the parameters in theorem asynchronous stochastic variance reduction we are now ready to present asynchronous versions of the algorithms captured by our general framework we first describe our setup before delving into the details of these algorithms our model of computation is similar to the ones used in hogwild and asyscd we assume multicore architecture where each core makes stochastic gradient updates to centrally stored vector in an asynchronous manner there are four key components in our asynchronous algorithm these are briefly described below read read the iterate and compute the gradient rfit for randomly chosen it read schedule iterate read the schedule iterate and compute the gradients required for update in algorithm update update the iterate with the computed incremental update in algorithm schedule update run scheduler update for updating each processor repeatedly runs these procedures concurrently without any synchronization hence may change in between step and step similarly may change in between steps and in fact the states of iterates and can correspond to different we maintain global counter to track the number of updates successfully executed we use and to denote the particular and used for evaluating the update at the tth iteration we assume that the delay in between the time of evaluation and updating is bounded by integer and the bound on the staleness captures the degree of parallelism in the method such parameters are typical in asynchronous systems see furthermore we also assume that the system is synchronized after every epoch km for km we would like to emphasize that the assumption is not strong since such synchronization needs to be done only once per epoch for the purpose of our analysis we assume consistent read model in particular our analysis assumes that the vector used for evaluation of gradients is valid iterate that existed at some point in time such an assumption typically amounts to using locks in practice this problem can be avoided by using random coordinate updates as in see section of but such procedure is computationally wasteful in practice we leave the analysis of inconsistent read model as future work nonetheless we report results for both locked and implementations see section convergence analysis the key ingredients to the success of asynchronous algorithms for multicore stochastic gradient descent are sparsity and disjointness of the data matrix more formally suppose fi only depends on xp ei where ei fi acts only on the components of indexed by the set ei let denote kxj then the convergence depends on the smallest constant such that ei intuitively denotes the average frequency with which feature appears in the data matrix we are interested in situations where as warm up let us first discuss convergence analysis for asynchronous vrg the general case is similar but much more involved hence it is instructive to first go through the analysis of asynchronous vrg theorem suppose step size epoch size are chosen such that the following condition holds then for the iterates of an asynchronous variant of algorithm with vrg schedule and probabilities pi for all we have the bound obtained in theorem is useful when is small to see this as earlier consider the indicative case where the synchronous version of vrg obtains convergence rate of for step size and epoch size for the asynchronous variant of vrg by setting max we obtain similar rate with to obtain this set where max and then simple calculation gives the following max where is some constant this follows from the fact that max suppose then we can achieve nearly the same guarantees as the synchronous version but times faster since we are running the algorithm asynchronously for example consider the sparse setting where then it is possible to get near linear speedup when on the other hand when we can obtain theoretical speedup of we finally provide the convergence result for the asynchronous algorithm in the general case the proof is complicated by the fact that set unlike in vrg changes during the epoch the key idea is that only single element of changes at each iteration furthermore it can only change to one of the iterates in the epoch this control provides handle on the error obtained due to the staleness due to space constraints the proof is relegated to the appendix theorem for any positive parameters step size and epoch size we define the following quantities cl max suppose probabilities pi parameters and epoch size are chosen such that the following conditions are satisfied then for the iterates of asynchronous variant of algorithm with sag schedule we have corollary note that and therefore under the conditions specified in theorem and with we have speedup threads threads svrg locked svrg svrg locked svrg svrg locked svrg speedup svrg locked svrg speedup speedup threads threads figure logistic regression speedup curves for vrg and locked vrg on left left center right center and url right datasets we report the speedup achieved by increasing the number of threads by using step size normalized by similar to theorem and parameters similar to the ones specified after theorem we can show speedups similar to the ones obtained in theorem please refer to the appendix for more details on the parameters in theorem before ending our discussion on the theoretical analysis we would like to highlight an important point our emphasis throughout the paper was on generality while the results are presented here in full generality one can obtain stronger results in specific cases for example in the case of aga one can obtain per iteration convergence guarantees see rather than those corresponding to per epoch presented in the paper also aga can be analyzed without any additional synchronization per epoch however there is no qualitative difference in these guarantees accumulated over the epoch furthermore in this case our analysis for both synchronous and asynchronous cases can be easily modified to obtain convergence properties similar to those in experiments we present our empirical results in this section for our experiments we study the problem of binary classification via logistic regression more formally we are interested in the following optimization problem min log exp yi zi where zi rd and yi is the corresponding label for each in all our experiments we set note that such choice leads to high condition number careful implementation of vrg is required for sparse gradients since the implementation as stated in algorithm will lead to dense updates at each iteration for an efficient implementation scheme like the update scheme as suggested in is required due to lack of space we provide the implementation details in the appendix we evaluate the following algorithms for our experiments vrg this is the asynchronous variant of algorithm using vrg schedule all threads can read and update the parameters with any synchronization parameter updates are performed through atomic instruction constant step size that gives the best convergence is chosen for the dataset locked vrg this is the locked version of the asynchronous variant of algorithm using vrg schedule in particular we use concurrent read exclusive write locking model where all threads can read the parameters but only one threads can update the parameters at given time the step size is chosen similar to vrg gd this is the asynchronous variant of the gd algorithm see we compare two different versions of this algorithm gd with constant step size referred to as cs gd ii gd with decaying step size referred to as ds gd where constants and specify the scale and speed of decay for each of these versions step size is tuned for each dataset to give the best convergence progress time seconds svrg dsgd csgd time seconds svrg dsgd csgd time seconds objective value optimal svrg dsgd csgd objective objective value optimal objective value optimal svrg dsgd csgd time seconds figure logistic regression training loss residual versus time plot of vrg ds gd and cs gd on left left center right center and url right datasets the experiments are parallelized over cores all the algorithms were implemented in we run our experiments on datasets from libsvm similar to we normalize each example in the dataset so that kzi for all such normalization leads to an upper bound of on the lipschitz constant of the gradient of fi the epoch size is chosen as as recommended in in all our experiments in the first experiment we compare the speedup achieved by our asynchronous algorithm to this end for each dataset we first measure the time required for the algorithm to each an accuracy of the speedup with threads is defined as the ratio of the runtime with single thread to the runtime with threads results in figure show the speedup on various datasets as seen in the figure we achieve significant speedups for all the datasets not surprisingly the speedup achieved by vrg is much higher than ones obtained by locking furthermore the lowest speedup is achieved for dataset similar speedup behavior was reported for this dataset in it should be noted that this dataset is not sparse and hence is bad case for the algorithm similar to for the second set of experiments we compare the performance of vrg with stochastic gradient descent in particular we compare with the variants of stochastic gradient descent ds gd and cs gd described earlier in this section it is well established that the performance of variance reduced stochastic methods is better than that of gd we would like to empirically verify that such benefits carry over to the asynchronous variants of these algorithms figure shows the performance of vrg ds gd and cs gd since the computation complexity of each epoch of these algorithms is different we directly plot the objective value versus the runtime for each of these algorithms we use cores for comparing the algorithms in this experiment as seen in the figure vrg outperforms both ds gd and cs gd the performance gains are qualitatively similar to those reported in for the synchronous versions of these algorithms it can also be seen that the ds gd not surprisingly outperforms cs gd in all the cases in our experiments we observed that vrg in comparison to gd is relatively much less sensitive to the step size and more robust to increasing threads discussion future work in this paper we presented unifying framework based on that captures many popular variance reduction techniques for stochastic gradient descent we use this framework to develop simple hybrid variance reduction method the primary purpose of the framework however was to provide common platform to analyze various variance reduction techniques to this end we provided convergence analysis for the framework under certain conditions more importantly we propose an asynchronous algorithm for the framework with provable convergence guarantees the key consequence of our approach is that we obtain asynchronous variants of several algorithms like vrg aga and gd our asynchronous algorithms exploits sparsity in the data to obtain near linear speedup in settings that are typically encountered in machine learning for future work it would be interesting to perform an empirical comparison of various schedules in particular it would be worth exploring the tradeoffs of these schedules we would also like to analyze the effect of these tradeoffs on the asynchronous variants acknowledgments ss was partially supported by nsf all experiments were conducted on google compute engine machine with processors and gb ram http bibliography agarwal and bottou lower bound for the optimization of finite sums agarwal and duchi distributed delayed stochastic optimization in advances in neural information processing systems pages bertsekas and tsitsiklis parallel and distributed computation numerical methods bertsekas incremental gradient subgradient and proximal methods for convex optimization survey optimization for machine learning defazio new optimization methods for machine learning phd thesis australian national university defazio bach and saga fast incremental gradient method with support for convex composite objectives in nips pages defazio caetano and domke finito faster permutable incremental gradient method for big data problems dekel shamir and xiao optimal distributed online prediction using minibatches the journal of machine learning research ozdaglar and parrilo globally convergent incremental newton method mathematical programming johnson and zhang accelerating stochastic gradient descent using predictive variance reduction in nips pages liu and gradient descent in the proximal setting and gradient descent methods li andersen smola and yu communication efficient distributed machine learning with the parameter server in nips pages liu wright bittorf and sridhar an asynchronous parallel stochastic coordinate descent algorithm in icml pages liu and wright asynchronous stochastic coordinate descent parallelism and convergence properties siam journal on optimization mairal optimization with surrogate functions bertsekas and borkar distributed asynchronous incremental subgradient methods studies in computational mathematics nemirovski juditsky lan and shapiro robust stochastic approximation approach to stochastic programming siam journal on optimization nesterov efficiency of coordinate descent methods on optimization problems siam journal on optimization nitanda stochastic proximal gradient descent with acceleration techniques in nips pages recht re wright and niu hogwild approach to parallelizing stochastic gradient descent in nips pages reddi hefny downey dubey and sra descent methods with linear constraints in uai and iteration complexity of randomized descent methods for minimizing composite function mathematical programming robbins and monro stochastic approximation method annals of mathematical statistics schmidt roux and bach minimizing finite sums with the stochastic average gradient and zhang accelerated stochastic dual coordinate ascent in nips pages and zhang stochastic dual coordinate ascent methods for regularized loss the journal of machine learning research shamir and srebro on distributed stochastic optimization and learning in proceedings of the annual allerton conference on communication control and computing xiao and zhang proximal stochastic gradient method with progressive variance reduction siam journal on optimization zinkevich weimer li and smola parallelized stochastic gradient descent in nips pages 
next system for development evaluation and application of active learning kevin jamieson uc berkeley lalit jain chris fernandez nick glattard robert nowak university of wisconsin madison kjamieson ljain crfernandez glattard rdnowak abstract active learning methods automatically adapt data collection by selecting the most informative samples in order to accelerate machine learning because of this testing and comparing active learning algorithms requires collecting new datasets adaptively rather than simply applying algorithms to benchmark datasets as is the norm in passive machine learning research to facilitate the development testing and deployment of active learning for real applications we have built an software system for active learning research and experimentation the system called next provides unique platform for reproducible active learning research this paper details the challenges of building the system and demonstrates its capabilities with several experiments the results show how experimentation can help expose strengths and weaknesses of active learning algorithms in sometimes unexpected and enlightening ways introduction we use the term active learning to refer to algorithms that employ adaptive data collection in order to accelerate machine learning by adaptive data collection we mean processes that automatically adjust based on previously collected data to collect the most useful data as quickly as possible this broad notion of active learning includes bandits adaptive data collection in unsupervised learning clustering embedding etc classification regression and sequential experimental design perhaps the most familiar example of active learning arises in the context of classification there active learning algorithms select examples for labeling in sequential dataadaptive fashion as opposed to passive learning algorithms based on preselected training data the key to active learning is adaptive data collection because of this testing and comparing active learning algorithms requires collecting new datasets adaptively rather than simply applying algorithms to benchmark datasets as is the norm in passive machine learning research in this adaptive paradigm algorithm and network response time human fatigue the differing label quality of humans and the lack of responses are all concerns of implementing active learning algorithms due to many of these conditions being impossible to faithfully simulate active learning algorithms must be evaluated on real human participants adaptively collecting datasets can be difficult and as result active learning has remained largely theoretical research area and practical algorithms and experiments are few and far between most experimental work in active learning with data is simulated by letting the algorithm adaptively select small number of labeled examples from large labeled dataset this requires large labeled data set to begin with which limits the scope and scale of such experimental work also it does not address the practical issue of deploying active learning algorithms and adaptive data collection for real applications to address these issues we have built software system called next which provides unique platform for reproducible active learning research enabling web api web request queue algorithm manager getquery def getquery context get suff statistics stats from database use stats context to create query return query processanswer updatemodel thanks web api web request queue algorithm manager request query display from web api submit job to queue asynchronous worker accepts job request routed to proper algorithm query generated sent back to api query display returned to client answer reported to web api def processanswer query answer set query answer in database update model or enqueue job for later return thanks getquery processanswer updatemodel model update queue submit job to queue asynchronous worker accepts job answer routed to proper algorithm answer acknowledged to api answer acknowledged to client submit job to queue synchronous worker accepts job figure next active learning data flow def updatemodel get all query answer pairs use to update model to create stats set suff statistics stats in database figure example algorithm prototype that active learning researcher implements each algorithm has access to the database and the ability to enqueue jobs that are executed in order one at time machine learning researchers to easily deploy and test new active learning algorithms applied researchers to employ active learning methods for applications many of today machine learning tools such as kernel methods and deep learning have developed and witnessed adoption in practice because of both theoretical work and extensive experimentation testing and evaluation on datasets arguably some of the deepest insights and greatest innovations have come through experimentation the goal of next is to enable the same sort of experimentation for active learning we anticipate this will lead to new understandings and breakthroughs just as it has for passive learning this paper details the challenges of building active learning systems and our solution the next we also demonstrate the system capabilities for real active learning experimentation the results show how experimentation can help expose strengths and weaknesses of active learning algorithms in sometimes unexpected and enlightening ways what next at the heart of active learning is process for sequentially and adaptively gathering data most informative to the learning task at hand as quickly as possible at each step an algorithm must decide what data to collect next the data collection itself is often from human helpers who are asked to answer queries label instances or inspect data crowdsourcing platforms such as amazon mechanical turk or crowd flower provide access to potentially thousands of users answering queries parallel data collection at large scale imposes design and engineering challenges unique to active learning due to the continuous interaction between data collection and learning here we describe the main features and contributions of the next system system functionality data flow diagram for next is presented in figure consider an individual client among the crowd tasked with answering series of classification questions the client interacts with next through website which requests new query to be presented from the next web api tasked with potentially handling thousands of such requests simultaneously the api will enqueue the query request to be processed by worker pool workers can be thought of as processes living on one or many machines pulling from the same queue once the job is accepted by worker it is routed through the worker algorithm manager described in the extensibility section below and the algorithm then chooses query based on previously collected data and sufficient statistics the query is then sent back to the client through the api to be displayed after the client answers the query the same initial process as above is repeated but this time the answer is routed to the processanswer endpoint of the algorithm since multiple users are getting the next system is open source and available https queries and reporting answers at the same time there is potential for two different workers to attempt to update the model or the statistics used to generate new queries at the same time and potentially overwrite each other work simple way next avoids this race condition is to provide locking queue to each algorithm so that when worker accepts job from this queue the queue is locked until that job is finished hence when the answer is reported to the algorithm the processanswer code block may either update the model asynchronously itself or submit modelupdate job to this locking model update queue to process the answer later synchronously see section for details after processing the answer the worker returns an acknowledgement response to the client of this data flow next handles the api enqueueing and scheduling of jobs and algorithm management the researcher interested in deploying their algorithm is responsible for implementing getquery processanswer and updatemodel figure shows for the functions that must be implemented for each algorithm in the next system see the supplementary materials for an explicit example involving an active svm classifier key challenge here is latency getquery request uses the current learned model to decide what queries to serve next humans will notice delays greater than roughly ms therefore it is imperative that the system can receive and process response update the model and select the next query within accounting for ms in communication latency each way the system must perform all necessary processing within ms while in some applications one can compute good queries offline and serve them as needed without further computation other applications such as contextual bandits for personalized content recommendation require that the query depend on the context provided by the user their cookies and consequently must be computed in real time realtime computing research in active learning focuses on reducing the sample complexity of the learning process minimizing number of labeled and unlabeled examples needed to learn an accurate model and sometimes addresses the issue of computational complexity in the latter case the focus is usually on algorithms but not necessarily realtime algorithms practical active learning systems face tradeoff between how frequently models are updated and how carefully new queries are selected if the model is updated less frequently then time can be spent on carefully selecting batch of new queries however selecting in large batches may potentially reduce some of the gains afforded by active learning since later queries will be based on old stale information updating the model frequently may be possible but then the time available for selecting queries may be very short resulting in suboptimal selections and again potentially defeating the aim of active learning managing this tradeoff is the chief responsibility of the algorithm designer but to make these design choices the algorithm designer must be able to easily gauge the effects of different algorithmic choices in the next system the tradeoff is explicitly managed by modifying when and how often the updatemodel command is run and what it does the system helps with making these decisions by providing extensive dashboards describing both the statistical and computational performance of the algorithms reproducible research publishing data and software needed to reproduce experimental results is essential to scientific progress in all fields due to the adaptive nature of data collection in active learning experiments it is not enough to simply publish data gathered in previous experiment for other researchers to recreate the experiment the must be able to also reconstruct the exact adaptive process that was used to collect the data this means that the complete system including any web facing crowd sourcing tools not just algorithm code and data must be made publicly available and easy to use by leveraging cloud computing next abstracts away the difficulties of building data collection system and lets the researcher focus on active learning algorithm design any other researcher can replicate an experiment with just few keystrokes in under one hour by just using the same experiment initialization parameters expert data collection for the non expert next puts active learning algorithms in the hands of interested in collecting data in more efficient ways this includes psychologists social scientists biologists security analysts and researchers in any other field in which large amounts of data is collected sometimes at large dollar cost and time expense choosing an appropriate active learning algorithm is perhaps an easier step for compared to data collection while there exist excellent tools to help researchers perform relatively simple experiments on mechanical turk psiturk or automan implementing active learning to collect data requires building sophisticated system like the one described in this paper to determine the needs of potential users the next system was built in close collaboration with cognitive scientists at our home institution they helped inform design decisions and provided us with participants to the system in environment indeed the examples used in this paper were motivated by related studies developed by our collaborators in psychology next is accessible through rest web api and can be easily deployed in the cloud with minimal knowledge and expertise using automated scripts next provides researchers set of example templates and widgets that can be used as graphical user interfaces to collect data from participants see supplementary materials for examples multiple algorithms and extensibility next provides platform for applications and algorithms applications are general active learning tasks such as linear classification and algorithms are particular implementations of that application random sampling or uncertainty sampling with experiments involve one application type but they may involve several different algorithms enabling the evaluation and comparison of different algorithms the algorithm manager in figure is responsible for routing each query and reported answer to the algorithms involved in an experiment for experiments involving multiple algorithms this routing could be randomized or optimized in more sophisticated manner for example it is possible to implement bandit algorithm inside the algorithm manager in order to select algorithms adaptively to minimize some notion of regret each application defines an algorithm management module and contract for the three functions of active learning getquery processanswer and modelupdate as described in figure each algorithm implemented in next will gain access to locking synchronous queue for model updates logging functionality automated dashboards for performance statistics and timing load balancing and graphical user interfaces for participants to implement new algorithm developer must write the associated getquery processanswer and updatemodel functions in python see examples in supplementary materials the rest is handled automatically by next we hope this ease of use will encourage researchers to experiment with and compare new active learning algorithms next is hosted on github and we urge users to push their local application and algorithm implementations to the repository example applications next is capable of hosting any active or passive learning application to demonstrate the capabilities of the system we look at two applications motivated by cognitive science studies the collected raw data along with instructions to easily reproduce these examples which can be used as templates to extend are available on the next project page pure exploration for dueling bandits the first experiment type we consider is problem in the dueling bandits framework based on the new yorker caption each week new yorker readers are invited to submit captions for cartoon and winner is picked from among these entries we used dataset from the contest for our experiments participants in our experiment are shown cartoon along with two captions each participant task is to pick the caption they think is the funnier of the two this is repeated with many caption pairs and different participants the objective of the learning algorithm is to determine which caption participants think is the funniest overall as quickly as possible using as few comparative judgments as possible in our experiments we chose an arbitrary cartoon and arbitrary captions from curated set from the new yorker dataset the cartoon and all captions can be found in the supplementary materials the number of captions was limited to primarily to keep the experimental dollar cost reasonable but the next system is capable of handling arbitrarily large numbers of captions arms and duels dueling bandit algorithms there are several notions of best arm in the dueling bandit framework including the condorcet copeland and borda criteria we focus on the borda criterion in this experiment for two reasons first algorithms based on the condorcet or copeland criterion we thank bob mankoff cartoon editor of the new yorker for sharing the cartoon and caption data used in our experiments caption plurality vote thompson ucb successive elim random beat the mean my last of the last the women do you eve drowni think of they promi want to ge table dueling bandit results for identifying the funniest caption for new yorker cartoon darker shading corresponds to an algorithm of its predictions for the winner generally require sampling all possible pairs of multiple times algorithms based on the borda criterion do not necessarily require such exhaustive sampling making them more attractive for problems second one can reduce dueling bandits with the borda criterion to the standard bandit problem using scheme known as the borda reduction br allowing one to use number of and tested bandit algorithms the algorithms considered in our experiment are random uniform sampling with br successive elimination with br ucb with br thompson sampling with br and beat the mean which was originally designed for identifying the condorcet winner see the supplementary materials for more implementation details experimental setup and results we posted next tasks to mechanical turk each of which asked unique participant to make comparison judgements for for each comparative judgment one of the five algorithms was chosen uniformly at random to select the caption pair and the participant decision was used to update that algorithm only each algorithm ranked the captions in order of the empirical borda score estimates except the beat the mean algorithm which used its modified borda score compare the quality of these results we collected data in two different ways first we took union of the captions from each algorithm resulting in top captions and asked different set of participants to vote for the funniest of these one vote per participant we denote this the plurality vote ranking the number of captions shown to each participant was limited to for practical reasons display voting ease the results of the experiment are summarized in table each row corresponds to one of the top captions and the columns correspond to different algorithms each table entry is the borda score estimated by the corresponding algorithm followed by bound on its standard deviation the bound is based on bernoulli model for the responses and is simply where is the number judgments collected for the corresponding caption which depends on the algorithm the relative ranking of the scores is what is relevant here but the uncertainties given an indication of each algorithm certainty of these scores in the table each algorithm best guess at the funniest captions are highlighted in decreasing order with darker to lighter shades overall the predicted captions of the algorithms which generally optimize for the borda criterion appear to be in agreement with the result of the plurality vote one thing that should be emphasized is that the uncertainty standard deviation of the top arm scores of thompson sampling and ucb is about half the uncertainty observed for the top three arms of the other methods which suggests that these algorithms can provide confident answers with of the samples needed by other algorithms this is the result of thompson sampling and ucb being more aggressive and adaptive early on compared to the other methods and therefore we recommend them for applications of this sort we conclude that thompson sampling and ucb perform best for this application and require significantly fewer samples than random sampling or other bandit algorithms the results of replication of this study can be found in the supplementary materials from which the same conclusions can be made active multidimensional scaling finding representations is fundamental problem in machine learning multidimensional scaling nmds is classic technique that embeds set of items into dimensional metric space so that distances in that space predict given set of human judgments of the form is closer to than this learning task is more formidable than the dueling bandits problem in number of ways providing an interesting contrast in terms of demands and tradeoffs in the system first nmds involves triples of items rather than pairs posing greater challenges to scalability second updating the embedding as new data are collected is much more computationally intensive which makes managing the tradeoff between updating the embedding and carefully selecting new queries highly formally the nmds problem is defined as follows given set comparative judgments and an embedding dimension the ideal goal is to identify set of points xn rd such that xk xk if is closer to than is one of the given comparative judgments in situations where no embedding exists that agrees with all of the judgments in the goal is to find an embedding that agrees on as many judgements as possible active learning can be used to accelerate this learning process as follows once an embedding is found based on subset of judgments the relative locations of the objects at least at coarse level are constrained consequently many other judgments not yet collected can be predicted from the coarse embedding while others are still highly uncertain the goal of active learning algorithms in this setting is to adaptively select as few triplet queries is closer to or as possible in order to identify the structure of the embedding active sampling algorithms nmds is usually posed as an optimization problem note that xk xk xti xi xk xtj xj xk hxx hi where xn and hi is an matrix except for the defined on the indices this suggests an optimizah tion hxx hi where in the literature has taken the form of or general negative of probabilistic model one may also recognize the similarity of this optimization problem with that of learning linear classifier here xx plays the role of hyperplane and the hi matrices are labeled examples indeed we apply active learning approaches developed for linear classifiers like uncertainty sampling to nmds two active learning algorithms have been proposed in the past for this specific application here we consider four data collection methods inspired by these past works passive uniform random sampling uncertainty sampling based off an embedding discovered by minimizing objective approximate maximum information gain sampling using the crowd kernel approach in and approximate maximum information gain sampling using the distribution in care was taken in the implementations to make these algorithms perform as well as possible in realtime environment and we point the interested reader to the supplementary materials and source code for details experimental setup and results we have used next for nmds in many applications including embedding faces words numbers and images here we focus on particular set of synthetic shape images that can be found in the supplementary materials each shape can be represented by single parameter reflecting it smoothness so an accurate embedding should recover one dimensional manifold the dataset consists of shapes selected uniformly from the manifold and so in total there are unique triplet queries that could be asked for all algorithms we set for embedding although we hope to see the intrinsic manifold in the result each participant was asked to answer triplet queries and total participants contributed to the experiment for each query an algorithm was chosen uniformly at random from the union of the set of algorithms plus an additional random algorithm whose queries were used for the set consequently each algorithm makes approximately queries we consider three different algorithms for generating embeddings from triplets embedding that minimizes hinge loss that we call hinge the crowd kernel embedding with that we call ck and and the embedding we wish to evaluate the sampling strategies not the embedding strategies so we apply each embedding strategy to each sampling procedure described above to evaluate the algorithms we sort the collected triplets for each algorithm by timestamp and then every triplets we compute an embedding using that algorithm answers and each strategy for figure top stimuli used for the experiment left triplet prediction error right nearest neighbor prediction accuracy identifying an embedding in figure the left panel evaluates the triplet prediction performance of the embeddings on the entire collected set of triplets while the right panel evaluates the nearest neighbor prediction of the algorithms for each point in the embedding we look at its true nearest neighbor on the manifold in which the data was generated and we say an embedding accurately predicts the nearest neighbor if the true nearest neighbor is within the top three nearest neighbors of the point in the embedding we then average over all points because this is just single trial the evaluation curves are quite rough the results of replication of this experiment can be found in the supplementary materials from which the same conclusions are made we conclude across both metrics and all sampling algorithms the embeddings produced by minimizing hinge loss do not perform as well as those from crowd kernel or in terms of predicting triplets the experiment provides no evidence that the active selection of triplets provides any improvement over random selection in terms of nearest neighbor prediction uncertaintysampling may have slight edge but it is very difficult to make any conclusions with certainty as in all deployed active learning studies we can not rule out the possibility that it is our implementation that is responsible for the disappointment and not the algorithms themselves however we note that when simulating human responses with bernoulli noise under similar load conditions uncertainty sampling outperformed all other algorithms by measurable margin leading us to believe that these active learning algorithms may not be robust to human feedback to sum up in this application there is no evidence for gains from adaptive sampling but crowd kernel and do appear to provide slightly better embeddings than the hinge loss optimization but we caution that this is but single possibly unrepresentative datapoint implementation details of next the entire next system was designed with machine learning researchers and practitioners in mind rather than engineers with deep systems background next is almost completely written in python but algorithms can be implemented in any programming language and wrapped in python wrapper we elected to use variety of startup scripts and dockerfor deployment to automate the provisioning process and minimize configuration issues details on specific software packages used can be found in the supplementary materials many components of next can be scaled to work in distributed environment for example serving many near simultaneous getquery requests is straightforward one can simply enlarge the pool of workers by launching additional slave machines and point them towards the web request queue just like typical are scaled this approach to scaling active learning has been studied rigorously processing tasks such as data fitting and selection can also be accelerated using standard distributed platforms see next section more challenging scaling issue arises in the learning process active learning algorithms update models sequentially as data are collected and the models guide the selection of new data recall that this serial process is handled by model update queue when worker accepts job from the queue the queue is locked until that job is finished the processing times required for model fitting and data selection introduce latencies that may reduce possible speedups afforded by active learning compared to passive learning since the rate of getquery requests could exceed the processing rate of the learning algorithm if the number of algorithms running in parallel outnumber the number of workers dedicated to serving the synchronous locking model update queues performance can be improved by adding more slave machines and thus workers to process the queues simulating load with and inspecting the provided dashboards on next of cpu memory queue size etc makes deciding the number of machines for an expected load straightforward task an algorithm in next could also bypass the locking synchronous queue by employing asynchronous schemes like directly in processanswer this could speed up processing through parallelization but could reduce active learning speedups since workers may overwrite the previous work of others related work and discussion there have been some examples of deployed active learning with human feedback for human perception interactive search and citation screening and in particular by research groups from industry for web content personalization and contextual advertising however these remain special purpose implementations while the proposed next system provides flexible and active learning platform that is versatile enough to develop test and field any of these specific applications moreover previous deployments have been difficult to replicate next could have profound effect on research reproducibility it allows anyone to easily replicate past and future algorithm implementations experiments and applications there exist many sophisticated libraries and systems for performing machine learning at scale vowpal wabbit mllib oryx and graphlab are all excellent examples of software systems designed to perform inference tasks like classification regression or clustering at enormous scale many of these systems are optimized for operating on fixed static dataset making them incomparable to next but some like vowpal wabbit have some active learning support the difference between these systems and next is that their goal was to design and implement the best possible algorithms for very specific tasks that will take the fullest advantage of each system own capabilities these systems provide great libraries of machine learning tools whereas next is an experimental platform to develop test and compare active learning algorithms and to allow practitioners to easily use active learning methods for data collection in the space there exist excellent tools like psiturk automan and crowdflower that provide functionality to simplify various aspects of crowdsourcing including automated task management and quality assurance controls while successful in this aim these crowd programming libraries do not incorporate the necessary infrastructure for deploying active learning across participants or adaptive data acquisition strategies next provides unique platform for developing active crowdsourcing capabilities and may play role in optimizing the use of humancomputational resources like those discussed in finally while systems like oryx and velox that leverage apache spark are made for deployment on the web and model serving they were designed for very specific types of models that limit their versatility and applicability they were also built for an audience with greater familiarity with systems and understandably prioritize computational performance over for example the it might take cognitive scientist or active learning theorist to figure out how to actively crowdsource large study using amazon mechanical turk at the time of this submission next has been used to ask humans hundreds of thousands of actively selected queries in ongoing cognitive science studies working closely with cognitive scientists who relied on the system for their research helped us make next predictable reliable easy to use and we believe ready for everyone references gureckis martin mcdonnell et al psiturk an framework for conducting replicable behavioral experiments online in press daniel barowy charlie curtsinger emery berger and andrew mcgregor automan platform for integrating and digital computation acm sigplan notices yisong yue josef broder robert kleinberg and thorsten joachims the dueling bandits problem journal of computer and system sciences tanguy urvoy fabrice clerot raphael and sami naamane generic exploration and voting bandits in proceedings of the international conference on machine learning kevin jamieson sumeet katariya atul deshpande and robert nowak sparse dueling bandits aistats eyal shie mannor and yishay mansour action elimination and stopping conditions for the bandit and reinforcement learning problems the journal of machine learning research kevin jamieson matthew malloy robert nowak and bubeck lil ucb an optimal exploration algorithm for bandits in proceedings of the conference on learning theory olivier chapelle and lihong li an empirical evaluation of thompson sampling in advances in neural information processing systems pages yisong yue and thorsten joachims beat the mean bandit in proceedings of the international conference on machine learning pages sameer agarwal josh wills lawrence cayton et al generalized multidimensional scaling in international conference on artificial intelligence and statistics pages omer tamuz ce liu ohad shamir adam kalai and serge belongie adaptively learning the crowd kernel in proceedings of the international conference on machine learning laurens van der maaten and kilian weinberger stochastic triplet embedding in machine learning for signal processing mlsp ieee international workshop on pages ieee burr settles active learning literature survey university of wisconsin madison kevin jamieson and robert nowak embedding using adaptively selected ordinal data in allerton conference on communication control and computing alekh agarwal leon bottou miroslav dudik and john langford learning arxiv preprint benjamin recht christopher re stephen wright and feng niu hogwild approach to parallelizing stochastic gradient descent in advances in neural information processing systems all our ideas http accessed byron wallace kevin small carla brodley joseph lau and thomas trikalinos deploying an interactive machine learning system in an practice center abstrackr in proceedings of the acm sighit international health informatics symposium pages acm kyle ambert aaron cohen gully apc burns eilis boudreau and kemal sonmez virk an active system for bootstrapping knowledge base development in the neurosciences frontiers in neuroinformatics lihong li wei chu john langford and robert schapire approach to personalized news article recommendation in international conference on world wide web acm deepak agarwal chen and pradheep elango schemes for web content optimization in data mining icdm ninth ieee international conference on ieee alekh agarwal olivier chapelle miroslav and john langford reliable effective terascale linear learning system the journal of machine learning research xiangrui meng et al mllib machine learning in apache spark oryx https accessed yucheng low danny bickson joseph gonzalez et al distributed graphlab framework for machine learning and data mining in the cloud proceedings of the vldb endowment crowdflower http accessed seungwhan moon and jaime carbonell proactive learning with multiple labelers in data science and advanced analytics dsaa international conference on ieee daniel crankshaw peter bailis joseph gonzalez et al the missing piece in complex analytics low latency scalable model management and serving with velox cidr 
off the grid qingqing huang mit eecs lids qqh sham kakade university of washington department of statistics computer science engineering sham abstract is the problem of recovering superposition of point sources using bandlimited measurements which may be corrupted with noise this signal processing problem arises in numerous imaging problems ranging from astronomy to biology to spectroscopy where it is common to take coarse fourier measurements of an object of particular interest is in obtaining estimation procedures which are robust to noise with the following desirable statistical and computational properties we seek to use coarse fourier measurements bounded by some cutoff frequency we hope to take quantifiably small number of measurements we desire our algorithm to run quickly suppose we have point sources in dimensions where the points are separated by at least from each other in euclidean distance this work provides an algorithm with the following favorable guarantees the algorithm uses fourier measurements whose frequencies are bounded by up to log factors previous algorithms require cutoff frequency which may be as large as the number of measurements taken by and the computational complexity of our algorithm are bounded by polynomial in both the number of points and the dimension with no dependence on the separation in contrast previous algorithms depended inverse polynomially on the minimal separation and exponentially on the dimension for both of these quantities our estimation procedure itself is simple we take random bandlimited measurements as opposed to taking an exponential number of measurements on the hypergrid furthermore our analysis and algorithm are elementary based on concentration bounds for sampling and the singular value decomposition introduction we follow the standard mathematical abstraction of this problem candes consider signal modeled as weighted sum of dirac measures in rd wj where the point sources the are in rd assume that the weights wj are complex valued whose absolute values are lower and upper bounded by some positive constant assume that we are given the number of point an upper bound of the number of point sources suffices define the measurement function rd to be the convolution of the point source with point spread function as below dt wj in the noisy setting the measurements are corrupted by uniformly bounded perturbation fe suppose that we are only allowed to measure the signal by evaluating the measurement function fe at any rd and we want to recover the parameters of the point source signal wj we follow the standard normalization to assume that let wmin minj denote the minimal weight and let sources defined as follows kµ be the minimal separation of the point where we use the euclidean distance between the point sources for ease of these quantities are key parameters in our algorithm and analysis intuitively the recovery problem is harder if the minimal separation is small and the minimal weight wmin is small the first question is that given exact measurements namely where and how many measurements should we take so that the original signal can be exactly recovered definition exact recovery in the exact case we say that an algorithm achieves exact recovery with measurements of the signal if upon input of these measurements the algorithm returns the exact set of parameters wj moreover we want the algorithm to be measurement noise tolerant in the sense that in the presence of measurement noise we can still recover good estimates of the point sources definition stable recovery in the noisy case we say that an algorithm achieves stable recovery with measurements of the signal if upon input of these measurements the algorithm returns estimates bj such that min max kb poly where the min is over permutations on and poly is polynomial function in and by definition if an algorithm achieves stable recovery with measurements it also achieves exact recovery with these measurements the terminology of is appropriate due to the following remarkable result in the noiseless case of donoho suppose we want to accurately recover the point sources to an error of where naively we may expect to require measurements whose frequency depends inversely on the desired the accuracy donoho showed that it suffices to obtain finite number of measurements whose frequencies are bounded by in order to achieve exact recovery thus resolving the point sources far more accurately than that which is naively implied by using frequencies of furthermore the work of candes showed that stable recovery in the univariate case is achievable with cutoff frequency of using convex program and number of measurements whose size is polynomial in the relevant quantities our claims hold withut using the wrap around metric as in due to our random sampling also it is possible to extend these results for the case cutoff freq measurements runtime cutoff freq sdp log log poly cd mp ours log measurements log log kd poly log runtime log table see section for description see lemma for details about the cutoff frequency here we are implicitly using notation this work we are interested in stable recovery procedures with the following desirable statistical and computational properties we seek to use coarse low frequency measurements we hope to take quantifiably small number of measurements we desire our algorithm run quickly informally our main result is as follows theorem informal statement of theorem for fixed probability of error the proposed algorithm achieves stable recovery with number of measurements and with computational runtime that are both on the order of log furthermore the algorithm makes measurements which are bounded in frequency by ignoring log factors notably our algorithm and analysis directly deal with the multivariate case with the univariate case as special case importantly the number of measurements and the computational runtime do not depend on the minimal separation of the point sources this may be important even in certain low dimensional imaging applications where taking physical measurements are costly indeed superresolution is important in settings where is small furthermore our technical contribution of how to decompose certain tensor constructed with fourier measurements may be of broader interest to related questions in statistics signal processing and machine learning comparison to related work table summarizes the comparisons between our algorithm and the existing results the multidimensional cutoff frequency we refer to in the table is the maximal entry of any measurement frequency sdp refers to the semidefinite programming sdp based algorithms of candes in the univariate case the number of measurements can be reduced by the method in tang et al this is reflected in the table mp refers to the matrix pencil type of methods studied in and for the univariate case here we are defining the infinity norm separation as kµ which is understood as the wrap around distance on the unit circle cd is problem dependent constant discussed below observe the following differences between our algorithm and prior work our minimal separation is measured under the instead of the infinity norm as in the sdp based algorithm note that depends on the coordinate system in the worst case it can underestimate the separation by factor namely the computation complexity and number of measurements are polynomial in dimension and the number of point sources and surprisingly do not depend on the minimal separation of the point sources intuitively when the minimal separation between the point sources is small the problem should be harder this is only reflected in the sampling range and the cutoff frequency of the measurements in our algorithm furthermore one could project the multivariate signal to the coordinates and solve multiple univariate problems such as in which provided only exact recovery results naive random projections would lead to cutoff frequency of sdp approaches the work in formulates the recovery problem as minimization problem they then show the dual problem can be formulated as an sdp they focused on the analysis of and only explicitly extend the proofs for for theorems see suggest that cd the number of measurements can be reduced by the method in for the case which is noted in the table their method uses sampling off the grid technically their sampling scheme is actually sampling random points from the grid though with far fewer measurements matrix pencil approaches the matrix pencil method music and prony method are essentially the same underlying idea executed in different ways the original prony method directly attempts to find roots of high degree polynomial where the root stability has few guarantees other methods aim to robustify the algorithm recently for the univariate matrix pencil method liao fannjiang and moitra provide stability analysis of the music algorithm moitra studied the optimal relationship between the cutoff frequency and showing that if the cutoff frequency is less than then stable recovery is not possible with matrix pencil method with high probability notation let and to denote real complex and natural numbers for denotes the set for set denotes its cardinality we use to denote the direct sum of sets namely let en to denote the standard basis vector in rd for let rd to denote the of radius in the standard euclidean space denote the condition number of matrix as max max and min are the maximal and minimal singular values of min where we use to denote tensor product given matrices the tensor product pk is equivalent to another view of tensor is that it defines mapping for given dimension ma mb mc the mapping cma is defined as xa xb xc xa xb xc in particular for cm we use to denote the projection of tensor along the dimension note that if the tensor admits decomposition it is straightforward to verify that adiag it is that if the factors have full column rank then the rank decomposition is unique up to and common column permutation moreover if the condition number of the factors are upper bounded by positive constant then one can compute the unique tensor decomposition with stability guarantees see for review lemma herein provides an explicit main results the algorithm we briefly describe the steps of algorithm below take measurements given positive numbers and randomly draw sampling set of samples of the gaussian distribution form the set ed rd denote take another independent random sample from the unit sphere and define input noisy measurement function fe output estimates bj take measurements let be samples from the gaussian distribution set en for all and denote take another random samples from the unit sphere and set and construct tensor fe cm fen fe tensordecomp fe tensor decomposition set vbs for set vs vs vbs read of estimates for set real log vbs arg minw kfb set vbs vbs vbd dw kf algorithm general algorithm construct the order tensor fe cm with noise corrupted measurements fe evaluated at the points in arranged in the following way fe tensor decomposition define the characteristic matrix vs to be vs and define matrix cm to be vs vs vd where vd is defined in define note that in the exact case the tensor constructed in admits decomposition vs vs dw assume that has full column rank then this tensor decomposition is unique up to column permutation and rescaling with very high probability over the randomness of the random unit vector since each element of vs has unit norm and we know that the last row of vs and the last row of are all ones there exists proper scaling so that we can uniquely recover wj and columns of vs up to common permutation in this paper we adopt jennrich algorithm see algorithm for tensor decomposition other algorithms for example tensor power method and recursive projection which are possibly more stable than jennrich algorithm can also be applied here read off estimates let log vd denote the logarithm of vd the estimates of the point sources are given by log input tensor fe rank output factor vb pb with the leading singular values compute the truncated svd of fe fe pb pb set and set be the eigenvectors of corresponding to the eigenvalues let the columns of with the largest absolute value set vb mpbu algorithm tensordecomp remark in the toy example the simple algorithm corresponds to using the sampling set ed the conventional univariate matrix pencil method corresponds to using the sampling set and the set of measurements corresponds to the grid guarantees in this section we discuss how to pick the two parameters and and prove that the proposed algorithm indeed achieves stable recovery in the presence of measurement noise theorem stable recovery there exists universal constant such that the following holds fix pick such that for pick max log log for pick log assume the bounded measurement noise model as in and that wmin with probability at least over the random sampling of and with probability at least over the random projections in algorithm the proposed algorithm returns an estimation of the pk point source signal bj bµ with accuracy min max kb dk wmax wmin where the min is over permutations on moreover the proposed algorithm has time complexity in the order of the next lemma shows that essentially with overwhelming probability all the frequencies taken concentrate within the with cutoff frequency on each coordinate where is comparable to lemma the cutoff frequency for with high probability all of the sampling frequencies in satisfy that ks kp where the cutoff frequency is given by log md for case the cutoff frequency can be made to be in the order of remark failure probability overall the failure probability consists of two pieces for random projection of and for random sampling to ensure the bounded condition number of vs this may be boosed to arbitrarily high probability through repetition key lemmas stability of tensor decomposition in this paragraph we give brief description and the stability guarantee of the jennrich algorithm for low rank order tensor decomposition we only state it for the symmetric tensors as appeared in the proposed algorithm consider tensor dw where the factor has full column rank then the decomposition is unique up to column permutation and rescaling and algorithm finds the factors efficiently moreover the is stable if the factor is and the eigenvalues of fa are well separated lemma stability of jennrich algorithm consider the order tensor dw of rank constructed as in step in algorithm given tensor fe that is close to namely for all wmin fn and assume that the noise is small use fe as the input dkwmax to algorithm with probability at least and we can over the random projections bound the distance between columns of the output vb and that of by dk wmax min max kvj wmin where is universal constant condition number of vs the following lemma is helpful lemma let vs be the factor as defined in recall that vs vs vd where vd is defined in and vs is the characteristic matrix defined in we can bound the condition number of vs by vs condition number of the characteristic matrix vs therefore the stability analysis of the proposed algorithm boils down to understanding the relation between the random sampling set and the condition number of the characteristic matrix vs this is analyzed in lemma main technical lemma lemma for any fixed number consider apgaussian vector with distribution log log where for and for define the hermitian random matrix xs cherm to be xs we can bound the spectrum of es xs by es xs lemma main technical lemma in the same setting of lemma let be independent samples of the gaussian vector for log ks with probability at least over the random sampling the condition number of the factor vs is bounded by vs acknowledgments the authors thank rong ge and ankur moitra for very helpful discussions sham kakade acknowledges funding from the washington research foundation for innovation in discovery references anandkumar ge hsu kakade and telgarsky tensor decompositions for learning latent variable models the journal of machine learning research anandkumar hsu and kakade method of moments for mixture models and hidden markov models arxiv preprint and from noisy data journal of fourier analysis and applications and towards mathematical theory of communications on pure and applied mathematics chen and chi robust spectral compressed sensing via structured matrix completion information theory ieee transactions on dasgupta learning mixtures of gaussians in foundations of computer science annual symposium on pages ieee dasgupta and gupta an elementary proof of theorem of johnson and lindenstrauss random structures and algorithms dasgupta and schulman variant of em for gaussian mixtures in proceedings of the sixteenth conference on uncertainty in artificial intelligence pages morgan kaufmann publishers donoho superresolution via sparsity constraints siam journal on mathematical analysis framework for phd thesis stanford university harshman foundations of the parafac procedure models and conditions for an explanatory factor analysis komornik and loreti fourier series in control theory springer science business media leurgans ross and abel decomposition for arrays siam journal on matrix analysis and applications liao and fannjiang music for spectral estimation stability and superresolution applied and computational harmonic analysis moitra the threshold for via extremal functions arxiv preprint mossel and roch learning nonsingular phylogenies and hidden markov models in proceedings of the annual acm symposium on theory of computing pages acm nandi kundu and srivastava noise space decomposition method for twodimensional sinusoidal model computational statistics data analysis pearson contributions to the mathematical theory of evolution philosophical transactions of the royal society of london pages potts and tasche parameter estimation for nonincreasing exponential sums by pronylike methods linear algebra and its applications russell controllability and stabilizability theory for linear partial differential equations recent progress and open questions siam review sanjeev and kannan learning mixtures of arbitrary gaussians in proceedings of the annual acm symposium on theory of computing pages acm schiebinger robeva and recht superresolution without separation arxiv preprint tang bhaskar shah and recht compressed sensing off the grid information theory ieee transactions on vempala and xiao max vs min independent component analysis with nearly linear sample complexity arxiv preprint 
taming the wild unified analysis of ogwild algorithms christopher de sa ce zhang kunle olukotun and christopher cdesa czhang kunle chrismre departments of electrical engineering and computer science stanford university stanford ca abstract stochastic gradient descent sgd is ubiquitous algorithm for variety of machine learning problems researchers and industry have developed several techniques to optimize sgd runtime performance including asynchronous execution and reduced precision our main result is analysis that enables us to capture the rich noise models that may arise from such techniques specifically we use our new analysis in three ways we derive convergence rates for the convex case ogwild with relaxed assumptions on the sparsity of the problem we analyze asynchronous sgd algorithms for matrix problems including matrix completion and we design and analyze an asynchronous sgd algorithm called uckwild that uses arithmetic we show experimentally that our algorithms run efficiently for variety of problems on modern hardware introduction many problems in machine learning can be written as stochastic optimization problem minimize over rn where is random objective function one popular method to solve this is with stochastic gradient descent sgd an iterative method which at each timestep chooses random objective sample and updates xt xt where is the step size for most problems this update step is easy to compute and perhaps because of this sgd is ubiquitous algorithm with wide range of applications in machine learning including neural network backpropagation recommendation systems and optimization for problems sgd is particular it is widely used in deep its success is poorly understood theoretically given sgd success in industry practitioners have developed methods to speed up its computation one popular method to speed up sgd and related algorithms is using asynchronous execution in an asynchronous algorithm such as ogwild multiple threads run an update rule such as equation in parallel without locks ogwild and other algorithms have been applied to variety of uses including pagerank approximations frogwild deep learning dogwild and recommender systems many asynchronous versions of other stochastic algorithms have been individually analyzed such as stochastic coordinate descent sgd and accelerated parallel proximal coordinate descent approx producing rate results that are similar to those of ogwild recently gupta et al gave an empirical analysis of the effects of variant of sgd on neural network training other variants of stochastic algorithms have been proposed only fraction of these algorithms have been analyzed in the asynchronous case unfortunately new variant of sgd or related algorithm may violate the assumptions of existing analysis and hence there are gaps in our understanding of these techniques one approach to filling this gap is to analyze each extension from scratch an entirely new model for each type of asynchrony each type of precision etc in practical sense this may be unavoidable but ideally there would be single technique that could analyze many models in this vein we prove result that enables us to treat many different extensions as different forms of noise within unified model we demonstrate our technique with three results for the convex case ogwild requires strict sparsity assumptions using our techniques we are able to relax these assumptions and still derive convergence rates moreover under ogwild stricter assumptions we recover the previous convergence rates we derive convergence results for an asynchronous sgd algorithm for matrix completion problem we derive the first rates for asynchronous sgd following the recent synchronous sgd work of de sa et al we derive convergence rates in the presence of quantization errors such as those introduced by arithmetic we validate our results experimentally and show that uckwild can achieve speedups of up to over ogwild algorithms for logistic regression one can combine these different methods both theoretically and empirically we begin with our main result which describes our approach and our model main result analyzing asynchronous algorithms is challenging because unlike in the sequential case where there is single copy of the iterate in the asynchronous case each core has separate copy of in its own cache writes from one core may take some time to be propagated to another core copy of which results in race conditions where stale data is used to compute the gradient updates this difficulty is compounded in the case where series of unlucky random initialization inauspicious steps and race cause the algorithm to get stuck near saddle point or in local minimum broadly we analyze algorithms that repeatedly update by running an update step xt xt for some update function for example for sgd we would have the goal of the algorithm must be to produce an iterate in some success region example ball centered at the optimum for any after running the algorithm for timesteps we say that the algorithm has succeeded if xt for some otherwise we say that the algorithm has failed and we denote this failure event as ft our main result is technique that allows us to bound the convergence rates of asynchronous sgd and related algorithms even for some problems we use martingale methods which have produced elegant convergence rate results for both convex and some algorithms martingales enable us to model multiple forms of example from stochastic sampling random initialization and asynchronous single statistical model compared to standard techniques they also allow us to analyze algorithms that sometimes get stuck which is useful for problems our core contribution is that proof for the convergence of sequential stochastic algorithm can be easily modified to give convergence rate for an asynchronous version supermartingale is stochastic process wt such that wt that is the expected value is over time proof of convergence for the sequential version of this algorithm must construct supermartingale wt xt that is function of both the time and the current and past iterates this function informally represents how unhappy we are with the current state of the algorithm typically it will have the following properties definition for stochastic algorithm as described above process wt is rate supermartingale with horizon if the following conditions are true first it must be supermartingale that is for any sequence xt and any xt xt xt wt xt second for all times and for any sequence xt if the algorithm has not succeeded by time that is xt for all it must hold that wt xt xt this represents the fact that we are unhappy with running for many iterations without success using this we can easily bound the convergence rate of the sequential version of the algorithm statement assume that we run sequential stochastic algorithm for which is rate supermartingale for any the probability that the algorithm has not succeeded by time is ft proof in what follows we let wt denote the actual value taken on by the function in process defined by that is wt wt xt by applying recursively for any wt by the law of total expectation applied to the failure event ft wt ft wt wt applying wt and recalling that is nonnegative results in ft rearranging terms produces the result in statement this technique is very general in subsequent sections we show that rate supermartingales can be constructed for sgd on all convex problems and for some algorithms for problems modeling asynchronicity the behavior of an asynchronous sgd algorithm depends both on the problem it is trying to solve and on the hardware it is running on for ease of analysis we assume that the hardware has the following characteristics these are basically the same assumptions used to prove the original og wild result there are multiple threads running iterations of each with their own cache at any point in time these caches may hold different values for the variable and they communicate via some cache coherency protocol there exists central store typically ram at which all writes are serialized this provides consistent value for the state of the system at any point in real time if thread performs read of previously written value and then writes another value dependent on then the write that produced will be committed to before the write that produced each write from an iteration of is to only single entry of and is done using an atomic instruction that is there are no races handling these is possible but complicates the analysis notice that if we let xt denote the value of the vector in the central store after writes have occurred then since the writes are atomic the value of is solely dependent on the single thread that produces the write that is serialized next in if we let denote the update function sample that is used by that thread for that write and vt denote the cached value of used by that write then xt our hardware model further constrains the value of all the read elements of must have been written to at some time before therefore for some nonnegative variable eti eti where ei is the ith standard basis vector we can think of as the delay in the ith coordinate caused by the parallel updates we can conceive of this system as stochastic process with two sources of randomness the noisy update function samples and the delays we assume that the are independent and identically is reasonable because they are sampled independently by the updating threads it would be unreasonable though to assume the same for the since delays may very well be correlated in the system instead we assume that the delays are bounded from above by some random variable specifically if ft the filtration denotes all random events that occurred before timestep then for any and we let and call the expected delay convergence rates for asynchronous sgd now that we are equipped with stochastic model for the asynchronous sgd algorithm we show how we can use rate supermartingale to give convergence rate for asynchronous algorithms to do this we need some continuity and boundedness assumptions we collect these into definition and then state the theorem definition an algorithm with rate supermartingale is if the following conditions hold first must be lipschitz continuous in the current iterate with parameter that is for any and sequence xt kwt wt hku vk second must be lipschitz continuous in expectation with parameter that is for any and rku third the expected magnitude of the update must be bounded by that is for any theorem assume that we run an asynchronous stochastic algorithm with the above hardware model for which is rate supermartingale with horizon further assume that hrξτ for any the probability that the algorithm has not succeeded by time is ft hrξτ note that this rate depends only on the expected delay and not on any other properties of the hardware model compared to the result of statement the probability of failure has only increased by factor of hrξτ in most practical cases hrξτ so this increase in probability is negligible since the proof of this theorem is simple but uses techniques we outline it here first notice that the process wt which was supermartingale in the sequential case is not in the asynchronous case because of the delayed updates our strategy is to use to produce new process vt that is supermartingale in this case for any and if xu for all we define vt xt wt xt hrξτ hr compared with there are two additional terms here the first term is negative and cancels out some of the unhappiness from that we ascribed to running for many iterations we can interpret this as us accepting that we may need to run for more iterations than in the sequential case the second term measures the distance between recent iterates we would be unhappy if this becomes large because then the noise from the delayed updates would also be large on the other hand if xu for some then we define vt xt xu vu xu we call vt stopped process because its value doesn change after success occurs it is straightforward to show that vt is supermartingale for the asynchronous algorithm once we know this the same logic used in the proof of statement can be used to prove theorem theorem gives us straightforward way of bounding the convergence time of any asynchronous stochastic algorithm first we find rate supermartingale for the problem this is typically no harder than proving sequential convergence second we find parameters such that the problem is typically this is easily done for problems by using differentiation to bound the lipschitz constants third we apply theorem to get rate for asynchronous sgd using this method analyzing an asynchronous algorithm is really no more difficult than analyzing its sequential analog applications now that we have proved our main result we turn our attention to applications we show for couple of algorithms how to construct rate supermartingale we demonstrate that doing this allows us to recover known rates for ogwild algorithms as well as analyze cases where no known rates exist convex case high precision arithmetic first we consider the simple case of using asynchronous sgd to minimize convex function using unbiased gradient samples that is we run the update rule xt we make the standard assumption that is strongly convex with parameter that is for all and ckx we also assume continuous differentiability of with lipschitz constant lkx we require that the second moment of the gradient sample is also bounded for some by for some we let the success region be under these conditions we can construct rate supermartingale for this algorithm lemma there exists wt where if the algorithm hasn succeeded by timestep wt xt log kx such that wt is rate submartingale for the above algorithm with horizon furthermore it is with parameters αl and αm using this and theorem gives us direct bound on the failure rate of convex ogwild sgd corollary assume that we run an asynchronous version of the above sgd algorithm where for some constant we choose step size then for any the probability that the algorithm has not succeeded by time is ft log kx this result is more general than the result in niu et al the main differences are that we make no assumptions about the sparsity structure of the gradient samples and that our rate depends only on the second moment of and the expected value of as opposed to requiring absolute bounds on their magnitude under their stricter assumptions the result of corollary recovers their rate convex case low precision arithmetic one of the ways uckwild achieves high performance is by using arithmetic this introduces additional noise to the system in the form of error we consider this error to be part of the uckwild hardware model we assume that the error can be modeled by an unbiased rounding function operating on the update samples that is for some chosen precision factor there is random quantization function such that for any it holds that and the error is bounded by ακm using this function we can write asynchronous update rule for convex sgd as xt where operates only on the single nonzero entry of in the same way as we did in the case we can use these properties to construct rate supermartingale for the lowprecision version of the convex sgd algorithm and then use theorem to bound the failure rate of convex uckwild corollary assume that we run asynchronous convex sgd and for some we choose step size lm then for any the probability that the algorithm has not succeeded by time is lm log kx ft typically we choose precision such that in this case the increased error compared to the result of corollary will be negligible and we will converge in number of samples that is very similar to the sequential case since each uckwild update runs in less time than an equivalent ogwild update this result means that an execution of uckwild will produce output in less time compared with ogwild case high precision arithmetic many machine learning problems are but are still solved in practice with sgd in this section we show that our technique can be adapted to analyze problems unfortunately there are no general convergence results that provide rates for sgd on problems so it would be unreasonable to expect general proof of convergence for ogwild instead we focus on particular problem matrix completion minimize xxt subject to rn for which there exists sequential sgd algorithm with rate that has already been proven this problem arises in general data analysis subspace tracking principle component analysis recommendation systems and other applications in what follows we let we assume that is symmetric and has unit eigenvectors un with corresponding eigenvalues λn we let the eigengap denote de sa et al provide rate of convergence for particular sgd algorithm alecton running on this problem for simplicity we focus on only the version of the problem and we assume that at each timestep single entry of is used as sample under these conditions alecton uses the update rule xt where and are indices in it initializes uniformly on the sphere of some radius centered at the origin we can equivalently think of this as stochastic power iteration algorithm for any we define the success set to be kxk that is we are only concerned with the direction of not its magnitude this algorithm only recovers the dominant eigenvector of not its eigenvalue in order to show convergence for this entrywise sampling scheme de sa et al require that the matrix satisfy coherence bound table training loss of sgd as function of arithmetic precision for logistic regression dataset reuters forest music rows columns size float int int definition matrix is incoherent with parameter if for every standard basis vector ej and for all unit eigenvectors ui of the matrix etj ui they also require that the step size be set for some constants and as kakf for ease of analysis we add the additional assumptions that our algorithm runs in some bounded space that is for some constant at all times kxt and kxt as in the convex case by following the approach of de sa et al we are able to generate rate supermartinagle for this save space we only state its initial value and not the full expression lemma for the problem above choose any horizon such that then there exists function wt such that wt is rate supermartingale for the above sgd algorithm with parameters ηµ kakf and ηµ kakf and log enγ note that the analysis parameter allows us to trade off between which determines how long we can run the algorithm and the initial value of the supermartingale we can now produce corollary about the convergence rate by applying theorem and setting and appropriately corollary assume that we run ogwild alecton under these conditions for timesteps as defined below then the probability of failure ft will be bounded as below kakf log en ft the fact that we are able to use our technique to analyze algorithm illustrates its generality note that it is possible to combine our results to analyze asynchronous sgd but the resulting formulas are complex so we do not include them here experiments we validate our theoretical results for both asynchronous matrix completion and uck wild ogwild implementation with arithmetic like ogwild uck wild algorithm has multiple threads running an update rule in parallel without locking compared with ogwild which uses floating point numbers to represent input data uck wild uses arithmetic by rounding the input data to or integers this not only decreases the memory usage but also allows us to take advantage of simd instructions for integers on modern cpus we verified our main claims by running ogwild and uckwild algorithms on the discussed applications table shows how the training loss of sgd for logistic regression convex problem varies as the precision is changed we ran sgd with step size however results are similar across range of step sizes we analyzed all four datasets reported in dimmwitted that favored ogwild reuters and which are text classification datasets forest which arises from remote sensing and music which is music classification dataset we implemented all glm models reported in dimmwitted including svm linear regression and logistic regression and float int int threads hogwild sequential alecton for speedup over best ogwild speedup over sequential performance of uckwild for logistic regression speedup of uckwild for dense dataset sequential hogwild iterations billions convergence trajectories for sequential versus ogwild alecton figure experiments compare the training loss performance and convergence of ogwild and uckwild algorithms with sequential versions report logistic regression because other models have similar performance the results illustrate that there is almost no increase in training loss as the precision is decreased for these problems we also investigated and computation the former was slower than due to lack of simd instructions and the latter discarded too much information to produce good quality results figure displays the speedup of uckwild running on the of the dataset compared to both sequential sgd left axis and ogwild right axis experiments ran on machine with two xeon cpus each with six hyperthreaded cores and of ram this plot illustrates that incorporating arithmetic into our algorithm allows us to achieve significant speedups over both sequential and ogwild sgd note that we don get full linear speedup because we are bound by the available memory bandwidth beyond this limit adding additional threads provides no benefits while increasing conflicts and thrashing the and caches this result combined with the data in table suggest that by doing lowprecision asynchronous updates we can get speedups of up to on these sorts of datasets without significant increase in error figure compares the convergence trajectories of ogwild and sequential versions of the nonconvex alecton matrix completion algorithm on synthetic data matrix with ten random eigenvalues λi each plotted series represents different run of alecton the trajectories differ somewhat because of the randomness of the algorithm the plot shows that the sequential and asynchronous versions behave qualitatively similarly and converge to the same noise floor for this dataset sequential alecton took seconds to run while ogwild alecton took seconds speedup conclusion this paper presented unified theoretical framework for producing results about the convergence rates of asynchronous and random algorithms such as stochastic gradient descent we showed how rate of convergence for sequential algorithm can be easily leveraged to give rate for an asynchronous version we also introduced uckwild strategy for sgd that is able to take advantage of modern hardware resources for both task and data parallelism and showed that it achieves near linear parallel speedup over sequential algorithms acknowledgments the uckwild name arose out of conversations with benjamin recht thanks also to madeleine udell for helpful conversations the authors acknowledge the support of darpa nsf nsf doe nsf darpa nsf onr and nih oracle nvidia huawei sap labs sloan research fellowship moore foundation american family insurance google and toshiba references bottou machine learning with stochastic gradient descent in compstat pages springer bottou stochastic gradient descent tricks in neural networks tricks of the trade pages springer bottou and olivier bousquet the tradeoffs of large scale learning in platt koller singer and roweis editors nips volume pages nips foundation christopher de sa kunle olukotun and christopher global convergence of stochastic gradient descent for some nonconvex matrix problems icml john duchi peter bartlett and martin wainwright randomized smoothing for stochastic optimization siam journal on optimization olivier fercoq and peter accelerated parallel and proximal coordinate descent arxiv preprint thomas fleming and david harrington counting processes and survival analysis volume pages john wiley sons pankaj gupta ashish goel jimmy lin aneesh sharma dong wang and reza zadeh wtf the who to follow service at twitter www pages suyog gupta ankur agrawal kailash gopalakrishnan and pritish narayanan deep learning with limited numerical precision icml prateek jain praneeth netrapalli and sujay sanghavi matrix completion using alternating minimization in stoc pages acm johansson maben rabi and mikael johansson randomized incremental subgradient method for distributed optimization in networked systems siam journal on optimization jakub zheng qu and peter coordinate descent in nips optimization in machine learning workshop yann le cun bottou genevieve orr and efficient backprop in neural networks tricks of the trade ji liu and stephen wright asynchronous stochastic coordinate descent parallelism and convergence properties siopt ji liu stephen wright christopher victor bittorf and srikrishna sridhar an asynchronous parallel stochastic coordinate descent algorithm jmlr ioannis mitliagkas michael borokhovich alexandros dimakis and constantine caramanis frogwild fast pagerank approximations on graph engines pvldb feng niu benjamin recht christopher re and stephen wright hogwild approach to parallelizing stochastic gradient descent in nips pages cyprien noel and simon osindero dogwild hogwild for cpu gpu shameem ahamed puthiya parambath matrix factorization methods for recommender systems alexander rakhlin ohad shamir and karthik sridharan making gradient descent optimal for strongly convex stochastic optimization icml peter and martin parallel coordinate descent methods for big data optimization mathematical programming pages qing tao kang kong dejun chu and gaowei wu stochastic coordinate descent methods for regularized smooth and nonsmooth losses in machine learning and knowledge discovery in databases pages springer rachael tappenden martin and peter on the complexity of parallel coordinate descent arxiv preprint yu hsieh si si and inderjit dhillon scalable coordinate descent approaches to parallel matrix factorization for recommender systems in icdm pages ce zhang and christopher re dimmwitted study of statistical analytics pvldb 
the return of the gating network combining generative models and discriminative training in natural image priors yair weiss school of computer science and engineering hebrew university of jerusalem dan rosenbaum school of computer science and engineering hebrew university of jerusalem abstract in recent years approaches based on machine learning have achieved performance on image restoration problems successful approaches include both generative models of natural images as well as discriminative training of deep neural networks discriminative training of feed forward architectures allows explicit control over the computational cost of performing restoration and therefore often leads to better performance at the same cost at run time in contrast generative models have the advantage that they can be trained once and then adapted to any image restoration task by simple use of bayes rule in this paper we show how to combine the strengths of both approaches by training discriminative architecture to predict the state of latent variables in generative model of natural images we apply this idea to the very successful gaussian mixture model gmm of natural images we show that it is possible to achieve comparable performance as the original gmm but with two orders of magnitude improvement in run time while maintaining the advantage of generative models introduction figure shows an example of an image restoration problem we are given degraded image in this case degraded with gaussian noise and seek to estimate the clean image image restoration is an extremely well studied problem and successful systems for specific scenarios have been built without any explicit use of machine learning for example approaches based on coring can be used to successfully remove noise from an image by transforming to wavelet basis and zeroing out coefficients that are close to zero more recently the very successful method removes noise from patches by finding similar patches in the noisy image and combining all similar patches in nonlinear way in recent years machine learning based approaches are starting to outperform the hand engineered systems for image restoration as in other areas of machine learning these approaches can be divided into generative approaches which seek to learn probabilistic models of clean images versus discriminative approaches which seek to learn models that map noisy images to clean images while minimizing the training loss between the predicted clean image and the true one two influential generative approaches are the fields of experts foe approach and ksvd which assume that filter responses to natural images should be sparse and learn set of filters under this assumption while very good performance can be obtained using these methods when they are trained generatively they do not give performance that is as good as perhaps the most successful generative approach to image restoration is based on gaussian mixture models gmms in this approach image patches are modeled as dimensional vectors and noisy image full model gating per patch fast gating per patch figure image restoration with gaussian mixture model middle the most probable component of every patch calculated using full posterior calculation fast gating network color coded by embedding in space bottom the restored image the gating network achieves almost identical results but in orders of magnitude faster simple gmm with components is used to model the density in this space despite its simplicity this model remains among the top performing models in terms of likelihood given to left out patches and also gives excellent performance in image restoration in particular it outperforms on image denoising and has been successfully used for other image restoration problems such as deblurring the performance of generative models in denoising can be much improved by using an empirical bayes approach where the parameters are estimated from the noisy image discriminative approaches for image restoration typically assume particular feed forward structure and use training to optimize the parameters of the structure and shaked used discriminative training to optimize the parameters of coring chen et al discriminatively learn the parameters of generative model to minimize its denoising error they show that even though the model was trained for specific noise level it acheives similar results as the gmm for different noise levels jain and seung trained convolutional deep neural network to perform image denoising using the same training set as was used by the foe and gmm papers they obtained better results than foe but not as good as or gmm burger et al trained deep nonconvolutional multi layer perceptron to perform denoising by increasing the size of the training set by two orders of magnitude relative to previous approaches they obtained what is perhaps the best method for image denoising fanello et al trained random forest architecture to optimize denoising performance they obtained results similar to the gmm but at much smaller computational cost which approach is better discriminative or generative first it should be said that the best performing methods in both categories give excellent performance indeed even the approach which can be outperformed by both types of methods has been said to be close to optimal for image denoising the primary advantage of the discriminative approach is its efficiency at by defining particular architecture we are effectively constraining the computational cost at and during learning we seek the best performing parameters for fixed computational cost the primary advantage of the generative approach on the other hand is its modularity learning only requires access to clean images and after learning density model for clean images bayes rule can be used to peform restoration on any image degradation and can support different loss functions at test time in contrast discriminative training requires separate training and usually separate architectures for every possible image degradation given that there are literally an infinite number of ways to degrade images not just gaussian noise with different noise levels but also compression artifacts blur etc one would like to have method that maintains the modularity of generative models but with the computational cost of discriminative models in this paper we propose such an approach our method is based on the observation that the most costly part of inference with many generative models for natural images is in estimating latent variables these latent variables can be abstract representations of local image covariance or simply discrete variable that indicates which gaussian most likely generated the data in gmm we therefore discriminatively train architecture or gating network to predict these latent variables using far less computation the gating network need only be trained on clean images and we show how to combine it during inference with bayes rule to perform image restoration for any type of image degradation our results show that we can maintain the accuracy and the modularity of generative models but with speedup of two orders of magnitude in run time in the rest of the paper we focus on the gaussian mixture model although this approach can be used for other generative models with latent variables like the one proposed by karklin and lewicki code implementing our proposed algorithms for the gmm prior and karklin and lewicki prior is available online at image restoration with gaussian mixture priors modeling image patches with gaussian mixtures has proven to be very effective for image restoration in this model the prior probability of an image patch is modeled by pr πh µh σh during image restoration this prior is combined with likelihood function pr and restoration is based on the posterior probability pr which is computed using bayes rule typically map estimators are used although for some problems the more expensive bls estimator has been shown to give an advantage in order to maximize the posterior probability different numerical optimizations can be used typically they require computing the assignment probabilities πh µh σh pr πk µk σk these assignment probabilities play central role in optimizing the posterior for example it is easy to see that the gradient of the log of the posterior involves weighted sum of gradients where the assignment probabilities give the weights log pr log pr log pr log pr log pr pr µh similarly one can use version of the em algorithm to iteratively maximize the posterior probability by solving sequence of reweighted least squares problems here the assignment probabilities define the weights for the least squares problems finally in auxiliary samplers for performing bls estimation each iteration requires sampling the hidden variables according to the current guess of the image for reasons of computational efficiency the assignment probabilities are often used to calculate hard assignment of patch to component arg max pr following the literature on mixtures of experts we call this process gating as we now show this process is often the most expensive part of performing image restoration with gmm prior running time of inference the successful epll algorithm for image restoration with patch priors defines cost function based on the simplifying assumption that the patches of an image are independent log pr xi log pr where xi are the image patches is the full image and is parameter that compensates for the simplifying assumption minimizing this cost when the prior is gmm is done by alternating between three steps we give here only short representation of each step but the full algorithm is given in the supplementary material the three steps are gating for each patch the current guess xi is assigned to one of the components xi filtering for each patch depending on the assignments xi least squares problem is solved mixing overlapping patches are averaged together with the noisy image it can be shown that after each iteration of the three steps the epll splitting cost function relaxation of equation is decreased in terms of computation time the gating step is by far the most expensive one the filtering step multiplies each dimensional patch by single matrix which is equivalent to or flops per patch assuming local noise model the mixing step involves summing up all patches back to the image and solving local cost on the image equivalent to or flops per patch in the gating step however we compute the probability of all the gaussian components for every patch each computation performs and so for components we get total of or flops per patch for gmm with components like the one used in this results in gating step which is times slower than the filtering and mixing steps the gating network figure architecture of the gating step in gmm inference left more efficient gating network for noise models like in image deblurring there is an additional factor of the square of the kernel dimension if the kernel dimension is in the order of the mixing step performs or flops the left side of figure shows the computation involved in naive computing of the gating in the gmm used in the gaussians are zero mean so computing the most likely component involves multiplying each patch with all the eigenvectors of the covariance matrix and squaring the results log consth consth where σih and vih are the eigenvalues and eigenvectors of the covariance matrix the eigenvectors can be viewed as templates and therefore the gating is performed according to weighted sums of dotproducts with different templates every component has different set of templates and different weighting of their importance the eigenvalues framing this process as network starting with patch of dimension and using gaussian components the first layer computes followed by squaring and the second layer performs viewed this way it is clear that the naive computation of the gating is inefficient there is no sharing of between different components and the number of that are required for deciding about the appropriate component may be much smaller than is done with this naive computation discriminative training of the gating network in order to obtain more efficient gating network we use discriminative training we rewrite equation as log pr wih vit consth note that the vectors vi are required to be shared and do not depend on only the weights wih depend on given set of vectors vi and the weights the posterior probability of patch assignment is approximated by exp wih vit consth pr exp wi vi constk we minimize the cross entropy between the approximate posterior probability and the exact posterior probability given by equation the training is done on of clean image patches each taken randomly from the images in the bsds training set we minimize the training loss for each using iterations of before moving to the next results of the training are shown in figure unlike the eigenvectors of the gmm covariance matrices which are often global fourier patterns or edge filters the learned vectors are more localized in space and resemble gabor filters generatively trained discriminatively trained psnr generative discriminative log of figure left subset of the eigenvectors used for the full posterior calculation center the first layer of the discriminatively trained gating network which serves as shared pool of eigenvectors right the number of versus the resulting psnr for patch denoising using different models discrimintively training smaller gating networks is better than generatively training smaller gmms with less components figure compares the gating performed by the full network and the discriminatively trained one each pixel shows the predicted component for patch centered around that pixel components are color coded so that dark pixels correspond to components with low variance and bright pixels to high variance the colors denote the preferred orientation of the covariance although the gating network requires far less it gives similar although not identical gating figure shows sample patches arranged according to the gating with either the full model top or the gating network bottom we classify set of patches by their assignment probabilities for of the components we display patches that are classified to that component it can be seen that when the classification is done using the gating network or the full posterior the results are visually similar the right side of figure compares between two different ways to reduce computation time the green curve shows gating networks with different sizes containing to vectors trained on top of the component gmm the blue curve shows gmms with different number of components from to each of the models is used to perform patch denoising using map inference with noise level of it is clearly shown that in terms of the number of versus the resulting psnr discriminatively training small gating network on top of gmm with components is much better than pure generative training of smaller gmms gating with the full model gating with the learned network figure gating with the full posterior computation the learned gating network top patches from clean images arranged according to the component with maximum probability every column represents different component showing out of bottom patches arranged according to the component with maximum gating score both gating methods have very similar behavior results we compare the image restoration performance of our proposed method to several other methods proposed in the literature the first class of methods used for denoising are internal methods that do not require any learning but are specific to image denoising prime example is the second class of methods are generative models which are only trained on clean images the original epll algorithm is in this class finally the third class of models are discriminative which are trained these typically have the best performance but need to be trained in advance for any image restoration problem in the right hand side of table we show the denoising results of our implementation of epll with gmm of components it can be seen that the difference between doing the full inference and using learned gating network with vectors is about to which is comparable to the difference between different published values of performance for single algorithm even with the learned gating network the epll performance is among the top performing methods for all noise levels the fully discriminative mlp method is the best performing method for each noise level but it is trained explicitly and separately for each noise level the right hand side of table also shows the run times of our matlab implementation of epll on standard cpu although the number of in the gating has been decreased by factor of internal lssc lssc ksvd generative foe ksvdg epll epll epll discriminative mlp ff epll with different gating methods sec full gating full naive posterior computation gating the learned gating network the learned network calculated with stride of table average psnr db for image denoising left values for different denoising methods as reported by different papers right comparing different gating methods for our epll implementation computed over test images of bsds using fast gating method results in psnr difference comparable to the difference between different published values of the same algorithm noisy mlp full gating noisy mlp full gating figure image denoising examples using the fast gating network or the full inference computation is visually indistinguishable the effect on the actual run times is more complex still by only switching to the new gating network we obtain speedup factor of more than on small images we also show that further speedup can be achieved by simply working with less overlapping patches stride the results show that using stride of working on every th patch leads to almost no loss in psnr although the stride speedup can be achieved by any patch based method it emphasizes another important between accuracy and in total we see that speedup factor of more than lead to very similar results than the full inference we expect even more dramatic speedups are possible with more optimized and parallel code figures gives visual comparison of denoised images as can be expected from the psnr values the results with full epll and the gating network epll are visually indistinguishable to highlight the modularity advantage of generative models figure shows results of image deblurring using the same prior even though all the training of the epll and the gating was done on clean sharp images the prior can be combined with likelihood for deblurring to obtain deblurring results again the full and the gating results are visually indistinguishable blur full gating blur full gating figure image deblurring examples using the learned gating network maintains the modularity property allowing it to be used for different restoration tasks once again results are very similar to the full inference computation noisy epllgating psnr figure denoising of image using the learned gating network and stride of we get very fast inference with comparable results to discriminatively trained models finally figure shows the result of performing resotration on an image epll with gating network achieves comparable results to discriminatively trained method csf but is even more efficient while maintaining the modularity of the generative approach discussion image restoration is widely studied problem with immediate practical applications in recent years approaches based on machine learning have started to outperform handcrafted methods this is true both for generative approaches and discriminative approaches while discriminative approaches often give the best performance for fixed computational budget the generative approaches have the advantage of modularity they are only trained on clean images and can be used to perform one of an infinite number of possible resotration tasks by using bayes rule in this paper we have shown how to combine the best aspects of both approaches we discriminatively train architecture to perform the most expensive part of inference using generative models our results indicate that we can still obtain performance with two orders of magnitude improvement in run times while maintaining the modularity advantage of generative models acknowledgements support by the isf intel and the gatsby foundation is greatfully acknowledged references harold christopher burger christian schuler and stefan harmeling learning how to combine internal and external denoising methods in pattern recognition pages springer harold christopher burger christian schuler and stefan harmeling image denoising with multilayer perceptrons part comparison with existing algorithms and with bounds arxiv preprint yunjin chen thomas pock ranftl and horst bischof revisiting training of filterbased mrfs for image restoration in pattern recognition pages springer kostadin dabov alessandro foi vladimir katkovnik and karen egiazarian image denoising by sparse collaborative filtering image processing ieee transactions on michael elad and michal aharon image denoising via sparse and redundant representations over learned dictionaries image processing ieee transactions on sean ryan fanello cem keskin pushmeet kohli shahram izadi jamie shotton antonio criminisi ugo pattacini and tim paek filter forests for learning convolutional kernels in computer vision and pattern recognition cvpr ieee conference on pages ieee yacov and doron shaked discriminative approach for wavelet denoising image processing ieee transactions on robert jacobs michael jordan steven nowlan and geoffrey hinton adaptive mixtures of local experts neural computation viren jain and sebastian seung natural image denoising with convolutional networks in advances in neural information processing systems pages yan karklin and michael lewicki emergence of complex cell properties by learning to generalize in natural scenes nature effi levi using natural image or sampling phd thesis the hebrew university of jerusalem anat levin and boaz nadler natural image denoising optimality and inherent bounds in computer vision and pattern recognition cvpr ieee conference on pages ieee siwei lyu and eero simoncelli statistical modeling of images with fields of gaussian scale mixtures in advances in neural information processing systems pages julien mairal francis bach jean ponce guillermo sapiro and andrew zisserman sparse models for image restoration in computer vision ieee international conference on pages ieee carl rassmusen http stefan roth and michael black fields of experts framework for learning image priors in computer vision and pattern recognition cvpr ieee computer society conference on volume pages ieee uwe schmidt qi gao and stefan roth generative perspective on mrfs in vision in computer vision and pattern recognition cvpr ieee conference on pages ieee uwe schmidt and stefan roth shrinkage fields for effective image restoration in computer vision and pattern recognition cvpr ieee conference on pages ieee libin sun sunghyun cho jue wang and james hays blur kernel estimation using patch priors in computational photography iccp ieee international conference on pages ieee benigno uria iain murray and hugo larochelle rnade the neural autoregressive densityestimator in advances in neural information processing systems pages guoshen yu guillermo sapiro and mallat solving inverse problems with piecewise linear estimators from gaussian mixture models to structured sparsity image processing ieee transactions on daniel zoran and yair weiss from learning models of natural image patches to whole image restoration in computer vision iccv ieee international conference on pages ieee daniel zoran and yair weiss natural images gaussian mixtures and dead leaves in nips pages 
pointer networks oriol google brain meire department of mathematics uc berkeley navdeep jaitly google brain abstract we introduce new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence such problems can not be trivially addressed by existent approaches such as and neural turing machines because the number of target classes in each step of the output depends on the length of the input which is variable problems such as sorting variable sized sequences and various combinatorial optimization problems belong to this class our model solves the problem of variable size output dictionaries using recently proposed mechanism of neural attention it differs from the previous attention attempts in that instead of using attention to blend hidden units of an encoder to context vector at each decoder step it uses attention as pointer to select member of the input sequence as the output we call this architecture pointer net we show can be used to learn approximate solutions to three challenging geometric problems finding planar convex hulls computing delaunay triangulations and the planar travelling salesman problem using training examples alone not only improve over with input attention but also allow us to generalize to variable size output dictionaries we show that the learnt models generalize beyond the maximum lengths they were trained on we hope our results on these tasks will encourage broader exploration of neural learning for discrete problems introduction recurrent neural networks rnns have been used for learning functions over sequences from examples for more than three decades however their architecture limited them to settings where the inputs and outputs were available at fixed frame rate the recently introduced paradigm removed these constraints by using one rnn to map an input sequence to an embedding and another possibly the same rnn to map the embedding to an output sequence bahdanau et al augmented the decoder by propagating extra contextual information from the input using attentional mechanism these developments have made it possible to apply rnns to new domains achieving results in core problems in natural language processing such as translation and parsing image and video captioning and even learning to execute small programs nonetheless these methods still require the size of the output dictionary to be fixed priori because of this constraint we can not directly apply this framework to combinatorial problems where the size of the output dictionary depends on the length of the input sequence in this paper we address this limitation by repurposing the attention mechanism of to create pointers to input elements we show that the resulting architecture which we name pointer networks can be trained to output satisfactory solutions to three combinatorial optimization problems computing planar convex hulls delaunay triangulations and the symmetric planar travelling salesman problem tsp the resulting models produce approximate solutions to these problems in purely data driven equal contribution figure an rnn blue processes the input sequence to create code vector that is used to generate the output sequence purple using the probability chain rule and another rnn the output dimensionality is fixed by the dimensionality of the problem and it is the same during training and inference an encoding rnn converts the input sequence to code blue that is fed to the generating network purple at each step the generating network produces vector that modulates attention mechanism over inputs the output of the attention mechanism is softmax distribution with dictionary size equal to the length of the input ion when we only have examples of inputs and desired outputs the proposed approach is depicted in figure the main contributions of our work are as follows we propose new architecture that we call pointer net which is simple and effective it deals with the fundamental problem of representing variable length dictionaries by using softmax probability distribution as pointer we apply the pointer net model to three distinct algorithmic problems involving geometry we show that the learned model generalizes to test problems with more points than the training problems our pointer net model learns competitive small scale tsp approximate solver our results demonstrate that purely data driven approach can learn approximate solutions to problems that are computationally intractable models we review the and models that are the baselines for this work in sections and we then describe our model in section model given training pair the model computes the conditional probability using parametric model an rnn with parameters to estimate the terms of the probability chain rule also see figure ci here pn is sequence of vectors and cm is sequence of indices each between and we note that the target sequence length is in general function of the parameters of the model are learnt by maximizing the conditional probabilities for the training set arg max log where the sum is over training examples as in we use an long short term memory lstm to model ci the rnn is fed pi at each time step until the end of the input sequence is reached at which time special symbol is input to the model the model then switches to the generation mode until the network encounters the special symbol which represents termination of the output sequence note that this model makes no statistical independence assumptions we use two separate rnns one to encode the sequence of vectors pj and another one to produce or decode the output symbols ci we call the former rnn the encoder and the latter the decoder or the generative rnn during inference given sequence the learnt parameters are used to select the sequence cˆp with the highest probability cˆp arg max finding the optimal sequence cˆ cp is computationally impractical because of the combinatorial number of possible output sequences instead we use beam search procedure to find the best possible sequence given beam size in this model the output dictionary size for all symbols ci is fixed and equal to since the outputs are chosen from the input thus we need to train separate model for each this prevents us from learning solutions to problems that have an output dictionary with size that depends on the input sequence length under the assumption that the number of outputs is this model has computational complexity of however exact algorithms for the problems we are dealing with are more costly for example the convex hull problem has complexity log the attention mechanism see section adds more computational capacity to this model content based input attention the vanilla model produces the entire output sequence using the fixed dimensional state of the recognition rnn at the end of the input sequence this constrains the amount of information and computation that can flow through to the generative model the attention model of ameliorates this problem by augmenting the encoder and decoder rnns with an additional neural network that uses an attention mechanism over the entire sequence of encoder rnn states for notation purposes let us define the encoder and decoder hidden states as en and dm respectively for the lstm rnns we use the state after the output gate has been multiplied by the cell activations we compute the attention vector at each output time as follows uij tanh ej di aij softmax uij aij ej where softmax normalizes the vector ui of length to be the attention mask over the inputs and and are learnable parameters of the model in all our experiments we use the same hidden dimensionality at the encoder and decoder typically so is vector and and are square matrices lastly and di are concatenated and used as the hidden states from which we make predictions and which we feed to the next time step in the recurrent model note that for each output we have to perform operations so the computational complexity at inference time becomes this model performs significantly better than the model on the convex hull problem but it is not applicable to problems where the output dictionary size depends on the input nevertheless very simple extension or rather reduction of the model allows us to do this easily we now describe very simple modification of the attention model that allows us to apply the method to solve combinatorial optimization problems where the output dictionary size depends on the number of elements in the input sequence the model of section uses softmax distribution over fixed sized output dictionary to compute ci in equation thus it can not be used for our problems where the size of the output dictionary is equal to the length of the input sequence to solve this problem we model ci using the attention mechanism of equation as follows uij ci tanh ej di softmax ui where softmax normalizes the vector ui of length to be an output distribution over the dictionary of inputs and and are learnable parameters of the output model here we do not blend the encoder state ej to propagate extra information to the decoder but instead use uij as pointers to the input elements in similar way to condition on as in equation we simply copy the corresponding as the input both our method and the attention model can be seen as an application of attention mechanisms proposed in we also note that our approach specifically targets problems whose outputs are discrete and correspond to positions in the input such problems may be addressed artificially for example we could learn to output the coordinates of the target point directly using an rnn however at inference this solution does not respect the constraint that the outputs map back to the inputs exactly without the constraints the predictions are bound to become blurry over longer sequences as shown in models for videos motivation and datasets structure in the following sections we review each of the three problems we considered as well as our data generation in the training data the inputs are planar point sets pn with elements each where pj xj yj are the cartesian coordinates of the points over which we find the convex hull the delaunay triangulation or the solution to the corresponding travelling salesman problem in all cases we sample from uniform distribution in the outputs cm are sequences representing the solution associated to the point set in figure we find an illustration of an pair for the convex hull and the delaunay problems convex hull we used this example as baseline to develop our models and to understand the difficulty of solving combinatorial problems with data driven approaches finding the convex hull of finite number of points is well understood task in computational geometry and there are several exact solutions available see in general finding the generally unique solution has complexity log where is the number of points considered the vectors pj are uniformly sampled from the elements ci are indices between and corresponding to positions in the sequence or special tokens representing beginning or end of sequence see figure for an illustration to represent the output as sequence we start from the point with the lowest index and go this is an arbitrary choice but helps reducing ambiguities during training we will release all the datasets at hidden for reference input and the output sequence representing its convex hull input and the output representing its delaunay triangulation figure representation for convex hull and delaunay triangulation the tokens and represent beginning and end of sequence respectively delaunay triangulation delaunay triangulation for set of points in plane is triangulation such that each circumcircle of every triangle is empty that is there is no point from in its interior exact log solutions are available where is the number of points in in this example the outputs cm are the corresponding sequences representing the triangulation of the point set each ci is triple of integers from to corresponding to the position of triangle vertices in or the of sequence tokens see figure we note that any permutation of the sequence represents the same triangulation for additionally each triangle representation ci of three integers can also be permuted without loss of generality and similarly to what we did for convex hulls at training time we order the triangles ci by their incenter coordinates lexicographic order and choose the increasing triangle without ordering the models learned were not as good and finding better ordering that the could better exploit is part of future work travelling salesman problem tsp tsp arises in many areas of theoretical computer science and is an important algorithm used for microchip design or dna sequencing in our work we focused on the planar symmetric tsp given list of cities we wish to find the shortest possible route that visits each city exactly once and returns to the starting point additionally we assume the distance between two cities is the same in each opposite direction this is an problem which allows us to test the capabilities and limitations of our model the pairs have similar format as in the convex hull problem described in section will be the cartesian coordinates representing the cities which are chosen randomly in the square cn will be permutation of integers from to representing the optimal path or tour for consistency in the training dataset we always start in the first city without loss of generality to generate exact data we implemented the algorithm which finds the optimal solution in we used it up to for larger producing exact solutions is extremely costly therefore we also considered algorithms that produce approximated solutions and which are both and which implements the christofides algorithm the latter algorithm is guaranteed to find solution within factor of from the optimal length table shows how they performed in our test sets we choose ci instead of or any other permutation empirical results architecture and hyperparameters no extensive architecture or hyperparameter search of the was done in the work presented here and we used virtually the same architecture throughout all the experiments and datasets even though there are likely some gains to be obtained by tuning the model we felt that having the same model hyperparameters operate on all the problems makes the main message of the paper stronger as result all our models used single layer lstm with either or hidden units trained with stochastic gradient descent with learning rate of batch size of random uniform weight initialization from to and gradient clipping of we generated training example pairs and we did observe overfitting in some cases where the task was simpler for small training generally converged after to epochs convex hull we used the convex hull as the guiding task which allowed us to understand the deficiencies of standard models such as the approach and also setting up our expectations on what purely data driven model would be able to achieve with respect to an exact solution we reported two metrics accuracy and area covered of the true convex hull note that any simple polygon will have full intersection with the true convex hull to compute the accuracy we considered two output sequences and to be the same if they represent the same polygon for simplicity we only computed the area coverage for the test examples in which the output represents simple polygon without if an algorithm fails to produce simple polygon in more than of the cases we simply reported fail the results are presented in table we note that the area coverage achieved with the is close to looking at examples of mistakes we see that most problems come from points that are aligned see figure for mistake for this is common source of errors in most algorithms to solve the convex hull it was seen that the order in which the inputs are presented to the encoder during inference affects its performance when the points on the true convex hull are seen late in the input sequence the accuracy is lower this is possibly the network does not have enough processing steps to update the convex hull it computed until the latest points were seen in order to overcome this problem we used the attention mechanism described in section which allows the decoder to look at the whole input at any time this modification boosted the model performance significantly we inspected what attention was focusing on and we observed that it was pointing at the correct answer on the input side this inspired us to create the model described in section more than outperforming both the lstm and the lstm with attention our model has the key advantage of being inherently variable length the bottom half of table shows that when training our model on variety of lengths ranging from to uniformly sampled as we found other forms of curriculum learning to not be effective single model is able to perform quite well on all lengths it has been trained on but some degradation for can be observed the model trained only on length instances more impressive is the fact that the model does extrapolate to lengths that it has never seen during training even for our results are satisfactory and indirectly indicate that the model has learned more than simple lookup neither lstm or lstm with attention can be used for any given without training new model on delaunay triangulation the delaunay triangulation test case is connected to our first problem of finding the convex hull in fact the delaunay triangulation for given set of points triangulates the convex hull of these points we reported two metrics accuracy and triangle coverage in percentage the percentage of triangles the model predicted correctly note that in this case for an input point set the output sequence is in fact set as consequence any permutation of its elements will represent the same triangulation table comparison between lstm lstm with attention and our model on the convex hull problem note that the baselines must be trained on the same that they are tested on means the dataset had uniform distribution over lengths from to ethod lstm attention tr et lstm tr et lstm tr et tr et tr et tr et tr et ground truth predictions lstm ground truth predictions trained accuracy rea fail fail ground truth ground truth tour length is truth truth predictions predictions tour length is figure examples of our model on convex hulls left delaunay center and tsp right trained on points and tested on points failure of the lstm model for convex hulls is shown in note that the baselines can not be applied to different length from training using the model for we obtained an accuracy of and triangle coverage of for the accuracy was and the triangle coverage for we did not produce any precisely correct triangulation but obtained triangle coverage see the middle column of figure for an example for travelling salesman problem we considered the planar symmetric travelling salesman problem tsp which is as the third problem similarly to finding convex hulls it also has sequential outputs given that the ptrnet implements an algorithm it was unclear if it would have enough capacity to learn useful algorithm solely from data as discussed in section it is feasible to generate exact solutions for relatively small values of to be used as training data for larger due to the importance of tsp good and efficient algorithms providing reasonable approximate solutions exist we used three different algorithms in our experiments and see section for references table tour length of the and collection of algorithms on small scale tsp problem trained trained trained trained trained trained trained trained trained ptimal tr et table shows all of our results on tsp the number reported is the length of the proposed tour unlike the convex hull and delaunay triangulation cases where the decoder was unconstrained in this example we set the beam search procedure to only consider valid tours otherwise the model would sometimes output an invalid tour for instance it would repeat two cities or decided to ignore destination this procedure was relevant for for the unconstrained decoding failed less than of the cases and thus was not necessary for which goes beyond the longest sequence seen in training failure rate went up to and for it went up to the first group of rows in the table show the trained on optimal data except for since that is not feasible computationally we trained separate model for each interestingly when using the worst algorithm data to train the our model outperforms the algorithm that is trying to imitate the second group of rows in the table show how the trained on optimal data with to cities can generalize beyond that the results are virtually perfect for and good for but it seems to break for and beyond still the results are far better than chance this contrasts with the convex hull case where we were able to generalize by factor of however the underlying algorithms have greater complexity than log which could explain this conclusions in this paper we described new architecture that allows us to learn conditional probability of one sequence given another sequence where is sequence of discrete tokens corresponding to positions in we show that can be used to learn solutions to three different combinatorial optimization problems our method works on variable sized inputs yielding variable sized output dictionaries something the baseline models with or without attention can not do directly even more impressively they outperform the baselines on fixed input size problems to which both the models can be applied previous methods such as rnnsearch memory networks and neural turing machines have used attention mechanisms to process inputs however these methods do not directly address problems that arise with variable output dictionaries we have shown that an attention mechanism can be applied to the output to solve such problems in so doing we have opened up new class of problems to which neural networks can be applied without artificial assumptions in this paper we have applied this extension to rnnsearch but the methods are equally applicable to memory networks and neural turing machines future work will try and show its applicability to other problems such as sorting where the outputs are chosen from the inputs we are also excited about the possibility of using this approach to other combinatorial optimization problems acknowledgments we would like to thank rafal jozefowicz ilya sutskever quoc le and samy bengio for useful discussions we would also like to thank daniel gillick for his help with the final manuscript references ilya sutskever oriol vinyals and quoc le sequence to sequence learning with neural networks in advances in neural information processing systems pages alex graves greg wayne and ivo danihelka neural turing machines arxiv preprint david rumelhart geoffrey hinton and ronald williams learning internal representations by error propagation technical report dtic document anthony robinson an application of recurrent nets to phone probability estimation neural networks ieee transactions on dzmitry bahdanau kyunghyun cho and yoshua bengio neural machine translation by jointly learning to align and translate in iclr arxiv preprint jason weston sumit chopra and antoine bordes memory networks in iclear arxiv preprint alex graves generating sequences with recurrent neural networks arxiv preprint oriol vinyals lukasz kaiser terry koo slav petrov ilya sutskever and geoffrey hinton grammar as foreign language arxiv preprint oriol vinyals alexander toshev samy bengio and dumitru erhan show and tell neural image caption generator in cvpr arxiv preprint jeff donahue lisa anne hendricks sergio guadarrama marcus rohrbach subhashini venugopalan kate saenko and trevor darrell recurrent convolutional networks for visual recognition and description in cvpr arxiv preprint wojciech zaremba and ilya sutskever learning to execute arxiv preprint sepp hochreiter and schmidhuber long memory neural computation nitish srivastava elman mansimov and ruslan salakhutdinov unsupervised learning of video representations using lstms in icml arxiv preprint ray jarvis on the identification of the convex hull of finite set of points in the plane information processing letters ronald graham an efficient algorith for determining the convex hull of finite planar set information processing letters franco preparata and se june hong convex hulls of finite sets of points in two and three dimensions communications of the acm rebay efficient unstructured mesh generation by means of delaunay triangulation and algorithm journal of computational physics richard bellman dynamic programming treatment of the travelling salesman problem journal of the acm jacm suboptimal travelling salesman https problem tsp solver traveling salesman problem implementation https available at available at implementation of traveling salesman problem using christofides and available at https 
associative memory via sparse recovery model arya mazumdar department of ece university of minnesota twin cities arya ankit singh computer science department carnegie mellon university asrawat abstract an associative memory is structure learned from dataset of vectors signals in way such that given noisy version of one of the vectors as input the nearest valid vector from nearest neighbor is provided as output preferably via fast iterative algorithm traditionally binary or hopfield neural networks are used to model the above structure in this paper for the first time we propose model of associative memory based on sparse recovery of signals our basic premise is simple for dataset we learn set of linear constraints that every vector in the dataset must satisfy provided these linear constraints possess some special properties it is possible to cast the task of finding nearest neighbor as sparse recovery problem assuming generic random models for the dataset we show that it is possible to store or exponential number of vectors in neural network of size furthermore given noisy version of one of the stored vectors corrupted in number of coordinates the vector can be correctly recalled using neurally feasible algorithm introduction neural associative memories with exponential storage capacity and large potentially linear fraction of guarantee have been the topic of extensive research for the past three decades networked associative memory model must have the ability to learn and remember an arbitrary but specific set of messages at the same time when presented with noisy query an vector close to one of the messages the system must be able to recall the correct message while the first task is called the learning phase the second one is referred to as the recall phase associative memories are traditionally modeled by what is called binary hopfield networks where weighted graph of size is considered with each vertex representing binary state neuron the of the network are learned from the set of binary vectors to be stored by the hebbian learning rule it has been shown in that to recover the correct vector in the presence of linear in number of errors it is not possible to store more than logn arbitrary binary vectors in the above model of learning in the pursuit of networks that can store exponential in number of messages some works do show the existence of hopfield networks that can store messages however for such hopfield networks only small number of errors in the query render the recall phase unsuccessful the hopfield networks that store message vectors are studied in where the storage capacity of such networks against large fraction of errors is again shown to be linear in there have been multiple efforts to increase the storage capacity of the associative memories to exponential by moving away from the framework of the hopfield networks in term of both the learning and the recall phases these efforts also involve relaxing the requirement of storing the collections of arbitrary messages in gripon and berrou stored number of sparse message vectors in the form of neural cliques another setting where neurons have been assumed to have large albeit constant number this work was done when the author was with the dept of ece university of texas at austin tx usa of states and at the same time the message set or the dataset is assumed to form linear subspace is considered in the most basic premise of the works on neural associative memory is to design graph dynamic xn system such that the vectors to be stored are the message nodes steady states of the system one way to attain this is to learn set of constraints that every vector in the dataset must satisfy the inclusion relation between the variables in the vectors and the constraint constraints can be represented by bipartite graph nodes rm cf fig for the recall phase noise removal can be done by running belief propagation on this figure the complete bipartite graph correbipartite graph it can be shown that the correct sponding to the associative memory here we message is recovered successfully under conditions depict only small fraction of edges the edge such as sparsity and expansion properties of the weights of the bipartite graph are obtained from the graph this is the main idea that has been explored linear constraints satisfied by the messages inforin in particular under the mation can flow in both directions in the graph from message node to constraint node and from tion that the messages belong to linear subspace constraint node to message node in the steady propose associative memories that can state message nodes store coordinates of valid store exponential number of messages while message and all the constraints nodes are satisating at most constant number of errors this fied the weighted sum of the values stored on proach is further refined in where each the neighboring message nodes according to the assage vector from the dataset is assumed to sociated edge weights is equal to zero note that an prise overlapping which belong to edge is relevant for the information flow iff the corferent linear subspaces the learning phase finds responding edge weight is nonzero the sparse linear constraints for the subspaces associated with these for the recall phase then belief propagation decoding ideas of codes have been used in karbasi et al show that the associative memories obtained in this manner can store exponential in messages they further show that the recall phase can correct linear in number of random errors provided that the bipartite graph associated with learnt linear constraints during learning phase has certain structural properties our work is very closely related to the above principle instead of finding sparse set of constraints we aim to find set of linear constraints that satisfy the coherence property the property or the restricted isometry property rip indeed for large class of random signal models we show that such constraints must exists and can be found in polynomial time any of the three above mentioned properties provide sufficient condition for recovery of sparse signals or vectors under the assumption that the noise in the query vector is sparse denoising can be done very efficiently via iterative sparse recovery algorithms that are neurally feasible neurally feasible algorithm for our model employs only local computations at the vertices of the corresponding bipartite graph based on the information obtained from their neighboring nodes our techniques and results our main provable results pertain to two different models of datasets and are given below theorem associative memory with dataset model it is possible to store dataset of size exp of vectors in neural network of size such that neurally feasible algorithm can output the correct vector from the dataset given noisy version of the vector corrupted in coordinates theorem associative memory with dataset spanned by random rows of fixed orthonormal basis it is possible to store dataset of size exp of vectors in neural network of size such that neurally feasible algorithm can output the correct vector from the dataset given noisy version of the vector corrupted in log coordinates theorem follows from prop and theorem while theorem follows from prop and and by also noting the fact that all over any finite alphabet can be linearly mapped to exp number of points in space of dimensionality the neural feasibility of the recovery follows from the discussion of sec in contrast with our sparse recovery based approach provides associative memories that are robust against stronger error model which comprises adversarial error patterns as opposed to random error patterns even though we demonstrate the associative memories which have storage capacity and can tolerate but polynomial number of errors neurally feasible recall phase is guaranteed to recover the message vector from adversarial errors on the other hand the recovery guarantees in theorem and hold if the bipartite graph obtained during learning phase possesses certain structures degree sequence however it is not apparent in their work if the learnt bipartite graph indeed has these structural properties similar to the aforementioned papers our operations are performed over real numbers we show the dimensionality of the dataset to be large enough as referenced in theorem and as in previous works such as we can therefore find large number of points exponential in the dimensionality with finite integer alphabet that can be treated as the message vectors or dataset our main contribution is to bring in the model of sparse recovery in the domain of associative memory very natural connection the main techniques that we employ are as follows in sec we present two models of ensembles for the dataset the dataset belongs to subspaces that have associated orthogonal subspace with good basis these good basis for the orthogonal subspaces satisfy one or more of the conditions introduced in sec section that provides some background material on sparse recovery and various sufficient conditions relevant to the problem in sec we briefly describe way to obtain good null basis for the dataset the found bases serve as measurement matrices that allow for sparse recovery sec focus on the recall phases of the proposed associative memories the algorithms are for sparse recovery but stated in way that are implementable in neural network in sec we present some experimental results showcasing the performance of the proposed associative memories in appendix we describe another approach to construct associative memory based on the dictionary learning problem definition and mathematical preliminaries notation we use lowercase boldface letters to denote vectors uppercase boldface letters represent matrices for matrix bt denotes the transpose of vector is called if it has only nonzero entries for vector rn and any set of coordinates xi denotes the projection of on to the coordinates of for any set of coordinates similarly for matrix bi denotes the obtained by the rows of that are indexed by the set we use span to denote the subspace spanned by the columns of given an matrix denote the columns of the matrix as bj and assume for all the matrices in this section that the columns are all unit norm kbj definition coherence the mutual coherence of the matrix is defined to be max bj definition property the matrix is supposed to satisfy the property with parameters if khi for every vector rn with bh and any set definition rip matrix is said to satisfy the restricted isometry property with parameters and or the the if for all vectors rn next we list some results pertaining to sparse signal recovery guarantee based on these aforementioned parameters the sparse recovery problem seeks the solution that has the smallest number of nonzero entries of the underdetermined system of equation bx where and rn the basis pursuit algorithm for sparse recovery provides the following estimate arg min let xk denote the projection of on its largest coordinates proposition if has property with parameters then we have kx xk the proof of this is quite standard and delegated to the appendix proposition the of the sampling matrix implies for constant kx furthermore it can be easily seen that any matrix is coherence of the sampling matrix xk where is the mutual properties of the datasets in this section we show that under reasonable random models that represent quite general assumptions on the datasets it is possible to learn linear constraints on the messages that satisfy one of the sufficient properties of sparse recovery incoherence property or rip we mainly consider two models for the dataset model span of random set from an orthonormal basis model for the dataset and the property in this section we consider the message sets that are spanned by basis matrix which has its entries distributed according to distribution the distributions are prevalent in machine learning literature and provide broad class of random models to analyze and validate various learning algorithms we refer the readers to for the background on these distribution let be an random matrix that has independent zero mean random variables as its entries we assume that the subspace spanned by the columns of the matrix represents our dataset the main result of this section is the following theorem the dataset above satisfies set of linear constraints that has the property that is for any span the following holds with high probability khi for all such that for and constant the rest of this section is dedicated to the proof of this theorem but before we present the proof we state result from which we utilize to prove theorem proposition theorem let be an matrix whose rows ai are independent isotropic random vectors in then for every with probability at least exp ct one has smin min smax max here and depends on the norms of the rows of the matrix proof of theorem consider an matrix which has independent isotropic random vectors as its rows now for given set we can focus on two disjoint of ai and ai using proposition with we know that with probability at least exp we have smax max ka xk since we know that using the following holds with probability at least exp ax rr we now consider it follows from proposition with that with probability at least exp smin min combining with the observation that at least exp ax the following holds with probability for all rr note that we are interested in showing that for all we have khi for all such that this is equivalent to showing that the following holds for all ax ax for all such that for given we utilize and to guarantee that holds with probability at least exp as long as now given that satisfies holds for all with probability at least en exp exp let consider the following set of parameters and log this set of parameters ensures that holds with overwhelming probability cf remark in theorem we specify one particular set of parameters for which the property holds using and it can be shown that the property inpgeneral holds for the following set of parameters log and log therefore it possible to the number of correctable errors during the recall phase denoted by with the dimension of the dataset represented by span of random set of columns of an orthonormal basis next in this section we consider the ensemble of signals spanned by random subset of rows from fixed orthonormal basis assume to be an matrix with orthonormal rows let be random index set such that the vectors in the dataset have form bt for some in other words the dataset span bt in this case constitutes basis matrix for the null space of the dataset since we have selected the set randomly set is also random set with proposition assume that be an orthonormal basis for with the property that maxi consider random matrix obtained by selecting random set of rows of indexed by the set such that then the matrix obeys with probability at least for some fixed constant where therefore we can invoke proposition to conclude that the matrix obeys with with being the largest absolute value among the entries of learning the constraints null space with small coherence in the previous section we described some random ensemble of datasets that can be stored on an associative memory based on sparse recovery this approach involves finding basis for the orthogonal subspace to the message or the signal subspace dataset indeed our learning algorithm simply finds null space of the dataset while obtaining the basis vectors of null we require them to satisfy property rip or small mutual coherence so that the signal can be recovered from its noisy version via the basis pursuit algorithm that can be neurally implemented see sec however for given set of message vectors it is computationally intractable to check if the obtained learnt orthogonal basis has property or rip with suitable parameters associated with these properties mutual coherence of the orthogonal basis on the other hand can indeed be verified in tractable manner further the more iterative soft thresholding algorithm will be successful if null has low coherence this will also lead to fast convergence of the recovery algorithm see sec towards this we describe one approach that ensures the selection of orthogonal basis that has smallest possible mutual coherence subsequently using the mutual coherence based recovery guarantees for sparse recovery this basis enables an efficient recovery phase for the associative memory one underlying assumption on the dataset that we make is its less than full dimensionality that is the dataset must belong to low dimensional subspace so that its is not trivial in practical cases is approximately we use preprocessing step employing principal component analysis pca to make sure that the dataset is we do not indulge in to more detailed description of this phase as it seems to be quite standard see algorithm find with low coherence input the dataset with dimensional vectors an initial coherence and output orthogonal matrix and coherence preprocessing perform pca on find the basis matrix of for do find feasible point of the quadratically constrained quadratic problem qcqp below interior point method ba kbi bj µl where is if no feasible point found then break else µl µl end if end for recall via neurally feasible algorithms we now focus on the second aspect of an associative memory namely the recovery phase for the signal model that we consider in this paper the recovery phase is equivalent to solving sparse signal recovery problem given noisy vector from the dataset we can use the basis of the associated to our dataset that we constructed during the learning phase to obtain by be now given that is sufficiently sparse enough and the matrix obeys the properties of sec we can solve for using sparse recovery algorithm subsequently we can remove the error vector from the noisy signal to construct the underlying message vector there is plethora of algorithms available in the literature to solve this problem however we note that for the purpose of an associative memory the recovery phase should be neurally feasible and computationally simple in other words each node or storage unit should be able to recover the coordinated associated to it locally by applying simple computation on the information received from its neighboring nodes potentially in an iterative manner recovery via iterative soft thresholding ist algorithm among the various sparse recovery algorithms in the literature iterative soft thresholding ist algorithm is natural candidate for implementing the recovery phase of the associative memories with underlying setting the ist algorithm tries to solve the following unconstrained least square problem which is closely related to the basis pursuit problem described in and arg min kbe for the ist algorithm its iteration is described as follows ist et bt bet here is constant and sgn sgn sgn xn xn denotes the soft thresholding or shrinkage operator note that the ist algorithm as described in is neurally feasible as it involves only performing matrix vector multiplications and soft thresholding coordinate in vector independent of the values of other coordinates in the vector in appendix we describe in details how the ist algorithm can be performed over bipartite neural network with as its edge weight matrix under suitable assumption on the coherence of the measurement matrix the ist algorithm is also known to converge to the correct vector in particular maleki allows the thresholding parameter to be varied in every iteration such that all but at most the largest coordinates in terms of their absolute values are mapped to zero by the soft thresholding operation in this setting maleki shows that the solution of the ist algorithm recovers the correct support of the optimal solution in finite steps and subsequently converges to the true solution very fast however we are interested in analysis of the ist algorithm in setting where thresholding parameter is kept suitable constant depending on other system parameters so that the algorithm remains neurally feasible towards this we note that there exists general analysis of the ist algorithm even without the coherence assumption proposition theorem let et be as defined in with max then for any et function defined in here kr is the objective recovery via bregman iterative algorithm recall that the basis pursuit algorithm refers to the following optimization problem arg min be even though the ist algorithm as described in the previous subsection solves the problem in the parameter value needs to be set small enough so that the recovered solution nearly satisfies the constraint be in however if we insist on recovering the solution which exactly meets the constraint one can employ the bregman iterative algorithm from the bregman iterative algorithm relies on the bregman distance based on which is defined as follows hp where is of the at the point the iteration of the bregman iterative algorithm is then defined as follows pt arg min et kbe arg min kbe ket pt et pt bt note that for the iteration the objective function in is essentially equivalent to the objective function in therefore each iteration of the bergman iterative algorithm can be solved using the ist algorithm it is shown in that after finite number of iteration of the bregman iterative algorithm one recovers the solution of the problem in theorem and in remark we know that the ist algorithm is neurally feasible furthermore the step described in is neurally feasible as it only involve multiplications in the spirit of eq since each iteration of the bregman iterative algorithm only relies on these two operations it follows that the bregman iterative algorithm is neurally feasible as well it should be noted that the neural feasibility of the bregman iterative algorithm was discussed in as well however the neural structures employed by is different from ours note that max bt the maximum eigenvalue of the matrix bt serves as lipschitz constant for the gradient of the function kr probability of failure probability of failure pd bi pd bi pd bi pd bi gaussian matrix and gaussian noise student version of matlab student version of matlab probability of failure probability of failure gaussian matrix and discrete noise pd bi pd bi sparsity sparsity pd bi pd bi sparsity sparsity bernoulli matrix and gaussian noise bernoulli matrix and discrete noise figure performance of the proposed associative memory approach during recall phase the pd algorithm student version of matlab student version of matlab refers to the primal dual algorithm to solve linear program associated with the problem in the bi algorithm refers to the bregman iterative algorithm described in sec experimental results in this section we demonstrate the feasibility of the associative memory framework using computer generated data along the line of the discussion in sec we first sample an matrix with entries we consider two distributions gaussian distribution and bernoulli distribution over the message vectors to be stored are then assumed to be spanned by the columns of the sampled matrix for the learning phase we find good basis for the subspace orthogonal to the space spanned by the columns of the matrix for noise during the recall phase we consider two noise models gaussian noise and discrete noise where each nonzero elements take value in the set figure presents our simulation results for for recall phase we employ the bregman iterative bi algorithm with the ist algorithm as subroutine we also plot the performance of the primal dual pd algorithm based linear programming solution for the recovery problem of interest cf this allows us to gauge the disadvantage due to the restriction of working with neurally feasible recovery algorithm the bi algorithm in our case furthermore we consider message sets with two different dimensions which amounts to and note that the dimension of the message set is we run iterations of the recovery algorithms for given set of parameters to obtain the estimates of the probability of failure of exact recovery of error vector in fig we focus on the setting with gaussian basis matrix for message set and unit variance zero mean gaussian noise during the recall phase it is evident that the proposed associative memory do allow for the exact recovery of error vectors up to certain sparsity level this corroborate our findings in sec we also note that the performance of the bi algorithm is very close to the pd algorithm fig shows the performance of the recall phase for the setting with gaussian basis for message set and discrete noise model with in this case even though the bi algorithm is able to exactly recover the noise vector up to particular sparsity level it performance is worse than that of pd algorithm the performance of the recall phase with bernoulli bases matrices for message set is shown in fig and the results are similar to those in the case of gaussian bases matrices for the message sets references agarwal anandkumar jain netrapalli and tandon learning sparsely used overcomplete dictionaries via alternating minimization corr arora ge ma and moitra simple efficient and neural algorithms for sparse coding corr arora ge and moitra new algorithms for learning incoherent and overcomplete dictionaries arxiv preprint beck and teboulle fast iterative algorithm for linear inverse problems siam journal on imaging sciences candes the restricted isometry property and its implications for compressed sensing comptes rendus mathematique romberg and tao robust uncertainty principles exact signal reconstruction from highly incomplete frequency information ieee trans on inf theory candes and tao signal recovery from random projections universal encoding strategies ieee trans on inf theory dec donoho compressed sensing ieee trans on inf theory donoho maleki and montanari algorithms for compressed sensing proceedings of the national academy of sciences foucart and rauhut mathematical introduction to compressive sensing birkhauser basel gripon and berrou sparse neural networks with large learning diversity ieee transactions on neural networks gross and mezard the simplest spin glass nuclear physics hebb the organization of behavior neuropsychological theory psychology press hillar and tran robust exponential memory in hopfield networks arxiv preprint hopfield neural networks and physical systems with emergent collective computational abilities proceedings of the national academy of sciences hu genkin and chklovskii network of spiking neurons for computing sparse representations in an way neural computation jankowski lozowski and zurada multistate neural associative memory ieee transactions on neural networks nov karbasi salavati and shokrollahi convolutional neural associative memories massive capacity with noise tolerance corr kumar salavati and shokrollahi exponential pattern retrieval capacity with associative memory in ieee information theory workshop itw pages oct maleki coherence analysis of iterative thresholding algorithms in annual allerton conference on communication control and computing allerton pages sept mceliece and posner the number of stable points of an spin glass memory telecommunications and data acquisition progress report mceliece posner rodemich and venkatesh the capacity of the hopfield associative memory information theory ieee transactions on muezzinoglu guzelis and zurada new design method for the multistate hopfield associative memory ieee transactions on neural networks july olshausen and field sparse coding with an overcomplete basis set strategy employed by vision research salavati and karbasi neural networks in ieee international symposium on information theory proceedings isit pages july tanaka and edwards analytic theory of the ground state properties of spin glass ising spin glass journal of physics metal physics vershynin introduction to the analysis of random matrices arxiv preprint wainwright sharp thresholds for and noisy sparsity recovery using constrained quadratic programming lasso ieee trans inform theory may yin osher goldfarb and darbon bregman iterative algorithms for with applications to compressed sensing siam journal on imaging sciences 
robust spectral inference for joint stochastic matrix factorization david mimno dept of information science cornell university ithaca ny mimno moontae lee david bindel dept of computer science cornell university ithaca ny moontae bindel abstract spectral inference provides fast algorithms and provable optimality for latent topic analysis but for real data these algorithms require additional heuristics and even then often produce unusable results we explain this poor performance by casting the problem of topic inference in the framework of joint stochastic matrix factorization jsmf and showing that previous methods violate the theoretical conditions necessary for good solution to exist we then propose novel rectification method that learns high quality topics and their interactions even on small noisy data this method achieves results comparable to probabilistic techniques in several domains while maintaining scalability and provable optimality introduction summarizing large data sets using pairwise frequencies is powerful tool for data mining objects can often be better described by their relationships than their inherent characteristics communities can be discovered from friendships song genres can be identified from in playlists and neural word embeddings are factorizations of pairwise cooccurrence information recent anchor word algorithms perform spectral inference on statistics for inferring topic models statistics can be calculated using single parallel pass through training corpus while these algorithms are fast deterministic and provably guaranteed they are sensitive to observation noise and small samples often producing effectively useless results on real documents that present no problems for probabilistic algorithms we cast this general problem area area area of learning overlapping latent clusters as matrix factorization jsmf subset of matrix factorization that contains topic modeling as special case we explore the conditions necessary for inference from figure visualizations show the convex hull occurrence statistics and show found by anchor words left and better convex hull middle that the anchor words found by discovering anchor words on rectified space right rithms necessarily violate such conditions then we propose rectified algorithm that matches the performance of probabilistic on small and noisy losing efficiency and provable guarantees validating on both real and synthetic data we demonstrate that our rectification not only produces better clusters but also unlike previous work learns meaningful cluster interactions let the matrix represent the of pairs drawn from objects cij is the joint probability for pair of objects and our goal is to discover latent clusters by approximately decomposing bab is the matrix in which each column corresponds to cluster and bik is the probability of drawing an object conditioned on the object belonging to the cluster and is the matrix in which akl represents the joint probability of pairs of clusters we call the matrices and due to their correspondence to joint distributions is example applications are shown in table table jsmf applications with equivalents anchor word algorithms solve jsmf problems usdomain object cluster basis ing separability assumption document word topic anchor word each topic contains at least image pixel segment pure pixel one anchor word that has network user community representative probability exlegislature member partisan playlist song genre signature song clusively in that topic the algorithm uses the patterns of the anchor words as summary basis for the patterns of all other words the initial algorithm is theoretically sound but unable to produce matrix due to unstable matrix inversions subsequent algorithm fixes negative entries in but still produces large negative entries in the estimated matrix as shown in figure the proposed algorithm infers valid interactions requirements for factorization in this section we review the probabilistic and statistical structures of jsmf and then define geometric structures of matrices required for successful factorization rn is matrix constructed from training examples each of which contain some subset of objects we wish to find latent clusters by factorizing into matrix rn and matrix satisfying bab probabilistic structure figure shows the event space of our model the distribution over pairs of clusters is generated first from stochastic process with hyperparameter if the training example contains total of nm objects our model views the example as consisting of all possible nm nm pairs of for each of these pairs cluster assignments are sampled from the selected distribution then an actual object pair is drawn with respect to the corresponding cluster assignments note that this process does not explain how each training example is generated from model but shows how our model understands the objects in the training examples bk nm nm figure the jsmf event space differs from lda jsmf deals only with pairwise events and does not generate following our model views as set of parameters rather than random the primary learning task is to estimate we then estimate to recover the hyperparameter due to the conditional independence or the factorization bab is equivalent to xx under the separability assumption each cluster has basis object sk such that sk and sk in matrix terms we assume the submatrix of comprised of due to the assumption every object can pair with any other object in that example except itself one implication of our work is better understanding the the diagonal entries in the matrix in lda each column of is generated from known distribution bk dir the rows with indices sk is diagonal as these rows form basis for the row space of the assumption implies rank providing identifiability to the factorization this assumption becomes crucial for inference of both and note that jsmf factorization is unique up to column permutation meaning that no specific ordering exists among the discovered clusters equivalent to probabilistic topic models see the appendix statistical structure let be known distribution of distributions from which cluster distribution is sampled for each training example saying wm we have samples wm which are not directly observable defining the posterior matrix pm wm wm and the expectation wm wm lemma in showed that as cm denote the posterior for the training example by and all examples by then cm bwm wm and cm thus wm wm denote the noisy observation for the training example by cm and all examples by let be matrix of topics we will construct cm so that is an unbiased estimator of thus as geometric structure though the separability assumption allows us to identify even from the noisy observation we need to throughly investigate the structure of cluster interactions this is because it will eventually be related to how much useful information the between corresponding anchor bases contains enabling us to best use our training data say dn is the set of doubly matrices entrywise and positive semidefinite psd claim dn and dn proof take any vector rk as is defined as sum of wm wm wm wm thus psdk in addition kl for all proving dn is analogous by the linearity of expectation relying on double of equation implies not only the structure of but also double of by similar proof see the appendix the anchor word algorithms in consider neither double of cluster interactions nor its implication on statistics indeed the empirical matrices collected from limited data are generally indefinite and whereas the posterior must be positive semidefinite and our new approach will efficiently enforce double nonnegativity and of the matrix based on the geometric property of its posterior behavior we will later clarify how this process substantially improves the quality of the clusters and their interactions by eliminating noises and restoring missing information rectified anchor words algorithm in this section we describe how to estimate the matrix from the training data and how to rectify so that it is and doubly we then decompose the rectified in way that preserves the doubly structure in the cluster interaction matrix means the rank of the matrix whereas rank means the usual rank pm this convergence is not trivial while wm wm as by the central limit theorem generating let hm be the vector of object counts for the training example and let pm bwm where wm is the document latent topic distribution then hm is assumed pn to be sample from multinomial distribution hm multi nm pm where nm hm and recall hm nm pm nm bwm and cov hm nm diag pm pm ptm as in we generate the for the example by cm hm hm diag hm nm nm the diagonal penalty in eq cancels out the diagonal matrix term in the matrix making the estimator unbiased putting dm nm nm that is cm hm hm diag cov diag dm dm thus by the linearity of expectation rectifying while is an unbiased estimator for in our model in reality the two matrices often differ due to mismatch between our model assumptions and the or due to error in estimation from limited data the computed is generally with many negative eigenvalues causing large approximation error as the posterior must be lowrank doubly and we propose two rectification methods diagonal completion dc and alternating projection ap dc modifies only diagonal entries so that becomes and while ap enforces modifies every entry and enforces the same properties as well as positive as our empirical results strongly favor alternating projection we defer the details of diagonal completion to the appendix based on the desired property of the posterior we seek to project our estimator onto the set of doubly low rank matrices alternating projection methods like dykstra algorithm allow us to project onto an intersection of finitely many convex sets using projections onto each individual set in turn in our setting we consider the intersection of three sets of symmetric matrices the elementwise matrices the normalized matrices orn whose entry sum is equal to and the positive matrices with rank psdn we project onto these three sets as follows cij πpsdn λk πn orn πn max where λu is an eigendecomposition and is the matrix modified so that all negative eigenvalues and any but the largest positive eigenvalues are set to zero truncated eigendecompositions can be computed efficiently and the other projections are likewise efficient while and orn are convex psdn is not however show that alternating projection with set still works under certain conditions guaranteeing local convergence thus iterating three projections in turn until the convergence rectifies to be in the desired space we will show how to satisfy such conditions and the convergence behavior in section selecting basis the first step of the factorization is to select the subset of objects that satisfy the separability assumption we want the best rows of the matrix so that all other rows lie nearly in the convex hull of the selected rows use the gramschmidt process to select anchors which computes pivoted qr decomposition but did not utilize the sparsity of to scale beyond small vocabularies they use random projections that approximately preserve distances between rows of for all experiments we use new pivoted qr algorithm see the appendix that exploits sparsity instead of using random projections and thus preserves deterministic recovering after finding the set of basis objects we can infer each entry of by bayes rule as in let be the coefficients that reconstruct the row of in terms of the basis rows corresponding to since bik there is no reason to expect real data to be generated from topics much less exactly latent topics to effectively use random projections it is necessary to either find proper dimensions based on multiple trials or perform random projection multiple times and merge the resulting anchors we can use the corpus frequencies cij to estimate bik thus the main task for this step is to solve qps to infer set of such coefficients for each object we use an exponentiated gradient algorithm to solve the problem similar to note that this step can be efficiently done in parallel for each object recovering recovered by minimizing kc bab kf but the inferred generally has many negative entries failing to model the probabilistic interaction between topics while we can further project onto the matrices this produces large figure the algorithm of first panel produces negative cluster probabilities probabilistic reconstruction alone this approximation error paper second panel removes negative entries but has no we consider an alternate recovery diagonals and does not sum to one trying after rectification this method that again leverages the paper third panel produces valid joint stochastic matrix separability assumption let css be the submatrix whose rows and columns correspond to the selected objects and let be the diagonal submatrix of rows of corresponding to then css dadt dad css this approach efficiently recovers matrix mostly based on the information between corresponding anchor basis and produces no negative entries due to the stability of diagonal matrix inversion note that the principle submatrices of psd matrix are also psd hence if psdn then css psdk thus not only is the recovered an unbiased estimator for but also it is now doubly as dn after the experimental results our rectified anchor words algorithm with alternating projection fixes many problems in the baseline anchor words algorithm while matching the performance of gibbs sampling and maintaining spectral inference determinism and independence from corpus size we evaluate direct measurement of matrix quality as well as indicators of topic utility we use two text datasets nips full papers and new york times news we eliminate minimal list of english stop words and prune rare words based on scores and remove documents with fewer than five tokens after vocabulary curation we also prepare two datasets users movie reviews from the movielens and music playlists from the complete we perform similar vocabulary curation and document tailoring with the exception of frequent elimination playlists often contain the same songs multiple times but users are unlikely to review the same movies more than once so we augment the movie dataset so that each review contains stars number of movies based on the rating information that varies from stars to stars statistics of our datasets are shown in table we run dc times for each experiment randomly permuting the order of objects and using the median dataset avg len results to minimize the effect of different orderings nips we also run iterations of ap alternating psdn nytimes orn and in turn for probabilistic gibbs movies sampling we use the mallet with the standard option songs doing iterations all metrics are evaluated against the original not against the rectified whereas we use and inferred from the rectified table statistics of four datasets we later realized that essentially same approach was previously tried in but it was not able to generate valid matrix as shown in the middle panel of figure https http http qualitative results although report comparable results to probabilistic algorithms for lda the algorithm fails under many circumstances the algorithm prefers rare and unusual anchor words that form poor basis so topic clusters consist of the same terms repeatedly as shown in the upper third of table in contrast our algorithm with ap rectification successfully learns themes similar to the probabilistic algorithm one can also verify that cluster interactions given in the third panel of figure explain how the five topics correlate with each other similar to we visualize the table each line is topic from nips previous work five anchor words in the simply repeats the most frequent words in the corpus five times occurrence space after pca arora et al baseline of each panel in figure neuron layer hidden recognition signal cell noise neuron layer hidden cell signal representation noise shows embedding of the neuron layer cell hidden signal noise dynamic nips vocabulary as blue dots and neuron layer cell hidden control signal noise five selected anchor words in red neuron layer hidden cell signal recognition noise the first plot shows standard anthis paper ap chor words and the original coneuron circuit cell synaptic signal layer activity occurrence space the second plot control action dynamic optimal policy controller reinforcement shows anchor words selected from recognition layer hidden word speech image net the rectified space overlaid on the cell field visual direction image motion object orientation original space the gaussian noise hidden approximation matrix bound examples third plot shows the same anchor probabilistic lda gibbs words as the second plot overlaid neuron cell visual signal response field activity on the space the reccontrol action policy optimal reinforcement dynamic robot recognition image object feature word speech features tified anchor words provide better hidden net layer dynamic neuron recurrent noise coverage on both spaces explaingaussian approximation matrix bound component variables ing why we are able to achieve reasonable topics even with rectification also produces better clusters in the movie dataset each cluster is notably more and than the clusters from the original algorithm when for example we verify cluster of walt disney animations mostly from the and cluster of fantasy movies represented by lord of the rings films similar to clusters found by probabilistic gibbs sampling the baseline algorithm repeats pulp fiction and silence of the lambs times quantitative results we measure the intrinsic quality of inference and summarization with respect to the jsmf objectives as well as the extrinsic quality of resulting topics lines correspond to four methods baseline for the algorithm in the previous work without any rectification dc for diagonal completion ap for alternating projection and gibbs for gibbs sampling anchor objects should form good basis for objects we measure recovery error pn pk kc sk with respect to the original matrix not the rectified matrix ap reduces error in almost all cases and is more effective than dc although we expect error to decrease as we increase the number of clusters reducing recovery error for fixed by choosing better anchors is extremely difficult no other subset selection algorithm decreased error by more than good matrix factorization should have small elementwise approximation error kc bab kf dc and ap preserve more of the information in the original matrix than the baseline method especially when is we expect nontrivial interactions between clusters even when not explicitly model them as in greater pk diagonal dominancy indicates lower correlation between clusters ap and gibbs results are similar we do not report probability because we find that relative results are determined by smoothing parameters pk specificity kl kp measures how much each cluster is distinct from the corpus distribution when anchors produce poor basis the conditional distribution of in the nytimes corpus is large error each element is around due to the number of normalized entries dominancy in songs corpus lacks any baseline results at because dominancy is undefined if an algorithm picks song that occurs at most once in each playlist as basis object in this case the original construction of css and hence of has zero diagonal element making dominancy nan nips recovery approximation dominancy specificity dissimilarity coherence category ap baseline dc gibbs nytimes recovery approximation specificity dominancy baseline dc ap category coherence dissimilarity gibbs movies recovery approximation dominancy specificity dissimilarity coherence gibbs baseline dc ap category songs recovery approximation dominancy specificity category ap dc gibbs baseline coherence dissimilarity figure experimental results on real dataset the indicates logk where varies by up to topics and by up to or topics whereas the baseline algorithm largely fails with small and does not infer quality and even with large alternating projection ap not only finds better basis vectors recovery but also shows stable and comparable behaviors to probabilistic inference gibbs in every metric ters given objects becomes uniform making similar to dissimilarity counts the average number of objects in each cluster that do not occur in any other cluster top objects our experiments validate that ap and gibbs yield comparably specific and distinct topics while baseline and dc simply repeat the corpus distribution as in table coherence pk opk penalizes topics that assign high probability rank to log words that do not occur together frequently ap produces results close to gibbs sampling and far from the baseline and dc while this metric correlates with human evaluation of clusters worse coherence can actually be better because the metric does not penalize repetition in experiments ap matches gibbs sampling and outperforms the baseline but the discrepancies in topic quality metrics are smaller than in the real experiments see appendix we speculate that data is more than real data explaining why issues were not recognized previously analysis of algorithm why does ap work before rectification diagonals of the empirical matrix may be far from correct bursty objects yield diagonal entries that are too large extremely rare objects that occur at most once per document yield zero diagonals rare objects are problematic in general the corresponding rows in the matrix are sparse and noisy and these rows are likely to be selected by the pivoted qr because rare objects are likely to be anchors the matrix css is likely to be highly diagonally dominant and provides an uninformative picture of topic correlations these problems are exacerbated when is small relative to the effective rank of so that an early choice of poor anchor precludes better choice later on and when the number of documents is small in which case the empirical is relatively sparse and is strongly affected by noise to mitigate this issue run exhaustive grid search to find document frequency cutoffs to get informative anchors as model performance is inconsistent for different cutoffs and search requires for each case it is nearly impossible to find good heuristics for each dataset and number of topics fortunately psd matrix can not have too many rows since this violates the low rank property nor can it have diagonal entries that are small relative to since this violates positive because the anchor word assumption implies that rank and ordinary rank are the same the ap algorithm ideally does not remove the information we wish to learn rather the projection in ap suppresses the influence of small numbers of noisy rows associated with rare words which may not be well correlated with the others and the psd projection in ap recovers missing information in diagonals as illustrated in the dominancy panel of the songs corpus in figure ap shows valid dominancies even after in contrast to the baseline algorithm why does ap converge ap enjoys local linear convergence if the initial is near the convergence point psdn is at and strong regularity holds at for the first condition recall that we rectified by pushing toward which is the ideal convergence point inside the intersection since as shown in is close to as proxregular are subsets of sets so of psdn at is sufficient for the second condition for permutation invariant rn the spectral set of symmetric matrices is defined as sn λn and is if and only if is th let be since each element in has exactly positive components and all others are zero psdn by the definition of and pm is locally unique almost everywhere satisfying the second condition almost surely as the intersection of the convex set psdn and the smooth manifold of rank matrices psdn is smooth manifold almost everywhere checking the third condition priori is challenging but we expect noise in the empirical to prevent an irregular solution following the argument of numerical example in we expect ap to converge locally linearly and we can verify local convergence of ap in practice empirically the ratio of average distances between two iterations are always on the nytimes dataset see the appendix and other datasets were similar note again that our rectified is result of pushing the empirical toward the ideal because approximation factors of are all computed based on how far and its shape could be distant from all provable guarantees of hold better with our rectified related and future work jsmf is specific matrix factorization nmf performing spectral inference exploit similar separable structure for nmf problmes to tackle hyperspectral unmixing problems assume pure pixels in computer vision in more general nmf without such structures rescal studies tensorial extension of similar factorization and symnmf infers bb rather than bab for topic modeling performs spectral inference on third moment tensor assuming topics are uncorrelated as the core of our algorithm is to rectify the input matrix it can be combined with several recent developments proposes two regularization methods for recovering better nonlinearly projects to space via and achieves better anchors by finding the exact anchors in that space performs multiple random projections to spaces and recovers approximate anchors efficiently by strategy in addition our work also opens several promising research directions how exactly do anchors found in the rectified form better bases than ones found in the original space since now the matrix is again doubly and can we learn in hierarchical model by recursively applying jsmf to acknowledgments this research is supported by nsf grant hcc we thank adrian lewis for valuable discussions on ap convergence set is if pm is locally unique references alan mislove bimal viswanath krishna gummadi and peter druschel you are who you know inferring user profiles in online social networks in proceedings of the acm international conference of web search and data mining wsdm new york ny february shuo chen moore turnbull and joachims playlist prediction via metric embedding in acm sigkdd conference on knowledge discovery and data mining kdd pages jeffrey pennington richard socher and christopher manning glove global vectors for word representation in emnlp omer levy and yoav goldberg neural word embedding as implicit matrix factorization in nips arora ge and moitra learning topic models going beyond svd in focs sanjeev arora rong ge yonatan halpern david mimno ankur moitra david sontag yichen wu and michael zhu practical algorithm for topic modeling with provable guarantees in icml hofmann probabilistic latent semantic analysis in uai pages blei ng and jordan latent dirichlet allocation journal of machine learning research pages preliminary version in nips jamesp boyle and richardl dykstra method for finding projections onto the intersection of convex sets in hilbert spaces in advances in order restricted statistical inference volume of lecture notes in statistics pages springer new york adrian lewis luke and jrme malick local linear convergence for alternating and averaged nonconvex projections foundations of computational mathematics griffiths and steyvers finding scientific topics proceedings of the national academy of sciences moontae lee and david mimno embeddings for interpretable topic inference in proceedings of the conference on empirical methods in natural language processing emnlp pages association for computational linguistics mary broadbent martin brown kevin penner ipsen and rehman subset selection algorithms randomized deterministic siam undergraduate research online blei and lafferty correlated topic model of science annals of applied statistics pages david mimno hanna wallach edmund talley miriam leenders and andrew mccallum optimizing semantic coherence in topic models in emnlp daniilidis lewis malick and sendov of spectral functions and spectral sets journal of convex analysis christian thurau kristian kersting and christian bauckhage yes we can simplex volume maximization for descriptive matrix factorization in cikm pages abhishek kumar vikas sindhwani and prabhanjan kambadur fast conical hull algorithms for nearseparable matrix factorization corr pages jos nascimento student member and jos bioucas dias vertex component analysis fast algorithm to unmix hyperspectral data ieee transactions on geoscience and remote sensing pages gomez le borgne pascal allemand christophe delacourt and patrick ledru method versus independent component analysis for lithological identification in hyperspectral imagery international journal of remote sensing maximilian nickel volker tresp and kriegel model for collective learning on data in proceedings of the international conference on machine learning icml pages acm da kuang haesun park and chris ding symmetric nonnegative matrix factorization for graph clustering in sdm siam omnipress anima anandkumar dean foster daniel hsu sham kakade and liu spectral algorithm for latent dirichlet allocation in advances in neural information processing systems annual conference on neural information processing systems proceedings of meeting held december lake tahoe nevada united pages thang nguyen yuening hu and jordan anchors regularized adding robustness and extensibility to scalable algorithms in association for computational linguistics tianyi zhou jeff bilmes and carlos guestrin learning by anchoring conical hull in advances in neural information processing systems pages 
fast provable algorithms for isotonic regression in all rasmus kyng dept of computer science yale university anup school of computer science georgia tech sushant sachdeva dept of computer science yale university sachdeva abstract given directed acyclic graph and set of values on the vertices the isotonic regression of is vector that respects the partial order described by and minimizes kx yk for specified norm this paper gives improved algorithms for computing the isotonic regression for all weighted with rigorous performance guarantees our algorithms are quite practical and variants of them can be implemented to run fast in practice introduction directed acyclic graph dag defines partial order on where precedes if there is directed path from to we say that vector rv is isotonic with respect to if it is weakly mapping of into let ig denote the set of all that are isotonic with respect to it is immediate that ig can be equivalently defined as follows ig rv xu xv for all given dag and norm on the isotonic regression of observations is given by ig that minimizes kx yk such monotonic relationships are fairly common in data they allow one to impose only weak assumptions on the data the typical height of young girl child is an increasing function of her age and the heights of her parents rather than more constrained parametric model isotonic regression is an important nonparametric regression method that has been studied since the it has applications in diverse fields such as operations research and signal processing in statistics it has several applications and the statistical properties of isotonic regression under the have been well studied particularly over linear orderings see and references therein more recently isotonic regression has found several applications in learning it was used by kalai and sastry to provably learn generalized linear models and single index models and by zadrozny and elkan and narasimhan and agarwal towards constructing binary class probability estimation models the most common norms of interest are weighted defined as wvp kzkw wv where wv is the weight of vertex in this paper we focus on algorithms for isotonic regression under weighted such algorithms have been applied to large from microarrays and from the web code from this work is available at https part of this work was done when this author was graduate student at yale university given dag and observations rv our regression problem can be expressed as the following convex program min kx ykw such that xu xv for all our results let and we ll assume that is connected and hence we give unified framework for algorithms that provably solve the isotonic regression problem for the following is an informal statement of our main theorem theorem in this regard assuming wv are bounded by poly theorem informal there is an algorithm that given dag observations and runs in time log and computes an isotonic xa lg ig such that kxa lg ykw min kx ykw the previous best time bounds were nm log for and nm log for for unlike for the isotonic regression problem need not have unique solution there are several specific solutions that have been studied in the literature see for detailed discussion in this paper we show that some of them ax in and avg to be precise can be computed in time linear in the size of theorem there is an algorithm that given dag set of observations rv and weights runs in expected time and computes an isotonic xi nf ig such that kxi nf ykw min kx ykw our algorithm achieves the best possible running time this was not known even for linear or tree orders the previous best running time was log strict isotonic regression we also give improved algorithms for strict isotonic regression given observations and weights its strict isotonic regression xs trict is defined to be the limit of as goes to where is the isotonic regression for under the norm it is immediate that xstrict is an isotonic regression for in addition it is unique and satisfies several desirable properties see theorem there is an algorithm that given dag set of observation rv and weights runs in expected time mn and computes xs trict the strict isotonic regression of the previous best running time was min mn nω log detailed comparison to previous results there has been lot of work for fast algorithms for special graph families mostly for see for references for some cases where is very simple directed path corresponding to linear orders or rooted directed tree corresponding to tree orders several works give algorithms with running times of or log see for references theorem not only improves on the previously best known algorithms for general dags but also on several algorithms for special graph families see table one such setting is where is point set in and whenever ui vi for all this setting has applications to data analysis as in the example given earlier and has been studied extensively see for references for this case it was proved by stout see prop that these partial orders can be embedded in dag with vertices and edges and that this dag can be computed in time linear in its size the bounds then follow by combining this result with our theorem above we obtain improved running times for all norms for dags with and for point sets for for stout gives an time algorithm table comparison to previous best results for vertex set arbitrary dag previous best logd nm log nm log nm this paper for sake of brevity we have ignored the notation implicit in the bounds and log terms the results are reported assuming an error parameter and that wv are bounded by poly for weighted on arbitrary dags the previous best result was log due to kaufman and tamir manuscript by stout improves it to log these algorithms are based on parametric search and are impractical our algorithm is simple achieves the best possible running time and only requires random sampling and topological sort in parallel independent work stout gives algorithms for linear order trees and and an algorithm for point sets in theorem implies the algorithms immediately the result for point sets follows after embedding the point sets into dags of size as for strict isotonic regression strict isotonic regression was introduced and studied in it also gave the only previous algorithm for computing it that runs in time min mn nω log theorem is an improvement when log overview of the techniques and contribution it is immediate that isotonic regression as formulated in equation is convex programming problem for weighted with applying generic convexprogramming algorithms such as interior point methods to this formulation leads to algorithms that are quite slow we obtain faster algorithms for isotonic regression by replacing the computationally intensive component of interior point methods solving systems of linear equations with approximate solves this approach has been used to design fast algorithms for generalized flow problems we present complete proof of an interior point method for large class of convex programs that only requires approximate solves daitch and spielman had proved such result for linear programs we extend this to and provide an improved analysis that only requires linear solvers with constant factor relative error bound whereas the method from daitch and spielman required polynomially small error bounds the linear systems in are symmetric diagonally dominant sdd matrices the seminal work of spielman and teng gives time approximate solvers for such systems and later research has improved these solvers further daitch and spielman extended these solvers to generalizations of sdd the systems we need to solve are neither sdd nor we develop fast solvers for this new class of matrices using fast sdd solvers we stress that standard techniques for approximate inverse computation conjugate gradient are not sufficient for approximately solving our systems in time these methods have at least square root dependence on the condition number which inevitably becomes huge in ipms and strict isotonic regression algorithms for and strict isotonic regression are based on techniques presented in recent paper of kyng et al we reduce isotonic regression to the following problem referred to as lipschitz learning on directed graphs in see section for details we have directed graph with edgenlengths givenoby len given rv for every define max len now given that assigns real values to subset of the goal is to determine rv that agrees with and minimizes max the above problem is solved in log time for general directed graphs in we give simple reduction to the above problem with the additional property that is dag for dags their algorithm can be implemented to run in time it is proved in that computing the strict isotonic regression is equivalent to computing the isotonic vector that minimizes the error under the lexicographic ordering see section under the same reduction as in the we show that this is equivalent to minimizing under the lexicographic ordering it is proved in that the can be computed with basically calls to immediately implying our result further applications the ipm framework that we introduce to design our algorithm for isotonic regression ir and the associated results are very general and can be applied to other problems as concrete application the algorithm of kakade et al for provably learning generalized linear models and single index models learns monotone functions on linear orders in time procedure lpav the structure of the associated convex program resembles ir our ipm results and solvers immediately imply an time algorithm up to log factors improved algorithms for ir or for learning lipschitz functions on point sets could be applied towards learning models where the is nondecreasing the natural ordering on extending they could also be applied towards constructing class probability estimation cpe models from multiple classifiers by finding mapping from multiple classifier scores to probabilistic estimate extending organization we report experimental results in section an outline of the algorithms and analysis for are presented in section in section we define the lipschitz regression problem on dags and give the reduction from isotonic regression we defer detailed description of the algorithms and most proofs to the accompanying supplementary material experiments an important advantage of our algorithms is that they can be implemented quite efficiently our algorithms are based on what is known as method see chapter that leads to an bound on the number of iterations each iteration corresponds to one linear solve in the hessian matrix variant known as the method see typically require much fewer iterations about log even though the only provable bound known is running time in practice for the important special case of regression we have implemented our algorithm in matlab with long step barrier method combined with our approximate solver for the linear systems involved number of heuristics recommended in that greatly improve the running time in practice have also been incorporated despite the changes our implementation is theoretically correct and also outputs an upper bound on the error by giving feasible point to the dual program our implementation is available at https time in secs number of vertices in the figure we plot average running times with error bars denoting standard deviation for regression on dags where the underlying graphs are grid graphs and random regular graphs of constant degree the edges for grid graphs are all oriented towards one of the corners for random regular graphs the edges are oriented according to random permutation the vector of initial observations is chosen to be random permutation of to obeying the partial order perturbed by adding gaussian noise to each coordinate for each graph size and two different noise levels standard deviation for the noise on each coordinate being or the experiment is repeated multiple time the relative error in the objective was ascertained to be less than algorithms for without loss of generality we assume given let denote the following isotonic regression problem and denote its optimum kx ykw min let denote the pth power of we assume the minimum entry of is and the maximum entry is wmax exp we also assume the additive error parameter is lower bounded notation to hide poly log log factors by exp and that exp we use the theorem given dag set of observations weights and an error log npwmax parameter the algorithm sotonic ipm runs in time and with probability at least outputs vector xa lg ig with kxa lg ykw the algorithm sotonic ipm is obtained by an appropriate instantiation of general interior point method ipm framework which we call pprox ipm to state the general ipm result we need to introduce two important concepts these concepts are defined formally in supplementary material section the first concept is barrier functions we denote the class of these functions by scb barrier function is special convex function defined on some convex domain set the function approaches infinity at the boundary of we associate with each complexity parameter which measures how is the second important concept is the symmetry of point nonnegative scalar quantity sym large symmetry value guarantees that point is not too close to the boundary of the set for our algorithms to work we need starting point whose symmetry is not too small we later show that such starting point can be constructed for the problem pprox ipm is primal path following ipm given vector domain and barrier function scb for we seek to compute hc xi to find minimizer we consider function fc hc xi and attempt to minimize fc for changing values of by alternately updating and as approaches the boundary of the term grows to infinity and with some care we can use this to ensure we never move to point outside the feasible domain as we increase the objective term hc xi contributes more to fc eventually for large enough the objective value hc xi of the current point will be close to the optimum of the program to stay near the optimum for each new value of we use method newton steps to update when is changed this means that we minimize local quadratic approximation to our objective this requires solving linear system hz where and are the gradient and hessian of at respectively solving this system to find is the most computationally intensive aspect of the algorithm crucially we ensure that crude approximate solutions to the linear system suffices allowing the algorithm to use fast approximate solvers for this step pprox ipm is described in detail in supplementary material section and in this section we prove the following theorem theorem given convex bounded domain irn and vector irn consider the program min hc xi let opt denote the optimum of the program let scb be barrier function for given initial point value upper bound sup hc xi symmetry lower bound sym and an error parameter the algorithm pprox ipm runs for tapx log iterations and returns point xapx which satisfies hc xapx the algorithm requires tapx multiplications of vectors by matrix satisfying where is the hessian of at various points specified by the algorithm we now reformulate the program to state version which can be solved using the pprox ipm framework consider points irn irn and define set dg for all to ensure boundedness as required by pprox ipm we add the constraint hw ti definition we define the domain dk ig irn dg hw ti the domain dk is convex and allows us to reformulate program with linear objective min hw ti such that dk our next lemma determines choice of which suffices to ensure that programs and have the same optimum the lemma is proven in supplementary material section lemma for all dk is and bounded and the optimum of program is the following result shows that for program we can compute good starting point for the path following ipm efficiently the algorithm ood tart computes starting point in linear time by running topological sort on the vertices of the dag and assigning values to according to the vertex order of the sort combined with an appropriate choice of this suffices to give starting point with good symmetry the algorithm ood tart is specified in more detail in supplementary material section together with proof of the following lemma lemma the algorithm ood tart runs in time and returns an initial point that is feasible and for satisfies sym dk combining standard results on barrier functions with barrier for developed by hertog et al we can show the following properties of function fk whose exact definition is given in supplementary material section corollary the function fk is barrier for dk and it has complexity parameter fk its gradient gfk is computable in time and an implicit representation of the hessian hfk can be computed in time as well the key reason we can use pprox ipm to give fast algorithm for isotonic regression is that we develop an efficient solver for linear equations in the hessian of fk the algorithm essian olve solves linear systems in hessian matrices of the barrier function fk the hessian is composed of structured main component plus rank one matrix we develop solver for the main component by doing change of variables to simplify its structure and then factoring the matrix by blockwise ldl we can solve straightforwardly in the and and we show that the factor consists of blocks that are either diagonal or sdd so we can solve in this factor approximately using time sdd solver the algorithm essian olve is given in full in supplementary material section along with proof of the following result theorem for any instance of program given by some at any point dk for any vector essian olve returns vector for symmetric linear operator satisfying hfk hfk the algorithm fails with probability log log essian olve runs in time these are the ingredients we need to prove our main result on solving the algorithm so tonic ipm is simply pprox ipm instantiated to solve program with an appropriate choice of parameters we state sotonic ipm informally as algorithm below sotonic ipm is given in full as algorithm in supplementary material section proof of theorem sotonic ipm uses the symmetry lower bound the value upper bound and the error parameter when calling pprox ipm by corollary the barrier function fk used by sotonic ipm has complexity parameter fk by lemma the starting point computed by ood tart and used by sotonic ipm is feasible and has symmetry sym dk hw apx by theorem the point xapx tapx output by sotonic ipm satisfies where opt is the optimum of program and is the value used by sotonic ipm for the constraint hw ti which is an upper bound on the supremum of objective values of feasible points of program by lemma opt hence ky xapx kp hw tapx opt again by theorem the number of calls to essian olve by sotonic ipm is bounded by fk log fk log each call to essian olve fails with probability thus probability by unionpbound the that some call to essian olve fails is upper bounded by log npwmax the algorithm uses log calls to essian olve that each take time log log as thus the total running time is algorithm sketch of algorithm sotonic ipm pick starting point using the ood tart algorithm for if then gradient of at else for log let be the hessian and gradient of fc at call essian olve to compute update return algorithms for and strict isotonic regression we now reduce isotonic regression and strict isotonic regression to the lipschitz learning problem as defined in let len be any dag with edge lengths len and partial labeling we think of partial labeling as function that assigns real values to subset of the vertex set we call such pair dag for complete labelingnx define the gradient on an edge due to to be max len if len then gradg unless in which case it is defined as given dag we say that complete assignment is an if it extends and for all other complete assignments that extends we have max max gradg note that when len then max if and only if is isotonic on suppose we are interested in isotonic regression on dag under to reduce this problem to that of finding an we add some auxiliary nodes and edges to let vl vr be two copies of that is for every vertex add vertex ul to vl and vertex ur to vr let el ul and er ur we then let ul and ur all other edge lengths are set to finally let the partial assignment takes real values only on the the vertices in vl vr for all ul ur and is our dag observe that has vertices and edges lemma given dag set of observations rv and weights construct and as above let be an for the dag then is the isotonic regression of with respect to under the norm proof we note that since the vertices corresponding to in are connected to each other by zero length edges max iff is isotonic on those edges since is dag we know that there are isotonic labelings on when is isotonic on vertices corresponding to gradient is zero on all the edges going in between vertices in also note that every vertex corresponding to in is attached to two auxiliary nodes xl vl xr vr we also have xl xr thus for any that extends and is isotonic on the only entries in correspond to edges in er and el and thus max max wu kx ykw algorithm omp nf in from is proved to compute the and is claimed to work for directed graphs section we exploit the fact that dijkstra algorithm in omp nf in can be implemented in time on dags using topological sorting of the vertices giving linear time algorithm for computing the combining it with the reduction given by the lemma above and observing that the size of is we obtain theorem complete description of the modified omp nf in is given in section we remark that the solution to the regression that we obtain has been referred to as avg isotonic regression in the literature it is easy to modify the algorithm to compute the ax in isotonic regressions details are given in section for strict isotonic regression we define the lexicographic ordering given rm let πr denote permutation that sorts in order by absolute value πr πr given two vectors rm we write to indicate that is smaller than in the lexicographic ordering on sorted absolute values πr πs and πr πs or πr πs note that it is possible that and while it is total relation for every and at least one of or is true given dag we say that complete assignment is if it extends and for all other complete assignments that extend we have gradg stout proves that computing the strict isotonic regression is equivalent to finding an isotonic that minimizes zu wu xu yu in the lexicographic ordering with the same reduction as above it is immediate that this is equivalent to minimizing in the lemma given dag set of observations rv and weights construct and as above let be the for the dag then is the strict isotonic regression of with respect to with weights as for we give modification of the algorithm omp ex in from that computes the in mn time the algorithm is described in section combining this algorithm with the reduction from lemma we can compute the strict isotonic regression in mn time thus proving theorem acknowledgements we thank sabyasachi chatterjee for introducing the problem to us and daniel spielman for his advice and comments we would also like to thank quentin stout and anonymous reviewers for their suggestions this research was partially supported by afosr award nsf grant and simons investigator award to daniel spielman references ayer brunk ewing reid and silverman an empirical distribution function for sampling with incomplete information the annals of mathematical statistics pp barlow bartholomew bremner and brunk statistical inference under order restrictions the theory and application of isotonic regression wiley new york gebhardt an algorithm for monotone regression with one or more independent variables biometrika maxwell and muckstadt establishing consistent and realistic reorder intervals in productiondistribution systems operations research roundy rule for production inventory system mathematics of operations research pp acton and bovik nonlinear image estimation using piecewise and local image models image processing ieee transactions on jul lee the algorithm and isotonic regression the annals of statistics pp dykstra and robertson an algorithm for isotonic regression for two or more independent variables the annals of statistics pp chatterjee guntuboyina and on risk bounds in isotonic and other shape restricted regression problems the annals of statistics to appear kalai and sastry the isotron algorithm isotonic regression in colt moon smola chang and zheng intervalrank isotonic regression with listwise and pairwise constraints in wsdm pages acm kakade kanade shamir and kalai efficient learning of generalized linear and single index models with isotonic regression in nips zadrozny and elkan transforming classifier scores into accurate multiclass probability estimates kdd pages narasimhan and agarwal on the relationship between binary classification bipartite ranking and binary class probability estimation in nips angelov harb kannan and wang weighted isotonic regression under the norm in soda punera and ghosh enhanced hierarchical classification via isotonic smoothing in www zheng zha and sun learning to rank using isotonic regression in communication control and computing allerton conference on hochbaum and queyranne minimizing convex cost closure set siam journal on discrete mathematics stout isotonic regression via partitioning algorithmica stout weighted isotonic regression manuscript stout strict isotonic regression journal of optimization theory and applications stout fastest isotonic regression algorithms http stout isotonic regression for multiple independent variables algorithmica kaufman and tamir locating service centers with precedence constraints discrete applied mathematics stout infinity isotonic regression for linear multidimensional and tree orders corr daitch and spielman faster approximate lossy generalized flow via interior point algorithms stoc pages acm madry navigating central path with electrical flows in focs lee and sidford path finding methods for linear programming solving linear programs in rank iterations and faster algorithms for maximum flow in focs spielman and teng time algorithms for graph partitioning graph sparsification and solving linear systems stoc pages acm koutis miller and peng log time solver for sdd linear systems focs pages washington dc usa ieee computer society cohen kyng miller pachocki peng rao and xu solving sdd linear systems in nearly time stoc kyng rao sachdeva and spielman algorithms for lipschitz learning on graphs in proceedings of colt pages boyd and vandenberghe convex optimization cambridge university press den hertog jarre roos and terlaky sufficient condition for math july renegar mathematical view of methods in convex optimization siam nemirovski lecure notes interior point polynomial time methods in convex programming mcshane extension of range of functions bull amer math whitney analytic extensions of differentiable functions defined in closed sets transactions of the american mathematical society pp 
adversarial prediction games for multivariate losses hong wang wei xing kaiser asif brian ziebart department of computer science university of illinois at chicago chicago il bziebart abstract multivariate loss functions are used to assess performance in many modern prediction tasks including information retrieval and ranking applications convex approximations are typically optimized in their place to avoid empirical risk minimization problems we propose to approximate the training data instead of the loss function by posing multivariate prediction as an adversarial game between prediction player and evaluation player constrained to match specified properties of training data this avoids the of empirical risk minimization but game sizes are exponential in the number of predicted variables we overcome this intractability using the double oracle constraint generation method we demonstrate the efficiency and predictive performance of our approach on tasks evaluated using the precision at the and the discounted cumulative gain introduction for many problems in information retrieval and learning to rank the performance of predictor is evaluated based on the combination of predictions it makes for multiple variables examples include the precision when limited to positive predictions the harmonic mean of precision and recall and the discounted cumulative gain dcg for assessing ranking quality these stand in contrast to measures like the accuracy and log likelihood which are additive over independently predicted variables many multivariate performance measures are not concave functions of predictor parameters so maximizing them over empirical training data or equivalently empirical risk minimization over corresponding multivariate loss function is computationally intractable and can only be accomplished approximately using local optimization methods instead convex surrogates for the empirical risk are optimized using either an additive or multivariate approximation of the loss function for both types of approximations the gap between the application performance measure and the surrogate loss measure can lead to substantial of the resulting predictions rather than optimizing an approximation of the multivariate loss for available training data we take an alternate approach that robustly minimizes the exact multivariate loss function using approximations of the training data we formalize this using game between predictor player and an adversarial evaluator player learned weights parameterize this game payoffs and enable generalization from training data to new predictive settings the key computational challenge this approach poses is that the size of multivariate prediction games grows exponentially in the number of variables we leverage constraint generation methods developed for solving large zerosum games and efficient methods for computing best responses to tame this complexity in many cases the structure of the multivariate loss function enables the game nash equilibrium to be efficiently computed we formulate parameter estimation as convex optimization problem and solve it using standard convex optimization methods we demonstrate the benefits of this approach on prediction tasks with and dcg multivariate evaluation measures background and related work notation and multivariate performance functions we consider the general task of making multivariate prediction for variables yn with random variables denoted as yn given some contextual information xn xn with random variable each xi is the information relevant to predicted variable yi we denote the estimator predicted values as the multivariate performance measure when predicting when the true multivariate value is actually is represented as scoring function score equivalently complementary loss function for any score function based on the maximal score can be defined as loss score score for information retrieval vector of retrieved items from the pool of items can be represented as and vector of relevant items as with xn denoting side contextual information search terms and document contents precision and recall are important measures for information retrieval systems however maximizing either leads to degenerate solutions predict all to maximize recall or predict none to maximize precision the precision when limited to exactly positive predictions where is one popular multivariate performance measure that avoids these extremes another is the fscore which is the harmonic mean of the precision and recall often used in information retrieval tasks using this notation the for set of items can be simply represented as and in other information retrieval tasks ranked list of retrieved items is desired this can be represented as permutation where denotes the ith item and denotes the rank of the th item evaluation measures that emphasize the items are used to produce search engine results attuned to actual usage the discounted cumulative gain dcg measures the performance of item rankings with relevancy scores yi as pn pn dcg log log or dcg multivariate empirical risk minimization empirical risk minimization is common supervised learning approach that seeks predictor from set of predictors that minimizes the loss under the empirical distribution of training data denoted loss multivariate losses are often not convex and finding the optimal solution is computationally intractable for expressive classes of predictors typically specified by some set of parameters linear discriminant functions if given these difficulties convex surrogates to the multivariate loss are instead employed that are additive over and yi loss loss yi employing the logarithmic loss loss yi log yi yields the logistic regression model using the hinge loss yields support vector machines structured support vector machines employ convex approximation of the multivariate loss over training dataset using the hinge loss function min ξi such that ξi ξi in other words linear parameters for feature functions are desired that make the example label have potential value that is better than all alternative labels by at least the multivariate loss between and denoted when this is not possible for particular example hinge loss penalty ξi is incurred that grows linearly with the difference in potentials parameter controls between obtaining predictor with lower hinge loss or better discrimination between training examples the margin the size of set is often too large for explicit construction of the constraint set to be computationally tractable instead constraint generation methods are employed to find smaller set of active constraints this can be viewed as either finding the constraint or as inference problem our approach employs similar constraint generation the inference procedure rather than the parameter learning improve its efficiency multivariate prediction games we formulate minimax game for multivariate loss optimization describe our approach for limiting the computational complexity of solving this game and describe algorithms for estimating parameters of the game and making predictions using this framework game formulation following recent adversarial formulation for classification we view multivariate prediction as game between player making predictions and player determining the evaluation distribution player first stochastically chooses predictive distribution of variable assignments to maximize multivariate performance measure then player stochastically chooses an evaluation distribution that minimizes the performance measure further player must choose the relevant items in way that approximately matches in expectation with set of statistics measured from labeled data we denote this set as definition the multivariate prediction game mpg for predicted variables is max min score where and are distributions over of the predicted variables and the set corresponds to the constraint since the set constrains the adversary multivariate label distribution over the entire distribution of inputs solving this game directly is impractical when the number of training examples is large instead we employ the method of lagrange multipliers in theorem which allows the set of games to be independently solved given lagrange multipliers theorem the multivariate prediction game value definition can be equivalently obtained by solving set of unconstrained maximin games parameterized by lagrange multipliers max min score max min max score min max where is vector of features characterizing the set of prediction variables yi and provided contextual variables xi each related to predicted variable yi proof sketch equality is consequence of duality in games equality is obtained by writing the lagrangian and taking the dual strong lagrangian duality is guaranteed when feasible solution exists on the relative interior of the convex constraint set small amount of slack corresponds to regularization of the parameter in the dual and guarantees the strong duality feasibility requirement is satisfied in practice the resulting game payoff matrix can be expressed as the original game scores of eq augmented with lagrangian potentials the combination defines new payoff matrix with entries score as shown in eq example multivariate prediction games and solutions examples of the lagrangian payoff matrices for the and dcg games are shown in tan ble for three variables we employ additive feature functions xi table the payoff matrices for the games between player choosing columns and player choosing rows with three variables for precision at top middle and dcg with binary relevance values and we let lg bottom dcg lg lg lg lg lg lg lg lg lg lg lg lg lg lg lg lg lg lg lg in these examples with indicator function we compactly represent the lagrangian potential terms for each game with potential variables ψi xi xi when and otherwise games such as these can be solved using pair of linear programs that have constraint for each pure action set of variable assignments in the game and max such that min such that and where is the payoff and is the value of the game the second player to act in game can using pure strategy single value assignment to all variables thus these lps consider only the set of pure strategies of the opponent to find the first player mixed equilibrium strategy the equilibrium strategy for the predictor is distribution over rows and the equilibrium strategy for the adversary is distribution over columns the size of each game payoff matrix grows exponentially with the number of variables nk for the precision at game for the game and for the dcg game with possible relevance levels these sizes make explicit construction of the game matrix impractical for all but the smallest of problems strategy inference more efficient methods for obtaining nash equilibria are needed to scale our mpg approach to large prediction tasks with payoff matrices though much attention has focused on efficiently computing equilibria in time or ln time which guarantee each player payoff within of optimal we employ an approach for finding an exact equilibrium that works well in practice despite not having as strong theoretical guarantees consider the reduced game matrices of table the nash equilibrium for the precision at game with potentials is and with game value of the nash equilibrium for the reduced with no is and with game value of the reduced game equilibrium is also an equilibrium of the original game though the exact size of the subgame and its specific actions depends on the values of often compact with identical equilibrium or close approximation exists motivated by the compactness of the reduced game we employ constraint generation approach known as the double oracle algorithm to iteratively construct an appropriate reduced game that provides the correct equilibrium but avoids the computational complexity of the original exponentially sized game table the reduced precision at game with top and fscore game with bottom algorithm constraint generation game solver input lagrange potentials for ψn initial action sets and each variable output nash equilibrium initialize player action set and player action set buildpayoffmatrix using eq for the matrix of repeat using the lp of eq findbestresponseaction denotes the best response action if then check if best response provides improvement buildpayoffmatrix add new row to game matrix end if using the lp of eq findbestresponseaction if then buildpayoffmatrix add new column to game matrix end if until stop if neither best response provides improvement return neither player can improve upon their strategy with additional pure strategies when algorithm terminates thus the mixed strategies it returns are nash equilibrium pair additionally the algorithm is efficient in practice so long as each player strategy is compact the number of actions with probability is polynomial subset of the label combinations and best responses to opponents strategies can be obtained efficiently in polynomial time for each player additionally this algorithm can be modified to find approximate equilibria by limiting the number of actions for each player set and efficiently computing best responses the tractability of our approach largely rests on our ability to efficiently find best responses to oppoc and for some combinations nent strategies of loss functions and features finding the best response is trivial using greedy selection algorithm other loss combinations require specialized algorithms or are we illustrate each situation precision at best response many best responses can be obtained using greedy algorithms that are based on marginal probabilities of the opponent strategy for example the expected payoff in the precision at game for the estimator player setting is thus the set of top variables with the largest marginal label probability provides the best response for the adversary best response the lagrangian terms must also be included since is known variable as long as the value of each included term kψi is negative the sum is the smallest and the corresponding response is the best for the adversary game best response we leverage recently developed method for efficiently maximizing the when distribution over relevant documents is given the key insight is that the problem can be separated into an inner greedy maximization over item sets of certain size and an outer maximization to select the best set size from this method can be directly applied to find the best response of the estimator player since the lagrangian terms of the cost matrix are invariant to the choice of algorithm obtains the best response for the adversary player using slight modifications to incorporate the lagrangian potentials into the objective function algorithm maximizer for adversary player input vector of estimator probabilities and lagrange potentials ψn define matrix with element construct matrix is the all ones vector for to do solve the inner optimization pn problem pn ai fik ak ai by setting ai for the column of smallest elements and ai for the rest store value of fik end for for take and solve the outer optimization problem return and order inversion best response another common loss measure when comparing two rankings is the number of pairs of items with inverted order across rankings one variable may occur before another in one ranking but not in the other ranking only the marginal probabilities of pairwise orderings are needed to construct the portion of the payoff received for ranking item over item ψi where ψi is lagrangian potential based on features for ranking item over item one could construct fully connected directed graph with edges weighted by these portions of the payoff for ranking pairs of items the best response for corresponds to set of acyclic edges with the smallest sum of edge weights unfortunately this problem is in general because the minimum feedback arc set problem which seeks to form an acyclic graph by removing the set of edges with the minimal sum of edge weights can be reduced to it dcg best response although we can not find an efficient algorithm to get the best response using order inversion solving best response of dcg has known efficient algorithm in this problem the maximizer is permutation of the documents while the minimizer is the relevance score of each document pair the estimator best response maximizes log log where is constant that has pno relationship with since is monotonically decreasing computing and sorting with descending order and greedily assign the order to is optimal the adversary best response using additive features minimizes θi φi xi θi φi xi thus by using the expectation of function of each variable rank which is easily computed from each variable relevancy score can be independently chosen parameter estimation predictive model parameters must be chosen to ensure that the adversarial distribution is similar to training data though adversarial prediction can be posed as convex optimization problem the objective function is not smooth general subgradient methods require iterations to provide an approximation to the optima we instead employ which has been empirically shown to converge at faster rate in many cases despite lacking theoretical guarantees for objectives we also employ regularization to avoid overfitting to the training data sample the addition of the smooth regularizer often helps to improve the rate of convergence the gradient in these optimizations with regularization for training dataset is the difference between feature moments with additional regularization term λθ the adversarial strategies needed for calculating this gradient are computed via alg experiments we evaluate our approach multivariate prediction games mpg on the three performance measures of interest in this work precision at and dcg our primary point of comparison is with structured support vector machines ssvm to better understand the between convexly approximating the loss function with the hinge loss versus adversarially approximating the training data using our approach we employ an optical recognition of handwritten digits optdigits dataset classes features training examples test examples an income prediction dataset two classes features training examples test examples and pairs from the million query trec dataset of queries documents on average per query features per document following the same evaluation method used in for optdigits the dataset is converted into multiple binary datasets and we report the of the performance of all classes on test data for we use random of the training data as holdout validation data to select the regularization parameter we evaluate the performance of our approach and comparison methods ssvm and logistic regression table precision at top and lr using precision at where is half the number of performance bottom positive examples os and for precision at we restrict the pure strategies of the adverprecision optdigits adult mpg sary to select positive labels this prevents adversary ssvm strategies with no positive labels from the results in tassvm ble we see that our approach mpg works better than optdigits adult ssvm on the optdgits datasets slightly better on prempg cision at and more significantly better on ssvm for the adult dataset mpg provides equivalent perlr formance for precision at and better performance on fmeasure the nature of the running time required for validation and testing is very different for ssvm which must find the maximizing set of variable assignments and mpg which must interactively construct game and its equilibrium model validation and testing require seconds for ssvm on the optdigits dataset and seconds on the adult dataset while requiring seconds and seconds for mpg precision at and seconds and seconds for mpg optimization respectively for precision at mpg is within an order of http for precision at the original ssvm implementation uses the restriction during training but not during testing we modified the code by ordering ssvm prediction value for each test example and select the top predictions as positives the rest are considered as negatives we denote the original implementation as ssvm and the modified version as ssvm tude better for optdigits worse for adult for the more difficult problem of maximizing the of adult over test examples the mpg game becomes quite large and requires significantly more computational time though our mpg method is not as finely optimized as existing ssvm implementations this difference in run times will remain as the game formulation is inherently more computationally demanding for difficult prediction tasks we compare the performance of our approach and comparison methods using cross validation on the dataset we measure performance using normalized dcg ndcg which divides the realized dcg by the maximum possible dcg for the dataset based on slightly different variant of dcg employed by pn log the son methods are part of svmstruct which uses structured svm to predict the rank listnet ranking algorithm employing cross entropy loss boosting method using weak rankers and data reweighing to achieve good ndcg performance uses mean average figure ndcg as increases precision map rather than ndcg and rankboost which reduces ranking to binary classification problems on instance pairs table ndcg results method mpg ranksvm listnet rankboost mean ndcg table reports the ndcg averaged over all values of between and on average while figure reports the results for each value of between and from this we can see that our mpg approach provides better rankings on average than the baseline methods except when is very small in other words the adversary focuses most of its effort in reducing the score received from the first item in the ranking but at the expense of providing better overall ndcg score for the ranking as whole discussion we have extended adversarial prediction games to settings with multivariate performance measures in this paper we believe that this is an important step in demonstrating the benefits of this approach in settings where structured support vector machines are widely employed our future work will investigate improving the computational efficiency of adversarial methods and also incorporating structured statistical relationships amongst variables in the constraint set in addition to multivariate performance measures acknowledgments this material is based upon work supported by the national science foundation under grant no robust optimization of loss functions with application to active learning references kaiser asif wei xing sima behpour and brian ziebart adversarial classification in proceedings of the conference on uncertainty in artificial intelligence stephen boyd and lieven vandenberghe convex optimization cambridge university press zhe cao tao qin liu tsai and hang li learning to rank from pairwise approach to listwise approach in proceedings of the international conference on machine learning pages acm corinna cortes and mehryar mohri auc optimization error rate minimization in advances in neural information processing systems pages corinna cortes and vladimir vapnik networks machine learning krzysztof dembczynski willem waegeman weiwei cheng and eyke an exact algorithm for maximization in advances in neural information processing systems pages yoav freund raj iyer robert schapire and yoram singer an efficient boosting algorithm for combining preferences the journal of machine learning research andrew gilpin javier and tuomas sandholm algorithm with ln convergence for in games in aaai conference on artificial intelligence pages peter and phillip dawid game theory maximum entropy minimum discrepancy and robust bayesian decision theory annals of statistics tamir hazan joseph keshet and david mcallester direct loss minimization for structured prediction in advances in neural information processing systems pages hoffgen simon and kevin vanhorn robust trainability of single neurons journal of computer and system sciences martin jansche maximum expected training of logistic regression models in proceedings of the conference on human language technology and empirical methods in natural language processing pages association for computational linguistics thorsten joachims optimizing search engines using clickthrough data in proceedings of the international conference on knowledge discovery and data mining pages acm thorsten joachims support vector method for multivariate performance measures in proceedings of the international conference on machine learning pages acm richard karp reducibility among combinatorial problems springer adrian lewis and michael overton nonsmooth optimization via bfgs lichman uci machine learning repository richard lipton and neal young simple strategies for large games with applications to complexity theory in proc of the acm symposium on theory of computing pages acm dong liu and jorge nocedal on the limited memory bfgs method for large scale optimization mathematical programming brendan mcmahan geoffrey gordon and avrim blum planning in the presence of cost functions controlled by an adversary in proceedings of the international conference on machine learning pages david musicant vipin kumar and aysel ozgur optimizing with support vector machines in flairs conference pages shameem puthiya parambath nicolas usunier and yves grandvalet optimizing by costsensitive classification in advances in neural information processing systems pages tao qin and liu introducing letor datasets arxiv preprint mani ranjbar greg mori and yang wang optimizing complex loss functions in structured prediction in proceedings of the european conference on computer vision pages springer ben taskar vassil chatalbashev daphne koller and carlos guestrin learning structured prediction models large margin approach in proceedings of the international conference on machine learning pages acm flemming topsøe information theoretical optimization techniques kybernetika ioannis tsochantaridis thomas hofmann thorsten joachims and yasemin altun support vector machine learning for interdependent and structured output spaces in proceedings of the international conference on machine learning page acm vladimir vapnik principles of risk minimization for learning theory in advances in neural information processing systems pages john von neumann and oskar morgenstern theory of games and economic behavior princeton university press jun xu and hang li adarank boosting algorithm for information retrieval in proc of the international conference on research and development in information retrieval pages acm 
asynchronous parallel stochastic gradient for nonconvex optimization xiangru lian yijun huang yuncheng li and ji liu department of computer science university of rochester lianxiangru raingomm abstract asynchronous parallel implementations of stochastic gradient sg have been broadly used in solving deep neural network and received many successes in practice recently however existing theories can not explain their convergence and speedup properties mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism to fill the gaps in theory and provide theoretical supports this paper studies two asynchronous parallel implementations of sg one is over computer network and the other is on shared memory system we establish an ergodic convergence rate for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by is the total number of iterations our results generalize and improve existing analysis for convex minimization introduction the asynchronous parallel optimization recently received many successes and broad attention in machine learning and optimization niu et li et yun et fercoq and zhang and kwok marecek et tappenden et hong it is mainly due to that the asynchronous parallelism largely reduces the system overhead comparing to the synchronous parallelism the key idea of the asynchronous parallelism is to allow all workers work independently and have no need of synchronization or coordination the asynchronous parallelism has been successfully applied to speedup many optimization algorithms including stochastic gradient niu et agarwal and duchi zhang et feyzmahdavian et paine et mania et stochastic coordinate descent avron et liu et sridhar et dual stochastic coordinate ascent tran et and randomized kaczmarz algorithm liu et in this paper we are particularly interested in the asynchronous parallel stochastic gradient algorithm sy sg for nonconvex optimization mainly due to its recent successes and popularity in deep neural network bengio et dean et paine et zhang et li et and matrix completion niu et petroni and querzoni yun et while some research efforts have been made to study the convergence and speedup properties of sy sg for convex optimization people still know very little about its properties in nonconvex optimization existing theories can not explain its convergence and excellent speedup property in practice mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism people even have no idea if its convergence is certified for nonconvex optimization although it has been used widely in solving deep neural network and implemented on different platforms such as computer network and shared memory for example multicore and multigpu system to fill these gaps in theory this paper tries to make the first attempt to study sy sg for the following nonconvex optimization problem eξ where is random variable and is smooth but not necessarily convex function the most common specification is that is an index set of all training samples and is the loss function with respect to the training sample indexed by we consider two popular asynchronous parallel implementations of sg one is for the computer network originally proposed in agarwal and duchi and the other one is for the shared memory including system originally proposed in niu et note that due to the architecture diversity it leads to two different algorithms the key difference lies on that the computer network can naturally also efficiently ensure the atomicity of reading and writing the whole vector of while the shared memory system is unable to do that efficiently and usually only ensures efficiency for atomic reading and writing on single coordinate of parameter the implementation on computer cluster is described by the consistent asynchronous parallel sg algorithm sy con because the value of parameter used for stochastic gradient evaluation is consistent an existing value of parameter at some time point contrarily we use the inconsistent asynchronous parallel sg algorithm sy incon to describe the implementation on the shared memory platform because the value of parameter used is inconconsistent that is it might not be the real state of at any time point this paper studies the theoretical convergence speedup properties for both algorithms we establish an asymptotic convergence rate of km for sy con where is the total iteration number and is the size of minibatch the linear speedup is proved to be achievable while the number of workers is bounded by for sy incon we establish an asymptotic convergence and speedup properties similar to sy con the intuition of the linear speedup of asynchronous parallelism for sg can be explained in the following recall that the serial sg essentially uses the stochastic gradient to surrogate the accurate gradient sy sg brings additional deviation from the accurate gradient due to using stale or delayed information if the additional deviation is relatively minor to the deviation caused by the stochastic in sg the total iteration complexity or convergence rate of sy sg would be comparable to the serial sg which implies nearly linear speedup this is the key reason why sy sg works the main contributions of this paper are highlighted as follows our result for sy con generalizes and improves earlier analysis of sy con for convex optimization in agarwal and duchi particularly we improve the upper bound of the maximal number of workers to ensure the linear speedup from to by factor the proposed sy incon algorithm provides more accurate description than ogwild niu et for the implementation of sy sg on the shared memory system although our result does not strictly dominate the result for ogwild due to different problem settings our result can be applied to more scenarios nonconvex optimization our analysis provides theoretical convergence and speedup guarantees for many recent successes of sy sg in deep learning to the best of our knowledge this is the first work that offers such theoretical support notation denotes the global optimal solution to denotes the norm of vector that is the number of nonzeros in ei rn denotes the ith natural unit basis vector we use eξk to denote the expectation with respect to set of variables ξk means taking the expectation in terms of all random variables is used to denote for short we use and to denote the ith element of and respectively assumption throughout this paper we make the following assumption for the objective function all of them are quite common in the analysis of stochastic gradient algorithms assumption we assume that the following holds unbiased gradient the stochastic gradient is unbiased that is to say eξ the speedup for workers is defined as the ratio between the total work load using one worker and the average work load using workers to obtain solution at the same precision the linear speedup is achieved means that the speedup with workers greater than ct for any values of is constant independent to bounded variance the variance of stochastic gradient is bounded eξ kg lipschitzian gradient the gradient function is lipschitzian that is to say lkx yk under the lipschitzian gradient assumption we can define two more constants ls and lmax let be any positive integer define ls to be the minimal constant satisfying the following inequality ls αi ei αi ei and define lmax as the minimum constant that satisfies αei lmax it can be seen that lmax ls related work this section mainly reviews asynchronous parallel gradient algorithms and asynchronous parallel stochastic gradient algorithms and refer readers to the long version of this for review of stochastic gradient algorithms and synchronous parallel stochastic gradient algorithms the asynchronous parallel algorithms received broad attention in optimization recently although pioneer studies started from bertsekas and tsitsiklis due to the rapid development of hardware resources the asynchronous parallelism recently received many successes when applied to parallel stochastic gradient niu et agarwal and duchi zhang et feyzmahdavian et paine et stochastic coordinate descent avron et liu et dual stochastic coordinate ascent tran et randomized kaczmarz algorithm liu et and admm zhang and kwok liu et al and liu and wright studied the asynchronous parallel stochastic coordinate descent algorithm with consistent read and inconsistent read respectively and prove the linear speedup is achievable if for smooth convex functions and for functions with smooth convex loss nonsmooth convex separable regularization avron et al studied this asynchronous parallel stochastic coordinate descent algorithm in solving ax where is symmetric positive definite matrix and showed that the linear speedup is achievable if for consistent read and for inconsistent read tran et al studied parallel version of stochastic dual coordinate ascent algorithm which periodically enforces synchronization in separate thread we review the asynchronous parallel stochastic gradient algorithms in the last agarwal and duchi analyzed the sy con algorithm on computer cluster for convex smooth optimization and proved convergence rate of which implies that linear speedup is achieved when is bounded by in comparison our analysis for the more general nonconvex smooth optimization improves the upper bound by factor very recent work feyzmahdavian et extended the analysis in agarwal and duchi to minimize functions in the form smooth convex loss nonsmooth convex regularization and obtained similar results niu et al proposed lock free asynchronous parallel implementation of sg on the shared memory system and described this implementation as ogwild algorithm they proved sublinear convergence rate for strongly convex smooth objectives another recent work mania et al analyzed asynchronous stochastic optimization algorithms for convex functions by viewing it as serial algorithm with the input perturbed by bounded noise and proved the convergences rates no worse than using traditional point of view for several algorithms asynchronous parallel stochastic gradient for computer network this section considers the asynchronous parallel implementation of sg on computer network proposed by agarwal and duchi it has been successfully applied to the distributed neural network dean et and the parameter server li et to solve deep neural network http algorithm description sy con algorithm sy con require γk ensure xk for do randomly select training samples indexed by ξk pm xk γk ξk end for the star in the network is master which maintains the parameter other machines in the computer network serve as workers which only communicate with the master all workers exchange information with the master independently and simultaneously basically repeating the following steps select randomly select subset of training samples pull pull parameter from the master compute compute the stochastic gradient push push to the master the master basically repeats the following steps aggregate aggregate certain amount of stochastic gradients from workers sum summarize all into vector update update parameter by while the master is aggregating stochastic gradients from workers it does not care about the sources of the collected stochastic gradients as long as the total amount achieves the predefined quantity the master will compute and perform the update on the update step is performed as an atomic operation workers can not read the value of during this step which can be efficiently implemented in the network especially in the parameter server li et the key difference between this asynchronous parallel implementation of sg and the serial or synchronous parallel sg algorithm lies on that in the update step some stochastic gradients in might be computed from some early value of instead of the current one while in the serial sg all are guaranteed to use the current value of the asynchronous parallel implementation substantially reduces the system overhead and overcomes the possible large network delay but the cost is to use the old value of in the stochastic gradient evaluation we will show in section that the negative affect of this cost will vanish asymptotically to mathematically characterize this asynchronous parallel implementation we monitor parameter in the master we use the subscript to indicate the kth iteration on the master for example xk denotes the value of parameter after updates so on and so forth we introduce variable τk to denote how many delays for used in evaluating the mth stochastic gradient at the kth iteration this asynchronous parallel implementation of sg on the network is summarized by the sy con algorithm see algorithm the suffix con is short for consistent read consistent read means that the value of used to compute the stochastic gradient is real state of no matter at which time point consistent read is ensured by the atomicity of the update step when the atomicity fails it leads to inconsistent read which will be discussed in section it is worth noting that on some structures the asynchronous implementation can also be described as sy con in algorithm for example the cyclic delayed architecture and the locally averaged delayed architecture agarwal and duchi figure analysis for sy con to analyze algorithm besides assumption we make the following additional assumptions assumption we assume that the following holds independence all random variables in ξk in algorithm are independent to each other bounded age all delay variables τk are bounded maxk τk the independence assumption strictly holds if all workers select samples with replacement although it might not be satisfied strictly in practice it is common assumption made for the analysis there could be more than one machines in some networks but all of them serves the same purpose and can be treated as single machine purpose the bounded delay assumption is much more important as pointed out before the asynchronous implementation may use some old value of parameter to evaluate the stochastic gradient intuitively the age or oldness should not be too large to ensure the convergence therefore it is natural and reasonable idea to assume an upper bound for ages this assumption is commonly used in the analysis for asynchronous algorithms for example niu et avron et liu and wright liu et feyzmahdavian et liu et it is worth noting that the upper bound is roughly proportional to the number of workers under assumptions and we have the following convergence rate for nonconvex optimization theorem assume that assumptions and hold and the steplength sequence γk in algorithm satisfies pt lm γk γk for all we have the following ergodic convergence rate for the iteration of algorithm pk γk γk γj pk where denotes taking expectation in terms of all random variables in algorithm to evaluate the convergence rate the commonly used metrics in convex optimization are not eligible for example xk and kxk for nonsmooth optimization we use the ergodic convergence as the metric that is the weighted average of the norm of all gradients xk which is used in the analysis for nonconvex optimization ghadimi and lan although the metric used in nonconvex optimization is not exactly comparable to xk or kxk used in the analysis for convex optimization it is not totally unreasonable to think that they are roughly in the same order the ergodic convergence directly indicates the following convergence if ranpk domly select an index from with probability γk γk then is bounded by the right hand side of and all bounds we will show in the following taking close look at theorem we can properly choose the steplength γk as constant value and obtain the following convergence rate corollary assume that assumptions and hold set the steplength γk to be constant lkσ if the delay parameter is bounded by then the output of algorithm satisfies the following ergodic convergence rate pk xk xk this corollary basically claims that when the total iteration number is greater than the convergence rate achieves since this rate does not depend on the delay parameter after sufficient number of iterations the negative effect of using old values of for stochastic gradient evaluation vanishes asymptoticly in other words if the total number of workers is bounded by the linear speedup is achieved note that our convergence rate is consistent with the serial sg with for convex optimization nemirovski et the synchronous parallel or sg for convex optimization dekel et and nonconvex smooth optimization ghadimi and lan therefore an important observation is that as long as the number of workers which is proportional to is bounded by the iteration complexity to achieve the same accuracy level will be roughly the same in other words the average work load for each worker is reduced by the factor comparing to the serial sg therefore the linear speedup is achievable if since our convergence rate meets several special cases it is tight next we compare with the analysis of sy con for convex smooth optimization in agarwal and duchi corollary they proved an asymptotic convergence rate which is consistent with ours but their results require to guarantee linear speedup our result improves it by factor asynchronous parallel stochastic gradient for shared memory architecture this section considers widely used asynchronous implementation of sg on the shared memory system proposed in niu et al its advantages have been witnessed in solving svm graph cuts niu et linear equations liu et and matrix completion petroni and querzoni while the computer network always involves multiple machines the shared memory platform usually only includes single machine with multiple cores gpus sharing the same memory algorithm sy incon algorithm description sy incon require ensure xk for the shared memory platform one can for do actly follow sy con on the computer randomly select training samples indexed network using software locks which is exby ξk therefore in practice the lock free randomly select ik with uniasynchronous parallel implementation of sg form distribution is preferred this section considers the same ik xk ik ξk ik implementation as niu et al but end for vides more precise algorithm description sy incon than ogwild proposed in niu et al in this lock free implementation the shared memory stores the parameter and allows all workers reading and modifying parameter simultaneously without using locks all workers repeat the following steps independently concurrently and simultaneously read read the parameter from the shared memory to the local memory without software locks we use to denote its value compute sample training data and use to compute the stochastic gradient locally update update parameter in the shared memory without software locks γg since we do not use locks in both read and update steps it means that multiple workers may manipulate the shared memory simultaneously it causes the inconsistent read at the read step that is the value of read from the shared memory might not be any state of in the shared memory at any time point for example at time the original value of in the shared memory is two dimensional vector at time worker is running the read step and first reads from the shared memory at time worker updates the first component of in the shared memory from to at time worker updates the second component of in the shared memory from to at time worker reads the value of the second component of in the shared memory as in this case worker eventually obtains the value of as which is not real state of in the shared memory at any time point recall that in sy con the parameter value obtained by any worker is guaranteed to be some real value of parameter at some time point to precisely characterize this implementation and especially represent we monitor the value of parameter in the shared memory we define one iteration as modification on any single component of in the shared memory since the update on single component can be considered to be atomic on gpus and dsps niu et we use xk to denote the value of parameter in the shared memory after iterations and to denote the value read from the shared memory and used for computing stochastic gradient at the kth iteration can be represented by xk with few earlier updates missing xk xj where is subset of index numbers of previous iterations this way is also used in analyzing asynchronous parallel coordinate descent algorithms in avron et liu and wright the kth update happened in the shared memory can be described as ik xk ik ξk ik where ξk denotes the index of the selected data and ik denotes the index of the component being updated at kth iteration in the original analysis for the ogwild implementation niu et is assumed to be some earlier state of in the shared memory that is the consistent read for simpler analysis although it is not true in practice the time consumed by locks is roughly equal to the time of computation the additional cost for using locks is the waiting time during which multiple worker access the same memory address one more complication is to apply the strategy like before since the update step needs physical modification in the shared memory it is usually much more time consuming than both read and compute steps are if many workers run the update step simultaneously the memory contention will seriously harm the performance to reduce the risk of memory contention common trick is to ask each worker to gather multiple say stochastic gradients and write the shared memory only once that is in each cycle run both update and compute steps for times before you run the update step thus the updates happen in the shared memory can be written as pm ik xk ik ξk ik where ik denotes the coordinate index updated at the kth iteration and ξk is the mth stochastic gradient computed from the data sample indexed by ξk and the parameter value denoted by at the kth iteration can be expressed by xk xj where is subset of index numbers of previous iterations the algorithm is summarized in algorithm from the view of the shared memory analysis for sy incon to analyze the sy incon we need to make few assumptions similar to niu et al liu et al avron et al liu and wright assumption we assume that the following holds for algorithm independence all groups of variables ik ξk at different iterations from to are independent to each other bounded age let be the global bound for delay so the independence assumption might not be true in practice but it is probably the best assumption one can make in order to analyze the asynchronous parallel sg algorithm this assumption was also used in the analysis for ogwild niu et and asynchronous randomized kaczmarz algorithm liu et the bounded delay assumption basically restricts the age of all missing components in the upper bound here serves similar purpose as in assumption thus we abuse this notation in this section the value of is proportional to the number of workers and does not depend on the size of the bounded age assumption is used in the analysis for asynchronous stochastic coordinate descent with inconsistent read avron et liu and wright under assumptions and we have the following results theorem assume that assumptions and hold and the constant steplength satisfies lmax we have the following ergodic convergence rate for algorithm pk km xt lmax γσ taking close look at theorem we can choose the steplength properly and obtain the following error bound corollary assume that assumptions and hold set the steplength to be constant klt if the total iterations is greater than lt nσ then the output of algorithm satisfies the following ergodic convergence rate pk lt km xk this corollary indicates the asymptotic convergence rate achieves when the total iteration number exceeds threshold in the order of if is considered as constant we can see that this rate and the threshold are consistent with the result in corollary for sy con one may argue that why there is an additional factor in the numerator of that is due to the way we count iterations one iteration is defined as updating single component of if we take into account this factor in the comparison to sy con the convergence rates for sy con and sy incon are essentially consistent this comparison implies that the inconsistent read would not make big difference from the consistent read next we compare our result with the analysis of ogwild by niu et in principle our analysis and their analysis consider the same implementation of asynchronous parallel sg but differ in the following aspects our analysis considers the smooth nonconvex optimization which includes the smooth strongly convex optimization considered in their analysis our analysis considers the inconsistent read model which meets the practice while their analysis assumes the impractical consistent read model although the two results are not absolutely comparable it is still interesting to see the difference niu et al proved that the linear speedup is achievable if the maximal number of nonzeros in stochastic gradients is bounded by and the number of workers is bounded by our analysis does not need this and guarantees the linear speedup as long as the number of workers is bounded by although it is hard to say that our result strictly dominates ogwild in niu et al our asymptotic result is eligible for more scenarios experiments the successes of sy con and sy incon and their advantages over synchronous parallel algorithms have been widely witnessed in many applications such as deep neural network dean et paine et zhang et li et matrix completion niu et petroni and querzoni yun et svm niu et and linear equations liu et we refer readers to these literatures for more comphrehensive comparison and empirical studies this section mainly provides the empirical study to validate the speedup properties for completeness due to the space limit please find it in supplemental materials conclusion this paper studied two popular asynchronous parallel implementations for sg on computer cluster and shared memory system respectively two algorithms sy con and sy incon are used to describe two implementations an asymptotic sublinear convergence rate is proven for both algorithms on nonconvex smooth optimization this rate is consistent with the result of sg for convex optimization the linear speedup is proven to achievable when the number of workers is bounded by which improves the earlier analysis of sy con for convex optimization in agarwal and duchi the proposed sy incon algorithm provides more precise description for lock free implementation on shared memory system than ogwild niu et our result for sy incon can be applied to more scenarios acknowledgements this project is supported by the nsf grant the nec fellowship and the startup funding at university of rochester we thank professor daniel gildea and professor sandhya dwarkadas at university of rochester professor stephen wright at university of and anonymous reviewers for their constructive comments and helpful advices references agarwal and duchi distributed delayed stochastic optimization nips avron druinsky and gupta revisiting asynchronous linear solvers provable convergence rate through randomization ipdps bengio ducharme vincent and janvin neural probabilistic language model the journal of machine learning research bertsekas and tsitsiklis parallel and distributed computation numerical methods volume prentice hall englewood cliffs nj dean corrado monga chen devin mao senior tucker yang le et al large scale distributed deep networks nips dekel shamir and xiao optimal distributed online prediction using journal of machine learning research fercoq and accelerated parallel and proximal coordinate descent arxiv preprint feyzmahdavian aytekin and johansson an asynchronous algorithm for regularized stochastic optimization arxiv may ghadimi and lan stochastic methods for nonconvex stochastic programming siam journal on optimization hong distributed asynchronous and incremental algorithm for nonconvex optimization an admm based approach arxiv preprint jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding arxiv preprint krizhevsky and hinton learning multiple layers of features from tiny images computer science department university of toronto tech rep krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks nips pages li zhou yang li xia andersen and smola parameter server for distributed machine learning big learning nips workshop li andersen park smola ahmed josifovski long shekita and su scaling distributed machine learning with the parameter server osdi li andersen smola and yu communication efficient distributed machine learning with the parameter server nips liu and wright asynchronous stochastic coordinate descent parallelism and convergence properties arxiv preprint liu wright bittorf and sridhar an asynchronous parallel stochastic coordinate descent algorithm icml liu wright and sridhar an asynchronous parallel randomized kaczmarz algorithm arxiv preprint mania pan papailiopoulos recht ramchandran and jordan perturbed iterate analysis for asynchronous stochastic optimization arxiv preprint marecek and distributed block coordinate descent for minimizing partially separable functions arxiv preprint nemirovski juditsky lan and shapiro robust stochastic approximation approach to stochastic programming siam journal on optimization niu recht re and wright hogwild approach to parallelizing stochastic gradient descent nips paine jin yang lin and huang gpu asynchronous stochastic gradient descent to speed up neural network training nips petroni and querzoni gasgd stochastic gradient descent for distributed asynchronous matrix completion via graph partitioning acm conference on recommender systems sridhar wright re liu bittorf and zhang an approximate efficient lp solver for lp rounding nips tappenden and on the complexity of parallel coordinate descent arxiv preprint tran hosseini xiao finley and bilenko scaling up stochastic dual coordinate ascent icml yun yu hsieh vishwanathan and dhillon nomad stochastic algorithm for asynchronous and decentralized matrix completion arxiv preprint zhang and kwok asynchronous distributed admm for consensus optimization icml zhang choromanska and lecun deep learning with elastic averaging sgd corr 
embed to control locally linear latent dynamics model for control from raw images manuel jost tobias joschka boedecker university of freiburg germany watterm springj jboedeck martin riedmiller google deepmind london uk riedmiller abstract we introduce embed to control method for model learning and control of dynamical systems from raw pixel images consists of deep generative model belonging to the family of variational autoencoders that learns to generate image trajectories from latent space in which the dynamics is constrained to be locally linear our model is derived directly from an optimal control formulation in latent space supports prediction of image sequences and exhibits strong performance on variety of complex control problems introduction control of dynamical systems with continuous state and action spaces is one of the key problems in robotics and in broader context in reinforcement learning for autonomous agents prominent class of algorithms that aim to solve this problem are locally optimal stochastic control algorithms such as ilqg control which approximate the general nonlinear control problem via local linearization when combined with receding horizon control and machine learning methods for learning approximate system models such algorithms are powerful tools for solving complicated control problems however they either rely on known system model or require the design of relatively state representations for real autonomous agents to succeed we ultimately need algorithms that are capable of controlling complex dynamical systems from raw sensory input images only in this paper we tackle this difficult problem if stochastic optimal control soc methods were applied directly to control from raw image data they would face two major obstacles first sensory data is usually images with thousands of pixels rendering naive soc solution computationally infeasible second the image content is typically highly function of the system dynamics underlying the observations thus model identification and control of this dynamics are while both problems could in principle be addressed by designing more advanced soc algorithms we approach the optimal control from raw images problem differently turning the problem of locally optimal control in systems into one of identifying latent state space in which locally optimal control can be performed robustly and easily to learn such latent space we propose new deep generative model belonging to the class of variational autoencoders that is derived from an ilqg formulation in latent space the resulting embed to control system is probabilistic generative model that holds belief over viable trajectories in sensory space allows for accurate planning in latent space and is trained fully unsupervised we demonstrate the success of our approach on four challenging tasks for control from raw images and compare it to range of methods for unsupervised representation learning as an aside we also validate that deep networks are powerful generative models for large images authors contributed equally the embed to control model we briefly review the problem of soc for dynamical systems introduce approximate locally optimal control in latent space and finish with the derivation of our model problem formulation we consider the control of unknown dynamical systems of the form st ut σξ ns nu where denotes the time steps st the system state ut the applied control and the system noise the function st ut is an arbitrary smooth system dynamics we equivalently refer to equation using the notation ut which we assume to be multivariate normal distribution st ut σξ we further assume that we are only given access to visual depictions xt rnx of state st this restriction requires solving joint state identification and control problem for simplicity we will in the following assume that xt is fully observed depiction of st but relax this assumption later our goal then is to infer latent state space model in which optimal control can be performed that is we seek to learn function mapping from images xt to vectors zt rnz with nz nx such that the control problem can be solved using zt instead of xt zt xt σω where accounts for system noise or equivalently zt xt σω assuming for the moment that such function can be learned or approximated we will first define soc in latent space and introduce our model thereafter stochastic locally optimal control in latent spaces let zt rnz be the inferred latent state from image xt of state st and lat zt ut the transition dynamics in latent space lat zt ut thus lat models the changes that occur in zt when control ut is applied to the underlying system as latent space analogue to st ut assuming lat is known optimal controls for trajectory of length in the dynamical system can be derived by minimizing the function which gives the expected future costs when following ez ct zt ut zt ut where zt ut are instantaneous costs ct zt ut denotes terminal costs and zt and ut are state and action sequences respectively if zt contains sufficient information about st st can be inferred from zt alone and lat is differentiable the controls can be computed from via soc algorithms these optimal control algorithms approximate the global dynamics with locally linear dynamics at each time step locally optimal actions can then be found in closed form formally given reference trajectory the current estimate for the optimal trajectory together with corresponding controls the system is linearized as zt σω lat lat where δf δf are local jacobians and is an offset to enable efficient computation of the local controls we assume the costs to be quadratic function of the latent representation zt ut zt zgoal rz zt zgoal utt ru ut nz where rz and ru rnu are cost weighting matrices and zgoal is the inferred representation of the goal state we also assume ct zt ut zt ut throughout this paper in combination with equation this gives us local formulation at each time step which can be solved by soc algorithms such as iterative regulation ilqr or approximate inference control aico the result of this trajectory optimization step is locally optimal trajectory with corresponding control sequence arg min henc at htrans bt ot µt σt xt qφ enc zt pt hdec kl pt hdec ut encode decode transition figure the information flow in the model from left to right we encode and decode an dec image xt with the networks henc and hθ where we use the latent code zt for the transition step trans the hψ network computes the local matrices at bt ot with which we can predict from zt and ut similarity to the encoding is enforced by kl divergence on their distributions and reconstruction is again performed by hdec locally linear latent state space model for dynamical systems starting from the soc formulation we now turn to the problem of learning an appropriate lowdimensional latent representation zt zt xt σω of xt the representation zt has to fulfill three properties it must capture sufficient information about xt enough to enable reconstruction ii it must allow for accurate prediction of the next latent state and thus implicitly of the next observation iii the prediction lat of the next latent state must be locally linearizable for all valid control magnitudes ut given some representation zt properties ii and iii in particular require us to capture possibly highly changes of the latent representation due to transformations of the observed scene induced by control commands crucially these are particularly hard to model and subsequently linearize we circumvent this problem by taking more direct approach instead of learning latent space and transition model lat which are then linearized and combined with soc algorithms we directly impose desired transformation properties on the representation zt during learning we will select these properties such that prediction in the latent space as well as locally linear inference of the next observation according to equation are easy the transformation properties that we desire from latent representation can be formalized directly from the ilqg formulation given in section formally following equation let the latent representation be gaussian xt σω to infer zt from xt we first require method for sampling latent states ideally we would generate samples directly from the unknown true posterior which we however have no access to following the variational bayes approach see jordan et al for an overview we resort to sampling zt from an approximate posterior distribution qφ with parameters inference model for qφ in our work this is always diagonal gaussian distribution qφ µt diag whose mean µt rnz and covariance σt diag rnz are computed by an encoding neural network with outputs µt wµ henc xt bµ wσ henc xt log bσ ne where henc is the activation of the last hidden layer and where is given by the set of all learnable parameters of the encoding network including the weight matrices wµ wσ and biases bµ bσ parameterizing the mean and variance of gaussian distribution based on neural network gives us natural and very expressive model for our latent space it additionally comes with the benefit that we can use the reparameterization trick to backpropagate gradients of loss function based on samples through the latent distribution generative model for pθ using the approximate posterior distribution qφ we generate observed samples images and from latent samples zt and by enforcing locally linear relationship in latent space according to equation yielding the following generative model zt qφ µt σt at µt bt ut ot ct pθ bernoulli pt where is the next latent state posterior distribution which exactly follows the linear form required for stochastic optimal control with ht as an estimate of the system noise can be decomposed as ct at σt att ht note that while the transition dynamics in our generative model operates on the inferred latent space it takes untransformed controls into account that is we aim to learn latent space such that the transition dynamics in linearizes the observed dynamics in and is locally linear in the applied controls reconstruction of an image from zt is performed by passing the sample through multiple hidden layers of decoding neural network which computes the mean pt of the generative bernoulli pθ as pt wp hdec zt bp nd where hdec is the response of the last hidden layer in the decoding network the set of zt parameters for the decoding network including weight matrix wp and bias bp then make up the learned generative parameters transition model for what remains is to specify how the linearization matrices at rnz bt rnz and offset ot rnz are predicted following the same approach as for distribution means and covariance matrices we predict all local transformation parameters from samples zt nt based on the hidden representation htrans of third neural network with parameters zt to which we refer as the transformation network specifically we parametrize the transformation matrices and offset as vec at wa htrans zt ba vec bt wb htrans zt bb ot wo htrans where vec denotes vectorization and therefore vec at nz and vec bt nz to circumvent estimating the full matrix at of size nz nz we can choose it to be perturbation of the identity matrix at vt rtt which reduces the parameters to be estimated for at to sketch of the complete architecture is shown in figure it also visualizes an additional constraint that is essential for learning representation for predictions we require samples from the state transition distribution to be similar to the encoding of through qφ while it might seem that just learning perfect reconstruction of from is enough we require multistep predictions for planning in which must correspond to valid trajectories in the observed space without enforcing similarity between samples from and qφ following transition in latent space from zt with action ut may lead to point from which reconstruction of is possible but that is not valid encoding the model will never encode any image as executing another action in then does not result in valid latent state since the transition model is conditional on samples coming from the inference network and thus predictions fail in nutshell such divergence between encodings and the transition model results in generative model that does not accurately model the markov chain formed by the observations learning via stochastic gradient variational bayes for training the model we use data set xt ut xt containing observation tuples with corresponding controls obtained from interactions with the dynamical system using this data set we learn the parameters of the inference transition and generative model by minimizing variational bound on the true data negative log xt ut plus an additional constraint on the latent representation the complete loss is given as lbound xt ut kl µt ut qφ xt ut the first part of this loss is the variational bound on the lbound xt ut zt log pθ xt log pθ kl qφ where qφ pθ and are the parametric inference generative and transition distributions from section and zt is prior on the approximate posterior qφ which we always chose to be bernoulli distribution for pθ is common choice when modeling images note that this is the loss for the latent state space model and distinct from the soc costs an isotropic gaussian distribution with mean zero and unit variance the second kl divergence in equation is an additional contraction term with weight that enforces agreement between the transition and inference models this term is essential for establishing markov chain in latent space that corresponds to the real system dynamics see section above for an in depth discussion this kl divergence can also be seen as prior on the latent transition model note that all kl terms can be computed analytically for our model see supplementary for details during training we approximate the expectation in via sampling specifically we take one sample zt for each input xt and transform that sample using equation to give valid sample from we then jointly learn all parameters of our model by minimizing using sgd experimental results we evaluate our model on four visual tasks an agent in plane with obstacles visual version of the classic inverted pendulum task balancing system and control of arm with larger images these are described in detail below experimental setup model training we consider two different network types for our model standard fully connected neural networks with up to three layers which work well for moderately sized images are used for the planar and experiments deep convolutional network for the encoder in combination with an network as the decoder which in accordance with recent findings from the literature we found to be an adequate model for larger images training was performed using adam throughout all experiments the training data set for all tasks was generated by randomly sampling state observations and actions with corresponding successor states for the plane we used samples for the inverted pendulum and system we used and for the arm complete list of architecture parameters and hyperparameter choices as well as an explanation of the network are specified in the supplementary material we will make our code and video containing controlled trajectories for all systems available under http model variants in addition to the embed to control dynamics model derived above we also consider two variants by removing the latent dynamics network htrans setting its output to one in equation we obtain variant in which at bt and ot are estimated as globally linear matrices global if we instead replace the transition model with network estimating the dynamics as function fˆlat and only linearize during planning estimating at bt ot as jacobians to fˆlat as described in section we obtain variant with nonlinear latent dynamics baseline models for thorough comparison and to exhibit the complicated nature of the tasks we also test set of baseline models on the plane and the inverted pendulum task using the same architecture as the model standard variational autoencoder vae and deep autoencoder ae are trained on the autoencoding subtask for visual problems that is given data set used for training our model we remove all actions from the tuples in and disregard temporal context between images after autoencoder training we learn dynamics model in latent space approximating lat from section we also consider vae variant with slowness term on the latent representation full description of this variant is given in the supplementary material optimal control algorithms to perform optimal control in the latent space of different models we employ two trajectory optimization algorithms iterative linear quadratic regulation ilqr for the plane and inverted pendulum and approximate inference control aico all other experiments for all vaes both methods operate on the mean of distributions qφ and aico additionally makes use of the local gaussian covariances σt and ct except for the experiments on the planar system control was performed in model predictive control fashion using the receding horizon scheme introduced in to obtain closed loop control given an image xt it is first passed through the encoder to obtain the latent state zt locally optimal trajectory is subsequently found by optimizing arg min zt zt ut with fixed small horizon ut with unless noted otherwise controls are applied to the system and transition to is observed by encoding the next image then new control sequence with horizon ae vae with slowness vae global figure the true state space of the planar system left with examples obstacles encoded as circles and the inferred spaces right of different models the spaces are spanned by generating images for every valid position of the agent and embedding them with the respective encoders starting in is found using the last estimated trajectory as bootstrap note that planning is performed entirely in the latent state without access to any observations except for the depiction of the current state to compute the cost function zt ut required for trajectory optimization in we assume knowledge of the observation xgoal of the goal state sgoal this observation is then transformed into latent space and costs are computed according to equation control in planar system the agent in the planar system can move in bounded plane by choosing continuous offset in and the representation of state is image obstructed by six circular obstacles the task is to move to the bottom right of the image starting from random position at the top of the image the encodings of obstacles are obtained prior to planning and an additional quadratic cost term is penalizing proximity to them depiction of the observations on which control is performed together with their corresponding state values and embeddings into latent space is shown in figure the figure also clearly shows fundamental advantage the model has over its competitors while the separately trained autoencoders make for aesthetically pleasing pictures the models failed to discover the underlying structure of the state space complicating dynamics estimation and largely invalidating costs based on distances in said space including the latent dynamics constraints in these models on the other hand yields latent spaces approaching the optimal planar embedding we test the accuracy by accumulating latent and real trajectory costs to quantify whether the imagined trajectory reflects reality the results for all models when starting from random positions at the top and executing actions are summarized in table using seperate test set for evaluating reconstructions while all methods achieve low reconstruction loss the difference in accumulated real costs per trajectory show the superiority of the model using the globally or locally linear model trajectories planned in latent space are as good as trajectories planned on the real state all models besides fail to give predictions that result in good performance learning for an inverted pendulum we next turn to the task of controlling the classical inverted pendulum system from images we create depictions of the state by rendering fixed length line starting from the center of the image at an angle corresponding to the pendulum position the goal in this task is to and balance an underactuated pendulum from resting position pendulum hanging down exemplary observations and reconstructions for this system are given in figure in the visual inverted pendulum task our algorithm faces two additional difficulties the observed space is as the angular velocity can not be inferred from single image and second discretization errors due to rendering pendulum angles as small pixel images make exact control difficult to restore the markov property we stack two images as input channels thus observing history figure shows the topology of the latent space for our model as well as one sample trajectory in true state and latent space the fact that the model can learn meaningful embedding separating table comparison between different approaches to model learning from raw pixels for the planar and pendulum system we compare all models with respect to their prediction quality on test set of sampled transitions and with respect to their performance when combined with soc trajectory cost for control from different start states note that trajectory costs in latent space are not necessarily comparable the real trajectory cost was computed on the dynamics of the simulator while executing planned actions for the true models for st real trajectory costs were for the planar system and for the pendulum success was defined as reaching the goal state and staying to it for the rest of the trajectory if non terminating all statistics quantify over different starting positions marks separately trained dynamics networks algorithm vae global vae no latent kl global state loss log xt next state loss trajectory cost log ut latent real planar system inverted pendulum success percent velocities and positions from this data is remarkable no other model recovered this shape table again compares the different models quantitatively while the model is not the best in terms of reconstruction performance it is the only model resulting in stable and balance behavior we explain the failure of the other models with the fact that the latent dynamics model can not be guaranteed to be linearizable for all control magnitudes resulting in undesired behavior around unstable fixpoints of the real system dynamics and that for this task globally linear dynamics model is inadequate balancing and controlling simulated robot arm finally we consider control of two more complex dynamical systems from images using six layer convolutional inference and six layer generative network resulting in deep path from input to reconstruction specifically we control visual version of the classical cartpole system from history of two pixel images as well as planar robot arm based on history of two pixel images the latent space was set to be in both experiments the real state dimensionality for the is four and is controlled using one angular velocity angle figure the true state space of the inverted pendulum task overlaid with successful trajectory taken by the agent the learned latent space the trajectory from traced out in the latent space images and reconstructions showing current positions right and history left observed predicted figure left trajectory from the domain only the first image green is real all other images are dreamed up by our model notice discretization artifacts present in the real image right exemplary observed with history image omitted and predicted images including the history image for trajectory in the visual robot arm domain with the goal marked in red action while for the arm the real state can be described in dimensions joint angles and velocities and controlled using action vector corresponding to motor torques as in previous experiments the model seems to have no problem finding locally linear embedding of images into latent space in which control can be performed figure depicts exemplary images for both problems from trajectory executed by our system the costs for these trajectories for the for the arm are only slightly worse than trajectories obtained by aico operating on the real system dynamics starting from the same and respectively the supplementary material contains additional experiments using these domains comparison to recent work in the context of representation learning for control see et al for review deep autoencoders ignoring state transitions similar to our baseline models have been applied previously by lange and riedmiller more direct route to control based on image streams is taken by recent work on model free deep for atari games by mnih et al as well as kernel based and deep policy learning for robot control close to our approach is recent paper by et al where autoencoders are used to extract latent representation for control from images on which model of the forward dynamics is learned their model is trained jointly and is thus similar to the variant in our comparison in contrast to our model their formulation requires pca and does neither ensure that predictions in latent space do not diverge nor that they are linearizable as stated above our system belongs to the family of vaes and is generally similar to recent work such as kingma and welling rezende et al gregor et al bayer and osendorfer two additional parallels between our work and recent advances for training deep neural networks can be observed first the idea of enforcing desired transformations in latent space during learning such that the data becomes easy to model has appeared several times already in the literature this includes the development of transforming and recent probabilistic models for images second learning relations between pairs of images although without control has received considerable attention from the community during the last years in broader context our model is related to work on state estimation in markov decision processes see langford et al for discussion through hidden markov models and kalman filters conclusion we presented embed to control system for stochastic optimal control on image streams key to the approach is the extraction of latent dynamics model which is constrained to be locally linear in its state transitions an evaluation on four challenging benchmarks revealed that can find embeddings on which control can be performed with ease reaching performance close to that achievable by optimal control on the real system model acknowledgments we thank radford metz and dewolf for sharing code as well as dosovitskiy for useful discussions this work was partly funded by dfg grant within the priority program autonomous learning and the cluster of excellence grant number exc watter is funded through the state graduate funding program of references jacobson and mayne differential dynamic programming american elsevier todorov and li generalized iterative lqg method for feedback control of constrained nonlinear stochastic systems in acc ieee tassa erez and smart receding horizon differential dynamic programming in proc of nips pan and theodorou probabilistic differential dynamic programming in proc of nips levine and koltun variational policy search via trajectory optimization in proc of nips kingma and welling variational bayes in proc of iclr rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in proc of icml zeiler krishnan taylor and fergus deconvolutional networks in cvpr dosovitskiy springenberg and brox learning to generate chairs with convolutional neural networks in proc of cvpr stengel optimal control and estimation dover publications li and todorov iterative linear quadratic regulator design for nonlinear biological movement systems in proc of icinco toussaint robot trajectory optimization using approximate inference in proc of icml jordan ghahramani jaakkola and saul an introduction to variational methods for graphical models in machine learning kingma and ba adam method for stochastic optimization in proc of iclr wang tanaka and griffin an approach to fuzzy control of nonlinear systems stability and design issues ieee trans on fuzzy systems sutton and barto introduction to reinforcement learning mit press cambridge ma usa edition isbn springenberg boedecker riedmiller and obermayer autonomous learning of state representations for control ki intelligenz lange and riedmiller deep neural networks in reinforcement learning in proc of ijcnn mnih kavukcuoglu silver rusu veness bellemare graves riedmiller fidjeland ostrovski petersen beattie sadik antonoglou king kumaran wierstra legg and hassabis control through deep reinforcement learning nature van hoof peters and neumann learning of control policies with highdimensional state features in proc of aistats levine finn darrell and abbeel training of deep visuomotor policies corr url http and deisenroth from pixels to torques policy learning with deep dynamical models corr url http gregor danihelka graves rezende and wierstra draw recurrent neural network for image generation in proc of icml bayer and osendorfer learning stochastic recurrent networks in nips workshop on advances in variational inference hinton krizhevsky and wang transforming in proc of icann dinh krueger and bengio nice independent components estimation corr url http cohen and welling transformation properties of learned visual representations in iclr taylor sigal fleet and hinton dynamical binary latent variable models for human pose tracking in proc of cvpr memisevic learning to relate images ieee trans on pami langford salakhutdinov and zhang learning nonlinear dynamic models in icml west and harrison bayesian forecasting and dynamic models springer series in statistics february isbn matsubara and kappen latent kullback leibler control for systems using probabilistic graphical models uai 
efficient and parsimonious agnostic active learning huang microsoft research nyc alekh agarwal microsoft research nyc daniel hsu columbia university tkhuang alekha djhsu john langford microsoft research nyc robert schapire microsoft research nyc jcl schapire abstract we develop new active learning algorithm for the streaming setting satisfying three important properties it provably works for any classifier representation and classification problem including those with severe noise it is efficiently implementable with an erm oracle it is more aggressive than all previous approaches satisfying and to do this we create an algorithm based on newly defined optimization problem and analyze it we also conduct the first experimental analysis of all efficient agnostic active learning algorithms evaluating their strengths and weaknesses in different settings introduction given label budget what is the best way to learn classifier active learning approaches to this question are known to yield exponential improvements over supervised learning under strong assumptions under much weaker assumptions agnostic active learning is particularly appealing since it is known to work for any classifier representation and any label distribution with an data here learning algorithm decides for each unlabeled example in sequence whether or not to request label never revisiting this decision restated then what is the best possible active learning algorithm which works for any classifier representation any label distribution and is computationally tractable computational tractability is critical concern because most known algorithms for this setting require explicit enumeration of classifiers implying computational complexity compared to typical supervised learning algorithms active learning algorithms based on empirical risk minimization erm oracles can overcome this intractability by using passive classification algorithms as the oracle to achieve computationally acceptable solution achieving generality robustness and acceptable computation has cost for the above methods label is requested on nearly every unlabeled example where two empirically good classifiers disagree this results in poor label complexity well short of limits even for general robust solutions until now in section we design new algorithm called active over ac for constructing query probability functions that minimize the probability of querying inside the disagreement set of points where good classifiers never query otherwise this requires new algorithm that maintains parsimonious cover of the set of empirically good classifiers the cover is result of solving an optimization problem in section specifying the properties of desirable see the monograph of hanneke for an overview of the existing literature including alternative settings where additional assumptions are placed on the data source separability query probability function the cover size provides practical knob between computation and label complexity as demonstrated by the complexity analysis we present in section also in section we prove that ac effectively maintains set of good classifiers achieves good generalization error and has label complexity bound tighter than previous approaches the label complexity bound depends on the disagreement coefficient which does not completely capture the advantage of the algorithm in the end of section we provide an example of hard active learning problem where ac is substantially superior to previous tractable approaches together these results show that ac is better and sometimes substantially better in theory do agnostic active learning algorithms work in practice no previous works have addressed this question empirically doing so is important because analysis can not reveal the degree to which existing classification algorithms effectively provide an erm oracle we conduct an extensive study in section by simulating the interaction of the active learning algorithm with streaming supervised dataset results on wide array of datasets show that agnostic active learning typically outperforms passive learning and the magnitude of improvement depends on how carefully the active learning are chosen more details theory proofs and empirical evaluation are in the long version of this paper preliminaries let be distribution over and let be set of binary classifiers which we assume is finite for let ex denote expectation with respect to px the marginal of over the expected error of classifier is err pr and the error minimizer is denoted by arg err the importance weighted empirical error of on multiset of importance weighted and labeled examples drawn from is err the disagreement region for subset of classifiers is dis such that the regret of classifier relative to another is reg err err and the analogous empirical regret on is reg err err when the second classifier in empirical regret is omitted it is taken to be the empirical error minimizer in active learner receives labeled examples from one at time each label yi is hidden unless the learner decides on the spot to query it the goal is to produce classifier with low error err while querying as few labels as possible in the iwal framework decision whether or not to query label is made randomly the learner picks probability and queries the label with that probability whenever an unbiased error estimate can be produced using inverse probability weighting specifically for any classifier an unbiased estimator of err based on and is as follows if is queried then else it is easy to check that err thus when the label is queried we produce the importance weighted labeled example algorithm and statistical guarantees our new algorithm shown as algorithm breaks the example stream into epochs the algorithm admits any epoch schedule so long as the epoch lengths satisfy for technical reasons we always query the first labels to the algorithm at the start of epoch ac computes query probability function pm which will be used for sampling the data points to query during the epoch this is done by maintaining few objects of interest during each epoch in step the best classifier on the sample collected so far where has mix of queried and predicted labels radius which is based on the level of concentration we want various empirical quantities to satisfy and the set consisting of all the classifiers with empirical regret at most on within the epoch pm determines the probability of querying an example in the disagreement region for this set am of good classifiers examples outside this the assumption that is finite can be relaxed to using standard arguments if the label is not queried we produce an ignored example of weight zero its only purpose is to maintain the correct count of querying opportunities this ensures that is the correct normalization in err algorithm active over ac input constants confidence error radius parameters for op epoch schedule τm satisfying for initialize epoch log where log query the labels yi of the first three unlabeled examples xi and set pmin and xj yj for do if τm then set and let arg min err err log τm and err err compute the solution to the problem op and increment end if if next unlabeled point xi dm dis am then toss coin with bias pm xi add example xi yi xi to if outcome is heads otherwise add xi to see footnote else add example with predicted label xi hm xi to end if end for return hm arg err region are not queried but given labels predicted by hm so error estimates are not unbiased ac computes pm by solving the optimization problem op which is further discussed below the objective function of op encourages small query probabilities in order to minimize the label complexity the constraints in op bound the variance in our regret estimates for every this is key to ensuring good generalization as we will later use bernsteinstyle bounds which rely on our random variables having small variance more specifically the lhs of the constraints measures the variance in our empirical regret estimates for measured only on the examples in the disagreement region dm this is because the importance weights in the form of are only applied to these examples outside this region we use the predicted labels with an importance weight of the rhs of the constraint consists of three terms the first term ensures the feasibility of the problem as for dm will always satisfy the constraints the second empirical regret term makes the constraints easy to satisfy for bad is crucial to rule out large label complexities in case there are bad hypotheses that disagree very often with hm benefit of this is easily seen when which might have terrible regret but would force query probability on the disagreement region if finally the third term will be on the same order as the second one for hypotheses in am and is only included to capture the allowed level of slack in our constraints which will be exploited for the efficient implementation in section in addition to controlled variance good concentration also requires the random variables of interest to be appropriately bounded this is ensured through the constraints which impose minimum query probability on the disagreement region outside the disagreement region we use the predicted label with an importance weight of so that our estimates will always be bounded albeit biased in this region note that this optimization problem is written with respect to the marginal distribution of the data points px meaning that we might have infinitely many of the latter constraints in section we describe how to solve this optimization problem efficiently and using access to only unlabeled examples drawn from px algorithm requires several input parameters which must satisfy log log the first three parameters and control the tightness of the variance constraints the next three parameters and control the threshold that defines the set of empirically good classifiers is used in the minimum probability and can be simply set to optimization problem op to compute pm min where hm dm ex bm and dm pmin ex ihm hm dm bm ex ihm γreg hm pmin min err hm log epoch schedules the algorithm takes an arbitrary epoch schedule subject to τm two natural extremes are epochs τm and doubling epochs the main difference lies in the number of times op is solved which is substantial computational consideration unless otherwise stated we assume the doubling epoch schedule where the query probability and erm classifier are recomputed only log times generalization and label complexity we present guarantees on the generalization error and label complexity of algorithm assuming solver for op which we provide in the next section our first theorem provides bound on generalization error define τj dis aj τm and errm log τm for errm essentially is population counterpart of the quantity used in algorithm and crucially relies on errm the true error of restricted to the disagreement region at epoch this tity captures the inherent noisiness of the problem and modulates the transition between to type error bounds as we see next theorem pick any such that then recalling that arg err we have for all epochs with probability at least reg reg for all and the proof is in section of since we use the bound implies that am for all epochs this also maintains that all the predicted labels used by our algorithm are identical to those of since no disagreement amongst classifiers in am was observed on those examples this observation will be critical to our proofs where we will exploit the fact that using labels predicted by instead of observed labels on certain examples only introduces bias in favor of thereby ensuring that we never mistakenly drop the optimal classifier from am the bound shows that every classifier in has small regret to since the erm classifier is always in this yields our main generalization error bound on the classifier hτm output by algorithm additionally it also clarifies the definition of the sets am as the set of good classifiers these are classifiers which indeed have small population regret relative to in realizable setting where has zero error leading to regret after unlabeled examples are presented to the algorithm on the other extreme if errm is constant then the regret is there are also interesting regimes in between where err might be constant but errm measured over the disagreement region decreases rapidly more specifically we show in appendix of that the expected regret of the classifier returned by algorithm achieves the optimal rate under the tsybakov noise condition next we provide label complexity guarantee in terms of the disagreement coefficient supr px px theorem with probability at least the number of label queries made by algorithm after examples over epochs is errm nerrm log log the theorem is proved in appendix of the first term of the label complexity bound is linear in the number of unlabeled examples but can be quite small if is small or if errm is indeed in the realizable setting the second term grows at most as but also becomes constant for realizable problems consequently we attain logarithmic label complexity in the realizable setting in noisy settings our label complexity improves upon that of predecessors such as beygelzimer et al obtain label complexity of exponentially worse for realizable problems related algorithm oracular cal has label complexity scaling with nerr but worse dependence on in all comparisons the use of errm provides qualitatively superior analysis to all previous results depending on err since this captures the fact that noisy labels outside the disagreement region do not affect the label complexity finally as in our regret analysis we show in appendix of that the label complexity of algorithm achieves the lower bound under tsybakov condition section of gives an example where the label complexity of algorithm is significantly smaller than both iwal and oracular cal by virtue of rarely querying in the disagreement region the example considers distribution and classifier space with the following structure for most examples single good classifier predicts differently from the remaining classifiers ii on few examples half the classifiers predict one way and half the other in the first case little advantage is gained from label because it provides evidence against only single classifier active over queries over the disagreement region with probability close to pmin in case and probability in case ii while others query with probability everywhere implying times more queries efficient implementation the computation of hm is an erm operation which can be performed efficiently whenever an efficient passive learner is available however several other hurdles remain testing for dis am in the algorithm as well as finding solution to op are considerably more challenging the epoch schedule helps but op is still solved log times necessitating an extremely efficient solver starting with the first issue we follow dasgupta et al who cleverly observed that dm dis am can be efficiently determined using single call to an erm oracle specifically to apply their method we use the oracle to arg min err hm it can then be argued that dm dis am if and only if the regret of that is reg hm is at most solving op efficiently is much bigger challenge because it is enormous there is one variable for every point one constraint for each classifier and bound constraints on for every this leads to infinitely many variables and constraints with an erm oracle being the only computational primitive available we eliminate the bound constraints using barrier functions notice that the objective ex is already barrier at to enforce the lower bound we modify the objective to dm ex ex where is parameter chosen momentarily to ensure pmin for all dm thus the modified goal is to minimize over subject only to we solve the problem in the dual where we have large but finite number of optimization variables and efficiently maximize the dual using coordinate ascent with access to an erm oracle over let λh denote the see appendix of for how to deal with one constraint with an unconstrained oracle algorithm coordinate ascent algorithm to solve op input accuracy parameter initialize loop rescale where ms arg find arg max ex bm pλ if ex bm then return else ex bm update as ex end if end loop lagrange multiplier for the constraint for classifier then for any we can minimize the lagrangian over each primal variable yielding the solution dm qλ pλ where qλ λh ihm qλ ihm and hm dm clearly pλ for all dm so all the bound constraints in op are satisfied if we choose plugging the solution pλ into the lagrangian we obtain the dual problem of maximizing the dual objective ex dm qλ λh bm over the constant is equal to dm where pr dm pr dm an algorithm to approximately solve this problem is presented in algorithm the algorithm takes parameter specifying the degree to which all of the constraints are to be approximated since is concave the rescaling step can be solved using straightforward numerical line search the main implementation challenge is in finding the most violated constraint step fortunately this step can be reduced to single call to an erm oracle to see this note that the constraint violation on classifier mh can written as bm ex dm hm ex err err hm the second term of the expression is simply the scaled risk classification error of with respect to the actual labels the first term is the risk of in predicting samples which have been labeled according to hm with importance weights of if dm and otherwise note that these weights may be positive or negative the last two terms do not depend on thus given access to px or samples approximating it discussed shortly the most violated constraint can be found by solving an erm problem defined on the labeled samples in and samples drawn from px labeled by hm with appropriate importance weights detailed in appendix of when all primal constraints are approximately satisfied the algorithm stops we have the following guarantee on the convergence of the algorithm theorem when run on the epoch algorithm halts in at most pr dm iterations and outputs solution such that satisfies the simple bound constraints in exactly the variance constraints in up to an additive factor of and ex ex pr dm where is the solution to op furthermore pr dm if is set to an amount of constraint violation tolerable in our analysis the number of iterations hence the number of erm oracle calls in theorem is at most the proof is in appendix of oac table summary of performance metrics ora ora oac ora passive solving op with expectation over samples so far we considered solving op defined on the unlabeled data distribution px which is unavailable in practice natural substitute for px is an sample drawn from it in appendix of we show that solving sample variant of op leads to solution to the original op with similar guarantees as in theorem experiments with agnostic active learning while ac is efficient in the number of erm oracle calls it needs to store all past examples resulting in large space complexity as theorem suggests the query probability function may need as many as classifiers further increasing storage demand aiming at scalable implementation we consider an online approximation of ac given in section of the main differences from ac are instead of batch erm oracle it invokes an online oracle and instead of repeatedly solving op from scratch it maintains set of classifiers and hence dual variables called the cover for representing the query probability and updates the cover with every new example in manner similar to the coordinate ascent algorithm for solving op we conduct an empirical comparison of the following efficient agnostic active learning algorithms oac online approximation of active over algorithm in section of and the algorithm of and variant that uses tighter threshold ora oac ora and ora versions of oac passive passive learning on labeled drawn uniformly at random and details about these algorithms are in section of the differences among these algorithms are best explained in the context of the disagreement region oac does importanceweighted querying of labels with an optimized query probability in the disagreement region while using predicted labels outside and maintain minimum query probability everywhere ora oac ora and ora query labels in their respective disagreement regions with probability using predicted labels otherwise we implemented these algorithms in vowpal wabbit http fast learning system based on online convex optimization using logistic regression as the erm oracle we performed experiments on binary classification datasets with varying sizes to and diverse feature characteristics details about the datasets are in appendix of our goal is to evaluate the test error improvement per label query achieved by different algorithms to simulate the streaming setting we randomly permuted the datasets ran the active learning algorithms through the first of data and evaluated the learned classifiers on the remaining we repeated this process times to reduce variance due to random permutation for each active learning algorithm we obtain the test error rates of classifiers trained at doubling numbers of label queries starting from to formally let errora denote the test error of the classifier returned by algorithm using setting on the permutation of dataset immediately after hitting the label budget let querya be the actual number of label queries made which can be smaller than when algorithm reaches the end of the training data before hitting that label budget to evaluate an algorithm we consider the area under its curve of test error against log number of label queries querya auca errora errora querya good active learning algorithm has small value of auc which indicates that the test error decreases quickly as the number of label queries increases we use logarithmic scale for the number of label queries to focus on the performance with few label queries where active learning is the most relevant more details about are in appendix of relative improvement in test error relative improvement in test error oac passive baseline number of label queries oac passive baseline number of label queries best per dataset best fixed figure average relative improvement in test error number of label queries we measure the performance of each algorithm by the following two aggregated metrics aucbase auca mean max median aucbase aucbase auca max mean median aucbase where aucbase denotes the auc of passive using default setting learning rate of see appendix of the first metric shows the maximal gain each algorithm achieves with the best setting for each dataset while the second shows the gain by using the single setting that performs the best on average across datasets results and discussions table gives summary of the performances of different algorithms when using optimized on basis top row in table oac achieves the largest improvement over the passive baseline with achieving almost the same improvement and improving slightly less variants perform worse but still do better than passive with the best learning rate for each dataset which leads to an average of improvement in auc over the default learning rate when using the best fixed setting across all datasets bottom row in table all active learning algorithms achieve less improvement compared with passive improvement with the best fixed learning rate in particular oac gets only improvement this suggests that careful tuning of is critical for oac and an important direction for future work figure describes the behaviors of different algorithms in more detail for each algorithm we identify the best fixed setting aucbase auca arg max mean median aucbase and plot the relative test error improvement by using averaged across all datasets at the label budgets errorbase errora mean median errorbase all algorithms including passive perform similarly during the first few hundred label queries performs the best at label budgets larger than while does almost as well ora oac is the next best followed by ora and ora oac performs worse than passive except at label budgets between and in figure we plot results obtained by each algorithm using the best setting for each dataset aucbase auca pd arg max median aucbase as expected all algorithms perform better but oac benefits the most from using the best hyperparameter setting per dataset appendix of gives more detailed results including test error rates obtained by all algorithms at different label query budgets for individual datasets in sum when using the best fixed setting outperforms other algorithms when using the best setting tuned for each dataset oac and perform equally well and better than other algorithms references balcan and phil long active and passive learning of linear separators under distributions in conference on learning theory pages balcan alina beygelzimer and john langford agnostic active learning in proceedings of the international conference on machine learning pages acm balcan andrei broder and tong zhang margin based active learning in proceedings of the annual conference on learning theory pages beygelzimer dasgupta and langford importance weighted active learning in icml beygelzimer hsu langford and zhang agnostic active learning without constraints in nips castro and nowak minimax bounds for active learning information theory ieee transactions on cohn atlas and ladner improving generalization with active learning machine learning dasgupta coarse sample complexity bounds for active learning in advances in neural information processing systems dasgupta hsu and monteleoni general agnostic active learning algorithm in nips hanneke theoretical foundations of active learning phd thesis carnegie mellon university steve hanneke theory of active learning foundations and trends in machine learning horvitz and thompson generalization of sampling without replacement from finite universe amer statist issn daniel hsu algorithms for active learning phd thesis university of california at san diego huang alekh agarwal daniel hsu john langford and robert schapire efficient and parsimonious agnostic active learning arxiv preprint nikos karampatziakis and john langford online importance weight aware updates in uai proceedings of the conference on uncertainty in artificial intelligence barcelona spain july pages vladimir koltchinskii rademacher complexities and bounding the excess risk in active learning mach learn december tsybakov optimal aggregation of classifiers in statistical learning ann chicheng zhang and kamalika chaudhuri beyond agnostic active learning in advances in neural information processing systems pages 
softstar probabilistic inference mathew monfort computer science department university of illinois at chicago chicago il brenden lake center for data science new york university new york ny brenden patrick lucey disney research pittsburgh pittsburgh pa brian ziebart computer science department university of illinois at chicago chicago il bziebart joshua tenenbaum brain and cognitive sciences department massachusetts institute of technology cambridge ma jbt abstract recent machine learning methods for sequential behavior prediction estimate the motives of behavior rather than the behavior itself this abstraction improves generalization in different prediction settings but computing predictions often becomes intractable in large decision spaces we propose the softstar algorithm softened search technique for the maximum entropy inverse optimal control model of sequential behavior this approach supports probabilistic search with bounded approximation error at significantly reduced computational cost when compared to sampling based methods we present the algorithm analyze approximation guarantees and compare performance with inference on two distinct complex decision tasks introduction inverse optimal control ioc also known as inverse reinforcement learning and inverse planning has become powerful technique for learning to control or make decisions based on expert demonstrations ioc estimates the utilities of decision process that rationalizes an expert demonstrated control sequences those estimated utilities can then be used in an optimal controller to solve new decision problems producing behavior that is similar to demonstrations predictive extensions to ioc recognize the inconsistencies and inherent suboptimality of repeated behavior by incorporating uncertainty they provide probabilistic forecasts of future decisions in which stochasticity is due to this uncertainty rather than the stochasticity of the decision process dynamics these models distributions over plans and policies can typically be defined as softened versions of optimal sequential decision criteria key challenge for predictive ioc is that many decision sequences are embedded within large decision processes symmetries in the decision process can be exploited to improve efficiency but decision processes are not guaranteed to be close to symmetric approximation approaches to probabilistic structured prediction include approximate maxent ioc sampling and ioc however few guarantees are provided by these approaches they are not complete and the set of variable assignments uncovered may not be representative of the model distribution seeking to provide stronger guarantees and improve efficiency over previous methods we present softstar probabilistic search algorithm for inverse optimal control our approach generalizes the search algorithm to calculate distributions over decision sequences in predictive ioc settings allowing for efficient bounded approximations of the path distribution through decision space this distribution can then be used to update set of trainable parameters that motivate the behavior of the decision process via function we establish theoretical guarantees of this approach and demonstrate its effectiveness in two settings learning stroke trajectories for latin characters and modeling the decision process of professional soccer background graphs and search in this work we restrict our consideration to deterministic planning tasks with discrete state spaces the space of plans and their costs can be succinctly represented using graph cost with vertices representing states of the planning task and directed edges eab representing available transitions between states sa and sb the neighbor set of state is the set of states to which has directed edge and cost function cost represents the relative desirability of transitioning between states and the optimal plan from state to goal state st is sequence of states st forming path through the graph minimizing cumulative penalty letting represent the cost of the optimal path from state to state st the or value of and defining st the optimal path corresponds to solution of the next state selection criterion min cost argmin cost st st the optimal path distance to the start state can be similarly defined with as min cost dynamic programming algorithms such as dijkstra search the space of paths through the graph in order of increasing to find the optimal path doing so implicitly considers all paths up to the length of the optimal path to the goal additional knowledge can significantly reduce the portion of the state space needed to be explored to obtain an optimal plan for example search explores partial state sequences by expanding states that minimize an estimate combining the minimal cost to reach state with heuristic estimate of the remaining priority queue is used to keep track of expanded states and their respective estimates search then expands the state at the top of the queue lowest and adds its neighboring states to the queue when the heuristic estimate is admissible the algorithm terminates with guaranteed optimal solution once the best unexpanded state estimate is worse than the best discovered path to the goal predictive inverse optimal control maximum entropy ioc algorithms estimate stochastic action policy that is most uncertain while still guaranteeing the same expected cost as demonstrated behavior on an unknown cost function for planning settings with deterministic dynamics this yields probability distribution over state sequences that are consistent with paths through the graph where costθ θt st is linearly weighted vector of features combined using the feature function st and learned parameter vector calculating the marginal state probabilities of this distribution is important for estimating model parameters the algorithm can be employed but for large it may not be practical approach motivated by the efficiency of search algorithms for optimal planning we define an analogous approximation task in the predictive inference setting and present an algorithm that leverages heuristic functions to accomplish this task efficiently with bounded approximation error the problem being addressed is the inefficiency of existing inference methods for probabilistic models of behavior we present method using ideas from search for estimating path distributions through large scale deterministic graphs with bounded approximation guarantees this is an improvement over previous methods as it results in more accurate distribution estimations without the concerns of path sampling and is suitable for any problem that can be represented as such graph additionally since the proposed method does not sample paths but instead searches the space as in it does not need to retrace its steps along previously searched trajectory to find new path to the goal it will instead create new branch from an already explored state sampling would require retracing an entire sequence until this branching state was reached this allows for improvements in efficiency in addition to the distribution estimation improvements inference as softened planning we begin our investigation by recasting the inference task from the perspective of softened planning where the predictive ioc distribution over state sequences factors into stochastic policy ehsoft st st according to softened hsoft recurrence that is relaxation of the bellman equation hsoft st log st softmin hsoft θt st st st st where ξst st is the set of all paths from st to st the softmin softmin log is smoothed relaxation of the min and the goal state value is initially and for others similar softened minimum distance exists in the forward direction from the start state dsoft st log softmin dsoft θt st st st by combining forward and backward soft distances important marginal expectations are obtained and used to predict state visitation probabilities and fit the maximum entropy ioc model parameters efficient search and learning require accurate estimates of dsoft and hsoft values since the expected number of occurrences of the transition from sa to sb under the soft path distribution is sa sb sa sb st these and distance functions can be computed in using geometric series where ai si sj for any states si and sj the th entry of is related to the softmin of all the paths from si to sj specifically the softened can be written as hsof si log bsi st unfortunately the required matrix inversion operation is computationally expensive preventing its use in typical inverse optimal control applications in fact power iteration methods used for sparse matrix inversion closely resemble the softened bellman updates of equation that have instead been used for ioc equivalently min softmin min is employed to avoid in pracx tice challenges and approximation desiderata in contrast with optimal control and planning tasks softened distance functions dsoft and functions hsoft in predictive ioc are based on many paths rather than single best one thus unlike in search each path can not simply be ignored its influence must instead be incorporated into the softened distance calculation this key distinction poses significantly different objective for probabilistic search find subset of paths for which the softmin distances closely approximate the softmin of the entire path set while we would hope that small subset of paths exists that provides close approximation the cost function weights and the structure of the graph ultimately determine if this is the case with this in mind we aim to construct method with the following desiderata for an algorithm that seeks small approximation set and analyze its guarantees known bounds on approximation guarantees convergence to any desired approximation guarantee efficienct finding small approximation sets of paths regimes of convergence in search theoretical results are based on the assumption that all infinite length paths have infinite cost any cycle has positive cost this avoids negative cost cycle regime of leading to stronger requirement for our predictive setting are three regimes of convergence for the predictive ioc distribution characterized by an most likely plan most likely plan with expected plans and finite expected plan length the first regime results from the same situation described for optimal planning reachable cycles of negative cost the second regime arises when the number of paths grows faster than the penalization of the weights from the additional cost of longer paths without negative cycles and is the final regime is convergent an additional assumption is needed in the predictive ioc setting to avoid the second regime of nonconvergence we assume that fixed bound on the entropy of the distribution of paths log hmax is known theorem expected costs under the predictive ioc distribution are related to entropy and softmin path costs by costθ dsoft st together bounds on the entropy and softmin distance function constrain expected costs under the predictive ioc distribution theorem computing approximation error bounds search with heuristic function guarantees optimality when the priority queue minimal element has an estimate dsoft exceeding the best path cost dsoft st though optimality is no longer guaranteed in the softmin search setting approximations to the softmin distance are obtained by considering subset of paths lemma lemma let represent the entire set potentially infinite in size of paths from state to st we can partition the set into two sets ξa and ξb such that ξa ξb and ξa ξb and define dξ soft as the softmin over all paths in set then given lower bound estimate for the distance ξa ˆξb dˆsoft dsoft we have we establish bound on the error introduced by considering the set of paths through set of states in the following theorem theorem given an approximation state subset with neighbors of the approximation set defined as the approximation loss of exact search for paths through this approximation set paths with vertices from and terminal vertices from is bounded by the softmin of the set neighbors estimates st st dsoft where is the softmin of all paths with terminal state and soft all previous states within thus for dynamic construction of the approximation set bound on approximation error can be maintained by tracking the weights of all states in the neighborhood of that set in practice even computing the exact softened distance function for paths through small subset of states may be computationally impractical theorem establishes the approximate search bounds when only subset of paths in the approximation set are employed to compute the soft distance theorem if subset of paths and represents set of paths that are prefixes for all of the remaining paths within through the approximation set is employed to compute the soft distance the error of the resulting estimate is bounded by st st softmin dsoft dsoft softstar greedy forward path exploration and backward estimation our algorithm greedily expands nodes by considering the state contributing the most to the approximation bound theorem this is accomplished by extending search in the following algorithm algorithm softstar greedy forward and approximate backward search with fixed ordering input graph initial state goal st heuristic and approximation bound output approximate soft distance to goal hssoft set hsoft dsoft fsoft hsoft st dsoft and fsoft insert fsoft into priority queue and initialize empty stack while softmin fsoft dsoft st do set min element popped from push onto for do fsoft softmin fsoft dsoft dsoft softmin dsoft dsoft insert fsoft into end end while not empty do set top element popped from for do hsoft softmin hsoft hsoft cost end end return hsoft for insertions to the priority queue if already exists in the queue its estimate is updated to the softmin of its previous estimate and the new insertion estimate additionally the softmin of all of the estimates of elements on the queue can be dynamically updated as elements are added and removed the queue contains some states that have never been explored and some that have the former correspond to the neighbors of the approximation state set and the latter correspond to the search approximation error within the approximation state set theorem the softmin over all elements of the priority queue thus provides bound on the approximation error of the returned distance measure the exploration order is stack containing the order that each state is loop through the reverse of the node exploration ordering stack generated by the forward search computes complementary backward values hsoft the expected number of rences of state transitions can then be calculated for the approximate distribution the bound on the difference between the expected path cost of this approximate distribution and the actual distribution over the entire state set is established in theorem theorem the cost expectation inaccuracy introduced by employing state set is bounded by costθ costθ st fsoft ep costθ costθ where is the expectation under the approximate state set produced by the algorithm softmin fsoft is the softmin of fsoft for all the states remaining on the priority queue after the first while loop of algorithm and ep is the expectation over all paths not considered in the second while loop remaining on the queue ep is unknown but can be bounded using theorem completeness guarantee the notion of monotonicity extends to the probabilistic setting guaranteeing that the expansion of state provides no looser bounds than the unexpanded state definition definition heuristic function is monotonic if and only if softmin cost assuming this the completeness of the proposed algorithm can be established theorem theorem for monotonic heuristic functions and finite softmin distances convergence to any level of softmin approximation is guaranteed by algorithm experimental validation we demonstrate the effectiveness of our approach on datasets for latin character construction using sequences of pen strokes and decisions of professional soccer players in both cases we learn the parameters of cost function that motivates the behavior in the demonstrated data and using the softstar algorithm to estimate the feature distributions needed to update the parameters of the cost function we refer to the appendix for more information we focus our experimental results on estimating feature distributions through large state spaces for inverse optimal control as there is lot of room for improvement over standard approaches which typically use sampling based methods to estimate the distributions providing few if any approximation guarantees softstar directly estimates this distribution with bounded approximation error allowing for more accurate estimation and more informed parameter updates comparison approaches we compare our approach to heuristic guided maximum entropy sampling approximate maximum entropy sampling reversible jump markov chain monte carlo mcmc and search that is not guided by heuristics comparable to dijkstra algorithm for planning for consistency we use the softmin distance to generate the values of each state in mcmc results were collected on an intel cpu at character drawing we apply our approach to the task of predicting the sequential pen strokes used to draw characters from the latin alphabet the task is to learn the behavior of how person draws character given some nodal skeleton despite the apparent simplicity applying standard ioc methods is challenging due to the large planning graph corresponding to representation of the task we demonstrate the effectiveness of our method against other commonly employed techniques demonstrated data the data consists of randomly separated training set of drawn characters each with unique demonstrated trajectory and separate test set of examples where the handwritten characters are converted into skeletons of nodes within unit character frame for example the character in figure was drawn using two strokes red and green respectively the numbering indicates the start of each stroke state and feature representation the state consists of two node history previous and current node and bitmap signifying which edges are the state space size is with edges and figure character nodes the number of nodes is increased by one to account for the skeleton with two tial state for example character with nodes and edges with has pen strokes corresponding state space of about million states the initial state has no nodal history and bitmap with all uncovered edges the goal state will have two node history as defined above and fully set bitmap representing all edges as covered any transition between nodes is allowed with transitions between neighbors defined as edge draws and all others as pen lifts the appendix provides additional details on the feature representation heuristic we consider heuristic function that combines the soft minimum costs of covering each remaining uncovered edge in character assuming all moves that do not cross an uncovered edge have zero cost formally it is expressed using the set ofx uncovered edges eu and the set of all possible costs of traversing edge cost ei as softmin cost ei ei ei professional soccer in addition we apply our approach to the task of modeling the discrete spatial decision process of the for single possession open plays in professional soccer as in the character drawing task we demonstrate the effectiveness of our approach against other commonly employed techniques demonstrated data tracking information from games consisting of player locations and time steps of significant were into sets of sequential actions in single possessions each possession may include multiple different handling the ball at different times resulting in team decision process on actions rather than single player discretizing the soccer field into cells leads to very large decision process when considering actions to each cell at each step we increase generalization by reformatting the field coordinates so that the origin lies in the center of the team goal and all playing fields are normalized to by and discretized into cells formatting the field coordinates based on the distances from the goal of the team in possession doubles the amount of training data for similar coordinates the positive and negative half planes of the axis capture which side of the goal the ball is located on we train spatial decision model on of the games and evaluate the learned ball trajectories on single test game the data contains training possession sequences and test sequences state and feature representation the state consists of two action history where an action is designated as tuple where the type is the action pass shot clear dribble or cross and the cell is the destination cell with the most recent action containing the ball current location there are possible actions at each step in trajectory resulting in about million possible states there are euclidean features for each action type and that apply to all action types resulting in total use the same features as the character drawing model and include different set of features for each action type to learn unique action based cost functions heuristic we use the softmin cost over all possible actions from the current state as heuristic it is admissible if the next state is assumed to always be the goal softmin cost comparison of learning efficiency we compare softstar to other inference procedures for large scale ioc and measure the average test set equivalent to the difference between the cost of the demonstrated path cost and the softmin distance to the goal dsoft goal log path cost dsoft goal approximate max ent heuristic max ent softstar average test average test after each training epoch approximate max ent heuristic max ent softstar training epoch training epoch figure training efficiency on the character left and soccer domains right figure shows the decrease of the test set after each training epoch the proposed method learns the models far more efficiently than both approximate max ent ioc and heuristic guided sampling this is likely due to the more accurate estimation of the feature expectations that results from searching the graph rather than sampling trajectories the improved efficiency of the proposed method is also evident if we analyze the respective time taken to train each model softstar took hours to train epochs for the character model and hours to train epochs for the soccer model to compare heuristic sampling took hours for the character model and hours for the soccer model and approximate max ent took hours for the character model and hours for the soccer model analysis of inference efficiency in addition to evaluating learning efficiency we compare the average time efficiency for generating lower bounds on the estimated softmin distance to the goal for each model in figure mcmc approximate max ent heuristic max ent soft star estimated softmin distance estimated softmin distance softmin distance estimation as function of time mcmc approximate max ent heuristic max ent softstar seconds seconds figure inference efficiency evaluations for the character left and soccer domains right the mcmc approach has trouble with local optima while the unguided algorithm does not experience this problem it instead explores large number of improbable paths to the goal the proposed method avoids low probability paths and converges much faster than the comparison methods mcmc fails to converge on both examples even after seconds matching past experience with the character data where mcmc proved incapable of efficient inference conclusions in this work we extended search techniques for optimal planning to the predictive inverse optimal control setting probabilistic search in these settings is significantly more computationally demanding than search both in theory and practice primarily due to key differences between the min and softmin functions however despite this we found significant performance improvements compared to other ioc inference methods by employing search ideas acknowledgements this material is based upon work supported by the national science foundation under grant no purposeful prediction interaction via understanding intent and goals references peter abbeel and andrew ng apprenticeship learning via inverse reinforcement learning in proceedings international conference on machine learning pages monica babes vukosi marivate kaushik subramanian and michael littman apprenticeship learning about multiple intentions in international conference on machine learning chris baker joshua tenenbaum and rebecca saxe goal inference as inverse planning in conference of the cognitive science society leonard baum an equality and associated maximization technique in statistical estimation for probabilistic functions of markov processes inequalities richard bellman markovian decision process journal of mathematics and mechanics abdeslam boularias jens kober and jan peters relative entropy inverse reinforcement learning in proceedings of the international conference on artificial intelligence and statistics pages arunkumar byravan mathew monfort brian ziebart byron boots and dieter fox inverse optimal control for robot manipulation in proceedings of the international joint conference on artificial intelligence rina dechter and judea pearl generalized search strategies and the optimality of acm july edsger dijkstra note on two problems in connexion with graphs numerische mathematik peter green reversible jump markov chain monte carlo computation and bayesian model determination biometrika peter hart nils nilsson and bertram raphael formal basis for the heuristic determination of minimum cost paths ieee transactions on systems science and cybernetics huang amir massoud farahman kris kitani and andrew bagnell approximate maxent inverse optimal control and its application for mental simulation of human interactions in aaai rudolf kalman when is linear control system optimal trans asme basic brenden lake ruslan salakhutdinov and josh tenenbaum learning by inverting compositional causal process in nips mathew monfort brenden lake brian ziebart and joshua tenenbaum predictive inverse optimal control in large decision processes via search in icml workshop on robot learning mathew monfort anqi liu and brian ziebart intent prediction and trajectory forecasting via predictive inverse regulation in aaai gergely neu and csaba apprenticeship learning using inverse reinforcement learning and gradient methods in proceedings uai pages andrew ng and stuart russell algorithms for inverse reinforcement learning in proceedings international conference on machine learning deepak ramachandran and eyal amir bayesian inverse reinforcement learning in proceedings international joint conferences on artificial intelligence pages nathan ratliff andrew bagnell and martin zinkevich maximum margin planning in proceedings international conference on machine learning pages paul vernaza and drew bagnell efficient high dimensional maximum entropy modeling via symmetric partition functions in advances in neural information processing systems pages brian ziebart andrew bagnell and anind dey modeling interaction via the principle of maximum causal entropy in international conference on machine learning brian ziebart andrew maas andrew bagnell and anind dey maximum entropy inverse reinforcement learning in association for the advancement of artificial intelligence 
grammar as foreign language oriol google vinyals terry koo google terrykoo lukasz google lukaszkaiser slav petrov google slav ilya sutskever google ilyasu geoffrey hinton google geoffhinton abstract syntactic constituency parsing is fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades as result the most accurate parsers are domain specific complex and inefficient in this paper we show that the domain agnostic model achieves results on the most widely used syntactic constituency parsing dataset when trained on large synthetic corpus that was annotated using existing parsers it also matches the performance of standard parsers when trained only on small dataset which shows that this model is highly in contrast to models without the attention mechanism our parser is also fast processing over hundred sentences per second with an unoptimized cpu implementation introduction syntactic constituency parsing is fundamental problem in linguistics and natural language processing that has wide range of applications this problem has been the subject of intense research for decades and as result there exist highly accurate parsers the computational requirements of traditional parsers are cubic in sentence length and while constituency parsers improved in accuracy in recent years they never matched furthermore standard parsers have been designed with parsing in mind the concept of parse tree is deeply ingrained into these systems which makes these methods inapplicable to other problems recently sutskever et al introduced neural network model for solving the general problem and bahdanau et al proposed related model with an attention mechanism that makes it capable of handling long sequences well both models achieve results on large scale machine translation tasks syntactic constituency parsing can be formulated as problem if we linearize the parse tree cf figure so we can apply these models to parsing as well our early experiments focused on the model of sutskever et al we found this model to work poorly when we trained it on standard parsing datasets tokens so we constructed an artificial dataset by labelling large corpus with the berkeleyparser equal contribution vp xx go end vp end vp vp xx figure schematic outline of run of our model on the sentence see text for details to our surprise the model matched the berkeleyparser that produced the annotation having achieved an score of on the test set section of the wsj we suspected that the attention model of bahdanau et al might be more data efficient and we found that it is indeed the case we trained model with attention on the small parsing dataset and were able to achieve an score of on section of the wsj without the use of an ensemble and with an ensemble which matches the performance of the berkeleyparser when trained on the same data finally we constructed second artificial dataset consisting of only parse trees as measured by the agreement of two parsers we trained model with attention on this data and achieved an score of on section of the wsj matching the this result did not require an ensemble and as result the parser is also very fast parsing model let us first recall the lstm model the long memory model of is defined as follows let xt ht and mt be the input control state and memory state at timestep given sequence of inputs xt the lstm computes the ht and the mt as follows it ft ot mt ht sigm xt tanh xt sigm xt sigm xt ft it ot the operator denotes multiplication the matrices and the vector are the parameters of the model and all the nonlinearities are computed in deep lstm each subsequent layer uses the of the previous layer for its input sequence the deep lstm defines distribution over output sequences given an input sequence tb bt ata tb softmax wo hta δbt the above equation assumes deep lstm whose input sequence is ata btb so ht denotes element of the of topmost lstm the matrix wo consists of the vector representations of each output symbol and the symbol δb john has dog np nnp vp vbz np dt john has dog nn np nnp np vp vbz np dt nn np vp figure example parsing task and its linearization is kronecker delta with dimension for each output symbol so softmax wo hta δbt is precisely the bt th element of the distribution defined by the softmax every output sequence terminates with special token which is necessary in order to define distribution over sequences of variable lengths we use two different sets of lstm parameters one for the input sequence and one for the output sequence as shown in figure stochastic gradient descent is used to maximize the training objective which is the average over the training set of the log probability of the correct output sequence given the input sequence attention mechanism an important extension of the model is by adding an attention mechanism we adapted the attention model from which to produce each output symbol bt uses an attention mechanism over the encoder lstm states similar to our model described in the previous section we use two separate lstms one to encode the sequence of input words ai and another one to produce or decode the output symbols bi recall that the encoder hidden states are denoted hta and we denote the hidden states of the decoder by dtb hta hta to compute the attention vector at each output time over the input words ta we define uti ati tanh hi dt softmax uti ta ati hi the vector and matrices are learnable parameters of the model the vector ut has length ta and its item contains score of how much attention should be put on the hidden encoder state hi these scores are normalized by softmax to create the attention mask at over encoder hidden states in all our experiments we use the same hidden dimensionality at the encoder and the decoder so is vector and and are square matrices lastly we concatenate with dt which becomes the new hidden state from which we make predictions and which is fed to the next time step in our recurrent model in section we provide an analysis of what the attention mechanism learned and we visualize the normalized attention vector at for all in figure linearizing parsing trees to apply the model described above to parsing we need to design an invertible way of converting the parse tree into sequence linearization we do this in very simple way following traversal order as depicted in figure we use the above model for parsing in the following way first the network consumes the sentence in sweep creating vectors in memory then it outputs the linearized parse tree using information in these vectors as described below we use lstm layers reverse the input sentence and normalize tags an example run of our model on the sentence is depicted in figure top gray edges illustrate attention parameters and initialization sizes in our experiments we used model with lstm layers and units in each layer which we call our input vocabulary size was and we output symbols dropout training on small dataset we additionally used dropout layers one between and and one between and we call this model normalization since pos tags are not evaluated in the syntactic parsing score we replaced all of them by xx in the training data this improved our score by about point which is surprising for standard parsers including pos tags in training data helps significantly all experiments reported below are performed with normalized pos tags input reversing we also found it useful to reverse the input sentences but not their parse trees similarly to not reversing the input had small negative impact on the score on our development set about absolute all experiments reported below are performed with input reversing word vectors the embedding layer for our vocabulary can be initialized randomly or using embeddings we embeddings of size using on corpus these embeddings were used to initialize our network but not fixed they were later modified during training we discuss the impact of in the experimental section we do not apply any other special preprocessing to the data in particular we do not binarize the parse trees or handle unaries in any specific way we also treat unknown words in naive way we map all words beyond our vocabulary to single unk token this potentially underestimates our final results but keeps our framework experiments training data we trained the model described above on different datasets for one we trained on the standard wsj training dataset this is very small training set by neural network standards as it contains only sentences compared to examples even in mnist still even training on this set we managed to get results that match those obtained by parsers to match we created another larger training set of parsed sentences tokens first we collected all publicly available treebanks we used the ontonotes corpus version the english web treebank and the updated and corrected question treebank note that the popular wall street journal section of the penn treebank is part of the ontonotes corpus in total these corpora give us training sentences we held out certain sections for evaluation as described below in addition to this gold standard data we use corpus parsed with existing parsers using the approach of in this approach two parsers our reimplementation of berkeleyparser and reimplementation of zpar are used to process unlabeled sentences sampled from news appearing on the web we select only sentences for which both parsers produced the same parse tree and to match the distribution of sentence lengths of the wsj training corpus is useful because parsers agree much more often on short sentences we call the set of million sentences selected in this way together with the golden sentences described above the corpus after creating this corpus we made sure that no sentence from the development or test set appears in the corpus also after replacing rare words with unknown tokens this operation guarantees that we never see any test sentence during training but it also lowers our score by about points we are not sure if such strict was performed in previous works but even with this we still match all treebanks are available through the linguistic data consortium ldc ontonotes english web treebank and question treebank parser baseline ensemble baseline lstm petrov et al zhu et al petrov et al ensemble zhu et al huang harper mcclosky et al training set wsj only wsj only wsj only berkeleyparser corpus corpus wsj only wsj only wsj only wsj wsj table scores of various parsers on the development and test set see text for discussion in earlier experiments we only used one parser our reimplementation of berkeleyparser to create corpus of parsed sentences in that case we just parsed million senteces from news appearing on the web and combined these parsed sentences with the golden corpus described above we call this the berkeleyparser corpus evaluation we use the standard evalb for evaluation and report scores on our developments set section of the penn treebank and the final test set section in table first let us remark that our training setup differs from those reported in previous works to the best of our knowledge no standard parsers have ever been trained on datasets numbering in the hundreds of millions of tokens and it would be hard to do due to efficiency problems we therefore cite the results which are analogous in spirit but use less data table shows performance of our models on the top and results from other papers at the bottom we compare to variants of the berkeleyparser that use on unlabeled data or built an ensemble of multiple parsers or combine both techniques we also include the best parser in the literature the parser of it can be seen that when training on wsj only baseline lstm does not achieve any reasonable score even with dropout and early stopping but single attention model gets to and an ensemble of models achieves matching berkeleyparser on wsj when trained on the large corpus single model achieves and so matches the best previous single model result generating trees the model trained on wsj dataset only produced malformed trees for of the sentences in our development set of all cases and the model trained on full dataset did this for sentences in these few cases where outputs malformed tree we simply add brackets to either the beginning or the end of the tree in order to make it balanced it is worth noting that all cases where produced unbalanced trees were sentences or sentence fragments that did not end with proper punctuation there were very few such sentences in the training data so it is not surprise that our model can not deal with them very well score by sentence length an important concern with the lstm was that it may not be able to handle long sentences well we determine the extent of this problem by partitioning the development set by length and evaluating berkeleyparser baseline lstm model without attention and on sentences of each length the results presented in figure are surprising the difference between the score on sentences of length upto and that upto is for the berkeleyparser for the baseline lstm and for so already the baseline lstm has similar performance to the berkeleyparser it degrades with length only slightly surprisingly shows less degradation with length than berkeleyparser full chart parser that uses lot more memory http berkeleyparser baseline lstm score sentence length figure effect of sentence length on the score on wsj beam size influence our decoder uses beam of fixed size to calculate the output sequence of labels we experimented with different settings for the beam size it turns out that it is almost irrelevant we report report results that use beam size but using beam size only lowers the score of on the development set by and using beam size lowers it by beam sizes above do not give any additional improvements dropout influence we only used dropout when training on the small wsj dataset and its influence was significant single model only achieved an score of on our development set that is over points lower than the of model influence as described in the previous section we initialized the embedding with word vectors obtained from to test the influence of this initialization we trained model on the corpus and model on the wsj corpus starting with randomly initialized embeddings the score on our development set was lower for the model and lower for the model vs so the effect of is consistent but small performance on other datasets the wsj evaluation set has been in use for years and is commonly used to compare syntactic parsers but it is not representative for text encountered on the web even though our model was trained on news corpus we wanted to check how well it generalizes to other forms of text to this end we evaluated it on two additional datasets qtb sentences from the question treebank web the first half of each domain from the english web treebank sentences trained on the corpus which only includes text from news achieved an score of on qtb and on web our score on web is higher both than the best score reported in and the best score we achieved with an reimplementation of berkeleyparser trained on data we managed to achieve slightly higher score with the berkeleyparser trained on large corpus on qtb the score of is also lower than the best score of our berkeleyparser still taking into account that there were only few questions in the training data these scores show that managed to generalize well beyond the news language it was trained on parsing speed our model running on cpu using batches of sentences on generic unoptimized decoder can parse over sentences from wsj per second for sentences of all lengths using this is better than the speed reported for this batch size in figure of at sentences per second even though they run on gpu and only on sentences of under words note that they achieve score on this subset of sentences of section while our model at achieves score of on this subset figure attention matrix shown on top is the attention matrix where each column is the attention vector over the inputs on the bottom we show outputs for four consecutive time steps where the attention mask moves to the right as can be seen every time terminal node is consumed the attention pointer moves to the right analysis as shown in this paper the attention mechanism was key component especially when learning from relatively small dataset we found that the model did not overfit and learned the parsing function from scratch much faster which resulted in model which generalized much better than the plain lstm without attention one of the most interesting aspects of attention is that it allows us to visualize to interpret what the model has learned from the data for example in it is shown that for translation attention learns an alignment function which certainly should help translating from english to french figure shows an example of the attention model trained only on the wsj dataset from the attention matrix where each column is the attention vector over the inputs it is clear that the model focuses quite sharply on one word as it produces the parse tree it is also clear that the focus moves from the first word to the last monotonically and steps to the right deterministically when word is consumed on the bottom of figure we see where the model attends black arrow and the current output being decoded in the tree black circle this stack procedure is learned from data as all the parameters are randomly initialized but is not quite simple stack decoding indeed at the input side if the model focuses on position that state has information for all words after since we also reverse the inputs it is worth noting that in some examples not shown here the model does skip words related work the task of syntactic constituency parsing has received tremendous amount of attention in the last years traditional approaches to constituency parsing rely on probabilistic grammars cfgs the focus in these approaches is on devising appropriate smoothing techniques for highly lexicalized and thus rare events or carefully crafting the model structure partially alleviate the heavy reliance on manual modeling of linguistic structure by using latent variables to learn more articulated model however their model still depends on cfg backbone and is thereby potentially restricted in its capacity early neural network approaches to parsing for example by also relied on strong linguistic insights introduced incremental sigmoid belief networks for syntactic parsing by constructing the model structure incrementally they are able to avoid making strong independence assumptions but inference becomes intractable to avoid complex inference methods propose recurrent neural network where parse trees are decomposed into stack of independent levels unfortunately this decomposition breaks for long sentences and their accuracy on longer sentences falls quite significantly behind the used neural network to score candidate parse trees their model however relies again on the cfg assumption and furthermore can only be used to score candidate trees rather than for full inference our lstm model significantly differs from all these models as it makes no assumptions about the task as prediction model it is somewhat related to the incremental parsing models pioneered by and extended by such linear time parsers however typically need some constraints and might build up the parse in multiple passes relatedly present excellent parsing results with single pass but require stack to explicitly delay making decisions and transition strategy in order to achieve good parsing accuracies the lstm in contrast uses its short term memory to model the complex underlying structure that connects the pairs recently researchers have developed number of neural network models that can be applied to general problems was the first to propose differentiable attention mechanism for the general problem of handwritten text synthesis although his approach assumed monotonic alignment between the input and output sequences later introduced more general attention model that does not assume monotonic alignment and applied it to machine translation and applied the same model to speech recognition used convolutional neural network to encode input sentence into vector of fixed dimension and used rnn to produce the output sentence essentially the same model has been used by to successfully learn to generate image captions finally already in experimented with applying recurrent neural networks to the problem of syntactic parsing conclusions in this work we have shown that generic approaches can achieve excellent results on syntactic constituency parsing with relatively little effort or tuning in addition while we found the model of sutskever et al to not be particularly data efficient the attention model of bahdanau et al was found to be highly data efficient as it has matched the performance of the berkeleyparser when trained on small parsing dataset finally we showed that synthetic datasets with imperfect labels can be highly useful as our models have substantially outperformed the models that have been used to create their training data we suspect it is the case due to the different natures of the teacher model and the student model the student model has likely viewed the teacher errors as noise which it has been able to ignore this approach was so successful that we obtained new result in syntactic constituency parsing with single attention model which also means that the model is exceedingly fast this work shows that domain independent models with excellent learning algorithms can match and even outperform domain specific models acknowledgement we would like to thank amin ahmad dan bikel and jonni kanerva references ilya sutskever oriol vinyals and quoc vv le sequence to sequence learning with neural networks in advances in neural information processing systems pages dzmitry bahdanau kyunghyun cho and yoshua bengio neural machine translation by jointly learning to align and translate arxiv preprint thang luong ilya sutskever quoc le oriol vinyals and wojciech zaremba addressing the rare word problem in neural machine translation arxiv preprint jean kyunghyun cho roland memisevic and yoshua bengio on using very large target vocabulary for neural machine translation arxiv preprint sepp hochreiter and schmidhuber long memory neural computation tomas mikolov kai chen greg corrado and jeffrey dean efficient estimation of word representations in vector space arxiv preprint eduard hovy mitchell marcus martha palmer lance ramshaw and ralph weischedel ontonotes the solution in naacl acl june slav petrov and ryan mcdonald overview of the shared task on parsing the web notes of the first workshop on syntactic analysis of language sancl john judge aoife cahill and josef van genabith questionbank creating corpus of questions in proceedings of iccl acl pages acl july mitchell marcus beatrice santorini and mary ann marcinkiewicz building large annotated corpus of english the penn treebank computational linguistics zhenghua li min zhang and wenliang chen ensemble training for dependency parsing in proceedings of acl pages acl slav petrov leon barrett romain thibaux and dan klein learning accurate compact and interpretable tree annotation in acl acl july muhua zhu yue zhang wenliang chen min zhang and jingbo zhu fast and accurate constituent parsing in acl acl august slav petrov products of random latent variable grammars in human language technologies the annual conference of the north american chapter of the acl pages acl june zhongqiang huang and mary harper pcfg grammars with latent annotations across languages in emnlp acl august david mcclosky eugene charniak and mark johnson effective for parsing in naacl acl june david hall taylor john canny and dan klein sparser better faster gpu parsing in acl michael collins three generative lexicalised models for statistical parsing in proceedings of the annual meeting of the acl pages acl july dan klein and christopher manning accurate unlexicalized parsing in proceedings of the annual meeting of the acl pages acl july james henderson inducing history representations for broad coverage statistical parsing in naacl may james henderson discriminative training of neural network statistical parser in proceedings of the meeting of the acl acl main volume pages july ivan titov and james henderson constituent parsing with incremental sigmoid belief networks in acl acl june ronan collobert deep learning for efficient discriminative parsing in international conference on artificial intelligence and statistics richard socher cliff lin chris manning and andrew ng parsing natural scenes and natural language with recursive neural networks in icml adwait ratnaparkhi linear observed time statistical parser based on maximum entropy models in second conference on empirical methods in natural language processing michael collins and brian roark incremental parsing with the perceptron algorithm in proceedings of the meeting of the acl acl main volume pages july alex graves generating sequences with recurrent neural networks arxiv preprint jan chorowski dzmitry bahdanau kyunghyun cho and yoshua bengio continuous speech recognition using recurrent nn first results arxiv preprint nal kalchbrenner and phil blunsom recurrent continuous translation models in emnlp pages oriol vinyals alexander toshev samy bengio and dumitru erhan show and tell neural image caption generator arxiv preprint zoubin ghahramani neural network for learning how to parse tree adjoining grammar thesis university of pennsylvania 
estimation in trace regression with symmetric positive semidefinite matrices matthias hein department of computer science department of mathematics saarland university germany hein martin slawski ping li department of statistics biostatistics department of computer science rutgers university piscataway nj usa pingli abstract trace regression models have received considerable attention in the context of matrix completion quantum state tomography and compressed sensing estimation of the underlying matrix from approaches promoting notably nuclear norm regularization have enjoyed great popularity in this paper we argue that such regularization may no longer be necessary if the underlying matrix is symmetric positive semidefinite spd and the design satisfies certain conditions in this situation simple least squares estimation subject to an spd constraint may perform as well as approaches with proper choice of regularization parameter which entails knowledge of the noise level tuning by contrast constrained least squares estimation comes without any tuning parameter and may hence be preferred due to its simplicity introduction trace regression models of the form yi tr εi is the parameter of interest to be estimated given measurement matrices xi where and observations yi contaminated by errors εi have attracted considerable interest in statistical inference machine learning and signal processing over the past few years research in these areas has focused on setting with few measurements and being approximately of low rank min such setting is relevant to problems such as matrix completion compressed sensing quantum state tomography and phase retrieval common thread in these works is the use of the nuclear norm of matrix as convex surrogate for its rank in regularized estimation amenable to modern optimization techniques this approach can be seen as natural generalization of aka lasso regularization for the linear regression model that arises as special case of model in which both and xi are diagonal it is inarguable that in general regularization is essential if the situation is less clear if is known to satisfy additional constraints that can be incorporated in estimation specifically in the present paper we consider the case in which and is known to be symmetric positive semidefinite spd sm with denoting the positive semidefinite cone in the space of symmetric real matrices the set sm deserves specific interest as it includes covariance matrices and gram matrices in learning it is rather common for these matrices to be of low rank at least approximately given the widespread use of principal components analysis and kernel approximations in the present paper we focus on the usefulness of the spd constraint for estimation we argue that if is spd and the measurement matrices xi obey certain conditions constrained least squares estimation minm yi tr may perform similarly well in prediction and parameter estimation as approaches employing nuclear norm regularization with proper choice of the regularization parameter including the interesting regime δm where δm dim sm note that the objective in only consists of data fitting term and is hence convenient to work with in practice since there is no free parameter our findings can be seen as extension of recent results on least squares estimation for linear regression related work model with sm has been studied in several recent papers good deal of these papers consider the setup of compressed sensing in which the xi can be chosen by the user with the goal to minimize the number of observations required to approximately recover for example in recovery of being from noiseless observations εi by solving feasibility problem over sm is considered which is equivalent to the constrained least squares problem in noiseless setting in recovery from measurements is considered for xi rm yi xi εi tr xi εi with xi xi xi as opposed to where estimation based on nuclear norm regularization is proposed the present work is devoted to estimation while measurements as in are also in the center of interest herein our framework is not limited to this case in an application of to covariance matrix estimation given only projections zi of the data points is discussed where the zi are from distribution with zero mean and covariance matrix in fact this fits the model under study with observations yi zi xi zi zi xi xi xi εi εi xi zi zi xi specializing to the case in which one obtains the quadratic model yi εi which with is relevant to the problem of phase retrieval the approach of treats as an instance of and uses nuclear norm regularization to enforce solutions in work the authors show refined recovery result stating that imposing an spd constraint without regularization suffices similar result has been proven independently by however the results in and only concern model after posting an extended version of the present paper generalization of to general spd matrices has been achieved in since consider bounded noise whereas the analysis herein assumes gaussian noise our results are not direclty comparable to those in notation md denotes the space of real matrices with inner product hm tr the subspace of symmetric matrices sd has dimension δd sd has an λu λj uj where λmax λd λmin diag λd and ud for and sd km kq denotes the nuclear norm frobenius norm km kf spectral norm km let sd km and the symbols refer to the semidefinite ordering for set and αa αa it is convenient to model as where yi εi and mm rn is linear map defined by tr referred to as sampling operator its adjoint rn mm is given by the map vi xi supplement the appendix contains all proofs additional experiments and figures analysis preliminaries throughout this section we consider special instance of model in which yi tr xi εi where sm xi and εi εi are gaussian is made for convenience as it simplifies the the assumption that the errors stochastic part of our analysis which could be extended to errors note that we may assume that xi sm in fact since sm for any mm we have that tr tr sym where sym in the sequel we study the statistical performance of the constrained least squares estimator argmin ky under model more specifically under certain conditions on we shall derive bounds on and kς kx where will be referred to as prediction error below the most basic method for estimating is ordinary least squares ols estimation ols argmin ky which is computationally simpler than while requires convex programming boils down to solving linear system of equations in δm variables on the other hand the prediction error of ols scales as op dim range where dim range can be as large as min δm in which case the prediction error vanishes only if δm as moreover ols is unbounded unless δm research conducted over the past the estimation error kς few years has thus focused on methods dealing successfully with the case δm as long as the target has additional structure notably indeed if has rank the intrinsic dimension of the problem becomes roughly mr δm in large body of work nuclear norm regularization which serves as convex surrogate of rank regularization is considered as computationally convenient alternative for which series of adaptivity properties to underlying lowrankedness has been established complementing with nuclear norm regularization yields the estimator argmin ky where is regularization parameter in case an spd constraint is imposed becomes argmin ky tr our analysis aims at elucidating potential advantages of the spd constraint in the constrained least squares problem from statistical point of view it turns out that depending on properties of can range from performance similar to the least squares estimator ols on the behaviour of with properly the one hand to performance similar to the nuclear norm regularized estimator on the other hand the latter case appears to be remarkable may enjoy similar is obtained from pure adaptivity properties as nuclear norm regularized estimators even though data fitting problem without explicit regularization negative result does not imwe first discuss negative example of for which the estimator ols prove substantially over the unconstrained estimator at the same time this example provides clues on conditions to be imposed on to achieve substantially better performance random gaussian design consider the gaussian orthogonal ensemble goe goe xjk xjj xjk xkj gaussian measurements are common in compressed sensing it is hence of interest to study surements xi goe in the context of the constrained least squares problem the following statement points to serious limitation associated with such measurements proposition consider xi goe for any if δm with probability at least exp δm there exists sm such that proposition implies that if the number of measurements drops below of the ambient dimension is unbounded δm estimating based on becomes the estimation error kς irrespective of the rank of geometrically the consequence of proposition is that the convex cone cx rn sm contains unless is contained in the boundary of cx we conjecture that this event has measure zero this means that cx rn the spd constraint becomes vacuous slow rate bound on the prediction error under an additional we present positive result on the least squares estimator condition on the sampling operator specifically the prediction error will be bounded as where kx kx with typically being of the order up to log factors the rate in can be sigb ols if tr is small if nificant improvement of what is achieved by that rate coincides with those of the nuclear norm regularized estimators with regularization parameter cf theorem in for nuclear norm regularized estimators the rate is achieved for any choice of and is slow in the sense that the squared prediction error only decays at the rate instead of condition on in order to arrive at suitable condition to be imposed on so that can be achieved it makes sense to the negative example of proposition which states that as long as is bounded away from δm from above there is sm such that equivalently dist px kx where px and sm tr in this situation it is impossible to derive bound on the prediction error as dist px to rule this out the condition may imply cx rn so that kx dist px is natural more strongly one may ask for the following there exists constant such that min kx an analogous condition is sufficient for slow rate bound in the vector case cf however the condition for the slow rate bound in theorem below is somewhat stronger than condition there exist constants where for rpx px min kx the following condition is sufficient for condition and in some cases much easier to check proposition suppose there exists rn and constants φmin φmax λmin φmin and λmax φmax then for any satisfies condition with φmax and the condition of proposition can be phrased as having positive definite matrix in the image of the unit ball under which after scaling by has its smallest eigenvalue bounded away from zero and bounded condition number as simple example suppose that ni invoking proposition with and we find that condition is satisfied with and more interesting example is random design where the xi are sample covariance matrices where the underlying random vectors satisfy appropriate tail or moment conditions corollary let πm be probability distribution on rm with second moment matrix zz satisfying λmin consider the random matrix ensemble πm zk zk πm pn xi and ǫn λmin under the suppose that xi πm and let ǫn satisfies condition with event kγ λmax ǫn λmin ǫn and λmax ǫn it is instructive to spell out corollary with πm as the standard gaussian distribution on rm the equals the sample covariance matrix computed from samples it is matrix that for large λmax γn and λmin γn concentrate sharply around and respectively where ηn hence for any there exists cγ so that if cγ exist for the it holds that similar though weaker concentration results for kγ broader class of distributions πm with finite fourth moments specialized to corollary yields statement about made up from random measurements xi zi cf the preceding discussion indicates that condition tends to be satisfied in this case theorem suppose that model holds with satisfying condition with constants and we then have max kx where for any with probability at least log nn where remark under the scalings and the bound of theorem is of the order as announced at the beginning of this section for given the quantity can be evaluated by solving least squares problem with spd constraints hence it is feasible to check in practice whether condition holds for later reference we evaluate the term for πm with πm as standard gaussian distribution as shown in the supplement with high probability log holds as long as nq bound on the estimation error in the previous subsection we did not make any assumptions about apart from sm henceforth we suppose that is of low rank and study the performance of the constrained least squares estimator for prediction and estimation in such setting preliminaries let λu be the eigenvalue decomposition of where λr uk where λr is diagonal with positive diagonal entries consider the linear subspace sm from it follows that is contained in the orthogonal complement sm uk of dimension mr δm if the image of under is denoted by conditions on we introduce the key quantities the bound in this subsection depends on separability constant px min px rn kx restricted eigenvalue min kx it is as indicated by the following statement concerning the noiseless case for bounding kς inevitable to have lower bounds on the above two quantities proposition consider the trace regression model with εi then argmin kx for all sm if and only if it holds that and correlation constant moreover we use of the following the quantity it is not clear to us if it is intrinsically required or if its appearance in our bound is for merely technical reasons max hx we are now in position to provide bound on kς theorem suppose that model holds with as considered throughout this subsection and let be defined as in theorem we then have kς max remark given theorem an improved bound on the prediction error scaling with in place of can be derived cf in appendix the quality of the bound of theorem depends on how the quantities and scale with and which is accordingly the estimation error in nuclear norm can be in the worst case and in the best case which matches existing bounds for nuclear norm regularization cf theorem in the quantity is specific to the geometry of the constrained least squares problem and hence of critical importance for instance it follows from proposition that for standard gaussian measurements with high probability once δm the situation can be much better for random spd measurements as exemplified for surements xi zi with zi in appendix specifically it turns out that as long as it is not restrictive to assume is positive indeed without that assumption even an oracle estimator based on knowledge of the subspace would fail reasonable sampling operators have rank min δm so that the nullspace of only has trivial intersection with the subspace as long as dim mr for fixed computing entails solving biconvex albeit optimization problem in and block coordinate descent is practical approach to such optimization problems for which globally optimal solution is out of reach in this manner we explore the scaling of numerically as done for we find that δm so that apart from the regime without ruling out the possibility of undersampling δm numerical results in particular its performance in this section we empirically study properties of the estimator relative to methods is explored we also present an application to spiked covariance estimation for the cbcl face image data set and stock prices from nasdaq comparison with approaches we here empirically evaluate kς relative to methods setup we consider wishart measurement matrices xi zi zi fix and let and vary each configuration of is run with replications in each of these we generate data yi tr xi σεi where is generated randomly as rank wishart matrices and the εi are constrained ls regularized ls regularized ls chen et al chen et al oracle constrained ls regularized ls regularized ls chen et al chen et al oracle constrained ls regularized ls regularized ls chen et al chen et al oracle sigma sigma constrained ls regularized ls regularized ls chen et al chen et al oracle sigma constrained ls regularized ls regularized ls chen et al chen et al oracle sigma sigma sigma constrained ls regularized ls regularized ls chen et al chen et al oracle figure average estimation error over replications in nuclear norm for fixed and certain choices of and in the legend ls is used as shortcut for least squares chen et al refersp to indicates an oracular choice of the tuning parameter oracle refers to the ideal error σr best seen in color to the corresponding nuclear norm regularized approaches we compare estimator in regarding the choice of the regularization parameter we consider the grid where as recommended in and pick so that the prediction error on separate validation data set of size generated from is minimized note that in general neither is known nor an extra validation data set is available our goal here is to ensure that the regularization parameter is properly tuned in addition we consider an oracular choice of where is picked from the above grid such that the performance measure of interest the distance to the target in the nuclear norm is minimized we also compare to the constrained nuclear norm minimization approach of min tr subject to and ky for we consider the grid nσ this specific choice is motivated by the fact that ky nσ apart from that tuning of is performed as for the nuclear norm regularized estimator in addition we have assessed the performance of the approach in which does not impose an spd constraint but adds another constraint to that additional constraint significantly complicates optimization and yields second tuning parameter thus instead of doing search we use fixed values given in for known the results are similar or worse than those of note in particular that positive semidefiniteness is not taken advantage of in and are hence not reported discussion of the results we conclude from figure that in most cases the performance of the constrained least squares estimator does not differ much from that of the methods with careful tuning for larger values of the constrained least squares estimator seems to require slightly more measurements to achieve competitive performance real data examples we now present an application to recovery of spiked covariance matrices which are of the form λj uj where and λj this model appears frequently in connection with principal components analysis pca extension to the spiked case so far we have assumed that is of low rank but it is straightforward to extend the proposed approach to the case in which is spiked as long as is known or where an estimate is available constrained least squares estimator of takes the form argmin ky the case of unknown or general unknown diagonal perturbation is left for future research sigma sigma sample nasdaq all samples cbcl oracle all samples oracle sample kf in dependence of mr and the paramfigure average reconstruction errors kς eter oracle refers to the best rank σr data sets the cbcl facial image data set consist of images of pixels we take as the sample covariance matrix of this data set it turns out that can be well approximated by σr where σr is the best rank approximation to obtained from computing its eigendecomposition and setting to zero all but the top eigenvalues ii we construct second data set from the daily end prices of stocks from the technology sector in nasdaq starting from the beginning of the year to the end of the year in total days retrieved from we take as the resulting sample correlation matrix and choose experimental setup as in preceding measurements we consider random wishart measurements for the operator where mr where ranges from to since kσr kf kf for both data sets we work with in for simplicity to make recovery of more difficult we make the problem noisy by using observations yi tr xi si where si is an approximation to obtained from the sample covariance respectively sample correlation matrix of βn data points randomly sampled with replacement from the entire data set where ranges from to si is computed from single data point for each choice of and the reported results are averages over replications accurately approximates once the results for the cbcl data set as shown in figure number of measurements crosses performance degrades once additional noise is introduced to the problem by using measurements even under significant perturbations reasonable reconstruction of remains possible albeit the number of required measurements increases accordingly in the extreme case the error is still decreasing with but millions of samples seems to be required to achieve reasonable reconstruction error the general picture is similar for the nasdaq data set but the difference between using measurements based on the full sample correlation matrix on the one hand and approximations based on random subsampling on the other hand are more pronounced conclusion we have investigated trace regression in the situation that the underlying matrix is symmetric positive semidefinite under restrictions on the design constrained least squares enjoys similar statistical properties as methods employing nuclear norm regularization this may come as surprise as regularization is widely regarded as necessary in small sample settings acknowledgments the work of martin slawski and ping li is partially supported by and references cbcl face dataset http amelunxen lotz mccoy and tropp living on the edge phase transitions in convex programs with random data information and inference cai and zhang rop matrix recovery via projections the annals of statistics candes and li solving quadratic equations via phaselift when there are about as many equations as unknowns foundation of computational mathematics candes and plan tight oracle bounds for matrix recovery from minimal number of noisy measurements ieee transactions on information theory candes and recht exact matrix completion via convex optimization foundation of computational mathematics candes strohmer and voroninski phaselift exact and stable signal recovery from magnitude measurements via convex programming communications on pure and applied mathematics chen chi and goldsmith exact and stable covariance estimation from quadratic sampling via convex programming ieee transactions on information theory davidson and szarek handbook of the geometry of banach spaces volume chapter local operator theory random matrices and banach spaces pages demanet and hand stable optimizationless recovery from phaseless measurements journal of fourier analysis and its applications gross liu flammia becker and eisert quantum state tomography via compressed sensing physical review letters horn and johnson matrix analysis cambridge university press kabanva kueng and rauhut und terstiege stable low rank matrix recovery via null space properties klibanov sacks and tikhonarov the phase retrieval problem inverse problems koltchinskii lounici and tsybakov penalization and optimal rates for noisy matrix completion the annals of statistics meinshausen least squares estimation for regression the electronic journal of statistics negahban and wainwright estimation of near matrices with noise and scaling the annals of statistics recht fazel and parillo guaranteed solutions of linear matrix equations via nuclear norm minimization siam review rohde and tsybakov estimation of matrices the annals of statistics and smola learning with kernels mit press cambridge massachussets slawski and hein least squares for linear models consistency and sparse recovery without regularization the electronic journal of statistics slawski li and hein estimation in trace regression with positive semidefinite matrices srebro rennie and jaakola maximum margin matrix factorization in advances in neural information processing systems pages tibshirani regression shrinkage and variable selection via the lasso journal of the royal statistical society series tropp tools for random matrices an introduction http vershynin how close is the sample covariance matrix to the actual covariance matrix journal of theoretical probability wang xu and tang unique nonnegative solution to an underdetermined system from vectors to matrices ieee transactions on signal processing williams and seeger using the method to speed up kernel machines in advances in neural information processing systems pages 
autoencoders alireza makhzani brendan frey university of toronto makhzani frey abstract in this paper we propose method for learning hierarchical sparse representations in an unsupervised fashion we first introduce autoencoders which use statistics to directly enforce lifetime sparsity in the activations of the hidden units we then propose the convolutional autoencoder which combines the benefits of convolutional architectures and autoencoders for learning sparse representations we describe way to train convolutional autoencoders layer by layer where in addition to lifetime sparsity spatial sparsity within each feature map is achieved using activation functions we will show that autoencoders can be used to to learn deep sparse representations from the mnist imagenet street view house numbers and toronto face datasets and achieve competitive classification performance introduction recently supervised learning has been developed and used successfully to produce representations that have enabled leaps forward in classification accuracy for several tasks however the question that has remained unanswered is whether it is possible to learn as powerful representations from unlabeled data without any supervision it is still widely recognized that unsupervised learning algorithms that can extract useful features are needed for solving problems with limited label information in this work we exploit sparsity as generic prior on the representations for unsupervised feature learning we first introduce the autoencoders that learn to do sparse coding by directly enforcing lifetime sparsity constraint we then introduce convolutional autoencoders that learn to do sparse coding by directly enforcing spatial and lifetime sparsity constraints autoencoders training sparse autoencoders has been well studied in the literature for example in lifetime sparsity penalty function proportional to the kl divergence between the hidden unit marginals and the target sparsity probability is added to the cost function λkl major drawback of this approach is that it only works for certain target sparsities and is often very difficult to find the right parameter that results in properly trained sparse autoencoder also kl divergence was originally proposed for sigmoidal autoencoders and it is not clear how it can be applied to relu autoencoders where could be larger than one in which case the kl divergence can not be evaluated in this paper we propose autoencoders to address these concerns autoencoders can aim for any target sparsity rate train very fast marginally slower than standard autoencoder have no to be tuned except the target sparsity rate and efficiently train all the dictionary atoms even when very aggressive sparsity rates are enforced mnist mnist mnist figure learnt dictionary decoder of with hidden units trained on mnist sparse coding algorithms typically comprise two steps highly sparse encoding operation that finds the right atoms in the dictionary and linear decoding stage that reconstructs the input with the selected atoms and update the dictionary the autoencoder is nonsymmetric autoencoder where the encoding stage is typically stack of several relu layers and the decoder is just linear layer in the feedforward phase after computing the hidden codes of the last layer of the encoder rather than reconstructing the input from all of the hidden units for each hidden unit we impose lifetime sparsity by keeping the percent largest activation of that hidden unit across the samples and setting the rest of activations of that hidden unit to zero in the backpropagation phase we only backpropagate the error through the percent activations in other words we are using the statistics to approximate the statistics of the activation of particular hidden unit across all the samples and finding hard threshold value for which we can achieve lifetime sparsity rate in this setting the highly nonlinear encoder of the network relus followed by sparsity learns to do sparse encoding and the decoder of the network reconstructs the input linearly at test time we turn off the sparsity constraint and the output of the deep relu network will be the final representation of the input in order to train stacked autoencoder we fix the weights and train another autoencoder on top of the fixed representation of the previous network the learnt dictionary of autoencoder trained on mnist and toronto face datasets are visualized in fig and fig for large sparsity levels the algorithm tends to learn very local features that are too primitive to be used for classification fig as we decrease the sparsity level the network learns more useful features longer digit strokes and achieves better classification fig nevertheless forcing too much sparsity results in features that are too global and do not factor the input into parts fig section reports the classification results rbms besides autoencoders wta activations can also be used in restricted boltzmann machines rbm to learn sparse representations suppose and denote the hidden and visible units of rbms for training in the positive phase of the contrastive divergence instead of sampling from hi we first keep the largest hi for each hi across the dimension and set the rest of hi values to zero and then sample hi according to the sparsified hi filters of trained on mnist are visualized in fig we can see learn longer digit strokes on mnist which as will be shown in section improves the classification rate note that the sparsity rate of should not be as aggressive as wta autoencoders since rbms are already being regularized by having binary hidden states toronto face dataset patches figure dictionaries decoder of autoencoder with hidden units and sparsity of standard rbm sparsity of figure features learned on mnist by hidden unit rbms convolutional autoencoders there are several problems with applying conventional sparse coding methods on large images first it is not practical to directly apply sparse coding algorithm on images second even if we could do that we would learn very redundant dictionary whose atoms are just shifted copies of each other for example in fig the fcwta autoencoder has allocated different filters for the same patterns borders occurring at different locations one way to address this problem is to extract random image patches from input images and then train an unsupervised learning algorithm on these patches in isolation once training is complete the filters can be used in convolutional fashion to obtain representations of images as discussed in the main problem with this approach is that if the receptive field is small this method will not capture relevant features imagine the extreme of patches increasing the receptive field size is problematic because then very large number of features are needed to account for all the variations within the receptive field for example we see that in fig the autoencoder allocates different filters to represent the same horizontal edge appearing at different locations within the receptive field as result the learnt features are essentially shifted versions of each other which results in redundancy between filters unsupervised methods that make use of convolutional architectures can be used to address this problem including convolutional rbms convolutional dbns deconvolutional networks and convolutional predictive sparse decomposition psd these methods learn features from the entire image in convolutional fashion in this setting the filters can focus on learning the shapes what because the location information where is encoded into feature maps and thus the redundancy among the filters is reduced in this section we propose convolutional autoencoders that learn to do sparse coding by directly enforcing spatial and lifetime sparsity constraints our work is similar in spirit to deconvolutional networks and convolutional psd but whereas the approach in that work is to break apart the recognition pathway and data generation pathway but learn them so that they are consistent we describe technique for directly learning sparse convolutional autoencoder shallow convolutional autoencoder maps an input vector to set of feature maps in convolutional fashion we assume that the boundaries of the input image are so that each feature map has the same size as the input the hidden representation is then mapped linearly to the output using deconvolution operation appendix the parameters are optimized to minimize the mean square error convolutional autoencoder learns useless delta function filters that copy the input image to the feature maps and copy back the feature maps to the output interestingly we have observed that even in the presence of denoising regularizations convolutional autoencoders still learn useless delta functions fig depicts the filters of convolutional autoencoder with maps input and hidden unit dropout trained on street view house numbers dataset we see that the learnt delta functions make copies of the input pixels so even if half of the hidden units get dropped during training the network can still rely on the copies to reconstruct the input this highlights the need for new and more aggressive regularization techniques for convolutional autoencoders the proposed architecture for autoencoder is depicted in fig the autoencoder is autoencoder where the encoder typically consists of stack of several relu convolutional layers filters and the decoder is linear deconvolutional layer of larger size filters we chose to use deep encoder with smaller filters instead of shallow one with larger filters because the former introduces more dropout conv autoencoder autoencoder figure filters and feature maps of convolutional autoencoder which learns useless delta functions proposed architecture for autoencoder with spatial sparsity and regularizes the network by forcing it to have decomposition over large receptive fields through smaller filters the autoencoder is trained under two sparsity constraints spatial sparsity and lifetime sparsity spatial sparsity in the feedforward phase after computing the last feature maps of the encoder rather than reconstructing the input from all of the hidden units of the feature maps we identify the single largest hidden activity within each feature map and set the rest of the activities as well as their derivatives to zero this results in sparse representation whose sparsity level is the number of feature maps the decoder then reconstructs the output using only the active hidden units in the feature maps and the reconstruction error is only backpropagated through these hidden units as well consistent with other representation learning approaches such as triangle and deconvolutional networks we observed that using softer sparsity constraint at test time results in better classification performance so in the autoencoder in order to find the final representation of the input image we simply turn off the sparsity regularizer and use relu convolutions to compute the last layer feature maps of the encoder after that we apply over regions on these feature maps and use this representation for classification tasks or in training stacked as will be discussed in section fig shows autoencoder that was trained on mnist figure the autoencoder with first layer filters and second layer filters trained on mnist input image learnt dictionary deconvolution filters feature maps while training spatial sparsity applied feature maps after training spatial sparsity turned off feature maps of the first layer after applying local out of feature maps of the second layer after turning off the sparsity and applying local final representation spatial sparsity only spatial lifetime sparsity spatial lifetime sparsity figure learnt dictionary deconvolution filters of autoencoder trained on mnist lifetime sparsity although spatial sparsity is very effective in regularizing the autoencoder it requires all the dictionary atoms to contribute in the reconstruction of every image we can further increase the sparsity by exploiting the lifetime sparsity as follows suppose we have feature maps and the size is after applying spatial sparsity for each filter we will have winner hidden units corresponding to the images during feedforward phase for each filter we only keep the largest of these values and set the rest of activations to zero note that despite this aggressive sparsity every filter is forced to get updated upon visiting every which is crucial for avoiding the dead filter problem that often occurs in sparse coding fig and fig show the effect of the lifetime sparsity on the dictionaries trained on mnist and toronto face dataset we see that similar to the autoencoders by tuning the lifetime sparsity of autoencoders we can aim for different sparsity rates if no lifetime sparsity is enforced we learn local filters that contribute to every training point fig and as we increase the lifetime sparsity we can learn rare but useful features that result in better classification fig nevertheless forcing too much lifetime sparsity will result in features that are too diverse and rare and do not properly factor the input into parts fig and stacked autoencoders the autoencoder can be used as building block to form hierarchy in order to train the hierarchical model we first train autoencoder on the input images then we pass all the training examples through the network and obtain their representations last layer of the encoder after turning off sparsity and applying local now we treat these representations as new dataset and train another autoencoder to obtain the stacked representations fig shows the deep feature maps of stacked that was trained on mnist scaling autoencoders to large images the goal of convolutional sparse coding is to learn dictionary atoms and encoding filters once the filters are learnt they can be applied convolutionally to any image of any size and produce spatial map corresponding to different locations at the input we can use this idea to efficiently train autoencoders on datasets containing large images suppose we want to train an alexnet architecture in an unsupervised fashion on imagenet spatial sparsity only spatial and lifetime sparsity of figure learnt dictionary deconvolution filters of autoencoder trained on the toronto face dataset spatial sparsity spatial and lifetime sparsity of figure learnt dictionary deconvolution filters of autoencoder trained on imagenet whitened patches in order to learn the first layer filters we can extract mediumsize image patches of size and train autoencoder with dictionary atoms of size on these patches this will result in filters of size that can efficiently capture the statistics of patches once the filters are learnt we can apply them in convolutional fashion with the stride of to the entire images and after we will have representation of the images now we can train another autoencoder on top of these feature maps to capture the statistics of larger receptive field at different location of the input image this process could be repeated for multiple layers fig shows the dictionary learnt on the imagenet using this approach we can see that by imposing lifetime sparsity we could learn very diverse filters such as corner circular and blob detectors experiments in all the experiments of this section we evaluate the quality of unsupervised features of wta autoencoders by training naive linear classifier svm on top them we did not the filters in any of the experiments the implementation details of all the experiments are provided in appendix in the supplementary materials an ipython demo for reproducing important results of this paper is publicly available at http autoencoders on mnist the mnist dataset has training points and test points table compares the performance of autoencoder and with other architectures table compares the performance of autoencoder with other convolutional architectures in these experiments we have used all the available training labels points to train linear svm on top of the unsupervised features an advantage of unsupervised learning algorithms is the ability to use them in scenarios where labeled data is limited table shows the performance of convwta where we have assumed only labels are available in this case the unsupervised features are still trained on the whole dataset points but the svm is trained only on the labeled points where varies from to we compare this with the performance of supervised deep convnet cnn trained only on the labeled training points we can see supervised deep learning techniques fail to learn good representations when labeled data is limited whereas our wta algorithm can extract useful features from the unlabeled data and achieve better classification we also compare our method with some of the best learning results recently obtained by shallow autoencoder input and hidden units dropout stacked denoising autoencoder layers deep boltzmann machines autoencoder shallow autoencoder units sparsity stacked autoencoder and sparsity restricted boltzmann machines restricted boltzmann machines sparsity error rate table classification performance of autoencoder features svm on mnist deep deconvolutional network convolutional deep belief network scattering convolution network convolutional kernel network autoencoder maps autoencoder maps stacked maps error unsupervised features svm trained on labels no cnn ckn sc unsupervised features svm trained on few labels table classification performance of autoencoder trained on mnist convolutional kernel networks ckn and convolutional scattering networks sc we see outperforms both these methods when very few labels are available autoencoder on street view house numbers the svhn dataset has about training points and test points table reports the classification results of autoencoder on this dataset we first trained shallow and stacked on all training cases to learn the unsupervised features and then performed two sets of experiments in the first experiment we used all the available labels to train an svm on top of the features and compared the result with convolutional we see that the stacked achieves dramatic improvement over the shallow as well as in the second experiment we trained an svm by using only labeled data points and compared the result with deep variational autoencoders trained in same fashion fig shows the learnt dictionary of on this dataset convolutional triangle autoencoder maps stacked autoencoder and maps deep variational autoencoders stacked autoencoder and maps supervised maxout network accuracy table unsupervised features svm trained on labeled points of svhn dataset contrast normalized svhn learnt dictionary figure autoencoder trained on the street view house numbers svhn dataset autoencoder on fig reports the classification results of on we see when small number of feature maps are used considerable improvements over can be achieved this is because our method can learn dictionary as opposed to the redundant dictionaries learnt by methods such as in the largest deep network that we trained we used maps and achieved the classification rate of without using finetuning model averaging or data augmentation fig shows the learnt dictionary on the dataset we can see that the network has learnt diverse filters such as detectors as opposed to fig that shows the filters of methods accuracy shallow convolutional triangle maps shallow autoencoder maps shallow convolutional triangle maps shallow autoencoder maps shallow convolutional triangle maps deep triangle maps convolutional deep belief net layers exemplar cnn data augmentation nomp maps averaging models stacked maps stacked maps supervised maxout network unsupervised features svm without learnt dictionary figure autoencoder trained on the dataset discussion relationship of to autoencoders autoencoders impose sparsity across different channels population sparsity whereas autoencoder imposes sparsity across training examples lifetime sparsity when aiming for low sparsity levels autoencoders use scheduling technique to avoid the dead dictionary atom problem wta autoencoders however do not have this problem since all the hidden units get updated upon visiting every no matter how aggressive the sparsity rate is no scheduling required as result we can train larger networks and achieve better classification rates relationship of to deconvolutional networks and convolutional psd deconvolutional networks are top down models with no direct link from the image to the feature maps the inference of the sparse maps requires solving the iterative ista algorithm which is costly convolutional psd addresses this problem by training parameterized encoder separately to explicitly predict the sparse codes using soft thresholding operator deconvolutional networks and convolutional psd can be viewed as the generative decoder and encoder paths of convolutional autoencoder our contribution is to propose specific approach for training convolutional autoencoder in which both paths are trained jointly using direct backpropagation yielding an algorithm that is much faster easier to implement and can train much larger networks relationship to maxout networks maxout networks take the max across different channels whereas our method takes the max across space and dimensions also the feature maps retain the location information of the winners within each feature map and different locations have different connectivity on the subsequent layers whereas the maxout activity is passed to the next layer using weights that are the same regardless of which unit gave the maximum conclusion we proposed the spatial and lifetime sparsity methods to train autoencoders that learn to do and convolutional sparse coding we observed that autoencoders learn and diverse dictionary atoms as opposed to atoms that are typically learnt by conventional sparse coding methods unlike related approaches such as deconvolutional networks and convolutional psd our method jointly trains the encoder and decoder paths by direct and does not require an iterative optimization technique during training we described how our method can be scaled to large datasets such as imagenet and showed the necessity of the deep architecture to achieve better results we performed experiments on the mnist svhn and datasets and showed that the classification rates of autoencoders are competitive with the acknowledgments we would like to thank ruslan salakhutdinov and andrew delong for the valuable comments we also acknowledge the support of nvidia with the donation of the gpus used for this research references krizhevsky sutskever and hinton imagenet classification with deep convolutional neural in nips vol ng sparse autoencoder lecture notes vol coates ng and lee an analysis of networks in unsupervised feature learning in international conference on artificial intelligence and statistics kavukcuoglu sermanet boureau gregor mathieu and lecun learning convolutional feature hierarchies for visual in nips vol lee grosse ranganath and ng convolutional deep belief networks for scalable unsupervised learning of hierarchical representations in proceedings of the annual international conference on machine learning pp acm krizhevsky convolutional deep belief networks on unpublished zeiler krishnan taylor and fergus deconvolutional networks in computer vision and pattern recognition cvpr ieee conference on pp ieee sermanet kavukcuoglu chintala and lecun pedestrian detection with unsupervised feature learning in computer vision and pattern recognition cvpr ieee conference on pp ieee vincent larochelle lajoie bengio and manzagol stacked denoising autoencoders learning useful representations in deep network with local denoising criterion the journal of machine learning research vol pp hinton srivastava krizhevsky sutskever and salakhutdinov improving neural networks by preventing of feature detectors arxiv preprint netzer wang coates bissacco wu and ng reading digits in natural images with unsupervised feature learning in nips workshop on deep learning and unsupervised feature learning vol granada spain zeiler and fergus differentiable pooling for hierarchical feature learning arxiv preprint salakhutdinov and hinton deep boltzmann machines in international conference on artificial intelligence and statistics pp makhzani and frey autoencoders international conference on learning representations iclr bruna and mallat invariant scattering convolution networks pattern analysis and machine intelligence ieee transactions on vol no pp mairal koniusz harchaoui and schmid convolutional kernel networks in advances in neural information processing systems pp ranzato huang boureau and lecun unsupervised learning of invariant feature hierarchies with applications to object recognition in computer vision and pattern recognition cvpr ieee conference on pp ieee kingma mohamed rezende and welling learning with deep generative models in advances in neural information processing systems pp goodfellow mirza courville and bengio maxout networks icml coates and ng selecting receptive fields in deep in nips dosovitskiy springenberg riedmiller and brox discriminative unsupervised feature learning with convolutional neural networks in advances in neural information processing systems pp lin and kung stable and efficient representation learning with nonnegativity constraints in proceedings of the international conference on machine learning pp 
deep poisson factor modeling ricardo henao zhe gan james lu and lawrence carin department of electrical and computer engineering duke university durham nc lcarin abstract we propose new deep architecture for topic modeling based on poisson factor analysis pfa modules the model is composed of poisson distribution to model observed vectors of counts as well as deep hierarchy of hidden binary units rather than using logistic functions to characterize the probability that latent binary unit is on we employ link which allows pfa modules to be used repeatedly in the deep architecture we also describe an approach to build discriminative topic models by adapting pfa modules we derive efficient inference via mcmc and stochastic variational methods that scale with the number of in the data and binary units yielding significant efficiency relative to models based on logistic links experiments on several corpora demonstrate the advantages of our model when compared to related deep models introduction deep models understood as multilayer modular networks have been gaining significant interest from the machine learning community in part because of their ability to obtain performance in wide variety of tasks their modular nature is another reason for their popularity commonly used modules include but are not limited to restricted boltzmann machines rbms sigmoid belief networks sbns convolutional networks feedforward neural networks and dirichlet dps perhaps the two most deep model architectures are the deep belief network dbn and the deep boltzmann machine dbm the former composed of rbm and sbn modules whereas the latter is purely built using rbms deep models are often employed in topic modeling specifically hierarchical models have been widely studied over the last decade often composed of dp modules examples of these include the nested chinese restaurant process ncrp the hierarchical dp hdp and the nested hdp nhdp alternatively topic models built using modules other than dps have been proposed recently for instance the replicated softmax model rsm based on rbms the neural autoregressive density estimator nade based on neural networks the overreplicated softmax model osm based on dbms and deep poisson factor analysis dpfa based on sbns models have attractive characteristics from the standpoint of interpretability in the sense that their generative mechanism is parameterized in terms of distributions over topics with each topic characterized by distribution over words alternatively models in which modules are parameterized by deep hierarchy of binary units do not have parameters that are as readily interpretable in terms of topics of this type although model performance is often excellent the dpfa model in is one of the first representations that characterizes documents based on distributions over topics and words while simultaneously employing deep architecture based on binary units specifically integrates the capabilities of poisson factor analysis pfa deep models based on dp priors are usually called hierarchical models with deep architecture composed of sbns pfa is nonnegative matrix factorization framework closely related to models results in show that dpfa outperforms other deep topic models building upon the success of dpfa this paper proposes new deep architecture for topic modeling based entirely on pfa modules our model fundamentally merges two key aspects of dp and architectures namely its fully nonnegative formulation relies on dirichlet distributions and is thus readily interpretable throughout all its layers not just at the base layer as in dpfa ii it adopts the rationale of traditional models such as dbns and dbms by connecting layers via binary units to enable learning of statistics and structured correlations the probability of binary unit being on is controlled by link rather than logistic link as in the sbn allowing repeated application of pfa modules at all layers of the deep architecture the main contributions of this paper are deep architecture for topic models based entirely on pfa modules ii unlike dpfa which is based on sbns our model has inherent shrinkage in all its layers thanks to the formulation of pfa iii dpfa requires sequential updates for its binary units while in our formulation these are updated in block greatly improving mixing iv we show how pfa modules can be used to easily build discriminative topic models an efficient mcmc inference procedure is developed that scales as function of the number of in the data and binary units in contrast models based on rbms and sbns scale with the size of the data and binary units vi we also employ scalable bayesian inference algorithm based on the recently proposed stochastic variational inference svi framework model poisson factor analysis as module we present the model in terms of document modeling and word counts but the basic setup is applicable to other problems characterized by vectors of counts and we consider such application when presenting results assume xn is an vector containing word counts for the of documents where is the vocabulary size we impose the model xn poisson θn hn where rm is the factor loadings matrix with factors θn are factor intensities hn is vector of binary units indicating which factors are active for observation and represents the hadamard product one possible prior specification for this model recently introduced in is pk xmn xmkn xmkn poisson λmkn λmkn ψmk θkn hkn ψk dirichlet θkn gamma rk hkn bernoulli πkn where is an vector of ones and we have used the additive property of the poisson distribution to decompose the observed count of xn as latent counts xmkn here ψk is column of xmn is component of xn θkn is component of θn and hkn is component of hn furthermore we let and rk gamma note that controls for the sparsity of while rk accommodates for in xn via θn see for details there is one parameter in for which we have not specified prior distribution specifically hkn πkn in hkn is provided with process prior by letting πkn πk beta cǫ meaning that every document has on average the same probability of seeing particular topic as active based on popularity it further assumes topics are independent of each other these two assumptions are restrictive because in practice documents belong to rather heterogeneous population in which themes naturally occur within corpus letting documents have individual topic activation probabilities will allow the model to better accommodate for heterogeneity in the data ii some topics are likely to systematically so being able to harness such correlation structures can improve the ability of the model for fitting the data the hierarchical model in which in the following we denote as xn pfa θn hn rk short for poisson factor analysis pfa represents documents xn as purely additive combinations of up to topics distributions over words where hn indicates what topics are active and θn is the intensity of each one of the active topics that is manifested in document xn it is also worth noting that the model in is closely related to other widely known topic model approaches such as latent dirichlet allocation lda hdp and focused topic modeling ftm connections between these models are discussed in section deep representations with pfa modules several models have been proposed recently to address the limitations described above in particular proposed using multilayer sbns to impose correlation structure across topics while providing each document with the ability to control its topic activation probabilities without the need of global process here we follow the same rationale as but without sbns we start by noting that for binary vector hn with elements hkn we can write hkn zkn zkn poisson where zkn is latent count for variable hkn parameterized by poisson distribution with rate if the argument is true and otherwise the model in recently proposed in is known as the link bpl and is denoted hn bpl for rk after marginalizing out the latent count zkn the model in has the interesting property that hkn bernoulli πkn where πkn exp hence rather than using the logistic function to represent binary unit probabilities we employ πkn exp in and we have represented the poisson rates as λmkn and respectively to distinguish between the two however the fact that the count vector in and the binary variable in are both represented in terms of poisson distributions suggests the following deep model based on pfa modules see graphical model in supplementary material xn pfa θn rk zn θn rk pfa hn θn hn rk zn pfa where is the number of layers in the model and is vector operation in which each component imposes the left operation in in this deep poisson factor model dpfm the binary units at layer are drawn hn bpl λn for λn θn hn the form of the model in introduces latent variables zn and the function rather than explicitly drawing hn from the bpl distribution concerning the top layer we let zkn poisson λk and λk gamma model interpretation consider layer of from which xn is drawn assuming hn is known this corresponds to focused topic model the columns of correspond to topics with the column ψk defining the probability with which words are manifested for topic each ψk is drawn from dirichlet distribution as in generalizing the notation from λkn ψk θkn hkn rm is the rate vector associated with topic and document and it is active when hkn the count vector for document manifested from topic is xkn poisson λkn and xn xkn where is the number of topics in the model the columns of define correlation among the words associated with the topics for given topic column of some words with high probability and other words are likely jointly absent we now consider model with hn assumed known to generate hn we first draw zn zkn with zkn poisson λkn and which analogous to above may be expressed as zn λkn ψk θkn hkn column of corresponds to with ψk probability vector denoting the probability with which each of the topics are on when is on when hkn the columns of define correlation among the topics for given column of some topics with high probability and other topics are likely jointly absent as one moves up the hierarchy to layers the become increasingly more abstract and sophisticated manifested in terms of probabilisitic combinations of topics and at the layers below because of the properties of the dirichlet distribution each column of particular is encouraged to be sparse implying that column of encourages use of small subset of columns of with this repeated all the way down to the data layer and the topics reflected in the columns of this deep architecture imposes correlation across the topics and it does it through use of pfa modules at all layers of the deep architecture unlike which uses an sbn for layers through and pfa at the bottom layer in addition to the elegance of using single class of modules at each layer the proposed deep model has important computational benefits as later discussed in section pfa modules for discriminative tasks assume that there is label yn associated with document we seek to learn the model for mapping xn yn simultaneously with learning the above deep topic representation in fact the mapping xn yn is based on the deep generative process for xn in we represent yn bn which has all elements equal to zero except one with the via the vector value which is set to one located at the position of the label we impose the model bn bcn λcn pc λcn bn multinomial bcn is element of λn θn is matrix of nonnegative where and classification weights with prior distribution bk dirichlet where bk is column of combining with allows us to learn the mapping xn yn via the shared local representation θn hn that encodes topic usage for document this sharing mechanism allows the model to learn topics and biased towards discrimination as opposed to just explaining the data xn we call this construction discriminative deep poisson factor modeling it is worth noting that this is the first time that pfa and classification have been combined into joint model although other discriminative topic models have been proposed they rely on approximations in order to combine the topic model usually lda with classification approaches inference very convenient feature of the model in is that all its conditional posterior distributions can be written in closed form due to local conjugacy in this section we focus on markov chain monte carlo mcmc via gibbs sampling as reference implementation and stochastic variational inference approach for large datasets where the fully bayesian treatment becomes prohibitive other alternatives for scaling up inference in bayesian models such as the parameter server conditional density filtering and stochastic approaches are left as interesting future work mcmc due to local conjugacy gibbs sampling for the model in amounts to sampling in sequence from the conditional posterior of all the parameters of the model namely θn hn rk the remaining parameters of the model are set to fixed and values and we note that priors for and exist that result in updates and can be easily incorporated into the model if desired however we opted to keep the model as simple as possible without compromising flexibility the most unique conditional posteriors are shown below without layer index for clarity ψk dirichlet xm θkn gamma rk hkn hkn bernoulli πkn pn pm rk where xmkn xmkn and πkn omitted details including those for the discriminative dpfm in section are given in the supplementary material initialization is done at random from prior distributions followed by fitting in the experiments we run gibbs sampling cycles per layer in preliminary trials we observed that cycles are usually enough to obtain good initial values of the global parameters of the model namely rk and stochastic variational inference svi svi is scalable algorithm for approximating posterior distributions consisting of updates in which subsets of dataset minibatches are used to update in the variational parameters controlling both the local and global structure of the model in an iterative fashion this is done by using stochastic optimization with noisy natural gradients to optimize the variational objective function additional details and theoretical foundations of svi can be found in in practice the algorithm proceeds as follows where again we have omitted the layer index for clarity let rk be the global variables at iteration ii sample from the full dataset iii compute updates for the variational parameters of the local variables using φmkn exp log ψmk log θkn pm θkn gamma rk hkn φmkn hkn bernoulli πkn where xmkn φmkn and πkn rk in practice expectations for θkn and hkn are computed in iv compute local update for the variational parameters of the global variables only is shown using pn ψbmk φmkn where and nb are sizes of the corpus and respectively finally we update the global bk where ρt the forgetting rate variables as ψk ρt ψk ρt controls how fast previous information is forgotten and the delay early iterations these conditions for and guarantee that the iterative algorithm converges to local optimum of the variational objective function in the experiments we set and additional details of the svi algorithm for the model in are given in the supplementary material importance of computations scaling as function of number of from practical standpoint the most important feature of the model in is that inference does not scale as function of the size of the corpus but as function of its number of elements which is advantageous in cases where the input data is sparse often the case for instance of the entries in the widely studied newsgroup corpus are similar proportions are also observed in the reuters and wikipedia data furthermore this feature also extends to all the layers of the model regardless of hn being latent similarly for the discriminative dpfm in section inference bn has single entry this is particularly scales with not cn because the binary vector appealing in cases where is large in order to show that this scaling behavior holds it is enough to see that by construction from pk if xmn xmkn or zmn for thus xmkn with probability besides from we see that if hkn then zkn with probability as result update equations for all parameters of the model except for hn depend only on elements of xn and zn updates for the binary variables can be cheaply obtained in block from hkn bernoulli πkn via as previously described it is worth mentioning that models based on multinomial or poisson likelihoods such as lda hdp ftm and pfa also enjoy this property however the recently proposed deep pfa does not use pfa modules on layers other than the first one it uses sbns or rbms that are known to scale with the number of binary variables as opposed to their elements related work connections to other topic models pfa is nonnegative matrix factorization model with poisson link that is closely related to other models specifically showed that by making hkn and letting θkn have dirichlet instead of gamma distribution as in we can recover lda by using the equivalence between poisson and multinomial distributions by looking at we see that pfa and lda have the same blocked gibbs and svi updates respectively when dirichlet distributions for θkn are used in the authors showed that using the representation of the negative binomial distribution and specification for hkn in we can recover the ftm formulation and inference in more recently showed that pfa is comparable to hdp in that the former builds dps with normalized gamma processes more direct relationship between hdp and version of can be established by grouping documents by categories in the hdp three dps are set for topics topic usage and topic usage in our model represent topics θn hn encodes topic usage and encodes topic usage for categories in hdp documents are assigned to categories priori but in our model soft assignments are estimated and encoded via θn hn as result the model in is more flexible alternative to hdp in that it groups documents into categories in an unsupervised manner similar models deep models for topic modeling employed in the deep learning literature typically utilize rbms or sbns as building blocks for instance and extended rbms via dbns to topic modeling and proposed the softmax model deep version of rsm that generalizes rbms recently proposed framework for generative deep models using exponential family modules although they consider and factorization modules akin to our pfa modules their model lacks the explicit binary unit linking between layers commonly found in traditional deep models besides their inference approach variational inference is not as conceptually simple but it scales with the number of as our model dpfa proposed in is the model closest to ours nevertheless our proposed model has number of key differentiating features both of them learn topic correlations by building multilayer modular representation on top of pfa our model uses pfa modules throughout all layers in conceptually simple and easy to interpret way dpfa uses gaussian distributed weight matrices within sbn modules these are hard to interpret in the context of topic modeling ii sbn architectures have the shortcoming of not having block conditional posteriors for their binary variables making them difficult to estimate especially as the number of variables increases iii factor loading matrices in pfas have natural shrinkage to counter overfitting thanks to the dirichlet prior used for their columns in models shrinkage has to be added via variable augmentation at the cost of increasing inference complexity iv inference for sbn modules scales with the number of hidden variables in the model not with the number of elements as in our case experiments benchmark corpora we present experiments on three corpora newsgroups news reuters corpus volume and wikipedia wiki news is composed of documents and words partitioned into training set and test set has newswire articles containing words random subset of documents is used for testing for wiki we obtained random documents from which subset of is set aside for testing following we keep vocabulary consisting of words taken from the top words in the project gutenberg library as performance measure we use perplexity defined as the geometric mean of the inverse marginal likelihood of every word in the set we can not evaluate the intractable marginal for our model thus we compute the predictive perplexity on subset of the set the remaining is used to learn variables of the model the training set is used to estimate the global parameters of the model further details on perplexity evaluation for pfa models can be found in we compare our model denoted dpfm against lda ftm rsm nhdp and dpfa with sbns and rbms for all these models we use the settings described in inference methods for rsm and dpfa are contrastive divergence with table perplexities for news and wiki size indicates number of topics binary units accordingly model dpfm dpfm nhdp lda ftm rsm method svi mcmc sgnht sgnht sgnht svi gibbs gibbs size news wiki step size and stochastic gradient thermostats sgnht respectively for our model we run samples first as burnin for mcmc and iterations with for svi for the wiki corpus dpfm is run on random subset of documents the code used implemented in matlab will be made publicly available table show results for the corpora being considered figures for methods other than dpfm were taken from we see that multilayer models dpfm dpfa and nhdp consistently outperform single layer ones lda ftm and rsm and that dpfm has the best performance across all corpora for models of comparable size osm result not shown are about units better than rsm in news and see we also see that mcmc yields better perplexities when compared to svi the difference in performance between these two inference methods is likely due to the approximation and the online nature of svi we verified empirically results not shown that doubling the number of hidden units adding third layer or increasing the number of for dpfm does not significantly change the results in table as note on computational complexity one iteration of the model on the news corpus takes approximately and seconds for mcmc and svi respectively for comparison we also ran the model in using model of the same size in their case it takes about and seconds to run one iteration using mcmc conditional density filtering cdf and sgnht respectively runtimes for are similar to those of lda and rsm are faster than dpfm ftm is comparable to the latter and nhdp is slower than dpfm figure shows representative ψk from the model for news for the five largest weights in ψk which correspond to topic indices we also show the top five words in their topic ψk we observe that this is loaded with religion specific topics judging by the words in them additional graphs and tables showing the top words in each topic for news and are provided in the supplementary material albuterol salmeterol ipratropium tiotropium prednisone cetirizine amoxicillin montelukast diltiazemamitriptyline clavulanate olopatadine cefdinir rizatriptan desloratadine fluticasone fexofenadine propranolol carbamazepine methimazole na rabeprazole alcaftadine lactobacillus rhamnosus gg multivitamin preparation ψk god true religion christians fact christianity wrong christian people point ψk point thing people idea writes god jesus christ christians bible god exist existence exists universe first layer topic index first layer topic index figure representative obtained from left news and right medical records weights ψk topics indices with word lists corresponding to the top five words in topics ψk classification we use news for document classification to evaluate the discriminative dpfm model described in section we use test set accuracy on the task as performance measure and compare our model against lda docnade rsm and osm results for these four models were obtained from where multinomial logistic regression with loss table test accuracy on news subscript accompanying model names indicate their size model accuracy tion was used as classification module test accuracies in table show that our model significantly outperforms the others being considered note as well that our model still improves upon the four times larger osm by more than we verified that our model outperforms well known supervised methods like multinomial logistic regression svm supervised lda and feedforward neural networks for which test accuracies ranged from to using term document frequency features we could not improve results by increasing the size of our model however we may be able to do so by following the approach of where single classification module svm is shared by topic models ldas exploration of more sophisticated deep model architectures for discriminative dpfms is left as future work medical records the duke university health system medical records database used here is year dataset generated within large health system including three hospitals and an extensive network of outpatient clinics for this analysis we utilized medication usage from over patients that had over million patient visits these patients reported over different types of medications which were then mapped to one of pharmaceutical active ingredients ai taken from rxnorm depository of medication information maintained by the national library of medicine that includes trade names brand names dosage information and active ingredients counts for usage reflected the number of times an ai appears in patients record compound medications that include multiple active ingredients incremented counts for all ai in that medication removing ais with less than overall occurrences and patients lacking medication information results in matrix of ais patients results for dpfm of size with the same setting used for the first experiment indicate that pharmaceutical topics derived from this analysis form clinically reasonable clusters of pharmaceuticals that may be prescribed to patients for various ailments in particular we found that topic includes cluster of insulin products insulin glargine insulin lispro insulin aspart nph insulin and regular insulin insulin dependent diabetes patients often rely on tailored mixtures of insulin products with different pharmacokinetic profiles to ensure glycemic control in another example we found in topic an angiotensin receptor blocker arb losartan with hmgcoa reductase inhibitor atorvastatin and heart specific beta blocker carvedilol this combination of medications is commonly used to control hypertension and hyperlipidemia in patients with cardiovascular risk the second layer correlation structure between topics of drug products also provide interesting composites of patient types based on the pharmaceutical topics specifically factor in figure reveals correlation between drug factors that would be used to treat types of respiratory patients that had chronic obstructive respiratory disease asthma albuterol montelukast and seasonal allergies additional graphs including top medications for all pharmaceutical topics found by our model are provided in the supplementary material conclusion we presented new deep model for topic modeling based on pfa modules we have combined the interpretability of specifications found in traditional topic models with deep hierarchies of hidden binary units our model is elegant in that single class of modules is used at each layer but at the same time enjoys the computational benefit of scaling as function of the number of zeros in the data and binary units we described discriminative extension for our deep architecture and two inference methods mcmc and svi the latter for large datasets compelling experimental results on several corpora and on new medical records database demonstrated the advantages of our model future directions include working towards alternatives for scaling up inference algorithms based on approaches extending the use of pfa modules in deep architectures to more sophisticated discriminative models tasks with mixed data types and time series modeling using ideas similar to acknowledgements this research was supported in part by aro darpa doe nga and onr references blei griffiths jordan and tenenbaum hierarchical topic models and the nested chinese restaurant process in nips blei and lafferty correlated topic model of science aoas blei ng and jordan latent dirichlet allocation jmlr chen fox and guestrin stochastic gradient hamiltonian monte carlo in icml ding fang babbush chen skeel and neven bayesian sampling using stochastic gradient thermostats in nips gan chen henao carlson and carin scalable deep poisson factor analysis for topic modeling in icml gan henao carlson and carin learning deep sigmoid belief networks with data augmentation in aistats gan li henao carlson and carin deep temporal sigmoid belief networks for sequence modeling in nips guhaniyogi qamar and dunson bayesian conditional density filtering hinton training products of experts by minimizing contrastive divergence neural computation hinton osindero and teh fast learning algorithm for deep belief nets neural computation hinton and salakhutdinov replicated softmax an undirected topic model in nips ho cipar cui lee kim gibbons gibson ganger and xing more effective distributed ml via stale synchronous parallel parameter server in nips hoffman bach and blei online learning for latent dirichlet allocation in nips hoffman blei wang and paisley stochastic variational inference jmlr sha and jordan disclda discriminative learning for dimensionality reduction and classification in nips larochelle and lauly neural autoregressive topic model in nips lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee li andersen smola and yu communication efficient distributed machine learning with the parameter server in nips maaloe arngren and winther deep belief nets for topic modeling mcauliffe and blei supervised topic models in nips neal connectionist learning of belief networks artificial intelligence paisley wang blei and jordan nested hierarchical dirichlet processes pami ranganath tang charlin and blei deep exponential families in aistats salakhutdinov and hinton deep boltzmann machines in aistats srivastava nitish and hinton modeling documents with deep boltzmann machines in uai teh jordan beal and blei hierarchical dirichlet processes jasa welling and teh bayesian learning via stochastic gradient langevin dynamics in icml williamson wang heller and blei the ibp compound dirichlet process and its application to focused topic modeling in icml zhou infinite edge partition models for overlapping community detection and link prediction in aistats zhou and carin negative binomial process count and mixture modeling pami zhou hannah dunson and carin binomial process and poisson factor analysis in aistats zhu ahmed and xing medlda maximum margin supervised topic models jmlr 
bayesian optimization with exponential convergence kenji kawaguchi mit cambridge ma kawaguch leslie pack kaelbling mit cambridge ma lpk mit cambridge ma tlp abstract this paper presents bayesian optimization method with exponential convergence without the need of auxiliary optimization and without the sampling most bayesian optimization methods require auxiliary optimization an additional global optimization problem which can be and hard to implement in practice also the existing bayesian optimization method with exponential convergence requires access to the sampling which was considered to be impractical our approach eliminates both requirements and achieves an exponential convergence rate introduction we consider general global optimization problem maximize subject to rd where is deterministic function such problem arises in many realworld applications such as parameter tuning in machine learning engineering design problems and model parameter fitting in biology for this problem one performance measure of an algorithm is the simple regret rn which is given by rn where is the best input vector found by the algorithm for brevity we use the term regret to mean simple regret the general global optimization problem is known to be intractable if we make no further assumptions the simplest additional assumption to restore tractability is to assume the existence of bound on the slope of variant of this assumption is lipschitz continuity with known lipschitz constant and many algorithms have been proposed in this setting these algorithms successfully guaranteed certain bounds on the regret however appealing from theoretical point of view practical concern was soon raised regarding the assumption that tight lipschitz constant is known some researchers relaxed this somewhat strong assumption by proposing procedures to estimate lipschitz constant during the optimization process bayesian optimization is an efficient way to relax this assumption of complete knowledge of the lipschitz constant and has become method for solving global optimization problems with functions in the machine learning community bayesian especially by means of gaussian process gp an active research area with the requirement of the access to the sampling procedure it samples the function uniformly such that the density of samples doubles in the feasible regions at each iteration de freitas et al recently proposed theoretical procedure that maintains an exponential convergence rate exponential regret however as pointed out by wang et al one remaining problem is to derive optimization method with an exponential convergence rate without the sampling procedure which is computationally too demanding in many cases in this paper we propose novel global optimization algorithm which maintains an exponential convergence rate and converges rapidly without the sampling procedure gaussian process optimization in gaussian process optimization we estimate the distribution over function and use this information to decide which point of should be evaluated next in parametric approach we consider parameterized function with being distributed according to some prior in contrast the nonparametric gp approach directly puts the gp prior over as gp where is the mean function and is the covariance function or the kernel that is and for finite set of points the gp model is simply joint gaussian where ki xi xj and is the number of data points to predict the value of at new data point we first consider the joint distribution over of the old data points and the new data point xn xn kt xn xn where xn rn then after factorizing the joint distribution using the schur complement for the joint gaussian we obtain the conditional distribution conditioned on observed entities dn and xn as xn xn xn xn where xn xn kt and xn xn xn kt one advantage of gp is that this solution simplifies both its analysis and implementation to use gp we must specify the mean function and the covariance function the mean function is usually set to be zero with this zero mean function the conditional mean xn can still be flexibly specified by the covariance function as shown in the above equation for for the covariance function there are several common choices including the matern kernel and the gaussian kernel for example the gaussian kernel is defined as exp where is the kernel parameter matrix the kernel parameters or hyperparameters can be estimated by empirical bayesian methods see for more information about gp the flexibility and simplicity of the gp prior make it common choice for continuous objective functions in the bayesian optimization literature bayesian optimization with gp selects the next query point that optimizes the acquisition function generated by gp commonly used acquisition functions include the upper confidence bound ucb and expected improvement ei for brevity we consider bayesian optimization with ucb which works as follows at each iteration the ucb function is maintained as ςσ where is parameter of the algorithm to find the next query for the objective function solves an additional optimization problem with as xn arg maxx this is often carried out by other global optimization methods such as direct and the justification for introducing new optimization problem lies in the assumption that the cost of evaluating the objective function dominates that of solving additional optimization problem for deterministic function de freitas et al recently presented theoretical procedure that maintains exponential convergence rate however their own paper and the research point out that this result relies on an impractical sampling procedure the sampling to overcome this issue wang et al combined with hierarchical partitioning optimization method the soo algorithm providing regret bound with polynomial dependence on the number of function evaluations they concluded that creating algorithm with an exponential convergence rate without the impractical sampling procedure remained an open problem gp optimization overview the algorithm can be seen as member of the class of search methods which includes lipschitz optimization search and algorithms with optimism in the face of uncertainty search methods have common property the tightness of the bound determines its effectiveness the tighter the bound is the better the performance becomes however it is often difficult to obtain tight bound while maintaining correctness for example in search admissible heuristics maintain the correctness of the bound but the estimated bound with admissibility is often too loose in practice resulting in long period of global search the algorithm has the same problem the bound in is represented by ucb which has the following property with some probability we formalize this property in the analysis of our algorithm the problem is essentially due to the difficulty of obtaining tight bound such that and with some probability our solution strategy is to first admit that the bound encoded in gp prior may not be tight enough to be useful by itself instead of relying on single bound given by the gp we leverage the existence of an unknown bound encoded in the continuity at global optimizer assumption unknown bound there exists global optimizer and an unknown such that for all and in other words we do not expect the known upper bound due to gp to be tight but instead expect that there exists some unknown bound that might be tighter notice that in the case where the bound by gp is as tight as the unknown bound by in assumption our method still maintains an exponential convergence rate and an advantage over no need for auxiliary optimization our method is expected to become relatively much better when the known bound due to gp is less tight compared to the unknown bound by as the is unknown there are infinitely many possible candidates that we can think of for accordingly we simultaneously conduct global and local searches based on all the candidates of the bounds the bound estimated by gp is used to reduce the number of candidates since the bound estimated by gp is known we can ignore the candidates of the bounds that are looser than the bound estimated by gp the source code of the proposed algorithm is publicly available at http description of algorithm figure illustrates how the algorithm works with simple objective function we employ hierarchical partitioning to maintain hyperintervals as illustrated by the line segments in the figure we consider hyperrectangle as our hyperinterval with its center being the evaluation point of blue points in each line segment in figure for each iteration the algorithm performs the following procedure for each interval size select the interval with the maximum center value among the intervals of the same size ii keep the interval selected by if it has center value greater than that of any larger interval iii keep the interval accepted by ii if it contains ucb greater than the center value of any smaller interval iv if an interval is accepted by iii divide it along with the longest coordinate into three new intervals for each new interval if the ucb of the evaluation point is less than the best function value found so far skip the evaluation and use the ucb value as the center value until the interval is accepted in step ii on some future iteration otherwise evaluate the center value vi repeat steps until every size of intervals are considered then at the end of each iteration the algorithm updates the gp hyperparameters here the purpose of steps iii is to select an interval that might contain the global optimizer steps and ii select the possible intervals based on the unknown bound by while step iii does so based on the bound by gp we now explain the procedure using the example in figure let be the number of divisions of intervals and let be the number of function evaluations is the number of iterations initially there is only one interval the center of the input region and thus this interval is divided resulting in the first diagram of figure at the beginning of iteration step selects the third interval from the left side in the first diagram as its center value is the maximum because there are no intervals of different size at this point steps ii and iii are skipped step iv divides the third interval and then the gp hyperparameters are updated resulting in the second figure an illustration of imgpo is the number of iteration is the number of divisions or splits is the number of function evaluations diagram at the beginning of iteration it starts conducting steps for the largest intervals step selects the second interval from the left side and step ii is skipped step iii accepts the second interval because the ucb within this interval is no less than the center value of the smaller intervals resulting in the third diagram iteration continues by conducting steps for the smaller intervals step selects the second interval from the left side step ii accepts it and step iii is skipped resulting in the forth diagram the effect of the step can be seen in the diagrams for iteration at the far right interval is divided but no function evaluation occurs instead ucb values given by gp are placed in the new intervals indicated by the red asterisks one of the temporary dummy values is resolved at when the interval is queried for division as shown by the green asterisk the effect of step iii for the rejection case is illustrated in the last diagram for iteration at is increased to from meaning that the largest intervals are first considered for division however the three largest intervals are all rejected in step iii resulting in the division of very small interval near the global optimum at technical detail of algorithm we define to be the depth of the hierarchical partitioning tree and ch to be the center point of the ith hyperrectangle at depth ngp is the number of the gp evaluations define depth to be the largest integer such that the set th is not empty to compute ucb we use ςm log where is the number of the calls made so far for each time we use we increment by one this particular form of ςm is to maintain the property of during an execution of our algorithm with probability at least here is the parameter of imgpo ξmax is another parameter but it is only used to limit the possibly long computation of step iii in the worst case step iii computes ucbs times although it would rarely happen the pseudocode is shown in algorithm lines to correspond to steps iii these lines compute the index of the candidate of the rectangle that may contain global optimizer for each depth for each depth index at line indicates the remaining candidate of rectangle that we want to divide lines to correspond to steps iv where the remaining candidates of the rectangles for all are divided to provide simple executable division scheme line we assume to be hyperrectangle see the last paragraph of section for general case lines to correspond to steps ii specifically line implements step where single candidate is selected for each depth and lines to conduct step ii where some candidates are screened out lines to resolve the the temporary dummy values computed by gp lines ch to correspond to step iii where the candidates are further screened out at line indicates the set of all center points of fully expanded tree until depth within the region ch contains the nodes of covered by the hyperrectangle centered at ch in other words the fully expanded tree rooted at ch with depth and can be computed by dividing the current rectangle at ch and recursively divide all the resulting new rectangles until depth depth from ch which is depth in the whole tree algorithm gp optimization imgpo input an objective function the search domain the gp kernel ξmax and initialize the set th set to be the center point of and evaluate at ngp for do υmax for to depth do for steps ii while true do arg maxi ch ch if ch υmax then break else if ch is not labeled as then υmax ch break else ch ch and remove the label from ch ngp ngp ch ch for to depth do for step iii if then the smallest positive integer and min ξmax if exists and otherwise maxk ch if and then break υmax for to depth do for steps iv if and ch υmax then divide the hyperrectangle centered at ch along with the longest coordinate into three new hyperrectangles with the following centers lef center right th th ch center ch for inew lef right do if inew then inew inew inew inew max inew υmax max υmax inew else inew inew and label inew as ngp ngp update if was updated and otherwise max update gp hyperparameters by an empirical bayesian method relationship to previous algorithms the most closely related algorithm is the bamsoo algorithm which combines soo with gpucb however it only achieves polynomial regret bound while imgpo achieves exponential regret bound imgpo can achieve exponential regret because it utilizes the information encoded in the gp to reduce the degree of the unknownness of the the idea of considering set of infinitely many bounds was first proposed by jones et al their direct algorithm has been successfully applied to problems but it only maintains the consistency property convergence in the limit from theoretical viewpoint direct takes an input parameter to balance the global and local search efforts this idea was generalized to the case of an unknown and strengthened with theoretical support finite regret bound by munos in the soo algorithm by limiting the depth of the search tree with parameter hmax the soo algorithm achieves finite regret bound that depends on the dimension analysis in this section we prove an exponential convergence rate of imgpo and theoretically discuss the reason why the novel idea underling imgpo is beneficial the proofs are provided in the supplementary material to examine the effect of considering infinitely many possible candidates of the bounds we introduce the following term definition exploration loss the exploration loss ρt is the number of intervals to be divided during iteration pdepth at line the exploration loss ρτ can be computed as ρt it is the cost in terms of the number of function evaluations incurred by not committing to any particular upper bound if we were to rely on specific bound ρτ would be minimized to for example the doo algorithm has ρt even if we know particular upper bound relying on this knowledge and thus minimizing ρτ is not good option unless the known bound is tight enough compared to the unknown bound leveraged in our algorithm this will be clarified in our analysis let ρˉt be the maximum of the averages of for ρˉt max ρτ assumption there exist and in such that for all in theorem we show that the exponential convergence rate λn with is achieved we define ξn ξmax to be the largest used so far with total node expansions for simplicity we assume that is square which we satisfied in our experiments by scaling original theorem assume assumptions and let supx kx let then with probability at least the regret of imgpo is bounded as ngp rn exp ξn ln λn ρˉt importantly our bound holds for the best values of the unknown and even though these values are not given the closest result in previous work is that of bamsoo which obtained with probability for as can be seen we have improved the regret bound additionally in our analysis we can see how and affect the bound allowing us to view the inherent difficulty of an objective function in theoretical perspective here is constant in and is used in previous work for example if we conduct or function evaluations per and if we have that we note that can get close to one as input dimension increases which suggests that there is remaining challenge in scalability for higher dimensionality one strategy for addressing this problem would be to leverage additional assumptions such as those in remark the effect of the tightness of ucb by gp ucb computed by if gp is useful such that ρt then our regret bound becomes exp ln if the bound due to pt up to ucb by gp is too loose and thus useless increase due to ρˉt ln which can be bounded resulting in the regret bound of exp by exp max nt ln this is still better than the known results remark the effect of gp without the use of gp our regret bound would be as follows rn exp ˉt is the exploration loss without ln where our proof works with this this can be done by limiting the depth of search tree as depth additional mechanism but results in the regret bound with being replaced by thus if we assume to have at least not useless ucbs such that ρt this additional mechanism can be disadvantageous accordingly we do not adopt it in our experiments gp therefore the use of gp reduces the regret bound by increasing ngp and decreasing ρˉt but may potentially increase the bound by increasing ξn remark the effect of optimization to understand the effect of considering all the possible upper bounds we consider the case without gp if we consider all the possible bounds ln for the best unknown and we have the regret bound exp for standard optimization with estimated bound we have exp ln for an estimated and by algebraic manipulation considering all the possible bounds has better regret when ln ln ln ln ll for an intuitive ld insight we can simplify the above by assuming and as ln ld because and are the ones that achieve the lowest bound the logarithm on the side is always hence always satisfies the condition when and are not tight enough the logarithmic term increases in magnitude allowing to increase for example if the second term on the side has magnitude of greater than then satisfies the inequality therefore even if we know the upper bound of the function we can see that it may be better not to rely on this but rather take the infinite many possibilities into account one may improve the algorithm with different division procedures than one presented in algorithm accordingly in the supplementary material we derive an abstract version of the regret bound for imgpo with family of division procedures that satisfy some assumptions this information could be used to design new division procedure experiments in this section we compare the imgpo algorithm with the soo bamsoo and algorithms in previous work bamsoo and were tested with pair of handpicked good kernel and hyperparameters for each function in our experiments we assume that the knowledge of good kernel and hyperparameters is unavailable which is usually the case in practice thus for imgpo bamsoo and we simply used one of the most popular kernels the isotropic matern kernel with this is given by where exp then we blindly initialized the hyperparameters to peaks branin figure performance comparison in the order the digits inside of the parentheses indicate the dimensionality of each function and the variables ρˉt and ξn at the end of computation for imgpo table average cpu time in seconds for the experiment with each test function algorithm soo bamsoo imgpo peaks branin and for all the experiments these values were updated with an empirical bayesian method after each iteration to compute the ucb by gp we used for imgpo and bamsoo for imgpo ξmax was fixed to be the effect of selecting different values is discussed later for bamsoo and soo the parameter hmax was set to according to corollary in for and we used the soo algorithm and local optimization method using gradients to solve the auxiliary optimization for soo bamsoo and imgpo we used the corresponding deterministic division procedure given the initial point is fixed and no randomness exists for and we randomly initialized the first evaluation point and report the mean and one standard deviation for runs the experimental results for eight different objective functions are shown in figure the vertical axis is where is the global optima and is the best value found by the algorithm hence the lower the plotted value on the vertical axis the better the algorithm performance the last five functions are standard benchmarks for global optimization the first two were used in to test soo and can be written as sin sin for and for the form of the third function is given in equation and figure in the last function is embedded in dimension in the same manner described in section in which is used here to illustrate possibility of using imgpo as main subroutine to scale up to higher dimensions with additional assumptions for this function we used rembo with imgpo and bamsoo as its bayesian optimization subroutine all of these functions are multimodal except for with dimensionality from to as we can see from figure imgpo outperformed the other algorithms in general soo produced the competitive results for because our gp prior was misleading it did not model the objective function well and thus the property did not hold many times as can be seen in table imgpo is much faster than traditional gp optimization methods although it is slower than soo for sin branin and increasing ξmax does not affect imgpo because ξn did not reach ξmax figure for the rest of the test functions we would be able to improve the performance of imgpo by increasing ξmax at the cost of extra cpu time conclusion we have presented the first optimization method with an exponential convergence rate λn without the need of auxiliary optimization and the sampling perhaps more importantly in the viewpoint of broader global optimization community we have provided practically oriented analysis framework enabling us to see why not relying on particular bound is advantageous and how bound can still be useful in remarks and following the advent of the direct algorithm the literature diverged along two paths one with particular bound and one without can be categorized into the former our approach illustrates the benefits of combining these two paths as stated in section our solution idea was to use method but rely less on the estimated bound by considering all the possible bounds it would be interesting to see if similar principle can be applicable to other types of methods such as planning algorithms search and the uct or fsss algorithm and learning algorithms algorithms acknowledgments the authors would like to thank remi munos for his thoughtful comments and suggestions we gratefully acknowledge support from nsf grant from onr grant and from aro grant kenji kawaguchi was supported in part by the funai overseas scholarship any opinions findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors references de freitas smola and zoghi exponential regret bounds for gaussian process bandits with deterministic observations in proceedings of the international conference on machine learning icml wang shakibi jin and de freitas bayesian optimistic optimization in proceedings of the international conference on artificial intelligence and statistics aistat pages snoek larochelle and adams practical bayesian optimization of machine learning algorithms in proceedings of advances in neural information processing systems nips pages carter gablonsky patrick kelley and eslinger algorithms for noisy problems in gas transmission pipeline optimization optimization and engineering zwolak tyson and watson globally optimised parameters for model of mitotic control in frog egg extracts ieee biology dixon global optima without convexity numerical optimisation centre hatfield polytechnic shubert sequential method seeking the global maximum of function siam journal on numerical analysis mayne and polak outer approximation algorithm for nondifferentiable optimization problems journal of optimization theory and applications mladineo an algorithm for finding the global maximum of multimodal multivariate function mathematical programming strongin convergence of an algorithm for finding global extremum engineering cybernetics kvasov pizzuti and sergeyev local tuning and partition strategies for diagonal go methods numerische mathematik bubeck stoltz and yu lipschitz bandits without the lipschitz constant in algorithmic learning theory pages springer gardner kusner weinberger and cunningham bayesian optimization with inequality constraints in proceedings of the international conference on machine learning icml pages wang zoghi hutter matheson and de freitas bayesian optimization in high dimensions via random embeddings in proceedings of the international joint conference on artificial intelligence pages aaai press srinivas krause seeger and kakade gaussian process optimization in the bandit setting no regret and experimental design in proceedings of the international conference on machine learning icml pages murphy machine learning probabilistic perspective mit press page rasmussen and williams gaussian processes for machine learning mit press munos optimistic optimization of deterministic functions without the knowledge of its smoothness in proceedings of advances in neural information processing systems nips jones perttunen and stuckman lipschitzian optimization without the lipschitz constant journal of optimization theory and applications kandasamy schneider and poczos high dimensional bayesian optimisation and bandits via additive models arxiv preprint surjanovic and bingham virtual library of simulation experiments test functions and datasets retrieved november from http mcdonald grantham tabor and murphy global and local optimization using radial basis function response surface models applied mathematical modelling walsh goschin and littman integrating planning and reinforcement learning in proceedings of the aaai conference on artificial intelligence aaai strehl li and littman reinforcement learning in finite mdps pac analysis the journal of machine learning research jmlr 
sample complexity of episodic reinforcement learning emma brunskill computer science department carnegie mellon university ebrun christoph dann machine learning department carnegie mellon university cdann abstract recently there has been significant progress in understanding reinforcement learning in discounted markov decision processes mdps by deriving tight sample complexity bounds however in many applications an interactive learning agent operates for fixed or bounded period of time for example tutoring students for exams or handling customer service requests such scenarios can often be better treated as episodic mdps for which only looser bounds on the sample complexity exist natural notion of sample complexity in this setting is the number of episodes required to guarantee certain performance with high probability pac guarantee in this paper we derive an upper pac bound ln and lower pac bound ln that match up to and an additional linear dependency on the number of states the lower bound is the first of its kind for this setting our upper bound leverages bernstein inequality to improve on previous bounds for episodic finitehorizon mdps which have dependency of at least introduction and motivation consider test preparation software that tutors students for national advanced placement exam taken at the end of year or maximizing business revenue by the end of each quarter each individual task instance requires making sequence of decisions for fixed number of steps tutoring one student to take an exam in spring or maximizing revenue for the end of the second quarter of therefore they can be viewed as sequential decision making under uncertainty problem in contrast to an infinite horizon setting in which the number of time steps is infinite when the domain parameters markov decision process parameters are not known in advance and there is the opportunity to repeat the task many times teaching new student for each year exam maximizing revenue for each new quarter this can be treated as episodic reinforcement learning rl one important question is to understand how much experience is required to act well in this setting we formalize this as the sample complexity of reinforcement learning which is the number of time steps on which the algorithm may select an action whose value is not rl algorithms with sample complexity that is polynomial function of the domain parameters are referred to as probably approximately correct pac though there has been significant work on pac rl algorithms for the infinite horizon setting there has been relatively little work on the finite horizon scenario in this paper we present the first to our knowledge lower bound and new upper bound on the sample complexity of episodic finite horizon pac reinforcement learning in discrete spaces our bounds are tight up to in the time horizon the accuracy the number of actions and up to an additive constant in the failure probability these bounds improve upon existing results by factor of at least our results also apply when the reward model is function of the time step in addition to the state and action space while we assume stationary transition model our results can be extended readily to transitions our proposed ucfh rl algorithm that achieves our upper pac guarantee can be applied directly to wide range of episodic mdps with known it does not require additional structure such as assuming access to generative model or that the state transitions are sparse or acyclic the limited prior research on upper bound pac results for finite horizon mdps has focused on different settings such as partitioning longer trajectory into fixed length segments or considering sliding time window the tightest dependence on the horizon in terms of the number of episodes presented in these approaches is at least whereas our dependence is only more importantly such alternative settings require the optimal policy to be stationary whereas in general in finite horizon settings the optimal policy is nonstationary is function of both the state and the within episode fiechter and reveliotis and bountourelis do tackle closely related setting but find dependence that is at least our work builds on recent work on pac infinite horizon discounted rl that offers much tighter upper and lower sample complexity bounds than was previously known to use an infinite horizon algorithm in finite horizon setting simple change is to augment the state space by the time step ranging over which enables the learned policy to be in the original state space or equivalently stationary in the newly augmented space unfortunately since these recent bounds are in general quadratic function of the state space size the proposed state space expansion would introduce at least an additional factor in the sample complexity term yielding at least dependence in the number of episodes for the sample complexity somewhat surprisingly we prove an upper bound on the sample complexity for the finite horizon case that only scales quadratically with the horizon key part of our proof is that the variance of the value function in the finite horizon setting satisfies bellman equation we also leverage recent insights that pairs can be estimated to different precisions depending on the frequency to which they are visited under policy extending these ideas to also handle when the policy followed is nonstationary our lower bound analysis is quite different than some prior results and involves construction of parallel bandits where it is required that the best arm in certain portion of the bandits is identified with high probability to achieve problem setting and notation we consider episodic mdps which can be formalized as tuple both the statespace and the actionspace are finite sets the learning agent interacts with the mdp in episodes of time steps at time the agent observes state st and choses an action at based on policy that potentially depends on the time step at πt st for the next state is sampled from the stationary transition kernel at and the initial state from in addition the agent receives reward drawn from with mean rt st determined by the reward function the reward function is possibly and takes values in the quality of policy is evaluated by hp the total expected reward of an episode rm rt st for simplicity we assume that the reward function is known to the agent but the transition kernel is unknown the question we study is how many episodes does learning agent follow policy that is not rm rm with probability at least for any chosen accuracy and failure probability notation in the following sections we reason about the true mdp an empirical mdp and an optimistic mdp which are identical except for their transition probabilities and we will provide more details about these mdps later we introduce the notation explicitly only for but the quantities carry over to and with additional tildes or hats by replacing with or previous works have shown that the complexity of learning state transitions usually dominates learning reward functions we therefore follow existing sample complexity analyses and assume known rewards for simplicity the algorithm and pac bound can be extended readily to the case of unknown reward functions the best action will generally depend on the state and the number of remaining time steps in the tutoring example even if the student has the same state of knowledge the optimal tutor decision may be to space practice if there is many days till the test and provide intensive practice if the test is tomorrow it is straightforward to have the reward depend on the state or or state the linear operator piπ πi takes any function and returns the expected value of with respect to the next time for convenience we define the version as pi piπ the value function from time to hp pj time is defined as vi rt st pi rt pi ri is the optimal when the policy is clear we omit the superscript and vi we denote by the set of possible successor states of state and action the maximum number of them is denoted by maxs in general without making further assumptions we have though in many practical domains robotics user modeling each state can only transition to subset of the full set of states robot can teleport across the building but can only take local moves the notation is similar to the usual but ignores more precisely if there are constants such that ln and analogously for the natural logarithm is ln and log is the logarithm upper we now introduce new algorithm ucfh for rl in finite horizon episodic domains we will later prove ucfh is pac with an upper bound on its sample complexity that is smaller than prior approaches like many other pac rl algorithms ucfh uses an optimism under uncertainty approach to balance exploration and exploitation the algorithm generally works in phases comprised of optimistic planning policy execution and model updating that take several episodes each phases are indexed by as the agent acts in the environment and observes tuples ucfh maintains confidence set over the possible transition parameters for each pair that are consistent with the observed transitions defining such confidence set that holds with high probability can be be achieved using concentration inequalities like the hoeffding inequality one innovation in our work is to use particular new set of conditions to define the confidence set that enables us to obtain our tighter bounds we will discuss the confidence sets further below the collection of these confidence sets together form class of mdps mk that are consistent with the observed data we define as the maximum likelihood estimate of the mdp given the previous observations given mk ucfh computes policy by performing optimistic planning specifically we use finite horizon variant of extended value iteration evi evi performs modified bellman backups that are optimistic with respect to given set of parameters that is given confidence set of possible transition model parameters it selects in each time step the model within that set that maximizes the expected sum of future rewards appendix provides more details about fixed horizon evi ucfh then executes until there is pair that has been visited often enough since its last update defined precisely in the in ucfh after updating the model statistics for this new policy is obtained by optimistic planning again we refer to each such iteration of as phase with index if there is no ambiguity we omit the phase indices to avoid cluttered notation ucfh is inspired by the algorithm by lattimore and hutter but has several important differences first the policy can only be updated at the end of an episode so there is no need for explicit delay phases as in second the policies in ucfh are finally ucfh can directly deal with transition probabilities whereas only directly allows two possible successor states for each confidence sets the class of mdps mk consists of mdps with the known true reward function and where the transition probability from any to at any time is in the confidence set induced by of the empirical mdp solely for the purpose of computationally more efficient optimistic planning we allow transitions allows choosing different transition models in different time steps to maximize reward but this does not affect the theoretical guarantees as the true stationary mdp is still in mk with high the definition also works for transition probabilities algorithm ucfh episodic reinforcement learning algorithm input desired accuracy failure tolerance mdp result with probability at least policy wmin umax min ch log ln while do optimistic planning execute policy for all with and mk mnonst confidenceset fixedhorizonevi mk repeat sampleepisode from using until there is with max mwmin and update model statistics for one with condition above procedure sampleepisode for to do at st and at st at st at and st at st at function confidenceset ln if ln min ln ln return probability unlike the confidence intervals used by lattimore and hutter we not only include conditions based on hoeffding and bernstein inequality eq but also require that the variance of the bernoulli random variable associated with this transition is close to the empirical one eq this additional condition eq is key for making the algorithm directly applicable to generic mdps in which states can transition to any number of next states while only having linear dependency on in the pac bound pac analysis for simplicity we assume that each episode starts in fixed start state this assumption is not crucial and can easily be removed by additional notational effort theorem for any the following holds with probability at least ucfh produces sequence of policies that yield at most ln episodes with rπ the maximum number of possible successor states is denoted by the first condition in the min in equation is actually not necessary for the theoretical results to hold it can be removed and all can be replaced by similarities to other analyses the proof of theorem is quite long and involved but builds on similar techniques for bounds in reinforcement learning see brafman and tennenholtz strehl and littman the general proof strategy is closest to the one of and the obtained bounds are similar if we replace the time horizon with the equivalent in the discounted case however there are important differences that we highlight now briefly central quantity in the analysis by lattimore and hutter is the local variance of the value function the exact definition for the case will be given below the key insight for the almost tight bounds of lattimore and hutter and azar et al is to leverage the fact that these local variances satisfy bellman equation and so the discounted sum of local variances can be bounded by instead of we prove in lemma that local value function variances σi also satisfy bellman equation for mdps even if transition probabilities and rewards are this allows us to bound the total sum of local variances by and obtain similarly strong results in this setting lattimore and hutter assumed there are only two possible successor states which allows them to easily relate the local variances σi to the difference of the expected value of successor states in the true and optimistic mdp pi for the relation is less clear but we address this by proving bound with tight dependencies on lemma to avoid dependency on in the final pac bound we add the additional condition in equation to the confidence set we show that this allows us to the total reward difference rπ of policy with terms that either depend on σi or decrease linearly in the number of samples this gives the desired linear dependency on in the final bound we therefore avoid assuming which makes ucfh directly applicable to generic mdps with without the impractical transformation argument used by lattimore and hutter we will now introduce the notion of knownness and importance of pairs that is essential for the analysis of ucfh and subsequently present several lemmas necessary for the proof of theorem we only sketch proofs here but detailed proofs for all results are available in the appendix categorization of many pac rl sample complexity proofs only have binary notion of knownness distinguishing between known transition probability estimated sufficiently accurately and unknown however as recently shown by lattimore and hutter for the infinite horizon setting it is possible to obtain much tighter sample complexity results by using more fine grained categorization in particular key idea is that in order to obtain accurate estimates of the value function of policy from starting state it is sufficient to have only loose estimate of the parameters of that are unlikely to be visited under this policy let the weight of given policy be its expected frequency in an episode wk st πtk st πtk the importance ιk of is its relative weight compared to wmin on wk ιk min zi zi where and zi wmin note that ιk is an integer indicating the influence of the pair on the value function of similarly we define the knownness nk κk max zi zi mwk which indicates how often has been observed relative to its importance the constant is defined in algorithm we can now categorize into subsets xk xk κk ιk and xk where xk ιk is the active set and the set of pairs that are very unlikely under the current policy intuitively the model of ucfh is accurate if only few are in categories with low knownness that is important under the current policy but have not been observed often so far recall that over time observations are generated under many policies as the policy is recomputed so this condition does not always hold we will therefore distinguish between phases where for all and and phases where this condition is violated the condition essentially allows for only few in categories that are less known and more and more in categories that are more well known in fact we will show that the policy is with high probability in phases that satisfy this condition we first show the validity of the confidence sets mk lemma capturing the true mdp mk for all with probability at least proof sketch by combining hoeffding inequality bernstein inequality and the concentration result on empirical variances by maurer and pontil with the union bound we get that with probability at least for single phase fixed and fixed we then show that the number of model updates is bounded by umax and apply the union bound the following lemma bounds the number of episodes in which is violated with high probability lemma let be the number of episodes for which there are and with then and assume that ln emax where and emax wmin proof sketch we first bound the total number of times fixed pair can be observed while being in particular category xk in all phases for we then show that for particular the number of episodes where is bounded with high probability as the value of implies minimum probability of observing each pair in xk in an episode since the observations are not independent we use martingale concentration results to show the statement for fixed the desired result follows with the union bound over all relevant and the next lemma states that in episodes where the condition is satisfied and the true mdp is in the confidence set the expected optimistic policy value is close to the true value this lemma is the technically most involved part of the proof lemma bound mismatch in total reward assume mk if for all and ch πk πk and ln then proof sketch using basic algebraic transformations we show that ln for each in the confidence set as defined ln in eq since we assume mk we know that and satisfy this bound with for all and we use that to bound the difference of the expected value function ch of the successor state in and proving that pi ln ln where the local variance of the value function is defined as σi piπ si ai and σi σi πi this bound then is applied to pt the basic idea is to split the bound into sum of two parts by partitioning of the space by knownness that is st at for all and and st at using the fact that st at and st at are tightly coupled for each we can bound the expression eventually by the final ph key ingredient in the remainder of the proof is to bound σt by instead of the trivial bound to this end we show the lemma below lemma the variance of the value function defined as vπi pj satisfies bellman equation vi pi σi rt st vi si pj which gives vi since rmax it follows that pi σt pj pi σt rmax for all figure class of finite horizon mdps the function is defined as and otherwise where is an unknown action per state and is parameter proof sketch the proof works by induction and uses fact that the value function satisfies the bellman equation and the of conditional expectations proof sketch for theorem the proof of theorem consists of the following major parts the true mdp is in the set of mdps mk for all phases with probability at least lemma the fixedhorizonevi algorithm computes value function whose optimistic value is higher than the optimal reward in the true mdp with probability at least lemma the number of episodes with for some and are bounded with probability at least by if ln lemma if for all relevant pairs are sufficiently known and ch ln then the optimistic value computed is to the true mdp value together with part we get that with high probability the policy is in this case from parts and with probability there are at most ln episodes that are not lower pac bound theorem there exist positive constants such that for every and and for every algorithm that satisfies pac guarantee for and outputs deterministic policy there is episodic mdp mhard with ln ln na where na is the number of episodes until the algorithm policy is the constants can be set to and the ranges of possible and are of similar order than in other lower bounds for bandits and discounted mdps they are mostly determined by the bandit result by mannor and tsitsiklis we build on increasing the parameter limits and for bandits would immediately result in larger ranges in our lower bound but this was not the focus of our analysis proof sketch the basic idea is to show that the class of mdps shown in figure require at least number of observed episodes of the order of equation from the start state the agent ends up in states to with equal probability independent of the action from each such state the agent transitions to either good state with reward or bad state with reward and stays there for the rest of the episode therefore each state is essentially bandit with binary rewards of either or for each bandit the probability of ending up in or is equal except for the first action with at and possibly an unknown optimal action different for each state with at in the episodic setting we are considering taking suboptimal action in one of the bandits does not necessarily yield suboptimal episode we have to consider the average over all bandits instead in an episode the agent therefore needs to follow policy that would solve at least certain portion of all bandits with probability at least we show that the best strategy for the agent to achieve this is to try to solve all bandits with equal probability the number of samples required to do so then results in the lower bound in equation similar mdps that essentially solve multiple of such bandits have been used to prove lower bounds for discounted mdps however the analysis in the infinite horizon case as well as for the optimality criterion considered by kakade is significantly simpler for these criteria every time step the agent follows policy that is not counts as mistake therefore every time the agent does not pick the optimal arm in any of the bandits counts as mistake this contrasts with our setting where we must instead consider taking an average over all bandits related work on sample complexity bounds we are not aware of any lower sample complexity bounds beyond bandit results that directly apply to our setting our upper bound in theorem improves upon existing results by at least factor of we briefly review those existing results in the following timestep bounds kakade chapter proves upper and lower pac bounds for similar setting where the agent interacts indefinitely with the environment but the interactions are divided in segments of equal length and the agent is evaluated by the expected of rewards sum until the end of each segment the bound states that there are not more than time steps in ln which the agents acts strehl improves the of these bounds for their delayed algorithm to ln however in episodic mdp it is more natural to consider performance on the entire episode since suboptimality near the end of the episode is no issue as long as the total reward on the entire episode is sufficiently high kolter and ng use an interesting criterion but prove bounds for bayesian setting instead of pac bounds can be applied to the episodic case by augmenting the original statespace with per episode to allow resets after steps this adds dependencies for each in the original bound which results in of at least of these existing bounds translating the regret bounds of in corollary jaksch et al yields ln even if one ignores the reset after time on the number of episodes of at least steps lower can not be applied directly to the episodic reward criterion episode bounds similar to us fiechter uses the value of initial states as but defines the value the infinite horizon his results of order ln episodes of length are therefore not directly applicable to our setting auer and ortner investigate the same setting as we and propose algorithm that has ln regret which translates into basic pac bound of order episodes we improve on this bound substantially in terms of its dependency on and reveliotis and bountourelis also consider the episodic undiscounted setting and present an efficient algorithm in cases where the transition graph is acyclic and the agent knows for each state policy that visits this state with known minimum probability assumptions are quite limiting and rarely these ln explicitly depends on hold in practice and their bound of order conclusion we have shown upper and lower bounds on the sample complexity of episodic rl that are tight up to in the time horizon the accuracy the number of actions and up to an additive constant in the failure probability these bounds improve upon existing results by factor of at least one might hope to reduce the dependency of the upper bound on to be linear by an analysis similar to mormax for discounted mdps which has sample complexity linear in at the penalty of additional dependencies on our proposed ucfh algorithm that achieves our pac bound can be applied to directly to wide range of episodic mdps with known rewards and does not require additional structure such as sparse or acyclic state transitions assumed in previous work the empirical evaluation of ucfh is an interesting direction for future work acknowledgments we thank tor lattimore for the helpful suggestions and comments this work was supported by an nsf career award and the onr young investigator program for comparison we adapt existing bounds to our setting while the original bound stated by kakade only has an additional comes in through due to different normalization of rewards references alexander strehl lihong li eric wiewiora john langford and michael littman pac reinforcement learning in international conference on machine learning michael kearns and satinder singh convergence rates for and indirect algorithms in advances in neural information processing systems ronen brafman and moshe tennenholtz general polynomail time algorithm for reinforcement learning journal of machine learning research sham kakade on the sample complexity of reinforcement learning phd thesis university college london peter auer and ronald ortner online regret bounds for new reinforcement learning algorithm in proceedings austrian cognitive vision workshop tor lattimore and marcus hutter pac bounds for discounted mdps in international conference on algorithmic learning theory szita and csaba reinforcement learning with nearly tight exploration complexity bounds in international conference on machine learning mohammad gheshlaghi azar munos and hilbert kappen on the sample complexity of reinforcement learning with generative model in international conference on machine learning zico kolter and andrew ng exploration in polynomial time in international conference on machine learning fiechter efficient reinforcement learning in conference on learning theory fiechter expected mistake bound model for reinforcement learning in international conference on machine learning spyros reveliotis and theologos bountourelis efficient pac learning for episodic tasks with acyclic state spaces discrete event dynamic systems theory and applications alexander strehl lihong li and michael littman incremental learners with formal guarantees in conference on uncertainty in artificial intelligence alexander strehl lihong li and michael littman reinforcement learning in finite mdps pac analysis journal of machine learning research thomas jaksch ronald ortner and peter auer regret bounds for reinforcement learning in advances in neural information processing systems alexander strehl and michael littman an analysis of interval estimation for markov decision processes journal of computer and system sciences dec matthew sobel the variance of markov decision processes journal of applied probability andreas maurer and massimiliano pontil empirical bernstein bounds and penalization in conference on learning theory shie mannor and john tsitsiklis the sample complexity of exploration in the bandit problem journal of machine learning research thomas jaksch ronald ortner and peter auer regret bounds for reinforcement learning journal of machine learning research fan chung and linyuan lu concentration inequalities and martingale inequalities survey internet mathematics 
learning with relaxed supervision percy liang stanford university pliang jacob steinhardt stanford university jsteinhardt abstract for problems with deterministic constraints between the latent variables and observed output learning necessitates performing inference over latent variables conditioned on the output which can be intractable no matter how simple the model family is even finding single latent variable setting that satisfies the constraints could be difficult for instance the observed output may be the result of latent database query or graphics program which must be inferred here the difficulty lies in not the model but the supervision and poor approximations at this stage could lead to following the wrong learning signal entirely in this paper we develop rigorous approach to relaxing the supervision which yields asymptotically consistent parameter estimates despite altering the supervision our approach parameterizes family of increasingly accurate relaxations and jointly optimizes both the model and relaxation parameters while formulating constraints between these parameters to ensure efficient inference these efficiency constraints allow us to learn in otherwise intractable settings while asymptotic consistency ensures that we always follow valid learning signal introduction we are interested in the problem of learning from intractable supervision for example for question answering application we might want to learn semantic parser that maps question which president is from arkansas to logical form uspresident placeofbirth arkansas that executes to the answer billclinton if we are only given pairs as training data then even if the model pθ is tractable it is still intractable to incorporate the hard supervision constraint since and live in large space and can be complex iff executes to on database in addition to semantic parsing intractable supervision also shows up in inverse graphics relation extraction program induction and planning tasks with complex goals as we scale to weaker supervision and richer output spaces such intractabilities will become the norm one can handle the intractable constraints in various ways by relaxing them by applying them in expectation or by using approximate inference however as these constraints are part of the supervision rather than the model altering them can fundamentally change the learning process this raises the question of when such approximations are faithful enough to learn good model in this paper we propose framework that addresses these questions formally by constructing relaxed supervision function with statistical and computational properties our approach is sketched in figure we start with an intractable supervision function given by the constraint together with model family pθ we then replace by family of functions qβ which contains giving rise to joint model pθ we ensure tractability of inference by constraining pθ and pθ to stay close together so that the supervision is never too surprising to the model finally we optimize and subject to this tractability constraint when qβ is properly normalized there is always pressure to use the true more exact intractable region less exact tor jec tr nin lea tractable region less accurate figure sketch of our approach we define family of relaxations qβ of the supervision and then jointly optimize both and if the supervision qβ is too harsh relative to the accuracy of the current model pθ inference becomes intractable in section we formulate constraints to avoid this intractable region and learn within the tractable region more accurate supervision and we can prove that the global optimum of pθ is an asymptotically consistent estimate of the true model section introduces the relaxed supervision model qβ exp where iff the constraint is satisfied the original supervision is then obtained when section studies the statistical properties of this relaxation establishing asymptotic consistency as well as characterizing the properties for any fixed we show roughly that both the loss and statistical efficiency degrade by factor of βmin the inverse of the smallest coordinate of in section we introduce novel tractability constraints show that inference is efficient if the constraints are satisfied and present an algorithm for constrained optimization of the likelihood finally in section we explore the empirical properties of this algorithm on two illustrative examples framework we assume that we are given partially supervised problem where are observed and is unobserved we model given as an exponential family pθ exp and assume that is known deterministic function of hence pθ exp where encodes the constraint in general could have complicated structure rendering inference computing pθ which is needed for learning intractable to alleviate this we consider projections πj mapping to some smaller set yj we then obtain the def hopefully simpler constraint that and match under πj sj πj πj we vk assume πk is injective which implies that equals the conjunction sj we also assume that some part of call it can be imposed tractably we can always take but it is better to include as much of as possible because will be handled exactly while will be approximated we record our assumptions below definition let encode the constraint we say that πk logically decomposes if implies and πk is injective before continuing we give three examples to illustrate the definitions above example translation from unordered supervision suppose that given an input sentence each word is passed through the same unknown substitution cipher to obtain an enciphered sentence and then ordering is removed to obtain an output multiset for example we might have abaa dcdd and suppose the vocabulary is our constraint is multiset which logically decomposes as πj multiset zi for all count count sj where count counts the number of occurrences of the word the constraint is useful because it lets us restrict attention to words in rather than all of which dramatically reduces the search space if each sentence has length then yj πj example conjunctive semantic parsing suppose again that is an input sentence and that each input word xi maps to predicate set zi qm and the meaning of the sentence is the intersection of the predicates for instance if the sentence is brown dog and is the set of all brown objects and is the set of all dogs then and def is the set of all brown dogs in general we define jzk zl this is simplified form of learning semantic parsers from denotations we let be every set that is obtainable as an intersection of predicates and define πj qj for so yj note that for all we have πj qj so πm is injective we then have the following logical decomposition πj jzk zi for all jzk qj qj sj the first constraint factors across so it can be handled tractably example predicate abstraction next we consider program induction task here the input might be smallest square divisible by six larger than would be argmin mod and and and would be hence if evaluates to suppose that we have collection of predicates πj such as mod isprime etc these predicates are useful for giving partial credit for instance it is easier to satisfy mod than but many programs that satisfy the former will have pieces that are also in the correct using the πj to decompose will therefore provide more tractable learning signal that still yields useful information relaxing the supervision returning to the general framework let us now use sj and to relax and thus also pθ first define penalty features ψj sj and also define qβ exp for any vector then log qβ measures how far is from being satisfied for each violated sj we incur penalty βj or infinite penalty if is violated note that the original corresponds to βk normalization constant the constant for qβ is equal to log exp this is in general difficult to compute since could have arbitrary structure fortunately we can uniformly by tractable quantity proposition for any we have the following bound def log exp see the supplement for proof the intuition is that by injectivity of πk we can bound qk by the product set yj we now define our joint model which is relaxation of qβ exp pθ exp ex log pθ where is the true distribution the relaxation parameter provides between faithfulness to the original objective large and tractability small importantly pθ produces valid probabilities which can be meaningfully compared across different this will be important later in allowing us to optimize note that while pθ if the bound is not tight this gap vanishes as analysis we now analyze the effects of relaxing supervision taking proofs may be found in the supplement we will analyze the following properties effect on loss how does the value of the relaxation parameter affect the unrelaxed loss of the learned parameters assuming we had infinite data and perfect optimization amount of data needed to learn how does affect the amount of data needed in order to identify the optimal parameters optimizing and consistency what happens if we optimize jointly with is there natural pressure to increase and do we eventually recover the unrelaxed solution notation let denote the expectation under and let denote the unrelaxed loss see let inf be the optimal unrelaxed loss and be the minimizing argument finally let eθ and covθ denote the expectation and covariance respectively under pθ to simplify expressions we will often omit the arguments from and and use and for the events and for simplicity assume that effect on loss suppose we set to some fixed value βk and let be the minimizer of since is optimized for rather than it is possible that is very large indeed if is zero for even single outlier then will be infinite however we can bound under an alternative loss that is less sensitive to outliers proposition let βmin βj then the key idea in the proof is that replacing with exp in pθ does not change the loss too much in the sense that exp exp exp when βmin if βlmin hence the error increases roughly linearly with βmin min βmin is large and the original loss is small then is good surrogate of particular interest is the case perfect predictions in this case the relaxed loss also yields perfect predictor for any note conversely that proposition is vacuous when we show in the supplement that proposition is essentially tight lemma for any βmin there exists model with loss and relaxation parameter βmin such that amount of data needed to learn to estimate how much data is needed to learn we compute the def fisher information iβ which measures the statistical efficiency of the maximum likelihood estimator all of the equations below follow from standard properties of exponential families with calculations in the supplement for the unrelaxed loss the fisher information is hence is easy to estimate if the features have high variance when and low variance when this should be true if all with have similar feature values while the with have varying feature values in the relaxed case the fisher information can be written to first order as iβ in other words iβ to first order is the covariance of the penalty with the statistics of to interpret this we will make the simplifying assumptions that βj βmin for all and the events are all disjoint in this case βmin and the covariance in simplifies to βmin relative to we pick up factor if we further assume that we see that the amount of data required to learn under the relaxation increases by factor of roughly βmin optimizing we now study the effects of optimizing both and jointly importantly joint optimization recovers the true distribution in the infinite data limit proposition suppose the model is for all then all global optima of satisfy pθ one such optimum is there is thus always pressure to send to and to the key fact in the proof is that the is never smaller than the conditional entropy with equality iff pθ summary based on our analyses above we can conclude that relaxation has the following impact loss the loss increases by factor of βmin in the worst case amount of data in at least one regime the amount of data needed to learn is βmin times larger the general theme is that the larger is the better the statistical properties of the maximumlikelihood estimator however larger also makes the distribution pθ less tractable as qβ becomes concentrated on smaller set of this creates between computational efficiency small and statistical accuracy large we explore this in more detail in the next section and show that in some cases we can get the best of both worlds constraints for efficient inference in light of the previous section we would like to make as large as possible on the other hand if is too large we are back to imposing exactly and inference becomes intractable we would therefore like to optimize subject to tractability constraint ensuring that we can still perform efficient inference as sketched earlier in figure we will use rejection sampling as the inference procedure with the acceptance rate as measure of tractability to formalize our approach we assume that the model pθ and the constraint are jointly tractable so that we can efficiently draw exact samples from def pθ exp at where at log exp most learning algorithms require the conditional expectations of and given and we therefore need to sample the distribution pθ exp where def log exp since we can draw exact samples from pθ using rejection sampling sample from pθ and accept with probability exp if the acceptance rate is high this algorithm lets us tractably sample from intuitively when is far from the optimum the model pθ and constraints sj will clash necessitating small value of to stay tractable as improves more of the constraints sj will be satisfied automatically under pθ allowing us to increase formally the expected number of samples is the inverse of the acceptance probability and can be expressed as see the supplement for details pθ exp exp at we can then minimize the loss see and subject to the tractability constraint ex exp at where is our computational budget while one might have initially worried that rejection sampling will perform poorly this constraint guarantees that it will perform well by bounding the number of rejections implementation details to minimize subject to constraint on we will develop an algorithm the algorithm maintains an inner approximation to the constraint set as well as an upper bound on the loss both of which will be updated with each iteration of the algorithm these bounds are obtained by linearizing more precisely for any we have by convexity def def def where we thus obtain bound on the loss as well as tractability constraint which are both convex minimize subject to exp at we will iteratively solve the above minimization and then update and using the minimizing from the previous step note that the minimization itself can be done without inference we only need to do inference when updating and since inference is tractable at by design we can obtain unbiased estimates of and using the rejection sampler described earlier we can also estimate at the same time by using samples from and the relation practical issue is that becomes overly stringent when is far away from it is therefore difficult to make large moves in parameter space which is especially bad for getting started initially we can solve this using the trivial constraint exp βj which will also ensure tractability we use for several initial iterations then optimize the rest of the way using to avoid degeneracies at we also constrain in all iterations we will typically take which is feasible for assuming exp to summarize we have obtained an iterative algorithm for jointly minimizing such that pθ always admits efficient rejection sampling pseudocode is provided in algorithm note that all population expectations should now be replaced with sample averages algorithm minimizing while guaranteeing tractable inference input training data initialize for while not converged do estimate and for by sampling estimate the functions using the output from the preceding step let be the solution to minimize subject to βj for update end while repeat the same loop as above with the constraint replaced by output experiments we now empirically explore our method behavior all of our code data and experiments may be found on the codalab worksheet for this paper at https which also contains more detailed plots beyond those shown here we would like to answer the following questions fixed for fixed how does the relaxation parameter affect the learned parameters what is the between accuracy and computation as we vary if only some of the constraints sj are active for each for translation we only have to worry about the words that actually appear in the output sentence then we need only include those βj in the sum for this can lead to substantial gains since now is effectively the sentence length rather than the vocabulary size adaptfull adapttied fixed fixed fixed fixed accuracy accuracy adaptfull adapttied adaptfull adaptfull fixed fixed fixed number of samples number of samples figure accuracy versus computation measured by number of samples drawn by the rejection sampler for the unordered translation task corresponding plot for the conjunctive semantic parsing task for both tasks the ixed method needs an order of magnitude more samples to achieve comparable accuracy to either adaptive method adapting does optimizing affect performance is the adaptivity of our relaxation advantageous or can we set all coordinates of to be equal how does the computational budget from and impact the optimization to answer these questions we considered using fixed ixed optimizing with computational constraint dapt ull and performing the same optimization with all coordinates of constrained to be equal dapt ied for optimization we used algorithm using samples to approximate each and and using the solver snopt for the inner optimization we ran algorithm for iterations when is not fixed we apply the constraint for the first iterations and for the remaining iterations when it is fixed we do not apply any constraint unordered translation we first consider the translation task from example recall that we def are given vocabulary and wish to recover an unknown substitution cipher given an input sentence the latent is the result of applying where zi is xi with probability and uniform over with probability to model this we define feature φu that counts the number of times that xi and zi hence pθ pl exp θxi zi recall also that the output multiset in our experiments we generated sentences of length with vocabulary size for each pair of adjacent words we set with drawn from power law distribution on with exponent we then set to or with equal probability this ensures that there are pairs of words that often without which the constraint would already solve the problem we set and which produces moderate range of word frequencies as well as moderate noise level we also considered setting either or to but omitted these results because essentially all methods achieved ceiling accuracy the interested reader may find them in our codalab worksheet we set the computational budget for the constraints and and as the lower bound on to measure accuracy we look at the fraction of words whose modal prediction under the model corresponds to the correct mapping we plot accuracy versus computation cumulative number of samples drawn by the rejection sampler up through the current iteration in figure note that the number of samples is plotted on for the ixed methods there is clear between computation and accuracy with multiplicative increases in computation needed to obtain additive increases in accuracy the adaptive methods completely surpass this curve achieving higher accuracy than ixed while using an order of magnitude less computation the dapt ull and dapt ied methods achieve similar results to each other in both cases all coordinates of eventually obtained their maximum value of which we set as cap for numerical reasons and which corresponds closely to imposing the exact supervision signal conjunctive semantic parsing we also ran experiments on the semantic parsing task from example we used vocabulary size and represented each predicate as subset of where the five most common words in mapped to the empty predicate and the remaining words mapped to random subset of of we used and sentence length each word in the input was drawn independently from power law with word was mapped to its correct predicate with probability and to uniformly random predicate with probability with we constrained the denotation jzk to have size by each examples until this constraint held we used the same model pθ as before and again measured accuracy based on the fraction of the vocabulary for which the modal prediction was correct we set to compare the effect of different computational budgets results are shown in figure once again the adaptive methods substantially outperform the ixed methods we also see that the accuracy of the algorithm is relatively invariant to the computational budget indeed for all of the adaptive methods all coordinates of eventually obtained their maximum value meaning that we were always using the exact supervision signal by the end of the optimization these results are broadly similar to the translation task suggesting that our method generalizes across tasks related work and discussion for fixed relaxation our loss is similar to the jensen risk bound defined by gimpel and smith for varying our framework is similar in spirit to annealing where the entire objective is relaxed by exponentiation and the relaxation is reduced over time an advantage of our method is that we do not have to pick fixed annealing schedule it falls out of learning and moreover each constraint can be annealed at its own pace under model optimizing the relaxed likelihood recovers the same distribution as optimizing the original likelihood in this sense our approach is similar in spirit to approaches such as pseudolikelihood and more distantly reward shaping in reinforcement learning there has in the past been considerable interest in specifying and learning under constraints on model predictions leading to family of ideas including learning generalized expectation criteria bayesian measurements and posterior regularization these ideas are nicely summarized in section of and involve relaxing the constraint either by using variational approximation or by applying the constraint in expectation rather than pointwise replacing the constraint with this leads to tractable inference when the function can be tractably incorporated as factor in the model which is the case for many problems of interest including the translation task in this paper in general however inference will be intractable even under the relaxation or the relaxation could lead to different learned parameters this motivates our framework which handles more general class of problems and has asymptotic consistency of the learned parameters the idea of learning with explicit constraints on computation appears in the context of prioritized search mcmc and dynamic feature selection these methods focus on keeping the model tractable in contrast we assume tractable model and focus on the supervision while the parameters of the model can be informed by the supervision relaxing the supervision as we do could fundamentally alter the learning process and requires careful analysis to ensure that we stay grounded to the data as an analogy consider driving car with damaged steering wheel approximate model versus not being able to see the road approximate supervision intuitively the latter appears to pose more fundamental challenge intractable supervision is key bottleneck in many applications and will only become more so as we incorporate more sophisticated logical constraints into our statistical models while we have laid down framework that grapples with this issue there is much to be deriving stochastic updates for optimization as well as tractability constraints for more sophisticated inference methods acknowledgments the first author was supported by fannie john hertz fellowship and an nsf graduate research fellowship the second author was supported by microsoft research faculty fellowship we are also grateful to the referees for their valuable comments references clarke goldwasser chang and roth driving semantic parsing from the world response in computational natural language learning conll pages liang jordan and klein learning compositional semantics in association for computational linguistics acl pages artzi and zettlemoyer weakly supervised learning of semantic parsers for mapping instructions to actions transactions of the association for computational linguistics tacl fisher ritchie savva funkhouser and hanrahan synthesis of object arrangements acm siggraph asia mansinghka kulkarni perov and tenenbaum approximate bayesian image interpretation using generative probabilistic graphics programs in advances in neural information processing systems nips pages chang savva and manning learning spatial knowledge for text to scene generation in empirical methods in natural language processing emnlp mintz bills snow and jurafsky distant supervision for relation extraction without labeled data in association for computational linguistics acl pages riedel yao and mccallum modeling relations and their mentions without labeled text in machine learning and knowledge discovery in databases ecml pkdd pages gulwani automating string processing in spreadsheets using examples acm sigplan notices mnih kavukcuoglu silver rusu veness bellemare graves riedmiller fidjeland ostrovski et al control through deep reinforcement learning nature chang ratinov and roth guiding with learning in association for computational linguistics acl pages ganchev and taskar expectation maximization and posterior constraints in nips van der vaart asymptotic statistics cambridge university press nielsen and garcia statistical exponential families digest with flash cards arxiv preprint gill murray and saunders snopt an sqp algorithm for constrained optimization siam journal on optimization gimpel and smith crfs training models with cost functions in north american association for computational linguistics naacl pages besag the analysis of data the statistician liang and jordan an asymptotic analysis of generative discriminative and pseudolikelihood estimators in international conference on machine learning icml pages ng harada and russell policy invariance under reward transformations theory and application to reward shaping in international conference on machine learning icml mann and mccallum generalized expectation criteria for learning of conditional random fields in pages druck mann and mccallum learning from labeled features using generalized expectation criteria in acm special interest group on information retreival sigir pages liang jordan and klein learning from measurements in exponential families in international conference on machine learning icml ganchev gillenwater and taskar posterior regularization for structured latent variable models journal of machine learning research jmlr jiang teichert eisner and daume learned prioritization for trading off accuracy and speed in advances in neural information processing systems nips shi steinhardt and liang learning where to sample in structured prediction in aistats steinhardt and liang learning models for structured prediction in icml he daume and eisner dynamic feature selection in icml inferning workshop he daume and eisner dynamic feature selection for dependency parsing in emnlp weiss and taskar learning adaptive value of information for structured prediction in advances in neural information processing systems nips pages 
subsampled power iteration unified algorithm for block models and planted csp will perkins university of birmingham vitaly feldman ibm research almaden vitaly santosh vempala georgia tech vempala abstract we present an algorithm for recovering planted solutions in two models the stochastic block model and planted constraint satisfaction problems csp via common generalization in terms of random bipartite graphs our algorithm matches up to constant factor the bounds for the number of edges or constraints needed for perfect recovery and its running time is linear in the number of edges used the time complexity is significantly better than both spectral and approaches the main contribution of the algorithm is in the case of unequal sizes in the bipartition that arises in our reduction from the planted csp here our algorithm succeeds at significantly lower density than the spectral approaches surpassing barrier based on the spectral norm of random matrix other significant features of the algorithm and analysis include the critical use of power iteration with subsampling which might be of independent interest its analysis requires keeping track of multiple norms of an evolving solution ii the algorithm can be implemented statistically with very limited access to the input distribution iii the algorithm is extremely simple to implement and runs in linear time and thus is practical even for very large instances introduction broad class of learning problems fits into the framework of obtaining sequence of independent random samples from unknown distribution and then approximately recovering this distribution using as few samples as possible we consider two natural instances of this framework the stochastic block model in which random graph is formed by choosing edges independently at random with probabilities that depend on whether an edge crosses planted partition and planted or planted in which boolean constraints are chosen independently at random with probabilities that depend on their evaluation on planted assignment to set of boolean variables we propose natural bipartite generalization of the stochastic block model and then show that planted can be reduced to this model thus unifying graph partitioning and planted csp into one problem we then give an algorithm for solving random instances of the model our algorithm is optimal up to constant factor in terms of number of sampled edges and running time for the bipartite block model for planted csp the algorithm matches up to log factors the best possible sample complexity in several restricted computational models and the bounds for any algorithm key feature of the algorithm is that when one side of the bipartition is much larger than the other then our algorithm succeeds at significantly lower edge densities than using singular value decomposition svd on the rectangular adjacency matrix details are in sec the bipartite block model begins with two vertex sets and of possibly unequal size each with balanced partition and respectively edges are added independently at random between and with probabilities that depend on which parts the endpoints are in edges between and or and are added with probability δp while the other edges are added with probability where and is the overall edge density to obtain the stochastic block model we can identify and to reduce planted csp to this model we first reduce the problem to an instance of noisy where is the complexity parameter of the planted csp distribution defined in see sec for details we then identify with literals and with of literals and add an edge between literal and tuple when the consisting of their union appears in the formula the reduction leads to bipartition with much larger than our algorithm is based on applying power iteration with sequence of matrices subsampled from the original adjacency matrix this is in contrast to previous algorithms that compute the eigenvectors or singular vectors of the full adjacency matrix our algorithm has several advantages such an algorithm for the special case of square matrices was previously proposed and analyzed in different context by korada et al up to constant factor the algorithm matches the and in some cases the bestpossible edge or constraint density needed for complete recovery of the planted partition or assignment the algorithm for planted csp finds the planted assignment using log clauses for clause distribution of complexity see sec for the formal definition nearly matching computational lower bounds for sdp hierarchies and the class of statistical algorithms the algorithm is fast running in time linear in the number of edges or constraints used unlike other approaches that require computing eigenvectors or solving programs the algorithm is conceptually simple and easy to implement in fact it can be implemented in the statistical query model with very limited access to the input graph it is based on the idea of iteration with subsampling which may have further applications in the design and analysis of algorithms most notably the algorithm succeeds where generic spectral approaches fail for the case of the planted csp when our algorithm succeeds at polynomial factor sparser density than the approaches of mcsherry and vu the algorithm succeeds despite the fact that the energy of the planted vector with respect to the random adjacency matrix is far below the spectral norm of the matrix in previous analyses this was believed to indicate failure of the spectral approach see sec related work the algorithm of mossel neeman and sly for the standard stochastic block model also runs in near linear time while other known algorithmic approaches for planted partitioning that succeed near the optimal edge density perform eigenvector or singular vector computations and thus require superlinear time though careful randomized implementation of approximations can reduce the running time of mcsherry algorithm substantially for planted satisfiability the algorithm of flaxman for planted works for subset of planted distributions those with distribution complexity at most in our definition below using constraints while the algorithm of cooper and frieze works for planted distributions that exclude unsatisfied clauses and uses constraints the only previous algorithm that finds the planted assignment for all distributions of planted kcsp is the algorithm of bogdanov and qiao with the folklore generalization to independent predicates cf similar to our algorithm it uses constraints this algorithm effectively solves the noisy instance and therefore can be also used to solve our general version of planted satisfiability using clauses via the reduction in sec notably for both this algorithm and ours having completely satisfying planted assignment plays no special role the number of constraints required depends only on the distribution the best of our knowledge our algorithm is the first for the planted problem that runs in linear time in the number of constraints used it is important to note that in planted the planted assignment becomes recoverable with high probability after at most log random clauses yet the best known efficient algorithms require nω clauses problems exhibiting this type of behavior have attracted significant interest in learning theory and some of the recent hardness results are based on the conjectured computational hardness of the refutation problem our algorithm is arguably simpler than the approach in and substantially improves the running time even for small another advantage of our approach is that it can be implemented using restricted access to the distribution of constraints referred to as statistical queries roughly speaking for the planted sat problem this access allows an algorithm to evaluate functions of single clause on randomly drawn clauses or to estimate expectations of such functions without direct access to the clauses themselves recently in lower bounds on the number of clauses necessary for statistical algorithm to solve planted were proved it is therefore important to understand the power of such algorithms for solving planted statistical implementation of our algorithm gives an upper bound that nearly matches the lower bound for the problem see for the formal details of the model and statistical implementation of our algorithm korada montanari and oh analyzed the gossip pca algorithm which for the special case of an equal bipartition is the same as our subsampled power iteration the assumptions model and motivation in the two papers are different and the results incomparable in particular while our focus and motivation are on general nonsquare matrices their work considers extracting planting of rank greater than in the square setting their results also assume an initial vector with correlation with the planted vector the nature of the guarantees is also different model and results bipartite stochastic block model definition for even and bipartitions of vertex sets of size respectively we define the bipartite stochastic block model to be the random graph in which edges between vertices in and and and are added independently with probability δp and edges between vertices in and and and with probability here is fixed constant while will tend to as note that setting and identifying and and and gives the usual stochastic block model with loops allowed for edge probabilities and we have and the overall edge density for our application to it will be crucial to allow vertex sets of very different sizes the algorithmic task for the bipartite block model is to recover one or both partitions completely or partially using as few edges and as little computational time as possible in this work we will assume that and we will be concerned with the algorithmic task of recovering the partition completely as this will allow us to solve the planted problems described below we define complete recovery of as finding the exact partition with high probability over the randomness in the graph and in the algorithm theorem assume there is constant so that the subsampled power iteration algorithm described below completely recovers the partition in the bipartite stochastic block log model with probability as when its log running time is note that for the usual stochastic block model this gives an algorithm using log edges and log time which is the best possible for complete recovery since that many edges are needed for every vertex to appear in at least edge with edge probabilities log and log our results require for some absolute constant matching the dependence on and in see for discussion of the best possible threshold for complete recovery for any at least edges are necessary for even partial recovery as below that threshold the graph consists only of small components and even if correct partition is found on each component correlating the partitions of different components is impossible similarly at least log are needed for complete recover of since below that density there are vertices in joined only to vertices of degree in for very lopsided graphs with the running time is sublinear in the size of this requires careful implementation and is essential to achieving the running time bounds for planted csp described below planted we now describe general model for planted satisfiability problems introduced in for an integer let ck be the set of all ordered of literals from xn xn with no repetition of variables for of literals and an assignment denotes the vector of values that assigns to the literals in planting distribution is probability distribution over definition given planting distribution and an assignment we define the random constraint satisfaction problem fq by drawing from ck independently according to the distribution qσ where is the vector of values that assigns to the of literals comprising definition the distribution complexity of the planting distribution is the smallest integer so that there is some so that the discrete fourier coefficient is in other words the distribution complexity of is if is an independent distribution on but not an independent distribution the uniform distribution over all clauses has for all and so we define its complexity to be the uniform distribution does not reveal any information about and so inference is impossible for any that is not the uniform distribution over clauses we have note that the uniform distribution on clauses with at least one satisfied literal under has distribution complexity means that there is bias towards either true or false literals in this case very simple algorithm is effective for each variable count the number of times it appears negated and not negated and take the majority vote for distributions with complexity the expected number of true and false literals in the random formula are equal and so this simple algorithm fails theorem for any planting distribution there exists an algorithm that for any assignment given an instance of fq completely recovers the planted assignment for log using log time where is the distribution complexity of for distribution complexity there is an algorithm that gives partial recovery with constraints and complete recovery with log constraints the algorithm we now present our algorithm for the bipartite stochastic block model we define vectors and of dimension and respectively indexed by and with ui for ui for and similarly for to recover the partition it suffices to find either or we will find this vector by multiplying random initial vector by sequence of centered adjacency matrices and their transposes we form these matrices as follows let gp be the random bipartite graph drawn from the model and positive integer then form different bipartite graphs gt on the same vertex sets by placing each edge from gp uniformly and independently at random into one of the graphs the resulting graphs have the same marginal distribution next we form the adjacency matrices at for gt with rows indexed by and columns by with in entry if vertex is joined to vertex finally we center the matrices by defining mi ai tp where is the all ones matrix the basic iterative steps are the multiplications and algorithm subsampled power iteration form log matrices mt by uniformly and independently assigning each edge of the bipartite block model to graph gt then forming the matrices mi ai tp where ai is the adjacency matrix of gi and is the all ones matrix sample uniformly at random and let for to let yi xi sgn xi for each coordinate take the majority vote of the signs of zji for all and call this vector sgn zji return the partition indicated by the analysis of the resampled power iteration algorithm proceeds in four phases during which we track the progress of two vectors xi and as measured by their inner product with and respectively we define ui xi and vi here we give an overview of each phase phase within log iterations reaches log we show that conditioned on the value of ui there is at least chance that that ui never gets too small and that in log steps run of log log doublings pushes the magnitude of ui above log phase after reaching makes steady predictable progress doubling at each step whp until it reaches at which point we say xi has strong correlation with phase once xi is strongly correlated with we show that agrees with either or on large fraction of coordinates phase we show that taking the majority vote of the signs of over log additional iterations gives complete recovery whp running time if then straightforward implementation of the algorithm runs in time linear in the number of edges used each entry of xi resp can be computed as sum over the edges in the graph associated with the rounding and majority vote are both linear in however if then simply initializing the vector will take too much time in this case we have to implement the algorithm more carefully say we have vector and want to compute xi without storing the vector instead of computing we create set of all vertices with degree at least in the current graph corresponding to the matrix the size of is bounded by the number of edges in and checking membership can be done in constant time with data structure of size that requires expected time to create recall that qj then we can write qj where is on coordinates si then to compute we write and is the all ones vector of length xi qj qj qj qj we bound the running time of the computation as follows we can compute in linear time in the number of edges of using given computing is linear in the number of edges of and computing qj is linear in the numberpof entries of which is bounded by is linear in and gives the number of edges of computing computing is linear in the number of edges of all together this gives our linear time implementation reduction of planted to the block model here we describe how solving the bipartite block model suffices to solve the planted problems consider planted problem fq with distribution complexity let be such that such an exists from the definition of the distribution complexity we assume that we know both and this set as trying all possibilities smallest first requires only constant factor more time we will restrict each in the formula to an by taking the literals specified by the set if the distribution is known to be symmetric with respect to the order of the in each clause or if clauses are given as unordered sets of literals then we can simply sample random set of literals without replacement from each clause we will show that restricting to these literals from each induces distribution on defined by qδ of the form qδ for even qδ for odd for some where is the number of true literals in under this reduction allows us to focus on algorithms for the specific case of distribution on with distribution complexity recall that for function its fourier coefficients are defined for each subset as fˆ χs where χs are the walsh basis functions of with respect to the uniform probability measure χs xi lemma if the function defines distribution qσ on with distribution complexity and planted assignment then for some and choosing literals with indices in from clause drawn randomly from qσ yields random from qδσ proof from definition we have that there exists an with such that note that by definition χs χs xs even xs odd pr xs even pr xs odd where xs is restricted to the coordinates in and so if we take the distribution induced by restricting to the specified by is qδσ note that by the definition of the distribution complexity for any and so the original and induced distributions are uniform over any set of coordinates first consider the case restricting each clause to for induces noisy distribution in which random true literal appears with probability and random false literal appears with probability the simple majority vote algorithm described above suffices set each variable to if it appears more often positively than negated in the restricted clauses of the formula pto if it appears more often negated and choose randomly if it appears equally often using log clauses this algorithm will give an assignment that agrees with or on variables with probability at least using cn log clauses it will recover exactly with probability now assume that we describe how the parity distribution qδσ on induces bipartite block model let be the set of literals of the given variable set and the collection of all of literals we have and we partition each set into two parts as follows is the set of false literals under and the set of true literals is the set of with an even number of true literals under and the set of with an odd number of true literals for each lr we add an edge in the block model between the tuples and lr constraint drawn according to qδσ induces random edge between and or and with probability and between and or and with probability exactly the distribution of single edge in the bipartite block model recovering the partition in this bipartite block model partitions the literals into true and false sets giving up to sign now the model in defn is that of clauses selected independently with replacement according to given distribution while in defn each edge is present independently with given probability reducing from the first to the second can be done by poissonization details given in the full version the key feature of our bipartite block model algorithm is that it uses edges corresponding to clauses in the planted csp comparison with spectral approach as noted above many approaches to graph partitioning problems and planted satisfiability problems use eigenvectors or singular vectors these algorithms are essentially based on the signs of the top eigenvector of the centered adjacency matrix being correlated with the planted vector this is fairly straightforward to establish when the average degree of the random graph is large enough however in the stochastic block model for example when the average degree is constant vertices of large degree dominate the spectrum and the straightforward spectral approach fails see for discussion and references in the case of the usual block model while our approach has fast running time it does not save on the number of edges required as compared to the standard spectral approach both require log edges however when eg as in the case of the planted for odd this is no longer the case consider the partitioning algorithm of let be the matrix of edge probabilities gij is the probability that the edge between vertices and is present let gu gv denote columns of corresponding to vertices let be an upper bound of the variance of an entry in the adjacency matrix sm the size of the smallest part in the planted partition the number of parts the failure probability of the algorithm and universal constant then the condition for the success of mcsherry partitioning algorithm is min in different parts kgu gv cqσ log in our case we have sm and kgu when log the condition requires while our algorithm succeeds when log in our application to planted csp with odd and this gives polynomial factor improvement in fact previous spectral approaches to planted csp or random refutation worked for even using constraints while algorithms for odd only worked for and used considerably more complicated constructions and techniques in contrast to previous approaches our algorithm unifies the algorithm for planted for odd and even works for odd and is particularly simple and fast we now describe why previous approaches faced spectral barrier for odd and how our algorithm surmounts it the previous spectral algorithms for even constructed similar graph to the one in the reduction above vertices are of literals and with edges between two tuples if their union appears as the distribution induced in this case is the stochastic block model for odd such reduction is not possible and one might try bipartite graph with either the reduction described above or with and our analysis works for this reduction as well however with clauses the spectral approach of computing the largest or second largest singular vector of the adjacency matrix does not work consider from the distribution let be the dimensional vector indexed as the rows of whose entries are if the corresponding vertex is in and otherwise define the dimensional vector analogously the next propositions summarize properties of proposition puv proposition let be the approximation of drawn from then the above propositions suffice to show high correlation between the top singular vector and the vector when and log this is because the norm of is this is higher than the norm of for this range of therefore the top singular vector of will be correlated with the top singular vector of the latter is matrix with as its left singular vector however when eg odd and the norm of the matrix is in fact much larger than the norm of letting be the vector of length with in the ith coordinate and zeroes elsewhere we see that km and so km while the former is while the latter is in other words the top singular value of is much larger than the value obtained by the vector corresponding to the planted assignment the picture is in fact richer the straightforward spectral approach succeeds for while for the top left singular vector of the centered adjacency matrix is asymptotically uncorrelated with the planted vector in spite of this one can exploit correlations to recover the planted vector below this threshold with our resampling algorithm which in this case provably outperforms the spectral algorithm acknowledgements vempala supported in part by nsf award references abbe bandeira and hall exact recovery in the stochastic block model arxiv preprint achlioptas and mcsherry fast computation of low rank matrix approximations in stoc pages berthet and rigollet complexity theoretic lower bounds for sparse principal component detection in colt pages blum learning boolean functions in an infinite attribute space machine learning bogdanov and qiao on the security of goldreich function in approximation randomization and combinatorial optimization algorithms and techniques pages boppana eigenvalues and graph bisection an analysis in focs pages graph partitioning via adaptive spectral techniques combinatorics probability computing cooper and frieze an efficient sparse regularity concept siam journal on discrete mathematics goerdt lanka and certifying unsatisfiability of random formulas using approximation techniques in fundamentals of computation theory pages springer daniely linial and more data speeds up training time in learning halfspaces over sparse vectors in nips pages daniely and complexity theoretic limitations on learning dnf corr decatur goldreich and ron computational sample complexity siam journal on computing feige and ofek easily refutable subformulas of large random formulas in automata languages and programming pages springer feige and ofek spectral techniques applied to sparse random graphs random structures algorithms feldman attribute efficient and learning of parities and dnf expressions journal of machine learning research feldman open problem the statistical query complexity of learning sparse halfspaces in colt pages feldman grigorescu reyzin vempala and xiao statistical algorithms and lower bound for planted clique in stoc pages feldman perkins and vempala subsampled power iteration unified algorithm for block models and planted csp corr feldman perkins and vempala on the complexity of random satisfiability problems with planted solutions in stoc pages florescu and perkins spectral thresholds in the bipartite stochastic block model arxiv preprint fredman and storing sparse table with worst case access time journal of the acm jacm friedman goerdt and krivelevich recognizing more unsatisfiable random instances efficiently siam journal on computing goerdt and krivelevich efficient recognition of random unsatisfiable instances by spectral methods in stacs pages springer kearns efficient learning from statistical queries jacm korada montanari and oh gossip pca in sigmetrics pages krzakala moore mossel neeman sly and zhang spectral redemption in clustering sparse networks pnas community detection thresholds and the weak ramanujan property in stoc pages mcsherry spectral partitioning of random graphs in focs pages mossel neeman and sly proof of the block model threshold conjecture arxiv preprint donnell and witmer goldreich prg evidence for polynomial stretch in conference on computational complexity servedio computational sample complexity and learning journal of computer and system sciences shamir and tromer using more data to training time in aistats pages vu simple svd algorithm for finding hidden partitions arxiv preprint 
accelerated mirror descent in continuous and discrete time walid krichene uc berkeley alexandre bayen uc berkeley peter bartlett uc berkeley and qut walid bayen bartlett abstract we study accelerated mirror descent dynamics in continuous and discrete time combining the original motivation of mirror descent with recent ode interpretation of nesterov accelerated method we propose family of descent dynamics for convex functions with lipschitz gradients such that the solution trajectories converge to the optimum at rate we then show that large family of accelerated methods can be obtained as discretization of the ode and these methods converge at rate this connection between accelerated mirror descent and the ode provides an intuitive approach to the design and analysis of accelerated algorithms introduction we consider convex optimization problem where rn is convex and closed is convex function and is assumed to be lf let be the minimum of on many convex optimization methods can be interpreted as the discretization of an ordinary differential equation the solutions of which are guaranteed to converge to the set of minimizers perhaps the simplest such method is gradient descent given by the iteration for some step size which can be interpreted as the discretization of the ode with discretization step the theory of ordinary differential equations can provide guidance in the design and analysis of optimization algorithms and has been used for unconstrained optimization constrained optimization and stochastic optimization in particular proving convergence of the solution trajectories of an ode can often be achieved using simple and elegant lyapunov arguments the ode can then be carefully discretized to obtain an optimization algorithm for which the convergence rate can be analyzed by using an analogous lyapunov argument in discrete time in this article we focus on two families of methods nesterov accelerated method and nemirovski mirror descent method methods have become increasingly important for optimization problems that arise in machine learning applications nesterov accelerated method has been applied to many problems and extended in number of ways see for example the mirror descent method also provides an important generalization of the gradient descent method to geometries as discussed in and has many applications in convex optimization as well as online learning an intuitive understanding of these methods is of particular importance for the design and analysis of new algorithms although nesterov method has been notoriously hard to explain intuitively progress has been made recently in su et al give an ode interpretation of nesterov method however this interpretation is restricted to the original method and does not apply to its extensions to geometries in and orecchia give another interpretation of nesterov method as performing at each iteration convex combination of mirror step and gradient step although it covers broader family of algorithms including geometries this interpretation still requires an involved analysis and lacks the simplicity and elegance of odes we provide new interpretation which has the benefits of both approaches we show that broad family of accelerated methods which includes those studied in and can be obtained as discretization of simple ode which converges at rate this provides unified interpretation which could potentially simplify the design and analysis of accelerated methods the interpretation of nesterov method and the motivation of mirror descent both rely on lyapunov argument they are reviewed in section by combining these ideas we propose in section candidate lyapunov function that depends on two state variables which evolves in the primal space rn and which evolves in the dual space and we design coupled dynamics of to guarantee that dt such function is said to be lyapunov function in reference to see also this leads to new family of ode systems given in equation we prove the existence and uniqueness of the solution to in theorem then we prove in thereom using the lyapunov function that the solution trajectories are such that in section we give discretization of these dynamics and obtain family of accelerated mirror descent methods for which we prove the same convergence rate theorem using lyapunov argument analogous to though more involved than the case we give as an example new accelerated method on the simplex which can be viewed as performing at each step convex combination of two entropic projections with different step sizes this ode interpretation of accelerated mirror descent gives new insights and allows us to extend recent results such as the adaptive restarting heuristics proposed by donoghue and in which are known to empirically improve the convergence rate we test these methods on numerical examples in section and comment on their performance ode interpretations of nemirovski mirror descent method and nesterov accelerated method proving convergence of the solution trajectories of an ode often involves lyapunov argument for example to prove convergence of the solutions to the gradient descent ode consider the lyapunov function kx for some minimizer then the time derivative of is given by dt where the last inequality is by convexity of we tf have rt dτ thus by jensen inequality dτ dτ which proves that dτ converges to at rate mirror descent ode the previous argument was extended by nemirovski and yudin in to family of methods called mirror descent the idea is to start from function then to design dynamics for which is lyapunov function nemirovski and yudin argue that one can replace the lyapunov function kx by function on the dual space where is dual variable for which we will design the dynamics is the value of at equilibrium and the corresponding trajectory in the primal space is here is convex function defined on such that maps to and is the bregman divergence associated with defined as yi the function is said to be convex reference norm if dψ kz for all and it is said to be if kz for review of properties of bregman divergences see chapter in or appendix in by definition of the bregman divergence we have dt dt dt therefore if the dual variable obeys the dynamics then dt and the same by argument as in the gradient descent ode is lyapunov function and dτ converges to at rate the mirror descent ode system can be summarized by with note that since maps into remains in finally the unconstrained gradient descent ode can be obtained as special case of the mirror descent ode by taking for which is the identity in which case and coincide ode interpretation of nesterov accelerated method in su et al show that nesterov accelerated method can be interpreted as discretization of differential equation given by the argument uses the following lyapunov function up to reparameterization tr kx rt which is proved to be lyapunov function for the ode whenever since is decreasing along trajectories of the system it follows that for all therefore which proves that converges to at rate one should note in particular that the squared euclidean norm is used in the definition of and as consequence discretizing the ode leads to family of unconstrained euclidean accelerated methods in the next section we show that by combining this argument with nemirovski idea of using general bregman divergence as lyapunov function we can construct much more general family of ode systems which have the same convergence guarantee and by discretizing the resulting dynamics we obtain general family of accelerated methods that are not restricted to the unconstrained euclidean geometry accelerated mirror descent derivation of the accelerated mirror descent ode we consider pair of dual convex functions defined on and defined on such that we assume that is with respect to reference norm on the dual space consider the function where is dual variable for which we will design the dynamics and is its value at equilibrium taking the of we have dt assume that rt then the of becomes dt therefore if is such that rt and then dt and it follows that is lyapunov function whenever the proposed ode system is then with in the unconstrained euclidean case taking kzk we have thus and the ode system is equivalent to dt which is equivalent to the ode studied in which we recover as special case we also give another interpretation of ode ther first equation is equivalent to tr or in integral form tr dτ which can be written as rt dτ rt dτ with therefore the coupled dynamics of can be interpreted as follows the dual variable accumulates gradients with rt rate while the primal variable is weighted average of the mirrored dual trajectory with weights proportional to this also gives an interpretation of as parameter controlling the weight distribution it is also interesting to observe that the weights are increasing if and only if finally with this averaging interpretation it becomes clear that the primal trajectory remains in since maps into and is convex solution of the proposed dynamics first we prove existence and uniqueness of solution to the ode system defined for all by assumption is which is equivalent see to is unfortunately due to the rt term in the expression of the function is not lipschitz at and we can not directly apply the existence and uniqueness theorem however one can work around it by considering sequence of approximating odes similarly to the argument used in theorem suppose is and that is lf and let such that then the accelerated mirror descent ode system with initial condition has unique solution in rn we will show existence of solution on any given interval uniqueness is proved in the supplementary material let and consider the smoothed ode system max with since the functions rt and max are lipschitz for all by the theorem theorem in the system has unique solution xδ zδ in in order to show the existence of solution to the original ode we use the following lemma proved in the supplementary material lemma let then the family of solutions xδ zδ is continuous and uniformly bounded proof of existence consider the family of solutions xδi zδi δi restricted to by lemma this family is and uniformly bounded thus by the theorem there exists subsequence xδi zδi that converges uniformly on where is an infinite set of indices let be its limit then we prove that is solution to the original ode on first since for all xδi and zδi it follows that xδi and zδi thus satisfies the initial conditions next let and let be the solution of the ode on with initial condition since xδi zδi as then by continuity of the solution initial conditions theorem in we have that for some xδi uniformly on but we also have xδi uniformly on therefore and coincide on therefore satisfies the ode on and since is arbitrary in this concludes the proof of existence convergence rate it is now straightforward to establish the convergence rate of the solution theorem suppose that has lipschitz gradient and that is smooth distance generating function let be the solution to the accelerated mirror descent ode with then for all proof by construction of the ode tr is lyapunov function it follows that for all tr discretization next we show that with careful discretization of this dynamics we can obtain general family of accelerated mirror descent methods for constrained optimization using mixed system scheme see chapter in we can discretize the ode of the ode let and using step size as follows given solution tk approximating tk with tk we propose the discretization the first equation can be rewritten as kr kr note the independence on due to the invariance of the first ode in other words is convex combination of and with coefficients λk and λk to summarize our first discrete scheme can be written as λk λk λk ks since maps into the feasible set starting from guarantees that remains in for all by convexity of note that by duality we have arg hx and if we additionally assume that is differentiable on the image of then theorem in thus if we write the second equation can be written as ks ks arg min ks arg min dψ we will eventually modify this scheme in order to be able to prove the desired convergence rate however we start by analyzing this motivated by the lyapunov function and using the correspondence we consider the potential function kr then we have and through simple algebraic manipulation the last term can be bounded as follows by definition of the bregman divergence ks by the discretization ks by convexity of therefore we have comr in the case paring this expression with the expression of dt we see that we obtain an analogous expression except for the additional bregman divergence term and we can not immediately conclude that is lyapunov function this can be remedied by the following modification of the discretization scheme family of accelerated mirror descent methods in the expression of λk λk we propose to with ob replace tained as solution to minimization problem arg γs where is regularization function that satisfies the following assumptions there exist lr such that for all kx kx in the euclidean case one can take in which case lr and the update becomes in the general case one can take dφ for some distance generating function which is convex and lr in which case the update becomes mirror update the resulting method is summarized in algorithm this algorithm is generalization of and orecchia interpretation of nesterov method in where is convex combination of mirror descent update and gradient descent update algorithm accelerated mirror descent with distance generating function regularizer step size and parameter initialize or for do λk λk with λk ks dψ arg and if is kr arg γs consistency of the modified scheme one can show that given our assumptions on indeed we have γs therefore which proves the claim using this observation we can show that the modified discretization scheme is consistent with the original ode that is the difference equations defining and converge as tends to to the ordinary differential equations of the system the difference equations of algorithm are equivalent to in which is replaced by now suppose there exist functions defined on such that tk and tk for tk then the fact that we have tk tk and simis larly tk therefore the difference equation system can be written as tk trk tk tk tk trk tk which converges to the ode as convergence rate to prove convergence of the algorithm consider the modified potential function kr lemma if lr and then for all kr as consequence if is lyapunov function for this lemma is proved in the supplementary material theorem the accelerated mirror descent algorithm with parameter and step sizes lr rf guarantees that for all sk sk proof the first inequality follows immediately from lemma the second inequality follows from simple bound on proved in the supplementary material example accelerated entropic descent we give an instance of algorithm for problems suppose that xi is the taking to be the negative entropy on we have for xi ln xi ln ezi ezi ln xi pn ezj where is the indicator function of the simplex if and otherwise and rn is normal vector to the affine hull of the simplex the resulting mirror descent update is simple entropy projection and can be computed exactly in operations and can be shown to be see for example for the second update we take dφ where pn is smoothed negative entropy function defined as follows let and let xi ln xi although no simple expression is known for it can be computed efficiently in log time using deterministic algorithm or expected time using randomized algorithm see additionally satisfies our assumptions it is convex and the resulting accelerated mirror descent method on the simplex can then be implemented efficiently and by theorem it is guaranteed to converge in whenever and fγ numerical experiments we test the accelerated mirror descent method in algorithm on problems in rn with two different objective functions simple quadratic hx for random positive matrix and function mirror descent accelerated mirror descent speed restart gradient restart mirror descent accelerated mirror descent speed restart gradient restart weakly convex quadratic rank effect of the parameter figure evolution of on problems using different accelerated mirror descent methods with entropy distance generating functions algorithm accelerated mirror descent with restart initialize for do λl λl with λl ks arg dψ arg γs if restart condition then given by ln hai xi bi where each entry in ai and bi is iid normal we implement the accelerated entropic descent algorithm proposed in section and include the entropic descent for reference we also adapt the gradient restarting heuristic proposed by donoghue and in as well as the speed restart heuristic proposed by su et al in the generic restart the restart condi method is given in algorithm tions are the following gradient restart and ii speed restart kx kx the results are given in figure the accelerated mirror descent method exhibits polynomial convergence rate which is empirically faster than the rate predicted by theorem the method also exhibits oscillations around the set of minimizers and increasing the parameter seems to reduce the period of the oscillations and results in trajectory that is initially slower but faster for large see figure the restarting heuristics alleviate the oscillation and empirically speed up the convergence we also visualized for each experiment the trajectory of the iterates for each method projected on hyperplane the corresponding videos are included in the supplementary material conclusion by combining the lyapunov argument that motivated mirror descent and the recent ode interpretation of nesterov method we proposed family of ode systems for minimizing convex functions with lipschitz gradient which are guaranteed to converge at rate and proved existence and uniqueness of solution then by discretizing the ode we proposed family of accelerated mirror descent methods for constrained optimization and proved an analogous rate when the step size is small enough the connection with the dynamics motivates more detailed study of the ode such as studying the oscillatory behavior of its solution trajectories its convergence rates under additional assumptions such as strong convexity and rigorous study of the restart heuristics acknowledgments we gratefully acknowledge the nsf the arc and acems and the simons institute fall algorithmic spectral graph theory program references zeyuan and lorenzo orecchia linear coupling an ultimate unification of gradient and mirror descent in arxiv arindam banerjee srujana merugu inderjit dhillon and joydeep ghosh clustering with bregman divergences mach learn december amir beck and marc teboulle mirror descent and nonlinear projected subgradient methods for convex optimization oper res may amir beck and marc teboulle fast iterative algorithm for linear inverse problems siam journal on imaging sciences and nemirovski lectures on modern convex optimization siam aharon tamar margalit and arkadi nemirovski the ordered subsets mirror descent optimization method with applications to tomography siam on optimization january anthony bloch editor hamiltonian and gradient flows algorithms and control american mathematical society brown and some effective methods for unconstrained optimization based on the solution of systems of ordinary differential equations journal of optimization theory and applications bubeck and regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning butcher numerical methods for ordinary differential equations john wiley sons ltd and lugosi prediction learning and games cambridge ofer dekel ran ohad shamir and lin xiao optimal distributed online prediction in proceedings of the international conference on machine learning icml june helmke and moore optimization and dynamical systems communications and control engineering series anatoli juditsky convex optimization ii algorithms lecture notes anatoli juditsky arkadi nemirovski and claire tauvel solving variational inequalities with stochastic algorithm stoch khalil nonlinear systems macmillan pub walid krichene syrine krichene and alexandre bayen efficient bregman projections onto the simplex in ieee conference on decision and control lyapunov general problem of the stability of motion control theory and applications series taylor francis nemirovsky and yudin problem complexity and method efficiency in optimization wileyinterscience series in discrete mathematics wiley yu nesterov smooth minimization of functions mathematical programming yu nesterov gradient methods for minimizing composite functions mathematical programming yurii nesterov method of solving convex programming problem with convergence rate soviet mathematics doklady yurii nesterov introductory lectures on convex optimization volume springer science business media brendan donoghue and emmanuel adaptive restart for accelerated gradient schemes foundations of computational mathematics raginsky and bouvrie stochastic mirror descent on network variance reduction consensus convergence in cdc pages rockafellar convex analysis princeton university press schropp and singer dynamical systems approach to constrained minimization numerical functional analysis and optimization weijie su stephen boyd and emmanuel differential equation for modeling nesterov accelerated gradient method theory and insights in nips gerald teschl ordinary differential equations and dynamical systems volume american mathematical 
the human kernel andrew gordon wilson cmu christoph dann cmu christopher lucas university of edinburgh eric xing cmu abstract bayesian nonparametric models such as gaussian processes provide compelling framework for automatic statistical modelling these models have high degree of flexibility and automatically calibrated complexity however automating human expertise remains elusive for example gaussian processes with standard kernels struggle on function extrapolation problems that are trivial for human learners in this paper we create function extrapolation problems and acquire human responses and then design kernel learning framework to reverse engineer the inductive biases of human learners across set of behavioral experiments we use the learned kernels to gain psychological insights and to extrapolate in humanlike ways that go beyond traditional stationary and polynomial kernels finally we investigate occam razor in human and gaussian process based function learning introduction truly intelligent systems can learn and make decisions without human intervention therefore it is not surprising that early machine learning efforts such as the perceptron have been neurally inspired in recent years probabilistic modelling has become cornerstone of machine learning approaches with applications in neural processing and human learning from probabilistic perspective the ability for model to automatically discover patterns and perform extrapolation is determined by its support which solutions are priori possible and inductive biases which solutions are priori likely ideally we want model to be able to represent many possible solutions to given problem with inductive biases which can extract intricate structure from limited data for example if we are performing character recognition we would want our support to contain large collection of potential characters accounting even for rare writing styles and our inductive biases to reasonably reflect the probability of encountering each character the support and inductive biases of wide range of probabilistic models and thus the ability for these models to learn and generalise is implicitly controlled by covariance kernel which determines the similarities between pairs of datapoints for example bayesian basis function regression including all polynomial models splines and infinite neural networks can all exactly be represented as gaussian process with particular kernel function moreover the fisher kernel provides mechanism to reformulate probabilistic generative models as kernel methods in this paper we wish to reverse engineer support and inductive biases for function learning using gaussian process gp based kernel learning formalism in particular we create new human function learning datasets including novel function extrapolation problems and questions that explore human intuitions about simplicity and explanatory power available at http we develop statistical framework for kernel learning from the predictions of model conditioned on the training information that model is given the ability to sample multiple sets of posterior predictions from model at any input locations of our choice given any dataset of our choice provides unprecedented statistical strength for kernel learning by contrast standard kernel learning involves fitting kernel to fixed dataset that can only be viewed as single realisation from stochastic process our framework leverages spectral mixture kernels and estimates we exploit this framework to directly learn kernels from human responses which contrasts with all prior work on human function learning where one compares fixed model to human responses further we consider individual rather than averaged human extrapolations we interpret the learned kernels to gain scientific insights into human inductive biases including the ability to adapt to new information for function learning we also use the learned human kernels to inspire new types of covariance functions which can enable extrapolation on problems which are difficult for conventional gp models we study occam razor in human function learning and compare to gp marginal likelihood based model selection which we show is biased towards we provide an expressive quantitative means to compare existing machine learning algorithms with human learning and mechanism to directly infer human prior representations our work is intended as preliminary step towards building probabilistic kernel machines that encapsulate support and inductive biases since state of the art machine learning methods perform conspicuously poorly on number of extrapolation problems which would be easy for humans such efforts have the potential to help automate machine learning and improve performance on wide range of tasks including settings which are difficult for humans to process big data and high dimensional problems finally the presented framework can be considered in more general context where one wishes to efficiently reverse engineer interpretable properties of any model deep neural network from its predictions we further describe related work in section in section we introduce framework for learning kernels from human responses and employ this framework in section in the supplement we provide background on gaussian processes which we recommend as review related work historically efforts to understand human function learning have focused on relationships polynomial or functions or interpolation based on similarity learning griffiths et al were the first to note that gaussian process framework can be used to unify these two perspectives they introduced gp model with mixture of rbf and polynomial kernels to reflect the human ability to learn arbitrary smooth functions while still identifying simple parametric functions they applied this model to standard set of evaluation tasks comparing predictions on simple functions to averaged human judgments and interpolation performance to human error rates lucas et al extended this model to accommodate wider range of phenomena and to shed light on human predictions given sparse data our work complements these pioneering gaussian process models and prior work on human function learning but has many features that distinguish it from previous contributions rather than iteratively building models and comparing them to human predictions based on fixed assumptions about the regularities humans can recognize we are directly learning the properties of the human model through advanced kernel learning techniques essentially all models of function learning including past gp models are evaluated on averaged human responses setting aside individual differences and erasing critical statistical structure in the by contrast our approach uses individual responses many recent model evaluations rely on relatively small and heterogeneous sets of experimental data the evaluation corpora using recent reviews are limited to small set of parametric forms and more detailed analyses tend to involve only linear quadratic and logistic functions other projects have collected richer data but we are only aware of qualitative analyses using these data moreover experiments that depart from simple parametric functions tend to use very noisy data thus it is unsurprising that participants tend to revert to the prior mode that arises in almost all function learning experiments linear functions especially with and but see in departure from prior work we create original function learning problems with no simple parametric description and no noise where it is obvious that human learners can not resort to simple rules and acquire the human data ourselves we hope these novel datasets will inspire more detailed findings on function learning we learn kernels from human responses which provide insights into the biases driving human function learning and the human ability to progressively adapt to new information and ii enable extrapolations on problems that are difficult for conventional gp models and we investigate occam razor in human function learning and nonparametric model selection for example averaging prior draws from gaussian process would remove the structure necessary for kernel learning leaving us simply with an approximation of the prior mean function the human kernel the and associative theories for human function learning can be unified as part of gaussian process framework indeed gaussian processes contain large array of probabilistic models and have the flexibility to produce infinitely many consistent zero training error fits to any dataset moreover the support and inductive biases of gp are encaspulated by covariance kernel our goal is to learn gp covariance kernels from predictions made by humans on function learning experiments to gain better understanding of human learning and to inspire new machine learning models with improved extrapolation performance and minimal human intervention problem setup human learner is given access to data at training inputs and makes predictions at testing inputs we assume the predictions are samples from the learner posterior distribution over possible functions following results showing that human inferences and judgments resemble posterior samples across wide range of perceptual and tasks we assume we can obtain multiple draws of for given and kernel learning in standard gp applications one has access to single realisation of data and performs kernel learning by optimizing the marginal likelihood of the data with respect to covariance function hyperparameters supplement however with only single realisation of data we are highly constrained in our ability to learn an expressive kernel function requiring us to make strong assumptions such as rbf covariances to extract useful information from the data one can see this by simulating datapoints from gp with known kernel and then visualising the empirical estimate yy of the known covariance matrix the empirical estimate in most cases will look nothing like however perhaps surprisingly if we have even small number of multiple draws from gp we can recover wide array of covariance matrices using the empirical estimator where is an data matrix for draws and is vector of empirical means the typical goal in choosing kernels is to use training data to find one that minimizes some loss function generalisation error but here we want to reverse engineer the kernel of model here whatever model human learners are tacitly using that has been applied to training data based on both training data and predictions of the model if we have single sample extrapolation at test inputs based on training points and gaussian noise the probability kθ is given by the posterior predictive distribution of gaussian process with one can use this probability as utility function for kernel learning much like the marginal likelihood see the supplement for details of these distributions our problem setup affords unprecedented opportunities for flexible kernel learning if we have mul tiple sample extrapolations from given set of training data then the predictive qw conditional marginal likelihood becomes kθ one could apply this new objective for instance if we were to view different human extrapolations as multiple draws from common generative model clearly this assumption is not entirely correct since different people will have different biases but it naturally suits our purposes we are not as interested in the differences between people as the shared inductive biases and assuming multiple draws from common generative model provides extraordinary statistical strength for learning these shared biases ultimately we will study both the differences and similarities between the responses one option for kernel learning is to specify flexible parametric form for and then learn by optimizing our chosen objective functions for this approach we choose the recent spectral mixture kernels of wilson and adams which can model wide range of stationary covariances and are intended to help automate kernel selection however we note that our objective function can readily be applied to other parametric forms we also consider empirical kernel estimation since kernel estimators can have the flexibility to converge to any positive definite kernel and thus become appealing when we have the signal strength provided by multiple draws from stochastic process human experiments we wish to discover kernels that capture human inductive biases for learning functions and extrapolating from complex or ambiguous training data we start by testing the consistency of our kernel learning procedure in section in section we study progressive function learning indeed prediction kernel data kernel learned kernel posterior draw posterior draws posterior draws figure reconstructing kernel used for predictions training data were generated with an rbf kernel green and multiple independent posterior predictions were drawn from gp with prediction kernel blue as the number of posterior draws increases the learned kernel red converges to the prediction kernel humans participants will have different representation learned kernel for different observed data and examining how these representations progressively adapt with new information can shed light on our prior biases in section we learn human kernels to extrapolate on tasks which are difficult for gaussian processes with standard kernels in section we study model selection in human function learning all human participants were recruited using amazon mechanical turk and saw experimental materials provided at http when we are considering stationary ground truth kernels we use spectral mixture for kernel learning otherwise we use empirical estimate reconstructing ground truth kernels we use simulations with known ground truth to test the consistency of our kernel learning procedure and the effects of multiple posterior draws in converging to kernel which has been used to make predictions we sample datapoints from gp with rbf kernel the supplement describes gps krbf exp at random input locations conditioned on these data we then sample multiple posterior draws each containing datapoints from gp with spectral mixture kernel with two components the prediction kernel the prediction kernel has deliberately not been trained to fit the data kernel to reconstruct the prediction kernel we learn the parameters of randomly initialized spectral mixture kernel with five components qw by optimizing the predictive conditional marginal likelihood kθ wrt figure compares the learned kernels for different numbers of posterior draws against the data kernel rbf and the prediction kernel spectral mixture for single posterior draw the learned kernel captures the component of the prediction kernel but fails at reconstructing the component only with multiple draws does the learned kernel capture the longerrange dependencies the fact that the learned kernel converges to the prediction kernel which is different from the data kernel shows the consistency of our procedure which could be used to infer aspects of human inductive biases progressive function learning we asked humans to extrapolate beyond training data in two sets of functions each drawn from gps with known kernels the learners extrapolated on these problems in sequence and thus had an opportunity to progressively learn about the underlying kernel in each set to further test progressive function learning we repeated the first function at the end of the experiment for six functions in each set we asked for extrapolation judgments because they provide more information about inductive biases than interpolation and pose difficulties for conventional gp kernels the observed functions are shown in black in figure the human responses in blue and the true extrapolation in dashed black in the first two rows the black functions are drawn from gp with rational quadratic rq kernel for heavy tailed correlations there are participants we show the learned human kernel the data generating kernel the human kernel learned from spectral mixture and an rbf kernel trained only on the data in figures and respectively corresponding to figures and initially both the human learners and rq kernel show heavy tailed behaviour and bias for decreasing correlations with distance in the input space but the human learners have high degree of variance by the time they have seen figure they are human kernel data kernel rbf kernel figure progressive function learning humans are shown functions in sequence and asked to make extrapolations observed data are in black human predictions in blue and true extrapolations in dashed black observed data are drawn from rational quadratic kernel with identical data in and learned human and rbf kernels on alone and on after seeing the data in the true data generating rational quadratic kernel is shown in red observed data are drawn from product of spectral mixture and linear kernels with identical data in and the empirical estimate of the human posterior covariance matrix from all responses in the true posterior covariance matrix for more confident in their predictions and more accurately able to estimate the true signal variance of the function visually the extrapolations look more confident and reasonable indeed the human learners will adapt their representations learned kernels to different datasets however although the human learners will adapt their representations learned kernels to observed data we can see in figure that the human learners are still the tails of the kernel perhaps suggesting strong prior bias for correlations the learned rbf kernel by contrast can not capture the heavy tailed nature of the training data long range correlations due to its gaussian parametrization moreover the learned rbf kernel underestimates the signal variance of the data because it overestimates the noise variance not shown to explain away the heavy tailed properties of the data its model misspecification in the second two rows we consider problem with highly complex structure and only participants here the functions are drawn from product of spectral mixture and linear kernels as the participants see more functions they appear to expect linear trends and become more similar in their predictions in figures and we show the learned and true predictive correlation matrices using empirical estimators which indicate similar correlation structure discovering unconventional kernels the experiments reported in this section follow the same general procedure described in section in this case human participants were asked to extrapolate from two single training sets in counterbalanced order sawtooth function figure and step function figure with traing data showing as dashed black lines figure learning unconventional kernels sawtooth function dashed black and three clusters of human extrapolations empirically estimated human covariance matrix for corresponding posterior draws for from empirically estimated human covariance matrices posterior predictive draws from gp with spectral mixture kernel learned from the dashed black data step function dashed black and two clusters of human extrapolations and are the empirically estimated human covariance matrices for and and and are posterior samples using these matrices and are respectively spectral mixture and rbf kernel extrapolations from the data in black these types of functions are notoriously difficult for standard gaussian process kernels due to sharp discontinuities and behaviour in figures we used agglomerative clustering to process the human responses into three categories shown in purple green and blue the empirical covariance matrix of the first cluster figure shows the dependencies of the sawtooth form that characterize this cluster in figures we sample from the learned human kernels following the same colour scheme the samples appear to replicate the human behaviour and the purple samples provide reasonable extrapolations by contrast posterior samples from gp with spectral mixture kernel trained on the black data in this case quickly revert to prior mean as shown in fig the data are sufficiently sparse and that the spectral mixture kernel is less inclined to produce long range extrapolation than human learners who attempt to generalise from very small amount of information for the step function we clustered the human extrapolations based on response time and total variation of the predicted function responses that took between and seconds and did not vary by more than units shown in figure appeared reasonable the other responses are shown in figure the empirical covariance matrices of both sets of predictions in figures and show the characteristics of the responses while the first matrix exhibits block structure indicating the second matrix shows fast changes between positive and negative dependencies characteristic for the responses posterior sample extrapolations using the empirical human kernels are shown in figures and in figures and we show posterior samples from gps with spectral mixture and rbf kernels trained on the black data given the same information as the human learners the spectral mixture kernel is able to extract some structure some horizontal and vertical movement but is overconfident and unconvincing compared to the human kernel extrapolations the rbf kernel is unable to learn much structure in the data human occam razor if you were asked to predict the next number in the sequence you are likely more inclined to guess than however we can produce either answer using different hypotheses that are entirely consistent with the data occam razor describes our natural tendency to favour the simplest hypothesis that fits the data and is of foundational importance in statistical model selection for example mackay argues that occam razor is automatically embodied by the marginal likelihood in performing bayesian inference indeed in our number sequence example marginal likelihood computations show that is millions of times more probable than even if the prior odds are equal occam razor is vitally important in nonparametric models such as gaussian processes which have the flexibility to represent infinitely many consistent solutions to any given problem but avoid overfitting through bayesian inference for example the marginal likelihood of gaussian process supplement separates into automatically calibrated model fit and model complexity terms sometimes referred to as automatic occam razor complex simple appropriate output all possible datasets input figure bayesian occam razor the marginal likelihood evidence all possible datasets the dashed vertical line corresponds to an example dataset posterior mean functions of gp with rbf kernel and too short too large and maximum marginal likelihood data are denoted by crosses the marginal likelihood is the probability that if we were to randomly sample parameters from that we would create dataset simple models can only generate small number of datasets but because the marginal likelihood must normalise it will generate these datasets with high probability complex models can generate wide range of datasets but each with typically low probability for given dataset the marginal likelihood will favour model of more appropriate complexity this argument is illustrated in fig fig illustrates this principle with gps function label first choice ranking average human ranking first place votes average human ranking here we examine occam razor in human learning and compare the gaussian process marginal likelihood ranking of functions all consistent with the data to human preferences we generated dataset sampled from gp with an rbf kernel and presented users with subsample of points as well as seven possible gp function fits internally labelled as follows the predictive mean of gp after maximum marginal likelihood hyperparameter estimation the generating function the predictive means of gps with larger to smaller simpler to more complex fits we repeated this procedure four times to create four datasets in total and acquired human rankings on each for total rankings each participant was shown the same unlabelled functions but with different random orderings truth ml gp marginal likelihood ranking figure human occam razor number of first place highest ranking votes for each function average human ranking with standard deviations of functions compared to first place ranking defined by average human ranking average gp marginal likelihood ranking of functions ml marginal likelihood optimum truth true extrapolation blue numbers are offsets to the log from the ml optimum positive offsets correspond to simpler solutions figure shows the number of times each function was voted as the best fit to the data which follows the internal latent ordering defined above the maximum marginal likelihood solution receives the most first place votes functions and received similar numbers between and of first place votes the solutions which have smaller greater complexity than the marginal likelihood best fit represented by functions and received relatively small number of first place votes these findings suggest that on average humans prefer overly simple explanations of the data moreover participants generally agree with the gp marginal likelihood first choice preference even over the true generating function however these data also suggest that participants have wide array of prior biases leading to variability in first choice preferences furthermore of participants responded that their first ranked choice was likely to have generated the data and looks very similar to imagined it possible for highly probable solutions to be underrepresented in figure we might imagine for example that particular solution is never ranked first but always second we show the average rankings with standard deviations the standard errors are compared to the first choice rankings for each function there is general correspondence between rankings suggesting that although human distributions over functions have different modes these distributions have similar allocation of probability mass the standard deviations suggest that there is relatively more agreement that the complex small functions labels are improbable than about specific preferences for functions and finally in figure we compare the average human rankings with the average gp marginal likelihood rankings there are clear trends humans agree with the gp marginal likelihood about the best fit and that empirically decreasing the below the best fit value monotonically decreases solution probability humans penalize simple solutions less than the marginal likelihood with function receiving last place ranking from the marginal likelihood despite the observed human tendency to favour simplicity more than the gp marginal likelihood gaussian process marginal likelihood optimisation is surprisingly biased towards in function space if we generate data from gp with known the mode of the marginal likelihood on average will the true figures and in the supplement if we are unconstrained in estimating the gp covariance matrix we will converge to the maximum likelihood estimator which is degenerate and therefore biased parametrizing covariance matrix by for example by using an rbf kernel restricts this matrix to manifold on the full space of covariance matrices biased estimator will remain biased when constrained to lower dimensional manifold as long as the manifold allows movement in the direction of the bias increasing moves covariance matrix towards the degeneracy of the unconstrained maximum likelihood estimator with more data the manifold becomes more constrained and less influenced by this bias discussion we have shown that human learners have systematic expectations about smooth functions that deviate from the inductive biases inherent in the rbf kernels that have been used in past models of function learning it is possible to extract kernels that reproduce qualitative features of human inductive biases including the variable sawtooth and step patterns that human learners favour smoother or simpler functions even in comparison to gp models that tend to complexity and that is it possible to build models that extrapolate in ways which go beyond traditional stationary and polynomial kernels we have focused on human extrapolation from nonparametric relationships this approach complements past work emphasizing simple parametric functions and the role of noise but kernel learning might also be applied in these other settings in particular iterated learning il experiments provide way to draw samples that reflect human learners priori expectations like most function learning experiments past il experiments have presented learners with sequential data our approach following little and shiffrin instead presents learners with plots of functions this method is useful in reducing the effects of memory limitations and other sources of noise in perception it is possible that people show different inductive biases across these two presentation modes future work using multiple presentation formats with the same underlying relationships will help resolve these questions finally the ideas discussed in this paper could be applied more generally to discover interpretable properties of unknown models from their predictions here one encounters fascinating questions at the intersection of active learning experimental design and information theory references mcculloch and pitts logical calculus of the ideas immanent in nervous activity bulletin of mathematical biology christopher bishop pattern recognition and machine learning springer doya ishii pouget and rao bayesian brain probabilistic approaches to neural coding mit press zoubin ghahramani probabilistic machine learning and artificial intelligence nature daniel wolpert zoubin ghahramani and michael jordan an internal model for sensorimotor integration science david knill and whitman richards perception as bayesian inference cambridge university press sophie deneve bayesian spiking neurons inference neural computation thomas griffiths and joshua tenenbaum optimal predictions in everyday cognition psychological science tenenbaum kemp griffiths and goodman how to grow mind statistics structure and abstraction science neal bayesian learning for neural networks springer verlag isbn rasmussen and williams gaussian processes for machine learning mit press andrew gordon wilson covariance kernels for fast automatic pattern discovery and extrapolation with gaussian processes phd thesis university of cambridge http tommi jaakkola david haussler et al exploiting generative models in discriminative classifiers advances in neural information processing systems pages andrew gordon wilson and ryan prescott adams gaussian process kernels for pattern discovery and extrapolation international conference on machine learning icml douglas carroll functional learning the learning of continuous functional mappings relating stimulus and response continua ets research bulletin series kyunghee koh and david meyer function learning induction of continuous relations journal of experimental psychology learning memory and cognition edward delosh jerome busemeyer and mark mcdaniel extrapolation the sine qua non for abstraction in function learning journal of experimental psychology learning memory and cognition jerome busemeyer eunhee byun edward delosh and mark mcdaniel learning functional relations based on experience with pairs by humans and artificial neural networks concepts and categories thomas griffiths chris lucas joseph williams and michael kalish modeling human function learning with gaussian processes in neural information processing systems christopher lucas thomas griffiths joseph williams and michael kalish rational model of function learning psychonomic bulletin review pages christopher lucas douglas sterling and charles kemp superspace extrapolation reveals inductive biases in function learning in cognitive science society mark mcdaniel and jerome busemeyer the conceptual basis of function learning and extrapolation comparison of and models psychonomic bulletin review michael kalish thomas griffiths and stephan lewandowsky iterated learning intergenerational knowledge transmission reveals inductive biases psychonomic bulletin review daniel little and richard shiffrin simplicity bias in the estimation of causal functions in proceedings of the annual conference of the cognitive science society pages samuel gb johnson andy jin and frank keil simplicity and in explanation the case of intuitive in proceedings of the annual conference of the cognitive science society pages samuel gershman edward vul and joshua tenenbaum multistability and perceptual inference neural computation thomas griffiths edward vul and adam sanborn bridging levels of analysis for probabilistic models of cognition current directions in psychological science edward vul noah goodman thomas griffiths and joshua tenenbaum one and done optimal decisions from very few samples cognitive science andrew gordon wilson elad gilboa arye nehorai and john cunningham fast kernel learning for multidimensional pattern extrapolation in advances in neural information processing systems david jc mackay information theory inference and learning algorithms cambridge press carl edward rasmussen and zoubin ghahramani occam razor in neural information processing systems nips andrew gordon wilson process over all stationary kernels technical report university of cambridge 
video prediction using deep networks in atari games junhyuk oh xiaoxiao guo honglak lee richard lewis satinder singh university of michigan ann arbor mi usa junhyuk guoxiao honglak rickl baveja abstract motivated by reinforcement learning rl problems in particular atari games from the recent benchmark aracade learning environment ale we consider prediction problems where future depend on control variables or actions as well as previous frames while not composed of natural scenes frames in atari games are in size can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly can involve entry and departure of objects and can involve deep partial observability we propose and evaluate two deep neural network architectures that consist of encoding actionconditional transformation and decoding layers based on convolutional neural networks and recurrent neural networks experimental results show that the proposed architectures are able to generate frames that are also useful for control over approximately futures in some games to the best of our knowledge this paper is the first to make and evaluate predictions on video conditioned by control inputs introduction over the years deep learning approaches see for survey have shown great success in many visual perception problems however modeling videos building generative model is still very challenging problem because it often involves data with complex temporal dynamics thus recent studies have mostly focused on modeling simple video data such as bouncing balls or small patches where the next frame is given the previous frames in many applications however future frames depend not only on previous frames but also on control or action variables for example the in vehicle is affected by and acceleration the camera observation of robot is similarly dependent on its movement and changes of its camera angle more generally in reinforcement learning rl problems learning to predict future images conditioned on actions amounts to learning model of the dynamics of the interaction an essential component of approaches to rl in this paper we focus on atari games from the arcade learning environment ale as source of challenging video modeling problems while not composed of natural scenes frames in atari games are can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly can involve entry and departure of objects and can involve deep partial observability to the best of our knowledge this paper is the first to make and evaluate predictions on images conditioned by control inputs this paper proposes evaluates and contrasts two prediction architectures based on deep networks that incorporate action variables see figure our experimental results show that our architectures are able to generate realistic frames over future frames without diverging in some atari games we show that the representations learned by our architectures approximately capture natural similarity among actions and discover which objects are directly controlled by the agent actions and which are only indirectly influenced or not controlled we evaluated the usefulness of our architectures for control in two ways by replacing emulator frames with predicted frames in controller dqn deepmind state action encoding action transformation decoding encoding feedforward encoding transformation decoding recurrent encoding figure proposed network architectures of the art for atari games and by using the predicted frames to drive more informed than random exploration strategy to improve controller also dqn related work video prediction using deep networks the problem of video prediction has led to variety of architectures in deep learning recurrent temporal restricted boltzmann machine rtrbm was proposed to learn temporal correlations from sequential data by introducing recurrent connections in rbm structured rtrbm srtrbm scaled up rtrbm by learning dependency structures between observations and hidden variables from data more recently michalski et al proposed gated autoencoder that defines multiplicative interactions between consecutive frames and mapping units and showed that temporal prediction problem can be viewed as learning and inferring interactions between consecutive images srivastava et al applied learning framework to video domain and showed that long memory lstm networks are capable of generating video of bouncing handwritten digits in contrast to these previous studies this paper tackles problems where control variables affect temporal dynamics and in addition scales up prediction to images ale combining deep learning and rl atari games provide challenging environments for rl because of visual observations partial observability and delayed rewards approaches that combine deep learning and rl have made significant advances specifically dqn combined with convolutional neural network cnn and achieved performance on many atari games guo et al used the for making predictions with slow uct tree search method to generate training data for cnn which outperformed dqn on several domains throughout this paper we will use dqn to refer to the architecture used in more recent work used deeper cnn with more data to produce the currently atari game players predictive model for rl the idea of building predictive model for rl problems was introduced by schmidhuber and huber they proposed neural network that predicts the attention region given the previous frame and an action more recently lenz et al proposed recurrent neural network with multiplicative interactions that predicts the physical coordinate of robot compared to this previous work our work is evaluated on much data with complex dependencies among observations there have been few attempts to learn from ale data that makes predictions of future frames one line of work divides game images into patches and applies bayesian framework to predict observations however this approach assumes that neighboring patches are enough to predict the center patch which is not true in atari games because of many complex interactions the evaluation in this prior work is prediction loss in contrast here we make and evaluate predictions both for quality of pixels generated and for usefulness to control proposed architectures and training method the goal of our architectures is to learn function at where xt and at are the frame and action variables at time and are the frames from time to time figure shows our two architectures that are each composed of encoding layers that extract features from the input frames transformation layers that transform the encoded features into prediction of the next frame in feature space by introducing action variables as additional input and finally decoding layers that map the predicted features into pixels our contributions are in the novel deep convolutional architectures for prediction as well as in the novel use of the architectures in visionbased rl domains two variants feedforward encoding and recurrent encoding feedforward encoding takes fixed history of previous frames as an input which is concatenated through channels figure and stacked convolution layers extract features directly from the concatenated frames the encoded feature vector henc rn at time is henc cnn where denotes frames of pixel images with color channels cnn is mapping from raw pixels to feature vector using multiple convolution layers and layer at the end each of which is followed by this encoding can be viewed as other types of fusions or convolution can also be applied to this architecture recurrent encoding takes one frame as an input for each and extracts features using an rnn in which the temporal dynamics is modeled by the recurrent layer on top of the feature vector extracted by convolution layers figure in this paper lstm without peephole connection is used for the recurrent layer as follows enc henc ct lstm cnn xt where ct rn is memory cell that retains information from deep history of inputs intuitively cnn xt is given as input to the lstm so that the lstm captures temporal correlations from spatial features multiplicative transformation we use multiplicative interactions between the encoded feature vector and the control variables hdec wijl henc at bi henc where is an encoded feature hdec rn is an feature at ra is the at time is tensor weight and rn is bias when the action is represented using vector using tensor is equivalent to using different weight matrices for each action this enables the architecture to model different transformations for different actions the advantages of multiplicative interactions have been explored in image and text processing in practice the tensor is not scalable because of its large number of parameters thus we approximate the tensor by factorizing into three matrices as follows hdec wdec wenc henc wa at dec enc where and is the number of factors unlike the tensor the above factorization shares the weights between different actions by mapping them to the factors this sharing may be desirable relative to the tensor when there are common temporal dynamics in the data across different actions discussed further in convolutional decoding it has been recently shown that cnn is capable of generating an image effectively using upsampling followed by convolution with stride of similarly we use the inverse operation of convolution called deconvolution which maps spatial region of the input to using deconvolution kernels the effect of upsampling can be achieved without explicitly upsampling the feature map by using stride of we found that this operation is more efficient than upsampling followed by convolution because of the smaller number of convolutions with larger stride in the proposed architecture the transformed feature vector hdec is decoded into pixels as follows deconv reshape hdec where reshape is layer where hidden units form feature map and deconv consists of multiple deconvolution layers each of which is followed by except for the last deconvolution layer curriculum learning with prediction it is almost inevitable for predictive model to make noisy predictions of images when the model is trained on prediction objective small prediction errors can compound through time to alleviate this effect we use prediction objective more specifically given the training data xti ati the model is trained to minimize the average squared error over predictions as follows lk xxx where is future prediction intuitively the network is repeatedly unrolled through time steps by using its prediction as an input for the next the model is trained in multiple phases based on increasing as suggested by michalski et al in other words the model is trained to predict future frames and to predict future frames after the previous phase converges we found that this curriculum learning approach is necessary to stabilize the training stochastic gradient descent with backpropagation through time bptt is used to optimize the parameters of the network experiments in the experiments that follow we have the following goals for our two architectures to evaluate the predicted frames in two ways qualitatively evaluating the generated video and quantitatively evaluating the squared error to evaluate the usefulness of predicted frames for control in two ways by replacing the emulator frames with predicted frames for use by dqn and by using the predictions to improve exploration in dqn and to analyze the representations learned by our architectures we begin by describing the details of the data and model architecture and baselines data and preprocessing we used our replication of dqn to generate video datasets using an policy with dqn is forced to choose random action with probability for each game the dataset consists of about training frames and test frames with actions chosen by dqn following dqn actions are chosen once every frames which reduces the video from to the number of actions available in games varies from to and they are represented as vectors we used rgb images and preprocessed the images by subtracting mean pixel values and dividing each pixel value by network architecture across all game domains we use the same network architecture as follows the encoding layers consist of convolution layers and one layer with hidden units the convolution layers use and filters with stride of every layer is followed by rectified linear function in the recurrent encoding network an lstm layer with hidden units is added on top of the layer the number of factors in the transformation layer is the decoding layers consists of one layer with hidden units followed by deconvolution layers the deconvolution layers use and filters with stride of for the feedforward encoding network the last frames are given as an input for each the recurrent encoding network takes one frame for each but it is unrolled through the last frames to initialize the lstm hidden units before making prediction our implementation is based on caffe toolbox details of training we use the curriculum learning scheme above with three phases of increasing prediction step objectives of and steps and learning rates of and respectively rmsprop is used with momentum of squared gradient momentum of and min squared gradient of the batch size for each training phase is and for the feedforward encoding network and and for the recurrent encoding network respectively when the recurrent encoding network is trained on prediction objective the network is unrolled through steps and predicts the last frames by taking images as input gradients are clipped at before of each gate of lstm as suggested by two baselines for comparison the first baseline is perceptron mlp that takes the last frame as input and has hidden layers with and units the action input is concatenated to the second hidden layer this baseline uses approximately the same number of parameters as the recurrent encoding model the second baseline feedforward or naff is the same as the feedforward encoding model figure except that the transformation layer consists of one layer that does not get the action as input step mlp naff feedforward recurrent ground truth action figure example of predictions over steps in freeway the step and action columns show the number of prediction steps and the actions taken respectively the white boxes indicate the object controlled by the agent from prediction step to the controlled object crosses the top boundary and reappears at the bottom this shift is predicted by our architectures and is not predicted by mlp and naff the horizontal movements of the uncontrolled objects are predicted by our architectures and naff but not by mlp mlp naff feedforward recurrent seaquest space invaders freeway qbert ms pacman figure mean squared error over predictions evaluation of predicted frames qualitative evaluation prediction video the prediction videos of our models and baselines are available in the supplementary material and at the following website https as seen in the videos the proposed models make qualitatively reasonable predictions over steps depending on the game in all games the mlp baseline quickly diverges and the naff baseline fails to predict the controlled object an example of predictions is illustrated in figure we observed that both of our models predict complex local translations well such as the movement of vehicles and the controlled object they can predict interactions between objects such as collision of two objects since our architectures effectively extract hierarchical features using cnn they are able to make prediction that requires global context for example in figure the model predicts the sudden change of the location of the controlled object from the top to the bottom at however both of our models have difficulty in accurately predicting small objects such as bullets in space invaders the reason is that the squared error signal is small when the model fails to predict small objects during training another difficulty is in handling stochasticity in seaquest new objects appear from the left side or right side randomly and so are hard to predict although our models do generate new objects with reasonable shapes and movements after appearing they move as in the true frames the generated frames do not necessarily match the quantitative evaluation squared prediction error mean squared error over predictions is reported in figure our predictive models outperform the two baselines for all domains however the gap between our predictive models and naff baseline is not large except for seaquest this is due to the fact that the object controlled by the action occupies only small part of the image feedforward feedforward recurrent recurrent true true ms pacman cropped space invaders cropped figure comparison between two encoding models feedforward and recurrent controlled object is moving along horizontal corridor as the recurrent encoding model makes small translation error at frame the true position of the object is in the crossroad while the predicted position is still in the corridor the true object then moves upward which is not possible in the predicted position and so the predicted object keeps moving right this is less likely to happen in feedforward encoding because its position prediction is more accurate the objects move down after staying at the same location for the first five steps the feedforward encoding model fails to predict this movement because it only gets the last four frames as input while the recurrent model predicts this downwards movement more correctly emulator rand mlp naff feedforward recurrent seaquest space invaders freeway qbert ms pacman figure game play performance using the predictive model as an emulator emulator and rand correspond to the performance of dqn with true frames and random play respectively the is the number of steps of prediction before the is the average game score measured from plays qualitative analysis of relative strengths and weaknesses of feedforward and recurrent encoding we hypothesize that feedforward encoding can model more precise spatial transformations because its convolutional filters can learn temporal correlations directly from pixels in the concatenated frames in contrast convolutional filters in recurrent encoding can learn only spatial features from the input and the temporal context has to be captured by the recurrent layer on top of the cnn features without localized information on the other hand recurrent encoding is potentially better for modeling arbitrarily dependencies whereas feedforward encoding is not suitable for dependencies because it requires more memory and parameters as more frames are concatenated into the input as evidence in figure we show case where feedforward encoding is better at predicting the precise movement of the controlled object while recurrent encoding makes pixel translation error this small error leads to entirely different predicted frames after few steps since the feedforward and recurrent architectures are identical except for the encoding part we conjecture that this result is due to the failure of precise encoding in recurrent encoding on the other hand recurrent encoding is better at predicting when the enemies move in space invaders figure this is due to the fact that the enemies move after steps which is hard for feedforward encoding to predict because it takes only the last four frames as input we observed similar results showing that feedforward encoding can not handle dependencies in other games evaluating the usefulness of predictions for control replacing real frames with predicted frames as input to dqn to evaluate how useful the predictions are for playing the games we implement an evaluation method that uses the predictive model to replace the game emulator more specifically dqn controller that takes the last four frames is first using real frames and then used to play the games based on policy where the input frames are generated by our predictive model instead of the game emulator to evaluate how the depth of predictions influence the quality of control we the predictions using the true last frames after every of prediction for note that the dqn controller never takes true frame just the outputs of our predictive models the results are shown in figure unsurprisingly replacing real frames with predicted frames reduces the score however in all the games using the model to repeatedly predict only few time table average game score of dqn over plays with standard error the first row and the second row show the performance of our dqn replication with different exploration strategies model seaquest dqn random exploration dqn informed exploration invaders freeway qbert ms pacman informed exploration figure comparison between two exploration methods on ms pacman each heat map shows the trajectories of the controlled object measured over steps for the corresponding method random exploration figure cosine similarity between every pair of action factors see text for details steps yields score very close to that of using real frames our two architectures produce much better scores than the two baselines for deep predictions than would be suggested based on the much smaller differences in squared error the likely cause of this is that our models are better able to predict the movement of the controlled object relative to the baselines even though such an ability may not always lead to better squared error in three out of the five games the score remains much better than the score of random play even when using steps of prediction improving dqn via informed exploration to learn control in an rl domain exploration of actions and states is necessary because without it the agent can get stuck in bad policy in dqn the agent was trained using an policy in which the agent chooses either greedy action or random action by flipping coin with probability of such random exploration is basic strategy that produces sufficient exploration but can be slower than more informed exploration strategies thus we propose an informed exploration strategy that follows the policy but chooses exploratory actions that lead to frame that has been visited least often in the last time steps rather than random actions implementing this strategy requires predictive model because the next frame for each possible action has to be considered the method works as follows the most recent frames are stored in trajectory memory denoted the predictive model is used to get the next frame for every action we estimate the for every predicted frame by summing the similarity between the predicted frame and the most recent frames stored in the trajectory memory using gaussian kernel as follows nd exp min max xj yj where is threshold and is kernel bandwidth the trajectory memory size is for qbert and for the other games for freeway and for the others and for all games for computational efficiency we trained new feedforward encoding network on images as they are used as input for dqn the details of the network architecture are provided in the supplementary material table summarizes the results the informed exploration improves dqn performance using our predictive model in three of five games with the most significant improvement in qbert figure shows how the informed exploration strategy improves the initial experience of dqn analysis of learned representations similarity among action representations in the factored multiplicative interactions every action is linearly transformed to factors wa in equation in figure we present the cosine similarity between every pair of after training in seaquest and corresponds to and fire arrows correspond to movements with black or without white fire there are positive correlations between actions that have the same movement directions up and and negative correlations between actions that have opposing directions these results are reasonable and discovered automatically in learning good predictions distinguishing controlled and uncontrolled objects is itself hard and interesting problem bellemare et al proposed framework to learn contingent regions of an image affected by agent action suggesting that contingency awareness is useful for agents we show that our architectures implicitly learn contingent regions as they learn to predict the entire image in our architectures factor fi wi with higher variance measured over all possible actions var fi ea fi ea fi is more likely to transform an image differently depending on actions and so we assume such factors are responsible for transforming the parts of the prev frame next frame prediction image related to actions we therefore collected the high variance referred to as highvar factors from the model trained on seaquest around of factors and collected the remaining factors into low variance lowvar subset given an image and an action we did two controlled forward propagations giving only highvar factors by setting the other factors to zeros and vice versa the results are visualized as action and in figure interestingly given only action the model action predicts sharply the movement of the object controlled by figure distinguishing controlled and actions while the other parts are mean pixel values in uncontrolled objects action image shows trast given only the model prediction given only learned actionpredicts the movement of the other objects and the factors with high variance ground oxygen and the controlled object stays at its image given only factors previous location this result implies that our model learns to distinguish between controlled objects and uncontrolled objects and transform them using disentangled representations see for related work on disentangling factors of variation conclusion this paper introduced two different novel deep architectures that predict future frames that are dependent on actions and showed qualitatively and quantitatively that they are able to predict visuallyrealistic and frames over futures on several atari game domains to our knowledge this is the first paper to show good deep predictions in atari games since our architectures were domain independent we expect that they will generalize to many rl problems in future work we will learn models that predict future reward in addition to predicting future frames and evaluate the performance of our architectures in rl acknowledgments this work was supported by nsf grant bosch research and onr grant any opinions findings conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors references bellemare naddaf veness and bowling the arcade learning environment an evaluation platform for general agents journal of artificial intelligence research bellemare veness and bowling investigating contingency awareness using atari games in aaai bellemare veness and bowling bayesian learning of recursively factored environments in icml bellemare veness and talvitie skip context tree switching in icml bengio learning deep architectures for ai foundations and trends in machine learning bengio louradour collobert and weston curriculum learning in icml ciresan meier and schmidhuber deep neural networks for image classification in cvpr dosovitskiy springenberg and brox learning to generate chairs with convolutional neural networks in cvpr girshick donahue darrell and malik rich feature hierarchies for accurate object detection and semantic segmentation in cvpr graves generating sequences with recurrent neural networks arxiv preprint guo singh lee lewis and wang deep learning for atari game play using offline tree search planning in nips hochreiter and schmidhuber long memory neural computation jia shelhamer donahue karayev long girshick guadarrama and darrell caffe convolutional architecture for fast feature embedding in acm multimedia karpathy toderici shetty leung sukthankar and video classification with convolutional neural networks in cvpr kocsis and bandit based planning in ecml krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips lenz knepper and saxena deepmpc learning deep latent features for model predictive control in rss memisevic learning to relate images ieee tpami michalski memisevic and konda modeling deep temporal dependencies with recurrent grammar cells in nips mittelman kuipers savarese and lee structured recurrent temporal restricted boltzmann machines in icml mnih kavukcuoglu silver graves antonoglou wierstra and riedmiller playing atari with deep reinforcement learning arxiv preprint mnih kavukcuoglu silver rusu veness bellemare graves riedmiller fidjeland ostrovski et al control through deep reinforcement learning nature nair and hinton rectified linear units improve restricted boltzmann machines in icml reed sohn zhang and lee learning to disentangle factors of variation with manifold interaction in icml rifai bengio courville vincent and mirza disentangling factors of variation for facial expression recognition in eccv schmidhuber deep learning in neural networks an overview neural networks schmidhuber and huber learning to generate artificial fovea trajectories for target detection international journal of neural systems srivastava mansimov and salakhutdinov unsupervised learning of video representations using lstms in icml sutskever hinton and taylor the recurrent temporal restricted boltzmann machine in nips sutskever martens and hinton generating text with recurrent neural networks in icml sutskever vinyals and le sequence to sequence learning with neural networks in nips szegedy liu jia sermanet reed anguelov erhan vanhoucke and rabinovich going deeper with convolutions arxiv preprint taylor and hinton factored conditional restricted boltzmann machines for modeling motion style in icml tieleman and hinton lecture rmsprop divde the gradient by running average of its recent magnitude coursera tran bourdev fergus torresani and paluri learning spatiotemporal features with convolutional networks in iccv watkins and dayan machine learning yang reed yang and lee disentangling with recurrent transformations for view synthesis in nips 
iteration for optimal recovery in noisy ica james voss the ohio state university vossj mikhail belkin the ohio state university mbelkin luis rademacher the ohio state university lrademac abstract independent component analysis ica is popular model for blind signal separation the ica model assumes that number of independent source signals are linearly mixed to form the observed signals we propose new algorithm pegi for gradient iteration for provable model recovery for ica with gaussian noise the main technical innovation of the algorithm is to use fixed point iteration in indefinite inner product space the use of this indefinite inner product resolves technical issues common to several existing algorithms for noisy ica this leads to an algorithm which is conceptually simple efficient and accurate in testing our second contribution is combining pegi with the analysis of objectives for optimal recovery in the noisy ica model it has been observed that the direct approach of demixing with the inverse of the mixing matrix is suboptimal for signal recovery in terms of the natural signal to interference plus noise ratio sinr criterion there have been several partial solutions proposed in the ica literature it turns out that any solution to the mixing matrix reconstruction problem can be used to construct an ica demixing despite the fact that sinr itself can not be computed from data that allows us to obtain practical and provably recovery method for ica with arbitrary gaussian noise introduction independent component analysis refers to class of methods aiming at recovering statistically independent signals by observing their unknown linear combination there is an extensive literature on this and number of related problems in the ica model we observe realizations of latent variable model sk ak as where ak denotes the th column of the mixing matrix and sm is the unseen latent random vector of signals it is assumed that sm are independent and the source signals and entries of may be either or for simplicity we will assume throughout that has zero mean as this may be achieved in practice by centering the observed data many ica algorithms use the preprocessing whitening step whose goal is to orthogonalize the independent components in the noiseless case this is commonly done by computing the square root of the covariance matrix of consider now the noisy ica model as with additive noise independent of it turns out that the introduction of noise makes accurate recovery of the signals significantly more involved specifically whitening using the covariance matrix does not work in the noisy ica model as the covariance matrix combines both signal and noise for the case when the noise is gaussian matrices constructed from higher order statistics specifically cumulants can be used instead of the covariance matrix however these matrices are not in general positive definite and thus the square root can not always be extracted this limits the applicability of several previous methods such as the algorithm proposed in addresses this issue by using complicated step followed by an iterative method in this paper section we develop simple and practical algorithm pegi for pseudoeuclidean gradient iteration for provably recovering up to the unavoidable ambiguities of the model in the case when the noise is gaussian with an arbitrary unknown covariance matrix the main technical innovation of our approach is to formulate the recovery problem as fixed point method in an indefinite inner product space the second contribution of the paper is combining pegi with the analysis of objectives for optimal recovery in the noisy ica model in most applications of ica speech separation artifact removal and others one cares about recovering the signals this is known as the source recovery problem this is typically done by first recovering the matrix up to an appropriate scaling of the column directions at first source recovery and recovering the mixing matrix appear to be essentially equivalent in the noiseless ica model if in then recovers the sources on the other hand in the noisy model the exact recovery of the latent sources becomes impossible even if is known exactly part of the noise can be incorporated into the signal preserving the form of the model even worse neither nor are defined uniquely as there is an inherent ambiguity in the setting there could be many equivalent decompositions of the observed signal as see the discussion in section we consider recovered signals of the form bx for choice of demixing matrix signal recovery is considered optimal if the coordinates of maximize signal to interference plus noise ratio sinr within any fixed model as note that the value of sinr depends on the decomposition of the observed data into noise and signal surprisingly the sinr optimal demixing matrix does not depend on the decomposition of data into signal plus noise as such sinr optimal ica recovery is well defined given access to data despite the inherent ambiguity in the model further it will be seen that the sinr optimal demixing can be constructed from cov and the directions of the columns of which are also across decompositions our demixing approach combined with the pegi algorithm provides complete recovery algorithm in the ica model with arbitrary gaussian noise we note that the ica papers of which we are aware that discuss optimal demixing do not observe that sinr optimal demixing is invariant to the choice of decomposition instead they propose more limited strategies for improving the demixing quality within fixed ica model for instance joho et al show how demixing can be approximated with extra sensors when assuming white additive noise and and discuss how to achieve asymptotically low bias ica demixing assuming white noise within fixed ica model however the invariance of the demixing matrix appears in the array sensor systems literature finally in section we demonstrate experimentally that our proposed algorithm for ica outperforms existing practical algorithms at the task of noisy signal recovery including those specifically designed for beamforming when given sufficiently many samples moreover most existing practical algorithms for noisy source recovery have bias and can not recover the optimal demixing matrix even with infinite samples we also show that pegi requires significantly fewer samples than to perform ica accurately the indeterminacies of ica notation we use to denote the complex conjugate of matrix to denote its transpose to denote its conjugate transpose and to denote its pseudoinverse before proceeding with our results we discuss the somewhat subtle issue of indeterminacies in ica these ambiguities arise from the fact that the observed may have multiple decompositions into ica models as and can be replaced with pseudoinverse in the discussion below for ica ica has two natural indeterminacies for any nonzero constant the contribution of the th component ak sk to the model can equivalently be obtained by replacing ak with αak and sk with the rescaled signal sk to lessen this scaling indeterminacy we use the that cov throughout this paper as such each source sk or equivalently each ak is defined up to choice of sign unit modulus factor in the complex case in addition there is an ambiguity in the order of thep latent signals for any of where the ppermutation ica models sk ak and sπ aπ are indistinguishable in the noise free setting is said to be recovered if we recover each column of up to choice of sign or up to unit modulus factor in the complex case and an unknown permutation as the sources sm are only defined up to the same indeterminacies inverting the recovered matrix to obtain demixing matrix works for signal recovery in the noisy ica setting there is an additional indeterminacy in the definition of the sources consider gaussian random vector then the noisy ica model in which is considered part of the latent source signal and the model as aξ in which is part of the noise are indistinguishable in particular the latent source and its covariance are due to this extra indeterminacy the lengths of the columns of no longer have fully defined meaning even when we assume cov in the noisy setting is said to be recovered if we obtain the columns of up to scalar multiplicative factors and an arbitrary permutation the last indeterminacy is the most troubling as it suggests that the power of each source signal is itself in the noisy setting despite this indeterminacy it is possible to perform an demixing without additional assumptions about what portion of the signal is source and what portion is noise in section we will see that source recovery takes on simple form given any solution which recovers up to the inherent ambiguities of noisy ica then cov is an demixing matrix related work and contributions independent component analysis is probably the most used model for blind signal separation it has seen numerous applications and has generated vast literature including in the noisy and underdetermined settings we refer the reader to the books for broad overview of the subject it was observed early on by cardoso that ica algorithms based soley on higher order cumulant statistics are invariant to additive gaussian noise this observation has allowed the creation of many algorithms for recovering the ica mixing matrix in the noisy and often underdetermined settings despite the significant work on noisy ica algorithms they remain less efficient more specialized or less practical than the most popular noise free ica algorithms research on noisy ica can largely be split into several lines of work which we only highlight here some algorithms such as foobi and biome directly use the tensor structure of higher order cumulants in another line of work de lathauwer et al and yeredor have suggested algorithms which jointly diagonalize cumulant matrices in manner reminiscent of the jade algorithm in addition yeredor and goyal et al have proposed ica algorithms based on random directional derivatives of the second characteristic function each line of work has its advantages and disadvantages the joint diagonalization algorithms and the tensor based algorithms tend to be practical in the sense that they use redundant cumulant information in order to achieve more accurate results however they have higher memory complexity than popular noise free ica algorithms such as fastica while the tensor methods foobi and biome can be used when there are more sources than the dimensionality of the space the underdetermined ica setting they require all the latent source signals to have positive order cumulants predetermined fixed integer as they rely on taking matrix square root finally the methods based on random directional derivatives of the second characteristic function rely heavily upon randomness in manner not required by the most popular noise free ica algorithms we continue line of research started by arora et al and voss et al on fully determined noisy ica which addresses some of these practical issues by using deflationary approach reminiscent of fastica their algorithms thus have lower memory complexity and are more scalable to high dimensional data than the joint diagonalization and tensor methods however both works require alternatively one may place the scaling information in the signals by setting kak for each preprocessing step to orthogonalize the latent signals which is based on taking matrix square root arora et al require each latent signal to have positive fourth cumulant in order to carry out this preprocessing step in contrast voss et al are able to perform with source signals of mixed sign fourth cumulants but their quaseorthogonalization step is more complicated and can run into numerical issues under sampling error we demonstrate that is unnecessary we introduce the pegi algorithm to work within not necessarily positive definite inner product space instead experimentally this leads to improved demixing performance in addition we handle the case of complex signals finally another line of work attempts to perform source recovery in the noisy ica setting it was noted by and that for noisy ica traditional ica algorithms such as fastica and jade actually outperform algorithms which first recover in the noisy setting and then use the resulting approximation of to perform demixing it was further observed that is not the optimal demixing matrix for source recovery later and proposed an algorithm based on fastica which performs low beamforming gradient iteration ica in this section we introduce the pegi algorithm for recovering in the fully determined noisy ica setting where pegi relies on the idea of gradient iteration introduced voss et al however unlike voss et al pegi does not require the source signals to be orthogonalized as such pegi does not require the complicated preprocessing step of which can be inaccurate to compute in practice we sketch the gradient iteration algorithm in section and then introduce pegi in section for simplicity we limit this discussion to the case of signals mild variation of our pegi algorithm works for signals and its construction is provided in the supplementary material in this section we assume noisy ica model as such that is arbitrary gaussian and independent of we also assume that that is known and that the columns of are linearly independent gradient iteration with orthogonality the gradient iteration relies on the properties of cumulants we will focus on the fourth cumulant though similar constructions may be given using other even order cumulants of higher order for random variable the fourth order cumulant may be defined as see chapter section higher order cumulants have nice algebraic properties which make them useful for ica in particular has the following properties independence if and are independent then homogeneity if is scalar then αx vanishing gaussians if is normally distributed then we consider the following function defined on the unit sphere hx ui expanding using the above properties we obtain xm xm hak uisk hu ηi hak sk taking derivatives we obtain xm hf xm hak sk ak hak sk ak atk ad at where is diagonal matrix with entries kk sk we also note that and hf have natural sample estimates see voss et al introduced as fixed point algorithm under the assumption that the columns of are orthogonal but not necessarily unit vectors the main idea is that the update is form of generalized power iteration from equation each ak may be considered as direction in hidden orthogonal basis of the space during each iteration the ak coordinate of is raised to the power and multiplied by constant treating this iteration as fixed point update it was shown that given random starting point this iterative procedure converges rapidly to one of the columns of up to choice of sign the rate of convergence is cubic however the algorithm requires somewhat complicated preprocessing step called to linearly transform the data to make columns of orthogonal quasiorthogonalization makes use of evaluations of hessians of the fourth cumulant function to construct matrix of the form adat where has all positive diagonal task which is complicated by the possibility that the latent signals si may have fourth order cumulants of differing requires taking the matrix square root of positive definite matrix of this form however the algorithm used for constructing under sampling error is not always positive definite in practice which can make the preprocessing step fail we will show how our pegi algorithm makes unnecessary in particular resolving this issue gradient iteration in space we now show that the gradient iteration can be performed using in space in which the columns of are orthogonal the natural candidate for the inner product space would be to use defined as hu ut aat clearly hai aj δij gives the desired orthogonality property however there are two issues with this inner product space first it is only an inner product space when is invertible this turns out not to be major issue and we move forward largely ignoring this point the second issue is more fundamental we only have access to aat in the noise free setting where cov aat in the noisy setting we have access to matrices of the form hf ad at from equation instead product dealgorithm recovers column of up to we consider inner fined as follows let ada where is scaling factor if is generically chosen diagonal matrix with diagonal entries and inputs unit vector define by hu vic ut when tains negative entries this is not proper inner prodrepeat uct since is not positive definite in particular uk hak ak ic at adat ak may be negk kk ative nevertheless when hak aj ic until convergence up to sign atk adat aj gives that the columns of return uk are orthogonal in this space we definep functions αk rn by αk such that for any span am then αi ai is the expansion of in its ai basis continuing pn pn from equation for any we see hak sk ak hak sk ak is the gradient iteration recast in the space expanding in its ak basis we obtain xm xm αk hak ak ic sk ak αk kk sk ak which is power iteration in the unseen ak coordinate system as no assumptions are made upon the sk values the kk scalings which were not present in eq cause no issues using this update we obtain alg fixed point method for recovering single column of up to an unknown scaling before proceeding we should clarify the notion of fixed point convergence in algorithm we say that the sequence uk converges to up to sign if there exists sequence ck such that each ck and ck uk as we have the following convergence guarantee theorem if is chosen uniformly at random from then with probability there exists such that the sequence uk defined as in algorithm converges to up to sign further the rate of convergence is cubic due to limited space we omit the proof of theorem it is similar to the proof of theorem in practice we test near convergence by checking if we are still making significant progress in particular for some predefined if there exists sign value ck such that kuk ck then we declare convergence achieved and return the result as there are only two choices for ck this is easily checked and we exit the loop if this condition is met full ica recovery via the we are able to recover single column of up to its unknown scale however for full recovery of we would like given recovered columns to be able to recover column ak such that on demand the idea behind the simultaneous recovery of all columns of is first instead of just finding columns of using algorithm we simultaneously find rows of then using the recovered columns of and rows of we project onto the orthogonal complement of the recovered columns of within the product space recovering rows of suppose we have access to column ak which may be achieved using algorithm let denote the th row of then we note that ak adat ak kk dkk recovers up to an arbitrary unknown constant dkk however the constant dkk may be recovered by noting that hak ak ic ak ak dkk as such we may estimate as ak ak ak algorithm full ica matrix recovery algorithm enforcing orthogonality during the gi given access to vector pm returns two matrices is the recovered update where is the ing matrix for the noisy ica model as projection onto the orthogonal complements and is running estimate of of the range of some recovered columns and corresponding rows of we may zero out the components of corresponding to the recovered columns of letpr ting then αk ak in particular is orthogonal in the space to the previously recovered columns of this allows the gradient iteration algorithm to recover new column of inputs for to do draw uniformly at random from repeat until convergence up to sign aj aj aj end for return sinr optimal recovery in noisy ica using these ideas we obtain algorithm which is the pegi algorithm for recovery of the mixing matrix in noisy ica up to the inherent ambiguities of the problem within this algorithm step enforces orthogonality with previously found columns of guaranteeing that convergence to new column ofp practical construction of in our implementation we set hf ek as it can be pn shown from equation that hf ek ada with dkk kak sk this deterministically guarantees that each latent signal has significant contribution to in this section we demonstrate how to perform sinr optimal ica within the noisy ica framework given access to an algorithm such as pegi to recover the directions of the columns of to this end we first discuss the sinr optimal demixing solution within any decomposition of the ica model into signal and noise as as we then demonstrate that the sinr optimal demixing matrix is actually the same across all possible model decompositions and that it can be recovered the results in this section hold in greater generality than in section they hold even if the underdetermined setting and even if the additive noise is consider an demixing matrix and define bx the resulting approximation to it will also be convenient to estimate the source signal one coordinate at time given row vector we define bx if the th row of then is our estimate to the th latent signal sk within specific ica model as signal to ratio sinr is defined by the following equation var bak sk var bak sk sinrk var bas bak sk var bη var bax var bak sk sinrk is the variance of the contribution of th source divided by the variance of the noise and interference contributions within the signal given access to the mixing matrix we define bopt ah aah cov since cov aah cov it follows that bopt ah cov here cov may be estimated from data but due to the ambiguities of the noisy ica model and specifically its column norms can not be and observed that when is white gaussian noise bopt jointly maximizes sinrk for each sinrk takes on its maximal value at bopt we generalize this result in proposition below to include arbitrary potentially noise accuracy under additive gaussian noise bias under additive gaussian noise figure sinr performance comparison of ica algorithms it is interesting to note that even after the data is whitened cov the optimal sinr solution is different from the optimal solution in the noiseless case unless is an orthogonal matrix ah this is generally not the case even if is white gaussian noise proposition for each bopt is maximizer of sinrk the proof of proposition can be found in the supplementary material since sinr is scale invariant proposition implies that any matrix of the form dbopt dah cov where is diagonal scaling matrix with diagonal entries is an sinroptimal demixing matrix more formally we have the following result theorem let be an matrix containing the columns of up to scale and an arbitrary permutation then cov is maximizer of sinrk by theorem given access to matrix which recovers the directions of the columns of then cov is the demixing matrix for ica in the presence of gaussian noise the directions of the columns of are well defined simply from that is the directions of the columns of do not depend on the decomposition of into signal and noise see the discussion in section on ica indeterminacies the problem of sinr optimal demixing is thus well defined for ica in the presence of gaussian noise and the sinr optimal demixing matrix can be estimated from data without any additional assumptions on the magnitude of the noise in the data finally we note that in the case the source recovery simplifies to be corollary suppose that as is noise free possibly underdetermined ica model suppose that contains the columns of up to scale and permutation there exists diagonal matrix with entries and permutation matrix such that adπ then is an demixing matrix corollary is consistent with known beamforming results in particular it is known that is optimal in terms of minimum mean squared error for underdetermined ica section experimental results we compare the proposed pegi algorithm with existing ica algorithms in addition to with for preprocessing we use the following baselines jade is popular fourth cumulant based ica algorithm designed for the noise free setting we use the implementation of cardoso and souloumiac fastica is popular ica algorithm designed for the noise free setting based on deflationary approach of recovering one component at time we use the implementation of et al is variation of fastica with the tanh contrast function designed to have low bias for performing beamforming in the presence of gaussian noise ainv performs oracle demixing algorithm which uses as the demixing matrix performs oracle demixing using ah cov to achieve demixing we compare these algorithms on simulated data with we constructed mixing matrices with condition number via reverse singular value decomposition λv the matrices and were random orthogonal matrices and was chosen to have as its minimum and as its maximum singular values with the intermediate singular values chosen uniformly at random we drew data from noisy ica model as where cov was chosen to be malaligned with cov as aat we set aat where is constant defining the noise power maxv var vt it can be shown that max is the ratio of the maximum directional noise variance to var as the maximum directional signal variance we generated matrices for our experiments with corresponding ica data sets for each sample size and noise power when reporting results we apply each algorithm to each of the data sets for the corresponding sample size and noise power and we report the mean performance the source distributions used in our ica experiments were the laplace and bernoulli distribution with parameters and respectively the with and degrees of freedom respectively the exponential distribution and the uniform distribution each distribution was normalized to have unit variance and the distributions were each used twice to create data we compare the algorithms using either sinr or the sinr loss from the optimal demixing matrix defined by sinr loss optimal sinr achieved sinr in figure we compare our proprosed ica algorithm with various ica algorithms for signal recovery in the algorithm we use to estimate and then perform demixing using the resulting estimate of ah cov the formula for demixing it is apparent that when given sufficient samples provides the best sinr demixing jade and each have bias in the presence of additive gaussian noise which keeps them from being even when given many samples in figure we compare algorithms at various sample sizes the algorithm relies more heavily on accurate estimates of fourth order statistics than jade and the and algorithms do not require the estimation of fourth order statistics for this reason requires more samples than the other algorithms in order to be run accurately however once sufficient samples are taken outperforms the other algorithms including which is designed to have low sinr bias we also note that while not reported in order to avoid clutter the fastica performed very similarly to in our experiments figure accuracy comparison of pegi using product spaces and using in order to avoid clutter we did not include the sinr optimal demixing estimate constructed using to estimate in the figures and it is also assymptotically unbiased in estimating the directions of the columns of and similar conclusions could be drawn using in place of however in figure we see that requires fewer samples than to achieve good performance this is particularly highlighted in the medium sample regime on the performance of traditional ica algorithms for noisy ica an interesting observation first made in is that the popular noise free ica algorithms jade and fastica perform reasonably well in the noisy setting in figures and they significantly outperform demixing using for source recovery it turns out that this may be explained by shared preprocessing step both jade and fastica rely on whitening preprocessing step in which the data are linearly transformed to have identity covariance it can be shown in the noise free setting that after whitening the mixing matrix is rotation matrix these algorithms proceed by recovering an orthogonal matrix to approximate the true mixing matrix demixing is performed using since the data is white has identity covariance then the demixing matrix cov is an estimate of the demixing matrix nevertheless the traditional ica algorithms give biased estimate of under additive gaussian noise references albera comon and chevalier blind identification of overcomplete mixtures of sources biome linear algebra and its applications arora ge moitra and sachdeva provable ica with unknown gaussian noise with implications for gaussian mixtures and autoencoders in nips pages cardoso and souloumiac blind beamforming for signals in radar and signal processing iee proceedings volume pages iet cardoso decomposition of the cumulant tensor blind identification of more sources than sensors in icassp pages ieee cardoso and souloumiac matlab jade for data http online accessed chevalier optimal separation of independent sources concept and performance signal processing issn comon and jutten editors handbook of blind source separation academic press de lathauwer de moor and vandewalle independent component analysis based on statistics only in statistical signal and array processing ieee signal processing workshop on pages ieee de lathauwer castaing and cardoso blind identification of underdetermined mixtures signal processing ieee transactions on june issn doi hurri and matlab fastica http online accessed goyal vempala and xiao fourier pca and robust tensor decomposition in stoc pages and oja independent component analysis algorithms and applications neural networks karhunen and oja independent component analysis john wiley sons joho mathis and lambert overdetermined blind source separation using more sensors than source signals in noisy mixture in proc international conference on independent component analysis and blind signal separation helsinki finland pages and methods of fair comparison of performance of linear ica techniques in presence of additive noise in icassp pages and asymptotic analysis of bias of algorithms in presence of additive noise technical report technical report and blind instantaneous noisy mixture separation with best rejection in independent component analysis and signal separation pages springer makino lee and sawada blind speech separation springer van veen and buckley beamforming versatile approach to spatial filtering ieee assp magazine sarela jousmiki hamalainen and oja independent component approach to the analysis of eeg and meg recordings biomedical engineering ieee transactions on voss rademacher and belkin fast algorithms for gaussian noise invariant independent component analysis in advances in neural information processing systems pages yeredor blind source separation via the second characteristic function signal processing yeredor joint diagonalization in the sense with application in blind source separation signal processing ieee transactions on 
distributed submodular cover succinctly summarizing massive data baharan mirzasoleiman eth zurich amin karbasi yale university ashwinkumar badanidiyuru google andreas krause eth zurich abstract how can one find subset ideally as small as possible that well represents massive dataset its corresponding utility measured according to suitable utility function should be comparable to that of the whole dataset in this paper we formalize this challenge as submodular cover problem here the utility is assumed to exhibit submodularity natural diminishing returns condition prevalent in many data summarization applications the classical greedy algorithm is known to provide solutions with logarithmic approximation guarantees compared to the optimum solution however this sequential centralized approach is impractical for truly problems in this work we develop the first distributed algorithm is over for submodular set cover that is easily implementable using computations we theoretically analyze our approach and present approximation guarantees for the solutions returned by is over we also study natural between the communication cost and the number of rounds required to obtain such solution in our extensive experiments we demonstrate the effectiveness of our approach on several applications including active set selection exemplar based clustering and vertex cover on tens of millions of data points using spark introduction central challenge in machine learning is to extract useful information from massive data concretely we are often interested in selecting small subset of data points such that they maximize particular quality criterion for example in nonparametric learning we often seek to select small subset of points along with associated basis functions that well approximate the hypothesis space more abstractly in data summarization problems we often seek small subset of images news articles scientific papers that are representative an entire corpus in many such applications the utility function that measures the quality of the selected data points satisfies submodularity adding an element from the dataset helps more in the context of few selected elements than if we have already selected many elements our focus in this paper is to find succinct summary of the data subset ideally as small as possible which achieves desired large fraction of the utility provided by the full dataset hereby utility is measured according to an appropriate submodular function we formalize this problem as submodular cover problem and seek efficient algorithms for solving it in face of massive data the celebrated result of wolsey shows that greedy approach that selects elements sequentially in order to maximize the gain over the items selected so far yields logarithmic factor approximation it is also known that improving upon this approximation ratio is hard under natural complexity theoretic assumptions even though such greedy algorithm produces solutions it is impractical for massive datasets as sequential procedures that require centralized access to the full data are highly constrained in terms of speed and memory in this paper we develop the first distributed algorithm is over for solving the submodular cover problem it can be easily implemented in parallel computation models and provides solution that is competitive with the impractical centralized solution we also study natural between the communication cost for each round of mapreduce and the number of rounds the lets us choose between small communication cost between machines while having more rounds to perform or large communication cost with the benefit of running fewer rounds our experimental results demonstrate the effectiveness of our approach on variety of submodular cover instances vertex cover clustering and active set selection in learning we also implemented is over on spark and approximately solved vertex cover on social graph containing more than million nodes and billion edges background and related work recently submodular optimization has attracted lot of interest in machine learning and data mining where it has been applied to variety of problems including viral marketing information gathering and active learning to name few like convexity in continuous optimization submodularity allows many discrete problems to become efficiently approximable constrained submodular maximization in the submodular cover problem the main objective is to find the smallest subset of data points such that its utility reaches desirable fraction of the entire dataset as stated earlier the sequential centralized greedy method fails to appropriately scale once faced with massive data mapreduce and modern implementations like spark offer arguably one of the most successful programming models for reliable parallel computing distributed solutions for some special cases of the submodular cover problem have been recently proposed in particular for the set cover problem find the smallest subcollection of sets that covers all the data points berger et al provided the first distributed solution with an approximation guarantee similar to that of the greedy procedure blelloch et al improved their result in terms of the number of rounds required by implementation very recently stergiou et al introduced an efficient distributed algorithm for set cover instances of massive size another variant of the set cover problem that has received some attention is maximum cover as many elements as possible from the ground set by choosing at most subsets for which chierichetti et al introduced distributed solution with approximation guarantee going beyond the special case of coverage functions distributed constrained submodular maximization has also been the subject of recent research in the machine learning and data mining communities in particular mirzasoleiman et al provided simple distributed algorithm called ree for submodular maximization under cardinality constraints contemporarily kumar et al developed algorithm for submodular maximzation subject to cardinality and matroid constraints there have also been very recent efforts to either make use of randomization methods or treat data in streaming fashion to the best of our knowledge we are the first to address the general distributed submodular cover problem and propose an algorithm is over for approximately solving it the distributed submodular cover problem the goal of data summarization is to select small subset out of large dataset indexed by called the ground set such that achieves certain quality to this end we first need to define utility function that measures the quality of any subset quantifies how well represents according to some objective in many data summarization applications the utility function satisfies submodularity stating that the gain in utility of an element in context of summary decreases as grows formally is submodular if for any and note that the meaning of utility is application specific and submodular functions provide wide range of possibilities to define appropriate utility functions in section we discuss concrete instances of functions that we consider in our experiments let us denote the marginal utility of an element subset as the utility function is called monotone if for any and throughout this paper we assume that the utility function is monotone submodular the focus of this paper is on the submodular cover problem finding the smallest set ac such that it achieves utility for some more precisely ac arg such that we call the optimum centralized solution with size unfortunately finding ac is for many classes of submodular functions however simple greedy algorithm is known to be very effective this greedy algorithm starts with the empty set and at each iteration it chooses an element that maximizes ai arg let us denote this centralized greedy solution by ag when is integral it is known that the size of the solution returned by the greedy algorithm is at most maxe where is the harmonic number and is bounded by ln thus we have ln maxe and obtaining better solution is hard under natural complexity theoretic assumptions as it is standard practice for our theoretical analysis to hold we assume that is an integral monotone submodular function scaling up distributed computation in mapreduce in many data summarization applications where the ground set is large the sequential greedy algorithm is impractical either the data can not be stored on single computer or the centralized solution is too expensive in terms of computation time instead we seek an algorithm for solving the submodular cover problem in distributed manner preferably amenable to mapreduce implementations in this model at high level the data is first distributed to machines in cluster then each part is processed by the corresponding machine in parallel without communication and finally the outputs are either merged or used for the next round of mapreduce computation while in principle multiple rounds of computation can be realized in practice expensive synchronization is required after each round hence we are interested in distributed algorithms that require few rounds of computation naive approaches towards distributed submodular cover one way of solving the distributed submodular cover problem in multiple rounds is as follows in each round all machines in parallel compute the marginal gains for the data points assigned to them then they communicate their best candidate to central processor who then identifies the globally best element and sends it back to all the machines this element is then taken into account when selecting the next element with highest marginal gain and so on unfortunately this approach requires synchronization after each round and we have exactly many rounds in many applications and hence is quite large which renders this approach impractical for mapreduce style computations an alternative approach would be for each machine to select greedily enough elements from its partition vi until it reaches at least utility then all machines merge their solution this approach is much more communication efficient and can be easily implemented using single mapreduce round unfortunately many machines may select redundant elements and the merged solution may suffer from diminishing returns and never reach instead of aiming for one could aim for larger fraction but it is not clear how to select this target value in section we introduce our solution is over which requires few rounds of communication while at the same time yielding solution competitive with the centralized one before that let us briefly discuss the specific utility functions that we use in our experiments described in section example applications of the distributed submodular cover problem in this part we briefly discuss three concrete utility functions that have been extensively used in previous work for finding diverse subset of data points and ultimately leading to good data summaries truncated vertex cover let be graph with the vertex set and edge set let denote the neighbours of in the graph one way to measure the influence of set is to look at its cover it is easy to see that is monotone submodular function the truncated vertex cover is the problem of choosing small subset of nodes such that it covers desired fraction of active set selection in kernel machines in many application such as feature selections determinantal point processes and gp regression where the data is described in terms of kernel matrix we want to select small subset of elements while maintaining certain diversity very often the utility function boils down to log det αks where and ks is the principal of indexed by it is known that is monotone submodular clustering another natural application is to select small number of exemplars from the data representing the clusters present in it pnatural utility function see and is where is the loss function and is an appropriately chosen reference element the utility function is monotone submodular the goal of distributed submodular cover here is to select the smallest set of exemplars that satisfies specified bound on the loss the is over algorithm for distributed submodular cover on high level our main approach is to reduce the submodular cover to sequence of cardinality constrained submodular maximization problem for which good distributed algorithms ree are known concretely our reduction is based on combination of the following three ideas to get an intuition we will first assume that we have access to an optimum algorithm which can solve cardinality constrained submodular maximization exactly solve for some specified aoc arg max we will then consider how to solve the problem when instead of aoc we only have access to an approximation algorithm for cardinality constrained maximization lastly we will illustrate how we can parametrize our algorithm to the number of rounds of the distributed algorithm versus communication cost per round estimating size of the optimal solution momentarily assume that we have access to an optimum algorithm pt ard for computing aoc on the ground set then one simple way to solve the submodular cover problem would be to incrementally check for each if aoc but this is very inefficient since it will take rounds of running the distributed algorithm for computing aoc simple fix that we will follow is to instead start with and double it until we find an such that aoc this way we are guaranteed to find solution of size at most in at most rounds of running aoc the pseudocode is given in algorithm however in practice we can not run algorithm in particular there is no efficient way to identify the optimum subset aoc in set unless hence we need to rely on approximation algorithms handling approximation algorithms for submodular maximization assume that there is distributed algorithm is ard for cardinality constrained submodular maximization that runs on the dataset with machines and provides set agd with guarantee to the optimal solution aoc agd λf aoc let us assume that we could run is ard with the unknown value then the solution we get satisfies agd λq thus we are not guaranteed to get anymore now what we can do still under the assumption that we know is to repeatedly run is ard in order to augment our solution set until we get the desired value note that for each invocation of is ard to find set of size we have to take into account the solutions that we have accumulated so far so note that while reduction from submodular coverage to submodular maximization has been used the straightforward application to the distributed setting incurs large communication cost algorithm approximate submodular cover algorithm approximate pt ard input set constraint output set aoc pt ard while aoc do aoc pt ard input set of partitions constraint output set adc agd while agd do agd agd is ard if agd then adc agd else break return adc aoc return by overloading the notation is ard returns set of size given that has already been selected in previous rounds is ard computes the marginal gains note that at every invocation to is ard increases the value of the solution by at least therefore by running is ard at most dlog times we get unfortunately we do not know the optimum value so we can feed an estimate of the size of the optimum solution to is ard now again thanks to submodularity is ard can check whether this is good enough or not if the improvement in the value of the solution is not at least during the augmentation process we can infer that is too small estimate of and we can not get the desired value by using so we apply the doubling strategy again theorem let is ard be distributed algorithm for submodular maximization with approximation guarantee then algorithm where pt ard is replaced with approximate pt ard algorithm runs in at most dlog log rounds and produces solution of size at most log trading off communication cost and number of rounds while algorithm successfully finds distributed solution adc with adc the intermediate problem instances invocations of is ard are required to select sets of size up to twice the size of the optimal solution and these solutions are communicated between all machines oftentimes is quite large and we do not want to have such large communication cost per round now instead of finding an what we can do is to find smaller αk for and augment these smaller sets in each round of algorithm this way the communication cost reduces to an fraction per round while the improvement in the value of the solution is at least αλ agd consequently we can the communication cost per round with the total number of rounds as positive side effect for since in each invocation of is ard it returns smaller sets the final solution set size can potentially get closer to the optimum solution size for instance for the extreme case of we recover the solution of the sequential greedy algorithm up to we see this effect in our experimental results is over the is over algorithm is shown in algorithm the algorithm proceeds in rounds with communication between machines taking place only between successive rounds in particular is over takes the ground set the number of partitions and the parameter it starts with and adc it then augments the set adc with set agd of at most new elements using an arbitrary distributed algorithm for submodular maximization under cardinality constraint is ard if the gain from adding agd to adc is at least αλ agd then we continue augmenting agd with another set of at most elements otherwise we double and restart the process with we repeat this process until we get theorem let is ard be distributed algorithm for submodular maximization with approximation guarantee then is over runs in at most dlog αk log λα rounds and produces solution of size log algorithm is over input set of partitions constraint trade off parameter output set adc adc while adc do agd is ard adc if adc agd adc αλ adc then adc adc agd else return adc ree as subroutine so far we have assumed that distributed algorithm is ard that runs on machines is given to us as black box which can be used to find sets of cardinality and obtain of the optimal solution more concretely we can use ree recently proposed distributed algorithm for maximizing submodular functions under cardinality constraint outlined in algorithm it first distributes the ground set to machines then each machine separately runs the standard greedy algorithm to produce set agc of size finally the solutions are merged and another round of greedy selection is performed over the merged results in order to return the solution agd of size it was proven that ree provides min to the optimal solution here we prove tight improved bound on the performance of ree more formally we have the following theorem theorem let be monotone submodular function and let then ree produces ac solution agd where agd min algorithm greedy distributed submodular maximization ree input set of partitions constraint output set agd partition into sets vm gc run the standard greedy algorithm on each set vi find solution ai gc merge the resulting sets ai run the standard greedy algorithm on until elements are selected return agd we illustrate the resulting algorithm is over using ree as subroutine in figure by combining theorems and we will have the following corollary produces solution of size by using ree we get that is over log min αk and runs in at most dlog αk min αk log rounds note that for constant number of machines and large solution size αk the above result simply implies that in at most log kq rounds is over produces solution of size log in contrast the greedy solution with log rounds which is much larger than log kq produces solution of the same quality very recently guarantee was proven for the randomized version of ree this suggests that if it is possible to reshuffle randomly among the machines the ground set each time that we revoke ree we can benefit from these stronger approximation guarantees which are independent of and note that theorem does not directly apply here since it requires deterministic subroutine for constrained submodular maximization we defer the analysis to longer version of this paper as final technical remark for our theoretical results to hold we have assumed that the utility function is integral in some applications like active set selection this assumption may not hold in these cases either we can appropriately discretize and rescale the function or instead of achieving cover cluster nodes data greedi greedi figure illustration of our algorithm is over assuming it terminates in two rounds without doubling search for the utility try to reach for some in the latter case we can simply replace with in theorem experiments in our experiments we wish to address the following questions how well does is over perform compare to the centralized greedy solution how is the between the solution size and the number of rounds affected by parameter and how well does is over scale to massive data sets to this end we run is over on three scenarios exemplar based clustering active set selection in gps and vertex cover problem for vertex cover we report experiments on large social graph with more than millionp vertices and billion edges since the constant in theorem is not optimized we used min in all the experiments exemplar based clustering our exemplar based clustering experiments involve is over applied to the clustering utility described in section with kx we perform our experiments on set of tiny images each by rgb pixel image is represented as dimentional vectors we subtract from each vector the mean value then normalize it to have unit norm we use the origin as the auxiliary exemplar for this experiment fig compares the performance of our approach to the centralized benchmark with the number of machines set to and varying coverage percentage here we have it can be seen that is over provides solution which is very close to the centralized solution with number of rounds much smaller than the solution size varying results in tradeoff between solution size and number of rounds active set selection our active set selection experiments involve is over applied to the function described in section using an exponential kernel ei ej exp ej we use the parkinsons telemonitoring dataset comprised of biomedical voice measurements with attributes from people in parkinson disease fig compares the performance of our approach to the benchmark with the number of machines set to and varying coverage percentage again is over performs close to the centralized greedy solution even with very few rounds again we see tradeoff by varying large scale vertex cover with spark as our large scale experiment we applied is over to the friendster network consists of nodes and edges the average outdegree is while the maximum is the disk footprint of the graph is stored in part files on hdfs our experimental infrastructure was cluster of machines with of memory each running spark we set the number of reducers to each machine carried out set of tasks in sequence where each stage corresponds to running ree with specific values of on the whole data set we first distributed the data uniformly at random to the machines where each machine received vertices ram then we start with perform task to extract one element we then communicate back the results to each machine and based on the improvement in the value of the solution we perform another round of calculation with either the the same value for or we continue performing tasks until we get the desired value we examine the performance of is over by obtaining covers for and of the whole graph the total running time of the algorithm for the above coverage percentages with was about and hours respectively for comparison we ran the centralized solution set size solution set size discover greedy discover greedy discover greedy parkinsons telemonitoring discover greedy number of rounds images number of rounds discover greedy discover greedy discover greedy discover greedy solution set size discover greedy discover greedy discover greedy discover greedy number of rounds friendster figure performance of is over compared to the centralized solution show the solution set size the number of rounds for various for set of tiny images and parkinsons telemonitoring shows the same quantities for the friendster network with vertices greedy on computer of cores and memory note that loading the entire data set into memory requires of ram and running the centralized greedy algorithm for cover requires at least another of ram this highlights the challenges in applying the centralized greedy algorithm to larger scale data sets fig shows the solution set size versus the number of rounds for various and different coverage constraints we find that by decreasing is over solutions quickly converge in size to those obtained by the centralized solution conclusion we have developed the first efficient distributed algorithm is over for the submodular cover problem we have theoretically analyzed its performance and showed that it can perform arbitrary close to the centralized albeit impractical in context of large data sets greedy solution we also demonstrated the effectiveness of our approach through extensive experiments including vertex cover on graph with million vertices using spark we believe our results provide an important step towards solving submodular optimization problems in very large scale real applications acknowledgments this research was supported by erc stg microsoft faculty fellowship and an eth fellowship references ryan gomes and andreas krause budgeted nonparametric learning from data streams in icml sebastian tschiatschek rishabh iyer haochen wei and jeff bilmes learning mixtures of submodular functions for image collection summarization in nips khalid gaurav veda dafna shahaf and carlos guestrin turning down the noise in the blogosphere in kdd khalid and carlos guestrin beyond keyword search discovering relevant scientific literature in kdd andreas krause and daniel golovin submodular function maximization in tractability practical approaches to hard problems cambridge university press laurence wolsey an analysis of the greedy algorithm for the submodular set covering problem combinatorica uriel feige threshold of ln for approximating set cover journal of the acm dean and ghemawat mapreduce simplified data processing on large clusters in osdi matei zaharia mosharaf chowdhury michael franklin scott shenker and ion stoica in spark cluster computing with working sets pages springer david kempe jon kleinberg and tardos maximizing the spread of influence through social network in proceedings of the ninth acm sigkdd andreas krause and carlos guestrin intelligent information gathering and submodular function optimization tutorial at the international joint conference in artificial intelligence daniel golovin and andreas krause adaptive submodularity theory and applications in active learning and stochastic optimization journal of artificial intelligence research bonnie berger john rompel and peter shor efficient nc algorithms for set cover with applications to learning and geometry journal of computer and system sciences guy blelloch richard peng and kanat tangwongsan greedy parallel approximate set cover and variants in spaa stergios stergiou and kostas tsioutsiouliklis set cover at web scale in sigkdd acm flavio chierichetti ravi kumar and andrew tomkins in in www baharan mirzasoleiman amin karbasi rik sarkar and andreas krause distributed submodular maximization identifying representative elements in massive data in nips ravi kumar benjamin moseley sergei vassilvitskii and andrea vattani fast greedy algorithms in mapreduce and streaming in spaa baharan mirzasoleiman ashwinkumar badanidiyuru amin karbasi jan vondrak and andreas krause lazier than lazy greedy in aaai ashwinkumar badanidiyuru baharan mirzasoleiman amin karbasi and andreas krause streaming submodular maximization massive data summarization on the fly in sigkdd acm silvio lattanzi benjamin moseley siddharth suri and sergei vassilvitskii filtering method for solving graph problems in mapreduce in spaa roberto battiti using mutual information for selecting features in supervised neural net learning neural networks ieee transactions on carl edward rasmussen and christopher williams gaussian processes for machine learning adaptive computation and machine learning alex kulesza and ben taskar determinantal point processes for machine learning mach learn rafael barbosa alina ene huy nguyen and justin ward the power of randomization distributed submodular maximization on massive datasets in arxiv vahab mirrokni and morteza zadimoghaddam randomized composable for distributed submodular maximization in stoc rishabh iyer and jeff bilmes submodular optimization with submodular cover and submodular knapsack constraints in nips pages antonio torralba rob fergus and william freeman million tiny images large data set for nonparametric object and scene recognition tpami athanasios tsanas max little patrick mcsharry and lorraine ramig enhanced classical dysphonia measures and sparse regression for telemonitoring of parkinson disease progression in icassp jaewon yang and jure leskovec defining and evaluating network communities based on knowledge and information systems 
community detection via measure space embedding shie mannor the technion haifa israel shie mark kozdoba the technion haifa israel markk abstract we present new algorithm for community detection the algorithm uses random walks to embed the graph in space of measures after which modification of in that space is applied the algorithm is therefore fast and easily parallelizable we evaluate the algorithm on standard random graph benchmarks including some overlapping community benchmarks and find its performance to be better or at least as good as previously known algorithms we also prove linear time in number of edges guarantee for the algorithm on block model with where and pn log introduction community detection in graphs also known as graph clustering is problem where one wishes to identify subsets of the vertices of graph such that the connectivity inside the subset is in some way denser than the connectivity of the subset with the rest of the graph such subsets are referred to as communities and it often happens in applications that if two vertices belong to the same community they have similar qualities this in turn may allow for higher level analysis of the graph in terms of communities instead of individual nodes community detection finds applications in diversity of fields such as social networks analysis communication and traffic design in biological networks and generally in most fields where meaningful graphs can arise see for instance for survey in addition to direct applications to graphs community detection can for instance be also applied to general euclidean space clustering problems by transforming the metric to weighted graph structure see for survey community detection problems come in different flavours depending on whether the graph in question is simple or weighted directed another important distinction is whether the communities are allowed to overlap or not in the overlapping communities case each vertex can belong to several subsets difficulty with community detection is that the notion of community is not well defined different algorithms may employ different formal notions of community and can sometimes produce different results nevertheless there exist several widely adopted benchmarks synthetic models and graphs where the ground truth communities are known and algorithms are evaluated based on the similarity of the produced output to the ground truth and based on the amount of required computations on the theoretical side most of the effort is concentrated on developing algorithms with guaranteed recovery of clusters for graphs generated from variants of the stochastic block model referred to as sbm in what follows in this paper we present new algorithm der diffusion entropy reducer for reasons to be clarified later for community detection the algorithm is an adaptation of the algorithm to space of measures which are generated by short random walks from the nodes of the graph the adaptation is done by introducing certain natural cost on the space of the measures as detailed below we evaluate the der on several benchmarks and find its performance to be as good or better than the best alternative method in addition we establish some theoretical guarantees on its performance while the main purpose of the theoretical analysis in this paper is to provide some insight into why der works our result is also one of few results in the literature that show reconstruction in linear time on the empirical side we first evaluate our algorithm on set of random graph benchmarks known as the lfr models in other algorithms were evaluated on these benchmarks and three algorithms described in and were identified that exhibited significantly better performance than the others and similar performance among themselves we evaluate our algorithm on random graphs with the same parameters as those used in and find its performance to be as good as these three best methods several well known methods including spectral clustering exhaustive modularity optimization see for details and clique percolation have worse performance on the above benchmarks next while our algorithm is designed for communities we introduce simple modification that enables it to detect overlapping communities in some cases using this modification we compare the performance of our algorithm to the performance of overlapping community algorithms on set of benchmarks that were considered in we find that in all cases der performs better than all algorithms none of the algorithms evaluated in and has theoretical guarantees on the theoretical side we show that der reconstructs with high probability the partition of the block model such that roughly where is the number of vertices and pn log this holds in particular when pq for some constant we show that for this reconstruction only one iteration of the is sufficient in fact three passages over the set of edges suffice while the cost function we introduce for der will appear at first to have purely probabilistic motivation for the purposes of the proof we provide an alternative interpretation of this cost in terms of the graph and the arguments show which properties of the graph are useful for the convergence of the algorithm finally although this is not the emphasis of the present paper it is worth noting here that as will be evident later our algorithm can be trivially parallelalized this seems to be particularly nice feature since most other algorithms including spectral clustering are not easy to parallelalize and do not seem to have parallel implementations at present the rest of the paper is organized as follows section overviews related work and discusses relations to our results in section we provide the motivation for the definition of the algorithm derive the cost function and establish some basic properties section we present the results on the empirical evaluation of the algorithm and section describes the theoretical guarantees and the general proof scheme some proofs and additional material are provided in the supplementary material literature review community detection in graphs has been an active research topic for the last two decades and generated huge literature we refer to for an extensive survey throughout the paper let be graph and let pk be partition of loosely speaking partition is good community structure on if for each pi more edges stay within pi than leave pi this is usually quantified via some cost function that assigns larger scalars to partitions that are in some sense better separated perhaps the most well known cost function is the modularity which was introduced in and served as basis of large number of community detection algorithms the popular spectral clustering methods can also be viewed as relaxed optimization of certain cost see yet another group of algorithms is based on fitting generative model of graph with communities to given graph references are two among the many examples perhaps the simplest generative model for communities is the stochastic block model see which we now define let pk be partition of into subsets is distribution over the graphs on vertex set such that all edges are independent and for the edge exists with probability if belong to the same ps and it exists with probability otherwise if the components pi will be well separated in this model we denote the number of nodes by throughout the paper graphs generated from sbms can serve as benchmark for community detection algorithms however such graphs lack certain desirable properties such as degree and community size distributions some of these issues were fixed in the benchmark models in and these models are referred to as lfr models in the literature more details on these models are given in section we now turn to the discussion of the theoretical guarantees typically results in this direction provide algorithms that can reconstruct with high probability the ground partition of graph drawn from variant of model with some possibly large number of components recent results include the works and in this paper however we only analytically analyse the case and such that in addition for this case the best known reconstruction result was obtained already in and was only improved in terms of runtime since then namely bopanna result states that if lognn and lognn then with high probability the partition is reconstructible similar bound can be obtained for instance from the approaches in to name few the methods in this group are generally based on the spectral properties of adjacency or related matrices the run time of these algorithms is in the size of the graph and it is not known how these algorithms behave on graphs not generated by the probabilistic models that they assume it is generally known that when the graphs are dense of order of constant simple linear time reconstruction algorithms exist see the first and to the best of our knowledge the only previous linear time algorithm for non dense graphs was proposed in this algorithm works for for any fixed the approach of was further extended in to handle more general cluster sizes these approaches approaches differ significantly from the spectrum based methods and provide equally important theoretical insight however their empirical behaviour was never studied and it is likely that even for graphs generated from the sbm extremely high values of would be required for the algorithms to work due to large constants in the concentration inequalities see the concluding remarks in algorithm let be finite undirected graph with vertex set denote by aij the symmetric adjacency matrix of where aij are edge weights and for vertex set di aij to be the degree of let be an diagonal matrix such that dii di and set to be the transition matrix of the random walk on set also pij tij finally denote by pdidj the stationary measure of the random walk number of community detection algorithms are based on the intuition that distinct communities should be relatively closed under the random walk see and employ different notions of closedness our approach also takes this point of view for fixed consider the following sampling process on the graph choose vertex randomly from and perform steps of random walk on starting from this results in length sequence of vertices repeat the process times independently to obtain also suppose now that we would like to model the sequences xs as multinomial mixture model with single component since each coordinate xst is distributed according to the single component of the mixture should be itself when grows now suppose that we would like to model the same sequences with mixture of two components because the sequences are sampled from random walk rather then independently from each other the components need no longer be itself as in any mixture where some elements appear more often together then others the mixture as above can be found using the em algorithm and this in principle summarizes our approach the only additional step as discussed above is to replace the sampled random walks with their true distributions which simplifies the analysis and also leads to somewhat improved empirical performance we now present the der algorithm for detecting the communities its input is the number of components to detect the length of the walks an initialization partition algorithm der input graph walk length number of components compute the measures wi initialize pk to be random partition such that for all repeat for all construct µs µps for all set ps argmax wi µl until the sets ps do not change pk of into disjoint subsets would be usually taken to be random partition of into equally sized subsets for and vertex denote by wit the row of the matrix then wit is the distribution of the random walk on started at after steps set wi wil which is the distribution corresponding to the average of the empirical measures of sequences that start at for two probability measures on set log although is not metric will act as distance function in our algorithm note that if was an empirical measure then up to constant would be just the of observing from independent samples of for subset set πs to be the restriction of the measure to and also set ds di to be the full degree of let di wi µs ds denote the distribution of the random walk started from πs the complete der algorithm is described in algorithm the algorithm is essentially algorithm in space where the points are the measures wi each occurring with multiplicity di step is the means step and is the maximization step let di wi µl be the associated cost as with the usual we have the following lemma either is unchanged by steps and or both steps and strictly increase the value of the proof is by direct computation and is deferred to the supplementary material since the number of configurations is finite it follows that der always terminates and provides local maximum of the cost the cost can be rewritten in somewhat more informative form to do so we introduce some notation first let be random variable on distributed according to measure let step of random walk started at so that the distribution of given is wi finally for partition let be the indicator variable of partition iff ps with this notation one can write dv political blogs karate club where are the full and conditional shannon entropies therefore der algorithm can be interpreted as seeking partition that maximizes the information between current known state and the next step from it this interpretation gives rise to the name of the algorithm der since every iteration reduces the entropy of the random walk or diffusion with respect to the partition the second equality in has another interesting interpretation suppose for simplicity that with partition in general clustering algorithm aims to minimize the cut the number of edges between and however minimizing the number of edges directly will lead to situations where is single node connected with one edge to the rest of the graph in to avoid such situation relative normalized version of cut needs to be introduced which takes into account the sizes of every clustering algorithms has way to resolve this issue implicitly or explicitly for der this is shown in second equality of is maximized when the components are of equal sizes with respect to while is minimized when the measures µps are as disjointly supported as possible as any algorithm der results depend somewhat on its random initialization all schemes are usually restarted several times and the solution with the best cost is chosen in all cases which we evaluated we observed empirically that the dependence of der on the initial parameters is rather weak after two or three restarts it usually found partition nearly as good as after restarts for clustering problems however there is another simple way to aggregate the results of multiple runs into single partition which slightly improves the quality of the final results we use this technique in all our experiments and we provide the details in the supplementary material section we conclude by mentioning two algorithms that use some of the concepts that we use the walktrap similarly to der constructs the random walks the measures wi possibly for as part of its computation however walktrap uses wi in completely different way both the optimization procedure and the cost function are different from ours the infomap has cost that is related to the notion of information it aims to minimize to the information required to transmit random walk on through channel the source coding is constructed using the clusters and best clusters are those that yield the best compression this does not seem to be directly connected to the maximum likelyhood motivated approach that we use as with walktrap the optimization procedure of infomap also completely differs from ours evaluation in this section results of the evaluation of der algorithm are presented in section we illustrate der on two classical graphs sections and contain the evaluation on the lfr benchmarks basic examples when new clustering algorithm is introduced it is useful to get general feel of it with some simple examples figure shows the classical zachary karate club this graph has ground partition into two subsets the partition shown in figure is partition obtained from typical run of der algorithm with and wide range of were tested as is the case with many other clustering algorithms the shown partition differs from the ground partition in one element node see figure shows the political blogs graph the nodes are political blogs and the graph has an undirected edge if one of the blogs had link to the other there are nodes in the graph the ground truth partition of this graph has two components the right wing and left wing blogs the labeling of the ground truth was partially automatic and partially manual and both processes could introduce some errors the run of der reconstructs the ground truth partition with only nodes missclassifed the nmi see the next section eq to the ground truth partition is the political blogs graphs is particularly interesting since it is an example of graph for which fitting an sbm model to reconstruct the clusters produces results very different from the ground truth to overcome the problem with sbm fitting on this graph degree sensitive version of sbm dcbm was introduced in that algorithm produces partition with nmi another approach to dcbm can be found in lfr benchmarks the lfr benchmark model is widely used extension of the stochastic block model where node degrees and community sizes have power law distribution as often observed in real graphs an important parameter of this model is the mixing parameter that controls the fraction of the edges of node that go outside the node community or outside all of node communities in the overlapping case for small there will be small number of edges going outside the communities leading to disjoint easily separable graphs and the boundaries between communities will become less pronounced as grows given set of communities on graph and the ground truth set of communities there are several ways to measure how close is to one standard measure is the normalized mutual information nmi given by where is the shannon entropy of partition and is the mutual information see for details nmi is equal if and only if the partitions and coincide and it takes values between and otherwise when computed with nmi the sets inside can not overlap to deal with overlapping communities an extension of nmi was proposed in we refer to the original paper for the definition as the definition is somewhat lengthy this extension which we denote here as enmi was subsequently used in the literature as measure of closeness of two sets of communities event in the cases of disjoint communities note that most papers use the notation nmi while the metric that they really use is enmi figure shows the results of evaluation of der for four cases the size of graph was either or nodes and the size of the communities was restricted to be either between to denoted in the figures or between to denoted for each combination of these parameters varied between and for each combination of graph size community size restrictions as above and value we generated graphs from that model and run der to provide some basic intuition about these graphs we note that the number of communities in the graphs is strongly concentrated around and in and graphs it is around and respectively each point in figure is the average enmi on the corresponding graphs with standard deviation as the error bar these experiments correspond precisely to the ones performed in see supplementary material section cfor more details in all runs on der we have set and set to be the true number of communities for each graph as was done in for the methods that required it therefore our figure can be compared directly with figure in from this comparison we see that der and the two of the best algorithms identified in infomap and rn reconstruct the partition perfectly for for der reconstruction scores are between infomap and rn with values for all of the algorithms above and for enmi enmi der lfr benchmarks spectral lfr benchmarks der has the best performance in two of the four cases for all algorithms have score we have also performed the same experiments with the standard version of spectral clustering because this version was not evaluated in the results are shown in fig although the performance is generally good the scores are mostly lower than those of der infomap and rn overlapping lfr benchmarks we now describe how der can be applied to overlapping community detection observe that der internally operates on measures µps rather then subsets of the vertex set recall that µps is the probability that random walk started from ps will hit node we can therefore consider each to be member of those communities from which the probability to hit it is high enough to define this formally we first note that for any partition the following decomposition holds ps µps this follows from the invariance of under the random walk now given the out put of der the sets ps and measures µps set µps ps µp ps mi pk µpt pt where we used in the second equality then mi is the probability that the walks started at ps given that it finished in for each set si argmaxl mi to be the most likely community given then define the overlapping communities ck via ct mi mi si the paper introduces new algorithm for overlapping communities detection and contains also an evaluation of that algorithm as well as of several other algorithms on set of overlapping lfr benchmarks the overlapping communities lfr model was defined in in table we present the enmi results of der runs on the graphs with same parameters as in and also show the values obtained on these benchmarks in figure in for four other algorithms the der algorithm was run with and was set to the true number of communities each number is an average over enmis on instances of graphs with given set of parameters as in the standard deviation around this average for der was less then in all cases variances for other algorithms are provided in for all algorithms yield enmi of less then as we see in table der performs better than all other algorithms in all the cases we believe this indicates that der together with equation is good choice for overlapping community detection in situations where community overlap between each two communities is sparse as is the case in the lfr models considered above further discussion is provided in the supplementary material section table evaluation for overlapping lfr all values except der are from alg der svi poi inf cop we conclude this section by noting that while in the case the models generated with result in trivial community detection problems because in these cases communities are simply the connected components of the graph this is no longer true in the overlapping case as point of reference the well known clique percolation method was also evaluated in in the case the average enmi for this algorithm was table in analytic bounds in this section we restrict our attention to the case of the der algorithm recall that the model was defined in section we shall consider the model with and such that we assume that the initial partition for the der denoted in what follows is chosen as in step of der algorithm random partition of into two equal sized subsets in this setting we have the following theorem for every there exists and such that if and pn log then der recovers the partition after one iteration with probability such that when note that the probability in the conclusion of the theorem refers to joint probability of draw from the sbm and of an independent draw from the random initialization the proof of the theorem has essentially three steps first we observe that the random initialization is necessarily somewhat biased in the sense that and never divide exactly into two halves specifically with high probability assume that has the bigger half in the second step by an appropriate linearization argument we show that for node deciding whether wi wi or vice versa amounts to counting paths of length two between and in the third step we estimate the number of these length two paths in the model the fact that will imply more paths to from and we will conclude that wi wi for all and wi wi for all the full proof is provided in the supplementary material references santo fortunato community detection in graphs physics reports ulrike luxburg tutorial on spectral clustering statistics and computing andrea lancichinetti and santo fortunato benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities phys rev santo fortunato and andrea lancichinetti community detection algorithms comparative analysis in fourth international icst conference rosvall and bergstrom maps of random walks on complex networks reveal community structure proc natl acad sci usa page peter ronhovde and zohar nussinov multiresolution community detection for megascale networks by replica correlations phys rev vincent blondel guillaume renaud lambiotte and etienne lefebvre fast unfolding of communities in large networks journal of statistical mechanics theory and experiment andrew ng michael jordan and yair weiss on spectral clustering analysis and an algorithm in advances in neural information processing systems gergely palla imre farkas and vicsek uncovering the overlapping community structure of complex networks in nature and society nature prem gopalan and david blei efficient discovery of overlapping communities in massive networks proceedings of the national academy of sciences girvan and newman community structure in social and biological networks proceedings of the national academy of sciences mej newman and ea leicht mixture models and exploratory analysis in networks proceedings of the national academy of sciences paul holland kathryn laskey and samuel leinhardt stochastic blockmodels first steps social networks andrea lancichinetti santo fortunato and filippo radicchi benchmark graphs for testing community detection algorithms phys rev animashree anandkumar rong ge daniel hsu and sham kakade tensor spectral approach to learning mixed membership community models in colt volume of jmlr proceedings yudong chen sanghavi and huan xu improved graph clustering information theory ieee transactions on oct ravi boppana eigenvalues and graph bisection an analysis in foundations of computer science annual symposium on pages oct anne condon and richard karp algorithms for graph partitioning on the planted partition model random struct algorithms ron shamir and dekel tsur improved algorithms for the random cluster graph model random struct algorithms pascal pons and matthieu latapy computing communities in large networks using random walks of graph alg and alcides viamontes esquivel and martin rosvall compression of flow can reveal overlappingmodule organization in networks phys rev dec zachary an information flow model for conflict and fission in small groups journal of anthropological research lada adamic and natalie glance the political blogosphere and the election divided they blog linkkdd brian karrer and newman stochastic blockmodels and community structure in networks phys rev arash amini aiyou chen peter bickel and elizaveta levina methods for community detection in large sparse networks andrea lancichinetti santo fortunato and jnos kertsz detecting the overlapping and hierarchical community structure in complex networks new journal of physics brian ball brian karrer and newman efficient and principled method for detecting communities in networks phys rev sep steve gregory finding overlapping communities in networks by label propagation new journal of physics 
basis refinement strategies for linear value function approximation in mdps gheorghe comanici school of computer science mcgill university montreal canada gcoman doina precup school of computer science mcgill university montreal canada dprecup prakash panangaden school of computer science mcgill university montreal canada prakash abstract we provide theoretical framework for analyzing basis function construction for linear value function approximation in markov decision processes mdps we show that important existing methods such as krylov bases and methods are special case of the general framework we develop we provide general algorithmic framework for computing basis function refinements which respect the dynamics of the environment and we derive approximation error bounds that apply for any algorithm respecting this general framework we also show how using ideas related to bisimulation metrics one can translate basis refinement into process of finding prototypes that are diverse enough to represent the given mdp introduction finding optimal or policies in large markov decision processes mdps requires the use of approximation very popular approach is to use linear function approximation over set of features sutton and barto szepesvari an important problem is that of determining automatically this set of features in such way as to obtain good approximation of the problem at hand many approaches have been explored including adaptive discretizations bertsekas and castanon munos and moore functions mahadevan bellman error basis functions bebfs keller et parr et fourier basis konidaris et feature dependency discovery geramifard et etc while many of these approaches have nice theoretical guarantees when constructing features for fixed policy evaluation this problem is significantly more difficult in the case of optimal control where multiple policies have to be evaluated using the same representation we analyze this problem by introducing the concept of basis refinement which can be used as general framework that encompasses large class of iterative algorithms for automatic feature extraction the main idea is to start with set of basis which are consistent with the reward function which allow only states with similar immediate reward to be grouped together is then used to find parts of the state space in which the current basis representation is inconsistent with the environment dynamics and the basis functions are adjusted to fix this problem the process continues iteratively we show that bebfs keller et parr et can be viewed as special case of this iterative framework these methods iteratively expand an existing set of basis functions in order to capture the residual bellman error the relationship between such features and augmented krylov bases allows us to show that every additional feature in these sets is consistently refining intermediate bases based on similar arguments it can be shown that other methods such as those based on the concept of mdp homomorphisms ravindran and barto bisimulation metrics ferns et and partition refinement algorithms ruan et are also special cases of the framework we provide approximation bounds for sequences of refinements as well as basis convergence criterion using mathematical tools rooted in bisimulation relations and metrics givan et ferns et final contribution of this paper is new approach for computing alternative representations based on selection of prototypes that incorporate all the necessary information to approximate values over the entire state space this is closely related to approaches ormoneit and sen jong and stone barreto et but we do not assume that metric over the state space is provided which allows one to determine similarity between states instead we use an iterative approach in which prototypes are selected to properly distinguish dynamics according to the current basis functions then new metric is estimated and the set of prototypes is refined again this process relies on using pseudometrics which in the limit converge to bisimulation metrics background and notation we will use the framework of markov decision processes consisting of finite state space finite action space transition function where is probability distribution over the state space reward for notational convenience ra will be used to denote and respectively one of the main objectives of mdp solvers is to determine good action choice also known as policy from every state that the system would visit appolicy determines the probability of choosing each action given the state with the value of policy given state is defined as ai si si ai si note that is real valued function the space of all such functions will be denoted by fs we will also call such functions features let rπ and denote the reward and transition probabilities corresponding to choosing actions according to note rπ and fs fs rπ ra and ep let fs fs denote the bellman operator rπ γp this operator is linear and is its fixed point most algorithms for solving mdps will either use the model rπ to find if this model is available can be estimated efficiently or they will estimate directly using samples of the model si ai ri the value associated with the best policy is the fixed point of the bellman optimality operator not linear operator defined as ra γp the main problem we address in this paper is that of finding alternative representations for given mdp in particular we look for finite linearly independent subsets of fs these are bases for subspaces that will be used to speed up the search for by limiting it to span we say that basis is partition if there exists an equivalence relation on such that where is the characteristic function if and otherwise given any equivalence relation we will use the notation for the set of characteristic functions on the equivalence classes of our goal will be to find subsets fs which allow value function approximation with strong quality guarantees more precisely for any policy we would like to approximate with pk vφπ wi φi for some choice of wi which amounts to finding the best candidate inside the space spanned by φk sufficient condition for to be an element of span and therefore representable exactly using the chosen set of bases is for to span the reward function and be an invariant subspace of the transition function rπ span and span linear fixed point methods like td lstd lspe sutton bradtke and barto yu and bertsekas can be used to find the least squares fixed point approximation vφπ of for representation these constitute proper approximation schemes as we will use to denote the set of probability distributions on given set for simplicity we assume wlog that the reward is deterministic and independent of the state into which the system arrives we will use eµ of function wrt distribution if the to mean the expectation function is multivariate we will use to denote expectation of when is fixed the equivalence class of an element is is used for the quotient set of all equivalence classes of one can determine the number of iterations required to achieve desired approximation error given representation the approximate value function vφπ is the fixed point of the operator tφπ defined as tφπ πφ rπ γp where πφ is the orthogonal projection operator on using the linearity of πφ it directly follows that tφπ πφ rπ γπφ and vφπ is the fixed point of the bellman operator over the transformed linear model rφ pφπ πφ rπ πφ for more details see parr et the analysis tools that we will use to establish our results are based on probabilistic bisimulation and its quantitative analogues strong probabilistic bisimulation is notion of behavioral equivalence between the states of probabilistic system due to larsen and skou and applied to mdps with rewards by givan et the metric analog is due to desharnais et and the extension of the metric to include rewards is due to ferns et an equivalence relation is bisimulation relation on the state space if for every pair if and only if ra ra and we use here to denote the probability of transitioning into under transition is bisimulation metric if there exists some bisimulation relation such that the bisimulation metrics described by ferns et are constructed using the kantorovich metric for comparing two probability distributions given ground metric over the kantorovich metric over takes the largest difference in the expected value of functions with respect to fs the distance between two probabilities and is computed as eµ eν for more details on the kantorovich metric see villani the following approximation scheme converges to bisimulation metric starting with the metric that associates to all pairs dk max ra ra γk dk the operator has fixed point which is bisimulation metric and dk as ferns et provide bounds which allow one to assess the quality of general state aggregations using this metric given relation and its corresponding partition one can define an mdp model over as ra and the approximation error between the true mdp optimal value function and its approximation using this reduced mdp model denoted by is bounded above by max where is average distance from state to its class defined as an expectation over the uniform distribution similar bounds for representations that are not partitions can be found in comanici and precup note that these bounds are minimized by aggregating states which are close in terms of the bisimulation distance basis refinement in this section we describe the proposed basis refinement framework which relies on detecting and fixing inconsistencies in the dynamics induced by given set of features intuitively states are dynamically consistent with respect to set of basis functions if transitions out of these states are evaluated the same way by the model inconsistencies are fixed by augmenting basis with features that are able to distinguish inconsistent states relative to the initial basis we are now ready to formalize these ideas definition given subset fs two states are consistent with respect to denoted if and ep ep definition given two subspaces fs refines in an mdp and write if and using the linearity of expectation one can prove that given two probability distributions and finite subset if span then eµ eν for the special case of dirac distributions δs and for which eδs it also holds that therefore def gives relation between two subspaces but the refinement conditions could be checked on any basis choice it is the subspace itself rather than particular basis that matters if span span to fix inconsistencies on pair for which we can find and such that either or ep ep one should construct new function with and add it to to guarantee that all inconsistencies have been addressed if for some must contain feature such that for some either or ep ep in sec we present an algorithmic framework consisting of sequential improvement steps in which current basis is refined into new one with span span def guarantees that following such strategies expands span and that the approximation error for any policy will be decreased as result we now discuss bounds that can be obtained based on these definitions value function approximation results one simple way to create refinement is to add to single element that would address all inconsistencies feature that is valued differently for every element np of given on the other hand such construction provides no approximation guarantee for the optimal value function unless we make additional assumptions on the problem we will discuss this further in section although it addresses inconsistencies in the dynamics over the set of features spanned by it does not necessarily provide the representation power required to properly approximate the value of the optimal policy the main theoretical result in this section provides conditions for describing refining sequences of bases which are not necessarily accurate but have approximation errors bounded by an exponentially decreasing function these results are based on the largest basis refining subspace any feature that is constant over equivalence classes of will be spanned by for any refinement span these subsets are convenient as they can be analyzed using the bisimulation metric introduced in ferns et lemma the bisimulation operator in eq is contraction with constant that is for any metric over sups sups the proof relies on the duality see villani to check that satisfies sufficient conditions to be contraction operator an operator is contraction with constant if whenever and if γc for any constant blackwell one could easily check these conditions on the operator in equation theorem let represent reward consistency ra ra and additionally assume γn is sequence of bases such that for all γn and is as large as the partition corresponding to consistency over γn optimal value function computed with respect to representa γn is the tion γn then vγn sups ra proof we will use the bisimulation metric defined in eq and eq applied to the special case of reduced models over bases γn first note that duality is crucial in this proof it basically states that the kantorovich metric is solution to the problem when its cost function is equal to the base metric for the kantorovich metric specifically for two measures and and cost function the problem computes inf eξ are the marginals corresponding to and the set of measures with marginals and is also known as the set of couplings of and for any metric over for proof see villani next we describe relation between the metric and γn since and span it must be the case that span span it is not hard to see that for the special case of partitions refinement can be determined based on transitions into equivalence classes given two equivalence relations the refinement holds and only if and in particular with and this equality is crucial in defining the following coupling for let ξc be any coupling of and the restrictions of and to theplatter is possible as the two distributions are equal next define the coupling of and as ξc for any cost function if then eξc using an inductive argument we will now show that the base case is clear from the definition now assume the former holds for that is but ξc is zero everywhere except on the set so eξc combining the last two results we get the following upper bound eξc since is metric it also holds that moreover as and are consistent over γn this pair of states agree on the reward function therefore maxa ra γj finally for any and with and any other state with it must be the case that and therefore as span γn span is the optimal value function for the mdp model over based on and we can conclude that vγ but we already know from lemma that defined in eq is the fixed point of contraction operator with constant as the following holds for all sup ra the final result is easily obtained by putting together equations and the result of the theorem provides strategy for constructing refining sequences with strong approximation guarantees still it might be inconvenient to generate refinements as large as as this might be although faithful to the assumptions of the theorem it might generate features that distinguish states that are not often visited or pairs of states which are only slightly different to address this issue we provide variation on the concept of refinement that can be used to derive more flexible refining algorithms refinements that concentrate on local properties definition given subset fs and subset two states are consistent on with respect to denoted if and ep ep ep ep definition given two subspaces fs refines locally with respect to denoted nζ if and definition is the special case of definition corresponding to refinement with respect to the whole state space ns when the subset is not important we will use the notation to say that refines locally with respect to some subset of the result below states that even if one provides local refinements one will eventually generate pair of subspaces which are related through global refinement property proposition let γi be set of bases over with nζi γi for some ζi assume that γn is the maximal refinement ζn let ζi then span γn proof assume ζn we will check below all conditions necessary to conclude that first let it is immediate from the definition of local refinements that γj so that ζn it follows that next fix and if ζn then ep ep ep ep by the assumption above on the pair otherwise such that ζj and nζj γj but we already know that γj as γj we can use this result in the definition of local refinement nζj γj to conclude that ζj moreover as ζj ep ep ep ep this completes the definition of consistency on and it becomes clear that ζn or span ζn finally both γn and are bases of the same size and both refine it must be that span γn span ζn examples of basis refinement for feature extraction the concept of basis refinement is not only applicable to the feature extraction methods we will present later but to methods that have been studied in the past in particular methods based on bellman error basis functions state aggregation strategies and spectral analysis using bisimulation metrics are all special cases of basis refinement we briefly describe the refinement property for the first two cases and in the next section we elaborate on the connection between refinement and bisimulation metrics to provide new condition for convergence to bases krylov bases consider the uncontrolled policy evaluation case in which one would like to find set of features that is suited to evaluating single policy of interest common approach to automatic feature generation in this context computes bellman error basis functions bebfs which have been shown to generate sequence of representations known as krylov bases given policy krylov basis φn of size is built using the model rπ defined in section as elements of fs and fs fs respectively φn span rπ rπ rπ rπ it is not hard to check that φn where is the refinement relational property in def since the initial feature rπ the result in theorem holds for the krylov bases under the assumption of mdp γχ is basis for fs therefore this set of features is finite dimensional it follows that one can find such that one of the krylov bases is φn φn this would by no means be the only basis in fact this property holds for the basis of characteristic functions γχ γχ the purpose our framework is to determine other bases which are suited for function approximation methods in the context of controlled systems state aggregation one popular strategy used for solving mdps is that of computing state aggregation maps instead of working with alternative subspaces these methods first compute equivalence relations on the state space an model is then derived and the solution to this model is translated to one for the original problem the resulting policy provides the same action choice for states that have originally been related given any equivalence relation on state aggregation map is function from to any set such that in order to obtain significant computational gain one would like to work with aggregation maps that reduce the size of the space for which one looks to provide action choices as discussed in section one could work with features that are defined on an aggregate state space instead of the original state space that is instead of computing set of state features fs we could work instead with an aggregation map and set of features over fx if is the relation such that then span using bisimulation metrics for convergence of bases in section we provide two examples of subspaces the krylov bases and the characteristic functions on single states the latter is the largest and sparsest basis it spans the entire state space and the features share no information the former is potentially smaller and it spans the value of the fixed policy for which it was designed in this section we will present third construction which is designed to capture bisimulation properties based on the results presented in section it can be shown that given bisimulation relation the partition it generates is desirable bases might be be computationally demanding too complex to use or represent we propose iterative schemes which ultimately provide result albeit we would have the flexibility of stopping the iterative process before reaching the final result at the same time we need criterion to describe convergence of sequences of bases that is we would want to know how close an iterative process is to obtaining basis inspired by the fixed point theory used to study bisimulation metrics desharnais et instead of using metric over the set of all bases to characterize convergence of such sequences we will use corresponding metrics over the original state space this choice is better suited for generalizing previously existing methods that compare pairs of states for bisimilarity through their associated reward models and expected realizations of features over the next state distribution model associated with these states we will study metric construction strategies based on map defined below which takes an element of the powerset fs of fs and returns an element of all over maxa ra ep ep is set of features whose expectation over distributions should be matched it is not hard to see that bases for which is bisimulation metric are by definition for example consider the largest bisimulation relation on given mdp it is not hard to see that is bisimulation more elaborate example involves the set of continuous functions on recall definition and computation details from section define to be the fixed point of the operator has the same property as the bisimulation metric defined in equation moreover given any bisimulation metric is bisimulation metric definition we say sequence γn is bisimulation sequence of bases if γn converges uniformly from below to bisimulation metric if one has the sequence of refining bases with γn then γn is an increasing sequence but not necessarily bisimulation sequence bisimulation sequence of bases provide an approximation scheme for bases that satisfy two important properties studied in the past and bisimilarity one could show that the approximation schemes presented in ferns et comanici and precup and ruan et are all examples of bisimulation sequences we will present in the next section framework that generalizes all these examples but which can be easily extended to broader set of approximation schemes that incorporate both refining and bisimulation principles prototype based refinements in this section we propose strategy that iteratively builds sequences of refineing sets of features based on the concepts described in the previous sections this generates layered sets of features where the nth layer in the construction will be dependent only on the th layer additionally each feature will be associated with prototype elements of associating to each action reward and probability distribution prototypes can be viewed as abstract or representative states such as used in kbrl methods ormoneit and sen in the layered structure the similarity between prototypes at the nth layer is based on measure of consistency with respect to features at the th layer the same measure of similarity is used to determine whether the entire state space is covered by the set of chosen for the nth layer we say that space is covered if every state of the space is close to at least one prototype generated by the construction with respect to predefined measure of similarity this measure is designed to make sure that consecutive layers represent refining sets of features note that for any given mdp the state space is embedded into as ra for every state additionally the metric generator as defined in equation can be generalized to map from fs to the algorithmic strategy will look for sequence jn ιn where jn is set of covering prototypes and ιn jn fs is function that associates feature to every prototype in jn starting with and the strategy needs to find at step cover jˆn for based on the distance metric that is it has to guarantee that jˆn with with jn jˆn and using strictly decreasing function the gibbs measure exp for some the framework constructs ιn jn fs map that associates prototypes to features as ιn algorithm prototype refinement and for to do choose representative subset ζn and cover approximation error find an jˆn for ζn define jn jˆn choose strictly decreasing function if ζn such that define ιn otherwise define γn ιn jn note that γn is local refinement nζn γn it is not hard to see that the refinement property holds at every step γn first every equivalence class of is represented by some prototype in jn second ιn is purposely defined to make sure that distinction is made between each prototype in moreover γn is bisimulation sequence of bases as the metric generator is the main tool used in covering the state space with the set of prototypes jn two states will be represented by the same prototype they will be equivalent with respect to if and only if the distance between their corresponding models is algorithm provides for the framework described in this section note that it also contains two additional modifications used to illustrate the flexibility of this feature extraction process through the first modification one could use the intermediate results at time step to determine subset ζn of states which are likely to have model with significantly distinct dynamics over as such the prototypes can be specialized to cover only the significant subset ζn moreover theorem guarantees that if every state in is picked in ζn infinitely often as then the approximation power of the final result is not be compromised the second modification is based on using the values in the metric for more than just choosing feature activations one could set at every step constants and then find jn such that ζn is covered using for every state in ζn there exists prototype jn with one can easily show that the refinement property can be maintained using the modified defition of ιn described in algorithm discussion we proposed general framework for basis refinement for linear function approximation the theoretical results show that any algorithmic scheme of this type satisfies strong bounds on the quality of the value function that can be obtained in other words this approach provides blueprint for designing algorithms with good approximation guarantees as discussed some existing value function construction schemes fall into this category such as state aggregation refinement for example other methods like bebfs can be interpreted in this way in the case of policy evaluation however the traditional bebf approach in the case of control does not exactly fit this framework however we suspect that it could be adapted to exactly follow this blueprint something we leave for future work we provided ideas for new algorithmic approach to this problem which would provide strong guarantees while being significantly cheaper than other existing methods with similar bounds which rely on bisimulation metrics we plan to experiment with this approach in the future the focus of this paper was to establish the theoretical underpinnings of the algorithm the algorithm structure we propose is close in spirit to barreto et which selects prototype states in order to represent well the dynamics of the system by means of stochastic factorization however their approach assumes given metric which measures state similarity and selects representative states using clustering based on this metric instead we iterate between computing the metric and choosing prototypes we believe that the theory presented in this paper opens up the possibility of further development of algorithms for constructive function approximation that have quality guarantees in the control case and which can be effective also in practice references sutton and barto reinforcement learning an introduction mit press cs szepesvari algorithms for reinforcement learning morgan claypool bertsekas and castanon adaptive aggregation methods for infinite horizon dynamic programming ieee transactions on automatic control munos and moore variable resolution discretization in optimal control machine learning mahadevan functions developmental reinforcement learning in icml pages keller mannor and precup automatic basis function construction for approximate dynamic programming and reinforcement learning in icml pages parr li and littman analyzing feature generation for value function approximation in icml pages konidaris osentoski and thomas value function approximation in reinforcement learning using the fourier basis in aaai pages geramifard doshi redding roy and how online discovery of feature dependencies in icml pages ravindran and barto model minimization in hierarchical reinforcement learning in symposium on abstraction reformulation and approximation sara pages ferns panangaden and precup metrics for finite markov decision processes in uai pages ruan comanici panangaden and precup representation discovery for mdps using bisimulation metrics in aaai pages givan dean and greig equivalence notions and model minimization in markov decision processes artificial intelligence ormoneit and reinforcement learning machine learning jong and stone models for reinforcement learning in icml workshop on kernel machines and reinforcement learning barreto precup and pineau reinforcement learning using stochastic factorization in nips pages sutton learning to predict by the methods of temporal differences machine learning bradtke and barto linear algorithms for temporal difference learning machine learning yu and bertsekas convergence results for some temporal difference methods based on least squares technical report lids mit parr li taylor and littman an analysis of linear models linear approximation and feature selection for reinforcement learning in icml pages larsen and skou bisimulation through probabilistic testing information and computation desharnais gupta jagadeesan and panangaden metrics for labeled markov systems in concur desharnais gupta jagadeesan and panangaden metric for labelled markov processes theoretical computer science villani topics in optimal transportation american mathematical society comanici and precup basis function discovery using spectral clustering and bisimulation metrics in aaai blackwell discounted dynamic programming annals of mathematical statistics 
structured estimation with atomic norms general bounds and applications sheng chen arindam banerjee dept of computer science university of minnesota twin cities shengc banerjee abstract for structured estimation problems with atomic norms recent advances in the literature express sample complexity and estimation error bounds in terms of certain geometric measures in particular gaussian width of the unit norm ball gaussian width of spherical cap induced by tangent cone and restricted norm compatibility constant however given an atomic norm bounding these geometric measures can be difficult in this paper we present general upper bounds for such geometric measures which only require simple information of the atomic norm under consideration and we establish tightness of these bounds by providing the corresponding lower bounds we show applications of our analysis to certain atomic norms especially norm for which existing result is incomplete introduction accurate recovery of structured sparse vectors from noisy linear measurements has been extensively studied in the field of compressed sensing statistics etc the goal is to recover signal parameter rp which is sparse only has few nonzero entries possibly with additional structure such as group sparsity typically one assume linear models xθ in which is the design matrix consisting of samples rn is the observed response vector and rn is an unknown noise vector by leveraging the sparsity of previous work has shown that certain based estimators can find good approximation of using sample size recent work has extended the notion of unstructured sparsity to other structures in which can be captured or approximated by some norm other than non overlapping group sparsity with norm etc in general two broad classes of estimators are considered in recovery analysis estimators which solve the regularized optimization problem argmin kxθ λn and ii estimators which solve the constrained problem argmin xt xθ λn where is the dual norm of variants of these estimators exist but the recovery analysis proceeds along similar lines as these two classes of estimators to establish recovery guarantees focused on estimators and from the class of decomposable norm norm the upper bound for the estimation error for any decomposable norm is characterized in terms of three geometric measures dual norm bound as an upper bound for xt ii sample complexity the minimal sample size needed for certain restricted eigenvalue re condition to be true and iii restricted norm compatibility constant between and norms the estimation error bound typically has the form where depends on product of dual norm bound and restricted norm compatibility whereas the sample complexity characterizes the minimum number of samples after which the error bound starts to be valid in recent work extended the analysis of estimator for decomposable norm to any norm and gave more succinct characterization of the dual norm bound for xt and the sample complexity for the re condition in terms of gaussian widths of suitable sets where for any set rp the gaussian width is defined as sup hu gi where is standard gaussian random vector for estimators obtained similar extensions to be specific assume entries in and are normal and define the tangent cone tr cone rp then one can get upper bound for xt as nw ωr where ωr rp is the unit norm ball and the re condition is satisfied with tr samples in which is the unit sphere for convenience we denote by cr the spherical cap tr throughout the paper further the restricted norm compatibility is given by ψr see section for details thus for any given norm it suffices to get characterization of ωr the width of the unit norm ball ii cr the width of the spherical cap induced by the tangent cone tr and iii ψr the restricted norm compatibility in the tangent cone for the special case of norm accurate characterization of all three measures exist however for more general norms the literature is rather limited for ωr the characterization is often reduced to comparison with either cr or known results on other norm balls while cr has been investigated for certain decomposable norms little is known about general nondecomposable norms one general approach for upper bounding cr is via the statistical dimension which computes the expected squared distance between gaussian random vector and the polar cone of tr to specify the polar one need full information of the subdifferential which could be difficult to obtain for norms notable bound for overlapping norms is presented in which yields tight bounds for mildly cases but is loose for highly overlapping ones for ψr the restricted norm compatibility results are only available for decomposable norms in this paper we present general set of bounds for the width ωr of the norm ball the width cr of the spherical cap and the restricted norm compatibility ψr for the analysis we consider the class of atomic norms that are invariant under the norm of vector stays unchanged if any entry changes only by flipping its sign the class is quite general and covers most of the popular norms used in practical applications norm ordered weighted owl norm and norm specifically we show that sharp bounds on ωr can be obtained using simple calculation based on decomposition inequality from to upper bound cr and ψr instead of full specification of tr we only require some information regarding the subgradient of which is often readily accessible the key insight is that bounding statistical dimension often ends up computing the expected distance from gaussian vector to single point rather than to the whole polar cone thus the full information on is unnecessary in addition we derive the corresponding lower bounds to show the tightness of our results as examples we illustrate the bounds for and owl norms finally we give sharp bounds for the recently proposed norm for which existing analysis is incomplete the rest of the paper is organized as follows we first review the relevant background for dantzigtype estimator and atomic norm in section in section we introduce the general bounds for the geometric measures in section we discuss the tightness of our bounds section is dedicated to the example of norm and we conclude in section background in this section we briefly review the recovery guarantee for the generalized dantzig selector in and the basics on atomic norms the following lemma originally theorem provides an error bound for related results have appeared for other estimators lemma assume that xθ where entries of and are copies of standard gaussian random variable if λn nw ωr and tr cr for some constant with high probability the estimate given by satisfies ωr ψr in this lemma there are three geometric ωr cr and ψr need to be determined for specific and in this work we focus on general atomic norms given set of atomic vectors rp the corresponding atomic norm of any rp is given by kθka inf ca ca ca in order for ka to be valid norm atomic vectors in has to span rp and iff the unit ball of atomic norm ka is given by ωa conv in addition we assume that the atomic set contains for any if belongs to where denotes the elementwise hadamard product for vectors this assumption guarantees that both ka and its dual norm are invariant under which is satisfied by many widely used norms such as norm owl norm and norm for the rest of the paper we will use ωa ta ca and ψa with replaced by appropriate subscript for specific norms for any vector and coordinate set we define us by zeroing out all the coordinates outside general analysis for atomic norms in this section we present detailed analysis of the general bounds for the geometric measures ωa ca and ψa in general knowing the atomic set is sufficient for bounding ωa for ca and ψa we only need single subgradient of kθ ka and some simple additional calculations gaussian width of unit norm ball although the atomic set may contain uncountably many vectors we assume that can be decomposed as union of simple sets am by simple we mean the gaussian width of each ai is easy to such decomposition assumption is often satisfied by commonly used atomic norms owl norm the gaussian width of the unit norm ball of ka can be easily obtained using the following lemma which is essentially the lemma in related results appear in lemma let am rp and am the gaussian width of unit norm ball of ka satisfies ωa conv max am sup log next we illustrate application of this result to bounding the width of the unit norm ball of and owl norm example norm recall that the norm can be viewed as the atomic norm induced by the set where ei is the canonical basis of rp since the gaussian width of singleton is if we treat as the union of individual and we have log log example owl norm recent variant of norm is the ordered weighted pp owl norm defined as kθkowl wi where wp are ordered weights and is the permutation of with entries sorted in decreasing order in the owl norm is proved to be an atomic norm with atomic set vs aowl ai us us pi wj supp we first apply lemma to each set ai and note that each ai contains pi atomic vectors ai log pi log log pi wj wj where is the average of wp then we apply the lemma again to aowl and obtain log ωowl aowl log log which matches the result in gaussian width of the intersection of tangent cone and unit sphere in this subsection we consider the computation of general ca using the definition of dual norm we can write kθ ka as kθ ka hu where denotes the dual norm of ka the for which kθ ka is subgradient of kθ ka one can obtain by simply solving the polar operator for the dual norm argmax hu based on polar operator we start with the lemma which plays key role in our analysis lemma let bepa solution to the polar operator and define the weighted as then the following relation holds ta where cone rp kθ kθ the proof of this lemma is in supplementary material note that the solution to may not be unique good criterion for choosing is to avoid zeros in as any will lead to the unboundedness of unit ball of which could potentially increase the size of next we present the upper bound for ca theorem suppose that is one of the solutions to and define the following sets the gaussian width ca is upper bounded by if is empty ca log if is nonempty where κmin and κmax proof by lemma we have ca hence we can focus on bounding we first analyze the structure of that satisfies kθ kθ for the coordinates the corresponding entries vi can be arbitrary since it does not affect the value of kθ thus all possible vq form subspace where for we define and and needs to satisfy which is similar to the tangent cone except that coordinates are weighted by therefore we use the techniques for proving the proposition in based on the structure of the normal cone at for is given by hz vi kv kθ zi for zi for for for any given standard gaussian random vector using the relation between gaussian width and statistical dimension proposition and in we have inf kz inf zj gj zk gk inf gj zk gk inf zk gk gk exp dgk exp the details for the derivation above can be found in appendix of if is empty by taking we have if rris nonempty we denote κmin and κmax taking log we obtain κmin exp min log κmax log κmin log substituting and into the last inequality completes the proof suppose that is vector we illustrate the above bound on the gaussian width of the spherical cap using norm and owl norm as examples example norm the dual norm of is norm and its easy to verify that rp is solution to applying theorem to we have log log example owl norm for owl its dual norm is given by hb ui we assume and solution to is given by ws in which is the average of wp if all wi are nonzero the gaussian width satisfies cowl log restricted norm compatibility the next theorem gives general upper bounds for the restricted norm compatibility ψa theorem assume that kuka max for all rp under the setting of theorem the restricted norm compatibility ψa is upper bounded by if isnempty ψa φq max κκmax if is nonempty min where kuka and φq supsupp kuka proof as analyzed in the proof of theorem vq for can be arbitrary and the vqc satisfies kvqc θq vi kθqc κmin kvr κmax kvs if is empty by lemma we obtain ψa kvka kvka sup sup if is nonempty we have ψa kvq ka kvqc ka kvka ka sup kv supp supp sup κmin kvr kvs kvka supp sup φq max sup supp κmin kvr kvs κmax κmin max sup supp φq max κmax κmin in which the last inequality in the first line uses the property of remark we call the unrestricted norm compatibility and φq the subspace norm compatibility both of which are often easier to compute than ψa the and in the assumption of ka can have multiple choices and one has the flexibility to choose the one that yields the tightest bound example norm to apply the theorem to norm we can choose and we recall the for norm whose is empty while is nonempty so we have for max example owl norm for owl note that kowl hence we choose and as result we similarly have for ψowl max tightness of the general bounds so far we have shown that the geometric measures can be upper bounded for general atomic norms one might wonder how tight the bounds in section are for these measures for ωa as the result from depends on the decomposition of for the ease of computation it might be tricky to discuss its tightness in general hence we will focus on the other two ca and ψa to characterize the tightness we need to compare the lower bounds of ca and ψa with their upper bounds determined by while there can be multiple it is easy to see that any convex combination of them is also solution to therefore we can always find that has the largest support supp supp for any other solution we will use such to generate the lower bounds first we need the following lemma for the cone ta lemma consider solution to which satisfies supp supp for any other solution under the setting of notations in theorem we define an additional set of coordinates then the tangent cone ta satisfies cl ta where denotes the direct minkowski sum operation cl denotes the closure rp vi for is subspace and rp sign vi θi for supp vi for supp is supp orthant the proof of lemma is given in supplementary material the following theorem gives us the lower bound for ca and ψa theorem under the setting of theorem and lemma the following lower bounds hold ca ψa proof to lower bound ca we use lemma and the relation between gaussian width and statistical dimension proposition in ta inf kz where the normal cone of is given by zi for sign zi sign for supp hence we have supp gj gj gi where the last equality follows the fact that supp this completes proof of to prove we again use lemma and the fact supp noting that ka is invariant under we get kvka kvka kvka ψa sup sup sup kvk kvk supp remark we compare the lower bounds with the upper bounds if is empty and the lower bounds actually match the upper bounds up to constant factor for both ca and ψa if is nonempty the lower and upper bounds of ca differ by multiplicative factor log which can be small in practice for ψa as φq min we usually have at most an additive term in upper bound since the assumption on ka often holds with constant and for most norms application to the norm in this section we apply our general results on geometric measures to example ksupport norm which has been proved effective for sparse recovery the norm can be viewed as an atomic norm for which rp the norm can be explicitly expressed as an infimum convolution given by nx kθksp kui kui inf ui and its dual norm is the symmetric gauge norm defined as kθksp kθk it is straightforward to see that the dual norm is simply the norm of the largest entries in suppose that all the sets of coordinates with cardinality can be listed as then can be written as where each ai rp supp si it is not difficult to see that ai ha gi ekgsi ekgsi using lemma we know the gaussian width of the unit ball of norm sp ωk log log log which matches that in now we turn to the calculation of cksp and ψsp as we have seen in the general analysis the solution to the polar operator is important in characterizing the two quantities we first present simple procedure in algorithm for solving the polar operator for ksp procedure can be utilized to compute the time complexity is only log this the norm or be applied to estimation with ksp using generalized conditional gradient method which requires solving the polar operator in each iteration algorithm solving polar operator for ksp input rp positive integer output solution to the polar operator for to do kzi if and then is vector with all ones end if end for change the sign and order of to conform with return theorem for given algorithm returns solution to polar operator for ksp the proof of this theorem is provided in supplementary material now we consider cksp and ψsp for here means supp in three scenarios overspecified where ii exactly specified where and iii where the bounds are given in theorem and the proof is also in supplementary material theorem for given rp the gaussian width cksp and the restricted norm compatibility ψsp for specified are given by if if sp θmax if cksp log if ψk min min κmax if max log if κmin and θmin where θmax sp remark previously ψsp is unknown and the bound on ck given in is loose as it used the result in based on theorem we note that the choice of can affect the recovery guarantees leads to direct dependence on the dimensionality for cksp and ψsp resulting in weak error bound the bounds are sharp for exactly specified or underspecified thus it is better to in practice where the estimation error satifies log conclusions in this work we study the problem of structured estimation with general atomic norms that are invariant under based on estimators we provide the general bounds for the geometric measures in terms of ωa instead of comparison with other results or direct calculation we demonstrate third way to compute it based on decomposition of atomic set for ca and ψa we derive general upper bounds which only require the knowledge of single subgradient of kθ ka we also show that these upper bounds are close to the lower bounds which makes them practical in general to illustrate our results we discuss the application to norm in details and shed light on the choice of in practice acknowledgements the research was supported by nsf grants and by nasa grant references amelunxen lotz mccoy and tropp living on the edge phase transitions in convex programs with random data inform inference argyriou foygel and srebro sparse prediction with the norm in advances in neural information processing systems nips banerjee chen fazayeli and sivakumar estimation with norm regularization in advances in neural information processing systems nips bickel ritov and tsybakov simultaneous analysis of lasso and dantzig selector the annals of statistics bogdan van den berg su and candes statistical estimation and testing via the sorted norm cai liang and rakhlin geometrizing local rates of convergence for linear inverse problems candes and tao the dantzig selector statistical estimation when is much larger than the annals of statistics romberg and tao stable signal recovery from incomplete and inaccurate measurements communications on pure and applied mathematics cands and recht simple bounds for recovering models math chandrasekaran recht parrilo and willsky the convex geometry of linear inverse problems foundations of computational mathematics chatterjee chen and banerjee generalized dantzig selector application to the norm in advances in neural information processing systems nips chen and banerjee compressed sensing with the norm in international conference on artificial intelligence and statistics aistats figueiredo and nowak sparse estimation with strongly correlated variables using ordered weighted regularization gordon some inequalities for gaussian processes and applications israel journal of mathematics jacob obozinski and vert group lasso with overlap and graph lasso in international conference on machine learning icml maurer pontil and an inequality with applications to structured sparsity and multitask dictionary learning in conference on learning theory colt mcdonald pontil and stamos spectral norm regularization in advances in neural information processing systems nips negahban ravikumar wainwright and yu unified framework for the analysis of regularized statistical science oymak thrampoulidis and hassibi the of generalized lasso precise analysis plan and vershynin robust compressed sensing and sparse logistic regression convex programming approach ieee transactions on information theory rao recht and nowak universal measurement bounds for structured sparse signal recovery in international conference on artificial intelligence and statistics aistats tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series tropp convex recovery of structured signal from independent random linear measurements in sampling theory renaissance yuan and lin model selection and estimation in regression with grouped variables journal of the royal statistical society series zeng and figueiredo the ordered weighted norm atomic formulation projections and algorithms zhang yu and schuurmans polar operators for structured sparse estimation in advances in neural information processing systems nips 
complete recipe for stochastic gradient mcmc ma tianqi chen and emily fox university of washington yianma tqchen cs ebfox stat abstract many recent markov chain monte carlo mcmc samplers leverage continuous dynamics to define transition kernel that efficiently explores target distribution in tandem focus has been on devising scalable variants that subsample the data and use stochastic gradients in place of gradients in the dynamic simulations however such stochastic gradient mcmc samplers have lagged behind their counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is nontrivial even with simple dynamics significant physical intuition is often required to modify the dynamical system to account for the stochastic gradient noise in this paper we provide general recipe for constructing mcmc stochastic gradient on continuous markov processes specified via two matrices we constructively prove that the framework is complete that is any continuous markov process that provides samples from the target distribution can be written in our framework we show how previous samplers can be trivially reinvented in our framework avoiding the complicated proofs we likewise use our recipe to straightforwardly propose new sampler stochastic gradient riemann hamiltonian monte carlo sgrhmc our experiments on simulated data and streaming wikipedia analysis demonstrate that the proposed sgrhmc sampler inherits the benefits of riemann hmc with the scalability of stochastic gradient methods introduction markov chain monte carlo mcmc has become defacto tool for bayesian posterior inference however these methods notoriously mix slowly in complex models and scale poorly to large datasets the past decades have seen rise in mcmc methods that provide more efficient exploration of the posterior such as hamiltonian monte carlo hmc and its reimann manifold variant this class of samplers is based on defining potential energy function in terms of the target posterior distribution and then devising various continuous dynamics to explore the energy landscape enabling proposals of distant states the gain in efficiency of exploration often comes at the cost of significant computational burden in large datasets recently stochastic gradient variants of such samplers have proven quite useful in scaling the methods to large datasets at each iteration these samplers use data than the full dataset stochastic gradient langevin dynamics sgld innovated in this area by connecting stochastic optimization with langevin dynamic mcmc technique showing that adding the right amount of noise to stochastic gradient ascent iterates leads to samples from the target posterior as the step size is annealed stochastic gradient hamiltonian monte carlo sghmc builds on this idea but importantly incorporates the efficient exploration provided by the hmc momentum term key insight in that paper was that the stochastic gradient variant of hmc actually leads to an incorrect stationary distribution also see instead modification to the dynamics underlying hmc is needed to account for the stochastic gradient noise variants of both sgld and sghmc with further modifications to improve efficiency have also recently been proposed in the plethora of past mcmc methods that explicitly leverage continuous hmc riemann manifold hmc and the stochastic gradient focus has been on showing that the intricate dynamics leave the target posterior distribution invariant innovating in this arena requires constructing novel dynamics and simultaneously ensuring that the target distribution is the stationary distribution this can be quite challenging and often requires significant physical and geometrical intuition natural question then is whether there exists general recipe for devising such mcmc methods that naturally lead to invariance of the target distribution in this paper we answer this question to the affirmative furthermore and quite importantly our proposed recipe is complete that is any continuous markov process with no jumps with the desired invariant distribution can be cast within our framework including hmc riemann manifold hmc sgld sghmc their recent variants and any future developments in this area that is our method provides unifying framework of past algorithms as well as practical tool for devising new samplers and testing the correctness of proposed samplers the recipe involves defining stochastic system parameterized by two matrices positive semidefinite diffusion matrix and curl matrix where with our model parameters of interest and set of auxiliary variables the dynamics are then written explicitly in terms of the target stationary distribution and these two matrices by varying the choices of and we explore the space of mcmc methods that maintain the correct invariant distribution we constructively prove the completeness of this framework by converting general continuous markov process into the proposed dynamic structure for any given and target distribution we provide practical algorithms for implementing either or variants of the sampler in sec we cast many previous samplers in our framework finding their and we then show how these existing and building blocks can be used to devise new samplers we leave the question of exploring the space of and to the structure of the target distribution as an interesting direction for future research in sec we demonstrate our ability to construct new and relevant samplers by proposing stochastic gradient riemann hamiltonian monte carlo the existence of which was previously only speculated we demonstrate the utility of this sampler on synthetic data and in streaming wikipedia analysis using latent dirichlet allocation complete stochastic gradient mcmc framework we start with the standard mcmc goal of drawing samples from target distribution which we take to be the posterior of model parameters rd given an observed dataset throughout we assumep data we write exp with potential function log log algorithms like hmc further augment the space of interest with auxiliary variables and sample from exp with hamiltonian such that exp dr constant marginalizing the auxiliary variables gives us the desired distribution on in this paper we generically consider as the samples we seek to draw could represent itself or an augmented state space in which case we simply discard the auxiliary variables to perform the desired marginalization as in hmc the idea is to translate the task of sampling from the posterior distribution to simulating from continuous dynamical system which is used to define markov transition kernel that is over any interval the differential equation defines mapping from the state at time to the state at time one can then discuss the evolution of the distribution under the dynamics as characterized by the equation for stochastic dynamics or the liouville equation for deterministic dynamics this evolution can be used to analyze the invariant distribution of the dynamics ps when considering deterministic dynamics as in hmc jump process must be added to ensure ergodicity if the resulting stationary distribution is equal to the target posterior then simulating from the process can be equated with drawing samples from the posterior if the stationary distribution is not the target distribution mh correction can often be applied unfortunately such correction steps require costly computation on the entire dataset even if one can compute the mh correction if the dynamics do not nearly lead to the correct stationary distribution then the rejection rate can be high even for short simulation periods furthermore for many stochastic gradient mcmc samplers computing the probability of the reverse path is infeasible obviating the use of mh as such focus in the literature is on defining dynamics with the right target distribution especially in scenarios where mh corrections are computationally burdensome or infeasible devising sdes with specified target stationary distribution generically all continuous markov processes that one might consider for sampling can be written as stochastic differential equation sde of the form dz dt dw where denotes the deterministic drift and often relates to the gradient of is ddimensional wiener process and is positive semidefinite diffusion matrix clearly however not all choices of and yield the stationary distribution ps exp when as in hmc the dynamics of eq become deterministic our exposition focuses on sdes but our analysis applies to deterministic dynamics as well in this case our using the liouville equation in place of that the deterministic dynamics leave the target distribution invariant for ergodicity jump process must be added which is not considered in our recipe but tends to be straightforward momentum resampling in hmc to devise recipe for constructing sdes with the correct stationary distribution we propose writing directly in terms of the target distribution γi dij qij here is curl matrix representing the deterministic traversing effects seen in hmc procedures in contrast the diffusion matrix determines the strength of the diffusion matrices and can be adjusted to attain faster convergence to the posterior distribution more detailed discussion on the interpretation of and and the influence of specific choices of these matrices is provided in the supplement importantly as we show in theorem sampling the stochastic dynamics of eq according to integral with as in eq leads to the desired posterior distribution as the stationary distribution ps exp that is for any choice of positive semidefinite and skewsymmetric parameterizing we know that simulating from eq will provide samples from discarding any sampled auxiliary variables assuming the process is ergodic theorem ps exp is stationary distribution of the dynamics of eq if is restricted to the form of eq with positive semidefinite and if is positive definite or if ergodicity can be shown then the stationary distribution is unique proof the equivalence of ps and the target exp can be shown using the description of the probability density evolution under the dynamics of eq fi dij eq can be further transformed into more compact form we can verify that is invariant under eq by calculating if the process is ergodic this invariant distribution is unique the equivalence of the compact form was originally proved in we include detailed proof in the supplement for completeness processes with ps defined by all continuous markov processes figure the red space represents the set of all continuous markov processes point in the black space represents continuous markov process defined by eqs based on specific choice of by theorem each such point has stationary distribution ps the blue space represents all continuous markov processes with ps theorem states that these blue and black spaces are equivalent there is no gap and any point in the blue space has corresponding in our framework completeness of the framework an important question is what portion of samplers defined by continuous markov processes with the target invariant distribution can we define by iterating over all possible and in theorem we show that for any continuous markov process with the desired stationary distribution ps there exists an sde as in eq with defined as in eq we know from the chapmankolmogorov equation that any continuous markov process with stationary distribution ps can be written as in eq which gives us the diffusion matrix theorem then constructively defines the curl matrix this result implies that our recipe is complete that is we cover all possible continuous markov process samplers in our framework see fig theorem for the sde density function ps of eq suppose probability pd niquely exists and that fi ps dij ps is integrable with respect to the lebesgue measure then there exists such that eq holds the integrability condition is usually satisfied when the probability density function uniquely exists constructive proof for the existence of is provided in the supplement practical algorithm in practice simulation relies on an of the sde leading to update rule zt zt zt zt zt zt calculating the gradient of involves evaluating the gradient of for stochastic gradient method the assumption is that is too computationally intensive to compute as it relies on sum over all data points see sec instead such stochastic gradient algorithms examine independently sampled data subsets se and the corresponding potential for these data log log se is an unbiased estimator of as such gradient the specific form of eq implies that stochastic gradient noisy but unbiased estimator computed based on of the gradient the key question in many of the existing stochastic gradient mcmc algorithms is whether the noise injected by the stochastic gradient adversely affects the stationary in place of one way to analyze the distribution of the modified dynamics using impact of the stochastic gradient is to make use of the central limit theorem and assume resulting in noisy hamiltonian gradient simply plugging in in place of in eq results in dynamics with an additional noise term zt zt to counteract this assume we have an estimate of the variance of this additional noise satisfying zt positive semidefinite with small this is always true since the stochastic gradient noise scales down faster than the added noise then we can attempt to account for the stochastic gradient noise by simulating zt zt zt zt zt this provides our stochastic variant of the sampler in eq the noise introduced by the stochastic gradient is multiplied by and the compensation by implying that the discrepancy between these dynamics and those of eq approaches zero as goes to zero as such in this infinitesimal step size limit since eq yields the correct invariant distribution so does eq this avoids the need for costly or potentially intractable mh correction however having to decrease to zero comes at the cost of increasingly small updates we can also use finite small step size in practice resulting in biased but faster sampler similar tradeoff was used in to construct mh samplers in addition to being used in sgld and sghmc applying the theory to construct samplers casting previous mcmc algorithms within the proposed framework we explicitly state how some recently developed mcmc methods fall within the proposed framework based on specific choices of and in eq and for the stochastic gradient methods we show how our framework can be used to reinvent the samplers by guiding their construction and avoiding potential mistakes or inefficiencies caused by implementations hamiltonian monte carlo hmc the key ingredient in hmc is hamiltonian dynamics which simulate the physical motion of an object with position momentum and mass on an frictionless surface as follows typically leapfrog simulation is used instead θt rt rt θt eq is case of the proposed framework with and stochastic gradient hamiltonian monte carlo sghmc as discussed in simply replace in eq results in the following updates ing by the stochastic gradient θt rt naive θt rt θt θt rt where the arises from the approximation of eq careful study shows that eq can not be rewritten into our proposed framework which hints that such stochastic gradient version of hmc is not correct interestingly the authors of proved that this version indeed does not have the correct stationary distribution in our framework we see that the noise term is paired with term hinting that such term should be added to eq here which means we need to add interestingly this is the correction strategy proposed in but through physical interpretation of the dynamics in particular the term or generically where has an interpretation as friction and leads to second order langevin dynamics θt rt θt rt rt here is an estimate of θt this method now fits into our framework with and as in hmc but with this example shows how our theory can be used to identify invalid samplers and provide guidance on how to effortlessly correct the mistakes this is crucial when physical intuition is not available once the proposed sampler is cast in our framework with specific and there is no need for proofs such as those of stochastic gradient langevin dynamics sgld sgld proposes to use the following first order no momentum langevin dynamics to generate samples θt θt this algorithm corresponds to taking with and as motivated by eq of our framework the variance of the stochastic gradient can be subtracted from the sampler injected noise to make the finite stepsize simulation more accurate this variant of sgld leads to the stochastic gradient fisher scoring algorithm stochastic gradient riemannian langevin dynamics sgrld sgld can be generalized to use an adaptive diffusion matrix specifically it is interesting to take where is the fisher information metric the sampler dynamics are given by θt θt θt θt θt taking and this sgrld method falls into our framep work with correction term γi it is interesting to note that in earlier literature more recently it was found that γi was taken to be ij this correction term corresponds to the distribution function with respect to measure for the lebesgue measure the revised γi was as determined by our framework again we have an example of our theory providing guidance in devising correct samplers stochastic gradient thermostat sgnht finally the sgnht method incorporates ideas from thermodynamics to further increase adaptivity by augmenting the sghmc system with an additional scalar auxiliary variable the algorithm uses the following dynamics θt rt rt θt rt rt rt ξt we can take and to place these dynamics within our framework summary in our framework sgld and sgrld take and instead stress the design of the diffusion matrix with sgld using constant and sgrld an adaptive diffusion matrix to better account for the geometry of the space being explored on the other hand hmc takes and focuses on the curl matrix sghmc combines sgld with hmc through and matrices sgnht then extends sghmc by taking to be state dependent the relationships between these methods are depicted in the supplement which likewise contains discussion of the tradeoffs between these two matrices in short can guide escaping from local modes while can enable rapid traversing of regions especially when state adaptation is incorporated we readily see that most of the product space defining the space of all possible samplers has yet to be filled stochastic gradient riemann hamiltonian monte carlo in sec we have shown how our framework unifies existing samplers in this section we now use our framework to guide the development of new sampler while sghmc inherits the momentum term of hmc making it easier to traverse the space of parameters the underlying geometry of the target distribution is still not utilized such information can usually be represented by the fisher information metric denoted as which can be used to precondition the dynamics for our proposed system we consider rt as in methods and modify the and of sghmc to account for the geometry as follows we refer to this algorithm as stochastic gradient riemann hamiltonian monte carlo sgrhmc our theory holds for any positive definite yielding generalized sgrhmc gsgrhmc algorithm which can be helpful when the fisher information metric is hard to compute implementation of sghmc algorithm might simply precondition the and iii add friction term on the hmc update ii replace by order of the diffusion matrix to counterbalance the noise as in sghmc resulting in θt θt rt naive rt θt θt θt rt θt algorithm generalized stochastic gradient riemann hamiltonian monte carlo initialize for do optionally periodically resample momentum as θt θt rt σt θt θt θt θt rt σt rt θt end divergence divergence naive gsgrhmc sgld sghmc gsgrhmc sgld sghmc gsgrhmc figure left for two simulated distributions defined by one peak and two peaks we compare the kl divergence of methods sgld sghmc the sgrhmc of eq and the gsgrhmc of eq relative to the true distribution in each scenario left and right bars labeled by and right for correlated distribution with we see that our gsgrhmc most rapidly explores the space relative to sghmc and sgld contour plots of the distribution along with paths of the first sampled points are shown for each method however as we show in sec samples from these dynamics do not converge to the desired distribution indeed this system can not be written within our framework instead we can simply follow our framework and as indicated by eq consider the following update rule θt θt rt θt θt θt rt θt rt which includes correction term with component ij the practical implementation of gsgrhmc is outlined in algorithm experiments in sec we show that gsgrhmc can excel at rapidly exploring distributions with complex landscapes we then apply sgrhmc to sampling in latent dirichlet allocation lda model on large wikipedia dataset in sec the supplement contains details on the specific samplers considered and the parameter settings used in these experiments synthetic experiments in this section we aim to empirically validate the correctness of our recipe and ii assess the effectiveness of gsgrhmc in fig left we consider two univariate distributions shown in the supplement and compare sgld sghmc the sghmc of eq and our proposed gsgrhmc of eq see the supplement for the form of as expected the implementation does not converge to the target distribution in contrast the gsgrhmc algorithm obtained via our recipe indeed has the correct invariant distribution and efficiently explores the distributions in the second experiment we sample bivariate distribution with strong correlation the results are shown in fig right the comparison between sgld sghmc and our gsgrhmc method shows that both preconditioner and hamiltonian dynamics help to make the sampler more efficient than either element on its own original lda expanded mean parameter βkw θkw βkw pθkw θkw prior θk dir θkw perplexity method sgld sghmc sgrld sgrhmc average runtime per docs sgld sghmc sgrld sgrhmc number of documents figure upper left expanded mean parameterization of the lda model lower left average runtime per wikipedia entries for all methods right perplexity versus number of wikipedia entries processed online latent dirichlet allocation we also applied sgrhmc with diag the fisher information metric to an online latent dirichlet allocation lda analysis of topics present in wikipedia entries in lda each topic is associated with distribution over words with βkw the probability of word under topic each document is comprised of mixture of topics with πk the probability of topic in document documents are generated by first selecting topic zj for the jth word and then drawing the specific word from the topic as xj βz typically and βk are given dirichlet priors the goal of our analysis here is inference of the topic distributions βk since the wikipedia dataset is large and continually growing with new articles it is not practical to carry out this task over the whole dataset instead we scrape the corpus from wikipedia in streaming manner and sample parameters based on minibatches of data following the approach in we first analytically marginalize the document distributions and to resolve the boundary issue posed by the dirichlet posterior of βk defined on the probability simplex use an expanded mean parameterization shown in figure upper left under this parameterization we then compute log and in our implementation use boundary reflection to ensure the positivity of parameters θkw the necessary expectation over topic indicators zj is approximated using gibbs sampling separately on each document as in the supplement contains further details for all the methods we report results of three random runs when sampling distributions with mass concentrated over small regions as in this application it is important to incorporate geometric information via riemannian sampler the results in fig right indeed demonstrate the importance of riemannian variants of the stochastic gradient samplers however there also appears to be some benefits gained from the incorporation of the hmc term for both the riemmannian and nonreimannian samplers the average runtime for the different methods are similar see fig lower left since the main computational bottleneck is the gradient evaluation overall this application serves as an important example of where our newly proposed sampler can have impact conclusion we presented general recipe for devising mcmc samplers based on continuous markov processes our framework constructs an sde specified by two matrices positive semidefinite and we prove that for any and we can devise continuous markov process with specified stationary distribution we also prove that for any continuous markov process with the target stationary distribution there exists and that cast the process in our framework our recipe is particularly useful in the more challenging case of devising stochastic gradient mcmc samplers we demonstrate the utility of our recipe in reinventing previous stochastic gradient mcmc samplers and in proposing our sgrhmc method the efficiency and scalability of the sgrhmc method was shown on simulated data and streaming wikipedia analysis acknowledgments this work was supported in part by onr grant nsf career award and the terraswarm research center sponsored by marco and darpa we also thank lei wu for helping with the proof of theorem and professors ping ao and hong qian for many discussions references ahn korattikara and welling bayesian posterior sampling via stochastic gradient fisher scoring in proceedings of the international conference on machine learning icml ahn shahbaba and welling distributed stochastic gradient mcmc in proceeding of international conference on machine learning icml bardenet doucet and holmes towards scaling up markov chain monte carlo an adaptive subsampling approach in proceedings of the international conference on machine learning icml betancourt the fundamental incompatibility of scalable hamiltonian monte carlo and naive data subsampling in proceedings of the international conference on machine learning icml blei ng and jordan latent dirichlet allocation journal of machine learning research march chen fox and guestrin stochastic gradient hamiltonian monte carlo in proceeding of international conference on machine learning icml ding fang babbush chen skeel and neven bayesian sampling using stochastic gradient thermostats in advances in neural information processing systems nips duane kennedy pendleton and roweth hybrid monte carlo physics letters feller introduction to probability theory and its applications john wiley sons girolami and calderhead riemann manifold langevin and hamiltonian monte carlo methods journal of the royal statistical society series korattikara chen and welling austerity in mcmc land cutting the metropolishastings budget in proceedings of the international conference on machine learning icml neal mcmc using hamiltonian dynamics handbook of markov chain monte carlo patterson and teh stochastic gradient riemannian langevin dynamics on the probability simplex in advances in neural information processing systems nips risken and frank the equation methods of solutions and applications springer robbins and monro stochastic approximation method the annals of mathematical statistics shi chen yuan yuan and ao relation of new interpretation of stochastic differential equations to process journal of statistical physics welling and teh bayesian learning via stochastic gradient langevin dynamics in proceedings of the international conference on machine learning icml pages june xifara sherlock livingstone byrne and girolami langevin diffusions and the langevin algorithm statistics probability letters yin and ao existence and construction of dynamical potential in nonequilibrium processes without detailed balance journal of physics mathematical and general zwanzig nonequilibrium statistical mechanics oxford university press 
bandit smooth convex optimization improving the tradeoff ofer dekel microsoft research redmond wa oferd ronen eldan weizmann institute rehovot israel roneneldan tomer koren technion haifa israel tomerk abstract bandit convex optimization is one of the fundamental problems in the field of online learning the best algorithm for the general bandit convex optimizae while the best known lower bound tion problem guarantees regret of is many attempts have been made to bridge the huge gap between these bounds particularly interesting special case of this problem assumes that the loss functions are smooth in this case the best known algorithm guarantees ree we present an efficient algorithm for the bandit smooth convex gret of our result rules out optimization problem that guarantees regret of an lower bound and takes significant step towards the resolution of this open problem introduction bandit convex optimization is the following online learning problem first an adversary privately chooses sequence of bounded and convex loss functions ft defined over convex domain in euclidean space then randomized decision maker iteratively chooses sequence of points xt where each xt on iteration after choosing the point xt the decision maker incurs loss of ft xt and receives bandit feedback he observes the value of his loss but he does not receive any other information about the function ft the decision maker uses the feedback to make better choices on subsequent rounds his goal is to minimize regret which is the difference between his loss and the loss incurred by the best fixed point in if the regret grows sublinearly with it indicates that the decision maker performance improves as the length of the sequence increases and therefore we say that he is learning finding an optimal algorithm for bandit convex optimization is an elusive open problem the first algorithm for this problem was presented in flaxman et al and guarantees regret of for any sequence of loss functions here and throughout the asymptotic notation hides polynomial dependence on the dimension as well as logarithmic factors despite the ongoing effort to improve on this rate it remains the state of the art on the other hand dani et al proves that for any algorithm there exists sequence of loss functions for which and the gap between the upper and lower bounds is huge while no progress has been made on the general form of the problem some progress has been made in interesting special cases specifically if the bounded convex loss functions are also assumed to be if the loss funclipschitz flaxman et al improves their regret guarantee to tions are smooth namely their gradients are lipschitz saha and tewari present an algorithm similarly if the loss functions are bounded lipschitz and with guaranteed regret of if even stronger assumptions are made an strongly convex the guaranteed regret is optimal regret rate of can be guaranteed namely when the loss functions are both smooth and when they are lipschitz and linear and when lipschitz loss functions are not generated adversarially but drawn from fixed and unknown distribution recently bubeck et al made progress that did not rely on additional assumptions such as lipschitz smoothness or strong convexity but instead considered the general problem in the onee regret dimensional case that result proves that there exists an algorithm with optimal for arbitrary univariate convex functions ft subsequently and after the current paper was written bubeck and eldan generalized this result to bandit convex optimization in general euclidean spaces albeit requiring lipschitz assumption however the proofs in both papers are and do not give any hint on how to construct concrete algorithm nor any indication that an efficient algorithm exists the current state of the bandit convex optimization problem has given rise to two competing conjectures some believe that there exists an efficient algorithm that matches the current lower bound meanwhile others are trying to prove larger lower bounds in the spirit of even under the assumption that the loss functions are smooth if the lower bound is loose natural guess in this paper we take an important step towards the of the true regret rate would be against resolution of this problem by presenting an algorithm that guarantees regret of any sequence of bounded convex smooth loss functions compare this result to the previous statee noting that and this result rules out result of the possibility of proving lower bound of with smooth functions while there remains sizable gap with the lower bound our result brings us closer to finding the elusive optimal algorithm for bandit convex optimization at least in the case of smooth functions our algorithm is variation on the algorithms presented in with one new idea these algorithms all follow the same template on each round the algorithm computes an estimate of rft xt the gradient of the current loss function at the current point by applying random perturbation to xt the sequence of gradient estimates is then plugged into online optimization technique the technical challenge in the analysis of these algorithms is to bound the bias and the variance of these gradient estimates our idea is take window of consecutive gradient estimates and average them producing new gradient estimate with lower variance and higher bias overall the new tradeoff works in our favor and allows us to improve the regret averaging uncorrelated random vectors to reduce variance is technique but applying it in the context of bandit convex optimization algorithm is easier said than done and requires us to overcome number of technical difficulties for example the gradient estimates in our window are taken at different points which introduces new type of bias another example is the difficulty that arrises when the sequence xs xt travels adjacent to the boundary of the convex set imagine transitioning from one face of hypercube to another the random perturbation applied to xs and xt could be supported on orthogonal directions yet we average the resulting gradient estimates and expect to get meaningful gradient estimate while the basic idea is simple our technical analysis is not and may be of independent interest preliminaries we begin by defining smooth bandit convex optimization more formally and recalling several basic results from previous work on the problem flaxman et al abernethy et al saha and tewari that we use in our analysis we also review the necessary background on barrier functions smooth bandit convex optimization in the bandit convex optimization problem an adversary first chooses sequence of convex functions ft where is closed and convex domain in rd then on each round randomized decision maker has to choose point xt and after committing to his decision he incurs loss of ft xt and observes this loss as feedback the decision maker expected loss where expectation is taken with respect to his random choices is ft xt and in fact we are aware of at least two separate research groups that invested time trying to prove such an lower bound his regret is ft xt min ft throughout we use the notation et to indicate expectations conditioned on all randomness up to and including round we make the following assumptions first we assume that each of the functions ft is llipschitz with respect to the euclidean norm namely that ft lkx for all we further assume that ft is with respect to which is to say that krft rft kx in particular this implies that ft is continuously differentiable over finally we assume that the euclidean diameter of the decision domain is bounded by first order algorithms with estimated gradients the online convex optimization problem becomes much easier in the full information setting where the decision maker feedback includes the vector gt rft xt the gradient or subgradient of ft at the point xt in this setting the decision maker can use online algorithm such as the projected online gradient descent algorithm or dual averaging sometimes known as follow the regularized leader and guarantee regret of the dual averaging approach sets xt to be the solution to the following optimization problem xt arg min gs where is suitably chosen regularizer and for all and we define weight typically all of the weights are set to constant value called the learning rate parameter however since we are not in the full information setting and the decision maker does not observe gt the algorithms mentioned above can not be used directly the key observation of flaxman et al which is later reused in all of the work is that gt can be estimated by randomly perturbing the point xt specifically on round the algorithm chooses the point yt xt at ut instead of the original point xt where is parameter that controls the magnitude of the perturbation at is positive definite matrix and ut is drawn from the uniform distribution on the unit sphere in flaxman et al at is simply set to the identity matrix whereas in saha and tewari at is more carefully tailored to the point xt see details below in any case care should be taken to ensure that the perturbed point yt remains in the convex set the observed value ft yt is then used to compute the gradient estimate ft yt at ut and this estimate is fed to the optimization algorithm while is not an unbiased estimator of rft xt it is an unbiased estimator for the gradient of different function fˆt defined by fˆt et ft at where rd is uniformly drawn from the unit ball the function fˆt is smoothed version of ft which plays key role in our analysis and in many of the previous results on this topic the main property of fˆt is summarized in the following lemma lemma flaxman et al saha and tewari lemma for any differentiable function rd positive definite matrix rd and define au where is uniform on the unit sphere also let fˆ av where is uniform on the unit ball then rfˆ the difference between rft xt and rfˆt xt is the bias of the gradient estimator the analysis in flaxman et al abernethy et al saha and tewari focuses on bounding the bias and the variance of and their effect on the optimization algorithm barriers following our algorithm and analysis rely on the properties of barrier functions intuitively barrier is function defined on the interior of the convex body which is rather flat in most of the interior of and explodes to as we approach its boundary additionally barrier has some technical properties that are useful in our setting before giving the formal definition of barrier we define the local norm defined by barrier definition local norm induced by barrier let int be local norm induced by at the pointpx int is denoted by kzkx and defined as kzkx zt its dual norm is kzkx zt in words the local norm at is the mahalanobis norm defined by the hessian of at the point namely we now give formal definition of barrier definition barrier let rd be convex body function int is barrier for if is three times continuously differentiable ii as and iii for all int and rd satisfies and kykx this definition is given for completeness and is not directly used in our analysis instead we rely on some useful properties of barriers first and foremost there exists barrier for any convex body barriers are only known for specific classes of convex bodies such as polytopes yet we make the standard assumption that we have an efficiently computable barrier for the set another key feature of barrier is the set of dikin ellipsoids that it defines the dikin ellipsoid at int is simply the unit ball with respect to the local norm at key feature of the dikin ellipsoid is that it is entirely contained in the convex body for any see theorem another technical property of barriers is that its hessian changes slowly with respect to its local norm theorem nesterov and nemirovskii theorem let be convex body with selfconcordant barrier for any int and rd such that kzkx it holds that kzkx kzkx while the barrier explodes to infinity at the boundary of it is quite flat at points that are far from the boundary to make this statement formal we define an operation that multiplicatively shrinks the set toward the minimizer of called the analytic center of let arg min and assume without loss of generality that for any let ky denote the set the next theorem states that the barrier is flat in ky and explodes to in the thin shell between ky and theorem nesterov and nemirovskii propositions let be convex body with barrier let arg min and assume that for any it holds that ky log our assumptions on the loss functions as the lipschitz assumption or the smoothness assumption are stated in terms of the standard euclidean norm which we denote by therefore we will need to relate the euclidean norm to the local norms defined by the barrier this is accomplished by the following lemma whose proof appears in the supplementary material lemma let be convex body with barrier and let be the euclidean diameter of for any it holds that kzkx dkzkx for all rd barrier as regularizer looking back at the dual averaging strategy defined in eq we can now fill in some of the details that were left unspecified set the regularization in eq to be barrier for the set we use the following useful lemma from abernethy and rakhlin in our analysis algorithm bandit smooth convex optimization parameters perturbation parameter dual averaging weights barrier int initialize arbitrarily for at xt draw ut uniformly from the unit sphere yt xt at ut choose yt receive feedback ft yt ft yt at ut pt arg lemma abernethy and rakhlin let be convex body with barrier let gt be vectors in rd and let be such that kxt for all define pt xt arg then for all it holds that kxt kxt kxt pt pt ii for any it holds that gt xt kgt algorithms for bandit convex optimization that use regularizer also use the same barrier to obtain gradient estimates namely these algorithms perturb the dual averaging solution xt as in eq with the perturbation matrix at set to xt the root of the inverse hessian of at the point xt in other words the distribution of yt is supported on the dikin ellipsoid centered at xt scaled by since this form of perturbation guarantees that yt moreover if yt is generated in this way and used to construct the gradient estimator then the local norm of is bounded as specified in the following lemma lemma saha and tewari lemma let rd be convex body with barrier for any differentiable function and int define where au and is drawn uniformly from the unit sphere then main result our algorithm for the bandit smooth convex optimization problem is variant of the algorithm in saha and tewari and appears in algorithm following abernethy and rakhlin saha and tewari we use function as the dual averaging regularizer and we use its dikin ellipsoids to perturb the points xt the difference between our algorithm and previous ones is the introduction of dual averaging weights for and which allow us to vary the weight of each gradient in the dual averaging objective function in addition to the parameters and we introduce new buffering parameter which takes integer values we set the dual averaging weights in algorithm to be if if where is global learning rate parameter this choice of effectively decreases the influence of the feedback received on the most recent rounds if all of the become equal to and algorithm reduces to the algorithm in saha and tewari the surprising result is that there exists different setting of that gives better regret bound we introduce slight abuse of notation which helps us simplify the presentation of our regret bound we will eventually achieve the desired regret bound by setting the parameters and to be some functions of therefore from now on we treat the notation and as an abbreviation for the functional forms and respectively the benefit is that we can now use asymptotic notation to sweep meaningless terms under the rug we prove the following regret bound for this algorithm theorem let ft be sequence of loss functions where each ft is differentiable convex and and where rd is convex body of diameter with barrier for any and assume that algorithm is run with these parameters and with the weights defined in eq using and to generate the sequences xt and yt if and for any it holds that log dl kt specifically if wepset that log and we get bound in saha and tewari up note that if we set in our theorem we recover the to small numerical constant namely the dependence on and is the same analysis pt using the notation arg ft the decision maker regret becomes pt ft following flaxman et al saha and tewari we rewrite the ft yt regret as ft yt xt xt ft this decomposition essentially adds layer of hallucination to the analysis we pretend that the loss functions are fˆt instead of ft and we also pretend that we chose the points xt rather than yt we then analyze the regret in this pretend world this regret is the expression in eq finally we tie our analysis back to the real world by bounding the difference between that which we analyzed and the regret of the actual problem this difference is the sum of eq and eq the advantage of our pretend world over the real world is that we have unbiased gradient estimates that can plug into the dual averaging algorithm the algorithm in saha and tewari sets all of the dual averaging weights equal to the constant learning rate it decomposes the regret as in eq and their main technical result is the following bound for the individual terms theorem saha and tewari let ft be sequence of loss functions where each ft is differentiable convex and and where rd is convex body of diameter and barrier assume that algorithm is run with perturbation parameter and generates the sequences xt and yt then for any it holds that if additionally the dual averaging weights are all set to the constant learning rate then log by choosing the analysis in saha and tewari goes on to obtain regret bound of optimal values for the parameters and and plugging those values into theorem our analysis uses the first part of theorem to bound and shows that our careful choice of the dual averaging weights results in the following improved bound on we begin our analysis by defining moving average of the functions fˆt as follows xˆ where for soundness we let for also define moving average of gradient estimates again with for in section below we show how each can be used as biased estimate of xt also note that the choice of the dual averaging weights in eq is such pt pt that for all therefore the last step in algorithm basically performs dual averaging with the gradient estimates uniformly weighted by we use the functions to rewrite eq as fˆt xt xt xt fˆt this decomposition essentially adds yet another layer of hallucination to the analysis we pretend that the loss functions are instead of fˆt which are themselves pretend loss functions as described above eq is the regret in our new pretend scenario while eq eq is the difference between this regret and the regret in eq the following lemma bounds each of the terms in eq separately and summarizes the main technical contribution of our paper lemma under the conditions of theorem for any it holds that kt and log kt proof sketch of lemma as mentioned above the basic intuition of our technique is quite simple average the gradients to decrease their variance yet applying this idea in the analysis is tricky we begin by describing the main source of difficulty in proving lemma recall that our strategy is to pretend that the loss functions are and to use the random vector as biased estimator of xt naturally one of our goals is to show that this bias is small recall that each is an unbiased estimator of rfˆs xs conditioned on the history up to round specifically note that each vector in the sequence is gradient estimate at different point yet we average these vectors and claim that they accurately estimate at the current point xt luckily fˆt is so rfˆt xt should not be much different than rfˆt xt provided that we show that xt and xt are close to each other in euclidean distance to show that xt and xt are close we exploit the stability of the dual averaging algorithm particularly the first claim in lemma states that kxs kxs is controlled by kxs for all so now we need to show that kxs is small however is the average of gradient estimates taken at different points each is designed to have small norm with respect to its own local norm kxt for all we know it may be very large with respect to the current local norm kxt so now we need to show that the local norms at xt and xt are similar we could prove this if we knew that xt and xt are close to each is exactly what we set out to prove in the beginning this situation complicates our analysis considerably another component of our proof is the variance reduction analysis the motivation to average is to generate new gradient estimates with smaller variance while the random vectors are not independent we show that their randomness is uncorrelated therefore the variance of is times smaller than the variance of each however to make this argument formal we again require the local norms at xt and xt to be similar to make things more complicated there is the recurring need to move back and forth between local norms and the euclidean norm since the latter is used in the definition of lipschitz and smoothness all of this has to do with bounding eq the regret with respect to the pretend loss functions an additional bias term appears in the analysis of eq we conclude the paper by stating our main lemmas and sketching the proof lemma the full technical proofs are all deferred to the supplementary material and replaced with some high level commentary to break the situation described above we begin with crude bound on kxt which does not benefit at all from the averaging operation we simultaneously prove that the local norms at xt and xt are similar lemma if the parameters and are chosen such that kxt ii for any such that it holds that kzkxt then for all kzkxt lemma itself has aspect which we resolve using an inductive proof technique armed with the knowledge that local norms at xt and xt are similar we go on to prove the more refined bound on et which does benefit from averaging lemma if the parameters and are chosen such that et then the proof constructs martingale difference sequence and uses the fact that its increments are uncorrelated compare the above to lemma which proves that and note the extra in our of our hard work was aimed at getting this factor next we set out to bound the expected euclidean distance between xt and xt this bound is later needed to exploit the and assumptions the crude bound on kxs from lemma is enough to satisfy the conditions of lemma which then tells us that kxs kxs is controlled by kxs the latter enjoys the improved bound due to lemma integrating the resulting bound over time we obtain the following lemma lemma if the parameters and such that we have kxt are chosen such that then for all and any notice that xt and xt may be rounds apart but the bound scales only with the work of the averaging technique again this is finally we have all the tools in place to prove our main result lemma pk proof sketch the first term eq is bounded by rewriting xt xt and then proving that fˆt xt is not very far from fˆt xt this follows from the fact that fˆt is llipschitz and from lemma to bound the second term eq we use the convexity of each to write xt xt xt we relate the side above to xt using the fact that fˆt is and again using lemma then we upper bound the above using lemma theorem and lemma acknowledgments we thank jian ding for several critical contributions during the early stages of this research parts of this work were done while the second and third authors were at microsoft research the support of which is gratefully acknowledged references abernethy and rakhlin beating the adaptive bandit with high probability in information theory and applications workshop pages ieee abernethy hazan and rakhlin competing in the dark an efficient algorithm for bandit linear optimization in proceedings of the annual conference on learning theory colt agarwal dekel and xiao optimal algorithms for online convex optimization with bandit feedback in proceedings of the annual conference on learning theory colt agarwal foster hsu kakade and rakhlin stochastic convex optimization with bandit feedback in advances in neural information processing systems nips bubeck and regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning bubeck and eldan the entropic barrier simple and optimal universal barrier arxiv preprint bubeck and eldan exploration of convex functions and bandit convex optimization arxiv preprint bubeck dekel koren and peres bandit convex optimization regret in one dimension in in proceedings of the annual conference on learning theory colt dani hayes and kakade the price of bandit information for online optimization in advances in neural information processing systems nips dekel ding koren and peres bandits with switching costs regret in proceedings of the annual symposium on the theory of computing flaxman kalai and mcmahan online convex optimization in the bandit setting gradient descent without gradient in proceedings of the sixteenth annual acmsiam symposium on discrete algorithms pages society for industrial and applied mathematics hazan and levy bandit convex optimization towards tight bounds in advances in neural information processing systems nips nesterov subgradient methods for convex problems mathematical programming nesterov and nemirovskii polynomial algorithms in convex programming volume siam saha and tewari improved regret guarantees for online smooth convex optimization with bandit feedback in international conference on artificial intelligence and statistics aistat pages online learning and online convex optimization foundations and trends in machine learning zinkevich online convex programming and generalized infinitesimal gradient ascent in proceedings of the international conference on machine learning icml pages 
online prediction at the limit of zero temperature mark herbster stephen pasteris department of computer science university college london london england uk shaona ghosh ecs university of southampton southampton uk abstract we design an online algorithm to classify the vertices of graph underpinning the algorithm is the probability distribution of an ising model isomorphic to the graph each classification is based on predicting the label with maximum marginal probability in the limit of with respect to the labels and vertices seen so far computing these classifications is unfortunately based on complete problem this motivates us to develop an algorithm for which we give sequential guarantee in the online mistake bound framework our algorithm is optimal when the graph is tree matching the prior results in for general graph the algorithm exploits the additional connectivity over tree to provide bound the algorithm is efficient as the cumulative time to sequentially predict all of the vertices of the graph is quadratic in the size of the graph introduction learning is now standard methodology in machine learning common approach in learning is to build graph from given set of labeled and unlabeled data with each datum represented as vertex the hope is that the constructed graph will capture either the cluster or manifold structure of the data typically an edge in this graph indicates the expectation that the joined data points are more likely to have the same label one method to exploit this representation is to use the induced by the laplacian of the graph shared idea of the laplacian based approaches is that the smoothness of boolean labeling of the graph is measured via the cut which is just the number of edges that connect disagreeing labels in practice the is then used as regularizer in which the optimization problem is relaxed from boolean to real values our approach also uses the cut but unrelaxed to define an ising distribution over the vertices of the graph predicting with the vertex marginals of an ising distribution in the limit of zero temperature was shown to be optimal in the mistake bound model section when the graph is tree the exact computation of marginal probabilities in the ising model is intractable on however in the limit of zero temperature rich combinatorial structure called the graph emerges we exploit this structure to give an algorithm which is optimal on trees has quadratic cumulative computational complexity and has mistake bound on generic graphs that is stronger than previous bounds in many natural cases the paper is organized as follows in the remainder of this section we introduce the ising model and lightly review previous work in the online mistake bound model for predicting the labeling of graph in section we review our key technical tool the graph and explain the required notation in the body of section we provide mistake bound analysis of our algorithm as well as the intractable algorithm and then conclude with detailed comparison to the state of the art in the appendices we provide proofs as well as preliminary experimental results ising model in the limit zero temperature in our setting the parameters of the ising model are an graph and temperature parameter where denotes the vertex set and denotes the edge set each vertex of this graph may be labeled with one of two states and thus labeling of graph may be denoted by vector where ui denotes the label of vertex the cutsize of labeling is defined as φg ising probability distribution over labelings of is then uj the defined as pgτ exp φg where is the temperature parameter in our online setting at the beginning of trial we will have already received an example sequence st of pairs it yt where pair we use pgτ uv pgτ uv uit yt to denote the marginal probability that vertex has label given the previously labeled vertices of st for convenience we also define the marginalized cutsize φg to be equal to φg if uit yt and equal to undefined otherwise our prediction of vertex is then the label with maximal marginal probability in the limit of zero temperature thus argmax lim pgτ uit yt note the prediction is undefined if the labels are equally probable in low temperatures the mass of the marginal is dominated by the labelings consistent with st and the proposed label of vertex of minimal cut as we approach zero is the label consistent with the maximum number of labelings of minimal cut thus if min nφg then we have that φg φg φg φg the problem of counting minimum cuts was shown to be in and further computing is also see appendix in section we introduce the graph which captures the combinatorial structure of the set of we then use this simplifying structure as basis to design heuristic approximation to with mistake bound guarantee predicting the labelling of graph in the mistake bound model we prove performance guarantees for our method in the mistake bound model introduced by littlestone on the graph this model corresponds to the following game nature presents graph nature queries vertex inn the learner predicts the label of the vertex nature presents label nature queries vertex the learner predicts and so forth the learner goal is to minimize the total number of mistakes yt if nature is adversarial the learner will always make mistake but if nature is regular or simple there is hope that learner may incur only few mistakes thus central goal of online learning is to design algorithms whose total mistakes can be bounded relative to the complexity of nature labeling the graph labeling problem has been studied extensively in the online literature here we provide rough discussion of the two main approaches for graph label prediction and in section we provide more detailed comparison the first approach is based on the graph laplacian it provides bounds that utilize the additional connectivity of graphs which are particularly strong when the graph contains clusters of small resistance diameter the drawbacks of this approach are that the bounds are weaker on graphs with large diameter and that the computation times are slower the second approach is to estimate the original graph with an appropriately selected tree or path graph this leads to faster computation times and bounds that are better on graphs with large diameters the algorithm treeopt is optimal on trees these algorithms may be extended to graphs by first selecting spanning tree uniformly at random and then applying the algorithm to the sampled tree this randomized approach enables expected mistake bounds which exploit the cluster structure in the graph the bounds we prove for the prediction and our heuristic are most similar to the small bounds proven for the interpolation algorithm although these bounds are not strictly comparable key strength of our approach is that the new bounds often improve when the graph contains clusters of varying diameters furthermore when the graph is tree we match the optimal bounds of finally the cumulative time required to compute the complete labeling of graph is quadratic in the size of the graph for our algorithm while requires the minimization of convex function on every trial which is not differentiable when preliminaries an undirected graph is pair of sets such that is set of unordered pairs of distinct elements from we say that is subgraph iff and given any subgraph we define its boundary or inner border its neighbourhood or exterior border respectively as and and its exterior edge border the length of subgraph is denoted by and we denote the diameter of graph by pair of vertices are if there exist paths connecting them the connectivity of graph is the maximal value of such that every pair of points in is the atomic number nκ of graph at connectivity level is the minimum cardinality of partition of into subgraphs rc such that ri for all our results also require the use of and graphs every undirected graph also defines directed graph where each undirected edge is represented by directed edges and an orientation of an undirected graph is an assignment of direction to each edge turning the initial graph into directed graph in the edge set is now and thus there may be multiple edges between two vertices is defined from graph and partition of its vertex set vi so that vi we often call these vertices supervertices to emphasize that they are sets and the multiset we commonly construct by merging collection of for example in figure from to where and are merged to and also the five merges that transforms to the set of all in graph with respect to an example sequence is φg the minimum is typically for example in figure the vertex sets correspond to one and to another the cutsize is the uncapacitated maximum flow is the number of paths between source and target vertex thus in figure between vertex and vertex there are at most simultaneously paths these are also not unique as one path must pass through either vertices or vertices figure illustrates one such flow just the directed edges for convenience it is natural to view the maximum flow or the as being with respect to only two vertices as in figure transformed to figure so that merge the flow and the cut are related by menger theorem which states that the with respect to source and target vertex is equal to the max flow between them given connected graph and source and target vertices the algorithm can find paths from to in time where is the value of the max flow the graph given set of labels there may be multiple as well as multiple maximum flows in graph the pq graph reduces this multiplicity as far as is possible with respect to the indeterminacy of the maximum flow the vertices of the are defined as set on partition of the original graph vertex set two vertices are contained in the same iff they have the same label in every an edge between two vertices defines an analogous edge between two iff that edge is conserved in every maximum flow furthermore the edges between strictly orient the labels in any as may be seen in the formal definition that follows first we introduce the following useful notations let kg min φg denote the of with respect to let denote an equivalence relation between vertices in where iff ui uj and then we define definition the graph is derived from graph and example sequence the graph is an orientation of the quotient graph derived from the partition in of induced by the edge set of is constructed of kg paths starting at source vertex and terminating at target vertex labeling is in iff implies ui and implies ui implies ui uj and ui implies uj where and are the source and target vertices and as is dag it naturally defines partial order on the vertex set where if there exists path starting at and ending at the least and greatest elements of the partial order are and the notation and denote the up set and down set of given the set of all then if there exists an antichain such that ui when otherwise ui furthermore for every antichain there exists labelconsistent the simple structure of was utilized by to enable the efficient algorithmic enumeration of however the cardinality of this set of all is potentially exponential in the size of the and the exact computation of the cardinality was later shown in in figure we give the algorithm from picardqueyrannegraph graph example sequence vk yk sourcetargetmerge maxflow where and quotientgraph stronglyconnectedcomponents except vk int yk and vk int yk return directed graph figure computing the graph graph and graph step in figure graph step in figure pq graph step in figure figure building graph to compute we illustrate the computation in figure the algorithm operates first on step by merging all vertices which share the same label in to create in step max flow graph is computed by the algorithm it is in the case of unweighted graphs that max flow graph may be output as dag of paths where is the value of the flow in step all edges in the flow become directed edges creating the graph is then created in step from where the strongly connected components become the of and the correspond to subset of flow edges from finally in step we create the by fixing the source and target vertices so that they also have as elements the original labeled vertices from which were merged in step the correctness of the algorithm follows from arguments in we provide an independent proof in appendix theorem the algorithm in figure computes the unique graph derived from graph and example sequence mistake bounds analysis in this section we analyze the mistakes incurred by the intractable strategy see and the strategy see figure our analysis splits into two parts firstly we show section theorem for sufficiently regular graph label prediction algorithm that we may analyze independently the mistake bound of each cluster connected subgraph secondly the analysis then separates into three cases the result of which is summarized in theorem for given cluster when its internal connectivity is larger than the number of edges in the boundary we will incur no more than one mistake in that cluster on the other hand for smaller connectivity clusters we incur up to quadratically in mistakes via the edge boundary size when is tree we incur log mistakes the analysis of smaller connectivity clusters separates into two parts first sequence of trials in which the does not increase we call section as in essence it is played on we give mistake bound for for the intractable prediction and comparable bound for the strategy in theorem second when the increases the current ends and new one begins leading to sequence of the mistakes incurred over sequence of is addressed in the aforementioned theorem and finally section concludes with discussion of the combined bounds of theorems and with respect to other graph label prediction algorithms mistake bounds for regular graph label prediction algorithms an algorithm is called regular if it is and markov an algorithm is if the prediction at any time does not depend on the order of the examples up to time if for every example sequence if we insert an example between examples and with label then the prediction at time is unchanged or changed to and markov with respect to graph if for any disjoint vertex sets and and separating set then the predictions in are independent of the labels in given the labels of subgraph is with respect to an example sequence iff the label of each vertex is the same and these labels are consistent with the example sequence the following definition characterizes the example sequences for regular algorithms with respect to clusters definition given an online algorithm and subgraph then ba denotes the maximal mistakes made only in for the presentation of any permutation of examples in each with label followed by any permutation of examples in each with label the following theorem enables us to analyze the mistakes incurred in each subgraph independently of each other and independently of the remaining graph structure excepting the subgraph exterior border theorem proof in appendix given an online markov algorithm and graph which is covered by subgraphs cc the mistakes incurred by the algorithm may be bounded by ba ci the above theorem paired with theorem completes the mistake bound analysis of our algorithms given the derived online is played between player and an adversary the aim of the player is to minimize their mistaken predictions for the adversary it is to maximize the player mistaken predictions thus to play the adversary proposes vertex the player then predicts label then the adversary returns label and either mistake is incurred or not the only restriction on the adversary is to not return label which increases the as long as the adversary does not give an example or the does not increase no matter the value of which also implies the player has trivial strategy to predict the label of after the example is given we have an updated with new source and target as seen in the proposition below proposition if is and is an example with and then let then hs merge thus given the the is independent of and since play induces play with mistake bounds for given single in the following we will discuss the three strategies and that the player may adopt for which we prove online mistake bounds the first strategy is merely motivational it can be used to play single but not sequence the second strategy is computationally infeasible finally the strategy is dynamically similar to but is also common to all our analyses is cover of which is partitioning of the of into directed paths pk from to note that the cover is not necessarily unique for example in figure we have the two unique path covers and we denote the set of all path covers as and thus we have for figure that this cover motivates simple mistake bound and strategy suppose we had single path of length where the first and last vertex are the source and target vertices so the minimum is and natural strategy is simply to predict with the revealed label and trivially our mistake bound is log generalizing to multiple paths we have the following strategy given choose path cover strategy if the path cover is also except for the source and target vertex we may directly use the strategy detailed above achieving the mistake upper bound log unsurprisingly in the case it is optimal algorithm if however is not and we need to predict vertex we may select path in containing and predict with the nearest neighbour and also obtain the bound above in this case however the bound may not be essentially the same technique was used in in related setting for learning directed limitation of the strategy is that it does not seem possible to extend into strategy that can play sequence of and still meet the regularity properties particularly as required by theorem strategy the prediction of the ising model in the limit of zero temperature cf is equivalent to those of the halving algorithm where the hypothesis class is the set of the mistake upper bound of the halving algorithm is just log where this bound follows from the observation that whenever mistake is made at least half of concepts in are no longer consistent we observe that we may upper bound argminp since the product of path lengths from any path cover is an upper bound on the cardinality of and hence we have the bound in and in fact this bound may be significant improvement over the strategy bound as seen in the following proposition proposition proof in appendix for every there exists gc with path cover gc and example sequence such that the mistakes while for all example sequences on gc the mistakes unfortunately the strategy has the drawback that counting is and computing the prediction see is see appendix strategy in our search for an efficient and regular prediction strategy it seems natural to attempt to dynamize the approach and predict with nearest neighbor along dynamic path two such methods are the and strategies the strategy predicts the label of in as iff the shortest directed path is shorter than the shortest directed path the strategy predicts the label of in as iff the longest directed path is shorter than the longest directed path the strategy seems to be intuitively favored over as it is just input graph example sequence inn initialization picardqueyrannegraph for do receive it it gt with it gt it gt it predict otherwise predict it receive yt if it or yt and it or yt then merge gt it yt merge gt it yt else picardqueyrannegraph st as per equation cut unchanged cut increases end figure and online prediction the prediction with respect to the geodesic distance however the following proposition shows that it is strictly worse than any strategy in the worst case proposition proof in appendix for every there exists gc and example sequence such that the mistakes log while for every path cover gc and for all example sequences on gc the mistakes in contrast for the strategy in the proof of theorem we show that there alp ways exists some retrospective path cover plp such that log computing the has time complexity linear in the number of edges in dag summarizing the mistake bounds for the three strategies for single we have the following theorem theorem proof in appendix the mistakes of an online for player strategies and on and cover is bounded by log pk argminp log pk log global analysis of prediction at zero temperature in figure we summarize the prediction protocol for and we claim the regularity properties of our strategies in the following theorem theorem proof in appendix the strategies and are and markov the technical hurdle here is to prove that holds over sequence of for this we need an analog of proposition to describe how the changes when the labelconsistent increases see proposition the application of the following theorem along with theorem implies we may bound the mistakes of each cluster in potentially three ways theorem proof in appendix given either the or strategy the mistakes on subgraph are bounded by ba log log is tree with the atomic number first if the internal connectivity of the cluster is high we will only make single mistake in that cluster second if the cluster is tree then we pay the external connectivity of the cluster times the log of the cluster diameter finally in the remaining case we pay quadratically in the external connectivity and logarithmically in the atomic number of the cluster the atomic number captures the fact that even poorly connected cluster may have of high internal connectivity computational complexity if is graph and an example sequence with of then we may implement the strategy so that it has cumulative computational complexity of max this follows because if on trial the cut does not increase we may implement prediction and update in time on the other hand if the cut increases by we pay time to do so we implement an online algorithm which starts from the previous residual graph to which it then adds the additional flow paths with steps of size discussion there are essentially five dominating mistake bounds for the online graph labeling problem the bound of treeopt on trees ii the bound in expectation of treeopt on random spanning tree sampled from graph iii the bound of interpolation tuned for sparsity iv the bound of interpolation as tuned to be equivalent to online label propagation this paper strategy the algorithm treeopt was shown to be optimal on trees in appendix we show that also obtains the same optimal bound on trees algorithm ii applies to generic graphs and is obtained from by sampling random spanning tree rst it is not directly comparable to the other algorithms as its bound holds only in expectation with respect to the rst we use corollary to compare to iii and iv we introduce the following simplifying notation to compare bounds let cc denote clusters connected subgraphs which cover the graph and set κr cr and φr cr we define dr to be the wide diameter at connectivity level of cluster cr the wide diameter dr is the minimum value such that for all pairs of vertices cr there exists of paths from to of length at least dr in cr and if κr then dr thus dr is the diameter of cluster cr and dr dr let denote the minimum cutsize and observe that if the cardinality of the cover cc is minimized then we have that φr thus using corollary we have the following upper bounds of iii log and iv pc where minr κr and maxr dr in comparison we have max φr κr φr log nr with atomic numbers nr nφr cr to contrast the bounds consider double lollipop first create lollipop which is path of vertices attached to clique of vertices label these vertices second clone the lollipop except with labels finally join the two cliques with edges arbitrarily for iii and iv the bounds are independent of the choice of clusters whereas an upper bound for is the exponentially smaller log which is obtained by choosing four cluster cover consisting of the two paths and the two cliques this emphasizes the generic problem of iii and iv parameters and are defined by the worst clusters whereas is truly bound we consider the previous constructed example to be representative of generic case where the graph contains clusters of many resistance diameters as well as sparse interconnecting background vertices on the other hand there are cases in which iii iv improve on for graph with only small diameter clusters and if the cutsize exceeds the cluster connectivity then iv improves on iii given the linear versus quadratic dependence on the cutsize the may be arbitrarily smaller than iii improves on and also other subtleties not accounted for in the above comparison include the fact the wide diameter is crude upper bound for resistance diameter cf theorem and the clusters of iii iv are not required to be regarding replacing wide with resistance does not change the fact the bound now holds with respect to the worst resistance diameter and the example above is still problematic regarding it is nice property but we do not know how to exploit this to give an example that significantly improves iii or iv over slightly more detailed analysis of finally iii iv depend on correct choice of tunable parameter thus in summary matches the optimal bound of on trees and can often improve on iii iv when graph is naturally covered by clusters of different diameters however iii iv may improve on in number of cases including when the is significantly smaller than of the clusters references claudio gentile and fabio vitale fast and optimal prediction on labeled tree in proceedings of the annual conference on learning omnipress avrim blum and shuchi chawla learning from labeled and unlabeled data using graph mincuts in proceedings of the eighteenth international conference on machine learning icml pages san francisco ca usa morgan kaufmann publishers olivier chapelle jason weston and bernhard cluster kernels for learning in becker thrun and obermayer editors advances in neural information processing systems pages mit press mikhail belkin and partha niyogi learning on riemannian manifolds mach xiaojin zhu zoubin ghahramani and john lafferty learning using gaussian fields and harmonic functions in icml pages dengyong zhou olivier bousquet thomas navin lal jason weston and bernhard learning with local and global consistency in nips martin szummer and tommi jaakkola partially labeled classification with markov random walks in nips pages leslie ann goldberg and mark jerrum the complexity of ferromagnetic ising with local fields combinatorics probability computing picard and maurice queyranne on the structure of all minimum cuts in network and applications in editor combinatorial optimization ii volume of mathematical programming studies pages springer berlin heidelberg scott provan and michael ball the complexity of counting cuts and of computing the probability that graph is connected siam journal on computing nick littlestone learning quickly when irrelevant attributes abound new algorithm machine learning april mark herbster massimiliano pontil and lisa wainer online learning over graphs in icml proceedings of the international conference on machine learning pages new york ny usa acm mark herbster exploiting to predict the labeling of graph in proceedings of the international conference on algorithmic learning theory pages mark herbster and guy lever predicting the labelling of graph via minimum interpolation in proceedings of the annual conference on learning theory colt mark herbster guy lever and massimiliano pontil online prediction on large diameter graphs in advances in neural information processing systems nips pages mit press claudio gentile fabio vitale and giovanni zappella random spanning trees and the prediction of weighted graphs in proceedings of the international conference on machine learning icml pages fabio vitale claudio gentile and giovanni zappella see the tree through the lines the shazoo algorithm in john richard zemel peter bartlett fernando pereira and kilian weinberger editors nips pages ford and fulkerson maximal flow through network canadian journal of mathematics michael ball and scott provan calculating bounds on reachability and connectedness in stochastic networks networks thomas and gemma garriga the cost of learning directed cuts in proceedings of the european conference on machine learning barzdin and frievald on the prediction of general recursive functions soviet math doklady nick littlestone and manfred warmuth the weighted majority algorithm inf 
learning continuous control policies by stochastic value gradients nicolas greg david silver timothy lillicrap yuval tassa tom erez google deepmind heess gregwayne davidsilver countzero tassa etom these authors contributed equally abstract we present unified framework for learning continuous control policies using backpropagation it supports stochastic control by treating stochasticity in the bellman equation as deterministic function of exogenous noise the product is spectrum of general policy gradient algorithms that range from methods with value functions to methods without value functions we use learned models but only require observations from the environment instead of observations from trajectories minimizing the impact of compounded model errors we apply these algorithms first to toy stochastic control problem and then to several control problems in simulation one of these variants svg shows the effectiveness of learning models value functions and policies simultaneously in continuous domains introduction policy gradient algorithms maximize the expectation of cumulative reward by following the gradient of this expectation with respect to the policy parameters most existing algorithms estimate this gradient in manner by sampling returns from the real environment and rely on likelihood ratio estimator such estimates tend to have high variance and require large numbers of samples or conversely policy parameterizations second approach to estimate policy gradient relies on backpropagation instead of likelihood ratio methods if differentiable environment model is available one can link together the policy model and reward function to compute an analytic policy gradient by backpropagation of reward along trajectory instead of using entire trajectories one can estimate future rewards using learned value function critic and compute policy gradients from subsequences of trajectories it is also possible to backpropagate analytic action derivatives from to compute the policy gradient without model following fairbank we refer to methods that compute the policy gradient through backpropagation as value gradient methods in this paper we address two limitations of prior value gradient algorithms the first is that in contrast to likelihood ratio methods value gradient algorithms are only suitable for training deterministic policies stochastic policies have several advantages for example they can be beneficial for partially observed problems they permit exploration and because stochastic policies can assign probability mass to trajectories we can train stochastic policy on samples from an experience database in principled manner when an environment model is used value gradient algorithms have also been critically limited to operation in deterministic environments by exploiting mathematical tool known as that has found recent use for generative models we extend the scope of value gradient algorithms to include the optimization of stochastic policies in stochastic environments we thus describe our framework as stochastic value gradient svg methods secondly we show that an environment dynamics model value function and policy can be learned jointly with neural networks based only on environment interaction learned dynamics models are often inaccurate which we mitigate by computing value gradients along real system trajectories instead of planned ones feature shared by methods this substantially reduces the impact of model error because we only use models to compute policy gradients not for prediction combining advantages of and modelfree methods with fewer of their drawbacks we present several algorithms that range from to methods flexibly combining models of environment dynamics with value functions to optimize policies in stochastic or deterministic environments experimentally we demonstrate that svg methods can be applied using generic neural networks with tens of thousands of parameters while making minimal assumptions about plants or environments by examining simple stochastic control problem we show that svg algorithms can optimize policies where planning and likelihood ratio methods can not we provide evidence that value function approximation can compensate for degraded models demonstrating the increased robustness of svg methods over planning finally we use svg algorithms to solve variety of challenging physical control problems including swimming of snakes reaching tracking and grabbing with robot arm for monoped and locomotion for planar cheetah and biped background we consider markov decision processes mdps with continuous states and actions and denote the state and action at time step by st rns and at rna respectively the mdp has an initial state distribution transition distribution at and potentially reward function rt st at we consider stochastic policies parameterized by the goal of policy optimization is to find policy parameters that maximize the expectedh sum of futurei rewards we optimize either or pt sums or where is discount when possible we represent variable at the next time step using the tick notation in what follows we make extensive use of the and for problems the value functions are and for problems the value functions are stationary the relevant meaning should be clear from the context the function can be expressed recursively using the stochastic bellman equation rt da we abbreviate partial differentiation using subscripts gx deterministic value gradients the deterministic bellman equation takes the form for deterministic model and deterministic policy differentiating the equation with respect to the state and policy yields an expression for the value gradient vs rs ra fs fa fa in eq the term arises because the total derivative includes policy gradient contributions from subsequent time steps full derivation in appendix for purely formalism these equations are used as pair of coupled recursions that starting from the termination of trajectory proceed backward in time to compute the gradient of the value function with respect to the state and policy parameters returns the total policy gradient when function is used we make use of reward function only in one problem to encode terminal reward for the case after one step in the recursion ra fa directly expresses the contribution of the current time step to the policy gradient summing these gradients over the trajectory gives the total policy gradient when is used the step contribution to the policy gradient takes the form qa stochastic value gradients one limitation of the gradient computation in eqs and is that the model and policy must be deterministic additionally the accuracy of the policy gradient is highly sensitive to modeling errors we introduce two critical changes first in section we transform the stochastic bellman equation eq to permit backpropagating value information in stochastic setting this also enables us to compute gradients along real trajectories not ones sampled from model making the approach robust to model error leading to our first algorithm svg described in section second in section we show how value function critics can be integrated into this framework leading to the algorithms svg and svg which expand the bellman recursion for and steps respectively value functions further increase robustness to model error and extend our framework to control differentiating the stochastic bellman equation of distributions our goal is to backpropagate through the stochastic bellman equation to do so we make use of concept called which permits us to compute derivatives of deterministic and stochastic models in the same way very simple example of is to write conditional gaussian density as the function where from this point of view one produces samples procedurally by first sampling then deterministically constructing here we consider conditional densities whose samples are generated by deterministic function of an input noise variable and other conditioning variables where fixed noise distribution rich density models rcan be expressed in this form expectations of function become ep the advantage of working with distributions is that we can now obtain simple estimator of the derivative of an expectation with respect to rx ep gy fx gy in contrast to likelihood monte carlo estimators rx log this formula makes direct use of the jacobian of of the bellman equation we now the bellman equation when the stochastic policy takes the form and the stochastic environment the form for noise variables and respectively inserting these functions into eq yields differentiating eq with respect to the current state and policy parameters gives vs rs ra fs fa ra fa we are interested in controlling systems with priori unknown dynamics consequently in the following we replace instances of or its derivatives with learned model gradient evaluation by planning planning method to compute gradient estimate is to compute trajectory by running the policy in loop with model while sampling the associated noise variables yielding trajectory on this sampled trajectory estimate of the policy gradient can be computed by the backward recursions vs rs ra ra where have written to emphasize that the quantities are and means evaluated at gradient evaluation on real trajectories an important advantage of stochastic over deterministic models is that they can assign probability mass to observations produced by the real environment in deterministic formulation there is no principled way to account for mismatch between model predictions and observed trajectories in this case the policy and environment noise that produced the observed trajectory are considered unknown by an application of bayes rule which we explain in appendix we can rewrite the expectations in equations and given the observations as vs ep ep ep rs ra ep ep ep ra where we can now replace the two outer expectations with samples derived from interaction with the real environment in the special case of additive noise it is possible to use deterministic model to compute the derivatives the noise influence is restricted to the gradient of the value of the next state and does not affect the model jacobian if we consider it desirable to capture more complicated environment noise we can use generative model and infer the missing noise variables possibly by sampling from svg svg computes value gradients by backward recursions on trajectories after every episode we train the model followed by the policy we provide pseudocode for this in algorithm but discuss further implementation details in section and in the experiments algorithm svg algorithm svg with replay given empty experience database for trajectory to do for to do apply control insert into end for train generative model using for down to do infer and ra given empty experience database for to do apply control observe insert into model and critic updates train generative model using train value function using alg policy update sample sk ak rk from vs rs ra end for apply update using end for ak ak infer and sk ak ra apply update using end for svg and svg in our framework we may learn parametric estimate of the expected value critic with parameters the derivative of the critic value with respect to the state can be used in place of the sample gradient estimate given in eq the critic can reduce the variance of the gradient estimates because approximates the expectation of future rewards while eq provides only in the formulation the gradient calculation starts at the end of the trajectory for which the only terms remaining in eq are vst rst rat after the recursion the total derivative of the value function with respect to the policy parameters is given by which is estimate of estimate additionally the value function can be used at the end of an episode to approximate the policy gradient finally eq involves the repeated multiplication of jacobians of the approximate model just as model error can compound in forward planning model gradient error can compound during backpropagation furthermore svg is that is after each episode single update is made to the policy and the policy optimization does not revisit those trajectory data again to increase we construct an experience replay algorithm that uses models and value functions svg with experience replay svg this algorithm also has the advantage that it can perform an computation to construct an estimator we perform of the current policy distribution with respect to proposal distribution eq ep ep ra specifically we maintain database with tuples of past state transitions sk ak rk each proposal drawn from is sample of tuple from the database at time the ak where comprise the policy parameters in use at the historical time step we do not the marginal distribution over states generated by policy this is widely considered to be intractable similarly we use experience replay for value function learning details can be found in appendix pseudocode for the svg algorithm with experience replay is in algorithm we also provide stochastic value gradient algorithm svg algorithm in the appendix this algorithm is very similar to svg and is the stochastic analogue of the recently introduced deterministic policy gradient algorithm dpg unlike dpg instead of assuming deterministic policy svg estimates the derivative around the policy noise ep qa this for example permits learning policy noise variance the relative merit of svg versus svg depends on whether the model or value function is easier to learn and is we expect that algorithms such as svg will show the strongest advantages in multitask settings where the system dynamics are fixed but the reward function is variable svg performed well across all experiments including ones introducing capacity constraints on the value function and model svg demonstrated significant advantage over all other tested algorithms model and value learning we can use almost any kind of differentiable generative model in our work we have parameterized the models as neural networks our framework supports nonlinear and noise notable properties of biological actuators for example this can be described by the parametric form model learning amounts to purely supervised problem based on observed state transitions our model and policy training occur jointly there is no motorbabbling period used to identify the model as new transitions are observed the model is trained first followed by the value function for svg followed by the policy to ensure that the model does not forget information about state transitions we maintain an experience database and cull batches of examples from the database for every model update additionally we model the statechange by and have found that constructing models as separate per predicted state dimension improved model quality significantly our framework also permits variety of means to learn the value function models we can use temporal difference learning or regression to empirical episode returns since svg is modelbased we can also use bellman residual minimization in practice we used version of fitted policy evaluation pseudocode is available in appendix algorithm experiments we tested the svg algorithms in two sets of experiments in the first set of experiments section we test whether evaluating gradients on real environment trajectories and value function note that is function of the state and noise variable figure from left to right swimmer reacher gripper monoped walker proximation can reduce the impact of model error in our second set section we show that svg can be applied to several complicated multidimensional physics environments involving contact dynamics figure in the mujoco simulator below we only briefly summarize the main properties of each environment further details of the simulations can be found in appendix and supplement in all cases we use generic neural networks with tanh activation functions to represent models value functions and policies video montage is available at https analyzing svg gradient evaluation on real trajectories planning to demonstrate the difficulty of planning with stochastic model we first present very simple control problem for which svg easily learns control policy but for which an otherwise identical planner fails entirely our example is based on problem due to the policy directly controls the velocity of hand on plane by means of the hand exerts force on ball mass the ball additionally experiences gravitational force and random forces gaussian noise the goal is to bring hand and ball into one of two randomly chosen target configurations with relevant reward being provided only at the final time step with simulation time step this demands controlling and backpropagating the distal reward along trajectory of steps because this experiment has value function this problem also favors value gradients over methods using value functions svg easily learns this task but the planner which uses trajectories from the model shows little improvement the planner simulates trajectories using the learned stochastic model and backpropagates along those simulated trajectories eqs and the extremely long lets prediction error accumulate and thus renders highly inaccurate leading to much worse final performance fig left robustness to degraded models and value functions we investigated the sensitivity of svg and svg to the quality of the learned model on swimmer swimmer is chain body with multiple links immersed in fluid environment with drag forces that allow the body to propel itself we build chains of or links corresponding to or state spaces with or action spaces the body is initialized in random configurations with respect to central goal location thus to solve the task the body must turn to and then produce an undulation to move to the goal to assess the impact of model quality we learned to control swimmer with svg and svg while varying the capacity of the network used to model the environment or hidden units for each state dimension subnetwork appendix in this task we intentionally shrink the neural network model to investigate the sensitivity of our methods to model inaccuracy while with high capacity model hidden units per state dimension both svg and svg successfully learn to solve the task the performance of svg drops significantly as model capacity is reduced fig middle svg still works well for models with only hidden units and it also scales up to and versions of the swimmer figs right and left to compare svg to conventional approaches we also tested algorithm that learns and updates the policy using the as an estimate of the advantage yielding the policy gradient log svg and the ac algorithm used the same code for learning svg outperformed the approach in the and swimmer tasks fig left right fig top left in figure panels middle right and left column we show that experience replay for the policy can improve the data efficiency and performance of svg we also tested reinforce on this problem but achieved very poor results due to the long horizon hand cartpole cartpole figure left backpropagation through model along observed stochastic trajectories is able to optimize stochastic policy in stochastic environment but an otherwise equivalent planning algorithm that simulates the transitions with learned stochastic model makes little progress due to compounding model error middle svg and dpg algorithms on svg learns the fastest right when the value function capacity is reduced from hidden units in the first layer to and then again to svg exhibits less performance degradation than the dpg presumably because the dynamics model contains auxiliary information about the function figure left for swimmer with relatively simple dynamics the compared methods yield similar results and possibly slight advantage to the purely svg middle however as the environment model capacity is reduced from to then to hidden units per subnetwork svg dramatically deteriorates whereas svg shows undisturbed performance right for swimmer svg learns faster and asymptotes at higher performance than the other tested algorithms similarly we tested the impact of varying the capacity of the value function approximator fig right on the svg degrades less severely than the dpg presumably because it computes the policy gradient with the aid of the dynamics model svg in complex environments in second set of experiments we demonstrated that svg can be applied to several challenging physical control problems with stochastic and discontinuous dynamics due to contacts reacher is an arm stationed within walled box with state dimensions and action dimensions and the coordinates of target site giving state dimensions in total in reacher the site was randomly placed at one of the four corners of the box and the arm in random configuration at the beginning of each trial in reacher the site moved at randomized speed and heading in the box with reflections at the walls solving this latter problem implies that the policy has generalized over the entire work space gripper augments the reacher arm with manipulator that can grab ball in randomized position and return it to specified site monoped has state dimensions action dimensions and ground contact dynamics the monoped begins falling from height and must remain standing additionally we apply gaussian random noise to the torques controlling the joints with standard deviation of of the total possible actuator strength at all points in time reducing the stability of upright postures is planar cat robot designed to run based on with state dimensions and action dimensions has version with springs to aid balanced standing and version without them walker is planar biped based on the environment from results figure shows learning curves for several repeats for each of the tasks we found that in all cases svg solved the problem well we provide videos of the learned policies in the supplemental material the reacher reliably finished at the target site and in the tracking task followed the moving target successfully svg has clear advantage on this task as also borne out in the and swimmer experiments the cheetah gaits varied slightly from experiment to experiment but in all cases made good forward progress for the monoped the policies were able to balance well beyond the time steps of training episodes and were able to resist significantly avg reward arbitrary units avg reward arbitrary units monoped gripper reacher figure across several different domains svg reliably optimizes policies clearly settling into similar local optima on the reacher svg shows noticeable efficiency and performance gain relative to the other algorithms higher adversarial noise levels than used during training up to noise we were able to learn gripping and walking behavior although walking policies that achieved similar reward levels did not always exhibit equally good walking phenotypes related work writing the noise variables as exogenous inputs to the system to allow direct differentiation with respect to the system state equation is known device in control theory where the model is given analytically the idea of using model to optimize parametric policy around real trajectories is presented heuristically in and for deterministic policies and models also in the limit of deterministic policies and models the recursions we have derived in algorithm reduce to those of werbos defines an algorithm called heuristic dynamic programming that uses deterministic model to one step to produce state prediction that is evaluated by value function deisenroth et al have used gaussian process models to compute policy gradients that are sensitive to and levine et al have optimized impressive policies with the aid of trajectory optimizer and models our work in contrast has focused on using global neural network models conjoined to value function approximators discussion we have shown that two potential problems with value gradient methods their reliance on planning and restriction to deterministic models can be exorcised broadening their relevance to reinforcement learning we have shown experimentally that the svg framework can train neural network policies in robust manner to solve interesting continuous control problems the framework includes algorithm variants beyond the ones tested in this paper for example ones that combine value function with steps of through model svg augmenting svg with experience replay led to the best results and similar extension could be applied to any svg furthermore we did not harness sophisticated generative models of stochastic dynamics but one could readily do so presenting great room for growth acknowledgements we thank arthur guez danilo rezende hado van hasselt john schulman jonathan hunt nando de freitas martin riedmiller remi munos shakir mohamed and theophane weber for helpful discussions and john schulman for sharing his walker model references abbeel quigley and ng using inaccurate models in reinforcement learning in icml atkeson efficient robust policy optimization in acc baird residual algorithms reinforcement learning with function approximation in icml balduzzi and ghifary compatible value gradients for reinforcement learning of continuous deep policies arxiv preprint coulom reinforcement learning using neural networks with applications to motor control phd thesis institut national polytechnique de deisenroth and rasmussen pilco and approach to policy search in icml fairbank learning phd thesis city university london fairbank and alonso learning in ijcnn grondman online model learning algorithms for control phd thesis tu delft delft university of technology jacobson and mayne differential dynamic programming jordan and rumelhart forward models supervised learning with distal teacher cognitive science kingma and welling variational bayes arxiv preprint levine and abbeel learning neural network policies with guided policy search under unknown dynamics in nips lillicrap hunt pritzel heess erez tassa silver and wierstra continuous control with deep reinforcement learning arxiv preprint lin reactive agents based on reinforcement learning planning and teaching machine learning munos policy gradient in continuous time journal of machine learning research narendra and parthasarathy identification and control of dynamical systems using neural networks ieee transactions on neural networks nguyen and widrow neural networks for control systems ieee control systems magazine pascanu mikolov and bengio on the difficulty of training recurrent neural networks in icml rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in icml martin riedmiller neural fitted experiences with data efficient neural reinforcement learning method in machine learning ecml pages springer schulman levine moritz jordan and abbeel trust region policy optimization corr silver lever heess degris wierstra and riedmiller deterministic policy gradient algorithms in icml singh learning without in partially observable markovian decision processes in icml richard sutton learning to predict by the methods of temporal differences machine learning sutton mcallester singh and mansour policy gradient methods for reinforcement learning with function approximation in nips tassa erez and smart receding horizon differential dynamic programming in nips todorov erez and tassa mujoco physics engine for control in iros robot learning to run in adaptive and natural computing algorithms pages springer reinforcement learning by sequential and experience replay neural networks werbos menu of designs for reinforcement learning over time neural networks for control pages williams simple statistical algorithms for connectionist reinforcement learning machine learning 
exploring models and data for image question answering mengye ryan richard university of canadian institute for advanced mren rkiros zemel abstract this work aims to address the problem of qa with new models and datasets in our work we propose to use neural networks and visual semantic embeddings without intermediate stages such as object detection and image segmentation to predict answers to simple questions about images our model performs times better than the only published results on an existing image qa dataset we also present question generation algorithm that converts image descriptions which are widely available into qa form we used this algorithm to produce an larger dataset with more evenly distributed answers suite of baseline results on this new dataset are also presented introduction combining image understanding and natural language interaction is one of the grand dreams of artificial intelligence we are interested in the problem of jointly learning image and text through task recently researchers studying image caption generation have developed powerful methods of jointly learning from image and text inputs to form higher level representations from models such as convolutional neural networks cnns trained on object recognition and word embeddings trained on large scale text corpora image qa involves an extra layer of interaction between human and computers here the model needs to pay attention to details of the image instead of describing it in vague sense the problem also combines many computer vision such as image labeling and object detection in this paper we present our contributions to the problem generic qa model using visual semantic embeddings to connect cnn and recurrent neural net rnn as well as comparisons to suite of other models an automatic question generation algorithm that converts description sentences into questions and new qa dataset that was generated using the algorithm and number of baseline results on this new dataset in this work we assume that the answers consist of only single word which allows us to treat the problem as classification problem this also makes the evaluation of the models easier and more robust avoiding the thorny evaluation issues that plague generation problems related work malinowski and fritz released dataset with images and pairs the dataset for question answering on images daquar all images are from the nyu depth dataset and are taken from indoor scenes human segmentation image depth values and object labeling are available in the dataset the qa data has two sets of configurations which differ by the daquar what is there in front of the sofa ground truth table table table lstm chair cocoqa how many leftover donuts is the red bicycle holding ground truth three two three bow one cocoqa what is the color of the teeshirt ground truth blue blue orange bow green cocoqa where is the gray cat sitting ground truth window window window bow suitcase figure sample questions and responses of variety of models correct answers are in green and incorrect in red the numbers in parentheses are the probabilities assigned to the answer by the given model the leftmost example is from the daquar dataset and the others are from our new dataset number of object classes appearing in the questions and there are mainly three types of questions in this dataset object type object color and number of objects some questions are easy but many questions are very hard to answer even for humans since daquar is the only publicly available qa dataset it is one of our benchmarks to evaluate our models together with the release of the daquar dataset malinowski and fritz presented an approach which combines semantic parsing and image segmentation their approach is notable as one of the first attempts at image qa but it has number of limitations first possible set of predicates are very to obtain the predicates their algorithm also depends on the accuracy of the image segmentation algorithm and image depth information second their model needs to compute all possible spatial relations in the training images even though the model limits this to the nearest neighbors of the test images it could still be an expensive operation in larger datasets lastly the accuracy of their model is not very strong we show below that some simple baselines perform better very recently there has been number of parallel efforts on both creating datasets and proposing new models both antol et al and gao et al used images and created an open domain dataset with human generated questions and answers in anto et work the authors also included cartoon pictures besides real images some questions require logical reasoning in order to answer correctly both malinowski et al and gao et al use recurrent networks to encode the sentence and output the answer whereas malinowski et al use single network to handle both encoding and decoding gao et al used two networks separate encoder and decoder lastly bilingual chinese and english versions of the qa dataset are available in gao et work ma et al use cnns to both extract image features and sentence features and fuse the features together with another cnn our approach is developed independently from the work above similar to the work of malinowski et al and gao et we also experimented with recurrent networks to consume the sequential question input unlike gao et we formulate the task as classification problem as there is no single accepted metric to evaluate answer accuracy thus we place more focus on limited domain of questions that can be answered with one word we also formulate and evaluate range of other algorithms that utilize various representations drawn from the question and image on these datasets proposed methodology the methodology presented here is on the model side we develop and apply various forms of neural networks and embeddings on this task and on the dataset side we propose new ways of synthesizing qa pairs from currently available image description datasets one two red bird softmax lstm image word embedding linear cnn how many books figure model models in recent years recurrent neural networks rnns have enjoyed some successes in the field of natural language processing nlp long memory lstm is form of rnn which is easier to train than standard rnns because of its linear error propagation and multiplicative gatings our model builds directly on top of the lstm sentence model and is called the model it treats the image as one word of the question we borrowed this idea of treating the image as word from caption generation work done by vinyals et al we compare this newly proposed model with suite of simpler models in the experimental results section we use the last hidden layer of the oxford vgg conv net trained on imagenet challenge as our visual embeddings the cnn part of our model is kept frozen during training we experimented with several different word embedding models randomly initialized embedding embedding and embedding model the word embeddings are trained with the rest of the model we then treat the image as if it is the first word of the sentence similar to devise we use linear or affine transformation to map dimension image feature vectors to or dimensional vector that matches the dimension of the word embeddings we can optionally treat the image as the last word of the question as well through different weight matrix and optionally add reverse lstm which gets the same content but operates in backward sequential fashion the lstm outputs are fed into softmax layer at the last timestep to generate answers generation the currently available daquar dataset contains approximately images and questions on common object classes which might be not enough for training large complex models another problem with the current dataset is that simply guessing the modes can yield very good accuracy we aim to create another dataset to produce much larger number of qa pairs and more even distribution of answers while collecting human generated qa pairs is one possible approach and another is to synthesize questions based on image labeling we instead propose to automatically convert descriptions into qa form in general objects mentioned in image descriptions are easier to detect than the ones in daquar human generated questions and than the ones in synthetic qas based on ground truth labeling this allows the model to rely more on rough image understanding without any logical reasoning lastly the conversion process preserves the language variability in the original description and results in more questions than questions generated from image labeling as starting point we used the dataset but the same method can be applied to any other image description dataset such as flickr sbu or even the internet common strategies we used the stanford parser to obtain the syntatic structure of the original image description we also utilized these strategies for forming the questions compound sentences to simple sentences here we only consider simple case where two sentences are joined together with conjunctive word we split the orginial sentences into two independent sentences indefinite determiners to definite determiners the constraints in english questions tend to start with interrogative words such as what the algorithm needs to move the verb as well as the constituent to the front of the sentence for example man is riding horse becomes what is the man riding in this work we consider the following two simple constraints principle which restricts the movement of whword inside noun phrase np our algorithm does not move any that is contained in clause constituent question generation question generation is still an topic overall we adopt conservative approach to generating questions in an attempt to create questions we consider generating four types of questions below object questions first we consider asking about an object using what this involves replacing the actual object with what in the sentence and then transforming the sentence structure so that the what appears in the front of the sentence the entire algorithm has the following stages split long sentences into simple sentences change indefinite determiners to definite determiners traverse the sentence and identify potential answers and replace with what during the traversal of question generation we currently ignore all the prepositional phrase pp constituents perform in order to identify possible answer word we used wordnet and the nltk software package to get noun categories number questions we follow similar procedure as the previous algorithm except for different way to identify potential answers we extract numbers from original sentences splitting compound sentences changing determiners and parts remain the same color questions color questions are much easier to generate this only requires locating the color adjective and the noun to which the adjective attaches then it simply forms sentence what is the color of the object with the object replaced by the actual noun location questions these are similar to generating object questions except that now the answer traversal will only search within pp constituents that start with the preposition in we also added rules to filter out clothing so that the answers will mostly be places scenes or large objects that contain smaller objects we rejected the answers that appear too rarely or too often in our generated dataset after this qa rejection process the frequency of the most common answer words was reduced from down to in the test set of experimental results datasets table summarizes the statistics of it should be noted that since we applied the qa pair rejection process performs very poorly on however questions are actually easier to answer than daquar from human point of view this encourages the model to exploit salient object relations instead of exhaustively searching all possible relations dataset can be downloaded at http table question type ategory bject umber olor ocation otal rain est here we provide some brief statistics of the new dataset the maximum question length is and average is the most common answers are two white and red the least common are eagle tram and sofa the median answer is bed across the entire test set qas overlap in training questions and overlap in training pairs model details the first model is the cnn and lstm with weight matrix in the middle we call this in our tables and figures the second model has two image feature inputs at the start and the end of the sentence with different learned linear transformations and also has lstms going in both the forward and backward directions both lstms output to the softmax layer at the last timestep we call the second model this simple model performs multinomial logistic regression based on the image features without dimensionality reduction dimension and bow vector obtained by summing all the learned word vectors of the question full lastly the full model is simple average of the three models above we release the complete details of the models at https baselines to evaluate the effectiveness of our models we designed few baselines guess one very simple baseline is to predict the mode based on the question type for example if the question contains how many then the model will output in daquar the modes are table two and white and in the modes are cat two white and room bow we designed set of blind models which are given only the questions without the images one of the simplest blind models performs logistic regression on the bow vector to classify answers lstm another blind model we experimented with simply inputs the question words into the lstm alone img we also trained counterpart deaf model for each type of question we train separate cnn classification layer with all lower layers frozen during training note that this model knows the type of question in order to make its performance somewhat comparable to models that can take into account the words to narrow down the answer space however the model does not know anything about the question except the type this baseline combines the prior knowledge of an object and the image understanding from the deaf model for example question asking the color of white bird flying in the blue sky may output white rather than blue simply because the prior probability of the bird being blue is lower we denote as the color as the class of the object of interest and as the image assuming and are conditionally independent given the color this can be computed if is the output of logistic regression given the cnn features alone and we simply estimate empirically count count we use laplace smoothing on this empirical distribution in the task of image caption generation devlin et al showed that nearest neighbors baseline approach actually performs very well to see whether our model memorizes the training data for answering new question we include baseline in the results unlike image caption generation here the similarity measure includes both image and text we use the representation learned from and append it to the cnn image features we use euclidean distance as the similarity metric it is possible to improve the nearest neighbor result by learning similarity metric performance metrics to evaluate model performance we used the plain answer accuracy as well as the similarity wups measure the wups calculates the similarity between two words based on their longest common subsequence in the taxonomy tree if the similarity between two words is less than threshold then score of zero will be given to the candidate answer following malinowski and fritz we measure all models in terms of accuracy wups and wups results and analysis table summarizes the learning results on daquar and for daquar we compare our results with and it should be noted that our daquar results are for the portion of the dataset with answers after the release of our paper ma et al claimed to achieve better results on both datasets table daquar and results guess bow lstm img full human acc daquar wups wups acc wups wups from the above results we observe that our model outperforms the baselines and the existing approach in terms of answer accuracy and wups our and malinkowski et recurrent neural network model achieved somewhat similar performance on daquar simple average of all three models further boosts the performance by outperforming other models it is surprising to see that the model is very strong on both datasets one limitation of our model is that we are not able to consume image features as large as dimensions at one time step so the dimensionality reduction may lose some useful information we tried to give dim image vector and it does worse than table accuracy per category guess bow lstm img full bject umber olor ocation by comparing the blind versions of the bow and lstm models we hypothesize that in image qa tasks and in particular on the simple questions studied here sequential word interaction may not be as important as in other natural language tasks it is also interesting that the blind model does not lose much on the daquar dataset we speculate that it is likely that the imagenet images are very different from the indoor scene images which are mostly composed of furniture however the models outperform the blind models by large margin on there are three possible reasons the objects in resemble the ones in imagenet more images have fewer objects whereas the indoor scenes have considerable clutter and has more data to train complex models there are many interesting examples but due to space limitations we can only show few in figure and figure full results are available at http for some of the images we added some extra questions the ones have an in the question id these provide more insight into model representation of the image and question information and help elucidate questions that our models may accidentally get correct the parentheses in the figures represent the confidence score given by the softmax layer of the respective model model selection we did not find that using different word embedding has significant impact on the final classification results we observed that the word embedding results in better performance and normalizing the cnn hidden image features into and helps achieve faster training time the bidirectional lstm model can further boost the result by little object questions as the original cnn was trained for the imagenet challenge the benefited significantly from its single object recognition ability however the challenging part is to consider spatial relations between multiple objects and to focus on details of the image our models only did moderately acceptable job on this see for instance the first picture of figure and the fourth picture of figure sometimes model fails to make correct decision but outputs the most salient object while sometimes the blind model can equally guess the most probable objects based on the question alone chairs should be around the dining table nonetheless the full model improves accuracy by compared to img model which shows the difference between pure object classification and image question answering counting in daquar we could not observe any advantage in the counting ability of the and the model compared to the blind baselines in there is some observable counting ability in very clean images with single object type the models can sometimes count up to five or six however as shown in the second picture of figure the ability is fairly weak as they do not count correctly when different object types are present there is lot of room for improvement in the counting task and in fact this could be separate computer vision problem on its own color in there is significant win for the and the against the blind ones on questions we further discovered that these models are not only able to recognize the dominant color of the image but sometimes associate different colors to different objects as shown in the first picture of figure however they still fail on number of easy cocoqa what is the color of the cat ground truth black black black bow gray daquar how many chairs are there ground truth two four one lstm four cocoqa where are the ripe bananas sitting ground truth basket basket basket bow bowl daquar what is the object on the chair ground truth pillow clothes pillow lstm clothes cocoqa what is the color of the couch ground truth red red black bow red daquar how many shelves are there ground truth three three two lstm two cocoqa what are in the basket ground truth bananas bananas bananas bow bananas daquar where is the pillow found ground truth chair bed chair lstm cabinet figure sample questions and responses of our system examples adding prior knowledge provides an immediate gain on the img model in terms of accuracy on color and number questions the gap between the and shows some localized color association ability in the cnn image representation conclusion and current directions in this paper we consider the image qa problem and present our neural network models our model shows reasonable understanding of the question and some coarse image understanding but it is still very in many situations while recurrent networks are becoming popular choice for learning image and text we showed that simple can perform equally well compared to recurrent network that is borrowed from an image caption generation framework we proposed more complete set of baselines which can provide potential insight for developing more sophisticated image question answering systems as the currently available dataset is not large enough we developed an algorithm that helps us collect large scale image qa dataset from image descriptions our question generation algorithm is extensible to many image description datasets and can be automated without requiring extensive human effort we hope that the release of the new dataset will encourage more approaches to this problem in the future image question answering is fairly new research topic and the approach we present here has number of limitations first our models are just answer classifiers ideally we would like to permit longer answers which will involve some sophisticated text generation model or structured output but this will require an automatic answer evaluation metric second we are only focusing on limited domain of questions however this limited range of questions allow us to study the results more in depth lastly it is also hard to interpret why the models output certain answer by comparing our models with some baselines we can roughly infer whether they understood the image visual attention is another future direction which could both improve the results based on recent successes in image captioning as well as help explain the model prediction by examining the attention output at every timestep acknowledgments we would like to thank nitish srivastava for the support of toronto conv net from which we extracted the cnn image features we would also like to thank anonymous reviewers for their valuable and helpful comments references vinyals toshev bengio and erhan show and tell neural image caption generator in cvpr kiros salakhutdinov and zemel unifying embeddings with multimodal neural language models tacl karpathy joulin and deep fragment embeddings for bidirectional image sentence mapping in nips mao xu yang wang and yuille explain images with multimodal recurrent neural networks nips deep learning workshop donahue hendricks guadarrama rohrbach venugopalan saenko and darrell recurrent convolutional networks for visual recognition and description in cvpr chen and zitnick learning recurrent visual representation for image caption generation corr vol fang gupta iandola srivastava deng gao he mitchell platt zitnick and zweig from captions to visual concepts and back in cvpr xu ba kiros cho courville salakhutdinov zemel and bengio show attend and tell neural image caption generation with visual attention in icml lebret pinheiro and collobert image captioning in icml klein lev lev and wolf fisher vectors derived from hybrid mixture models for image annotations in cvpr malinowski and fritz towards visual turing challenge in nips workshop on learning semantics silberman hoiem kohli and fergus indoor segmentation and support inference from rgbd images in eccv antol agrawal lu mitchell batra zitnick and parikh vqa visual question answering corr vol malinowski rohrbach and fritz ask your neurons approach to answering questions about images corr vol gao mao zhou huang wang and xu are you talking to machine dataset and methods for multilingual image question answering corr vol ma lu and li learning to answer questions from image using convolutional neural network corr vol lin maire belongie hays perona ramanan and zitnick microsoft coco common objects in context in eccv chen fang lin vedantam gupta dollar and zitnick microsoft coco captions data collection and evaluation server corr vol hochreiter and schmidhuber long memory neural computation vol no pp simonyan and zisserman very deep convolutional networks for image recognition in iclr russakovsky deng su krause satheesh ma huang karpathy khosla bernstein berg and imagenet large scale visual recognition challenge ijcv mikolov chen corrado and dean efficient estimation of word representations in vector space in iclr frome corrado shlens bengio dean ranzato and mikolov devise deep embedding model in nips hodosh young and hockenmaier framing image description as ranking task data models and evaluation metrics artif intell res jair vol pp ordonez kulkarni and berg describing images using million captioned photographs in nips klein and manning accurate unlexicalized parsing in acl chomsky conditions on transformations new york academic press fellbaum wordnet an electronic lexical database cambridge ma london the mit press may bird nltk the natural language toolkit in acl devlin gupta girshick mitchell and zitnick exploring nearest neighbor approaches for image captioning corr vol wu and palmer verb semantics and lexical selection in acl malinowski and fritz approach to question answering about scenes based on uncertain input in nips 
efficient and robust automated machine learning matthias feurer aaron klein katharina eggensperger jost tobias springenberg manuel blum frank hutter department of computer science university of freiburg germany feurerm kleinaa eggenspk springj mblum fh abstract the success of machine learning in broad range of applications has led to an demand for machine learning systems that can be used off the shelf by to be effective in practice such systems need to automatically choose good algorithm and feature preprocessing steps for new dataset at hand and also set their respective hyperparameters recent work has started to tackle this automated machine learning automl problem with the help of efficient bayesian optimization methods building on this we introduce robust new automl system based on using classifiers feature preprocessing methods and data preprocessing methods giving rise to structured hypothesis space with hyperparameters this system which we dub auto sklearn improves on existing automl methods by automatically taking into account past performance on similar datasets and by constructing ensembles from the models evaluated during the optimization our system won the first phase of the ongoing chalearn automl challenge and our comprehensive analysis on over diverse datasets shows that it substantially outperforms the previous state of the art in automl we also demonstrate the performance gains due to each of our contributions and derive insights into the effectiveness of the individual components of auto sklearn introduction machine learning has recently made great strides in many application areas fueling growing demand for machine learning systems that can be used effectively by novices in machine learning correspondingly growing number of commercial enterprises aim to satisfy this demand microsoft azure machine learning google prediction api and amazon machine learning at its core every effective machine learning service needs to solve the fundamental problems of deciding which machine learning algorithm to use on given dataset whether and how to preprocess its features and how to set all hyperparameters this is the problem we address in this work more specifically we investigate automated machine learning automl the problem of automatically without human input producing test set predictions for new dataset within fixed computational budget formally this automl problem can be stated as follows definition automl problem for let xi rd denote feature vector and yi the corresponding target value given training dataset dtrain xn yn and the feature vectors of test dataset dtest drawn from the same underlying data distribution as well as resource budget and loss metric the automl problem is to automatically produce test set predictions pm the loss of solution to the automl problem is given by in practice the budget would comprise computational resources such as cpu wallclock time and memory usage this problem definition reflects the setting of the ongoing chalearn automl challenge the automl system we describe here won the first phase of that challenge here we follow and extend the automl approach first introduced by auto see http at its core this approach combines highly parametric machine learning framework with bayesian optimization method for instantiating well for given dataset the contribution of this paper is to extend this automl approach in various ways that considerably improve its efficiency and robustness based on principles that apply to wide range of machine learning frameworks such as those used by the machine learning service providers mentioned above first following successful previous work for low dimensional optimization problems we reason across datasets to identify instantiations of machine learning frameworks that perform well on new dataset and warmstart bayesian optimization with them section second we automatically construct ensembles of the models considered by bayesian optimization section third we carefully design highly parameterized machine learning framework from classifiers and preprocessors implemented in the popular machine learning framework section finally we perform an extensive empirical analysis using diverse collection of datasets to demonstrate that the resulting auto sklearn system outperforms previous automl methods section to show that each of our contributions leads to substantial performance improvements section and to gain insights into the performance of the individual classifiers and preprocessors used in auto sklearn section automl as cash problem we first review the formalization of automl as combined algorithm selection and hyperparameter optimization cash problem used by auto automl approach two important problems in automl are that no single machine learning method performs best on all datasets and some machine learning methods svms crucially rely on hyperparameter optimization the latter problem has been successfully attacked using bayesian optimization which nowadays forms core component of an automl system the former problem is intertwined with the latter since the rankings of algorithms depend on whether their hyperparameters are tuned properly fortunately the two problems can efficiently be tackled as single structured joint optimization problem definition cash let be set of algorithms and let the hyperparameters of each algorithm have domain further let dtrain xn yn be train ing set which is split into folds dvalid dvalid and dtrain dtrain such that dtrain dtrain for finally let aλ dtrain dvalid denote the loss that algorithm achieves on dvalid when trained on dtrain with hyperparameters then the combined algorithm selection and hyperparameter optimization cash problem is to find the joint algorithm and hyperparameter setting that minimizes this loss argmin aλ dtrain dvalid this cash problem was first tackled by thornton et al in the auto system using the machine learning framework weka and bayesian optimization methods in nutshell bayesian optimization fits probabilistic model to capture the relationship between hyperparameter settings and their measured performance it then uses this model to select the most promising hyperparameter setting trading off exploration of new parts of the space exploitation in known good regions evaluates that hyperparameter setting updates the model with the result and iterates while bayesian optimization based on gaussian process models snoek et al performs best in problems with numerical hyperparameters models have been shown to be more successful in structured and partly discrete problems such as the cash problem and are also used in the automl system sklearn among the bayesian optimization methods thornton et al found the smac to outperform the tree parzen estimator tpe and we therefore use smac to solve the cash problem in this paper next to its use of random forests smac main distinguishing feature is that it allows fast by evaluating one fold at time and discarding hyperparameter settings early bayesian optimizer xtrain ytrain xtest metalearning data preprocessor feature classifier preprocessor ml framework automl system build ensemble figure our improved automl approach we add two components to bayesian hyperparameter optimization of an ml framework for initializing the bayesian optimizer and automated ensemble construction from configurations evaluated during optimization new methods for increasing efficiency and robustness of automl we now discuss our two improvements of the automl approach first we include step to warmstart the bayesian optimization procedure which results in considerable boost in efficiency second we include an automated ensemble construction step allowing us to use all classifiers that were found by bayesian optimization figure summarizes the overall automl workflow including both of our improvements we note that we expect their effectiveness to be greater for flexible ml frameworks that offer many degrees of freedom many algorithms hyperparameters and preprocessing methods for finding good instantiations of machine learning frameworks domain experts derive knowledge from previous tasks they learn about the performance of machine learning algorithms the area of mimics this strategy by reasoning about the performance of learning algorithms across datasets in this work we apply to select instantiations of our given machine learning framework that are likely to perform well on new dataset more specifically for large number of datasets we collect both performance data and set of characteristics of the dataset that can be computed efficiently and that help to determine which algorithm to use on new dataset this approach is complementary to bayesian optimization for optimizing an ml framework can quickly suggest some instantiations of the ml framework that are likely to perform quite well but it is unable to provide information on performance in contrast bayesian optimization is slow to start for hyperparameter spaces as large as those of entire ml frameworks but can performance over time we exploit this complementarity by selecting configurations based on and use their result to seed bayesian optimization this approach of warmstarting optimization by has already been successfully applied before but never to an optimization problem as complex as that of searching the space of instantiations of ml framework likewise learning across datasets has also been applied in collaborative bayesian optimization methods while these approaches are promising they are so far limited to very few and can not yet cope with the highdimensional partially discrete configuration spaces faced in automl more precisely our approach works as follows in an offline phase for each machine learning dataset in dataset repository in our case datasets from the openml repository we evaluated set of described below and used bayesian optimization to determine and store an instantiation of the given ml framework with strong empirical performance for that dataset in detail we ran smac for hours with on two thirds of the data and stored the resulting ml framework instantiation which exhibited best performance on the remaining third then given new dataset we compute its rank all datasets by their distance to in space and select the stored ml framework instantiations for the nearest datasets for evaluation before starting bayesian optimization with their results to characterize datasets we implemented total of from the literature including simple and statistical such as statistics about the number of data points features and classes as well as data skewness and the entropy of the targets all are listed in table of the supplementary material notably we had to exclude the prominent and effective category of landmarking which measure the performance of simple base learners because they were computationally too expensive to be helpful in the online evaluation phase we note that this approach draws its power from the availability of repository of datasets due to recent initiatives such as openml we expect the number of available datasets to grow ever larger over time increasing the importance of automated ensemble construction of models evaluated during optimization while bayesian hyperparameter optimization is in finding the hyperparameter setting we note that it is very wasteful procedure when the goal is simply to make good predictions all the models it trains during the course of the search are lost usually including some that perform almost as well as the best rather than discarding these models we propose to store them and to use an efficient method which can be run in second process to construct an ensemble out of them this automatic ensemble construction avoids to commit itself to single hyperparameter setting and is thus more robust and less prone to overfitting than using the point estimate that standard hyperparameter optimization yields to our best knowledge we are the first to make this simple observation which can be applied to improve any bayesian hyperparameter optimization method it is well known that ensembles often outperform individual models and that effective ensembles can be created from library of models ensembles perform particularly well if the models they are based on are individually strong and make uncorrelated errors since this is much more likely when the individual models are different in nature ensemble building is particularly well suited for combining strong instantiations of flexible ml framework however simply building uniformly weighted ensemble of the models found by bayesian optimization does not work well rather we found it crucial to adjust these weights using the predictions of all individual models on set we experimented with different approaches to optimize these weights stacking numerical optimization and the method ensemble selection while we found both numerical optimization and stacking to overfit to the validation set and to be computationally costly ensemble selection was fast and robust in nutshell ensemble selection introduced by caruana et al is greedy procedure that starts from an empty ensemble and then iteratively adds the model that maximizes ensemble validation performance with uniform weight but allowing for repetitions procedure in the supplementary material describes it in detail we used this technique in all our experiments building an ensemble of size practical automated machine learning system to design robust automl system as feature estimator preprocessing classifier our underlying ml framework we chose preprocessor one of the best known adaboost rf knn pca none fast ica and most widely used machine learning learning rate estimators max depth data libraries it offers wide range of well espreprocessor tablished and ml imputation balancing rescaling one hot enc algorithms and is easy to use for both experts and beginners since our automl mean median weighting standard none system closely resembles auto but like sklearn is based figure structured configuration space squared boxes on we dub it auto sklearn denote parent hyperparameters whereas boxes with rounded figure depicts auto sklearn overall edges are leaf hyperparameters grey colored boxes mark components it comprises classification active hyperparameters which form an example configuration and machine learning pipeline each pipeline comprises one algorithms preprocessing methods and feature preprocessor classifier and up to three data data preprocessing methods we cessor methods plus respective hyperparameters eterized each of them which resulted in space of hyperparameters most of these are conditional hyperparameters that are only active if their respective component is selected we note that smac can handle this conditionality natively all classification algorithms in auto sklearn are listed in table and described in detail in section of the supplementary material they fall into different categories such as general linear models algorithms support vector machines discriminant analysis nearest neighbors bayes decision trees and ensembles in contrast to auto we name name cat cond cont cond adaboost ab bernoulli bayes decision tree dt extreml rand trees gaussian bayes gradient boosting gb knn lda linear svm kernel svm multinomial bayes passive aggressive qda random forest rf linear class sgd classification algorithms cat cond cont cond extreml rand trees prepr fast ica feature agglomeration kernel pca rand kitchen sinks linear svm prepr no preprocessing nystroem sampler pca polynomial random trees embed select percentile select rates encoding imputation balancing rescaling preprocessing methods table number of hyperparameters for each possible classifier left and feature preprocessing method right for binary classification dataset in dense representation tables for sparse binary classification and multiclass classification datasets can be found in the section of the supplementary material tables and we distinguish between categorical cat hyperparameters with discrete values and continuous cont numerical hyperparameters numbers in brackets are conditional hyperparameters which are only relevant when another parameter has certain value focused our configuration space on base classifiers and excluded and ensembles that are themselves parameterized by one or more base classifiers while such ensembles increased auto number of hyperparameters by almost factor of five to auto sklearn only features hyperparameters we instead construct complex ensembles using our method from section compared to auto this is much more in auto weka evaluating the performance of an ensemble with components requires the construction and evaluation of models in contrast in auto sklearn ensembles come largely for free and it is possible to mix and match models evaluated at arbitrary times during the optimization the preprocessing methods for datasets in dense representation in auto sklearn are listed in table and described in detail in section of the supplementary material they comprise data preprocessors which change the feature values and are always used when they apply and feature preprocessors which change the actual set of features and only one of which or none is used data preprocessing includes rescaling of the inputs imputation of missing values encoding and balancing of the target classes the possible feature preprocessing methods can be categorized into feature selection kernel approximation matrix decomposition embeddings feature clustering polynomial feature expansion and methods that use classifier for feature selection for example linear svms fitted to the data can be used for feature selection by eliminating features corresponding to model coefficients as with every robust system we had to handle many more important details in auto sklearn we describe these in section of the supplementary material comparing auto sklearn to auto and sklearn as baseline experiment we compared the performance of vanilla auto sklearn without our improvements to auto and sklearn reproducing the experimental setup with datasets of the paper introducing auto we describe this setup in detail in section in the supplementary material table shows that auto sklearn performed statistically significantly better than auto in cases tied it in cases and lost against it in for the three datasets where auto weka performed best we found that in more than of its runs the best classifier it chose is not implemented in trees with pruning component so far sklearn is more of inviting the user to adapt the configuration space to her own needs than full automl system the current version crashes when presented with sparse data and missing values it also crashes on due to memory limit which we set for all optimizers to enable yeast waveform wine quality shuttle semeion secom mrbi madelon mnist basic appetency gisette german credit dexter convex small dorothea car amazon abalone as aw hs table test set classification error of auto aw vanilla auto sklearn as and yperoptsklearn hs as in the original evaluation of auto we show median percent error across bootstrap samples based on runs simulating parallel runs bold numbers indicate the best result underlined results are not statistically significantly different from the best according to bootstrap test with average rank vanilla ensemble ensemble time sec figure average rank of all four auto sklearn variants ranked by balanced test error rate ber across datasets note that ranks are relative measure of performance here the rank of all methods has to add up to and hence an improvement in ber of one method can worsen the rank of another the supplementary material shows the same plot on to show the time overhead of and ensemble computation fair comparison on the datasets on which it ran it statistically tied the best optimizer in cases and lost against it in evaluation of the proposed automl improvements in order to evaluate the robustness and general applicability of our proposed automl system on broad range of datasets we gathered binary and multiclass classification datasets from the openml repository only selecting datasets with at least data points to allow robust performance evaluations these datasets cover diverse range of applications such as text classification digit and letter recognition gene sequence and rna classification advertisement particle classification for telescope data and cancer detection in tissue samples we list all datasets in table and in the supplementary material and provide their unique openml identifiers for reproducibility since the class distribution in many of these datasets is quite imbalanced we evaluated all automl methods using measure called balanced classification error rate ber we define balanced error rate as the average of the proportion of wrong classifications in each class in comparison to standard classification error the average overall error this measure the average of the error assigns equal weight to all classes we note that balanced error or accuracy measures are often used in machine learning competitions the automl challenge uses balanced accuracy we performed runs of auto sklearn both with and without and with and without ensemble prediction on each of the datasets to study their performance under rigid time constraints and also due to computational resource constraints we limited the cpu time for each run to hour we also limited the runtime for single model to tenth of this minutes to not evaluate performance on data sets already used for we performed validation when evaluating on dataset we only used from the other datasets figure shows the average ranks over time of the four auto sklearn versions we tested we observe that both of our new methods yielded substantial improvements over vanilla auto sklearn the most striking result is that yielded drastic improvements starting with the first openml dataset id sklearn adaboost bernoulli bayes decision tree extreml rand trees gaussian bayes gradient boosting knn lda linear svm kernel svm multinomial bayes passive aggresive qda random forest linear class sgd table median balanced test error rate ber of optimizing auto sklearn subspaces for each classification openml dataset id sklearn densifier extreml rand trees prepr fast ica feature agglomeration kernel pca rand kitchen sinks linear svm prepr no preproc nystroem sampler pca polynomial random trees embed select percentile classification select rates truncatedsvd method and all preprocessors as well as the whole configuration space of auto sklearn on datasets all optimization runs were allowed to run for hours except for auto sklearn which ran for hours bold numbers indicate the best result underlined results are not statistically significantly different from the best according to bootstrap test using the same setup as for table table like table but instead optimizing subspaces for each preprocessing method and all classifiers configuration it selected and lasting until the end of the experiment we note that the improvement was most pronounced in the beginning and that over time vanilla auto sklearn also found good solutions without letting it catch up on some datasets thus improving its overall rank moreover both of our methods complement each other our automated ensemble construction improved both vanilla auto sklearn and auto sklearn with interestingly the ensemble influence on the performance started earlier for the version we believe that this is because produces better machine learning models earlier which can be directly combined into strong ensemble but when run longer vanilla auto sklearn without also benefits from automated ensemble construction detailed analysis of auto sklearn components we now study auto sklearn individual classifiers and preprocessors compared to jointly optimizing all methods in order to obtain insights into their peak performance and robustness ideally we would have liked to study all combinations of single classifier and single preprocessor in isolation but with classifiers and preprocessors this was infeasible rather when studying the performance of single classifier we still optimized over all preprocessors and vice versa to obtain more detailed analysis we focused on subset of datasets but extended the configuration budget for optimizing all methods from one hour to one day and to two days for auto sklearn specifically we clustered our datasets with based on the dataset and used one dataset from each of the resulting clusters see table in the supplementary material for the list of datasets we note that in total these extensive experiments required cpu years table compares the results of the various classification methods against auto sklearn overall as expected random forests extremely randomized trees adaboost and gradient boosting showed gradient boosting kernel svm random forest gradient boosting kernel svm random forest balanced error rate balanced error rate time sec mnist openml dataset id time sec promise openml dataset id figure performance of subset of classifiers compared to auto sklearn over time we show median test error rate and the fifth and percentile over time for optimizing three classifiers separately with optimizing the joint space plot with all classifiers can be found in figure in the supplementary material while auto sklearn is inferior in the beginning in the end its performance is close to the best method the most robust performance and svms showed strong peak performance for some datasets besides variety of strong classifiers there are also several models which could not compete the decision tree passive aggressive knn gaussian nb lda and qda were statistically significantly inferior to the best classifier on most datasets finally the table indicates that no single method was the best choice for all datasets as shown in the table and also visualized for two example datasets in figure optimizing the joint configuration space of auto sklearn led to the most robust performance plot of ranks over time figure and in the supplementary material quantifies this across all datasets showing that auto sklearn starts with reasonable but not optimal performance and effectively searches its more general configuration space to converge to the best overall performance over time table compares the results of the various preprocessors against auto sklearn as for the comparison of classifiers above auto sklearn showed the most robust performance it performed best on three of the datasets and was not statistically significantly worse than the best preprocessor on another of discussion and conclusion we demonstrated that our new automl system auto sklearn performs favorably against the previous state of the art in automl and that our and ensemble improvements for automl yield further efficiency and robustness this finding is backed by the fact that auto sklearn won the in the first phase of chalearn ongoing automl challenge in this paper we did not evaluate the use of auto sklearn for interactive machine learning with an expert in the loop and weeks of cpu power but we note that that mode has also led to third place in the human track of the same challenge as such we believe that auto sklearn is promising system for use by both machine learning novices and experts the source code of auto sklearn is available under an open source license at https our system also has some shortcomings which we would like to remove in future work as one example we have not yet tackled regression or problems most importantly though the focus on implied focus on small to datasets and an obvious direction for future work will be to apply our methods to modern deep learning systems that yield performance on large datasets we expect that in that domain especially automated ensemble construction will lead to tangible performance improvements over bayesian optimization acknowledgments this work was supported by the german research foundation dfg under priority programme autonomous learning spp grant hu under emmy noether grant hu and under the brainlinksbraintools cluster of excellence grant number exc references guyon bennett cawley escalante escalera ho ray saeed statnikov and viegas design of the chalearn automl challenge in proc of ijcnn thornton hutter hoos and combined selection and hyperparameter optimization of classification algorithms in proc of kdd pages brochu cora and de freitas tutorial on bayesian optimization of expensive cost functions with application to active user modeling and hierarchical reinforcement learning corr feurer springenberg and hutter initializing bayesian hyperparameter optimization via metalearning in proc of aaai pages reif shafait and dengel for evolutionary parameter optimization of classifiers machine learning gomes soares rossi and carvalho combining and search techniques to select parameters for support vector machines neurocomputing pedregosa varoquaux gramfort michel thirion grisel blondel prettenhofer weiss dubourg vanderplas passos cournapeau brucher perrot and duchesnay machine learning in python jmlr hall frank holmes pfahringer reutemann and witten the weka data mining software an update sigkdd hutter hoos and sequential optimization for general algorithm configuration in proc of lion pages bergstra bardenet bengio and algorithms for optimization in proc of nips pages snoek larochelle and adams practical bayesian optimization of machine learning algorithms in proc of nips pages eggensperger feurer hutter bergstra snoek hoos and towards an empirical foundation for assessing bayesian optimization of hyperparameters in nips workshop on bayesian optimization in theory and practice komer bergstra and eliasmith automatic hyperparameter configuration for in icml workshop on automl breiman random forests mlj brazdil soares and vilalta metalearning applications to data mining springer bardenet brendel and sebag collaborative hyperparameter tuning in proc of icml pages yogatama and mann efficient transfer learning method for automatic hyperparameter tuning in proc of aistats pages vanschoren van rijn bischl and torgo openml networked science in machine learning sigkdd explorations michie spiegelhalter taylor and campbell machine learning neural and statistical classification ellis horwood kalousis algorithm selection via phd thesis university of geneve pfahringer bensusan and by landmarking various learning algorithms in proc of icml pages guyon saffari dror and cawley model selection beyond the divide jmlr lacoste marchand laviolette and larochelle agnostic bayesian learning of ensembles in proc of icml pages caruana crew and ksikes ensemble selection from libraries of models in proc of icml page caruana munson and getting the most out of ensemble selection in proc of icdm pages wolpert stacked generalization neural networks hamerly and elkan learning the in in proc of nips pages proc of icml 
preconditioned spectral descent for deep learning david edo lawrence volkan department of statistics columbia university laboratory for information and inference systems lions epfl department of electrical and computer engineering duke university abstract deep learning presents notorious computational challenges these challenges include but are not limited to the of learning objectives and estimating the quantities needed for optimization algorithms such as gradients while we do not address the we present an optimization solution that exploits the so far unused geometry in the objective function in order to best make use of the estimated gradients previous work attempted similar goals with preconditioned methods in the euclidean space such as rmsprop and adagrad in stark contrast our approach combines gradient method with preconditioning we provide evidence that this combination more accurately captures the geometry of the objective function compared to prior work we theoretically formalize our arguments and derive novel preconditioned algorithms the results are promising in both computational time and quality when applied to restricted boltzmann machines feedforward neural nets and convolutional neural nets introduction in spite of the many great successes of deep learning efficient optimization of deep networks remains challenging open problem due to the complexity of the model calculations the nature of the implied objective functions and their inhomogeneous curvature it is established both theoretically and empirically that finding local optimum in many tasks often gives comparable performance to the global optima so the primary goal is to find local optimum quickly it is speculated that an increase in computational power and training efficiency will drive performance of deep networks further by utilizing more complicated networks and additional data stochastic gradient descent sgd is the most widespread algorithm of choice for practitioners of machine learning however the objective functions typically found in deep learning problems such as neural networks and restricted boltzmann machines rbms have inhomogeneous curvature rendering sgd ineffective common technique for improving efficiency is to use adaptive methods for sgd where each layer in deep model has an independent methods have shown promising results in networks with sparse penalties and factorized second order approximations have also shown improved performance popular alternative to these methods is to use an adaptive learning rate which has shown improved performance in adagrad adadelta and rmsprop the foundation of all of the above methods lies in the hope that the objective function can be wellapproximated by euclidean frobenius or norms however recent work demonstrated that the matrix of connection weights in an rbm has tighter majorization bound on the objective function with respect to the norm compared to the frobenius norm majorizationminimization approach with the majorization bound leads to an algorithm denoted as stochastic spectral descent ssd which sped up the learning of rbms and other probabilistic models however this approach does not directly generalize to other deep models as it can suffer from loose majorization bounds in this paper we combine recent gradient methods with adaptive learning rates and show their applicability to variety of models specifically our contributions are we demonstrate that the objective function in feedforward neural nets is naturally bounded by the norm this motivates the application of the ssd algorithm developed in which explicitly treats the matrix parameters with matrix norms as opposed to vector norms ii we develop natural generalization of adaptive methods adagrad rmsprop to the noneuclidean gradient setting that combines adaptive methods with gradient methods these algorithms have robust tuning parameters and greatly improve the convergence and the solution quality of ssd algorithm via local adaptation we denote these new algorithms as rmsspectral and adaspectral to mark the relationships to stochastic spectral descent and rmsprop and adagrad iii we develop fast approximation to our algorithm iterates based on the randomized svd algorithm this greatly reduces the overhead when using the norm iv we empirically validate these ideas by applying them to rbms deep belief nets feedforward neural nets and convolutional neural nets we demonstrate major speedups on all models and demonstrate improved fit for the rbm and the deep belief net we denote vectors as bold letters and matrices letters operations and denote multiplication and division and the square root of denotes the matrix with all entries denotes the standard norm of denotes the norm of which is with the singular values of is the largest singular value of which is also known as the matrix or the spectral norm preconditioned algorithms we first review gradient descent algorithms in section section motivates and discusses preconditioned gradient descent dynamic preconditioners are discussed in section and fast approximations are discussed in section gradient descent unless otherwise mentioned proofs for this section may be found in consider the minimization of closed proper convex function with lipschitz gradient lp where and are dual to each other and lp is the smoothness constant this lipschitz gradient implies the following majorization bound which is useful in optimization xi lp natural strategy to minimize is to iteratively minimize the side of defining the as arg maxx hs xi this approach yields the algorithm xk lp xk where is the iteration count for is simply gradient descent and in general can be viewed as gradient descent in norm to explore which norm leads to the fastest convergence we note the convergence rate of lp where is minimizer of if we have an lp such that is xk holds and lp then can lead to superior convergence one such example is presented in where the authors proved that improves dimensiondependent factor over gradient descent for class of problems in computer science moreover they showed that the algorithm in demands very little computational overhead for their problems and hence is favored over nor shape pr econditioned adient figure updates from parameters wk for multivariate logistic regression left order approximation error at parameter wk with singular vectors of middle order approximation error at parameter wk with singular vectors of with preconditioner matrix right shape of the error implied by frobenius norm and the norm after preconditioning the error surface matches the shape implied by the norm and not the frobenius norm pn as noted in for the function lse log ωi exp αi the constant is and log whereas the constant is if are possibly dependent zero mean random variables the convergence for the objective function is improved by at least see supplemental section for details as well gradient descent can be adapted to the stochastic setting the function reoccurs frequently in the cost function of deep learning models analyzing the majorization bounds that are dependent on the function with respect to the model parameters in deep learning reveals majorization functions dependent on the norm this was shown previously for the rbm in and we show general approach in supplemental section and specific results for neural nets in section hence we propose to optimize these deep networks with the norm preconditioned gradient descent it has been established that the loss functions of neural networks exhibit pathological curvature the loss function is essentially flat in some directions while it is highly curved in others the regions of high curvature dominate the in gradient descent solution to the above problem is to rescale the parameters so that the loss function has similar curvature along all directions the basis of recent adative methods adagrad rmsprop is in preconditioned gradient descent with iterates xk dk xk we restrict without loss of generality the preconditioner dk to positive definite diagonal matrix and is chosen letting hy xid hy dxi and hx xid we note that the iteration in corresponds to the minimizer of xk xk xk xk consequently for to perform well has to either be good approximation or tight upper bound of the true function value this is equivalent to saying that the first order approximation error xk xk is better approximated by the scaled euclidean norm the preconditioner dk controls the scaling and the choice of dk depends on the objective function as we are motivated to use norms for our models the above reasoning leads us to consider variable metric approximation for matrix let us denote to be an preconditioner note that is not diagonal matrix in this case because the operations here are this would correspond to the case above with vectorized form of and preconditioner of diag vec let we consider the following surrogate of xk xk xk xk using the from section the minimizer of takes the form see supplementary section for the proof xk xk dk dk we note that classification with softmax link naturally operates on the norm as an illustrative example of the applicability of this norm we show the first order approximation error for the objective function in this model where the distribution on the class depends on covariates categorical softmax wx figure left shows the error surfaces on without the preconditioner where the uneven curvature will lead to poor updates the jacobi diagonal of the hessian preconditioned error surface is shown in figure middle where the curvature has been made homogeneous however the shape of the error does not follow the euclidean frobenius norm but instead the geometry from the norm shown in figure right since many deep networks use the softmax and to define probability distribution over possible classes adapting to the the inherent geometry of this function can benefit learning in deeper layers dynamic learning of the preconditioner our algorithms amount to choosing an and preconditioner dk we propose to use the preconditioner from adagrad and rmsprop these preconditioners are given below αvk xk xk rmsprop vk xk xk adagrad the term is tuning parameter controlling the extremes of the curvature in the preconditioner the updates in adagrad have provably improved regret bound guarantees for convex problems over gradient descent with the iterates in adagrad and adadelta have been applied successfully to neural nets the updates in rmsprop were shown in to approximate the equilibration preconditioner and have also been successfully applied in autoencoders and supervised neural nets both methods require tuning parameter and rmsprop also requires term that controls historical smoothing we propose two novel algorithms that both use the iterate in the first uses the adagrad preconditioner which we call adaspectral the second uses the rmsprop preconditioner which we call rmsspectral the and fast approximations letting udiag vt be the svd of the for the norm also known as the spectral norm can be computed as follows uvt depending on the cost of the gradient estimation this computation may be relatively cheap or quite expensive in situations where the gradient estimate is relatively cheap the exact demands significant overhead instead of calculating the full svd we utilize randomize svd algorithm for this reduces the cost from to log with the number of projections used in the algorithm letting represent the approximate svd then the approximate corresponds to the approximation and the reweighted remainder diag we note that the is also defined for the norm however for notational clarity we will denote this as and leave the notation for the case this solution was given in as pseudocode for these operations is in the supplementary materials applicability of bounds to models restricted boltzmann machines rbm rbms are bipartite markov random field models that form probabilistic generative models over collection of data they are useful both as generative models and for deep networks in the binary case the observations are binary with connections to latent hidden binary units the probability for each state is defined by parameters with the energy ct wh ht and probability pθ the maximum likelihood pestimator implies the objective function minθ log exp vn log exp vn this objective function is generally intractable although an accurate but computationally intensive esalgorithm rmsspectral for rbms timator is given via annealed importance sampling inputs nb ais the gradient can be comparatively parameters quickly estimated by taking small number of gibbs history terms vw vb vc sampling steps in monte carlo integration scheme for do contrastive divergence due to the noisy sample minibatch of size nb nature of the gradient estimation and the intractable estimate gradient dw db dc objective function second order methods and line update matrix parameter search methods are inappropriate and sgd has travw αv ditionally been used proposed an upper dw dw dw vw bound on perturbations to of dw dw dw update bias term ui vb αvb db db this majorization motivated the stochastic specdb vb tral descent ssd algorithm which uses the db db db operator in section in addition bias parameters same for and were bound on the norm and use the upend for dates from section in their experiments this method showed significantly improved performance over competing algorithm for of and number of gibbs sweeps where the computational cost of the is insignificant this motivates using the preconditioned spectral descent methods and we show our proposed rmsspectral method in algorithm when the rbm is used to deep models is typically used gibbs sweep one such model is the deep belief net where parameters are effectively learned by repeatedly learning rbm models in this case the svd operation adds significant overhead therefore the fast approximation of section and the adaptive methods result in vast improvements these enhancements naturally extend to the deep belief net and results are detailed in section algorithm rmsspectral for fnn inputs nb parameters wl history terms vl for do sample minibatch of size nb estimate gradient by backprop dw for do αv dw dw supervised feedforward neural nets feedforward neural nets are widely used models for classification problems we consider layers of hidden variables with deterministic nonlinear link functions with softmax classifier at the final layer ignoring bias terms for clarity an input is mapped through linear transformation and nonlinear link function to give the first layer of hidden nodes this process continues with at the last layer we dw set wl αl and an class vector end for is drawn categorical softmax the stanend for dard approach for parameter learning is to minimize the objective function that corresponds to the penalized maximum likelihood objective function over the parameters wl and data examples xn which is given by pn pj arg minθ hn log exp hn while there have been numerous recent papers detailing different optimization approaches to this objective we are unaware of any approaches that attempt to derive bounds as result we explore the properties of this objective function we show the key results here and provide further details on the general framework in supplemental section and the specific derivation in supplemental section by using properties of the function mnist training training sgd adagrad rmsprop adaspectral rmsspectral ssd normalized time thousands sgd adagrad rmsprop ssd adaspectral rmsspectral log mnist training log reconstruction error normalized time thousands sgd adagrad rmsprop ssd adaspectral rmsspectral normalized time thousands figure normalized time unit is sgd iteration left this shows the reconstruction error from training the mnist dataset using middle of training silhouettes using persistent right of training mnist using persistent from the objective function from has an upper bound pn θi maxj hn hn hn hn we note that this implicitly requires the link function to have lipschitz continuous gradient many commonly used links including logistic hyperbolic tangent and smoothed rectified linear units have lipschitz continuous gradients but rectified linear units do not in this case we will just proceed with the subgradient strict upper bound on these parameters is highly pessimistic so instead we propose to take local approximation around the parameter in each layer individually considering perturbation around the terms in have the following upper bounds hθ hj maxx hφ hθ hj hj maxx hθ and dt where dt because both and hj can easily be calculated during the standard backpropagation procedure for gradient estimation this can be calculated without significant overhead since these equations are bounded on the norm this motivates using the stochastic spectral descent algorithm with the is applied to the weight matrix for each layer individually however the proposed updates require the calculation of many additional terms as well they are pessimistic and do not consider the inhomogenous curvature instead of attempting to derive the both rmsspectral and adaspectral will learn appropriate by using the gradient history then the preconditioned is applied to the weights from each layer individually the rmsspectral method for neural nets is shown in algorithm it is unclear how to use geometry for convolution layers as the pooling and convolution create alternative geometries however the adaspectral and rmsspectral algorithms can be applied to convolutional neural nets by using the steps on the dense layers and linear updates from adagrad and rmsprop on the convolutional filters the benefits from the dense layers then propagate down to the convolutional layers experiments restricted boltzmann machines to show the use of the approximate from section as well as rmsspec and adaspec we first perform experiments on the mnist dataset the dataset was binarized as in we detail the algorithmic setting used in these experiments in supplemental table which are chosen to match previous literature on the topic the batch size was chosen to be data points which matches this is larger than is typical in the rbm literature but we found that all algorithms improved their final results with larger due to reduction in sampling noise the analysis supporting the ssd algorithm does not directly apply to the learning procedure so it is of interest to examine how well it generalizes to this framework to examine the effect of learning we used reconstruction error with hidden latent variables reconstruction error is standard heuristic for analyzing convergence and is defined by taking where is an observation and is the mean value for pass from that sample this result is shown in figure left with all algorithms normalized to the amount of time it takes for single sgd iteration the full in the ssd algorithm adds significant overhead to each iteration so the ssd algorithm does not provide competitive performance in this situation the adaspectral and rmsspectral algorithms use the approximate combining the adaptive nature of rmsprop with optimization provides dramatically improved performance seemingly converging faster and to better optimum high cd orders are necessary to fit the ml estimator of an rbm to this end we use the persistent cd method of with gibbs sweeps per iteration we show the of the training data as function of time in figure middle the is estimated using ais with the parameters and code from there is clear divide with improved performance from the based methods there is further improved performance by including preconditioners as well as showing improved training the test set has an improved of for rmsspec and for ssd for further exploration we trained deep belief net with two hidden layers of size to match we trained the first hidden layer with and rmsspectral and the second layer with and rmsspectral we used the same model sizes tuning parameters and evaluation parameters and code from so the only change is due to the optimization methods our estimated on the performance of this model is on the test set this compares to from and for deep boltzmann machine from however we caution that these numbers no longer reflect true performance on the test set due to bias from ais and repeated overfitting however this is fair comparison because we use the same settings and the evaluation code for further evidence we performed the same experiment on the silhouettes dataset this dataset was previously used to demonstrate the effectiveness of an adaptive gradient and enhanced gradient method for restricted boltzmann machines the training curves for the are shown in figure right here the methods based on the norm give results in under iterations and thoroughly dominate the learning furthermore both adaspectral and rmsspectral saturate to higher value on the training set and give improved testing performance on the test set the best result from the noneuclidean methods gives testing of for rmsspectral and value of for rmsprop these values all improve over the best reported value from sgd of standard and convolutional neural networks compared to rbms and other popular machine learning models standard neural nets are cheap to train and evaluate the following experiments show that even in this case where the computation of the gradient is efficient our proposed algorithms produce major speed up in convergence in spite of the cost associated with approximating the svd of the gradient we demonstrate this claim using the mnist and image datasets both datasets are similar in that they pose classification task over possible classes however consisting of rgb images of vehicles and animals with an additional images reserved for testing poses considerably more difficult problem than mnist with its greyscale images of digits plus test samples this fact is indicated by the accuracy on the mnist test set reaching with the same architecture achieving only accuracy on to obtain the performance on these datasets it is necessary to use various types of data methods regularization schemes and data augmentation all of which have big impact of model generalization in our experiments we only employ zca whitening on the data since these methods are not the focus of this paper instead we focus on the comparative performance of the various algorithms on variety of models we trained neural networks with zero one and two hidden layers with various hidden layer sizes and with both logistic and rectified linear units relu algorithm parameters cifar cnn sgd adagrad rmsprop ssd adaspectral rmsspectral seconds log log cnn sgd adagrad rmsprop ssd adaspectral rmsspectral rmsprop rmsspectral accuracy mnist nn seconds seconds figure left of current training batch on the mnist dataset middle loglikelihood of the current training batch on right accuracy on the test set can be found in supplemental table we observed fairly consistent performance across the various configurations with spectral methods yielding greatly improved performance over their euclidean counterparts figure shows convergence curves in terms of on the training data as learning proceeds for both mnist and ssd with estimated lipschitz steps outperforms sgd also clearly visible is the big impact of using local preconditioning to fit the local geometry of the objective amplified by using the spectral methods spectral methods also improve convergence of convolutional neural nets cnn in this setting we apply the only to fully connected linear layers preconditioning is performed for all layers when using rmsspectral for linear layers the convolutional layers are updated via rmsprop we applied our algorithms to cnns with one two and three convolutional layers followed by two layers each convolutional layer was followed by max pooling and relu we used filters ranging from to filters per layer we evaluated the mnist test set using convolutional net with kernels the best generalization performance on the test set after epochs was achieved by both rmsprop and rmsspectral with an accuracy of rmsspectral obtained this level of accuracy after only epochs less that half of what rmsprop required to further demonstarte the speed up we trained on using deeper net with three convolutional layers following the architecture used in in figure right the test set accuracy is shown as training proceeds with both rmsprop and rmsspectral while they eventually achieve similar accuracy rates rmsspectral reaches that rate four times faster discussion in this paper we have demonstrated that many deep models naturally operate with geometry and exploiting this gives remarkable improvements in training efficiency as well as finding improved local optima also by using adaptive methods algorithms can use the same tuning parameters across different model sizes configurations we find that in the rbm and dbn improving the optimization can give dramatic performance improvements on both the training and the test set for feedforward neural nets the training efficiency of the propose methods give staggering improvements to the training performance while the training performance is drastically better via the methods the performance on the test set is improved for rbms and dbns but not in feedforward neural networks however because our proposed algorithms fit the model significantly faster they can help improve bayesian optimization schemes to learn appropriate penalization strategies and model configurations furthermore these methods can be adapted to dropout and other recently proposed regularization schemes to help achieve performance acknowledgements the research reported here was funded in part by aro darpa doe nga and onr and in part by the european commission under grants and erc future proof by the swiss science foundation under grants snf snf and the nccr marvel we thank the reviewers for their helpful comments references carlson cevher and carin stochastic spectral descent for restricted boltzmann machines aistats carlson hsieh collins carin and cevher stochastic spectral descent for discrete graphical models ieee special topics in signal processing cho raiko and ilin enhanced gradient for training restricted boltzmann machines neural computation choromanska henaff mathieu arous and lecun the loss surfaces of multilayer networks aistats dauphin de vries chung and bengio rmsprop and equilibrated adaptive learning rates for optimization dauphin pascanu gulcehre cho ganguli and bengio identifying and attacking the saddle point problem in optimization in nips duchi hazan and singer adaptive subgradient methods for online learning and stochastic optimization jmlr erhan bengio courville manzagol vincent and bengio why does unsupervised help deep learning jmlr halko martinsson and tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam review hinton practical guide to training restricted boltzmann machines toronto technical report hinton osindero and teh fast learning algorithm for deep belief nets neural computation hinton training products of experts by minimizing contrastive divergence neural computation kelner lee orecchia and sidford an algorithm for approximate max flow in undirected graphs and its multicommodity generalizations krizhevsky and hinton imagenet classification with deep convolutional neural networks nips krizhevsky and hinton learning multiple layers of features from tiny images university of toronto tech rep le coates prochnow and ng on optimization methods for deep learning icml marlin and swersky inductive principles for restricted boltzmann machine learning icml martens and grosse optimizing neural networks with approximate curvature martens and sutskever parallelizable sampling of markov random fields aistats nair and hinton rectified linear units improve restricted boltzmann machines in icml neal annealed importance sampling toronto technical report rokhlin szlam and tygert randomized algorithm for principal component analysis siam journal on matrix analysis and applications salakhutdinov and hinton deep boltzmann machines aistats salakhutdinov and murray on the quantitative analysis of deep belief networks icml schaul zhang and lecun no more pesky learning rates arxiv smolensky information processing in dynamical systems foundations of harmony theory snoek larochelle and adams practical bayesian optimization of machine learning algorithms in nips tieleman and hinton using fast weights to improve persistent contrastive divergence icml wan zeiler zhang cun and fergus regularization of neural networks using dropconnect in icml zeiler adadelta an adaptive learning rate method arxiv 
recurrent latent variable model for sequential data junyoung chung kyle kastner laurent dinh kratarth goel aaron courville yoshua department of computer science and operations research de cifar senior fellow abstract in this paper we explore the inclusion of latent random variables into the hidden state of recurrent neural network rnn by combining the elements of the variational autoencoder we argue that through the use of latent random variables the variational rnn vrnn can model the kind of variability observed in highly structured sequential data such as natural speech we empirically evaluate the proposed model against other related sequential models on four speech datasets and one handwriting dataset our results show the important roles that latent random variables can play in the rnn dynamics introduction learning generative models of sequences is machine learning challenge and historically the domain of dynamic bayesian networks dbns such as hidden markov models hmms and kalman filters the dominance of approaches has been recently overturned by resurgence of interest in recurrent neural network rnn based approaches an rnn is special type of neural network that is able to handle both input and output by training an rnn to predict the next output in sequence given all previous outputs it can be used to model joint probability distribution over sequences both rnns and dbns consist of two parts transition function that determines the evolution of the internal hidden state and mapping from the state to the output there are however few important differences between rnns and dbns dbns have typically been limited either to relatively simple state transition structures linear models in the case of the kalman filter or to relatively simple internal state structure the hmm state space consists of single set of mutually exclusive states rnns on the other hand typically possess both richly distributed internal state representation and flexible transition functions these differences give rnns extra expressive power in comparison to dbns this expressive power and the ability to train via error backpropagation are the key reasons why rnns have gained popularity as generative models for highly structured sequential data in this paper we focus on another important difference between dbns and rnns while the hidden state in dbns is expressed in terms of random variables the internal transition structure of the standard rnn is entirely deterministic the only source of randomness or variability in the rnn is found in the conditional output probability model we suggest that this can be an inappropriate way to model the kind of variability observed in highly structured data such as natural speech which is characterized by strong and complex dependencies among the output variables at different code is available at http timesteps we argue as have others that these complex dependencies can not be modelled efficiently by the output probability models used in standard rnns which include either simple unimodal distribution or mixture of unimodal distributions we propose the use of latent random variables to model the variability observed in the data in the context of standard neural network models for data the variational autoencoder vae offers an interesting combination of highly flexible mapping between the latent random state and the observed output and effective approximate inference in this paper we propose to extend the vae into recurrent framework for modelling sequences the vae can model complex multimodal distributions which will help when the underlying true data distribution consists of multimodal conditional distributions we call this model variational rnn vrnn natural question to ask is how do we encode observed variability via latent random variables the answer to this question depends on the nature of the data itself in this work we are mainly interested in highly structured data that often arises in ai applications by highly structured we mean that the data is characterized by two properties firstly there is relatively high ratio meaning that the vast majority of the variability observed in the data is due to the signal itself and can not reasonably be considered as noise secondly there exists complex relationship between the underlying factors of variation and the observed data for example in speech the vocal qualities of the speaker have strong but complicated influence on the audio waveform affecting the waveform in consistent manner across frames with these considerations in mind we suggest that our model variability should induce temporal dependencies across timesteps thus like dbn models such as hmms and kalman filters we model the dependencies between the latent random variables across timesteps while we are not the first to propose integrating random variables into the rnn hidden state we believe we are the first to integrate the dependencies between the latent random variables at neighboring timesteps we evaluate the proposed vrnn model against other models including vrnn model without introducing temporal dependencies between the latent random variables on two challenging sequential data types natural speech and handwriting we demonstrate that for the speech modelling tasks the models significantly outperform the models and the vrnn model that does not integrate temporal dependencies between latent random variables background sequence modelling with recurrent neural networks an rnn can take as input sequence xt by recursively processing each symbol while maintaining its internal hidden state at each timestep the rnn reads the symbol xt rd and updates its hidden state ht rp by ht xt where is deterministic transition function and is the parameter set of the transition function can be implemented with gated activation functions such as long memory lstm or gated recurrent unit gru rnns model sequences by parameterizing factorization of the joint sequence probability distribution as product of conditional probabilities such that xt xt xt gτ where is function that maps the rnn hidden state to probability distribution over possible outputs and is the parameter set of one of the main factors that determines the representational power of an rnn is the output function in eq with deterministic transition function the choice of effectively defines the family of joint probability distributions xt that can be expressed by the rnn we can express the output function in eq as being composed of two parts the first part ϕτ is function that returns the parameter set φt given the hidden state φt ϕτ while the second part of returns the density of xt pφt xt when modelling and sequences reasonable choice of an observation model is gaussian mixture model gmm as used in for gmm ϕτ returns set of mixture coefficients αt means and covariances of the corresponding mixture components the probability of xt under the mixture distribution is pαt xt αj xt µj σj with the notable exception of there has been little work investigating the structured output density model for rnns with sequences there is potentially significant issue in the way the rnn models output variability given deterministic transition function the only source of variability is in the conditional output probability density this can present problems when modelling sequences that are at once highly variable and highly structured with high ratio to effectively model these types of sequences the rnn must be capable of mapping very small variations in xt the only source of randomness to potentially very large variations in the hidden state ht limiting the capacity of the network as must be done to guard against overfitting will force compromise between the generation of clean signal and encoding sufficient input variability to capture the variability both within single observed sequence and across data examples the need for highly structured output functions in an rnn has been previously noted boulangerlewandowski et al extensively tested nade and output densities for modelling sequences of binary vector representations of music bayer and osendorfer introduced sequence of independent latent variables corresponding to the states of the rnn their model called storn first generates sequence of samples zt from the sequence of independent latent random variables at each timestep the transition function from eq computes the next hidden state ht based on the previous state the previous output and the sampled latent random variables zt they proposed to train this model based on the vae principle see sec similarly pachitariu and sahani earlier proposed both sequence of independent latent random variables and stochastic hidden state for the rnn these approaches are closely related to the approach proposed in this paper however there is major difference in how the prior distribution over the latent random variable is modelled unlike the aforementioned approaches our approach makes the prior distribution of the latent random variable at timestep dependent on all the preceding inputs via the rnn hidden state see eq the introduction of temporal structure into the prior distribution is expected to improve the representational power of the model which we empirically observe in the experiments see table however it is important to note that any approach based on having stochastic latent state is orthogonal to having structured output function and that these two can be used together to form single model variational autoencoder for data vaes have recently been shown to be an effective modelling paradigm to recover complex multimodal distributions over the data space vae introduces set of latent random variables designed to capture the variations in the observed variables as an example of directed graphical model the joint distribution is defined as the prior over the latent random variables is generally chosen to be simple gaussian distribution and the conditional is an arbitrary observation model whose parameters are computed by parametric function of importantly the vae typically parameterizes with highly flexible function approximator such as neural network while latent random variable models of the form given in eq are not uncommon endowing the conditional as potentially highly mapping from to is rather unique feature of the vae however introducing highly mapping from to results in intractable inference of the posterior instead the vae uses variational approximation of the posterior that enables the use of the lower bound log kp eq log where kl qkp is divergence between two distributions and in the approximate posterior is gaussian diag whose mean and variance are the output of highly function of once again typically neural network the generative model and inference model are then trained jointly by maximizing the variational lower bound with respect to their parameters where the integral with respect to is approximated stochastically the gradient of this estimate can have low variance estimate by reparameterizing and rewriting eq log ep log where is vector of standard gaussian variables the inference model can then be trained through standard backpropagation technique for stochastic gradient descent variational recurrent neural network in this section we introduce recurrent version of the vae for the purpose of modelling sequences drawing inspiration from simpler dynamic bayesian networks dbns such as hmms and kalman filters the proposed variational recurrent neural network vrnn explicitly models the dependencies between latent random variables across subsequent timesteps however unlike these simpler dbn models the vrnn retains the flexibility to model highly dynamics generation the vrnn contains vae at every timestep however these vaes are conditioned on the state variable of an rnn this addition will help the vae to take into account the temporal structure of the sequential data unlike standard vae the prior on the latent random variable is no longer standard gaussian distribution but follows the distribution zt diag where ϕprior where and denote the parameters of the conditional prior distribution moreover the generating distribution will not only be conditioned on zt but also on such that xt zt µx diag where µx ϕdec ϕτ zt ϕprior ϕdec can be any where µx and denote the parameters of the generating distribution and highly flexible function such as neural networks ϕxτ and ϕzτ can also be neural networks which extract features from xt and zt respectively we found that these feature extractors are crucial for learning complex sequences the rnn updates its hidden state using the recurrence equation ht ϕxτ xt ϕzτ zt where was originally the transition function from eq from eq we find that ht is function of and therefore eq and eq define the distributions zt and xt respectively the parameterization of the generative model results in and was motivated by the factorization xt zt inference in similar fashion the approximate posterior will not only be function of xt but also of following the equation zt xt µz diag where µz ϕenc ϕτ xt similarly µz and denote the parameters of the approximate posterior we note that the encoding of the approximate posterior and the decoding for generation are tied through the rnn hidden state we also observe that this conditioning on results in the factorization zt prior generation recurrence inference overall figure graphical illustrations of each operation of the vrnn computing the conditional prior using eq generating function using eq updating the rnn hidden state using eq inference of the approximate posterior using eq overall computational paths of the vrnn learning the objective function becomes variational lower bound using eq and eq eq zt kp zt log xt as in the standard vae we learn the generative and inference models jointly by maximizing the variational lower bound with respect to their parameters the schematic view of the vrnn is shown in fig operations correspond to eqs respectively the vrnn applies the operation when computing the conditional prior see eq if the variant of the vrnn does not apply the operation then the prior becomes independent across timesteps storn can be considered as an instance of the model family in fact storn puts further restrictions on the dependency structure of the approximate inference model we include this version of the model in our experimental evaluation in order to directly study the impact of including the temporal dependency structure in the prior conditional prior over the latent random variables experiment settings we evaluate the proposed vrnn model on two tasks modelling natural speech directly from the raw audio waveforms modelling handwriting generation speech modelling we train the models to directly model raw audio signals represented as sequence of frames each frame corresponds to the amplitudes of consecutive raw acoustic samples note that this is unlike the conventional approach for modelling speech often used in speech synthesis where models are expressed over representations such as spectral features see we evaluate the models on the following four speech datasets blizzard this dataset made available by the blizzard challenge contains hours of english spoken by single female speaker timit this widely used dataset for benchmarking speech recognition systems contains english sentences read by speakers this is set of sounds such as coughing screaming laughing and shouting recorded from voice actors accent this dataset contains english paragraphs read by different native and nonnative english speakers this dataset has been provided by ubisoft table average on the test or validation set of each task speech modelling handwriting models blizzard timit onomatopoeia accent for the blizzard and accent datasets we process the data so that each sample duration is the sampling frequency used is except the timit dataset the rest of the datasets do not have predefined splits we shuffle and divide the data into splits using ratio of handwriting generation we let each model learn sequence of coordinates together with binary indicators of using the dataset which consists of handwritten lines written by writers we preprocess and split the dataset as done in preprocessing and training the only preprocessing used in our experiments is normalizing each sequence using the global mean and standard deviation computed from the entire training set we train each model with stochastic gradient descent on the negative using the adam optimizer with learning rate of for timit and accent and for the rest we use minibatch size of for blizzard and accent and for the rest the final model was chosen with based on the validation performance models we compare the vrnn models with the standard rnn models using two different output functions simple gaussian distribution gauss and gaussian mixture model gmm for each dataset we conduct an additional set of experiments for vrnn model without the conditional prior we fix each model to have single recurrent hidden layer with lstm units in the case of blizzard and for all of ϕτ shown in eqs have four hidden layers using rectified linear units for we use single hidden layer the standard prior enc rnn models only have ϕxτ and ϕdec while the vrnn models also have ϕτ ϕτ and ϕτ for the dec standard rnn models ϕτ is the feature extractor and ϕτ is the generating function for the rnngmm and vrnn models we match the total number of parameters of the deep neural networks dnns ϕx enc dec prior as close to the model having hidden units for every layer that belongs to either ϕxτ or ϕdec we consider hidden units in the case of blizzard note that we use mixture components for models using gmm as the output function for qualitative analysis of speech generation we train larger models to generate audio sequences we stack three recurrent hidden layers each layer contains lstm units again for the rnngmm and vrnn models we match the total number of parameters of the dnns to be equal to the model having hidden units for each layer that belongs to either ϕxτ or ϕdec results and analysis we report the average of test examples assigned by each model in table for and we report the exact while in the case of vrnns we report the variational lower bound given with sign see eq and approximated marginal given with sign based on importance sampling using samples as in in general higher numbers are better our results show that the vrnn models have higher loglikelihood which support our claim that latent random variables are helpful when modelling figure the top row represents the difference δt between µz and µz the middle row shows the dominant kl divergence values in temporal order the bottom row shows the input waveforms plex sequences the vrnn models perform well even with unimodal output function vrnngauss which is not the case for the standard rnn models latent space analysis in fig we show an analysis of the latent random variables we let vrnn model read some unseen examples and observe the transitions in the latent space we compute δt µjz µjz at every timestep and plot the results on the top row of fig the middle row shows the kl divergence computed between the approximate posterior and the conditional prior when there is transition in the waveform the kl divergence tends to grow white is high and we can clearly observe peak in δt that can affect the rnn dynamics to change modality ground truth figure examples from the training set and generated samples from and vrnngauss top three rows show the global waveforms while the bottom three rows show more zoomedin waveforms samples from contain noise and samples from have less noise we exclude because the samples are almost close to pure noise speech generation we generate waveforms with duration from the models that were trained on blizzard from fig we can clearly see that the waveforms from the are much less noisy and have less spurious peaks than those from the we suggest that the large amount of noise apparent in the waveforms from the model is consequence of the compromise these models must make between representing clean signal consistent with the training data and encoding sufficient input variability to capture the variations across data examples the latent random variable models can avoid this compromise by adding variability in the latent space which can always be mapped to point close to relatively clean sample handwriting generation visual inspection of the generated handwriting as shown in fig from the trained models reveals that the vrnn model is able to generate more diverse writing style while maintaining consistency within samples ground truth figure handwriting samples training examples and unconditionally generated handwriting from and the retains the writing style from beginning to end while and tend to change the writing style during the generation process this is possibly because the sequential latent random variables can guide the model to generate each sample with consistent writing style conclusion we propose novel model that can address sequence modelling problems by incorporating latent random variables into recurrent neural network rnn our experiments focus on unconditional natural speech generation as well as handwriting generation we show that the introduction of latent random variables can provide significant improvements in modelling highly structured sequences such as natural speech sequences we empirically show that the inclusion of randomness into latent space can enable the vrnn to model natural speech sequences with simple gaussian distribution as the output function however the standard rnn model using the same output function fails to generate reasonable samples an model using more powerful output function such as gmm can generate much better samples but they contain large amount of noise compared to the samples generated by the models we also show the importance of temporal conditioning of the latent random variables by reporting higher numbers on modelling natural speech sequences in handwriting generation the vrnn model is able to model the diversity across examples while maintaining consistent writing style over the course of generation acknowledgments the authors would like to thank the developers of theano also the authors thank kyunghyun cho kelvin xu and sungjin ahn for insightful comments and discussion we acknowledge the support of the following agencies for research funding and computing support ubisoft nserc calcul compute canada the canada research chairs and cifar references bastien lamblin pascanu bergstra goodfellow bergeron bouchard and bengio theano new features and speed improvements deep learning and unsupervised feature learning nips workshop bayer and osendorfer learning stochastic recurrent networks arxiv preprint bertrand demuynck stouten and hamme unsupervised learning of auditory filter banks using matrix factorisation in ieee international conference on acoustics speech and signal processing icassp pages ieee bengio and vincent modeling temporal dependencies in highdimensional sequences application to polyphonic music generation and transcription in proceedings of the international conference on machine learning icml pages cho van merrienboer gulcehre bahdanau bougares schwenk and bengio learning phrase representations using rnn for statistical machine translation in proceedings of the conference on empirical methods in natural language processing emnlp pages fabius van amersfoort and kingma variational recurrent arxiv preprint graves generating sequences with recurrent neural networks arxiv preprint gregor danihelka graves and wierstra draw recurrent neural network for image generation in proceedings of the international conference on machine learning icml hochreiter and schmidhuber long memory neural computation king and karaiskos the blizzard challenge in the ninth annual blizzard challenge kingma and welling variational bayes in proceedings of the international conference on learning representations iclr kingma and welling adam method for stochastic optimization in proceedings of the international conference on learning representations iclr lee pham largman and ng unsupervised feature learning for audio classification using convolutional deep belief networks in advances in neural information processing systems nips pages liwicki and bunke english sentence database acquired from handwritten text on whiteboard in proceedings of eighth international conference on document analysis and recognition pages ieee nair and hinton rectified linear units improve restricted boltzmann machines in proceedings of the international conference on machine learning icml pages pachitariu and sahani learning visual motion in recurrent neural networks in advances in neural information processing systems nips pages rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in proceedings of the international conference on machine learning icml pages tokuda nankaku toda zen yamagishi and oura speech synthesis based on hidden markov models proceedings of the ieee weinberger the speech accent archieve http 
fast convergence of regularized learning in games vasilis syrgkanis microsoft research new york ny vasy alekh agarwal microsoft research new york ny alekha haipeng luo princeton university princeton nj haipengl robert schapire microsoft research new york ny schapire abstract we show that natural classes of regularized learning algorithms with form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games when each player in game uses an algorithm from our class their individual regret decays at while the sum of utilities converges to an approximate optimum at improvement upon the worst case rates we show blackbox reduction for any algorithm in the class to achieve rates against an adversary while maintaining the faster rates against algorithms in the class our results extend those of rakhlin and shridharan and daskalakis et al who only analyzed games for specific algorithms introduction what happens when players in game interact with one another all of them acting independently and selfishly to maximize their own utilities if they are smart we intuitively expect their utilities both individually and as group to grow perhaps even to approach the best possible we also expect the dynamics of their behavior to eventually reach some kind of equilibrium understanding these dynamics is central to game theory as well as its various application areas including economics network routing auction design and evolutionary biology it is natural in this setting for the players to each make use of learning algorithm for making their decisions an approach known as decentralized dynamics algorithms are strong match for playing games because their regret bounds hold even in adversarial environments as benefit these bounds ensure that each player utility approaches optimality when played against one another it can also be shown that the sum of utilities approaches an approximate optimum and the player strategies converge to an equilibrium under appropriate conditions at rates governed by the regret bounds families of algorithms include mirror descent and follow the leader see for excellent overviews for all of these the average regret vanishes at the rate of which is unimprovable in fully adversarial scenarios however the players in our setting are facing other similar predictable learning algorithms chink that hints at the possibility of improved convergence rates for such dynamics this was first observed and exploited by daskalakis et al for games they developed decentralized variant of nesterov accelerated saddle point algorithm and showed that each player average regret converges at the remarkable rate of although the resulting dynamics are somewhat unnatural in later work rakhlin and sridharan showed surprisingly that the same convergence rate holds for simple variant of mirror descent with the seemingly minor modification that the last utility observation is counted twice although major steps forward both these works are limited to games the very simplest case as such they do not cover many practically important settings such as auctions or routing games which are decidedly not and which involve many independent actors in this paper we vastly generalize these techniques to the practically important but far more challenging case of arbitrary games giving natural dynamics whose convergence rates are much faster than previously possible for this general setting contributions we show that the average welfare of the game that is the sum of player utilities converges top approximately optimal welfare at the rate rather than the previously known rate of concretely we show natural class of regularized algorithms with recency bias that achieve welfare at least pt where and are parameters in smoothness condition on the game introduced by roughgarden for the same class of algorithms we show that each individual player average regret converges to zero at the rate thus our results entail an algorithm for computing coarse correlated equilibria in decentralized manner with significantly faster convergence than existing methods we additionally give that preserves the fast rates in favorable environments while robustly maintaining regret against any opponent in the worst case even for games our results for general games expose hidden generality and modularity underlying the previous results first our analysis identifies stability and recency bias as key structural ingredients of an algorithm with fast rates this covers the optimistic mirror descent of rakhlin and sridharan as an example but also applies to optimistic variants of follow the regularized leader ftrl including dependence on arbitrary weighted windows in the history as opposed to just the utility from the last round recency bias is behavioral pattern commonly observed in environments as such our results can be viewed as partial theoretical justification second previous approaches in on achieving both faster convergence against similar algorithms while at the same time regret rates against adversaries were shown via modifications of specific algorithms we give modification which is not algorithm specific and works for all these optimistic algorithms finally we simulate simultaneous auction game and compare our optimistic algorithms against hedge in terms of utilities regrets and convergence to equilibria repeated game model and dynamics consider static game among set of players each player has strategy space si and utility function ui sn that maps strategy profile sn to utility ui we assume that the strategy space of each player is finite and has cardinality we denote with wn profile of mixed strategies where wi si and wi is the probability of strategy si finally let ui ui the expected utility of player we consider the setting where the game is played repeatedly for time steps at each time step each player picks mixed strategy wit si at the end of the iteration each player observes the expected utility he would have received had he played any possible strategy si more formally let uti es ui where is the set of strategies of all but the ith player and let uti uti at the end of each iteration each player observes uti observe that the expected utility of player at iteration is simply the inner product hwit uti dynamics we assume that the players each decide their strategy wit based on vanishing regret algorithm formally for each player the regret after time steps is equal to the maximum gain he could have achieved by switching to any other fixed strategy ri sup si wit uti the algorithm has vanishing regret if ri approximate efficiency of dynamics we are interested in analyzing the average welfare of such vanishing regret sequences for given strategy profile the social welfare is defined as the sum of the player utilities ui we overload notation to denote we want to lower bound how far the average welfare of the sequence is with respect to the optimal welfare of the static game pt max this is the optimal welfare achievable in the absence of player incentives and if central coordinator could dictate each player strategy we next define class of games first identified by roughgarden on which we can approximate the optimal welfare using decoupled dynamics definition smooth game is if there exists strategy profile such that for any strategy profile ui si pt µw in words any player using his optimal strategy continues to do well irrespective of other players strategies this condition directly implies of dynamics as we show below proposition in game if each player suffers regret at most ri then wt pt ri pt ri where the factor is called the price of anarchy oa this proposition is essentially more explicit version of roughgarden result we provide proof in the appendix for completeness the result shows that the convergence to oa is driven by the quantity ri there are many algorithms which achieve regret rate of ri log in which case the latter theorem would imply that the average welfare converges to oa at rate of log as we will show for some natural classes of algorithms the average welfare converges at the much faster rate of log fast convergence to approximate efficiency in this section we present our main theoretical results characterizing class of dynamics which lead to faster convergence in smooth games we begin by describing this class definition rvu property we say that vanishing regret algorithm satisfies the regret bounded by variation in utilities rvu property with parameters and and pair of dual norms if its regret on any sequence of utilities ut is bounded as ku ut kwt wt typical online learning algorithms such as mirror descent and ftrl do not satisfy the rvu property pt in their vanilla form as the middle term grows as kut for these methods however rakhlin and sridharan give modification of mirror descent with this property and we will present similar variant of ftrl in the sequel we now present two sets of results when each player uses an algorithm with this property the first discusses the convergence of social welfare while the second governs the convergence of the individual players utilities at fast rate the dual to norm is defined as hu vi fast convergence of social welfare given proposition we only need to understand the evolution of the sum of players regrets pt ri in order to obtain convergence rates of the social welfare our main result in this section bounds this sum when each player uses dynamics with the rvu property theorem suppose that the algorithm of each player satisfies the property rvu with parameters and such that and then ri proof since ui definitions imply kuti uti wj sj wj sj the latter is the total variation distance of two product distributions by known properties of total variation see this is bounded by the sum of the total variations of each marginal distribution wj wj kwjt wjt by jensen inequality kw kwjt wjt so that xx kuti uti kwjt wjt kwit wit the theorem follows by summing up the rvu property for each player and observing that the summation of the second terms is smaller than that of the third terms and thereby can be dropped remark the rates from the theorem depend on which will be in the sequel the above theorem extends to the case where is any norm equivalent to the norm the resulting requirement on in terms of can however be more stringent also the theorem does not require that all players use the same algorithm unlike previous results as long as each player algorithm satisfies the rvu property with common bound on the constants we now instantiate the result with examples that satisfy the rvu property with different constants optimistic mirror descent the optimistic mirror descent omd algorithm of rakhlin and sridharan is parameterized by an adaptive predictor sequence mti and which is with respect to norm let dr denote the bregman divergence associated with then the update rule is defined as follows let si and argmax hw ui si dr then wit mti git and git uti git then the following proposition can be obtained for this method proposition the omd algorithm using stepsize and mti uti satisfies the rvu property with constants where maxi supf dr the proposition follows by further crystallizing the arguments of rakhlin and sridaran and we provide proof in the appendix for completeness the above proposition along with theorem immediately yields the following corollary which had been proved by rakhlin and sridharan for games and which we here extend to general games corollary if each player runs omd with mti uti and stepsize then we have ri the corollary follows by noting that the condition is met with our choice of here and in the sequel we can use different regularizer ri for each player without qualitatively affecting any of the results ku is convex if optimistic follow the regularized leader we next consider different class of algorithms denoted as optimistic follow the regularized leader oftrl this algorithm is similar but not equivalent to omd and is an analogous extension of standard ftrl this algorithm takes the same parameters as for omd and is defined as follows let si and wit argmax uti mti si we consider three variants of oftrl with different choices of the sequence mti incorporating the recency bias in different forms recency bias the simplest form of oftrl uses ui result where maxi supf si inf si and obtains the following proposition the oftrl algorithm using stepsize and mti uti with constants and satisfies the rvu property combined with theorem this yields the following constant bound on the total regret of all players corollary if each player runs oftrl with mti uti and then we have ri rakhlin and sridharan also analyze an ftrl variant but require barrier for the constraint set as opposed to an arbitrary strongly convex regularizer and their bound is missing the crucial negative terms of the rvu property which are essential for obtaining theorem recency bias more generally given window size one can define mti pt ui we have the following proposition pt proposition the oftrl algorithm using stepsize and mti ui satisfies the rvu property with constants and setting we obtain the analogue of corollary with an extra factor of geometrically discounted recency bias the next proposition considers an alternative form of recency bias which includes all the previous utilities but with geometric discounting pt proposition the oftrl algorithm using stepsize and mti pt satisfies the rvu property with constants and note that these choices for mti can also be used in omd with qualitatively similar results fast convergence of individual utilities the previous section shows implications of the rvu property on the social welfare this section complements these with similar result for each player individual utility theorem suppose that the players use algorithms satisfying the rvu property with parameters if we further have the stability property kwit then for any pt player hwi wit uti similar reasoning as in theorem yields kuti uti kwjt wjt and summing the terms gives the theorem noting that oftrl satisfies the rvu property with constants given in proposition and stability property with see lemma in the appendix we have the following corollary corollary if all players use the oftrl algorithm with mti uti pt then we have wit uti and similar results hold for the other forms of recency bias as well as for omd corollary gives fast convergence rate of the players strategies to the set of coarse correlated equilibria cce of the game this improves the previously known convergence rate to cce using natural decoupled dynamics defined in robustness to adversarial opponent so far we have shown simple dynamics with rapid convergence properties in favorable environments when each player in the game uses an algorithm with the rvu property it is natural to wonder if this comes at the cost of guarantees when some players do not use algorithms with this property rakhlin and sridharan address this concern by modifying the omd algorithm with additional smoothing and adaptive so as to preserve the fast rates in the favorable case while still guaranteeing regret for each player no matter how the opponents play it is not so obvious how this modification might extend to other procedures and it seems undesirable to abandon the regret transformations we used to obtain theorem in this section we present generic way of transforming an algorithm which satisfies the rvu property so that it retains the fast convergence in favorable settings but always guarantees regret of in order to present our modification we need parametric form of the rvu property which will also involve tunable parameter of the algorithm for most online learning algorithms this will correspond to the parameter used by the algorithm definition rvu property we say that parametric algorithm satisfies the regret bounded by variation in utilities rvu property with parameters and pair of dual norms if its regret on any sequence of utilities ut is bounded as ut kut ut kwt wt in both omd and oftrl algorithms from section the parameter is precisely the stepsize we now show an adaptive choice of according to an doubling schedule reduction given parametric algorithm as we construct wrapper based on the doubling trick the algorithm of each player proceeds in epochs at each epoch pt the player has an upper bound of br on the quantity kuti uti we start with parameter and and for repeat play according to and receive if uti br update br min br with as in equation start new run of with parameter theorem algorithm achieves regret at most the minimum of the following two terms log kuti uti wi ui log kuti wit uti uti kwit wit that is the algorithm satisfies the rvu property and also has regret that can never exceed the theorem thus yields the following corollary which illustrates the stated robustness of corollary algorithm with log achieves regret against any adversarial sequence while at the same time satisfying the conditions of theorem thereby if all players use such an algorithm then ri log sum of regrets max of regrets hedge optimistic hedge cumulative regret cumulative regret number of rounds hedge optimistic hedge number of rounds figure maximum and sum of individual regrets over time under the hedge blue and optimistic hedge red dynamics proof observe that for such we have that log log therefore algorithm satisfies the sufficient conditions of theorem if is the oftrl algorithm then we know by proposition that the above result applies with maxw and setting the resulting algorithm will have regret at most against an arbitrary adversary while if all players use algorithm then ri log an analogue of theorem can also be established for this algorithm corollary if satisfies the rvu property and also kwit pwit then with achieves regret if played against itself and against any opponent once again oftrl satisfies the above conditions with implying robust convergence experimental evaluation we analyzed the performance of optimistic follow the regularized leader with the entropy regularizer which corresponds to the hedge algorithm modified so that the last iteration utility for each strategy is double counted we refer to it as optimistic hedge of more the probability player playing strategy at iteration is proportional to exp rather uij pt than exp uij as is standard for hedge we studied simple auction where players are bidding for items each player has value for getting at least one item and no extra value for more items the utility of player is the value for the allocation he derived minus the payment he has to make the game is defined as follows simultaneously each player picks one of the items and submits bid on that item we assume bids to be discretized for each item the highest bidder wins and pays his bid we let players play this game repeatedly with each player invoking either hedge or optimistic hedge this game and generalizations of it are known to be if we also view the auctioneer as player whose utility is the revenue the welfare of the game is the value of the resulting allocation hence not game the welfare maximization problem corresponds to the unweighted bipartite matching problem the oa captures how far from the optimal matching is the average allocation of the dynamics by smoothness we know it converges to at least of the optimal fast convergence of individual and average regret we run the game for bidders and items and valuation the bids are discretized to be any integer in we find that the sum of the regrets and the maximum individual regret of each player are remarkably lower under optimistic hedge as opposed to hedge in figure we plot the maximum individual regret as well as the sum of the regrets under the two algorithms using for both methods thus convergence to the set of coarse correlated equilibria is substantially faster under optimistic hedge expected bids of player utility of player hedge optimistic hedge utility expected bid hedge optimistic hedge number of rounds number of rounds figure expected bid and utility of player on one of the four items over time under hedge blue and optimistic hedge red dynamics confirming our results in section we also observe similar behavior when each player only has value on randomly picked subset of items or uses other step sizes more stable dynamics we observe that the behavior under optimistic hedge is more stable than under hedge in figure we plot the expected bid of player on one of the items and his expected utility under the two dynamics hedge exhibits the sawtooth behavior that was observed in generalized first price auction run by overture see in stunning contrast optimistic hedge leads to more stable expected bids over time this stability property of optimistic hedge is one of the main intuitive reasons for the fast convergence of its regret welfare in this class of games we did not observe any significant difference between the average welfare of the methods the key reason is the following the proof that dynamics are approximately efficient proposition only relies on the fact that each player does not have regret against the strategy used in the definition of smooth game in this game regret against these strategies is experimentally comparable under both algorithms even though regret against the best fixed strategy is remarkably different this indicates possibility for faster rates for hedge in terms of welfare in appendix we show fast convergence of the efficiency of hedge for costminimization games though with worse oa discussion this work extends and generalizes growing body of work on decentralized dynamics in many ways we demonstrate class of algorithms which enjoy rapid convergence when played against each other while being robust to adversarial opponents this has implications in computation of correlated equilibria as well as understanding the behavior of agents in complex games there are number of interesting questions and directions for future research which are suggested by our results including the following convergence rates for vanilla hedge the fast rates of our paper do not apply to algorithms such as hedge without modification is this modification to satisfy rvu only sufficient or also necessary if not are there counterexamples in the supplement we include sketch hinting at such counterexample but also showing fast rates to worse equilibrium than our optimistic algorithms convergence of players strategies the oftrl algorithm often produces much more stable trajectories empirically as the players converge to an equilibrium as opposed to say hedge precise quantification of this desirable behavior would be of great interest better rates with partial information if the players do not observe the expected utility function but only the moves of the other players at each round can we still obtain faster rates references blum and mansour learning regret minimization and equilibria in noam nisan tim roughgarden tardos and vijay vazirani editors algorithmic game theory chapter pages cambridge university press avrim blum mohammadtaghi hajiaghayi katrina ligett and aaron roth regret minimization and the price of total anarchy in proceedings of the fortieth annual acm symposium on theory of computing stoc pages new york ny usa acm nicolo and gabor lugosi prediction learning and games cambridge university press new york ny usa constantinos daskalakis alan deckelbaum and anthony kim algorithms for games games and economic behavior benjamin edelman michael ostrovsky and michael schwarz internet advertising and the generalized second price auction selling billions of dollars worth of keywords working paper national bureau of economic research november dean foster and rakesh vohra calibrated learning and correlated equilibrium games and economic behavior yoav freund and robert schapire generalization of learning and an application to boosting journal of computer and system sciences yoav freund and robert schapire adaptive game playing using multiplicative weights games and economic behavior drew fudenberg and alexander peysakhovich recency records and recaps learning and nonequilibrium behavior in simple decision problem in proceedings of the fifteenth acm conference on economics and computation ec pages new york ny usa acm sergiu hart and andreu simple adaptive procedure leading to correlated equilibrium econometrica wassily hoeffding and wolfowitz distinguishability of sets of distributions ann math adam kalai and santosh vempala efficient algorithms for online decision problems journal of computer and system sciences learning theory learning theory nick littlestone and manfred warmuth the weighted majority algorithm information and computation as nemirovsky and db yudin problem complexity and method efficiency in optimization yu nesterov smooth minimization of functions mathematical programming alexander rakhlin and karthik sridharan online learning with predictable sequences in colt pages alexander rakhlin and karthik sridharan optimization learning and games with predictable sequences in advances in neural information processing systems pages roughgarden intrinsic robustness of the price of anarchy in proceedings of the annual acm symposium on theory of computing pages new york ny usa acm shai online learning and online convex optimization found trends mach february vasilis syrgkanis and tardos composable and efficient mechanisms in proceedings of the fortyfifth annual acm symposium on theory of computing stoc pages new york ny usa acm 
parallel lstm with application to fast biomedical volumetric image segmentation marijn stollenga wonmin byeon marcus and juergen shared first authors both authors contribruted equally to this work corresponding authors marijn istituto dalle molle di studi sull intelligenza artificiale the swiss ai lab idsia scuola universitaria professionale della svizzera italiana supsi switzerland della svizzera italiana usi switzerland university of kaiserslautern germany german research center for artificial intelligence dfki germany abstract convolutional neural networks cnns can be shifted across images or videos to segment them they have fixed input size and typically perceive only small local contexts of the pixels to be classified as foreground or background in contrast recurrent nns can perceive the entire context of each pixel in few sweeps through all pixels especially when the rnn is long memory lstm despite these theoretical advantages however unlike cnns previous variants were hard to parallelise on gpus here we the traditional cuboid order of computations in in pyramidal fashion the resulting is easy to parallelise especially for data such as stacks of brain slice images achieved best known brain image segmentation results on and competitive results on introduction long memory lstm networks are recurrent neural networks rnns initially designed for sequence processing they achieved results on challenging tasks such as handwriting recognition large vocabulary speech recognition and machine translation their architecture contains gates to store and read out information from linear units called error carousels that retain information over long time intervals which is hard for traditional rnns lstm networks connect hidden lstm units in two dimensional is applicable to image segmentation where each pixel is assigned to class such as background or foreground each lstm unit sees pixel and receives input from predecessing lstm units thus recursively gathering information about all other pixels in the image there are many biomedical volumetric data sources such as computed tomography ct magnetic resonance mr and electron microscopy em most previous approaches process each slice separately using image segmentation algorithms such as snakes random forests and convolutional neural networks however can process the full context of each pixel in such volume through sweeps over all pixels by different lstms each sweep in the general direction of one of the directed volume diagonals for example in two dimensions this yields directions up down left and right due to the sequential nature of rnns however parallelisation was difficult especially for volumetric data the novel pyramidal lstm networks introduced in this paper use rather different topology and update strategy they are easier to parallelise need fewer computations overall and scale well on gpu architectures is applied to two challenging tasks involving segmentation of biological volumetric images competitive results are achieved on best known results are achieved on method we will first describe standard lstm then we introduce the and topology changes to construct the which is formally described and discussed lstm unit consists of an input gate forget output gate and memory cell which control what should be remembered or forgotten over potentially long periods of time the input and all gates and activations are vectors rt where is the length of the input the gates and activations at discrete time are computed as follows it xt θxi θhi θibias ft xt θxf θhf θfbias tanh xt ct it ft ot xt θxo θho θobias ht ot tanh ct where is matrix multiplication an multiplication and denotes the weights is the input to the cell which is gated by the input gate and is the output the functions and tanh are applied where equations determine gate activations equation cell inputs equation the new cell states here memories are stored or forgotten equation output gate activations which appear in equation the final output pyramidal connection topology standard pyramid lstm figure the standard topology evaluates the context of each pixel recursively from neighbouring pixel contexts along the axes that is pixels on simplex can be processed in parallel turning this order by causes the simplex to become plane column vector in the case here the resulting gaps are filled by adding extra connections to process more than elements of the context the lstm aligns in grid and connects them over the axis multiple grids are needed to process information from all directions adds the outputs of lstms one scanning the image pixel by pixel from to one from to one from to and one from to figure shows one of these directions although the forget gate output is inverted and actually remembers when it is on and forgets when it is off the traditional nomenclature is kept however small change in connections can greatly facilitate parallelisations if the connections are rotated by all inputs to all units come from either left right up or down left in case of figure and all elements of row in the grid row can be computed independently however this introduces context gaps as in figure by adding an extra input these gaps are filled as in figure extending this approach in dimensions results in pyramidal connection topology meaning the context of pixel is formed by pyramid in each direction figure on the left we see the context scanned so far by one of the lstms of cube in general given dimensions lstms are needed on the right we see the context scanned so far by one of the lstms of pyramid in general lstms are needed one of the striking differences between and is the shape of the scanned contexts each lstm of an scans contexts in or cuboids in each lstm of scans triangles in and pyramids in see figure an needs lstms to scan volume while needs only since it takes cubes or pyramids to fill volume given dimension the number of lstms grows as for an exponentially and for linearly similar connection strategy has been previously used to speed up distance computations on surfaces there are however important differences we can exploit efficient cuda convolution operations but in way unlike what is done in cnns as will be explained below as result of these operations input filters that are bigger than the necessary filters arise naturally creating overlapping contexts such redundancy turns out to be beneficial and is used in our experiments we apply several layers of complex processing with outputs and several for each pixel instead of having single value per pixel as in distance computations our application is focused on volumetric data fully connected layer tanh fully connected layer softmax input data figure network architecture randomly rotated and flipped inputs are sampled from random locations then fed to six over three axes the outputs from all are combined and sent to the layer tanh is used as squashing function in the hidden layer several layers can be applied the last layer is and uses softmax function to compute probabilities for each class for each pixel here we explain the network architecture for volumes see figure the working horses are six convolutional lstms layers one for each direction to create the full context of each pixel note that each of these is entire lstm rnn processing the entire volume in one direction the directions are formally defined over the three axes they essentially choose which axis is the time direction with the positive direction of the represents the time each performs computations in plane moving in the defined direction the input is rw where is the width the height the depth and the number of channels of the input or hidden units in the case of and higher layers similarly we define the volumes id od cd hd rw where is direction and is the number of hidden units per pixel since each direction needs separate volume we denote volumes with the time index selects slice in direction for instance for direction vtd refers to the plane for and for negative direction the plane is the same but moves in the opposite direction special case is the first plane in each direction which does not have previous plane hence we omit the corresponding computation equations θidbias idt xdt θxi θhi ftd xdt θxf θhf θfdbias tanh xdt cdt idt ftd θodbias odt xdt θxo θho hdt odt tanh cdt where is and is the output of the layer all biases are the same for all lstm units no positional biases are used the outputs hd for all directions are summed together layer the output of our layer is connected to layer which output is squashed by the hyperbolic tangent tanh function this step is used to increase the number of channels for the next layer the final classification is done using softmax function pe giving probabilities for each class experiments we evaluate our approach on two biomedical image segmentation datasets electron microscopy em and mr brain images em dataset the em dataset is provided by the isbi workshop on segmentation of neuronal structures in em stacks two stacks consist of slices of pixels obtained from microcube with resolution of and binary labels one stack is used for training the other for testing target data consists of binary labels membrane and mr brain dataset the mr brain images are provided by the isbi workshop on neonatal and adult mr brain image segmentation isbi the dataset consists of twenty fully annotated scan inversion recovery scan ir and inversion recovery scan flair the dataset is divided into training set with five volumes and test set with fifteen volumes all scans are and aligned each volume includes slices with pixels slice thickness the slices in volumes convolutions are performed in in general an volume requires convolutions all convolutions have stride and their filter sizes should at least be in each dimension to create the full context are manually segmented through nine labels cortical gray matter basal ganglia white matter white matter lesions cerebrospinal fluid in the extracerebral space ventricles cerebellum brainstem and background following the isbi workshop procedure all labels are grouped into four classes and background cortical gray matter and basal ganglia gm white matter and white matter lesions wm cerebrospinal fluid and ventricles csf and cerebellum and brainstem class is ignored for the final evaluation as required and augmentation the full dataset requires more than the gb of memory provided by our gpu hence we train and test on we randomly pick position in the full data and extract smaller cube see the details in bootstrapping this cube is possibly rotated at random angle over some axis and can be flipped over any axis for em images we rotate over the and flipped with chance along and axes for mr brain images rotation is disabled only flipping along the direction is considered since brains are mostly symmetric in this direction during rotations and flipping are disabled and the results of all are stitched together using gaussian kernel providing the final result we normalise each input slice towards mean of zero and variance of one since the imaging methods sometimes yield large variability in contrast and brightness we do not apply the complex common in biomedical image segmentation we apply simple on the three datatypes of the mr brain dataset since they contain large brightness changes under the same label even within one slice see figure from all slices we subtract the gaussian smoothed images filter size then adaptive histogram equalisation clahe is applied to enhance the local contrast tile size contrast limit an example of the images after is shown in figure the original and images are all used except the original ir images figure which have high variability to be an ρan bn training we apply with momentum we define where rn the following equations hold for every epoch ρmse mse mse ρm λlr where is the target is the output from the networks is the squared loss mse running average of the variance of the gradient is the squared gradient the normalised gradient the smoothed gradient and the weights the squared loss was chosen as it produced better results than using the as an error function this algorithm normalises the gradient of each weight such that even weights with small gradients get updated this also helps to deal with vanishing gradients epoch we use decaying learning rate λlr which starts at λlr and halves every epochs asymptotically towards λlr other used are ρmse and ρm bootstrapping to speed up training we run three learning procedures with increasing sizes first epochs with size then epochs with size finally for the we train epochs with size and for the mr brain dataset epochs with size after each epoch the learning rate λlr is reset table performance comparison on em images some of the competing methods reported in the isbi website are not yet published comparison details can be found under http rand err warping err pixel err human simple thresholding idsia dive group experimental setup all experiments are performed on desktop computer with an nvidia gtx titan gpu due to the pyramidal topology all major computations can be done using convolutions with nvidia cudnn library which has reported speedup over an optimised implementation on modern core cpu on the mr brain dataset training took around three days and testing per volume took around minutes we use exactly the same and architecture for both datasets our networks contain three layers the first layer has hidden units followed by layer with hidden units in the next layer hidden units are connected to layer with hidden units in the last layer hidden units are connected to the output layer whose size equals the number of classes the convolutional filter size for all layers is set to the total number of weights is and all weights are initialised according to uniform distribution neuronal membrane segmentation input figure segmentation results on em dataset slice membrane segmentation is evaluated through an online system provided by the isbi organisers the measures used are the rand error warping error and pixel error comparisons to other methods are reported in table the teams idsia and dive provide membrane probability maps for each pixel the idsia team uses deep convolutional network the method of dive was not provided these maps are adapted by the technique of the teams sci which directly optimises the rand error and this is most important in this particular segmentation task table the performance comparison on mr brain images structure metric ksom ghmf gm wm csf dc md avd dc md avd dc md avd rank mm mm mm without networks outperform other methods in rand error and are competitive in wrapping and pixel errors of course performance could be further improved by applying techniques figure shows an example segmentation result mr brain segmentation the results are compared using the dice overlap dc the modified hausdorff distance md and the absolute volume difference avd mr brain image segmentation results are evaluated by the isbi organisers who provided the extensive comparison to other approaches on http table compares our results to those of the top five teams the organisers compute nine measures in total and rank all teams for each of them separately these ranks are then summed per team determining the final ranking ties are broken using the standard deviation leads the final ranking with new result and outperforms other methods for csf in all metrics we also tried regularisation through dropout following earlier work the dropout operator is applied only to connections dropout on fully connected layers on input layer however this did not improve performance conclusion since cnns have dominated classification contests and segmentation contests however may pose serious challenge to such cnns at least for segmentation tasks unlike cnns has an elegant recursive way of taking each pixel entire context into account in both images and videos previous implementations however could not exploit the parallelism of modern gpu hardware this has changed through our work presented here although our novel highly parallel has already achieved segmentation results in challenging benchmarks we feel we have only scratched the surface of what will become possible with such and other acknowledgements we would like to thank klaus greff and alessandro giusti for their valuable discussions and jan koutnik and dan ciresan for their useful feedback we also thank the isbi organisers and the isbi organisers in particular mendrik and ignacio argandacarreras lastly we thank nvidia for generously providing us with hardware to perform our research this research was funded by the nascence eu project ir flair ir flair segmentation result from figure slice of the test image are examples of three scan methods used in the mr brain dataset and show the corresponding images after our procedure see in section input is omitted due to strong artefacts in the data the other datatypes are all used as input to the the segmentation result is shown in references hochreiter and schmidhuber long memory in neural computation based on tr tum pp gers schmidhuber and cummins learning to forget continual prediction with lstm in icann graves liwicki fernandez bertolami bunke and schmidhuber novel connectionist system for improved unconstrained handwriting recognition in pami sak senior and beaufays long memory recurrent neural network architectures for large scale acoustic modeling in proc interspeech sutskever vinyals and le sequence to sequence learning with neural networks tech nips graves and schmidhuber recurrent neural networks in icann graves and schmidhuber offline handwriting recognition with multidimensional recurrent neural networks in nips byeon breuel raue and liwicki scene labeling with lstm recurrent neural networks in cvpr kass witkin and terzopoulos snakes active contour models in international journal of computer vision wang gao shi li gilmore lin and shen links integration framework for segmentation of infant brain images in neuroimage ciresan giusti gambardella and schmidhuber deep neural networks segment neuronal membranes in electron microscopy images in nips cardona saalfeld preibisch schmid cheng pulokas tomancak and hartenstein an integrated macroarchitectural analysis of the drosophila brain by serial section electron microscopy in plos biology mendrik vincken kuijf biessels and viergever organizers mrbrains challenge online evaluation framework for brain image segmentation in mri scans http weber devir bronstein bronstein and kimmel parallel algorithms for approximation of distance maps on parametric surfaces in acm transactions on graphics segmentation of neuronal structures in em stacks challenge ieee international symposium on biomedical imaging isbi http pizer amburn austin cromartie geselowitz greer romeny and zimmerman adaptive histogram equalization and its variations in comput vision graph image process tieleman and hinton lecture divide the gradient by running average of its recent magnitude in coursera neural networks for machine learning hochreiter untersuchungen zu dynamischen neuronalen netzen diploma thesis institut informatik lehrstuhl brauer technische advisor schmidhuber chetlur woolley vandermersch cohen tran catanzaro and shelhamer cudnn efficient primitives for deep learning in corr liu jones seyedhosseini and tasdizen modular hierarchical approach to electron microscopy image segmentation in journal of neuroscience methods srivastava hinton krizhevsky sutskever and salakhutdinov dropout simple way to prevent neural networks from overfitting in journal of machine learning research pham bluche kermorvant and louradour dropout improves recurrent neural networks for handwriting recognition in icfhr ciresan meier masci and schmidhuber committee of neural networks for traffic sign classification in ijcnn krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips zeiler and fergus visualizing and understanding convolutional networks tech nyu 
reflection refraction and hamiltonian monte carlo justin domke national ict australia nicta australian national university canberra act hadi mohasel afshar research school of computer science australian national university canberra act abstract hamiltonian monte carlo hmc is successful approach for sampling from continuous densities however it has difficulty simulating hamiltonian dynamics with functions leading to poor performance this paper is motivated by the behavior of hamiltonian dynamics in physical systems like optics we introduce modification of the leapfrog discretization of hamiltonian dynamics on piecewise continuous energies where intersections of the trajectory with discontinuities are detected and the momentum is reflected or refracted to compensate for the change in energy we prove that this method preserves the correct stationary distribution when boundaries are affine experiments show that by reducing the number of rejected samples this method improves on traditional hmc introduction markov chain monte carlo sampling is among the most general methods for probabilistic inference when the probability distribution is smooth hamiltonian monte carlo hmc originally called hybrid monte carlo uses the gradient to simulate hamiltonian dynamics and reduce random walk behavior this often leads to rapid exploration of the distribution hmc has recently become popular in bayesian statistical inference and is often the algorithm of choice some problems display piecewise smoothness where the density is differentiable except at certain boundaries probabilistic models may intrinsically have finite support being constrained to some region in bayesian inference it might be convenient to state piecewise prior more complex and highly piecewise distributions emerge in applications where the distributions are derived from other distributions the distribution of the product of two continuous random variables as well as applications such as preference learning or probabilistic programming while hmc is motivated by smooth distributions the inclusion of an acceptance probability means hmc does asymptotically sample correctly from piecewise however since leapfrog numerical integration of hamiltonian dynamics see relies on the assumption that the corresponding potential energy is smooth such cases lead to high rejection probabilities and poor performance hence traditional hmc is rarely used for piecewise distributions in physical systems that follow hamiltonian dynamics discontinuity in the energy can result in two possible behaviors if the energy decreases across discontinuity or the momentum is large enough to overcome an increase the system will cross the boundary with an instantaneous change in momentum known as refraction in the context of optics if the change in energy is too large to be overcome by the momentum the system will reflect off the boundary again with an instantaneous change in momentum technically here we assume the total measure of the points is zero so that with probability one none is ever encountered reflection refraction figure example trajectories of baseline and reflective hmc contours of the target distribution in two dimensions as defined in eq trajectories of the rejected red crosses and accepted blue dots proposals using baseline hmc the same with rhmc both use leapfrog parameters and in rhmc the trajectory reflects or refracts on the boundaries of the internal and external polytope boundaries and thus has far fewer rejected samples than hmc leading to faster mixing in practice more examples in supplementary material recently pakman and paninski proposed methods for sampling from piecewise gaussian distributions by exactly solving the hamiltonian equations and accounting for what we refer to as refraction and reflection above however since hamiltonian equations of motion can rarely be solved exactly the applications of this method are restricted to distributions whose logdensity is piecewise quadratic in this paper we generalize this work to arbitrary piecewise continuous distributions where each region is polytope is determined by set of affine boundaries we introduce modification to the leapfrog numerical simulation of hamiltonian dynamics called reflective hamiltonian monte carlo rhmc by incorporating reflection and refraction via detecting the first intersection of linear trajectory with boundary we prove that our method has the correct stationary distribution where the main technical difficulty is proving volume preservation of our dynamics to establish detailed balance numerical experiments confirm that our method is more efficient than baseline hmc due to having fewer rejected proposal trajectories particularly in high dimensions as mentioned the main advantage of this method over and is that it can be applied to arbitrary piecewise densities without the need for solution to the hamiltonian dynamics greatly increasing the scope of applicability exact hamiltonian dynamics consider distribution exp over rn where is the potential energy hmc is based on considering joint distribution on momentum and position space exp where and is quadratic meaning that exp is normal distribution if one could exactly simulate the dynamics hmc would proceed by iteratively sampling simulating the hamiltonian dynamics dqi pi dt dpi dt for some period of time and reversing the final value only needed for the proof of correctness since this will be immediately discarded at the start of the next iteration in practice since steps and both leave the distribution invariant so does markov chain that alternates between the two steps hence the dynamics have as stationary distribution of course the above differential equations are not when has discontinuities and are typically difficult to solve in reflection and refraction with exact hamiltonian dynamics take potential function which is differentiable in all points except at some boundaries of partitions suppose that when simulating the hamiltonian dynamics evolves over time as in the above equations whenever these equations are differentiable however when the state reaches boundary decompose the momentum vector into component perpendicular to the boundary and component pk parallel to the boundary let be the signed difference in potential energy on the two sides of the discontinuity if then is instantaneously replaced by that is the discontinuity is passed but the momentum is changed in the direction perpendicular to the boundary refraction if is positive the momentum will decrease and if it is negative the momentum will increase on the other hand if then is instantaneously replaced by that is if the particle momentum is insufficient to climb the potential boundary it bounces back by reversing the momentum component which is perpendicular to the boundary pakman and paninski present an algorithm to exactly solve these dynamics for quadratic however for the hamiltonian dynamics rarely have solution and one must resort to numerical integration the most successful method for which is known as the leapfrog dynamics reflection and refraction with leapfrog dynamics informally hmc with leapfrog dynamics iterates three steps sample perform leapfrog simulation by discretizing the hamiltonian equations into steps using some small stepsize here one interleaves position step between two half momentum steps reverse the sign of if is the starting point of the leapfrog dynamics and is the final point accept the move with probability min exp see algorithm it can be shown that this baseline hmc method has detailed balance with respect to even if is discontinuous however discontinuities mean that large changes in the hamiltonian may occur meaning many steps can be rejected we propose modification of the dynamics namely reflective hamiltonian monte carlo rhmc which is also shown in algorithm the only modification is applied to the position steps in rhmc the first intersection of the trajectory with the boundaries of the polytope that contains must be detected the position step is only taken up to this boundary and occurs depending on the momentum and change of energy at the boundary this process continues until the entire amount of time has been simulated note that if there is no boundary in the trajectory to time this is equivalent to baseline hmc also note that several boundaries might be visited in one position step as with baseline hmc there are two alternating steps namely drawing new momentum variable from exp and proposing move and accepting or rejecting it with probability determined by ratio we can show that both of these steps leave the joint distribution invariant and hence markov chain that also alternates between these steps will also leave invariant as it is easy to see drawing from will leave invariant we concentrate on the second step where move is proposed according to the piecewise leapfrog dynamics shown in alg firstly it is clear that these dynamics are meaning that if the simulation takes state to it will also take state to secondly we will show that these dynamics are volume preserving formally if denotes the leapfrog dynamics we will show that the absolute value of the determinant of the jacobian of is one these two properties together show that the probability density of proposing move from to is the same of that proposing move from to thus if the move is accepted according to the standard ratio min exp then detailed balance will be satisfied to see this let denote the proposal distribution then the usual proof of correctness for applies namely that min exp min exp the final equality is easy to establish considering the cases where and separately this means that detailed balance holds and so is stationary distribution the major difference in the analysis of rhmc relap tive to traditional hmc is that showing conservation of volume is more difficult with standard hmc and leapfrog steps volume conservation is easy to show by observing that each part of leapfrog step is shear transformation this is not the case with rhmc and so we must resort to full analysis of the determinant of the jacobian as explored in the figure transformation hq pi following section described by lemma refraction algorithm baseline eflective hmc lgorithms input current sample potential function leapfrog steps leapfrog step size output next sample begin for to do evolution of momentum if baseline hmc then evolution of position if eflective hmc else while hx tx φi irst iscontinuity do tx pk decompose parallel to boundary plane if then refraction else reflection pk evolution of momentum not required in practice for reversibility proof if return else return end note irst iscontinuity returns the position of the first intersection of boundary plain with line segment tx the time it is visited the change in energy at the discontinuity and the visited partition boundary if no such point exists is returned volume conservation refraction in our first result we assume without loss of generality that there is boundary located at the hyperplane this lemma shows that in the refractive case volume is conserved the setting is visualized in figure lemma let hq pi be transformation in rn that takes unit mass located at qn and moves it with constant momentum pn till it reaches plane at some xn where is constant subsequently the momentum is changed to pn where is function of the move is carried on for the total time period till it ends in for all satisfies the volume preservation property proof since for the momentum is not affected by the collision qi pi and pi thus therefore if we explicitly write out the jacobian determinant of the transformation it is now using standard properties of the determinant we have that we will now explicitly calculate these four derivatives due to the significance of the result we carry out the computations in detail nonetheless as this is largely mechanical process for brevity we do not comment on the derivation let be the time to reach and be the period between reaching and the last point then def def def by pn pn pn substituting the appropriate terms into we get that reflection now we turn to the reflective case and again show that volume is conserved again we assume without loss of generality that there is boundary located at the hyperplane lemma let hq pi be transformation in rn that takes unit mass located at qn and moves it with the constant momentum pn till it reaches plane at some point xn where is constant subsequently the mass is bounced back reflected with momentum pn the move is carried on for total time period till it ends in for all satisfies the volume preservation property proof similar to lemma for qi and pi therefore for any consequently by equation and since as before let be the time to reach and be the period between reaching and the last point def def def that is and it follows that is equal to hence reflective leapfrog dynamics theorem in rhmc algorithm for sampling from continuous and piecewise distribution which has affine partitioning boundaries leapfrog simulation preserves volume in space proof we split the algorithm into several atomic transformations ti each transformation is either momentum step full position step with no or full or partial position step where exactly one reflection or refraction occurs to prove that the total algorithm preserves volume it is sufficient to show that the volume is preserved under each ti since transformations of kind and are shear mappings and therefore they preserve the volume now consider full or partial position step where single refraction occurs if the reflective plane is in form by lemma the volume preservation property holds otherwise as long as the reflective plane is affine via rotation of basis vectors the problem is reduced to the former case since volume is conserved under rotation in this case the volume is also conserved with similar reasoning by lemma reflection on affine reflective boundary preserves volume thus since all component transformations of rhmc leapfrog simulation preserve volume the proof is complete along with the fact that the leapfrog dynamics are this shows that the algorithm satisfies detailed balance and so has the correct stationary distribution experiment compared to baseline hmc we expect that rhmc will simulate hamiltonian dynamics more accurately and therefore leads to fewer rejected samples on the other hand this comes at the expense of slower leapfrog position steps since intersections reflections and refractions must be computed to test the trade off we compare the rhmc to baseline hmc and tuned mh with simple isotropic normal proposal distribution mh is automatically tuned after by testing equidistant proposal variances in interval and accepting variance for which the acceptance rate is closest to the baseline hmc and rhmc number of steps and step size are chosen to be and respectively many other choices are in the appendix while hmc performance is highly standard to these parameters rhmc is consistently faster the comparison takes place on heavy tail piecewise model with negative log probability if pa if otherwise where is positive definite matrix we carry out the experiment on three choices of position space dimensionalites and due to the symmetry of the model the ground truth expected value of is known to be therefore the absolute error of the expected value estimated by chain of mcmc samples in each dimension is the absolute value of the mean of element of the sample vectors the worst mean absolute error wmae over all dimensions is taken as the error measurement of the chain wmae max qd for each algorithm markov chains are run and the mean wmae and confidence intervals as error bars versus the number of iterations markov chain sizes are time milliseconds are depicted in figure all algorithms are implemented in java and run on single thread of cpu for each of the repetitions some random starting point is chosen uniformly and used for all three of the algorithms we use diagonal matrix for where for each repetition each entry on the main diagonal is either exp or exp with equal probabilities as the results show even in low dimensions the extra cost of the position step is more or less compensated by its higher effective sample size but as the dimensionality increases the rhmc significantly outperforms both baseline hmc and tuned mh error error iteration error error iteration error error time time iteration time figure error worst mean absolute error per dimension versus iterations and time ms is metropolis hastings with tuned isotropic gaussian proposal distribution many more examples in supplementary material conclusion we have presented modification of the leapfrog dynamics for hamiltonian monte carlo for piecewise smooth energy functions with affine boundaries each region is polytope inspired by physical systems though traditional hamiltonian monte carlo can in principle be used on such functions the fact that the hamiltonian will often be dramatically changed by the dynamics can result in very low acceptance ratio particularly in high dimensions by better preserving the hamiltonian reflective hamiltonian monte carlo rhmc accepts more moves and thus has higher effective sample size leading to much more efficient probabilistic inference to use this method one must be able to detect the first intersection of position trajectory with polytope boundaries acknowledgements nicta is funded by the australian government through the department of communications and the australian research council through the ict centre of excellence program references hadi mohasel afshar scott sanner and ehsan abbasnejad gibbs sampling in piecewise graphical models in association for the advancement of artificial intelligence pages steve brooks andrew gelman galin jones and meng handbook of markov chain monte carlo crc press hans adolph buchdahl an introduction to hamiltonian optics courier corporation simon duane anthony kennedy brian pendleton and duncan roweth hybrid monte carlo physics letters andrew glen lawrence leemis and john drew computing the distribution of the product of two continuous random variables computational statistics data analysis donald greenwood principles of dynamics englewood cliffs nj matthew homan and andrew gelman the sampler adaptively setting path lengths in hamiltonian monte carlo the journal of machine learning research david lunn david spiegelhalter andrew thomas and nicky best the bugs project evolution critique and future directions statistics in medicine radford neal mcmc using hamiltonian dynamics handbook of markov chain monte carlo ari pakman and liam paninski exact hamiltonian monte carlo samplers for binary distributions in advances in neural information processing systems pages ari pakman and liam paninski exact hamiltonian monte carlo for truncated multivariate gaussians journal of computational and graphical statistics gareth roberts andrew gelman walter gilks et al weak convergence and optimal scaling of random walk metropolis algorithms the annals of applied probability stan development team stan modeling language users guide and reference manual version 
the consistency of common neighbors for link prediction in stochastic blockmodels deepayan chakrabarti irom mccombs school of business university of texas at austin deepay purnamrita sarkar department of statistics university of texas at austin purnamritas peter bickel department of statistics university of california berkeley bickel abstract link prediction and clustering are key problems for data while spectral clustering has strong theoretical guarantees under the popular stochastic blockmodel formulation of networks it can be expensive for large graphs on the other hand the heuristic of predicting links to nodes that share the most common neighbors with the query node is much fast and works very well in practice we show theoretically that the common neighbors heuristic can extract clusters with high probability when the graph is dense enough and can do so even in sparser graphs with the addition of cleaning step empirical results on simulated and data support our conclusions introduction networks are the simplest representation of relationships between entities and as such have attracted significant attention recently their applicability ranges from social networks such as facebook to collaboration networks of researchers citation networks of papers trust networks such as epinions and so on common applications on such data include ranking recommendation and user segmentation which have seen wide use in industry most of these applications can be framed in terms of two problems link prediction where the goal is to find few similar nodes to given query node and clustering where we want to find groups of similar individuals either around given seed node or full partitioning of all nodes in the network an appealing model of networks is the stochastic blockmodel which posits the existence of latent cluster for each node with link probabilities between nodes being simply functions of their clusters inference of the latent clusters allows one to solve both the link prediction problem and the clustering problem predict all nodes in the query node cluster strong theoretical and empirical results have been achieved by spectral clustering which uses the singular value decomposition of the network followed by clustering step on the eigenvectors to determine the latent clusters however singular value decomposition can be expensive particularly for large graphs when many eigenvectors are desired unfortunately both of these are common requirements instead many fast heuristic methods are often used and are empirically observed to yield good results one particularly common and effective method is to predict links to nodes that share many common neighbors with the query node rank nodes by where cn represents an edge between and the intuition is that probably has many links with others in its cluster and hence probably also shares many common friends with others in its cluster counting common neighbors is particularly fast it is join operation supported by all databases and systems in this paper we study the theoretical properties of the common neighbors heuristic our contributions are the following we present to our knowledge the first theoretical analysis of the common neighbors for the stochastic blockmodel we demarcate two regimes which we call and under which common neighbors can be successfully used for both link prediction and clustering in particular in the regime the number of common neighbors between the query node and another node within its cluster is significantly higher than that with node outside its cluster hence simple threshold on the number of common neighbors suffices for both link prediction and clustering however in the regime there are too few common neighbors with any node and hence the heuristic does not work however we show that with simple additional cleaning step we regain the theoretical properties shown for the case we empirically demonstrate the effectiveness of counting common neighbors followed by the cleaning on variety of simulated and datasets related work link prediction has recently attracted lot of attention because of its relevance to important practical problems like recommendation systems predicting future connections in friendship networks better understanding of evolution of complex networks study of missing or partial information in networks etc algorithms for link prediction fall into two main groups and methods these methods use similarity measures based on network topology for link prediction some methods look at nodes two hops away from the query node counting common neighbors the jaccard index the score etc more complex methods include nodes farther away such as the katz score and methods based on random walks these are often intuitive easily implemented and fast but they typically lack theoretical guarantees methods the second approach estimates parametric models for predicting links many popular network models fall in the latent variable model category these models assign latent random variables zn to nodes in network these variables take values in general space the probability of linkage between two nodes is specified via symmetric map these zi can be uniform or positions in some latent space in mixture of multivariate gaussian distributions is used each for separate cluster stochastic blockmodel is special class of these models where zi is binary length vector encoding membership of node in cluster in well known special case the planted partition model all nodes in the same cluster connect to each other with probability whereas all pairs in different clusters connect with probability in fact under broad parameter regimes the blockmodel approximation of networks has recently been shown to be analogous to the use of histograms as summaries of an unknown probability distribution varying the number of bins or the bandwidth corresponds to varying the number or size of communities thus blockmodels can be used to approximate more complex models under broad smoothness conditions if the number of blocks are allowed to increase with empirical results as the models become more complex they also become computationally demanding it has been commonly observed that simple and easily computable measures like common neighbors often have competitive performance with more complex methods this behavior has been empirically established across variety of networks starting from networks to router level internet connections protein protein interaction networks and electrical power grid network theoretical results spectral clustering has been shown to asymptotically recover cluster memberships for variations of stochastic blockmodels however apart from there is little understanding of why simple methods such as common neighbors perform so well empirically given their empirical success and computational tractability the common neighbors heuristic is widely applied for large networks understanding the reasons for the accuracy of common neighbors under the popular stochastic blockmodel setting is the goal of our work proposed work many link prediction methods ultimately make two assumptions each node belongs to latent cluster where nodes in the same cluster have similar behavior and each node is very likely to connect to others in its cluster so link prediction is equivalent to finding other nodes in the cluster these assumptions can be relaxed instead of belonging to the same cluster nodes could have topic distributions with links being more likely between pairs of nodes with similar topical interests however we will focus on the assumptions stated above since they are clean and the relaxations appear to be fundamentally similar model specifically consider stochastic blockmodel where each node belongs to an unknown cluster ci ck we assume that the number of clusters is fixed as the number of nodes increases we also assume that each cluster has members though this can be relaxed easily the probability of link between nodes and depends only on the clusters of and bci cj ci cj ci cj for some in other words the probability of link is between nodes in the same cluster and otherwise by definition if the nodes were arranged so that all nodes in cluster are contiguous then the corresponding matrix when plotted attains structure with the diagonal blocks corresponding to links within cluster being denser than blocks since under these assumptions we ask the following two questions problem link prediction and recommendation given node how can we identify at least constant number of nodes from ci problem local cluster detection given node how can we identify all nodes in ci problem can be considered as the problem of finding good recommendations for given node here the goal is to find few good nodes that could connect to recommending few possible friends on facebook or few movies to watch next on netflix since withincluster links have higher probability than links predicting nodes from ci gives the optimal answer crucially it is unnecessary to find all good nodes as against that problem requires us to find everyone in the given node cluster this is the problem of detecting the entire cluster corresponding to given node note that problem is clearly harder than problem we next present summary of our results and the underlying intuition before delving into the details intuition and result summary current approaches standard approaches to inference for the stochastic blockmodel attempt to solve an even harder problem problem full cluster detection how can we identify the latent clusters ci for all popular solution is via spectral clustering involving two steps computing the eigenvectors of the graph laplacian and clustering the projections of each node on the corresponding eigenspace via an algorithm slight variation of this has like been shown to work as long as log and the average degree grows faster than powers of however spectral clustering solves harder problem than problems and and can be expensive particularly for very large graphs our claim is that simpler operation counting common neighbors between nodes can yield results that are almost as good in broad parameter regime common neighbors given node link prediction via common neighbors follows simple prescription predict link to node such that and have the maximum number of shared friends cn the usefulness of common neighbors have been observed in practice and justified theoretically for the latent distance model however its properties under the stochastic blockmodel remained unknown intuitively we would expect pair of nodes and from the same cluster to have many common neighbors from the same cluster since both the links and occur with probability whereas for ci cj at least one of the edges and must have the lower probability cn ci cj cu ci ci cj cu ci ci cj cn ci cj αγp cu ci or cu cj ci cj cu ci cu cj ci cj cn ci cj cn ci cj thus the expected number of common neighbors is higher when ci cj if we can show that the random variable cn concentrates around its expectation node pairs with the most common neighbors would belong to the same cluster thus common neighbors would offer good solution to problem we show conditions under which this is indeed the case there are three key points regarding our method handling dependencies between common neighbor counts defining the graph density regime under which common neighbors is consistent and proposing variant of common neighbors which significantly broadens this region of consistency dependence cn and cn are dependent hence distinguishing between and nodes can be complicated even if each cn concentrates around its expectation we handle this via careful conditioning step dense versus sparse graphs in general the parameters and can be functions of and we can try to characterize parameter settings when common neighbors consistently returns nodes from the same cluster as the input node we show that when the graph is sufficiently dense average degree is growing faster than log common neighbors is powerful enough to answer problem also can go to zero at suitable rate on the other hand the expected number of common neighbors between nodes tends to zero for sparser graphs irrespective of whether the nodes are in the same cluster or not further the standard deviation is of higher order than the expectation so there is no concentration in this case counting common neighbors fails even for problem variant with better consistency properties however we show that the addition of an extra step henceforth the cleaning step still enables common neighbors to identify nodes from its own cluster while reducing the number of nodes to zero with probability tending to one as this requires stronger separation condition between and however such strong consistency is only possible when the average degree grows faster than log thus the cleaning step extends the consistency of common neighbors beyond the range main results we first split the edge set of the complete graph on nodes into two sets and its complement independent of the given graph we compute common neighbors on and perform cleaning process on the adjacency matrices of and are denoted by and we will fix reference node which belongs to class without loss of generality recall that there are clusters ck each of size nπ let xi denote the number of common neighbors between and algorithm computes the set xi tn of nodes who have at least tn common neighbors with on whereas algorithm does further degree thresholding on to refine into algorithm common neighbors screening algorithm procedure scan tn for xi xq xi tn return algorithm post selection cleaning algorithm procedure clean sn sn return to analyze the algorithms we must specify conditions on graph densities recall that and represent and link probabilities we assume that is constant while equivalently assume that both and are both some constant times where the analysis of graphs has typically been divided into two regimes the dense regime consists of graphs with nρ where the expected degree nρ is fraction of as grows in the sparse regime nρ so degree is roughly constant our work explores finer gradation which we call and defined next definition graph sequence of graphs is called if log as definition graph sequence of graphs is called if but log as our first result is that common neighbors is enough to solve not only the problem problem but also the local clustering problem problem in the case this is because even though both nodes within and outside the query node cluster have growing number of common neighbors with there is clear distinction in the expected number of common neighbors between the two classes also since the standard deviation is of smaller order than the expectation the random variables concentrate thus we can pick threshold tn such that scan tn yields just the nodes in the same cluster as with high probability note that the cleaning step algorithm is not necessary in this case theorem solves problem in graphs let tn let be the set of nodes returned by scan tn let nw and no denote the number of nodes in and respectively if the graph is log then nw nπ and no and if nα proof sketch we ponly sketch the proof here deferring details to the supplementary material let dqa be the number of links from the query node to nodes in cluster ca let dq qqk and dqa we first show that nπα ψn dq good dqa nπγ ψn qp ψn log nπγ log log conditioned on dq xi is the sum of binomial dqa independent random variables representing the number of common neighbors between and via nodes in each of the clusters xi dq ca dqa dqa we have xi dq good ψn ψn ηa xi dq good ca ψn un ψn note that tn un tn and nπ log log where we applied condition on noted in the theorem statement we show xi tn dq good xi tn dq good ca conditioned on dq both nw and no are sums of conditionally independent and identically distributed bernoullis nw nπ dq good nw nπ dq good nπ no dq good no dq good there are two major differences between the and cases first in the case both expectations and ηa are of the order which tends to zero second standard deviations on the number of common neighbors are of larger order than expectations together this means that the number of common neighbors to and nodes can no longer be separated hence algorithm by itself can not work however after cleaning the entire cluster of the query node can still be recovered theorem algorithm followed by algorithm solves problem in graphs let tn and sn πα let scan tn denote the number of nodes in if the graph is and πα then nw nπ and no and clean sn let nw no proof sketch we only sketch the proof here with details being deferred to the supplementary material the degree bounds of eq and the equations for xi good hold even in the case we can also bound the variances of xi which are sums of conditionally independent bernoullis var xi dq good xi dq good since the expected number of common neighbors vanishes and the standard deviation is an order larger than the expectation there is no hope for concentration however there are slight differences in the probability of having at least one common neighbor first by an application of the inequality we find xi dq good xi dq good var xi dq good xi dq good ψn since for markov inequality gives pa xi dq good ca xi dq good ca ηa even though pa nπpa so we can use concentration inequalities like the chernoff bound again to bound nw and no nw log no pa log pa unlike the denser regime nw and no can be of the same order here hence the candidate set returned by thresholding the common neighbors has fraction of nodes from outside community however this fraction is relatively small which is what we would exploit in the cleaning step let θw and θo denote the expected number of edges in from node to the separation condition in the theorem statement gives θw θo θw log setting the degree threshold sn θw θo we bound the probability of mistakes in the cleaning step sn dq good sn dq good removing the conditioning on dq good as in theorem yields the desired result experiments we present our experimental results in two parts first we use simulations to support our theoretical claims next we present link prediction accuracies on real world collaborative networks to show that common neighbors indeed perform close to gold standard algorithms like spectral clustering and the katz score implementation details recall that our algorithms are based on thresholding when there is large gap between common neighbors between node and nodes in its cluster in the regime this is equivalent to using the algorithm with to find in algorithm the same holds for finding in algorithm when the number of nodes with more than two common neighbors is less than ten we define the set by finding all neighbors with at least one common neighbor as in the regime on the other hand since the cleaning step works only when is sufficiently large so that degrees concentrate we do not perform any cleaning when while we used the split sample graph in the cleaning step for ease of analysis we did the cleaning using the same network in the experiments experimental setup for simulations we use stochastic blockmodel of nodes split into clusters for each value of we pick query nodes at random and calculate the precision and recall of the result against nodes from the query node cluster for any subset and true cluster precision and recall we report mean precision and recall over random generated graph instances accuracy on simulated data figure shows the precision and recall as degree grows with the parameters satisfying the condition πα of thm we see that cleaning helps both precision and recall particularly in the range the regime as reference we also plot the precision of spectral clustering when it was given the correct number of clusters above average degree of spectral clustering gives perfect precision whereas common neighbors can identify large fraction of the true cluster once average degree is above on the other hand for average degree less than seven spectral clustering performs poorly whereas the precision of common neighbors is remarkably higher precision is relatively higher than recall for broad degree regime and this explains why common neighbors are popular choice for link prediction on side figure recall and precision versus average degree when degree is very small none of the methods work well in the range regime we see that common neighbors gets increasingly better precision and recall and cleaning helps with high enough degrees regime just common neighbors is sufficient and gets excellent accuracy table auc scores for networks dataset hepth citeseer nips mean degree auc cn spec katz random note it is not surprising that in very sparse graph common neighbors can not identify the whole cluster since not everyone can be reached in two hops accuracy on data we used publicly available datasets over time where nodes represent authors and an edge represents collaboration between two authors in particular we used subgraphs of the high energy physics hepth coauthorship dataset timesteps the nips dataset timesteps and the citeseer dataset timesteps we obtain the training graph by merging the first networks use the step for and use the last timestep as the test graph the number of nodes and average degrees are reported in table we merged years of papers to create one timestep so that the median degree of the test graph is at least we compare our algorithm cn and with the katz score which is used widely in link prediction and spectral clustering of the network spectral clustering is carried out on the giant component of the network furthermore we the number of clusters using the held out graph our setup is very similar to link prediction experiments in related literature since these datasets are unlabeled we can not calculate precision or recall as before instead for any score or affinity measure we propose to perform link prediction experiments as follows for randomly picked node we calculate the score from the node to everyone else we compute the auc score of this vector against the edges in the test graph we report the average auc for randomly picked nodes table shows that even in sparse regimes common neighbors performs similar to benchmark algorithms conclusions counting common neighbors is particularly useful heuristic it is fast and also works well empirically we prove the effectiveness of common neighbors for link prediction as well as local clustering around query node under the stochastic blockmodel setting in particular we show the existence of regime where common neighbors yields the right cluster and regime where an additional cleaning step is required experiments with simulated as well as datasets shows the efficacy of our approach including the importance of the cleaning step references adamic and adar friends and neighbors on the web social networks backstrom and leskovec supervised random walks predicting and recommending links in social networks in proceedings of the fourth acm international conference on web search and data mining pages new york ny usa acm bickel and chen nonparametric view of network models and newman girvan and other modularities proceedings of the national academy of sciences of the unites states of america chaudhuri graham and tsiatas spectral clustering of graphs with general degrees in the extended planted partition model journal of machine learning research proceedings track handcock raftery and tantrum clustering for social networks journal of the royal statistical society series statistics in society holland laskey and leinhardt stochastic blockmodels first steps social networks katz new status index derived from sociometric analysis in psychometrika volume pages and kleinberg the link prediction problem for social networks in conference on information and knowledge management acm lü and zhou link prediction in complex networks survey physica mcsherry spectral partitioning of random graphs in focs pages olhede and wolfe network histograms and universality of blockmodel approximation proceedings of the national academy of sciences of the unites states of america raftery handcock and hoff latent space approaches to social network analysis journal of the american statistical association rohe chatterjee and yu spectral clustering and the stochastic blockmodel annals of statistics sarkar and bickel role of normalization in spectral clustering for stochastic blockmodels to appear in the annals of sarkar chakrabarti and moore theoretical justification of popular link prediction heuristics in conference on learning theory acm sarkar and moore tractable approach to finding closest neighbors in large graphs in proc uai 
private kunal talwar google research kunal abhradeep thakurta previously yahoo labs li zhang google research liqzhang abstract we present nearly optimal differentially private version of the well known lasso estimator our algorithm provides privacy protection with respect to each training example the excess risk of our algorithm compared to the version is assuming all the input data has bounded norm this is the first differentially private algorithm that achieves such bound without the polynomial dependence on under no additional assumptions on the design matrix in addition we show that this error bound is nearly optimal amongst all differentially private algorithms introduction common task in supervised learning is to select the model that best fits the data this is frequently achieved by selecting loss function that associates loss with each datapoint and model and then selecting from class of admissible models the model that minimizes the average loss over all data points in the training set this procedure is commonly referred to as empirical risk minimization erm the availability of large datasets containing sensitive information from individuals has motivated the study of learning algorithms that guarantee the privacy of individuals contributing to the database rigorous and standard privacy guarantee is via the notion of differential privacy in this work we study the design of differentially private algorithms for empirical risk minimization continuing long line of work see for survey in particular we study adding privacy protection to the classical lasso estimator which has been widely used and analyzed we first present differentially private optimization algorithm for the lasso estimator the algorithm is the combination of the classical algorithm and the exponential mechanism for guaranteeing the privacy we then show that our algorithm achieves nearly optimal risk among all the differentially private algorithms this lower bound proof relies on recently developed techniques with roots in cryptography consider the training dataset consisting of pairs of data di xi yi where xi rp usually called the feature vector and yi the prediction the lasso estimator or the sparse linear regression solves for argminθ di yi subject to to simplify presentation we assume but our results directly extend to general the constraint tends to induce sparse so is widely used in the high dimensional setting when here we will study approximating the lasso estimation with minimum possible error while protecting the privacy of each individual di below we define the setting more formally part of this work was done at microsoft research silicon valley campus problem definition given data set dn of samples from domain constraint set rp and loss function for any model define its excess empirical risk as def di min di for lasso the constraint set is the ball and the loss is the quadratic loss function we define the risk of mechanism on data set as where the expectation is over the internal randomness of and the risk is the maximum risk over all the possible data sets our objective is then to design mechanism which preserves privacy definition and achieves as low risk as possible we call the minimum achievable risk as privacy risk defined as mina where the min is over all private mechanisms there has been much work on studying the privacy risk for the lasso estimator however all the previous results either need to make strong assumption about the input data or have polynomial dependence on the dimension first and then studied the lasso estimator with differential privacy guarantee they showed that one can avoid the polynomial dependence on in the excess empirical risk if the data matrix satisfy the restricted strong convexity and mutual incoherence properities while such assumptions seem necessary to prove that lasso recovers the exact support in the worst case they are often violated in practice where lasso still leads to useful models it is therefore desirable to design and analyze private versions of lasso in the absence of such assumptions in this work we do so by analyzing the loss achieved by the private optimizer compared to the true optimizer we make primarily two contributions in this paper first we present an algorithm that achieves the privacy risk of for the lasso compared to the previous work we only assume that the input data has bounded norm in addition the above risk bound only has logarithmic dependence on which fits particularly well for lasso as we usually assume when applying lasso this bound is achieved by private version of the algorithm assuming that each data point di satisfies that kdi we have theorem there exists an private algorithm for lasso such that log np log our second contribution is to show that surprisingly this simple algorithm gives nearly tight bound we show that this rather unusual dependence is not an artifact of the algorithm or the analysis but is in fact the right dependence for the lasso problem no differentially private algorithm can do better we prove lower bound by employing fingerprinting codes based techniques developed in theorem for the sparse linear regression problem where kxi for and any private algorithm must have log our improved privacy risk crucially depends on the fact that the constraint set is polytope with few polynomial in dimensions vertices this allows us to use private version of the algorithm where at each step we use the exponential mechanism to select one of the vertices of the polytope we also present variant of that uses objective perturbation instead of the exponential mechanism we show that theorem we can obtain risk bound dependent on the gaussian width of the constraint set which often results in tighter bounds compared to bounds based on diameter while more general this variant adds much more noise than the frankwolfe based algorithm as it is effectively publishing the whole gradient at each step when is not polytope with small number of vertices one can still use the exponential mechanism as long as one has small list of candidate points which contains an approximate optimizer for every direction for many simple cases for example the ball with the bounds attained in this way have to hide logarithmic factors throughout the paper we use an additional polynomial dependence on the dimension instead of the logarithmic dependence in the above result for example when the upper bound from this variant has an extra factor of whereas such dependence is provably needed for the upper bound jump rather abruptly from the logarithmic dependence for to polynomial dependence on for we leave open the question of resolving this discontinuity and interpolating more smoothly between the case and the case our results enlarge the set of problems for which privacy comes for free given samples from distribution suppose that is the empirical risk minimizer and θpriv is the differentially private approximate minimizer then the erm algorithm outputs and incurs expected on the distribution loss equal to the loss where the generalization error term depends on the loss function and on the number of samples the differentially private algorithm incurs an additional loss of the privacy risk if the privacy risk is asymptotically no larger than the generalization error we can think of privacy as coming for free since under the assumption of being large enough to make the generalization error small we are also making large enough to make the privacy risk small in the case when is the and the loss function is the squared loss with and the best known generalization error bounds dominate the privacy risk when theorem related work there have been much work on private lasso or more generally private erm algorithms the error bounds mainly depend on the shape of the constraint set and the lipschitz condition of the loss function here we will summarize these related results related to our results we distinguish two settings the constraint set is bounded in the and the the loss function is in the call it the this is directly related to our bounds on lasso and ii the constraint set has bounded norm and the loss function is in the norm the which is related to our bounds using gaussian width the the results in this setting include the first two works make certain assumptions about the instance restricted strong convexity rsc and mutual incoherence under these assumptions they obtain privacy risk guarantees that depend logarithmically in the dimensions and thus allowing the guarantees to be meaningful even when in fact their bound of polylog can be better than our tight bound of polylog however these assumptions on the data are strong and may not hold in practice our guarantees do not require any such data dependent assumptions the result of captures the scenario when the constraint set is the probability simplex and the loss function is generalized linear model but provides worse bound of polylog for the special case of linear loss functions which are interesting primarily in the online prediction setting the techniques of provide bound of polylog the in all the works on private convex optimization that we are aware of either the excess risk guarantees depend polynomially on the dimensionality of the problem or assumes special structure to the loss generalized linear model or linear losses similar dependence is also present in the online version of the problem recently show that in the private erm setting in general this polynomial dependence on is unavoidable in our work we show that one can replace this dependence on with the gaussian width of the constraint set which can be much smaller effect of gaussian width in risk minimization our result on general has an dependence on the gaussian width of this geometric concept has previously appeared in other contexts for example bounds the the excess generalization error by the gaussian width of the constraint set recently show that the gaussian width of constraint set is very closely related to the number of generic linear measurements one needs to perform to recover an underlying model the notion of gaussian width has also been used by in the context of differentially private query release mechanisms but in the very different context of answering multiple linear queries over database background differential privacy the notion of differential privacy definition is by now defacto standard for statistical data privacy one of the reasons why differential privacy has become so popular is because it provides meaningful guarantees even in the presence of arbitrary auxiliary information at semantic level the privacy guarantee ensures that an adversary learns almost the same thing about an individual independent of his presence or absence in the data set the parameters quantify the amount of information leakage for reasons beyond the scope of this work and are good choice of parameters here refers to the number of samples in the data set definition randomized algorithm is private if for all neighboring data sets and they differ in one record or equivalently dh and for all events in the output space of we have pr pr here dh refers to the hamming distance for the for any vector is defined as where is the coordinate of the vector continuity norm function is within set norm if the following holds gaussian width of set let ip be gaussian random vector in rp the gaussian def width of set is defined as gc eb sup private convex optimization by algorithm in this section we analyze differentially private variant of the classical algorithm we show that for the setting where the constraint set is polytope with vertices and the loss function is lipschitz the one can obtain an excess privacy risk of roughly log this in particular captures the linear regression setting one such example is the classical lasso algorithm which computes argminθ kxθ in the usual case of kxθ is with respect to we show that one can achieve the nearly optimal privacy risk of the algorithm can be regarded as greedy algorithm which moves towards the optimum solution in the first order approximation see algorithm for the description how fast algorithm converges depends on curvature defined as follows according to we remark that function on has curvature constant bounded by definition curvature constant for define γl as below γl sup remark useful bound can be derived for quadratic loss θat aθ hb θi in this case by γl maxa ka when is centrally symmetric we have the bound γl for lasso define argmin the following theorem bounds the convergence rate of algorithm algorithm algorithm input rp µt choose an arbitrary from for to do compute θet θt θt set θt µt θet θt return θt theorem if we set µt then θt γl while the algorithm does not necessarily provide faster convergence compared to the based method it has two major advantages first on line it reduces the problem to solving minimization of linear function when is defined by small number of vertices when is an ball the minimization can be done by checking θt xi for each vertex of this can be done efficiently secondly each step in takes convex combination of θt and θet which is on the boundary of hence each intermediate solution is always inside sometimes called projection free and the final outcome θt is the convex combination of up to points on the boundary of or vertices of when is polytope such outcome might be desired for example when is polytope as it corresponds to sparse solution due to these reasons algorithm has found many applications in machine learning as we shall see below these properties are also useful for obtaining low risk bounds for their private version private algorithm we now present private version of the algorithm the algorithm accesses the private data only through the loss function in step of the algorithm thus to achieve privacy it suffices to replace this step by private version to do so we apply the exponential mechanism to select an approximate optimizer in the case when the set is polytope it suffices to optimize over the vertices of due to the following basic fact fact let rp be the convex hull of compact set rp for any vector rp arg minhθ vi thus it suffices to run the exponential mechanism to select from amongst the vertices of this leads to differentially private algorithm with risk logarithmically dependent on when is polynomial in it leads to an error bound with log dependence we can bound the error in terms of the constant which can be much smaller than the constant in particular as we show in the next section the private algorithm is nearly optimal for the important sparse linear regression problem algorithm polytope differentially private algorithm polytope case input data set dn loss function di with constant for privacy parameters convex set conv with denoting choose an arbitrary from for to do log αs hs θt lap where lap θet arg min αs µt θt µt θet where µt output priv θt theorem privacy guarantee algorithm is private since each data item is assumed to have bounded norm for two neighboring databases and and any we have that hs the proof of privacy then follows from application of the exponential mechanism or its noisy maximum version theorem and the strong composition theorem in theorem we prove the utility guarantee for the private algorithm for the convex polytope case define γl max cl over all the possible data sets in theorem utility guarantee let and be defined as in algorithms algorithm polytope let γl be an upper bound on the curvature constant defined in definition for the loss function that holds for all in algorithm polytope if we set γl then log log θpriv min γl here the expectation is over the randomness of the algorithm the proof of utility uses known bounds on noisy along with error bounds for the exponential mechanism the details can be found in the full version general while variant of this mechanism can be applied to the case when is not polytope its error would depend on the size of cover of the boundary of which can be exponential in leading to an error bound with polynomial dependence on in the full version we analyze another variant of private that uses objective perturbation to ensure privacy this variant is for general convex set and the following result proven in the appendix bounds its excess risk in terms of the gaussian width of for this mechanism we only need to be bounded in diameter but our error now depends on the constant of the loss functions theorem suppose that each loss function is with respect to the norm and that has diameter at most let gc the gaussian width of the convex set rp and let γl be the curvature constant defined in definition for the loss function for all and then there is an private algorithm with excess empirical risk γl gc priv min here the expectation is over the randomness of the algorithm private lasso algorithm we now apply the private algorithm polytope to the important case of the sparse linear regression or lasso problem problem definition given data set xn yn of from the domain rp and the convex set define the mean squared loss hxi θi yi the objective is to compute θpriv to minimize while preserving privacy with respect to any change of individual xi yi pair the setting of the above problem is variant of the least squares problem with regularization which was started by the work of lasso and intensively studied in the past years since the ball is the convex hull of vertices we can apply the private algorithm polytope for the above setting it is easy to check that the constant is bounded by further by applying the bound on quadratic programming remark we have that cl since is the unit ball and hence now applying theorem we have corollary let xn yn of samples from the domain and the convex set equal to the the output θpriv of algorithm polytope ensures the following log θpriv min remark compared to the previous work the above upper bound makes no assumption of restricted strong convexity or mutual incoherence which might be too strong for realistic settings also our results significantly improve bounds of from to which considered the case of the set being the probability simplex and the loss being generalized linear model optimality of private lasso in the following we shall show that to ensure privacy the error bound in corollary is nearly optimal in terms of the dominant factor of theorem optimality of private let be the and be the mean squared loss in equation for every sufficiently large for every private algorithm with and there exists data set xn yn of samples from the domain such that min we prove the lower bound by following the fingerprinting codes argument of for lowerbounding the error of private algorithms similar to and we start with the following lemma which is implicit in matrix in theorem is the padded tardos code used in section for any matrix denote by the matrix obtained by removing the row of call column of matrix consensus column if the entries in the column are either all or all the sign of consensus column is simply the consensus value of the column write log and the following theorem follows immediately from the proof of corollary in theorem corollary from restated let be sufficiently large positive integer there exists matrix with the following property for each there are at least consensus columns wi in each in addition for algorithm on input matrix where if with probability at least produces sign vector which agrees with at least columns in wi then is not differentially private with respect to single row change to some other row in write let wp we first form an matrix where the column vectors of are mutually orthogonal vectors this is possible as now we construct databases di for as follows for all the databases they contain the common set of examples zj vector zj with label for where zj yjp is the row vector of in addition each di contains examples xj for xj xjk for then di is defined as follows for the ease of notation in this proof we work with the loss this does not affect the generality of the arguments in any way di xj yj xj the last equality is due the columns of are mutually orthogonal vectors for each to that di consider such that the sign of the coordinates of matches the sign for the consensus columns of plugging in we have the following since the number of consensus columns is at least we now prove the crucial lemma which states that if is such that and di is small then has to agree with the sign of most of the consensus columns of lemma suppose that and di for wi denote by sj the sign of the consensus column then we have wi sign θj sj proof for any denote by the projection of to the coordinate subset consider three subsets where wi sign θj sj wi sign θj sj wi the proof is by contradiction assume that further θi for now we will bound and using the inequality for any vector hence but so that similarly by the assumption of again using we have that now we have hxi θi βi where by we have θi hence we have that di this leads to contradiction hence we must have with theorem and lemma we can now prove theorem proof suppose that is private and for the datasets we constructed above di di min di cw for sufficiently small constant by markov inequality we have with probability at least di di minθ di by we have min di hence if we choose constant small enough we have with probability di di consensus columns in however by lemma implies that di agrees with at least by theorem this violates the privacy of hence we have that there exists such that di di min di cw recall that log and wp log hence we have that di di min di the proof is completed by converting the above bound to the normalized version of log references bartlett and mendelson rademacher and gaussian complexities risk bounds and structural results the journal of machine learning research bassily smith and thakurta private empirical risk minimization revisited in focs bhaskar laxman smith and thakurta discovering frequent patterns in sensitive data in kdd new york ny usa bun ullman and vadhan fingerprinting codes and the price of approximate differential privacy in stoc chandrasekaran recht parrilo and willsky the convex geometry of linear inverse problems foundations of computational mathematics chaudhuri and monteleoni logistic regression in nips chaudhuri monteleoni and sarwate differentially private empirical risk minimization jmlr clarkson coresets sparse greedy approximation and the algorithm acm transations on algorithms duchi jordan and wainwright local privacy and statistical minimax rates in focs dwork mcsherry nissim and smith calibrating noise to sensitivity in private data analysis in theory of cryptography conference pages springer dwork nikolov and talwar efficient algorithms for privately releasing marginals via convex relaxations arxiv preprint dwork and roth the algorithmic foundations of differential privacy foundations and trends in theoretical computer science now publishers dwork rothblum and vadhan boosting and differential privacy in focs dwork talwar thakurta and zhang analyze gauss optimal bounds for principal component analysis in stoc frank and wolfe an algorithm for quadratic programming naval research logistics quarterly hazan and kale online learning in icml jaggi revisiting sparse convex optimization in icml jain kothari and thakurta differentially private online learning in colt pages jain and thakurta near dimension independent risk bounds for differentially private learning in international conference on machine learning icml kifer smith and thakurta private convex empirical risk minimization and regression in colt pages mcsherry and talwar mechanism design via differential privacy in focs pages ieee nikolov talwar and zhang the geometry of differential privacy the sparse and approximate cases in stoc srebro and zhang trading accuracy for sparsity in optimization problems with sparsity constraints siam journal on optimization smith and thakurta differentially private feature selection via stability arguments and the robustness of the lasso in colt smith and thakurta follow the perturbed leader is differentially private with optimal regret guarantees manuscript in preparation smith and thakurta nearly optimal algorithms for private online learning in and bandit settings in nips tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series methodological tibshirani et al the lasso method for variable selection in the cox model statistics in medicine ullman private multiplicative weights beyond linear queries corr 
convergence analysis of prediction markets via randomized subspace descent rafael frongillo department of computer science university of colorado boulder raf mark reid research school of computer science the australian national university nicta abstract prediction markets are economic mechanisms for aggregating information about future events through sequential interactions with traders the pricing mechanisms in these markets are known to be related to optimization algorithms in machine learning and through these connections we have some understanding of how equilibrium market prices relate to the beliefs of the traders in market however little is known about rates and guarantees for the convergence of these sequential mechanisms and two recent papers cite this as an important open question in this paper we show how some previously studied prediction market trading models can be understood as natural generalization of randomized coordinate descent which we call randomized subspace descent rsd we establish convergence rates for rsd and leverage them to prove rates for the two prediction market models above answering the open questions our results extend beyond standard centralized markets to arbitrary trade networks introduction in recent years there has been an increasing appreciation of the shared mathematical foundations between prediction markets and variety of techniques in machine learning prediction markets consist of agents who trade securities that pay out depending on the outcome of some uncertain future event as trading takes place the prices of these securities reflect an aggregation of the beliefs the traders have about the future event popular class of mechanisms for updating these prices as trading occurs has been shown to be closely related to techniques from online learning convex optimization probabilistic aggregation and crowdsourcing building these connections serve several purposes however one important line of research has been to use insights from machine learning to better understand how to interpret prices in prediction market as aggregations of trader beliefs and moreover how the market together with the traders can be viewed as something akin to distributed machine learning algorithm the analysis in this paper was motivated in part by two pieces of work that considered the equilibria of prediction markets with specific models of trader behavior traders as risk minimizers and traders who maximize expected exponential utility using beliefs from exponential families in both cases the focus was on understanding the properties of the market at convergence and questions concerning whether and how convergence happened were left as future work in the authors note that we have not considered the dynamics by which such an equilibrium would be reached nor the rate of convergence yet we think such questions provide fruitful directions for future in one area of future work would be conducting detailed analysis of this framework using the tools of convex optimisation particularly interesting topic is to find the conditions under which the market will the main contribution of this paper is to answer these questions of convergence we do so by first proposing new and very general model of trading networks and dynamics that subsumes the models used in and and provide key structural result for what we call efficient trades in these networks theorem as an aside this structural result provides an immediate generalization of an existing aggregation result in to trade networks of compatible agents theorem in we argue that efficient trades in our networks model can be viewed as steps of what we call random subspace descent rsd algorithm algorithm this novel generalization of coordinate descent allows an objective to be minimized by taking steps along affinely constrained subspaces and maybe be of independent interest beyond prediction market analysis we provide convergence analysis of rsd under two sets of regularity constraints theorems and show how these can be used to derive slow fast convergence rates in trade networks theorems before introducing our general trading networks and convergence rate results we first introduce the now standard presentation of prediction markets and the recent variant in which all agents determine their trades using risk measures we will then state informal versions of our main results so as to highlight how we address issues of convergence in existing frameworks background and informal results prediction markets are mechanisms for eliciting and aggregating distributed information or beliefs about uncertain future events the set of events or outcomes under consideration in the market will be denoted and may be finite or infinite for example each outcome might represent certain presidential candidate winning an election the location of missing submarine or an unknown label for an item in data set following the goods that are traded in prediction market are securities that pay dollars should the outcome occur we denote the set of distributions over by and note for any that the expected pay off for the securities under is and the set of all expected pay offs is just the convex hull denoted conv simple and commonly studied case is when when there are exactly outcomes and the securities are the securities that pay out should specific outcome occur and nothing otherwise if and for here the securities are just basis vectors for rk and traders in prediction market hold portfolios of securities rk called positions that pay out pk total of ri dollars should outcome occur we denote the set of positions by rk we will assume that always contains position that returns dollar regardless of which outcome occurs meaning for all we therefore interpret as cash within the market in the sense that buying or selling guarantees fixed change in wealth in order to address the questions about convergence in we will consider common form of prediction market that is run through market maker this is an automated agent that is willing to buy or sell securities in return for cash the specific and prediction market mechanism we consider is the market maker here traders interact with the market maker sequentially and the cost for each trade is determined by convex potential function applied to the market maker state specifically the cost for trade dr when the market maker has state is given by cost dr the change in potential value of the market maker position due to the market maker accepting the trade after trade the market maker updates the state to as noted in the next section the usual axiomatic requirements for cost function in specify function that is effectively risk measure commonly studied in mathematical finance see risk measures as in agents in our framework will each quantify their uncertainty in positions using what is known as risk measure this is function that assigns dollar values to positions as example below shows this assumption will also cover the case of agents maximizing exponential utility as considered in it is more common in the prediction market literature for to be liability vector tracking what the market maker stands to lose instead of gain here we adopt positive positions to match the convention for risk measures convex monetary risk measure is function satisfying for all monotonicity cash invariance for all convexity λr λρ for all normalization the reasonableness of these properties is usually argued as follows see monotonicity ensures that positions that result in strictly smaller payoffs regardless of the outcome are considered more risky cash invariance captures the idea that if guaranteed payment of is added to the payment on each outcome then the risk will decrease by convexity states that merging positions results in lower risk finally normalization requires that holding no securities should carry no risk this last condition is only for convenience since any risk without this condition can trivially have its argument translated so it holds without affecting the other three properties key result concerning convex risk measures is the following representation theorem cf theorem theorem risk representation functional is convex risk measure if and only if there is closed convex function such that hπ here relint denotes the relative interior of the interior relative to the affine hull of notice that if denotes the convex conjugate supx hy xi then this theorem states that that is and are dual in the same way prices and positions are dual this suggests that the function can be interpreted as penalty function assigning measure of unlikeliness to each expected value of the securities defined above equivalently ep measures the unlikeliness of distribution over the outcomes we can then see that the risk is the greatest expected loss under each distribution taking into account the penalties assigned by example risk measure is the entropic risk relative to reference distribution this is defined on positions by ρβ log exp the cost function ρβ associated with this risk exactly corresponds to the logarithmic market scoring rule lmsr its associated convex function αβ over distributions is the scaled relative entropy αβ kl as discussed in the entropic risk is closely related to exponential utility uβ exp indeed ρβ uβ which is just the negative certainty equivalent of the position the amount of cash an agent with utility uβ and belief would be willing to trade for the uncertain position due to the monotonicity of it follows that trader maximizing expected utility uβ of holding position is equivalent to minimizing the entropic risk ρβ for technical reasons in addition to the standard assumptions for convex risk measures we will also make two weak regularity assumptions these are similar to properties required of cost functions in the prediction market literature cf theorem expressiveness is everywhere differentiable and closure strict risk aversion the convexity inequality is strict unless for some as discussed in expressiveness is related to the dual formulation given above roughly it says that the agent must take into account every possible expected value of the securities when calculating the risk strict risk aversion says that an agent should strictly prefer mixture of positions unless of course the difference is under these assumptions the representation result of theorem and similar result for cost functions theorem coincide and we are able to show that cost functions and risk measures are exactly the same object we write ρc when we think of as risk measure unfolding the definition of cost now using cash invariance we have ρc dr cost dr ρc dr cost dr dr dr ρc thus we may view market maker as agent trading dynamics and aggregation as described above we consider traders who approach the market maker sequentially and at random and select the optimal trade based on their current position the market state and the cost function as we just observed we may think of the market maker as agent with ρc let us examine the optimization problem faced by the trader with position when the current market state is this trader will choose portfolio from the market maker so as to minimise her risk arg min dr cost dr arg min dr ρc dr since by the cash invariance of and the definition of cost the objective is dr ρc dr ρc and ρc does not depend on dr thus if we think of ρc as kind of social risk we can define the surplus as simply the net risk taken away by an optimal trade namely we can now state our central question if set of such traders arrive at random and execute optimal or perhaps trades with the market maker will the market state converge to the optimal risk and if so how fast as discussed in the introduction this is precisely the question asked in that we set out to answer to do so we will draw close connection to the literature on distributed optimization algorithms for machine learning specifically if we encode the entire state of our system in the positions rn of the market maker and each of the traders we may view the optimal trade in eq as performing coordinate descent step by optimizing only with respect to coordinates and we build on this connection in section and leverage generalization of coordinate descent methods to show the following in theorem if set of traders is sampled at random to sequentially trade in the market the market state and prices converge to within of the optimal total risk in rounds in fact under mild smoothness assumptions on the cost potential function we can improve this rate to log we can also relax the optimality of the trader behavior as long as traders find trade dr which extracts at least constant fraction of the surplus the rate remains intact with convergence rates in hand the next natural question might be to what does the market converge abernethy et al show that when traders minimize expected exponential utility and have exponential family beliefs the market equilibrium price can be thought of as weighted average of the parameters of the traders with the weights being measure of their risk tolerance even though our setting is far more general than exponential utility and exponential families the framework we develop can also be used to show that their results can be extended to interactions between traders who have what we call compatible risks and beliefs specifically for any trader possessing risk with dual we can think of that trader belief as the least surprising distribution according to this view induces family of distributions which happen to be generalized exponential families that are parameterized by the initial positions of the traders furthermore the risk tolerance is given by how sensitive this belief is to small changes of an agent position the results of are then special case of our theorem for agents with being entropic risk cf example if each trader has risk tolerance bi and belief parameterized by θi and the initial market state is then the equilibrium state of the market to which the market converges is given bi θi by bi as the focus of this paper is on the convergence the details for this result are given in appendix the main insight that drives the above analysis of the interaction between trader and market maker is that each trade minimizes global objective for the market that is the infimal convolution of the traders and market maker risks in fact this observation naturally generalizes to trades between three or more agents and the same convergence analysis applies in other words our analysis also holds when bilateral trade with fixed market maker is replaced by multilateral trade among arbitrarily overlapping subsets of agents viewed as graph with agents as nodes the standard prediction market framework is represented by the star graph where the central market market interacts with traders sequentially and individually more generally we have what we call trading network in which the structure of trades can form arbitrary connected graphs or even hypergraphs an obvious choice is the complete graph which can model decentralized market and in fact we can even compare the convergence rate of our dynamics between the centralized and decentralized models see appendix and the discussion in general trading dynamics the previous section described the two agent case of what is more generally known as the optimal risk allocation problem where two or more agents express their preferences for positions via risk measures this is formalized by considering agents with risk measures ρi for who are asked to split position in to positions ri satisfying ri so as to minimise the total risk ρi ri they note that the value of the total risk is given by the infimal convolution ρi of the individual agent risks that is ρi inf ρi ri ri ri key property of the infimal convolution which will underly much of our analysis is that its convex conjugate is the sum of the conjugates of its constituent functions see for proof ρi one can think of ρi as the market risk which captures the risk of the entire market as if it were single agent as function of the net position ri of its constituents by definition eq says that the market is trying to reallocate the risk so as to minimize this net risk this interpretation is confirmed by eq when we interpret the duals as penalty functions as above the penalty of is the sum of the penalties of the market participants as alluded to above we allow our agents to interact round by round by conducting trades which are simply the exchange of securities since by assumption our position space is closed under linear combinations trade between two agents is simply position which is added to one agent and subtracted from another generalizing from this two agent interaction trade among set of agents is just collection of trade vectors one for each agent which sum to formally let be subsetp of agents trade on is then vector of positions dr rn matrix in such that dri and dri for all this last condition specifies that agents not in do not change their position key quantity in our analysis is measure of how much the total risk of collection of traders drops due to trading given some subset of ptraders the is function φs defined by φs ρi ri ρi ri which measures the maximum achievable drop in risk since ρi is an infimum in particular is the surplus function the trades that achieve this optimal drop in risk are called efficient given current state rn trade dr rn on is efficient if φs dr our following key result shows that efficient trades have remarkable structure once the state and subset is specified there is unique efficient trade up to cash transfers in other words the surplus is removed from the position vectors and then redistributed as cash to the traders the choice of trade is merely in how this redistribution takes place the fact that the derivatives match has strong intuition from prediction markets agents must agree on the the proof is in appendix theorem let rn and be given the surplus is always finite φs ii the set of efficient trades on is nonempty iii efficient trades are unique up to cash transfers given efficient trades dr dr on we have dr zn for some rn with zi iv traders agree on prices trade dr on is efficient if and only if for all ri dri rj drj there is unique efficient price if dr ispan efficient trade on for all we have ri dri where arg min αi ri as intuition for the term price consider that the highest agent would be willing to pay for an infinitesimal quantity of position dri is dri ri and likewise the lowest to sell thus the entries of ri act as the fair prices for their corresponding basis the above properties of efficient trades drive the remainder of our convergence analysis of network dynamics it also allows us to write simple closed form for the market price when traders share common risk profile theorem details are in appendix beyond our current focus on rates theorem has implications for variety of other economic properties of trade networks for example in appendix we show that efficient trades correspond to fixed points for more general dynamics market clearing equilibria and equilibria of natural bargaining games among the traders recall that in the prediction market framework of each round has single trader say interacting with the market maker who we will assume has index in the notation just defined this corresponds to choosing we now wish to consider richer dynamics where groups of two or more agents trade efficiently each round to this end will we call collection sj of groups of traders trading network and assume there is some fixed distribution over with full support trade dynamic over is process that begins at with some initial positions rn for the traders and at each round draws random group of traders according to selects some efficient trade drt on then updates the trader positions using rt drt for the purposes of proving the convergence of trade dynamics crucial property is whether all traders can directly or indirectly affect the others to capture this we will say trade network is connected if the hypergraph on with edges given by is connected information can propagate throughout the entire network dynamics over classical prediction markets are always connected since any pair of groups from its network will always contain the market maker convergence analysis of randomized subspace descent before briefly reviewing the literature on coordinate descent let us see why this might be useful way to think of our dynamics recall that we have set of subsets of agents and that in each step an efficient trade dr is chosen which only modifies the positions of agents in the sampled thinking of rn as vector of dimension vector recall rk changing rt to rt dr thus only modifies blocks of entries moreover efficiency ensures that dr minimizes the sum of the risks of agents in hence ignoring for now the constraint that the sum of the positions must remain constant the trade dynamic seems to be performing kind of block coordinate descent of the surplus function randomized subspace descent several randomized coordinate descent methods have appeared in the literature recently with increasing levels of sophistication while earlier methods focused on updates which only modified disjoint blocks of coordinates more recent methods allow for more general configurations such as overlapping blocks in fact these last three methods are closest to what we study here the authors consider an objective which decomposes as the sum of convex functions on each coordinate and study coordinate updates which follow graph structure all under the constraint that coordinates sum to despite the similarity of these methods to our trade dynamics we require even more general updates as we allow coordinate to correspond to arbitrary subsets si instead we establish unification of these methods which we call randomized subspace descent rsd listed in algorithm rather than blocks of coordinates or specific linear constraints rsd abstracts away these constructs by simply specifying coordinate subspaces in which the optimization is to be performed specifically the algorithm takes list of projection matrices πi which define the subspaces and at each step selects πi at random and tries to optimize the objective under the constraint that it may only move within the image space of πi that is if the current point is xt then xt im πi before stating our convergence results for algorithm we will need notion of smoothness relative to our subspaces specifically we say is li if for all there are constants li such that for all im πi yi li finally let min im πi be the global minimizer of subject to the constraints from the πi then we have the following result for constant which increases in algorithm randomized subspace descent input smooth convex function rn initial point rn matrices πi smoothness parameters li distribution for iteration in do sample from xt πi xt end the distance from the point to furthest minimizer of the lipschitz constants of the πi and the connectivity of the hypergraph induced by the projections theorem let πi li and be as in algorithm with the condition that is li for all then xt min the proof is in appendix additionally when is strongly convex meaning it has uniform local quadratic lower bound rsd enjoys faster linear convergence formally this condition requires to be convex for some constant that is for all dom we require ky the statement and details of this stronger result is given in appendix importantly for our setting these results only track the progress per iteration thus they apply to more sophisticated update steps than simple gradient step as long as they improve the objective by at least as much for example if in each step the algorithm computed the exact minimizer arg πi xt both theorems would still hold convergence rates for trade dynamics to apply theorem to the convergence of trading dynamics we let and rn rn rn be the joint position of all agents for each subset of agents we havepa subspace of rn consisting of all possible trades on namely dr rn dri for dri with corresponding projection matrix πs for the special case of prediction markets with centralized market maker we have subspaces and projects onto dr rn dri drj for the intuition of coordinate descent is clear now the subset of agents seek to minimize the total surplus within the subspace of trades on and thus the coordinate descent steps of algorithm will correspond to roughly efficient trades we now apply theorem to show that trade dynamics achieve surplus in time note that we will have to assume the risk measure ρi of agent is li for some li this is very loose restriction as our risk measures are all differentiable by the expressiveness condition theorem let ρi be an li risk measure for all then for any connected trade dynamic we have rt proof taking ls li one can check that is ls for all by eq since algorithm has no state aside from xt and the proof of theorem depends only the drop in per step any algorithm selecting the sets with the same distribution and satisfying xt πi xt will yield the same convergence rate as trade dynamics satisfy xt πi this property trivially holds and so theorem applies if we assume slightly more that our risk measures have local quadratic lower bounds then we can obtain linear convergence note that this is also relatively weak assumption and holds whenever the risk measure has hessian with only one zero eigenvalue for at each point this is satisfied for example by all the variants of entropic risk we discuss in the paper the proof is in appendix theorem suppose for each we have continuous function µi such that for all risk ρi is µi convex with respect to in neighborhood of in other words eq holds for ρi µi and all in neighborhood of such that then for all connected trade dynamics rt graph kn pn cn bk nπ table algebraic connectivities for common graphs figure average in bold of market simulations for the complete and star graphs the empirical gap in iteration complexity is just under cf fig amazingly the convergence rates in theorem and theorem hold for all connected trade dynamics the constant hidden in the does depend on the structure of the network but can be explicitly determined in terms its algebraic connectivity this is discussed further in appendix the intuition behind these convergence rates given here is that agents in whichever group is chosen always trade to fully minimize their surplus because the proofs in appendix of these methods merely track the reduction in surplus per trading round the bounds apply as long as the update is at least as good as gradient step in fact we can say even more if only an fraction of the surplus is taken at each round the rates are still and respectively this suggests that our convergence results are robust with respect to the model of rationality one employs if agents have bounded rationality and can only compute positions which approximately minimize their risk the rates remain intact up to constant factors as long as the inefficiency is bounded conclusions future work using the tools of convex analysis to analyse the behavior of markets allows us to make precise quantitative statements about their global behavior in this paper we have seen that with appropriate assumptions on trader behaviour we can determine the rate at which the market will converge to equilibrium prices thereby closing some open questions raised in and in addition our newly proposed trading networks model allow us to consider variety of prediction market structures as discussed in the usual prediction market setting is centralized and corresponds to star graph with the market maker at the center decentralized market where any trader can trade with any other corresponds to complete graph over the traders we can also model more exotic networks such as two or more market prediction markets with risk minimizing arbitrageur or networks where agents only trade with limited number of neighbours furthermore because these arrangements are all instances of trade networks we can immediately compare the convergence rates across various constraints on how traders may interact for example in appendix we show that market that trades through centralized market maker incurs an quantifiable efficiency overhead convergence takes twice as long see figure more generally we show that the rates scale as allowing us to make similar comparisons between arbitrary networks see table this raises an interesting question for future work given some constraints such as bound on how many traders single agent can trade with the total number of edges etc which network optimizes the convergence rate of the market these new models and the analysis of their convergence may provide new principles for building and analyzing distributed systems of heterogeneous and learning agents acknowledgments we would like to thank matus telgarsky for his generous help as well as the lively discussions with and helpful comments of lahaie miro jenn wortman vaughan yiling chen david parkes and nageeb ali mdr is supported by an arc discovery early career research award part of this work was developed while he was visiting microsoft research references jacob abernethy yiling chen and jennifer wortman vaughan efficient market making via convex optimization and connection to online learning acm transactions on economics and computation jacob abernethy sindhu kutty lahaie and rahul sami information aggregation in exponential family markets in proceedings of the fifteenth acm conference on economics and computation pages acm jacob abernethy and rafael frongillo collaborative mechanism for crowdsourcing prediction problems in advances in neural information processing systems pages aharon and marc teboulle an concept of convex risk measures the optimized certainty equivalent mathematical finance stephen boyd and lieven vandenberghe convex optimization cambridge university press christian burgert and ludger on the optimal risk allocation problem statistics decisions yiling chen and jennifer wortman vaughan new understanding of prediction markets via learning in proceedings of the acm conference on electronic commerce pages acm nair maria maia de abreu old and new results on algebraic connectivity of graphs linear algebra and its applications hans and alexander schied stochastic finance an introduction in discrete time volume of de gruyter studies in mathematics walter de gruyter berlin edition rafael frongillo della penna and mark reid interpreting prediction markets stochastic approach in proceedings of neural information processing systems and dawid game theory maximum entropy minimum discrepancy and robust bayesian decision theory the annals of statistics jb and grundlehren der mathematischen wissenschaften convex analysis and minimization algorithms ii jinli hu and amos storkey trading prediction markets with connections to machine learning in proceedings of the international conference on machine learning icml jono millin krzysztof geras and amos storkey isoelastic agents and wealth updates in machine learning markets in proceedings of the international conference on machine learning pages bojan mohar the laplacian spectrum of graphs in graph theory combinatorics and applications necoara nesterov and glineur random coordinate descent method on optimization problems with linear constraints technical report ion necoara random coordinate descent algorithms for convex optimization over networks automatic control ieee transactions on yurii nesterov efficiency of coordinate descent methods on optimization problems siam journal on optimization mindika premachandra and mark reid aggregating predictions via sequential in asian conference on machine learning pages sashank reddi ahmed hefny carlton downey avinava dubey and suvrit sra randomizedcoordinate descent methods with linear constraints arxiv preprint mark reid rafael frongillo robert williamson and nishant mehta generalized mixability via entropic duality in proc of conference on learning theory colt peter and martin iteration complexity of randomized descent methods for minimizing composite function mathematical programming rockafellar convex analysis princeton university press amos storkey machine learning markets in international conference on artificial intelligence and statistics pages 
the poisson gamma belief network mingyuan zhou mccombs school of business the university of texas at austin austin tx usa yulai cong national laboratory of rsp xidian university xi an shaanxi china bo chen national laboratory of rsp xidian university xi an shaanxi china abstract to infer multilayer representation of count vectors we propose the poisson gamma belief network pgbn that factorizes each of its layers into the product of connection weight matrix and the nonnegative real hidden units of the next layer the pgbn hidden layers are jointly trained with an gibbs sampler each iteration of which upward samples dirichlet distributed connection weight vectors starting from the first layer bottom data layer and then downward samples gamma distributed hidden units starting from the top hidden layer the binomial process combined with training strategy allows the pgbn to infer the width of each layer given fixed budget on the width of the first layer the pgbn with single hidden layer reduces to poisson factor analysis example results on text analysis illustrate interesting relationships between the width of the first layer and the inferred network structure and demonstrate that the pgbn whose hidden units are imposed with correlated gamma priors can add more layers to increase its performance gains over poisson factor analysis given the same limit on the width of the first layer introduction there has been significant recent interest in deep learning despite its tremendous success in supervised learning inferring multilayer data representation in an unsupervised manner remains challenging problem the sigmoid belief network sbn which connects the binary units of adjacent layers via the sigmoid functions infers deep representation of multivariate binary vectors the deep belief network dbn is sbn whose top hidden layer is replaced by the restricted boltzmann machine rbm that is undirected the deep boltzmann machine dbm is an undirected deep network that connects the binary units of adjacent layers using the rbms all these deep networks are designed to model binary observations although one may modify the bottom layer to model gaussian and multinomial observations the hidden units of these networks are still typically restricted to be binary one may further consider the exponential family harmoniums to construct more general networks with hidden units but often at the expense of noticeably increased complexity in training and data fitting moving beyond conventional deep networks using binary hidden units we construct deep directed network with gamma distributed nonnegative real hidden units to unsupervisedly infer multilayer representation of multivariate count vectors with simple but powerful mechanism to capture the correlations among the features across all layers and handle highly overdispersed counts the proposed model is called the poisson gamma belief network pgbn which factorizes the observed count vectors under the poisson likelihood into the product of factor loading matrix and the gamma distributed hidden units factor scores of layer one and further factorizes the shape parameters of the gamma hidden units of each layer into the product of connection weight matrix and the gamma hidden units of the next layer distinct from previous deep networks that often utilize binary units for tractable inference and require tuning both the width number of hidden units of each layer and the network depth number of layers the pgbn employs nonnegative real hidden units and automatically infers the widths of subsequent layers given fixed budget on the width of its first layer note that the budget could be infinite and hence the whole network can grow without bound as more data are being observed when the budget is finite and hence the ultimate capacity of the network is limited we find that the pgbn equipped with narrower first layer could increase its depth to match or even outperform shallower network with substantially wider first layer the gamma distribution density function has the highly desired strong for deep learning but the existence of neither conjugate prior nor maximum likelihood estimate for its shape parameter makes deep network with gamma hidden units appear unattractive despite seemingly difficult we discover that by generalizing the data augmentation and marginalization techniques for discrete data one may propagate latent counts one layer at time from the bottom data layer to the top hidden layer with which one may derive an efficient gibbs sampler that one layer at time in each iteration upward samples dirichlet distributed connection weight vectors and then downward samples gamma distributed hidden units in addition to constructing new deep network that well fits multivariate count data and developing an efficient gibbs sampler other contributions of the paper include combining the binomial process with training strategy to automatically infer the network structure revealing the relationship between the upper bound imposed on the width of the first layer and the inferred widths of subsequent layers revealing the relationship between the network depth and the model ability to model overdispersed counts and generating multivariate random count vector whose distribution is governed by the pgbn by propagating the gamma hidden units of the top hidden layer back to the bottom data layer useful count distributions and their relationships let the chinese restaurant table crt pn distribution crt represent the distribution of random count generated as bi bi bernoulli its probability mass function pmf can be expressed as where and are unsigned stirling numbers of the first kind let log denote the logarithmic pu distribution with pmf ln where let nb denote the negative binomial nb distribution with pmf where the nb distribution nb can be generated as gamma mixed poisson distribution as pois gam where is the gamma scale parameter as shown in the joint distribution of and given and in crt nb where pl and is the same as that in ut ut log pois ln which is called the bivariate distribution with pmf the poisson gamma belief network assuming the observations are multivariate count vectors xj the generative model of the poisson gamma belief network pgbn with hidden layers from top to bottom is expressed as gam cj gam cj gam pj pj xj pois the pgbn factorizes the count observation xj into the product of the factor loading and hidden units rk of layer one under the poisson likelihood where and for factorizes the shape parameters of the gamma distributed hidden units rk of layer into the product of the connection weight matrix and the hidden units of layer the top layer hidden units share the same vector rkt as their gamma shape parameters and the pj are probability parameters and are gamma scale parameters with cj pj pj for scale identifiabilty and ease of inference each column of is restricted to have unit norm to complete the hierarchical model for we let φk dir rk gam and impose gam and gam and for we let pj beta cj gam xj we expect the correlations between the rows features of to be captured by the columns of and the correlations between the rows latent features of to be captured by the columns of even if all for are identity matrices indicating no correlations between latent features our analysis will show that deep structure with could still benefit data fitting by better modeling the variability of the latent features sigmoid and deep belief networks under the hierarchical model in given the connection weight matrices the joint distribution of the count observations and gamma hidden units of the pgbn can be expressed similar to those of the sigmoid and deep belief networks as xj xj θj θj with φv representing the vth row for the gamma hidden units θvj we have θvj φv φv φv θj θvj θj θvj which are highly nonlinear functions that are strongly desired in deep learning by contrast with the belief network would sigmoid function and bias terms bv connect the binary hidden units θvj of layer for deep belief networks to the product of the connection weights and binary hidden units of the next layer with θj bv θj θvj comparing with clearly shows the differences between the gamma nonnegative hidden units and the sigmoid link based binary hidden units note that the rectified linear units have emerged as powerful alternatives of sigmoid units to introduce nonlinearity it would be interesting to use the gamma units to introduce nonlinearity in the positive region of the rectified linear units deep poisson factor analysis with the pgbn specified by reduces to poisson factor analysis pfa using the truncated binomial process which is also related to latent dirichlet allocation if the dirichlet priors are imposed on both φk and with the pgbn is related to the gamma markov chain hinted by corollary of and realized in the deep exponential family of and the deep pfa of different from the pgbn in it is the gamma scale but not shape parameters that are chained and factorized in it is the correlations between binary topic usage indicators but not the full connection weights that are captured and neither nor provide principled way to learn the network structure below we break the pgbn of layers into related submodels that are solved with the same subroutine the propagation of latent counts and model properties lemma the pgbn with pj and pj ln pj ln pj cj for one may connect the observed if or some latent if counts xj to the product at layer under the poisson likelihood as xj pois ln pj proof by definition is true for layer suppose that is true for layer then we can augment each count xvj into the summation of kt latent counts that are smaller or equal as pkt xvjk xvjk pois θkj ln pj xvj where with mkj xvjk representing the num ber of times that factor kt of layer appears in observation and mj φvk we can marginalize out as in leading to since pois ln pj mj further marginalizing out the gamma distributed from the above poisson likelihood leads to mj nb pj the kth element of mj mkj can be augmented under its compound poisson representation as px kj log pj xkj pois ln pj thus if is true for layer then it is also true for layer corollary propagate the latent counts upward using lemma of on and theorem of on we can propagate the latent counts xvj of layer upward to layer as φvk θk mult xvjkt xvj vj φvk θkj φvk θkj xkj mkj φk crt mkj φk as and xkj is in the same order as ln mkj the total count of layer expressed as would often be much smaller than that of layer expressed as thus the pgbn may use as simple criterion to decide whether to add more layers modeling overdispersed counts in comparison to shallow model with that assumes the hidden units of layer one to be independent in the prior the multilayer deep model with captures the correlations between them note that for the extreme case that ikt for are all identity matrices which indicates that there are no correlations between the features of left to be captured the deep structure could still provide benefits as it helps model latent counts mj that may be highly overdispersed for example supposing for all then from and we have mkj nb θkj pj θkj gam θkj θkj gam rk for simplicity let us further assume cj for all using the laws of total expectation and total variance we have θkj rk rk and var θkj rk rk and hence mkj rk rk pj pj var mkj rk rk pj pj pj in comparison to pfa with mkj rk nb rk pj with ratio of pj the pgbn with hidden layers which mixes the shape of mkj nb θkj pj with chain of gamma random variables increases the ratio of the latent count mkj given rk by factor of pj and hence could better model highly overdispersed counts gibbs sampling with lemma and corollary and the width of the first layer being bounded by max we develop an gibbs sampler for the pgbn each iteration of which proceeds as follows sample xvjk we can sample xvjk for all layers using but for the first hidden layer we may treat each observed count xvj as sequence of word tokens at the vth term in vocabulary of size in the jth document and assign the words vji one after another to the latent factors topics with both the topics and topic weights marginalized out as zji ji φk max where zji is the topic index for vji and xvjk vji zji counts the number of times that term appears in document we use the symbol to represent summing over the correspondp ing index xvjk and use to denote the count calculated without considering word in document the collapsed gibbs sampling update equation shown above is related to the one developed in for latent dirichlet allocation and the one developed in for pfa using the binomial process when we would replace the terms φk with rk for pfa built on the binomial process or with απk for the hierarchical dirichlet process latent dirichlet allocation see and for details and add an additional term to account for the possibility of creating an additional topic for simplicity in this paper we truncate the nonparametric bayesian model with max factors and let rk gam max if sample φk given these latent counts we sample the φk as φk dir sample xvj we sample xj using replacing with rkt sample using and the conjugacy we sample as gamma mj cj ln pj sample both and are sampled using related equations in we sample as rv gam ln pj pkt sample cj with θkj for and we sample pj and cj as pj beta cj gamma and calculate cj and pj with learning the network structure with training as jointly training all layers together is often difficult existing deep networks are typically trained using greedy unsupervised training algorithm such as the one proposed in to train the deep belief networks the effectiveness of this training strategy is further analyzed in by contrast the pgbn has simple gibbs sampler to jointly train all its hidden layers as described in section and hence does not require greedy training yet the same as commonly used deep learning algorithms it still needs to specify the number of layers and the width of each layer in this paper we adopt the idea of training for the pgbn not because of the lack of an effective algorithm but for the purpose of learning the width of each hidden layer in greedy manner given fixed budget on the width of the first layer the proposed training strategy is summarized in algorithm with pgbn of layers that has already been trained the key idea is to use truncated binomial process to model the latent count matrix for the newly added top layer as mkj nb rk pj rk algorithm the pgbn gibbs sampler that uses training strategy to train set of networks each of which adds an additional hidden layer on top of the previously inferred network retrains all its layers jointly and prunes inactive factors from the last layer inputs observed counts xvj upper bound of the width of the first layer max upper bound of the number of layers tmax and outputs total of tmax jointly trained pgbns with depths and tmax for tmax do jointly train all the layers of the network set kt the inferred width of layer as kt max the upper bound of layer width for iter bt ct do gibbs sampling sample zji using collapsed inference calculate xvjk sample xvj for do sample xvjk sample φk sample xvj end for sample pj and calculate cj sample cj and calculate pj for for do sample if sample end for if iter bt then prune layer inactive factors φk let kt and update end if end for output the posterior means according to the last mcmc sample of all remaining factors φk as kt the inferred network of layers and rk as the gamma shape parameters of layer hidden units end for gam max and rely on that stochastic process shrinkage mechanism to prune inactive factors connection weight vectors of layer and hence the inferred kt would be smaller than kt max if kt max is sufficiently large the newly added layer and the layers below it would be jointly trained but with the structure below the newly added layer kept unchanged note that when the pgbn would infer the number of active factors if max is set large enough otherwise it would still assign the factors with different weights rk but may not be able to prune any of them experimental results we apply the pgbns for topic modeling of text corpora each document of which is represented as count vector note that the pgbn with single hidden layer is identical to the truncated binomial process pfa of which is nonparametric bayesian algorithm that performs similarly to the hierarchical dirichlet process latent dirichlet allocation for text analysis and is considered as strong baseline that outperforms large number of topic modeling algorithms thus we will focus on making comparison to the pgbn with single layer with its layer width set to be large to approximate the performance of the binomial process pfa we evaluate the pgbns performance by examining both how well they unsupervisedly extract features for document classification and how well they predict heldout word tokens matlab code will be available in http we use algorithm to learn in manner from the training data the weight matrices tmax and the hidden units gamma shape parameters to add layer to previously trained network with layers we use bt iterations to jointly train and together with prune the inactive factors of layer and continue the joint training with another ct iterations we set the as and given the trained network we apply the gibbs sampler to collect mcmc samples after burnins to estimate the posterior mean of the feature usage proportion vector at the first hidden layer for every document in both the training and testing sets feature learning for binary classification we consider the newsgroups dataset http that consists of documents from different news groups with vocabulary of size it is partitioned into training set of documents and testing set of ones we first consider two binary classification tasks that distinguish between the and and between the and news groups for each binary classification task we remove standard list of stop words and only consider the terms that appear at least five times and report the classification accuracies based on independent random trials with the upper bound of the first layer vs number of layers number of layers vs classification accuracy vs classification accuracy classification accuracy classification accuracy vs number of layers number of layers figure classification accuracy as function of the network depth for two binary classification tasks with for all layers the boxplots of the accuracies of independent runs with max the average accuracies of these runs for various max and note that max is large enough to cover all active topics inferred to be around for both binary classification tasks whereas all the topics would be used if max or classification accuracy classification accuracy number of layers figure classification accuracy of the pgbns for classification as function of the depth with various max and as function of max with various depths with for all layers the widths of hidden layers are automatically inferred with max or note that max is large enough to cover all active topics whereas all the topics would be used if max or width set as max and bt ct and for all we use algorithm to train network with layers denote as the estimated dimensional feature vector for document where max is the inferred number of active factors of the first layer that is bounded by the truncation level max we use the regularized logistic regression provided by the liblinear package to train linear classifier on in the training set and use it to classify in the test set where the regularization parameter is on the training set from as shown in fig modifying the pgbn from shallow network to multilayer deep one clearly improves the qualities of the unsupervisedly extracted feature vectors in random trial with max we infer network structure of for the first binary classification task and for the second one figs also show that increasing the network depth in general improves the performance but the width clearly plays an important role in controlling the ultimate network capacity this insight is further illustrated below feature learning for classification we test the pgbns for classification on after removing standard list of stopwords and the terms that appear less than five times we obtain vocabulary with we set ct and for all if max we set bt for all otherwise we set and bt for we use all training documents to infer set of networks with tmax and max and mimic the same testing procedure used for binary classification to extract feature vectors with which each testing document is classified to one of the news groups using the regularized logistic regression fig shows clear trend of improvement in classification accuracy by increasing the network depth with limited width or by increasing the upper bound of the width of the first layer with the depth fixed for example pgbn with max could add one or more layers to slightly outperform pgbn with max and pgbn with max could add layers to clearly outperform pgbn with max as large as we also note that each iteration of jointly training multiple layers costs moderately more than that of training single layer with max training iteration on single core of an intel xeon ghz cpu on average takes about seconds for the pgbn with and layers respectively examining the inferred network structure also reveals interesting details for example in random trial with algorithm the inferred network widths are perplexity perplexity figure perplexity the lower the better for the corpus using the most frequent terms as function of the upper bound of the first layer width max and network depth with of the word tokens in each document used for training and for all for visualization each curve in is reproduced by subtracting its values from the average perplexity of the network and for max and respectively this indicates that for network with an insufficient budget on its width as the network depth increases its inferred layer widths decay more slowly than network with sufficient or surplus budget on its width and network with surplus budget on its width may only need relatively small widths for its higher hidden layers in the appendix we provide comparisons of accuracies between the pgbn and other related algorithms including these of and on similar document classification tasks perplexities for holdout words in addition to examining the performance of the pgbn for unsupervised feature learning we also consider more direct approach that we randomly choose of the word tokens in each document as training and use the remaining ones to calculate perplexity we consider the http corpus limiting the vocabulary to the most frequent terms we set and ct for all set and bt for and consider five random trials among the bt ct gibbs sampling iterations used to train layer we collect one sample per five iterations during the last iterations for each of which we draw the topics φk and topics weights to compute the perplexity using equation of as shown in fig we observe clear trend of improvement by increasing both max and qualitative analysis and document simulation in addition to these quantitative experiments we have also examined the topics learned at each layer we use φk to project topic of layer as word probability vector generally speaking the topics at lower layers are more specific whereas those at higher layers are more general examining the results used to produce fig with max and the pgbn infers network with the ranks by popularity and top five words of three example topics for layer are network units input learning training data model learning set image and network learning model input neural while these of five example topics of layer are likelihood em mixture parameters data bayesian posterior prior log evidence variables belief networks conditional inference boltzmann binary machine energy hinton and speech speaker acoustic vowel we have also tried drawing gam and downward passing it through the network to generate synthetic documents which are found to be quite interpretable and reflect various general aspects of the corpus used to train the network we provide in the appendix number of synthetic documents generated from pgbn trained on the corpus whose inferred structure is conclusions the poisson gamma belief network is proposed to extract multilayer deep representation for highdimensional count vectors with an efficient gibbs sampler to jointly train all its layers and training strategy to automatically infer the network structure example results clearly demonstrate the advantages of deep topic models for big data problems in practice one may rarely has sufficient budget to allow the width to grow without bound thus it is natural to consider belief network that can use deep representation to not only enhance its representation power but also better allocate its computational resource our algorithm achieves good compromise between the widths of hidden layers and the depth of the network acknowledgements zhou thanks tacc for computational support chen thanks the support of the thousand young talent program of china and references bengio and lecun scaling learning algorithms towards ai in bottou olivier chapelle decoste and weston editors large scale kernel machines mit press ranzato huang boureau and lecun unsupervised learning of invariant feature hierarchies with applications to object recognition in cvpr bengio goodfellow and courville deep learning book in preparation for mit press neal connectionist learning of belief networks artificial intelligence pages saul jaakkola and jordan mean field theory for sigmoid belief networks journal of artificial intelligence research pages hinton osindero and teh fast learning algorithm for deep belief nets neural computation pages hinton training products of experts by minimizing contrastive divergence neural computation pages salakhutdinov and hinton deep boltzmann machines in aistats larochelle and lauly neural autoregressive topic model in nips salakhutdinov tenenbaum and torralba learning with models ieee trans pattern anal mach pages welling and hinton exponential family harmoniums with an application to information retrieval in nips pages xing yan and hauptmann mining associated text and images with harmoniums in uai zhou and carin negative binomial process count and mixture modeling ieee trans pattern anal mach zhou padilla and scott priors for random count matrices derived from family of negative binomial processes to appear in amer statist nair and hinton rectified linear units improve restricted boltzmann machines in icml blei ng and jordan latent dirichlet allocation mach learn acharya ghosh and zhou nonparametric bayesian factor analysis for dynamic count matrices in aistats ranganath tang charlin and blei deep exponential families in aistats gan chen henao carlson and carin scalable deep poisson factor analysis for topic modeling in icml zhou hannah dunson and carin binomial process and poisson factor analysis in aistats griffiths and steyvers finding scientific topics pnas zhou binomial process and exchangeable random partitions for mixedmembership modeling in nips teh jordan beal and blei hierarchical dirichlet processes amer statist bengio lamblin popovici and larochelle greedy training of deep networks in nips fan chang hsieh wang and lin liblinear library for large linear classification jmlr pages srivastava salakhutdinov and hinton modeling documents with deep boltzmann machine in uai 
convergence rates of newton methods murat erdogdu department of statistics stanford university erdogdu andrea montanari department of statistics and electrical engineering stanford university montanari abstract we consider the problem of minimizing sum of functions via projected iterations onto convex parameter set rp where in this regime algorithms which utilize techniques are known to be effective in this paper we use techniques together with approximation to design new randomized batch algorithm which possesses comparable convergence rate to newton method yet has much smaller cost the proposed algorithm is robust in terms of starting point and step size and enjoys composite convergence rate namely quadratic convergence at start and linear convergence when the iterate is close to the minimizer we develop its theoretical analysis which also allows us to select algorithm parameters our theoretical results can be used to obtain convergence rates of previously proposed based algorithms as well we demonstrate how our results apply to machine learning problems lastly we evaluate the performance of our algorithm on several datasets under various scenarios introduction we focus on the following minimization problem minimize fi where fi rp most machine learning models can be expressed as above where each function fi corresponds to an observation examples include logistic regression support vector machines neural networks and graphical models many optimization algorithms have been developed to solve the above minimization problem for given convex set rp we denote the euclidean projection onto this set by pc we consider the updates of the form pc qt where is the step size and qt is suitable scaling matrix that provides curvature information updates of the form eq have been extensively studied in the optimization literature for simplicity we assume rp throughout the introduction the case where qt is equal to identity matrix corresponds to gradient descent gd which under smoothness assumptions achieves linear convergence rate with np cost more precisely gd with ideal step size yields ˆt where as limt and is the largest gd gd eigenvalue of the hessian of at minimizer second order methods such as newton method nm and natural gradient descent ngd can be recovered by taking qt to be the inverse hessian and the fisher information evaluated at the current iterate respectively such methods may achieve quadratic convergence rates with cost in particular for large enough newton method yields nm and it is insensitive to the condition number of the hessian however when the number of samples grows large computing qt becomes extremely expensive popular line of research tries to construct the matrix qt in way that the update is computationally feasible yet still provides sufficient second order information such attempts resulted in methods in which only gradients and iterates are utilized resulting in an efficient update on qt celebrated method is the bfgs algorithm which requires np cost an alternative approach is to use techniques where scaling matrix qt is based on randomly selected set of data points is widely used in the first order methods but is not as well studied for approximating the scaling matrix in particular theoretical guarantees are still missing key challenge is that the hessian is close to the actual hessian along the directions corresponding to large eigenvalues large curvature directions in but is poor approximation in the directions corresponding to small eigenvalues flatter directions in in order to overcome this problem we use approximation more precisely we treat all the eigenvalues below the as if they were equal to the this yields the desired stability with respect to the we call our algorithm newsamp in this paper we establish the following newsamp has composite convergence rate quadratic at start and linear near the minimizer as illustrated in figure formally we prove bound of the form with coefficient that are explicitly given and are computable from data the asymptiotic behavior of the linear convergence coefficient is limt for small the condition number which controls the convergence of gd has been replaced by the milder for datasets with strong spectral features this can be large improvement as shown in figure the above results are achived without tuning the in particular by setting the complexity per iteration of newsamp is np with the sample size our theoretical results can be used to obtain convergence rates of previously proposed subsampling algorithms the rest of the paper is organized as follows section surveys the related work in section we describe the proposed algorithm and provide the intuition behind it next we present our theoretical results in section convergence rates corresponding to different schemes followed by discussion on how to choose the algorithm parameters two applications of the algorithm are discussed in section we compare our algorithm with several existing methods on various datasets in section finally in section we conclude with brief discussion related work even synthetic review of optimization algorithms for machine learning would go beyond the page limits of this paper here we emphasize that the method of choice depends crucially on the amount of data to be used and their dimensionality respectively on the parameters and in this paper we focus on regime in which and are large but not so large as to make gradient computations of order np and matrix manipulations of order prohibitive online algorithms are the option of choice for very large since the computation per update is independent of in the case of stochastic gradient descent sgd the descent direction is formed by randomly selected gradient improvements to sgd have been developed by incorporating the previous gradient directions in the current update equation batch algorithms on the other hand can achieve faster convergence and exploit second order information they are competitive for intermediate several methods in this category aim at quadratic or at least convergence rates in particular methods have proven effective another approach towards the same goal is to utilize to form an approximate hessian if the hessian is close to the true hessian these methods can approach nm in terms of convergence rate nevertheless they enjoy algorithm newsamp input define pc is the euclidean projection onto uk truncatedsvdk is truncated svd of with while do set of indices st let hst fi and hst qt ip ir pc qt end while output much smaller complexity per update no convergence rate analysis is available for these methods this analysis is the main contribution of our paper to the best of our knowledge the best result in this direction is proven in that estabilishes asymptotic convergence without quantitative bounds exploiting general theory from on the further improvements of the algorithms common approach is to use conjugate gradient cg methods krylov lastly there are various hybrid algorithms that combine two or more techniques to increase the performance examples include and sgd and gd ngd and nm ngd and approximation newsamp method via rank thresholding in the regime we consider there are two main drawbacks associated with the classical second order methods such as newton method the dominant issue is the computation of the hessian matrix which requires operations and the other issue is inverting the hessian which requires computation is an effective and efficient way of tackling the first issue recent empirical studies show that the hessian provides significant improvement in terms of computational cost yet preserves the fast convergence rate of second order methods if uniform is used the hessian will be random matrix with expected value at the true hessian which can be considered as sample estimator to the mean recent advances in statistics have shown that the performance of various estimators can be significantly improved by simple procedures such as shrinkage thresholding to this extent we use approximation as the important second order information is generally contained in the largest few of the hessian newsamp is presented as algorithm at iteration step the set of indices its size and the corresponding hessian is denoted by st and hst respectively assuming that the functions fi are convex eigenvalues of the symmetric matrix hst are therefore svd and eigenvalue decomposition coincide the operation truncatedsvdk hst uk is the best approximation takes hst as input and returns the largest eigenvalues with the corresponding eigenvectors uk this procedure requires computation operator pc projects the current iterate to the feasible set using euclidean projection we assume that this projection can be done efficiently to construct the curvature matrix qt instead of using the basic approximation we fill its eigenvalues with the eigenvalue of the hessian which is the largest eigenvalue below the threshold if we compute truncated svd with and the described operation results in qt ip ir ur which is simply the sum of scaled identity matrix and matrix note that the approximation that is suggested to improve the curvature estimation has been further utilized to reduce the cost of computing the inverse matrix final cost of newsamp will be np np newsamp takes the parameters and as inputs we discuss in section how to choose them optimally based on the theory in section convergence rate convergence coefficients value log error size newsamp st newsamp st newsamp st iterations coefficient linear quadratic rank figure left plot demonstrates convergence rate of newsamp which starts with quadratic rate and transitions into linear convergence near the true minimizer the right plot shows the effect of eigenvalue thresholding on the convergence coefficients up to scaling constant shows the number of kept eigenvalues plots are obtained using covertype dataset by the construction of qt newsamp will always be descent algorithm it enjoys quadratic convergence rate at start which transitions into linear rate in the neighborhood of the minimizer this behavior can be observed in figure the left plot in figure shows the convergence behavior of newsamp over different sizes we observe that large result in better convergence rates as expected as the size increases slope of the linear phase decreases getting closer to that of quadratic phase we will explain this phenomenon in section by theorems and the right plot in figure demonstrates how the coefficients of two phases depend on the thresholded rank coefficient of the quadratic phase increases with the rank threshold whereas for the linear phase relation is reversed theoretical results in this section we provide the convergence analysis of newsamp based on two different subsampling schemes independent at each iteration st is uniformly sampled from independently from the sets with or without replacement sequentially dependent at each iteration st is sampled from based on distribution which might depend on the previous sets but not on any randomness in the data the first scheme is simple and commonly used in optimization one drawback is that the set at the current iteration is independent of the previous hence does not consider which of the samples were previously used to form the approximate curvature information in order to prevent cycles and obtain better performance near the optimum one might want to increase the sample size as the iteration advances including previously unused samples this process results in sequence of dependent which falls into the subsampling scheme in our theoretical analysis we make the following assumptions assumption lipschitz continuity for any subset depending on the size of such that khs hs assumption bounded hessian fi is upper bounded by constant max fi independent in this section we assume that st is sampled according to the scheme in fact many stochastic algorithms assume that st is uniform subset of because in this case the hessian is an unbiased estimator of the full hessian that is hst where the expectation is over the randomness in st we next show that for any scaling matrix qt that is formed by the st iterations of the form eq will have composite convergence rate combination of linear and quadratic phases lemma assume that the parameter set is convex and st is based on scheme and sufficiently large further let the assumptions and hold and then for an absolute constant with probability at least the updates of the form eq satisfy for coefficients and defined as hst ck log mn qt remark if the initial point is close to the algorithm will start with quadratic rate of convergence which will transform into linear rate later in the close neighborhood of the optimum the above lemma holds for any matrix qt in particular if we choose qt we obtain bound for the simple hessian method in this case the coefficients and depend on kqt tp where tp is the smallest eigenvalue of the hessian note that tp can be arbitrarily small which might blow up both of the coefficients in the following we will see how newsamp remedies this issue theorem let the assumptions in lemma hold denote by ti the eigenvalue of hst where is given by newsamp at iteration step if the step size satisfies tp then we have with probability at least for an absolute constant for the coefficients and are defined as ck log mn newsamp has composite convergence rate where and are the coefficients of the linear and the quadratic terms respectively see the right plot in figure we observe that the size has significant effect on the linear term whereas the quadratic term is governed by the lipschitz constant we emphasize that the case is feasible for the conditions of theorem sequentially dependent here we assume that the scheme is used to generate distribution of sets may depend on each other but not on any randomness in the dataset examples include fixed as well as of increasing size sequentially covering unused data in addition to assumptions we assume the following assumption observations let zn be observations from distribution for fixed rp and we assume that the functions fi satisfy fi zi for some function rp most statistical learning algorithms can be formulated as above in classification problems one has access to samples yi xi where yi and xi denote the class label and the covariate and measures the classification error see section for examples for scheme an analogue of lemma is stated in appendix as lemma which leads to the following result theorem assume that the parameter set is convex and st is based on the scheme further let the assumptions and hold almost surely conditioned on the event if the step size satisfies eq then for given by newsamp at iteration with probability at least ce for ce we have for the coefficients and defined as diam mn log where are absolute constants and denotes the eigenvalue of hst mn compared to the theorem we observe that the coefficient of the quadratic term does not change this is due to assumption however the bound on the linear term is worse since we use the uniform bound over the convex parameter set dependence of coefficients on and convergence guarantees the coefficients and depend on the iteration step which is an undesirable aspect of the above results however these constants can be well approximated by their analogues and evaluated at the optimum which are defined by simply replacing tj with in their definition where the latter is the eigenvalue of at for the sake of simplicity we only consider the case where the functions fi are quadratic theorem assume that the functions fi are quadratic st is based on scheme and let the full hessian at be lower bounded by then for sufficiently large and absolute constants with probability log log theorem implies that when the size is sufficiently large will concentrate around generalizing the above theorem to functions is straightforward in which case one would get additional terms involving the difference in the case of scheme if one uses fixed then the coefficient does not depend on the following corollary gives sufficient condition for convergence detailed discussion on the number of iterations until convergence and further local convergence properties can be found in corollary assume that and are by and with an error bound of for as in theorem for the initial point sufficient condition for convergence is choosing the algorithm parameters step size let log we suggest the following step size for newsamp at iteration note that is the upper bound in theorems and and it minimizes the first component of the other terms in and linearly depend on to compensate for that we shrink towards contrary to most algorithms optimal step size of newsamp is larger than rigorous derivation of eq can be found in sample size by theorem of size log should be sufficient to obtain small coefficient for the linear phase also note that size scales quadratically with the condition number rank threshold for with effective rank trace divided by the largest eigenvalue it suffices to use log samples effective rank is upper bounded by the dimension hence one can use log samples to approximate the and choose rank threshold which retains the important curvature information examples generalized linear models glm maximum likelihood estimation in glm setting is equivalent to minimizing the negative loglikelihood minimize hxi yi hxi where is the cumulant generating function xi rp denote the rows of design matrix and rp is the coefficient vector here hx denotes the inner product between the vectors the function defines the type of glm gives ordinary least squares ols and log ez gives logistic regression lr using the results from section we perform convergence analysis of our algorithm on glm problem corollary let st be uniform and rp be the parameter set assume that the second derivative of the cumulant generating function is bounded by and it is lipschitz continuous with lipschitz constant assume that the covariates are contained in ball of radius rx kxi rx then for given by newsamp with constant step size at iteration with probability at least we have for constants and defined as where is an absolute constant and crx log lrx is the ith eigenvalue of hst support vector machines svm linear svm provides separating hyperplane which maximizes the margin the distance between the hyperplane and the support vectors although the vast majority of the literature focuses on the dual problem svms can be trained using the primal as well since the dual problem does not scale well with the number of data points some approaches get complexity the primal might be for optimization of linear svms the primal problem for the linear svm can be written as minimize yi xi where yi xi denote the data samples defines the separating hyperplane and could be any loss function the most commonly used loss functions include loss huber loss and their smoothed versions smoothing or approximating such losses with more stable functions is sometimes crucial in optimization in the case of newsamp which requires the loss function to be twice differentiable almost everywhere we suggest either smoothed huber loss or loss in the case of loss xi max xi by combining the offset and the normal vector of the hyperplane into single parameter vector and denoting by svt the set of indices of all the support vectors at iteration we may write the hessian xi xti where svt yi xi when is large the problem falls into our setup and can be solved efficiently using newsamp note that unlike the glm setting lipschitz condition of our theorems do not apply here however we empirically demonstrate that newsamp works regardless of such assumptions experiments in this section we validate the performance of newsamp through numerical studies we experimented on two optimization problems namely logistic regression lr and svm lr minimizes eq for the logistic function whereas svm minimizes eq for the loss in the following we briefly describe the algorithms that are used in the experiments gradient descent gd at each iteration takes step proportional to negative of the full gradient evaluated at the current iterate under certain regularity conditions gd exhibits linear convergence rate accelerated gradient descent agd is proposed by nesterov which improves over the gradient descent by using momentum term newton method nm achieves quadratic convergence rate by utilizing the inverse hessian evaluated at the current iterate bfgs is the most popular and stable method qt is formed by accumulating the information from iterates and gradients limited memory bfgs is variant of bfgs which uses only the recent iterates and gradients to construct qt providing improvement in terms of memory usage stochastic gradient descent sgd is simplified version of gd where at each iteration randomly selected gradient is used we follow the guidelines of for the step size dataset synthe logistic regression msd logistic regression log error method newsamp bfgs lbfgs newton gd agd sgd adagrad time sec log error log error ct slices logistic regression method newsamp bfgs lbfgs newton gd agd sgd adagrad svm time sec method newsamp bfgs lbfgs newton gd agd sgd adagrad svm time sec svm method newsamp bfgs lbfgs newton gd agd sgd adagrad time sec log error log error log error method newsamp bfgs lbfgs newton gd agd sgd adagrad time sec method newsamp bfgs lbfgs newton gd agd sgd adagrad time sec figure performance of several algorithms on different datasets newsamp is represented with red color adaptive gradient scaling adagrad uses an adaptive learning rate based on the previous gradients adagrad significantly improves the performance and stability of sgd for batch algorithms we used constant step size and for all the algorithms the step size that provides the fastest convergence is chosen for stochastic algorithms we optimized over the parameters that define the step size parameters of newsamp are selected following the guidelines in section we experimented over various datasets that are given in table each dataset consists of design matrix and the corresponding observations classes rn synthetic data is generated through multivariate gaussian distribution as methodological choice we selected moderate values of for which newton method can still be implemented and nevertheless we can demonstrate an improvement for larger values of comparison is even more favorable to our approach the effects of size and rank threshold are demonstrated in figure thorough comparison of the aforementioned optimization techniques is presented in figure in the case of lr we observe that stochastic methods enjoy fast convergence at start but slows down after several epochs the algorithm that comes close to newsamp in terms of performance is bfgs in the case of svm nm is the closest algorithm to newsamp note that the global convergence of bfgs is not better than that of gd the condition for rate is for which an initial point close to the optimum is required this condition can be rarely satisfied in practice which also affects the performance of other second order methods for newsamp even though rank thresholding provides level of robustness we found that initial point is still an important factor details about figure and additional experiments can be found in appendix dataset ct slices covertype msd synthetic reference mewl table datasets used in the experiments conclusion in this paper we proposed based second order method utilizing hessian estimation the proposed method has the target regime and has np complexity we showed that the convergence rate of newsamp is composite for two widely used schemes starts as quadratic convergence and transforms to linear convergence near the optimum convergence behavior under other schemes is an interesting line of research numerical experiments demonstrate the performance of the proposed algorithm which we compared to the classical optimization methods references amari natural gradient works efficiently in learning neural computation richard byrd gillian chin will neveitt and jorge nocedal on the use of stochastic hessian information in optimization methods for machine learning siam journal on optimization jock blackard and denis dean comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables compag richard byrd sl hansen jorge nocedal and yoram singer stochastic method for optimization arxiv preprint christopher bishop neural networks for pattern recognition oxford university press bottou machine learning with stochastic gradient descent compstat stephen boyd and lieven vandenberghe convex optimization cambridge university press cai emmanuel and zuowei shen singular value thresholding algorithm for matrix completion siam journal on optimization no olivier chapelle training support vector machine in the primal neural computation lee dicker and murat erdogdu flexible results for quadratic forms with applications to variance components estimation arxiv preprint david donoho matan gavish and iain johnstone optimal shrinkage of eigenvalues in the spiked covariance model arxiv preprint john duchi elad hazan and yoram singer adaptive subgradient methods for online learning and stochastic optimization mach learn res john dennis jr and jorge methods motivation and theory siam review murat erdogdu and andrea montanari convergence rates of newton methods arxiv preprint murat erdogdu method second order method for glms via stein lemma nips michael friedlander and mark schmidt hybrid methods for data fitting siam journal on scientific computing no franz graf kriegel matthias schubert sebastian and alexander cavallaro image registration in ct images using radial image descriptors miccai springer david gross and vincent nesme note on sampling without replacing from finite collection of matrices arxiv preprint igor griva stephen nash and ariela sofer linear and nonlinear optimization siam nathan halko martinsson and joel tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions no lichman uci machine learning repository nicolas le roux and andrew fitzgibbon fast natural newton method icml nicolas le roux manzagol and yoshua bengio topmoumoute online natural gradient algorithm nips james martens deep learning via optimization icml pp mewl thierry mahieux daniel ellis brian whitman and paul lamere the million song dataset yurii nesterov method for unconstrained convex minimization problem with the rate of convergence doklady an sssr vol pp introductory lectures on convex optimization basic course vol springer mark schmidt nicolas le roux and francis bach minimizing finite sums with the stochastic average gradient arxiv preprint bernhard and alexander smola learning with kernels support vector machines regularization optimization and beyond mit press joel tropp tail bounds for sums of random matrices foundations of computational mathematics roman vershynin introduction to the analysis of random matrices oriol vinyals and daniel povey krylov subspace descent for deep learning aistats 
learning in bayesian games vasilis syrgkanis microsoft research new york ny vasy jason hartline northwestern university evanston il hartline tardos cornell university ithaca ny eva abstract recent analyses of games of complete information suggest that coarse correlated equilibria which characterize outcomes resulting from learning dynamics have welfare this work provides two main technical results that lift this conclusion to games of incomplete information bayesian games first welfare in bayesian games follows directly from the proof of welfare in the same game when the private information is public second learning dynamics converge to bayesian coarse correlated equilibrium in these incomplete information games these results are enabled by interpretation of bayesian game as stochastic game of complete information introduction recent confluence of results from game theory and learning theory gives simple explanation for why good outcomes in large families of games can be expected the advance comes from relaxation the classical notion of equilibrium in games to one that corresponds to the outcome attained when players behavior ensures asymptotic via standard online learning algorithms such as weighted majority and an extension theorem that shows that the standard approach for bounding the quality of classical equilibria automatically implies the same bounds on the quality of equilibria this paper generalizes these results from static games to bayesian games for example auctions our motivation for considering learning outcomes in bayesian games is the following many important games model repeated interactions between an uncertain set of participants sponsored search and more generally online market places are important examples of such games platforms are running millions of auctions with each individual auction slightly different and of only very small value but such market places have high enough volume to be the financial basis of large industries this online auction environment is best modeled by repeated bayesian game the auction game is repeated over time with the set of participants slightly different each time depending on many factors from budgets of the players to subtle differences in the opportunities canonical example to which our methods apply is auction with players values for the item drawn from product distribution in such an auction players simultaneously submit sealed bids and the player with the highest bid wins and pays her bid the utility of the winner is her value minus her bid the utilities of the losers are zero when the values are drawn from continuous distributions the equilibrium is given by differential equation that is not generally analytically tractable cf and generalizations of this model computationally hard see again though their equilibria are complex we show that good outcomes can be expected in these kinds of auctions our approach to proving that good equilibria can be expected in repeated bayesian games is to extend an analogous result for static the setting where the same game with the same payoffs and the same players is repeated nash equilibrium is the classical model of equilibrium for each stage of the static game in such an equilibrium the strategies of players may be randomized however the randomizations of the players are independent to measure the quality of outcomes in games koutsoupias and papadimitriou introduced the price of anarchy the ratio of the quality of the worst nash equilibrium over socially optimal solution price of anarchy results have been shown for large families of games with focus on those relevant for computer networks roughgarden identified the canonical approach for bounding the price of anarchy of game as showing that it satisfies natural smoothness condition there are two fundamental flaws with nash equilibrium as description of strategic behavior first computing nash equilibrium can be ppad hard and thus neither should efficient algorithms for computing nash equilibrium be expected nor should any dynamics of players with bounded computational capabilities converge to nash equilibrium second natural behavior tends to introduce correlations in strategies and therefore does not converge to nash equilibrium even in the limit both of these issues can be resolved for large families of games first there are relaxations of nash equilibrium which allow for correlation in the players strategies of these this paper will focus on coarse correlated equilibrium which requires the expected payoff of player for the correlated strategy be no worse than the expected payoff of any action at the player disposal second it was proven by blum et al that the asymptotic property of many online learning algorithms implies convergence to the set of coarse correlated blum et al extended the definition of the price of anarchy to outcomes obtained when each player follows learning as coarse correlated equilibrium generalize nash equilibrium it could be that the worst case equilibrium under the former is worse than the latter roughgarden however observed that there is often no degradation specifically the very same smoothness property that he identified as implying good welfare in nash equilibrium also proves good welfare of coarse correlated equilibrium equivalently for outcomes from learners thus for large family of static games we can expect strategic behavior to lead to good outcomes this paper extends this theory to bayesian games our contribution is we show an analog of the convergence of learning to coarse correlated equilibria in bayesian games which is of interest independently of our price of anarchy analysis and ii we show that the coarse correlated equilibria of the bayesian version of any smooth static game have good welfare combining these results we conclude that learning in smooth bayesian games achieves good welfare these results are obtained as follows it is possible to view bayesian game as stochastic game where the payoff structure is fixed but there is random action on the part of nature this viewpoint applied to the above auction example considers population of bidders associated for each player and in each stage nature uniformly at random selects one bidder from each population to participate in the auction we and strengthen result of syrgkanis and tardos by showing that the smoothness property of the static game for any fixed profile of bidder values implies smoothness of this stochastic game from the perspective of coarse correlated equilibrium there is no difference between stochastic game and the game with each random variable replaced with its expected value thus the smoothness framework of roughgarden extends this result to imply that the coarse correlated equilibria of the stochastic game are good to show that we can expect good outcomes in bayesian games it suffices to show that learning converges to the coarse correlated equilibrium of this stochastic game importantly when we consider learning algorithms there is distinction between the stochastic game where players payoffs are random variables and the game where players payoffs are the expectation in the standard terms of the game theory literature we extend results for learning in games of complete information to games of incomplete information this result is generalization of one of foster and vohra they referred to this price of anarchy for learners as the price of total anarchy of these variables our analysis addressed this distinction and in particular shows that in the stochastic game on populations learning converges almost surely to the set of coarse correlated equilibrium this result implies that the average welfare of dynamics will be good almost surely and not only in expectation over the random draws of nature preliminaries this section describes general game theoretic environment which includes auctions and resource allocation mechanisms for this general environment we review the results from the literature for analyzing the social welfare that arises from learning dynamics in repeated game play the subsequent sections of the paper will generalize this model and these results to bayesian games games of incomplete information general game form general game is specified by mapping from profile an of allowable actions of players to an outcome behavior in game may result in possibly correlated randomized actions player utility in this game is determined by profile of individual values vn and the implicit outcome of the game it is denoted ui vi ui vi in games with social planner or principal who does not take an action in the game the utility of the principal is in many games of interest such as auctions or allocation mechanisms the utility of the principal is the revenue from payments from the players we will use the term mechanism and game interchangeably in static game the payoffs of the players given by are fixed subsequent sections will consider bayesian games in the independent private value model where player value vi is drawn independently from the other players values and is known only privately to player classical game theory assumes complete information for static games that is known and incomplete information in bayesian games that the distribution over is known for our study of learning in games no assumptions of knowledge are made however to connect to the classical literature we will use its terminology of complete and incomplete information to refer to static and bayesian games respectively social welfare we will be interested in analyzing the quality of the outcome of the game as defined by the social welfare pwhich is the sum of the utilities of the players and the principal we will denote by sw ui vi the expected social welfare of mechanism under randomized action profile for any valuation profile we will denote the optimal social welfare the maximum over outcomes of the game of the sum of utilities by pt learning and coarse correlated equilibria for complete information games fixed valuation profile blum et al analyzed repeated play of players using learning algorithms and showed that this play converges to relaxation of nash equilibrium namely coarse correlated equilibrium definition no regret player achieves no regret in sequence of play at if his regret against any fixed strategy vanishes to zero pt limt ui vi ui at vi definition coarse correlated equilibrium cce randomized action profile is coarse correlated equilibrium of complete information game with valuation profile if for every player and ai ea ui vi ea ui vi theorem blum et al the empirical distribution of actions of any sequence in repeated game converges to the set of cce of the static game price of anarchy of cce roughgarden gave unifying framework for comparing the social welfare under various equilibrium notions including coarse correlated equilibrium to the optimal social welfare by defining the notion of smooth game this framework was extended to games like auctions and allocation mechanisms by syrgkanis and tardos symbols denote random variables simultaneous first price auction with submodular bidders first price auction first price position auction auction greedy combinatorial auction with proportional bandwitdth allocation mechanism submodular welfare games congestion games with linear delays oa de reference figure examples of smooth games and mechanisms definition smooth mechanism mechanism is for some there exists an independent randomized action profile an for each valuation profile such that for any action profile and valuation profile ui ai vi pt many important games and mechanisms satisfy this smoothness definition for various parameters of and see figure the following theorem shows that the welfare of any coarse correlated equilibrium in any of these games is nearly optimal theorem efficiency of cce if mechanism is then the social welfare of any course correlated equilibrium at least max of the optimal welfare the price of anarchy satisfies oa max price of anarchy of learning following blum et al theorem and theorem imply that learning dynamics have social welfare corollary efficiency of dyhamics if mechanism is then the average welfare of any dynamics of the repeated game with fixed player set and valuation profile achieves average social welfare at least max of the optimal welfare the price of anarchy satisfies oa max importantly corollary holds the valuation profile fixed throughout the repeated game play the main contribution of this paper is in extending this theory to games of incomplete information where the values of the players are drawn at random in each round of game play population interpretation of bayesian games in the standard independent private value model of bayesian game there are players player has type vi drawn uniformly from the set of type vi and this distribution is denoted fi we will restrict attention to the case when the type space vi is finite player strategy in this bayesian game is mapping si vi ai from valuation vi vi to an action ai ai we will denote with σi av the strategy space of each player and with σn in the game each player realizes his type vi from the distribution and then makes action si vi in the game in the population interpretation of the bayesian game also called the agent normal form representation there are finite populations of players each player in population has type vi which we assume to be distinct for each player in each population and across the set of players in the population is denoted vi and the player in population with type vi is called player vi in the population game each player vi chooses an action si vi nature uniformly draws one player from the restriction to the uniform distribution is without loss of generality for any finite type space and for any distribution over the type space that involves only rational probabilities the restriction to distinct types is without of loss of generality as we can always augment type space with an index that does not affect player utilities each population and the game is played with those players actions in other words the utility of player vi from population is ag ui ev ui vi vi vi notice that the population interpretation of the bayesian game is in fact stochastic game of complete information there are multiple generalizations of coarse correlated equilibria from games of complete information to games of incomplete information one of the canonical definitions is simply the coarse correlated equilibrium of the stochastic game of complete information that is defined by the population interpretation definition bayesian coarse correlated equilibrium bayes randomized strategy profile is bayesian coarse correlated equilibrium if for every ai and for every vi vi es ev ui vi vi vi es ev ui vi vi vi in game of incomplete information the welfare in equilibrium will be compared to the expected optimal social welfare ev pt we will refer to the ratio of the expected optimal social welfare over the expected social welfare of any bayes as bayes oa learning in repeated bayesian game consider repeated version of the population interpretation of bayesian game at each iteration one player vi from each population is sampled uniformly and independently from other populations the set of chosen players then participate in an instance of mechanism we assume that each player vi vi uses some learning rule to play in this repeated in definition we describe the structure of the game and our notation more elaborately definition the repeated bayesian game of proceeds as follows in stage each player vi vi in each population picks an action sti vi ai we denote with sti ai the function that maps player vi vi to his action from each population one player vit vi is selected uniformly at random let vnt be the chosen profile of players and st stn vnt be the profile of chosen actions each player vit participates in an instance of game in the role of player with action sti vit and experiences utility of ui st vit all players not selected in step experience zero utility remark we point out that for each player in population to achieve he does not need to know the distribution of values in other populations there exist algorithms that can achieve the property and simply require an oracle that returns the utility of player at each iteration thus all we need to assume is that each player receives as feedback his utility at each iteration remark we also note that our results would extend to the case where at each period multiple matchings are sampled independently and players potentially participate in more than one instance of the mechanism and potentially with different players from the remaining population the only thing that the players need to observe in such setting is their average utility that resulted from their action sti vi ai from all the instances that they participated at the given period such scenario seems an appealing model in online ad auction marketplaces where players receive only average utility feedback from their bids this notion is the coarse analog of the agent normal form bayes correlated equilibrium defined in section of forges an equivalent and standard way to view bayesian game is that each player draws his value independently from his distribution each time the game is played in this interpretation the player plays by choosing strategy that maps his value to an action or distribution over actions in this interpretation our condition requires that the player not regret his actions for each possible value bayesian price of anarchy for learners in this repeated game setting we want to compare the average social welfare of any sequence of play where each player uses vanishing regret algorithm versus the average optimal welfare moreover we want to quantify the such average welfare over all possible valuation distributions within each population sup fn lim sup pt pt st sw pt we will refer to this quantity as the bayesian price of anarchy for learners the numerator of this term is simply the average optimal welfare when players from each population are drawn independently in each stage it converges almost surely to the expected optimal welfare ev pt of the stage game our main theorem is that if the mechanism is smooth and players follow strategies then the expected welfare is guaranteed to be close to the optimal welfare theorem main theorem if mechanism is then the average over time welfare of any dynamics of the repeated bayesian game achieves average social welfare at least max almost surely max of the average optimal welfare oa roadmap of the proof in section we show that any vanishing regret sequence of play of the repeated bayesian game will converge almost surely to the bayesian version of coarse correlated equilibrium of the incomplete information stage game therefore the bayesian price of total anarchy will be upper bounded by the efficiency of guarantee of any bayesian coarse correlated equilibrium finally in section we show that the price of anarchy bound of smooth mechanisms directly extends to bayesian coarse correlated equilibria thereby providing an upper bound on the bayesian price of total anarchy of the repeated game remark we point out that our definition of bayes is inherently different and more restricted than the one defined in caragiannis et al there bayes is defined as joint distribution over such that if then for any vi vi and vi ai ui vi ui vi vi the main difference is that the product distribution defined by distribution in and the distribution of values can not produce any possible joint distribution over but the type of joint distributions are restricted to satisfy conditional independence property described by namely that player action is conditionally independent of some other player value given player type such conditional independence property is essential for the guarantees that we will present in this work to extend to bayes and hence do not seem to extend to the notion given in however as we will show in section the dynamics that we analyze which are mathematically equivalent to the dynamics in do converge to this smaller set of bayes that we define and for which our efficiency guarantees will extend this extra convergence property is not needed when the mechanism satisfies the stronger property defined in and thereby was not needed to show efficiency bounds in their setting convergence of bayesian to bayes in this section we show that learning in the repeated bayesian game converges almost surely to the set of bayesian coarse correlated equilibria any given sequence of play of the repeated bayesian game which we defined in definition gives rise to sequence of pairs st where st stn and sti av captures the actions that each player vi in population would have chosen had they been picked then observe that all that matters to compute the average social welfare of the game for any given time step is the empirical distribution of pairs up till time step denoted as dt if st vt is random sample from dt pt sw st vt sw lemma almost sure convergence to bayes consider sequence of play of the random matching game where each player uses vanishing regret algorithm and let dt be the empirical distribution of strategy valuation profile pairs up till time step consider any subsequence of dt that converges in distribution to some distribution then almost surely is product distribution ds dv with ds and dv such that dv and ds bayes of the static incomplete information game with distributional beliefs proof we will denote with ri vi ui vi ui vi the regret of player vi from population for action at action profile for vi vi let xti vi vit vi since the sequence has vanishing regret for each player vi in population pi it must be that for any σi pt xi vi ri si vi vi for any fixed let dst denote the empirical distribution of st and let be random sample from dst for each let ts denote the time steps such that st for each ts then we can equation as es xti vi ri vi st vi for any and let ts ts then we can equation as hp es wi vi ri si vi vi now we observe that is the empirical frequency of the valuation vector when filtered at time steps where the strategy vector was since at each time step the valuation vector is picked independently from the distribution of valuation profiles this is the empirical frequency of ts independent samples from by standard arguments from empirical processes theory if ts then this empirical distribution converges almost surely to the distribution on the other hand if ts doesn go to then the empirical frequency of strategy vanishes to as and therefore has measure zero in the above expectation as thus for any convergent subsequence of dt if is the limit distribution then if is in the support of then almost surely the distribution of conditional on strategy is thus we can write as product distribution ds moreover if we denote with the random variable that follows distribution then the limit of equation for any convergent will give that wi vi ri vi vi equivalently we get that ds will satisfy that for all vi vi and for all ri wi wi wi vi the latter is exactly the bayes condition from definition thus ds is in the set of bayes of the static incomplete incomplete information game among players where the type profile is drawn from given the latter convergence theorem we can easily conclude the following the following theorem whose proof is given in the supplementary material theorem the price of anarchy for bayesian dynamics is upper bounded by the price of anarchy of bayesian coarse correlated equilibria almost surely efficiency of smooth mechanisms at bayes coarse correlated equilibria in this section we show that smoothness of mechanism implies that any bayes of the incomplete information setting achieves at least max of the expected optimal welfare to show this we will adopt the interpretation of bayes that we used in the previous section as coarse correlated equilibria of more complex normal form game the stochastic agent normal form representation of the bayesian game we can interpret this complexp normal form game as the game that arises from complete information mechanism mag among players which randomly samples one player from each of the population and where the utility of player in the complete information mechanism mag is given by equation the set of possible outcomes in this agent game corresponds to the set of mappings from profile of chosen players to an outcome in the underlying mechanism the optimal welfare of this game is then the expected optimal welfare ptag ev pt the main theorem that we will show is that whenever mechanism is then also mechanism mag is then we will invoke theorem of which shows that any coarse correlated equilibrium of complete information mechanism achieves at least max of the optimal welfare by the equivalence between bayes and cce of this complete inforλ mation game we get that every bayes of the bayesian game achieves at least max of the expected optimal welfare theorem from complete information to bayesian smoothness if mechanism is smooth then for any vector of independent valuation distributions fn the complete information mechanism mag is also proof consider the following randomized deviation for each player vi vi in population he random samples valuation profile then he plays according to the randomized action vi the player deviates using the randomized action guaranteed by the smoothness property of mechanism for his type vi and the random sample of the types of the others consider an arbitrary action profile sn for all players in all populations in this context it is better to think of each si as dimensional vector in ai and to view as dimensional vector then with we will denote all the components of this large vector except the ones corresponding to player vi vi moreover we will be denoting with sample from drawn by mechanism mag we now argue about the expected utility of player vi from this deviation which is ag ew ui si vi ew ev ui vi vi vi vi summing the latter over all players vi vi in population ag ew ui si vi ew vi ui si vi vi vi vi vi ev ui vi vi ev ui wi wi ev ui wi where the second to last equation is an exchange of variable names and regrouping using independence summing over populations and using smoothness of we get smoothness of mag hp ag ew ui vi ev λo pt µr λew pt µrag corollary every bayes of the incomplete information setting of smooth mechanism achieves expected welfare at least max of the expected optimal welfare finite time analysis and convergence rates in the previous section we argued about the limit average efficiency of the game as time goes to infinity in this section we analyze the convergence rate to bayes and we show approximate efficiency results even for finite time when players are allowed to have some theorem consider the repeated matching game with mechanism suppose that for any each player in each of the populations has regret at most then for every and there exists such that for any min with probability pt sw max ev pt moreover log references dirk bergemann and stephen morris correlated equilibrium in games with incomplete information cowles foundation discussion papers cowles foundation for research in economics yale university october avrim blum mohammadtaghi hajiaghayi katrina ligett and aaron roth regret minimization and the price of total anarchy in proceedings of the fortieth annual acm symposium on theory of computing stoc pages new york ny usa acm yang cai and christos papadimitriou simultaneous bayesian auctions and computational complexity in proceedings of the fifteenth acm conference on economics and computation ec pages new york ny usa acm ioannis caragiannis christos kaklamanis panagiotis kanellopoulos maria kyropoulou brendan lucier renato paes leme and tardos bounding the inefficiency of outcomes in generalized second price auctions journal of economic theory bart de keijzer evangelos markakis guido schfer and orestis telelis inefficiency of standard auctions in hansl bodlaender and giuseppef italiano editors algorithms esa volume of lecture notes in computer science pages springer berlin heidelberg franoise forges five legitimate definitions of correlated equilibrium in games with incomplete information theory and decision dean foster and rakesh vohra asymptotic calibration biometrika toddr kaplan and shmuel zamir asymmetric auctions with uniform distributions analytic solutions to the general case economic theory elias koutsoupias and christos papadimitriou equilibria in proceedings of the annual conference on theoretical aspects of computer science stacs pages berlin heidelberg lucier and borodin price of anarchy for greedy auctions in proceedings of the twentyfirst annual symposium on discrete algorithms soda pages philadelphia pa usa society for industrial and applied mathematics roughgarden intrinsic robustness of the price of anarchy in proceedings of the annual acm symposium on theory of computing stoc pages new york ny usa acm vasilis syrgkanis and tardos composable and efficient mechanisms in proceedings of the annual acm symposium on theory of computing stoc pages new york ny usa acm vetta nash equilibria in competitive societies with applications to facility location traffic routing and auctions in foundations of computer science proceedings the annual ieee symposium on pages 
statistical topological data analysis kernel perspective stefan huber ist austria roland kwitt department of computer science university of salzburg rkwitt marc niethammer department of computer science and bric unc chapel hill mn weili lin department of radiology and bric unc chapel hill ulrich bauer department of mathematics technische universität münchen tum ulrich abstract we consider the problem of statistical computations with persistence diagrams summary representation of topological features in data these diagrams encode persistent homology widely used invariant in topological data analysis while several avenues towards statistical treatment of the diagrams have been explored recently we follow an alternative route that is motivated by the success of methods based on the embedding of probability measures into reproducing kernel hilbert spaces in fact positive definite kernel on persistence diagrams has recently been proposed connecting persistent homology to popular learning techniques such as support vector machines however important properties of that kernel enabling principled use in the context of probability measure embeddings remain to be explored our contribution is to close this gap by proving universality of variant of the original kernel and to demonstrate its effective use in twosample hypothesis testing on synthetic as well as data introduction over the past years advances in adopting methods from algebraic topology to study the shape of data point clouds images shapes have given birth to the field of topological data analysis tda in particular persistent homology has been widely established as tool for capturing relevant topological features at multiple scales the output is summary representation in the form of so called barcodes or persistence diagrams which roughly speaking encode the life span of the features these topological summaries have been successfully used in variety of different fields including but not limited to computer vision and medical imaging applications range from the analysis of cortical surface thickness to the structure of brain networks brain artery trees or histology images for breast cancer analysis despite the success of tda in these areas statistical treatment of persistence diagrams computing means or variances turns out to be difficult not least because of the unusual structure of the barcodes as intervals rather than numerical quantities while substantial advancements in the direction of statistical tda have been made by studying the structure of the space of persistence diagrams endowed with metrics or variants thereof it is technically and computationally challenging to work in this space in machine learning context we would rather work with hilbert spaces primarily due to the highly regular structure and the abundance of readily available and methods for statistics and learning one way to circumvent issues such as of the fréchet mean or computationally intensive algorithmic strategies is to consider mappings of persistence barcodes into linear function spaces statistical computations can then be performed based on probability theory on banach spaces however the methods proposed in can not guarantee that different probability distributions can always be distinguished by statistical test contribution in this work we consider the task of statistical computations with persistence diagrams our contribution is to approach this problem by leveraging the theory of embedding probability measures into reproducing kernel hilbert spaces in our case probability measures on the space of persistence diagrams in particular we start with recently introduced kernel on persistence diagrams by reininghaus et al and identify missing properties that are essential for use in the aforementioned framework by enforcing mild restrictions on the underlying space we can in fact close the remaining gaps and prove that minor modification of the kernel is universal in the sense of steinwart see section our experiments demonstrate on couple of synthetic and data samples how this universal kernel enables principled solution to the selected problem of hypothesis testing related work in the following we focus our attention on work related to statistical treatment of persistent homology since this is rather new field several avenues are pursued in parallel mileyko et al study properties of the set of persistence diagrams when endowed with the metric they show for instance that under this metric the space is polish and the fréchet mean exists however it is not unique and no algorithmic solution is provided turner et al later show that the metric on the set of persistence diagrams yields geodesic space and that the additional structure can be leveraged to construct an algorithm for computing the fréchet mean and to prove law of large numbers in munch et al take different approach and introduce probabilistic variant of the fréchet mean as probability measure on persistence diagrams while this yields unique mean the solution itself is not persistence diagram anymore techniques for computing confidence sets for persistence diagrams are investigated by fasy et al the authors focus on the bottleneck metric special case of the metric when remarking that similar results could potentially be obtained for the case of the metric under stronger assumptions on the underlying topological space while the aforementioned results concern properties of the set of persistence diagrams equipped with metrics different strategy is advocated by bubenik in the key idea is to circumvent the peculiarities of the metric by mapping persistence diagrams into function spaces one such representation is the persistence landscape sequence of functions in banach space while it is in general not possible to go back and forth between landscapes and persistence diagrams the banach space structure enables theoretical treatment of statistical concepts such as averages or confidence intervals chazal et al establish additional convergence results and propose bootstrap procedure for obtaining confidence sets another less statistically oriented approach towards convenient summary of persistence barcodes is followed by adcock et al the idea is to attach numerical quantities to persistence barcodes which can then be used as input to any machine learning algorithm in the form of feature vectors this strategy is rooted in study of algebraic functions on barcodes however it does not necessarily guarantee stability of the persistence summary representation which is typically desired property of feature map our proposed approach to statistical tda is also closely related to work in the field of learning techniques or to be more specific to the embedding of probability measures into rkhs and the study of suitable kernel functions in that context in fact the idea of mapping probability measures into rkhs has led to many developments generalizing statistical concepts such as testing testing for conditional independence or statistical inference form euclidean spaces to other domains equipped with kernel in the context of supervised learning with tda reininghaus et al recently established first connection to learning techniques via the definition of positive definite kernel on persistence diagrams while positive definiteness is sufficient for many techniques such as support vector machines or kernel pca additional properties are required in the context of embedding probability measures organization section briefly reviews some background material and introduces some notation in section we show how slight modification of the kernel in fits into the framework of embedding probability measures into rkhs section presents set of experiments on synthetic and real data highlighting the advantages of the kernel finally section summarizes the main contributions and discusses future directions background since our discussion of statistical tda from kernel perspective is largely decoupled from how the topological summaries are obtained we only review two important notions for the theory of persistent homology filtrations and persistence diagrams for thorough treatment of the topic we refer the reader to we also briefly review the concept of embedding probability measures into rkhs following filtrations standard approach to tda assigns to some metric space growing sequence of simplicial complexes indexed by parameter typically referred to as filtration recall that an abstract simplicial complex is collection of nonempty sets that is closed under taking nonempty subsets persistent homology then studies the evolution of the homology of these complexes for growing parameter some widely used constructions particularly for point cloud data are the and the complex the complex is simplicial complex with vertex set such that is an iff maxi for point set rd in euclidean space the complex is simplicial complex with vertex set rd such that is an iff the closed balls of radius centered at the have common intersection more general way of obtaining filtration is to consider the sublevel sets for of function on topological space for instance in the case of surfaces meshes commonly used function is the heat kernel signature hks the and filtrations appear as special cases both being sublevel set filtrations of an appropriate function on the subsets abstract simplices of the vertex set for the filtration the function assigns to each subset the radius of its smallest enclosing sphere while for the filtration the function assigns to each subset its diameter equivalently the length of its longest edge death persistence diagrams studying the evolution of the topology of filtration allows us to capture interesting properties of the metric or function used to generate the filtration persistence diagrams provide concise description of the changes in homology that occur during this process existing connected components may merge cycles may appear etc this leads to the appearance and disappearance of homological features of different dimension persistent homology tracks the birth and death of such topological features the multiset of points where each point corresponds to birth time pair is called the persistence fig function and its persistence diagram diagram of the filtration an example of persistence diagram for features connected components of function with is shown in fig we use the identifiers to denote persistence diagrams in the remainder of the paper since all points lie in the above the diagonal rkhs embedding of probability measures an important concept for our work is the embedding of probability measures into reproducing kernel hilbert spaces consider borel probability measure defined on compact metric space which we observe through the sample with furthermore let be positive definite kernel function which realizes an inner product hφ with in some hilbert space for some possibly unknown map see definition also let be the associated rkhs generated by functions induced by the kernel span span hφ with the scalar product hk the linear structure on the rkhs admits the construction of means the embedding of probability measure on is now accomplished via the mean map if this map is injective the kernel is called characteristic this is true in particular if is dense in the space of continuous functions with the supremum norm in which case we refer to the kernel as universal while universal kernel is always characteristic the converse is not true since it has been shown that the empirical estimate of the mean is good proxy for the injectivity of can be used to define distances between distributions and and specifically this can be done via the maximum observed via samples mean discrepancy mmd sup where denotes suitable class of functions and denotes the expectation of which can be written as hµ by virtue of the reproducing property of gretton et al restrict to functions on unit ball in and show that eq can be expressed as the rhks distance between the means and µq of the measures empirical estimates of this quantity are given in and as µq this connection is of particular importance to us since it allows for hypothesis testing in principled manner given suitable kernel prominent examples of universal kernels for rd are the gaussian rbf kernel and the kernel ehx yi however without kernel mmd does not imply example of kernel is the scalar product kernel hx yi with rd even if if the variances of the distributions differ the mmd will still be zero if the means are equal in the context of statistical treatment of persistent homology the ability to embed probability measures on the space of persistence diagrams into rkhs is appealing specifically the problem of testing whether two different samples exhibit significantly different homological features as captured in the persistence diagram boils down to test with null hypothesis µq general alternative µq where and are probability measures on the set of persistence diagrams the computation of this test only involves evaluations of the kernel enabling this procedure via suitable universal kernel will be discussed next the universal persistence scale space kernel in the following for we let dq dw denote the metric space of persistence diagrams with the metric dw where is the empty diagram in theorem mileyko et al show that dq dw is complete metric space when the subscript is omitted we do not refer to any specific instance of metric let us fix the numbers and we denote by the subset of consisting of those persistence diagrams that are bounded by for every the time of its points is less or equal to see definition and whose total multiplicities the sum of multiplicities of all points in diagram are bounded by while this might appear restrictive at first sight it does not really pose limitation in practice in fact for data generated by some finite process meshes have finite number of images have limited resolution etc establishing and is typically not problem we remark that the aforementioned restriction is similar to enforcing boundedness of the support of persistence landscapes in section in reininghaus et al introduce the persistence scale space pss kernel as stable kernel on the set of persistence diagrams of finite total multiplicity each diagram contains only finitely many points let denote point in diagram and let denote its mirror image across the diagonal further let the feature map φσ is given as the solution of heat diffusion problem with dirichlet boundary condition on the diagonal by metric is defined as dw infγ where ranges over all bijections from to with denoting the multiset of diagonal points each with countably infinite multiplicity φσ the kernel kσ is then given in closed form as kσ hφσ φσ for and by construction positive definiteness of kσ is guaranteed the kernel is stable in the sense that the distance dσ is bounded up to constant by theorem we have the following property proposition restricting the kernel in eq to the mean map sends probability measure on to an element proof the claim immediately follows from lemma and proposition since kσ is measurable and bounded on and hence while positive definiteness enables the use of kσ in many learning techniques we are interested in assessing whether it is universal or if we can construct universal kernel from kσ see section the following theorem of christmann and steinwart is particularly relevant to this question theorem cf theorem of let be compact metric space and separable hilbert space such that there exists continuous and injective map furthermore let be function that is analytic on some neighborhood of it can locally be expressed by its taylor series an if an for all then hφ an hφ ing is universal kernel kernels of the form eq are typically referred to as taylor kernels note that universality of kernel on refers to specific choice of metric on by using the same argument as for the linear kernel in rd see above the pss kernel kσ can not be universal with respect to the metric which is induced by the scalar product defining kσ on the other hand it is unclear whether kσ is universal with respect to the metric dw however we do have the following result proposition the kernel kσ kσ exp kσ is universal with respect to the metric proof we prove this proposition by means of theorem we set which is separable hilbert space as shown in reininghaus et al the feature map φσ is injective furthermore it is continuous by construction as the metric on is induced by the norm on and so is φσ restricted to the function is defined as exp and hence is analytic on its taylor coefficients an are and thus are positive for any it remains to show that is compact metric space first define which is bounded closed and therefore compact subspace of now consider the function φσ φσ φσ φσ fn corresponds to the holes average fig visualization of the mean pss function right taken over samples from cf that maps to the persistence diagram pi if pi we note that for all pn with there exists an pn such that this implies next we show that is continuous the distance on persistence diagrams qn where we defined as inf kpi pi with ranging over all bijections between and qn in other words corresponds to the distance without allowing matches to the diagonal now by definition because all bijections considered by are also admissible for since thus is compact and is continuous we have that is compact as well we refer to the kernel of eq as the universal persistence kernel since remark while we prove prop for the pss kernel in eq it obviously also holds for kσ exponentiation does neither invalidate measurability nor boundedness relation to persistence landscapes as the feature map φσ of eq defines summary of persistent homology in the hilbert space the results on probability in banach spaces used in for persistence landscapes naturally apply to φσ as well this includes for instance the law of large numbers or the central limit theorem theorems conversely considering persistence landscape as function in or yields positive definite kernel hλ on persistence diagrams however it is unclear whether universal in kernel can be constructed from persistence landscapes in way similar to the definition of kσ particular we are not aware of proof that the construction of persistence landscapes considered as functions in is continuous with respect to dw for some for more detailed treatment of the differences between φσ and persistence landscapes we refer the reader to experiments we first describe set of experiments on synthetic data appearing in previous work to illustrate the use of the pss feature map φσ and the universal persistence kernel on two different tasks we then present two applications on data where we assess differences in the persistent homology of functions on surfaces of lateral ventricles and corpora callosa with respect to different group assignments age in all experiments filtrations and the persistence diagrams are obtained using which can directly handle our types of input data source code to reproduce the experiments is available at https synthetic data computation of the mean pss function we repeat the experiment from of sampling from the union of two overlapping annuli in particular we repeatedly times draw samples of size out of and then compute persistence diagrams fn for features by considering sublevel sets of the distance function from the points finally we compute the mean of the pss functions φσ fi defined by the feature map from eq this simply amounts to computing φσ fn visualization of the pointwise average for fixed choice of is shown in fig we remind the reader that the convergence results used in equally hold for this feature map as explained in section in particular the above process of taking means converges to the expected value of the pss function as can be seen in fig the two holes manifest themselves as two bumps at different positions in the mean pss function online https significance level torus sphere fig left illustration of one random sample of size on sphere and torus in with equal surface area to generate noisy sample we add gaussian noise to each point in sample indicated by the vectors right hypothesis testing results for and features the box plots show the variation in over selection of values for as function of increasing sample size sample sizes for which the median is less than the chosen significance level here are marked green and red otherwise torus sphere in this slightly more involved example we repeat an experiment from section on the problem of discriminating between sphere and torus in based on random samples drawn from both objects in particular we repeatedly times draw samples from the torus and the sphere corresponding to measures and and then compute persistence diagrams eventually we test the that samples were drawn from the same object cf for thorough description of the full setup we remark that our setup uses the delaunay triangulation of the point samples instead of the triangulation of regular grid as in conceptually the important difference is in the testing strategy in two factors influence the test the choice of functional to map the persistence landscape to scalar and the choice of test statistic bubenik chooses to test for equality between the mean persistence landscapes in contrast we can test for true equality in distribution this is possible since universality of the kernel ensures that the mmd of eq is metric for the space of probability measures on persistence diagrams all are obtained by bootstrapping the test statistic under over random permutations we further vary the number of used to compute the mmd statistic from to and add gaussian noise in one experiment results are shown in fig over selection of scales for features and no noise we can always reject at significance for features and no noise we need at least samples to reliably reject at the same level of data we use two datasets in our experiments surfaces of the corpus callosum and surfaces of the lateral ventricles from neotates the corpus callosum surfaces were obtained from the longitudinal dataset of the oasis brain we use all subject data from the first visit and the grouping criteria is disease state dementia note that the demented group is comprised of individuals with very mild to mild ad this discrimination is based on the clinical dementia rating cdr score marcus et al explain this dataset in detail the lateral ventricle dataset is an extended version of it contains data from neonates all subjects were repeatedly imaged approximately every months starting from weeks in the first year and every months in the second year according to bompard et al the ventricle growth is the dominant effect and occurs in manner most significantly during the first months this raises the question whether age also has an impact on the shape of these brain structures that can be detected by persistent homology of the hks see setup below or section function hence we set our grouping criteria to be developmental age months months it is important to note that the heat kernel signature is not for that reason we normalize the configuration matrices containing the vertex coordinates of each mesh by their euclidean norm cf this ensures that our analysis is not biased by growth scaling effects online http in kσu hks time hks time right lateral ventricles grouping subjects in kσu hks time hks time corpora callosa grouping demented subjects fig left effect of increasing hks time illustrated on one exemplary surface mesh of both datasets right contour plots of estimated via random permutations shown as function of the kernel scale and the hks time setup we follow an experimental setup similar to and and compute the heat kernel signature for various times as function defined on the surface meshes in all experiments of eq and vary the hks time in we use the proposed kernel kernel kσ regarding the kernel scale σi we sweep from and alternative hypotheses are defined as in section with two samples of the test statistic under is bootstrapped using persistence diagrams fi random permutations this is also the setup recommended in for low samples sizes results figure shows the estimated for both datasets as function of the kernel scale and the hks time for features the false discovery rate is controlled by the benjaminihochberg procedure on the lateral ventricle data we observe for the right ventricles especially around hks times to cf fig since the results for left and right lateral ventricles are similar only the plots for the right lateral ventricle are shown in general the results indicate that at specific settings of the hks function captures salient shape features of the surface which lead to statistically significant differences in the persistent homology we do however point out that there is no clear guideline on how to choose the hks time in fact setting too low might emphasize noise while setting too high tends to details as can be seen in the illustration of the hks time on the side of fig on the corpus callosum data cf fig no significant differences in the persistent homology of the two groups again for features can be identified with ranging from to this does not allow to reject at any reasonable level discussion with the introduction of universal kernel for persistence diagrams in section we enable the use of this topological summary representation in the framework of embedding probability measures into reproducing kernel hilbert spaces while our experiments are mainly limited to hypothesis testing our kernel allows to use wide variety of statistical techniques and learning methods which are situated in that framework it is important to note that our construction via theorem essentially depends on restriction of the set to compact metric space we remark that similar conditions are required in in order to enable statistical computations constraining the support of the persistence landscapes however it will be interesting to investigate which properties of the kernel remain valid when lifting these restrictions from an application point of view we have shown that we can test for statistical difference in the distribution of persistence diagrams this is in contrast to previous work where hypothesis testing is typically limited to test for specific properties of the distributions such as equality in mean acknowledgements this work has been partially supported by the austrian science fund project no kli we also thank the anonymous reviewers for their valuable references adcock carlsson and carlsson the ring of algebraic functions on persistence bar codes arxiv available at http bendich marron miller pieloch and skwerer persistent homology analysis of brain artery trees arxiv available at http bompard xu styner paniagua ahn yuan jewells gao shen zhu and lin multivariate longitudinal shape analysis of human lateral ventricles during the first months of life plos one bubenik statistical topological data analysis using persistence landscapes jmlr carlsson topology and data bull amer math chazal fasy lecci rinaldo and wasserman stochastic convergence of persistence landscapes and silhouettes in socg christmann and steinwart universal kernels on input spaces in nips chung bubenik and kim persistence diagrams of cortical surface data in ipmi dryden and mardia statistical shape analysis wiley series in probability and statistics wiley edelsbrunner and harer computational topology an introduction ams fasy lecci rinaldo wasserman balakrishnan and singh confidence sets for persistence diagrams ann fukumizu song and gretton kernel bayes rule bayesian inference with positive definite kernels jmlr gretton borgwardt rasch schölkopf and smola kernel test jmlr ledoux and talagrand probability in banach spaces classics in mathematics springer lee chung kang and lee hole detection in metabolic connectivity of alzheimer disease using in miccai li ovsjanikov and chazal structural recognition in cvpr marcus fotenos csernansky morris and buckner open access series of imaging studies longitudinal mri data in nondemented and demented older adults cognitive mileyko mukherjee and harer probability measures on the space of persistence diagrams inverse munch bendich mukherjee mattingly and harer probabilistic fréchet means and statistics on vineyards corr http reininghaus bauer huber and kwitt stable kernel for topological machine learning in cvpr schölkopf and smola learning with kernels support vector machines regularization optimization and beyond mit press cambridge ma usa singh couture marron perou and niethammer topological descriptors of histology images in mlmi smola gretton song and schölkopf hilbert space embedding for distributions in alt sriperumbudur gretton fukumizu schölkopf and lanckriet hilbert space embeddings and metrics on probability measures jmlr steinwart on the influence of the kernel on the consistency of support vector machines jmlr steinwart and christmann support vector machines springer sun ovsjanikov and guibas concise and probably informative signature based on heat diffusion in sgp turner mileyko mukherjee and harer fréchet means for distributions of persistence diagrams discrete comput 
sequence learning andrew dai google adai quoc le google qvl abstract we present two approaches to use unlabeled data to improve sequence learning with recurrent networks the first approach is to predict what comes next in sequence which is language model in nlp the second approach is to use sequence autoencoder which reads the input sequence into vector and predicts the input sequence again these two algorithms can be used as pretraining algorithm for later supervised sequence learning algorithm in other words the parameters obtained from the pretraining step can then be used as starting point for other supervised training models in our experiments we find that long short term memory recurrent networks after pretrained with the two approaches become more stable to train and generalize better with pretraining we were able to achieve strong performance in many classification tasks such as text classification with imdb dbpedia or image recognition in introduction recurrent neural networks rnns are powerful tools for modeling sequential data yet training them by through time can be difficult for that reason rnns have rarely been used for natural language processing tasks such as text classification despite their ability to preserve word ordering on variety of document classification tasks we find that it is possible to train an lstm rnn to achieve good performance with careful tuning of hyperparameters we also find that simple pretraining step can significantly stabilize the training of lstms simple pretraining method is to use recurrent language model as starting point of the supervised network slightly better method is to use sequence autoencoder which uses rnn to read long input sequence into single vector this vector will then be used to reconstruct the original sequence the weights obtained from pretraining can then be used as an initialization for the standard lstm rnns we believe that this approach is superior to other unsupervised sequence learning methods paragraph vectors because it can allow for easy in our experiments with document classification tasks with newsgroups and dbpedia and sentiment analysis with imdb and rotten tomatoes lstms pretrained by recurrent language models or sequence autoencoders are usually better than lstms initialized randomly another important result from our experiments is that it is possible to use unlabeled data from related tasks to improve the generalization of subsequent supervised model for example using unlabeled data from amazon reviews to pretrain the sequence autoencoders can improve classification accuracy on rotten tomatoes from to an equivalence of adding substantially more labeled data this evidence supports the thesis that it is possible to use unsupervised learning with more unlabeled data to improve supervised learning with sequence autoencoders and outside unlabeled data lstms are able to match or surpass previously reported results our learning approach is related to vectors with two differences the first difference is that is harder objective because it predicts adjacent sentences the second is that is pure unsupervised learning algorithm without sequence autoencoders and recurrent language models our approach to sequence autoencoding is inspired by the work in sequence to sequence learning also known as by sutskever et al which has been successfully used for machine translation text parsing image captioning video analysis speech recognition and conversational modeling key to their approach is the use of recurrent network as an encoder to read in an input sequence into hidden state which is the input to decoder recurrent network that predicts the output sequence the sequence autoencoder is similar to the above concept except that it is an unsupervised learning model the objective is to reconstruct the input sequence itself that means we replace the output sequence in the framework with the input sequence in our sequence autoencoders the weights for the decoder network and the encoder network are the same see figure figure the sequence autoencoder for the sequence wxyz the sequence autoencoder uses recurrent network to read the input sequence in to the hidden state which can then be used to reconstruct the original sequence we find that the weights obtained from the sequence autoencoder can be used as an initialization of another supervised network one which tries to classify the sequence we hypothesize that this is because the network can already memorize the input sequence this reason and the fact that the gradients have shortcuts are our hypothesis of why the sequence autoencoder is good and stable approach in initializing recurrent networks significant property of the sequence autoencoder is that it is unsupervised and thus can be trained with large quantities of unlabeled data to improve its quality our result is that additional unlabeled data can improve the generalization ability of recurrent networks this is especially useful for tasks that have limited labeled data we also find that recurrent language models can be used as pretraining method for lstms this is equivalent to removing the encoder part of the sequence autoencoder in figure our experimental results show that this approach works better than lstms with random initialization overview of baselines in our experiments we use lstm recurrent networks because they are generally better than rnns our lstm implementation is standard and has input gates forget gates and output gates we compare this basic lstm against lstm initialized with the sequence autoencoder method when the lstm is initialized with sequence autoencoder the method is called in our experiments when lstm is initialized with language model the method is called lmlstm we also compare our method to other baselines methods or paragraph vectors previously reported on the same datasets in most of our experiments our output layer predicts the document label from the lstm output at the last timestep we also experiment with the approach of putting the label at every timestep and linearly increasing the weights of the prediction objectives from to this way we can inject gradients to earlier steps in the recurrent networks we call this approach linear label gain lastly we also experiment with the method of jointly training the supervised learning task with the sequence autoencoder and call this method joint training experiments in our experiments with lstms we follow the basic recipes as described in by clipping the cell outputs and gradients the benchmarks of focus are text understanding tasks with all datasets being publicly available the tasks are sentiment analysis imdb and rotten tomatoes and text classification newsgroups and dbpedia commonly used methods on these datasets such as or typically ignore ordering information modifiers and their objects may be separated by many unrelated words so one would expect recurrent methods which preserve ordering information to perform well nevertheless due to the difficulty in optimizing these networks recurrent models are not the method of choice for document classification in our experiments with the sequence autoencoder we train it to reproduce the full document after reading all the input words in other words we do not perform any truncation or windowing we add an end of sentence marker to the end of each input sequence and train the network to start reproducing the sequence after that marker to speed up performance and reduce gpu memory usage we perform truncated backpropagation up to timesteps from the end of the sequence we preprocess the text so that punctuation is treated as separate tokens and we ignore any characters and words in the dbpedia text we also remove words that only appear once in each dataset and do not perform any term weighting or stemming after training the recurrent language model or the sequence autoencoder for roughly steps with batch size of we use both the word embedding parameters and the lstm weights to initialize the lstm for the supervised task we then train on that task while fine tuning both the embedding parameters and the weights and use early stopping when the validation error starts to increase we choose the dropout parameters based on validation set using we are able to match or surpass reported results for all datasets it is important to emphasize that previous best results come from various different methods so it is significant that one method achieves strong results for all datasets presumably because such method can be used as general model for any similar task summary of results in the experiments are shown in table more details of the experiments are as follows table summary of the error rates of and previous best reported results dataset previous best result imdb rotten tomatoes newsgroups dbpedia sentiment analysis experiments with imdb in this first set of experiments we benchmark our methods on the imdb movie sentiment dataset proposed by maas et al there are labeled and unlabeled documents in the training set and in the test set we use of the labeled training documents as validation set the average length of each document is words and the maximum length of document is words the previous baselines are convnets or paragraph vectors since the documents are long one might expect that it is difficult for recurrent networks to learn we however find that with tuning it is possible to train lstm recurrent networks to fit the training set for example if we set the size of hidden state to be units and truncate the backprop to be an lstm can do fairly well with random embedding dimension dropout and random word dropout not published previously we are able to reach performance of around accuracy in the test set which is approximately worse than most baselines http fundamentally the main problem with this approach is that it is unstable if we were to increase the number of hidden units or to increase the number of backprop steps the training breaks down very quickly the objective function explodes even with careful tuning of the gradient clipping this is because lstms are sensitive to the hyperparameters for long documents in contrast we find that the works better and is more stable if we use the sequence autoencoders changing the size of the hidden state or the number of backprop steps hardly affects the training of lstms this is important because the models become more practical to train using sequence autoencoders we overcome the optimization instability in lstms in such way that it is fast and easy to achieve perfect classification on the training set to avoid overfitting we again use input dimension dropout with the dropout rate chosen on validation set we find that dropping out of the input embedding dimensions works well for this dataset the results of our experiments are shown in table together with previous baselines we also add an additional baseline where we initialize lstm with embeddings on the training set table performance of models on the imdb sentiment classification task model test error rate lstm with tuning and dropout lstm initialized with embeddings see section see figure with linear gain see section with joint training see section wrrbm bow bnc bayes svm with bigrams convnet with dynamic pooling paragraph vectors the results confirm that with input embedding dropout can be as good as previous best results on this dataset in contrast lstms without sequence autoencoders have trouble in optimizing the objective because of long range dependencies in the documents using language modeling as an initialization works well achieving but less well compared to the this is perhaps because language modeling is objective so that the hidden state only captures the ability to predict the next few words in the above table we use units for memory cells units for the input embedding layer in the and we also use hidden layer units with dropout of between the last hidden state and the classifier we continue to use these settings in the following experiments in table we present some examples from the imdb dataset that are correctly classified by salstm but not by bigram nbsvm model these examples often have dependencies or have sarcasm that is difficult to detect by solely looking at short phrases sentiment analysis experiments with rotten tomatoes and the positive effects of additional unlabeled data the success on the imdb dataset convinces us to test our methods on another sentiment analysis task to see if similar gains can be obtained the benchmark of focus in this experiment is the rotten tomatoes dataset the dataset has documents which are randomly split into for training for validation and for test the average length of each document is words and the maximum length is words thus compared to imdb this dataset is smaller both in terms of the number of documents and the number of words per document http table imdb sentiment classification examples that are correctly classified by and incorrectly by text sentiment looking for real super bad movie if you wan na have great fun don hesitate and check this one ferrigno is incredibly bad but is also the best of this mediocrity negative professional production with quality actors that simply never touched the heart or the funny bone no matter how hard it tried the quality cast stark setting and excellent cinemetography made you hope for fargo or high plains drifter but sorry the soup had no seasoning or meat for that matter of for effort negative the is very bad but there are some action sequences that really liked think the image is good better than other romanian movies liked also how the actors did their jobs negative our first observation is that it is easier to train lstms on this dataset than on the imdb dataset and the gaps between lstms and are smaller than before this is because movie reviews in rotten tomatoes are sentences whereas reviews in imdb are paragraphs as this dataset is small our methods tend to severely overfit the training set combining with input embedding and word dropout improves generalization and allows the model to achieve test set the further on the validation set can improve the result to error rate on the test set to better the performance we add unlabeled data from the imdb dataset in the previous experiment and amazon movie reviews to the autoencoder training we also run control experiment where we use the pretrained word vectors trained by from google news table performance of models on the rotten tomatoes sentiment classification task model test error rate lstm with tuning and dropout lstm with linear gain lstm with word vectors from google news with unlabeled data from imdb with unlabeled data from amazon reviews convnet with word vectors from google news the results for this set of experiments are shown in table our observation is that if we use the word vectors from there is only small gain of this is perhaps because the recurrent weights play an important role in our model and are not initialized properly in this experiment however if we use imdb to pretrain the sequence autoencoders the error decreases from to nearly gain in accuracy if we use amazon reviews larger unlabeled dataset million movie reviews to pretrain the sequence autoencoders the error goes down to which is another gain in accuracy the dataset is available at http which has million general product reviews but we only use million movie reviews in our experiments this brings us to the question of how well this method of using unlabeled data fares compared to adding more labeled data as argued by socher et al reason of why the methods are not perfect yet is the lack of labeled training data they proposed to use more labeled data by labeling an addition of phrases created by the stanford parser the use of more labeled data allowed their method to achieve around error in the test set an improvement of approximately over older methods with less labeled data we compare our method to their reported results on classification as our method does not have access to valuable labeled data one might expect that our method is severely disadvantaged and should not perform on the same level however with unlabeled data and sequence autoencoders we are able to obtain ranking second amongst many other methods that have access to much larger corpus of labeled data the fact that unlabeled data can compensate for the lack of labeled data is very significant as unlabeled data are much cheaper than labeled data the results are shown in table table more unlabeled data more labeled data performance of with additional unlabeled data and previous models with additional labeled data on the rotten tomatoes task model test error rate lstm initialized with embeddings trained on amazon reviews with unlabeled data from amazon reviews nb svm binb vecavg rnn rntn text classification experiments with newsgroups the experiments so far have been done on datasets where the number of tokens in document is relatively small few hundred words our question becomes whether it is possible to use salstms for tasks that have substantial number of words such as web articles or emails and where the content consists of many different topics for that purpose we carry out the next experiments on the newsgroups dataset there are documents in the training set and in the test set we use of the training documents as validation set each document is an email with an average length of words and maximum length of words attachments pgp keys duplicates and empty messages are removed as the newsgroup documents are long it was previously considered improbable for recurrent networks to learn anything from the dataset the best methods are often simple we repeat the same experiments with lstms and on this dataset similar to observations made in previous experiments are generally more stable to train than lstms to improve generalization of the models we again use input embedding dropout and word dropout chosen on the validation set with input embedding dropout and word dropout achieves test set error which is much better than previous classifiers in this dataset results are shown in table document classification experiments with dbpedia in this set of experiments we turn our attention to another challenging task of categorizing wikipedia pages by reading inputs the dataset of attention is the dbpedia dataset which was also used to benchmark convolutional neural nets in zhang and lecun http table performance of models on the newsgroups classification task model test error rate lstm lstm with linear gain hybrid class rbm svm bayes note that unlike other datasets in zhang and lecun dbpedia has no duplication or tainting issues so we assume that their experimental results are valid on this dataset dbpedia is crowdsourced effort to extract information from wikipedia and categorize it into an ontology for this experiment we follow the same procedure suggested in zhang and lecun the task is to classify dbpedia abstracts into one of categories after reading the input the dataset is split into training examples and test examples dbpedia document has an average of characters while the maximum length of all documents is characters as this dataset is large overfitting is not an issue and thus we do not perform any dropout on the input or recurrent layers for this dataset we use lstm each layer has hidden units and and the input embedding has units table performance of models on the dbpedia character level classification task model test error rate lstm lstm with linear gain with linear gain with layers and linear gain small convnet large convnet in this dataset we find that the linear label gain as described in section is an effective mechanism to inject gradients to earlier steps in lstms this linear gain method works well and achieves test set error which is better than combining and the linear gain method achieves test set error significant improvement from the results of convolutional networks as shown in table object classification experiments with in these experiments we attempt to see if our methods extend to data to do this we train lstm to read the image dataset where the input at each timestep is an entire row of pixels and output the class of the image at the end we use the same method as in to perform data augmentation we also trained lstm to do next row prediction given the current row we denote this as and lstm to predict the image by rows after reading all its rows we then these on the classification task we present the results in table while we do not achieve the results attained by state of the art convolutional networks our pretrained is able to exceed the results of the baseline convolutional dbn model despite not using any convolutions and outperforms the non lstm table performance of models on the object classification task model test error rate lstm lstm convolution dbns discussion in this paper we found that it is possible to use lstm recurrent networks for nlp tasks such as document classification we also find that language model or sequence autoencoder can help stabilize the learning in recurrent networks on five benchmarks that we tried lstms can become general classifier that reaches or surpasses the performance levels of all previous baselines acknowledgements we thank oriol vinyals ilya sutskever greg corrado vijay vasudevan manjunath kudlur rajat monga matthieu devin and the google brain team for their help references ando and zhang framework for learning predictive structures from multiple tasks and unlabeled data mach learn december bengio ducharme vincent and jauvin neural probabilistic language model in jmlr datasets for text categorization http online accessed william chan navdeep jaitly quoc le and oriol vinyals listen attend and spell arxiv preprint dauphin and bengio stochastic ratio matching of rbms for sparse inputs in nips gers schmidhuber and cummins learning to forget continual prediction with lstm neural computation graves generating sequences with recurrent neural networks in arxiv greff srivastava steunebrink and schmidhuber lstm search space odyssey in icml hochreiter bengio frasconi and schmidhuber gradient flow in recurrent nets the difficulty of learning dependencies field guide to dynamical recurrent neural networks hochreiter and schmidhuber long memory neural computation jean cho memisevic and bengio on using very large target vocabulary for neural machine translation in icml johnson and zhang effective use of word order for text categorization with convolutional neural networks in naacl kim convolutional neural networks for sentence classification kiros zhu salakhutdinov zemel torralba urtasun and fidler skipthought vectors in nips krizhevsky convolutional deep belief networks on technical report university of toronto krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips lang newsweeder learning to filter netnews in icml larochelle mandel pascanu and bengio learning algorithms for the classification restricted boltzmann machine jmlr le and mikolov distributed representations of sentences and documents in icml lehmann isele jakob jentzsch kontokostas mendes hellmann morsey van kleef auer et al dbpedia multilingual knowledge base extracted from wikipedia semantic web luong sutskever le vinyals and zaremba addressing the rare word problem in neural machine translation arxiv preprint maas daly pham huang ng and potts learning word vectors for sentiment analysis in acl mcauley and leskovec hidden factors and hidden topics understanding rating dimensions with review text in recsys pages acm mikolov burget and khudanpur recurrent neural network based language model in interspeech ng hausknecht vijayanarasimhan vinyals monga and toderici beyond short snippets deep networks for video classification in cvpr pang and lee seeing stars exploiting class relationships for sentiment categorization with respect to rating scales in acl rumelhart hinton and williams learning representations by errors nature shang lu and li neural responding machine for conversation in emnlp socher huval manning and ng semantic compositionality through recursive spaces in emnlp socher perelygin wu chuang manning ng and potts recursive deep models for semantic compositionality over sentiment treebank in emnlp srivastava mansimov and salakhutdinov unsupervised learning of video representations using lstms in icml sutskever vinyals and le sequence to sequence learning with neural networks in nips vinyals kaiser koo petrov sutskever and hinton grammar as foreign language in nips vinyals and le neural conversational model in icml deep learning workshop vinyals toshev bengio and erhan show and tell neural image caption generator in cvpr wang and manning baselines and bigrams simple good sentiment and topic classification in acl werbos beyond regression new tools for prediction and analysis in the behavioral sciences phd thesis harvard zaremba sutskever and vinyals recurrent neural network regularization arxiv preprint zhang and lecun convolutional networks for text classification in nips 
structured transforms for deep learning vikas sindhwani tara sainath sanjiv kumar google new york sindhwani tsainath sanjivk abstract we consider the task of building compact deep learning pipelines suitable for deployment on storage and power constrained mobile devices we propose unified framework to learn broad family of structured parameter matrices that are characterized by the notion of low displacement rank our structured transforms admit fast function and gradient evaluation and span rich range of parameter sharing configurations whose statistical modeling capacity can be explicitly tuned along continuum from structured to unstructured experimental results show that these transforms can significantly accelerate inference and passes during training and offer superior tradeoffs in comparison to number of existing techniques in keyword spotting applications in mobile speech recognition our methods are much more effective than standard linear bottleneck layers and nearly retain the performance of state of the art models while providing more than compression introduction transforms of the form mx where is an elementwise nonlinearity is an input vector and is an matrix of parameters are building blocks of complex deep learning pipelines and function estimators arising in randomized kernel methods when is large general dense matrix the cost of storing mn parameters and computing products in mn time can make it prohibitive to deploy such models on lightweight mobile devices and wearables where battery life is precious and storage is limited this is particularly relevant for mobile applications such as continuously looking for specific keywords spoken by the user or processing live video stream onboard mobile robot in such settings the models may need to be hosted on specialized digital signal processing components which are even more resource constrained than the device cpu parsimonious structure typically imposed on parameter matrices is that of if is rank matrix with min then it has product representation of the form ght where have only columns clearly this representation reduces the storage requirements to mr nr parameters and accelerates the multiplication time to via mx ht another popular structure is that of sparsity typically imposed during optimization via or regularizers other techniques include freezing to be random matrix as motivated via approximations to kernel functions storing in low formats using specific parameter sharing mechanisms or training smaller models on outputs of larger models distillation structured matrices an matrix which can be described in much fewer than mn parameters is referred to as structured matrix typically the structure should not only reduce memory requirements but also dramatically accelerate inference and training via fast products and gradient computations below are classes of structured matrices arising pervasively in many contexts with different types of parameter sharing indicated by the color iii cauchy toeplitz ii vandermonde toeplitz matrices have constant values along each of their diagonals when the same property holds for the resulting class of matrices are called hankel matrices toeplitz and hankel matrices are intimately related to discrete convolutions and arise naturally in time series analysis and dynamical systems vandermonde matrix is determined by taking elementwise powers of its second column very important special case is the complex matrix associated with the discrete fourier transform dft which has vandermonde structure with vj ωnj where ωn exp is the primitive nth root of unity similarly the entries of cauchy matrices are completely defined by two length vectors vandermonde and cauchy matrices arise naturally in polynomial and rational interpolation problems superfast numerical linear algebra the structure in these matrices can be exploited for faster linear algebraic operations such as multiplication inversion and factorization in particular the product can be computed in time log for toeplitz and hankel matrices and in time for vandermonde and cauchy matrices displacement operators at first glance these matrices appear to have very different kinds of parameter sharing and consequently very different algorithms to support fast linear algebra it turns out however that each structured matrix class described above can be associated with specific displacement operator which transforms each matrix say in that class into an matrix that has very rank min this displacement rank approach which can be traced back to seminal paper greatly unifies algorithm design and complexity analysis for structured matrices generalizations of structured matrices consider deriving matrix by taking arbitrary linear combinations of products of structured matrices and their inverses where each ti is toeplitz matrix the parameter sharing structure in such derived matrix is by no means apparent anymore yet it turns out that the associated displacement operator remarkably continues to expose the underlying parsimony structure such derived matrices are still mapped to relatively matrices the displacement rank approach allows fast linear algebra algorithms to be seamlessly extended to these broader classes of matrices the displacement rank parameter controls the degree of structure in these generalized matrices technical preview contributions and outline we propose building deep learning pipelines where parameter matrices belong to the class of generalized structured matrices characterized by low displacement rank in section we attempt to give overview of the displacement rank approach drawing key results from the relevant literature on structured matrix computations proved in our supplementary material for completeness in section we show that the proposed structured transforms for deep learning admit fast matrix multiplication and gradient computations and have rich statistical modeling capacity that can be explicitly controlled by the displacement rank hyperparameter covering along continuum an entire spectrum of configurations from highly structured to unstructured matrices while our focus in this paper is on transforms our proposal extends to other structured matrix generalizations in section we study inference and acceleration with structured transforms as function of displacement rank and dimensionality we find that our approach compares highly favorably with numerous other techniques for learning models on several benchmark datasets finally we demonstrate our approach on mobile speech recognition applications where we are able to match the performance of much bigger state of the art models with fraction of parameters notation let en denote the canonical basis elements of rn viewed as column vectors in denote identity and zero matrices respectively jn en is the reflection matrix whose action on vector is to reverse its entries when the dimension is obvious we may drop the subscript for rectangular matrices we may specify both the dimensions explicitly we use for and for all ones column vector of length denotes hadamard elementwise product between two vectors for complex vector will denote the vector of complex conjugate of its entries the discrete fourier transform dft matrix will be denoted by or ωn we will also use fft to denote ωx and ifft to denote for vector diag denotes diagonal matrix given by diag ii vi displacement operators associated with structured matrices we begin by providing brisk background on the displacement rank approach unless otherwise specified for notational convenience we will henceforth assume squared transforms and discuss rectangular transforms later proofs of various assertions can be found in our selfcontained supplementary material or in the sylvester displacement operator denoted as is defined by am mb where are fixed matrices referred to as operator matrices closely related is the stein displacement operator denoted as and defined by amb by carefully choosing and one can instantiate sylvester and stein displacement operators with desirable properties in particular for several important classes of displacement operators are chosen to be an matrix defined as follows definition matrix for scalar the matrix denoted by zf is defined as follows zf en the matrix is associated with basic downward transformation the product zf shifts the elements of the column vector downwards and scales and brings the last element vn to the top resulting in vn it has several basic algebraic properties see proposition that are crucial for the results stated in this section figure lists the rank of the sylvester displacement operator in eqn when applied to matrices belonging to various structured matrix classes where the operator matrices in eqn are chosen to be diagonal it can be seen that despite the difference in their structures all these classes are characterized by very low displacement rank figure shows how this transformation happens in the case of toeplitz matrix also see section lemma embedded in the toeplitz matrix are two copies of toeplitz matrix shown in black and red boxes the shift and scale action of and aligns these by taking the difference the sylvester displacement operator nullifies the aligned submatrix leaving rank matrix with elements only along its first row and last column note that the negative sign introduced by term prevents the complete zeroing out of the value of marked by red star and is hence critical for invertibility of the displacement action figure displacement action on toeplitz matrix figure below is rank sh wn do ft ift zt zt diag diag diag diag tsh zt diag zt diag diag lef structured matrix toeplitz hankel vandermonde cauchy each class of structured matrices listed in figure can be naturally generalized by allowing the rank of the displacement operator to be higher specifically given displacement operator and displacement rank parameter one may consider the class of matrices that satisfies rank clearly then ght for rank matrices we refer to rank as the displacement rank of under and to the factors as the associated generators for the operators listed in table these broader classes of structured matrices are correspondingly called and fast numerical linear algebra algorithms extend to such matrices in order to express structured matrices with rank directly as function of its lowdisplacement generators we need to invert and obtain learnable parameterization for stein type displacement operator the following elegant result is known see proof in theorem krylov decomposition if an matrix is such that ght where gr hr and the operator matrices satisfy an ai bi for some scalars then can be expressed as krylov gj krylov bt hj ab where krylov is defined by krylov av henceforth our focus in this paper will be on matrices for which the displacement operator of interest see table is of sylvester type in order to apply theorem one can switch between sylvester and stein operators setting and which both satisfy the conditions of theorem see property proposition the resulting expressions involve krylov matrices generated by matrices which are called matrices in the literature definition matrix given vector the matrix zf is defined as follows zf krylov zf fv two special cases are of interest corresponds to circulant matrices and corresponds to matrices finally one can obtain an explicit parameterization for matrices which turns out to involve taking sums of products of circulant and matrices theorem if an matrix satisfies ght where gr hr then can be written as gj jhj learning structured transforms motivated by theorem we propose learning parameter matrices of the form in eqn by optimizing the displacement factors first from the properties of displacement operators it follows that this class of matrices is very rich from statistical modeling perspective theorem richness the set of all matrices that can be written as gi hi for some gr hr contains all circulant and matrices for all toeplitz matrices for inverses of toeplitz matrices for all products of the form at for pp all linear combinations of the form βi at where all matrices for where each ai above is toeplitz matrix or the inverse of toeplitz matrix when we learn parameter matrix structured as eqn with displacement rank equal to or we also search over convolutional transforms in this sense structured transforms with higher displacement rank generalize convolutional layers the displacement rank provides knob on modeling capacity low displacement matrices are highly structured and compact while high displacement matrices start to contain increasingly unstructured dense matrices next we show that associated structured transforms of the form admit fast evaluation and gradient computations with respect to first we recall the following wellknown result concerning the diagonalization of matrices theorem diagonalization of matrices theorem for any let cn and df diag then zf diag ωdf this result implies that for the special cases of and corresponding to circulant and matrices respectively the multiplication can be computed in log time via the fast fourier transform ifft fft fft ifft fft fft ifft fft fft ifft fft fft where where exp nπ the root of negative unity in particular single product for circulant and matrices has the computational cost of ffts therefore for matrices of the form in eqn comprising of products of circulant and matrices naively computing product for batch of input vectors would take ffts however this cost can be significantly lowered to that of rb ffts by making the following observation gi hi diag ωgi diag diag hi where diag here the fft of the parameters ωgi and hi is computed once and shared across multiple input vectors in the minibatch the scaled fft of the input diag is computed once and shared across the sum in eqn and the final inverse fft is also shared thus the following result is immediate theorem fast multiplication given an matrix the product pr gi hi can be computed at the cost of rb ffts using the following algorithm set where exp nπ initialize set fft diag set fft and fft diag for to hi diag ifft diag diag fft set ifft return we now show that when our structured transforms are embedded in deep learning pipeline the gradient computation can also be accelerated first we note that the jacobian structure of matrices has the following pleasing form proposition jacobian of transforms the jacobian of the map zf with respect to the parameters is zf this leads to the following expressions for the jacobians of the structured transforms of interest proposition jacobians with respect to displacement generators consider parameterized transforms of the form gi hi the jacobians of with respect to the th column of gj hj at are as follows jgj jhj hj gj pb based on eqns the gradient over minibatch of size requires computing jgj δi pb and jhj δi where xi and δi are batches of forward and backward inputs during backpropagation these can be naively computed with ffts however as before by sharing fft of the forward and backward inputs and the fft of the parameters this can be lowered to ffts below we give matricized implementation proposition fast gradients let be matrices whose columns are forward and backward inputs respectively of minibatch size during backpropagation the gradient with respect to gj hj can be computed at the cost of ffts as follows compute fft fft diag fft fft diag gradient wrt gj ffts return ifft fft diag ifft diag gradient wrt hj ffts return diag ifft fft diag ifft diag rectangular transforms variants of theorems exist for rectangular transforms see alternatively for we can subsample the outputs of square transforms at the cost of extra computations while for assuming is multiple of we can stack output vectors of square transforms empirical studies acceleration with structured transforms in figure we analyze the speedup obtained in practice using circulant and matrices relative to dense unstructured matrix fully connected layer as function of displacement rank and dimension three scenarios are considered inference speed per test instance training speed as implicitly dictated by forward passes on minibatch and gradient computations on minibatch factors such as differences in cache optimization simd vectorization and multithreading between blas multiplication blas multiplication and fft implementations we use fftw http influence the speedup observed in practice speedup gains start to show for dimensions as small as for circulant matrices the gains become dramatic with acceleration of the order of to times for several thousand dimensions even for higher displacement rank transforms speedup unstructured structured inference gradient minibatch forward pass minibatch displacement rank displacement rank displacement rank figure acceleration with structured transforms intel xeon machine random datasets in the plot displacement rank corresponds to circulant transform effectiveness for learning compact neural networks next we compare the proposed structured transforms with several existing techniques for learning compact feedforward neural networks we exactly replicate the experimental setting from the recent paper on ashed ets which uses several image classification datasets first prepared by mnist is the original mnist digit classification dataset with training examples and test examples bg img rot refers to challenging version of mnist where digits are randomly rotated and placed against random black and white background rect training images test images and convex training images test images are binary image datasets where the task is to distinguish between tall and wide rectangles and whether the on pixels form convex region or not respectively in all datasets input images are of size several existing techniques are benchmarked in for compressing reference single hidden layer model with hidden nodes random edge removal rer where fraction of weights are randomly frozen to be decomposition lrd neural network nn where the hidden layer size is reduced to satisfy parameter budget dark knowledge dk small neural network is trained with respect to both the original labeled data as well as soft targets generated by full uncompressed neural network hashednets hn this approach uses hash function to randomly group connection weights which share the same value hashednets with dark knowledge hndk trains hashednet with respect to both the original labeled data as well as soft targets generated by full uncompressed neural network we consider learning models of comparable size with the weights in the hidden layer structured as matrix we also compare with the fastfood approach of where the weight matrix is product of diagonal parameter matrices and fixed permutation and matrices also admitting log multiplication and gradient computation time the irculant neural network approach proposed in is special case of our framework theorem results in table show that structured transforms outperform all competing approaches on all datasets sometimes by very significant margin with similar or drastically lesser number of parameters it should also be noted that while random weight tying in ashed ets reduces the number of parameters the lack of structure in the resulting weight matrix can not be exploited for log multiplication time we note in passing that for ashed ets weight matrices whose entries assume only one of distinct values the mailman algorithm can be used for faster multiplication with complexity log log which still is much slower than multiplication time for matrices also note that the distillation ideas of are complementary to our approach and can further improve our results mnist bg img rot convex rect rer lrd nn dk hn hndk fastfood irculant oeplitz oeplitz table error rate and number of parameters italicized best results in blue oeplitz mobile speech recognition we now demonstrate the techniques developed in this paper on speech recognition application meant for mobile deployment specifically we consider keyword spotting kws task where deep neural network is trained to detect specific phrase such as ok google the data used for these experiments consists of utterances of selected phrases such as and larger set of utterances to serve as negative training examples the utterances were randomly split into training development and evaluation sets in the ratio of we created noisy evaluation set by artificially adding cafeteria noise at snr to the clean data set we will refer to this noisy data set as cafe we refer the reader to for more details about the datasets we consider the task of shrinking large model for this task whose architecture is as follows the input layer consists of dimensional filterbanks stacked with temporal context of to produce an input of whose dimensions are in time and frequency respectively this input is fed to convolutional layer with filter size frequency stride and filters the output of the convolutional layer is of size the output of this layer is fed to fully connected layer followed by softmax layer for predicting classes constituting the phrase playmusic the full training set contains about million samples we use asynchronous distributed stochastic gradient descent sgd in parameter server framework with worker nodes for optimizing various models the global learning rate is set to while our structured transform layers use learning rate of both are decayed by an exponential factor of accuracy fullyconnected reference fastfood accuracy false rejects circulant fullyconnected reference circulant fastfood false alarms per hour time hours figure detection performance left keyword spotting performance in terms of false reject fr rate per false alarm fa rate lower is better right classification accuracy as function of training time displacement rank is in parenthesis for models results with different models are reported in figure left including the state of the art keyword spotting model developed in at an operating point of false alarm per hour the following observations can be made with just parameters displacement oeplitz like structured transform outperforms standard bottleneck model with containing times more parameters it also lowers false reject rates from with irculant and with fastfood transforms to about with displacement rank the false reject rate is in comparison to with the times larger standard bottleneck model our best model comes within of the performance of the larger and times larger reference models in terms of raw classification accuracy as function of training time figure right shows that our models with displacement ranks and come within accuracy of the and reference models and easily provide much better tradeoffs in comparison to standard bottleneck models circulant and fastfood baselines the conclusions are similar for other noise conditions see supplementary material perspective we have introduced and shown the effectiveness of new notions of parsimony rooted in the theory of structured matrices our proposal can be extended to various other structured matrix classes including block and matrices related to multidimensional convolution we hope that such ideas might lead to new generalizations of convolutional neural networks acknowledgements we thank chen carolina parada rohit prabhavalkar alex gruenstein rajat monga baris sumengen kilian weinberger and wenlin chen for their contributions references supplementary material structured transforms for small footprint deep learning http chen parada and heigold keyword spotting using deep neural networks in icassp chen wilson tyree weinberger and chen compressing neural networks with the hashing trick in icml cheng xu feris kumar choudhary and chang fast neural networks with circulant projections in ciresan meier masci gambardella and schmidhuber neural networks for visual object classification in collins and kohli deep convolutional neural networks in icassp courbariaux david and bengio storage for deep learning in iclr dean corrado monga chen devin le mao ranzato senior tucker yang and ng distributed deep networks in nips denil shakibi dinh and de freitas predicting parameters in deep learning in nips gray toeplitz and circulant matrices review foundations and trends in communications and information theory hinton vinyals and dean distilling the knowledge in neural network in nips workshop kailath and chun generalized displacement structure for block toeplitz toeplitz block and matrices siam matrix anal kailath kung and morf displacement ranks of matrices and linear equations journal of mathematical analysis and applications pages kailath and sayed displacement structure theory and applications siam review larochelle erhan courville bergstra and bengio an empirical evaluation of deep architectures on problems with many factors of variation in icml le sarlos and smola fastfood approximating kernel expansions in loglinear time in icml liberty and zucker the mailman algorithm note on matrix vector multiplication in information processing letters pan structured matrices and polynomials unified superfast algorithms springer pan inversion of displacement operators siam journal of matrix analysis and applications pages rahimi and recht random features for kernel machines in nips rakhuba and oseledets fast multidimensional convolution in tensor formats via cross approximation siam sci sainath kingsbury sindhwani arisoy and ramabhadran matrix factorization for deep neural network training with output targets in icassp sainath and parada convolutional neural networks for keyword spotting in proc interspeech vanhoucke senior and mao improving the speed of neural networks on cpus in nips workshop on deep learning and unsupervised feature learning yang moczulski denil de freitas smola song and wang deep fried convnets in 
rapidly mixing gibbs sampling for class of factor graphs using hierarchy width christopher de sa ce zhang kunle olukotun and christopher cdesa czhang kunle chrismre departments of electrical engineering and computer science stanford university stanford ca abstract gibbs sampling on factor graphs is widely used inference technique which often produces good empirical results theoretical guarantees for its performance are weak even for tree structured graphs the mixing time of gibbs may be exponential in the number of variables to help understand the behavior of gibbs sampling we introduce new hyper graph property called hierarchy width we show that under suitable conditions on the weights bounded hierarchy width ensures polynomial mixing time our study of hierarchy width is in part motivated by class of factor graph templates hierarchical templates which have bounded hierarchy of the data used to instantiate them we demonstrate rich application from natural language processing in which gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers introduction we study inference on factor graphs using gibbs sampling the de facto markov chain monte carlo mcmc method specifically our goal is to compute the marginal distribution of some query variables using gibbs sampling given evidence about some other variables and set of factor weights we focus on the case where all variables are discrete in this situation gibbs sampler randomly updates single variable at each iteration by sampling from its conditional distribution given the values of all the other variables in the model many as factorie openbugs pgibbs dimmwitted and others gibbs sampling for inference because it is fast to run simple to implement and often produces high quality empirical results however theoretical guarantees about gibbs are lacking the aim of the technical result of this paper is to provide new cases in which one can guarantee that gibbs gives accurate results for an mcmc sampler like gibbs sampling the standard measure of efficiency is the mixing time of the underlying markov chain we say that gibbs sampler mixes rapidly over class of models if its mixing time is at most polynomial in the number of variables in the model gibbs sampling is known to mix rapidly for some models for example gibbs sampling on the ising model on graph with bounded degree is known to mix in quasilinear time for high temperatures recent work has outlined conditions under which gibbs sampling of markov random fields mixes rapidly gibbs sampling over models with distributions is also known to mix rapidly each of these celebrated results still leaves gap there are many classes of factor graphs on which gibbs sampling seems to work very as part of systems that have won quality competitions which there are no theoretical guarantees of rapid mixing many graph algorithms that take exponential time in general can be shown to run in polynomial time as long as some graph property is bounded for inference on factor graphs the most commonly used property is hypertree width which bounds the complexity of dynamic programming algorithms on the graph many problems including variable elimination for exact inference can be solved in polynomial time on graphs with bounded hypertree width in some sense bounded hypertree width is necessary and sufficient condition for tractability of inference in graphical models unfortunately it is not hard to construct examples of factor graphs with bounded weights and hypertree width for which gibbs sampling takes exponential time to mix therefore bounding hypertree width is insufficient to ensure rapid mixing of gibbs sampling to analyze the behavior of gibbs sampling we define new graph property called the hierarchy width this is stronger condition than hypertree width the hierarchy width of graph will always be larger than its hypertree width we show that for graphs with bounded hierarchy width and bounded weights gibbs sampling mixes rapidly our interest in hierarchy width is motivated by factor graph templates which are common in practice several types of models such as markov logic networks mln and relational markov networks rmn can be represented as factor graph templates many systems use gibbs sampling on factor graph templates and achieve better results than competitors using other algorithms we exhibit class of factor graph templates called hierarchical templates which when instantiated have hierarchy width that is bounded independently of the dataset used gibbs sampling on models instantiated from these factor graph templates will mix in polynomial time this is kind of sampling analog to tractable markov logic or safe plans in probabilistic databases we exhibit templated program that outperforms human annotators at complex text extraction provably mixes in polynomial time in summary this work makes the following contributions we introduce new notion of width hierarchy width and show that gibbs sampling mixes in polynomial time for all factor graphs with bounded hierarchy width and factor weight we describe new class of factor graph templates hierarchical factor graph templates such that gibbs sampling on instantiations of these templates mixes in polynomial time we validate our results experimentally and exhibit factor graph templates that achieve high quality on tasks but for which our new theory is able to provide mixing time guarantees related work gibbs sampling is just one of several algorithms proposed for use in factor graph inference the variable elimination algorithm is an exact inference method that runs in polynomial time for graphs of bounded hypertree width belief propagation is another inference algorithm that produces an exact result for trees and although it does not converge in all cases converges to good approximation under known conditions lifted inference is one way to take advantage of the structural symmetry of factor graphs that are instantiated from template there are lifted versions of many common algorithms such as variable elimination belief propagation and gibbs sampling it is also possible to leverage template for fast computation venugopal et al achieve orders of magnitude of speedup of gibbs sampling on mlns compared with gibbs sampling these inference algorithms typically have better theoretical results despite this gibbs sampling is ubiquitous algorithm that performs practically outstripping its guarantees our approach of characterizing runtime in terms of graph property is typical for the analysis of graph algorithms many algorithms are known to run in polynomial time on graphs of bounded treewidth despite being otherwise sometimes using stronger or weaker property than treewidth will produce better result for example the submodular width used for constraint satisfaction problems main result in this section we describe our main contribution we analyze some simple example graphs and use them to show that bounded hypertree width is not sufficient to guarantee rapid mixing of gibbs sampling drawing intuition from this we define the hierarchy width graph property and prove that gibbs sampling mixes in polynomial time for graphs with bounded hierarchy width φt tn fn linear semantics φf tn fn semantics figure factor graph diagrams for the voting model prior factors are omitted first we state some basic definitions factor graph is graphical model that consists of set of variables and factors and determines distribution over those variables if is world for an assignment of value to each variable in then the energy of the world is defined as the probability of world is exp where is the normalization constant necessary for this to be distribution typically each depends only on subset of the variables we can draw as bipartite graph where variable is connected to factor if depends on definition mixing time the mixing time of markov chain is the first time at which the estimated distribution µt is within statistical distance of the true distribution that is tmix min voting example we start by considering simple example model called the voting model that models the sign of particular query variable in the presence of other voter variables ti and fi for that suggest that is positive and negative true and false respectively we consider three versions of this model the first the voting model with linear semantics has energy function pn pn pn pn wq ti wq fi wti ti wfi fi where wti wfi and are constant weights this model has factor connecting each voter variable to the query which represents the value of that vote and an additional factor that gives prior for each voter it corresponds to the factor graph in figure the second version the voting model with logical semantics has energy function pn pn wq maxi ti wq maxi fi wti ti wfi fi here in addition to the prior factors there are only two other factors one of which which we call φt connects all the to the query and the other of which φf connects all the to the query the third version the voting model with ratio semantics is an intermediate between these two models and has energy function pn pn pn pn wq log ti wq log fi wti ti wfi fi with either logical or ratio semantics this model can be drawn as the factor graph in figure these three cases model different distributions and therefore different ways of representing the power of vote the choice of names is motivated by considering the marginal odds of given the other variables for linear semantics the odds of depend linearly on the difference between the number of nonzero ti and nonzero fi for ratio semantics the odds of depend roughly on their ratio for logical semantics only the presence of nonzero voters matters not the number of voters we instantiated this model with random weights wti and wfi ran gibbs sampling on it and computed the variance of the estimated marginal probability of for the different models figure the results show that the models with logical and ratio semantics produce much estimates than the model with linear semantics this experiment motivates us to try to prove bound on the mixing time of gibbs sampling on this model theorem fix any constant and run gibbs sampling on the voting model with bounded factor weights wti wfi for the voting model with linear semantics the largest variance of marginal estimate for variance of marginal estimate for convergence of voting model linear ratio logical iterations thousands convergence of voting model linear ratio logical iterations millions figure convergence for the voting model with and random prior weights in possible mixing time tmix of any such model is tmix for the voting model with either logical or ratio semantics the largest possible mixing time is tmix log this result validates our observation that linear semantics mix poorly compared to logical and ratio semantics intuitively the reason why linear semantics performs worse is that the gibbs sampler will switch the state of only very fact exponentially so this is because the energy roughly depends linearly on the number of voters and therefore the probability of switching depends exponentially on this does not happen in either the logical or ratio models hypertree width in this section we describe the graph property of hypertree width and show using the voting example that bounding it is insufficient to ensure rapid gibbs sampling hypertree width is typically used to bound the complexity of dynamic programming algorithms on graph in particular variable elimination for exact inference runs in polynomial time on factor graphs with bounded hypertree width the hypertree width of hypergraph which we denote tw is generalization of the notion of acyclicity since the definition of hypertree width is technical we instead state the definition of an acyclic hypergraph which is sufficient for our analysis in order to apply these notions to factor graphs we can represent factor graph as hypergraph that has one vertex for each node of the factor graph and one hyperedge for each factor where that hyperedge contains all variables the factor depends on definition acyclic factor graph join tree also called junction tree of factor graph is tree such that the nodes of are the factors of and if two factors and both depend on the same variable in then every factor on the unique path between and in also depends on factor graph is acyclic if it has join tree all acyclic graphs have hypertree width tw note that all trees are acyclic in particular the voting model with any semantics has hypertree width since the voting model with linear semantics and bounded weights mixes in exponential time theorem this means that bounding the hypertree width and the factor weights is insufficient to ensure rapid mixing of gibbs sampling hierarchy width since the hypertree width is insufficient we define new graph property the hierarchy width which when bounded ensures rapid mixing of gibbs sampling this result is our main contribution definition hierarchy width the hierarchy width hw of factor graph is defined recursively such that for any connected factor graph hv φi hw min hw hv and for any disconnected factor graph with connected components hw max hw gi as base case all factor graphs with no factors have hw hv to develop some intuition about how to use the definition of hierarchy width we derive the hierarchy width of the path graph drawn in figure vn figure factor graph diagram for an path graph lemma the path graph model has hierarchy width hw ne proof let gn denote the path graph with variables for the lemma follows from for gn is connected so we must compute its hierarchy width by applying it turns out that the factor that minimizes this expression is the factor in the middle and so applying followed by shows that hw gn hw gd applying this inductively proves the lemma similarly we are able to compute the hierarchy width of the voting model factor graphs lemma the voting model with logical or ratio semantics has hierarchy width hw lemma the voting model with linear semantics has hierarchy width hw these results are promising since they separate our examples from our examples however the hierarchy width of factor graph says nothing about the factors themselves and the functions they compute this means that it alone tells us nothing about the model for example any distribution can be represented by trivial factor graph with single factor that contains all the variables therefore in order to use hierarchy width to produce result about the mixing time of gibbs sampling we constrain the maximum weight of the factors definition maximum factor weight factor graph has maximum factor weight where max max min for example the maximum factor weight of the voting example with linear semantics is with logical semantics it is and with ratio semantics it is log we now show that graphs with bounded hierarchy width and maximum factor weight mix rapidly theorem polynomial mixing time if is factor graph with variables at most states per variable factors maximum factor weight and hierarchy width then tmix log log em exp in particular if is polynomial in the number of values for each variable is bounded and hm log then tmix no to show why bounding the hierarchy width is necessary for this result we outline the proof of theorem our technique involves bounding the absolute spectral gap of the transition matrix of gibbs sampling on graph there are standard results that use the absolute spectral gap to bound the mixing time of process our proof proceeds via induction using the definition of hierarchy width and the following three lemmas lemma connected case let and be two factor graphs with maximum factor weight which differ only inasmuch as contains single additional factor then exp lemma disconnected case let be disconnected factor graph with variables and connected components gm with nm variables respectively then ni min gi lemma base case let be factor graph with one variable and no factors the absolute spectral gap of gibbs sampling running on will be using these lemmas inductively it is not hard to show that under the conditions of theorem exp converting this to bound on the mixing time produces the result of theorem to gain more intuition about the hierarchy width we compare its properties to those of the hypertree width first we note that when the hierarchy width is bounded the hypertree width is also bounded statement for any factor graph tw hw one of the useful properties of the hypertree width is that for any fixed computing whether graph has hypertree width tw can be done in polynomial time in the size of we show the same is true for the hierarchy width statement for any fixed computing whether hw can be done in time polynomial in the number of factors of finally we note that we can also bound the hierarchy width using the degree of the factor graph notice that graph with unbounded node degree contains the voting program with linear semantics as subgraph this statement shows that bounding the hierarchy width disallows such graphs statement let be the maximum degree of variable in factor graph then hw factor graph templates our study of hierarchy width is in part motivated by the desire to analyze the behavior of gibbs sampling on factor graph templates which are common in practice and used by many systems factor graph template is an abstract model that can be instantiated on dataset to produce factor graph the dataset consists of objects each of which represents thing we want to reason about which are divided into classes for example the object bart could have class person and the object twilight could have class movie there are many ways to define templates here we follow the formulation in koller and friedman factor graph template consists of set of template variables and template factors template variable represents property of tuple of zero or more objects of particular classes for example we could have an ispopular template which takes single argument of class movie in the instantiated graph this would take the form of multiple variables like ispopular twilight or ispopular avengers template factors are replicated similarly to produce multiple factors in the instantiated graph for example we can have template factor tweetedabout ispopular for some factor function this would be instantiated to factors like tweetedabout avengers bart ispopular avengers we call the and in template factor object symbols for an instantiated factor graph with template factors if we let aφ denote the set of possible assignments to the object symbols in template factor and let denote the value of its factor function in world under the object symbol assignment then the standard way to define the energy function is with wφ where wφ is the weight of template factor this energy function results from the creation of single factor φa for each object symbol assignment of unfortunately this standard energy definition is not suitable for all applications to deal with this shin et al introduce the notion of semantic function which counts the of energy of instances of the factor template in way in order to do this they first divide the object symbols of each template factor into two groups the head symbols and the body symbols when writing out factor templates we distinguish head symbols by writing them with hat like if we let hφ denote the set of possible assignments to the head symbols let bφ denote the set of possible assignments voting linear bounded factor weight bounded hypertree width polynomial mixing time bounded hierarchy width hierarchical templates voting voting logical ratio figure subset relationships among classes of factor graphs and locations of examples to the body symbols and let denote the value of its factor function in world under the assignment then the energy of world is defined as wφ this results in the creation of single factor φh for each assignment of the template head symbols we focus on three semantic functions in particular for the first linear semantics this is identical to the standard semantics in for the second logical semantics sgn for the third ratio semantics sgn log these semantics are analogous to the different semantics used in our voting example shin et al exhibit several classification problems where using logical or ratio semantics gives better scores hierarchical factor graphs in this section we outline class of templates hierarchical templates that have bounded hierarchy width we focus on models that have hierarchical structure in their template factors for example should have hierarchical structure while should not armed with this intuition we give the following definitions definition hierarchy depth template factor has hierarchy depth if the first object symbols that appear in each of its terms are the same we call these symbols hierarchical symbols for example has hierarchy depth and and are hierarchical symbols also has hierarchy depth and no hierarchical symbols definition hierarchical we say that template factor is hierarchical if all of its head symbols are hierarchical symbols for example is hierarchical while is not we say that factor graph template is hierarchical if all its template factors are hierarchical we can explicitly bound the hierarchy width of instances of hierarchical factor graphs lemma if is an instance of hierarchical template with template factors then hw we would now like to use theorem to prove bound on the mixing time this requires us to bound the maximum factor weight of the graph unfortunately for linear semantics the maximum factor weight of graph is potentially so applying theorem won get us useful results fortunately for logical or ratio semantics hierarchical factor graphs do mix in polynomial time statement for any fixed hierarchical factor graph template if is an instance of with bounded weights using either logical or ratio semantics then the mixing time of gibbs sampling on is polynomial in the number of objects in its dataset that is tmix no so if we want to construct models with gibbs samplers that mix rapidly one way to do it is with hierarchical factor graph templates using logical or ratio semantics experiments synthetic data we constructed synthetic dataset by using an ensemble of ising model graphs each with nodes edges and treewidth but with different hierarchy widths these graphs max error of marginal estimate for kbp dataset mean square error square error errors of marginal estimates for synthetic ising model hierarchy width error of marginal estimates for synthetic ising model after samples linear ratio logical iterations per variable maximum error marginal estimates for kbp dataset after some number of samples figure experiments illustrate how convergence is affected by hierarchy width and semantics ranged from the star graph like in figure to the path graph and each had different hierarchy width for each graph we were able to calculate the exact true marginal of each variable because of the small we then ran gibbs sampling on each graph and calculated the error of the marginal estimate of single query variable figure shows the result with different weights and hierarchy width it shows that even for tree graphs with the same number of nodes and edges the mixing time can still vary depending on the hierarchy width of the model applications we observed that the hierarchical templates that we focus on in this work appear frequently in real applications for example all five knowledge base population kbp systems illustrated by shin et al contain subgraphs that are grounded by hierarchical templates moreover sometimes factor graph is solely grounded by hierarchical templates and thus provably mixes rapidly by our theorem while achieving high quality to validate this we constructed hierarchical template for the paleontology application used by shanan et al we found that when using the ratio semantic we were able to get an score of with precision of on the same task this quality is actually higher than professional human volunteers for comparison the linear semantic achieved an score of and the logical achieved the factor graph we used in this paleontology application is large enough that it is intractable using exact inference to estimate the true marginal to investigate the mixing behavior therefore we chose subgraph of kbp system used by shin et al that can be grounded by hierarchical template and chose setting of the weight such that the true marginal was for all variables we then ran gibbs sampling on this subgraph and report the average error of the marginal estimation in figure our results illustrate the effect of changing the semantic on more complicated model from real application and show similar behavior to our simple voting example conclusion this paper showed that for class of factor graph templates hierarchical templates gibbs sampling mixes in polynomial time it also introduced the graph property hierarchy width and showed that for graphs of bounded factor weight and hierarchy width gibbs sampling converges rapidly these results may aid in better understanding the behavior of gibbs sampling for both template and general factor graphs acknowledgments thanks to stefano ermon and percy liang for helpful conversations the authors acknowledge the support of darpa nsf nsf doe nsf darpa nsf onr and nih oracle nvidia huawei sap labs sloan research fellowship moore foundation american family insurance google and toshiba references venkat chandrasekaran nathan srebro and prahladh harsha complexity of inference in graphical models arxiv preprint persi diaconis kshitij khare and laurent gibbs sampling exponential families and orthogonal polynomials statist may persi diaconis kshitij khare and laurent gibbs sampling conjugate priors and coupling sankhya pedro domingos and william austin webb tractable probabilistic logic in aaai joseph gonzalez yucheng low arthur gretton and carlos guestrin parallel gibbs sampling from colored fields to thin junction trees in aistats pages georg gottlob gianluigi greco and francesco scarcello treewidth and hypertree width tractability practical approaches to hard problems page alexander ihler john iii and alan willsky loopy belief propagation convergence and effects of message errors in journal of machine learning research pages daphne koller and nir friedman probabilistic graphical models principles and techniques mit press johan kwisthout hans bodlaender and linda van der gaag the necessity of bounded treewidth for efficient inference in bayesian networks in ecai pages david asher levin yuval peres and elizabeth lee wilmer markov chains and mixing times american mathematical xianghang liu and justin domke projecting markov random field parameters for fast mixing in ghahramani welling cortes lawrence and weinberger editors advances in neural information processing systems pages curran associates david lunn david spiegelhalter andrew thomas and nicky best the bugs project evolution critique and future directions statistics in medicine marx tractable hypergraph properties for constraint satisfaction and conjunctive queries journal of the acm jacm andrew mccallum karl schultz and sameer singh factorie probabilistic programming via imperatively defined factor graphs in nips pages david newman padhraic smyth max welling and arthur asuncion distributed inference for latent dirichlet allocation in nips pages kee siong ng john lloyd and william tb uther probabilistic modelling inference and learning using logical theories annals of mathematics and artificial intelligence shanan peters ce zhang miron livny and christopher machine reading system for assembling synthetic paleontological databases plos one david poole probabilistic inference in ijcai pages citeseer neil robertson and paul seymour graph minors ii algorithmic aspects of journal of algorithms jaeho shin sen wu feiran wang christopher de sa ce zhang feiran wang and christopher incremental knowledge base construction using deepdive pvldb parag singla and pedro domingos lifted belief propagation in aaai pages alexander smola and shravan narayanamurthy an architecture for parallel topic models pvldb dan suciu dan olteanu christopher and christoph koch probabilistic databases synthesis lectures on data management mihai surdeanu and heng ji overview of the english slot filling track at the knowledge base population evaluation lucas theis jascha and matthias bethge training sparse natural image models with fast gibbs sampler of an extended state space in nips pages deepak venugopal and vibhav gogate on lifting the gibbs sampling algorithm in pereira burges bottou and weinberger editors nips pages curran associates deepak venugopal somdeb sarkhel and vibhav gogate just count the satisfied groundings scalable and sampling based inference in mlns in aaai conference on artificial intelligence ce zhang and christopher dimmwitted study of statistical analytics pvldb 
interpolating convex and tensor decompositions via the subspace norm ryota tomioka toyota technological institute at chicago tomioka qinqing zheng university of chicago qinqing abstract we consider the problem of recovering tensor from its noisy observation previous work has shown recovery guarantee with signal to noise ratio for recovering kth order rank one tensor of size by recursive unfolding in this paper we first improve this bound to by much simpler approach but with more careful analysis then we propose new norm called the subspace norm which is based on the kronecker products of factors obtained by the proposed simple estimator the imposed kronecker structure allows us to show nearly ideal bound in which the parameter controls the blend from the estimator to nuclear norm minimization furthermore we empirically demonstrate that the subspace norm achieves the nearly ideal denoising performance even with introduction tensor is natural way to express higher order interactions for variety of data and tensor decomposition has been successfully applied to wide areas ranging from chemometrics signal processing to neuroimaging see for survey moreover recently it has become an active area in the context of learning latent variable models many problems related to tensors such as finding the rank or best approaximation of tensor is known to be np hard nevertheless we can address statistical problems such as how well we can recover tensor from its randomly corrupted version tensor denoising or from partial observations tensor completion since we can convert tensor into matrix by an operation known as unfolding recent work has shown that we do get nontrivial guarantees by using some norms or singular value decompositions more specifically richard montanari has shown that when kth order tensor of size is corrupted by standard gaussian noise nontrivial bound can be shown with high probability if the noise ratio by method called the recursive note that is sufficient for matrices and also for tensors if we use the best approximation which is known to be np hard as an estimator on the other hand jain oh analyzed the tensor completion problem and proposed an algorithm that requires samples for while information theoretically we need at least samples and the intractable maximum likelihood estimator would require polylog samples therefore in both settings there is wide gap between the ideal estimator and current polynomial time algorithms subtle question that we will address in this paper is whether we need to unfold the tensor so that the resulting matrix become as square as possible which was the reasoning underlying both as parallel development estimators based on alternating minimization or nonlinear optimization have been widely applied and have performed very well when appropriately we say an bn if there is constant such that an bn table comparison of required ratio of different algorithms for recovering kth order rank one tensor of size contaminated by gaussian noise with standard deviation see model the bound for the ordinary unfolding is shown in corollary the bound for the subspace norm is shown in theorem the ideal estimator is proven in appendix latent nuclear norm recursive unfolding square norm ordinary unfolding subspace norm proposed ideal nk log set up therefore it would be of fundamental importance to connect the wisdom of estimators with the more theoretically motivated estimators that recently emerged in this paper we explore such connection by defining new norm based on kronecker products of factors that can be obtained by simple singular value decomposition svd of unfoldings see notation section below also known as the singular value decomposition hosvd we first study the behavior of the leading singular vector from the ordinary rectangular unfolding and show nontrivial bound for signal to noise ratio thus the result also applies to odd order tensors confirming conjecture in furthermore this motivates us to use the solution of truncated svds to construct new norm we propose the subspace norm which predicts an unknown tensor as mixture of tensors in which each term takes the form foldk where foldk is the inverse of unfolding denotes the kronecker product and is orthonormal matrix estimated from the unfolding of the observed tensor for is parameter and our theory tells us that with sufficiently high ratio the estimated spans the true factors we highlight our contributions below we prove that the required ratio for recovering kth order rank one tensor from the ordinary unfolding is our analysis shows curious two phase behavior with high probability when the error shows fast decay as for the error decays slowly as we confirm this in numerical simulation the proposed subspace norm is an interpolation between the intractable estimators that directly control the rank hosvd and the tractable estimators it becomes equivalent to the latent trace norm when at the cost of increased ratio threshold see table the proposed estimator is more efficient than previously proposed norm based estimators because the size of the svd required in the algorithm is reduced from to we also empirically demonstrate that the proposed subspace norm performs nearly optimally for constant order notation let be kth order tensor we will often use nk to simplify the notation but all the results in this paper generalizes to general dimensions the inner product between pair of tensors is defined as the inner products of them as vectors hx wi hvec vec for denotes the tensor whose entry is ui vj wk the rank of is the minimum number of tensors required to write as linear combination of them fiber of tensor is an nk dimensional vector that is obtained by fixing all but the kth index of the unfolding of tensor is an nk matrix constructed by concatenating all the fibers along columns we denote the spectral and frobenius norms for matrices by and kf respectively the power of ordinary unfolding perturbation bound for the left singular vector we first establish bound on recovering the left singular vector of matrix with perturbed by random gaussian noise consider the following model known as the information plus noise model βuv σe where and are unit vectors is the signal strength is the noise standard deviation and the noise matrix is assumed to be random with entries sampled from the standard normal distribution our goal is to the correlation between and the top left singular vector of for ratio mn with high probability direct application of the classic wedin perturbation theorem to the rectangular matrix does not provide us the desired result this is because requires the signal to noise ratio since the spectral norm of scales as op this would mean that we require the threshold is dominated by the number of columns if alternatively we can view as the leading eigenvector of square matrix our key insight is that we can decompose as follows uu mσ ee mσ βσ uv evu note that is the leading eigenvector of the first term because adding an identity matrix does not change the eigenvectors moreover we notice that there are two noise terms the first term is centered wishart matrix and it is independent of the signal the second term is gaussian distributed and depends on the signal this implies behavior corresponding to either the wishart or the gaussian noise term being dominant depending on the value of interestingly we get different speed of convergence for each of these phases as we show in the next theorem the proof is given in appendix theorem there exists constant such that with probability at least if cnm if cnm cn if cn otherwise cn if in other words if has sufficiently many more columns than rows as the signal to noise ratio increases first converges to as and then as figure illustrates these results we randomly generate matrix perturbed by gaussian noise and measure the distance between and the phase transition happens at nm and there are two regimes of different convergence rates as theorem predicts tensor unfolding now let apply the above result to the tensor version of information plus noise model studied by we consider rank one tensor signal contaminated by gaussian noise as follows σe βu σe where factors rn are unit vectors which are not necessarily identical and the entries of are samples from the normal distribution note that this is slightly more general and easier to analyze than the symmetric setting studied by correlation log log log log nm log log synthetic experiment showing phase transition at nm and regimes with different rates of convergence see theorem ni synthetic experiment showing phase transition at nk for odd order tensors see corollary figure numerical demonstration of theorem and corollary several estimators for recovering from its noisy version have been proposed see table both the overlapped nuclear norm and latent nuclear norm discussed in achives the relative performance guarantee op where is the estimator this bound implies that if we want to obtain relative error smaller than we need the signal to noise ratio to scale as mu et al proposed the square norm defined as the nuclear norm of the matrix obtained by grouping the first indices along the rows and the last indices along the columns this norm improves the right hand side of inequality to which translates to requiring for obtaining relative error the intuition here is the more square the unfolding is the better the bound becomes however there is no improvement for richard and montanari studied the symmetric version of model and proved recursive unfolding algorithm achieves the factor recovery error dist with with high probability where dist min ku ku they also showed that the randomly initialized tensor can achieve the same error with slightly worse power method threshold max log also with high probability the reasoning underlying both and is that square unfolding is better however if we take the ordinary unfolding βu σe we can see as an instance of information plus noise model where thus the ordinary unfolding satisfies the condition of theorem for or large enough corollary consider th order rank one tensor contaminated by gaussian noise as in there exists constant such that if with probability at least we have if dist for if where is the leading left singular vector of the rectangular unfolding this proves that as conjectured by the threshold applies not only to the even order case but also to the odd order case note that hopkins et al have shown similar result without the sharp rate of convergence the above corollary easily extends to more general qq qk and nk tensor by replacing the conditions by nk qq the result also holds when has rank higher than see appendix we demonstrate this result in figure the models behind the experiment are slightly more general ones in which or and the signal is rank two with and the plot shows the inner products and as measure of the quality of estimating the two factors the horizontal axis is the normalized noise qk standard deviation nk we can clearly see that the inner product decays symmetrically around and as predicted by corollary for both tensors subspace norm for tensors suppose the true tensor admits minimum tucker decomposition of rank pr pr ik ik uik if the core tensor ik is superdiagonal the above decomposition reduces to the canonical polyadic cp decomposition the unfolding of the true tensor can be written as follows where is the unfolding of the core tensor is matrix ur for note that is not necessarily orthogonal let be the svd of we will observe that span because of and span corollary shows that the left singular vectors can be recovered under mild conditions thus the span of the right singular vectors can also be recovered inspired by this we define norm that models tensor as mixture of tensors we require that the unfolding of has low rank factorization where is variable and is fixed arbitrary orthonormal basis of some subspace which we choose later to have the kronecker structure in in the following we define the subspace norm suggest an approach to construct the right factor and prove the denoising bound in the end the subspace norm consider kth order tensor of size definition let be matrices such that rn with the subspace norm for kth order tensor associated with is defined as pk inf km if span otherwise where is the nuclear norm and span foldk in the next lemma proven in appendix we show the norm of the subspace norm has simple appealing form as we see in theorem it avoids the scaling see the first column of table by restricting the influence of the noise term in the subspace defined by lemma the dual norm of is max kx where is the spectral norm choosing the subspace natural question that arises is how to choose the matrices lemma let the be the svd of where is and is assume that and has full column rank it holds that for all span ii span proof we prove the lemma in appendix corollary shows that when the signal to noise ratio is high enough we can recover with high probability hence we suggest the following approach for tensor denoising for each unfold the observation tensor in mode and compute the top left singular vectors concatenate these vectors to obtain matrix ii construct as iii solve the subspace norm regularized minimization problem min where the subspace norm is associated with the above defined see appendix for details analysis let be tensor corrupted by gaussian noise with standard deviation as follows σe we define slightly modified estimator as follows arg min foldk where is restriction of the set of matrices defined as follows kfold this restriction makes sure that are incoherent each has spectral norm that is as low as random matrix when unfolded at different mode similar assumptions were used in plus sparse matrix decomposition and for the denoising bound for the latent nuclear norm then we have the following statement we prove this in appendix theorem let xp be any tensor that can be expressed as xp foldk which satisfies the above incoherence condition and let rk be the rank of for in addition we assume that each is constructed as with then there are universal constants and such that any solution of the minimization problem with log satisfies the following bound xk rk with probability at least note that the side of the bound consists of two terms the first term is the approximation error this term will be zero if lies in span this is the case if we choose as in the latent nuclear norm or if the condition of corollary is satisfied for the smallest βr when we use the kronecker product construction we proposed note that the regularization constant should also scale with the dual subspace norm of the residual xp the second term is the estimation error with respect to xp if we take xp to be the orthogonal projection of to the span we can ignore the contribution of the residual to because xp then the estimation error scales mildly with the dimensions and with the sum of the ranks note that if we take we have and we recover the guarantee experiments in this section we conduct tensor denoising experiments on synthetic and real datasets to numerically confirm our analysis in previous sections synthetic data we randomly generated the true rank two tensor of size with singular values and the true factors are generated as random matrices with orthonormal columns the observation tensor is then generated by adding gaussian noise with standard deviation to our approach is compared to the cp decomposition the overlapped approach and the latent approach the cp decomposition is computed by the tensorlab with random initializations we assume cp knows the true rank is for the subspace norm we use algorithm described we computed in section we also select the top singular vectors when constructing the solutions for values of regularization parameter logarithmically spaced between and for the overlapped and the latent norm we use admm described in we also computed solutions with the same used for the subspace norm we measure the performance in the relative error defined as we report the minimum error obtained by choosing the optimal regularization parameter or the optimal initialization although the regularization parameter could be selected by leaving out some entries and measuring the error on these entries we will not go into tensor completion here for the sake of simplicity figure and show the result of this experiment the left panel shows the relative error for representative values of for the subspace norm the black line shows the minimum error across all the the magenta dashed the error corresponding to the theoretically line showsp motivated choice maxk nk log for each the two vertical lines are thresholds of from corollary corresponding to and namely nk and nk it confirms that there is rather sharp increase in the error around the theoretically predicted places see also figure we can also see that the optimal should grow linearly with for large small snr the best relative error is since the optimal choice of the regularization parameter leads to predicting with xb figure compares the performance of the subspace norm to other approaches for each method the smallest error corresponding to the optimal choice of the regularization parameter is shown relative error relative error relative error min error suggested cp subspace latent overlap optimistic cp subspace latent overlap suggested optimistic figure tensor denoising the subspace approach with three representative on synthetic data comparison of different methods on synthetic data comparison on amino acids data in addition to place the numbers in context we plot the line corresponding to nk log relative error which we call optimistic this can be motivated from considering the maximum likelihood estimator for cp decomposition see appendix clearly the error of cp the subspace norm and optimistic grows at the same rate much slower than overlap and latent the error of cp increases beyond as no regularization is imposed see appendix for more experiments we can see that both cp and the subspace norm are behaving near optimally in this setting although such behavior is guaranteed for the subspace norm whereas it is hard to give any such guarantee for the cp decomposition based on nonlinear optimization amino acids data the amino acid dataset is dataset commonly used as benchmark for low rank tensor modeling it consists of five samples each one contains different amounts of tyrosine tryptophan and phenylalanine the spectrum of their excitation wavelength nm and emission nm are measured by fluorescence which gives tensor as the true factors are known to be these three acids this data perfectly suits the cp model the true rank is fed into cp and the proposed approach as we computed the solutions of cp for different random initializations and the solutions of other approaches with different values of for the subspace and the overlapped approach are logarithmically spaced between and for the latent approach are logarithmically spaced between and again we include the optimistic scaling to put the numbers in context figure shows the smallest relative error achieved by all methods we compare similar to the synthetic data both cp and the subspace norm behaves near ideally though the relative error of cp can be larger than due to the lack of regularization interestingly the theoretically suggested scaling of the regularization parameter is almost optimal conclusion we have settled conjecture posed by and showed that indeed ratio is sufficient also for odd order tensors moreover our analysis shows an interesting behavior of the error this finding lead us to the development of the proposed subspace norm the which are proposed norm is defined with respect to set of orthonormal matrices estimated by singular value decompositions we have analyzed the denoising performance of the proposed norm and shown that the error can be bounded by the sum of two terms which can be interpreted as an approximation error term coming from the first step and an estimation error term coming from the second convex step 
sample complexity bounds for iterative stochastic policy optimization marin kobilarov department of mechanical engineering johns hopkins university baltimore md marin abstract this paper is concerned with robustness analysis of decision making under uncertainty we consider class of iterative stochastic policy optimization problems and analyze the resulting expected performance for each newly updated policy at each iteration in particular we employ inequalities to compute future expected cost and probability of constraint violation using empirical runs novel inequality bound is derived that accounts for the possibly unbounded likelihood ratio resulting from iterative policy adaptation the bound serves as certificate for providing future performance or safety guarantees the approach is illustrated with simple robot control scenario and initial steps towards applications to challenging aerial vehicle navigation problems are presented introduction we consider general class of stochastic optimization problems formulated as arg min eτ where defines vector of decision variables represents the system response defined through the density and defines positive cost function which can be and nonconvex it is assumed that is either known or can be sampled from in manner the objective is to obtain sample complexity bounds on the expected cost for given decision strategy by observing past realizations of possibly different strategies such bounds are useful for two reasons for providing robustness guarantees for future executions and for designing new algorithms that directly minimize the bound and therefore are expected to have robustness our primary motivation arises from applications in robotics for instance when robot executes control policies to achieve given task such as navigating to desired state while perceiving the environment and avoiding obstacles such problems are traditionally considered in the framework of reinforcement learning and addressed using policy search algorithms see also for comprehensive overview with focus on robotic applications when an uncertain system model is available the problem is equivalent to robust control mpc our specific focus is on providing formal guarantees on future executions of control algorithms in terms of maximum expected cost quantifying performance and maximum probability of constraint violation quantifying safety such bounds determine the reliability of control in the presence of process measurement and parameter uncertainties and contextual changes in the task in this work we make no assumptions about nature of the system structure such as linearity convexity or gaussianity in addition the proposed approach applies either to physical system without an available model to an analytical stochastic model or to model from opensource physics engine in this context pac bounds have been rarely considered but could prove essential for system certification by providing guarantees for future performance and safety for instance with chance the robot will reach the goal within minutes or with chance the robot will not collide with obstacles approach to cope with such general conditions we study robustness through statistical learning viewpoint using sample complexity bounds on performance based on empirical runs this is accomplished using inequalities which provide only probabilistic bounds they certify the algorithm execution in terms of statements such as in future executions with chance the expected cost will be less than and the probability of collision will be less than while such bounds are generally applicable to any stochastic decision making process our focus and initial evaluation is on stochastic control problems randomized methods in control analysis our approach is also inspired by existing work on randomized algorithms in control theory originally motivated by robust linear control design for example early work focused on probabilistic design and later applied to constraint satisfaction and general cost functions bounds for decidability of linear stability were refined in these are closely related to the concepts of randomized stability robustness analysis rsra and randomized performance robustness analysis rpra probabilistic bounds for system identification problems have also been obtained through statistical learning viewpoint iterative stochastic policy optimization instead of directly searching for the optimal to solve common strategy in direct policy search and global optimization is to iteratively construct surrogate stochastic model with such as gaussian mixture model gmm where is vector space the model induces joint density encoding natural stochasticity and artificial stochasticity the problem is then to find to minimize the expected cost iteratively until convergence which in many cases also corresponds to shrinking close to delta function around the optimal or to multiple peaks when multiple disparate optima exist as long as is the typical flow of the iterative policy optimization algorithms considered in this work is iterative stochastic policy optimization ispo start with initial prior set sample trajectories ξj τj for compute new policy using observed costs τj compute bound on expected cost and stop if below threshold else set and goto the purpose of computing bounds is to analyze the performance of such standard policy search algorithms to design new algorithms by not directly minimizing an estimate of the expected cost but by minimizing an upper confidence bound on the expected cost instead the computed policy will thus have robustness in the sense that with highprobability the resulting cost will not exceed an known value the present paper develops bounds applicable to both and but only explores their application to to the analysis of existing iterative policy search methods cost functions we consider two classes of cost functions the first class encodes system performance and is defined as bounded function such that for any the second are indicator functions representing constraint violation assume that the variable must satisfy the condition the cost is then defined as and its expectation can be regarded as the probability of constraint violation eτ in this work we will be obtain bounds for both classes of cost functions specific application stochastic control we next illustrate the general stochastic optimization setting using classical nonlinear optimal control problem specific instances of such control problems will later be used for numerical evaluation consider dynamical model with state xk where is an manifold and control inputs uk rm at time tk where denotes the time stage assume that the system dynamics are given by fk xk uk wk subject to gk xk uk gn xn where fk and gk correspond either to the physical plant to an analytical model or to update step the terms wk denotes process noise equivalently such formulation induces the process model density uk in addition consider the cost lk xk uk ln xn where xn denotes the complete trajectory and lk are given nonlinear functions our goal is to design feedback control policies to optimize the expected value of for simplicity we will assume perfect measurements although this does not impose limitation on the approach assume that any decision variables in the problem such as feedforward or feedback gains obstacle avoidance terms mode switching variables are encoded using vector rnξ and define the control law uk φk xk using basis functions φk for all this representation captures both static feedback control laws as well as optimal control laws of the form uk kklqr xk where tk is an optimized feedforward control parametrized using basis functions such as kklqr is the optimal feedback gain matrix of the lqr problem based on the linearized dynamics and cost expansion around the optimized nominal reference trajectory such that fk the complete trajectory of the system is denoted by the random variable and has density πn φk xk where is the dirac delta uk uk the trajectory constraint takes the form gk xk uk gn xn simple example as an example consider robot modeled as system with state where rd denotes position and rd denotes velocity with for planar workspaces and for workspaces the dynamics is given for by pk uk wk vk uk wk where uk are the applied controls and wk is white noise imagine that the constraint gk defines circular obstacles rd and control norm bounds defined as ro kp po kuk umax where ro is the radius of an obstacle at position po rd the cost could be arbitrary but typical choice is where is given matrix and is nonlinear function defining task the final cost could force the system towards goal state xf rn or region xf rn and could be defined according to ln kx xf for some given matrix qf for such simple systems one can choose smooth feedback control law uk φk with static positive gains kp kd ko and basis function pf vf where is an force defined as the gradient of potential field or as gyroscopic steering force that effectively rotates the velocity vector alternatively one could employ optimal control law as described in pac bounds for iterative policy adaptation we next compute probabilistic bounds on the expected cost resulting from the execution of new stochastic policy with using observed samples from previous policies the bound is agnostic to how the policy is updated step in the ispo algorithm inequality for policy adaptation the stochastic optimization setting naturally allows the use of prior belief on what good control laws could be for some known after observing executions based on such prior we wish to find new improved policy which optimizes the cost eτ eτ which can be approximated using samples ξj and τj by the empirical cost ξj τj ξj the goal is to compute the parameters using the sampled decision variables ξj and the corresponding observed costs τj obtaining practical bounds for becomes challenging since the can be unbounded or have very large values and likelihood ratio standard bound such as hoeffding or bernstein becomes impractical or impossible to apply to cope with this we will employ recently proposed robust estimation technique stipulating that instead of estimating the expectation of random variable using its pm empirical mean xj more robust estimate can be obtained by truncating its higher pm moments using αm αxj for some where log what makes this possible is the key assumption that the possibly unbounded random variable must have bounded second moment we employ this idea to deal with the unboundedness of the policy adaptation ratio by showing that in fact its second moment can be bounded and corresponds to an information distance between the current and previous stochastic policies to obtain sharp bounds though it is useful to employ samples over multiple iterations of the ispo algorithm from policies computed in previous iterations to simplify notation let and define the cost of executing can now be equivalently expressed as using the computed policies in previous iterations we next state the main result proposition with probability the expected cost of executing stochastic policy with parameters is bounded according to inf jbα log αlm where jbα denotes robust estimator defined by jbα xx zij αlm computed after iterations with samples zim obtained at iterations where dβ denotes the renyii divergence between and defined by pβ dβ log dx the constants bi are such that bi at each iteration proof the bound is obtained by relating the mean to its robust estimate according to lm jbα eαlm eαt eαlm pm zij yy zij yy yy αj yy using markov inequality to obtain the identities log in and ex in here we adapted the technique proposed by catoni for general unbounded losses to our policy adaptation setting in order to handle the possibly unbounded likelihood ratio these results are then combined with bi eπ where the relationship between the likelihood ratio variance and the renyii divergence was established in note that the renyii divergence can be regarded as distance between two distribution and can be computed in closed bounded form for various distributions such as the exponential families it is also closely related to the kl divergence kl illustration using simple robot navigation we next illustrate the application of these bounds using the simple scenario introduced in the stochasticity is modeled using gaussian density on the initial state on the disturbances wk and on the goal state xf iterative policy optimization is performed using stochastic model encoding multivariate gaussian which is updated through rwr in step of the ispo algorithm we take samples observe their costs and update the parameters according to τj ξj τj ξj ξj using the tilting weights for some adaptively chosen and where τj pm τj are the normalized weights at each iteration one can compute bound on the expected cost using the previously computed we have computed such bounds using for both the expected cost and probability of obstacles goal obstacles sampled start states iteration iteration iteration iteration expected cost probability of collision empirical jb robust jbα pac bound empirical robust pα pac bound iterations iterations figure robot navigation scenario based on iterative policy improvement and resulting predicted performance evolution of the density over the decision variables in this case the control gains cost function and its computed upper bound for future executions analogous bounds on snapshots of sampled trajectories top note that the initial policy results in collisions surprisingly the standard empirical and robust estimates are nearly identical collision denoted respectively by and using samples figure at each iteration we used window of maximum previous iterations to compute the bounds to compute all samples from densities νi were used remarkably using our robust statistics approach the resulting bound eventually becomes close to the standard empirical estimate jb the collision probability bound decreses to less than which could be further improved by employing more samples and more iterations the significance of these bounds is that one can stop the optimization regarded as training at any time and be able to predict expected performance in future executions using the newly updated policy before actually executing the policy using the samples from the previous iteration finally the renyii divergence term used in these computations takes the simple form dβ kn log where σβ policy optimization methods we do not impose any restrictions on the specific method used for optimizing the policy when complex constraints are present such computation will involve global motion planning step combined with local feedback control laws we show such an example in the approach can be used to either analyze such policies computed using any method of choice or to derive new algorithms based on minimizing the side of the bound the method also applies to modelfree learning for instance related to recent methods in robotics one could use rwr or policy learning by weighted samples with returns power stochastic optimization methods such as or using the related optimization application to aerial vehicle navigation consider an aerial vehicle such as quadrotor navigating at high speed through cluttered environment we are interested in minimizing cost metric related to the total time taken and control effort required to reach desired goal state while maintaining low probability of collision we employ an experimentally identified model of an asctec quadrotor figure with state space se with state where is the position so is the rotation matrix and is the angular velocity the vehicle is controlled with inputs including the lift force and torque moments the dynamics is mg rb jω where is the mass inertia tensor and the matrix is such that bη for any the system is subject to initial localization errors and also to random disturbances due to wind gusts and wall effects defined as stochastic forces each component in is and has standard deviation of newtons for vehicle with mass kg the objective is to navigate through given urban environment at high speed to desired goal state we employ approach consisting of an global planner which produces sequence of local that the vehicle must pass through standard nonlinear feedback backstepping controller based on slow position control loop and fast attitude control is employed for local control in addition and obstacle avoidance controller is added to avoid collisions since the vehicle is not expected to exactly follow the path at each iteration samples are taken with confidence level window of past iterations were used for the bounds the control density is single gaussian as specified in the most sensitive gains in the controller are the position proporitional and derivative terms and the obstacle gains denoted by kp kd and ko which we examine in the following scenarios fixed goal wind gusts disturbances virtual environment the system is first tested in cluttered simulated environment figure the simulated vehicle travels at an average velocity of see video in supplement and initially experiences more than collisions after few iterations the total cost stabilizes and the probability of collision reduces to around the bound is close to the empirical estimate which indicates that it can be tight if more samples are taken the collision probability bound is still too high to be practical but our goal was only to illustrate the bound behavior it is also likely that our chosen control strategy is in fact not suitable for traversal of such tight environments sparser environment randomly sampled goals more general evaluation was performed by adding the goal location to the stochastic problem parameters so that the bound will apply to any future desired goal in that environment figure the algorithm converges to similar values as before but this time the collision probability is smaller due to more expansive environment in both cases the bounds could be reduced further by employing more than samples or by reusing more samples from previous runs according to proposition conclusion this paper considered stochastic decision problems and focused on bounds on robustness of the computed decision variables we showed how to derive bounds for fixed policies in order to predict future performance constraint violation these results could then be employed for obtaining generalization pac bounds through approach which could be consistent with the proposed notion of policy priors and policy adaptation future work will develop concrete algorithms by directly optimizing such pac bounds which are expected to have robustness properties references richard sutton david mcallester satinder singh and yishay mansour policy gradient methods for reinforcement learning with function approximation in nips pages csaba szepesvari algorithms for reinforcement learning morgan and claypool publishers deisenroth neumann and peters survey on policy search for robotics pages simulated quadrotor motion asctec pelican waypoint path iteration iteration iteration expected cost probability of collision empirical robust pα pac bound empirical jb robust jbα pac bound iterations iterations figure aerial vehicle navigation using simulated nonlinear quadrotor model top iterative stochastic policy optimization iterations analogous to those given in figure note that the initial policy results in over collisions which is reduced to less than after few policy iterations random goals start campus map iteration iteration iteration expected cost probability of collision empirical robust pα pac bound empirical jb robust jbα pac bound iterations iterations figure analogous plot to figure but for typical campus environment using uniformly at random sampled goal states along the northern boundary the vehicle must fly below feet and is not allowed to fly above buildings this is larger less constrained environment resulting in less collisions schaal and atkeson learning control in robotics robotics automation magazine ieee june alberto bemporad and manfred morari robust model predictive control survey in garulli and tesi editors robustness in identification and control volume of lecture notes in control and information sciences pages springer london vladimir vapnik the nature of statistical learning theory new york new york ny usa david mcallester stochastic model selection mach april langford tutorial on practical prediction theory for classification journal of machine learning research stphane boucheron gbor lugosi pascal massart and michel ledoux concentration inequalities nonasymptotic theory of independence oxford university press oxford vidyasagar randomized algorithms for robust controller synthesis using statistical learning theory automatica october laura ryan ray and robert stengel monte carlo approach to the analysis of control system robustness automatica january qian wang and robertf stengel probabilistic control of nonlinear uncertain systems in giuseppe calafiore and fabrizio dabbene editors probabilistic and randomized methods for design under uncertainty pages springer london tempo calafiore and dabbene randomized algorithms for analysis and control of uncertain systems springer koltchinskii abdallah ariola and dorato statistical learning control of uncertain systems theory and algorithms applied mathematics and computation bellman vidyasagar and rajeeva karandikar learning theory approach to system identification and stochastic adaptive control journal of process control festschrift honouring professor dale seborg reuven rubinstein and dirk kroese the method unified approach to combinatorial optimization springer anatoly zhigljavsky and antanasz zilinskas stochastic global optimization spri philipp hennig and christian schuler entropy search for global optimization mach learn june christian igel nikolaus hansen and stefan roth covariance matrix adaptation for optimization evol march pedro larraaga and jose lozano editors estimation of distribution algorithms new tool for evolutionary computation kluwer academic publishers martin pelikan david goldberg and fernando lobo survey of optimization by building and using probabilistic models comput optim january howie choset kevin lynch seth hutchinson george kantor wolfram burgard lydia kavraki and sebastian thrun principles of robot motion theory algorithms and implementations mit press june corinna cortes yishay mansour and mehryar mohri learning bounds for importance weighting in advances in neural information processing systems olivier catoni challenging the empirical mean and empirical variance deviation study ann inst poincar probab theodorou buchli and schaal generalized path integral control approach to reinforcement learning journal of machine learning research sergey levine and pieter abbeel learning neural network policies with guided policy search under unknown dynamics in neural information processing systems nips kobilarov motion planning international journal of robotics research robert mahony and tarek hamel robust trajectory tracking for scale model autonomous helicopter international journal of robust and nonlinear control marin kobilarov trajectory tracking of class of underactuated systems with external disturbances in american control conference pages 
binaryconnect training deep neural networks with binary weights during propagations matthieu courbariaux polytechnique de yoshua bengio de cifar senior fellow david polytechnique de abstract deep neural networks dnn have achieved results in wide range of tasks with the best results obtained with large training sets and large models in the past gpus enabled these breakthroughs because of their greater computational speed in the future faster computation at both training and test time is likely to be crucial for further progress and for consumer applications on devices as result there is much interest in research and development of dedicated hardware for deep learning dl binary weights weights which are constrained to only two possible values or would bring great benefits to specialized dl hardware by replacing many operations by simple accumulations as multipliers are the most space and powerhungry components of the digital implementation of neural networks we introduce binaryconnect method which consists in training dnn with binary weights during the forward and backward propagations while retaining precision of the stored weights in which gradients are accumulated like other dropout schemes we show that binaryconnect acts as regularizer and we obtain near results with binaryconnect on the mnist and svhn introduction deep neural networks dnn have substantially pushed the in wide range of tasks especially in speech recognition and computer vision notably object recognition from images more recently deep learning is making important strides in natural language processing especially statistical machine translation interestingly one of the key factors that enabled this major progress has been the advent of graphics processing units gpus with on the order of to starting with and similar improvements with distributed training indeed the ability to train larger models on more data has enabled the kind of breakthroughs observed in the last few years today researchers and developers designing new deep learning algorithms and applications often find themselves limited by computational capability this along with the drive to put deep learning systems on devices unlike gpus is greatly increasing the interest in research and development of specialized hardware for deep networks most of the computation performed during training and application of deep networks regards the multiplication of weight by activation in the recognition or forward propagation phase of the algorithm or gradient in the backward propagation phase of the algorithm this paper proposes an approach called binaryconnect to eliminate the need for these multiplications by forcing the weights used in these forward and backward propagations to be binary constrained to only two values not necessarily and we show that results can be achieved with binaryconnect on the mnist and svhn what makes this workable are two ingredients sufficient precision is necessary to accumulate and average large number of stochastic gradients but noisy weights and we can view discretization into small number of values as form of noise especially if we make this discretization stochastic are quite compatible with stochastic gradient descent sgd the main type of optimization algorithm for deep learning sgd explores the space of parameters by making small and noisy steps and that noise is averaged out by the stochastic gradient contributions accumulated in each weight therefore it is important to keep sufficient resolution for these accumulators which at first sight suggests that high precision is absolutely required and show that randomized or stochastic rounding can be used to provide unbiased discretization have shown that sgd requires weights with precision of at least to bits and successfully train dnns with bits dynamic computation besides the estimated precision of the brain synapses varies between and bits noisy weights actually provide form of regularization which can help to generalize better as previously shown with variational weight noise dropout and dropconnect which add noise to the activations or to the weights for instance dropconnect which is closest to binaryconnect is very efficient regularizer that randomly substitutes half of the weights with zeros during propagations what these previous works show is that only the expected value of the weight needs to have high precision and that noise can actually be beneficial the main contributions of this article are the following we introduce binaryconnect method which consists in training dnn with binary weights during the forward and backward propagations section we show that binaryconnect is regularizer and we obtain near results on the mnist and svhn section we make the code for binaryconnect available binaryconnect in this section we give more detailed view of binaryconnect considering which two values to choose how to discretize how to train and how to perform inference or applying dnn mainly consists in convolutions and matrix multiplications the key arithmetic operation of dl is thus the operation artificial neurons are basically multiplyaccumulators computing weighted sums of their inputs binaryconnect constraints the weights to either or during propagations as result many operations are replaced by simple additions and subtractions this is huge gain as adders are much less expensive both in terms of area and energy than deterministic vs stochastic binarization the binarization operation transforms the weights into the two possible values very straightforward binarization operation would be based on the sign function if wb otherwise https where wb is the binarized weight and the weight although this is deterministic operation averaging this discretization over the many input weights of hidden unit could compensate for the loss of information an alternative that allows finer and more correct averaging process to take place is to binarize stochastically with probability wb with probability where is the hard sigmoid function clip max min we use such hard sigmoid rather than the soft version because it is far less computationally expensive both in software and specialized hardware implementations and yielded excellent results in our experiments it is similar to the hard tanh introduced by it is also linear and corresponds to bounded form of the rectifier propagations vs updates let us consider the different steps of with sgd udpates and whether it makes sense or not to discretize the weights at each of these steps given the dnn input compute the unit activations layer by layer leading to the top layer which is the output of the dnn given its input this step is referred as the forward propagation given the dnn target compute the training objective gradient each layer activations starting from the top layer and going down layer by layer until the first hidden layer this step is referred to as the backward propagation or backward phase of backpropagation compute the gradient each layer parameters and then update the parameters using their computed gradients and their previous values this step is referred to as the parameter update algorithm sgd training with binaryconnect is the cost function for minibatch and the functions binarize and clip specify how to binarize and clip weights is the number of layers require minibatch of inputs targets previous parameters weights and biases and learning rate ensure updated parameters wt and bt forward propagation wb binarize for to compute ak knowing wb and backward propagation initialize output layer activations gradient for to compute knowing and wb parameter update compute and knowing and wt clip bt key point to understand with binaryconnect is that we only binarize the weights during the forward and backward propagations steps and but not during the parameter update step as illustrated in algorithm keeping good precision weights during the updates is necessary for sgd to work at all these parameter changes are tiny by virtue of being obtained by gradient descent sgd performs large number of almost infinitesimal changes in the direction that most improves the training objective plus noise one way to picture all this is to hypothesize that what matters most at the end of training is the sign of the weights but that in order to figure it out we perform lot of small changes to quantity and only at the end consider its sign sign gt yt where gt is noisy estimator of xt where xt yt is the value of the objective function on input target example xt yt when are the previous weights and is its final discretized value of the weights another way to conceive of this discretization is as form of corruption and hence as regularizer and our empirical results confirm this hypothesis in addition we can make the discretization errors on different weights approximately cancel each other while keeping lot of precision by randomizing the discretization appropriately we propose form of randomized discretization that preserves the expected value of the discretized weight hence at training time binaryconnect randomly picks one of two values for each weight for each minibatch for both the forward and backward propagation phases of backprop however the sgd update is accumulated in variable storing the parameter an interesting analogy to understand binaryconnect is the dropconnect algorithm just like binaryconnect dropconnect only injects noise to the weights during the propagations whereas dropconnect noise is added gaussian noise binaryconnect noise is binary sampling process in both cases the corrupted value has as expected value the clean original value clipping since the binarization operation is not influenced by variations of the weights when its magnitude is beyond the binary values and since it is common practice to bound weights usually the weight vector in order to regularize them we have chosen to clip the weights within the interval right after the weight updates as per algorithm the weights would otherwise grow very large without any impact on the binary weights few more tricks optimization no learning rate scaling learning rate scaling sgd nesterov momentum adam table test error rates of small cnn trained on depending on optimization method and on whether the learning rate is scaled with the weights initialization coefficients from we use batch normalization bn in all of our experiments not only because it accelerates the training by reducing internal covariate shift but also because it reduces the overall impact of the weights scale moreover we use the adam learning rule in all of our cnn experiments last but not least we scale the weights learning rates respectively with the weights initialization coefficients from when optimizing with adam and with the squares of those coefficients when optimizing with sgd or nesterov momentum table illustrates the effectiveness of those tricks inference up to now we have introduced different ways of training dnn with weight binarization what are reasonable ways of using such trained network performing inference on new examples we have considered three reasonable alternatives use the resulting binary weights wb this makes most sense with the deterministic form of binaryconnect use the weights the binarization only helps to achieve faster training but not faster performance in the stochastic case many different networks can be sampled by sampling wb for each weight according to eq the ensemble output of these networks can then be obtained by averaging the outputs from individual networks we use the first method with the deterministic form of binaryconnect as for the stochastic form of binaryconnect we focused on the training advantage and used the second method in the experiments inference using the weights this follows the practice of dropout methods where at the noise is removed method mnist svhn no regularizer binaryconnect det binaryconnect stoch dropout maxout networks deep network in network dropconnect nets table test error rates of dnns trained on the mnist no convolution and no unsupervised pretraining no data augmentation and svhn depending on the method we see that in spite of using only single bit per weight during propagation performance is not worse than ordinary no regularizer dnns it is actually better especially with the stochastic version suggesting that binaryconnect acts as regularizer figure features of the first layer of an mlp trained on mnist depending on the regularizer from left to right no regularizer deterministic binaryconnect stochastic binaryconnect and dropout benchmark results in this section we show that binaryconnect acts as regularizer and we obtain near results with binaryconnect on the mnist and svhn mnist mnist is benchmark image classification dataset it consists in training set of and test set of images representing digits ranging from to permutationinvariance means that the model must be unaware of the image structure of the data in other words cnns are forbidden besides we do not use any preprocessing or unsupervised pretraining the mlp we train on mnist consists in hidden layers of rectifier linear units relu and output layer has been shown to perform better than softmax on several classification benchmarks the square hinge loss is minimized with sgd without momentum we use an exponentially decaying learning rate we use batch figure histogram of the weights of the first layer of an mlp trained on mnist depending on the regularizer in both cases it seems that the weights are trying to become deterministic to reduce the training error it also seems that some of the weights of deterministic binaryconnect are stuck around hesitating between and figure training curves of cnn on depending on the regularizer the dotted lines represent the training costs square hinge losses and the continuous lines the corresponding validation error rates both versions of binaryconnect significantly augment the training cost slow down the training and lower the validation error rate which is what we would expect from dropout scheme normalization with minibatch of size to speed up the training as typically done we use the last samples of the training set as validation set for early stopping and model selection we report the test error rate associated with the best validation error rate after epochs we do not retrain on the validation set we repeat each experiment times with different initializations the results are in table they suggest that the stochastic version of binaryconnect can be considered regularizer although slightly less powerful one than dropout in this context is benchmark image classification dataset it consists in training set of and test set of color images representing airplanes automobiles birds cats deers dogs frogs horses ships and trucks we preprocess the data using global contrast normalization and zca whitening we do not use any which can really be game changer for this dataset the architecture of our cnn is where is relu convolution layer is layer fully connected layer and svm output layer this architecture is greatly inspired from vgg the square hinge loss is minimized with adam we use an exponentially decaying learning rate we use batch normalization with minibatch of size to speed up the training we use the last samples of the training set as validation set we report the test error rate associated with the best validation error rate after training epochs we do not retrain on the validation set the results are in table and figure svhn svhn is benchmark image classification dataset it consists in training set of and test set of color images representing digits ranging from to we follow the same procedure that we used for with few notable exceptions we use half the number of hidden units and we train for epochs instead of because svhn is quite big dataset the results are in table related works training dnns with binary weights has been the subject of very recent works even though we share the same objective our approaches are quite different do not train their dnn with backpropagation bp but with variant called expectation backpropagation ebp ebp is based on expectation propagation ep which is variational bayes method used to do inference in probabilistic graphical models let us compare their method to ours it optimizes the weights posterior distribution which is not binary in this regard our method is quite similar as we keep version of the weights it binarizes both the neurons outputs and weights which is more hardware friendly than just binarizing the weights it yields good classification accuracy for fully connected networks on mnist but not yet for convnets retrain neural networks with ternary weights during forward and backward propagations they train neural network with after training they ternarize the weights to three possible values and and adjust to minimize the output error and eventually they retrain with ternary weights during propagations and weights during updates by comparison we train all the way with binary weights during propagations our training procedure could be implemented with efficient specialized hardware avoiding the forward and backward propagations multiplications which amounts to about of the multiplications cf algorithm conclusion and future works we have introduced novel binarization scheme for weights during forward and backward propagations called binaryconnect we have shown that it is possible to train dnns with binaryconnect on the permutation invariant mnist and svhn datasets and achieve nearly results the impact of such method on specialized hardware implementations of deep networks could be major by removing the need for about of the multiplications and thus potentially allowing to by factor of at training time with the deterministic version of binaryconnect the impact at test time could be even more important getting rid of the multiplications altogether and reducing by factor of at least from bits precision to single bit precision the memory requirement of deep networks which has an impact on the memory to computation bandwidth and on the size of the models that can be run future works should extend those results to other models and datasets and explore getting rid of the multiplications altogether during training by removing their need from the weight update computation acknowledgments we thank the reviewers for their many constructive comments we also thank roland memisevic for helpful discussions we thank the developers of theano python library which allowed us to easily develop fast and optimized code for gpu we also thank the developers of and lasagne two deep learning libraries built on the top of theano we are also grateful for funding from nserc the canada research chairs compute canada and cifar references geoffrey hinton li deng george dahl mohamed navdeep jaitly andrew senior vincent vanhoucke patrick nguyen tara sainath and brian kingsbury deep neural networks for acoustic modeling in speech recognition ieee signal processing magazine tara sainath abdel rahman mohamed brian kingsbury and bhuvana ramabhadran deep convolutional neural networks for lvcsr in icassp krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips christian szegedy wei liu yangqing jia pierre sermanet scott reed dragomir anguelov dumitru erhan vincent vanhoucke and andrew rabinovich going deeper with convolutions technical report jacob devlin rabih zbib zhongqiang huang thomas lamar richard schwartz and john makhoul fast and robust neural network joint models for statistical machine translation in proc acl ilya sutskever oriol vinyals and quoc le sequence to sequence learning with neural networks in nips dzmitry bahdanau kyunghyun cho and yoshua bengio neural machine translation by jointly learning to align and translate in iclr rajat raina anand madhavan and andrew ng deep unsupervised learning using graphics processors in icml yoshua bengio ducharme pascal vincent and christian jauvin neural probabilistic language model journal of machine learning research dean corrado monga chen devin le mao ranzato senior tucker yang and ng large scale distributed deep networks in nips sang kyun kim lawrence mcafee peter leonard mcmahon and kunle olukotun highly scalable restricted boltzmann machine fpga implementation in field programmable logic and applications fpl international conference on pages ieee tianshi chen zidong du ninghui sun jia wang chengyong wu yunji chen and olivier temam diannao accelerator for ubiquitous in proceedings of the international conference on architectural support for programming languages and operating systems pages acm yunji chen tao luo shaoli liu shijin zhang liqiang he jia wang ling li tianshi chen zhiwei xu ninghui sun et al dadiannao supercomputer in microarchitecture micro annual international symposium on pages ieee lorenz muller and giacomo indiveri rounding methods for neural networks with low resolution synaptic weights arxiv preprint suyog gupta ankur agrawal kailash gopalakrishnan and pritish narayanan deep learning with limited numerical precision in icml matthieu courbariaux yoshua bengio and david low precision arithmetic for deep learning in iclr workshop thomas bartol cailey bromer justin kinney michael chirillo jennifer bourne kristen harris and terrence sejnowski hippocampal spine head sizes are highly precise biorxiv alex graves practical variational inference for neural networks in zemel bartlett pereira and weinberger editors advances in neural information processing systems pages curran associates nitish srivastava improving neural networks with dropout master thesis toronto nitish srivastava geoffrey hinton alex krizhevsky ilya sutskever and ruslan salakhutdinov dropout simple way to prevent neural networks from overfitting journal of machine learning research li wan matthew zeiler sixin zhang yann lecun and rob fergus regularization of neural networks using dropconnect in icml david kalach and tittley hardware complexity of modular multiplication and exponentiation computers ieee transactions on oct collobert large scale machine learning phd thesis de paris vi glorot bordes and bengio deep sparse rectifier neural networks in aistats xavier glorot and yoshua bengio understanding the difficulty of training deep feedforward neural networks in aistats sergey ioffe and christian szegedy batch normalization accelerating deep network training by reducing internal covariate shift diederik kingma and jimmy ba adam method for stochastic optimization arxiv preprint yu nesterov method for unconstrained convex minimization problem with the rate of convergence doklady an sssr translated as soviet math docl ian goodfellow david mehdi mirza aaron courville and yoshua bengio maxout networks technical report arxiv report de february yichuan tang deep learning using linear support vector machines workshop on challenges in representation learning icml min lin qiang chen and shuicheng yan network in network arxiv preprint lee saining xie patrick gallagher zhengyou zhang and zhuowen tu nets arxiv preprint yann lecun leon bottou yoshua bengio and patrick haffner learning applied to document recognition proceedings of the ieee november nair and hinton rectified linear units improve restricted boltzmann machines in icml benjamin graham convolutional neural networks arxiv preprint karen simonyan and andrew zisserman very deep convolutional networks for image recognition in iclr daniel soudry itay hubara and ron meir expectation backpropagation training of multilayer neural networks with continuous or discrete weights in nips zhiyong cheng daniel soudry zexi mao and zhenzhong lan training binary multilayer neural networks for image classification using expectation backpropgation arxiv preprint kyuyeon hwang and wonyong sung feedforward deep neural network design using in signal processing systems sips ieee workshop on pages ieee jonghong kim kyuyeon hwang and wonyong sung phoneme recognition vlsi using deep neural networks in acoustics speech and signal processing icassp ieee international conference on pages ieee thomas minka expectation propagation for approximate bayesian inference in uai james bergstra olivier breuleux bastien pascal lamblin razvan pascanu guillaume desjardins joseph turian david and yoshua bengio theano cpu and gpu math expression compiler in proceedings of the python for scientific computing conference scipy june oral presentation bastien pascal lamblin razvan pascanu james bergstra ian goodfellow arnaud bergeron nicolas bouchard and yoshua bengio theano new features and speed improvements deep learning and unsupervised feature learning nips workshop ian goodfellow david pascal lamblin vincent dumoulin mehdi mirza razvan pascanu james bergstra bastien and yoshua bengio machine learning research library arxiv preprint 
interactive control of diverse complex characters with neural networks igor mordatch kendall lowrey galen andrew zoran popovic emanuel todorov department of computer science university of washington mordatch lowrey galen zoran todorov abstract we present method for training recurrent neural networks to act as feedback controllers it is able to generate stable and realistic behaviors for range of dynamical systems and tasks swimming flying biped and quadruped walking with different body morphologies it does not require motion capture or features or state machines the controller is neural network having large number of units that learn elaborate mappings and small number of recurrent units that implement memory states beyond the physical system state the action generated by the network is defined as velocity thus the network is not learning control policy but rather the dynamics under an implicit policy essential features of the method include interleaving supervised learning with trajectory optimization injecting noise during training training for unexpected changes in the task specification and using the trajectory optimizer to obtain optimal feedback gains in addition to optimal actions figure illustration of the dynamical systems and tasks we have been able to control using the same method and architecture see the video accompanying the submission introduction interactive controllers that are capable of generating complex stable and realistic movements have many potential applications including robotic control animation and gaming they can also serve as computational models in biomechanics and neuroscience traditional methods for designing such controllers are and largely manual relying on motion capture datasets or state machines our goal is to automate this process by developing universal synthesis methods applicable to arbitrary behaviors body morphologies online changes in task objectives perturbations due to noise and modeling errors this is also the ambitious goal of much work in reinforcement learning and stochastic optimal control however the goal has rarely been achieved in continuous spaces involving complex dynamics deep learning techniques on modern computers have produced remarkable results on wide range of tasks using methods that are not significantly different from what was used decades ago the objective of the present paper is to design training methods that scale to larger and harder control problems even if most of the components were already known specifically we combine supervised learning with trajectory optimization namely optimization cio which has given rise to some of the most elaborate motor behaviors synthesized automatically trajectory optimization however is an offline method so the rationale here is to use neural network to learn from the optimizer and eventually generate similar behaviors online there is closely related recent work along these lines but the method presented here solves substantially harder problems in particular it yields stable and realistic locomotion in space where previous work was applied to only characters that this is possible is due to number of technical improvements whose effects are analyzed below control was historically among the earliest applications of neural networks but the recent surge in performance has been in computer vision speech recognition and other classification problems that arise in artificial intelligence and machine learning where large datasets are available in contrast the data needed to learn neural network controllers is much harder to obtain and in the case of imaginary characters and novel robots we have to synthesize the training data ourselves via trajectory optimization at the same time the learning task for the network is harder this is because we need precise outputs as opposed to categorical outputs and also because our network must operate not on samples but in closed loop where errors can amplify over time and cause instabilities this necessitates specialized training procedures where the dataset of trajectories and the network parameters are optimized together another challenge caused by limited datasets is the potential for and poor generalization our solution is to inject different forms of noise during training the scale of our problem requires cloud computing and gpu implementation and training that takes on the order of hours interestingly we invest more computing resources in generating the data than in learning from it thus the heavy lifting is done by the trajectory optimizer and yet the neural network complements it in way that yields interactive control neural network controllers can also be trained with more traditional methods which do not involve trajectory optimization this has been done in discrete action settings as well as in continuous control settings systematic comparison of these more direct methods with the present methods remains to be done nevertheless our impression is that networks trained with direct methods give rise to successful yet somewhat chaotic behaviors while the present class of methods yield more realistic and purposeful behaviors using physics based controllers allows for interaction but these controllers need specially designed architectures for each range of tasks or characters for example for biped location common approaches include state machines and use of simplified models such as the inverted pendulum and concepts such as zero moment or capture points for quadrupedal characters different set of state machines contact schedules and simplified models is used for flying and swimming yet another set of control architectures commonly making use of explicit cyclic encodings have been used it is our aim to unity these disparate approaches overview let the state of the character be defined as where is the physical pose of the character root position orientation and joint angles are the contact forces being applied on the character by the ground and is the recurrent memory state of the character the motion of the character is state trajectory of length defined by qt rt let xn be collection of trajectories each starting with different initial conditions and executing different task such as moving the character to particular location we introduce neural network control policy parametrized by neural network weights that maps sensory state of the character at each point in time to an optimal action that controls the character in general the sensory state can be designed by the user to include arbitrary informative features but in this preliminary work we use the following simple and representation st qt rt at where qt denotes the instantaneous rate of change of at time with this representation of the action the policy directly commands the desired velocity of the character and applied contact forces and determines the evolution of the recurrent state thus our network learns both optimal controls and model of dynamics simultaneously let ci be the total cost of the trajectory which rewards accurate execution of task and physical realism of the character motion we want to jointly find collection of optimal trajectories that each complete particular task along with policy that is able to reconstruct the sense and action pairs st and at of all trajectories at all timesteps minimize ci xi subject to at xi st xi the optimized policy parameters can then be used to execute policy in and interactively control the character by the user stochastic policy and sensory inputs injecting noise has been shown to produce more robust movement strategies in graphics and optimal control reduce overfitting and prevent feature in neural network training and stabilize recurrent behaviour of neural networks we inject noise in principled way to aid in learning policies that do not diverge when rolled out at execution time in particular we inject additive gaussian noise into the sensory inputs given to the neural network let the sensory noise be denoted so the resulting noisy policy inputs are this is similar to denoising autoencoders with one important difference the change in input in our setting also induces change in the optimal action to output if the noise is small enough the optimal action at nearby noisy states is given by the first order expansion as where as alternatively da ds is the matrix of optimal feedback gains around these gains can be calculated as byproduct of trajectory optimization as described in section intuitively such feedback helps the neural network trainer to learn policy that can automatically correct for small deviations from the optimal trajectory and allows us to use much less training data distributed stochastic optimization the resulting constrained optimization problem is nonconvex and too large to solve directly we replace the hard equality constraint with quadratic penalty with weight as leading to the relaxed unconstrained objective minimize ci xi st xi at xi εi xn we then proceed to solve the problem in optimization fashion optimizing for one set of variables while holding others fixed in particular we independently optimize for each xi trajectory optimization and for neural network regression as the target action as depends on the optimal feedback gains as the noise is resampled after optimizing each policy training in principle the noisy sensory state and corresponding action could be recomputed within the neural network training procedure but we found it expedient to freeze the noise during nn optimization so that the optimal feedback gains need not be passed to the nn training process similar to recent stochastic optimization approaches we introduce quadratic proximal regularization terms weighted by rate that keep the solution of the current iteration close to its previous optimal value the resulting algorithm is algorithm distributed stochastic optimization sample sensor noise for each and optimize trajectories sec argminx ci si ai solve neural network regression sec argminθ repeat thus we have reduced complex policy search problem in to an alternating sequence of independent trajectory optimization and neural network regression problems each of which are wellstudied and allow the use of existing implementations while previous work used admm or dual gradient descent to solve similar optimization problems it is to adapt them to asynchronous and stochastic setting we have despite potentially slower rate we still observe convergence as shown in section trajectory optimization we wish to find trajectories that start with particular initial conditions and execute the task while satisfying physical realism of the character motion the existing approach we use is contactinvariant optimization cio which is direct trajectory optimization method based on inverse dynamics define the total cost for trajectory φt where is function that extracts vector of features such as root forces contact distances control torques etc from the trajectory at time and is the state cost over these features physical realism is achieved by satisfying equations of motion and force complementarity conditions at every point in the trajectory where is the distance of the contact to the ground and is the contact friction cone these constraints are implemented as soft constraints as in and are included in initial conditions are also implemented as soft constraints in additionally we want to make sure the task is satisfied such as moving to particular location while minimizing effort these task costs are the same for all our experiments and are described in section importantly cio is able to find solutions with trivial initializations which makes it possible to have broad range of characters and behaviors without requiring controllers or motion capture for initialization optimal trajectory the trajectory optimization problem consists of finding the optimal trajectory parameters that minimize the total cost with objective now folded into for simplicity argmin we solve the above optimization problem using newton method which requires the gradient and hessian of the total cost function using the chain rule these quantities are cx ctφ φtx cxx φtx ctφφ φtx ctφ φtxx φtx ctφφ φtx where the truncation of the last term in cxx is the common hessian approximation we choose cost functions for which cφ and cφφ can be calculated analytically on the other hand φx is calculated by finite differencing the optimum can then be found by the following recursion cxx cx because this optimization is only step in algorithm we don run it to convergence and instead take between one and ten iterations optimal feedback gains in addition to the optimal trajectory we also need to find optimal feedback gains as necessary to generate optimal actions for noisy inputs in while these feedback gains are byproduct of indirect trajectory optimization methods such as lqg they are not an obvious result of direct trajectory optimization methods like cio while we can use linear quadratic gaussian lqg pass around our optimal solution to compute these gains this is inefficient as it does not make use of computation already performed during direct trajectory optimization moreover we found the resulting process can produce very large and feedback gains one could change the objective function for the lqg pass when calculating feedback gains to make them smoother for example by adding explicit trajectory smoothness cost but then the optimal actions would be using feedback gains from different objective instead we describe perturbation method that reuses computation done during direct trajectory optimization also producing gains this is general method for producing feedback gains that stabilize resulting optimal trajectories and can be useful for other applications suppose we perturb certain aspect of optimal trajectory such that the sensory state changes we wish to find how the optimal action will change given this perturbation we can enforce the perturbation with soft constraint of weight resulting in an augmented total cost ks let be the optimum of the augmented total cost for near as is the case with local feedback control the minimizer of augmented cost is the minimizer of quadratic around optimal trajectory cxx λs cx λs sx where all derivatives are calculated around differentiating the above cxx λs sx cxx sx sx cxx sx sx where the last equality follows from woodbury identity and has the benefit of reusing cxx which is already computed as part of trajectory optimization the optimal feedback gains for are ax note that sx and ax are subsets of φx and are already calculated as part of trajectory optimization thus computing optimal feedback gains comes at very little additional cost our approach produces softer feedback gains according to parameter without modifying the cost function the intuition is that instead of holding perturbed initial state fixed as lqg does for example we make matching the initial state soft constraint by weakening this constraint we can modify initial state to better achieve the master cost function without using very aggressive feedback neural network policy regression after performing trajectory optimization we perform standard regression to fit neural network to the noisy fixed input and output pairs as for each timestep and trajectory our neural network policy has total of layers hidden layer activation function tanh in the present work and hidden units hk for layer to learn model that is robust to small changes in neural state we add independent gaussian noise with variance to the neural activations at each layer during training wager et al observed this noise model makes hidden units tend toward saturated regions and less sensitive to precise values of individual units as with the trajectory optimization we do not run the neural network trainer to convergence but rather perform only single pass of batched stochastic gradient descent over the dataset before updating the parameters in step of algorithm all our experiments use hidden layer neural networks with hidden units in each layer other network sizes are evaluated in section the neural network weight matrices are initialized with spectral radius of just above similar to this helps to make sure initial network dynamics are stable and do not vanish or explode training trajectory generation to train neural network for interactive use we required data set that includes dynamically changing task goal state the task in this case is the locomotion of character to movable goal position controlled by the user our character goal position was always set to be the origin which encodes the characters state position in the goal position coordinate frame thus the origin may shift relative to the character but this keeps behavior invariant to the global frame of reference our trajectory generation creates dataset consisting of trials and segments each trial starts with init reference physical pose and null recurrent memory and must reach goal location after generating an optimal trajectory according to section random timestep is chosen to branch new segment with used as the initial state new goal location is also chosen randomly for optimal trajectory this process represents the character changing direction at some point along its original trajectory plan interaction in this case is simply new change in goal position this technique allows for our initial states and goals to come from the distribution that reflects the character typical motion in all our experiments we use between to trials each with branched segments distributed training architecture our training algorithm was implemented in asynchronous distributed architecture utilizing gpu for neural network training simple parallelism was achieved by distributing the trajectory optimization processes to multiple node machines while the resulting data was used to train the nn policy on single gpu node amazon web service instances provided the nodes for optimization while instance provided the gpu utilizing with the gpu instance at the center network file system server distributes the training data and network parameters to necessary processes within the cluster each optimization node is assigned subset of the total trials and segments for the given task this simple usage of files for data storage meant no supporting infrastructure other than standard file locking for concurrency we used custom gpu implementation of stochastic gradient descent sgd to train the neural network control policy for the first training epoch all trajectories and action sequences are loaded onto the gpu randomly shuffling the order of the frames then the neural network parameters are updated using batched sgd in single pass over the data to reduce the objective in at the start of subsequent training epochs trajectories which have been updated by one of the trajectory optimization processes and injected with new sensor noise are reloaded although this architecture is asynchronous the proximal regularization terms in the objective prevent the training data and policy results from changing too quickly and keep the optimization from diverging as result we can increase our training performance linearly for the size of cluster we are using to about optimization nodes per gpu machine we run the overall optimization process until the average of trajectory optimization iterations has been reached across all machines this usually results in about neural network training epochs and takes about hours to complete depending on task parameters and number of nodes policy execution once we find the optimal policy parameters offline we can execute the resulting policy in realtime under user control unlike methods like motion graphs or gaussian processes we do not need to keep any trajectory data at execution time starting with an initial we desstate des compute sensory state and query the policy without noise for the desired action to evolve the physical state of the system we directly optimize the next state to match while satisfying equations of motion argmin des subject to note that this is simply the optimization problem with horizon which can be solved at rates and does not require any additional implementation this approach is reminiscent of control in computer graphics and robotics because our physical state evolution is result of optimization similar to an implicit integrator it does not suffer from instabilities or divergence as euler integration would and allows the use of larger timesteps we use of in all our experiments in the current work the dynamics constraints are enforced softly and thus may include some root forces in simulation results this algorithm was applied to learn policy that allows interactive locomotion for range of very different characters we used single network architecture and parameters to create all controllers without any specialized initializations while the task is locomotion different character types exhibit very different behaviors the experiments include swimming and flying characters as well as biped and quadruped walking tasks unlike in scenarios it is much easier for characters to fall or go into unstable regions yet our method manages to learn successful controllers we strongly suggest viewing the supplementary video for examples of resulting behaviors the swimming creature featured four fins with two degrees of freedom each it is propelled by lift and drag forces for simulated water density of to move orient or maintain position controller learned to sweep down opposite fins in cyclical patter as in treading water the bird creature was modification of the swimmer with opposing wings and the medium density changed changed to that of air the learned behavior that emerged is cyclical flapping motion more vigorous now because of the lower medium density as well as utilization of lift forces to coast to distant goal positions and modulation of flapping speed to change altitude three bipedal creatures were created to explore the controller function with respect to contact forces two creatures were akin to humanoid one large and one small both with arms while the other had very wide torso compared to its height all characters learned to walk to the target location and orientation with regular cyclic gait the same algorithm also learned stereotypical trot gait for and quadrupeds this alternating footstep cyclic behavior for bipeds or trot gaits for quadrupeds emerged without any user input or the costs in the trajectory optimization were to reach goal position and orientation while minimizing torque usage and contact force magnitudes we used the mujoco physics simulator engine for our dynamics calculations the values of the algorithmic constants used in all experiments are σε σγ comparative evaluation we show the performance of our method on biped walking task in figure under full method case to test the contribution of our proposed joint optimization technique we compared our algorithm to naive neural network training on static optimal trajectory dataset we disabled the neural network and generated optimal trajectories as according to then we performed our regression on this static data set with no trajectories being the results are shown in no joint case we see that at test time our full method performs two orders of magnitude better than static training to test the contribution of noise injection we used our full method but disabled sensory and hidden unit noise sections and the results are under no noise case we observe typical overfitting with good training performance but very poor test performance in practice both ablations above lead to policy rollouts that quickly diverge from expected behavior additionally we have compared the performance of different policy network architectures on the biped walking task by varying the number of layers and hidden units the results are shown in table we see that hidden layers of units gives the best tradeoff control mpc is another potential choice of controller for character behavior in fact the trajectory costs for both mpc and our method are very similar the resulting trajectories however end up being different mpc creates effective trajectories that are not cyclical both are shown in figure for bird character this suggests significant nullspace of task solutions but from all these solutions our joint optimization through the cost terms of matching the neural network output act to regularize trajectory optimization to predictable and less chaotic behaviors figure performance of our full method and two ablated configurations as training progresses over neutral network updates mean and variance of the error is over training and test trials neurons neurons neurons neurons neurons layer layers layers layers increasing layers with neurons per layer increasing neurons per layer with layers table mean and variance of joint position error on test rollouts with our method after training with different neural network configurations conclusions and future work we have presented an automatic way of generating neural network parameters that represent control policy for physically consistent interactive character control only requiring dynamical character model and task description using both trajectory optimization and stochastic neural networks together combines correct behavior with interactive use furthermore the same algorithm and controller architecture can provide interactive control for multiple creature morphologies while the behavior of the characters reflected efficient task completion in this work additional modifications could be made to affect the style of behavior costs during trajectory optimization can affect how task is completed incorporation of muscle actuation effects into our character models may result in more biomechanically plausible actions for that biologically based character in addition to changing the character physical characteristics we could explore different neural network architectures and how they compare to biological systems with this work we have networks that enable diverse physical action which could be augmented to further reflect biological sensorimotor systems this model could be used to experiment with the effects of sensor delays and the resulting motions for example this work focused on locomotion of different creatures with the same algorithm previous work has demonstrated behaviors such as getting up climbing and reaching with the same trajectory optimization method policies using this algorithm could allow interactive use of these behaviors as well extending beyond character animation this work could be used to develop controllers for robotics applications that are robust to sensor noise and perturbations if the trained character model accurately reflects the robot physical parameters figure typical joint angle trajectories that result from mpc and our method while both trajectories successfully maintain position for bird character our method generates trajectories that are cyclic and regular references chen hessian matrix hessian matrix siam numerical analysis geyer and herr model that encodes principles of legged mechanics produces human walking dynamics and muscle activities neural systems and rehabilitation engineering ieee transactions on grzeszczuk terzopoulos and hinton neuroanimator fast neural network emulation and control of models in proceedings of the annual conference on computer graphics and interactive techniques siggraph pages new york ny usa acm hinton srivastava krizhevsky sutskever and salakhutdinov improving neural networks by preventing of feature detectors arxiv preprint hoerzer legenstein and maass emergence of complex computational structures from chaotic neural networks through hebbian learning cerebral cortex huh and todorov motor control using recurrent neural networks in adaptive dynamic programming and reinforcement learning adprl ieee symposium on pages march ijspeert central pattern generators for locomotion control in animals and robots review ju won lee choi noh and choi control of flapping flight acm trans levine and koltun learning complex neural network policies with trajectory optimization in icml proceedings of the international conference on machine learning mnih kavukcuoglu silver graves antonoglou wierstra and riedmiller playing atari with deep reinforcement learning corr mordatch and todorov combining the benefits of function approximation and trajectory optimization in robotics science and systems rss mordatch todorov and discovery of complex behaviors through contactinvariant optimization acm transactions on graphics tog rebula neuhaus bonnlander johnson and pratt controller for the littledog quadruped walking on rough terrain in robotics and automation ieee international conference on pages ieee schulman levine moritz jordan and abbeel trust region policy optimization corr sutskever martens dahl and hinton on the importance of initialization and momentum in deep learning in proceedings of the international conference on machine learning volume pages may todorov erez and tassa mujoco physics engine for control in iros pages vincent larochelle bengio and manzagol extracting and composing robust features with denoising autoencoders pages vukobratovic and borovac point thirty five years of its life humanoid robotics wager wang and liang dropout training as adaptive regularization in advances in neural information processing systems nips wang fleet and hertzmann optimizing walking controllers for uncertain inputs and environments acm trans july yin loken and van de panne simbicon simple biped locomotion control acm trans article 
submodular hamming metrics jennifer rishabh bethany rahul jeff university of washington dept of ee seattle university of washington dept of applied math seattle jengi rkiyer herwaldt rkidambi bilmes abstract we show that there is largely unexplored class of functions positive polymatroids that can define proper discrete metrics over pairs of binary vectors and that are fairly tractable to optimize over by exploiting submodularity we are able to give hardness results and approximation algorithms for optimizing over such metrics additionally we demonstrate empirically the effectiveness of these metrics and associated algorithms on both metric minimization task form of clustering and also metric maximization task generating diverse lists introduction good distance metric is often the key to an effective machine learning algorithm for instance when clustering the distance metric largely defines which points end up in which clusters similarly in learning the distance between different labelings can contribute as much to the definition of the margin as the objective function itself likewise when constructing diverse lists the measure of diversity is key to ensuring meaningful differences between list elements we consider distance metrics over binary vectors if we define the set then each can seen as the characteristic vector of set where if and otherwise for sets with representing the symmetricp difference the hamming distance is then dh hamming distance between two vectors assumes that each entry difference contributes value one weighted hamming distance generalizes this slightly allowing each entry unique weight the mahalanobis distance further extends this for many practical applications however it is desirable to have entries interact with each other in more complex and ways than hamming or mahalanobis allow yet arbitrary interactions would result in functions whose optimization would be intractable in this work therefore we consider an alternative class of functions that goes beyond pairwise interactions yet is computationally feasible is natural for many applications and preserves metricity given set function we can define distortion between two binary vectors as follows df by asking to satisfy certain properties we will arrive at class of discrete metrics that is feasible to optimize and preserves metricity we say that is positive if whenever is normalized if is monotone if for all is subadditive if for all is modular if for all and is submodular if for all if we assume that is positive normalized monotone and subadditive then df is metric see theorem but without useful computational properties if is positive normalized monotone and modular then we recover the weighted hamming distance in this paper we assume that is positive normalized monotone and submodular and hence also subadditive these conditions are sufficient to ensure the metricity of df but allow for significant generalization over the weighted hamming distance also thanks to the properties of submodularity this class yields efficient optimization algorithms with guarantees table hardness for and uc stands for unconstrained and card stands for the entry open implies that the problem is potentially solvable uc card homogeneous heterogeneous open homogeneous heterogeneous table approximation guarantees of algorithms for and implies that no guarantee holds for the corresponding pair only works for the homogeneous case while all other algorithms work in both cases nion plit uc card uc ajor in card and et uc for practical machine learning problems in what follows we will refer to normalized monotone submodular functions as polymatroid functions all of our results will be concerned with positive polymatroids we note here that despite the restrictions described above the polymatroid class is in fact quite broad it contains number of natural choices of diversity and coverage functions such as set cover facility location saturated coverage and functions given positive polymatroid function we refer to df as submodular hamming sh distance we study two optimization problems involving these metrics each fi is positive polymatroid each bi and denotes combinatorial constraint min fi and max fi we will use as shorthand for the fm for the sequence bm and psequence for the objective function fi we will also make distinction between the homogeneous case where all fi are the same function and the more general heterogeneous case where each fi may be distinct in terms of constraints in this paper theory we consider only the unconstrained and the settings in general though could express more complex concepts such as knapsack constraints or that solutions must be an independent set of matroid or cut or spanning tree path or matching in graph intuitively the problem can be thought of as problem the minimizing should be as similar to the bi as possible since penalty of fi is paid for each difference analogously the problem can be thought of as diversification problem the maximizing should be as distinct from all bi as possible as fi is awarded for each difference given modular fi the weighted hamming distance case these optimization problems can be solved exactly and efficiently for many constraint types for the more general case of submodular fi we establish several hardness results and offer new approximation algorithms as summarized in tables and our main contribution is to provide to our knowledge the first systematic study of the properties of submodular hamming sh metrics by showing metricity describing potential machine learning applications and providing optimization algorithms for and the outline of this paper is as follows in section we offer further motivation by describing several applications of and to machine learning in section we prove that for positive polymatroid function the distance df is metric then in sections and we give hardness results and approximation algorithms and in section we demonstrate the practical advantage that submodular metrics have over modular metrics for several applications applications we motivate and by showing how they occur naturally in several applications clustering many clustering algorithms including for example use distance functions in their optimization if each item to be clustered is represented by binary feature vector bi then counting the disagreements between bi and bj is one natural distance function defining sets bi bi this count is equivalent to the hamming distance consider document clustering application where is the set of all features and bi is the set of features for document hamming distance has value both when bi submodular synapse and when bi submodular modular intuitively however smaller distance seems warranted in the latter case since the difference is only in one rather than two distinct concepts the submodular hamming distances we propose in this work can easily capture this type of pbehavior given feature clusters one can define submodular function as applying this with bi if the documents differences are confined to one cluster the distance is smaller than if the differences occur across several word clusters in the case discussed above the distances are and if this submodular hamming distance is used for clustering then the step becomes an instance of the shmin problem that is if cluster contains documents cj then its mean takes exactly the following form µj structured prediction structured support vector machines svms typically rely on hamming distance to compare candidate structures to the true one the margin required between the correct structure score and candidate score is then proportional to their hamming distance consider the problem of segmenting an image into foreground and background let bi be image true set of foreground pixels then hamming distance between bi and candidate segmentation with foreground pixels counts the number of pixels however both and observe poor performance with hamming distance and recent work by shows improved performance with richer distances that are supermodular functions of one potential direction for further enriching image segmentation distance functions is thus to consider functions from within our submodular hamming metrics class these functions have the ability to correct for the that the current distance functions may suffer from when the same kind of difference happens repeatedly for instance if bi differs from only in the pixels local to particular block of the image then current distance functions could be seen as the difference using submodular hamming function the inference step in svm optimization becomes an problem more concretely if the segmentation model is defined by submodular graph cut then we have note that in fact observes superior results with this type of inference using special case of submodular hamming metric for the task of image classification diverse for some machine learning tasks rather than finding model single highestscoring prediction it is helpful to find diverse set of predictions for instance showed that for image segmentation and pose tracking diverse set of solutions tended to contain better predictor than the top solutions additionally finding diverse solutions can be beneficial for accommodating user interaction for example consider the task of selecting photos to summarize the photos that person took while on vacation if the model best prediction set of images is rejected by the user then the system should probably present substantially different prediction on its second try submodular functions are natural model for several summarization problems thus given submodular summarization model and set of existing diverse summaries one could find kth summary to present to the user by solving ak if and are both positive polymatroids then this constitutes an instance of the problem properties of the submodular hamming metric we next show several interesting properties of the submodular hamming distance proofs for all theorems and lemmas can be found in the supplementary material we begin by showing that any positive polymatroid function of is metric in fact we show the more general result that any positive normalized monotone subadditive function of is metric this result is known see for instance chapter of but we provide proof in the supplementary material for completeness theorem let be positive normalized monotone subadditive function then df is metric on while these subadditive functions are metrics their optimization is known to be very difficult the simple subadditive function example in the introduction of shows that subadditive minimization is inapproximable and theorem of states that no algorithm exists for subadditive maximization that has an approximation factor better than by contrast submodular minimization is in the unconstrained setting and simple greedy algorithm from gives for maximization of positive polymatroids subject to cardinality constraint many other approximation results are also known for submodular function optimization subject to various other types of constraints thus in this work we restrict ourselves to positive polymatroids corollary let be positive polymatroid function then df is metric on this restriction does not entirely resolve the question of optimization hardness though recall that the optimization in and is with respect to but that the fi are applied to the sets unfortunately the function gb for fixed set is neither necessarily submodular nor supermodular in the next example demonstrates this violation of submodularity example to be submodular the function gb must satisfy the following gb gb gb consider condition for all sets gb the positive polymatroid function and let consist of two elements then for and with gb gb gb gb although gb can be we are interestingly still able to make use of the fact that is submodular in to develop approximation algorithms for and minimization of the submodular hamming metric in this section we focus on the problem we consider the four cases from table the constrained and unconstrained settings as well as the homogeneous case where all fi are the same function and the heterogeneous case before diving in we note that in all cases we assume not only the natural oracle access to the objective function fi the ability to evaluate for any but also knowledge of the bi the sequence theorem shows that without knowledge of is inapproximable in practice requiring knowledge of is not significant limitation for all of the applications described in section is naturally known theorem let be positive polymatroid function suppose that the subset is fixed but unknown and gb if we only have an oracle for gb then there is no approximation algorithm for minimizing gb up to any polynomial approximation factor unconstrained setting submodular minimization is in the unconstrained setting since sum of submodular functions is itself submodular at first glance it might then seem that the sum of fi in can be minimized in however recall from example that the fi are not necessarily submodular in the optimization variable this means that the question of hardness even in the unconstrained setting is an open question theorem resolves this question for the heterogeneous case showing that it is and that no algorithm can do better than guarantee the question of hardness in the homogeneous case remains open theorem the unconstrained and heterogeneous version of is moreover no algorithm can achieve an approximation factor better than since unconstrained is it makes sense to consider approximation algorithms for this problem we first provide simple nion plit see algorithm this algorithm splits into then applies standard submodular minimization see to the split function theorem shows that this algorithm is for it relies on lemma which we state first lemma let be positive monotone subadditive function then for any algorithm nion plit algorithm ajor in input define fp bi fi bi define output ubmodular pt input repeat set as in equation odular in until output algorithm input for do if bi bi output theorem nion plit is for unconstrained restricting to the homogeneous setting we can provide different algorithm that has better approximation guarantee than nion plit this algorithm simply checks the value of pm for each bi and returns the minimizing bi we call this algorithm algorithm theorem gives the approximation guarantee for this result is known as the proof of the guarantee only makes use of metricity and homogeneity not submodularity and these properties are common to much other work we provide the proof in our notation for completeness though theorem for exactly solves unconstrained for is for unconstrained homogeneous constrained setting in the constrained setting the problem becomes more difficult essentially all of the hardness results established in existing work on constrained submodular minimization applies to the constrained problem as well theorem shows that even for simple cardinality constraint and identical fi not only is but also it is hard to approximate with factor better than theorem homogeneous is under cardinality constraints moreover no algorithm can achieve an approximation factor better than where κf denotes the curvature of this holds even when we can also show similar hardness results for several other combinatorial constraints including matroid constraints shortest paths spanning trees cuts etc note that the hardness established in theorem depends on quantity κf which is also called the curvature of submodular function intuitively this factor measures how close submodular function is to modular function the result suggests that the closer the function is being modular the easier it is to optimize this makes sense since with modular function can be exactly minimized under several combinatorial constraints to see this for the case first note that for modular fi the corresponding is also modular lemma formalizes this pm lemma if the fi in are modular then fi is also modular given lemma from the definition of modularity we know that there exists some constant and vector wf rn such that wf from this representation it is clear that can be minimized subject to the constraint by choosing as the set the items corresponding to the smallest entries in wf thus for modular fi or fi with small curvature κfi such constrained minimization is relatively easy having established the hardness of constrained we now turn to considering approximation algorithms for this problem unfortunately the nion plit algorithm from the previous section requires an efficient algorithm for submodular function minimization and no such algorithm exists in the constrained setting submodular minimization is even under simple cardinality constraints similarly the algorithm breaks down in the constrained setting its guarantees carry over only if all the bi are within the constraint set thus for the constrained problem we instead propose algorithm theorem shows that this algorithm has an approximation guarantee and algorithm formally defines the algorithm essentially ajor in proceeds by iterating the following two steps constructing modular upper bound for at the current solution then minimizing to get new consists of superdifferentials of component submodular functions we use the superdifferentials defined as grow and shrink in defining sets as for grow and for shrink the vector that represents the modular can be written fi if fi otherwise where is the gain in when adding to we now state the main theorem characterizing algorithm ajor in performance on pm theorem ajor in is guaranteed to improve the objective value fi at every iteration for any constraint over which modular function can be exactly optimized it has maxi approximation guarantee where is the optimal solution of while ajor in does not have guarantee which is possible only in the unconstrained setting the bounds are not too far from the hardness of the constrained setting for example in the cardinality case the guarantee of ajor in is while the hardness shown in theorem is maximization of the submodular hamming metric we next characterize the hardness of the diversification problem and describe approximation algorithms for it we first show that all versions of even the unconstrained homogeneous one are note that this is result maximization of monotone function such as polymatroid is not the maximizer is always the full set but for despite the fact that the fi are monotone with respect to their argument they are not monotone with respect to itself this makes significantly harder after establishing that is we show that no algorithm can obtain an approximation factor better in the unconstrained setting and factor of in the constrained setting finally we provide simple approximation algorithm which achieves factor of for all settings theorem all versions of constrained or unconstrained heterogeneous or homogeneous are moreover no algorithm can obtain factor better than for the unconstrained versions or better than for the versions we turn now to approximation algorithms for the unconstrained setting lemma shows that simply choosing random subset provides in expectation lemma random subset is for in the unconstrained homogeneous or heterogeneous setting an improved approximation guarantee of can be shown for variant of nion plit algorithm if the call to ubmodular pt is call to ubmodular ax algorithm theorem makes this precise for both the unconstrained case and case it might also be of interest to consider more complex constraints such as matroid independence and base constraints but we leave the investigation of such settings to future work pm theorem maximizing fi bi fi bi with greedy algorithm algorithm is for maximizing fi in the unconstrained setting under the cardinality constraint using the randomized greedy algorithm algorithm provides table averaged over the datasets standard deviation hm sp table of wins out of datasets tp hm sp tp experiments to demonstrate the effectiveness of the submodular hamming metrics proposed here we apply them to metric minimization task clustering and metric maximization task diverse application clustering we explore the document clustering problem described in section where the groundset is all unigram features and bi contains the unigrams of document we run clustering and each iteration find the mean for cluster cj by solving µj argmina the constraint requires the mean to contain at least unigrams which helps to create richer and more meaningful cluster centers we compare using the submodular function sm to using hamming distance hm the problem of finding µj above can be solved exactly for hm since it is modular function in the sm case we apply ajor in algorithm as an initial test we generate synthetic data consisting of documents assigned to true clusters we set the number of word features to and partition the features into word classes the in the submodular function ten word classes are associated with each true document cluster and each document contains one word from each of these word classes that is each word is contained in only one document but documents in the same true cluster have words from the same word classes we set the minimum cluster center size to we use initialization and average over trials within the optimization we enforce that all clusters are of equal size by assigning document to the closest center whose current size is with this setup the average accuracy of hm is while sm is the hm accuracy is essentially the accuracy of random assignment of documents to clusters this makes sense as no documents share words rendering the hamming distance useless in data there would likely be some word overlap though to better model this we let each document contain random sampling of words from the word clusters associated with its document cluster in this case the average accuracy of hm is while sm is the results for sm are even better if randomization is removed from the initialization we simply choose the next center to be one with greatest distance from the current centers in this case the average accuracy of hm is while sm is this indicates that as long as the starting point for sm contains one document from each cluster the sm optimization will recover the true clusters moving beyond synthetic data we applied the same method to the problem of clustering nips papers the initial set of documents that we consider consists of all nips from to we filter the words of given paper by first removing stopwords and any words that don appear at least times in the paper we further filter by removing words that have small value and words that occur in only one paper or in more than of papers we then filter the papers themselves discarding any that have fewer than remaining words and for each other paper retaining only its top by score words each of the remaining papers defines bi set among the bi there are unique words to get the word clusters we first run the word vec code of which generates vector of features for each word and then run clustering with euclidean distance on these vectors to define word clusters we set the center size cardinality constraint to and set the number of document clusters to to initialize we again use with results are averaged over trials while we do not have groundtruth labels for nips paper clusters we can use distances as proxy for cluster goodness lower values indicating tighter clusters are better specifically we compute pk µj with hamming for the average ratio of hm to sm is this indicates that as expected hm does better job of optimizing the hamming loss however with the submodular function for the average ratio of hm to sm is thus sm does significantly better job optimizing the submodular loss papers were downloaded from http application diverse in this section we explore diverse image collection summarization problem as described in section for this problem our goal is to obtain summaries each of size by selecting from set consisting of images the idea is that either the user could choose from among these summaries the one that they find most appealing or more computationally expensive model could be applied to these summaries and choose the best as is described in section we obtain the kth summary ak given the first summaries via ak for we use the facility location function sij where sij is similarity score for images and we compute sij by taking the dot product of the ith and jth feature vectors which are the same as those used by for we compare two different functions the hamming distance hm and the submodular facility location distance sm for hm we optimize via the standard greedy algorithm since the facility location function is monotone submodular this implies an approximation guarantee of for sm we experiment with two algorithms standard greedy and figure an example photo montage zoom in to nion plit algorithm with standard see detail showing summaries of size one greedy as the ubmodular pt function we per row from the hm approach left and the tp will refer to these two cases as single part sp approach right for image collection and two part tp note that neither of these optimization techniques has formal approximation guarantee though the latter would if instead of standard greedy we used the greedy algorithm of we opt to use standard greedy though as it typically performs much better in practice we employ the image summarization dataset from which consists of image collections each of which contains images for each image collection we seek summaries of size for evaluation we employ the score developed by the mean of the summaries provides quantitative measure of their goodness scores are normalized such that score of corresponds to randomly generated summaries while score of is on par with summaries table shows that sp and tp outperform hm in terms of mean providing support for the idea of using submodular hamming distances in place of modular hamming for diverse applications tp also outperforms sp suggesting that the used in nion plit is of practical significance table provides additional evidence of tp superiority indicating that for out of the image collections tp has the best score of the three approaches figure provides some qualitative evidence of tp goodness notice that the images in the green rectangle tend to be more redundant with images from the previous summaries in the hm case than in the tp case the hm solution contains many images with sky theme while tp contains more images with other themes this shows that the hm solution lacks diversity across summaries the quality of the individual summaries also tends to become poorer for the later hm sets considering the images in the red rectangles overlaid on the montage the hm sets contain many images of tree branches here by contrast the tp summary quality remains good even for the last few summaries conclusion in this work we defined new class of distance functions submodular hamming metrics we established hardness results for the associated and problems and provided approximation algorithms further we demonstrated the practicality of these metrics for several applications there remain several open theoretical questions the tightness of the hardness results and the of as well as many opportunities for applying submodular hamming metrics to other machine learning problems the prediction application from section references lloyd least squares quantization in pcm ieee transactions on information theory hazan maji keshet and jaakkola learning efficient random maximum predictors with loss functions in nips szummer kohli and hoiem learning crfs using graph cuts in eccv osokin and kohli perceptually inspired losses for image segmentation in eccv yu and blaschko learning submodular losses with the lovasz hinge in icml batra yadollahpour guzman and shakhnarovich diverse solutions in markov random fields in eccv lin and bilmes class of submodular functions for document summarization in acl tschiatschek iyer wei and bilmes learning mixtures of submodular functions for image collection summarization in nips halmos measure theory springer jegelka and bilmes approximation bounds for inference using cooperative cuts in icml bateni hajiaghayi and zadimoghaddam submodular secretary problem and extensions technical report mit cunningham on submodular function minimization combinatorica nemhauser wolsey and fisher an analysis of approximations for maximizing submodular set functions satoru fujishige submodular functions and optimization elsevier edition gusfield algorithms on strings trees and sequences computer science and computational biology cambridge university press iyer jegelka and bilmes curvature and efficient approximation algorithms for approximation and minimization of submodular functions in nips goel karande tripathi and wang approximability of combinatorial problems with submodular cost functions in focs submodularity and curvature the optimal algorithm rims kokyuroku bessatsu svitkina and fleischer submodular approximation algorithms and lower bounds in focs jegelka and bilmes submodularity beyond submodular energies coupling edges in graph cuts in cvpr iyer and bilmes the submodular bregman and divergences with applications in nips iyer jegelka and bilmes fast submodular function optimization in icml buchbinder feldman naor and schwartz tight linear time for unconstrained submodular maximization in focs buchbinder feldman naor and schwartz submodular maximization with cardinality constraints in soda arthur and vassilvitskii the advantages of careful seeding in soda mikolov sutskever chen corrado and dean distributed representations of words and phrases and their compositionality in nips 
universal convex optimization framework alp yurtsever quoc volkan cevher laboratory for information and inference systems epfl switzerland department of statistics and operations research unc usa quoctd abstract we propose new algorithmic framework for prototypical constrained convex optimization template the algorithmic instances of our framework are universal since they can automatically adapt to the unknown continuity degree and constant within the dual formulation they are also guaranteed to have optimal convergence rates in the objective residual and the feasibility gap for each smoothness degree in contrast to existing algorithms our framework avoids the proximity operator of the objective function we instead leverage computationally cheaper operators which are the main workhorses of the generalized conditional gradient gcg methods in contrast to the methods our framework does not require the objective function to be differentiable and can also process additional general linear inclusion constraints while guarantees the convergence rate on the primal problem introduction this paper constructs an algorithmic framework for the following convex optimization template min tf pxq ax ku xpx where rp is convex function rnˆp rn and and are nonempty closed and convex sets in rp and rn respectively the constrained optimization formulation is quite flexible capturing many important learning problems in unified fashion including matrix completion sparse regularization support vector machines and submodular optimization processing the inclusion ax in requires significant computational effort in the largescale setting hence the majority of the scalable numerical solution methods for are of the including decomposition augmented lagrangian and alternating direction methods the efficiency guarantees of these methods mainly depend on three properties of lipschitz gradient strong convexity and the tractability of its proximal operator for instance the proximal operator of proxf pxq arg minz pzq is key in handling while obtaining the convergence rates as if it had lipschitz gradient when the set is absent in other methods can be preferable to algorithms for instance if has lipschitz gradient then we can use the accelerated proximal gradient methods by applying the proximal operator for the indicator function of the set however as the problem dimensions become increasingly larger the proximal tractability assumption can be restrictive this fact increased the popularity of the generalized conditional gradient gcg methods or algorithms which instead leverage the following oracles arg max txx sy gpsqu spx where is convex function when we obtain the linear minimization oracle when rp then the sub gradient of the fenchel conjugate of is in the set the in is often much cheaper to process as compared to the prox operator while the algorithms require to guarantee an objective gap they can not converge when their objective is nonsmooth to this end we propose new algorithmic framework that can exploit the of in lieu of its proximal operator our aim is to combine the flexibility of proximal methods in addressing the general template while leveraging the computational advantages of the methods as result we trade off the computational difficulty per iteration with the overall rate of convergence while we obtain optimal rates based on the oracles we note that the rates reduce to with the sharp operator with the proximal operator when is completely cf definition intriguingly the convergence rates are the same when is strongly convex unlike methods our approach can now handle nonsmooth objectives in addition to complex constraint structures as in our framework is universal in the sense the convergence of our algorithms can optimally adapt to the continuity of the dual objective in section without having to know its parameters by continuity we mean the sub gradient of convex function satisfies mν with parameters mν and for all rn the case models the bounded subgradient whereas captures the lipschitz gradient the continuity has recently resurfaced in unconstrained optimization by with universal gradient methods that obtain optimal rates without having to know mν and unfortunately these methods can not directly handle the general constrained template after our initial draft appeared presented new methods for composite minimization minxprp pxq ψpxq relying on smoothness of and the of the methods in do not apply when is in addition they can not process the additional inclusion ax in which is major drawback for machine learning applications our algorithmic framework features gradient method and its accelerated variant that operates on the dual formulation of for the accelerated variant we study an alternative to the universal accelerated method of based on fista since it requires less proximal operators in the dual while the fista scheme is classical our analysis of it with the continuous assumption is new given the dual iterates we then use new averaging scheme to construct the for the constrained template in contrast to the weighting schemes of algorithms our weights explicitly depend on the local estimates of the constants mν at each iteration finally we derive the complexity results our results are optimal since they match the computational lowerbounds in the sense of methods paper organization section briefly recalls formulation of problem with some standard assumptions section defines the universal gradient mapping and its properties section presents the universal gradient methods both the standard and accelerated variants and analyzes their convergence section provides numerical illustrations followed by our conclusions the supplementary material includes the technical proofs and additional implementation details notation and terminology for notational simplicity we work on the rp rn spaces with the euclidean norms we denote the euclidean distance of the vector to closed convex set by dist pu throughout the paper represents the euclidean norm for vectors and the spectral norm for the matrices for convex function we use both for its subgradient and gradient and for its fenchel conjugate our goal is to approximately solve to obtain in the following sense definition given an accuracy level point is said to be an of if and dist kq here we call the primal objective residual and dist kq the feasibility gap preliminaries in this section we briefly summarize the formulation with some standard assumptions for the ease of presentation we reformulate by introducing slack variable as follows min tf pxq ax bu xpx rpk let rx rs and ˆk then we have tz bu as the feasible set of the dual problem the lagrange function associated with the linear constraint ax is defined as lpx λq pxq xλ ax by and the dual function of can be defined and decomposed as follows dpλq min tf pxq xλ ax byu min tf pxq xλ ax byu min xλ xpx xpx rpk loooooooooooooooomoooooooooooooooon loooooomoooooon rpk dx pλq dr pλq where is the dual variable then we define the dual problem of as follows maxn dpλq maxn dx pλq dr pλq λpr λpr fundamental assumptions to characterize the relation between and we require the following assumptions assumption the function is proper closed and convex but not necessarily smooth the constraint sets and are nonempty closed and convex the solution set of is nonempty either is polyhedral or the slater condition holds by the slater condition we mean ripzq tpx rq ax bu where ripzq stands for the relative interior of strong duality under assumption the solution set of the dual problem is also nonempty and bounded moreover the strong duality holds universal gradient mappings this section defines the universal gradient mapping and its properties dual reformulation we first adopt the composite convex minimization formulation of for better interpretability as minn tgpλq gpλq hpλqu λpr where and the correspondence between pg hq and pdx dr is as follows gpλq max txλ axy pxqu pλq xpx hpλq max xλ ry pλq rpk since and are generally fista and its analysis are not directly applicable recall the sharp operator defined in then can be expressed as gpλq max xy pxq xλ by xpx and we define the optimal solution to the subproblem above as follows pλq arg max xy pxq xpx the second term depends on the structure of we consider three special cases paq if tr rn κu for given and given norm then hpλq the scaled dual norm of for instance if tr rn κu then hpλq while the induces the sparsity of computing requires the max absolute elements of if tr κu the nuclear norm then hpλq the spectral norm the nuclear norm induces the of computing in this case leads to finding the of which is efficient pbq cone constraints if is cone then becomes the indicator function of its dual cone hence we can handle the inequality constraints and positive semidefinite constraints in for instance if rn then hpλq pλq the indicator function of tλ rn if then hpλq pλq the indicator function of the negative semidefinite matrix cone śp řp pcq separable structures if and are separable xi and pxq fi pxi then the evaluation of and its derivatives can be decomposed into subproblems continuity of the dual universal gradient let be subgradient of which can be computed as pλq next we define mν mν pgq sup where is the smoothness order note that the parameter mν explicitly depends on we are interested in the case and especially the two extremal cases where we either have the lipschitz gradient that corresponds to or the bounded subgradient that corresponds to we require the following condition in the sequel assumption pgq inf mν pgq assumption is reasonable we explain this claim with the following two examples first if is subdifferentiable and is bounded then is also bounded indeed we have pλq dx supt ax hence we can choose and pgq second if is uniformly convex with the convexity parameter µf and the degree pxq µf for all rp then defined by satisfies with and pgq as shown in in particular if is µf convex then and mν pgq which is the lipschitz constant of the gradient the step for the dual problem given rn and mk we define mk as an approximate quadratic surrogate of then we consider the following update rule λk arg minn qmk pλ hpλq proxm qmk pλ λpr for given accuracy we define we need to choose the parameter mk such that qmk is an approximate upper surrogate of gpλq qmk pλ λk δk for some rn and δk if and mν are known then we can defined by in this case qm set mk is an upper surrogate of in general we do not know and mν hence mk can be determined via backtracking procedure universal gradient methods we apply the universal gradient mappings to the dual problem and propose an averaging scheme to construct for approximating then we develop an accelerated variant based on the fista for approximating scheme and construct another primal sequence universal gradient algorithm our algorithm is shown in algorithm the dual steps are simply the universal gradient method in while the new primal step allows to approximate the solution of first computing pλk at step requires the solution pλk λk for many and we can compute pλk efficiently and often in closed form algorithm universal gradient method punipdgradq initialization choose an initial point rn and desired accuracy level set and estimate value such that for to kmax compute primal solution pλk λk form pλk set for to imax perform the following steps compute the trial point λk proxm λk mk if the following condition holds gpλk qmk pλk λk then set ik and terminate the loop otherwise set mk end of set λk λk ik and mk mk ik compute wk sk wk and γk sk compute γk γk pλk end for output return the primal approximation for second in the procedure we require the solution λk at step and the evaluation of gpλk the total computational cost depends on the proximal operator of and the evaluations of we prove below that our algorithm requires two oracle queries of on average theorem the primal sequence generated by the algorithm satisfies dist kq dist kq is defined by is an arbitrary dual solution and is the desired accuracy where the analytical complexity we establish the total number of iterations kmax to achieve an of the supplementary material proves that ffi ffi ffi ffi fl inf kmax fl where max this complexity is optimal for but not for at each iteration the linesearch procedure at step requires the evaluations of the supplementary material bounds the total number pkq of oracle queries including the function and its gradient evaluations up to the kth iteration as follows pkq inf mν hence we have pkq we require approximately two oracle queries at each iteration on the average accelerated universal gradient method we now develop an accelerated scheme for solving our scheme is different from in two key aspects first we adopt the fista scheme to obtain the dual sequence since it requires less prox operators compared to the fast scheme in second we perform the after computing which can reduce the number of the computations of and note that the application of fista to the dual function is not novel per se however we claim that our theoretical characterization of this classical scheme based on the continuity assumption in the composite minimization setting is new algorithm accelerated universal gradient method paccunipdgradq initialization choose an initial point rn and an accuracy level set and estimate value such that for to kmax compute primal solution form set for to imax perform the following steps compute the trial point λk proxm mk if the following condition holds gpλk qmk pλk then ik and terminate the loop otherwise set mk end of tk set λk λk ik and mk mk ik compute wk wk and γk wk compute tk and update λk ttkk λk λk γk γk compute end for for output return the primal approximation complexity the complexity of algorithm remains essentially the same as that of algorithm generated by the algorithm satisfies theorem the primal sequence pk kq dist pk pk kq dist is defined by is an arbitrary dual solution and is the desired accuracy where the analytical complexity the supplementary material proves the following complexity of algorithm to achieve an ffi fi ffi ffi ffi fl inf kmax fl this complexity is optimal in the sense of black box models the procedure at step of algorithm also terminates after finite number of iterations similar to algorithm algorithm requires gradient query and ik function evaluations of at each iteration the supplementary material proves that the number of oracle queries in algorithm is upperbounded as follows pkq pk pmν roughly speaking algorithm requires approximately two oracle query per iteration on average numerical experiments this section illustrates the scalability and the flexibility of our framework using some applications in the quantum tomography qt and the matrix completion mc quantum tomography with pauli operators we consider the qt problem which aims to extract information from physical quantum system quantum system is mathematically characterized by its density matrix which is complex positive semidefinite hermitian matrix where surprisingly we can provably deduce the state from performing compressive linear measurements apxq based on pauli operators while the size of the density matrix grows exponentially in significantly fewer compressive measurements opp log pq suffices to recover pure state density matrix as result of the following convex optimization problem minp ϕpxq trpxq xps where the constraint ensures that is density matrix the recovery is also robust to noise since the objective function has lipschitz gradient and the constraint the spectrahedron is the qt problem provides an ideal scalability test for both our framework and algorithms to verify the performance of the algorithms with respect to the optimal solution in largescale we remain within the noiseless setting however the timing and the convergence behavior of the algorithms remain qualitatively the same under polarization and additive gaussian noise relative solution error iteration relative solution error unipdgrad accunipdgrad frankwolfe objective residual xk objective residual xk computational time iteration computational time figure the convergence behavior of algorithms for the qubits qt problem the solid lines correspond to the theoretical weighting scheme and the dashed lines correspond to the in the weighting step variants to this end we generate random pure quantum state and we take log random pauli measurements for qubits system this corresponds to dimensional problem with measurements we recast into by introducing the slack variable apxq we compare our algorithms the method which has optimal convergence rate guarantees for this problem and its variant computing the requires of pλq while evaluating corresponds to just computing the of pλq via power method all methods use the same subroutine to compute the sharpoperator which is based on matlab eigs function we set for our methods and have in order to stop the algorithms however our algorithms seems insensitive to the choice of for the qt problem figure illustrates the iteration and the timing complexities of the algorithms unipdgrad algorithm with an average of steps per iteration has similar iteration and timing performance as compared to the standard scheme with γk pk the variant of improves over the standard one however our accelerated variant with an average of steps is the clear winner in terms of both iterations and time we can empirically improve the performance of our algorithms even further by adapting similar strategy in the weighting step as by choosing the weights wk in greedy fashion to minimize the objective function the practical improvements due to appear quite significant matrix completion with movielens dataset to demonstrate the flexibility of our framework we consider the popular matrix completion mc application in mc we seek to estimate matrix rpˆl from its subsampled entries rn where is the sampling operator apxq unipdgrad accunipdgrad frankwolfe iteration rmse rmse rmse iteration iteration computational time min figure the performance of the algorithms for the mc problems the dashed lines correspond to the in the weighting step variants and the empty and the filled markers correspond to the formulation and respectively convex formulations involving the nuclear norm have been shown to be quite effective in estimating matrices from limited number of measurements for instance we can solve min ϕpxq apxq xprpˆl with methods where is tuning parameter which may not be available priori we can also solve the following version min ψpxq apxq xprpˆl while the nonsmooth objective of prevents the tuning parameter it clearly burdens the computational efficiency of the convex optimization algorithms we apply our algorithms to and using the movielens dataset algorithms can not handle and only solve for this experiment we did not the data and took the default ub test and training data partition we start out algorithms form we set the target accuracy and we choose the tuning parameter as in we use lansvd function matlab version from propack to compute the top singular vectors and simple implementation of the power method to find the top singular value in the both with relative error tolerance the first two plots in figure show the performance of the algorithms for our metrics are the normalized objective residual and the root mean squared error rmse calculated for the test data since we do not have access to the optimal solutions we approximated the optimal values and by iterations of accunipdgrad other two plots in figure compare the performance of the formulations and which are represented by the empty and the filled markers respectively note that the dashed line for accunipdgrad corresponds to the variant where the weights wk are chosen to minimize the feasibility gap additional details about the numerical experiments can be found in the supplementary material conclusions this paper proposes new algorithmic framework that combines the flexibility of proximal methods in addressing the general template while leveraging the computational advantages of the methods the algorithmic instances of our framework are universal since they can automatically adapt to the unknown continuity properties implied by the template our analysis technique unifies nesterov universal gradient methods and methods to address the more broadly applicable setting the hallmarks of our approach includes the optimal complexity and its flexibility to handle nonsmooth objectives and complex constraints compared to existing algorithm as well as algorithms while essentially preserving their low cost iteration complexity acknowledgments this work was supported in part by erc future proof snf and snf we would like to thank stephen becker of university of colorado at boulder for his support in preparing the numerical experiments references jaggi revisiting sparse convex optimization mach learn res workshop conf vol pp cevher becker and schmidt convex optimization for big data scalable randomized and parallel algorithms for big data analytics ieee signal process vol pp wainwright structured regularizers for problems statistical and computational issues annu review stat and vol pp lan and monteiro of augmented lagrangian methods for convex programming math pp boyd parikh chu peleato and eckstein distributed optimization and statistical learning via the alternating direction method of multipliers found and trends in machine learning vol pp combettes and pesquet proximal decomposition method for solving convex variational inverse problems inverse problems vol goldstein esser and baraniuk adaptive hybrid gradient methods for saddle point problems http shefi and teboulle rate of convergence analysis of decomposition methods based on the proximal method of multipliers for convex minimization siam vol pp and cevher constrained convex minimization via excessive gap in advances neural inform process syst montreal canada beck and teboulle fast iterative algorithm for linear inverse problems siam imaging vol pp mar nesterov smooth minimization of functions math vol pp may juditsky and nemirovski solving variational inequalities with monotone operators on domains given by linear minimization oracles math pp mar yu fast gradient algorithms for structured sparsity phd dissertation univ alberta edmonton canada nesterov complexity bounds for methods minimizing the model of objective function core univ catholique louvain belgium tech nesterov universal gradient methods for convex optimization problems math vol pp nemirovskii and yudin problem complexity and method efficiency in optimization hoboken nj wiley interscience rockafellar convex analysis princeton math series princeton nj princeton univ press gross liu flammia becker and eisert quantum state tomography via compressed sensing phys rev vol pp and recht exact matrix completion via convex optimization commun acm vol pp june jaggi and simple algorithm for nuclear norm regularized problems in proc int conf machine learning haifa israel pp larsen propack software for large and sparse svd calculations available http 
learning from small samples an analysis of simple decision heuristics and marcus buckmann center for adaptive behavior and cognition max planck institute for human development lentzeallee berlin germany ozgur buckmann abstract simple decision heuristics are models of human and animal behavior that use few pieces of only single piece of integrate the pieces in simple ways for example by considering them sequentially one at time or by giving them equal weight we focus on three families of heuristics decision making lexicographic decision making and tallying it is unknown how quickly these heuristics can be learned from experience we show analytically and empirically that substantial progress in learning can be made with just few training samples when training samples are very few tallying performs substantially better than the alternative methods tested our empirical analysis is the most extensive to date employing natural data sets on diverse subjects introduction you may remember that on january in new york city commercial passenger plane struck flock of geese within two minutes of taking off from laguardia airport the plane immediately and completely lost thrust from both engines leaving the crew facing number of critical decisions one of which was whether they could safely return to laguardia the answer depended on many factors including the weight velocity and altitude of the aircraft as well as wind speed and direction none of these factors however are directly involved in how pilots make such decisions as copilot jeffrey skiles discussed in later interview pilots instead use single piece of visual information whether the desired destination is staying stationary in the windshield if the destination is rising or descending the plane will undershoot or overshoot the destination respectively using this visual cue the flight crew concluded that laguardia was out of reach deciding instead to land on the hudson river skiles reported that subsequent simulation experiments consistently showed that the plane would indeed have crashed before reaching the airport simple decision heuristics such as the one employed by the flight crew can provide effective solutions to complex problems some of these heuristics use single piece of information others use multiple pieces of information but combine them in simple ways for example by considering them sequentially one at time or by giving them equal weight our work is concerned with two questions how effective are simple decision heuristics and how quickly can they be learned from experience we focus on problems of comparison where the objective is to decide which of given set of objects has the highest value on an unobserved criterion these problems are of fundamental importance in intelligent behavior humans and animals spend much of their time choosing an object to act on with respect to some criterion whose value is unobserved at the time choosing mate prey to chase an investment strategy for retirement fund or publisher for book are just few examples earlier studies on this problem have shown that simple heuristics are surprisingly accurate in natural environments especially when learning from small samples we present analytical and empirical results on three families of heuristics lexicographic decision making tallying and decision making our empirical analysis is the most extensive to date employing natural environments on diverse subjects our main contributions are as follows we present analytical results on the rate of learning heuristics from experience we show that very few learning instances can yield effective heuristics we empirically investigate decision making and find that its performance is remarkable we find that the most robust decision heuristic for small sample sizes is tallying collectively our results have important implications for developing more successful heuristics and for studying how well simple heuristics capture human and animal decision making background the comparison problem asks which of given set of objects has the highest value on an unobserved criterion given number of attributes of the objects we focus on pairwise comparisons where exactly two objects are being compared we consider decision to be accurate if it selects the object with the higher criterion value or either object if they are equal in criterion value in the heuristics literature attributes are called cues we will follow this custom when discussing heuristics the heuristics we consider decide by comparing the objects on one or more cues asking which object has the higher cue value importantly they do not require the difference in cue value to be quantified for example if we use height of person as cue we need to be able to determine which of two people is taller but we do not need to know the height of either person or the magnitude of the difference each cue is associated with direction of inference also known as cue direction which can be positive or negative favoring the object with the higher or lower cue value respectively cue directions and other components of heuristics can be learned in number of ways including social learning in our analysis we learn them from training examples decision making is perhaps the simplest decision method one can imagine it compares the objects on single cue breaking ties randomly we learn the identity of the cue and its direction from training sample among the possible models where is the number of cues we choose the hcue directioni combination that has the highest accuracy in the training sample breaking ties randomly lexicographic heuristics consider the cues one at time in specified order until they find cue that discriminates between the objects that is one whose value differs on the two objects the heuristic then decides based on that cue alone an example is which orders cues with respect to decreasing validity on the training sample where validity is the accuracy of the cue among pairwise comparisons on which the cue discriminates between the objects tallying is voting model it determines how each cue votes on its own selecting one or the other object or abstaining from voting and selects the object with the highest number of votes breaking ties randomly we set cue directions to the direction with highest validity in the training set paired comparison can also be formulated as classification problem let ya denote the criterion value of object xa the vector of attribute values of object and ab ya yb the difference in criterion values of objects and we can define the class of pair of objects as function of the difference in their criterion values if ab if ab ab if ab class value of denotes that object has the higher criterion value that object has the higher criterion value and that the objects are equal in criterion value the comparison problem is intrinsically symmetrical comparing to should give us the same decision as comparing to that is ab should equal ba because the latter equals ab we have the following symmetry constraint for all we can expect better classification accuracy if we impose this symmetry constraint on our classifier building blocks of decision heuristics we first examine two building blocks of learning heuristics from experience assigning cue direction and determining which of two cues has the higher predictive accuracy the former is important for all three families of heuristics whereas the latter is important for lexicographic heuristics when determining which cue should be placed first both components are building blocks of heuristics in broader use is not limited to the three families of heuristics considered here let and be the objects being compared xa and xb denote their cue values ya and yb denote their criterion values and sgn denote the mathematical sign function sgn is if if and if single training instance is the tuple hsgn xa xb sgn ya yb corresponding to single pairwise comparison indicating whether the cue and the criterion change from one object to the other along with the direction of change for example if xa ya xb yb the training instance is learning cue direction we assume without loss of generality that cue direction in the population is positive we ignore the case where the cue direction in the population is neutral let denote the success rate of the cue in the population where success is the event that the cue decides correctly we examine two probabilities and the former is the probability of correctly inferring the cue direction from set of training instances the latter is the probability of deciding correctly on new unseen instance using the direction inferred from the training instances we define an informative instance to be one in which the objects differ both in their cue values and in their criterion values positive instance to be one in which the cue and the criterion change in the same direction or and negative instance to be one in which the cue and the criterion change in the opposite direction or let be the number of training instances the number of positive training instances and the number of negative training instances our estimate of cue direction is positive if negative if and random choice between positive and negative if given set of independent informative training instances follows the binomial distribution with trials and success probability allowing us to write as follows is even where is the indicator function after one training instance equals after one more instance remains the same this is general property after an odd number of training instances an additional instance does not increase the probability of inferring the direction correctly on new test instance the cue decides correctly with probability if cue direction is inferred correctly and with probability otherwise consequently simple algebra yields the following expected learning rates after training instances with two additional instances the increase in the probability of inferring cue direction correctly is and the increase in the probability of deciding correctly is figure shows and as function of size and success rate the more predictive the cue is the smaller the sample needs to be for desired level of accuracy in both and this is of course desirable property the more useful the cue is the faster we learn how to use it correctly the figure also shows that there are highly diminishing returns from one odd size to the next as the size of the training set increases in fact just few instances make great progress toward the maximum possible the third plot in the figure reveals this property more clearly it shows divided by its maximum possible value showing how quickly we reach the maximum possible accuracy for cues of various predictive ability the minimum value depicted in this figure is observed at this means that even after single training instance our expected accuracy is at least of the maximum accuracy we can reach and this value rises quickly with each additional pair of training instances probability of correctly deciding probability of correctly inferring cue direction figure learning cue direction learning to order two cues assume we have two cues with success rates and in the population with we expand the definition of informative instance to require that the objects differ on the second cue as well we examine two probabilities and the former is the probability of ordering the two cues correctly which means placing the cue with higher success rate above the other one the latter is the probability of deciding correctly with the inferred order we chose to examine learning to order cues independently of learning cue directions one reason is that people do not necessarily learn the cue directions from experience in many cases they can guess the cue direction correctly through causal reasoning social learning past experience in similar problems or other means in the analysis below we assume that the directions are assigned correctly let and be the success rates of the two cues in the training set if instances are informative and independent and follow the binomial distribution with parameters and allowing us to write as follows after one training instance is which is linear function of the difference between the two success rates if we order cues correctly decision on test instance is correct with probability otherwise with probability thus after training instances max after training instances probability of correctly deciding after training instances probability of correctly ordering figure shows and as function of and after three training instances in general larger values of as well as larger differences between and require smaller training sets for desired level of accuracy in other words learning progresses faster where it is more useful the third plot in the figure shows relative to the maximum value it can take the maximum of and the minimum value depicted in this figure is if we examine the same figure after only single training instance we see that this minimum value is figure not shown figure learning cue order empirical analysis we next present an empirical analysis of natural data sets most from two earlier studies our primary objective is to examine the empirical learning rates of heuristics from the analytical results of the preceding section we expect learning to progress rapidly secondary objective is to examine the effectiveness of different ways cues can be ordered in lexicographic heuristic the data sets were gathered from wide variety of sources including online data repositories textbooks packages for statistical software statistics and data mining competitions research publications and individual scientists collecting field data the subjects were diverse including biology business computer science ecology economics education engineering environmental science medicine political science psychology sociology sports and transportation the data sets varied in size ranging from to objects many of the smaller data sets contained the entirety of the population of objects for example all islands in the archipelago the data sets are described in detail in the supplementary material we present results on lexicographic heuristics tallying decision making logistic regression and decision trees trained by cart we used the cart implementation in rpart with the default splitting criterion gini and pruning there is no explicit way to implement the symmetry constraint for decision trees we simply augmented the training set with its mirror image with respect to the direction of comparison for logistic regression we used the glm function of setting the intercept to zero to implement the symmetry constraint to the glm function we input the cues in the order of decreasing correlation with the criterion so that the weakest cues were dropped first when the number of training instances was smaller than the number of cues ordering cues in lexicographic heuristics we first examine the different ways lexicographic heuristics can order the cues with cues there are possible cue orders combined with the possibility of using each cue with positive or negative direction there are possible lexicographic models number that increases very rapidly with how should we choose one if our top criterion is accuracy but we also want to pay attention to computational cost and memory requirements we consider three methods the first is greedy search where we start by deciding on the first cue to be used along with its direction then the second and so on until we have fully specified lexicographic model when deciding on the first cue we select the one that has the highest validity in the training examples when deciding on the mth cue we select the cue that has the highest validity in the examples left over after using the first cues that is those examples where the first cues did not discriminate between the two objects the second method is to order cues with respect to their validity in the training examples as does evaluating cues independently of each other substantially reduces computational and memory requirements but perhaps at the expense of accuracy the third method is to use the lexicographic the gives the highest accuracy in the training examples identifying this rule is and it is unlikely to generalize well but it will be informative to examine it the three methods have been compared earlier on data set consisting of german cities where the fitting accuracy of the best greedy validity and random ordering was and respectively figure top panel shows the fitting accuracy of each method in each of the data sets when all possible pairwise comparisons were conducted among all objects because of the long simulation time required we show an approximation of the best ordering in data sets with seven or more cues in these data sets we started with the two lexicographic rules generated by the greedy and the validity ordering kept intact the cues that were placed seventh or later in the sequence and tested all possible permutations of their first six cues trying out both possible cue directions the figure also shows the mean accuracy of random ordering where cues were used in the direction of higher validity in all data sets greedy ordering was identical or very close in accuracy to the best ordering in addition validity ordering was very close to greedy ordering except in handful of data sets one explanation is that continuous cue that is placed first in lexicographic model makes all or almost all decisions and therefore the order of the remaining cues does not matter we therefore also examine the binary version of each data set where numerical cues were dichotomizing around the median figure bottom panel there was little difference in the relative positions of greedy and optimal ordering except in one data set there was more of drop in the relative accuracy of fitting accuracy best approximate best greedy ordering validity ordering random ordering number of cues fitting accuracy data sets dichotomized data sets number of cues figure fitting accuracy of lexicographic models with and without dichotomizing the cues the validity ordering but this method still achieved accuracy close to that of the best ordering in the majority of the data sets we next examine predictive accuracy figure shows accuracies when the models were trained on of the objects and tested on the remaining conducting all possible pairwise comparisons within each group mean accuracy across data sets was for logistic regression for cart for greedy lexicographic and for and for tallying figure shows learning curves where we grew the training set one pairwise comparison at time two individual objects provided single instance for training or testing and were never used again neither in training nor in testing consequently the training instances were independent of each other but they were not always informative as defined in section the figure shows the mean learning curve across all data sets as well as individual learning curves on data sets we present the graphs without error bars for legibility the highest standard error of the data points displayed is in figure and in figure few observations are noteworthy heuristics were indeed learned rapidly in the early part of the learning curve tallying generally had the highest accuracy the performance of was remarkable when trained on of the objects its mean performance was better than tallying percentage points behind and percentage points behind logistic regression performed better than or as well as greedy lexicographic in most data sets detailed comparison of the two methods is provided below validity versus greedy ordering in lexicographic decision making the learning curves on individual data sets took one of four forms there was no difference in any part of the learning curve this is the case when continuous cue is placed first this cue almost always discriminates between the objects and cues further down in the sequence are seldom if ever used because greedy and validity ordering always agree on the first cue the learning curves are identical or nearly so data sets were in this first category validity ordering was better than greedy ordering in some parts of the learning curve and never worse this category included data sets learning curves crossed validity ordering generally started with higher accuracy than greedy ordering the difference diminished with increasing size and eventually greedy ordering exceeded validity ordering in accuracy data sets greedy ordering was better than validity accuracy greedy lexicographic tallying logistic regression cart data sets number of cues figure predictive accuracy when models are trained with of the objects in each data set and tested on the remaining dering in some parts of the learning curve and never worse data sets to draw these conclusions we considered difference to be present if the error bars se did not overlap discussion we isolated two building blocks of decision heuristics and showed analytically that they require very few training instances to learn under conditions that matter the most when they add value to the ultimate predictive ability of the heuristic our empirical analysis confirmed that heuristics typically make substantial progress early in learning among the algorithms we considered the most robust method for very small training sets is tallying earlier work concluded that with undichotomized cues is the most robust model for training sets with to objects but tallying with undichotomized cues was absent from this earlier study in addition we found that the performance of decision making is truly remarkable this heuristic has been analyzed by assuming that the cues and the criterion follow the normal distribution we are not aware of an earlier analysis of its empirical performance on natural data sets our analysis of learning curves differs from earlier studies most earlier studies examined performance as function of number of objects in the training set where training instances are all possible pairwise comparisons among those objects others increased the training set one pairwise comparison at time but did not keep the pairwise comparisons independent of each other in contrast we increased the training set one pairwise comparison at time and kept all pairwise comparisons independent of each other this makes it possible to examine the incremental value of each training instance there is criticism of decision heuristics because of their computational requirements for instance it has been argued that can be described as simple algorithm but its successful execution relies on large amount of precomputation and that the computation of cue validity in the german city task would require pairwise comparisons just to establish the cue validity hierarchy for predicting city size our results clearly show that the actual computational needs of heuristics can be very low if independent pairwise comparisons are used for training similar just few samples may within the context of bayesian inference acknowledgments thanks to gerd gigerenzer konstantinos katsikopoulos malte lichtenberg laura martignon perke jacobs and the abc research group for their comments on earlier drafts of this article this work was supported by grant si to from the deutsche forschungsgemeinschaft dfg as part of the priority program new frameworks of rationality spp greedy lexicographic tallying logistic regression cart number of data sets mean accuracy training set size city lake figure learning curves hitter bodyfat contraception obesity athlete car accuracy infant cpu pitcher salary land fish training set size mileage accuracy diamond greedy lexi tallying logistic reg cart training set size references rose charlie rose television program aired on february gigerenzer todd and the abc research group simple heuristics that make us smart oxford university press new york gigerenzer hertwig and pachur editors heuristics the foundations of adaptive behavior oxford university press new york czerlinski gigerenzer and goldstein how good are simple heuristics pages in martignon and laskey bayesian benchmarks for fast and frugal heuristics pages in martignon katsikopoulos and woike categorization with limited resources family of simple heuristics journal of mathematical psychology luan schooler and gigerenzer from perception to preference and on to inference an analysis of thresholds psychological review katsikopoulos psychological heuristics for making inferences definition performance and the emerging theory and practice decision analysis laskey and martignon comparing fast and frugal trees and bayesian networks for risk assessment in makar editor proceedings of the international conference on teaching statistics flagstaff arizona brighton robust inference with simple cognitive models in lebiere and wray editors aaai spring symposium cognitive science principles meet problems pages american association for artificial intelligence menlo park ca katsikopoulos schooler and hertwig the robust beauty of ordinary information psychological review gigerenzer and goldstein reasoning the fast and frugal way models of bounded rationality psychological review linear decision rule as aspiration for simple decision heuristics in burges bottou welling ghahramani and weinberger editors advances in neural information processing systems pages curran associates red hook ny breiman friedman stone and olshen classification and regression trees crc press boca raton fl therneau atkinson and ripley rpart recursive partitioning and regression trees package version schmitt and martignon on the accuracy of bounded rationality how far from optimal is fast and frugal in weiss and platt editors advances in neural information processing systems pages mit press cambridge ma schmitt and martignon on the complexity of learning lexicographic strategies journal of machine learning research martignon and hoffrage fast frugal and fit simple heuristics for paired comparison theory and decision hogarth and karelaia ignoring information in binary choice with continuous variables when is less more journal of mathematical psychology chater oaksford nakisa and redington fast frugal and rational how rational norms explain behavior organizational behavior and human decision processes brighton and gigerenzer bayesian brains and cognitive mechanisms harmony or dissonance in chater and oaksford editors the probabilistic mind prospects for bayesian cognitive science pages oxford university press new york brighton and gigerenzer are rational actor models rational outside small worlds in okasha and binmore editors evolution and rationality decisions and strategic behaviour pages cambridge university press cambridge todd and dieckmann heuristics for ordering cue search in decision making in saul weiss and bottou editors advances in neural information processing systems pages mit press cambridge ma newell of rationality trends in cognitive sciences dougherty and thomas psychological plausibility of the theory of probabilistic mental models and the fast and frugal heuristics psychological review vul goodman griffiths and tenenbaum one and done optimal decisions from very few samples cognitive science 
explore no more improved regret bounds for bandits gergely sequel team inria lille nord europe abstract this work addresses the problem of regret minimization in multiarmed bandit problems focusing on performance guarantees that hold with high probability such results are rather scarce in the literature since proving them requires large deal of technical effort and significant modifications to the standard more intuitive algorithms that come only with guarantees that hold on expectation one of these modifications is forcing the learner to sample arms from the uniform distribution at least times over rounds which can adversely affect performance if many of the arms are suboptimal while it is widely conjectured that this property is essential for proving regret bounds we show in this paper that it is possible to achieve such strong results without this undesirable exploration component our result relies on simple and intuitive strategy called implicit exploration ix that allows remarkably clean analysis to demonstrate the flexibility of our technique we derive several improved bounds for various extensions of the standard bandit framework finally we conduct simple experiment that illustrates the robustness of our implicit exploration technique introduction consider the problem of regret minimization in bandits as defined in the classic paper of auer freund and schapire this sequential problem can be formalized as repeated game between learner and an environment sometimes called the adversary in each round the two players interact as follows the learner picks an arm also called an action it and the environment selects loss function where the loss associated with arm is denoted as subsequently the learner incurs and observes the loss it based solely on these observations the goal of the learner is to choose its actions so as to accumulate as little loss as possible during the course of the game as traditional in the online learning literature we measure the performance of the learner in terms of the regret defined as rt it min we say that the environment is oblivious if it selects the sequence of loss vectors irrespective of the past actions taken by the learner and adaptive or if it is allowed to choose as function of the past actions an equivalent formulation of the bandit game uses the concept of rewards also called gains or payoffs instead of losses in this version the author is currently with the department of information and communication technologies pompeu fabra university barcelona spain the adversary chooses the sequence of reward functions rt with rt denoting the reward given to the learner for choosing action in round in this game the learner aims at maximizing its total rewards we will refer to the above two formulations as the loss game and the reward game respectively our goal in this paper is to construct algorithms for the learner that guarantee that the regret grows sublinearly since it is well known that no deterministic learning algorithm can achieve this goal we are interested in randomized algorithms accordingly the regret rt then becomes random variable that we need to bound in some probabilistic sense most of the existing literature on bandits is concerned with bounding the or weak regret defined as rt max where the expectation integrates over the randomness injected by the learner proving bounds on the actual regret that hold with high probability is considered to be significantly harder task that can be achieved by serious changes made to the learning algorithms and much more complicated analyses one particular common belief is that in order to guarantee performance guarantees the learner can not avoid repeatedly sampling arms from uniform distribution typically kt times it is easy to see that such explicit exploration can impact the empirical performance of learning algorithms in very negative way if there are many arms with high losses even if the base learning algorithm quickly learns to focus on good arms explicit exploration still forces the regret to grow at steady rate as result algorithms with performance guarantees tend to perform poorly even in very simple problems in the current paper we propose an algorithm that guarantees strong regret bounds that hold with high probability without the explicit exploration component one component that we preserve from the classical recipe for such algorithms is the biased estimation of losses although our bias is of much more delicate nature and arguably more elegant than previous approaches in particular we adopt the implicit exploration ix strategy first proposed by neu valko and munos for the problem of online learning with as we show in the current paper this simple strategy allows proving bounds for range of nonstochastic bandit problems including bandits with expert advice tracking the best arm and bandits with our proofs are arguably cleaner and less involved than previous ones and very elementary in the sense that they do not rely on advanced results from probability theory like freedman inequality the resulting bounds are tighter than all previously known bounds and hold simultaneously for all confidence levels unlike most previously known bounds for the first time in the literature we also provide bounds for anytime algorithms that do not require prior knowledge of the time horizon minor conceptual improvement in our analysis is direct treatment of the loss game as opposed to previous analyses that focused on the reward game making our treatment more coherent with other results in the online learning the rest of the paper is organized as follows in section we review the known techniques for proving regret bounds for bandits and describe our implicit exploration strategy in precise terms section states our main result concerning the concentration of the ix loss estimates and shows applications of this result to several problem settings finally we conduct set of simple experiments to illustrate the benefits of implicit exploration over previous techniques in section explicit and implicit exploration most principled learning algorithms for the bandit problem are constructed by using standard online learning algorithm such as the exponentially weighted forecaster or follow the perturbed leader as black box with the true unobserved losses replaced by some appropriate estimates one of the key challenges is constructing reliable estimates of the losses for all based on the single observation it following auer et al this is in fact studying the loss game is colloquially known to allow better constant factors in the bounds in many settings see bubeck and our result further reinforces these observations traditionally achieved by using estimates of the form bt pt or rbt rt pt where pt it is the probability that the learner picks action in round conditioned on the observation history of the learner up to the beginning of round it is easy to show that these estimates are unbiased for all with pt in the sense that bt for all such for concreteness consider the xp algorithm of auer et al as described in bubeck and cesabianchi section in every round this the loss estimates defined in equation to compute the weights wt exp for all and some positive parameter that is often called the learning rate having computed these weights xp draws arm it with probability proportional to wt relying on the unbiasedness of the estimates and an optimized setting of one can prove that xp enjoys bound of log however the fluctuations of the loss estimates around the true losses are too large to permit bounding the true regret with high probability to keep these fluctuations under control auer et al propose to use the biased ret rbt pt with an appropriately chosen given these estimates the xp algorithm of auer et al computes the weights wt exp res for all arms and then samples it according to the distribution wt pt pk wt where is the exploration parameter the argument for this explicit exploration is that it helps to keep the range and thus the variance of the above reward estimates bounded thus enabling the use of more or less standard concentration in particular the key element in the analysis of xp is showing that the inequality rt ret log holds simultaneously for all with probability at least in other words this shows that the pt pt cumulative estimates ret are upper confidence bounds for the true rewards rt in the current paper we propose to use the loss estimates defined as et pt γt for all and an appropriately chosen γt and then use the resulting estimates in an exponentialweights algorithm scheme without any explicit exploration loss estimates of this form were first used by et al them we refer to this technique as implicit exploration or in short ix in what follows we argue that that ix as defined above achieves similar variancereducing effect as the one achieved by the combination of explicit exploration and the biased reward estimates of equation in particular we show that the ix estimates constitute lower confidence bound for the true losses which allows proving bounds for number of variants of the bandit problem regret bounds via implicit exploration in this section we present concentration result concerning the ix loss estimates of equation and apply this result to prove performance guarantees for number of nonstochastic bandit problems the following lemma states our concentration result in its most general form explicit exploration is believed to be inevitable for proving bounds in the reward game for various other reasons bubeck and for discussion lemma let γt be fixed sequence with γt and let αt be nonnegative random variables satisfying αt for all and then with probability at least αt et log particularly important special case of the above lemma is the following corollary let γt for all with probability at least log et simultaneously holds for all this corollary follows from applying lemma to the functions αt for all and applying the union bound the full proof of lemma is presented in the appendix for didactic purposes we now present direct proof for corollary which is essentially simpler version of lemma proof of corollary for convenience we will use the notation first observe that it it it log bt et pt pt where the first step follows from and last one from the elementary inequality log that holds for all using the above inequality we get that exp et bt exp where the second and third steps are obtained by using bt that holds by definition of bt and the inequality ez that holds for all as result the process zt pt exp es is supermartingale with respect to ft zt observe that since this implies zt zt and thus by markov inequality et exp et exp exp holds for any the statement of the lemma follows from solving exp for and using the union bound over all arms in what follows we put lemma to use and prove improved performance guarantees for several variants of the bandit problem namely the bandit problem with expert advice tracking the best arm for bandits and bandits with the general form of lemma will allow us to prove bounds for anytime algorithms that can operate without prior knowledge of for clarity we will only provide such bounds for the standard bandit setting extending the derivations to other settings is left as an easy exercise for all algorithms we prove bounds that scale linearly with log and hold simultaneously for all levels note that this dependence can be improved to log for fixed confidence level if the algorithm can use this to tune its parameters this is the way that table presents our new bounds with the best previously known ones setting bandits bandits with expert advice tracking the best arm bandits with best known regret bound log log kt log kt mt ourpnew regret bound log log log kt αt table our results compared to the best previously known results in the four settings considered in sections see the respective sections for references and notation bandits in this section we propose variant of the xp algorithm of auer et al that uses the ix loss estimates xp the algorithm in its most general form uses two nonincreasing sequences of nonnegative parameters ηt and γt in every round xp chooses action it with probability proportional to pt wt exp algorithm xp parameters initialization for repeat pt pk wt draw it pt pt observe loss it et without mixing any explicit exploration term into the distribution version of xp is presented as algorithm pt it for all wt for all our theorem below states on the regret of xp notably our bound exhibits the best known constant factor of in the leading term improving on the factor of due to bubeck and the best known leading constant for the bound of xp is also proved in bubeck and theorem fix an arbitrary with ηt log kt for all xp guarantees log rt log log with probability at least furthermore setting ηt log kt for all the bound becomes kt rt kt log log log proof let us fix an arbitrary following the standard analysis of xp in the loss game and nonincreasing learning rates we can obtain the bound log ηt pt et et pt et ηt for any now observe that pt γt pt et it γt it it γt et pt γt pt γt pk pk similarly pt et holds by the boundedness of the losses thus we get that log ηt it et γt et ηt log log ηt γt log holds with probability at least where the last line follows from an application of lemma with αt ηt γt for all and taking the union bound by taking arg mini lt and and using the boundedness of the losses we obtain rt ηt log log γt log ηt the statements of the theorem then follow immediately noting that pt bandits with expert advice we now turn to the setting of bandits with expert advice as defined in auer et al and later revisited by mcmahan and streeter and beygelzimer et al in this setting we assume that in every round the learner observes set of probability distributions pk ξt ξt ξt over the arms such that ξt for all we assume that the sequences ξt are measurable with respect to ft the nth of these vectors represent the probabilistic advice of the corresponding nth expert the goal of the learner in this setting is to pick sequence of arms so as to minimize the regret against the best expert rtξ it min ξt min to tackle this problem we propose modification of the xp algorithm of auer et al that uses the ix loss estimates and also drops the explicit exploration component of the original algorithm specifically xp uses the loss estimates defined in equation to compute the weights ξs es wt exp pn for every expert and then draw arm with probability pt wt ξt we now state the performance guarantee of xp our bound improves the best known leading constant of due to beygelzimer et al to and is factor of worse than the best known constant in the bound for xp the proof of the theorem is presented in the appendix theorem fix an arbitrary and set log kt for all then with probability at least the regret of xp satisfies rt log log log tracking the best sequence of arms in this section we consider the problem of competing with sequences of actions similarly to herbster and warmuth we consider the class of sequences that switch at most times between actions we measure the performance of the learner in this setting in terms of the regret against the best sequence from this class defined as rts it min jt jt similarly to auer et al we now propose to adapt the fixed share algorithm of herbster and warmuth to our setting our algorithm called xp updates set of weights wt over the arms in recursive fashion in the first round xp sets for all in the following rounds the weights are updated for every arm as wt wt in round the algorithm draws arm it with probability pt below we give the performance guarantees of xp note that our leading factor of again improves over the best previously known leading factor of shown by audibert and bubeck the proof of the theorem is given in the appendix log theorem fix an arbitrary and set and where then with probability at least the regret of xp satisfies ekt rt log log log bandits with let us now turn to the problem of online learning in bandit problems in the presence of side observations as defined by mannor and shamir and later elaborated by alon et al in this setting the learner and the environment interact exactly as in the bandit problem the main difference being that in every round the learner observes the losses of some arms other than its actually chosen arm it the structure of the side observations is described by the directed graph nodes of correspond to individual arms and the presence of arc implies that the learner will observe upon selecting it implicit exploration and xp was first proposed by et al for this precise setting to describe this variant let us introduce the notations ot it it and ot ot ot then the ix loss estimates in this setting are defined for all as et ot with these estimates at hand xp draws arm it from the exponentially weighted distribution defined in equation the following theorem provides the regret bound concerning this algorithm theorem fix an arbitrary assume that and set log kt where is the independence number of with probability at least xp guarantees αt log kt log log rt log log kt log the proof of the theorem is given in the appendix while the proof of this statement is significantly more involved than the other proofs presented in this paper it provides fundamentally new result in particular our bound is in terms of the independence number and thus matches the minimax regret bound proved by alon et al for this setting up to logarithmic factors in contrast the only regret bound for this setting due to alon et al scales with the size of the maximal acyclic subgraph of which can be much larger than in general may be for some graphs empirical evaluation we conduct simple experiment to demonstrate the robustness of xp as compared to xp and its superior performance as compared to xp our setting is bandit problem where all losses are independent draws of bernoulli random variables the mean losses of arms through are and the mean loss of arm is for all rounds the mean losses of arm are changing over time for rounds the mean is and afterwards this choice ensures that up to at least round arm is clearly better than other arms in the second half of the game arm starts to outperform arm and eventually becomes the leader we have evaluated the performance of xp xp and xp in the above setting with and for fairness of comparison we evaluate all three algorithms for wide range of parameters in particular for all three algorithms we set base learning rate according to the best known theoretical results theorems and and varied the multiplier of the respective base parameters between and other parameters are set as for xp and for xp we studied the regret up to two interesting rounds in the game up to where the losses are and up to where the algorithms have to notice the shift in the regret at regret at multiplier multiplier figure regret of xp xp and xp respectively in the problem described in section loss distributions figure shows the empirical means and standard deviations over runs of the regrets of the three algorithms as function of the multipliers the results clearly show that xp largely improves on the empirical performance of xp and is also much more robust in the regime than vanilla xp discussion in this paper we have shown that contrary to popular belief explicit exploration is not necessary to achieve regret bounds for bandit problems interestingly however we observed in several of our experiments that our algorithms still draw every arm roughly times even though this is not explicitly enforced by the algorithm this suggests need more complete study of the role of exploration to find out whether pulling every single arm times is necessary for achieving guarantees one can argue that tuning the ix parameter that we introduce may actually be just as difficult in practice as tuning the parameters of xp however every aspect of our analysis suggests that γt ηt is the most natural choice for these parameters and thus this is the choice that we recommend one limitation of our current analysis is that it only permits deterministic and ix parameters see the conditions of lemma that is proving adaptive regret bounds in the vein of that hold with high probability is still an open challenge another interesting direction for future work is whether the implicit exploration approach can help in advancing the state of the art in the more general setting of linear bandits all known algorithms for this setting rely on explicit exploration techniques and the strength of the obtained results depend crucially on the choice of the exploration distribution see for recent advances interestingly ix has natural extension to the linear bandit problem to see this consider the vector vt eit and the matrix pt vt vtt then the ix loss estimates can be written as et pt γi vt vtt whether or not this estimate is the right choice for linear bandits remains to be seen finally we note that our estimates are certainly not the only ones that allow avoiding explicit exploration in fact the careful reader might deduce from the proof of lemma that the same concentration can be shown to hold for the alternative loss estimates it pt and log it actually variant of the latter estimate was used previously for proving regret bounds in the reward game by audibert and bubeck their proof still relied on explicit exploration it is not hard to verify that all the results we presented in this paper except theorem can be shown to hold for the above two estimates too acknowledgments this work was supported by inria the french ministry of higher education and research and by fui project the author wishes to thank haipeng luo for catching bug in an earlier version of the paper and the anonymous reviewers for their helpful suggestions references alon gentile and mansour from bandits to experts tale of domination and independence in pages alon gentile mannor mansour and shamir nonstochastic bandits with feedback arxiv preprint audibert and bubeck minimax policies for adversarial and stochastic bandits in proceedings of the annual conference on learning theory colt audibert and bubeck regret bounds and minimax policies under partial monitoring journal of machine learning research auer freund and schapire the nonstochastic multiarmed bandit problem siam issn bartlett dani hayes kakade rakhlin and tewari regret bounds for bandit online linear optimization in colt pages beygelzimer langford li reyzin and schapire contextual bandit algorithms with supervised learning guarantees in aistats pages bubeck and kakade towards minimax policies for online linear optimization with bandit feedback bubeck and regret analysis of stochastic and nonstochastic bandit problems now publishers inc and lugosi prediction learning and games cambridge university press new york ny usa gaillard lugosi and stoltz mirror descent meets fixed share and feels no regret in pages freedman on tail probabilities for martingales the annals of probability freund and schapire generalization of learning and an application to boosting journal of computer and system sciences hannan approximation to bayes risk in repeated play contributions to the theory of games hazan and kale better algorithms for benign bandits the journal of machine learning research hazan karnin and meka volumetric spanners an efficient exploration basis for learning in colt pages herbster and warmuth tracking the best expert machine learning kalai and vempala efficient algorithms for online decision problems journal of computer and system sciences neu valko and munos efficient learning by implicit exploration in bandit problems with side observations in pages littlestone and warmuth the weighted majority algorithm information and computation mannor and shamir from bandits to experts on the value of in neural information processing systems mcmahan and streeter tighter bounds for bandits with expert advice in colt neu regret bounds for combinatorial in colt pages rakhlin and sridharan online learning with predictable sequences in colt pages seldin auer laviolette and inequality for martingales and its application to multiarmed bandits in proceedings of the workshop on trading of exploration and exploitation vovk aggregating strategies in proceedings of the third annual workshop on computational learning theory colt pages 
fast and memory optimal matrix approximation yun msr cambridge marc lelarge inria ens alexandre proutiere kth ee school acl alepro abstract in this paper we revisit the problem of constructing rank approximation of matrix under the streaming data model where the columns of are revealed sequentially we present sla streaming approximation an algorithm that is asymptotically accurate when mn where is the largest singular value of this means that its average error converges to as and grow large mn with high probability where and denote the output of sla and the optimal rank approximation of respectively our algorithm makes one pass on the data if the columns of are revealed in random order and two passes if the columns of arrive in an arbitrary order to reduce its memory footprint and complexity sla uses random sparsification and samples each entry of with small probability in turn sla is memory optimal as its required memory space scales as the dimension of its output furthermore sla is computationally efficient as it runs in δkmn time constant number of operations is made for each observed entry of which can be as small as log for an appropriate choice of and if introduction we investigate the problem of constructing in memory and computationally efficient manner an accurate estimate of the optimal rank approximation of large matrix this problem is fundamental in machine learning and has naturally found numerous applications in computer science the optimal rank approximation minimizes over all rank matrices the frobenius norm km zkf and any norm that is invariant under rotation and can be computed by singular value decomposition svd of in time if we assume that for massive matrices when and are very large this becomes unacceptably slow in addition storing and manipulating in memory may become difficult in this paper we design memory and computationally efficient algorithm referred to as streaming approximation sla that computes rank approximation under mild assumptions on the sla algorithm is asymptotically accurate in the sense that as and grow large its average error converges to mn with high probability we interpret as the signal that we aim to recover form noisy observation to reduce its memory footprint and running time the proposed algorithm combines random sparsification and the idea of the streaming data model more precisely each entry of is revealed to the algorithm with probability called the sampling rate moreover sla observes and treats the work performed as part of joint research centre acknowledges the support of the french agence nationale de la recherche anr under reference gap project proutiere research is supported by the erc fsa grant and the ssf project columns of one after the other in sequential manner the sequence of observed columns may be chosen uniformly at random in which case the algorithm requires one pass on only or can be arbitrary in which case the algorithm needs two passes sla first stores log randomly selected columns and extracts via spectral decomposition an estimator of parts of the top right singular vectors of it then completes the estimator of these vectors by receiving and treating the remain columns sequentially sla finally builds from the estimated top right singular vectors the linear projection onto the subspace generated by these vectors and deduces an estimator of the analysis of the performance of sla is presented in theorems and in summary when logm with probability kδ the output of sla satisfies log km mn mn δm where is the singular value of sla requires kn memory space and if logm and its time is δkmn to ensure the asymptotic accuracy of sla the in needs to converge to which is true as soon as mn in the case where is seen as noisy version of this condition quantifies the maximum amount of noise allowed for our algorithm to be asymptotically accurate sla is memory optimal since any rank approximation algorithm needs to at least store its output right and left singular vectors and hence needs at least kn memory space further observe that among the class of algorithms sampling each entry of at given rate sla is computational optimal since it runs in δkmn time it does constant number of operations per observed entry if in turn to the best of our knowledge sla is both faster and more memory efficient than existing algorithms sla is the first memory optimal and asymptotically accurate low rank approximation algorithm the approach used to design sla can be readily extended to devise memory and computationally efficient matrix completion algorithms we present this extension in the supplementary material notations throughout the paper we use the following notations for any matrix we denote by its transpose and by its we denote by the singular values of when matrices and have the same number of rows to denote the matrix whose first columns are those of followed by those of denotes an orthonormal basis of the subspace perpendicular to the linear span of the columns of aj ai and aij denote the column of the row of and the entry of on the line and column respectively for ah resp ah is the matrix obtained by extracting the columns resp lines of for any ordered set bp refers to the matrix composed by the ordered set of columns of is defined similarly but for lines for real numbers we define the matrix with entry equal to ij min max aij finally for any vector kvk denotes its euclidean norm whereas for any matrix kakf denotes its frobenius norm its operator norm and its maxi related work approximation algorithms have received lot of attention over the last decade there are two types of error estimate for these algorithms either the error is additive or relative to translate our bound in an additive error is easy km kf km kf mn δm mn sparsifying to the computation of approximation has been proposed in the literature and the best additive error bounds have been obtained in when the sampling rate satisfies logm the authors show that with probability exp km kf km kf km kf this performance guarantee is derived from lemma and theorem in to compare and note that our assumptions on the bounded entries of ensures that and mn in particular we see that the worst case km kf mn bound for is δm nm which is always lower than the worst case bound for log nm when our bound is only larger by logarithmic term in δm compared to however the algorithm proposed in requires to store δmn entries of whereas sla needs memory space recall that δm so that our algorithm makes significant improvement on the memory requirement at low price in the error guarantee bounds although biased sampling algorithms can reduce the error the algorithm have to run leverage scores with multiple passes over data in recent work proposes time efficient algorithm to compute approximation of sparse matrix combined with we obtain an algorithm running in time δmn nk but with an increased additive error term we can also compare our result to papers providing an estimate of the optimal approximation of with relative error such that km kf km kf to the best of our knowledge provides the best result in this setting theorem in shows that provided the rank of is at least their algorithm outputs with probability matrix with relative error using memory space log note that in the authors use as unit of memory bit whereas we use as unit of memory an entry of the matrix so we removed log mn factor in their expression to make fair comparisons to compare with our result we can translate our bound in relative error and we need to take log mn δm km kf first note that since is assumed to be of rank at least we have km kf and is clearly for our to tend to zero we need km kf to be not too small for the scenario we have in mind is noisy version of the signal so that is the noise matrix when every entry at random is generated independently with constant variance km kf while in such case we have and we improve the memory requirement of by factor log kδ also considers model where the full columns of are revealed one after the other in an arbitrary order and proposes algorithm to derive the approximation of with the same memory requirement in this general setting our algorithm is required to make two passes on the data and only one pass if the order of arrival of the column is random instead of arbitrary the running time of the algorithm scales as log kδ to project onto log kδ dimensional random space thus sla improves the time again by factor of log kδ we could also think of using sketching and streaming pca algorithms to estimate when the columns arrive sequentially these algorithms identify the left singular vectors using on the matrix and then need second pass on the data to estimate the right singular vectors for example proposes sketching algorithm that updates the most frequent directions as columns are observed shows that with memory space for this sketching algorithm finds matrix such that km kf km kf where denotes the projection matrix to the linear span of the columns of the running time of the algorithm is roughly which is much greater than that of sla note also that to identify such matrix in one pass on it is shown in that we have to use memory space this result does not contradict the performance analysis of sla since the latter needs two passes on if the columns of are observed in an arbitrary manner finally note that the streaming pca algorithm proposed in does not apply to our problem as this paper investigates very specific problem the spiked covariance model where column is randomly generated in an manner streaming approximation algorithm algorithm streaming approximation sla input and log independently sample entries of at rate pca for the first columns spca trimming the rows and columns of set the entries of rows of having more than two entries to set the entries of the columns of having more than entries to iˆ remove and from the memory space for to do at sample entries of mt at rate at iˆ iˆ at remove at from the memory space end for find using the process such that is an orthonormal matrix output algorithm spectral pca spca input gaussian random matrix trimming set the entries of the rows of with more than entries to diag power iteration qr qr decomposition of log output in this section we present the streaming approximation sla algorithm and analyze its performance sla makes one pass on the matrix and is provided with the columns of one after the other in streaming manner the svd of is σv where and are and unitary matrices and is the matrix diag we assume or impose by design of sla that the specified below first observed columns of are chosen uniformly at random among all columns an extension of sla to scenarios where columns are observed in an arbitrary order is presented in but this extension requires two passes on to be memory efficient sla uses sampling each observed entry of is erased set equal to with probability where is referred to as the sampling rate the algorithm whose is presented in algorithm proceeds in three steps in the first step we observe log columns of chosen uniformly at random these columns form the matrix where denotes the ordered set of the indexes of the first observed columns is sampled at rate more precisely we apply two independent sampling procedures where in each of them every entry of is sampled at rate the two resulting independent random matrices and are stored in memory referred to as to simplify the notations is used in this first step whereas will be used in subsequent steps next through spectral decomposition of we derive orthonormal matrix such that the span of its column vectors approximates that of the column vectors of the first step corresponds to lines and in the of sla in the second step we complete the construction of our estimator of the top right singular vectors of denote by the matrix formed by these estimated vectors we first compute the components of these vectors corresponding to the set of indexes as with then for after receiving the column mt of we set where at is obtained by sampling entries of mt at rate hence after one pass on we get where an as it turns out multiplying by amplifies the useful signal contained in and yields an accurate approximation of the span of the top right singular vectors of the second step is presented in lines and in sla in the last step we deduce from set of column vectors gathered in matrix such that provides an accurate approximation of first using the process we find such that is an orthonormal matrix and compute in streaming manner as in step then where approximates the projection matrix onto the linear span of the top right singular vectors of thus is close to this last step is described in lines and in sla in the next subsections we present in more details the rationale behind the three steps of sla and provide performance analysis of the algorithm step estimating vectors of the first batch of columns the objective of the first step is to estimate those components of the top right singular vectors of whose indexes are in the set remember that is the set of indexes of the first observed columns this estimator denoted by is obtained by applying the power method to extract the top right singular vector of as described in algorithm in the design of this algorithm and its performance analysis we face two challenges we only have access to sampled version of and ii is not the svd of since the column vectors of are not orthonormal in general we keep the components of these vectors corresponding to the set of indexes hence the top right singular vectors of that we extract in algorithm do not necessarily correspond to to address in algorithm we do not directly extract the top right singular vectors of we first remove the rows of with too many entries too many observed entries from since these rows would perturb the svd of let us denote by the obtained trimmed matrix we then form the covariance matrix and remove its diagonal entries to obtain the matrix diag removing the diagonal entries is needed because of the sampling procedure indeed the diagonal entries of scale as whereas its entries scale as hence when is small the diagonal entries would clearly become dominant in the spectral decomposition we finally apply the power method to to obtain in the analysis of the performance of algorithm the following lemma will be instrumental and provides an upper bound of the gap between and using the matrix bernstein inequality theorem all proofs are detailed in appendix lemma if with probability some constant kφ log for to address ii we first establish in lemma that for an appropriate choice of the column vectors of are approximately orthonormal this lemma is of independent interest and relates the svd of truncated matrix here to that of the initial matrix more precisely lemma if there exists matrix such that its column vectors are orthonorn mal and with probability exp for all satisfying that log note that as suggested by the above lemma it might be impossible to recover pvi when the corren sponding singular value si is small more precisely when log however the singular vectors corresponding to such small singular values generate very little error for lowrank approximation thus interested in singular vectors whose singular values are we are only above the threshold log let max log now to analyze the performance of algorithm when applied to we decompose as where is noise matrix the following lemma quantifies how noise may affect the performance of the power method it provides an upper bound of the gap between and as function of the operator norm of the noise matrix the si lemma with probability output of spca when applied to satisfies for all in the proof we analyze the power iteration algorithm from results in to complete the performance analysis of algorithm it remains to upper bound ky to this aim we decompose into three terms the first term can be controlled using lemma and the last term is upper bounded using lemma finally the second term corresponds to the error made by ignoring the singular vectors which are not within the top to estimate this term we use the matrix chernoff bound theorem in and prove that lemma with probability exp log in summary combining the four above lemmas we can establish that accurately estimates theorem if with probability satisfies for all is the constant from lemma the output of algorithm applied to log where step estimating the principal right singular vectors of in this step we aim at estimating the top right singular vectors or at least at producing vectors whose linear span approximates that of towards this objective we start from derived in the previous step and define the matrix is stored and kept in memory for the remaining of the algorithm it is tempting to directly read from the top left singular vectors indeed we know that and δu and hence however the level of the noise in is too important so as to accurately extract in turn can be written as δu where δu partly captures the noise in it is then easy to see that the level of the noise satisfies kzkf δm pm pk indeed first observe that is of rank then zij mkδ this is due to the facts that and δu are independent since and are independent ii kqj for all and iii the entries of are independent with variance however for all the singular value scales as qof δu δm log since sj mn and sj sj when from lemma instead from and the subsequent sampled arriving columns at we produce matrix whose linear span approximates that of more precisely we first let then for all we define at where at is obtained from the observed column of after sampling each of its entries at rate multiplying by an amplifies the useful signal in so that constitutes good approximation of to understand why we can rewrite as follows δm δm δm in the above equation the first term corresponds to the useful signal and the two remaining terms constitute noise matrices from theorem the linear span of columns of that of qapproximates the columns of and thus for sj sj mn log the spectral norms of the noise matrices are bounded using random matrix arguments and the fact that δm and δm are random matrices with independent entries we can show see lemma given in the supplementary material using the independence of and that with high probability kδm δm mn we may also establish that with high probability δm this is consequence of result derived in quoted pin lemma in the supplementary material stating that with high probability δm and of the fact that due to the trimming process presented in line in algorithm kw δm in summary as soon as scales at least as the noise level becomes negligible and the span of provides an accurate approximation of that of the above arguments are made precise and rigorous in the supplementary material the following theorem summarizes the accuracy of our estimator of for all there exists constant such that with log log probability kδ kvi theorem with step estimating the principal left singular vectors of in the last step we estimate the principal left singular vectors of to finally derive an estimator of the optimal approximation of the construction of this estimator is based on the observation that where is an matrix representing the projection onto the linear span of the top right singular vectors of hence to estimate we try to approximate the matrix to this aim we construct matrix so that the column vectors of form an orthonormal basis whose span corresponds to that of the column vectors of this construction is achieved using process we then approximate by and finally our estimator of is the construction of can be made in memory efficient way accommodating for our streaming model where the columns of arrive one after the other as described in the of sla first after constructing in step we build the matrix iˆ then for after constructing the line of we update iˆ by adding to it the matrix at so that after all columns of are observed iˆ hence we can build an estimator of the principal left singular vectors of as and finally obtain to quantify the estimation error of we decompose as the first term of the of the above equation can be bounded for we have si kvi using theorem log log and hence we can conclude that for all si ui the second term can be easily bounded observing that the matrix is of rank kk kkm the last term in the can be controlled as in the performance analysis of step and observing that is of rank kδ it is then easy to remark that for the range of the parameter we are interested in the upper bound of the first term dominates the upper bound of the two other terms finally we obtain the following result see the supplementary material for complete proof with probability kδ the output of algorithm km kf log log satisfies with constant mn mn δn δm theorem when note that if then log δm an asymptotically accurate estimate of hence if the sla algorithm provides as soon as mn required memory and running time required memory lines in sla and have δm entries and we need δm log bits to store the id of these entries similarly the memory required to store is log storing further requires memory finally and iˆ computed in line require and km memory space respectively thus when log this first part of the algorithm requires memory lines before we treat the remaining columns and are removed from the memˆ ory using this released memory when the column arrives we can store it compute and and remove the column to save memory therefore we do not need additional memory to treat the remaining columns lines and from iˆ and we compute to this aim the memory required is running time from line to the spca algorithm requires log operations to compute and iˆ are inner products and their computations require δkm operations with log the number of operations to treat the first columns is log kδm km kδ from line to to compute and iˆ when the column arrives we need δkm operations since there are remaining columns the total number of operations is δkmn lines and is computed from using the process which requires operations we then compute using operations hence we conclude that in summary we have shown that theorem the memory required to run the sla algorithm is its running time is δkmn kδ observe that when max log log and log we have δkmn and therefore the running time of sla is δkmn general streaming model sla is approximation algorithm but the set of the first observed columns of needs to be chosen uniformly at random we can readily extend sla to deal with scenarios where the columns of can be observed in an arbitrary order this extension requires two passes on but otherwise performs exactly the same operations as sla in the first pass we extract set of columns chosen uniformly at random and in the second pass we deal with all other columns to extract randomly selected columns in the first pass we proceed as follows assume that when the column of arrives we have already extracted columns then the column is extracted with probability this version of sla enjoys the same performance guarantees as those of sla conclusion this paper revisited the low rank approximation problem we proposed streaming algorithm that samples the data and produces near optimal solution with vanishing mean square error the algorithm uses memory space scaling linearly with the ambient dimension of the matrix the memory required to store the output alone its running time scales as the number of sampled entries of the input matrix the algorithm is relatively simple and in particular does exploit elaborated techniques such as sparse embedding techniques recently developed to reduce the memory requirement and complexity of algorithms addressing various problems in linear algebra references dimitris achlioptas and frank mcsherry fast computation of matrix approximations journal of the acm jacm srinadh bhojanapalli prateek jain and sujay sanghavi tighter approximation via sampling the leveraged element in proceedings of the annual acmsiam symposium on discrete algorithms pages siam kenneth clarkson and david woodruff numerical linear algebra in the streaming model in proceedings of the annual acm symposium on theory of computing pages acm kenneth clarkson and david woodruff low rank approximation and regression in input sparsity time in proceedings of the annual acm symposium on theory of computing pages acm mina ghashami and jeff phillips relative errors for deterministic matrix approximations in soda pages siam nathan halko martinsson and joel tropp finding structure with randomness probabilistic algorithms for constructing approximate matrix decompositions siam review edo liberty simple and deterministic matrix sketching in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm ioannis mitliagkas constantine caramanis and prateek jain memory limited streaming pca in advances in neural information processing systems joel tropp improved analysis of the subsampled randomized hadamard transform advances in adaptive data analysis joel tropp tail bounds for sums of random matrices foundations of computational mathematics david woodruff low rank approximation lower bounds in streams in advances in neural information processing systems pages 
learnability of influence in networks harikrishna narasimhan david parkes yaron singer harvard university cambridge ma hnarasimhan parkes yaron abstract we show pac learnability of influence functions for three common influence models namely the linear threshold lt independent cascade ic and voter models and present concrete sample complexity results in each case our results for the lt model are based on interesting connections with neural networks those for the ic model are based an interpretation of the influence function as an expectation over random draw of subgraph and use covering number arguments and those for the voter model are based on reduction to linear regression we show these results for the case in which the cascades are only partially observed and we do not see the time steps in which node has been influenced we also provide efficient polynomial time learning algorithms for setting with full observation where the cascades also contain the time steps in which nodes are influenced introduction for several decades there has been much interest in understanding the manner in which ideas language and information cascades spread through society with the advent of social networking technologies in recent years digital traces of human interactions are becoming available and the problem of predicting information cascades from these traces has gained enormous practical value for example this is critical in applications like viral marketing where one needs to maximize awareness about product by selecting small set of influential users to this end the spread of information in networks is modeled as an influence function which maps set of seed nodes who initiate the cascade to distribution on the set of individuals who will be influenced as result these models are parametrized by variables that are unknown and need to be estimated from data there has been much work on estimating the parameters of influence models or the structure of the underlying social graph from observed cascades of influence spread and on using the estimated parameters to predict influence for given seed set these parameter estimation techniques make use of local influence information at each node and there has been recent line of work devoted to providing sample complexity guarantees for these local estimation techniques however one can not locally estimate the influence parameters when the cascades are not completely observed when the cascades do not contain the time at which the nodes are influenced moreover influence functions can be sensitive to errors in model parameters and existing results do not tell us to what accuracy the individual parameters need to be estimated to obtain accurate influence predictions if the primary goal in an application is to predict influence accurately it is natural to ask for algorithms that have learnability guarantees on the influence function itself benchmark for studying such questions is the probably approximately correct pac learning framework are influence functions pac learnable while many influence models have been popularized due to their approximation guarantees for influence maximization learnability of influence is an equally fundamental property part of this work was done when hn was phd student at the indian institute of science bangalore in this paper we show pac learnability for three influence models the linear threshold the independent cascade and the voter models we primarily consider setting where the cascades are partially observed where only the nodes influenced and not the time steps at which they were influenced are observed this is setting where existing local estimation techniques can not be applied to obtain parameter estimates additionally for fully observed setting where the time of influence is also observed we show polynomial time learnability our methods here are akin to using local estimation techniques but come with guarantees on the global influence function main results our learnability results are summarized below linear threshold lt model our result here is based on an interesting observation that lt influence functions can be seen as neural network classifiers and proceed by bounding their the method analyzed here picks function with zero training error while this can be computationally hard to implement under partial observation we provide polynomial time algorithm for the full observation case using local computations independent cascade ic model our result uses an interpretation of the influence function as an expectation over random draw of subgraph this allows us to show that the function is lipschitz and invoke covering number arguments the algorithm analyzed for partial observation is based on global maximum likelihood estimation under full observation and additional assumptions we show polynomial time learnability using local estimation technique voter model our result follows from reduction of the learning problem to linear regression problem the resulting learning algorithm can be implemented in polynomial time for both the full and partial observation settings related work related problem to ours is that of inferring the structure of the underlying social graph from cascades there has been series of results on polynomial sample complexity guarantees for this problem under variants of the ic model most of these results make specific assumptions on the structure and assume full observation setting on the other hand in our problem the structure of the social graph is assumed to be known and the goal is to provably learn the underlying influence function our results do not depend on assumptions on the network structure and primarily apply to the more challenging partial observation setting the work that is most related to ours is that of du et al who show polynomial sample complexity results for learning influence in the lt and ic models under partial observation however their approach uses approximations to influence functions and consequently requires strong technical condition to hold which is not necessarily satisfied in general our results for the lt and ic models are some what orthogonal while the authors in assumptions on learnability and gain efficient algorithms that work well in practice our goal is to show unconditional sample complexity for learning influence we do this at the expense of the efficiency of the learning algorithms in the partial observation setting moreover the technical approach we take is substantially different there has also been work on learnability of families of discrete functions such as submodular and coverage functions under the pac and the variant pmac frameworks these results assume availability of training sample containing exact values of the target function on the given input sets while ic influence functions can be seen as coverage functions the previous results do not directly apply to the ic class as in practice the true expected value of an ic influence function on seed set is never observed and only random realization is seen in contrast our learnability result for ic functions do not require the exact function values to be known moreover the previous results require strict assumptions on the input distribution since we focus on learnability of specific function classes rather than large families of discrete functions we are able to handle general seed distributions for most part other results relevant to our work include learnability of linear influence games where the techniques used bear some similarity to our analysis for the lt model preliminaries influence models we represent social network as finite graph where the nodes represent set of individuals and edges represent their social links let the graph is assumed to be directed unless otherwise specified each edge is associated with weight wuv that indicates the strength of influence of node on node we consider setting where each node in the network holds an opinion in and opinions disseminate in the network this dissemination process begins with small subset of nodes called the seed which have opinion while the rest have opinion and continues in discrete time steps in every time step node may change its opinion from to based on the opinion of its neighbors and according to some local model of influence if this happens we say that the node is influenced we will use to denote the set of neighbors of node and at to denote the set of nodes that are influenced at time step we consider three models linear threshold lt model each node holds threshold ru and is influenced at time if the total incoming weight from its neighbors that were influenced at the previous time step exceeds the threshold wuv ku once influenced node can then influence its neighbors for one time step and never changes its opinion to independent cascade ic model restricting edge weights wuv to be in node is influenced at time independently by each neighbor who was influenced at time the node can then influence its neighbors for one time step and never changes its opinion to voter model the graph is assumed to be undirected with at time step node adopts the opinion of its neighbor with probability wuv unlike the lt and ic models here node may change its opinion from to or to at every step we stress that node is influenced at time if it changes its opinion from to exactly at also in both the lt and ic models no node gets influenced more than once and hence an influence cascade can last for at most time steps for simplicity we shall consider in all our definitions only cascades of length while revisiting the voter model in section we will look at more general cascades definition influence function given an influence model global influence function maps an initial set of nodes seeded with opinion to vector of probabilities fn where the uth coordinate indicates the probability of node being influenced during any time step of the corresponding influence cascades note that for the lt model the influence process is deterministic and the influence function simply outputs binary vector in let fg denote the class of all influence functions under an influence model over obtained for different choices of parameters edge in the model we will be interested in learning the influence function for given parametrization of this influence model we shall assume that the initial set of nodes that are seeded with opinion at the start of the influence process or the seed set is chosen according to distribution over all subsets of nodes we are given training sample consisting of draws of initial seed sets from along with observations of nodes influenced in the corresponding influence process our goal is to then learn from fg an influence function that best captures the observed influence process measuring loss to measure quality of the learned influence function we define loss function that for any subset of influenced nodes and predicted influence probabilities assigns value measuring discrepancy between and we define the error of learned function fg for given seed distribution and model parametrization as the expected loss incurred by err ex where the above expectation is over random draw of the seed set from distribution and over the corresponding subsets of nodes influenced during the cascade we will be particularly interested in the difference between the error of an influence function fs fg learned from training sample and minimum possible error achievable over all influence functions in fg err fs inf err and would like to learn influence functions for which this difference is guaranteed to be small using only polynomially many training examples full and partial observation we primarily work in setting in which we observe the nodes influenced in cascade but not the time step at which they were influenced in other words we assume availability of partial observed training sample where denotes the seed set of cascade and is the set of nodes influenced in that cascade we will also consider refined notion of full observation in which we are provided training sample where yni and yti is the set of nodes in in settings where the node thresholds are unknown it is common to assume that they are chosen randomly by each node in our setup the thresholds are parameters that need to be learned from cascades cascade who were influenced precisely sn at time step notice that here the complete set of nodes influenced in cascade is given by yti this setting is particularly of interest when discussing learnability in polynomial time the structure of the social graph is always assumed to be known pac learnability of influence functions let fg be the class of all influence functions under an influence model over social network we say fg is probably approximately correct pac learnable loss if there exists an algorithm the following holds for for all parametrizations of the model and for all or subset of distributions over seed sets when the algorithm is given partially observed training sample with poly examples it outputs an influence function fs fg for which ps err fs inf err where the above probability is over the randomness in moreover fg is efficiently pac learnable under this setting if the running time of the algorithm in the above definition is polynomial in and in the size of we say fg is efficiently pac learnable under full observation if the above definition holds with fully observed training sample sensitivity of influence functions to parameter errors common approach to predicting influence under full observation is to estimate the model parameters using local influence information at each node however an influence function can be highly sensitive to errors in estimated parameters consider an ic model on chain of nodes where all edge parameters are if the parameters have all been underestimated with constant error of the estimated probability of the last node being influenced is which is exponentially smaller than the true value for large our results for full observation provide concrete sample complexity guarantees for learning influence functions using local estimation to any desired accuracy in particular for the above example our results prescribe that be driven below for accurate predictions see section on ic model of course under partial observation we do not see enough information to locally estimate the individual model parameters and the influence function needs to be learned directly from cascades the linear threshold model we start with learnability in the linear threshold lt model given that the influence process is deterministic and the influence function outputs binary values we use the loss for evaluation for any subset of nodes and predicted boolean this is the fraction of nodes on pvector which the prediction is wrong χu qu where χu theorem pac learnability under lt model the class of influence functions under lt furmodel is pac learnable and the corresponding sample complexity is thermore in the full observation setting the influence functions can be learned in polynomial time the proof is in appendix and we give an outline here let denote lt influence function with parameters edge weights and thresholds and let us focus on the partial observation setting only node and not its time of influence is observed consider simple algorithm that outputs an influence function with zero error on training sample xx χu fuw mn such function always exists as the training cascades are generated using the lt model we will shortly look at computational issues in implementing this algorithm we now explain our pac learnability result for this algorithm the main idea is in interpreting lt influence functions as neural networks with linear threshold activations the proof follows by bounding the of the class of all functions fuw for node and using standard arguments in showing learnability under finite we sketch the neural network nn construction in two steps local influence as nn and the global influence as multilayer network see figure where crucial part is in ensuring that no node gets influenced more than once during the influence process local influence as recall that the local influence at node for previously influenced nodes is given by this can be modeled as linear binary uv classifier or equivalently as nn with linear threshold activations here the input layer contains unit for each node in the network and takes binary value indicating whether the node figure modeling single time step of the influence process ft as neural network the portion in black computes whether or not node is influenced in the current time step while that in enforces the constraint that does not get influenced more than once during the influence process here ξt is when node has been influenced previously and otherwise the dotted red edges represent strong negative signals has large negative weight and the dotted blue edges represent strong positive signals the initial input to each node in the input layer is while that for the auxiliary nodes in red is is present in the output layer contains binary unit indicating whether is influenced after one time step the connections between the two layers correspond to the edges between and other nodes and the threshold term on the output unit is the threshold parameter ku thus the first step of the influence process can be modeled using nn with two layers the input layer takes information about the seed set and the binary output indicates which nodes got influenced from local to global the multilayer network the nn can be extended to multiple time steps by replicating the output layer once for each step however the resulting nn will allow node to get influenced more than once during the influence process to avoid this we introduce an additional binary unit for each node in layer which will record whether node was influenced in previous time steps in particular whenever node is influenced in layer strong positive signal is sent to activate in the next layer which in turn will send out strong negative signals to ensure is never activated in subsequent we use additional connections to ensure that remains active there after note that node in layer is whenever is influenced at time step denote this function computed at for given seed set the lt influence let ft is whenever is influenced in any one of the time steps is function fuw which for seed pn set clearly fuw can be modeled as nn with layers then given by fuw ft naive application of classic results for nn will give us that the of the class of functions fu is counting parameters for each layer since the the remaining proof same parameters are repeated across layers this can be tightened to involves standard uniform convergence arguments and union bound over all nodes efficient computation having shown pac learnability we turn to efficient implementation of the prescribed algorithm partial observation in the case where the training set does not specify the time at which each node was infected finding an influence function with zero training error is computationally hard in general as this is similar to learning recurrent neural network in practice however we can leverage the neural network construction and solve the problem approximately by replacing linear threshold activation functions with sigmoidal activations and the loss with suitable continuous surrogate loss and apply based methods used for neural network learning full observation here it turns out that the algorithm can be implemented in polynomial time using local computations given fully observed sample the loss of an influence function for any is given by yt and as before measures the fraction of mispredicted nodes the prescribed algorithm then seeks to find parameters for which the corresponding training error is given that the time of influence is observed this problem can be decoupled into set of linear programs lps at each node this is akin to locally estimating the parameters at each node in particular let wupdenote the to node incoming weights and threshold and let fu wu wuv ku denote the local influence pm at for set of previously influence nodes let χu fu wu and bt wu χu yt fu wu that given the set of nodes influenced at time measures the local prediction error at time since the training sample was by strong signal we mean large connection weight which will outweigh signals from other connections indeed such connections can be created when the weights are all bounded generated by lt model there always exists parameters such that bt wu for each and which also implies that the overall training error is such set of parameters can be obtained by formulating suitable lp that can be solved in polynomial time the details are in appendix the independent cascade model we now address the question of learnability in the independent cascade ic model since the influence functions here have probabilistic outputs the proof techniques we shall use will be different from the previous section and will rely on arguments based on covering numbers in this case and is given by sq pnwe use the loss which for any χu qu χu qu we shall make mild assumption that the edge probn abilities are bounded away from and for some theorem pac learnability under ic model the class of influence functions under the ic furthermore model is pac learnable sq and the sample complexity is in the full observation setting under additional assumptions see assumption the influence functions can be learned in polynomial time with sample complexity the proof is given in appendix as noted earlier an ic influence function can be sensitive to errors in estimated parameters hence before discussing our algorithms and analysis we seek to understand the extent to which changes in the ic parameters can produce changes in the influence function and in particular check if the function is lipschitz for this we use the interpretation of the ic function as an expectation of an indicator term over randomly drawn subset of edges from the network see more specifically the ic cascade process can be seen as activating subset of edges in the network since each edge can be activated at most once the active edges can be seen as having been chosen apriori using independent bernoulli draws consider random subgraph of active edges obtained by choosing each edge independently with probability wuv for given subset of such edges and seed set let σu be an indicator function that evaluates to if is reachable from node in via edges in and otherwise then the ic influence function can be written as an expectation of over random draw of the subgraph fuw wab wab σu while the above definition involves an exponential number of terms it can be verified that the corresponding gradient is bounded thus implying that the ic function is lemma fix for any rr with kw fuw fuw this result tells us how small the parameter errors need to be to obtain accurate influence predictions and will be crucially used in our learnability results note that for the chain example in section this tells us that the errors need to be less than for meaningful influence predictions we are now ready to provide the pac learning algorithm for the partial observation setting with sample we shall sketch the proof here the full observation case is outlined in section where we shall make use of the different approach based on local estimation let denote the ic influence function with parameters the algorithm that we consider for partial observation resorts to maximum likelihood ml estimation of the global ic function let χu define the global for cascade as χu ln fuw χu ln fuw the prescribed algorithm then solves the following optimization problem and outputs an ic influence function from the solution obtained max in practice ic influence functions can be computed through suitable sampling approaches also note that function class can be pac learnable even if the individual functions can not be computed efficiently to provide learnability guarantees for the above ml based procedure we construct finite over the space of ic influence functions show that the class can be approximated to factor of in the infinity norm sense by finite set of ic influence functions we first construct an of size over the space of parameters and use lipschitzness to translate this to an of same size over the ic class following this standard uniform convergence arguments can be used to derive sample complexity guarantee on the expected likelihood with logarithmic dependence on the cover size this then implies the desired learnability result sq lemma sample complexity guarantee on the objective fix and let be the parameters obtained from ml estimation then sup compared to results for the lt model the sample complexity in theorem has square dependence on this is not surprising as unlike the lt model where the optimal error is zero the optimal squared error here is in general in fact there are standard sample complexity lower bound results that show that for similar settings one can not obtain tighter bound in terms of we wish to also note that the approach of du et al for learning influence under partial observation uses the same interpretation of the ic influence function as in eq but rather than learning the parameters of the model they seek to learn the weights on the individual indicator functions since there are exponentially many indicator terms they resort to constructing approximations to the influence function for which strong technical condition needs to be satisfied this condition need not however hold in most settings in contrast our result applies to general settings efficient computation partial observation the optimization problem in eq that we need to solve for the partial observation case is in general of course in practice this can be solved approximately using techniques using gradient computations to deal with the exponential number of terms in the definition of in the objective see appendix full observation on the other hand when training sample contains fully observed cascades we are able to show polynomial time learnability for the lt model we were assured of set of parameters that would yield zero error on the training sample and hence the same procedure prescribed for partial information could be implemented under the full observation in polynomial time by reduction to local computations this is not the case with the ic model where we resort to the common approach of learning influence by estimating the model parameters through local maximum likelihood ml estimation technique this method is similar to the maximum likelihood procedure used in for solving different problem of recovering the structure of an unknown network from cascades for the purpose of showing learnability we find it sufficient to apply this procedure to only the first time step of the cascade our analysis first provides guarantees on the estimated parameters and uses the lipschitz property in lemma to translate them to guarantees on the influence function since we now wish to give guarantees in the parameter space we will require that there exists unique set of parameters that explains the ic cascade process for this we will need stricter assumptions we assume that all edges have minimum influence strength and that even when all neighbors of node are influenced in time step there is small probability of not being influenced in the next step we consider specific seed distribution where each node has probability of not being seed node assumption let denote the parameters of the underlying ic model then there exists such that wuv for all and wuv for all also each node in is chosen independently in the initial seed set with probability we first define the local for given seed set and nodes influenced at χu ln exp βuv χu βuv where we have used parameters βuv ln wuv so that the objective is concave in the prescribed algorithm then solves the following maximization problem over all parameters that satisfy assumption and constructs an ic influence function from the parameters maxr βuv ln βuv ln this problem breaks down into smaller convex problems and can be solved efficiently see proposition pac learnability under ic model with full observation under full observation and assumption the class of ic influence functions is pac learnable in polynomial time through local ml estimation the corresponding sample complexity is the proof is provided in appendix and proceeds through the following steps we use covering number arguments to show that the local for the estimated parameters is close to the optimal value we then show that under assumption the expected is strongly concave which gives us that closeness to the true model parameters in terms of the likelihood also implies closeness to the true parameters in the parameter space we finally use the lipschitz property in lemma to translate this to guarantees on the global influence function note that the sample complexity here has worse dependence on the number of edges compared to the partial observation case this is due to the approach of requiring guarantees on the individual parameters and then transferring them to the influence function the better dependence on the number of nodes is consequence of estimating parameters locally it would be interesting to see if tighter results can be obtained by using influence information from all time steps and making different assumptions on the model parameters correlation decay assumption in the voter model before closing we sketch of our learnability results for the voter model where unlike previous models the graph is undirected with here we shall be interested in learning influence for fixed number of time steps as the cascades can be longer than with the squared loss again as the loss function this problem almost immediately reduces to linear least squares regression let be matrix of normalized edge weights with wuv wuv wuv if and otherwise note that can be seen as probability transition matrix then for an initial seed set the probability of node being influenced under this model after one time step can be verified to be where is column vector containing in entries corresponding to nodes in and everywhere else similarly for calculating the probability of node being influenced after time steps one can use the transition matrix fu now setting we have fu which is essentially linear function parametrized by weights thus learning influence in the voter model for fixed cascade length can be posed as independent linear regression one per node with coefficients each this can be solved in polynomial time even with partially observed data we then have the following from standard results theorem pac learnability under voter model the class of influence functions under voter model is pac learnable sq in polynomial time and the sample complexity is conclusion we have established pac learnability of some of the most celebrated models of influence in social networks our results point towards interesting connections between learning theory and the literature on influence in networks beyond the practical implications of the ability to learn influence functions from cascades the fact that the main models of influence are pac learnable serves as further evidence of their potent modeling capabilities it would be interesting to see if our results extend to generalizations of the lt and ic models and to investigate sample complexity lower bounds acknowledgements part of this work was carried out while hn was visiting harvard as part of student visit under the joint center for advanced research in machine learning game theory optimization supported by the science technology forum hn thanks kevin murphy shivani agarwal and harish ramaswamy for helpful discussions ys and dp were supported by nsf grant and ys by career and google faculty research award references pedro domingos and matthew richardson mining the network value of customers in kdd david kempe jon kleinberg and tardos maximizing the spread of influence through social network in kdd amit goyal francesco bonchi and laks vs lakshmanan learning influence probabilities in social networks in kdd manuel david balduzzi and bernhard uncovering the temporal dynamics of diffusion networks in icml nan du le song alexander smola and ming yuan learning networks of heterogeneous influence in nips manuel jure leskovec and andreas krause inferring networks of diffusion and influence acm transactions on knowledge discovery from data nan du le song manuel and hongyuan zha scalable influence estimation in diffusion networks in nips abir de sourangshu bhattacharya parantapa bhattacharya niloy ganguly and soumen chakrabarti learning linear influence model from transient opinion dynamics in cikm praneeth netrapalli and sujay sanghavi learning the graph of epidemic cascades in sigmetrics hadi daneshmand manuel le song and bernhard estimating diffusion network structures recovery conditions sample complexity algorithm in icml jean and thibaut horel inferring graphs from cascades sparse recovery framework icml bruno abrahao flavio chierichetti robert kleinberg and alessandro panconesi trace complexity of network inference in kdd nan du yingyu liang balcan and le song influence function learning in information diffusion networks in icml leslie valiant theory of the learnable commununications of the acm elchanan mossel and roch on the submodularity of influence in social networks in stoc eyal and asaf shapira note on maximizing the spread of influence in social networks information processing letters balcan and nicholas harvey learning submodular functions in stoc vitaly feldman and pravesh kothari learning coverage functions and private release of marginals in colt jean honorio and luis ortiz learning the structure and parameters of graphical games from behavioral data journal of machine learning research martin anthony and peter bartlett neural network learning theoretical foundations cambridge university press peter bartlett and wolfgang maass vapnik chervonenkis dimension of neural nets handbook of brain theory and neural networks pages tong zhang statistical behaviour and consistency of classification methods based on convex risk minimization annals of mathematical statistics 
learning causal graphs with small interventions karthikeyan murat alexandros sriram department of electrical and computer engineering the university of texas at austin usa karthiksh mkocaoglu dimakis sriram abstract we consider the problem of learning causal networks with interventions when each intervention is limited in size under pearl structural equation model with independent errors the objective is to minimize the number of experiments to discover the causal directions of all the edges in causal graph previous work has focused on the use of separating systems for complete graphs for this task we prove that any deterministic adaptive algorithm needs to be separating system in order to learn complete graphs in the worst case in addition we present novel separating system construction whose size is close to optimal and is arguably simpler than previous work in combinatorics we also develop novel information theoretic lower bound on the number of interventions that applies in full generality including for randomized adaptive learning algorithms for general chordal graphs we derive worst case lower bounds on the number of interventions building on observations about induced trees we give new deterministic adaptive algorithm to learn directions on any chordal skeleton completely in the worst case our achievable scheme is an algorithm where is the independence number of the graph we also show that there exist graph classes for which the sufficient number of experiments is close to the lower bound in the other extreme there are graph classes for which the required number of experiments is multiplicatively away from our lower bound in simulations our algorithm almost always performs very close to the lower bound while the approach based on separating systems for complete graphs is significantly worse for random chordal graphs introduction causality is fundamental concept in sciences and philosophy the mathematical formulation of theory of causality in probabilistic sense has received significant attention recently formulation advocated by pearl considers the structural equation models in this framework is cause of if can be written as for some deterministic function and some latent random variable given two causally related variables and it is not possible to infer whether causes or causes from random samples unless certain assumptions are made on the distribution of on for more than two random variables directed acyclic graphs dags are the most common tool used for representing causal relations for given dag the directed edge shows that is cause of if we make no assumptions on the data generating process the standard way of inferring the causal directions is by performing experiments the interventions an intervention requires modifying the process that generates the random variables the experimenter has to enforce values on the random variables this process is different than conditioning as explained in detail in the natural problem to consider is therefore minimizing the number of interventions required to learn causal dag hauser et al developed an efficient algorithm that minimizes this number in the worst case the algorithm is based on optimal coloring of chordal graphs and requires at most log interventions to learn any causal graph where is the chromatic number of the chordal skeleton however one important open problem appears when one also considers the size of the used interventions each intervention is an experiment where the scientist must force set of variables to take random values unfortunately the interventions obtained in can involve up to variables the simultaneous enforcing of many variables can be quite challenging in many applications for example in biology some variables may not be enforceable at all or may require complicated genomic interventions for each parameter in this paper we consider the problem of learning causal graph when intervention sizes are bounded by some parameter the first work we are aware of for this problem is by eberhardt et al where he provided an achievable scheme furthermore shows that the set of interventions to fully identify causal dag must satisfy specific set of combinatorial conditions called separating when the intervention size is not constrained or is in with the assumption that the same holds true for any intervention size hyttinen et al draw connections between causality and known separating system constructions one open problem is if the learning algorithm is adaptive after each intervention is separating system still needed or can one do better it was believed that adaptivity does not help in the worst case and that one still needs separating system our contributions we obtain several novel results for learning causal graphs with interventions bounded by size the problem can be separated for the special case where the underlying undirected graph the skeleton is the complete graph and the more general case where the underlying undirected graph is chordal for complete graph skeletons we show that any adaptive deterministic algorithm needs separating system this implies that lower bounds for separating systems also hold for adaptive algorithms and resolves the previously mentioned open problem we present novel combinatorial construction of separating system that is close to the previous lower bound this simple construction may be of more general interest in combinatorics recently showed that randomized adaptive algorithms need only log log interventions with high probability for the unbounded case we extend this result and show that nk log log interventions of size bounded by suffice with high probability we present more general information theoretic lower bound of to capture the performance of such randomized algorithms we extend the lower bound for adaptive algorithms for general chordal graphs we show that over all orientations the number of experiments from separating system is needed where is the chromatic number of the skeleton graph we show two extremal classes of graphs for one of them the interventions through separating system is sufficient for the other class we need experiments in the worst case we exploit the structural properties of chordal graphs to design new deterministic adaptive algorithm that uses the idea of separating systems together with adaptability to meek rules we simulate our new algorithm and empirically observe that it performs quite close to the separating system our algorithm requires much fewer interventions compared to separating systems background and terminology essential graphs causal dag is directed acyclic graph where xn is set of random variables and is directed edge if and only if is direct cause of we adopt pearl structural equation model with independent errors in this work see for more details separating system is matrix with distinct columns and each row has at most ones variables in cause xi if xi xj ey where ey is random variable independent of all other variables the causal relations of imply set of conditional independence ci relations between the variables conditional independence relation is of the following form given the set and the set are conditionally independent for some disjoint subsets of variables due to this causal dags are also called causal bayesian networks set of variables is bayesian with respect to dag if the joint probability distribution of can be factorized as product of marginals of every variable conditioned on its parents all the ci relations that are learned statistically through observations can also be inferred from the bayesian network using graphical criterion called the assuming that the distribution is faithful to the graph two causal dags are said to be markov equivalent if they encode the same set of cis two causal dags are markov equivalent if and only if they have the same and the same the class of causal dags that encode the same set of cis is called the markov equivalence class we denote the markov equivalence class of dag by the graph of all dags in is called the essential graph of it is denoted is always chain graph with chain components the criterion can be used to identify the skeleton and all the immoralities of the underlying causal dag additional edges can be identified using the fact that the underlying dag is acyclic and there are no more immoralities meek derived local rules meek rules introduced in to be recursively applied to identify every such additional edge see theorem of the repeated application of meek rules on this partially directed graph with identified immoralities until they can no longer be used yields the essential graph interventions and active learning given set of variables xn an intervention on set of the variables is an experiment where the performer forces each variable to take the value of another independent from other variables variable this operation and how it affects the joint distribution is formalized by the do operator by pearl an intervention modifies the causal dag as follows the post intervention dag is obtained by removing the connections of nodes in to their parents the size of an intervention is the number of intervened variables let denote the complement of the set learning algorithms can be applied to to identify the set of removed edges parents of and the remaining adjacent edges in the original skeleton are declared to be the children hence the orientations of the edges of the cut between and in the original dag can be inferred then local meek rules introduced in are repeatedly applied to the original dag with the new directions learnt from the cut to learn more till no more directed edges can be identified further application of algorithms on will reveal no more information the meek rules are given below is oriented as if and is oriented as if and is oriented as if and given bayesian network any ci relation implied by holds true all the cis implied by the distribution can be found using if the distribution is faithful faithfulness is widely accepted assumption since it is known that only measure zero set of distributions are not faithful skeleton of dag is the undirected graph obtained when directed edges are converted to undirected edges an induced subgraph on is an immorality if and are disconnected and graph union of two dags and with the same skeleton is partially directed graph where va vb is undirected if the edges va vb in and have different directions and directed as va vb if the edges va vb in and are both directed as va vb an undirected graph is chordal if it has no induced cycle of length greater than this means that can be decomposed as sequence of undirected chordal graphs gm chain components such that there is directed edge from vertex in gi to vertex in gj only if is oriented as if and the concepts of essential graphs and markov equivalence classes are extended in to incorporate the role of interventions let im be set of interventions and let the above process be followed after each intervention interventional markov equivalence class equivalence of dag is the set of dags that represent the same set of probability distributions obtained when the above process is applied after every intervention in it is denoted by similar to the observational case essential graph of dag is the graph union of all dags in the same equivalence class it is denoted by ei we have the following sequence ci learning meek rules learn by meek rules therefore after set of interventions has been performed the essential graph ei is graph with some oriented edges that captures all the causal relations we have discovered so far using before any interventions happened captures the initially known causal directions it is known that ei is chain graph with chordal chain components therefore when all the directed edges are removed the graph becomes set of disjoint chordal graphs problem definition we are interested in the following question problem given that all interventions in are of size at most variables for each intervention minimize the number of interventions such that the partially directed graph with all directions learned so far ei the question is the design of an algorithm that computes the small set of interventions given note of course that the unknown directions of the edges are not available to the algorithm one can view the design of as an active learning process to find from the essential graph is chain graph with undirected chordal components and it is known that interventions on one chain components do not affect the discovery process of directed edges in the other components so we will assume that is undirected and chordal graph to start with our notion of algorithm does not consider the time complexity of statistical algorithms involved of steps and in given interventions we only consider efficiently computing using possibly the graph im we consider the following three classes of algorithms algorithm the choice of is fixed prior to the discovery process adaptive algorithm at every step the choice of is deterministic function of im randomized adaptive algorithm at every step the choice of is random function of im the problem is different for complete graphs versus more general chordal graphs since rule becomes applicable when the graph is not complete thus we give separate treatment for each case first we provide algorithms for all three cases for learning the directions of complete graphs kn undirected complete graph on vertices then we generalize to chordal graph skeletons and provide novel adaptive algorithm with upper and lower bounds on its performance the missing proofs of the results that follow can be found in the appendix complete graphs in this section we consider the case where the skeleton we start with is an undirected complete graph denoted kn it is known that at any stage in starting from rules and do not apply further the underlying dag is directed clique the directed clique is characterized by an ordering on such that in the subgraph induced by for some ordering let has no incoming edges let be denoted by denote the set we need the following results on separating system for our first result regarding adaptive and algorithms for complete graph separating system definition an system on an element set is set of subsets sm such that and for every pair there is subset such that either or if pair satisfies the above condition with respect to then is said to separate the pair here we consider the case when in katona gave an system together with lower bound on in wegener gave simpler argument for the lower bound and also provided tighter upper bound than the one in in this work we give different construction below where the separating system size is at ne larger than the construction of wegener however our construction has simpler description lemma there is labeling procedure that produces distinct length labels for all elements in using letters from the integer alphabet where dloga ne further in every digit or position any integer letter is used at most times once we have set of string labels as in lemma our separating system construction is straightforward theorem consider an alphabet nk of size nk where label every element of an element set using distinct string of letters from of length dlogd nk ne using the procedure in lemma with nk for every and nk choose the subset si of vertices whose string letter is the set of all such subsets si is system on elements and nk dlogd nk ne adaptive algorithms equivalence to separating system consider any algorithm that designs set of interventions each of size at most has to be separating system in the worst case over all this is already to discover known now we prove the necessity of separating system for deterministic adaptive algorithms in the worst case theorem let there be an adaptive deterministic algorithm that designs the set of interventions for any ground truth ordering starting from the such that the final graph learnt ei initial skeleton kn then there exists such that designs an which is separating system the theorem above is independent of the individual intervention sizes therefore we have the following theorem which is direct corollary of theorem theorem in the worst case over any adaptive or deterministic algorithm on has to be such that log ne there is feasible with the dag dlogd nk ne proof by theorem we need separating system in the worst case and the lower and upper bounds are from randomized adaptive algorithms in this section we show that that total number of variable accesses to fully identify the complete causal dag is on variables using interventheorem to fully identify complete causal dag tions interventions are necessary also the total number of variables accessed is at least the lower bound in theorem is information theoretic we now give randomized algorithm that requires nk log log experiments in expectation we provide straightforward generalization of where the authors gave randomized algorithm for unbounded intervention size theorem let be kn and the experiment size nr for some then there exists randomized adaptive algorithm which designs an such that ei with probability polynomial in and nk log log in expectation general chordal graphs in this section we turn to interventions on general dag after the initial stages in is chain graph with chordal chain components there are no further immoralities throughout the graph in this work we focus on one of the chordal chain components thus the dag we work on is assumed to be directed graph with no immoralities and whose skeleton is chordal we are interested in recovering from using interventions of size at most following bounds for chordal skeletons we provide lower bound for both adaptive and deterministic schemes for chordal skeleton let be the coloring number of the given chordal graph since chordal graphs are perfect it is the same as the clique number theorem given chordal in the worst case over all dags which has skeleton and no immoralities if every intervention is of size at most then log for any adaptive and algorithm with ei upper bound clearly the separating system based algorithm of section can be applied to the vertices in the chordal skeleton and it is possible to find all the directions thus logd nk this with the lower bound implies an approximation logd algorithm since logd nk log under mild assumption ne remark the separating system on nodes gives an approximation however the new algorithm in section exploits chordality and performs much better empirically it is possible to show that our heuristic also has an approximation guarantee but we skip that two extreme counter examples we provide two classes of chordal skeletons one for which the number of interventions close to the lower bound is sufficient and the other for which the number of interventions needed is very close to the upper bound theorem there exists chordal skeletons such that for any algorithm with intervention size constraint the number of interventions required is at least where and are the independence number and chromatic numbers respectively there exists chordal graph classes such that edlogd is sufficient an improved algorithm using meek rules in this section we design an adaptive deterministic algorithm that anticipates meek rule usage along with the idea of separating system we evaluate this experimentally on random chordal graphs first we make few observations on learning connected directed trees from the skeleton undirected trees are chordal that do not have immoralities using meek rule where every intervention is of size because the tree has no cycle meek rules do not apply lemma every node in directed tree with no immoralities has at most one incoming edge there is root node with no incoming edges and intervening on that node alone identifies the whole tree using repeated application of rule lemma if every intervention in is of size at most learning all directions on directed tree with no immoralities can be done adaptively with at most where is the number of vertices in the tree the algorithm runs in time poly lemma given any chordal graph and valid coloring the graph induced by any two color classes is forest in the next section we combine the above single intervention adaptive algorithm on directed trees which uses meek rules with that of the separating system approach description of the algorithm the key motivation behind the algorithm is that pair of color classes is forest lemma choosing the right node to intervene leaves only small subtree unlearnt as in the proof of lemma in subsequent steps suitable nodes in the remaining subtrees could be chosen until all edges are learnt we give brief description of the algorithm below let denote the initial undirected chordal skeleton and let be its coloring number consider separating system si to intervene on the actual graph an intervention set ii corresponding to si is chosen we would like to intervene on node of color si consider node of color now we attach score as follows for any color si consider the induced forest on the color classes and in consider the tree containing node in let be the degree of in let td be the resulting disjoint trees after node is removed from if is intervened on according to the proof of lemma all edge directions in all trees ti except one of them would be learnt when applying meek rules and rule all the directions from to all its neighbors would be found the score is taken to be the total number of edge directions guaranteed to be learnt in the worst case therefore the score is max the node with the highest score among the color class is used for the intervention ii after intervening on ii all the edges whose directions are known through meek rules by repeated application till nothing more can be learnt and are deleted from once is processed we recolor the sparser graph we find new with the new chromatic number on and the above procedure is repeated the exact hybrid algorithm is described in algorithm theorem given an undirected choral skeleton of an underlying directed graph with no immoralities algorithm ends in finite time and it returns the correct underlying directed graph the algorithm has runtime complexity polynomial in algorithm hybrid algorithm using meek rules with separating system input chordal graph skeleton with no immoralities initialize ed with nodes and no directed edges initialize time while do color the chordal graph with colors standard algorithms exist to do it in linear time initialize color set form min separating system such that for until do initialize intervention it for si and every node in color class do consider cp and tj as per definitions in sec compute max sic end for if then it it argmax else it it first nodes with largest nonzero end if apply and meek rules using ed and after intervention it add newly learnt directed edges to ed and delete them from end for remove all nodes which have degree in end while return simulations information theoretic lb max clique sys entropic lb max clique sys achievable lb our construction clique sys lb our heuristic algorithm naive sys based algorithm seperating system ub number of experiments number of experiments chromatic number information theoretic lb max clique sys entropic lb max clique sys achievable lb our construction clique sys lb our heuristic algorithm naive sys based algorithm seperating system ub chromatic number figure no of vertices intervention size bound the number of experiments is compared between our heuristic and the naive algorithm based on the separating system on random chordal graphs the red markers represent the sizes of separating system green circle markers and the cyan square markers for the same value correspond to the number of experiments required by our heuristic and the algorithm based on an separating system theorem respectively on the same set of chordal graphs note that when and the naive algorithm requires on average about and close to experiments respectively while our algorithm requires at most orderwise close to when we simulate our new heuristic namely algorithm on randomly generated chordal graphs and compare it with naive algorithm that follows the intervention sets given by our separating system as in theorem both algorithms apply and meek rules after each intervention according to we plot the following lower bounds information theoretic lb of max clique sys entropic lb which is the chromatic number based lower bound of theorem moreover we use two known separating system constructions for the maximum clique size as references the best known separating system is shown by the label max clique sys achievable lb and our new simpler separating system construction theorem is shown by our construction clique sys lb as an upper bound we use the size of the best known separating system without any meek rules and is denoted separating system ub random generation of chordal graphs start with random ordering on the vertices consider every vertex starting from for each vertex with probability inversely proportional to for every si where si the proportionality constant is changed to adjust sparsity of the graph after all such are considered make si ne clique by adding edges respecting the ordering where ne is the neighborhood of the resultant graph is dag and the corresponding skeleton is chordal also is perfect elimination ordering results we are interested in comparing our algorithm and the naive one which depends on the separating system to the size of the separating system the size of the separating system is roughly consider values around on the for the plots with and note that our algorithm performs very close to the size of the separating system in fact it is always in both cases while the average performance of naive algorithm goes from close to to close to the result points to this for random chordal graphs the structured tree search allows us to learn the edges in number of experiments quite close to the lower bound based only on the maximum clique size and not the plots for and are given in appendix acknowledgments authors acknowledge the support from grants nsf ccf and aro yip award we also thank frederick eberhardt for helpful discussions references pearl causality models reasoning and inference cambridge university press hauser and two optimal strategies for active learning of causal models from interventional data international journal of approximate reasoning vol no pp eberhardt glymour and scheines on the number of experiments sufficient and in the worst case necessary to identify all causal relations among variables in proceedings of the conference on uncertainty in artificial intelligence uai pp hyttinen eberhardt and hoyer experiment selection for causal discovery journal of machine learning research vol pp hu li and vetta randomized experimental design for causal graph discovery in proceedings of nips montreal ca december shimizu hoyer hyvarinen and kerminen linear acyclic model for causal discovery journal of machine learning research vol pp hoyer janzing mooij peters and nonlinear causal discovery with additive noise models in proceedings of nips eberhardt causation and intervention thesis spirtes glymour and scheines causation prediction and search book bradford meek strong completeness and faithfulness in bayesian networks in proceedings of the eleventh international conference on uncertainty in artificial intelligence andersson madigan and perlman characterization of markov equivalence classes for acyclic digraphs the annals of statistics vol no pp verma and pearl an algorithm for deciding if set of observed independencies has causal explanation in proceedings of the eighth international conference on uncertainty in artificial intelligence meek causal inference and causal explanation with background knowledge in proceedings of the eleventh international conference on uncertainty in artificial intelligence hauser and characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs journal of machine learning research vol no pp two optimal strategies for active learning of causal networks from interventional data in proceedings of sixth european workshop on probabilistic graphical models katona on separating systems of finite set journal of combinatorial theory vol pp wegener on separating systems whose elements are sets of at most elements discrete mathematics vol pp lipton and tarjan separator theorem for planar graphs siam journal on applied mathematics vol no pp 
lower bounds for convex optimization with erroneous oracles jan ibm almaden research center san jose ca jvondrak yaron singer harvard university cambridge ma yaron abstract we consider the problem of optimizing convex and concave functions with access to an erroneous oracle in particular for given function we consider optimization when one is given access to absolute error oracles that return values in or relative error oracles that return value in for some we show stark information theoretic impossibility results for minimizing convex functions and maximizing concave functions over polytopes in this model introduction consider the problem of minimizing convex function over some convex domain it is well known that this problem is solvable in the sense that there are algorithms which make calls to an oracle that evaluates the function at every given point and return point which is arbitrarily close to the true minimum of the function but suppose that instead of the true value of the function the oracle has some small error would it still be possible to optimize the function efficiently to formalize the notion of error we can consider two types of erroneous oracles for given function we say that fe is an absolute oracle if we have that fe ξx where ξx for given function we say that fe is relative oracle if we have that fe ξx where ξx note that we intentionally do not make distributional assumptions about the errors this is in contrast to noise where the errors are assumed to be random and independently generated from some distribution in such cases under reasonable conditions on the distribution one can obtain arbitrarily good approximations of the true function value by averaging polynomially many points in some around the point of interest stated in terms of noise in this paper we consider oracles that have some small adversarial noise and wish to understand whether desirable optimization guarantees are obtainable to avoid ambiguity we refrain from using the term noise altogether and refer to such as inaccuracies in evaluation as error while distributional assumptions are often reasonable models evaluating our dependency on these assumptions seems necessary from practical perspective there are cases in which noise can be correlated or where the data we use to estimate the function is corrupted in some arbitrary way furthermore since we often optimize over functions that we learn from data the process of fitting model to function may also introduce some bias that does not necessarily vanish or may vanish but more generally it seems like we should morally know the consequences that modest inaccuracies may have figure an illustration of an erroneous oracle to convex function that fools gradient descent algorithm benign cases in the special case of linear function for some rn relative has little effect on the optimization by querying ei for every we can extract ci ci ci and then optimize over this results in approximation alternatively if the erroneous oracle fe happens to be convex function optimizing directly retains desirable optimization guarantees up to either additive and multiplicative errors we are therefore interested in scenarios where the error does not necessarily have nice properties gradient descent fails with error for simple example consider the function illustrated in figure the figure illustrates convex function depicted in blue and an erroneous version of it dotted red on every point the oracle is at most some additive away from the true function value the margins of the function are depicted in grey if we assume that gradient descent algorithm is given access to the erroneous version dotted red instead of the true function blue the algorithm will be trapped in local minimum that can be arbitrarily far from the true minimum but the fact that naive gradient descent algorithm fails does not necessarily mean that there isn an algorithm that can overcome small errors this narrates the main question in this paper is convex optimization robust to error main results our results are largely spoilers we present stark lower bounds for both relative and absolute oracles for any constant and even in particular we show that for minimizing convex function or maximizing concave function over we show that for any fixed no algorithm can achieve an additive approximation within of the optimum using subexponential number of calls to an absolute oracle for minimizing convex function over polytope for any fixed no algorithm can achieve finite multiplicative factor using subexponential number of calls to relative oracle for maximizing concave function over polytope for any fixed no algorithm can achieve multiplicative factor better than using subexponential number of calls to relative oracle for maximizing concave function over for any fixed no algorithm can obtain multiplicative factor better than using subexponential number of calls to relative oracle and there is trivial without asking any queries somewhat surprisingly many of the impossibility results listed above are shown for class of extremely simple convex and concave functions namely affine functions this is in sharp contrast to the case of linear functions without the constant term with relative erroneous oracles as discussed above in addition we note that our results extend to strongly convex functions related work the oracle models we study here fall in the category of or derivative free derivativefree methods have rich history in convex optimization and were among the earliest to numerically solve unconstrained optimization problems recently these approaches have enjoyed increasing interest as they are useful in scenarios where access is given to the function or cases in which gradient information is difficult to compute or does not exist there has been rich line of work for noisy oracles where the oracles return some erroneous version of the function value which is random in stochastic framework these settings correspond to repeatedly choosing points in some convex domain obtaining noisy realizations of some underlying convex function value most frequently the assumption is that one is given noisy oracle with some assumptions about the distribution that generates the error in the learning theory community optimization with stochastic noisy oracles is often motivated by bandits settings and regret minimization with feedback all these models consider the case in which the error is drawn from distribution the model of adversarial noise in zeroth order oracles has been mentioned in which considers related model of erroneous oracles and informally argues that exponentially many queries are required to approximately minimize convex function in this model under an constraint in recent work belloni et al study convex optimization with erroneous oracles interestingly belloni et al show positive results in their work they develop novel algorithm that is based on sampling from an approximately distribution using the method and show that their method has polynomial query complexity in contrast to the negative results we show in this work the work of belloni et al assumes the absolute erroneous oracle returns ξx with ξx that is the error is not constant term but rather is inversely proportional to the dimension our lower bounds for additive approximation hold when the oracle error is not necessarily constant but ξx for constant preliminaries optimization and convexity for minimization problem given nonnegative objective function and polytope we will say that an algorithm provides multiplicative if it finds point for maximization problem an algorithm provides an if it finds point for absolute erroneous oracles given an objective function and polytope we will aim to find point which is within an additive error of from the optimum with as small as possible that is for we aim to find point in the case of minimization function is convex on if tx tf or concave if tx tf for every and chernoff bounds throughout the paper we appeal to the chernoff bounds we note that while typically stated for independent random variables xm chernoff bounds also hold for negatively associated random variables definition definition random variables xn are negatively associated if for every and every ri ri xi xj xi xj claim theorem pn let xn be negatively associated random variables that take values in and xi then for any we have that pr xi pr xi we apply this to random variables that are formed by selecting random subset of fixed size in particular we use the following claim let xn be fixed for let be uniformly random subset of elements out of let xi xi if and xi otherwise then xn are negatively associated proof for xn the statement holds by corollary of which refers to this distribution as the model the generalization to arbitrary xi follows from proposition of with ij and hj xj optimization over the unit cube we start with optimization over arguably the simplest possible polytope we show that already in this setting the presence of adversarial noise prevents us from achieving much more than trivial results convex minimization first let us consider convex minimization over in this setting we show that errors as small as prevent us from optimizing within constant additive error theorem let be constant there are instances of convex function accessible through an absolute oracle such that possibly randomized algorithm that makes eo queries can not find solution of value better than within additive of the optimum with probability more than we remark that the proof of this theorem is inspired by the proof of hardness of approximation for unconstrained submodular maximization in particular it can be viewed as simple application of the symmetry gap argument see for more general exposition proof let we can assume that otherwise is constant and the statement is trivial we will construct an oracle both in the relative and absolute sense for convex function consider partition of into two subsets of size which will be eventually chosen randomly we define the following function xi xj this is convex in fact linear function next we define the following modification of which could be the function returned by an oracle if xi xj then if xi xj then xi xj note that and differ only in the region where xi xj in particular the value of in this region is within while so an oracle for both in the relative and absolute sense could very well return instead now assume that is random partition unknown to the algorithm we argue pthat with high probability fixed query issued by the algorithm will have the property that xi xj more precisely since is chosen at random subject to we have that xi is sum of negatively associated random variables in by claim pn the expectation of this quantity is xi xi by claim we have pr xi pr xi since xi xi xi we get pr xi xi pr xi by symmetry pr xi xj we emphasize that this holds for fixed query recall that we assumed the algorithm to be deterministic hence as long as its queries satisfy the property above the answers will be and the algorithm will follow the same path of computation no matter what the choice of is effectively we will not learn anything about and considering the sequence of queries on this computation path if the number of queries is then with probability at least the queries will indeed fall in the region where and the algorithm will follow this path if this happens with probability at least in this case all the points queried by the algorithm as well as the returned solution xout by the same argument satisfies xout and hence xout in contrast the actual optimum is recall that hence xout and the bounds on the number of queries and probability of success are as in the statement of the theorem finally consider randomized algorithm denote by the random variables used by the algorithm in its decisions we can condition on fixed choice of which makes the algorithm deterministic by our proof the algorithm conditioned on this choice can not succeed with probability more than since this is true for each particular choice of by averaging it is also true for random choice of hence we obtain the same result for randomized algorithms as well concave maximization here we consider the problem of maximizing concave function one can obtain result for concave maximization analogous to theorem which we do not state in terms of additive errors there is really no difference between convex minimization and concave maximization however in the case of concave maximization we can also formulate the following hardness result for multiplicative approximation theorem if concave function is accessible through oracle then for any an algorithm that makes less than queries can not find solution of value greater than op with probability more than proof this result follows from the same construction as theorem recall that is linear function hence also concave as we mentioned in the proof of theorem could be the values returned by relative oracle now we consider an arbitrary note that for it still holds that is relative oracle by the same proof an algorithm querying less than points can not find solution of value in contrast the optimum of the maximization better than with probability more than problem is therefore the algorithm can not achieve multiplicative approximation better than we note that this hardness result is optimal due to the following easy observation theorem for any concave function let op then op proof by compactness the optimum is attained at point let op let also we have and hence by concavity we obtain op in other words multiplicative for this problem is trivial to obtain even without asking any queries about we just return the point thus we can conclude that for concave maximization relative oracle is not useful at all optimization over polytopes in this section we consider optimization of convex and concave functions over polytope ax we will show inappoximability results for the relative error model note that for the absolute error case the lower bound on convex minimization from the previous section holds and can be applied to show lower bound for concave maximization with absolute errors theorem let be some constants there are convex functions for which no algorithm can obtain finite approximation ratio to using en queries to relative oracle of the function proof we will prove our theorem for the case in which let xi be subset of indices chosen uniformly at random from all subsets of size exactly we construct two functions xi xi observe that both these functions are convex and also observe that the minimizer of is and while the minimizer of is any vector xi and therefore the ratio between these two functions is unbounded we will now construct the erroneous oracle in the following manner if otherwise by definition fe is an oracle to the claim will follow from the fact that given access to fe one can not distinguish between and using subexponential number of queries this implies the inapproximability result since an approximation algorithm which guarantees finite approximation ratio using subexponential number of queries could be used to distinguish between the two functions if the algorithm returns an answer strictly greater than then we know the underlying function is and otherwise it is given query to the oracle we will consider two cases in case the query is such that xi then we have that since for any there is large enough nδ this implies that for any query for which xi then we have that and thus the oracle returns in then we can interpret the value of xi case the query is such that which determines value of as sum of negatively associated random variables xn where xi realizes with probability and takes value xi if realized see claim wepcan then apply the chernoff bound claim using the fact that xi and get that for any constant we have that with probability xi by using this implies that with probability at least we get that since the likelihood of distinguishing between and on single query is exponentially small in nδ the same arguments used throughout the paper imply that it takes an exponential number of queries to distinguish between and to conclude for any query it takes en queries to distinguish between and as discussed above due to the fact that the ratio between the optima of these two functions is unbounded this concludes the proof theorem constants there is concave function for which no algorithm can obtain an approximation strictly better than to using en queries to relative oracle of the function proof we follow similar methodology as in the proof of theorem we again we select set of size and construct two functions xi and as in the proof of theorem the noisy oracle when xi and otherwise fe note that both functions are concave and and by its definition the oracle is for the function for it is easy to see that the optimal value when the objective is is while the optimal value is when the objective is which implies that one can not obtain an approximation better than with subexponential number of queries in case the query to the oracle is point then by chernoff bound arguments similar to the ones we used above with probability we get thus for any query in at which xi the likelihood of the oracle returning is exponentially small in nδ in case the query is point xi standard concentration bound arguments as before imply that with probability at least we get since the likelihood of distinguishing between and on single query is exponentially small in nδ we can conclude that it takes an exponential number of queries to distinguish between and optimization over assignments in this section we consider the concave maximization problem over more specific polytope pn xij this can be viewed as the matroid polytope for partition matroid on blocks of elements or alternatively the convex hull of assignments of items to agents in this case there is trivial similar to the in the case of unit cube theorem for any and concave function pn let op then op proof by compactness the optimum is attained at point let op let xij mod is cyclic shift of the coordinates of by in each block we have pk pn and xij by concavity and nonnegativity of we obtain op we show that this approximation is best possible if we have access only to oracle theorem if and concave function pn is accessible through erroneous oracle then for any an algorithm that makes less than queries can not find solution of value greater than op with probability more than note that this result is nontrivial only for in other words the hardness factor of is never worse than square root of the dimension of the problem therefore this result can be viewed as interpolating between the hardness of over the unit cube theorem and the hardness of over general polytope theorem proof given we construct function pn where describes the intended optimal solution pn xi next we define modified function as follows if then if then by definition and differ only if and then while therefore is valid relative oracle for similarly to the proofs above we argue that if is chosen uniformly at random then with high probability for any fixed query pn this holds again by chernoff bound for pk pn fixed xij such that xij we have that xi where is sum of independent random variables with expectation xij nk the random variables attain values in by the chernoff bound pr nk nk this gives pr by the same arguments as before if the algorithm asks less than queries then it will not detect point such that with probability more than then the query answers will all be and the value of the returned solution will be at most meanwhile the optimum solution is for all which gives acknowledgements ys was supported by nsf grant career and google faculty research award references alekh agarwal ofer dekel and lin xiao optimal algorithms for online convex optimization with bandit feedback in colt the conference on learning theory haifa israel june pages alekh agarwal dean foster daniel hsu sham kakade and alexander rakhlin stochastic convex optimization with bandit feedback siam journal on optimization alexandre belloni tengyuan liang hariharan narayanan and alexander rakhlin escaping the local minima via simulated annealing optimization of approximately convex functions colt bubeck and regret analysis of stochastic and nonstochastic bandit problems foundations and trends in machine learning devdatt dubhashi volker priebe and desh ranjan negative dependence through the fkg inequality in research report institut informatik john duchi michael jordan martin wainwright and andre wibisono optimal rates for zeroorder convex optimization the power of two function evaluations ieee transactions on information theory uriel feige vahab mirrokni and jan maximizing submodular functions siam abraham flaxman adam tauman kalai and brendan mcmahan online convex optimization in the bandit setting gradient descent without gradient in proceedings of the sixteenth annual symposium on discrete algorithms soda vancouver british columbia canada january pages kevin jamieson robert nowak and benjamin recht query complexity of optimization in advances in neural information processing systems annual conference on neural information processing systems proceedings of meeting held december lake tahoe nevada united pages nemirovsky and yudin problem complexity and method efficiency in optimization wiley sons new york yurii nesterov random minimization of convex functions core discussion papers catholique de louvain center for operations research and econometrics core aaditya ramdas aarti singh and larry wasserman an analysis of active learning with uniform feature noise in proceedings of the seventeenth international conference on artificial intelligence and statistics aistats reykjavik iceland april pages aaditya ramdas and aarti singh optimal rates for stochastic convex optimization under tsybakov noise condition in proceedings of the international conference on machine learning icml atlanta ga usa june pages ohad shamir on the complexity of bandit and stochastic convex optimization in colt the annual conference on learning theory june princeton university nj usa pages sebastian stich christian and bernd optimization of convex functions with random pursuit corr jan symmetry and approximability of submodular maximization problems siam 
poisson mrf adding dependencies to the multinomial david inouye pradeep ravikumar inderjit dhillon department of computer science university of texas at austin dinouye pradeepr inderjit abstract we propose novel distribution that generalizes the multinomial distribution to enable dependencies between dimensions our novel distribution is based on the parametric form of the poisson mrf model but is fundamentally different because of the domain restriction to vector like in multinomial where the number of trials is fixed or known thus we propose the poisson mrf lpmrf distribution we develop ais sampling methods to estimate the likelihood and log partition function the log normalizing constant which was not developed for the poisson mrf model in addition we propose novel mixture and topic models that use lpmrf as base distribution and discuss the similarities and differences with previous topic models such as the recently proposed admixture of poisson mrfs we show the effectiveness of our lpmrf distribution over multinomial models by evaluating the test set perplexity on dataset of abstracts and wikipedia qualitatively we show that the positive dependencies discovered by lpmrf are interesting and intuitive finally we show that our algorithms are fast and have good scaling code available online introduction related work the multinomial distribution seems to be natural distribution for modeling data such as text documents indeed most topic models such as plsa lda and numerous for survey of probabilistic topic the multinomial as the fundamental base distribution while adding complexity using other latent variables this is most likely due to the extreme simplicity of multinomial parameter frequency is usually smoothed by the simple dirichlet conjugate prior in addition because the multinomial requires the length of document to be fixed or usually poisson distribution on document length is assumed this yields by results is merely an independent poisson however the multinomial assumes independence between the words because the multinomial is merely the sum of independent categorical variables this restriction does not seem to fit with text for example words like neural and network will tend to quite frequently together in nips papers thus we seek to relax the word independence assumption of the multinomial the poisson mrf distribution pmrf seems to be potential replacement for the poissonmultinomial because it allows some dependencies between words the poisson mrf is developed by assuming that every conditional distribution is poisson however the original formulation in only allowed for negative dependencies thus several modifications were proposed in to allow for positive dependencies one proposal the truncated poisson mrf tpmrf simply truncated the pmrf by setting max count for every word while this formulation may provide the assumption of poisson document length is not important for most topic models interesting parameter estimates tpmrf with positive dependencies may be almost entirely concentrated at the corners of the joint distribution because of the quadratic term in the log probability see the bottom left of fig in addition the log partition function of the tpmrf is intractable to estimate even for small number of dimensions because the sum is over an exponential number of terms thus we seek different distribution than tpmrf that allows positive dependencies but is more appropriately normalized we observe that the multinomial is proportional to an independent poisson model with the domain restricted to fixed length thus in similar way we propose poisson mrf lpmrf that is proportional to pmrf but is restricted to domain with fixed vector where this distribution is quite different from previous pmrf variants because the normalization is very different as will be described in later sections for motivating example in fig we show the marginal distributions of the empirical distribution and fitted models using only three words from the dataset that contains documents regarding library sciences and aerospace engineering see sec clearly text has positive dependencies as evidenced by the empirical marginals of boundary and layer referring to the boundary layer in fluid dynamics and lpmrf does the best at fitting this empirical distribution in addition the log partition hence the lpmrf can be approximated using samplingempirical as described in later sections under the or tpmrf models distribution both the log partition function and likelihood were computationally intractable to compute thus approximating the log partition function of an lpmrf opens up the door for hyperparameter estimation and model evaluation that was not possible with pmrf poisson ind poissons multinomial boundarymarginal distributions empirical library library layer library layer layer boundary library layer boundary pmrf poisson truncated poisson mrf boundary boundary library library layer library layer layer boundary library layer boundary figure marginal distributions from dataset top left empirical distribution top right estimated multinomial poisson joint independent poissons bottom left truncated poisson mrf bottom right pmrf poisson joint distribution the simple empirical distribution clearly shows strong dependency between boundary and layer but strong negative dependency of boundary with library clearly the distribution underfits the data while the truncated pmrf can model dependencies it obviously has normalization problems because the normalization is dominated by the edge case the distribution much more appropriately fits the empirical data in the topic modeling literature many researchers have realized the issue with using the multinomial distribution as the base distribution for example the interpretability of multinomial can be difficult since it only gives an ordering of words thus multiple metrics have been proposed to evaluate topic models based on the perceived dependencies between words within topic in particular showed that the multinomial assumption was often violated in real world data in another paper the lda topic assignments for each word are used to train separate ising bernoulli each topic in heuristic procedure instead of modeling dependencies posteriori we formulate generalization of topic models that allows the lpmrf distribution to directly replace the multinomial this allows us to compute topic model and word dependencies jointly under unified model as opposed to the heuristic procedure in the example in fig was computed by exhaustively computing the log partition function this model has some connection to the admixture of poisson mrfs model apm which was the first topic model to consider word dependencies however the lpmrf topic model directly relaxes the lda assumption the independent case is the same as lda whereas apm is only an indirect relaxation of lda because apm mixes in the exponential family canonical parameter space while lda mixes in the standard multinomial parameter space another difference with apm is that our proposed lpmrf topic model can actually produce topic assignments for each word similar to lda with gibbs sampling finally the lpmrf topic model does not fall into the same generalization of topic models as apm because the distribution is not an described more fully in later sections the follow up apm paper gives fast algorithm for estimating the pmrf parameters we use this algorithm as the basis for estimating the topic lpmrf parameters for estimating the topic vectors for each document we give simple coordinate descent algorithm for estimation of the lpmrf topic model this estimation of topic vectors can be seen as direct relaxation of lda and could even provide different estimation algorithm for lda poisson mrf notation let and denote the number of words documents and topics respectively we will generally use uppercase letters for matrices boldface lowercase letters or indices of matrices for column vectors xi φs and lowercase letters for scalar values xi θs poisson mrf definition first we will briefly describe the poisson mrf distribution and refer the reader to for more details pmrf can be parameterized by node vector and an edge matrix whosep encode the direct dependencies between words prpmrf exp log xs where is the log partition function needed for normalization note that without loss of generality we can assume is symmetric because it only shows up in the symmetric quadratic term the conditional distribution of one word given all the pr xs poisson distribution with natural parameter ηs θs φs by construction one primary issue with the pmrf is that the log partition function is over all vectors in and thus with even one positive dependency the log partition function is infinite because of the quadratic term in the formulation yang et al tried to address this issue but as illustrated in the introduction their proposed modifications to the pmrf can yield unusual models for data lpmrf with different parameters lpmrf with different parameters negative dependency independent binomial positive dependency negative dependency independent binomial positive dependency probability probability domain of binomial at domain of binomial at figure lpmrf distribution for left and right with negative zero and positive dependencies the distribution of lpmrf can be quite different than multinomial zero dependency and thus provides much more flexible parametric distribution for count data lpmrf definition the poisson mrf lpmrf distribution is simple yet fundamentally different distribution than the pmrf letting be the length of document we define the lpmrf distribution as follows pr exp xt φx log xs al lpmrf al log exp xt φx log xs xl the only difference from the pmrf parametric form is the log partition function al which is conditioned on the set xl unlike the unbounded set for pmrf this domain restriction is critical to formulating tractable and reasonable distribution combined with poisson distribution on vector length the lpmrf distribution can be much more suitable distribution for documents than multinomial the lpmrf distribution reduces to the standard multinomial if there are no dependencies however if there are dependencies then the distribution can be quite different than multinomial as illustrated in fig for an lpmrf with and fixed at either or words after the original submission we realized that for the lpmrf model is the same as the multiplicative binomial generalization in thus the lpmrf model can be seen as multinomial generalization of the multiplicative binomial in lpmrf parameter estimation because the parametric form of the lpmrf model is the same as the form of the pmrf model and we primarily care about finding the correct dependencies we decide to use the pmrf estimation algorithm described in to estimate and the algorithm in uses an approximation to the likelihood by using the and performing regularized nodewise poisson regressions the regularization is important both for the sparsity of the dependencies and the computational efficiency of the algorithm while the pmrf and lpmrf are different distributions the approximation for estimation provides good results as shown in the results section we present timing results to show the scalability of this algorithm in sec other parameter estimation methods would be an interesting area of future work likelihood and log partition estimation unlike previous work on the pmrf or tpmrf distributions we develop tractable approximation to the lpmrf log partition function eq so that we can compute approximate likelihood values the likelihood of model can be fundamentally important for hyperparameter optimization and model evaluation lpmrf annealed importance sampling first we develop an lpmrf gibbs sampler by considering the most common form of multinomial sampling namely by taking the sum of sequence of categorical variables from this intuition we sample one word at time while holding all other words fixed the probability of one word in the sequence given all the other words is proportional to exp θs where is the sum of all other words see the appendix for the details of gibbs sampling then we derive an annealed importance sampler using the gibbs sampling by scaling the matrix for each successive distribution by the linear sequence starting with and ending with thus we start with simple multinomial sample from pr prmult and then gibbs sample from each successive distribution prlpmrf γφ updating the sample weight as defined in until we reach the final distribution when from these weighted samples we can compute an estimate of the log partition function upper bound using inequality simple convex relaxation and the partition function of multinomial an upper bound for the log partition function can be computed al llog exp θs log where is the maximum eigenvalue ofpφ see the appendix for the full derivation we simplify this upper bound by subtracting log exp θs from which does not change the distribution so that the second term becomes then neglecting the constant term that does not interact with the parameters the log partition function is upper bounded by simple quadratic function weighting for different for datasets in which is observed for every sample but is not as document collections the log partition function will grow quadratically in if there are any positive dependencies as suggested by the upper bound this causes long documents to have extremely small likelihood thus we must modify as gets larger to conteract this effect we propose simple modification that scales the for each in particular we propose to use the sigmoidal function using the log logistic cumulative distribution function cdf loglogisticcdf αll βll we set the βll parameter to so that thep tail is which will eventually cause the upper bound to approach constant letting li be the mean instance length we choose αll for some small constant this choice of αll helps the weighting function to appropriately scale for corpuses of different average lengths final approximation method for all for our experiments we approximate the log partition function value for all in the range of the corpus we use ais samples for different test values of linearly spaced between the and so that we cover both small and large values of this gives total of annealed importance samples we use the quadratic form of the upper bound ua ignoring constants with respect to and find constant that upper bounds all estimates amax maxl exp θs where is an ais estimate of the log partition function for the test values of this gives smooth approximation for all that are greater than or equal to all individual estimates figure of example approximation in appendix mixtures of lpmrf with an approximation to the likelihood we can easily formulate an estimation algorithm for mixture of lpmrfs using simple alternating procedure first given the cluster assignments the lpmrf parameters can be estimated as explained above then the best cluster assignments can be computed by assigning each instance to the highest likelihood cluster extending the lpmrf to topic models requires more careful analysis as described next generalizing topic models using distributions in standard topic models like lda the distribution contains unique topic variable for every word in the corpus essentially this means that every word is actually drawn from categorical distribution however this does not allow us to capture dependencies between words because there is only one word being drawn at time therefore we need to reformulate lda in way that the words from topic are sampled jointly from multinomial from this reformulation we can then simply replace the multinomial with an lpmrf to obtain topic model with lpmrf as the base distribution our reformulation of lda groups the topic indicator variables for each word into vectors corresponding to the different topics these topic indicator vectors are then assumed to be drawn from multinomial with fixed length kz this grouping of topic vectors yields an equivalent distribution because the topic indicators are exchangeable and independent of one another given the observed word and the distribution this leads to the following generalization of topic models in which an observation xi is the summation of hidden variables zij generic topic model wi simplexprior novel lpmrf topic model wi dirichlet li lengthdistribution mi partitiondistribution wi li li poisson mi multinomial wi li zij fixedlengthdist φj kzij mji pk xi zij zij lpmrf φj mji pk xi zij note that this generalization of topic models does not require the partition distribution and the fixedlength distribution to be the same in addition other distributions could be substituted for the dirichlet prior distribution on distributions like the logistic normal prior finally this generalization allows for topic models for other types of data although exploration of this is outside the scope of this paper this generalization is distinctive from the topic model generalization termed admixtures in admixtures assume that each observation is drawn from an base distribution whose parameters are convex combination of previous parameters thus an admixture of lpmrfs could be formulated by assuming given the weights wi is drawn thatj each document from lpmrf kxi though this may be an interesting ij ij model in its own right and useful for further exploration in future work this is not the same as the above proposed model because the distribution of xi is not an lpmrf but rather sum of independent lpmrfs one the only these two generalizations of topic models intersect is when the distribution is multinomial lpmrf with as another distinction from apm the lpmrf topic model directly generalizes lda because the lpmrf in the above model reduces to multinomial if fully exploring the differences between this topic model generalization and the admixture generalization are quite interesting but outside the scope of this paper with this formulation of lpmrf topic models we can create joint optimization problem to solve for the topic matrix zi zik for each document and to solve for the shared lpmrf parameters the optimization is based on minimizing the negative log posterior arg min xx pr zij φj mji log pr log pr φj prior prior lpmrf zi xi zi where is the all ones vector notice that the observations xi only show up in the constraints the prior distribution on can be related to the dirichlet distribution as in lda by taking pr prprior dir also notice that the documents are all independent if the lpmrf parameters are known so this optimization can be trivially parallelized connection to collapsed gibbs sampling this optimization is very similar to the collapsed gibbs sampling for lda essentially the key part to estimating the topic models is estimating the topic indicators for each word in the corpus the model parameters can then be estimated directly from these topic indicators in the case of lda the multinomial parameters are trivial to estimate by merely keeping track of counts and thus the parameters can be updated in constant time for every topic resampled this also suggests that an interesting area of future work would be to understand the connections between collapsed gibbs sampling and this optimization problem it may be possible to use this optimization problem to speed up gibbs sampling convergence or provide map phase after gibbs sampling to get estimates estimating topic matrices for lpmrf topic models the estimation of the lpmrf parameters given the topic assignments requires solving another complex optimization problem thus we pursue an alternating scheme as in lpmrf mixtures first we estimate lpmrf parameters with the pmrf algorithm from and then we optimize the topic matrix zi for each document because of the constraints on zi we pursue simple dual coordinate descent procedure we select two coordinates in row of zi and determine if the optimization problem can be improved by moving words from topic to topic thus we only need to solve series of simple univariate problems each univariate problem only has xis number of possible solutions and thus if the max count of words in document is bounded by constant the univariate subproblems can be solved bi zi aer et aer et gives efficiently more formally we are seeking step size such that better optimization value than zi if we remove constant terms we arrive at the following univariate optimization problem suppressing dependence on because each of the subproblems are independent arg min θr θrq φqr log zr log zrq am φq log pr prior where is the new distribution of length based on the step size the first term is the linear and quadratic term from the sufficient statistics the second term is the change in base measure of word is moved the third term is the difference in log partition function if the length of the topic vectors changes note that the log partition function can be precomputed so it merely costs table lookup the prior also only requires simple calculation to update thus the main computation comes in the inner product however this inner product can be maintained very efficiently and updated efficiently so that it does not significantly affect the running time perplexity experiments we evaluated our novel lpmrf model using perplexity on test set of documents from corpus composed of research paper denoted and collection of wikipedia documents the dataset has three distinct topic areas medical medline library information sciences cisi and aerospace engineering cran http experimental setup we train all the models using training split of the documents and compute the perplexity on the remaining where perplexity is equal to exp test where is the log likelihood and ntest is the total number of words in the test set we evaluate single mixture and topic models with both the multinomial as the base distribution and lpmrf as the base distribution at the topic indicator matrices zi for the test set are estimated by fitting estimate while holding the topic parameters for single multinomial or lpmrf we set the smoothing parameter to we select the lpmrf models using all combinations of log spaced between and and linearly spaced weighting function constants between and for the weighting function described in sec in order to compare our algorithms with lda we also provide perplexity results using an lda gibbs sampler for matlab to estimate the model parameters for lda we used iterations and optimized the hyperparameters and using the likelihood of tuning set we do not seek to compare with many topic models because many of them use the multinomial as base distribution which could be replaced by lpmrf but rather we simply focus on simple representative wikipedia perplexity perplexity test set perplexity for results the perplexity results for all models can be seen in fig clearly single lpmrf significantly outperforms single multinomial on the test dataset both for the and wikipedia datasets the lpmrf model outperforms the simple multinomial mixtures and topic models in all cases this suggests that the lpmrf model could be an interesting replacement for the multinomial in more complex models for small number of topics lpmrf topic models also outperforms gibbs sampling lda but does not perform as well for larger number of topics this is likely due to the sampling methods for learning lda exploring the possibility of incorporating sampling into the fitting of the lpmrf topic model is an excellent area of future work we believe lpmrf shows significant promise for replacing the multinomial in various probabilistic models mult lpmrf mult gibbs lda mult lpmrf mixture lpmrf topic model figure left the lpmrf models quite significantly outperforms the multinomial for both datasets right the lpmrf model outperforms the simple multinomial model in all cases for small number of topics lpmrf topic models also outperforms gibbs sampling lda but does not perform as well for larger number of topics qualitative analysis of lpmrf parameters in addition to perplexity analysis we present the top words top positive dependencies and the top negative dependencies for the lpmrf topic model in table notice that in lda only the top words are available for analysis but an lpmrf topic model can produce intuitive dependencies for example the positive dependency is composed of two words that often in the library sciences but each word independently does not occur very often in comparison to information and library the positive dependency suggests that some of the documents in the medline dataset likely refer inducing stress on subject and measuring the reaction or in the aerospace topic the positive dependency suggests that equations are important in aerospace notice that these concepts could not be discovered with standard topic model for topic models the likelihood computation is intractable if averaging over all possible zi thus we use map simplification primarily for computational reasons to compare models without computationally expensive likelihood estimation for the lpmrf this merely means adding to of the nodewise poisson regressions http we could not compare to apm because it is not computationally tractable to calculate the likelihood of test instance in apm and thus we can not compute perplexity table top words and dependencies for lpmrf topic model topic topic topic top words top pos edges top neg edges top words top pos edges top neg edges top words top pos edges information patients flow library cases pressure normal boundary cells results research system libraries book systems data treatment top neg edges theory children method found layer results given use blood number scientific disease presented timing and scalability finally we explore the practical performance of our algorithms in we implemented the three core algorithms fitting poisson regressions fitting the topic matrices for each document and sampling ais samples the timing for each of these components respectively can be seen in fig for the wikipedia dataset we set in the first two experiments which yields roughly and varied for the third experiment each of the components is trivially parallelized using openmp http all timing experiments were conducted on the tacc maverick system with intel xeon ivy bridge cpus ghz cpus per node and gb memory per cpu https the scaling is generally linear in the parameters except for fitting topic matrices which is for the ais sampling the scaling is linear in the number of in irrespective of overall we believe our implementations provide both good scaling and practical performance code available online timing for poisson regressions timing for fitting topic matrices number of unique words log scale timing for ais sampling time time time in seconds log scale number of nonzeros in edge matrix figure left the timing for fitting poisson regressions shows an empirical scaling of np middle the timing for fitting topic matrices empirically shows scaling that is npk right the timing for ais sampling shows that the sampling is approximately linearly scaled with the number of in irrespective of conclusion we motivated the need for more flexible distribution than the multinomial such as the poisson mrf however the pmrf distribution has several complications due to its normalization that hinder it from being model for count data we overcome these difficulties by restricting the domain to fixed length as in multinomial while retaining the parametric form of the poisson mrf by parameterizing by the length of the document we can then efficiently compute estimates of the log partition function and hence the were not tractable to compute under the pmrf model we extend the lpmrf distribution to both mixtures and topic models by generalizing topic models using distributions and develop parameter estimation methods using dual coordinate descent we evaluate the perplexity of the proposed lpmrf models on datasets and show that they offer good performance when compared to models finally we show that our algorithms are fast and have good scaling potential new areas could be explored such as the relation between the topic matrix optimization method and gibbs sampling it may be possible to develop methods for the lpmrf topic model similar to gibbs sampling for lda in general we suggest that the lpmrf model could open up new avenues of research where the multinomial distribution is currently used acknowledgments this work was supported by nsf and aro references yang ravikumar allen and liu graphical models via generalized linear models in nips pp inouye ravikumar and dhillon admixture of poisson mrfs topic model with word dependencies in international conference on machine learning icml pp hofmann probabilistic latent semantic analysis in uncertainty in artificial intelligence uai pp morgan kaufmann publishers blei ng and jordan latent dirichlet allocation jmlr vol pp blei probabilistic topic models communications of the acm vol pp yang ravikumar allen and on poisson graphical models in nips pp chang gerrish wang and blei reading tea leaves how humans interpret topic models in nips mimno wallach talley leenders and mccallum optimizing semantic coherence in topic models in emnlp pp newman noh talley karimi and baldwin evaluating topic models for digital libraries in joint conference on digital libraries jcdl pp aletras and court evaluating topic coherence using distributional semantics in international conference on computational semantics iwcs long papers pp mimno and blei bayesian checking for topic models in emnlp pp nallapati ahmed cohen and xing sparse word graphs scalable algorithm for capturing word correlations in topic models in icdm pp steyvers and griffiths probabilistic topic models in latent semantic analysis road to meaning pp inouye ravikumar and dhillon capturing semantically meaningful word dependencies with an admixture of poisson mrfs in nips pp altham two generalizations of the binomial distribution journal of the royal statistical society series applied statistics vol no pp neal annealed importance sampling statistics and computing vol no pp 
bayesian learning via label embeddings piyush changwei ricardo lawrence cse dept iit kanpur ece dept duke university piyush lcarin abstract we present scalable bayesian learning model based on learning lowdimensional label embeddings our model assumes that each label vector is generated as weighted combination of set of topics each topic being distribution over labels where the combination weights the embeddings for each label vector are conditioned on the observed feature vector this construction coupled with link function for each label of the binary label vector leads to model with computational cost that scales in the number of positive labels in the label matrix this makes the model particularly appealing for learning problems where the label matrix is usually very massive but highly sparse using strategy leads to full local conjugacy in our model facilitating simple and very efficient gibbs sampling as well as an expectation maximization algorithm for inference also predicting the label vector at test time does not require doing an inference for the label embeddings and can be done in closed form we report results on several benchmark data sets comparing our model with various art methods introduction learning refers to the problem setting in which the goal is to assign to an object video image or webpage subset of labels tags from possibly very large set of labels the label assignments of each example can be represented using binary label vector indicating the of each label despite significant amount of prior work learning continues to be an active area of research with recent surge of interest in designing scalable learning methods to address the challenges posed by problems such as annotation computational advertising medical coding where not only the number of examples and data dimensionality are large but the number of labels can also be massive several thousands to even millions often in learning problems many of the labels tend to be correlated with each other to leverage the label correlations and also handle the possibly massive number of labels common approach is to reduce the dimensionality of the label space by projecting the label vectors to subspace learning prediction model in that space and then projecting back to the original space however as the label space dimensionality increases the sparsity in the label matrix becomes more pronounced very few ones if the label matrix is only partially observed such methods tend to suffer and can also become computationally prohibitive to address these issues we present scalable fully bayesian framework for learning our framework is similar in spirit to the label embedding methods based on reducing the label space dimensionality however our framework offers the following key advantages computational cost of training our model scales in the number of ones in the label matrix which makes our framework easily scale in cases where the label matrix is massive but sparse our likelihood model for the binary labels based on link more realistically models the extreme sparsity of the label matrix as compared to the commonly employed link and our model is more interpretable embeddings naturally correspond to topics where each topic is distribution over labels moreover at test time unlike other bayesian methods we do not need to infer the label embeddings of the test example thereby leading to faster predictions in addition to the modeling flexibility that leads to robust interpretrable and scalable model our framework enjoys full local conjugacy which allows us to develop simple gibbs sampling as well as an expectation maximization em algorithm for the proposed model both of which are simple to implement in practice and amenable for parallelization the model we assume that the training data are given in the form of examples represented by feature matrix along with their labels in possibly incomplete label matrix the goal is to learn model that can predict the label vector for test example rd we model the binary label vector yn of the nth example by thresholding vector mn yn mn which for each individual binary label yln yn can also be written as yln mln in eq mn mln zl denotes latent count vector of size and is assumed drawn from poisson mn poisson λn eq denotes drawing each component of mn independently from poisson distribution with rate equal to the corresponding component of λn rl which is defined as λn vun here and un rk typically note that the columns of can be thought of as atoms of label dictionary or topics over labels and un can be thought of as the atom weights or embedding of the label vector yn or topic proportions how active each of the topics is for example also note that eq can be combined as yn λn vun where jointly denotes drawing the latent counts mn from poisson eq with rate λn vun followed by thresholding mn at eq in particular note that marginalizing out mn from eq leads to yn bernoulli exp this link function termed as the link has also been used recently in modeling relational data with binary observations in eq expressing the label vector yn in terms of vun is equivalent to assumption on the label matrix yn vu where vk and un which are modeled as follows vk ukn pkn wk dirichlet gamma rk pkn pkn wk xn nor exp diag τd and hyperparameters rk τd are given improper gamma priors since columns of are dirichlet drawn they correspond to distributions topics over the labels it is important to note here that the dependence of the label embedding un ukn on the feature vector xn is achieved by making the scale parameter of the gamma prior on ukn depend on pkn which in turn depends on the features xn via regression weight wk eq and figure graphical model for the generative process of the label vector hyperpriors omitted for brevity computational scalability in the number of positive labels for the likelihood model for binary labels we can write the conditional posterior of the latent count vector mn as mn un yn vun where denotes the poisson distribution with support only on the positive integers and denotes the product eq suggests that the zeros in yn will result in the corresponding elements of the latent count vector mn being zero almost surely with probability one as shown in section the sufficient statistics of the model parameters do not depend on latent counts that are equal to zero such latent counts can be simply ignored during the inference this aspect leads to substantial computational savings in our model making it scale only in the number of positive labels in the label matrix in the rest of the exposition we will refer to our model as bmlpl to denote bayesian learning via positive labels asymmetric link function in addition to the computational advantage scaling in the number of in the label matrix another appealing aspect of our learning framework is that the likelihood is also more realistic model for highly sparse binary data as compared to the commonly used likelihood to see this note that the model defines the probability of an observation being one as exp where is the positive rate parameter for positive on the axis the rate of growth of the plot of on the axis from to is much slower than the rate it drops from to this benavior of the bernoullipoisson link will encourage much fewer number of nonzeros in the observed data as compared to the number of zeros on the other hand logistic and probit approach both and at the same rate and therefore can not model the of the label matrix like the link therefore in contrast to multilabel learning models based on likelihood function or standard loss functions such as the for the binary labels our proposed model provides better robustness against label imbalance inference key aspect of our framework is that the conditional posteriors of all the model parameters are available in closed form using data augmentation strategies that we will describe below in particular since we model binary label matrix as thresholded counts we are also able to leverage some of the inference methods proposed for bayesian matrix factorization of data to derive an efficient gibbs sampler for our model inference in our model requires estimating and the hyperparameters of the model as we will see below the latent count vectors mn which are functions of and provide sufficient statistics for the model parameters each element of mn if the corresponding element in yn is one is drawn from truncated poisson distribution mln vl un λln pk pk th vl denotes the row of and λln λkln vlk ukn thus we can also write pk mln mlkn where mlkn λkln vlk ukn on the other hand if yln then mln with probability one eq and therefore need not be sampled because it does not affect the sufficient statistics of the model parameters using the equivalence of poisson and multinomial distribution we can express the decomposipk tion mln mlkn as draw from multinomial mlkn mult mln ζlkn where ζlkn pkvlk uvknu this allows us to exploit the conjugacy and lk kn helps designing efficient gibbs sampling and em algorithms for doing inference in our model as discussed before the computational cost of both algorithms scales in the number of ones in the label matrix which males our model especially appealing for dealing with multilabel learning problems where the label matrix is massive but highly sparse gibbs sampling gibbs sampling for our model proceeds as follows sampling using eq and the conjugacy each column of can be sampled as vk dirichlet mlk where mlk mlnk sampling using the conjugacy each entry of can be sampled as ukn gamma rk mkn pkn where mkn wk xn mlnk and pkn sampling since mkn mlnk and mlnk vlk ukn mkn is also poisson further since ukn pkn is gamma we can integrate out ukn from mkn which gives mkn negbin rk pkn where negbin denotes the negative binomial distribution although the negative binomial is not conjugate to the gaussian prior on wk we leverage the strategy data augmentation to gaussianify the negative binomial likelihood doing this we are able to derive closed form gibbs sampling updates wk the pg strategy is based on sampling set of auxiliary variables one for each observation which in the context of sampling wk are the latent counts mkn for sampling wk we draw random variables ωkn one for each training example as ωkn pg mkn rk wk xn where pg denotes the distribution given these pg variables the posterior distribution of wk is gaussian nor µwk σwk where σw µwk xωk σwk xκk where ωk diag ωkn and κk rk mkn rk sampling the hyperparameters the hyperparameter rk is given gamma prior and can be sampled easily the other hyperparameters τd are estimated using maximum likelihood estimation expectation maximization the gibbs sampler described in section is efficient and has computational complexity that scales in the number of ones in the label matrix to further scale up the inference we also develop an efficient em inference algorithm for our model in the we need to compute the expectations of the local variables the latent counts and the variables ωkn for these expectations are available in closed form and can thus easily be computed in particular the expectation of each variable ωkn is very efficient to compute and is available in closed form ωkn mkn rk tanh wk xn xn the involves maximization and which essentially involves solving for their map estimates which are available in closed form in particular as shown in estimating wk requires solving linear system which in our case is of the form sk wk dk where sk xωk dk xκk ωk and κk are defined as in section except that the random variables are replaced by their expectations given by eq note that eq can be straighforwardly solved as wk dk however convergence of the em algorithm does not require solving for wk exactly in each em iteration and running couple of iterations of any of the various iterative methods that solves linear system of equations can be used for this step we use the conjugate gradient method to solve this which also allows us to exploit the sparsity in and ωk to very efficiently solve this system of equations even when and are very large although in this paper we only use the batch em it is possible to speed it up even further using an online version of this em algorithm as shown in the online em processes data in small minibatches and in each em iteration updates the sufficient statistics of the global parameters in our case these sufficient statistics include sk and dk for and can be updated as γt sk γt ωk γt dk γt κk sk dk where denotes the set of examples in the current minibatch and ωk and κk denote quantities that are computed using the data from the current minibatch predicting labels for test examples predicting the label vector for new test example rd can be done as exp if using gibbs sampling the integral above can be approximated using samples from the posterior of it is also possible to integrate out details skipped for brevity and get closed form estimates of probability of each label in terms of the model parameters and and it is given by rk exp lk computational cost computing the latent count mln for each nonzero entry yln in requires computing mlkn which takes time therefore computing all the latent counts takes nnz time which is very efficient if has very few nonzeros which is true of most realworld learning problems estimating and the hyperparameters is relatively cheap and can be done very efficiently the variables when doing gibbs sampling can be efficiently sampled using methods described in and when doing em these can be even more cheaply computed because the expectations which are available in closed form as hyperbolic tan function can be very efficiently computed the most dominant step is estimating when doing gibbs sampling if done it would dk time if sampling and time if sampling however if using the em algorithm estimating can be done much more efficiently using conjugate gradient updates because it is not even required to solved for exactly in each iteration of the em algorithm also note that since most of the parameters updates for different are all independent of each other our gibbs sampler and the em algorithms can be easily connection topic models with as discussed earlier our learning framework is similar in spirit to topic model as the label embeddings naturally correspond to topics each column vk of the matrix can be seen as representing topic in fact our model interestingly can directly be seen as topic model where we have associated with each document document features for example if each document yn in representation with vocabulary of size may also have some xn rd associated with it our model can therefore also be used to perform topic modeling of text documents with such in robust and scalable manner related work despite significant number of methods proposed in the recent years learning from data continues to remain an active area of research especially due to the recent surge of interest in learning when the output space the number of labels is massive to handle the huge dimensionality of the label space common approach is to embed the labels in space using methods such as canonical correlation analysis or other methods for jointly embedding feature and label vectors compressed sensing or by assuming that the matrix consisting of the weight vectors of all the labels is matrix another interesting line of work on label embedding methods makes use of random projections to reduce the label space dimensionality or use methods such as multitask learning each label is task our proposed framework is most similar in spirit to the aforementioned class of label embedding based methods we compare with some of these in our experiments in contrast to these methods our framework reduces the dimensionality via nonlinear mapping section our framework has accompanying inference algorithms that scale in the number of positive labels has an underlying generative model that more realistically models the imbalanced nature of the labels in the label matrix section can deal with missing labels and is easily parallelizable also the connection to topic models provide nice interpretability to the results which is usually not possible with the other methods in our model the columns of the matrix can be seen as set of topics over the labels in section we show an experiment on this moreover although in this paper we have focused on the learning problem our framework can also be applied for multiclass problems via the reduction in which case the label matrix is usually very sparse each column of the label matrix represents the labels of single binary classification problem finally although not focus of this paper some other important aspects of the learning problem have also been looked at in recent work for example fast prediction at test time is an important concern when the label space is massive to deal with this some recent work focuses on methods that only incur logarithmic cost in the number of labels at test time by inferring and leveraging tree structure over the labels experiments we evaluate the proposed learning framework on four benchmark data sets bibtex delicious compphys eurlex with their statistics summarized in table the data sets we use in our experiments have both feature and label dimensions that range from few hundreds to several thousands in addition the feature label matrices are also quite sparse data set bibtex delicious compphys eurlex ntrain training set ntest test set table statistics of the data sets used in our experiments denotes average number of positive labels per example denotes the average number of nonzero features per example we compare the proposed model bmlpl with four methods all these methods just like our method are based on the assumption that the label vectors live in low dimensional space cplst conditional principal label space transformation cplst is based on embedding the label vectors conditioned on the features bcs bayesian compressed sensing for learning bcs is bayesian method that uses the idea of doing compressed sensing on the labels wsabie it assumes that the feature as well as the label vectors live in low dimensional space the model is based on optimizing weighted approximate ranking loss leml low rank empirical risk minimization for learning for leml we report the best results across the three loss functions squared logistic hinge they propose table shows the results where we report the area under the roc curve auc for each method on all the data sets for each method as done in we vary the label space dimensionality from of and report the best results for bmlpl both gibbs sampling and em based inference perform comparably though em runs much faster than gibbs here we report results obtained with em inference only section provides another comparison between these two inference methods the em algorithms were run for iterations and they converged in all the cases as shown in the results in table in almost all of the cases the proposed bmlpl model performs better than the other methods except for compphys data sets where the auc is slightly worse than leml the better performance of our model justifies the flexible bayesian formulation and also shows the evidence of the robustness provided by the asymmetric link function against sparsity and label imbalance in the label matrix note that the data sets we use have very sparse label matrices bibtex delicious compphys eurlex cplst bcs wsabie leml bmlpl table comparison of the various methods in terms of auc scores on all the data sets note cplst and bcs were not feasible to run on the eurlex data so we are unable to report those numbers here results with missing labels our generative model for the label matrix can also handle missing labels the missing labels may include both zeros or ones we perform an experiment on two of the data sets bibtex and compphys where only of the labels from the label matrix are revealed note that of all these revealed labels our model uses only the positive labels and compare our model with leml and bcs both are capable of handling missing labels the results are shown in table for each method we set as the results show our model yields better results as compared to the competing methods even in the presence of missing labels bibtex compphys bcs leml bmlpl table auc scores with only labels observed qualitative analysis topic modeling on eurlex data since in our model each column of the matrix represents distribution topic over the labels to assess its ability of discovering meaningful topics we run an experiment on the eurlex data with and look at each column of the eurlex data consists of labels each of which is tags document can have subset of the tags so each column in is of that size in table we show five of the topics and top five labels in each topic based on the magnitude of the entries in the corresponding column of as shown in table our model is able to discover clear and meaningful topics from the eurlex data which shows its usefulness as topic model when each document yn has features in form of meta data xn rd associated with it topic nuclear nuclear safety nuclear power station radioactive effluent radioactive waste radioactive pollution topic agreements ec agreement trade agreement ec interim agreement trade cooperation ec coop agree topic environment environmental protection waste management env monitoring dangerous substance pollution control measures topic stats data community statistics statistical method agri statistics statistics data transmission table most probable words in different topics topic fishing trade fishing regulations fishing agreement fishery management fishing area conservation of fish stocks scalability number of positive labels to demonstrate the linear scalability in the number of positive labels we run an experiment on the delicious data set by varying the number of positive labels used for training the model from to to simulate this we simply treat all the other labels as zeros so as to have constant label matrix size we run each experiment for iterations using em for the inference and report the running time for each case fig left shows the results which demonstrates the roughly linear scalability the number of positive labels this experiment is only meant for small illustration note than the actual scalability will also depend on the relative values of and and the sparsity of in any case the amount of computations the involve the labels both positive and negatives only depend on the positive labels and this part for our model is clearly linear in the number of positive labels in the label matrix gibbs auc time taken fraction of positive labels time figure left scalability number of positive labels right time vs accuracy comparison for gibbs and em with exact and with cg based steps gibbs sampling vs em we finally show another experiment comparing both gibbs sampling and em for our model in terms of accuracy vs running time we run each inference method only for iterations for em we try two settings em with an exact step for and em with an approximate step where we run steps of conjugate gradient cg fig right shows plot comparing each inference method in terms of the accuracy vs running time as fig right shows the em algorithms both exact as well as the one that uses cg attain reasonably high auc scores in short amount of time which the gibbs sampling takes much longer per iteration and seems to converge rather slowly moreover remarkably em with iterations cg in each steps seems to perform comparably to the em with an exact step while running considerably faster as for the gibbs sampler although it runs slower than the em based inference it should be noted that the gibbs sampler would still be considerably faster than other fully bayesian methods for prediction such as bcs because it only requires evaluating the likelihoods over the positive labels in the label matrix moreover the step involving sampling of the matrix can be made more efficient by using cholesky decompositions which can avoid matrix inversions needed for computing the covariance of the gaussian posterior on wk discussion and conclusion we have presented scalable bayesian framework for learning in addition to providing flexible model for sparse label matrices our framework is also computationally attractive and can scale to massive data sets the model is easy to implement and easy to parallelize both full bayesian inference via simple gibbs sampling and em based inference can be carried out in this model in computationally efficient way possible future work includes developing online gibbs and online em algorithms to further enhance the scalability of the proposed framework to handle even bigger data sets another possible extension could be to additionally impose label correlations more explicitly in addition to the structure already imposed by the current model by replacing the dirichlet distribution on the columns of with logistic normal distributions because our framework allows efficiently computing the predictive distribution of the labels as shown in section it can be easily extend for doing active learning on the labels finally although here we only focused on learning our framework can be readily used as robust and scalable alternative to methods that perform binary matrix factorization with acknowledgements this research was supported in part by aro darpa doe nga and onr references rahul agrawal archit gupta yashoteja prabhu and manik varma learning with millions of labels recommending advertiser bid phrases for web pages in www dimitri bertsekas nonlinear programming athena scientific belmont david blei andrew ng and michael jordan latent dirichlet allocation jmlr jianfei chen jun zhu zi wang xun zheng and bo zhang scalable inference for topic models in nips chen and lin label space dimension reduction for classification in nips eva gibaja and ventura multilabel learning review of the state of the art and ongoing research wiley interdisciplinary reviews data mining and knowledge discovery eva gibaja and ventura tutorial on multilabel learning acm comput daniel hsu sham kakade john langford and tong zhang prediction via compressed sensing in nips changwei hu piyush rai and lawrence carin poisson tensor factorization for massive binary tensors in uai ashish kapoor raajay viswanathan and prateek jain multilabel classification using bayesian compressed sensing in nips nikos karampatziakis and paul mineiro scalable multilabel prediction via randomized methods arxiv preprint dae kim and erik sudderth the doubly correlated nonparametric topic model in nips xiangnan kong zhaoming wu li ruofei zhang philip yu hang wu and wei fan learning with incomplete label assignments in sdm xin li feipeng zhao and yuhong guo conditional restricted boltzmann machines for learning with incomplete labels in aistats david mimno and andrew mccallum topic models conditioned on arbitrary features with dirichletmultinomial regression in uai paul mineiro and nikos karampatziakis fast label embeddings for extremely large output spaces in iclr workshop nicholas polson james scott and jesse windle bayesian inference for logistic models using gamma latent variables journal of the american statistical association yashoteja prabhu and manik varma fastxml fast accurate and stable for extreme learning in kdd maxim rabinovich and david blei the inverse regression topic model in icml james scott and liang sun for logistic regression arxiv preprint farbound tai and lin multilabel classification with principal label space transformation neural computation michael tipping bayesian inference an introduction to principles and practice in machine learning in advanced lectures on machine learning pages springer jason weston samy bengio and nicolas usunier wsabie scaling up to large vocabulary image annotation in ijcai yan yan glenn fung jennifer dy and romer rosales medical coding classification by leveraging relationships in kdd yu prateek jain purushottam kar and inderjit dhillon learning with missing labels in icml yi zhang and jeff schneider output codes using canonical correlation analysis in aistats zhou hannah dunson and carin binomial process and poisson factor analysis in aistats mingyuan zhou infinite edge partition models for overlapping community detection and link prediction in aistats jun zhu ni lao ning chen and eric xing conditional topical coding an efficient topic model conditioned on rich features in kdd 
the estimator for counterfactual learning thorsten joachims department of computer science cornell university tj adith swaminathan department of computer science cornell university adith abstract this paper identifies severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback blbf and proposes the use of an alternative estimator that avoids this problem in the blbf setting the learner does not receive feedback like in supervised learning but observes feedback only for the actions taken by historical policy this makes blbf algorithms particularly attractive for training online systems ad placement web search recommendation using their historical logs the counterfactual risk minimization crm principle offers general recipe for designing blbf algorithms it requires counterfactual risk estimator and virtually all existing works on blbf have focused on particular unbiased estimator we show that this conventional estimator suffers from propensity overfitting problem when used for learning over complex hypothesis spaces we propose to replace the risk estimator with estimator showing that it neatly avoids this problem this naturally gives rise to new learning algorithm normalized policy optimizer for exponential models for structured output prediction using linear rules we evaluate the empirical effectiveness of normpoem on several classification problems finding that it consistently outperforms the conventional estimator introduction most interactive systems search engines recommender systems ad platforms record large quantities of log data which contain valuable information about the system performance and user experience for example the logs of an system record which ad was presented in given context and whether the user clicked on it while these logs contain information that should inform the design of future systems the log entries do not provide supervised training data in the conventional sense this prevents us from directly employing supervised learning algorithms to improve these systems in particular each entry only provides bandit feedback since the is only observed for the particular action chosen by the system the presented ad but not for all the other actions the system could have taken moreover the log entries are biased since actions that are systematically favored by the system will by in the logs learning from historical logs data can be formalized as batch learning from logged bandit feedback blbf unlike the problem of online learning from bandit feedback this setting does not require the learner to have interactive control over the system learning in such setting is closely related to the problem of evaluation in reinforcement learning we would like to know how well new system policy would perform if it had been used in the past this motivates the use of counterfactual estimators following an approach analogous to empirical risk minimization erm it was shown that such estimators can be used to design learning algorithms for batch learning from logged bandit feedback however the conventional counterfactual risk estimator used in prior works on blbf exhibits severe anomalies that can lead to degeneracies when used in erm in particular the estimator exhibits new form of propensity overfitting that causes severely biased risk estimates for the erm minimizer by introducing multiplicative control variates we propose to replace this risk estimator with risk estimator that provably avoids these degeneracies an extensive empirical evaluation confirms that the desirable theoretical properties of the risk estimator translate into improved generalization performance and robustness related work batch learning from logged bandit feedback is an instance of causal inference classic inference techniques like propensity score matching are hence immediately relevant blbf is closely related to the problem of learning under covariate shift also called domain adaptation or sample bias correction as well as evaluation in reinforcement learning lower bounds for domain adaptation and impossibility results for evaluation hence also apply to propensity score matching costing and other importance sampling approaches to blbf several counterfactual estimators have been developed for evaluation all these estimators are instances of importance sampling for monte carlo approximation and can be traced back to simulations learning upper bounds have been developed recently that show that these estimators can work for blbf we additionally show that importance sampling can overfit in hitherto unforeseen ways with the capacity of the hypothesis space during learning we call this new kind of overfitting propensity overfitting classic variance reduction techniques for importance sampling are also useful for counterfactual evaluation and learning for instance importance weights can be clipped to bias against variance in the estimators additive control variates give rise to regression estimators and doubly robust estimators our proposal uses multiplicative control variates these are widely used in financial applications see and references therein and policy iteration for reinforcement learning in particular we study the estimator which is superior to the vanilla estimator when fluctuations in the weights dominate the variance we additionally show that the estimator neatly addresses propensity overfitting batch learning from logged bandit feedback following we focus on the stochastic cardinal contextual bandit setting and recap the essence of the crm principle the inputs of structured prediction problem are drawn from fixed but unknown distribution pr the outputs are denoted by the hypothesis space contains stochastic hypotheses that define probability distribution over hypothesis makes predictions by sampling from the conditional distribution this definition of also captures deterministic hypotheses for notational convenience we denote the probability distribution by and the probability assigned by to as we use to refer to samples of pr and when clear from the context we will drop bandit feedback means we only observe the feedback for the specific that was predicted but not for any of the other possible predictions the feedback is just number called the loss smaller numbers are desirable in general the loss is the noisy realization of stochastic random variable the following exposition can be readily extended to the general case by setting the expected loss called risk of hypothesis is eh the aim of learning is to find hypothesis that has minimum risk counterfactual estimators we wish to use the logs of historical system to perform learning to ensure that learning will not be impossible we assume the historical algorithm whose predictions we record in our logged data is stationary policy with full support over for new hypothesis we can not use the empirical risk estimator used in supervised learning to directly approximate because the data contains samples drawn from while the risk from equation requires samples from importance sampling fixes this distribution mismatch eh so with data collected from the historical system xn yn δn pn where xi yi δi xi yi and pi yi xi we can derive an unbiased estimate of via monte carlo approximation yi xi δi pi this classic inverse propensity estimator has unbounded variance pi in can cause to be arbitrarily far away from the true risk to remedy this problem several thresholding schemes have been proposed and studied in the literature the straightforward option is to cap the propensity weights pick and set yi xi δi min pi smaller values of reduce the variance of but induce larger bias counterfactual risk minimization importance sampling also introduces variance in through the variability of this variance can be drastically different for different the crm principle is derived from generalization error bound that reasons about this variance using an empirical bernstein argument let and consider the random variable uh min note that contains observations uh theorem denote the empirical variance of uh by ˆar uh with probability at least in the random vector xi yi for stochastic hypothesis space with capacity and log ˆar uh log proof refer theorem of and the proof of theorem of following structural risk minimization this bound motivates the crm principle for designing algorithms for blbf learning algorithm should jointly optimize the estimate as well as its empirical standard deviation where the latter serves as regularizer ar uh argmin and are regularization the propensity overfitting problem the crm objective in equation penalizes those that are far from the logging policy as measured by their empirical variance ˆar uh this can be intuitively understood as safeguard against overfitting however overfitting in blbf is more nuanced than in conventional supervised learning in particular the unbiased risk estimator of equation has two anomalies even if the value of estimated on finite sample need not lie in that range furthermore if is translated by constant becomes by linearity of expectation but the unbiased estimator on finite sample need not equal in short this risk estimator is not equivariant the various thresholding schemes for importance sampling only exacerbate this effect these anomalies leave us vulnerable to peculiar kind of overfitting as we see in the following example example for the input space of integers and the output space define if otherwise the hypothesis space is the set of all deterministic functions if hf otherwise data is drawn uniformly and for all the hypothesis with minimum true risk is with which has risk when drawing training sample xn yn δn pn let us first consider the special case where all xi in the sample are distinct this is quite likely if is small relative to in this case contains hypothesis hoverf it which assigns xi yi for all this hypothesis has the following empirical risk as estimated by equation hoverf it yi xi hoverf it δi δi pi clearly this risk estimate shows severe overfitting since it can be arbitrarily lower than the true risk of the best hypothesis with appropriately chosen or more generally the choice of this is in stark contrast to overfitting in supervised learning where at least the overfitted risk is bounded by the lower range of the loss function note that the empirical risk of concentrates around erm will hence almost always select hoverf it over even if we are not in the special case of having sample with all distinct xi this type of overfitting still exists in particular if there are only distinct xi in then there still exists hoverf it with hoverf it nl finally note that this type of overfitting behavior is not an artifact of this example section shows that this is ubiquitous in all the datasets we explored maybe this problem could be avoided by transforming the loss for example let translate the loss by adding to so that now all loss values become this results in the new loss function taking values and in conventional supervised learning an additive translation of the loss does not change the empirical risk minimizer suppose we draw sample in which not all possible values for xi are observed for all xi in the sample again such sample is likely for sufficiently large now there are many hypotheses hoverf that predict one of the unobserved for each xi basically avoiding the training data hoverf yi xi δi δi hoverf pi again we are faced with overfitting since many overfit hypotheses are indistinguishable from the true risk minimizer with true risk and empirical risk these examples indicate that this overfitting occurs regardless of how the loss is transformed intuitively this type of overfitting occurs since the risk estimate according to equation can be minimized not only by putting large probability mass on the examples with low loss but by maximizing for negative losses or minimizing for positive losses the sum of the weights yi xi pi for this reason we call this type of overfitting propensity overfitting this is in stark contrast to overfitting in supervised learning which we call loss overfitting intuitively loss overfitting occurs because the capacity of fits spurious patterns of low in the data in propensity overfitting the capacity in allows overfitting of the propensity weights pi for positive hypotheses that avoid are selected for negative hypotheses that overrepresent are selected the variance regularization of crm combats both loss overfitting and propensity overfitting by optimizing more informed generalization error bound however the empirical variance estimate is also affected by propensity overfitting especially for positive losses can we avoid propensity overfitting more directly control variates and the estimator to avoid propensity overfitting we must first detect when and where it is occurring for this we draw on diagnostic tools used in importance sampling note that for any the sum of propensity weights from equation always has expected value under the conditions required for the unbiased estimator of equation yi xi pr xi dxi yi xi pr xi dyi dxi yi xi this means that we can identify hypotheses that suffer from propensity overfitting based on how far deviates from its expected value of since hh is likely correlated with hh large deviation in suggests large deviation in and consequently bad risk estimate how can we use the knowledge that to avoid degenerate risk estimates in principled way while one could use concentration inequalities to explicitly detect and eliminate overfit hypotheses based on we use control variates to derive an improved risk estimator that directly incorporates this knowledge control variates control variates random variables whose expectation is known are classic tool used to reduce the variance of monte carlo approximations let be control variate with known expectation ex and let ex be an expectation that we would like to estimate based on independent samples of employing as multiplicative control variate we can write ex this motivates the ratio estimator pn xi sn xi which is called the estimator in the importance sampling literature this estimator has substantially lower variance if and are correlated risk estimator let us use as control variate for yielding pn yi δi sn pn pi hesterberg reports that this estimator tends be more accurate than the unbiased estimator of equation when fluctuations in the sampling weights dominate the fluctuations in observe that the estimate is just convex combination of the δi observed in the sample if is now translated by constant both the true risk and the finite sample estimate get shifted by hence is equivariant unlike moreover is always bounded within the range of so the overfitted risk due to erm will now be bounded by the lower range of the loss analogous to supervised learning finally while the risk estimator is not unbiased er in general it is strongly consistent and approaches the desired expectation when is large theorem let be drawn xi yi from that has full support over then pr lim proof the numerator of in are observations with mean strong law pn of large numbers gives pr δi similarly the denominator has observations with mean so the strong law of large numbers implies pn pr hence pr in summary the risk estimator in equation resolves all the problems of the unbiased estimator from equation identified in section learning method we now derive learning algorithm called for structured output prediction the algorithm is analogous to poem in its choice of hypothesis space and its application of the crm principle but it replaces the conventional estimator with the estimator hypothesis space following learns stochastic linear rules hw hlin parametrized by that operate on joint feature map hw exp exp is the partition function variance estimator in order to instantiate the crm objective from equation we need an empirical variance estimate ˆar for the risk estimator following section we use an approximate variance estimate for the ratio estimator of equation using the normal approximation argument equation pn sn δi sn ar pn using the delta method to approximate the variance yields the same formula to invoke asymptotic normality of the estimator and indeed for reliable importance sampling estimates we require the true variance of the estimator ar to exist we can guarantee this by thresholding the importance weights analogous to the benefits of the estimator come at computational cost the risk estimator of poem had simpler variance estimate which could be approximated by taylor expansion and optimized using stochastic gradient descent the variance of equation does not admit stochastic optimization surprisingly in our experiments in section we find that the improved robustness of permits fast convergence during training even without stochastic optimization training objective of the objective is now derived by substituting the selfnormalized risk estimator of equation and its sample variance estimate from equation into the crm objective for the hypothesis space hlin by design hw lies in the exponential family of distributions so the gradient of the resulting objective can be tractably computed whenever the partition functions xi are tractable doing so yields objective in the parameters which we optimize using the choice of for and optimization is well supported analogous to poem the clipping to prevent unbounded variance and strength of variance regularization can be calibrated via counterfactual evaluation on held out validation set in summary the cost of optimizing the objective has the same complexity as the cost of poem with it requires the same set of and it can be done tractably whenever the corresponding supervised crf can be learnt efficiently software implementing is available at http experiments we will now empirically verify if the estimator as used in can indeed guard against propensity overfitting and attain robust generalization performance we follow the supervised bandit methodology to test the limits of counterfactual learning in wellcontrolled environment as in prior work the experiment setup uses supervised datasets for classification from the libsvm repository in these datasets the inputs rp the predictions are bitvectors indicating the labels assigned to the datasets have range of features labels and instances name scene yeast tmc lyrl features labels ntrain ntest poem uses the crm principle instantiated with the unbiased estimator while uses the estimator both use hypothesis space isomorphic to conditional random field crf we therefore report the performance of crf essentially logistic regression for each of the labels independently as skyline for what we can possibly hope to reach by batch learning from logged bandit feedback the joint feature map for all approaches to simulate bandit feedback dataset we use crf with default trained on of the supervised dataset as and replay the training data times and collect sampled labels from this is inspired by the observation that supervised labels are typically hard to collect relative to bandit feedback the blbf algorithms only have access to the hamming loss between the supervised label and the sampled label for input generalization performance is measured by the expected hamming loss on the supervised test set lower is better were calibrated as recommended and validated on of in summary our experimental setup is identical to poem we report performance of blbf approaches without here we observed normpoem dominated poem even after since the choice of optimization method could be confounder we use for all methods and experiments what is the generalization performance of the key question is whether the appealing theoretical properties of the estimator actually lead to better generalization performance in table we report the test set loss for and poem averaged over runs on each run has varying performance trained on random subsets but consistently beats poem table test set hamming loss averaged over runs significantly outperforms poem on all four datasets paired difference at significance level of poem crf scene yeast tmc lyrl the plot below figure shows how generalization performance improves with more training data for single run of the experiment on the yeast dataset we achieve this by varying the number of times we replay the training set to collect samples from replaycount consistently outperforms poem for all training sample sizes crf poem replaycount figure test set hamming loss as on the yeast dataset all approaches will converge to crf performance in the limit but the rate of convergence is slow since is does avoid propensity overfitting while the previous results indicate that achieves better performance it remains to be verified that this improved performance is indeed due to improved control over propensity overfitting table left shows the average for the hypothesis selected by each approach indeed is close to its known expectation of for while it is severely biased for poem furthermore the value of depends heavily on how the losses are translated for poem as predicted by theory as anticipated by our earlier observation that the estimator is equivariant is unaffected by translations of table right shows that the same is true for the prediction error on the test set is consistenly good while poem fails catastrophically for instance on the tmc dataset poem is worse than random guessing table mean of the unclipped weights left and test set hamming loss right averaged over runs and indicate whether the loss was translated to be positive or negative poem poem scene yeast tmc lyrl scene yeast tmc lyrl is crm variance regularization still necessary it may be possible that the improved selfnormalized estimator no longer requires variance regularization the loss of the unregularized estimator is reported in table we see that variance regularization still helps table test set hamming loss for and the variance agnostic averaged over the same runs as table on scene tmc and lyrl is significantly better than paired difference at significance level of scene yeast tmc lyrl how computationally efficient is the runtime of is surprisingly faster than poem even though normalization increases the computation cost optimization tends to converge in fewer iterations than for poem we find that poem picks hypothesis with large kwk attempting to assign probability of to all training points with negative losses however converges to much shorter kwk the loss of an instance relative to others in sample governs how tries to fit to it this is another nice consequence of the fact that the overfitted risk of is bounded and small overall the runtime of is on the same order of magnitude as those of crf and is competitive with the runtimes reported for poem with stochastic optimization and early stopping while providing substantially better generalization performance table time in seconds averaged across validation runs crf is implemented by time poem crf scene yeast tmc lyrl we observe the same trends for when different properties of are varied stochasticity and quality as reported for poem conclusions we identify the problem of propensity overfitting when using the conventional unbiased risk estimator for erm in batch learning from bandit feedback to remedy this problem we propose the use of multiplicative control variate that leads to the risk estimator this provably avoids the anomalies of the conventional estimator deriving new learning algorithm called based on the crm principle using the new estimator we show that the improved estimator leads to significantly improved generalization performance acknowledgement this research was funded in part through nsf awards the jtcii research fund and gift from bloomberg references adith swaminathan and thorsten joachims counterfactual risk minimization learning from logged bandit feedback in icml alina beygelzimer and john langford the offset tree for learning with partial labels in kdd pages nicolo and gabor lugosi prediction learning and games cambridge university press new york ny usa richard sutton and andrew barto reinforcement learning an introduction the mit press bottou jonas peters joaquin candela denis charles max chickering elon portugaly dipankar ray patrice simard and ed snelson counterfactual reasoning and learning systems the example of computational advertising journal of machine learning research miroslav john langford and lihong li doubly robust policy evaluation and learning in icml pages rosenbaum and rubin the central role of the propensity score in observational studies for causal effects biometrika cortes mansour and mohri learning bounds for importance weighting in nips pages john langford alexander strehl and jennifer wortman exploration scavenging in icml pages bianca zadrozny john langford and naoki abe learning by example weighting in icdm pages alexander strehl john langford lihong li and sham kakade learning from logged implicit exploration data in nips pages trotter and tukey conditional monte carlo for normal samples in symposium on monte carlo methods pages andreas maurer and massimiliano pontil empirical bernstein bounds and penalization in colt philip thomas georgios theocharous and mohammad ghavamzadeh evaluation in aaai pages edward ionides truncated importance sampling journal of computational and graphical statistics lihong li munos and szepesvari toward minimax value estimation in aistats phelim boyle mark broadie and paul glasserman monte carlo methods for security pricing journal of economic dynamics and control john schulman sergey levine pieter abbeel michael jordan and philipp moritz trust region policy optimization in icml pages tim hesterberg weighted average importance sampling and defensive mixture distributions technometrics vapnik statistical learning theory wiley chichester gb art owen monte carlo theory methods and examples augustine kong note on importance sampling using standardized weights technical report department of statistics university of chicago rubinstein and kroese simulation and the monte carlo method wiley edition john lafferty andrew mccallum and fernando pereira conditional random fields probabilistic models for segmenting and labeling sequence data in icml pages adrian lewis and michael overton nonsmooth optimization via methods mathematical programming jin yu vishwanathan and schraudolph approach to nonsmooth convex optimization problems in machine learning jmlr pedregosa varoquaux gramfort michel thirion grisel blondel prettenhofer weiss dubourg vanderplas passos cournapeau brucher perrot and duchesnay machine learning in python journal of machine learning research 
fast lifted map inference via partitioning somdeb sarkhel the university of texas at dallas parag singla delhi vibhav gogate the university of texas at dallas abstract recently there has been growing interest in lifting map inference algorithms for markov logic networks mlns key advantage of these lifted algorithms is that they have much smaller computational complexity than propositional algorithms when symmetries are present in the mln and these symmetries can be detected using lifted inference rules unfortunately lifted inference rules are sound but not complete and can often miss many symmetries this is problematic because when symmetries can not be exploited lifted inference algorithms ground the mln and search for solutions in the much larger propositional space in this paper we present novel approach which cleverly introduces new symmetries at the time of grounding our main idea is to partition the ground atoms and force the inference algorithm to treat all atoms in each part as indistinguishable we show that by systematically and carefully refining and growing the partitions we can build advanced and map inference algorithms our experiments on several datasets clearly show that our new algorithm is superior to previous approaches and often finds useful symmetries in the search space that existing lifted inference rules are unable to detect markov logic networks mlns allow application designers to compactly represent and reason about relational and probabilistic knowledge in large number of application domains including computer vision and natural language understanding using few weighted logic formulas these formulas act as templates for generating large markov networks the undirected probabilistic graphical model key reasoning task over mlns is maximum posteriori map inference which is defined as the task of finding an assignment of values to all random variables in the markov network that has the maximum probability this task can be solved using propositional graphical model inference techniques unfortunately these techniques are often impractical because the markov networks can be quite large having millions of variables and features recently there has been growing interest in developing lifted inference algorithms for solving the map inference task these algorithms work as much as possible on the much smaller specification grounding or propositionalizing only as necessary and can yield significant complexity reductions in practice at high level lifted algorithms can be understood as algorithms that identify symmetries in the specification using lifted inference rules and then use these symmetries to simultaneously infer over multiple symmetric objects unfortunately in vast majority of cases the inference rules are unable to identify several useful symmetries the rules are sound but not complete either because the symmetries are approximate or because the symmetries are and do not belong to known type in such cases lifted inference algorithms partially ground some atoms in the mln and search for solution in this much larger partially propositionalized space in this paper we propose the following yet principled approach for solving this partial grounding problem partition the ground atoms into groups and force the inference algorithm to treat all atoms in each group as indistinguishable symmetric for example consider atom and assume that can be instantiated to the following set of constants if the atom possesses the or symmetry then the lifted inference algorithm will search over only two assignments all five groundings of are either all true or all false in order to find the map solution when no identifiable symmetries exist the lifted algorithm will inefficiently search over all possible truth assignments to the ground atoms and will be equivalent in terms of complexity to propositional algorithm in our approach we would partition the domain say as and search over only the following assignments all groundings in each part can be either all true or all false thus if we are lucky and the map solution is one of the assignments our approach will yield significant reductions in complexity even though no identifiable symmetries exist in the problem our approach is quite general and includes the fully lifted and fully propositional approaches as special cases for instance setting the partition size to and respectively where is the number of constants will yield exactly the same solution as the one output by the fully lifted and fully propositional approach setting to values other than and yields family of inference schemes that systematically explores the regime between these two extremes moreover by controlling the size of each partition we can control the size of the ground theory and thus the space and time complexity of our algorithm we prove properties and improve upon our basic idea in several ways first we prove that our proposed approach yields consistent assignment that is on the map value second we show how to improve the lower bound and thus the quality of the map solution by systematically refining the partitions third we show how to further improve the complexity of our refinement procedure by exploiting the exchangeability property of successive refinements specifically we show that the exchangeable refinements can be arranged on lattice which can then be searched via heuristic search procedure to yield an efficient algorithm for map inference finally we demonstrate experimentally that our method is highly scalable and yields close to optimal solutions in fraction of the time as compared to existing approaches in particular our results show that for even small values of bounds the partition size our algorithm yields close to optimal map solutions clearly demonstrating the power of our approach notation and background partition of set collection of sets is partition of set if and only if each set in is nonempty pairwise disjoint and the union of all sets equals the sets in are called the cells or parts of the partition if two elements of the set appear in same cell of partition we denote them by the operator partition of set is refinement of partition of if every element of is subset of some element of informally this means that is further fragmentation of we say that is finer than or is coarser than and denote it as we will also use the notation to denote that either is finer than or is the same as for example let be partition of the set containing two cells and and let be another partition of then is refinement namely logic we will use strict subset of logic that has no function symbols equality constraints or existential quantifiers our subset consists of constants denoted by upper case letters etc which model objects in the domain logical variables denoted by lower case letters etc which can be substituted with objects logical operators such as disjunction conjunction implication and equivalence universal and existential quantifiers and predicates which model properties and relationships between objects predicate consists of predicate symbol denoted by typewriter fonts friends etc followed by parenthesized list of arguments term is logical variable or constant literal is predicate or its negation formula in first order logic is an atom predicate or any complex sentence that can be constructed from atoms using logical operators and quantifiers for example smokes asthma is formula clause is disjunction of literals throughout we will assume that all formulas are clauses and their variables are standardized apart ground atom is an atom containing only constants ground formula is formula obtained by substituting all of its variables with constant namely formula containing only ground atoms for example the groundings of smokes asthma where ana bob are the two propositional formulas smokes ana asthma ana and smokes bob asthma bob markov logic markov logic network mln is set of weighted clauses in logic we will assume that all logical variables in all formulas are universally quantified and therefore we will drop the quantifiers from all formulas are typed and can be instantiated to finite set of constants for variable this set will be denoted by and there is mapping between the constants and objects in the domain herbrand interpretations note that the class of mlns we are assuming is not restrictive at all because almost all mlns used in application domains such as natural language processing and the web fall in this class given finite set of constants the mln represents ground markov network that has one random variable for each ground atom in its herbrand base and weighted feature for each ground clause in the herbrand base the weight of each feature is the weight of the corresponding clause given world which is truth assignment to all the ground patoms the markov network represents the following probability distribution exp wi fi where fi wi is weighted formula fi is the number of true groundings of fi in and is the partition function for simplicity we will assume that the mln is in normal form which is defined as an mln that satisfies the following two properties there are no constants in any formula and ii if two distinct atoms of predicate have variables and as the same argument of then because of the second condition in normal mlns we can associate domains with each argument of predicate let ir denote the argument of predicate and let ir denote the number of elements in the domain of ir we will also assume that all domains are of the form ir since domain size is finite any domain can be converted to this form common optimization inference task over mlns is finding the most probable state of the world that is finding complete assignment to all ground atoms which maximizes the probability formally arg max pm arg max exp wi fi arg max wi fi from eq we can see that the map problem reduces to finding truth assignment that maximizes the sum of weights of satisfied clauses therefore any weighted satisfiability solver such as maxwalksat can used to solve it however maxwalksat is propositional solver and is unable to exploit symmetries in the representation and as result can be quite inefficient alternatively the map problem can be solved in lifted manner by leveraging various lifted inference rules such as the decomposer the binomial rule and the recently proposed single occurrence rule schematic of such procedure is given in algorithm before presenting the algorithm we will describe some required definitions let ir denote the argument of predicate given an mln two arguments ir and js of its predicates and respectively are called unifiable if they share logical variable in an mln formula being symmetric and transitive the unifiable relation splits the arguments of all the predicates into set of domain equivalence classes example consider normal mln having two weighted formulas and here we have two sets of domain equivalence classes and algorithm has five recursive steps and returns algorithm lmap mln the optimal map value the first two lines are base case the base case and the simplification step in if is empty return which the mln is simplified by deleting redunsimplify dant formulas rewriting predicates by propositional decomposition ing constants so that lifted conditioning can be if has disjoint mlns mk then applied and assigning values to ground atoms pk return lmap whose values can be inferred using assignments lifted decomposition made so far the second step is the propositional if has liftable domain equivalence class then decomposition step in which the algorithm rereturn lmap curses over disjoint mlns if any and returns lifted conditioning their sum in the lifted decomposition step the if has singleton atom then algorithm finds domain equivalence class return lmap such that in the map solution all ground atoms partial grounding of the predicates that have elements of as arheuristically select domain equivalence class guments are either all true or all false to find and ground it yielding new mln such class rules given in can be return lmap used in the algorithm denotes the mln obtained by setting the domain of all elements of to and updating the formula weights accordingly in the lifted conditioning step if there is an atom having just one argument singleton atom then the algorithm partitions the possible truth assignments to groundings of such that in each part all truth assignments have the same number of true atoms in the algorithm denotes the mln obtained by setting groundings of to true and the remaining to false is the total weight of ground formulas satisfied by the assignment the final step in lmap is the partial grounding step and is executed only when the algorithm is unable to apply lifted inference rules in this step the algorithm heuristically selects domain equivalence class and grounds it completely for example example consider an mln with two formulas and let after grounding the equivalence class we get an mln having four formulas and scaling up the partial grounding step using set partitioning partial grounding often yields much bigalgorithm ger mln than the original mln and is the mln size and domain equivalence class chief reason for the inefficiency and poor scalability of algorithm lmap to address create partition of size of where ir this problem we propose novel approach foreach predicate such that ir do to speed up inference by adding additional foreach cell πj of do constraints to the existing lifted map foradd all possible hard formulas of the form mulation our idea is as follows reduce the xr yr number of ground atoms by partitioning them such that xi yi if ir and and treating all atoms in each part as indistinx if where xa xb πj guishable thus instead of introducing tn return new ground atoms where is the cardinality of the domain equivalence class and is the number of constants our approach will only introduce tk ground atoms where our new approximate partial grounding method which will replace the partial grounding step in algorithm is formally described in algorithm the algorithm takes as input an mln an integer and domain equivalence class as input and outputs new mln the algorithm first partitions the domain of the class into cells yielding partition then for each cell πj of and each predicate such that one or more of its arguments is in the algorithm adds all possible constraints of the form xr yr such that for each we add the equality constraint between the logical variables xi and yi if the argument of the predicate is not in and set xi xa and yi xb if argument of is in where xa xb πj since adding constraints restricts feasible solutions to the optimization problem it is easy to show that proposition let where is an mln and is an integer be the mln used in the partial grounding step of algorithm instead of the partial grounding step described in the algorithm then the map value returned by the modified algorithm will be smaller than or equal to the one returned by algorithm the following example demonstrates how algorithm constructs new mln example consider the mln in example let be of the domain of then after applying algorithm the new mln will have the following three hard formulas in addition to the formulas given in example and although adding constraints reduces the search space of the map problem algorithm still needs to ground the mln this can be time consuming alternatively we can group indistinguishable atoms together without grounding the mln using the following definition definition let be domain equivalence class and let be its partition two ground atoms xr and yr of predicate such that are equivalent if xi yi if ir and xi xa yi xb if ir where xa xb πj we denote this by xr yr notice that the relation is symmetric and reflexive thus we can group all the ground atoms corresponding to the transitive closure of this relation yielding meta ground atom such that if the meta atom is assigned to true false all the ground atoms in the transitive closure will be true false this yields the algorithm described as algorithm the algorithm starts the constants can be removed by renaming the predicates yielding normal mln for example we can rename as this renaming occurs in the simplification step by creating partition of the domain of it then updates the domain of so that it only contains values grounds all arguments of predicates that are in the set and updates the formula weights appropriately the formula weights should be updated because when the domain is compressed several ground formulas are replaced by just one ground formula intuitively if partially ground formulas having weight are replaced by one partially ground formula then should be equal to wt the two for loops in algorithm accomplish this we can show that proposition the map value output by replacing the partial grounding step in algorithm with algorithm is the same as the one output by replacing the the partial grounding step in algorithm with algorithm the key advantage using algorithm partitionalgorithm ground is that the lifted algorithm lmap mln size and domain equivalence class will have much smaller space complexity than the one using algorithm constrainedcreate partition of size of where ir ground specifically unlike the latter which update the domain to in yields ground atoms assuming ground all predicates such that ir each predicate has only one argument in foreach formula in such that where is the number of constants in contains an atom of where ir do the domain of the former generates only let be the formula in from which was derived ground atoms where the following example illustrates how algorithm constructs new mln foreach logical variable in that was substituted by the value in to yield do where πj is the cell of return example consider an mln with two formulas and let and after grounding with respect to we get an mln having four formulas and the total weight of grounding in is which is the same as in the following example illustrates how the algorithm constructs new mln in presence of example consider an mln with the single formula let and after grounding and also on as they belong to the same domain equivalence class with respect to we get an mln having following four formulas and generalizing the partition grounding approach algorithm allows us to group the equivalent atoms with respect to partition and has much smaller space complexity and time complexity than the partial grounding strategy described in algorithm however it yields lower bound on the map value in this section we show how to improve the lower bound using refinements of the partition the basis of our generalization is the following theorem theorem given two partitions and of such that the map value of the partially ground mln with respect to is less than or equal to the map value of the partially ground mln with respect to proof sketch since the partition is finer refinement of any candidate map assignment corresponding to the mln obtained via already includes all the candidate assignments corresponding to the mln obtained via and since the map value of both of these mlns are lower bound of the original map value the theorem follows we can use theorem to devise new map algorithm which refines the partitions to get better estimate of map values our approach is presented in algorithm the algorithm begins by identifying all domains namely domains ui that will be partially grounded during the execution of algorithm and associating πi with each domain then until there is timeout it iterates through the following two steps first it runs the lmap algorithm which uses the pair ui πi in algorithm during the partial grounding step yielding map solution second it heuristically selects partition πj and refines it from theorem it is clear that as the number of iterations is increased the map solution will either improve or remain the same thus algorithm is an anytime algorithm alternatively we can also devise an algorithm using the following idea we will algorithm mln first determine the maximum size of partilet ui be the domains tion that we can fit in the memory as different set πi where jr ui for all ui partitions of size will give us different map values we can search through them to find the while timeout has not occurred do best possible map solution drawback of the approach is that it explores lmap uses the pair ui πi and algorithm prohibitively large search space in particular for its partial grounding step the number of possible partitions of size for heuristically select partition πj and refine it set of size denoted by is given by return the so called stirling numbers of the second kind which grows with the total number of partitions of set is given by the bell pnexponentially number bn nk clearly searching over all the possible partitions of size is not practical luckily we can exploit symmetries in the mln representation to substantially reduce the number of partitions we have to consider since many of them will give us the same map value formally theorem given two πk and φk of such that for all the map value of the partially ground mln with respect to is equal to the map value of the partially ground mln with respect to proof sketch formula when ground on an argument ir with respect to partition creates copies of the formula since grounding on ir with respect to also creates the same number of formulas which are identical upto renaming of constants furthermore since each of their parts have identical cardinality and as weight of ground formula is determined by the cell sizes see algorithm the ground formulas obtained using and will have same weights as well as result mlns obtained by grounding on any argument ir with respect to and are indistinguishable subject to renaming of variables and constants and the proof follows from theorem it follows that the number of elements in cells and the number of cells of partition is sufficient to define partially ground mln with respect to that partition consecutive refinements of such partitions will thus yield lattice which we will refer to as exchangeable partition lattice the term exchangeable refers to the fact that two parti tions containing same number of elements of cells and same number of cells are figure exchangeable partition lattice corresponding able with each other in terms of map to the domain lution quality figure shows the exchangeable partition lattice corresponding to the domain the number of partitions in the lattice would have been on the other hand the lattice has elements different traversal strategies of this exchangeable partition lattice will give rise to different lifted map algorithms for example greedy traversal of the lattice yields algorithm we can also explore the lattice using systematic search and return the maximum solution found for particular depth limit this yields an improved version of our approach described earlier we can even combine the two strategies by traversing the lattice in some heuristic order for our experiments we use greedy search because full search was very expensive note that although our algorithm assumes normal mlns which are we can easily extend it to use shattering as needed moreover by clustering evidence atoms together we can further reduce the size of the shattered theory experiments we implemented our algorithm on top of the lifted map algorithm of sarkhel et al which reduces lifted map inference to an integer polynomial program ipp we will call our algorithm which stands for ipp we performed two sets of experiments the first set measures the impact of increasing the partition size on the quality of the map solution output by our algorithm the second set compares the performance and scalability of our algorithm with several algorithms from literature all of our experiments were run on third generation machine having ram we used following five mlns in our experimental study an mln which we call equivalence that consists of following three formulas equals equals equals and equals equals equals the student mln from consisting of four formulas and three predicates the relationship mln from consisting of four formulas and three predicates webkb mln from the alchemy web page consisting of three predicates and seven formulas and citation ie mln from the alchemy web page consisting of five predicates and fourteen formulas we compared the solution quality and scalability of our approach with the following algorithms and systems alchemy aly tuffy tuffy ground inference based on integer linear programming ilp and the ipp algorithm of sarkhel et al alchemy and tuffy are two open source software packages for learning and inference in mlns both of them ground the mln and then use an approximate solver maxwalksat to compute the map solution unlike alchemy tuffy uses clever database tricks to speed up computation and in principle can be much more scalable than alchemy ilp is obtained by converting the map problem over the ground markov network to an integer linear program we ran each algorithm on the aforementioned mlns for varying and recorded the solution quality which is measured using the total weight of the false clauses in the approximate map solution also referred to as the cost smaller the cost better the map solution for fair comparison we used parallelized integer linear programming solver called gurobi to solve the integer linear programming problems generated by our algorithm as well as by other competing algorithms figure shows our experimental results note that if the curve for an algorithm is not present in plot then it means that the corresponding algorithm ran out of either memory or time on the mln and did not output any solution we observe that tuffy and alchemy are the worst performing systems both in terms of solution quality and scalability ilp scales slightly better than tuffy and alchemy however it is unable to handle mlns having more than clauses we can see that our new algorithm run as an anytime scheme by refining partitions not only finds higher quality map solutions but also scales better in terms of time complexity than ipp in particular ipp could not scale to the equivalence mln having roughly million ground clauses and the relation mln having roughly ground clauses the reason is that these mlns have same predicate appearing multiple times in formula which ipp is unable to lift on the other hand our new approach is able to find useful approximate symmetries in these hard mlns to measure the impact of varying the partition size on the map solution quality we conducted the following experiment we first ran the ipp algorithm until completion to compute the optimum map value then we ran our algorithm multiple times until completion as well and recorded the solution quality achieved in each run for different partition sizes figure plots average cost across various runs as function of the error bars show the standard deviation for brevity we only show results for the ie and equivalence mlns the optimum solutions for the three mlns were found in minutes hours and hours respectively on the other hand our new approach yields close to optimal solutions in fraction of the time and for relatively small values of summary and future work lifted inference techniques have gained popularity in recent years and have quickly become the approach of choice to scale up inference in mlns pressing issue with existing lifted inference technology is that most algorithms only exploit exact identifiable symmetries and resort to grounding or propositional inference when such symmetries are not present this is problematic because grounding can blow up the search space in this paper we proposed principled approximate approach to solve this grounding problem the main idea in our approach is to partition the ground atoms into small number of groups and then treat all ground atoms in group as indistinguishable tuffy aly ipp ilp ipp cost cost cost ipp time in seconds ie tuffy aly ipp ilp ie tuffy aly ipp ilp cost cost cost time in seconds ie time in seconds time in seconds time in seconds equivalence equivalence time in seconds equivalence ipp cost cost cost time in seconds time in seconds webkb time in seconds student relation figure cost vs time cost of unsatisfied clauses smaller is better vs time for different domain sizes notation used to label each figure mln numvariables numclauses note the quantities reported are for ground markov network associated with the mln standard deviation is plotted as error bars optimum optimum optimum cost cost cost ie ie equivalence figure cost vs partition size notation used to label each figure mln numvariables numclauses from each other this simple idea introduces new approximate symmetries which can help the inference process although our proposed approach is inherently approximate we proved that it has nice theoretical properties in that it is guaranteed to yield consistent assignment that is lowerbound on the map value we further described an algorithm which can improve this lower bound through systematic refinement of the partitions finally based on the exchangeability property of the refined partitions we demonstrated method for organizing the partitions in lattice structure which can be traversed heuristically to yield efficient as well as lifted map inference algorithms our experiments on wide variety of benchmark mlns clearly demonstrate the power of our new approach future work includes connecting this work to the work on hierarchy deriving variational principle for our method and developing novel branch and bound as well as weight learning algorithms based on our partitioning approach acknowledgments this work was supported in part by the darpa probabilistic programming for advanced machine learning program under afrl prime contract number references apsel and braman exploiting uniform assignments in mpe in proceedings of the aaai conference on artificial intelligence pages apsel kersting and mladenov lifting relational using cluster signatures in proceedings of the aaai conference on artificial intelligence bui huynh and riedel automorphism groups of graphical models and lifted variational inference in proceedings of the conference on uncertainty in artificial intelligence de salvo braz lifted probabilistic inference phd thesis university of illinois urbanachampaign il domingos and lowd markov logic an interface layer for artificial intelligence morgan claypool gogate and domingos probabilistic theorem proving in proceedings of the conference on uncertainty in artificial intelligence pages auai press hadiji and kersting reduce and bootstrapped lifted likelihood maximization for map in proceedings of the aaai conference on artificial intelligence gurobi optimization gurobi optimizer reference manual jha gogate meliou and suciu lifted inference from the other side the tractable features in proceedings of the annual conference on neural information processing systems kisynski and poole constraint processing in lifted probabilistic inference in proceedings of the conference on uncertainty in artificial intelligence pages kok sumner richardson singla poon lowd wang and domingos the alchemy system for statistical relational ai technical report department of computer science and engineering university of washington seattle wa http marinescu and dechter search for combinatorial optimization in graphical models artificial intelligence mittal goyal gogate and singla new rules for domain independent lifted map inference in advances in neural information processing systems mladenov globerson and kersting efficient lifting of map lp relaxations using proceedings of the international conference on artificial intelligence and statistics niu doan and shavlik tuffy scaling up statistical inference in markov logic networks using an rdbms proceedings of the vldb endowment noessner niepert and stuckenschmidt rockit exploiting parallelism and symmetry for map inference in statistical relational models in proceedings of the aaai conference on artificial intelligence poole probabilistic inference in proceedings of the eighteenth international joint conference on artificial intelligence pages acapulco mexico morgan kaufmann sarkhel venugopal singla and gogate an integer polynomial programming based framework for lifted map inference in advances in neural information processing systems sarkhel venugopal singla and gogate lifted map inference for markov logic networks proceedings of the international conference on artificial intelligence and statistics selman kautz and cohen local search strategies for satisfiability testing in cliques coloring and satisfiability second dimacs implementation challenge van den broeck and darwiche on the complexity and approximation of binary evidence in lifted inference in advances in neural information processing systems van den broeck taghipour meert davis and de raedt lifted probabilistic inference by knowledge compilation in proceedings of the twenty second international joint conference on artificial intelligence pages venugopal and gogate clustering for scalable inference in markov logic in machine learning and knowledge discovery in databases 
data generation as sequential decision making philip bachman doina precup mcgill university school of computer science mcgill university school of computer science dprecup abstract we connect broad class of generative models through their shared reliance on sequential decision making motivated by this view we develop extensions to an existing model and then explore the idea further in the context of data imputation perhaps the simplest setting in which to investigate the relation between unconditional and conditional generative modelling we formulate data imputation as an mdp and develop models capable of representing effective policies for it we construct the models using neural networks and train them using form of guided policy search our models generate predictions through an iterative process of feedback and refinement we show that this approach can learn effective policies for imputation problems of varying difficulty and across multiple datasets introduction directed generative models are naturally interpreted as specifying sequential procedures for generating data we traditionally think of this process as sampling but one could also view it as making sequences of decisions for how to set the variables at each node in model conditioned on the settings of its parents thereby generating data from the model the large body of existing work on reinforcement learning provides powerful tools for addressing such sequential decision making problems we encourage the use of these tools to understand and improve the extended processes currently driving advances in generative modelling we show how sequential decision making can be applied to general prediction tasks by developing models which construct predictions by iteratively refining working hypothesis under guidance from exogenous input and endogenous feedback we begin this paper by reinterpreting several recent generative models as sequential decision making processes and then show how changes inspired by this point of view can improve the performance of the model introduced in next we explore the connections between directed generative models and reinforcement learning more fully by developing an approach to training policies for sequential data imputation we base our approach on formulating imputation as finitehorizon markov decision process which one can also interpret as deep directed graphical model we propose two policy representations for the imputation mdp one extends the model in by inserting an explicit feedback loop into the generative process and the other addresses the mdp more directly we train our using techniques motivated by guided policy pearch we examine their qualitative and quantitative performance across imputation problems covering range of difficulties different amounts of data to impute and different missingness mechanisms and across multiple datasets given the relative paucity of existing approaches to the general imputation problem we compare our models to each other and to two simple baselines we also test how our policies perform when they use steps to refine their predictions as imputation encompasses both classification and standard unconditional generative modelling our work suggests that further study of models for the general imputation problem is worthwhile the performance of our models suggests that sequential stochastic construction of predictions guided by both input and feedback should prove useful for wide range of problems training these models can be challenging but lessons from reinforcement learning may bring some relief directed generative models as sequential decision processes directed generative models have grown in popularity relative to their undirected reasons include the development of efficient methods for training them the ease of sampling from them and the tractability of bounds on their growth in available computing power compounds these benefits one can interpret the ancestral sampling process in directed model as repeatedly setting subsets of the latent variables to particular values in sequence of decisions conditioned on preceding decisions each subsequent decision restricts the set of potential outcomes for the overall sequence intuitively these models encode stochastic procedures for constructing plausible observations this section formally explores this perspective deep autoregressive networks the deep autoregressive networks investigated in define distributions of the following form with pt zt in which indicates generated observation and zt represent latent variables in the model the distribution may be factored similarly to the form of in eqn can represent arbitrary distributions over the latent variables and the work work in mainly concerned approaches to parameterizing the conditionals pt zt that restricted representational power in exchange for computational tractability to appreciate the generality of eqn consider using zt that are univariate multivariate structured etc one can interpret any model based on this sequential factorization of as policy pt zt for selecting each action zt in state st with each st determined by all for and train it using some form of policy search generalized guided policy search we adopt broader interpretation of guided policy search than one might initially take from we provide review of guided policy search in the supplementary material our expanded definition of guided policy search includes any optimization of the general form minimize iq ip div ip iq ip ip in which indicates the primary policy indicates the guide policy iq indicates distribution over information available only to ip indicates distribution over information available to both and iq ip computes the cost of trajectory in the context of iq and div ip measures dissimilarity between the trajectory distributions generated by as goes to infinity eqn enforces the constraint ip ip iq terms for controlling the entropy of can also be added the power of the objective in eq stems from two main points the guide policy can use information iq that is unavailable to the primary policy and the primary policy need only be trained to minimize the dissimilarity term div ip for example directed model structured as in eqn can be interpreted as specifying policy for mdp whose terminal state distribution encodes in this mdp the state at time is determined by the policy picks an action zt zt at time and picks an action at time the policy can be written as pt zt for and as zt for the initial state is drawn from executing the policy for single trial produces trajectory zt and the distribution over xs from these trajectories is just in the corresponding directed generative model the authors of train deep autoregressive networks by maximizing variational lower bound on the training set to do this they introduce variational distribution which provides and qt zt for with the final step zt given by at given these definitions the training in can be interpreted as guided policy search for the mdp described in the previous paragraph specifically the variational distribution provides guide policy over trajectories zt zt qt zt the primary policy generates trajectories distributed according to zt pt zt which does not depend on in this case corresponds to the information iq iq in eqn we now rewrite the variational optimization as minimize kl where and dx indicates the target distribution for the terminal state of the primary policy when expanded the kl term in eqn becomes kl qt zt log log zt log thus the variational approach used in for training directed generative models can be interpreted as form of generalized guided policy search as the form in eqn can represent any finite directed generative model the preceding derivation extends to all models we discuss in this stochastic processes one can simplify eqn by assuming suitable forms for and zt the authors of proposed model in which zt for all and was gaussian we can write their model as xt pt xt xt ty pt xt where xt indicates the terminal state distribution of the markov process determined by pt xt note that throughout this paper we ab use sums over latent variables and trajectories which be written as integrals the authors of observed that for any reasonably smooth target distribution dx and sufficiently large one can define stochastic process qt with simple dynamics that transforms xt dx into the gaussian distribution this is given by dx xt xt qt next we define as the distribution over trajectories xt generated by the reversetime process determined by qt xt dx xt we define as the distribution over trajectories generated by the process in eqn the training in is equivalent to guided policy search using guide trajectories sampled from it uses the objective dx xt xt minimize log log log pt xt pt xt which corresponds to minimizing kl if the in eqn are tractable then this minimization can be done using basic if as in the processi is not pt trained then eqn simplifies to minimizep eq log log pt xt this trick for generating guide trajectories exhibiting particular distribution over terminal states xt running dynamics backwards in time starting from xt dx may prove useful in settings other than those considered in the lapgan model in learns to approximately invert fixed and information destroying process the supplementary material expands on the content of this subsection including derivation of eqn as bound on log we could pull the log zt term from the kl and put it in the cost but we prefer the kl formulation for its elegance we abuse notation using kl log this also includes all generative models implemented and executed on an actual computer learning generative stochastic processes with lstms the authors of introduced model for generative processes we interpret their model as primary policy which generates trajectories zt with distribution pt zt with zt in which indicates latent trajectory and sθ indicates state trajectory st computed recursively from using the update st fθ zt for the initial state is given by trainable constant each state st ht vt represents the joint state ht of an lstm and fθ state input computes standard lstm the authors of defined all pt zt as isotropic gaussians and defined the output distribution as where pt ct ωθ vt here is trainable constant and ωθ vt is an affine transform of vt intuitively ωθ vt transforms vt into refinement of the working hypothesis which gets updated to ct ωθ vt is governed by parameters which affect fθ ωθ and the supplementary material provides and an illustration for this model to train the authors of introduced guide policy with trajectory distribution qt zt with zt in which sφ indicates state trajectory computed recursively from using the guide policy state update fφ gφ sθ in this update is the previous guide state and gφ sθ is deterministic function of and the partial primary state trajectory sθ which is computed recursively from using the state update st fθ zt the output distribution is defined as at each qt zt is diagonal gaussian distribution with means and given by an affine function lφ of is defined as identical to is governed by parameters which affect the state updates fφ gφ sθ and the step distributions qt zt gφ sθ corresponds to the read operation of the encoder network in using our definitions for the training objective in is given by qt zt minimize log log pt zt which can be written more succinctly as kl this objective log where extending the generative model qt we propose changing in eqn to pt zt we define pt zt as diagonal gaussian distribution with means and given by an affine function lθ of remember that st ht vt and we define as an isotropic gaussian we set using fθ where fθ is trainable function neural network intuitively our changes make the model more like typical policy by conditioning its action zt on its state and upgrade the model to an infinite mixture by placing distribution over its initial state we also consider using ct lθ ht which transforms the hidden part of the lstm state st directly into an observation this makes ht working memory in which to construct an observation the supplementary material provides and an illustration for this model we train this model by optimizing the objective qt zt log log minimize log pt zt for those unfamiliar with lstms good introduction can be found in we use lstms including input gates forget gates output gates and peephole connections for all tests presented in this chapter it may be useful to relax this assumption where we now have to deal with pt zt and which could be treated as constants in the model from we define as diagonal gaussian distribution whose means and are given by trainable function gφ when trained for the binarized mnist benchmark used in our extended model scored negative of on the test for comparison the score reported in was after finetuning the variational distribution on the test set our model score improved to which is quite strong considering it is an upper bound for comparison see the best upper bound reported for this benchmark in which was when the model used the alternate ct lθ ht the test scores were fig shows samples from the model code is available at http figure the left block shows ct for for policy with ct pt lθ vt the right block is analogous for model using ct lθ ht developing models for sequential imputation the goal of imputation is to estimate xu where xu xk indicates complete observation with known values xk and missing values xu we define mask as disjoint partition of into xu by expanding xu to include all of one recovers standard generative modelling by shrinking xu to include single element of one recovers standard given distribution dm over and distribution dx over the objective for imputation is minimize log xu we now describe mdp for which guided policy search minimizes bound on the objective in eqn the mdp is defined by mask distribution dm complete observation distribution dx and the state spaces zt associated with each of steps together dm and dx define joint distribution over initial states and rewards in the mdp for the trial determined by dx and dm the initial state is selected by the policy based on the known values xk the cost xu xk suffered by trajectory zt in the context is given by log xu xk the negative of guessing the missing values xu after following trajectory while seeing the known values xk qt we consider policy with trajectory distribution zt xk where xk is determined by for the current trial and can observe the missing values xu with these definitions we can find an approximately optimal imputation policy by solving minimize log xu xk the expected negative of making correct imputation on any given trial this is valid but loose upper bound on the imputation objective in eq from jensen inequality we can tighten the bound by introducing guide policy variational distribution as with the unconditional generative models in sec we train to imitate guide policy shaped by additional information here it xu this generates trajectories with distribution xk qt xk zt xu xk given this and guided policy search solves minimize log iq ip kl ip ip where we define iq xu ip xk and xu iq ip xu ip data splits from http the model in significantly improves its score to when using an architecture direct representation for sequential imputation policies we define an imputation trajectory as cτ ct where each partial imputation ct is computed from partial step trajectory zt partial imputation encodes the policy guess for the missing values xu immediately prior to selecting step zt and ct gives the policy final guess at each step of iterative refinement the policy selects zt based on and the known values xk and then updates its guesses to ct based on and zt by iteratively refining its guesses based on feedback from earlier guesses and the known values the policy can construct complexly structured distributions over its final guess ct after just few steps this happens naturally without any as in many approaches to structured prediction and without sampling values in ct one at time as required by existing models this property of our approach should prove useful for many tasks we consider two ways of updating the guesses in ct mirroring those described in sec the first way sets ct ωθ zt where ωθ zt is trainable function we set using trainable bias the second way sets ct ωθ zt we indicate models using the first type of update with the suffix and models using the second type of update with our primary policy pθ selects zt at each step using pθ zt xk which we restrict to be diagonal gaussian this is simple stationary policy together the step selector pθ zt xk and the imputation constructor ωθ zt fully determine the behaviour of the primary policy the supplementary material provides and an illustration for this model we construct guide policy similarly to the guide policy shares the imputation constructor ωθ zt with the primary policy the guide policy incorporates additional information xu xk the complete observation for which the primary policy must reconstruct some missing values the guide policy chooses steps using qφ zt which we restrict to be diagonal gaussian we train the policy components ωθ pθ and qφ simultaneously on the objective minimize log kl xk where xu xu we train our models using of and stochastic backpropagation as in full implementations and test code are available from http representing sequential imputation policies using lstms to make it useful for imputation which requires conditioning on the exogenous information xk we modify the model from sec to include read operation in its primary policy we incorporate read operation by spreading over two lstms pr and pw which respectively read and write an imputation trajectory cτ ct conveniently the guide policy for this model takes the same form as the primary policy reader pr this model also includes an infinite mixture initialization step as used in sec but modified to incorporate conditioning on and the supplementary material provides and an illustration for this model following the infinite mixture initialization step single full step of execution for involves several substeps first updates the reader state using srt fθr ωθr sw then selects step zt pθ zt then updates the writer state using st fθ zt and finally updates its guesses by setting ct ωθw vtw or ct ωθw hw hr vt in these updates st refer to the states of the reader and writer lstms the lstm updates fθ and the operations ωθr are governed by the policy parameters we train to imitate trajectories sampled from guide policy the guide policy shares the primary policy writer updates fθw and write operation ωθw but has its own reader updates fφq and read operation ωφq at each step the guide policy updates the guide state sqt fφq ωφq sw then selects zt qφ zt then updates the writer state sw and finally updates its guesses ct ωθw vtw or ct ωθw hw as in sec the guide policy read opt eration ωφq gets to see the complete observation while the primary policy only gets to see the known values xk we restrict the step distributions pθ to be diagonal gaussians whose means and are affine functions of vtr the training objective has the same form as eq imputation nll vs available information imputation nll vs available information the effect of increased refinement steps mask probability mask probability refinement steps figure comparing the performance of our imputation models against several baselines using mnist digits the indicates the of pixels which were dropped completely at random and the scores are normalized by the number of imputed pixels closer view of results from just for our models the effect of increased iterative refinement steps for our gpsi models experiments we tested the performance of our sequential imputation models on three datasets mnist svhn cropped and tfd we converted images to grayscale and them to be in the range prior to we measured the imputation log xu using the true missing values xu and the models guesses given by cut we report negative so lower scores are better in all of our tests we refer to variants of the model from sec as and and to variants of the model from sec as and except where noted the gpsi models used refinement steps and the lstm models used we tested imputation under two types of data masking missing completely at random mcar and missing at random mar in mcar we masked pixels uniformly at random from the source images and indicate removal of of the pixels by in mar we masked square regions with the occlusions located uniformly at random within the borders of the source image we indicate occlusion of square by on mnist we tested for corresponds to unconditional generation on tfd and svhn we tested on mnist we tested for on tfd we tested and on svhn we tested for test trials we sampled masks from the same distribution used in training and we sampled complete observations from test set fig and tab present quantitative results from these tests fig shows the behavior of our gpsi models when we allowed them refinement steps mnist tfd svhn table imputation performance in various settings details of the tests are provided in the main text lower scores are better due to time constraints we did not test on tfd or svhn these scores are normalized for the number of imputed pixels we tested our models against three baselines the baselines were variational imputation honest template matching and oracular template matching vae imputation ran multiple steps of vae reconstruction with the known values held fixed and the missing values with each reconstruction after refinement steps we scored the vae based on its best gpsi stands for guided policy search imputer the tag refers to additive guess updates and refers to updates that fully replace the guesses we discuss some deficiencies of vae imputation in the supplementary material figure this figure illustrates the policies learned by our models models trained for mnist from the models are models trained for tfd with models in the same order as but without lstmjump models trained for svhn with models arranged as for guesses honest template matching guessed the missing values based on the training image which best matched the test image known values oracular template matching was like honest template matching but matched directly on the missing values our models significantly outperformed the baselines in general the models outperformed the more direct gpsi models we evaluated the of imputations produced by our models using the lower bounds provided by the variational objectives with respect to which they were trained evaluating the imputations was straightforward for vae imputation we used the expected of the imputations sampled from multiple runs of the imputation process this provides valid but loose lower bound on their as shown in fig the imputations produced by our models appear promising the imputations are generally of high quality and the models are capable of capturing strongly reconstruction distributions see subfigure the behavior of gpsi models changed intriguingly when we swapped the imputation constructor using the imputation constructor the imputation policy learned by the direct model was rather inscrutable fig shows that additive guess updates extracted more value from using more refinement steps when trained on the binarized mnist benchmark discussed in sec with binarized images and subject to the lstmadd model produced scores of the model scored anecdotally on this task these models seemed more prone to overfitting than the models in sec the supplementary material provides further qualitative results discussion we presented point of view which links methods for training directed generative models with policy search in reinforcement learning we showed how our perspective can guide improvements to existing models the importance of these connections will only grow as generative models rapidly increase in structural complexity and effective decision depth we introduced the notion of imputation as natural generalization of standard unconditional generative modelling depending on the relation between the and the available information imputation spans from full unconditional generative modelling to we showed how to successfully train sequential imputation policies comprising millions of parameters using an approach based on guided policy search our approach outperforms the baselines quantitatively and appears qualitatively promising incorporating the local mechanisms from should provide further improvements references emily denton soumith chintala arthur szlam and robert fergus deep generative models using laplacian pyramid of adversarial networks alex graves generating sequences with recurrent neural networks karol gregor ivo danihelka alex graves and daan wierstra draw recurrent neural network for image generation in international conference on machine learning icml karol gregor ivo danihelka andriy mnih charles blundell and daan wierstra deep autoregressive networks in international conference on machine learning icml diederik kingma danilo rezende shakir mohamed and max welling learning with deep generative models in advances in neural information processing systems nips diederik kingma and max welling variational bayes in international conference on learning representations iclr hugo larochelle and iain murray the neural autoregressive distribution estimator in international conference on machine learning icml sergey levine and pieter abbeel learning neural network policies with guided policy search under unknown dynamics in advances in neural information processing systems nips sergey levine and vladlen koltun guided policy search in international conference on machine learning icml sergey levine and vladlen koltun variational policy search via trajectory optimization in advances in neural information processing systems nips sergey levine and vladlen koltun learning complex neural network policies with trajectory optimization in international conference on machine learning icml andriy mnih and karol gregor neural variational inference and learning in belief networks in international conference on machine learning icml yuval netzer tao wang adam coates alessandro bissacco bo wu and andrew ng reading digits in natural images with unsupervised feature learning nips workshop on deep learning and unsupervised feature learning danilo rezende shakir mohamed and daan wierstra stochastic backpropagation and approximate inference in deep generative models in international conference on machine learning icml danilo rezende and shakir mohamed variational inference with normalizing flows in international conference on machine learning icml jascha eric weiss niru maheswaranathan and surya ganguli deep unsupervised learning using nonequilibrium thermodynamics in international conference on machine learning icml joshua susskind adam anderson and geoffrey hinton the toronto face database 
on elicitation complexity rafael frongillo university of colorado boulder ian kash microsoft research raf iankash abstract elicitation is the study of statistics or properties which are computable via empirical risk minimization while several recent papers have approached the general question of which properties are elicitable we suggest that this is the wrong properties are elicitable by first eliciting the entire distribution or data set and thus the important question is how elicitable specifically what is the minimum number of regression parameters needed to compute the property building on previous work we introduce new notion of elicitation complexity and lay the foundations for calculus of elicitation we establish several general results and techniques for proving upper and lower bounds on elicitation complexity these results provide tight bounds for eliciting the bayes risk of any loss large class of properties which includes spectral risk measures and several new properties of interest introduction empirical risk minimization erm is domininant framework for supervised machine learning and key component of many learning algorithms statistic or property is simply functional assigning vector of values to each distribution we say that such property is elicitable if for some loss function it can be represented as the unique minimizer of the expected loss under the distribution thus the study of which properties are elicitable can be viewed as the study of which statistics are computable via erm the study of property elicitation began in statistics and is gaining momentum in machine learning economics and most recently finance sequence of papers starting with savage has looked at the full characterization of losses which elicit the mean of distribution or more generally the expectation of random variable the case of properties is also now well in hand the general case is still generally open with recent progress in recently parallel thread of research has been underway in finance to understand which financial risk measures among several in use or proposed to help regulate the risks of financial institutions are computable via regression elicitable cf references above more often than not these papers have concluded that most risk measures under consideration are not elicitable notable exceptions being generalized quantiles expectiles and expected utility throughout the growing momentum of the study of elicitation one question has been central which properties are elicitable it is clear however that all properties are indirectly elicitable if one first elicits the distribution using standard proper scoring rule therefore in the present work we suggest replacing this question with more nuanced one how elicitable are various properties specifically heeding the suggestion of gneiting we adapt to our setting the notion of elicitation complexity introduced by lambert et al which captures how many parameters one needs to maintain in an erm procedure for the property in question indeed if property is found not to be elicitable such as the variance one should not abandon it but rather ask how many parameters are required to compute it via erm our work is heavily inspired by the recent progress along these lines of fissler and ziegel who show that spectral risk measures of support have elicitation complexity at most spectral risk measures are among those under consideration in the finance community and this result shows that while not elicitable in the classical sense their elicitation complexity is still low and hence one can develop reasonable regression procedures for them our results extend to these and many other risk measures see often providing matching lower bounds on the complexity as well our contributions are the following we first introduce an adapted definition of elicitation complexity which we believe to be the right notion to focus on going forward we establish few simple but useful results which allow for kind of calculus of elicitation for example conditions under which the complexity of eliciting two properties in tandem is the sum of their individual complexities in we derive several techniques for proving both upper and lower bounds on elicitation complexity which apply primarily to the bayes risks from decision theory or optimal expected loss functions the class includes spectral risk measures among several others see we conclude with brief remarks and open questions preliminaries and foundation let be set of outcomes and be convex set of probability measures the goal of elicitation is to learn something about the distribution specifically some function such as the mean or variance by minimizing loss function definition property is function rk for some which associates desired report value to each we let γr denote the set of distributions corresponding to report value given property we want to ensure that the best result is to reveal the value of the property using loss function that evaluates the report using sample from the distribution definition loss function rk elicits property rk if for all arginf where ep property is elicitable if some loss elicits it for example when the mean ep is elicitable via squared loss necessary condition for elicitability is convexity of the level sets of proposition osband if is elicitable the level sets γr are convex for all one can easily check that the mean ep has convex level sets yet the variance ep ep does not and hence is not elicitable it is often useful to work with stronger condition that not only is γr convex but it is the intersection of linear subspace with this condition is equivalent the existence of an identification function functional describing the level sets of definition function rk is an identification function for rk or identifies if for all it holds that γr rk where as with above we write ep is identifiable if there exists identifying it one can check for example that identifies the mean we can now define the classes of identifiable and elicitable properties along with the complexity of identifying or eliciting given property naturally property is if it is the link of identifiable property and if it is the link of elicitable property the elicitation complexity of property is then simply the minimum dimension needed for it to be definition let ik denote the class of all identifiable properties and ek denote the class of all elicitable properties we write ik and ek definition property is if there exists ik and such that the identification complexity of is defined as iden min is we will also consider rn definition property is if there exists ek and such that the elicitation complexity of is defined as elic min is to make the above definitions concrete recall that the variance ep ep is not elicitable as its level sets are not convex necessary condition by prop note however that we may write ep which can be obtained from the property ep ep it is that is both elicitable and identifiable as the expectation of random variable using for example kr and thus we can recover as link of the elicitable and identifiable and as no such exists we have iden elic in this example the variance has stronger property than merely being and namely that there is single that satisfies both of these simultaneously in fact this is quite common and identifiability provides geometric structure that we make use of in our lower bounds thus most of our results use this refined notion of elicitation complexity definition property has identifiable elicitation complexity elici min such that ek ik and note that restricting our attention to elici effectively requires elici iden specifically if is derived from some elicitable then must be identifiable as well this restriction is only relevant for our lower bounds as our upper bounds give losses note however that some restriction on ek is necessary as otherwise pathological constructions giving injective mappings from to rk would render all properties to alleviate this issue some authors require continuity while others like we do require identifiability which can be motivated by the fact that for any differentiable loss for will identify provided ep has no inflection points or local minima an important future direction is to relax this identifiability assumption as there are very natural properties with iden our definition of elicitation complexity differs from the notion proposed by lambert et al in that the components of above do not need to be individually elicitable this turns out to have large impact as under their definition the property for finite has elicitation complexity whereas under our definition elici see example fissler and ziegel propose closer but still different definition with the complexity being the smallest such that is component of elicitable property again this definition can lead to larger complexities than necessary take for example the squared mean ep when which has elici with ep and but is not elicitable and thus has complexity under we believe that modulo regularity assumptions on ek our definition is better suited to studying the difficulty of eliciting properties viewing as potentially dimensionreducing link function our definition captures the minimum number of parameters needed in an erm computation of the property in question followed by simple application of foundations of elicitation complexity in the remainder of this section we make some simple but useful observations about iden and elici we have already discussed one such observation after definition elici iden it is natural to start with some trivial upper bounds clearly whenever can be uniquely determined by some number of elicitable parameters then the elicitation complexity of every property is at most that number the following propositions give two notable applications of this proposition when every property has elici proof the probability distribution is determined by the probability of any outcomes and the probability associated with given outcome is both elicitable and identifiable our main lower bound thm merely requires to have convex level sets which is necessary by prop one may take for example argmaxi ai for finite measurable partition an of note that these restrictions on may easily be placed on instead finite is equivalent to having support on finite subset of or even being piecewise constant on some disjoint events proposition when every property has elici countable one class of properties are those where is linear the expectation of some vectorvalued random variable all such properties are elicitable and identifiable cf with elici but of course the complexity can be lower if the range of is not lemma let rk be and ep then elici dim affhull the dimension of the affine hull of the range of it is easy to create redundant properties in various ways for example given elicitable properties and the property clearly contains redundant information concrete case is mean squared variance moment which as we have seen has elici the following definitions and lemma capture various aspects of lack of such redundancy definition property rk in is of full rank if iden note that there are two ways for property to fail to be full rank first as the examples above suggest can be redundant so that it is link of identifiable property full rank can also be violated if more dimensions are needed to identify the property than to specify it this is the case with the variance which is dimensional property but has iden definition properties are independent if iden iden iden lemma if are full rank and independent then elici elici to illustrate the lemma elici variance yet mean variance has elici so clearly the mean and variance are not both independent and full rank as we have seen variance is not full rank however the mean and second moment satisfy both by lemma another important case is when consists of some number of distinct quantiles osband essentially showed that quantiles are independent and of full rank so their elicitation complexity is the number of quantiles being elicited lemma let and be class of probability measures with continuously differentiable and invertible cdfs which is sufficiently rich in the sense that for all xk span xk rk let qα denote the function then if αk are all distinct qαk has elici the quantile example in particular allows us to see that all complexity classes including are occupied in fact our results to follow will show something stronger even for properties all classes are occupied we give here the result that follows from our bounds on spectral risk measures in example but this holds for many other see example proposition let as in lemma then for all there exists with elici eliciting the bayes risk in this section we prove two theorems that provide our main tools for proving upper and lower bounds respectively on elicitation complexity of course many properties are known to be elicitable and the losses that elicit them provide such an upper bound for that case we provide such construction for properties that can be expressed as the pointwise minimum of an indexed set of functions interestingly our construction does not elicit the minimum directly but as joint elicitation of the value and the function that realizes this value the form is that of scoring rule for the linear property ep xa except that here the index itself is also theorem let xa be set of functions indexed by rk then if inf ep xa is attained the property mina ep xa is in particular xa elicits ep xa for any strictly decreasing with dr here and throughout when rk we assume the borel omitted proofs can be found in the appendix of the full version of this paper as we focus on elicitation complexity we have not tried to characterize all ways to elicit this joint property or other properties we give explicit losses for see for an example where additional losses are possible proof we will work with gains instead of losses and show that dgr xa elicits ep xa for maxa ep xa here is convex with strictly increasing and positive subgradient dg for any fixed we have by the subgradient inequality dgr ep xa ep xa ep xa and as dg is strictly increasing is strictly convex so ep xa is the unique maximizer now letting ep xa we have argmax argmax ep xa argmax ep xa because is strictly increasing we now have argmax ep xa argmax ep xa one natural way to get such an indexed set of functions is to take an arbitrary loss function in which case this pointwise minimum corresponds to the bayes risk which is simply the minimum possible expected loss under some distribution definition given loss function on some prediction set the bayes risk of is defined as inf one illustration of the power of theorem is that the bayes risk of loss eliciting property is itself corollary if rk is loss function eliciting rk then the loss elicits where is any positive strictly decreasing function and is any surrogate loss eliciting if ik elici rr dx we now turn to our second theorem which provides lower bounds for the elicitation complexity of the bayes risk first observation which follows from standard convex analysis is that is concave and thus it is unlikely to be elicitable directly as the level sets of are likely to be nonconvex to show lower bound greater than however we will need much stronger techniques in particular while must be concave it may not be strictly so thus enabling level sets which are potentially amenable to elicitation in fact must be flat between any two distributions which share minimizer crucial to our lower bound is the fact that whenever the minimizer of differs between two distributions is essentially strictly concave between them lemma suppose loss with bayes risk elicits rk then for any with we have λp λl for all with this lemma in hand we can prove our lower bound the crucial insight is that an identification function for the bayes risk of loss eliciting property can through link be used to identify that property corollary tells us that parameters suffice for the bayes risk of property and our lower bound shows this is often necessary only parameters suffice however when the property value itself provides all the information required to compute the bayes risk for example dropping the term from squared loss gives and giving elic thus the theorem splits the lower bound into two cases theorem if loss elicits some ek with elicitation complexity elici then its bayes risk has elici moreover if we can write for some function rk then we have elici otherwise elici proof let such that for some note that one could easily lift the requirement that be function and allow to be the set of minimizers of the loss cf we will use this additional power in example we show by contradiction that for all implies otherwise we have with and thus but lemma would then give us some pλ λp with pλ but as the level sets are convex by prop we would have pλ which would imply pλ we now can conclude that there exists rk such that but as this implies elici so clearly we need finally if we have the upper bounds follow from corollary examples and applications we now give several applications of our results several upper bounds are novel as well as all lower bounds greater than in the examples unless we refer to explicitly we will assume and write so that in each setting we also make several standard regularity assumptions which we suppress for ease of exposition for example for the variance and variantile we assume finite first and second moments which must span and whenever we discuss quantiles we will assume that is as in lemma though we will not require as much regularity for our upper bounds variance in section we showed that elici as warm up let us see how to recover this statement using our results on the bayes risk we can view as the bayes risk of squared loss which of course elicits the mean ep ep ep this gives us elici by corollary with matching lower bound by theorem as the variance is not simply function of the mean corollary gives losses such as which elict ep but in fact there are losses which can not be represented by the form showing that we do not have full characterization for example this was generated via squared loss with respect to the norm which elicits the first two moments and link function convex functions of means another simple example is ep for some strictly convex function rk and rk to avoid degeneracies we assume dim affhull ep is full rank letting dgp be selection of subgradients of the loss dgr elicits ep cf and moreover we have by lemma elici one easily checks that so now by theorem elici as well letting xk be family of such full rank random variables this gives us sequence of properties γk kep with elici γk proving proposition modal mass with consider the property γβ namely the maximum probability mass contained in an interval of width theorem easily shows elici γβ as is elicited by and γβ similarly in the case of finite is simply the expected score gain rather than loss of the mode which is elicitable for finite but not otherwise see heinrich in both cases one can easily check that the level sets of are not convex so elici alternatively theorem applies in the first case as mentioned following definition the result for finite differs from the definitions of lambert et al where the elicitation complexity of is expected shortfall and other spectral risk measures one important application of our results on the elicitation complexity of the bayes risk is the elicitability of various financial risk measures one of the most popular financial risk measures is expected shortfall esα also called conditional value at risk cvar or average value at risk avar which we define as follows cf eq eq esα inf ep inf ep despite the importance of elicitability to financial regulation esα is not elicitable it was recently shown by fissler and ziegel however that elici esα they also consider the broader class of spectral risk measures which can be represented as ρµ esα dµ where is probability measure on cf eq in the case where has finite support pk βi δαi for point distributions βi we can rewrite ρµ using the above as βi ρµ βi esαi inf ep zi αi they conclude elici ρµ unless in which case elici ρµ we show how to recover these results together with matching lower bounds it is that the infimum in eq is attained by any of the quantiles in qαk so we conclude elici ρµ by theorem and in particular the property ρµ qαk is elicitable the family of losses from corollary coincide with the characterization of fissler and ziegel see for lower bound as elici qαk whenever the αi are distinct by lemma theorem gives us elici ρµ whenever and of course elici ρµ if variantile the type of generalized quantile introduced by newey and powell is defined as the solution µτ to the equation ep this also shows µτ here we propose the an asymmetric measure with respect to the just as the mean is the solution to the equation ep and the variance is ep we define the by ep µτ it is that µτ can be expressed as the minimizer of asymmetric least squares problem the loss elicits µτ hence just as the variance turned out to be bayes risk for the mean so is the for the µτ argmin ep min ep we now see the pair µτ is elicitable by corollary and by theorem we have elici deviation and risk measures rockafellar and uryasev introduce risk quadrangles in which they relate risk deviation error and statistic all functions from random variables to the reals as follows min min argmin our results provide tight bounds for many of the risk and deviation measures in their paper the most immediate case is the expectation quadrangle case where for some in this case if theorem implies elici elici provided is nonconstant and this includes several of their examples truncated mean and beyond the expectation case the authors show mixing theorem where they consider min min λi ei bi λi bi min λi ei bi bk once again if the ei are all of expectation type and si theorem gives elici elici with matching lower bound from theorem provided the si are all independent the reverting theorem for pair can be seen as special case of the above where one replaces by consequently we have tight bounds for the elicitation complexity of several other examples including superquantiles the same as spectral risk measures the quadrangle and optimized certainty equivalents of and teboulle our results offer an explaination for the existence of regression procedures for some of these measures for example proceedure called superquantile regression was introduced in rockafellar et al which computes spectral risk measures in light of theorem one could interpret their procedure as simply performing regression on the different quantiles as well as the bayes risk in fact our results show that any generated by mixing several expectation quadrangles will have similar procedure in which the variables are simply computed along side the measure of interest even more broadly such regression procedures exist for any bayes risk discussion we have outlined theory of elicitation complexity which we believe is the right notion of complexity for erm and provided techniques and results for upper and lower bounds in particular we now have tight bounds for the large class of bayes risks including several applications of note such as spectral risk measures our results also offer an explanation for why procedures like superquantile regression are possible and extend this logic to all bayes risks there many natural open problems in elicitation complexity perhaps the most apparent are the characterizations of the complexity classes elic and in particular determining the elicitation complexity of properties which are known to be such as the mode and smallest confidence interval in this paper we have focused on elicitation complexity with respect to the class of identifiable properties which we denoted elici this choice of notation was deliberate one may define elicc min ek to be the complexity with respect to some arbitrary class of properties some examples of interest might be elice for expected values of interest to the prediction market literature and eliccvx for properties elicitable by loss which is convex in of interest for efficiently performing erm another interesting line of questioning follows from the notion of conditional elicitation properties which are elicitable as long as the value of some other elicitable property is known this notion was introduced by emmer et al who showed that the variance and expected shortfall are both conditionally elicitable on ep and qα respectively intuitively knowing that is elicitable conditional on an elicitable would suggest that perhaps the pair is elicitable fissler and ziegel note that it is an open question whether this joint elicitability holds in general the bayes risk for is elicitable conditioned on and as we saw above the pair is jointly elicitable as well we give in figure however which also illustrates the subtlety of characterizing all elicitable properties figure depictions of the level sets of two properties one elicitable and the other not the left is bayes risk together with its property and thus elicitable while the right is shown in not to be elicitable here the planes are shown to illustrate the fact that these are both conditionally elicitable the height of the plane the intersept for example is elicitable from the characterizations for scalar properties and conditioned on the plane the properties are both linear and thus links of expected values which are also elicitable references ingo steinwart chlo pasin robert williamson and siyu zhang elicitation and identification of properties in proceedings of the conference on learning theory pages agarwal and agrawal on consistent surrogate risk minimization and property elicitation in colt rafael frongillo and ian kash property elicitation in proceedings of the conference on learning theory pages savage elicitation of personal probabilities and expectations journal of the american statistical association pages kent harold osband providing incentives for better cost forecasting university of california berkeley gneiting and raftery strictly proper scoring rules prediction and estimation journal of the american statistical association gneiting making and evaluating point forecasts journal of the american statistical association abernethy and frongillo characterization of scoring rules for linear properties in proceedings of the conference on learning theory pages lambert elicitation and evaluation of statistical forecasts preprint lambert and shoham eliciting truthful answers to questions in proceedings of the acm conference on electronic commerce pages susanne emmer marie kratz and dirk tasche what is the best risk measure in practice comparison of standard measures december arxiv fabio bellini and valeria bignozzi elicitable risk measures this is preprint of an article accepted for publication in quantitative finance doi johanna ziegel coherence and elicitability mathematical finance arxiv ruodu wang and johanna ziegel elicitable distortion risk measures concise proof statistics probability letters may tobias fissler and johanna ziegel higher order elicitability and osband principle math stat march arxiv banerjee guo and wang on the optimality of conditional expectation as bregman predictor ieee transactions on information theory july lambert pennock and shoham eliciting properties of probability distributions in proceedings of the acm conference on electronic commerce pages rafael frongillo and ian kash general truthfulness characterizations via convex analysis in web and internet economics pages springer heinrich the mode functional is not elicitable biometrika page hans fllmer and stefan weber the axiomatic approach to risk measures for capital determination annual review of financial economics tyrrell rockafellar and stan uryasev the fundamental risk quadrangle in risk management optimization and statistical estimation surveys in operations research and management science tobias fissler johanna ziegel and tilmann gneiting expected shortfall is jointly elicitable with value at risk implications for backtesting july arxiv whitney newey and james powell asymmetric least squares estimation and testing econometrica journal of the econometric society pages aharon and marc teboulle an concept of convex risk measures the optimized certainty equivalent mathematical finance rockafellar royset and miranda superquantile regression with applications to buffered reliability uncertainty quantification and conditional european journal of operational research 
decomposition bounds for marginal map wei qiang alexander computer science uc irvine computer science dartmouth college wping ihler qliu abstract marginal map inference involves making map predictions in systems defined with latent variables or missing information it is significantly more difficult than pure marginalization and map tasks for which large class of efficient and convergent variational algorithms such as dual decomposition exist in this work we generalize dual decomposition to generic power sum inference task which includes marginal map along with pure marginalization and map as special cases our method is based on block coordinate descent algorithm on new convex decomposition bound that is guaranteed to converge monotonically and can be parallelized efficiently we demonstrate our approach on marginal map queries defined on problems from the uai approximate inference challenge showing that our framework is faster and more reliable than previous methods introduction probabilistic graphical models such as bayesian networks and markov random fields provide useful framework and powerful tools for machine learning given graphical model inference refers to answering probabilistic queries about the model there are three common types of inference tasks the first are or maximum posteriori map tasks which aim to find the most probable state of the joint probability exact and approximate map inference is widely used in structured prediction tasks include calculating marginal probabilities and the normalization constant of the distribution and play central role in many learning tasks maximum likelihood finally marginal map tasks are mixed inference problems which generalize the first two types by marginalizing subset of variables hidden variables before optimizing over the these tasks arise in latent variable models and many problems all three inference types are generally intractable as result approximate inference particularly convex relaxations or upper bounding methods are of great interest decomposition methods provide useful and computationally efficient class of bounds on inference problems for example dual decomposition methods for map give class of upper bounds which can be directly optimized using coordinate descent subgradient updates or other methods it is easy to ensure both convergence and that the objective is monotonically decreasing so that more computation always provides better bound the resulting bounds can be used either as approximation methods or as component of search in summation problems notable decomposition bound is bp trw which bounds the partition function with combination of trees these bounds are useful in joint inference and learning or inferning frameworks allowing learning with approximate inference to be framed as joint optimization over the model parameters and decomposition bound often leading to more efficient learning however far fewer methods have been developed for marginal map problems in some literature marginal map is simply called map and the joint map task is called mpe in this work we deveop decomposition bound that has number of desirable properties generality our bound is sufficiently general to be applied easily to marginal map it yields bound at any point during the optimization not just at convergence so it can be used in an anytime way monotonic and convergent more computational effort gives strictly tighter bounds note that and are particularly important for approximations which are expensive to represent and update allows optimization over all parameters including the weights or fractional counting numbers of the approximation these parameters often have significant effect on the tightness of the resulting bound compact representation within given class of bounds using fewer parameters to express the bound reduces memory and typically speeds up optimization we organize the rest of the paper as follows section gives some background and notation followed by connections to related work in section we derive our decomposed bound in section and present block coordinate descent algorithm for monotonically tightening it in section we report experimental results in section and conclude the paper in section background here we review some background on graphical models and inference tasks markov random field mrf on discrete random variables xn is probability distribution hx hx exp θα xα log exp θα xα where is set of subsets of the variables each associated with factor θα and is the log partition function we associate an undirected graph with by mapping each xi to node and adding an edge ij iff there exists such that we say node and are neighbors if ij then is the subset of cliques fully connected subgraphs of the use and evaluation of given mrf often involves different types of inference tasks marginalization or tasks perform sum over the configurations to calculate the log partition function in marginal probabilities or the probability of some observed evidence on the other hand the maximum posteriori map or tasks perform joint maximization to find configurations with the highest probability that is maxx θα xα generalization of and inference is marginal map or in which we are interested in first marginalizing subset of variables hidden variables and then maximizing the remaining variables whose values are of direct interest that is hx φab max xb max log exp θα xα xb xb xa where all the variables and obviously both and inference are special cases of marginal map when and respectively it will be useful to define an even more general inference task based on power sum operator τi xi xi xi xi where xi is any function and τi is temperature or weight parameter the power sum reduces to standard sum when τi and approaches maxx when τi so that we define the power sum with τi to equal the max operator the power sum is helpful for unifying and inference as well as marginal map specifically we can apply power sums with different weights τi to each variable xi along predefined elimination order xn to define the weighted log partition function τn exp log exp φτ log xn where we note that the value of depends on the elimination order unless all the weights are equal obviously includes marginal map as special case by setting weights τa and τb this representation provides useful tool for understanding and deriving new algorithms for general inference tasks especially marginal map for which relatively few efficient algorithms exist related work variational upper bounds on map and the partition function along with algorithms for providing fast convergent optimization have been widely studied in the last decade in map dual decomposition and linear programming methods have become dominating approach with numerous optimization techniques and methods to tighten the approximations for summation problems most upper bounds are derived from the trw family of convex bounds or more generally conditional entropy decompositions trw bounds can be framed as optimizing over convex combination of models or in dual representation as trw belief propagation algorithm this illustrates basic tension in the resulting bounds in its primal form combination of trees trw is inefficient it maintains weight and parameters for each tree and large number of trees may be required to obtain tight bound this uses memory and makes optimization slow on the other hand the dual or free energy form uses only parameters the trw messages to optimize over the set of all possible spanning trees but the resulting optimization is only guaranteed to be bound at convergence making it difficult to use in an anytime fashion similarly the gradient of the weights is only correct at convergence making it difficult to optimize over these parameters most implementations simply adopt fixed weights thus most algorithms do not satisfy all the desirable properties listed in the introduction for example many works have developed convergent algorithms for convex free energies however by optimizing the dual they do not provide bound until convergence and the representation and constraints on the counting numbers do not facilitate optimizing the bound over these parameters to optimize counting numbers adopt more restrictive free energy form requiring positive counting numbers on the entropies but this can not represent marginal map whose free energy involves conditional entropies equivalent to the difference between two entropy terms on the other hand working in the primal domain ensures bound but usually at the cost of enumerating large number of trees heuristically select small number of trees to avoid being too inefficient while focus on trying to speed up the updates on given collection of trees another primal bound is weighted wmb which can represent large collection of trees compactly and is easily applied to marginal map using the weighted log partition function viewpoint however existing optimization algorithms for wmb are and often fail to converge especially on marginal map tasks while our focus is on variational bounds there are many approaches for marginal map as well provide upper bounds on marginal map by reordering the order in which variables are eliminated and using exact inference in the reordered however this is exponential in the size of the unconstrained treewidth and can easily become intractable give an approximation closely related to to bound the marginal map however unlike weighted these bounds can not be improved iteratively the same is true for the algorithm of which also has strong dependence on treewidth other examples of marginal map algorithms include local search and markov chain monte carlo methods fully decomposed upper bound in this section we develop new general form of upper bound and provide an efficient monotonically convergent optimization algorithm our new bound is based on fully decomposing the graph into disconnected cliques allowing very efficient local computation but can still be as tight as wmb or the trw bound with large collection of spanning trees once the weights and shifting variables are chosen or optimized properly our bound reduces to dual decomposition for map inference but is applicable to more general settings our main result is based on the following generalization of the classical inequality despite the term dual decomposition used in map tasks in this work we refer to decomposition bounds as primal bounds since they can be viewed as directly bounding the result of variable elimination this is in contrast to for example the linear programming relaxation of map which bounds the result only after optimization see an example on ising model in supplement theorem for given graphical model in with cliques and set of nonnegative weights τi we definepa set of split weights wα wiα on each pair that satisfies wiα τi then we have wα yx exp θα xα exp θα xα xα where the side is the along order xn as defined in and the side is the product of the on subvector xα with weights wα along pwα pwkα the same elimination order that is xα exp θα xα exp θα xα where xk xα xkc should be ranked with increasing index consisting with the elimination order xn as used in the side proof details can be found in section of the supplement key advantage of the bound is that it decomposes the joint power sum on into product of independent power sums over smaller cliques xα which significantly reduces computational complexity and enables parallel computation including variables in order to increase the flexibility of the upper bound we introduce set of or reparameterization variables δiα xi on each pair which can be optimized to provide much tighter upper bound note that φτ can be rewritten as φτ log exp hx δiα xi θα xα δiα xi where ni is the set of cliques incident to applying inequality we have that φτ log wi xi exp wα def exp θα xα δi xi δiα xi log xα where the nodes are also treated as cliques within inequality and new weight wi is introduced on each variable the new weights wi wiα should satisfy wi wiα τi wi wiα the bound is convex the variables and weights enabling an efficient optimization algorithm that we present in section as we will discuss in section these shifting variables correspond to lagrange multipliers that enforce moment matching condition dual form and connection with existing bounds it is straightforward to see that our bound in reduces to dual decomposition when applied on map inference with all τi and hence wi wiα on the other hand its connection with bounds such as wmb and trw is seen more clearly via dual representation of theorem the tightest upper bound obtainable by that is xx min min min max hθ bi wi xi bi wiα xi bα where bi xi bα xα is set of or beliefs defined on the singleton variables and thep cliques and ispthe corresponding local consistency polytope defined by bi xi bα xα xi bi xi here are their corresponding marginal or conditional entropies and pai is the set of variables in that rank later than that is for the global elimination order xn paα the proof details can be found in section of the supplement it is useful to compare theorem with other dual representations as the sum of weighted conditional entropies the bound is clearly convex and within the general class of conditional entropy decompositions ced but unlike generic ced it has simple and efficient primal form comparing the primal form derived in geometric program is computationally infeasible grid wmb covering tree full decomposition trw figure illustrating wmb trw and our bound on grid wmb uses covering tree with minimal number of splits and our decomposition further splits the graph into small cliques here edges introducing additional variables but allowing for easier monotonic optimization primal trw splits the graph into many spanning trees requiring even more variables note that all three bounds attain the same tightness after optimization to the dual form of wmb in theorem of our bound is as tight as wmb and hence the class of trw ced bounds attainable by wmb most forms are expressed in terms of joint entropies hθ bi cβ bβ rather than conditional entropies while the two can be converted the resulting counting numbers cβ will be differences of weights wiα which obfuscates its convexity makes it harder to maintain the relative constraints on the counting numbers during optimization and makes some counting numbers negative rendering some methods inapplicable finally like most variational bounds in dual form the rhs of has inner maximization and hence guaranteed to bound φτ only at its optimum in contrast our eq is primal bound hence bound for any it is similar to the primal form of trw except that the individual regions are single cliques rather than spanning trees of the graph and the fraction weights wα associated with each region are vectors rather than single scalar the representation efficiency can be seen with an example in figure which shows grid model and three relaxations that achieve the same bound assuming states per variable and ignoring the equality constraints our decomposition in figure uses parameters and weights wmb figure is slightly more efficient with only parameters for and and weights but its lack of decomposition makes parallel and monotonic updates difficult on the other hand the equivalent primal trw uses spanning trees shown in figure for parameters and weights the increased dimensionality of the optimization slows convergence and updates are requiring full sweeps on the involved trees although this cost can be amortized in some cases monotonically tightening the bound in this section we propose block coordinate descent algorithm algorithm to minimize the upper bound in the shifting variables and weights our algorithm has monotonic convergence property and allows efficient distributable local computation due to the full decomposition of our bound our framework allows generic inference including or as special cases by setting different weights moment matching and entropy matching we start with deriving the gradient of and we show that the equation has simple form of moment matching that enforces consistency between the singleton beliefs with their related clique beliefs and that of weights enforces consistency of marginal and conditional entropies theorem for in its δiα xi is µi xi µα xα xi see more details of this connection in section of the supplement while subgraphs can be used in the primal trw form doing so leads to loose bounds in contrast our decomposition terms consist of individual cliques algorithm generalized gdd input weights τi elimination order output the optimal giving tightest upper bound for φτ in initialize and weights wi wiα repeat for node in parallel with node do if τi then update ni δiα ni with the update else if τi then update ni and wni with gradient descent and combined with line search end if end for until convergence and evaluate by remark gdd solves and by setting different values of weights τi where µi xi exp δiα xi can be interpreted as singleton belief on xi and µα xα can be viewed qc as clique belief on xα defined with chain rule assuming xαα xc µα xα µα xi µα xi xi where zi is the partial up to on the clique that is wiα exp θα xα δiα xi xα exp θα xα δiα xi zi xi where the summation order should be consistent with the global elimination order xn the gradients of the weights wi wiα are marginal and conditional entropies defined on the beliefs µi µα respectively xi µα µα xα log µα xi xi µi therefore the optimal weights should satisfy the following kkt condition wi xi µi wiα xi µα where wi xi µi wi xi µα is the weighted average entropy on node the proof details can be found in section of the supplement the matching condition enforces that µi µα belong to the local consistency polytope as defined in theorem similar moment matching results appear commonly in variational inference algorithms also derive gradient of the weights but it is based on the free energy form and is correct only after optimization our form holds at any point enabling efficient joint optimization of and block coordinate descent we derive block coordinate descent method in algorithm to minimize our bound in which we sweep through all the nodes and update each block ni δiα xi ni and wni wi wiα ni with the neighborhood parameters fixed our algorithm applies two update types depending on whether the variables have zero weight for nodes with τi corresponding to max nodes in marginal map we derive coordinate descent rule for the associated shifting variables ni these nodes do not require to optimize wni since it is fixed to be zero for nodes with τi sum nodes in marginal map we lack closed form update for ni and wni and optimize by local gradient descent combined with line search the lack of closed form coordinate update for nodes τi is mainly because the order of power sums with different weights can not be exchanged however the gradient descent inner loop is still efficient because each gradient evaluation only involves the local variables in clique update for any node with τi max nodes in marginal map and its associated ni δiα xi ni the following update gives closed form solution for the zero gradient equation in keeping the other δjα ni fixed δiα xi γiα xi γi xi where is the number of neighborhood cliques and γiα xi log exp θα xα δj xj note that the update in works regardless of the weights of nodes τj ni in the neighborhood cliques when all the neighboring nodes also have zero weight τj for ni it is analogous to the star update of dual decomposition for map the detailed derivation is shown in proposition in the supplement the update in can be calculated with cost of only where is the number of states of xi and is the clique size by computing and saving all the shared γiα xi before updating ni furthermore the updates of ni for different nodes are independent if they are not directly connected by some clique this makes it easy to parallelize the coordinate descent process by partitioning the graph into independent sets and parallelizing the updates within each set local gradient descent for nodes with τi or in marginal map there is no closedform solution for δiα xi and wi wiα to minimize the upper bound however because of the fully decomposed form the gradient ni and wni can be evaluated efficiently via local computation with and again can be parallelized between nonadjacent nodes to handle the constraint on wni we use exponential gradient let wi exp vi exp vi exp viα and wiα exp viα exp vi exp viα taking the gradient and viα and transforming update backα givesαthe following wi wi exp ηwi xi µi wi wi exp ηwi xi µα where is the step size and paα in our implementation we find that few gradient steps with backtracking line search using the armijo rule works well in practice other more advanced optimization methods such as and newton method are also applicable experiments in this section we demonstrate our algorithm on set of graphical models from recent uai inference challenges including two diagnostic bayesian networks with and variables and max domain sizes and respectively and several mrfs for pedigree analysis with up to variables max domain size of and clique size we construct marginal map problems on these models by randomly selecting half of the variables to be max nodes and the rest as sum nodes we implement several algorithms that optimize the same primal marginal map bound including our gdd algorithm the wmb algorithm in with ibound which uses the same cliques and fixed point heuristic for optimization and an implementation that directly optimizes our decomposed bound for comparison we also computed several related primal bounds including standard and elimination reordering limited to the same computational limits ibound we also tried mas but found its bounds extremely decoding finding configuration is more difficult in marginal map than in joint map we use the same local decoding procedure that is standard in dual decomposition however evaluating the objective involves potentially difficult sum over xa making it hard to score each decoding for this reason we evaluate the score of each decoding but show the most recent decoding rather than the best as is standard in map to simulate behavior in practice figure and figure compare the convergence of the different algorithms where we define the iteration of each algorithm to correspond to full sweep over the graph with the same order of time complexity one iteration for gdd is defined in algorithm for wmb is full forward and backward message pass as in algorithm of and for is joint step on all variables the elimination order that we use is obtained by heuristic constrained to eliminate the sum nodes first diagnostic bayesian networks figure shows that our gdd converges quickly and monotonically on both the networks while wmb does not converge without proper damping we see http the instances tested have many zero probabilities which make finding lower bounds difficult since mas bounds are symmetrized this likely contributes to its upper bounds being loose gdd mbe elimination reordering decoded value wmb decoded value gdd bound bound gdd mbe elimination reordering decoded value wmb decoded value gdd iterations nodes iterations nodes figure marginal map results on and with randomly selected additional plots are in the supplement we plot the upper bounds of different algorithms across iterations the objective function xb of the decoded solutions xb are also shown dashed lines at the beginning xb may equal to because of zero probabiliy gdd mbe elim reordering iterations nodes gdd mbe elim reordering upper bound gdd mbe elim reordering upper bound upper bound iterations nodes iterations nodes figure marginal map inference on three pedigree models additional plots are in the supplement we randomly select half the nodes as in these models we tune the damping rate of wmb from to experimented different damping ratios for wmb and found that it is slower than gdd even with the best damping ratio found in figure wmb works best with damping ratio but is still significantly slower than gdd our gdd also gives better decoded marginal map solution xb obtained by rounding the singleton beliefs both wmb and our gdd provide much tighter bound than the elimination mbe or reordered elimination methods genetic pedigree instances figure shows similar results on set of pedigree instances again gdd outperforms wmb even with the best possible damping and the bounds after only one iteration pass through the graph conclusion in this work we propose new class of decomposition bounds for general inference which is capable of representing large class of primal variational bounds but is much more computationally efficient unlike previous primal sum bounds our bound decomposes into computations on small local cliques increasing efficiency and enabling parallel and monotonic optimization we derive block coordinate descent algorithm for optimizing our bound over both the parameters reparameterization and weights fractional counting numbers which generalizes dual decomposition and enjoy similar monotonic convergence property taking the advantage of its monotonic convergence our new algorithm can be widely applied as building block for improved heuristic construction in search or more efficient learning algorithms acknowledgments this work is sponsored in part by nsf grants and alexander ihler is also funded in part by the united states air force under contract no under the darpa ppaml program references dechter reasoning with probabilistic and deterministic graphical models exact algorithms synthesis lectures on artificial intelligence and machine learning dechter and rish general scheme for bounded inference jacm domke dual decomposition for marginal inference in aaai doucet godsill and robert marginal maximum posteriori estimation using markov chain monte carlo statistics and computing globerson and jaakkola approximate inference using conditional entropy decompositions in aistats globerson and jaakkola fixing convergent message passing algorithms for map in nips hardy littlewood and polya inequalities cambridge university press hazan peng and shashua tightening fractional covering upper bounds on the partition function for region graphs in uai hazan and shashua convergent algorithms for inference over general graphs with convex free energies in uai hazan and shashua belief propagation for approximate inference ieee transactions on information theory ihler flerova dechter and otten based schemes in uai jancsary and matz convergent decomposition solvers for trw free energies in aistats kiselev and poupart policy optimization by marginal map probabilistic inference in generative models in aamas komodakis paragios and tziritas mrf energy minimization and beyond via dual decomposition tpami liu reasoning and decisions in probabilistic graphical unified framework phd thesis university of california irvine liu and ihler bounding the partition function using inequality in icml liu and ihler variational algorithms for marginal map jmlr marinescu dechter and ihler search for marginal map in uai maua and de campos anytime marginal maximum posteriori inference in icml meek and wexler approximating problems using multiplicative error bounds bayesian statistics meltzer globerson and weiss convergent message passing algorithms unifying view in uai meshi and globerson an alternating direction method for dual map lp relaxation in meshi sontag jaakkola and globerson learning efficiently with approximate inference via dual losses in icml mooij libdai free and open source library for discrete approximate inference in graphical models jmlr naradowsky riedel and smith improving nlp through marginalization of hidden syntactic structure in emnlp nowozin and lampert structured learning and prediction in computer vision foundations and trends in computer graphics and vision park and darwiche solving map exactly using systematic search in uai park and darwiche complexity results and approximation strategies for map explanations jair ping liu and ihler marginal structured svm with hidden variables in icml ruozzi and tatikonda algorithms reparameterizations and splittings ieee transactions on information theory sontag globerson and jaakkola introduction to dual decomposition for inference optimization for machine learning sontag and jaakkola tree block coordinate descent for map in graphical models aistats sontag meltzer globerson jaakkola and weiss tightening lp relaxations for map using message passing in uai wainwright jaakkola and willsky new class of upper bounds on the log partition function ieee transactions on information theory weiss yanover and meltzer map estimation linear programming and belief propagation with convex free energies in uai werner linear programming approach to problem review tpami yarkony fowlkes and ihler covering trees and on quadratic assignment in cvpr yuan and hansen efficient computation of jointree bounds for systematic map search ijcai yuan lu and druzdzel annealed map in uai 
discrete classifiers meisam meisamr farzan farnia david dntse abstract consider the binary classification problem of predicting target variable from discrete feature vector xd when the probability distribution is known the optimal classifier leading to the minimum misclassification rate is given by the maximum probability map decision rule however in practice estimating the complete joint distribution is computationally and statistically impossible for large values of therefore an alternative approach is to first estimate some low order marginals of the joint probability distribution and then design the classifier based on the estimated low order marginals this approach is also helpful when the complete training data instances are not available due to privacy concerns in this work we consider the problem of finding the optimum classifier based on some estimated low order marginals of we prove that for given set of marginals the minimum hgr correlation principle introduced in leads to randomized classification rule which is shown to have misclassification rate no larger than twice the misclassification rate of the optimal classifier then under separability condition it is shown that the proposed algorithm is equivalent to randomized linear regression approach in addition this method naturally results in robust feature selection method selecting subset of features having the maximum worst case hgr correlation with the target variable our theoretical is similar to the recent discrete chebyshev classifier dcc approach while the proposed algorithm has significant computational advantages since it only requires solving least square optimization problem finally we numerically compare our proposed algorithm with the dcc classifier and show that the proposed algorithm results in better misclassification rate over various uci data repository datasets introduction statistical classification core task in many modern data processing and prediction problems is the problem of predicting labels for given feature vector based on set of training data instances containing feature vectors and their corresponding labels from probabilistic point of view this problem can be formulated as follows given data samples xn from probability distribution predict the target label test for given test point xtest many modern classification problems are on high dimensional categorical features for example in the association studies gwas the classification task is to predict trait of interest based on observations of the snps in the genome in this problem the feature vector xd is categorical with xi what is the optimal classifier leading to the minimum misclassification rate for such classification problem with high dimensional categorical feature vectors when the joint probability distribution of the random vector is known the map decision rule defined by map argmaxy department of electrical engineering stanford university stanford ca achieves the minimum misclassification rate however in practice the joint probability distribution is not known moreover estimating the complete joint probability distribution is not possible due to the curse of dimensionality for example in the above gwas problem the dimension of the feature vector is which leads to the alphabet size of for the feature vector hence practical approach is to first estimate some low order marginals of and then use these low order marginals to build classifier with low misclassification rate this approach which is the sprit of various machine learning and statistical methods is also useful when the complete data instances are not available due to privacy concerns in applications such as medical informatics in this work we consider the above problem of building classifier for given set of low order marginals first we formally state the problem of finding the robust classifier with the minimum worst case misclassification rate our goal is to find possibly randomized decision rule which has the minimum worst case misclassification rate over all probability distributions satisfying the given low order marginals then surrogate objective function which is obtained by the minimum hgr correlation principle is used to propose randomized classification rule the proposed classification method has the worst case misclassification rate no more than twice the misclassification rate of the optimal classifier when only pairwise marginals are estimated it is shown that this classifier is indeed randomized linear regression classifier on indicator variables under separability condition then we formulate feature selection problem based on the knowledge of pairwise marginals which leads to the minimum misclassification rate our analysis provides theoretical justification for using group lasso objective function for feature selection over the discrete set of features finally we conclude by presenting numerical experiments comparing the proposed classifier with discrete chebyshev classifier tree augmented naive bayes and minimax probabilistic machine in short the contributions of this work is as follows providing rigorous theoretical justification for using the minimum hgr correlation principle for binary classification problem proposing randomized classifier with misclassification rate no larger than twice the misclassification rate of the optimal classifier introducing computationally efficient method for calculating the proposed randomized classifier when pairwise marginals are estimated and separability condition is satisfied providing mathematical justification based on maximal correlation for using group lasso problem for feature selection in categorical data related work the idea of learning structures in data through low order is popular in machine learning and statistics for example the maximum entropy principle which is the spirit of the variational method in graphical models and tree augmented naive bayes is based on the idea of fixing the marginal distributions and fitting probabilistic model which maximizes the shannon entropy although these methods fit probabilistic model satisfying the low order marginals they do not directly optimize the misclassification rate of the resulting classifier another related information theoretic approach is the minimum mutual information principle which finds the probability distribution with the minimum mutual information between the feature vector and the target variable this approach is closely related to the framework of this paper however unlike the minimum hgr principle there is no known computationally efficient approach for calculating the probability distribution with the minimum mutual information in the continuous setting the idea of minimizing the worst case misclassification rate leads to the minimax probability machine this algorithm and its analysis is not easily extendible to the discrete scenario the most related algorithm to this work is the recent discrete chebyshev classifier dcc algorithm the dcc is based on the minimization of the worst case misclassification rate over the class of probability distributions with the given marginals of the form xi xj similar to our framework the dcc method achieves the misclassification rate no larger than twice the misclassification rate of the optimum classifier however computation of the dcc classifier requires solving optimization problem which is computationally demanding while the proposed algorithm results in least squares optimization problem with closed form solution furthermore in contrast to which only considers deterministic decision rules in this work we consider the class of randomized decision rules finally it is worth noting that the algorithm in requires tree structure to be tight while our proposed algorithm works on structures as long as the separability condition is satisfied problem formulation consider the binary classification problem with discrete features xd and target variable without loss of generality let us assume that and the data points are coming from an underlying probability distribution if the joint probability distribution is known the optimal classifier is given by the maximum posteriori probability map estimator yb map however the joint probability distribution is often not known in practice therefore in order to utilize the map rule one should first estimate using the training data instances unfortunately estimating the joint probability distribution requires estimating the value of for all which is intractable for large values of therefore as mentioned earlier our approach is to first estimate some low order marginals of the joint probability distribution and then utilize the minimax criterion for classification let be the class of probability distributions satisfying the estimated marginals for example when only pairwise marginals of the distribution is estimated the set is the class of distributions satisfying the given pairwise marginals cpairwise px pxi xj xi xj xj xi xj pxi xi xi xj in general could be any class of probability distributions satisfying set of estimated low order marginals let us also define to be randomized classification rule with with probability qδx with probability qδx for some qδx given randomized decision rule and joint probability distribution px we can extend to include our randomized decision rule then the misclassification rate of the decision rule under the probability distribution is given by hence under minimax criterion we are looking for decision rule which minimizes the worst case misclassification rate in other words the robust decision rule is given by argmin max where is the set of all randomized decision rules notice that the optimal decision rule may not be unique in general worst case error minimization in this section we propose surrogate objective for which leads to decision rule with misclassification rate no larger than twice of the optimal decision rule later we show that the proposed surrogate objective is connected to the minimum hgr principle let us start by rewriting as an optimization problem over real valued variables notice that each probability distribution px can be represented by probability vector px with px and px similarly every randomized rule can be represented by vector qδ qδx rm adopting these notations the set can be rewritten in terms of the probability vector as ap where the system of linear equations ap represents all the low order marginal constraints in and the notation denotes the vector of all ones therefore problem can be reformulated as argmin max qδx qδx where and denote the elements of the vector corresponding to the probability values and respectively the simple application of the minimax theorem implies that the saddle point of the above optimization problem exists and moreover the optimal decision rule is map rule for certain probability distribution in other words there exists pair for which and although the above observation characterizes the optimal decision rule to some extent it does not provide computationally efficient approach for finding the optimal decision rule notice that it is to verify the existence of probability distribution satisfying given set of low order marginals based on this observation and the result in we conjecture that in general solving is in the number variables and the alphabet size even when the set is nonempty hence here we focus on developing framework to find an approximate solution of let us continue by utilizing the minimax theorem and obtain the worst case probability distribution in by qxδ qxδ or equivalently argmax min despite convexity of the above problem there are two sources of hardness which make the problem intractable for moderate and large values of firstly the objective function is secondly the number of optimization variables is and grows exponentially with the alphabet size to deal with the first issue notice that the function inside the summation is the fairness objective between the two quantities and replacing this objective with the harmonic average leads to the following smooth convex optimization problem argmax it is worth noting that the harmonic mean of the two quantities is intuitively reasonable surrogate for the original objective function since min although this inequality suggests that the objective functions in and are close to each other leads to any classification rule having low misclassification it is not clear whether the distribution the first naive approach rate for all distributions in in order to obtain classification rule from however the following result shows that this decision rule is to use map decision rule based on does not achieve the factor two misclassification rate obtained in with the worst case error probability theorem let us define δemap map map ee then ee map where is the worst case misclassification rate of the optimal decision rule that is proof the proof is similar to the proof of next theorem and hence omitted here next we show that surprisingly one can obtain randomized decision rule based on the solution of which has misclassification rate no larger than twice of the optimal decision rule as the optimal solution of define the random decision rule δe as given with probability with probability let be the worst case classification error of the decision rule ee max clearly ee according to the definition of the optimal decision rule the following theorem shows that ee is also by twice of the optimal misclassification rate theorem define max then ee in other words the worst case misclassification rate of the decision rule δe is at most twice the optimal decision rule proof the proof is relegated to the supplementary materials so far we have resolved the issue in solving by using surrogate objective function in the next section we resolve the second issue by establishing the connection between problem and the minimum hgr correlation principle then we use the existing result in to develop computationally efficient approach for calculating the decision rule for cpairwise connection to correlation commonplace approach to infer models from data is to employ the maximum entropy principle this principle states that given set of constraints on the distribution the distribution with the maximum shannon entropy under those constraints is proper representer of the class to extend this rule to the classification problem the authors in suggest to pick the distribution maximizing the target entropy conditioned to features or equivalently minimizing mutual information between target and features unfortunately this approach does not lead to computationally efficient approach for model fitting and there is no guarantee on the misclassification rate of the resulting classifier here we study an alternative approach of minimum hgr correlation principle this principle suggests to pick the distribution in minimizing hgr correlation between the target variable and features the hgr correlation coefficient between the two random objects and which was first introduced by hirschfeld and gebelein and then studied by is defined as supf where the maximization is taken over the class of all measurable functions and with and the hgr correlation coefficient has many desirable properties for example it is normalized to be between and furthermore this coefficient is zero if and only if the two random variables are independent and it is one if there is strict dependence between and for other properties of the hgr correlation coefficient see and the references therein lemma assume the random variable is binary and define then px px px px proof the proof is relegated to the supplementary material this lemma leads to the following observation observation assume the marginal distribution and is fixed for any distribution then the distribution in with the minimum hgr correlation between and obtained by solving in other words is the distribution where denotes the hgr correlation coefficient under the probability distribution in as the clasbased on the above observation from now on we call the classifier sifier in the next section we use the result of the recent work to compute the classifier for special class of marginals cpairwise computing classifier based on pairwise marginals in many practical problems the number of features is large and therefore it is only computationally tractable to estimate marginals of order at most two hence hereafter we restrict ourselves to the case where only the first and second order marginals of the distribution is estimated cpairwise in this scenario in order to predict the output of the classifier for given data and next we state result from which sheds point one needs to find the value of and to state the theorem we need the following definitions light on the computation of let the matrix and the vector be defined through their entries as for every and also define the function as pd max zmi then we have the following theorem theorem rephrased from assume cpairwise let then min zt qz dt where the inequality holds with equality if and only if there exists solution to such that and or equivalently if and only if the following separability condition is satisfied for some cpairwise ep ζi xi for some functions ζd moreover if the separability condition holds with equality then xd combining the above theorem with the equality implies that the decision rule δe and δe map can be computed in computationally efficient manner under the separability condition notice that when the separability condition is not satisfied the approach proposed in this section would provide classification rule whose error rate is still bounded by however this error rate does no longer provide approximation gap it is also worth mentioning that the separability condition is property of the class of distribution cpairwise and is independent of the classifier at hand moreover this condition is satisfied with positive measure over the simplex of the all probability distributions as discussed in two remarks are in order inexact knowledge of marginal distribution the optimization problem is equivalent to solving the stochastic optimization problem argmin wt where is random vector with wm if xi in the and wm otherwise also define the random variable with if the random variable and otherwise here the expectation could be calculated with respect to any distribution in hence in practice the above optimization problem can be estimated using sample average approximation saa method through the optimization problem argmin wi ci where wi ci corresponds to the training data point xi clearly this is least square problem with closed form solution notice that in order to bound the saa error and avoid overfitting one could restrict the search space for this could also be done using regularizers such as ridge regression by solving ridge argmin wi ci λridge beyond pairwise marginals when is small one might be interested in estimating higher order marginals for predicting in this scenario simple modification for the algorithm is to define the new set of feature random variables xij xi xj and apply the algorithm to the new set of feature variables it is not hard to see that this approach utilizes the marginal information xi xj xk and xi xj robust feature selection the task of feature selection for classification purposes is to preselect subset of features for use in model fitting in prediction shannon mutual information which is measure of dependence between two random variables is used in many recent works as an objective for feature selection in these works the idea is to select small subset of features with maximum dependence with the target variable in other words the task is to find subset of variables with based on the following optimization problem mi argmax xs where xs xi and xs denotes the mutual information between the random variable xs and almost all of the existing approaches for solving are based on heuristic approaches and of greedy nature which aim to find solution of here we suggest to replace mutual information with the maximal correlation furthermore since estimating the joint distribution of and is computationally and statistically impossible for large number of features we suggest to estimate some low order marginals of the groundtruth distribution and then solve the following robust feature selection problem rfs argmax min xs when only pairwise marginals are estimated from the training data cpairwise maximizing the instead of leads to the following optimization problem rfs argmax min zt qz dt or equivalently sb rfs argmin min zt qz dt pm where zs rmd this problem is of combinatorial ture howevre using the standard group lasso regularizer leads to the feature selection procedure in algorithm algorithm robust feature selection choose regularization parameter and define let zrfs argminz zt qz dt λh pm rfs set pd max zmi notice that when the pairwise marginals are estimated from set of training data points the above feature selection procedure is equivalent to applying the group lasso regularizer to the standard linear regression problem over the domain of indicator variables our framework provides justification for this approach based on the robust maximal correlation feature selection problem remark another natural approach to define the feature selection procedure is to select subset of features by minimizing the worst case classification error solving the following optimization problem min min max where ds is the set of randomized decision rules which only uses the feature variables in define it can be shown that zt qz dt therefore another justification for algorithm is to minimize an of instead of itself remark alternating direction method of multipliers admm algorithm can be used for solving the optimization problem in algorithm see the supplementary material for more details numerical results we evaluated the performance of the classifiers δe and δe map on five different binary classification datasets from the uci machine learning data repository the results are compared with five different benchmarks used in discrete chebyshev classifier greedy dcc tree augmented naive bayes minimax probabilistic machine and support vector machines svm in addition to the classifiers δe and δe map which only use pairwise marginals we also use higher order marginals in and these classifiers are obtained by defining the new feature variables eij xi xj as discussed in section since in this scenario the number of features is large we combine our classifier with the proposed group lasso feature selection in other words eij and then find the maximum correlation classifier for the selected we first select subset of ridge features the value of and is determined through cross validation the results are averaged over monte carlo runs each using of the data for training and the rest for testing the results are summarized in the table below where each number shows the percentage of the error of each method the boldface numbers denote the best performance on each dataset as can be seen in this table in four of the tested datasets at least one of the proposed methods outperforms the other benchmarks furthermore it can be seen that the classifier δemap on average performs better than this fact could be due to the specific properties of the underlying probability distribution in each dataset datasets adult credit promoters votes δemap δe map dcc gdcc mpm tan svm in order to evaluate the computational efficiency of the classifier we compare its running time with svm over the synthetic data set with features and data points each feature xi is generated by bernoulli distribution with xi the target variable is generated by sign αt with and rd is generated with nonzero elements each drawn from standard gaussian distribution the results are averaged over runs of generating the data set and use of the data points for training and for test the classifier is obtained by gradient descent method with regularizer λridge the numerical experiment shows average misclassification rate for svm and for classifier however the average training time of the classifier is seconds while the training time of svm with matlab svm command is seconds acknowledgments the authors are grateful to stanford university supporting stanford graduate fellowship and the center for science of information csoi an nsf science and technology center under grant agreement for the support during this research references farnia razaviyayn kannan and tse minimum hgr correlation principle from marginals to joint distribution arxiv preprint eban mezuman and globerson discrete chebyshev classifiers in proceedings of the international conference on machine learning pages friedman geiger and goldszmidt bayesian network classifiers machine learning lanckriet ande ghaoui bhattacharyya and jordan robust minimax approach to classification the journal of machine learning research jordan ghahramani jaakkola and saul an introduction to variational methods for graphical models machine learning roughgarden and kearns reducibility in advances in neural information processing systems pages jaynes information theory and statistical mechanics physical review globerson and tishby the minimum information principle for discriminative learning in proceedings of the conference on uncertainty in artificial intelligence pages auai press sion on general minimax theorems pacific math de loera and onn the complexity of statistical tables siam journal on computing bertsimas and sethuraman moment problems and semidefinite optimization in handbook of semidefinite programming pages springer hirschfeld connection between correlation and contingency in mathematical proceedings of the cambridge philosophical society volume pages cambridge univ press gebelein das statistische problem der korrelation als eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung of applied mathematics and angewandte mathematik und mechanik on measures of dependence acta mathematica hungarica anantharam gohari kamath and nair on maximal correlation hypercontractivity and the data processing inequality studied by erkip and cover arxiv preprint shapiro dentcheva and lectures on stochastic programming modeling and theory volume siam shapiro monte carlo sampling methods handbooks in operations research and management science kakade sridharan and tewari on the complexity of linear prediction risk bounds margin bounds and regularization in advances in neural information processing systems pages peng long and ding feature selection based on mutual information criteria of maxdependency and ieee transactions on pattern analysis and machine intelligence battiti using mutual information for selecting features in supervised neural net learning ieee transactions on neural networks boyd parikh chu peleato and eckstein distributed optimization and statistical in learning via the alternating direction method of multipliers foundations and trends machine learning 
class of network models recoverable by spectral clustering marina department of statistics university of washington seattle wa usa mmp yali wan department of statistics university of washington seattle wa usa yaliwan abstract finding communities in networks is problem that remains difficult in spite of the amount of attention it has recently received the stochastic sbm is generative model for graphs with communities for which because of its simplicity the theoretical understanding has advanced fast in recent years in particular there have been various results showing that simple versions of spectral clustering using the normalized laplacian of the graph can recover the communities almost perfectly with high probability here we show that essentially the same algorithm used for the sbm and for its extension called sbm works on wider class of which we call preference frame models with essentially the same guarantees moreover the parametrization we introduce clearly exhibits the free parameters needed to specify this class of models and results in bounds that expose with more clarity the parameters that control the recovery error in this model class introduction there have been many recent advances in the recovery of communities in networks under blockmodel assumptions in particular advances in recovering communities by spectral clustering algorithms these have been extended to models including propensities in this paper we argue that one can further expand the model class for which recovery by spectral clustering is possible and describe model that subsumes number of existing models which we call the pfm we show that under the pfm model the communities can be recovered with small error our results correspond to what termed the weak recovery regime in which the fraction of nodes that are mislabeled is when the preference frame model of graphs with communities this model embodies the assumption that interactions at the community level which we will also call macro level can be quantified by meaningful parameters this general assumption underlies the and the related parameterizations of the sbm as well we define preference frame to be graph with nodes one for each community that encodes the connectivity pattern at the community level by stochastic matrix formally given matrix det representing the transition matrix of reversible markov chain on the weighted graph with edge set supp edges correspond to entries in not being is called frame requiring reversibility is equivalent to requiring that there is set of symmetric weights on the edges from which can be derived we note that without the reversibility assumption we would be modeling directed graphs which we will leave for future work we denote by the left principal eigenvector of satisfying ρt ρt we can assume the eigenvalue or has multiplicity and therefore we call the stationary distribution of we say that deterministic weighted graph with weight matrix and edge set supp admits frame if and only if there exists partition of the nodes into clusters ck of sizes nk respectively so that the markov chain on with transition matrix determined by satisfies the linear constraints pij rlm for all cl and all cluster indices the matrix is obtained from by the standard where pn diag di sij random graph family over node set admits frame and is called preference frame model pfm if the edges are sampled independently from bernoulli distributions with parameters sij it is assumed that the edges obtained are undirected and that sij for all pairs we denote realization from this process by furthermore let dˆi aij and in general throughout this paper we will denote computable quantities derived from the observed with the same letter as their model counterparts decorated with the hat symbol thus diag and so on one question we will study is under what conditions the pfm model can be estimated from given by standard spectral clustering algorithms evidently the difficult part in this estimation problem is recovering the partition if this is obtained correctly the remaining parameters are easily estimated in maximum likelihood framework but another question we elucidate refers to the parametrization itself it is known that in the sbm and degree in spite of their simplicity there are dependencies between the community level intensive parameters and the graph level extensive parameters as we will show below in the parametrization of the pfm we can explicitly show which are the free parameters and which are the dependent ones several network models in wide use admit preference frame for example the sbm model which we briefly describe here this model has parameters the cluster sizes and the connectivity matrix for two nodes the probability of an edge is bkl iff ck and cl the matrix needs not be symmetric when bkk bkl for the model is denoted sbm it is easy to verify that the sbm admits preference frame for instance in the case of sbm we have di nl nl dcl for cl qnm nl if rl for dcl dcl in the above we have introduced the notation dcl di one particular realization of the pfm is the homogeneous frame model hpfm in hpfm each node is characterized by weight or propensity to form ties wi for each pair of communities with and for each cl cm we sample aij with probability sij given by rl sij rml wi wj ρl this formulation ensures detail balance in the edge expectations sij sji the hpfm is virtually equivalent to what is known as the degree model or up to proposition relates the node weights to the expected node degrees di we note that the main result we prove in this paper uses independent sampling of edges only to prove the concentration of the laplacian matrix the pfm model can be easily extended to other graph models otherwise the networks obtained would be disconnected here we follow the customary definition of this model which does not enforce sii even though this implies probability of with dependent edges if one could prove concentration and eigenvalue separation for example when has rational entries the subgraph induced by each block of can be represented by random graph with specified degree proposition in hpfm di wi pk rkl wcl ρl whenever ck and equivalent statements that the expected degrees in each cluster are proportional to the weights exist in and they are instrumental in analyzing this model this particular parametrization immediately implies in what case the degrees are globally proportional to the weights this is obviously the situation when wcl ρl for all as we see the node degrees in hpfm are not directly determined by the propensities wi but depend on those by multiplicative constant that varies with the cluster this type of interaction between parameters has been observed in practically all extensions of the stochastic that we are aware of making parameter interpretation more difficult our following result establishes what are the free parameters of the pfm and of their subclasses as it will turn out these parameters and their interactions are easily interpretable proposition let nk be partition of assumed to represent the cluster sizes of ck partition of node set stochastic matrix its left principal eigenvector and πck nk probability distributions over then there exists pfm consistent with with clustering and whose node degrees are given by di dtot ρk πck whenever ck where dtot assumption di is user parameter which is only restricted above by the proof of this result is constructive and can be found in the extended version the parametrization shows to what extent one can specify independently the degree distribution of network model and the connectivity parameters moreover it describes the pattern of connection of node as composition of pattern which gives the total probability of to form connections with cluster and the distribution of connections between and the members of cl these parameters are meaningful on their own and can be specified or estimated separately as they have no hidden dependence on each other or on the pfm enjoys number of other interesting properties as this paper will show almost all the properties that make sbm popular and easy to understand hold also for the much more flexible pfm in the remainder of this paper we derive recovery guarantees for the pfm as an additional goal we will show that in the frame we set with the pfm the recovery conditions become clearer more interpretable and occasionally less restrictive than for other models as already mentioned the pfm includes many models that have been found useful by previous authors yet the pfm class is much more flexible than those individual models in the sense that it allows other unexplored degrees of freedom or in other words achieves the same advantages as previously studied models with fewer constraints on the data note that there is an infinite number of possible random graphs with the same parameters satisfying the constraints and proposition yet for preliable community detection we do not need to control fully but only aggregate statistics like aij spectral clustering algorithm and main result now we address the community recovery problem from random graph sampled from the pfm defined as above we make the standard assumption that is known our analysis is based on very common spectral clustering algorithm used in and described also in input graph with and number of clusters output clustering compute diag dˆn and laplacian calculate the eigenvectors associated with the eigenvalues of normalize the eigenvectors to unit length we denote them as the first eigenvectors in the following text set form matrix treating each row of as point in dimensions cluster them by the algorithm to obtain the clustering algorithm spectral clustering note that the vectors are the first eigenvectors of the algorithm is assumed to find the global optimum for more details on good initializations for in step see we quantify the difference between cˆ and the true clusterings by the rate perr which is defined as max perr theorem rate bound for hpfm and pfm let the matrix admit pfm and have the usual meaning and let be the eigenvalues of with let dmin min be the minimum expected degree dˆmin min dˆi and dmax maxij nsij let be arbitrary numbers assume assumption admits hpfm model and holds assumption sij assumption dˆmin log assumption dmin log assumption dmax log assumption grow where grow is defined in proposition assumption are the eigenvalues of and we also assume that we run algorithm on and that finds the optimal solution then for sufficiently large the following statements hold with probability at least pfm assumptions imply log kdtot perr ndmin grow log dˆmin hpfm assumptions imply perr kdtot ndmin grow log λk log dˆmin where is constant depending on and note that perr decreases at least as log when dˆmin dmin log this is because dˆmin and dmin help with the concentration of using proposition the distances between rows of the true centers of the step are lower bounded by grow after plugging in the assumptions for dmin dˆmin dmax we obtain kκ perr grow log log when is small the first component on the right hand side dominates because of the constant while the second part dominates when is very large this shows that perr decreases almost as log of the remaining quantities controls the spread of the degrees di notice that λk and are eigengaps in hpfm model and pfm model respectively and depend only on the preference frame and likewise for grow the eigengaps ensure the stability of principal spaces and the separation from the spurious eigenvalues as shown in proposition the term containing log is designed to control the difference between di and dˆi with small positive constant proof outline techniques and main concepts the proof of theorem given in the extended version of the paper relies on three steps which are to be found in most results dealing with spectral clustering first concentration bounds of the empirical laplacian are obtained there are various conditions under which these can be obtained and ours are most similar to the recent result of the other tools we use are hoeffding bounds and tools from linear algebra second one needs to bound the perturbation of the eigenvectors as function of the perturbation in this is based on the pivotal results of davis and kahan see crucial ingredient in these type of theorems is the size of the eigengap between the invariant subspace and its orthogonal complement this is condition that is and therefore we discuss the techniques we introduce for solving this problem in the pfm in the next subsection the third step is to bound the error of the clustering algorithm this is done by counting argument the crux of this step is to ensure the separation of the distinct rows of this again is model dependent and we present our result below the details and proof are in the extended version all proofs are for the pfm to specialize to the hpfm one replaces with cluster separation and bounding the spurious eigenvalues in the pfm proposition cluster separation let have the usual meaning and define the cluster dck volume dck di and cmax cmin as maxk mink nρ let be nodes belonging respectively to clusters with then grow vj dtot cmax ρk ρm ρk ρm cmin cmax dtot where grow cmax moreover if the columns of are ρk ρm ρk ρm cmin cmax normalized to length the above result holds by replacing dtot cmax min with max mink nρkk in the square brackets cmax min depend on the degree distribution while all the other quantitities depend only of the preference frame hence this expression is invariant with and as long as it is strictly positive we have that the cluster separation is the next theorem is crucial in proving that has constant eigengap we express the eigengap of in terms of the preference frame and the mixing inside each of the clusters ck for this we resort to generalized stochastic matrices rectangular positive matrices with equal row sums and we relate their properties to the mixing of markov chains on bipartite graphs these tools are introduced here for the sake of intuition toghether with the main spectral result while the rest of the proofs are in the extended version given for any vector rn we denote by xk the block of indexed by elements of cluster of similarly for any square matrix we denote by akl aij the block with rows indexed by and columns indexed by denote by rk respectively the stationary distribution and eigenvectors of we are interested in block stochastic matrices for which the eigenvalues of are the principal eigenvalues we call λn spurious eigenvalues theorem below is sufficient condition that bounds whenever each of the blocks of is homogeneous in sense that will be defined below when we consider the matrix partitioned according to it will be convenient to consider the blocks in pairs this is why the next result describes the properties of matrices consisting of pair of blocks proposition eigenvalues for the blocks let be the square matrix and let be an eigenvector of where and with eigenvalue then ba ab moreover if is symmetric at then is singular value of is real and is also an eigenvalue of with eigenvector assuming and that is full rank one can write λu with orthogonal matrices and diagonal matrix of singular values theorem bounding the spurious eigenvalues of let be defined as above and let be an eigenvalue of assume that is with respect to are kk the eigenvalues of and is not an eigenvalue of denote by λkl the third second largest in magnitude eigenvalue of block mkl lkk and assume that λmax mkl lkk then the spurious eigenvalues of are bounded by times constant that depends only on max rkl rlk remarks the factor that multiplies can be further bounded denoting rlk uk ux rkk rkl rlk rkl rlk rkl uk ux lk in other words uk ux rlk max the maximum column sum of stochastic matrix is if the matrix is doubly stochastic and larger than otherwise and can be as large as however one must remember that the interesting matrices have large eigenvalues in particular we will be interested in λk it is expected that under these conditions the factor depending on to be close to here too eigenvalues will always be ordered in decreasing order of their magnitudes with positive values preceeding negatives one of the same magnitude consequently for any stochastic matrix always the second remark is on the condition that all blocks have small spurious eigenvalues this condition is not merely technical convenience if block had large eigenvalue near or times its λmax then that block could itself be broken into two distinct clusters in other words the clustering would not accurately capture the cluster structure of the matrix hence condition amounts to requiring that no other cluster structure is present in other words that within each block the markov chain induced by mixes well related work previous results we used the laplacian concentration results use technique introduced recently by and some of the basic matrix theoretic results are based on which studied the and matrix in the context of spectral clustering as any of the many works we cite we are indebted to the pioneering work on the perturbation of invariant subspaces of davis and kahan previous related models the configuration model for regular random graphs and for graphs with general fixed degrees is very well known it can be shown by simple calculation that the configuration model also admits frame in the particular case when the diagonal of the matrix is and the connections between clusters are given by bipartite configuration model with fixed degrees frames have been studied by under the name equitable graphs the object there was to provide way to calculate the spectrum of the graph since the pfm is itself an extension of the sbm many other extensions of the latter will bear resemblance to pfm here we review only subset of these series of strong relatively recent advances which exploit the spectral properties of the sbm and extend this to handle large range of degree distributions the pfm includes each of these models as in the authors study model that coincides up to some multiplicative constants with the hpfm the paper introduces an elegant algorithm that achieves partial recovery or better which is based on the spectral properties of random matrix and does not require knowledge of the partition size the pfm also coincides with the model of and called the expected degree model the distribution of edges but not the ambient edges so the hpfm is subclass of this model different approach to recovery the papers propose regularizing the normalized laplacian with respect to the influence of low degrees by adding the scaled unit matrix to the incidence matrix and thereby they achieve recovery for much more imbalanced degree distributions than us currently we do not see an application of this interesting technique to the pfm as the diagonal regularization destroys the separation of the intracluster and intercluster transitions which guarantee the clustering property of the eigenvectors therefore currently we can not break the log limit into the regime although we recognize that this is an important current direction of research recovery results like ours can be easily extended to weighted graphs and in this sense they are relevant to the spectral clustering of these graphs when they are assumed to be noisy versions of that admits pfm an empirical comparison of the recovery conditions as obtaining general results in comparing the various recovery conditions in the literature would be tedious task here we undertake to do numerical comparison while the conclusions drawn from this are not universal they illustrate well the stringency of various conditions as well as the gap between theory and actual recovery for this we construct hpfm models and verify numerically if they satisfy the various conditions we have also clustered random graphs sampled from this model with good results shown in the extended version in particular the models proposed in are variations of the and thus forms of the homogeneous pfm we generate from the hpfm model with each wi is uniformly generated from grow the matrix is given below note its last row in which the conditions we are verifying include besides ours those obtained by and since the original is perfect case for spectral clustering of weighted graphs we also verify the theoretical recovery conditions for spectral clustering in and our result theorem assumption and automatically hold from the construction of the data by simulating the data we find that dmin dˆmin both of which are bigger than log therefore assumption and hold dmax grow thus assumption and hold after running algorithm the rate is which satisfies the theoretical bound in conclusion the dataset fits into both the assumptions and conclusion of theorem qin and rohe this paper has an assumption on the lower bound on λk that is ln so that the concentration bound holds with probability we set and dmin obtain λk which is impossible to hold since λk is upper bounded by rohe chatterjee yu here one defines τn dmin and requires τn log to ensure the concentration of to meet this assumption with dmin while in our case dmin the assumption requires very dense graph and is not satisfied in this dataset balcan borgs braverman chayes their theorem is based on community structure it requires all the nodes to be more connected within their own cluster however in our graph out of nodes have more connections to outside nodes than to nodes in their own cluster ng jordan weiss require where dˆ kl dˆj dˆk dˆj dl on the given data we find that and which is impossible to hold since needs to be smaller than chaudhuri chung tsiatas the recovery theorem of this paper requires di ln so that when all the assumptions hold it recovers the clustering correctly with probability at least we set and obtain that di ln therefore the assumption fails as well for our method the hardest condition to satisfy and the most different from the others was assumption we repeated this experiment with the other weights distributions for which this assumption fails the assumptions in the related papers continued to be violated in qin and rohe we obtain λk in rohe chatterjee yu we still needs dmin in balcan borgs braverman chayes we get points more connected to the outside nodes of its cluster in balakrishnan xu krishnamurthy singh we get and needs to satisfy in ng jordan weiss we obtain therefore the assumptions in these papers are all violated as well conclusion in this paper we have introduced the preference frame model which is more flexible and subsumes many current models including sbm and it produces art recovery rates comparable to existing models to accomplish this we used parametrization that is clearer and more intuitive the theoretical results are based on the new geometric techniques which control the eigengaps of the matrices with piecewise constant eigenvectors we note that the main result theorem uses independent sampling of edges only to prove the concentration of the laplacian matrix the pfm model can be easily extended to other graph models with dependent edges if one could prove concentration and eigenvalue separation for example when has rational entries the subgraph induced by each block of can be represented by random graph with specified degree to make possible one needs dmin references sanjeev arora rong ge sushant sachdeva and grant schoenebeck finding overlapping communities in social networks toward rigorous approach in proceedings of the acm conference on electronic commerce pages acm sivaraman balakrishnan min xu akshay krishnamurthy and aarti singh noise thresholds for spectral clustering in advances in neural information processing systems pages balcan christian borgs mark braverman jennifer chayes and teng finding endogenously formed communities arxiv preprint bela bollobas random graphs cambridge university press second edition chaudhuri chung and tsiatas spectral clustering of graphs with general degrees in extended planted partition model journal of machine learning research pages yudong chen and jiaming xu tradeoffs in planted problems and submatrix localization with growing number of clusters and submatrices arxiv preprint amin and andre lanka finding planted partitions in random graphs with general degree distributions siam journal on discrete mathematics jackson social and economic networks princeton university press can le and roman vershynin concentration and regularization of random graphs brendan mckay asymptotics for symmetric matrices with prescribed row sums ars combinatoria brendan mckay and nicholas wormald uniform generation of random regular graphs of moderate degree journal of algorithms brendan mckay and nicholas wormald asymptotic enumeration by degree sequence of graphs with degrees combinatorica marina and jianbo shi learning segmentation by random walks in leen dietterich and tresp editors advances in neural information processing systems volume pages cambridge ma mit press marina and jianbo shi random walks view of spectral segmentation in jaakkola and richardson editors artificial intelligence and statistics aistats newman and travis martin equitable random graphs andrew ng michael jordan yair weiss et al on spectral clustering analysis and an algorithm advances in neural information processing systems norris markov chains cambridge university press tai qin and karl rohe regularized spectral clustering under the stochastic blockmodel in advances in neural information processing systems pages karl rohe sourav chatterjee bin yu et al spectral clustering and the stochastic blockmodel the annals of statistics gilbert stewart sun and harcourt brace jovanovich matrix perturbation theory volume academic press new york ulrike von luxburg tutorial on spectral clustering statistics and computing 
vectors ryan kiros yukun zhu ruslan salakhutdinov richard zemel antonio torralba raquel urtasun sanja fidler university of toronto canadian institute for advanced research massachusetts institute of technology abstract we describe an approach for unsupervised learning of generic distributed sentence encoder using the continuity of text from books we train an encoderdecoder model that tries to reconstruct the surrounding sentences of an encoded passage sentences that share semantic and syntactic properties are thus mapped to similar vector representations we next introduce simple vocabulary expansion method to encode words that were not seen as part of training allowing us to expand our vocabulary to million words after training our model we extract and evaluate our vectors with linear models on tasks semantic relatedness paraphrase detection ranking classification and benchmark sentiment and subjectivity datasets the end result is an encoder that can produce highly generic sentence representations that are robust and perform well in practice introduction developing learning algorithms for distributed compositional semantics of words has been longstanding open problem at the intersection of language understanding and machine learning in recent years several approaches have been developed for learning composition operators that map word vectors to sentence vectors including recursive networks recurrent networks convolutional networks and methods among others all of these methods produce sentence representations that are passed to supervised task and depend on class label in order to backpropagate through the composition weights consequently these methods learn highquality sentence representations but are tuned only for their respective task the paragraph vector of is an alternative to the above models in that it can learn unsupervised sentence representations by introducing distributed sentence indicator as part of neural language model the downside is at test time inference needs to be performed to compute new vector in this paper we abstract away from the composition methods themselves and consider an alternative loss function that can be applied with any composition operator we consider the following question is there task and corresponding loss that will allow us to learn highly generic sentence representations we give evidence for this by proposing model for learning sentence vectors without particular supervised task in mind using word vector learning as inspiration we propose an objective function that abstracts the model of to the sentence level that is instead of using word to predict its surrounding context we instead encode sentence to predict the sentences around it thus any composition operator can be substituted as sentence encoder and only the objective function becomes modified figure illustrates the model we call our model and vectors induced by our model are called vectors our model depends on having training corpus of contiguous text we chose to use large collection of novels namely the bookcorpus dataset for training our models these are free books written by yet unpublished authors the dataset has books in different genres romance books fantasy science fiction teen etc table highlights the summary statistics of the book corpus along with narratives books contain dialogue emotion and wide range of interaction between characters furthermore with large enough collection the training set is not biased towards any particular domain or application table shows nearest neighbours figure the model given tuple si of contiguous sentences with si the sentence of book the sentence si is encoded and tries to reconstruct the previous sentence and next sentence in this example the input is the sentence triplet got back home could see the cat on the steps this was strange unattached arrows are connected to the encoder output colors indicate which components share parameters heosi is the end of sentence token of books of sentences of words of unique words mean of words per sentence table summary statistics of the bookcorpus dataset we use this corpus to training our model of sentences from model trained on the bookcorpus dataset these results show that vectors learn to accurately capture semantics and syntax of the sentences they encode we evaluate our vectors in newly proposed setting after learning freeze the model and use the encoder as generic feature extractor for arbitrary tasks in our experiments we consider tasks paraphrase detection ranking and standard classification benchmarks in these experiments we extract vectors and train linear models to evaluate the representations directly without any additional as it turns out yield generic representations that perform robustly across all tasks considered one difficulty that arises with such an experimental setup is being able to construct large enough word vocabulary to encode arbitrary sentences for example sentence from wikipedia article might contain nouns that are highly unlikely to appear in our book vocabulary we solve this problem by learning mapping that transfers word representations from one model to another using pretrained representations learned with continuous model we learn linear mapping from word in space to word in the encoder vocabulary space the mapping is learned using all words that are shared between vocabularies after training any word that appears in can then get vector in the encoder word embedding space approach inducing vectors we treat in the framework of models that is an encoder maps words to sentence vector and decoder is used to generate the surrounding sentences encoderdecoder models have gained lot of traction for neural machine translation in this setting an encoder is used to map an english sentence into vector the decoder then conditions on this vector to generate translation for the source english sentence several choices of pairs have been explored including and the source sentence representation can also dynamically change through the use of an attention mechanism to take into account only the relevant words for translation at any given time in our model we use an rnn encoder with gru activations and an rnn decoder with conditional gru this model combination is nearly identical to the rnn of used in neural machine translation gru has been shown to perform as well as lstm on sequence modelling tasks while being conceptually simpler gru units have only gates and do not require the use of cell while we use rnns for our model any encoder and decoder can be used so long as we can backpropagate through it assume we are given sentence tuple si let wit denote the word for sentence si and let xti denote its word embedding we describe the model in three parts the encoder decoder and objective function encoder let win be the words in sentence si where is the number of words in the sentence at each time step the encoder produces hidden state hti which can be interpreted as the representation of the sequence wit the hidden state hn thus represents the full sentence preliminary version of our model was developed in the context of computer vision application query and nearest sentence he ran his hand inside his coat that the unopened letter was still there he slipped his hand between his coat and his shirt where the folded copies lay in brown envelope im sure youll have glamorous evening she said giving an exaggerated wink im really glad you came to the party tonight he said turning to her although she could tell he had been too invested in any of their other chitchat he seemed genuinely curious about this although he had been following her career with microscope he definitely taken notice of her appearances an annoying buzz started to ring in my ears becoming louder and louder as my vision began to swim weighty pressure landed on my lungs and my vision blurred at the edges threatening my consciousness altogether if he had weapon he could maybe take out their last imp and then beat up errol and vanessa if he could ram them from behind send them sailing over the far side of the levee he had chance of stopping them then with stroke of luck they saw the pair head together towards the portaloos then from out back of the house they heard horse scream probably in answer to pair of sharp spurs digging deep into its flanks ll take care of it goodman said taking the phonebook ll do that julia said coming in he finished rolling up scrolls and placing them to one side began the more urgent task of finding ale and tankards he righted the table set the candle on piece of broken plate and reached for his flint steel and tinder table in each example the first sentence is query while the second sentence is its nearest neighbour nearest neighbours were scored by cosine similarity from random sample of sentences from our corpus to encode sentence we iterate the following sequence of equations dropping the subscript rt wr xt ur wz uz tanh wx ht zt zt where is the proposed state update at time zt is the update gate rt is the reset gate denotes product both update gates takes values between zero and one decoder the decoder is neural language model which conditions on the encoder output hi the computation is similar to that of the encoder except we introduce matrices cz cr and that are used to bias the update gate reset gate and hidden state computation by the sentence vector one decoder is used for the next sentence while second decoder is used for the previous sentence separate parameters are used for each decoder with the exception of the vocabulary matrix which is the weight matrix connecting the decoder hidden state for computing distribution over words in what follows we describe the decoder for the next sentence although an analogous computation is used for the previous sentence let denote the hidden state of the decoder at time decoding involves iterating through the following sequence of equations dropping the subscript rt wrd udr cr hi wzd udz tanh cz hi chi zt zt given the probability of word given the previous words and the encoder vector is hi exp where denotes the row of corresponding to the word of an analogous computation is performed for the previous sentence objective given tuple si the objective optimized is the sum of the for the forward and backward sentences conditioned on the encoder representation logp hi logp hi the total objective is the above summed over all such training tuples vocabulary expansion we now describe how to expand our encoder vocabulary to words it has not seen during training suppose we have model that was trained to induce word representations such as let denote the word embedding space of these word representations and let vrnn denote the rnn word embedding space we assume the vocabulary of is much larger than that of vrnn our goal is to construct mapping vrnn parameterized by matrix such that wv for and vrnn inspired by which learned linear mappings between translation word spaces we solve an linear regression loss for the matrix thus any word from can now be mapped into vrnn for encoding sentences experiments in our experiments we evaluate the capability of our encoder as generic feature extractor after training on the bookcorpus dataset our experimentation setup on each task is as follows using the learned encoder as feature extractor extract vectors for all sentences if the task involves computing scores between pairs of sentences compute features between pairs this is described in more detail specifically for each experiment train linear classifier on top of the extracted features with no additional or backpropagation through the model we restrict ourselves to linear classifiers for two reasons the first is to directly evaluate the representation quality of the computed vectors it is possible that additional performance gains can be made throughout our experiments with models but this falls out of scope of our goal furthermore it allows us to better analyze the strengths and weaknesses of the learned representations the second reason is that reproducibility now becomes very straightforward details of training to induce vectors we train two separate models on our book corpus one is unidirectional encoder with dimensions which we subsequently refer to as the other is bidirectional model with forward and backward encoders of dimensions each this model contains two encoders with different parameters one encoder is given the sentence in correct order while the other is given the sentence in reverse the outputs are then concatenated to form dimensional vector we refer to this model as for training we initialize all recurrent matricies with orthogonal initialization weights are initialized from uniform distribution in of size are used and gradients are clipped if the norm of the parameter vector exceeds we used the adam algorithm for optimization both models were trained for roughly two weeks as an additional experiment we also report experimental results using combined model consisting of the concatenation of the vectors from and resulting in dimensional vector we refer to this model throughout as after our models are trained we then employ vocabulary expansion to map word embeddings into the rnn encoder space the publically available cbow word vectors are used for this purpose the models are trained with vocabulary size of words after removing multiple word examples from the cbow model this results in vocabulary size of words thus even though our model was trained with only words after vocabulary expansion we can now successfully encode possible words since our goal is to evaluate as general feature extractor we keep text to minimum when encoding new sentences no additional preprocessing is done other than basic tokenization this is done to test the robustness of our vectors as an additional baseline we also consider the mean of the word vectors learned from the model we refer to this baseline as bow this is to determine the effectiveness of standard baseline trained on the bookcorpus semantic relatedness our first experiment is on the semeval task semantic relatedness sick dataset given two sentences our goal is to produce score of how semantically related these sentences are based on human generated scores each score is the average of different human annotators scores take values between and score of indicates that the sentence pair is not at all related while http mse method acc meaning factory ecnu feats mean vectors lstm bidirectional lstm dependency fhs pe wddp mtmetrics bow bow feats method table left test set results on the sick semantic relatedness subtask the evaluation metrics are pearson spearman and mean squared error the first group of results are semeval submissions while the second group are results reported by right test set results on the microsoft paraphrase corpus the evaluation metrics are classification accuracy and score top recursive autoencoder variants middle the best published results on this dataset score of indicates they are highly related the dataset comes with predefined split of training pairs development pairs and testing pairs all sentences are derived from existing image and video annotation datasets the evaluation metrics are pearson spearman and mean squared error given the difficulty of this task many existing systems employ large amount of feature engineering and additional resources thus we test how well our learned representations fair against heavily engineered pipelines recently showed that learning representations with lstm or for the task at hand is able to outperform these existing systems we take this one step further and see how well our vectors learned from completely different task are able to capture semantic relatedness when only linear model is used on top to predict scores to represent sentence pair we use two features given two vectors and we compute their product and their absolute difference and concatenate them together these two features were also used by to predict score we use the same setup as let be an integer vector from to we compute distribution as function of prediction scores given by pi byc if byc pi byc if byc and otherwise these then become our targets for logistic regression classifier at test time given new sentence pairs we first compute targets and then compute the related score as as an additional comparison we also explored appending features derived from an embedding model trained on coco see section given vectors and we obtain vectors and from the learned linear embedding model and compute features and these are then concatenated to the existing features table left presents our results first we observe that our models are able to outperform all previous systems from the semeval competition it highlights that vectors learn representations that are well suited for semantic relatedness our results are comparable to lstms whose representations are trained from scratch on this task only the dependency of performs better than our results we note that the dependency relies on parsers whose training data is very expensive to collect and does not exist for all languages we also observe using features learned from an embedding model on coco gives an additional performance boost resulting in model that performs on par with the dependency to get feel for the model outputs table shows example cases of test set pairs our model is able to accurately predict relatedness on many challenging cases on some examples it fails to pick up on small distinctions that drastically change sentence meaning such as tricks on motorcycle versus tricking person on motorcycle paraphrase detection the next task we consider is paraphrase detection on the microsoft research paraphrase corpus on this task two sentences are given and one must predict whether or not they are sentence sentence gt pred little girl is looking at woman in costume little girl is looking at woman in costume little girl is looking at woman in costume young girl is looking at woman in costume the little girl is looking at man in costume little girl in costume looks like woman sea turtle is hunting for fish sea turtle is not hunting for fish sea turtle is hunting for food sea turtle is hunting for fish man is driving car there is no man driving the car the car is being driven by man man is driving car large duck is flying over rocky stream large duck is flying over rocky stream duck which is large is flying over rocky stream large stream is full of rocks ducks and flies person is performing acrobatics on motorcycle person is performing tricks on motorcycle person is performing tricks on motorcycle the performer is tricking person on motorcycle someone is pouring ingredients into pot nobody is pouring ingredients into pot someone is pouring ingredients into pot someone is adding ingredients to pot someone is pouring ingredients into pot man is removing vegetables from pot table example predictions from the sick test set gt is the ground truth relatedness scored between and the last few results show examples where slight changes in sentence structure result in large changes in relatedness which our model was unable to score correctly model random ranking dvsa bow coco retrieval image annotation med image search med table coco results for retrieval experiments is recall high is good med is the median rank low is good paraphrases the training set consists of sentence pairs which are positive and the test set has pairs are positive we compute vector representing the pair of sentences in the same way as on the sick dataset using the product and their absolute difference which are then concatenated together we then train logistic regression on top to predict whether the sentences are paraphrases is used for tuning the penalty as in the semantic relatedness task paraphrase detection has largely been dominated by extensive feature engineering or combination of feature engineering with semantic spaces we report experiments in two settings one using the features as above and the other incorporating basic statistics between sentence pairs the same features used by these are referred to as feats in our results we isolate the results and baselines used in as well as the top published results on this task table right presents our results from which we can observe the following alone outperform recursive nets with dynamic pooling when no features are used when other features are used recursive nets with dynamic pooling works better and when skipthoughts are combined with basic pairwise statistics it becomes competitive with the which incorporate much more complicated features and this is promising result as many of the sentence pairs have very details that signal if they are paraphrases ranking we next consider the task of retrieving images and their sentence descriptions for this experiment we use the microsoft coco dataset which is the largest publicly available dataset of images with sentence descriptions each image is annotated with captions each from different annotators following previous work we consider two tasks image annotation and image search for image annotation an image is presented and sentences are ranked based on how well they describe the query image the image search task is the reverse given caption we retrieve images that are good fit to the query the training set comes with over images each with captions for development and testing we use the same splits as the development and test sets each contain images and captions evaluation is performed using recall namely the mean number of images for which the correct caption is ranked within the retrieved results and for sentences we also report the median rank of the closest ground truth result from the ranked list the best performing results on ranking have all used rnns for encoding sentences where the sentence representation is learned jointly recently showed that by using fisher vectors for representing sentences linear cca can be applied to obtain performance that is as strong as using rnns for this task thus the method of is strong baseline to compare our sentence representations with for our experiments we represent images using oxfordnet features from their model for sentences we simply extract vectors for each caption the training objective we use is pairwise ranking loss that has been previously used by many other methods the only difference is the scores are computed using only linear transformations of image and sentence inputs the loss is given by xx xx max ux vy ux vyk max vy ux vy uxk where is an image vector is the vector for the groundtruth sentence yk are vectors for constrastive incorrect sentences and is the score cosine similarity is used for scoring the model parameters are where is the image embedding matrix and is the sentence embedding matrix in our experiments we use dimensional embedding margin and contrastive terms we trained for epochs and saved our model anytime the performance improved on the development set table illustrates our results on this task using vectors for sentences we get performance that is on par with both and except for on image annotation where other methods perform much better our results indicate that vectors are representative enough to capture image descriptions without having to learn their representations from scratch combined with the results of it also highlights that simple scalable embedding techniques perform very well provided that image and sentence vectors are available classification benchmarks for our final quantitative experiments we report results on several classification benchmarks which are commonly used for evaluating sentence representation learning methods we use datasets movie review sentiment mr customer product reviews cr classification subj opinion polarity mpqa and classification trec on all datasets we simply extract vectors and train logistic regression classifier on top is used for evaluation on the first datasets while trec has split we tune the penality using and thus use nested for the first datasets method mr cr subj mpqa trec mnb cbow grconv rnn brnn cnn adasent bow nb on these tasks properly tuned models have been shown to perform exceptionally well in particular the of is fast and robust performer on these tasks skipthought vectors potentially give an alternative to these baselines being just as fast and easy to use for an additional comparison we also see to what effect augmenting with bigram naive bayes nb features improves performance table presents our results on most tasks performs about as well as the baselines but table classification accuracies on several standard fails to improve over methods whose marks results are grouped as follows sentence representations are learned diels supervised compositional models paragraph vector rectly for the task at hand this indicates unsupervised learning of sentence representations ours that for tasks like sentiment classificabest results overall are bold while best results outside of group tion tuning the representations even on are underlined small datasets are likely to perform better than learning generic unsupervised we use the code available at https trec subj sick figure embeddings of vectors on different datasets points are colored based on their labels question type for trec for subj on the sick dataset each point represents sentence pair and points are colored on gradient based on their relatedness labels results best seen in electronic form sentence vector on much bigger datasets finally we observe that the combination is effective particularly on mr this results in very strong new baseline for text classification combine with and train linear model visualizing as final experiment we applied to vectors extracted from trec subj and sick datasets and the visualizations are shown in figure for the sick visualization each point represents sentence pair computed using the concatenation of and absolute difference of features even without the use of relatedness labels vectors learn to accurately capture this property conclusion we evaluated the effectiveness of vectors as an sentence representation with linear classifiers across tasks many of the methods we compare against were only evaluated on task the fact that vectors perform well on all tasks considered highlight the robustness of our representations we believe our model for learning vectors only scratches the surface of possible objectives many variations have yet to be explored including deep encoders and decoders larger context windows encoding and decoding paragraphs other encoders such as convnets it is likely the case that more exploration of this space will result in even higher quality representations acknowledgments we thank geoffrey hinton for suggesting the name we also thank felix hill kelvin xu kyunghyun cho and ilya sutskever for valuable comments and discussion this work was supported by nserc samsung cifar google and onr grant references richard socher alex perelygin jean wu jason chuang christopher manning andrew ng and christopher potts recursive deep models for semantic compositionality over sentiment treebank in emnlp sepp hochreiter and jürgen schmidhuber long memory neural computation nal kalchbrenner edward grefenstette and phil blunsom convolutional neural network for modelling sentences acl yoon kim convolutional neural networks for sentence classification emnlp kyunghyun cho bart van merriënboer dzmitry bahdanau and yoshua bengio on the properties of neural machine translation approaches han zhao zhengdong lu and pascal poupart hierarchical sentence model ijcai quoc le and tomas mikolov distributed representations of sentences and documents icml tomas mikolov kai chen greg corrado and jeffrey dean efficient estimation of word representations in vector space iclr yukun zhu ryan kiros richard zemel ruslan salakhutdinov raquel urtasun antonio torralba and sanja fidler aligning books and movies towards visual explanations by watching movies and reading books in iccv nal kalchbrenner and phil blunsom recurrent continuous translation models in emnlp pages kyunghyun cho bart van merrienboer caglar gulcehre fethi bougares holger schwenk and yoshua bengio learning phrase representations using rnn for statistical machine translation emnlp ilya sutskever oriol vinyals and quoc vv le sequence to sequence learning with neural networks in nips dzmitry bahdanau kyunghyun cho and yoshua bengio neural machine translation by jointly learning to align and translate iclr junyoung chung caglar gulcehre kyunghyun cho and yoshua bengio empirical evaluation of gated recurrent neural networks on sequence modeling nips deep learning workshop tomas mikolov quoc le and ilya sutskever exploiting similarities among languages for machine translation arxiv preprint andrew saxe james mcclelland and surya ganguli exact solutions to the nonlinear dynamics of learning in deep linear neural networks iclr diederik kingma and jimmy ba adam method for stochastic optimization iclr alice lai and julia hockenmaier denotational and distributional approach to semantics semeval sergio jimenez george duenas julia baquero alexander gelbukh av juan dios bátiz and av mendizábal combining soft cardinality features for semantic textual similarity relatedness and entailment semeval johannes bjerva johan bos rob van der goot and malvina nissim the meaning factory formal semantics for recognizing textual entailment and determining semantic similarity semeval page jiang zhao tian tian zhu and man lan ecnu one stone two birds ensemble of heterogenous measures for semantic relatedness and textual entailment semeval kai sheng tai richard socher and christopher manning improved semantic representations from long memory networks acl richard socher andrej karpathy quoc le christopher manning and andrew ng grounded compositional semantics for finding and describing images with sentences tacl richard socher eric huang jeffrey pennin christopher manning and andrew ng dynamic pooling and unfolding recursive autoencoders for paraphrase detection in nips andrew finch hwang and eiichiro sumita using machine translation evaluation techniques to determine semantic equivalence in iwp dipanjan das and noah smith paraphrase identification as probabilistic recognition in acl stephen wan mark dras robert dale and cécile paris using features to take the out of paraphrase in proceedings of the australasian language technology workshop nitin madnani joel tetreault and martin chodorow machine translation metrics for paraphrase identification in naacl yangfeng ji and jacob eisenstein discriminative improvements to distributional sentence similarity in emnlp pages marco marelli luisa bentivogli marco baroni raffaella bernardi stefano menini and roberto zamparelli task evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment bill dolan chris quirk and chris brockett unsupervised construction of large paraphrase corpora exploiting massively parallel news sources in proceedings of the international conference on computational linguistics karpathy and deep alignments for generating image descriptions in cvpr benjamin klein guy lev gil sadeh and lior wolf associating neural word embeddings with deep image representations using fisher vectors in cvpr junhua mao wei xu yi yang jiang wang and alan yuille deep captioning with multimodal recurrent neural networks iclr lin michael maire serge belongie james hays pietro perona deva ramanan piotr dollár and lawrence zitnick microsoft coco common objects in context in eccv pages karen simonyan and andrew zisserman very deep convolutional networks for image recognition iclr sida wang and christopher manning baselines and bigrams simple good sentiment and topic classification in acl laurens van der maaten and geoffrey hinton visualizing data using jmlr 
causal structure learning david danks university pittsburgh pa ddanks sergey plis the mind research network albuquerque nm vince calhoun the mind research network ece university of new mexico albuquerque nm vcalhoun cynthia freeman the mind research network cs university of new mexico albuquerque nm abstract causal structure learning from time series data is major scientific challenge extant algorithms assume that measurements occur sufficiently quickly more precisely they assume approximately equal system and measurement timescales in many domains however measurements occur at significantly slower rate than the underlying system changes but the size of the timescale mismatch is often unknown this paper develops three causal structure learning algorithms each of which discovers all dynamic causal graphs that explain the observed measurement data perhaps given undersampling that is these algorithms all learn causal structure in manner they do not assume any particular relation between the measurement and system timescales we apply these algorithms to data from simulations to gain insight into the challenge of undersampling introduction dynamic causal systems are major focus of scientific investigation in diverse domains including neuroscience economics meteorology and education one significant limitation in all of these sciences is the difficulty of measuring the relevant variables at an appropriate timescale for the particular scientific domain this challenge is particularly salient in neuroimaging standard fmri experiments sample the brain bloodflow approximately every one or two seconds though the underlying neural activity the major driver of bloodflow occurs much more rapidly moreover the precise timescale of the underlying causal system is unknown it is almost certainly faster than the fmri measurements but it is unknown how much faster in this paper we aim to learn the causal structure of system that evolves at timescale τs given measurements at timescale τm we focus on the case in which τs is faster than τm to an unknown degree we assume that the underlying causal structure can be modeled as directed graphical model without simultaneous influence there has been substantial work on modeling the statistics of time series but relatively less on learning causal structure and almost all of that assumes that the measurement and causal timescales match the problem of causal learning from undersampled time series data was explicitly addressed by but they assumed that the degree of the ratio of τs to τm both known and small in contrast we focus on the significantly harder challenge of causal learning when that ratio is unknown we provide formal specification of the problem and representational framework in section we then present three different structure learning rasl algorithms in section we finish in section by exploring their performance on synthetic data representation and formalism dynamic causal graphical model consists of graph over random variables at the current time as well as nodes for at all previous relative timesteps that contain direct cause of variable at the current the markov order of the system is the largest such that vjt where superscripts denote timestep we assume throughout that the true underlying causal system is markov order and that all causally relevant variables are finally we assume that there are no isochronal causal edges vit vjt causal influences inevitably take time to propagate and so any apparent isochronal edge will disappear when measured sufficiently finely since we do not assume that the causal timescale τs is known this is relatively innocuous assumption is thus over nodes where the only edges are vjt where possibly there is additionally conditional probability distribution or density vt which we assume to be we do not however assume stationarity of vt finally we assume appropriate versions of the markov variable is independent of given parents and the only independencies are those implied by markov assumptions such that the graph and probability mutually constrain each other let tk be the measurement timesteps we undersample at rate when we measure only timesteps tu tuk the causal timescale is thus undersampled at rate we denote the causal graph resulting from undersampling at rate by gu to obtain gu we unroll by introducing nodes for that bear the same graphical and parametric relations to as those variables bear to vt and iterate until we have included we then marginalize out all variables except those in vt and marginalization yields an acyclic directed mixed graph admg gu containing both directed and bidirected edges vjt in gu iff there is directed path from to vjt in the unrolled graph define trek to be pair of directed paths such that both have the same start variable vit vjt in gu iff there is trek between vit and vjt with length length clearly if bidirected edge occurs in gm then it occurs in gu for all can be computationally complex due to duplication of nodes and so we instead use compressed graphs that encode temporal relations in edges for an arbitrary dynamic causal graph is its compressed graph representation is over nodes for ii vi vj in iff vjt in and iii vi vj in iff vit vjt in compressed graphs can be cyclic vi vj for vjt and vit including there is clearly mapping between dynamic admgs and compressed graphs computationally the effects of undersampling at rate can be computed in compressed graph simply by finding directed paths of length in more precisely vjt in iff there is directed path of length in similarly vit vjt in iff there is trek with length length in we thus use compressed graphs going forward algorithms the core question of this paper is given for unknown what can be inferred about let jhk be the equivalence class of that could for some undersample rate yield we are thus trying to learn jhk from an obvious algorithm is for each possible compute the corresponding graphs for all and then output all equally obviously this algorithm will be computationally intractable for any reasonable as there are possible and can in theory be arbitrarily large instead we pursue three different constructive strategies that more efficiently build the members of jhk sections and because these algorithms make no assumptions about we refer to them each as agnostic structure use subscripts to distinguish between different types first though we provide some key theoretical results about forward inference that will be used by all three algorithms we use difference equations in our analyses the results and algorithms will be applicable to systems of differential equations to the extent that they can be approximated by system of difference equations more precisely we assume dynamic variant of the causal sufficiency assumption though it is more complicated than just no unmeasured common nonparametric forward inference for given and there is an efficient algorithm for calculating but it is only useful in learning if we have stopping rules that constrain which and should ever be considered these rules will depend on how changes as key notion is strongly connected component scc in maximal set of variables such that for every possibly there is directed path from to sccs are clearly cyclic and can provably be decomposed into set of possibly overlapping simple loops those in which no node is repeated σs let ls be the set of those simple loop lengths one stopping rule must specify for given which to consider for single scc the greatest common divisor of simple loop lengths where gcd ls for singleton is key gcd ls iff that is gcd determines whether an scc converges to fixedpoint graph as we can constrain if there is such graph and theorem generalizes theorem to provide an upper bound on interesting all proofs found in supplement theorem if gcd ls then stabilization occurs at nf where nf is the frobenius is the graph diameter and is the transit number see supplement this is theoretically useful bound but is not practically helpful since neither nor nf have known analytic expression moreover gcd ls is weak restriction but restriction nonetheless we instead use functional stopping rule for theorem that holds for all theorem if for then kw that is as increases if we find graph that we previously encountered then there can not be any new graphs as for given we can thus determine all possible corresponding undersampled graphs by computing until we encounter graph this stopping rule enables us to correctly constrain the that are considered for each we also require stopping rule for as we can not evaluate all possible graphs for any reasonable the key theoretical result is theorem if then let ge be the graph resulting from adding the edges in to since this is simply another graph it can be undersampled at rate denote the result ge since ge can always serve as in theorem we immediately have the following two corollaries corollary if then ge corollary if then ge we thus have stopping rule for some candidate if is not an of for all then do not consider any of this stopping rule fits very cleanly with constructive algorithms that iteratively add edge to candidate we now develop three such algorithms recursive edgewise inverse algorithm the two stopping rules naturally suggest recursive structure learning algorithm with as input and jhk as output start with an empty graph for each edge of possible edges construct containing only if for all then reject else if for some then add to jhk else recurse into graphs in order effectively this is depth first search dfs algorithm on the solution tree denote it as raslre for recursive figure provides and figure shows how one dfs path in the search tree unfolds we can prove theorem the raslre algorithm is correct and complete one significant drawback of raslre is that the same graph can be constructed in many different ways corresponding to different orders of edge addition the search tree is actually search latp for set of positive integers with gcd nf is the max integer with nf αi bi for αi this check requires at most min eu eh fast operations where eu eh are the number of edges in respectively this equality check occurs relatively rarely since and must be algorithm recursiveeqclass input output jhk initialize empty graph and set candidate edges begin edgeadder if has elements then forall the edges in do if edge creates conflict then remove it from if has elements then forall the edges in do add the edge to if then add to edgeadder the edge remove the edge from constructed graph pruned candidate edges next edge to add no graph constructed along this branch generates put all edges into list edgeadder return raslre algorithm ground truth no more candidates backtrack branch of the search tree figure raslre algorithm specification and search tree example tice the algorithm is thus unnecessarily inefficient even when we use dynamic programming via memoization of input graphs an iterative edgecentric inverse algorithm to minimize multiple constructions of the same graph we can use raslie iterative edgewise which generates at stage all with exactly edges more precisely at stage raslie starts with the empty graph if is also empty then it adds the empty graph to jhk otherwise it moves to stage in general at stage raslie considers each graph resulting from single edge addition to an acceptable graph at stage rejects if it conflicts for all with otherwise keeps as acceptable at and if then adds to jhk raslie continues until there are no more edges to add or it reaches stage figure provides the main loop figure and core function of raslie figure as well as an example of the number of graphs potentially considered at each stage figure raslie provides significant and memory gains over raslre see figure we optimize raslie by tracking the single edges that could possibly still be added for example if graph is rejected in stage then do not consider adding that edge at other stages additional conflicts can be derived analytically further reducing the graphs to consider in general absence of an edge in implies for the corresponding unknown absence of length paths in all jhk since we do not know we can not directly apply this constraint however lemmas and provide useful special case constraints for implied by single bidirected edge lemma if then can not contain any of the following paths lemma if then an iterative loopcentric inverse algorithm raslie yields results in reasonable time for with up to nodes though it is computationally demanding we can gain further computational advantages if we assume that is an scc this assumption is relatively innocuous as it requires only that our time series be generated by system with appropriate feedback loops as noted earlier any scc is composed of set of simple loops and so we modify raslie to iteratively add loops instead of edges call the resulting algorithm algorithm iterativeeqclass input output jhk initialize empty sets init as an empty graph and edges while do si extiterationgraphs si return raslie main algorithm rasl input run run iteration index number of graphs at the iteration run graphs histogram procedure nextiterationgraphs input graph edges structure and output dr and set jhk initialize empty structure dr and sets si forall the graphs in do forall the edges in do if then if conflicts with then continue add to if si then add to si if conflicts with then continue if then add to remove from add graphs edges to dr return dr core function of raslie three runs of the algorithm figure raslie algorithm main loop example of graphs considered and core function raslil for iterative more precisely raslil uses the same algorithm as in figure but successively attempts to add simple loops rather than edges raslil also incorporates the additional constraints due to lemmas and raslil is surprisingly faster than raslie even though for pn much nodes there are ni simple loops compared to edges the key is that introducing single simple loop induces multiple constraints simultaneously and so conflicting graphs are discovered at much earlier stage as result raslil checks many fewer graphs in practice for example consider the in figure with corresponding for raslre constructs not counting pruned single edges graphs raslie constructs only graphs and raslil considers only for these numbers are and respectively unsurprisingly these differences in numbers of examined graphs translate directly into wall clock time differences figure figure comparison results all three rasl algorithms take measurement timescale graph as input they are therefore compatible with any structure learning algorithm that outputs measurement timescale graph whether structural vector autoregression svar direct dynamic bayes net search or modifications of standard causal structure learning algorithms such as pc and ges the problem of learning measurement timescale graph is very hard one but is also not our primary focus here instead we focus on the performance of the novel rasl algorithms first we abstract away from learning measurement timescale structure and assume that the correct is provided as input for these simulated graphs we focus on sccs which are the most scientifically interesting cases for simplicity and because structure can be learned in parallel for complex we employ graphs to generate random sccs we build single simple loop over nodes and ii uniformly sample from the other possible edges until we reach the specified density proportion of the total possible edges we employ density in order to measure graph complexity in an approximately way we can improve the runtime speed of raslre using memoization though it is then for figure provides the running times for all three rasl algorithms applied to random graphs at each of three densities this graph substantiates our earlier claims that raslil is faster than raslie which is faster than raslre in fact each is at least an order of magnitude faster than the previous one raslre would take over year on the most difficult problems so we focus exclusively on raslil unsurprisingly complexity of all rasl algorithms depends on the density of for each of three density values we generated random sccs which were then undersampled at rates and before being provided as input to raslil figure summarizes wall clock computation time as function of density with different plots based on density of and undersampling rate we also show three examples of with range of computation runtime unsurprisingly the most difficult is quite dense with densities below typically require less than one minute figure behavior equivalence classes we first use raslil to determine jhk size and composition for varying that is we explore the degree of underdetermination produced by undersampling the underdetermination occurs if is with every possible edge any scc with gcd ls becomes as so jhk contains all such graphs for we thus note when is rather than computing the size of jhk graphs graphs density superclique density figure size of equivalence classes for random sccs at each density and figure size of equivalence classes for larger graphs for figures and plot equivalence class size as function of both density and the true undersampling rate for each and density we generated random ii undersampled each at indicated iii passed to raslil and iv computed the size of jhk interestingly jhk is typically quite small sometimes even singleton for example graphs at typically have singleton jhk up to density even graphs often have singleton jhk though with relatively sparse increased undersampling and density both clearly worsen underdetermination but often not intractably so particularly since even nonsingleton jhk can be valuable if they permit post hoc inspection or analysis equivalence class size edge density graphs undersampling rate figure effect of the undersampling rate on equivalence class size to focus on the impact of undersampling we generated random sccs with density each of which was undersampled for figure plots the size of jhk as function of for these graphs for singleton jhk still dominate interestingly even still yields some graphs graphs rate figure distribution of for for jhk for and graphs finally jhk iff but the appropriate need not be the same for all members of jhk figure plots the percentages of appropriate for each jhk for the from figure if actually utrue then almost all jhk are because of there are rarely jhk due to if actually utrue though then many jhk are due to where utrue as density and utrue increase there is increased underdetermination in both and synthetic data in practice we typically must learn structure from finite sample data as noted earlier there are many algorithms for learning as it is measurement timescale structure though small modifications are required to learn bidirected edges in pilot testing we found that structural vector autoregressive svar model optimization provided the most accurate and stable solutions for for our simulation regime we thus employ the svar procedure here though we note that other measurement timescale learning algorithms might figure the estimation and search errors on synwork better in different domains thetic data graphs per density to test the learning passed to raslil generated random sccs for each density in for each random graph we generated random transition matrix by sampling weights for the elements of the adjacency matrix and controlling system stability by keeping the maximal eigenvalue at or below we then generated time series data using vector var model with and random noise to simulate undersampling datapoints were removed to yield svar optimization on the resulting time series yielded candidate that was passed to raslil to obtain jhk the space of possible is factor of larger than the space of possible and so svar optimization can return an such that jhk if raslil returns then we rerun it on all that result from single edge addition or deletion on if raslil returns for all of those graphs then we consider the that result from two changes to then three changes this search through the hamming neighborhood of essentially always finds an with figure shows the results of the process where algorithm output is evaluated by two omission error the number of omitted edges normalized to the total number of edges in the ground truth comission error number of edges not present in the ground truth normalized to the total possible edges minus the number of those present in the ground truth we also plot the estimation errors of svar on the undersampled data to capture the dependence of raslil estimation errors on estimation errors for interestingly raslil does not significantly increase the error rates over those produced by the svar estimation in fact we find the contrary similarly to the requirement to use an that could be generated by some undersampled functions as regularization constraint that corrects for some svar estimation errors conclusion time series data are widespread in many scientific domains but if the measurement and system timescales differ then we can make significant causal inference errors despite this potential for numerous errors there have been only limited attempts to address this problem and even those methods required strong assumptions about the undersample rate we here provided the first causal inference algorithms that can reliably learn causal structure from time series data when the system and measurement timescales diverge to an unknown degree the rasl algorithms are complex but not restricted to toy problems we also showed that underdetermination of is sometimes minimal given the right methods jhk was often small substantial system timescale causal structure could be learned from undersampled measurement timescale data significant open problems remain such as more efficient methods when has jhk this paper has however expanded our causal inference toolbox to include cases of unknown undersampling acknowledgments sp dd contributed equally this work was supported by awards nih sp nsf sp nsf dd nih dd from the national human genome research institute through funds provided by the big data to knowledge initiative the content is solely the responsibility of the authors and does not necessarily represent the official views of the national institutes of health references moneta chlaß entner and hoyer causal search in structural vector autoregressive models in journal of machine learning research workshop and conference proceedings causality in time series proc on causality in time series volume pages granger investigating causal relations by econometric models and methods econometrica journal of the econometric society pages thiesson chickering heckerman and meek arma modeling with graphical models in proceedings of the twentieth conference annual conference on uncertainty in artificial intelligence pages arlington virginia auai press mark voortman denver dash and marek druzdzel learning why things change the causality learner in proceedings of the annual conference on uncertainty in artificial intelligence uai pages corvallis oregon auai press nir friedman kevin murphy and stuart russell learning the structure of dynamic probabilistic networks in annual conference on uncertainty in artificial intelligence pages san francisco morgan kaufmann sergey plis david danks and jianyu yang mesochronal structure learning in proceedings of the conference annual conference on uncertainty in artificial intelligence corvallis oregon auai press mingming gong kun zhang bernhard schoelkopf dacheng tao and philipp geiger discovering temporal causal relations from subsampled data in proc icml pages richardson and spirtes ancestral graph markov models the annals of statistics david danks and sergey plis learning causal structure from undersampled time series in jmlr workshop and conference proceedings volume pages donald johnson finding all the elementary circuits of directed graph siam journal on computing helmut new introduction to multiple time series analysis springer science business media murphy dynamic bayesian networks representation inference and learning phd thesis uc berkeley clark glymour peter spirtes and richard scheines causal inference in erkenntnis orientated centennial volume for rudolf carnap and hans reichenbach pages springer david maxwell chickering optimal structure identification with greedy search the journal of machine learning research anil seth paul chorley and lionel barnett granger causality analysis of fmri bold signals is invariant to hemodynamic convolution but not downsampling neuroimage 
principal geodesic analysis for probability measures under the optimal transport metric vivien seguy graduate school of informatics kyoto university marco cuturi graduate school of informatics kyoto university mcuturi abstract given family of probability measures in the space of probability measures on hilbert space our goal in this paper is to highlight one ore more curves in that summarize efficiently that family we propose to study this problem under the optimal transport wasserstein geometry using curves that are restricted to be geodesic segments under that metric we show that concepts that play key role in euclidean pca such as data centering or orthogonality of principal directions find natural equivalent in the optimal transport geometry using wasserstein means and differential geometry the implementation of these ideas is however computationally challenging to achieve scalable algorithms that can handle thousands of measures we propose to use relaxed definition for geodesics and regularized optimal transport distances the interest of our approach is demonstrated on images seen either as shapes or color histograms introduction optimal transport distances villani wasserstein or earth mover distances define powerful geometry to compare probability measures supported on metric space the wasserstein space space of probability measures on endowed with the wasserstein metric space which has received ample interest from theoretical perspective given the prominent role played by probability measures and feature histograms in machine learning the properties of can also have practical implications in data science this was shown by agueh and carlier who described first wasserstein means of probability measures wasserstein means have been recently used in bayesian inference srivastava et clustering cuturi and doucet graphics solomon et or brain imaging gramfort et when is not just metric but also hilbert space is an riemannian manifold ambrosio et al chap villani part ii three recent contributions by boissard et al bigot et al and wang et al exploit directly or indirectly this structure to extend principal component analysis pca to these important seminal papers are however limited in their applicability the type of curves they output our goal in this paper is to propose more general and scalable algorithms to carry out wasserstein principal geodesic analysis on probability measures and not simply dimensionality reduction as explained below principal geodesics in dimensionality reduction on we provide in fig simple example that illustrates the motivation of this paper and which also shows how our approach differentiates itself from existing dimensionality reduction algorithms linear and that draw inspiration from pca as shown in fig linear pca can not produce components that remain in even more advanced tools such as those proposed by hastie and stuetzle fall slightly short of that goal on the other hand wasserstein geodesic analysis yields geodesic components in that are easy to interpret and which can also be used to reduce dimensionality wasserstein principal geodesics euclidean principal components principal curve figure dataset images of single chinese character randomly translated scaled and slightly rotated images displayed out of used each image is handled as normalized histogram of intensities dataset schematically drawn on the wasserstein principal geodesics of this dataset are depicted in red its euclidean components in blue and its principal curve verbeek et in yellow right actual curves blue colors depict negative intensities green intensities neither the euclidean components nor the principal curve belong to nor can they be interpreted as meaningful axis of variation foundations of pca and riemannian extensions carrying out pca on family xn of points taken in space can be described in abstract terms as define mean element for that dataset ii define family of components in typically geodesic curves that contain iii fit component by making it follow the xi as closely as possible in the sense that the sum of the distances of each point xi to that component is minimized iv fit additional components by iterating step iii several times with the added constraint that each new component is different orthogonal enough to the previous components when is euclidean and the xi are vectors in rd the component can be computed iteratively by solving argmin def def min kxi tv where and vn span vn since pca is known to boil down to simple when is euclidean or hilbertian et eq looks artificially complicated this formulation is however extremely useful to generalize pca to riemannian manifolds fletcher et this generalization proceeds first by replacing vector means lines and orthogonality conditions using respectively means geodesics and orthogonality in tangent spaces riemannian pca builds then upon the knowledge of the exponential map at each point of the manifold each exponential map expx is locally bijective between the tangent space tx of and after computing the mean of the dataset the logarithmic map at the inverse of is used to map all data points xi onto because is euclidean space by definition of riemannian manifolds the dataset xi can be studied using euclidean pca principal geodesics in can then be recovered by applying the exponential map to principal component tv from riemannian pca to wasserstein pca related work as remarked by bigot et al fletcher et approach can not be used as it is to define wasserstein geodesic pca because is infinite dimensional and because there are no known ways to define exponential maps which are locally bijective between wasserstein tangent spaces and the manifold of probability measures to circumvent this problem boissard et al bigot et al have proposed to formulate the geodesic pca problem directly as an optimization problem over curves in boissard et al and bigot et al study the wasserstein pca problem in restricted scenarios bigot et al focus their attention on measures supported on which considerably simplifies their analysis since it is known in that case that the wasserstein space can be embedded isometrically in boissard et al assume that each input measure has been generated from single template density the mean measure which has been transformed according to one admissible deformation taken in parameterized family of deformation maps their approach to wasserstein pca boils down to functional pca on such maps wang et al proposed more general approach given family of input empirical measuresp µn they propose to compute first template measure using clustering on µi they consider next all optimal transport plans πi between that template and each of the measures µi and propose to compute the barycentric projection see eq of each optimal transport plan πi to recover monge maps ti on which standard pca can be used this approach is computationally attractive since it requires the computation of only one optimal transport per input measure its weakness lies however in the fact that the curves in obtained by displacing along each of these pca directions are not geodesics in general contributions and outline we propose new algorithm to compute wasserstein principal geodesics wpg in for arbitrary hilbert spaces we use several of the optimal transport metric and of its obtain tractable algorithms that can scale to thousands of measures we provide first in review of the key concepts used in this paper namely wasserstein distances and means geodesics and tangent spaces in the wasserstein space we propose in to parameterize wasserstein principal component pc using two velocity fields defined on the support of the wasserstein mean of all measures and formulate the wpg problem as that of optimizing these velocity fields so that the average distance of all measures to that pc is minimal this problem is and we propose to optimize smooth upperbounds of that objective using entropy regularized optimal transport in the practical interest of our approach is demonstrated in on toy samples datasets of shapes and histograms of colors notations we write ha for the frobenius of matrices and is the diagonal matrix of vector for mapping we say that acts on measure through the pushforward operator to define new measure this measure is characterized by the identity for any borel set we write and for the canonical projection operators defined as and background on optimal transport wasserstein distances we start this section with the main mathematical object of this paper definition villani def let the space of probability measures on hilbert space let be the set of probability measures on with marginals and and the squared distance between and in is defined as kx dπ inf wasserstein barycenters given family of probability measures µn in and weights rn agueh and carlier define the wasserstein barycenter of these measures argmin λi µi our paper relies on several algorithms which have been recently proposed benamou et bonneel et carlier et cuturi and doucet to compute such barycenters wasserstein geodesics given two measures and let be the set of optimal couplings for eq informally speaking it is well known that if either or are absolutely continuous measures then any optimal coupling is degenerated in the sense that assuming for instance that is absolutely continuous for all in the support of only one point is such that dπ in that case the optimal transport is said to have no mass splitting and there exists an optimal mapping such that can be written using pushforward as id when there is no mass splitting to transport to mccann interpolant gt id tt defines geodesic curve in the wasserstein space gt is locally the shortest path between any two measures located on the geodesic with respect to in the more general case where no optimal map exists and mass splitting occurs for some locations one may have dπ for several then geodesic can still be defined but it relies on the optimal plan instead gt ambrosio et both cases are shown in fig geodesic geodesic figure both plots display geodesic curves between two empirical measures and on an optimal map exists in the left plot no mass splitting occurs whereas some of the mass of needs to be split to be transported onto on the right plot tangent space and tangent vectors we briefly describe in this section the tangent spaces of and refer to ambrosio et chap for more details let be curve in for given time the tangent space of at µt is subset of µt the space of velocity fields supported on supp µt at any there exists tangent vectors vt in µt such that id hvt µt given geodesic curve in parameterized as eq its corresponding tangent vector at time zero is id wasserstein principal geodesics geodesic parameterization the goal of principal geodesic analysis is to define geodesic curves in that go through the mean and which pass close enough to all target measures µi to that end geodesic curves can be parameterized with two end points and however to avoid dealing with the constraint that principal geodesic needs to go through one can start instead from and consider velocity field which displaces all of the mass of in both directions def gt id tv lemma of ambrosio et al implies that any geodesic going through can be written as eq hence we do not lose any generality using this parameterization however given an arbitrary vector field the curve gt is not necessarily geodesic indeed the maps id are def not necessarily in the set id of maps that are optimal when moving mass away from ensuring thus at each step of our algorithm that is still such that gt is geodesic curve is particularly challenging to relax this strong assumption we propose to use generalized formulation of geodesics which builds upon not one but two velocity fields as introduced by ambrosio et al definition adapted from ambrosio et let and assume there is an optimal mapping from to and an optimal mapping from to generalized geodesic illustrated in fig between and with base is defined by gt tt choosing as the base measure in definition and two fields such that id id are optimal mappings in we can define the following generalized geodesic gt def gt id for generalized geodesics become true geodesics when and are positively proportional we can thus consider regularizer that controls the deviation from that property by defining which is minimal when and are indeed positively proportional we can now formulate the wpg problem as computing for the th principal generalized geodesic component of family of measures µi by solving with id id min λω min gt µi span this problem is not convex in we pose to find an approximation of that minimum by projected gradient descent with tion that is to be understood in terms of an ternative metric on the space of vector fields to preserve the optimality of the mappings id and id between iteraσ tions we introduce in the next paragraph able projection operator on gσ remark trivial way to ensure that gt gσ is geodesic is to impose that the vector field is translation namely that is uniformly equal to vector on all of supp one can show in that case that the wpg problem described in eq outputs an optimal vector which is the euclidean principal component of the figure generalized geodesic interpolation between two empirical measures and using the ily formed by the means of each measure µi base measure all defined on projection on the optimal mapping set we use projected gradient descent method to solve eq approximately we will compute the gradient of local of the objective of eq and update and accordingly we then need to ensure that and are such that id and id belong to the set of optimal mappings to do so we would ideally want to compute the projection of id in argmin id to update id westdickenberg has shown that the set of optimal mappings is convex closed cone in leading to the existence and the unicity of the solution of eq however there is to our knowledge no known method to compute the projection of id there is nevertheless well known and efficient approach to find mapping in which is close to id that approach known as the the barycentric projection requires to compute first an optimal coupling between and id to define then conditional expectation map def tπ ydπ ambrosio et al theorem or reich lemma have shown that tπ is indeed an optimal mapping between and tπ we can thus set the velocity field as tπ id to carry out an approximate projection we show in the supplementary material that this operator can be in fact interpreted as projection under on computing principal generalized geodesics in practice we show in this section that when rd the steps outlined above can be implemented efficiently input measures and their barycenter each input measure in the family µn is finite weighted sum of diracs described by ni points contained in matrix xi of size ni and nonnegative weight vector aip of dimension ni summing to the wasserstein mean of these measures is given and equal to bk δyk where the nonnegative vector bp sums to one and yp is the matrix containing locations of generalized geodesic two velocity vectors for each of the points in are needed to parameterize generalized geodesic these velocity fields will be represented by two matrices and in assuming that these velocity fields yield optimal mappings the points at time of that generalized geodesic are the measures parameterized by def gt bk δzkt with locations zt zpt the squared distance between datum µi and point gt on the geodesic is gt µi min hp mzt xi ai ai and mzt xi where ai is the transportation polytope stands for the ni matrix of distances between the and ni column vectors of zt and xi respectively writing zt ztt zt and xi xit xi we have that mzt xi zt xti xi which by taking into account the marginal conditions on ai leads to hp mzt xi bt zt ati xi ztt xi majorization of the distance of each µi to the principal geodesic using eq the distance between each µi and the pc gt can be cast as function fi of def fi min zt ai xi min xi ai where we have replaced zt above by its explicit form in to highlight that the objective above is quadratic convex plus piecewise linear concave as function of and thus neither convex nor concave assume that we are given and that are approximate for fi for any in we thus have that each distance fi appearing in eq is such that def fi hp mzt xi we can thus use procedure hunter and lange to minimize the sum of terms fi by iteratively creating majorization functions at each iterate all functions mvi are quadratic convex given that we need to ensure that these velocity fields yield optimal mappings and that they may also need to satisfy orthogonality constraints with respect to principal components we use gradient steps to update which can be recovered using cuturi and doucet and the chain rule as mvi zt xi zt xi efficient approximation of and as discussed above gradients for majorization functions mvi can be obtained using approximate minima and for each function fi because the objective of eq is not convex we propose to do an exhaustive grid search with values in this approach would still require in theory to solve optimal transport problems to solve eq for each of the input measures to carry out this step efficiently we propose to use entropy regularized transport cuturi which allows for much faster computations and efficient parallelizations to recover approximately optimal transports projected gradient update velocity fields are updated with gradient stepsize span followed by projection step to enforce that and lie in th in the sense when computing the pc we finally apply the barycentric projection operator defined in the end of we first need to compute two optimal transport plans argmin hp my argmin hp my to form the barycentric projections which then yield updated velocity vectors we repeat steps until convergence is given in the supplementary material experiments figure wasserstein mean and first pc computed on dataset of four left and three right empirical measures the second pc is also displayed in the right figure toy samples we first run our algorithm on two simple synthetic examples we consider respectively and empirical measures supported on small number of locations in so that we can compute their exact wasserstein means using the linear programming formulation given in agueh and carlier these measures and their mean red squares are shown in fig the first principal component on the left example is able to capture both the variability of average measure locations from left to right and also the variability in the spread of the measure locations on the right example the first principal component captures the overall elliptic shape of the supports of all considered measures the second principal component reflects the variability in the parameters of each ellipse on which measures are located the variability in the weights of each location is also captured through the wasserstein mean since each single line of generalized geodesic has corresponding location and weight in the wasserstein mean mnist for each of the digits ranging from to we sample images in the mnist database representing that digit each image originally grayscale image is converted into probability distribution on that grid by normalizing each intensity by the total intensity in the image we compute the wasserstein mean for each digit using the approach of benamou et al we then follow our approach to compute the first three principal geodesics for each digit geodesics for four of these digits are displayed in fig by showing intermediary rasterized measures on the curves while some deformations in these curves can be attributed to relatively simple rotations around the digit center more interesting deformations appear in some of the curves such as the the loop on the bottom left of digit our results are easy to interpret unlike those obtained with wang et approach on these datasets see supplementary material fig displays the first pc obtained on subset of mnist composed of images of and in equal proportions figure images for each of the digits were sampled from the mnist database we display above the first three pcs sampled at times tk for each of these digits color histograms we consider subset of the dataset composed of three image categories waterfalls tomatoes and tennis balls resulting in set of color images the pixels figure first pc on subset of mnist composed of one thousand and one thousand contained in each image can be seen as in the rgb color space we use quantization to reduce the size of these uniform into set of weighted points using cluster assignments to define the weights of each of the cluster centroids each image can be thus regarded as discrete probability measure of atoms in the tridimensional rgb space we then compute the wasserstein barycenter of these measures supported on locations using cuturi and doucet principal components are then computed as described in the computation for single pc is performed within minutes on an imac intel core fig displays color palettes sampled along each of the first three pcs the first pc suggests that the main source of color variability in the dataset is the illumination each pixel going from dark to light second and third pcs display the variation of colors induced by the typical images dominant colors blue red yellow fig displays the second pc along with three images projected on that curve the projection of given image on pc is obtained by finding first the optimal time such that the distance of that image to the pc at is minimum and then by computing an optimal color transfer et between the original image and the histogram at time figure each row represents pc displayed at regular time intervals from left to right from the first pc top to the third pc bottom figure color palettes from the second pc on the left on the right displayed at times images displayed in the top row are original their projection on the pc is displayed below using color transfer with the palette in the pc to which they are the closest conclusion we have proposed an approximate projected gradient descent method to compute generalized geodesic principal components for probability measures our experiments suggest that these principal geodesics may be useful to analyze shapes and distributions and that they do not require any parameterization of shapes or deformations to be used in practice aknowledgements mc acknowledges the support of jsps young researcher grant references martial agueh and guillaume carlier barycenters in the wasserstein space siam journal on mathematical analysis luigi ambrosio nicola gigli and giuseppe gradient flows in metric spaces and in the space of probability measures springer benamou guillaume carlier marco cuturi luca nenna and gabriel iterative bregman projections for regularized transportation problems siam journal on scientific computing bigot gouet thierry klein and alfredo geodesic pca in the wasserstein space by convex pca annales de institut henri probability and statistics emmanuel boissard thibaut le gouic loubes et al distributions template estimate with wasserstein metrics bernoulli nicolas bonneel julien rabin gabriel and hanspeter pfister sliced and radon wasserstein barycenters of measures journal of mathematical imaging and vision guillaume carlier adam oberman and edouard oudet numerical methods for matching for teams and wasserstein barycenters esaim mathematical modelling and numerical analysis to appear marco cuturi sinkhorn distances lightspeed computation of optimal transport in advances in neural information processing systems pages marco cuturi and arnaud doucet fast computation of wasserstein barycenters in proceedings of the international conference on machine learning pages thomas fletcher conglin lu stephen pizer and sarang joshi principal geodesic analysis for the study of nonlinear statistics of shape medical imaging ieee transactions on maurice les de nature quelconque dans un espace in annales de institut henri volume pages presses universitaires de france alexandre gramfort gabriel and marco cuturi fast optimal transport averaging of neuroimaging data in information processing in medical imaging ipmi springer trevor hastie and werner stuetzle principal curves journal of the american statistical association david hunter and kenneth lange quantile regression via an mm algorithm journal of computational and graphical statistics robert mccann convexity principle for interacting gases advances in mathematics anil kokaram and rozenn dahyot automated colour grading using colour distribution transfer computer vision and image understanding sebastian reich nonparametric ensemble transform method for bayesian inference siam journal on scientific computing bernhard alexander smola and kernel principal component analysis in artificial neural networks icann pages springer justin solomon fernando de goes gabriel marco cuturi adrian butscher andy nguyen tao du and leonidas guibas convolutional wasserstein distances efficient optimal transportation on geometric domains acm transactions on graphics proc siggraph sanvesh srivastava volkan cevher quoc and david dunson wasp scalable bayes via barycenters of subset posteriors in proceedings of the eighteenth international conference on artificial intelligence and statistics pages jakob verbeek nikos vlassis and algorithm for finding principal curves pattern recognition letters villani optimal transport old and new volume springer wei wang dejan saurav basu john ozolek and gustavo rohde linear optimal transportation framework for quantifying and visualizing variations in sets of images international journal of computer vision michael westdickenberg projections onto the cone of optimal transport maps and compressible fluid flows journal of hyperbolic differential equations 
consistent multilabel classification oluwasanmi department of psychology stanford university sanmi nagarajan department of computer science university of texas at austin pradeep ravikumar department of computer science university of texas at austin pradeepr inderjit dhillon department of computer science university of texas at austin inderjit abstract multilabel classification is rapidly developing as an important aspect of modern predictive modeling motivating study of its theoretical aspects to this end we propose framework for constructing and analyzing multilabel classification metrics which reveals novel results on parametric form for population optimal classifiers and additional insight into the role of label correlations in particular we show that for multilabel metrics constructed as and macroaverages the population optimal classifier can be decomposed into binary classifiers based on the marginal distribution of each label with weak association between labels via the threshold thus our analysis extends the state of the art from few known multilabel classification metrics such as hamming loss to general framework applicable to many of the classification metrics in common use based on the classifier we propose computationally efficient and classification algorithm and prove its consistency with respect to the metric of interest empirical results on synthetic and benchmark datasets are supportive of our theoretical findings introduction modern classification problems often involve the prediction of multiple labels simultaneously associated with single instance image tagging by predicting multiple objects in an image the growing importance of multilabel classification has motivated the development of several scalable algorithms and has led to the recent surge in theoretical analysis which helps guide and understand practical advances while recent results have advanced our knowledge of optimal population classifiers and consistent learning algorithms for particular metrics such as the hamming loss and multilabel general understanding of learning with respect to multilabel classification metrics has remained an open problem this is in contrast to the more traditional settings of binary and multiclass classification where several recently established results have led to rich understanding of optimal and consistent classification this manuscript constitutes step towards establishing results for multilabel classification at the level of generality currently enjoyed only in these traditional settings towards generalized analysis we propose framework for multilabel sample performance metrics and their corresponding population extensions classification metric is constructed to measure the of classifier as defined by the practitioner or the utility may be measured using equal contribution equivalently we may define the loss as the negative utility the sample metric given finite dataset and further generalized to the population metric with respect to given data distribution with respect to infinite samples two distinct approaches have been proposed for studying the population performance of classifier in the classical settings of binary and multiclass classification described by ye et al as decision theoretic analysis dta and empirical utility maximization eum dta population utilities measure the expected performance of classifier on test set while eum population utilities are directly defined as function of the population confusion matrix however analysis of multilabel classification has lacked such distinction the proposed framework defines both eum and dta multilabel population utility as generalizations of the aforementioned classic definitions using this framework we observe that existing work on multilabel classification have exclusively focused on optimizing the dta utility of specific multilabel metrics averaging of binary classification metrics remains one of the most widely used approaches for defining multilabel metrics given binary label representation such metrics are constructed via averaging with respect to labels with respect to examples separately for each label or with respect to both labels and examples we consider large of such metrics where the underlying binary metric can be constructed as fraction of linear combinations of true positives false positives false negatives and true negatives examples in this family include the ubiquitous hamming loss the averaged precision the multilabel averaged and the averaged jaccard measure among others our key result is that bayes optimal multilabel classifier for such metrics can be explicitly characterized in simple form the optimal classifier thresholds the conditional probability marginals and the label dependence in the underlying distribution is relevant to the optimal classifier only through the threshold parameter further the threshold is shared by all the labels when the metric is or this result is surprising and to our knowledge first result to be shown at this level of generality for multilabel classification the result also sheds additional insight into the role of label correlations in multilabel classification answering prior conjectures by et al and others we provide estimation based algorithm that is efficient as well as theoretically consistent the true utility of the empirical estimator approaches the optimal eum utility of the bayes classifier section we also present experimental evaluation on synthetic and benchmark multilabel datasets comparing different estimation algorithms section for representative multilabel performance metrics selected from the studied family the results observed in practice are supportive of what the theory predicts related work we briefly highlight closely related theoretical results in the multilabel learning literature gao and zhou consider the consistency of multilabel learning with respect to dta utility with focus on two specific losses hamming and rank loss the corresponding measures are defined in section surrogate losses are devised which result in consistent learning with respect to these metrics in contrast we propose estimation based algorithm which directly estimates the bayes optimal without going through surrogate losses dembczynski et al analyze the dta population optimal classifier for the multilabel rank loss showing that the bayes optimal is independent of label correlations in the unweighted case and construct certain weighted univariate losses which are dta consistent surrogates in the more general weighted case perhaps the work most closely related to ours is by dembczynski et al who propose novel dta consistent rule estimation based algorithm for multilabel cheng et al consider optimizing popular losses in multilabel learning such as hamming rank and subset loss which is the multilabel analog of the classical loss they propose probabilistic version of classifier chains first introduced by read et al for estimating the bayes optimal with respect to subset loss though without rigorous theoretical justification framework for multilabel classification metrics consider multilabel classification with labels where each instance is denoted by for convenience we will focus on the common binary encoding where the labels are represented by vector so ym iff the mth label is associated with the instance and ym otherwise the goal is to learn multilabel classifier that optimizes certain performance metric with respect to fixed data generating distribution over the domain using training set of pairs drawn typically assumed iid from let and denote the random variables for instances and labels respectively and let denote the performance utility metric of interest most classification metrics can be represented as functions of the entries of the confusion matrix in case of binary classification the confusion matrix is specified by four numbers true positives true negatives false positives and false negatives similarly we construct the following primitives for multilabel classification jfm ym jfm ym tp tn fp jfm ym fn jfm ym where jzk denotes the indicator function that is if the predicate is true or otherwise it is clear that most multilabel classification metrics considered in the literature can be written as function of the primitives defined in in the following we consider construction which is of sufficient generality to capture all multilabel fp tn fn metrics in common use let ak tp represent set of functions consider sample multilabel metrics constructed as functions ak we note that the metric need not decompose over individual instances equipped with this definition of sample performance metric consider the population utility of multilabel classifier defined as ak where the expectation is over iid draws from the joint distribution note that this can be seen as multilabel generalization of the empirical utility maximization eum style classifiers studied in binary and multiclass settings our goal is to learn multilabel classifier that maximizes for general performance metrics define the bayes optimal multilabel classifier as argmax let we say that is consistent estimator of if examples the averaged accuracy hamming loss used in multilabel classification correpm pn fn and ham sponds to simply choosing fp the measure corresponding to rank loss can be obtained by choosing ak pm pm for and rank fn fp ak note that the choice of ak and therefore is not unique remark existing results on multilabel classification have focused on analysis dta style classifiers where the utility is defined as udta ak and the expectation is over iid samples from furthermore there are no theoretical results for consistency with respect to general performance metrics in this setting see appendix for the remainder of this manuscript we refer to as the utility defined in we will also as tp when it is clear from the context drop the argument write tp framework for averaged binary multilabel classification metrics the most popular class of multilabel performance metrics consists of averaged binary performance metrics that correspond to particular settings of ak using certain averages as described in the following for the remainder of this subsection the metric will refer to binary classification metric as is typically applied to binary confusion matrix subtle but important aspect of the definition of rank loss in the existing literature including and is that the bayes optimal is allowed to be function and may not correspond to label decision multilabel performance metrics micro are defined by averaging over both labels and examples let tp tp fp fp and fn are defined similarly then the multilabel performance metrics are tn given by fp tn fn tp micro ak thus for one applies binary performance metric to the confusion matrix defined by the averaged quantities described in the metric macro measures average classification performance across labels define the averaged measures tp tp fp fp and fn are defined similarly the performance metric is given by tn macro ak fp tn fn tp the metric instance measures the average classification performance across examples define the averaged measures tp tp fp fp and fn are defined similarly the performance metric is given by tn instance ak tpn fpn tnn fnn characterizing the bayes optimal classifier for multilabel metrics we now characterize the optimal multilabel classifier for the large family of metrics outlined in section micro macro and instance with respect to the eum utility we begin by observing that while and seem quite different when viewed as sample averages they are in fact equivalent at the population level thus we need only focus on micro to characterize instance as well proposition for given binary classification metric consider the averaged multilabel metrics micro defined in and instance defined in for any micro instance in particular micro instance we further restrict our study to metrics selected from the metric family recently studied in the context of binary classification any in this family can be written as fp fn tn tp fp fn tn tp fp fn tn tp fp fn tn are defined as in section many where aij bij are fixed and tp popular multilabel metrics can be derived using some examples tp tp jaccard jacc fp fn tp fn fp tp tp precision hamming prec ham tp tn fp tp note that hamming is typically defined as the loss given by ham pm pm define the population quantities ym and fm let where the expectation is over iid draws from from it follows that tp tp tp tn fp fp tp and fn tp now the population utility corresponding to micro can be written succinctly as tp tp fp fn tn tp micro with the constants and we assume that the joint has density that satisfies dp µdx and define ym our first main result characterizes the bayes optimal multilabel classifier micro theorem given the constants and define micro micro the optimal bayes classifier micro defined in is given by when micro takes the form fm when micro takes the form fm for for the proof is provided in appendix and applies equivalently to theorem recovers existing results in binary settings see appendix for details and is sufficiently general to capture many of the multilabel metrics used in practice our proof is closely related to the binary classification case analyzed in theorem of but differs in the additional averaging across labels key observation from theorem is that the optimal multilabel classifier can be obtained by thresholding the marginal probability for each label ym and importantly that the optimal classifiers for all the labels share the same threshold thus the effect of the joint distribution is only in the threshold parameter we emphasize that while the presented results characterize the optimal population classifier incorporating label correlations into the prediction algorithm may have other benefits with finite samples such as statistical efficiency when there are known structural similarities between the marginal distributions further analysis is left for future work the bayes optimal for the population metric is straightforward to establish we observe that the threshold is not shared in this case proposition for given metric consider the multilabel metric macro defined in let macro and we have for fm where is constant that depends on the metric and the marginals of analogous results hold for macro remark it is clear that and averaging are equivalent at the population level when the metric is linear this is straightforward consequence of the observation that the corresponding sample utilities are the same more generally and instanceaveraging are equivalent whenever the optimal threshold is constant independent of such as for linear metrics where so cf corollary of koyejo et al thus our analysis recovers known results for hamming loss consistent estimation algorithm importantly the bayes optimal characterization points to simple estimation algorithm that enjoys consistency as follows first one obtains an estimate of the marginal instanceconditional probability ym for each label see reid and williamson using training sample then the given metric micro is maximized on validation sample for the remainder of this manuscript we assume wlog that note that in order to maximize over fm it suffices to optimize argmax micro where micro is the sample metric defined in similarly for instance though the threshold search is over continuous space the number of distinct micro values given training sample of size is at most thus can be solved efficiently on finite sample algorithm for micro and instance input training examples and metric micro or for do select the training data for label sm ym split the training data sm into two sets and estimate using define fˆm jˆ end for obtain by solving on return instance consistency of the proposed algorithm the following theorem shows that the procedure of algorithm results in consistent classifier theorem let micro be metric if the estimates satisfy then the output multilabel classifier of algorithm is consistent the proof is provided in appendix from proposition it follows that consistency holds for instance as well additionally in light of proposition we may apply the learning algorithms proposed by for binary classification independently for each label to obtain consistent estimator for macro experiments we present two sets of results the first is an experimental validation on synthetic data with known ground truth probabilities the results serve to verify our main result theorem characterizing the bayes optimal for averaged multilabel metrics the second is an experimental evaluation of the plugin estimator algorithms for and multilabel metrics on benchmark datasets synthetic data verification of bayes optimal we consider the metric in for multilabel classification with labels we sample set of five vectors from the standard gaussian the conditional probability for label is modeled using sigmoid function ym wt using vector wm sampled from the standard gaussian the bayes optimal that maximizes the population utility is then obtained by exhaustive search over all possible label vectors for each instance in figure we plot the conditional probabilities wrt the sample index for each label the corresponding fm for each and the optimal threshold using we observe that the optimal multilabel classifier indeed thresholds ym for each label and furthermore that the threshold is same for all the labels as stated in theorem figure bayes optimal classifier for multilabel measure on synthetic data with labels and distribution supported on instances plots from left to right show the bayes optimal classifier prediction for instances and for labels through note that the optimal at which the marginal is thresholded is shared conforming to theorem larger plots are included in appendix benchmark data evaluation of estimators we now evaluate the proposed algorithm that is consistent for and multilabel metrics we focus on two metrics and jaccard listed in we compare algorithm designed to optimize or multilabel rics to two related methods separate threshold tuned for each label individually this optimizes the utility corresponding to the metric but is not consistent for or metrics and is the most common approach in practice we refer to this as ii constant threshold for all the labels this is known to be optimal for averaged accuracy equiv hamming loss but not for or jaccard metrics we refer to this as binary relevance br we use four benchmark multilabel in our experiments cene an image dataset consisting of labels with training and test instances ii irds an audio dataset consisting of labels with training and test instances iii motions music dataset consisting of labels with training and test instances and iv al music dataset consisting of labels with training and test we perform logistic regression with regularization on separate subsample to obtain estimates of of ym for each label as described in section all the methods we evaluate rely on obtaining good estimator for the conditional probability so we exclude labels that are associated with very few instances in particular we train and evaluate using labels associated with at least instances in each dataset for all the methods in table we report the and jaccard metrics on the test set for algorithm and binary relevance we observe that estimating fixed threshold for all the labels algorithm consistently performs better than estimating thresholds for each label and than using threshold for all labels br this conforms to our main result in theorem and the consistency analysis of algorithm in theorem similar trend is observed for the instanceaveraged metrics computed on the test set shown in table proposition shows that maximizing the population utilities of and metrics are equivalent the result holds in practice as presented in table finally we report metrics computed on test set in table we observe that is competitive in out of datasets this conforms to proposition which shows that in the case of metrics it is optimal to tune threshold specific to each label independently beyond consistency we note that by using more samples joint threshold estimation enjoys additional statistical efficiency while separate threshold estimation enjoys greater flexibility this may explain why algorithm achieves the best performance in three out of four datasets in table though it is not consistent for metrics the datasets were obtained from http original al dataset does not provide splits we split the data randomly into train and test sets dataset cene irds motions al br algorithm br algorithm jaccard table comparison of methods on multilabel and jaccard metrics reported values correspond to metric and jaccard computed on test data with standard deviation over random validation sets for tuning thresholds algorithm is consistent for microaveraged metrics and performs the best consistently across datasets dataset br cene irds motions al algorithm br algorithm jaccard table comparison of methods on multilabel and jaccard metrics reported values correspond to metric and jaccard computed on test data with standard deviation over random validation sets for tuning thresholds algorithm is consistent for metrics and performs the best consistently across datasets dataset br cene irds motions al algorithm br algorithm jaccard table comparison of methods on multilabel and jaccard metrics reported values correspond to the metric computed on test data with standard deviation over random validation sets for tuning thresholds is consistent for metrics and is competitive in three out of four datasets though not consistent for metrics algorithm achieves the best performance in three out of four datasets conclusions and future work we have proposed framework for the construction and analysis of multilabel classification metrics and corresponding population optimal classifiers our main result is that for large family of averaged performance metrics the eum optimal multilabel classifier can be explicitly characterized by thresholding of marginal probabilities with weak label dependence via shared threshold we have also proposed efficient and consistent estimators for maximizing such multilabel performance metrics in practice our results are step forward in the direction of extending the understanding of learning with respect to general metrics in binary and multiclass settings our work opens up many interesting research directions including the potential for further generalization of our results beyond averaged metrics and generalized results for dta population optimal classification which is currently only for the acknowledgments we acknowledge the support of nsf via and and nih via as part of the joint initiative to support research at the interface of the biological and mathematical sciences references weiwei cheng eyke and krzysztof dembczynski bayes optimal multilabel classification via probabilistic classifier chains in proceedings of the international conference on machine learning pages krzysztof dembczynski wojciech kotlowski and eyke consistent multilabel ranking through univariate losses in proceedings of the international conference on machine learning pages krzysztof willem waegeman weiwei cheng and eyke on label dependence and loss minimization in classification machine learning krzysztof dembczynski arkadiusz jachnik wojciech kotlowski willem waegeman and eyke optimizing the in classification rule approach versus structured loss minimization in proceedings of the international conference on machine learning pages krzysztof dembczynski willem waegeman weiwei cheng and eyke an exact algorithm for maximization in advances in neural information processing systems pages luc devroye probabilistic theory of pattern recognition volume springer wei gao and zhou on the consistency of learning artificial intelligence ashish kapoor raajay viswanathan and prateek jain multilabel classification using bayesian compressed sensing in advances in neural information processing systems pages oluwasanmi koyejo nagarajan natarajan pradeep ravikumar and inderjit dhillon consistent binary classification with generalized performance metrics in advances in neural information processing systems pages harikrishna narasimhan rohit vaish and shivani agarwal on the statistical consistency of classifiers for performance measures in advances in neural information processing systems pages harikrishna narasimhan harish ramaswamy aadirupa saha and shivani agarwal consistent multiclass algorithms for complex performance measures in proceedings of the international conference on machine learning pages james petterson and caetano submodular learning in advances in neural information processing systems pages jesse read bernhard pfahringer geoff holmes and eibe frank classifier chains for multilabel classification machine learning mark reid and robert williamson composite binary losses the journal of machine learning research grigorios tsoumakas ioannis katakis and ioannis vlahavas mining data in data mining and knowledge discovery handbook pages springer willem waegeman krzysztof dembczynski arkadiusz jachnik weiwei cheng and eyke on the of maximizers journal of machine learning research nan ye kian ming chai wee sun lee and hai leong chieu optimizing tale of two approaches in proceedings of the international conference on machine learning yu prateek jain purushottam kar and inderjit dhillon learning with missing labels in proceedings of the international conference on machine learning pages 
parallel predictive entropy search for batch global optimization of expensive objective functions amar shah department of engineering cambridge university zoubin ghahramani department of engineering university of cambridge zoubin abstract we develop parallel predictive entropy search ppes novel algorithm for bayesian optimization of expensive objective functions at each iteration ppes aims to select batch of points which will maximize the information gain about the global maximizer of the objective well known strategies exist for suggesting single evaluation point based on previous observations while far fewer are known for selecting batches of points to evaluate in parallel the few batch selection schemes that have been studied all resort to greedy methods to compute an optimal batch to the best of our knowledge ppes is the first nongreedy batch bayesian optimization strategy we demonstrate the benefit of this approach in optimization performance on both synthetic and real world applications including problems in machine learning rocket science and robotics introduction finding the global maximizer of objective function based on sequential noisy observations is fundamental problem in various real world domains engineering design finance and algorithm optimization we are interesed in objective functions which are unknown but may be evaluated pointwise at some expense be it computational economical or other the challenge is to find the maximizer of the expensive objective function in as few sequential queries as possible in order to minimize the total expense bayesian approach to this problem would probabilistically model the unknown objective function based on posterior belief about given evaluations of the the objective function you can decide where to evaluate next in order to maximize chosen utility function bayesian optimization has been successfully applied in range of difficult expensive global optimization tasks including optimizing robot controller to maximize gait speed and discovering chemical derivative of particular molecule which best treats particular disease two key choices need to be made when implementing bayesian optimization algorithm model choice for and ii strategy for deciding where to evaluate next common approach for modeling is to use gaussian process prior as it is highly flexible and amenable to analytic calculations however other models have shown to be useful in some bayesian optimization tasks process priors and deep neural networks most research in the bayesian optimization literature considers the problem of deciding how to choose single location where should be evaluated next however it is often possible to probe several points in parallel for example you may possess identical robots on which you can test different gait parameters in parallel or your computer may have multiple cores on which you can run algorithms in parallel with different hyperparameter settings whilst there are many established strategies to select single point to probe next expected improvement probability of improvement and upper confidence bound there are few well known strategies for selecting batches of points to the best of our knowledge every batch selection strategy proposed in the literature involves greedy algorithm which chooses individual points until the batch is filled greedy choice making can be severely detrimental for example greedy approach to the travelling salesman problem could potentially lead to the uniquely worst global solution in this work our key contribution is to provide what we believe is the first algorithm to choose batch of points to probe next in the task of parallel global optimization our approach is to choose set of points which in expectation maximally reduces our uncertainty about the location of the maximizer of the objective function the algorithm we develop parallel predictive entropy search extends the methods of to multiple point batch selection in section we formalize the problem and discuss previous approaches before developing parallel predictive entropy search in section finally we demonstrate the benefit of our strategy on synthetic as well as objective functions in section problem statement and background our aim is to maximize an objective function which is unknown but can be noisily evaluated pointwise at multiple locations in parallel in this work we assume is compact subset of rd at each decision we must select set of points st xt where the objective function would next be evaluated in parallel each evaluation leads to scalar observation yt xt where we assume we wish to minimize future regret rt where is an optimal decision assumed to exist and is our guess of where the maximizer of is after evaluating batches of input points it is highly intractable to make decisions steps ahead in the setting described therefore it is common to consider the regret of the very next decision in this work we shall assume is draw from gaussian process with constant mean and differentiable kernel function most bayesian optimization research focuses on choosing single point to query at each decision popular strategy in this setting is to choose the point with highest expected improvement over the current hbest evaluation the imaximizer of aei max xbest where is the set of obp servations xbest is the best evaluation point so far var xbest and and are the standard gaussian and aside from being an intuitive approach key advantage of using the expected improvement strategy is in the fact that it is computable analytically and is infinitely differentiable making the problem of finding aei amenable to plethora of gradient based optimization methods unfortunately the corresponding strategy for selecting points to evaluate in parallel does not lead to an analytic expression considered an approach which sequentially used the ei criterion to greedily choose batch of points to query next which formalized and utilized by defining aei dyq xq the expected gain in evaluating after evaluating which can be approximated using monte carlo samples hence the name choosing batch of points st using the eimcmc policy is doubly greedy the ei criterion is greedy as it inherently aims to minimize onestep regret rt and ii the approach starts with an empty set and populates it sequentially and hence greedily deciding the best single point to include until similar but different approach called simulated matching sm was introduced by let be baseline policy which chooses single point to evaluate next ei sm aims to select batch st of size which includes point close to the best point which would have chosen when applied sequentially times with high probability formally sm aims to maximize ii asm st ef min sπq where sπq is the set of points which policy would query if employed sequentially greedy based algorithm is proposed to approximately maximize the objective which the authors justify by the submodularity of the objective function the upper confidence bound ucb strategy is another method used by practitioners to decide where to evaluate an objective function next the ucb approach is to maximize aucb αt where αt is positive parameter which trades off exploration and exploitation in order to extend this approach to the parallel setting noted that the predictive variance of gaussian process depends only on where observations are made and not the observations themselves therefore they suggested the method which greedily populates the set st by maximizing ucb type equation times sequentially updating at each step whilst maintaining the same for each batch finally variant of the was proposed by the first point of the set st is chosen by optimizing the ucb objective thereafter relevant region rt which contains the maximizer of with high probability is defined points are greedily chosen from this region to maximize the information gain about measured by expected reduction in entropy until this method was named gaussian process upper confidence bound with pure exploration each approach discussed resorts to greedy batch selection process to the best of our knowledge no batch bayesian optimization method to date has avoided greedy algorithm we avoid greedy batch selection approach with ppes which we develop in the next section parallel predictive entropy search our approach is to maximize information about the location of the global maximizer which we measure in terms of the negative differential entropy of analogous to ppes aims to choose the set of points st xq which maximizes xq yq appes st yq st where log dx is the differential entropy of its argument and the expectation above is taken with respect to the posterior joint predictive distribution of yq given the previous evaluations and the set st evaluating exactly is typically infeasible the prohibitive aspects are that xq yq would have to be evaluated for many different combinations of xq yq and the entropy computations are not analytically tractable in themselves significant approximations need to be made to before it becomes practically useful convenient equivalent formulation of the quantity in can be written as the mutual information between and yq given by symmetry of the mutual information we can rewrite appes as appes st yq where yq is the joint posterior predictive distibution for yq st given the observed data and the location of the global maximizer of the key advantage of the formulation in is that the objective is based on entropies of predictive distributions of the observations which are much more easily approximated than the entropies of distributions on in fact the first term of can be computed analytically suppose fq st is variate gaussian with covariance then yq log det st we develop an approach to approximate the expectation of the predictive entropy in using an expectation propagation based method which we discuss in the following section approximating the predictive entropy in assuming sample of we discuss our approach to approximating yq st for set of query points st note that we can write yq st fq st yq dfq is the posterior distribution of the objective function at the locations where fq st xq st given previous evaluations and that is the global maximizer of recall that yq is gaussian for each our approach will be to derive gaussian approximation to fq st which would lead to an analytic approximation to the integral in the posterior predictive distribution of the gaussian process fq st is multivariate gaussian distributed however by further conditioning on the location the global maximizer of we impose the condition that for any imposing this constraint for highly inall is extremely difficult and makes the computation of fq st tractable we instead impose the following two conditions for each st and ii ymax where ymax is the largest observed noisy objective function value and constraint is equivalent to imposing that is larger than objective function values at current query locations whilst condition ii makes larger than previous objective function evaluations accounting for noise denoting the two conditions and the variables fq and where we incorporate the conditions as follows ymax st st fq df where is an indicator function the integral in can be approximated using expectation propagation the gaussian process predictive st is we approximate the integrand of with where each and are positive and for cq is vector of length with th entry entry and remaining entries whilst the approximation approximates the gaussian cdf and each indicator function with univariate scaled gaussian pdf the site parameters are learned using fast ep algorithm for which details are given in the supplementary material where we show that zn where and hence st since multivariate gaussians are consistent under marginalization convenient corollary is that st where is the vector containing the first elements of and is the matrix containing the first rows and columns of since gaussians are also gaussian distributed we see that sums of independent yq st yq the final convenient attribute of our gaussian approximation is that of multivariate gaussian can be computed the differential entropy analytically such that yq log det st sampling from the posterior over the global maximizer so far we have considered how to approximate yq given the global st mizer we in fact would like the expected value of this quantity over the posterior distribution of the global maximizer literally the posterior probability that is the global maximizer of computing the distribution is intractable but it is possible to approximately sample from it and compute monte carlo based approximation of the desired expectation we consider two approaches to sampling from the posterior of the global maximizer maximum posteriori map method and ii random feaure approach map sample from the map of is its posterior mode given by we may approximate the expected value of the predictive entropy by replacing the posterior distribution of with single point estimate at there are two key advantages to using the map estimate in this way firstly it is simple to compute as it is the global maximizer of the posterior mean of given the observations secondly choosing to use assists the ep algorithm developed in section to converge as desired this is because the condition for is easy to enforce when the global maximizer of the poserior mean of when is sampled such that the posterior mean at is significantly suboptimal the ep approximation may be poor whilst using the map estimate approximation is convenient it is after all point estimate and fails to characterize the full posterior distribution we therefore consider method to draw samples from using random features random feature samples from naive approach to sampling from would be to sample and choosing unfortunately this would require sampling over an uncountably infinite space which is infeasible slightly less naive method would be to sequentially construct whilst optimizing it instead of evaluating it everywhere in however this approach would have cost where is the number of function evaluations of necessary to find its optimum we propose as in to sample and optimize an analytic approximation to by bochner theorem stationary kernel function has fourier dual which is equal to the spectral density of setting normalized density we can write αep cos cos where let cos wx denote an feature mapping where and consist of stacked samples from then the kernel can be approximated by the inner product of these features the linear model where is an approximate sample from where is vector of objective function evaluations and xn in fact is true sample from the generative process above suggests the following approach to approximately sampling from sample random features and corresponding posterior weights using the process above ii construct and iii finally compute using gradient based methods computing and optimizing the ppes approximation let denote the set of kernel parameters and the observation noise variance our posterior belief about is summarized by the posterior distribution where is our prior belief about and is the gp marginal likelihood given the parameters for fully bayesian treatment of we must marginalize appes with respect to the expectation with respect to the posterior distribution of is approximated with monte carlo samples similar approach is taken in combining the ep based method to approximate the predictive entropy with either of the two methods discussed in the previous section to approximately sample from we can construct an approximation to defined by xh st log det log det where is constructed using the ith sample of from is constructed as in section assuming the global maximizer is the ppes approximation is simple and amenable to gradient based optimization our goal is to choose st xq which maximizes in since our kernel function is differentiable we may consider taking the derivative of with respect to xq the dth component of xq trace trace xq computing is simple directly from the definition of the chosen kernel function is and that each cq parameters vary with function of cq and and we know how to compute is constant vector hence our only concern is how the ep site xq rather remarkably we may invoke result from section of which says that converged site parameters have derivative with respect to parameters of st there is key distinction between explicit dependencies where actually depends on and implicit dependencies where site parameter might depend implicitly on similar approach is taken in and discussed in we therefore compute on first inspection it may seem computationally too expensive to compute derivatives with respect to each and however note that we may compute and store the matrices and once and that is symmetric with exactly one row and column which can be exploited for fast matrix multiplication and trace computations synthetic function appes figure assessing the our approximations to the parallel predictive entropy search egy synthetic objective function blue line defined on with noisy observations black squares ground truth appes defined on rejection sampling our proximation using expectation propagation dark regions correspond to pairs with high utility whilst faint regions correspond to pairs low utility empirical study in this section we study the performance of ppes in comparison to aforementioned methods we model as gaussian process with constant mean and covariance kernel observations of the objective function are considered to be independently drawn from our experiments we choose to use kernel of the form exp xd therefore the set of model hyperparameters is ld broad gaussian hyperprior is placed on and uninformative gamma priors are used for the other hyperparameters it is worth investigating how well is able to approximate appes in order to test the approximation in manner amenable to visualization we generate sample from gaussian process prior on with and and consider batches of size we set rejection sampling approach is used to compute the ground based truth appes defined on we first discretize and sample in by evaluating samples from the discrete points and choosing the input with highest function value given we compute using rejection sampling samples from are evaluted on discrete points in and rejected if the highest function value occurs not at we add independent gaussian noise with variance to the non rejected samples from the previous step and approximate using kernel density estimation figure includes illustrations of the objective function to be maximized with noisy observations the appes ground truth obtained using the rejection sampling method and finally using the ep method we develop in the previous section the black squares on the axes of figures and represent the locations in where has been noisily sampled and the darker the shade the larger the function value the lightly shaded horizontal and vertical lines in these figures along the points the figures representing appes and appear to be symmetric as is expected since the set st is not an ordered set since all points in the set are probed in parallel st the surface of is similar to that of appes in paticular the approximation often appeared to be an annealed version of the ground truth appes in the sense that peaks were more pronounced and areas were flatter since we are interested in argmax appes our key concern is that the peaks of occur at the same input locations as appes this appears to be the case in our experiment suggesting that the argmax is good approximation for argmax appes we now test the performance of ppes in the task of finding the optimum of various objective functions for each experiment we compare ppes to with mcmc samples simulated matching with ucb baseline policy and we use the random features method to sample from rejecting samples which lead to failed ep runs an experiment of an objective function consists of sampling input points uniformly at random and running each algorithm starting with these samples and their corresponding noisy function values we measure performance after batch evaluations using immediate regret rt where is the known optimizer of and is the recommendation of an algorithm after batch evaluations we perform experiments for each objective function and report the median of the regret regret regret regret branin cosines shekel hartmann regret regret figure median of the immediate regret of the ppes and other algorithms over experiments benchmark synthetic objective functions using batches of size regret regret regret regret regret regret regret ppesppes smucb smucb bucb bucb ucbpe ucbpe regret regret regret immediate regret for each algorithm the confidence bands represent one standard tion obtained from bootstrapping the empirical distribution of the immediate regret is heavy tailed making the median more representative of where most data points lie than the mean first set of experiments our is on set of synthetic benchmark objective functions including braninhoo mixture of cosines shekel function with modes each defined on and the function defined on we choose batches of size at each decision time the plots in figure illustrate the median immediate regrets found for each rithm the results suggest that the ppes algorithm performs close to best if not best for each problem considered does significantly better on the hartmann function which is relatively smooth function with very few modes where greedy search appears beneficial strategies are more exploratory in higher dimensions nevertheless ppes does significantly better than on of the problems suggesting that our batch selection procedure enhances performance versus greedy entropy based policy we now consider maximization of real world objective functions the first boston returns the negative of the prediction error of neural network trained on random split of the boston housing dataset the parameter and number of training iterations for the neural network are the parameters to be optimized over the next function hydrogen returns the amount of hydrogen produced by particular bacteria as function of ph and nitrogen levels of growth medium thirdly we consider function rocket which runs simulation of rocket being launched from the earth surface and returns the time taken for the rocket to land on the earth surface the variables to be optimized over are the launch height from the surface the mass of fuel to use and the angle of launch with respect to the earth surface if the rocket does not return the function returns finally we consider function robot which returns the walking speed of bipedal robot the function input parameters which live in are the robot controller we add gaussian noise with to the noiseless function note that all of the functions we consider are not available analytically boston trains neural network and returns test error whilst rocket and robot run physical simulations involving differential equations before returning desired quantity since the hydrogen dataset is available only for discrete points we define hydrogen to return the predictive mean of gaussian process trained on the dataset figure show the median values of immediate regret by each method over random initializations we consider batches of size and we find that ppes consistently outperforms competing methods on the functions considered the greediness and nonrequirement of mcmc sampling of the and algorithms make them amenable to large batch experiments for example consider optimization in with batches of size however these three algorithms all perform poorly when selecting batches of smaller size the performance on the hydrogen function illustrates an interesting phenemona whilst the immediate regret of ppes is mediocre initially it drops rapidly as more batches are evaluated this behaviour is likely due to the of the approach we have taken makes good initial progress but then fails to explore the input space as well as ppes is able to recall that after each batch evaluation an algorithm is required to output its best estimate for the maximizer of the objective function we observed that whilst competing algorithms tended to evaluate points which had high objective function values compared to ppes yet when it came to recommending regretregret regretregret boston hydrogen rocket regret regret regretregret regret regret regret regret regret regret regret regret regret regret regret regret regret regret regretregret regretregret regretregret regretregret regretregret ppes ppes ppes ppes smucb smucb bucb bucb smucb smucb ucbpe ucbpe bucb bucb ucbpe ucbpe robot figure median of the immediate regret of the ppes and other algorithms over experiments on real world objective functions figures in the top row use batches of size whilst figues on the bottom row use batches of size ppes tended to do better job our belief is that this occured exactly because the ppes objective aims to maximize information gain rather than objective function value improvement the rocket function has strong discontinuity making if difficult to maximize if the fuel mass launch height angle are too high the rocket would not return to the earth surface resulting in function value it can be argued that stationary kernel gaussian process is poor model for this function yet it is worth investigating the performance of gp based models since practitioner function is smooth apriori ppes seemed to handle this may not know whether or not function best and had fewer samples which resulted in function value than each of the competing methods and made fewer recommendations which led to function value relative increase in ppes performance from increasing batch size from to is small for the robot function compared to the other functions considered we believe this is consequence of using slightly naive optimization procedure to save computation time our optimization procedure first computes at points selected uniformly at random and performs gradient ascent from the best point since is defined on this method may miss global optimum other methods all select their batches greedily and hence only need to optimize in however this should easily be avoided by using more exhaustive gradient based optimizer conclusions we have developed parallel predictive entropy search an information theoretic approach to batch bayesian optimization our method is greedy in the sense that it aims to maximize the information gain about the location of but it is not greedy in how it selects set of points to evaluate next previous methods are doubly greedy in that they look one step ahead and also select batch of points greedily competing methods are prone to under exploring which hurts their perfomance on noisy objective functions as we demonstrate in our experiments references wang and shan review of metamodeling techniques in support of engineering design optimization journal of mechanical design ziemba vickson stochastic optimization models in finance world scientific singapore snoek larochelle and adams practical bayesian optimization of machine learning algorithms nips mockus bayesian approach to global optimization theory and applications kluwer lizotte wang bowling and schuurmans automatic gait optimization with gaussian process regression ijcai pages negoescu frazier and powell the algorithm for sequencing experiments in drug discovery informs journal on computing carl rasmussen and chris williams gaussian processes for machine learning mit press shah wilson and ghahramani processes as alternatives to gaussian processes aistats snoek rippel swersky kiros satish sundaram patwary mr prabat and adams scalable bayesian optimization using deep neural networks icml brochu cora and de freitas tutorial on bayesian optimization of expensive cost functions with applications to active user modeling and hierarchical reinforcement learning technical report university of british columbia gutin yeo and zverovich traveling salesman should not be greedy domination analysis of heuristics for the tsp discrete applied mathematics hennig and schuler entropy search for global optimization jmlr hoffman and ghahramani predictive entropy search for efficient global optimization of functions nips ginsbourger janusevskis and le riche dealing with asynchronicity in parallel gaussian process based optimization azimi fern and fern batch bayesian optimization via simulation matching nips srinivas krause kakade and seeger gaussian process optimization in the bandit setting no regret and experimental design icml desautels krause and burdick parallelizing tradeoffs with gaussian process bandit optimization icml contal buffoni robicquet and vayatis parallel gaussian process optimization with upper confidence bound and pure exploration in machine learning and knowledge discovery in databases pages springer berlin heidelberg mackay objective functions for active data selection neural computation houlsby huszar and ghahramani collaborative gaussian processes for preference learning nips minka family of algorithms for approximate bayesian inference phd thesis masachusetts institute of technology bochner lectures on fourier integrals princeton university press rahimi and recht random features for kernel machines nips neal bayesian learning for neural networks phd thesis university of toronto seeger expectation propagation for exponential families technical report berkeley cunningham hennig and gaussian probabilities and expectation propagation arxiv http ahmad and lin nonparametric estimation of the entropy for absolutely continuous distributions ieee trans on information theory lizotte practical bayesian optimization phd thesis university of alberta anderson moore and cohn nonparametric approach to noisy and costly optimization icml shekel test functions for multimodal search techniques information science and systems bache and lichman uci machine learning repository burrows wong fern chaplen and ely optimization of ph and nitrogen for enhanced hydrogen production by synechocystis sp pcc via statistical and machine learning methods biotechnology progress hasbun in classical mechanics with matlab applications jones bartlett learning westervelt and grizzle feedback control of dynamic bipedal robot locomotion control and automation series crc pressinc 
cornering stationary and restless mixing bandits with liva ralaivola arma lif cnrs aix marseille university marseille cedex france julien audiffren cmla ens cachan paris saclay university cachan france audiffren abstract we study the restless bandit problem where arms are associated with stationary processes and where rewards are therefore dependent the question that arises from this setting is that of carefully recovering some independence by ignoring the values of some rewards as we shall see the bandit problem we tackle requires us to address the which we do by considering the idea of waiting arm in the new algorithm generalization of for the problem at hand that we introduce we provide regret analysis for this bandit strategy two noticeable features of are that it reduces to the regular when the coefficients are all when the scenario is recovered ii when it is able to ensure controlled regret of order log where encodes the distance between the best arm and the best suboptimal arm even in the case when the case when the coefficients are not summable introduction bandit with mixing arms the bandit problem consists in an agent who has to choose at each step between arms stochastic process is associated to each arm and pulling an arm produces reward which is the realization of the corresponding stochastic process the objective of the agent is to maximize its long term reward in the abundant bandit literature it is often assumed that the stochastic process associated to each arm is sequence of independently and identically distributed random variables see in that case the challenge the agent has to address is the problem she has to simultaneously make sure that she collects information from all arms to try to identify the most rewarding is to maximize the rewards along the sequence of pulls she is exploitation many algorithms have been proposed to solve this between exploration and exploitation we propose to go step further than the setting and to work in the situation where the process associated with each arm is stationary process the rewards are thus dependent from one another with strength of dependence that weakens over time from an application point of view this is reasonable dependence structure if user clicks on some ad typical use of bandit algorithms at some point in time it is very likely that her choice will have an influence on what she will click in the close future while it may have lot weaker impact on what ad she will choose to view in more distant future as it shall appear in the sequel working with such dependent observations poses the question of how informative are some of the rewards with respect to the value of an arm since because of the dependencies and the strong correlation between in time rewards they might not reflect the true value of the arms however as the dependencies weaken over time some kind of independence might be recovered if some rewards are ignored in some sense this actually requires us to deal with new tradeoff the tradeoff where the usual compromise has to be balanced with the need for some independence dealing with this new tradeoff is the pivotal feature of our work non bandit closely related setup that addresses the bandit problem with dependent rewards is when they are distributed according to markov processes such as markov chains and markov decision process mdp where the dependences between rewards are of bounded range which is what distinguishes those works with ours contributions in this area study two settings the rested case where the process attached to an arm evolves only when the arm is pulled and the restless case where all processes simultaneously evolve at each time step in the present work we will focus on the restless setting the adversarial bandit setup see can be seen as non setup as the rewards chosen by the adversary might depend on the agent past actions however even if the algorithms developed for this framework can be used in our setting they might perform very poorly as they are not designed to take advantage of any mixing structure finally we may also mention the bandit scenario where the dependencies are between the arms instead being this is orthogonal to what we propose to study here mixing processes mixing process theory is hardly new one of the seminal works on the study of mixing processes was done by bernstein who introduced the block method central to prove results on mixing processes in statistical machine learning one of the first papers on estimators for mixing processes is more recent works include the contributions of mohri and rostamizadeh which address the problem of stability bound and rademacher stability for and processes kulkarni et al establish the consistency of regularized boosting algorithms learning from processes steinwart et al prove the consistency of support vector machines learning from processes and steinwart and christmann establish general oracle inequality for generic regularized learning algorithms and observations as far as we know it is the first time that mixing processes are studied in bandit framework contribution our main result states that strategy based on the improved upper confidence bound or in the sequel proposed by auer and ortner allows us to achieve controlled regret in the restless mixing scenario namely our algorithm which stands for restless mixing ucb achieves regret of the form where encodes the distance between the best arm and the best suboptimal arm encodes the rate of decrease is notation that neglects logarithmic of the coefficients nα and dependencies see section it is worth noticing that all the results we give hold for when the dependencies are no longer summable when the mixing coefficients at hand are all zero in the case the regret of our algorithm naturally reduces to the classical uses the assumption about known convergence rates of coefficients which is classical standpoint that has been used by most of the papers studying the behavior of machine learning algorithms in the case of mixing processes see the estimation of the mixing coefficients poses learning problem on its own see for the estimation of coefficients and is beyond the scope of this paper structure of the paper section defines our setup processes are recalled together with relevant concentration inequality for such processes the notion of regret we focus on is given section is devoted to the presentation of our algorithm and to the statement of our main result regarding its regret finally section discusses the obtained results overview of the problem concentration of stationary processes let be probability space we recall the notions of stationarity and processes definition stationarity sequence of random variables xt is stationary if for any xt and are identically distributed definition process let xt be stationary sequence of random variables for any let σij denote the generated by xt then for any positive the coefficient of the stochastic process is defined as sup is if is algebraically mixing if so that as we recall later concentration inequalities are the pivotal tools to devise bandits strategy hoeffding inequality is for instance at the root of number of methods this inequality is yet devoted to characterize the deviation of the sum of independent variables from its expected value and can not be used in the framework we are investigating in the case of stationary distributions there however is the following concentration inequality due to and theorem let ψm be function defined over countable space and be stationary process if ψm is wrt the hamming metric for some then px eψm exp pm where λm and ψm ψm xm here we do not have to use this concentration inequality in its full generality as we will restrict pm to the situation where ψm is the mean of its arguments ψm xtm xti which is obviously provided that the xt have range will be one of our working assumptions if with slight abuse of notation λm is now used to denote ti λm for an increasing sequence ti of times steps then the concentration inequality that will serve our purpose is given in the next corollary corollary let be stationary mixing process the following holds for all and all ti with tm xti exp xt pm thanks to the stationarity of xt and the linearity of the expectation xti remark paper the function λm should be paccording to kontorovitch maxj ti tj however when the time lag between two consecutive time steps ti and is which will be imposed by the algorithm see below and the mixing coefficients are decreasing which is natural assumption that simply says that the amount of dependence between xt and reduces when increases then λm reduces to the more compact expression given by note that when there is independence then for all λm and as consequence equation reduces to hoeffing inequality the precise values of the time instants in do not impact the value of the bound and the length of is the central parameter that matters this is in contrast with what happens in the dependent setting where the bound on the deviation pclear of xti from its expectation directly depends onp the timepoints ti through λm for two sequences ti and of timepoints xti may be more sharply conpm centrated around than provided λm λm which can be consequence of more favorable spacing of the points in than in problem minimize the expected regret we may now define the bandit problem we consider and the regret we want to control restless bandits we study the problem of sampling from bandit in our setting pulling arm at time provides the agent with realization of the random variable xtk where the family xtk satisfies the following assumptions xtk is stationary ϕmixing process with decreasing mixing coefficients ϕk and takes its values in discrete finite set by stationarity the same holds for any xtk with included in regret the regret we want to bound is the classical which after pulls is given by µi where µk maxk µk and it is the index of the arm selected at time we want to devise strategy is capable to select at each time the arm it so that the obtained regret is minimal bottleneck the setting we assume entails the possibility of dependencies between the rewards output by the arms hence as evoked earlier in order to choose which arm to pull the agent is forced to address the where independence may be partially recovered by taking advantage of the observation regarding spacings of timepoints that induce sharper concentration of the empirical rewards than others as emphasized later targetting good spacing in the bandit framework translates into the idea of ignoring the rewards provided by some pulls to compute the empirical averages this idea is carried by the concept of waiting arm which is formally defined later on the questions raised by the waiting arm that we address with the algorithm are how often should the waiting arm be pulled so the concentration of the empirical means is high enough to be relied on so the usual tradeoff can be tackled and from the regret standpoint how hindering is it to pull the waiting arm analysis in the analysis of that we provide just as is the case for most if not and all analyses that exist for bandit algorithms we will focus in the order of the regret and we will not be concerned about the precise constants involved in the derived results we will therefore naturally notation that bears the following meaning heavily rely on the usual notation and on the notation for any two functions from to we say that definition if there exist so that logα and logβ ucb strategy for restless mixing bandits this section contains our main contributions the algorithm from now on we use resp for the maximum resp minimum of two elements and we consider that the processes attached to the arms are algebraically mixing and for arm the exponent is αk there exist such that ϕk assumption is not very restrictive as considering rates such as are to capture and characterize the decreasing behavior of the convergent sequence ϕk also we will sometimes say that arm is faster resp slower than arm for to convey the fact that αk resp αk for any and any increasing sequence of timepoints the empirical reward bτk of pt given is bk xτ the subscripted notation τk is used to denote the sequence of timepoints at which arm was selected finally we define λτk in similar way as in the difference with the former notation being the subscript as λk ϕk τk τk we feel important to discuss when may be robust to the mixing process scenario robustness of to restless bandits we will not recall the algorithm in its entirety as it will turn out to be special case of our algorithm but it is instructive to identify its distinctive features that make it relevant base algorithm for the handling of mixing processes first it is essential to keep in mind that is designed for the case and that it achieves an optimal log regret second it is an algorithm that works in successive at the end of each of which number of arms are eliminated because they are identified with high probability as being the least promising ones from regret point of view more precisely at each round the same number of consecutive pulls is planned for each arm this number is induced by hoeffding inequality and devised in such way that all remaining arms share the same confidence interval for their respective expected gains the µk for in the set of remaining arms at the current round from technical standpoint this is what makes it possible to draw conclusions on whether an arm is useless eliminated or not it is enlightening to understand what are the favorable and unfavorable setups for to keep working when facing restless mixing bandits the following proposition depicts the favorable case proposition if ϕk then the classical run on the restless bandit preserves its log regret proof straightforward given the assumption on the mixing coefficients it exists such that ϕk therefore from theorem for any arm and any sequence of consecutive timepoints bτk exp which is akin to ing inequality up to the multiplicative constant in the exponential this and the lines to prove the log regret of directly give the desired result in the general case where ϕk does not hold for every then nothing ensures for to keep working the idea of consecutive pulls being the essential culprit to illustrate the problem suppose that ϕk then after sequence of consecutive time instances where was selected simple calculations give that λτk and the concentration inequality from corollary for bτk reads as bτk exp where is some strictly positive constant the quality of the confidence interval that can be derived from this concentration inequality degrades when additional pulls are performed which counters the usual nature of concentration inequalities and prevents the obtention of reasonable regret for this is direct consequence of the dependency of the variables indeed if decreases slowly taking the average over multiple consecutive pulls may move the estimator away from the mean value of the stationary process another way of understanding the difference between the case and the restless mixing case is to look at the sizes of the confidence intervals around the true value of an arm when the time to the next pull increases given corollary run in the restless mixing scenario would advocate pulling strategy based on the lengths κk of the confidence intervals given by κk λk log where is the overall time index this shows that working in the case or in the mixing case can imply two different behaviors for the lengths of the confidence interval in the scenario κk has the same form as the classical ucb term as ϕk and λτk and is an increasing function of while in the scenario the behavior may be with decreasing confidence interval up to some point after which the confidence interval becomes increasingly larger as the purpose of exploration is to tighten the confidence interval as much as possible the mixing framework points to carefully designed strategies for instance when an arm is slow it is beneficial to wait between two successive pulls of this arm by alternating the pulls of the different arms it is possible to wait up to unit of time between two consecutive pulls of the same arm however it is not sufficient to recover enough independence between the two observed values for instance in the case described in after sequence tk simple calculations give that λτk kt and the concentration inequality from corollary for bτk reads as bτk exp which entails the same problem the problem exhibited above is that if the decrease of the ϕk is too slow pulling an arm in the traditional way with consecutive pulls and updating the value of the empirical estimator may lower the certainty with which the estimation of the expected gain is performed to solve this problem and reduce the confidence interval that are computed for each arm better independence between algorithm with parameter αi defined in αi bi for bg do select arm if then until total time ts dg pull each arm bs at time τi defined in if no arm is ready to be pulled pull the waiting arm instead update update the empirical mean bi and the number of pulls ni for each arm bs obtain by eliminating from bs each arm such that pni ϕi τi log ni max pnk ϕk τk log nk update min αi and argmax pni ϕi τi log ni end for the values observed from given arm is required this can only be achieved by waiting for the time to pass by since an arm must be pulled at each time simulating the time passing by may be implemented by the idea to pull an arm but not to update the empirical mean bk of this arm with the observed reward at the same time it is important to note that even if we do not update the empirical mean of the arm the resort to the waiting arm may impact the regret it is therefore crucial to ensure that we pull the best possible arm to limit the resulting regret whence the arm with the best optimistic value being used as the waiting arm note that this arm may change over time for the rest of the paper will only refer to significant pulls of an arm that is pulls that lead to an update of the empirical value of the arm algorithm and regret bound we may now introduce depicted in algorithm as works in epochs and eliminates at each epoch the significantly suboptimal arms view let θs be decreasing sequence of and δs rn the main idea promoted by is to divide the time available in epochs smax the outer loop of the algorithm such that at the end of each epoch for all the remaining arms the following holds µτk µk θs µτk µk θs δs where identifies the time instants up to current time when arm was selected using this means that for all with high probability µk µk nk λτk log δs thus at the end of epoch we have with high probability uniform control of the uncertainty with which the empirical rewards bτk approximate their corresponding rewards µk based on this the algorithm eliminates the arms that appear significantly suboptimal step of the update of just as in the process is with parameters δs and θs adjusted as δs and θs where is the time budget the modifications of the δs and θs values makes it possible to gain additional information through new pulls on the quality of the remaining arms so arms associated with rewards can be distinguished by the algorithm policy for pulling arms at epoch the objective of the policy is to obtain uniform control of the intervals of all the remaining arms for some arm and fixed time budget λτ tη such that such policy could be obtained as the solution of minηs ti where the times of pulls ti must be increasing and greater than the last element of tηs and the number of times this arm has already been pulled are given this conveys our aim to obtain as fast and efficiently the targetted confidence interval however this problem does not have solution and even if it could be solved efficiently we are more interested in assessing whether it is possible to devise relevant sequences of timepoints that induce controlled regret even if they do not solve the optimization problem to this end we only focus on the best sampling rate of the arms which is an approximation of the previous minimization problem for each we search for sampling schemes of the form τk tn nβ for for the case where the ϕk are not summable αk we have the following result proposition let αk recall that ϕk the optimal sampling rate τk for arm is τk proof the idea of the proof is that if the sampling is too frequent close to then the dependency between the values of the arm reduces the information obtained by taking the average in other words ϕk τk increases too quickly on the other hand if the sampling is too scarce is very large the information obtained at each pull is important but the total amount of pulls in given time is approximately and thus is too low the optimal solution to pthis is to take which directly comes from the fact that this is the point where ϕk τk becomes logarithmic the complete proof is available in the supplementary material if αk for all this result means that the best policy with sampling scheme of the form nβ should update the empirical means associated with each arm at rate contrary to the case it is therefore not relevant to try and update the empirical rewards at each time step there henceforth must be gaps between updates of the means this is precisely the role of the waiting arm to make this gaps possible as seen in the depiction of when pulled the waiting arm provides reward that will count for the cumulative gains of the agent and help her control her regret but that will not be used to update any empirical mean as for precise pulling strategy to implement given proposition it must be understood that it is the slowest arm that determines the best uniform control possible since it is the one which will be selected the least number of times it is unnecessary to pull the fastest arms more often than the slowest arm therefore if iks are the ks remaining arms at epoch and iks αi then an arm selection strategy based on the rate of the slowest arm suggests to pull arm im and update for the time at time instants ks if otherwise all arms are pulled at the same frequency and to pull the waiting arm while waiting time budget per epoch in the algorithm the function defines the size of the rounds the definition of is rather technical we have gk where gk inf λτk log tθs where the τk are defined above in other words gk encodes the minimum amount of time necessary to reach the aimed length of confidence interval by following the aforementioned policy log δs this is the key element but the most interesting property of is that which will be used in the proof of the regret bound which can be found in theorem below putting it all together at epoch the algorithm starts by selecting the best empirical arm and flags it as the waiting arm it then determines the speed of the slowest arm after which it computes time budget ts then until this time horizon is reached it pulls arms following the policy described above finally after the time budget is reached the algorithm eliminates the arms whose empirical mean is significantly lower than the best available empirical mean note that when all the ϕk are summable we have and thus the algorithm never pulls the waiting arm mainly differs from by its strategy of alternate pulls the result below provides an upper bound for the regret of the algorithm theorem for all arm let αk and ϕk let αk and if the regret of is bounded in order by log since encodes the rate of sampling it can not be greater than proof the proof follows the same line as the proof of the upper bound of the regret of the algorithm the important modification is the sizes of the blocks which depend in the mixing case of the mixing coefficient and might grow arbitrary large and the waiting arm which does not exist in the setting the dominant term in the regret mentioned in theorem is related to the pulls of the waiting arm indeed the waiting arm is pulled with an always increasing frequency but the quality of the waiting arm tends to increase over time as the arms with the smallest values are eliminated the complete proof is available in the supplementary material discussion and particular cases we here discuss theorem and some of its variations for special cases of processes first in the case the regret of is upper bounded by log observe that comes down to this bound when tends to also note that it is an upper bound of the regret in the algebraically mixing case it reflects the fact that in this particular case it is possible to ignore the dependency of the mixing process it also implies that even if even if the dependency can not be ignored by properly using the mixing property of the different stationary processes it is possible to obtain an upper bound of polynomial logarithmic order another question is to see what happens when αk which is an important threshold in our study indeed if αk the ϕk are not summable but from proposition we have that τk the arms should be sampled as often as possible theorem states that the regret is upper bounded log however it is not possible to know if this bound is comparable to in this case by still from the proof of theorem we get the following result that of the case due to the corollary for all arm let αk and ϕk let αk then if the regret for algorithm is upper bounded in order by gα log where and is solution of log although we do not have an explicit formula for the regret in the case it is interesting to note that is strictly negligible with respect to but strictly dominates log this comes from that while in the case the waiting arm is no longer used the time budget necessary to complete step is still higher that in the case when decreases at logarithmic speed for some it is still possible to apply the same reasoning as the one developed in this paper but in this case exp which is no longer logarithmic in in other will only achieve regret of words if the mixing coefficients decrease too slowly the information given by the concentration inequality in theorem is not sufficient to deduce interesting information about the mean value of the arms in this case the successive values of the processes are too dependent and the randomness in the sequence of values is almost negligible an adversarial bandit algorithm such as may give better results than conclusion we have studied an extension of the bandit problem to the stationary framework in the restless case by providing functional algorithm and an upper bound of the regret in general framework future work might include study of lower bound for the regret in the mixing process case our first findings on the issue are that the analysis of the scenario in the mixing framework bears significant challenges another interesting point would be the study of the more difficult case of processes rather different but very interesting question that we may address in the future is the possibility to exploit possible structure of the correlation between rewards over time for instance in the case wher the correlation of an arm with the close past is much higher than the correlation with the distant past it might be interesting to see if the analysis done in can be extended to exploit this correlation structure acknowledgments this work is partially supported by the projet greta greediness theory and algorithms and the nd project references audibert jy bubeck minimax policies for adversarial and stochastic bandits in annual conference on learning theory auer ortner ucb revisited improved regret bounds for the stochastic bandit problem periodica mathematica hungarica auer fischer analysis of the armed bandit problem machine learning journal auer freund schapire re the nonstochastic multiarmed bandit problem siam journal on computing bernstein sur extension du limite du calcul des aux sommes de mathematische annalen bubeck regret analysis of stochastic and nonstochastic bandit problems foundation and trends in machine learning vol now hoeffding class of statistics with asymptotically normal distribution annals of mathematical statistics hoeffding probability inequalities for sums of bounded random variables journal of the american statistical association doi url http karandikar rl vidyasagar rates of uniform convergence of empirical means with mixing processes statistics probability letters kontorovich ramanan concentration inequalities for dependent random variables via the martingale method the annals of probability kulkarni lozano schapire re convergence and consistency of regularized boosting algorithms with stationary observations in advances in neural information processing systems pp lai tl robbins asymptotically efficient adaptive allocation rules advances in applied mathematics mcdonald shalizi schervish estimating coefficients arxiv preprint mohri rostamizadeh rademacher complexity bounds for processes in koller schuurmans bengio bottou eds advances in neural information processing systems pp mohri rostamizadeh stability bounds for stationary and processes journal of machine learning research ortner ryabko auer munos regret bounds for restless markov bandits in proceeding of the int conf algorithmic learning theory pp pandey chakrabarti agarwal bandit problems with dependent arms in proceedings of the international conference on machine learning acm pp ralaivola szafranski stempfel chromatic bounds for data applications to ranking and stationary processes the journal of machine learning research seldin slivkins one practical algorithm for both stochastic and adversarial bandits in proceedings of the international conference on machine learning pp steinwart christmann fast learning from observations in advances in neural information processing systems pp steinwart hush scovel learning from dependent observations journal of multivariate analysis tekin liu online learning of rested and restless bandits ieee transactions on information theory url http yu rates of convergence for empirical processes of stationary mixing sequences annals of probability 
factored logistic regression for neuroimaging data danilo bzdok michael eickenberg olivier grisel bertrand thirion varoquaux inria parietal team saclay france cea neurospin france abstract imaging neuroscience links human behavior to aspects of brain biology in everincreasing datasets existing neuroimaging methods typically perform either discovery of unknown neural structure or testing of neural structure associated with mental tasks however testing hypotheses on the neural correlates underlying larger sets of mental tasks necessitates adequate representations for the observations we therefore propose to blend representation modelling and task classification into unified statistical learning problem multinomial logistic regression is introduced that is constrained by factored coefficients and coupled with an autoencoder we show that this approach yields more accurate and interpretable neural models of psychological tasks in reference dataset as well as better generalization to other datasets keywords brain imaging cognitive science learning systems biology introduction methods for neuroimaging research can be grouped by discovering neurobiological structure or assessing the neural correlates associated with mental tasks to discover on the one hand spatial distributions of neural activity structure across time independent component analysis ica is often used it decomposes the bold signals into the primary modes of variation the ensuing spatial activity patterns are believed to represent brain networks of functionally interacting regions similarly sparse principal component analysis spca has been used to separate bold signals into parsimonious network components the extracted brain networks are probably manifestations of electrophysiological oscillation frequencies their fundamental organizational role is further attested by continued covariation during sleep and anesthesia network discovery by applying ica or spca is typically performed on unlabeled data these capture brain dynamics during ongoing random thought without controlled environmental stimulation in fact large portion of the bold signal variation is known not to correlate with particular behavior stimulus or experimental task to test on the other hand the neural correlates underlying mental tasks the general linear model glm is the dominant approach the contribution of individual brain voxels is estimated according to design matrix of experimental tasks alternatively psychophysiological interactions ppi elucidate the influence of one brain region on another conditioned by experimental tasks as last example an increasing number of neuroimaging studies model experimental tasks by training classification algorithms on brain signals all these methods are applied to labeled data that capture brain dynamics during behavior two important conclusions can be drawn first the mentioned supervised neuroimaging analyses typically yield results in voxel space this ignores the fact that the bold signal exhibits spatially distributed patterns of coherent neural activity second existing supervised neuroimaging analyses can not exploit the abundance of easily acquired data these may allow better discovery of the manifold of brain states due to the high similarities of neural activity patterns as observed using ica and linear correlation both these neurobiological properties can be conjointly exploited in an approach that is mixed using rest and task data factored performing network decomposition and multitask capitalize on neural representations shared across mental operations the integration of discovery into supervised classification can yield learning framework the most relevant neurobiological structure should hence be identified for the prediction problem at hand autoencoders suggest themselves because they can emulate variants of most unsupervised learning algorithms including pca spca and ica autoencoders ae are layered learning models that condense the input data to local and global representations via reconstruction under compression prior they behave like truncated pca in case of one linear hidden layer and squared error loss autoencoders behave like spca if shrinkage terms are added to the model weights in the optimization objective moreover they have the characteristics of an ica in case of tied weights and adding nonlinear convex function at the first layer these authors further demonstrated that ica sparse autoencoders and sparse coding are mathematically equivalent under mild conditions thus autoencoders may flexibly project the neuroimaging data onto the main directions of variation in the present investigation linear autoencoder will be fit to unlabeled rest data and integrated as rankreducing bottleneck into multinomial logistic regression fit to labeled task data we can then solve the compound statistical problem of unsupervised data representation and supervised classification previously studied in isolation from the perspective of dictionary learning the figure model architecture linear first layer represents projectors to the discovered set of autoencoders find an optimized comsis functions which are linearly combined by the second pression of brain voxels into layer to perform predictions neurobiologically this unknown activity patterns by improving allows delineating manifold of brain reconstruction from them the decomnetwork patterns and then distinguishing mental tasks by position matrix equates with the bottletheir most discriminative linear combinations neck of factored logistic regression cally reduction in model variance should be achieved supervised learning on task by autoencoders that privilege the most data xtask can thus be guided by unrobiologically valid models in the hypothesis set supervised decomposition of rest data cally neuroimaging research frequently suffers from data xrest scarcity this limits the set of representations that can be extracted from glm analyses based on few participants we therefore contribute computational framework that analyzes many problems simultaneously thus finds shared representations by learning and exploits unlabeled data since they span space of meaningful configurations methods data as the currently biggest reference dataset we chose resources from the human connectome project hcp neuroimaging task data with labels of ongoing cognitive processes were drawn from healthy hcp participants cf appendix for details on datasets hcp tasks were selected that are known to elicit reliable neural activity across participants table in sum the hcp task data incorporated activity maps from diverse paradigms administered to participants removed due to incomplete data all maps were resampled to common space of isotropic voxels and masked at least tissue probability the supervised analyses were thus based on labeled hcp task maps with voxels of interest representing in gray matter cognitive task reward punish shapes faces random theory of mind mathematics language tongue movement food movement hand movement matching relations view bodies view faces view places view tools stimuli instruction for participants card game guess the number of mystery card for of money shape pictures face pictures decide which of two shapes matches another shape geometrically decide which of two faces matches another face emotionally videos with objects decide whether the objects act randomly or intentionally spoken numbers auditory stories complete addition and subtraction problems choose answer about the topic of the story move tongue squeezing of the left or right toe tapping of the left or right finger decide whether two objects match in shape or texture decide whether object pairs differ both along either shape or texture passive watching passive watching passive watching passive watching indicate whether current stimulus is the same as two items earlier visual cues shapes with textures pictures pictures pictures pictures various pictures table description of psychological tasks to predict these labeled data were complemented by unlabeled activity maps from hcp acquisitions of unconstrained activity these reflect brain activity in the absence of controlled thought in sum the hcp rest data concatenated unlabeled rest maps with brain maps from each of randomly selected participants we were further interested in the utility of the optimized projection in one task dataset for dimensionality reduction in another task dataset to this end the network decompositions were used as preliminary step in the classification problem of another large sample the archi dataset provides activity maps from diverse experimental tasks including auditory and visual perception motor action reading language comprehension and mental calculation analogous to hcp data the second task dataset thus incorporated labeled masked and activity maps from diverse tasks acquired in participants linear autoencoder the labeled and unlabeled data were fed into linear statistical model composed of an autoencoder and logistic regression the affine autoencoder takes the input projects it into coordinate system of latent representations and reconstructs it back to by where rd denotes the vector of voxel values from each rest map rn is the ndimensional hidden state distributed neural activity patterns and rd is the reconstruction vector of the original activity map from the hidden variables further denotes the weight matrix that transforms from input space into the hidden space encoder is the weight matrix for backprojection from the hidden variables to the output space decoder and are corresponding bias vectors the model parameters are found by minimizing the expected squared reconstruction error lae kx here we choose and to be tied consequently the learned weights are forced to take function that of signal analysis and that of signal synthesis the first layer analyzes the data to obtain the cleanest latent representation while the second layer represents building blocks from which to synthesize the data using the latent activations tying these processes together makes the analysis layer interpretable and pulls all singular values towards nonlinearities were not applied to the activations in the first layer factored logistic regression our factored logistic regression model is best described as variant of multinomial logistic regression specifically the weight matrix is replaced by the product of two weight matrices with common latent dimension the later is typically much lower than the dimension of the data alternatively this model can be viewed as feedforward neural network with linear activation function for the hidden layer and softmax function on the output layer as the dimension of the hidden layer is much lower than the input layer this architecture is sometimes referred to as linear bottleneck in the literature the probability of an input to belong to class is given by softmaxi flr where flr computes multinomial logits and softmaxi exp xi exp xj the matrix rdxn transforms the input rd into latent components and the matrix rnxl projects the latent components onto hyperplanes that reflect label probabilities and are bias vectors the loss function is given by llr nxtask nxtask log layer combination the optimization problem of the linear autoencoder and the factored logistic regression are linked in two ways first their transformation matrices mapping from input to the latent space are tied we hence search for compression of the voxel values into unknown components that represent latent code optimized for both rest and task activity data second the objectives of the autoencoder and the factored logistic regression are interpolated in the common loss function llr nxrest lae in so doing we search for the combined model parameters with respect to the unsupervised reconstruction error and the supervised task detection lae is devided by nxrest to equilibrate both loss terms to the same order of magnitude represents an regularization that combines and penalty terms optimization the common objective was optimized by gradient descent in the ssflogreg parameters the required gradients were obtained by using the chain rule to backpropagate error derivatives we chose the rmsprop solver refinement of stochastic gradient descent rmsprop dictates an adaptive learning rate for each model parameter by scaled gradients from running average the batch size was set to given much expected redundancy in xrest and xtask matrix parameters were initalized by gaussian random values multiplied by gain and bias parameters were initalized to the normalization factor and the update rule for are given by where is the loss function computed on minibatch sample at timestep is the learning rate global damping factor and the decay rate to deemphasize the magnitude of the gradient note that we have also experimented with other solvers stochastic gradient descent adadelta and adagrad but found that rmsprop converged faster and with similar or higher generalization performance implementation the analyses were performed in python we used nilearn to handle the large quantities of neuroimaging data and theano for automatic numerically stable differentiation of symbolic computation graphs all python scripts that generated the results are accessible online for reproducibility and reuse http experimental results serial versus parallel structure discovery and classification we first tested whether there is substantial advantage in combining unsupervised decomposition and supervised classification learning we benchmarked our approach against performing data reduction on the unlabeled first half of the hcp task data by pca spca ica and ae components and learning classification models in the labeled second half by ordinary logistic regression pca reduced the dimensionality of the task data by finding orthogonal network components whitening of the data spca separated the bold signals into network components with few regions by optimization problem constrained by penalty no orthogonality assumptions maximum iterations tolerance of ica performed iterative blind source separation by parallel fastica implementation maximum iterations tolerance of initialized by random mixing matrix whitening of the data ae found code of latent representations by optimizing projection into bottleneck iterations same implementation as below for rest data the second half of the task data was projected onto the latent components discovered in its first half only the ensuing component loadings were submitted to ordinary logistic regression no hidden layer iterations these serial twostep approaches were compared against parallel decomposition and classification by ssflogreg one hidden layers iterations importantly all trained classification models were tested on large unseen test set of data in the present analyses across choices for ssflogreg achieved more than accuracy whereas supervised learning based on pca spca ica and ae loadings ranged from to table this experiment establishes the advantage of directly searching for structure in the fmri data rather than solving the supervised and unsupervised problems independently this effect was particularly pronounced when assuming few hidden dimensions pca logreg spca logreg ica logreg ae logreg ssflogreg table serial versus parallel dimensionality reduction and classification chance is at model performance ssflogreg was subsequently trained epochs across parameter choices for the hidden components and the balance between autoencoder and logistic regression assuming latent directions of variation should yield models with higher bias and smaller variance than ssflogreg with latent directions given the problem of hcp setting to consistently yields generalization performance at chancelevel because only the unsupervised layer of the estimator is optimized at each epoch iteration over the data the performance of the trained classifier was assessed on of unseen hcp data additionally the performance of the learned decomposition was assessed by using it as dimensionality reduction of an independent labeled dataset archi and conducting ordinary logistic regression on the ensuing component loadings accuracy precision mean recall mean score mean reconstr error norm accuracy table performance of ssflogreg across model parameter choices chance is at we made three noteworthy observations table first the most supervised estimator achieved in no instance the best accuracy precision recall or scores on hcp data classification by ssflogreg is therefore facilitated by imposing structure from the unlabeled rest data confirmed by the normalized reconstruction error kx little weight on the supervised term is sufficient for good model performance while keeping low and decomposition figure effect of bottleneck in classificaton problem depicts the prediction scores for each of psychological tasks multinomial logistic regression operating in voxel space blue bars was compared to ssflogreg operating in left plot and right plot latent modes grey bars autoencoder or rest data were not used for these analyses ordinary logistic regression yielded accuracy out of sample while ssflogreg scored at and hence compressing the voxel data into component space for classification achieves higher task separability chance is at second the higher the number of latent components the higher the performance with small values of this suggests that the presence of more hidden components results in more effective feature representations in unrelated task data third for and but not the purely decomposition matrix resulted in noninferior performance of and respectively table this confirms that guiding model learning by structure extracts features of general relevance beyond the supervised problem at hand individual effects of dimensionality reduction and rest data we first quantified the impact of introducing bottleneck layer disregarding the autoencoder to this end ordinary logistic regression was juxtaposed with ssflogreg at for this experiment we increased the difficulty of the classification problem by including data from all hcp tasks indeed increased class separability in component space as compared to voxel space entails differences in generalization performance of figure notably the cognitive tasks on reward and punishment processing are among the least predicted with ordinary but well predicted with logistic regression tasks and in figure these experimental conditions have been reported to exhibit highly similar neural activity patterns in glm analyses of that dataset consequently also local activity differences in the striatum and visual cortex in this case can be successfully captured by modelling we then contemplated the impact of rest structure figure by modulating its influence in and settings at the beginning of every epoch task and rest maps were drawn with replacement from same amounts of task and rest maps in scenarios frequently encountered by neuroimaging practitioners the scores improve as we depart from the most supervised model in scenarios we observed the same trend to be apparent feature identification we finally examined whether the models were fit for purpose figure to this end we computed pearson correlation between the classifier weights and the averaged neural activity map for each of the tasks ordinary logistic regression thus yielded mean correlation of across tasks for ssflogreg map was computed by matrix multiplication of the two inner layers feature identification performance thus ranged between and for between and for and between and for consequently ssflogreg puts higher absolute weights on relevant structure this reflects an increased ratio in part explained by figure effect of rest structure model performance of ssflogreg for different choices of in task and rest maps hot color and task and rest maps cold color scenarios gradient descent was performed on task and rest maps at the begining of each epoch these were drawn with replacement from pool of or different task and rest maps respectively chance is at figure classification weight maps the voxel predictors corresponding to exemplary of total psychological tasks rows from the hcp dataset left column multinomial logistic regression same implementation but without bottleneck or autoencoder middle column ssflogreg latent components right column average across all samples of activity maps from each task ssflogreg puts higher absolute weights on relevant structure lower ones on irrelevant structure and yields local contiguity without enforcing an explicit spatial prior all values are and thresholded at the percentile the more local contiguity conversely ssflogreg puts lower probability mass on irrelevant structure despite lower interpretability of the results from ordinary logistic regression the weight maps were sufficient for good classification performance hence ssflogreg yielded class weights that were much more similar to features of the respective training samples for all choices of and ssflogreg therefore captures genuine properties of task activity patterns rather than or artefacts miscellaneous observations for the sake of completeness we informally report modifications of the statistical model that did not improve generalization performance introducing stochasticity into model learning by input corruption of xtask deteriorated model performance in all scenarios adding rectified linear units relu to or other commonly used nonlinearities sigmoid softplus hyperbolic tangent all led to decreased classification accuracies probably due to sample size limits further pretraining of the bottleneck initialization by either corresponding pca spca or ica loadings did not exhibit improved accuracies neither did autoencoder pretraining moreover introducing an additional overcomplete layer units after the bottleneck was not advantageous finally imposing either only or only penalty terms was disadvantageous in all tested cases this favored elasticnet regularization chosen in the above analyses discussion and conclusion using the flexibility of factored models we learn the representation from highdimensional voxel brain space that is most important for prediction of cognitive task sets from perspective factorization of the logistic regression weights can be viewed as transforming classification problem into learning problem the higher generalization accuracy and support recovery comparing to ordinary logistic regression hold potential for adoption in various neuroimaging analyses besides increased performance these models are more interpretable by automatically learning mapping to and from space this learning algorithm encourages departure from the artificial and statistically less attractive voxel space neurobiologically brain activity underlying defined mental operations can be explained by linear combinations of the main activity patterns that is fmri data probably concentrate near manifold of characteristic brain network combinations extracting fundamental building blocks of brain organization might facilitate the quest for the cognitive primitives of human thought we hope that these first steps stimulate development towards powerful representation extraction in systems neuroscience in the future automatic reduction of brain maps to their neurobiological essence may leverage dataintense neuroimaging investigations initiatives for data collection are rapidly increasing in neuroscience these promise structured integration of neuroscientific knowledge accumulating in databases tractability by condensed feature representations can avoid the problem of learning the full distribution of activity patterns this is not only relevant to the challenges spanning the human cognitive space but also the combination with models of brain anatomy and genomics the biggest socioeconomic potential may lie in clinical studies that predict disease trajectories and drug responses in psychiatric and neurological populations acknowledgment the research leading to these results has received funding from the european union seventh framework programme under grant agreement no human brain project data were provided by the human connectome project further support was received from the german national academic foundation and the metamri associated team references abraham pedregosa eickenberg gervais mueller kossaifi gramfort thirion varoquaux machine learning for neuroimaging with front neuroinform amunts lepage borgeat mohlberg dickscheid rousseau bludau bazin lewis et al bigbrain an human brain model science baldi hornik neural networks and principal component analysis learning from examples without local minima neural networks barch burgess harms petersen schlaggar corbetta glasser curtiss dixit feldt function in the human connectome and individual differences in behavior neuroimage bastien lamblin pascanu bergstra goodfellow bergeron bouchard wardefarley bengio theano new features and speed improvements arxiv preprint beckmann deluca devlin smith investigations into connectivity using independent component analysis philos trans soc lond biol sci bergstra breuleux bastien lamblin pascanu desjardins turian bengio theano cpu and gpu math expression compiler proceedings of the python for scientific computing conference scipy biswal mennes zuo gohel kelly et al toward discovery science of human brain function proc natl acad sci cole bassettf power braver petersen intrinsic and network architectures of the human brain neuron fox raichle spontaneous fluctuations in brain activity observed with functional magnetic resonance imaging nat rev neurosci frackowiak markram the future of human cerebral cartography novel approach philosophical transactions of the royal society of london biological sciences friston buechel fink morris rolls dolan psychophysiological and modulatory interactions in neuroimaging neuroimage friston holmes worsley poline frith frackowiak statistical parametric maps in functional imaging general linear approach hum brain mapp gorgolewski burns madison clark halchenko waskom ghosh nipype flexible lightweight and extensible neuroimaging data processing framework in python front neuroinform hertz krogh palmer introduction to the theory of neural computation vol basic books hinton salakhutdinov reducing the dimensionality of data with neural networks science hipp siegel bold fmri correlation reflects neuronal correlation curr biol le karpenko ngiam ng ica with reconstruction cost for efficient overcomplete feature learning pp need goldstein whole genome association studies in complex diseases where do we stand dialogues in clinical neuroscience olshausen et al emergence of receptive field properties by learning sparse code for natural images nature pinel thirion meriaux jobert serres le bihan poline dehaene fast reproducible identification and databasing of individual functional cognitive networks bmc neurosci poldrack gorgolewski making big data open data sharing in neuroimaging nature neuroscience poldrack halchenko hanson decoding the structure of brain function by classifying mental states across individuals psychol sci schwartz thirion varoquaux mapping cognitive ontologies to and from the brain advances in neural information processing systems smith beckmann andersson auerbach bijsterbosch douaud duff feinberg griffanti harms et al fmri in the human connectome project neuroimage smith fox miller glahn fox mackay filippini watkins toro laird beckmann correspondence of the brain functional architecture during activation and rest proc natl acad sci tieleman hinton lecture divide the gradient by running average of its recent magnitude coursera neural networks for machine learning varoquaux gramfort pedregosa michel thirion dictionary learning to segment an atlas of brain spontaneous activity information processing in medical imaging pp 
gaussian process random fields david moore and stuart russell computer science division university of california berkeley berkeley ca dmoore russell abstract gaussian processes have been successful in both supervised and unsupervised machine learning tasks but their computational complexity has constrained practical applications we introduce new approximation for gaussian processes the gaussian process random field gprf in which local gps are coupled via pairwise potentials the gprf likelihood is simple tractable and parallelizeable approximation to the full gp marginal likelihood enabling latent variable modeling and hyperparameter selection on large datasets we demonstrate its effectiveness on synthetic spatial data as well as application to seismic event location introduction many machine learning tasks can be framed as learning function given noisy information about its inputs and outputs in regression and classification we are given inputs and asked to predict the outputs by contrast in latent variable modeling we are given set of outputs and asked to reconstruct the inputs that could have produced them gaussian processes gps are flexible class of probability distributions on functions that allow us to approach problems from an appealingly principled and clean bayesian perspective unfortunately the time complexity of exact gp inference is where is the number of data points this makes exact gp calculations infeasible for data sets with many approximations have been proposed to escape this limitation one particularly simple approximation is to partition the input space into smaller blocks replacing single large gp with multitude of local ones this gains tractability at the price of potentially severe independence assumption in this paper we relax the strong independence assumptions of independent local gps proposing instead markov random field mrf of local gps which we call gaussian process random field gprf gprf couples local models via pairwise potentials that incorporate covariance information this yields surrogate for the full gp marginal likelihood that is simple to implement and can be tractably evaluated and optimized on large datasets while still enforcing smooth covariance structure the task of approximating the marginal likelihood is motivated by unsupervised applications such as the gp latent variable model but examining the predictions made by our model also yields novel interpretation of the bayesian committee machine we begin by reviewing gps and mrfs and some existing approximation methods for gps in section we present the gprf objective and examine its properties as an approximation to the full gp marginal likelihood we then evaluate it on synthetic data as well as an application to seismic event location full gp local gps bayesian committee machine figure predictive distributions on toy regression problem background gaussian processes gaussian processes are distributions on functions gps are parameterized by mean function µθ typically assumed without loss of generality to be and covariance function sometimes called kernel kθ with hyperparameters common choice is the squared exponential covariance kse exp kx with hyperparameters and specifying respectively prior variance and correlation lengthscale we say that random function is gaussian process distributed if for any input points the vector of function values is multivariate gaussian kθ in many applications we have access only to noisy observations for some noise process if the noise is iid gaussian then the observations are themselves gaussian ky where ky kθ the most common application of gps is to bayesian regression in which we attempt to predict the function values at test points via the conditional distribution given the training data sometimes however we do not observe the training inputs or we observe them only partially or noisily this setting is known as the gaussian process latent variable model gplvm it uses gps as model for unsupervised learning and nonlinear dimensionality reduction the setting typically involves observations with each output dimension modeled as an independent gaussian process the input locations hyperparameters are typically sought via maximization of the marginal likelihood log log yit yi log tr though some recent work attempts to recover an approximate posterior on by maximizing variational bound given differentiable covariance function this maximization is typically performed by methods although local maxima can be significant concern as the marginal likelihood is generally scalability and approximate inference the main computational difficulty in gp methods is the need to invert or factor the kernel matrix ky which requires time cubic in in inference this must be done at every optimization step to evaluate and its derivatives this complexity has inspired number of approximations the most commonly studied are inducingpoint methods in which the unknown function is represented by its values at set of inducing points where these points can be chosen by maximizing the marginal likelihood in surrogate model or by minimizing the kl divergence between the approximate and exact gp posteriors inference in such models can typically be done in time but this comes at the price of reduced representational capacity while smooth functions with long lengthscales may be compactly represented by small number of inducing points for functions with significant local structure it may be difficult to find any faithful representation more compact than the complete set of training observations separate class of approximations local gp methods involves partitioning the inputs into blocks of points each then modeling each block with an independent gaussian process if the partition is spatially local this corresponds to covariance function that imposes independence between function values in different regions of the input space computationally each block requires only time the total time is linear in the number of blocks local approximations preserve structure within each block but their harsh independence assumptions can lead to predictive discontinuities and inaccurate uncertainties figure these assumptions are problematic for inference because the marginal likelihood becomes discontinuous at block boundaries nonetheless local gps sometimes work very well in practice achieving results comparable to more sophisticated methods in fraction of the time the bayesian committee machine bcm attempts to improve on independent local gps by averaging the predictions of multiple gp experts the model is formally equivalent to an inducingpoint model in which the test points are the inducing points it assumes that the training blocks are conditionally independent given the test data the bcm can yield predictions that avoid the pitfalls of local gps figure while maintaining scalability to very large datasets however as purely predictive approximation it is unhelpful in the setting where we are interested in the likelihood of our training set irrespective of any particular test data the desire for approximation to the marginal likelihood was part of the motivation for this current work in section we show that the gprf proposed in this paper can be viewed as such model models extend the local gp concept in different direction instead of deterministically assigning points to gp models based on their spatial locations they treat the assignments as unobserved random variables and do inference over them this allows the model to adapt to different functional characteristics in different regions of the space at the price of more difficult inference task we are not aware of models being applied in the setting though this should in principle be possible simple building blocks are often combined to create more complex approximations the pic approximation blends global model with local covariances thus capturing mix of global and local structure though with the same boundary discontinuities as in vanilla local gps related approach is the use of covariance functions with compact support to capture local variation in concert with global inducing points surveys and compares several approximate gp regression methods on synthetic and datasets finally we note here the similar title of which is in fact orthogonal to the present work they use random field as prior on input locations whereas this paper defines random field decomposition of the gp model itself which may be combined with any prior on markov random fields we recall some basic theory regarding markov random fields mrfs also known as undirected graphical models pairwise mrf consists of an undirected graph along with node potentials ψi and edge potentials ψij which define an energy function on random vector ψi yi ψij yi yj where is partitioned into components yi identified with nodes in the graph this energy in turn defines probability density the gibbs distribution given by exp where exp dz is normalizing constant gaussian random fields are the special case of pairwise mrfs in which the gibbs distribution is multivariate gaussian given partition of into ym gaussian distribution with covariance and precision matrix can be expressed by potentials ψi yi yit jii yi ψij yi yj jij yj where jij is the submatrix of corresponding to the yi yj the normalizing constant involves the determinant of the covariance matrix since edges whose potentials are zero can be dropped without effect the nonzero entries of the precision matrix can be seen as specifying the edges present in the graph gaussian process random fields we consider vector of observations ky modeled by gp where ky is implicitly function of input locations and hyperparameters unless otherwise specified all probabilities yi yi yj refer to marginals of this full gp we would like to perform optimization on the marginal likelihood with respect to but suppose that the cost of doing so directly is prohibitive in order to proceed we assume partition ym of the observations into blocks of size at most with an implied corresponding partition xm of the perhaps unobserved inputs the source of this partition is not focus of the current work we might imagine that the blocks correspond to spatially local clusters of input points assuming that we have noisy observations of the values or at least reasonable guess at an initialization we let kij covθ yi yj denote the appropriate submatrix of ky and jij denote the corresponding submatrix of the precision matrix jy note that jij kij in general the gprf objective given the precision matrix jy we could use to represent the full gp distribution in factored form as an mrf this is not directly useful since computing jy requires cubic time instead we propose approximating the marginal likelihood via random field in which local gps are connected by pairwise potentials given an edge set which we will initially take to be the complete graph our approximate objective is qgp rf yi yi yi yj yi yj yi yj where ei denotes the neighbors of in the graph and yi and yi yj are marginal probabilities under the full gp equivalently they are the likelihoods of local gps defined on the points xi and xi xj respectively note that these local likelihoods depend implicitly on and taking the log we obtain the energy function of an unnormalized mrf log qgp rf log yi log yi yj with potentials ψigp rf yi log yi gp rf ψij yi yj log yi yj we refer to the approximate objective as qgp rf rather than pgp rf to emphasize that it is not in general normalized probability density it can be interpreted as approximation in which joint density is approximated via overlapping pairwise marginals in the special case that the full precision matrix jy induces tree structure on the blocks of our partition qgp rf recovers the exact marginal likelihood this is shown in the supplementary material in general this will not be the case but in the spirit of loopy belief propagation we consider the case as an approximation for the general setting before further analyzing the nature of the approximation we first observe that as sum of local gaussian the objective is straightforward to implement and fast to evaluate each of the pairwise densities requires time for an overall complexity of the extension to multiple independent outputs is straightforward when the quadratic dependence on can not be avoided by any algorithm that computes similarities between all pairs of training points however in practice we will consider local modifications in which is something smaller than all pairs of blocks for example if each block is connected only to fixed number of spatial neighbors the complexity reduces to linear in in the special case where is the empty set we recover the exact likelihood of independent local gps it is also straightforward to obtain the gradient of with respect to hyperparameters and inputs by summing the gradients of the local densities the likelihood and gradient for each term in the sum can be evaluated independently using only local subsets of the training data enabling simple parallel implementation having seen that qgp rf can be optimized efficiently it remains for us to argue its validity as proxy for the full gp marginal likelihood due to space constraints we defer proofs to the supplementary material though our results are not difficult we first show that like the full marginal likelihood qgp rf has the form of gaussian distribution but with different precision matrix theorem the objective qgp rf has the form of an unnormalized gaussian density with precision with blocks given by matrix ij ij if jii kii kii jij otherwise where ij is the local precision matrix ij defined as the inverse of the marginal covariance ij ij kii kij ij ij ij kji kjj although the gaussian density represented by qgp rf is not in general normalized we show that it is approximately normalized in certain sense theorem the objective qgp rf is approximately normalized in the sense that the optimal value of the bethe free energy ln bi yi bij yi yj fb log bi yi bij yi yj ln ln ψi yi ψij yi yj yi yi yj the approximation to the normalizing constant found by loopy belief propagation is precisely zero furthermore this optimum is obtained when the pseudomarginals bi bij are taken to be the true gp marginals pi pij this implies that loopy belief propagation run on gprf would recover the marginals of the true gp predictive equivalence to the bcm we have introduced qgp rf as surrogate model for the training set however it is natural to extend the gprf to make predictions at set of test points by including the function values as an block with an edge to each of the training blocks the resulting predictive distribution yi yj yi pgp rf qgp rf corresponds exactly to the prediction of the bayesian committee machine bcm this motivates the gprf as natural extension of the bcm as model for the training set providing an alternative to noisy observed locations mean error full gp showing grid cells with inducing points note contraction figure inferred locations on synthetic data colored by the first output dimension the standard transductive interpretation of the similar derivation shows that the conditional distribution of any block yi given all other blocks also takes the form of bcm prediction suggesting the possibility of pseudolikelihood training directly optimizing the quality of bcm predictions on blocks not explored in this paper experiments uniform input distribution we first consider synthetic dataset intended to simulate spatial location tasks such as wifislam or seismic event location below in which we observe measurements but have only noisy information regarding the locations at which those measurements were taken we sample points uniformly from the square of side length to generate the true inputs then sample output from independent gps with se kernel exp for and noise standard deviation σn the observed points obs σobs arise by corrupting with isotropic gaussian noise of standard deviation σobs the parameters σn and σobs were chosen to generate problems with interesting structure for which optimization could nontrivially improve the initial noisy locations figure shows typical sample from this model for local gps and gprfs we take the spatial partition to be grid with cells where is the desired number of points per cell the gprf edge set connects each cell to its eight neighbors figure yielding linear time complexity during optimization practical choice is necessary do we use fixed partition of the points or points to cells as they cross spatial boundaries the latter corresponds to coherent spatial covariance function but introduces discontinuities to the marginal likelihood in our experiments the gprf was not sensitive to this choice but local gps performed more reliably with fixed spatial boundaries in spite of the discontinuities so we used this approach for all experiments for comparison we also evaluate the sparse implemented in gpy which uses the fitc approximation to the marginal likelihood we also considered the bayesian but found it to be more with no meaningful difference in results on this problem here the approximation parameter is the number of inducing points we ran optimization to recover maximum posteriori map locations or local optima thereof figure shows mean location error euclidean distance for points at this size it is tractable to compare directly to the full the gprf with large block size corresponding to grid nearly matches the solution quality of the full gp while requiring less time while the local methods are quite fast to converge but become stuck at inferior optima the fitc optimization exhibits an interesting pathology it initially moves towards good solution but then diverges towards what turns out to correspond to contraction of the space figure we conjecture this is because there are not enough inducing points to faithfully represent the full gp the gprf is still transductive in the sense that adding test block will change the marginal distribution on the training observations as can be seen explicitly in the precision matrix the contribution of the gprf is that it provides reasonable model for the likelihood even in the absence of test points mean location error over time for mean error at convergence as mean location error over time including comparison function of with learned for to full gp scale figure results on synthetic data distribution over the entire space partial fix is to allow fitc to jointly optimize over locations and the correlation lengthscale this yielded biased lengthscale estimate but more accurate locations in figure to evaluate scaling behavior we next considered problems of increasing size up to out of generosity to fitc we allowed each method to learn its own preferred lengthscale figure reports the solution quality at convergence showing that even with an adaptive lengthscale fitc requires increasingly many inducing points to compete in large spatial domains this is intractable for larger problems due to scaling indeed attempts to run at with inducing points exceeded of available memory recently more sophisticated methods have claimed scalability to very large datasets but they do so with we expect that they would hit the same fundamental scaling constraints for problems that inherently require many inducing points on our largest synthetic problem approximations are intractable as is the full local gps converge more quickly than gprfs of equal block size but the gprfs find solutions figure after short initial period the best performance always belongs to gprf and at the conclusion of hours the best gprf solution achieves mean error lower than the best local solution vs seismic event location we next consider an application to seismic event location which formed the motivation for this work seismic waves can be viewed as vectors generated from an underlying threedimension manifold namely the earth crust nearby events tend to generate similar waveforms we can model this spatial correlation as gaussian process prior information regarding the event locations is available from traditional location systems which produce an independent gaussian uncertainty ellipse for each event full probability model of seismic waveforms accounting for background noise and performing joint alignment of arrival times is beyond the scope of this paper to focus specifically on the ability to approximate inference we used real event locations but generated synthetic waveforms by sampling from gp using kernel with and lengthscale of we also generated observed location estimates obs by corrupting the true locations with the astute reader will wonder how we generated synthetic data on problems that are clearly too large for an exact gp for these synthetic problems as well as the seismic example below the covariance matrix is relatively sparse with only of entries corresponding to points within six kernel lengthscales of each other by considering only these entries we were able to draw samples using sparse cholesky factorization although this required approximately of ram unfortunately this approach does not straightforwardly extend to inference under the exact gp as the standard expression for the marginal likelihood derivatives log tr involves the full precision matrix which is not sparse in general bypassing this expression via automatic differentiation through the sparse cholesky decomposition could perhaps allow exact inference to scale to somewhat larger problems event map for seismic dataset mean location error over time figure seismic event location task gaussian noise of standard deviation in each dimension given the observed waveforms and noisy locations we are interested in recovering the latitude longitude and depth of each event our dataset consists of events detected at the mankachi array station in kazakstan between and figure shows the event locations colored to reflect principle axis tree partition into blocks of points tree construction time was negligible the gprf edge set contains all pairs of blocks for which any two points had initial locations within one kernel lengthscale of each other we also evaluated connections but found that this relatively local edge set had the best tradeoffs eliminating edges not only speeds up each optimization step but in some cases actually yielded faster convergence perhaps because denser edge sets tended to create large cliques for which the pairwise gprf objective is poor approximation figure shows the quality of recovered locations as function of computation time we jointly optimized over event locations as well as two lengthscale parameters surface distance and depth and the noise variance local gps perform quite well on this task but the best gprf achieves lower mean error than the best local gp model vs respectively given equal time an even better result can be obtained by using the results of local gp optimization to initialize gprf using the same partition for both local gps and the gprf this hybrid method gives the lowest final error and is dominant across wide range of wall clock times suggesting it as promising practical approach for large optimizations conclusions and future work the gaussian process random field is tractable and effective surrogate for the gp marginal likelihood it has the flavor of approximate inference methods such as loopy belief propagation but can be analyzed precisely in terms of deterministic approximation to the inverse covariance and provides new interpretation of the bayesian committee machine it is easy to implement and can be straightforwardly parallelized one direction for future work involves finding partitions for which gprf performs well partitions that induce block structure perhaps related question is identifying when the gprf objective defines normalizable probability distribution beyond the case of an exact tree structure and under what circumstances it is good approximation to the exact gp likelihood this evaluation in this paper focuses on spatial data however both local gps and the bcm have been successfully applied to regression problems so exploring the effectiveness of the gprf for dimensionality reduction tasks would also be interesting another useful avenue is to integrate the gprf framework with other approximations since the gprf and methods have complementary strengths the gprf is useful for modeling function over large space while inducing points are useful when the density of available data in some region of the space exceeds what is necessary to represent the function an integrated method might enable new applications for which neither approach alone would be sufficient acknowledgements we thank the anonymous reviewers for their helpful suggestions this work was supported by dtra grant and by computing resources donated by microsoft research under an azure for research grant references neil lawrence gaussian process latent variable models for visualisation of high dimensional data advances in neural information processing systems nips volker tresp bayesian committee machine neural computation carl rasmussen and chris williams gaussian processes for machine learning mit press michalis titsias and neil lawrence bayesian gaussian process latent variable model in international conference on artificial intelligence and statistics aistats andreas damianou michalis titsias and neil lawrence variational inference for latent variables and uncertain inputs in gaussian processes journal of machine learning research jmlr joaquin and carl edward rasmussen unifying view of sparse approximate gaussian process regression journal of machine learning research jmlr neil lawrence learning for larger datasets with the gaussian process latent variable model in international conference on artificial intelligence and statistics aistats michalis titsias variational learning of inducing variables in sparse gaussian processes in international conference on artificial intelligence and statistics aistats duy matthias seeger and jan peters model learning with local gaussian process regression advanced robotics chiwoo park jianhua huang and yu ding domain decomposition approach for fast gaussian process regression of large spatial data sets journal of machine learning research jmlr krzysztof chalupka christopher ki williams and iain murray framework for evaluating approximation methods for gaussian process regression journal of machine learning research jmlr marc peter deisenroth and jun wei ng distributed gaussian processes in international conference on machine learning icml carl edward rasmussen and zoubin ghahramani infinite mixtures of gaussian process experts advances in neural information processing systems nips pages trung nguyen and edwin bonilla fast allocation of gaussian process experts in international conference on machine learning icml pages edward snelson and zoubin ghahramani local and global sparse gaussian process approximations in artificial intelligence and statistics aistats jarno vanhatalo and aki vehtari modelling local and global phenomena with sparse gaussian processes in uncertainty in artificial intelligence uai guoqiang zhong li yeung xinwen hou and liu gaussian process latent random field in aaai conference on artificial intelligence daphne koller and nir friedman probabilistic graphical models principles and techniques mit press jonathan yedidia william freeman and yair weiss bethe free energy kikuchi approximations and belief propagation algorithms advances in neural information processing systems nips kevin murphy yair weiss and michael jordan loopy belief propagation for approximate inference an empirical study in uncertainty in artificial intelligence uai pages julian besag statistical analysis of data the statistician pages brian ferris dieter fox and neil lawrence using gaussian process latent variable models in international joint conference on artificial intelligence ijcai pages the gpy authors gpy gaussian process framework in python http james hensman nicolo fusi and neil lawrence gaussian processes for big data in uncertainty in artificial intelligence uai page yarin gal mark van der wilk and carl rasmussen distributed variational inference in sparse gaussian process regression and latent variable models in advances in neural information processing systems nips international seismological centre bulletin int seis thatcham united kingdom http james mcnames fast algorithm based on principal axis search tree ieee transactions on pattern analysis and machine intelligence pami 
for kernel detection shuang li yao xie milton stewart school of industrial and systems engineering georgian institute of technology hanjun dai le song computational science and engineering college of computing georgia institute of technology hanjundai lsong abstract detecting the emergence of an abrupt is classic problem in statistics and machine learning nonparametric statistics have been proposed for this task which make fewer assumptions on the distributions than traditional parametric approach however none of the existing kernel statistics has provided computationally efficient way to characterize the extremal behavior of the statistic such characterization is crucial for setting the detection threshold to control the significance level in the offline case as well as the average run length in the online case in this paper we propose two related computationally efficient for detection when the amount of background data is large novel theoretical result of the paper is the characterization of the tail probability of these statistics using new technique based on such characterization provides us accurate detection thresholds for both offline and online cases in computationally efficient manner without the need to resort to the more expensive simulations such as bootstrapping we show that our methods perform well in both synthetic and real world data introduction detecting the emergence of abrupt is classic problem in statistics and machine learning given sequence of samples xt from domain we are interested in detecting possible such that before the samples xi for where is the background distribution and after the the samples xi for where is distribution here the time horizon can be either fixed number called an offline or problem or is not fixed and we keep getting new samples called sequential or online problem our goal is to detect the existence of the in the offline setting or detect the emergence of as soon as possible after it occurs in the online setting we will restrict our attention to detecting one which arises often in monitoring problems one such example is the seismic event detection where we would like to detect the onset of the event precisely in retrospect to better understand earthquakes or as quickly as possible from the streaming data ideally the detection algorithm can also be robust to distributional assumptions as we wish to detect all kinds of seismic events that are different from the background typically we have large amount of background data since seismic events are rare and we want the algorithm to exploit these data while being computationally efficient classical approaches for detection are usually parametric meaning that they rely on strong assumptions on the distribution nonparametric and kernel approaches are distribution free and more robust as they provide consistent results over larger classes of data distributions they can possibly be less powerful in settings where clear distributional assumption can be made in particular many kernel based statistics have been proposed in the machine learning literature which typically work better in real data with few assumptions however none of these existing kernel statistics has provided computationally efficient way to characterize the tail probability of the extremal value of these statistics characterization such tail probability is crucial for setting the correct detection thresholds for both the offline and online cases furthermore efficiency is also an important consideration since typically the amount of background data is very large in this case one has the freedom to restructure and sample the background data during the statistical design to gain computational efficiency on the other hand detection problems are related to the statistical test problems however they are usually more difficult in that for detection we need to search for the unknown location for instance in the offline case this corresponds to taking maximum of series of statistics each corresponding to one putative location similar idea was used in for the offline case and in the online case we have to characterize the average run length of the test statistic hitting the threshold which necessarily results in taking maximum of the statistics over time moreover the statistics being maxed over are usually highly correlated hence analyzing the tail probabilities of the test statistic for detection typically requires more sophisticated probabilistic tools in this paper we design two related for detection based on kernel maximum mean discrepancy mmd for test although mmd has nice unbiased and minimum variance estimator mmdu it can not be directly applied since mmdu costs to compute based on sample of data points in the detection case this translates to complexity quadratically grows with the number of background observations and the detection time horizon therefore we adopt strategy inspired by the recently developed statistic and design statistic for detection at high level our methods sample blocks of background data of size compute mmdu of each reference block with the block and then average the results however different from the simple test case in order to provide an accurate detection threshold the background block needs to be designed in novel structured way in the offline setting and updated recursively in the online setting besides presenting the new our contributions also include deriving accurate approximations to the significance level in the offline case and average run length in the online case for our which enable us to determine thresholds efficiently without recurring to the onerous simulations repeated bootstrapping obtaining variance estimator which allows us to form the easily developing novel structured ways to design background blocks in the offline setting and rules for update in the online setting which also leads to desired correlation structures of our statistics that enable accurate approximations for tail probability to approximate the asymptotic tail probabilities we adopt highly sophisticated technique based on recently developed in series of paper by yakir and siegmund et al the numerical accuracy of our approximations are validated by numerical examples we demonstrate the good performance of our method using real speech and human activity data we also find that in the testing scenario it is always beneficial to increase the block size as the distribution for the statistic under the null and the alternative will be better separated however this is no longer the case in online detection because larger block size inevitably causes larger detection delay finally we point to future directions to relax our gaussian approximation and correct for the skewness of the statistics background and related work we briefly review methods and the maximum mean discrepancy reproducing kernel hilbert space rkhs on with kernel is hilbert space of functions with inner product its element satisfies the reproducing property hf if and consequently hk if meaning that we can view the evaluation of function at any point as an inner product assume there are two sets with observations from domain where xn are drawn from distribution and yn are drawn from distribution the maximum mean discrepancy mmd is defined as supf ex ey an unbiased estimate of can be obtained using xi xj yi yj where is the kernel of the defined as xi xj yi yj xi xj yi yj xi yj xj yi intuitively the empirical test statistic is expected to be small close to zero if and large if and are far apart the complexity for evaluating is since we have to form the gram matrix for the data under the is degenerate and distributed the same as an infinite sum of variables to improve the computational efficiency and obtain an threshold for hypothesis testing recently proposed an alternative statistic for called the key idea of the approach is to partition the samples from and into blocks xn and yn each of constant size then xi yi is computed for each pair of pn blocks and averaged over the blocks to result in xi yi since is constant and the computational complexity of is significant reduction compared to furthermore by averaging xi yi over independent blocks the is asymptotically normal leveraging over the central limit theorem this latter property also allows simple threshold to be derived for the test rather than resorting to more expensive bootstrapping approach our later statistics are inspired by however the detection setting requires significant new derivations to obtain the test threshold since one cares about the maximum of computed at different point in time moreover the detection case consists of sum of highly correlated mmd statistics because these are formed with common test block of data this is inevitable in our detection problems because test data is much less than the reference data hence we can not use the central limit theorem even martingale version but have to adopt the aforementioned approach related work other nonparametric detection approach has been proposed in the literature in the offline setting designs test statistic based on running maximum partition strategy to test for the presence of studies related problem in which there are anomalous sequences out of sequences to be detected and they construct test statistic using mmd in the online setting presents that compares data in some reference window to the data in the current window using some empirical distance measures not detects abrupt changes by comparing two sets of descriptors extracted online from the signal at each time instant the immediate past set and the immediate future set based on soft margin support vector machine svm they build dissimilarity measure which is asymptotically equivalent to the fisher ratio in the gaussian case in the feature space between those sets without estimating densities as an intermediate step uses estimation to detect and models the using gaussian kernel model whose parameters are updated online through stochastic gradient decent the above work lack theoretical analysis for the extremal behavior of the statistics or average run length for offline and online detection give sequence of observations xt xi with denoting the sequence of background or reference data assume large amount of reference data is available our goal is to detect the existence of such that before the samples are with distribution and after the samples are with different distribution the location where the occurs is unknown we may formulate this problem as hypothesis test where the null hypothesis states that there is no and the alternative hypothesis is that there exists at some time we will construct our using the maximum mean discrepancy mmd to measure the difference between distributions of the reference and the test data we denote by the block of data which potentially contains also referred to as the block or test block in the offline setting we assume the size of can be up to bmax and we want to search for location of the within such that observations after are from different distribution inspired by the idea of we sample reference blocks of size bmax independently from the reference pool and index them as xibmax since we search for location bmax within for we construct from by taking contiguous data points and denote them as to form the statistic we correspondingly construct from each reference block by taking contiguous data points out of that block and index these as xi illustrated in fig we then compute block containing potential change point pool of reference data xn max bmax xn max bmax bmax bmax bmax bmax bmax bmax 𝑋𝑖 bmax bmax block containing potential change point pool of reference data sample time pool of reference data sample time xi xi offline sequential figure illustration of offline case data are split into blocks of size bmax indexed backwards from time and we consider blocks of size bmax online case assuming we have large amount of reference or background data that follows the null distribution between xi zb and average over blocks xi xi xi yj yl where xi denotes the jth sample in xi and yj denotes the th sample in due to the property of under the null hypothesis zb let var zb denote the variance of zb under the null the expression of zb is given by in the following section we see the variance depends on the block size and the number of blocks as increases var zb decreases also illustrated in figure in the appendix considering this we standardize the statistic maximize over all values of to define the offline and detect whenever the exceeds the threshold max zb var zb offline detection bmax zb where varying the from to bmax corresponds to searching for the unknown location in the online setting suppose the block has size and we construct it using sliding window in this case the potential is declared as the end of each block to form the statistic we take samples without replacement since we assume the reference data are distribution from the reference pool to form reference blocks compute the quadratic statistics between each reference block and the block and then average them when there is new sample time moves from to we append the new sample in the reference block remove the oldest sample from the block and move it to the reference pool the reference blocks are also updated accordingly the end point of each reference block is moved to the reference pool and new point is sampled and appended to the front of each reference block as shown in fig using the sliding window scheme described above similarly we may define an online by forming standardized average of the between the block in sliding window and the reference block xi where is the fixed xi is the ith reference block of size at time and is the the block of size at time in the online case we have to characterize the average run length of the test statistic hitting the threshold which necessarily results in taking maximum of the statistics over time the online detection procedure is stopping time where we detect whenever the normalized exceeds threshold inf var online detection mt note in the online case we actually take maximum of the standardized statistics over time there is recursive way to calculate the online efficiently explained in section in the appendix at the stopping time we claim that there exists there is tradeoff in choosing the block size in online setting small block size will incur smaller computational cost which may be important for the online case and it also enables smaller detection delay for strong point magnitude however the disadvantage of small is lower power which corresponds to longer detection delay when the magnitude is weak for example the amplitude of the mean shift is small examples of offline and online are demonstrated in fig based on synthetic data and segment of the real seismic signal we see that the proposed offline powerfully detects the existence of and accurately pinpoints where the change occurs the online quickly hits the threshold as soon as the change happens normal laplace seismic signal normal signal laplace time time offline null time peak statistic statistic offline time statistic statistic normal signal signal time online time seismic signal figure examples of offline and online with and offline case without and with bmax and the maximum is obtained when online case with at detection delay is and we use real seismic signal and with different kernel bandwidth all thresholds are theoretical values and are marked in red theoretical performance analysis we obtain an analytical expression for the variance var zb in and by leveraging the correspondence between the statistics and since zb is some form of and exploiting the known properties of we also derive the covariance structure for the online and offline standardized zb statistics which is crucial for proving theorems and lemma variance of zb under the null given any fixed block size and number of blocks under the null hypothesis var zb cov where and are with the null distribution lemma suggests an easy way to estimate the variance var zb from the reference data to estimate we need to first estimate by each time drawing four samples without replacement from the reference data use them for evaluate the sampled function value and then form monte carlo average similarly we may estimate cov lemma covariance structure of the standardized zb statistics under the null hypothesis given and in bmax for the offline case ru cov zu zv where max and for the online case ru cov mu for in the offline setting the choice of the threshold involves tradeoff between two standard performance metrics the significant level sl which is the probability that the exceeds the threshold under the null hypothesis when there is no and ii power which is the probability of the statistic exceeds the threshold under the alternative hypothesis in the online setting there are two analogous performance metrics commonly used for analyzing detection procedures the expected value of the stopping time when there is no change the average run length arl ii the expected detection delay edd defined to be the expected stopping time in the extreme case where change occurs immediately at we focus on analyzing sl and arl of our methods since they play key roles in setting thresholds we derive accurate approximations to these quantities as functions of the threshold so that given prescribed sl or arl we can solve for the corresponding analytically let and denote respectively the probability measure and expectation under the null theorem sl in offline case when and bmax for some constant the significant level of the offline defined in is given by bx max max bmax var zb where the special function is the probability density function and is the cumulative distribution function of the standard normal distribution respectively the proof of theorem uses argument which is based on the likelihood ratio identity see the likelihood ratio identity relates computing of the tail probability under the null to computing sum of expectations each under an alternative distribution indexed by particular parameter value to illustrate assume the probability density function pdf under the null is given function with in some index set we may introduce family of alternative distributions with pdf where log du is the log moment generating function and is the parameter that we may assign an arbitrary value it can be easily verified that is pdf using this family of alternative we may calculate the probability of an event under the original distribution by calculating sum of expectations where the indicator function is one when event is true and zero otherwise is the expectation using pdf log is the ratio and we have the freedom to choose different value for each the basic idea of in our setting is to treat zb zb zb as random field indexed by then to characterize sl we need to study the tail probability of the maximum of this random field relate this to the setting above zb corresponds to corresponds to and corresponds to the threshold crossing event to compute the expectations under the alternative measures we will take few steps first we choose parameter value for each pdf associated with parameter value such that this is equivalent to setting the mean under each alternative probability to the threshold eb zb and it allows us to use the local central limit theorem since under the alternative measure boundary cross has much larger probability second we will express the random quantities involved in the expectations as functions of the local field terms as well as the ratios we show that they are asymptotically independent as and grows on the order of and this further simplifies our calculation the last step is to analyze the covariance structure of the random field lemma in the following and approximate it using gaussian random field note that the terms and have correlation due to our construction they share the same block we then apply the localization theorem theorem in to obtain the final result theorem arl in online case when and for some constant the average run length arl of the stopping time defined in is given by eb proof for theorem is similar to that for theorem due to the fact that for given max mt hence pwe also need to study the tail probability of the maximum of random field mt for fixed block size similar approach can be used except that the covariance structure of mt in the online case is slightly different from the offline case this tail probability turns out to be in form of using similar ments as those in we may see that is asymptotically exponentially distributed hence exp consequently which leads to theorem shows that arl eb and hence log arl on the other hand the edd is typically on the order of using the wald identity although more careful analysis should be carried out in the future work where is the kl divergence between the null and alternative distributions on order of constant hence given desired arl typically on the order of or the error made in the estimated threshold will only be translated linearly to edd this is blessing to us and it means typically reasonably accurate will cause little performance loss in edd similarly theorem shows that sl and similar argument can be made for the offline case numerical examples we test the performance of the using simulation and real world data here we only highlight the main results more details can be found in appendix in the following examples we use gaussian kernel exp ky where is the kernel bandwidth and we use the median trick to get the bandwidth which is estimated using the background data accuracy of lemma for estimating var zb fig in the appendix shows the empirical distributions of zb when and when in both cases we generate random instances which are computed from data following to represent the null distribution moreover we also plot the gaussian pdf with sample mean and sample variance which matches well with the empirical distribution note the approximation works better when the block size decreases the skewness of the statistic can be corrected see discussions in section accuracy of theoretical results for estimating threshold for the offline case we compare the thresholds obtained from numerical simulations bootstrapping and using our approximation in theorem for various sl values we choose the maximum block size to be bmax in the appendix fig demonstrates how threshold is obtained by simulation for the threshold corresponds to the quantile of the empirical distribution of the offline statistic for range of values fig compares the empirical sl value from simulation with that predicted by theorem and shows that theory is quite accurate for small which is desirable as we usually care of small to obtain thresholds table shows that our approximation works quite well to determine thresholds given thresholds obtained by our theory matches quite well with that obtained from monte carlo simulation the null distribution is and even from bootstrapping for real data scenario here the bootstrap thresholds are for speech signal from the dataset in this case the null distribution is unknown and we only have samples speech signals thus we generate bootstrap samples to estimate the threshold as shown in fig in the appendix these obtained from theoretical approximations have little performance degradation and we will discuss how to improve in section table comparison of thresholds for offline case determined by simulation bootstrapping and theory respectively for various sl value sim bmax boot the sim bmax boot the sim bmax boot the for the online case we also compare the thresholds obtained from simulation using instances for various arl and from theorem respectively as predicated by theory the threshold is consistently accurate for various null distributions shown in fig also note from fig that the precision improves as increases the null distributions we consider include exponential distribution with mean random graph with nodes and probability of of forming random edges and laplace distribution expected detection delays edd in the online setting we compare edd with the assumption of detecting when the signal is dimensional and the transition happens gaussian exp random graph laplace theory gaussian exp random graph laplace theory arl arl gaussian exp random graph laplace theory arl figure in online case for range of arl values comparison obtained from simulation and from theorem under various null distributions from gaussian to mean gaussian where the postchange mean vector is equal to constant mean shift in this setting fig demonstrates the tradeoff in choosing block size when block size is too small the statistical power of the is weak and hence edd is large on the other hand when block size is too large although statistical power is good edd is also large because the way we update the test block therefore there is an optimal block size for each case fig shows the optimal block size decreases as the mean shift increases as expected we test the performance of our using real data our datasets include speech dataset in the speech resource consortium src corpora provided by national institute of informatics nii human activity sensing consortium hasc challenge we compare our with algorithm the relative densityratio rdr estimate one limitation of the rdr algorithm however is that it is not suitable for data because estimating density ratio in the setting is illposed to achieve reasonable performance for the rdr algorithm we adjust the bandwidth and the regularization parameter at each time step and hence the rdr algorithm is computationally more expensive than the method we use the area under curve auc the larger the better as performance metric our have competitive performance compared with the baseline rdr algorithm in the real data testing here we report the main results and the details can be found in appendix for the speech data our goal is to detect the onset of speech signal emergent from the background noise the background noises are taken from real acoustic signals such as background noise in highway airport and subway stations the overall auc for the statistic is and for the baseline algorithm is for human activity detection data we aim at detection the onset of transitioning from one activity to another each data consists of human activity information collected by portable accelerometers the overall auc for the is and for the baseline algorithm is discussions we may be able to improve the precision of the tail probability approximation in theorems and to account for skewness of zb in the argument we need to choose parameter values such that currently we use gaussian assumption zb and hence and we may improve the precision if we are able to estimate skewness zb for zb in particular we can include the skewness in the log moment generating function approximation zb when we estimate the parameter setting the derivative of this to and solving quadratic equation zb for this will change the leading exponent term in from to be similar improvement can be done for the arl approximation in theorem acknowledgments this research was supported in part by and to bigdata onr nsf nsf career to available from http available from http references desobry davy and doncarli an online kernel change detection algorithm ieee trans sig enikeeva and harchaoui detection with sparse alternatives arthur gretton karsten borgwardt malte rasch bernhard and alexander smola kernel test the journal of machine learning research harchaoui bach cappe and moulines methods for hypothesis testing ieee sig proc magazine pages harchaoui bach and moulines kernel analysis in adv in neural information processing systems nips kifer and gehrke detecting change in data streams in proc of the vldb song liu makoto yamada nigel collier and masashi sugiyama detection in data by direct estimation neural networks aaditya ramdas sashank jakkam reddi aarti singh and larry wasserman on the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions in aaai conference on artificial intelligence ross and automatic picking of direct seismic phases and fault zone head waves geophys bernhard scholkopf and alexander smola learning with kernels support vector machines regularization optimization and beyond mit press serfling approximation theorems of mathematical statistics john wiley sons siegmund sequential analysis tests and confidence intervals springer siegmund and venkatraman using the generalized likelihood ratio statistic for sequential detection of ann siegmund and yakir detecting the emergence of signal in noisy image stat interface xie and siegmund sequential detection annals of statistics yakir extremes in random fields theory and its applications wiley zaremba gretton and blaschko low variance kernel test in adv neural info proc sys nips zou liang poor and shi nonparametric detection of anomalous data via kernel mean embedding 
adaptive online learning dylan foster cornell university alexander rakhlin university of pennsylvania karthik sridharan cornell university abstract we propose general framework for studying adaptive regret bounds in the online learning setting subsuming model selection and bounds given or bound we ask does there exist some algorithm achieving this bound we show that modifications to recently introduced sequential complexity measures can be used to answer this question by providing sufficient conditions under which adaptive rates can be achieved in particular each adaptive rate induces set of offset complexity measures and obtaining small upper bounds on these quantities is sufficient to demonstrate achievability cornerstone of our analysis technique is the use of tail inequalities to bound suprema of offset random processes our framework recovers and improves wide variety of adaptive bounds including quantile bounds second order bounds and small loss bounds in addition we derive new type of adaptive bound for online linear optimization based on the spectral norm as well as new online theorem introduction some of the recent progress on the theoretical foundations of online learning has been motivated by the parallel developments in the realm of statistical learning in particular this motivation has led to martingale extensions of empirical process theory which were shown to be the right notions for online learnability two topics however have remained elusive thus far obtaining bounds and establishing model selection or inequalities for online learning problems in this paper we develop new techniques for addressing both these questions oracle inequalities and model selection have been topics of intense research in statistics in the last two decades given sequence of models whose union is one aims to derive procedure that selects given an sample of size an estimator fˆ from model that trades off bias and variance roughly speaking the desired oracle bound takes the form err fˆ inf inf err penn where penn is penalty for the model such oracle inequalities are attractive because they can be shown to hold even if the overall model is too large central idea in the proofs of such statements and an idea that will appear throughout the present paper is that penn should be slightly larger than the fluctuations of the empirical process for the model it is therefore not surprising that concentration particularly talagrand celebrated inequality for the supremum of the empirical played an important role in attaining oracle bounds in order to select good model in manner one establishes bounds on the fluctuations of an empirical process indexed by elements in each model deptartment of computer science deptartment of statistics lifting the ideas of oracle inequalities and bounds from statistical to online learning is not an obvious task for one there is no concentration inequality available even for the simple case of sequential rademacher complexity for the reader already familiar with this complexity change of the value of one rademacher variable results in change of the remaining path and hence an attempt to use version of bounded difference inequality grossly fails luckily as we show in this paper the concentration machinery is not needed and one only requires tail inequality this realization is motivated by the recent work of at high level our approach will be to develop inequalities for the suprema of certain offset processes where the offset is chosen to be slightly larger than the complexity of the corresponding model we then show that these offset processes determine which adaptive rates are achievable for online learning problems drawing strong connections to the ideas of statistical learning described earlier framework let be the set of observations the space of decisions and the set of outcomes let denote the set of distributions on set let be loss function the online learning framework is defined by the following process for nature provides input instance xt learner selects prediction distribution qt nature provides label yt while the learner draws prediction qt and suffers loss yt two important settings are supervised learning and online linear optimization is singleton set and are balls in dual banach spaces and for class dx we define the learner cumulative regret to as yt inf xt yt uniform regret bound bn is achievable if there exists randomized algorithm selecting such that yt inf xt yt bn where stands for an achievable rates bn depend on complexity of the function class for example sequential rademacher complexity of is one of the tightest achievable uniform rates for variety of loss functions an adaptive regret bound has the form bn and is said to be achievable if there exists randomized algorithm for selecting such that yt xt yt bn we distinguish three types of adaptive bounds according to whether bn depends only on only on or on both quantities whenever bn depends on an adaptive regret can be viewed as an oracle inequality which penalizes each according to measure of its complexity the complexity of the smallest model to which it belongs as in statistical learning an oracle inequality may be proved for certain functions bn even if uniform bound can not hold for any nontrivial bn related work the case when bn bn does not depend on has received most of the attention in the literature the focus is on bounds that can be tighter for nice sequences yet maintain guarantees an incomplete list of prior work includes couched in the setting of online optimization and in the experts setting bound of type bn was studied in which presented an algorithm that competes with all experts simultaneously but with varied regret with respect to each of them depending on the quantile of the expert another bound of this type was given by who consider online linear optimization with an unbounded set and provide oracle inequalities with an appropriately chosen function bn finally the third category of adaptive bounds are those that depend on both the hypothesis and the data the bounds that depend on the loss of the best function bounds sec fall in this category trivially since one may overbound the loss of the best function by the performance of we draw attention to the recent result of who show an adaptive bound in terms of both the loss of comparator and the kl divergence between the comparator and some prior distribution over experts an bound in terms of the variance of the loss of the comparator under the distribution induced by the algorithm was recently given in our study was also partly inspired by cover who characterized necessary and sufficient conditions for achievable bounds in prediction of binary sequences the methods in however rely on the structure of the binary prediction problem and do not readily generalize to other settings the framework we propose recovers the vast majority of known adaptive rates in literature including variance bounds quantile bounds bounds and fast rates for small losses it should be noted that while existing literature on adaptive online learning has focused on simple hypothesis classes such as finite experts and balls our results extend to general hypothesis classes including large nonparametric ones discussed in adaptive rates and achievability general setup the first step in building general theory for adaptive online learning is to identify what adaptive regret bounds are possible to achieve recall that an adaptive regret bound of bn is said to be achievable if there exists an online learning algorithm such that holds in the rest of this work we use the notation to denote the interleaved application of the operators inside the brackets repeated over rounds see achievability of an adaptive rate can be formalized by the following minimax quantity definition given an adaptive rate bn we define the offset minimax value an bn sup inf sup xt qt yt yt inf xt yt bn an bn quantifies how yt inf xt yt bn behaves when the optimal learning algorithm that minimizes this difference is used against nature trying to maximize it directly from this definition an adaptive rate bn is achievable if and only if an bn if bn is uniform rate bn bn achievability reduces to the minimax analysis explored in the uniform rate bn is achievable if and only if bn vn where vn is the minimax value of the online learning game we now focus on understanding the minimax value an bn for general adaptive rates we first show that the minimax value is bounded by an offset version of the sequential rademacher complexity studied in the symmetrization lemma below provides us with the first step towards probabilistic analysis of achievable rates before stating the lemma we need to define the notion of tree and the notion of sequential rademacher complexity given set tree of depth is sequence zt of functions zt one may view as complete binary tree decorated by elements of let be sequence of independent rademacher random variables then zt may be viewed as predictable process with respect to the filtration for tree the sequential rademacher complexity of function class rz on is defined as rn sup zt and rn sup rn lemma for any lower loss and any adaptive rate bn that only depends on outcomes bn bn we have that an sup xt yt bn further for any general adaptive rate bn an sup xt yt bn finally if one considers the supervised learning problem where and is loss that is convex and in its first argument then for any adaptive rate bn an sup xt bn the above lemma tells us that to check whether an adaptive rate is achievable it is sufficient to check that the corresponding adaptive sequential complexity measures are we remark that if the above complexities are bounded by some positive quantity of smaller order one can form new achievable rate by adding the positive quantity to bn probabilistic tools as mentioned in the introduction our technique rests on certain probabilistic inequalities we now state the first building block rather straightforward maximal inequality proposition let be set of indices and let xi be sequence of random variables satisfying the following tail condition for any xi bi exp exp si for some positive sequence bi nonnegative sequence and nonnegative sequence si of numbers and for constants then for any and max log log bi si log bi it holds that sup xi bi we remark that bi need not be the expected value of xi as we are not interested in deviations around the mean one of the approaches to obtaining inequalities is to split large class into smaller ones according to complexity radius and control certain stochastic process separately on each subset also known as the peeling technique in the applications below xi will often stand for the random supremum of this process on subset and bi will be an upper bound on its typical size given deviation bounds for xi above bi the dilated size bi then allows one to pass to maximal inequalities and thus verify achievability in lemma the same strategy works for obtaining bounds where we first prove tail bounds for the given size of the quantity then appeal to simple yet powerful example for the control of the supremum of stochastic process is an inequality due to pinelis for the norm which is supremum over the dual ball of martingale in banach space here we state version of this result that can be found in appendix lemma let be unit ball in separable banach space for any tree and any zt exp when the class of functions is not linear we may no longer appeal to the above lemma instead we make use of result from that extends lemma at price of factor before stating this lemma we briefly define the relevant complexity measures see for more details first set of trees is called an of rz on with respect to if zt vt the size of the smallest is denoted by np and np supz np the set is an of on with respect to if zt vt we let be the smallest such cover and set supz lemma let suppose rn with and that the following mild assumptions hold rn and there exists constant such that then for any for any tree of depth zt rn zt inf log the above lemma yields control on the size of the supremum of the sequential rademacher process as required for our inequalities next we turn our attention to an offset rademacher process where the supremum is taken over collection of random variables the behavior of this offset process was shown to govern the optimal rates of convergence for online nonparametric regression such control of the supremum will be necessary for some of the upper bounds we develop lemma let be tree of depth and let rz for any zt zt where exp exp log and log log and log we observe that the probability of deviation has both subgaussian and subexponential components using the above result and proposition leads to useful bounds on the quantities in lemma for specific types of adaptive rates given tree we obtain bound on the expected size of the sequential rademacher process when we subtract off the of the function on the tree adjusted by logarithmic terms corollary suppose and let be any tree of depth assume log for some then sup zt log log zt log log log the next corollary yields slightly faster rates than corollary when corollary suppose with and let be any tree of depth then zt achievable bounds in this section we use lemma along with the probabilistic tools from the previous section to obtain an array of achievable adaptive bounds for various online learning problems we subdivide the section into one subsection for each category of adaptive bound described in section adapting to data here we consider adaptive rates of the form bn uniform over we show the power of the developed tools on the following example example online linear optimization in rd consider the problem of online linear optimization where rd and the following adaptive rate is achievable bn log log yt yt where is the spectral norm let us deduce this result from corollary first observe that yt yt sup yt yt sup yt yt yt sup the linear function class can be covered at any scale with balls and thus for any tree we apply corollary with and the integral term in the corollary vanishes yielding the claimed statement model adaptation in this subsection we focus on achievable rates for oracle inequalities and model selection but without dependence on data the form of the rate is therefore bn assume we have class with the property that for any if we are told by an oracle that regret will be measured with respect to those hypotheses with inf then using the minimax algorithm one can guarantee regret bound of at most the sequential rademacher complexity rn on the other hand given the optimality of the sequential rademacher complexity for online learning problems for commonly encountered losses we can argue that for any chosen in hindsight one can not expect regret better than order rn in this section we show that simultaneously for all one can attain an adaptive upper bound of log rn that is we may predict as if we knew the optimal radius at the price of logarithmic factor this is the price of adaptation corollary for any class of predictors with if one considers the supervised learning problem with loss the following rate is achievable bn log log rn rn rn rn for absolute constants and defined in lemma in fact this statement is true more generally with replaced by it is tempting to attempt to prove the above statement with the exponential weights algorithm running as an aggregation procedure over the solutions for each in general this approach will fail for two reasons first if function values grow with the exponential weights bound will scale linearly with this value second an experts bound yields only slower rate as special case of the above lemma we obtain an online theorem we postpone this example to the next where we get version of this result we now provide bound for online linear optimization in banach spaces that automatically adapts to the norm of the comparator to prove it we use the concentration bound from lemma within the proof of the above corollary to remove the extra logarithmic factors example unconstrained linear optimization consider linear optimization with being the unit ball of some reflexive banach space with norm let be the dual space and the loss where we are using to represent the linear functional in the first argument to the second argument define where is the norm dual to if the unit ball of is then the following rate is achievable for all with log log log for the case of hilbert space the above bound was achieved by adapting to data and model simultaneously we now study achievable bounds that perform online model selection in way of specific interest is our online optimistic bound this bound should be compared to with the reader noting that it is independent of the number of experts is algorithmindependent and depends quadratically on the expected loss of the expert we compare against example generalized predictable sequences supervised learning consider an online supervised learning problem with convex loss let mt be any predictable sequence that the learner can compute at round based on information provided so far including xt one can think of the predictable sequence mt as prior guess for the hypothesis we would compare with in hindsight then the following adaptive rate is achievable bn inf log xt mt log log log for constants from corollary the achievability is direct consequence of eq in lemma followed by corollary one can include any predictable sequence in the rademacher average part because mt is zero mean particularly if we assume that the sequential covering of class grows as log for some we get that bn xt mt as gets closer to we get full adaptivity and replace by xt mt on the other hand as gets closer to more complex function classes we do not adapt and get uniform bound in terms of for we attain natural interpolation example regret to fixed vs regret to best supervised learning consider an online supervised learning problem with convex loss and let let be fixed expert chosen in advance the following bound is achievable bn xt xt xt xt in particular against we have bn and against an arbitrary expert we have bn log log log this bound follows from eq in lemma followed by corollary this extends the study of to supervised learning and general class of experts example optimistic assume that we have countable set of experts and that the loss for each expert on any round is and bounded by the function class is the set of all distributions over these experts and this setting can be formulated as online linear optimization where the loss of mixture over experts given instance is the expected loss under the mixture the following adaptive bound is achievable bn kl log yt kl log this adaptive bound is an online bound the rate adapts not only to the kl vergence of with fixed prior but also replaces with yt note that we have yt yt yielding the type bound described earlier this is an improvement over the bound in in that the bound is independent of number of experts and so holds even for countably infinite sets of experts the kl term in our bound may be compared to the term in the bound of if we have large but finite number of experts and take to be uniform the above bound provides an improvement over both and evaluating the above bound with distribution that places all its weight on any one expert appears to address the open question posed by of obtaining variance bounds for experts the proof of achievability of the above rate is shown in the appendix because it requires slight variation on the symmetrization lemma specific to the problem see for comparison of bounds and quantile bounds relaxations for adaptive learning to design algorithms for achievable rates we extend the framework of online relaxations from relaxation reln that satisfies the initial condition reln inf xt yt bn and the recursive condition reln sup sup yt reln inf xt qt yt is said to be admissible for the adaptive rate bn the relaxation corresponding strategy is arg minqt supyt yt reln which enjoys the adaptive bound yt inf xt yt bn reln it follows immediately that the strategy achieves the rate bn reln our goal is then to find relaxations for which the strategy is computationally tractable and reln or at least has smaller order than bn similar to conditional versions of the offset minimax values an yield admissible relaxations but solving these relaxations may not be computationally tractable example online consider the experts setting in example with bn max kl let ri and let qtr denote the exponential weights distribution with learning rate yt the following is an admissible relaxation achieving bn reln inf ri ys nri ri let be distribution with ys nri we predict by drawing according to then drawing an expert according to ri while in general the problem of obtaining an efficient adaptive relaxation might be hard one can ask the question if and efficient relaxation relr is available for each can one obtain an adaptive model selection algorithm for all of to this end for supervised learning problem with convex lipschitz loss we delineate meta approach which utilizes existing relaxations for each lemma let qtr be the randomized strategy corresponding to relr obtained after observing outcomes and let be nonnegative the following relaxation is admissible for the rate bn relr reln adan sup ys reln reln playing according to the strategy for adan will guarantee regret bound of bn adan and adan can be bounded using proposition when the form of is as in that proposition we remark that the above strategy is not necessarily obtained by running experts algorithm over the discretized values of it is an interesting question to determine the cases when such strategy is optimal more generally when the adaptive rate bn depends on data it is not possible to obtain the rates we show in this paper using the exponential weights algorithm with as the required weighting over experts would be data dependent and hence is not prior over experts further the bounds from algorithms are akin to having tails in proposition but for many problems we may have tails obtaining computationally efficient methods from the proposed framework is an interesting research direction proposition provides useful tool to establish achievable adaptive bounds and natural question to ask is if one can obtain constructive counterpart for the proposition references lucien pascal massart et al minimum contrast estimators on sieves exponential bounds and rates of convergence bernoulli lugosi and andrew nobel adaptive model selection using empirical complexities annals of statistics pages peter bartlett boucheron and lugosi model selection and error estimation machine learning pascal massart concentration inequalities and model selection volume springer shahar mendelson learning without concentration in conference on learning theory tengyuan liang alexander rakhlin and karthik sridharan learning with square loss localization through offset rademacher complexity proceedings of the conference on learning theory alexander rakhlin and karthik sridharan online nonparametric regression proceedings of the conference on learning theory alexander rakhlin karthik sridharan and ambuj tewari online learning random averages combinatorial parameters and learnability in advances in neural information processing systems elad hazan and satyen kale extracting certainty from uncertainty regret bounded by variation in costs machine learning chiang tianbao yang lee mehrdad mahdavi lu rong jin and shenghuo zhu online optimization with gradual variations in colt alexander rakhlin and karthik sridharan online learning with predictable sequences in proceedings of the annual conference on learning theory colt john duchi elad hazan and yoram singer adaptive subgradient methods for online learning and stochastic optimization the journal of machine learning research nicolo yishay mansour and gilles stoltz improved bounds for prediction with expert advice machine learning kamalika chaudhuri yoav freund and daniel hsu hedging algorithm in advances in neural information processing systems pages brendan mcmahan and francesco orabona unconstrained online linear learning in hilbert spaces minimax algorithms and normal approximations proceedings of the conference on learning theory nicolo and lugosi prediction learning and games cambridge university press nathan srebro karthik sridharan and ambuj tewari smoothness low noise and fast rates in advances in neural information processing systems pages haipeng luo and robert schapire achieving all with no parameters adaptive normalhedge corr wouter koolen and tim van erven quantile methods for experts and combinatorial games in proceedings of the annual conference on learning theory colt pages thomas cover behavior of sequential predictors of binary sequences in in trans prague conference on information theory statistical decision functions random processes pages publishing house of the czechoslovak academy of sciences alexander rakhlin and karthik sridharan statistical learning theory and sequential prediction available at http iosif pinelis optimum bounds for the distributions of martingales in banach spaces the annals of probability alexander rakhlin karthik sridharan and ambuj tewari online learning beyond regret arxiv preprint alexander rakhlin karthik sridharan and ambuj tewari sequential complexities and uniform martingale laws of large numbers probability theory and related fields eyal michael kearns yishay mansour and jennifer wortman regret to the best regret to the average machine learning alexander rakhlin ohad shamir and karthik sridharan relax and randomize from value to algorithms advances in neural information processing systems pages 
universal catalyst for optimization hongzhou julien and zaid inria nyu abstract we introduce generic scheme for accelerating optimization methods in the sense of nesterov which builds upon new analysis of the accelerated proximal point algorithm our approach consists of minimizing convex objective by approximately solving sequence of auxiliary problems leading to faster convergence this strategy applies to large class of algorithms including gradient descent block coordinate descent sag saga sdca svrg and their proximal variants for all of these methods we provide acceleration and explicit support for convex objectives in addition to theoretical we also show that acceleration is useful in practice especially for problems where we measure significant improvements introduction large number of machine learning and signal processing problems are formulated as the minimization of composite objective function rp minp where is convex and has lipschitz continuous derivatives with constant and is convex but may not be differentiable the variable represents model parameters and the role of is to ensure that the estimated parameters fit some observed data specifically is often large sum of functions fi and each term fi measures the fit between and data point indexed by the function in acts as regularizer it is typically chosen to be the squared which is smooth or to be penalty such as the or another norm composite minimization also encompasses constrained minimization if we consider indicator functions that may take the value outside of convex set and inside see our goal is to accelerate or methods that are designed to solve with particular focus on large sums of functions by accelerating we mean generalizing mechanism invented by nesterov that improves the convergence rate of the gradient descent algorithm more precisely when gradient descent steps produce iterates xk such that xk where denotes the minimum value of furthermore when the objective is strongly convex with constant the rate of convergence becomes linear in these rates were shown by nesterov to be suboptimal for thepclass of methods and instead optimal for the convex case and for the convex be obtained by taking gradient steps at points later this acceleration technique was extended to deal with regularization functions for modern machine learning problems involving large sum of functions recent effort has been devoted to developing fast incremental algorithms that can exploit the particular structure of unlike pn full gradient approaches which require computing and averaging gradients at every iteration incremental techniques have cost that is independent of the price to pay is the need to store moderate amount of information regarding past iterates but the benefit is significant in terms of computational complexity main contributions our main achievement is generic acceleration scheme that applies to large class of optimization methods by analogy with substances that increase chemical reaction rates we call our approach catalyst method may be accelerated if it has linear convergence rate for strongly convex problems this is the case for full gradient and block coordinate descent methods which already have accelerated variants more importantly it also applies to incremental algorithms such as sag saga sdca and svrg whether or not these methods could be accelerated was an important open question it was only known to be the case for dual coordinate ascent approaches such as sdca or sdpc for strongly convex objectives our work provides universal positive answer regardless of the strong convexity of the objective which brings us to our second achievement some approaches such as sdca or svrg are only defined for strongly convex objectives classical trick to apply them to general convex functions is to add small regularization the drawback of this strategy is that it requires choosing in advance the parameter which is related to the target accuracy consequence of our work is to automatically provide direct support for convex objectives thus removing the need of selecting beforehand other contribution proximal miso the approach which was proposed in and is an incremental technique for solving smooth unconstrained convex problems when is larger than constant with in in addition to providing acceleration and support for convex objectives we also make the following specific contributions we extend the method and its convergence proof to deal with the composite problem we fix the method to remove the big data condition the resulting algorithm can be interpreted as variant of proximal sdca with different step size and more practical optimality is checking the optimality condition does not require evaluating dual objective our construction is indeed purely primal neither our proof of convergence nor the algorithm use duality while sdca is originally dual ascent technique related work the catalyst acceleration can be interpreted as variant of the proximal point algorithm which is central concept in convex optimization underlying augmented lagrangian approaches and composite minimization schemes the proximal point algorithm consists of solving by minimizing sequence of auxiliary problems involving quadratic regularization term in general these auxiliary problems can not be solved with perfect accuracy and several notations of inexactness were proposed including the catalyst approach hinges upon an acceleration technique for the proximal point algorithm originally introduced in the pioneer work ii more practical inexactness criterion than those proposed in the as result we are able to control the rate of convergence for approximately solving the auxiliary problems with an optimization method in turn we are also able to obtain the computational complexity of the global procedure for solving which was not possible with previous analysis when instantiated in different optimization settings our analysis yields systematic acceleration beyond several works have inspired this paper in particular accelerated sdca is an instance of an inexact accelerated proximal point algorithm even though this was not explicitly stated in their proof of convergence relies on different tools than ours specifically we use the concept of estimate sequence from nesterov whereas the direct proof of in the context of sdca does not extend to convex objectives nevertheless part of their analysis proves to be helpful to obtain our main results another useful methodological contribution was the convergence analysis of inexact proximal gradient methods of finally similar ideas appear in the independent work their results overlap in part with ours but both papers adopt different directions our analysis is for instance more general and provides support for convex objectives another independent work with related results is which introduce an accelerated method for the minimization of finite sums which is not based on the proximal point algorithm note that our inexact criterion was also studied among others in but the analysis of led to the conjecture that this criterion was too weak to warrant acceleration our analysis refutes this conjecture the catalyst acceleration we present here our generic acceleration scheme which can operate on any or gradientbased optimization algorithm with linear convergence rate for strongly convex objectives linear convergence and acceleration consider the problem with convex function where the strong convexity is defined with respect to the minimization algorithm generating the sequence of iterates xk has linear convergence rate if there exists τm in and constant cm in such that xk cm τm where denotes the minimum value of the quantity τm controls the convergence rate the larger is τm the faster is convergence to however for given algorithm the quantity τm depends usually on the ratio which is often called the condition number of the catalyst acceleration is general approach that allows to wrap algorithm into an accelerated algorithm which enjoys faster linear convergence rate with τa τm as we will also see the catalyst acceleration may also be useful when is not strongly is when in that case we may even consider method that requires strong convexity to operate and obtain an accelerated algorithm that can minimize with convergence rate our approach can accelerate wide range of optimization algorithms starting from classical gradient descent it also applies to randomized algorithms such as sag saga sdca svrg and whose rates of convergence are given in expectation such methods should be contrasted with stochastic gradient methods which minimize different function acceleration of stochastic gradient methods is beyond the scope of this work catalyst action we now highlight the mechanics of the catalyst algorithm which is presented in algorithm it consists of replacing at iteration the original objective function by an auxiliary objective gk close to up to quadratic term gk kx where will be specified later and yk is obtained by an extrapolation step described in then at iteration the accelerated algorithm minimizes gk up to accuracy εk substituting to has two consequences on the one hand minimizing only provides an approximation of the solution of unless on the other hand the auxiliary objective gk enjoys better condition number than the original objective which makes it easier to minimize for instance when is the regular gradient descent algorithm with has the rate of convergence for minimizing with τm however owing to the additional quadratic term gk can be minimized by with the rate where τm gk τm in practice there exists an optimal choice for which controls the time required by for solving the auxiliary problems and the quality of approximation of by the functions gk this choice will be driven by the convergence analysis in sec see also sec for special cases acceleration via extrapolation and inexact minimization similar to the classical gradient descent scheme of nesterov algorithm involves an extrapolation step as consequence the solution of the auxiliary problem at iteration is driven towards the extrapolated variable yk as shown in this step is in fact sufficient to reduce the number of iterations of algorithm to solve when εk is for running the exact accelerated proximal point algorithm nevertheless to control the total computational complexity of an accelerated algorithm it is necessary to take into account the complexity of solving the auxiliary problems using this is where our approach differs from the classical proximal point algorithm of essentially both algorithms are the same but we use the weaker inexactness criterion gk xk εk where the sequence εk is fixed beforehand and only depends on the initial point this subtle difference has important consequences in practice this condition can often be checked by computing duality gaps ii in theory the methods we consider have linear convergence rates which allows us to control the complexity of step and then to provide the computational complexity of in this paper we use the notation to hide constants the notation also hides logarithmic factors algorithm catalyst input initial estimate rp parameters and sequence εk optimization method initialize and while the desired stopping criterion is not satisfied do find an approximate solution of the following problem using xk arg min gk kx such that gk xk εk compute αk from equation αk qαk compute yk xk βk xk with βk αk end while output xk final estimate convergence analysis in this section we present the theoretical properties of algorithm for optimization methods with deterministic convergence rates of the form when the rate is given as an expectation simple extension of our analysis described in section is needed for space limitation reasons we shall sketch the proof mechanics here and defer the full proofs to appendix analysis for convex objective functions we first analyze the convergence rate of algorithm for solving problem regardless of the complexity required to solve the subproblems we start with the convex case theorem convergence of algorithm convex case choose with and εk with then algorithm generates iterates xk such that xk with this theorem characterizes the linear convergence rate of algorithm it is worth noting that the choice of is the discretion of the user but it can safely be set to in practice the choice was made for convenience purposes since it leads to simplified analysis but larger values are also acceptable both from theoretical and practical point of views following an advice from nesterov page originally dedicated to his classical gradient descent algorithm we may for instance recommend choosing such that the choice of the sequence εk is also subject to discussion since the quantity is unknown beforehand nevertheless an upper bound may be used instead which will only affects the corresponding constant in such upper bounds can typically be obtained by computing duality gap at or by using additional knowledge about the objective for instance when is we may simply choose εk the proof of convergence uses the concept of estimate sequence invented by nesterov and introduces an extension to deal with the errors εk to control the accumulation of errors we borrow the methodology of for inexact proximal gradient algorithms our construction yields convergence result that encompasses both strongly convex and convex cases note that estimate sequences were also used in but as noted by the proof of only applies when using an extrapolation step that involves the true minimizer of which is unknown in practice to obtain rigorous convergence result like different approach was needed theorem is important but it does not provide yet the global computational complexity of the full algorithm which includes the number of iterations performed by for approximately solving the auxiliary problems the next proposition characterizes the complexity of this proposition complexity convex case under the assumptions of theorem let us consider method generating iterates zt for minimizing the function gk with linear convergence rate of the form gk zt τm gk when the precision εk is reached with number of iterations tm where the notation hides some universal constants and some logarithmic dependencies in and this proposition is generic since the assumption is relatively standard for methods it may now be used to obtain the global rate of convergence of an accelerated algorithm by calling fs the objective function value obtained after performing ktm iterations of the method the true convergence rate of the accelerated algorithm is fs tm tm as result algorithm has global linear rate of convergence with parameter τa τm where τm typically depends on the greater the faster is consequently will be chosen to maximize the ratio τm note that for other algorithms that do not satisfy additional analysis and possibly different initialization may be necessary see appendix for example convergence analysis for convex but convex objective functions we now state the convergence rate when the objective is not strongly convex that is when theorem algorithm convex but convex case when choose and εk with then algorithm generates iterates xk such that xk this theorem is the of theorem when the choice of is left to the discretion of the user it empirically seem to have very low influence on the global convergence speed as long as it is chosen small enough we use in practice it shows that algorithm achieves the optimal rate of convergence of methods but it does not take into account the complexity of solving the subproblems therefore we need the following proposition proposition complexity convex case assume that has bounded level sets under the assumptions of theorem let us consider method generating iterates zt for minimizing the function gk with linear convergence rate of the form then there exists tm such that for any solving gk with initial point requires at most tm log iterations of we can now draw up the global complexity of an accelerated algorithm when has linear convergence rate for convex objectives to produce xk is called at most ktm log times using the global iteration counter ktm log we get fs if is method this rate is up to logarithmic factor when compared to the optimal rate which may be the price to pay for using generic acceleration scheme acceleration in practice we show here how to accelerate existing algorithms and compare the convergence rates obtained before and after catalyst acceleration for all the algorithms we consider we study rates of convergence in terms of total number of iterations in expectation when necessary to reach accuracy we first show how to accelerate full gradient and randomized coordinate descent algorithms then we discuss other approaches such as sag saga or svrg finally we present new proximal version of the incremental gradient approaches along with its accelerated version table summarizes the acceleration obtained for the algorithms considered deriving the global rate of convergence the convergence rate of an accelerated algorithm is driven by the parameter in the strongly convex case the best choice is the one that maximizes the ratio τm gk as discussed in appendix this rule also holds when is given in expectation and in many cases where the constant cm gk is different than gk from when the choice of only affects the complexity by multiplicative constant rule of thumb is to maximize the ratio τm gk see appendix for more details after choosing the global is given by comp kin kout where kin is an upperbound on the number of iterations performed by per and kout is the on the number of iterations following from theorems note that for simplicity we always consider that such that we may write simply as in the convergence rates acceleration of existing algorithms composite minimization most of the algorithms we consider here namely the proximal gradient method saga prox can handle composite objectives with regularization penalty that admits proximal operator proxψ defined for any as proxψ arg min ky table presents convergence rates that are valid for proximal and settings since most methods we consider are able to deal with such penalties the exception is sag for which proximal variants are not analyzed the incremental method has also been limited to settings so far in section we actually introduce the extension of miso to composite minimization and establish its theoretical convergence rates full gradient method first illustration is the algorithm obtained when accelerating the regular full gradient descent fg and how it contrasts with nesterov accelerated variant afg here the optimal choice for is in the strongly convex case we get an accelerated rate of convergence in log which is the same as afg up to logarithmic terms similar result can also be obtained for randomized coordinate descent methods randomized incremental gradient we now consider randomized incremental gradient methods resp sag and saga when we focus on the setting where these methods have the complexity log otherwise their complexity becomes log which is independent of the condition number and seems theoretically optimal for these methods the best choice for has the form with for sag for saga similar formula with constant in place of holds for svrg we omit it here for brevity sdca and are actually related to incremental gradient methods and the choice for has similar form with proximal miso and its acceleration was proposed in and for solving the problem when and when is sum of convex functions fi as in which are also differentiable with derivatives the algorithm maintains list of quadratic lower dki at iteration of the functions fi and randomly updates one of them at each iteration by using fg comp log sag saga log lε catalyst log nl log log catalyst not avail sdca svrg comp lµ log log nl log no acceleration not avail table comparison of rates of convergence before and after the catalyst acceleration resp in the and non cases to simplify we only present the case where when for all incremental algorithms there is indeed no acceleration otherwise the quantity for svrg is the average lipschitz constant of the functions fi see inequalities the current iterate xk is then obtained by minimizing the of the objective xk arg min dk interestingly since dk is of we also have dk xk and thus the quantity xk dk xk can be used as an optimality certificate that xk furthermore this certificate was shown to converge to zero with rate similar to under the condition in this section we show how to remove this condition and how to provide support to functions whose proximal operator can be easily computed we shall briefly sketch the main ideas and we refer to appendix for thorough presentation the first idea to deal with nonsmooth regularizer is to change the definition of dk dk which was also proposed in without convergence proof then because the dki are quadratic functions the minimizer xk of dk can be obtained by computing the proximal operator of at particular point the second idea to remove the condition is to modify the update of the lower bounds dki assume that index ik is selected among at iteration then fi kx if ik di otherwise whereas the original uses our new variant uses min the resulting algorithm turns out to be very close to variant of proximal sdca which corresponds to using different value for the main difference between sdca and misoprox is that the latter does not use duality it also provides different simpler optimality certificate xk dk xk which is guaranteed to converge linearly as stated in the next theorem theorem convergence of let xk be obtained by then xk with min furthermore we also have fast convergence of the certificate xk dk xk the proof of convergence is given in appendix finally we conclude this section by noting that enjoys the catalyst acceleration leading to the presented in table since the convergence rate does not have exactly the same form as propositions and can not be used and additional analysis given in appendix is needed practical forms of the algorithm are also presented there along with discussions on how to initialize it experiments we evaluate the catalyst acceleration on three methods that have never been accelerated in the past sag saga and we focus on logistic regression where the regularization parameter yields lower bound on the strong convexity parameter of the problem we use three datasets used in namely and ocr which are relatively large with up to points for ocr and variables for we consider three regimes no regularization and which leads significantly larger condition numbers than those used in other studies in we compare miso sag and saga with their default parameters which are recommended by their theoretical analysis for sag and for saga and study several accelerated variants the values of and and the sequences εk are those suggested in the previous sections with in other implementation details are presented in appendix the restarting strategy for is key to achieve acceleration in practice all of the methods we compare store gradients evaluated at previous iterates of the algorithm we always use the gradients from the previous run of to initialize new one we detail in appendix the initialization for each method finally we evaluated heuristic that constrain to always perform at most iterations one pass over the data we call this variant for miso whereas refers to the regular vanilla accelerated variant and we also use this heuristic to accelerate sag the results are reported in table we always obtain huge for miso which suffers from numerical stability issues when the condition number is very large for instance for ocr here not only does the catalyst algorithm accelerate miso but it also stabilizes it whereas miso is slower than sag and saga in this small regime is almost systematically the best performer we are also able to accelerate sag and saga in general even though the improvement is less significant than for miso in particular saga without acceleration proves to be the best method on ocr one reason may be its ability to adapt to the unknown strong convexity parameter of the objective near the solution when we indeed obtain regime where acceleration does not occur see sec therefore this experiment suggests that adaptivity to unknown strong convexity is of high interest for incremental optimization relative duality gap relative duality gap objective function passes dataset relative duality gap relative duality gap objective function passes dataset passes dataset ocr passes dataset passes dataset relative duality gap relative duality gap objective function passes dataset passes dataset miso sag asag saga asaga passes dataset ocr passes dataset ocr figure objective function value or duality gap for different number of passes performed over each dataset the legend for all curves is on the top right amiso asaga asag refer to the accelerated variants of miso saga and sag respectively acknowledgments this work was supported by anr macaron joint centre program titan and nyu data science environment references agarwal and bottou lower bound for the optimization of finite sums in proc international conference on machine learning icml bach jenatton mairal and obozinski optimization with penalties foundations and trends in machine learning bauschke and combettes convex analysis and monotone operator theory in hilbert spaces springer beck and teboulle fast iterative algorithm for linear inverse problems siam journal on imaging sciences bertsekas convex optimization algorithms athena scientific defazio bach and saga fast incremental gradient method with support for convex composite objectives in adv neural information processing systems nips defazio caetano and domke finito faster permutable incremental gradient method for big data problems in proc international conference on machine learning icml frostig ge kakade and sidford approximate proximal point algorithms for empirical risk minimization in proc international conference on machine learning icml new proximal point algorithms for convex minimization siam journal on optimization he and yuan an accelerated inexact proximal point algorithm for convex minimization journal of optimization theory and applications and convex analysis and minimization algorithms springer juditsky and nemirovski first order methods for nonsmooth convex optimization optimization for machine learning mit press lan an optimal randomized incremental gradient method mairal incremental optimization with application to machine learning siam journal on optimization nemirovski juditsky lan and shapiro robust stochastic approximation approach to stochastic programming siam journal on optimization nesterov method of solving convex programming problem with convergence rate soviet mathematics doklady nesterov introductory lectures on convex optimization basic course springer nesterov efficiency of coordinate descent methods on optimization problems siam journal on optimization nesterov gradient methods for minimizing composite functions mathematical programming parikh and boyd proximal algorithms foundations and trends in optimization and iteration complexity of randomized descent methods for minimizing composite function mathematical programming salzo and villa inexact and accelerated proximal point algorithms journal of convex analysis schmidt le roux and bach convergence rates of inexact methods for convex optimization in adv neural information processing systems nips schmidt le roux and bach minimizing finite sums with the stochastic average gradient and zhang proximal stochastic dual coordinate ascent and zhang accelerated proximal stochastic dual coordinate ascent for regularized loss minimization mathematical programming xiao and zhang proximal stochastic gradient method with progressive variance reduction siam journal on optimization zhang and xiao stochastic coordinate method for regularized empirical risk minimization in proc international conference on machine learning icml 
inference for determinantal point processes without spectral knowledge cnrs cristal umr univ lille france michalis department of informatics athens univ of economics and business greece mtitsias both authors contributed equally to this work abstract determinantal point processes dpps are point process models that naturally encode diversity between the points of given realization through positive definite kernel dpps possess desirable properties such as exact sampling or analyticity of the moments but learning the parameters of kernel through inference is not straightforward first the kernel that appears in the likelihood is not but another kernel related to through an often intractable spectral decomposition this issue is typically bypassed in machine learning by directly parametrizing the kernel at the price of some interpretability of the model parameters we follow this approach here second the likelihood has an intractable normalizing constant which takes the form of large determinant in the case of dpp over finite set of objects and the form of fredholm determinant in the case of dpp over continuous domain our main contribution is to derive bounds on the likelihood of dpp both for finite and continuous domains unlike previous work our bounds are cheap to evaluate since they do not rely on approximating the spectrum of large matrix or an operator through usual arguments these bounds thus yield cheap variational inference and moderately expensive exact markov chain monte carlo inference methods for dpps introduction determinantal point processes dpps are point processes that encode repulsiveness using algebraic arguments they first appeared in and have since then received much attention as they arise in many fields random matrix theory combinatorics quantum physics we refer the reader to for detailed tutorial reviews respectively aimed at audiences of machine learners statisticians and probabilists more recently dpps have been considered as modelling tool see dpps appear to be natural alternative to poisson processes when realizations should exhibit repulsiveness in for example dpps are used to model diversity among summary timelines in large news corpus in dpps model diversity among the results of search engine for given query in dpps model the spatial repartition of trees in forest as similar trees compete for nutrients in the ground and thus tend to grow away from each other with these modelling applications comes the question of learning dpp from data either through parametrized form or we focus in this paper on parametric inference similarly to the correlation between the function values in gaussian process gps the repulsiveness in dpp is defined through kernel which measures how much two points in realization repel each other the likelihood of dpp involves the evaluation and the spectral decomposition of an operator defined through kernel that is related to there are two main issues that arise when performing inference for dpp first the likelihood involves evaluating the kernel while it is more natural to parametrize instead and there is no easy link between the parameters of these two kernels the second issue is that the spectral decomposition of the operator required in the likelihood evaluation is rarely available in practice for computational or analytical reasons for example in the case of large finite set of objects as in the news corpus application evaluating the likelihood once requires the eigendecomposition of large matrix similarly in the case of continuous domain as for the forest application the spectral decomposition of the operator may not be analytically tractable for nontrivial choices of kernel in this paper we focus on the second issue we provide likelihoodbased inference methods that assume the kernel is parametrized but that do not require any eigendecomposition unlike more specifically our main contribution is to provide bounds on the likelihood of dpp that do not depend on the spectral decomposition of the operator for the finite case we draw inspiration from bounds used for variational inference of gps and we extend these bounds to dpps over continuous domains for ease of presentation we first consider dpps over finite sets of objects in section and we derive bounds on the likelihood in section we plug these bounds into known inference paradigms variational inference and markov chain monte carlo inference in section we extend our results to the case of dpp over continuous domain readers who are only interested in the finite case or who are unfamiliar with operator theory can safely skip section without missing our main points in section we experimentally validate our results before discussing their breadth in section dpps over finite sets definition and likelihood consider discrete set of items xn where xi rd is vector of attributes that describes item let be symmetric positive definite kernel on rd and let xi xj be the gram matrix of the dpp of kernel is defined as the probability distribution over all possible subsets such that det ka where ka denotes the of indexed by the elements of this distribution exists and is unique if and only if the eigenvalues of are in intuitively we can think of as encoding the amount of negative correlation or repulsiveness between and indeed as remarked in first yields that diagonal elements of are marginal probabilities xi kii equation then entails that xi and xj are likely to in realization of if and only if det xi xj xi xi yi yi xi xj xi xj kij is large terms in indicate whether points tend to providing the eigenvalues of are further restricted to be in the dpp of kernel has likelihood more specifically writing for realization of det det where is the identity matrix and denotes the of indexed by the elements of now given realization we would like to infer the parameters of kernel say the parameters θk ak σk of squared exponential kernel kx ak exp since the trace of is the expected number of points in one can estimate ak by the number of points in the data divided by but σk the repulsive lengthscale has to be fitted if the number of items is large methods such as maximum likelihood are too costly each evaluation of requires storage and time furthermore valid choices of θk are constrained since one needs to make sure the eigenvalues of remain in partial is to note that given any symmetric positive definite kernel the likelihood with matrix xi xj corresponds to valid choice of since the corresponding matrix necessarily has eigenvalues in which makes sure the dpp exists the consists in directly parametrizing and inferring the kernel instead of so that the numerator of is cheap to evaluate and parameters are less constrained note that this step favours tractability over interpretability of the inferred parameters if we assume to take the squared exponential form instead of with parameters al and σl the number of points and the repulsiveness of the points in do not decouple as nicely for example the expected number of items in depends on al and σl now and both parameters also significantly affect repulsiveness there is some work investigating approximations to to retain the more interpretable parametrization but the machine learning literature almost exclusively adopts the more tractable parametrization of in this paper we also make this choice of parametrizing directly now the computational bottleneck in the evaluation of is computing det while this still prevents the application of maximum likelihood bounds on this determinant can be used in variational approach or an mcmc algorithm in bounds on det are proposed requiring only the first eigenvalues of where is chosen adaptively at each mcmc iteration to make the acceptance decision possible this still requires applying power iteration methods which are limited to finite domains require storing the whole matrix and are prohibitively slow when the number of required eigenvalues is large nonspectral bounds on the likelihood let us denote by lab the submatrix of where row indices correspond to the elements of and column indices to those of when we simply write la for laa and we drop the subscript when drawing inspiration from sparse approximations to gaussian processes using inducing variables we let zm be an arbitrary set of points in rd and we approximate by lyz lz lzy note that we do not constrain to belong to so that our bounds do not rely on approximation we term or inducing inputs proposition det det det the proof relies on nontrivial inequality on determinants theorem and is provided in the supplementary material learning dpp using bounds in this section we explain how to run variational inference and markov chain monte carlo methods using the bounds in proposition in this section we also make connections with variational sparse gaussian processes more explicit variational inference the lower bound in proposition can be used for variational inference assume we have point process realizations yt and we fit dpp with kernel lθ the log likelihood can be expressed using log det lyt log det let be an arbitrary set of points in rd proposition then yields lower bound log det lyt log det tr the lower bound can be computed efficiently in time which is considerably lower than power iteration in if instead of maximizing we thus maximize jointly the kernel parameters and the variational parameters to maximize one can implement an scheme alternately optimizing in and kernels are often differentiable with respect to and sometimes will also be differentiable with respect to the so that methods can help in the general case optimizers such as can also be employed markov chain monte carlo inference if approximate inference is not suitable we can use the bounds in proposition to build more expensive markov chain monte carlo sampler given prior distribution on the parameters of bayesian inference relies on the posterior distribution exp where the log likelihood is defined in standard approach to sample approximately from is the algorithm mh chapter mh consists in building an ergodic markov chain of invariant distribution given proposal the mh algorithm starts its chain at then at iteration it proposes candidate state and sets to with probability θk θk min θk while is otherwise set to θk the core of the algorithm is thus to draw bernoulli variable with parameter for rd this is typically implemented by drawing uniform and checking whether in our dpp application we can not evaluate but we can use proposition to build lower and an upper bound which can be arbitrarily refined by increasing the cardinality of and optimizing over we can thus build lower and upper bound for log log log now another way to draw bernoulli variable with parameter is to first draw and then refine the bounds in by augmenting the numbers of inducing variables and optimizing over log is out of the interval formed by the bounds in then one can decide whether this bernoulli trick is sometimes named retrospective sampling and has been suggested as early as it has been used within mh for inference on dpps with spectral bounds in we simply adapt it to our bounds the case of continuous dpps dpps can be defined over very general spaces we limit ourselves here to point processes on rd such that one can extend the notion of likelihood in particular we define here dpp on as in example by defining its janossy density for definitions of traces and determinants of operators we follow section vii note that this necessarily happens under fairly weak assumptions saying that the upper and lower bounds in match when goes to infinity is saying that the integral of the posterior variance of gaussian process with no evaluation error goes to zero as we add more distinct training points definition let be measure on rd rd that is continuous with respect to the lebesgue measure with density let be symmetric positive definite kernel defines operator on through dµ assume is and tr dµ we assume to avoid technicalities proving can be done by requiring various assumptions on and under the assumptions of mercer theorem for instance will be satisfied section vii theorem more generally the assumptions of theorem apply to kernels over noncompact domains in particular the gaussian kernel with gaussian base measure that is often used in practice we denote by λi the eigenvalues of the compact operator there exists example point process on rd such that det xi xj there are particles one in each of xn the infinitesimal balls xi dxi det where is the open ball of center and radius and where det λi is the fredholm determinant of operator section vii such process is called the determinantal point process associated to kernel and base measure equation is the continuous equivalent of our bounds require to be computable this is the case for the popular gaussian kernel with gaussian base measure nonspectral bounds on the likelihood in this section we derive bounds on the likelihood that do not require to compute the fredholm determinant det proposition let zm rd then det lz det lz dµ lz det lz det det lz where lz zi zj and ψij zi zj dµ as for proposition the proof relies on nontrivial inequality on determinants theorem and is provided in the supplementary material we also detail in the supplementary material why is the continuous equivalent to experiments toy gaussian continuous experiment in this section we consider dpp on so that the bounds derived in section apply as in section we take the base measure to be proportional to gaussian its density is κn we consider squared exponential kernel exp kx yk in this particular case the spectral decomposition of operator is known the eigenfunctions of are scaled hermite polynomials while the eigenvalues are geometrically decreasing sequence this example is interesting for two reasons first the spectral decomposition of is known so that we can sample exactly from the corresponding dpp and thus generate synthetic datasets second the fredholm determinant det in this special case is symbol and for which all points in realization are distinct there is notion of kernel for general dpps but we define directly here for the sake of simplicity the interpretability issues of using instead of are the same as for the finite case see sections and we follow the parametrization of for ease of reference can thus be efficiently computed see the sympy library in python this allows for comparison with ideal methods to check the validity of our mcmc sampler for instance we emphasize that these special properties are not needed for the inference methods in section they are simply useful to demonstrate their correctness we sample synthetic dataset using resulting in points shown in red in figure applying the variational inference method of section jointly optimizing in and using the optimizer yields poorly consistent results varies over several orders of magnitude from one run to the other and relative errors for and go up to not shown we thus investigate the identifiability of the parameters with the retrospective mh of section to limit the range of we choose for log log log wide uniform prior over we use gaussian proposal the covariance matrix of which is adapted so as to reach of acceptance we start each iteration with and increase it by and when the acceptance decision can not be made most iterations could be made with and the maximum number of inducing inputs required in our run was we show the results of run of length in figure removing sample of size we show the resulting marginal histograms in figures and retrospective mh and the ideal mh agree the prior pdf is in green the posterior marginals of and are centered around the values used for simulation and are very different from the prior showing that the likelihood contains information about and however as expected almost nothing is learnt about as posterior and prior roughly coincide this is an example of the issues that come with parametrizing directly as mentioned in section it is also an example when mcmc is preferable to the variational approach in section note that this can be detected through the variability of the results of the variational approach across independent runs to conclude we show set of optimized in black in figure we also superimpose the marginal of any single point in the realization which is available through the spectral decomposition of here in this particular case this marginal is gaussian interestingly the look like evenly spread samples from this marginal intuitively they are likely to make the denominator in the likelihood small as they represent an ideal sample of the dpp diabetic neuropathy dataset here we consider real dataset of spatial patterns of nerve fibers in diabetic patients these fibers become more clustered as diabetes progresses the dataset consists of samples collected from diabetic patients at different stages of diabetic neuropathy and one healthy subject we follow the experimental setup used in and we split the total samples into two classes diabetic and diabetic the first class contains three samples and the second one the remaining four figure displays the point process data which contain on average points per sample in the class and for the class we investigate the differences between these classes by fitting separate dpp to each class and then quantify the differences of the repulsion or overdispersion of the point process data through the inferred kernel parameters paraphrasing we consider continuous dpp on with kernel function xi xj xi xj exp and base measure proportional to gaussian xd as in we quantify the overdispersion of realizations of such dpp through the quantities γd σd which are invariant to the scaling of note however that strictly speaking also mildly influences repulsion we investigate the ability of the variational method in section to perform approximate maximum likelihood training over the kernel parameters specifically we wish to fit separate continuous dpp to each class by jointly maximizing the prior pdf ideal mh mh bounds prior pdf ideal mh mh bounds prior pdf ideal mh mh bounds figure results of adaptive in the continuous experiment of section figure shows data in red set of optimized in black for set to the value used in the generation of the synthetic dataset and the marginal of one point in the realization in blue figures and show marginal histograms of variational lower bound over and the inducing inputs using optimization given that the number of inducing variables determines the amount of the approximation or compression of the dpp model we examine different settings for this number and see whether the corresponding trained models provide similar estimates for the overdispersion measures thus we train the dpps under different approximations having inducing variables and display the estimated overdispersion measures in figures and these estimated measures converge to coherent values as increases they show clear separation between the two classes as also found in this also suggests tuning in practice by increasing it until inference results stop varying furthermore figures and show the values of the upper and lower bounds on the log likelihood which as expected converge to the same limit as increases we point out that the overall optimization of the variational lower bound is relatively fast in our matlab implementation for instance it takes minutes for the most expensive run where to perform iterations until convergence smaller values of yield significantly smaller times finally as in section we comment on the optimized figure displays the inducing points at the end of converged run of variational inference for various values of similarly to figure these are placed in remarkably neat grids and depart significantly from their initial locations discussion we have proposed novel nonspectral bounds on the determinants arising in the likelihoods of dpps both finite and continuous we have shown how to use these bounds to infer the parameters of dpp and demonstrated their use for mcmc and variational inference in particular these bounds have some degree of freedom the which we optimize so as to tighten the bounds this optimization step is crucial for inference of parametric dpp models where bounds have to adapt to the point where the likelihood is evaluated to yield normal figure six out of the seven nerve fiber samples the first three samples from left to right correspond to normal subject and two mildly diabetic subjects respectively the remaining three samples correspond to moderately diabetic subject and two severely diabetic subjects bounds bounds ratio ratio number of inducing variables number of inducing variables number of inducing variables number of inducing variables figure figures and show the evolution of the estimated overdispersion measures and as functions of the number of inducing variables used the dotted black lines correspond to the diabetic class while the solid lines to the diabetic class figure shows the upper bound red and the lower bound blue on the log likelihood as functions of the number of inducing variables for the diabetic class while the diabetic case is shown in figure figure we illustrate the optimization over the inducing inputs for several values of in the dpp of section we consider the diabetic class the panels in the top row show the initial inducing input locations for various values of while the corresponding panels in the bottom row show the optimized locations decisions which are consistent with the ideal underlying algorithms in future work we plan to investigate connections of our bounds with the bounds for fredholm determinants of we also plan to consider variants of dpps that condition on the number of points in the realization to put joint priors over the distributions of the features in classification problems in manner related to in the long term we will investigate connections between kernels and that could be made without spectral knowledge to address the issue of replacing by acknowledgments we would like to thank adrien hardy for useful discussions and emily fox for kindly providing access to the diabetic neuropathy dataset rb was funded by research fellowship through the science programme funded by epsrc grant number and by anr project references daley and an introduction to the theory of point processes springer edition macchi the coincidence approach to stochastic point processes advances in applied probability kulesza and taskar determinantal point processes for machine learning foundations and trends in machine learning lavancier møller and rubak determinantal point process models and statistical inference journal of the royal statistical society hough krishnapur peres and determinantal processes and independence probability surveys zou and adams priors for diversity in generative latent variable models in advances in neural information processing systems nips affandi fox adams and taskar learning the parameters of determinantal point processes in proceedings of the international conference on machine learning icml gillenwater kulesza fox and taskar for learning determinantal point processes in advances in neural information proccessing systems nips mariet and sra algorithms for learning determinantal point processes in advances in neural information systems nips rasmussen and williams gaussian processes for machine learning mit press michalis titsias variational learning of inducing variables in sparse gaussian processes in aistats volume cristianini and kernel methods for pattern recognition cambridge university press affandi kulesza fox and taskar approximation for largescale determinantal processes in proceedings of the conference on artificial intelligence and statistics aistats seiler and simon an inequality among determinants proceedings of the national academy of sciences hansen the cma evolution strategy comparing review in towards new evolutionary computation advances on estimation of distribution algorithms springer robert and casella monte carlo statistical methods springer devroye random variate generation gohberg goldberg and kaashoek classes of linear operators volume springer simon trace ideals and their applications american mathematical society edition fasshauer and mccourt stable evaluation of gaussian radial basis function interpolants siam journal on scientific computing haario saksman and tamminen an adaptive metropolis algorithm bernoulli waller olsbo panoutsopoulou kennedy and spatial analysis of epidermal nerve fibers statistics in medicine bornemann on the numerical evaluation of fredholm determinants mathematics of computation 
proximal variational inference mohammad emtiyaz ecole polytechnique de lausanne lausanne switzerland emtiyaz pierre ecole polytechnique de lausanne lausanne switzerland pascal fua ecole polytechnique de lausanne lausanne switzerland fleuret idiap research institute martigny switzerland abstract we propose new variational inference method based on proximal framework that uses the kl divergence as the proximal term we make two contributions towards exploiting the geometry and structure of the variational bound first we propose kl algorithm and show its equivalence to variational inference with natural gradients stochastic variational inference second we use the proximal framework to derive efficient variational algorithms for models we propose splitting procedure to separate terms from conjugate ones we linearize the terms to obtain subproblems that admit solution overall our approach converts inference in model to subproblems that involve inference in conjugate models we show that our method is applicable to wide variety of models and can result in computationally efficient algorithms applications to datasets show comparable performances to existing methods introduction variational methods are popular alternative to markov chain monte carlo mcmc methods for bayesian inference they have been used extensively for their speed and ease of use in particular methods based on the evidence lower bound optimization elbo are quite popular because they convert difficult integration problem to an optimization problem this reformulation enables the application of optimization techniques for bayesian inference recently an approach called stochastic variational inference svi has gained popularity for inference in exponential family models svi exploits the geometry of the posterior distribution by using natural gradients and uses stochastic method to improve scalability the resulting updates are simple and easy to implement several generalizations of svi have been proposed for general models where the lower bound might be intractable these generalizations although important do not take the geometry of the posterior distribution into account in addition none of these approaches exploit the structure of the lower bound in practice not all factors of the joint distribution introduce difficulty in the optimization it is therefore desirable to treat difficult terms differently from easy terms note on contributions proposed the use of the kl proximal term and showed that the resulting proximal steps have solutions the rest of the work was carried out by khan in this context we propose splitting method for variational inference this method exploits both the structure and the geometry of the lower bound our approach is based on the framework we make two important contributions first we propose algorithm that uses the kl divergence as the proximal term we show that the addition of this term incorporates the geometry of the posterior distribution we establish the equivalence of our approach to variational methods that use natural gradients second following the framework we propose splitting approach for variational inference in this approach we linearize difficult terms such that the resulting optimization problem is easy to solve we apply this approach to variational inference on models we show that linearizing terms leads to subproblems that have solutions our approach therefore converts inference in model to subproblems that involve inference in conjugate models and for which efficient implementation exists latent variable models and evidence optimization consider general model with data vector of length and the latent vector of length following joint distribution we drop the parameters of the distribution from the notation elbo approximates the posterior by distribution that maximizes lower bound to the marginal likelihood here is the vector of parameters of the distribution as shown in the lower bound is obtained by first multiplying and then dividing by and then applying jensen inequality by using concavity of log the approximate posterior is obtained by maximizing the lower bound with respect to log log dz max eq log unfortunately the lower bound may not always be easy to optimize some terms in the lower bound might be intractable or might admit form that is not easy to optimize in addition the optimization can be slow when and are large the kl algorithm for conjugate models in this section we introduce method based on kl proximal function and establish its relation to the existing approaches based on natural gradients in particular for models we show that each iteration of our approach is equivalent to step along the natural gradient the kl divergence between two distributions and is defined as follows dkl eq log log using the kl divergence as the proximal term we introduce algorithm that generates sequence of λk by solving the following subproblems kl arg max βk kl given an initial value and bounded sequence of βk one benefit of using the kl term is that it takes the geometry of the posterior distribution into account this fact has lead to their extensive use in both the optimization and statistics literature for speeding up the algorithm for convex optimization for in graphical models and for approximate bayesian inference relationship to the methods that use natural gradients an alternative approach to incorporate the geometry of the posterior distribution is to use natural gradients we now establish its relationship to our approach the natural gradient can be interpreted as finding descent direction that ensures fixed amount of change in the distribution for variational inference this is equivalent to the following arg max λk dsym kl where dsym kl is the symmetric kl divergence it appears that the subproblem is related to lagrangian of the above optimization in fact as we show below the two problems are equivalent for conditionally conjugate models we consider the described in which is bit more general than that of consider bayesian network with nodes zi and joint distribution zi where pai are the parents of zi we assume that each factor is an distribution defined as follows zi hi zi exp ti pai ti zi ai where is the natural parameter ti zi is the sufficient statistics ai is the partition function and hi zi is the base measure we seek factorized approximation shown in where each zi belongs to the same distribution as the joint distribution the parameters of this distribution are denoted by λi to differentiate them from the parameters also note that the subscript refers to the factor not to the iteration qi zi where qi zi hi exp λti ti zi ai λi for this model we show the following equivalence between method based on natural gradients and our approach the proof is given in the supplementary material theorem for the model shown in and the posterior approximation shown in the sequence λk generated by the algorithm of is equal to the one obtained using gradientdescent along the natural gradient with step lengths βk βk proof of convergence convergence of the algorithm shown in is proved in we give summary of the results here we assume βk however the proof holds for any bounded sequence of βk let the space of all be denoted by define the set then λk under the following conditions maximum of exist and the gradient of is continuous and defined in the kl divergence and its gradient are continuous and defined in dkl only when in our case the conditions and are either assumed or satisfied and the condition can be ensured by choosing an appropriate parameterization of the kl algorithm for models the algorithm of might be difficult to optimize for models due to the factors in this section we present an algorithm based on the proximalgradient framework where we first split the objective function into difficult and easy terms and then to simplify the optimization linearize the difficult term see for good review of proximal methods for machine learning we split the ratio where contains all factors that make the optimization difficult and contains the rest is constant this results in the following split eq log eq log log eq log note that and can be factors in the distribution in the worst case we can set and take the rest as we give an example of the split in the next section the main idea is to linearize the difficult term such that the resulting problem admits simple form specifically we use algorithm that solves the following sequence of subproblems to maximize as shown below here λk is the gradient of at λk kl arg max λt λk βk kl note that our linear approximation is equivalent to the one used in gradient descent also the approximation is tight at λk therefore it does not introduce any error in the optimization rather it only acts as surrogate to take the next step existing variational methods have used approximations such as ours see most of these methods first approximate the log term by using linear or quadratic approximation and then compute the expectation as result the approximation is not tight and can result in bad performance in contrast our approximation is applied directly to log and therefore is tight at λk the convergence of our approach is covered under the results shown in they prove convergence of an algorithm more general algorithm than ours below we summarize the results as before we assume that the maximum exists and is continuous we make three additional assumptions first the gradient of is continuous in second the function is concave third there exists an such that λk dkl λk where denotes the gradient with respect to the first argument under these conditions λk when βk the choice of constant is also discussed in note that even though is required to be concave could be the lower bound usually contains concave terms in the entropy term in the worst case when there are no concave terms we can simply choose examples of kl variational inference in this section we show few examples where the subproblem has solution generalized linear model we consider the generalized linear model shown in here is the output vector of length whose th entry is equal to yn whereas is an feature matrix that contains feature vectors xtn as rows the weight vector is gaussian with mean and covariance to obtain the probability of yn the linear predictor xtn is passed through yn yn we restrict the posterior distribution to be gaussian with mean and covariance therefore for this posterior family the terms yn are difficult to handle while the gaussian term is easy because it is conjugate to therefore we set and let the rest of the terms go in by substituting in and using the definition of the kl divergence we get the lower bound shown below in the first term is the function that will be linearized and the second term is the function eq log yn eq log for linearization we compute the gradient of using the chain rule denote fn ven eq log yn where xtn and ven xtn vxn gradients of and can then be expressed in terms of gradients of fn and ven xn ven fn xn xtn fn ven for notational simplicity we denote the gradient of fn at nk xtn mk and venk xtn vk xn by αnk nk venk γnk fn nk venk fn using and we get the following linear approximation of λt λk mt mk vk tr mk vk αnk xtn γnk xtn vxn substituting the above in we get the following subproblem in the th iteration arg max αnk xtn γnk xtn vxn eq dkl vk βk taking the gradient and and setting it to zero we get the following solutions details are given in the supplementary material diag rk rk rk xt αk rk mk where rk βk and αk and are vectors of αnk and γnk respectively for all computationally efficient updates even though the updates are available in closed form they are not efficient when dimensionality is large in such case an explicit computation of is costly because the resulting matrix is extremely large we now derive efficient updates that avoids an explicit computation of our derivation involves two key steps the first step is to show that can be parameterized by specifically if we initialize then we can show that rk rk xt diag where detailed derivation is given in the supplementary material with the second key step is to express the updates in terms of and ven for this purpose we define be vector whose th entry is be the vector of ven some new quantities let similarly let and respectively finally for all denote the corresponding vectors in the th iteration by xσxt xµ and define xm and diag xvxt and by applying the woodbury matrix now by using the fact that and as shown below detailed derivation is identity we can express the updates in terms of given in the supplementary material where bk diag rk rk σb σα diag σa where ak diag diag αk and whose size only depends on and is indee note that these updates depend on pendent of most importantly these updates avoid an explicit computation of and only require and both of which scale linearly with storing also note that the matrix ak and bk differ only slightly and we can reduce computation by using ak in place of bk in our experiments this does not create any convergence issues to assess convergence we can use the optimality condition by taking the norm of the derivative of σα at and and simplifying we get the following criteria ke tr diag for some derivation is in the supplementary material function model and gaussian process the algorithm presented above can be extended to function models by using the view presented in consider basis function that maps feature vector into an feature space the generalized linear model of is extended to linear basis function model by replacing xtn with the latent function the gaussian prior on then translates to kernel function σφ and mean function in the latent whose th entry is equal function space given input vectors xn we define the kernel matrix whose th entry is to xi xj and the mean vector xi assuming gaussian posterior distribution over the latent function we can compute its mean to be the vector of and variance ve using the algorithm we define algorithm algorithm for linear basis function models and gaussian process sequence rk covariance given training data test data kernel mean and threshold and diag initialize repeat for all in parallel αnk nk venk and γnk fn nk venk fn and using update rk rk tr diag σα until ke predict test inputs using to be the vector of all ve xn following the same derivation as the for all and similarly and variance previous section we can show that the updates of give us the posterior mean these updates are the kernalized version of and for prediction we only need the converged value of αk and denoted by and respectively given new input define and to be vector whose th entry is equal to xn the predictive mean and variance can be computed as shown below diag ve to small constant otherwise is given in algorithm here we initialize solving the first equation might be these updates also work for the gaussian process gp models with kernel and mean function and for many other latent gaussian models such as matrix factorization models experiments and results we now present some results on the real data our goal is to show that our approach gives comparable results to existing methods and is easy to implement we also show that in some cases our method is significantly faster than the alternatives due to the kernel trick we show results on three models bayesian logistic regression gp classification with logistic likelihood and gp regression with laplace likelihood for these likelihoods expectations can be computed almost exactly for which we used the methods described in we use fixed of βk and for logistic and laplace likelihoods respectively we consider three datasets for each model summary is given in table these datasets can be found at the data of libsvm and uci bayesian logistic regression results for bayesian logistic regression are shown in table we consider two datasets for and for colon we compare our proximal method to three other existing methods the map method which finds the mode of the penalized the method where the distribution is factorized across dimensions and the cholesky method of we implemented these methods using minfunc software by mark we used for optimization all algorithms are stopped when optimality condition is below we set the gaussian prior to δi and to set the hyperparameter we use for map and maximum estimate for the rest of the methods as we compare running times as well we use common range of hyperparameter values for all methods these values are shown in table for bayesian methods we report the negative of the marginal likelihood approximation this is the negative of the value of the lower bound at the maximum we also report the computed as follows log where are the predictive probabilities of the test data and is the total number of lower value is better and value of is equivalent to random in addition we report the total time taken for hyperparameter selection https and http available at https model logreg gp class gp reg dataset colon ionosphere sonar housing triazines space ga train splits hyperparameter range logspace logspace for all datasets log linspace log linspace log linspace log linspace log linspace table list of models and datasets train is the of training data the last column shows the hyperparameters values linspace and logspace refer to matlab commands dataset colon methods map cholesky proximal map proximal log loss time table summary of the results obtained on bayesian logistic regression in all columns lower values implies better performance for map this is the total time whereas for bayesian methods it is the time taken to compute for all hyperparameters values over the whole range we summarize these results in table for all columns lower value is better we see that for fully bayesian methods perform slightly better than map more importantly the proximal method is faster than the cholesky method but obtains the same error and marginal likelihood estimate for the proximal method we use updates of and because but even in this scenario the cholesky method is slow due to expensive for large number of parameters for the colon dataset we use the update for the proximal method we do not compare to the cholesky method because it is too slow for the large datasets in table we see that our implementation is as fast as the method but performs significantly better overall with the proximal method we achieve the same results as the cholesky method but take less time in some cases we can also match the running time of the method note that the method does not give bad predictions and the minimum value of are comparable to our approach however as values for the method are inaccurate it ends up choosing bad hyperparameter value this is expected as the method makes an extreme approximation therefore is more appropriate for the method gaussian process classification and regression we compare the proximal method to expectation propagation ep and laplace approximation we use the gpml toolbox for this comparison we used kernel for the gaussian process with two scale parameters and as defined in gpml toolbox we do grid search over these hyperparameters the grid values are given in table we report the and running time for each method the left plot in figure shows the for gp classification on usps dataset where the proximal method shows very similar behaviour to ep these results are summarized in table we see that our method performs similar to ep sometimes bit better the running times of ep and the proximal method are also comparable the advantage of our approach is that it is easier to implement compared to ep and it is also numerically robust the predictive probabilities obtained with ep and the proximal method for usps dataset are shown in the right plot of figure the horizontal axis shows the test examples in an ascending order the examples are sorted according to their predictive probabilities obtained with ep the probabilities themselves are shown in the higher value implies better performance therefore the proximal method gives log log log log sigma ep proximal predictive prob log sigma ep vs proximal log log test examples log figure in the left figure the top row shows the and the bottom row shows the running time in seconds for the usps dataset in each plot the minimum value of the is shown with black circle the right figure shows the predictive probabilities obtained with ep and the proximal method the horizontal axis shows the test examples in an ascending order the examples are sorted according to their predictive probabilities obtained with ep the probabilities themselves are shown in the higher value implies better performance therefore the proximal method gives estimates better than ep data ionosphere sonar housing triazines space ga laplace log loss ep proximal time is sec is min is hr laplace ep proximal table results for the gp classification using logistic likelihood and the gp regression using laplace likelihood for all rows lower value is better estimates better than ep the improvement in the performance is due to the numerical error in the likelihood implementation for the proximal method we use the method of which is quite accurate designing such accurate likelihood approximations for ep is challenging discussion and future work in this paper we have proposed proximal framework that uses the kl proximal term to take the geometry of the posterior distribution into account we established the equivalence between our algorithm and methods we proposed algorithm that exploits the structure of the bound to simplify the optimization an important future direction is to apply stochastic approximations to approximate gradients this extension is discussed in it is also important to design method to set the step sizes in addition our proximal framework can also be used for distributed optimization in variational inference acknowledgments mohammad emtiyaz khan would like to thank masashi sugiyama and akiko takeda from university of tokyo matthias grossglauser and vincent etter from epfl and hannes nickisch from philips research hamburg for useful discussions and feedback pierre was supported in part by the swiss national science foundation under the grant tracking in the wild references matthew hoffman david blei chong wang and john paisley stochastic variational inference the journal of machine learning research tim salimans david knowles et al variational posterior approximation through stochastic linear regression bayesian analysis rajesh ranganath sean gerrish and david blei black box variational inference arxiv preprint michalis titsias and miguel doubly stochastic variational bayes for inference in international conference on machine learning sato online model selection based on the variational bayes neural computation honkela raiko kuusela tornio and karhunen approximate riemannian conjugate gradient learning for variational bayes the journal of machine learning research and alfred oiii hero kullback proximal algorithms for estimation information theory ieee transactions on paul tseng an analysis of the em algorithm and proximal point methods mathematics of operations research teboulle convergence of algorithms siam jon optimization pradeep ravikumar alekh agarwal and martin wainwright for linear programs proximal projections convergence and rounding schemes in international conference on machine learning behnam sejong yoon and vladimir pavlovic distributed mean field variational inference using bregman admm arxiv preprint bo dai niao he hanjun dai and le song scalable bayesian inference via particle mirror descent computing research repository lucas theis and matthew hoffman method for stochastic variational inference with applications to streaming data international conference on machine learning razvan pascanu and yoshua bengio revisiting natural gradient for deep networks arxiv preprint ulrich paquet on the convergence of stochastic variational inference in bayesian networks nips workshop on variational inference nicholas polson james scott and brandon willard proximal algorithms in statistics and machine learning arxiv preprint harri lappalainen and antti honkela bayesian independent component analysis by multilayer perceptrons in advances in independent component analysis pages springer chong wang and david blei variational inference in nonconjugate models mach learn april seeger and nickisch large scale bayesian inference and experimental design for sparse linear models siam journal of imaging sciences antti honkela and harri valpola unsupervised variational bayesian learning of nonlinear models in advances in neural information processing systems pages mohammad emtiyaz khan reza babanezhad wu lin mark schmidt and masashi sugiyama convergence of stochastic variational inference under sequence arxiv preprint carl edward rasmussen and christopher williams gaussian processes for machine learning mit press marlin khan and murphy piecewise bounds for estimating latent gaussian models in international conference on machine learning mohammad emtiyaz khan decoupled variational inference in advances in neural information processing systems challis and barber concave gaussian variational approximations for inference in bayesian linear models in international conference on artificial intelligence and statistics huahua wang and arindam banerjee bregman alternating direction method of multipliers in advances in neural information processing systems 
for nonsmooth composite minimization niao he georgia institute of technology zaid harchaoui nyu inria abstract we propose new optimization algorithm to solve composite minimization problems typical examples of such problems have an objective that decomposes into empirical risk part and regularization penalty the proposed algorithm called semiproximal leverages the saddle point representation of one part of the objective while handling the other part of the objective via linear minimization over the domain the algorithm stands in contrast with more classical proximal gradient algorithms with smoothing which require the computation of proximal operators at each iteration and can therefore be impractical for problems we establish the theoretical convergence rate of mirrorprox which exhibits the optimal complexity bounds for the number of calls to linear minimization oracle we present promising experimental results showing the interest of the approach in comparison to competing methods introduction wide range of machine learning and signal processing problems can be formulated as the minimization of composite objective min kbxk where is closed and convex is convex and can be either smooth or nonsmooth yet enjoys particular structure the term kbxk defines regularization penalty through norm and bx linear mapping on closed convex set in many situations the objective function of interest enjoys favorable structure namely socalled saddle point representation max hx azi where is convex compact subset of euclidean space and is convex function sec will give several examples of such situations saddle point representations can then be leveraged to use optimization algorithms the simple first option to minimize is using the nesterov smoothing technique along with proximal gradient algorithm assuming that the proximal operator associated with is computationally tractable and cheap to compute however this is certainly not the case when considering problems with norms acting in the spectral domain of matrices such as the matrix and structured extensions thereof in the latter situation another option is to use smoothing technique now with conditional gradient or algorithm to minimize assuming that linear minimization oracle associated with is cheaper to compute than the proximal operator neither option takes advantage of the composite structure of the objective or handles the case when the linear mapping is nontrivial contributions our goal is to propose new optimization algorithm called semiproximal designed to solve the difficult composite optimization problem which does not require the exact computation of proximal operators instead the semiproximal relies upon saddle point representability of less restricted role than representation ii linear minimization oracle associated with in the domain while the saddle point representability of allows to cure the of the linear minimization over the domain allows to tackle the regularization penalty we establish the theoretical convergence rate of which exhibits the optimal complexity bounds for the number of calls to linear minimization oracle furthermore generalizes previously proposed approaches and improves upon them in special cases case does not require assumptions on favorable geometry of dual domain or simplicity of in case is competitive with previously proposed approaches based on smoothing techniques case of is the first or optimization algorithm for related work the algorithm belongs to the family of conditional gradient algorithms whose most basic instance is the algorithm for constrained smooth optimization using linear minimization oracle see recently in the authors consider constrained optimization when the domain has favorable geometry the domain is amenable to proximal setups favorable geometry and establish complexity bound with calls to the linear minimization oracle recently in method called conditional gradient sliding is proposed to solve similar problems using smoothing technique with complexity bound in for the calls to the linear minimization oracle lmo and additionally bound for the linear operator evaluations actually this bound for the lmo complexity can be shown to be indeed optimal for or lmobased algorithms when solving convex problems however these previous approaches are appropriate for objective with structure when applied to our problem the smoothing would be applied to the objective taken as whole ignoring its composite structure algorithms were recently proposed for composite objectives but can not be applied for our problem in is smooth and is identity matrix whereas in is and is also the identity matrix the proposed can be seen as blend of the successful components resp of the composite conditional gradient algorithm and the composite that enjoys the optimal complexity bound on the total number of lmo calls yet solves broader class of convex problems than previously considered framework and assumptions we present here our theoretical framework which hinges upon smooth saddle point reformulation of the minimization we shall use the following notations throughout the paper for given norm kp we define the dual norm as pn hs xi for any kxkf problem we consider the composite minimization problem opt min kbxk where is closed convex set in the euclidean space ex bx is linear mapping from to bx where is closed convex set in the euclidean space ey we make two important assumptions on the function and the norm defining the regularization penalty explained below related research extended such approaches to stochastic or online settings such settings are beyond the scope of this work saddle point representation the of can be challenging to tackle however in many cases of interest the function enjoys favorable structure that allows to tackle it with smoothing techniques we assume that is convex function given by max where is smooth function and is convex and compact set in the euclidean space ez such representation was introduced and developed in for the purpose of optimization saddle point representability can be interpreted as general form of the structure of functions used in the nesterov smoothing technique representations of this type are readily available for wide family of nonsmooth functions see sec for examples and actually for all empirical risk functions with convex loss in machine learning up to our knowledge composite linear minimization oracle algorithms require the tation of proximal operator at each iteration hη yi αkyk for several cases of interest described below the computation of the proximal operator can be expensive or intractable classical example is the nuclear norm whose proximal operator boils down to singular value thresholding therefore requiring full singular value decomposition in contrast to the proximal operator the linear minimization oracle can be much cheaper the linear minimization oracle lmo is routine which given an input and ey returns point lmo argmin hη yi αkyk in the case of the lmo only requires the computation of the leading pair of singular vectors which is an order of magnitude faster in saddle point reformulation the crux of our approach is smooth saddle point reformulation of after massaging the reformulation we consider the associated variational inequality which provides the sufficient and necessary condition for an optimal solution to the saddle point problem for any optimization problem with convex structure including convex minimization saddle point problem convex nash equilibrium the corresponding variational inequality is directly related to the accuracy certificate used to guarantee the accuracy of solution to the optimization problem see sec in and we shall present then an algorithm to solve the variational inequality established below that exploits its particular structure assuming that admits saddle point representation we write in epigraph form opt min max bx where bx is convex set we can approximate opt by opt min max ρhy bx wi opt see details in by introducing the variables for properly selected one has opt and the variational inequality associated with the above saddle point problem is fully described by the domain kyk and the monotone vector field fu fv where ρb ρw fu bx fv in the next section we present an efficient algorithm to solve this type of variational inequality which enjoys particular structure we call such an inequality for variational inequalities variational inequalities enjoy particular mixed structure that allows to get the best of two worlds namely the proximal setup where the proximal operator can be computed and the lmo setup where the linear minimization oracle can be computed basically the domain is decomposed as cartesian product over two sets such that admits while admits linear minimization oracle we now describe the main theoretical and algorithmic components of the algorithm resp in sec and in sec and finally describe the overall algorithm in sec composite with inexact we first present new algorithm which can be seen as an extension of the composite mirror prox algorithm denoted cmp for brevity that allows inexact computation of and can solve broad class of variational inequalites the original mirror prox algorithm was introduced in and was extended to composite settings in assuming exact computations of structured variational inequalities we consider the variational inequality vi find hf with domain and operator that satisfy the assumptions below set eu ev is closed convex and its projection where is convex and closed eu ev are euclidean spaces the function is continuously differentiable and also convex some this defines the bregman distance vu hω ui the operator eu ev is monotone and of form fu fv with fv ev being constant and fu eu satisfying the condition kfu fu lku for some the linear form hfv vi of eu ev is bounded from below on and is coercive on whenever ut is sequence such that ut is bounded and kv as we have hfv the quality of an iterate in the course of the algorithm is measured through the dual gap function ǫvi sup hf yi we give in appendix refresher on dual gap functions for the reader convenience we shall establish the complexity bounds in terms of this dual gap function for our algorithm which directly provides an accuracy certificate along the iterations however we first need to define what we mean by an inexact inexact proximal mappings were recently considered in the context of accelerated proximal gradient algorithms the definition we give below is more general allowing for we introduce here the notion of for for eu ev and let us define the subset pxǫ of as pxǫ vb hη si hζ vb wi when this reduces to the exact in the usual setting that is px argmin hη si hζ wi vu there is slight abuse of notation here the norm here is not the same as the one in problem when this yields our definition of an inexact with inexactness parameter note that for any the set pxǫ γfv is well defined whenever the composite mirror prox with inexact is outlined in algorithm algorithm composite mirror prox algorithm cmp for vi input stepsizes γt inexactness ǫt initialize for do ut vbt pxǫtt γt xt pxǫtt γt fu ut fv pxǫtt γt pxǫtt γt fu ut fv end for pt pt output xt γt γt the proposed algorithm is extension of the composite mirror prox with exact proxmappings both from theoretical and algorithmic point of views we establish below the theoretical convergence rate see appendix for the proof theorem assume that the sequence of γt in the cmp algorithm satisfy ut σt γt hfu ut fu ut bt vubt vut then denoting sup for sequence of inexact with inexactness ǫt we have pt pt ǫt ǫvi sup hf xi pt γt remarks note that the assumption on the sequence of γt is clearly satisfied when γt when which is essentially the case for the problem described in section it suffices as long as γt when ǫt is summable we achieve the same convergence rate as when there is no error if ǫt decays with rate of then the overall convergence is only affected by log factor convergence results on the sequence of projections of onto when stems from saddle point problem is established in appendix the theoretical convergence rate established in theorem and corollary generalizes the previous result established in corollary in for cmp with exact indeed when exact are used we recover the result of when inexact are used the errors due to the inexactness of the accumulate and is reflected in and composite conditional gradient we now turn to variant of the composite conditional gradient algorithm denoted ccg tailored for particular class of problems which we call smooth problems the composite conditional gradient algorithm was first introduced in and also developed in we present an extension here which turns to be for that will be solved in sec minimizing smooth functions we consider the smooth problem min hθ vi represented by the pair such that the following assumptions are satisfied we assume that eu ev is closed convex and its projection on eu belongs to where is convex and compact ii is convex continuously differentiable function and there exist and such that ku ukκ ui iii ev is such that every linear function on eu ev of the form hη ui hθ vi with eu attains its minimum on at some point we have at our disposal composite linear minimization oracle lmo which given on input eu returns algorithm composite conditional gradient algorithm ccg input accuracy and γt initialize for do compute δt hgt ut ut gt hθ gt where gt ut if δt then return xt ut else find such that xt γt xt gt xt end if end for the algorithm is outlined in algorithm note that ccg works essentially as if there were no vcomponent at all the ccg algorithm enjoys convergence rate in in the evaluations of the function and the accuracy certificates δt enjoy the same rate as well proposition denote the of when solving problems of type the sequence of iterates xt of ccg satisfies dκ xt min in addition the accuracy certificates δt satisfy min δs dκ for variational inequality we now give the full description of special class of variational inequalities called variational inequalities this family of problems encompasses both cases that we discussed so far in section and but most importantly it also covers many other problems that do not fall into these two regimes and in particular our essential problem of interest variational inequalities the class of variational inequalities allows to go beyond assumptions by assuming more structure this structure is consistent with what we call setup which encompasses both the regular proximal setup and the regular linear minimization setup as special cases indeed we consider variational inequality vi that satisfies in addition to assumptions the following assumptions proximal setup for we assume that eu ev and with xi eui evi and pi ui ui vi xi ui for where is convex and closed is convex and compact we also assume that and kuk with continuously differentiable such that ku for particular and furthermore we assume that the of is bounded by some partition of the operator induced by the above partition of and can be written as fu fv with fu fv proximal mapping on we assume that for any and we have at our disposal of the form argmin linear minimization oracle for we assume that we we have at our disposal composite linear minimization oracle lmo which given any input and returns an optimal solution to the minimization problem with linear form that is lmo argmin setup we denote such problems as on the one hand when is singleton we get the setup on the other hand when is singleton we get the full setup full lmo setup the setup allows to cover both setups and all the ones in between as well the algorithm we finally present here our main contribution the algorithm which solves the variational inequality under and the algorithm blends both cmp and ccg basically for given by lmo instead of computing exactly the we mimick inexactly the via conditional gradient algorithm in the composite mirror prox algorithm for the we compute the as it is algorithm algorithm for input stepsizes γt accuracies ǫt initialize where for do compute that γt γt ccg hγt γt ǫt compute that γt γt γt ǫt ccg hγt end for pt pt output xt γt γt at step we first update by computing the exact and build by running the composite conditional gradient algorithm to problem specifically with hγt and γt until ǫt we then build and similarly except this time taking the value of the operator at point combining the results in theorem and proposition we arrive at the following complexity bound proposition under the assumption and with and choice of stepsize γt for the outlined algorithm to return an to the variational inequality the total number of mirror prox steps required does not exceed lθ total number of steps and the total number of calls to the linear minimization oracle does not exceed lκ dκ ǫκ in particular if we use euclidean proximal setup on with which leads to and then the number of lmo calls does not exceed objective value objective value objective valule objective valule elapsed time sec number of lmo calls elapsed time sec number of lmo calls figure robust collaborative filtering and link prediction objective function vs elapsed time from left to right wikivote wikivote full discussion the proposed algorithm enjoys the optimal complexity bounds in the number of calls to lmo see for the optimal complexity bounds for general optimization with lmo consequently when applying the algorithm to the variational reformulation of the problem of interest we are able to get an solution within at most lmo calls thus generalizes previously proposed approaches and improves upon them in special cases of problem see appendix experiments we report the experimental results obtained with the proposed denoted here and competing algorithms we consider two different applications robust collaborative filtering for movie recommendation ii link prediction for social network analysis for we compare to two competing approaches smoothing conditional gradient proposed in denoted smoothing proximal gradient equipped with setup for ii we compare to using equipped with setup additional experiments and implementation details are given in appendix robust collaborative filtering we consider the collaborative filtering problem with nuclearnorm regularization penalty and an function we run the above three algorithms on the the small and medium movielens datasets the dataset consists of users and movies with about ratings while the dataset consists of users and movies with about ratings we follow to set the regularization parameters in fig we can see that clearly outperforms while it is competitive with link prediction we consider now the link prediction problem where the objective consists for the empirical risk part and multiple regularization penalties namely the and the for this example applying the or would require two smooth approximations one for hinge loss term and one for norm term therefore we consider an alternative approach where we apply the linearized preconditioned admm algorithm by solving proximal mapping through conditional gradient routines up to our knowledge admm with early stopping is not fully theoretically analyzed in literature however intuitively as long as the error is controlled sufficiently such variant of admm should converge we conduct experiments on binary social graph data set called wikivote which consists of nodes and edges since the computation cost of these two algorithms mainly come from the lmo calls we present in below the performance in terms of number of lmo calls for the first set of experiments we select top highest degree users from wikivote and run the two algorithms on this small dataset with different strategies for the inner lmo calls in fig we observe that the is less sensitive to the inner accuracies of compared to the admm variant which sometimes stops progressing if the of early iterations are not solved with sufficient accuracy the results on the full dataset corroborate the fact that outperforms the variant of the admm algorithm acknowledgments the authors would like to thank juditsky and nemirovski for fruitful discussions this work was supported by nsf grant labex project titan project macaron the joint centre and the data science environment at nyu references francis bach duality between subgradient and conditional gradient methods siam journal on optimization francis bach rodolphe jenatton julien mairal and guillaume obozinski optimization with sparsityinducing penalties found trends mach heinz bauschke and patrick combettes convex analysis and monotone operator theory in hilbert spaces springer bertsekas convex optimization algorithms athena scientific xi chen qihang lin seyoung kim jaime carbonell and eric xing smoothing proximal gradient method for general structured sparse regression the annals of applied statistics bruce cox anatoli juditsky and arkadi nemirovski dual subgradient algorithms for nonsmooth learning problems mathematical programming pages dudik harchaoui and malick lifted coordinate descent for learning with regularization proceedings of the international conference on artificial intelligence and statistics aistats dan garber and elad hazan linearly convergent conditional gradient algorithm with applications to online and stochastic optimization arxiv preprint zaid harchaoui anatoli juditsky and arkadi nemirovski conditional gradient algorithms for normregularized smooth convex optimization mathematical programming pages hazan and kale online learning in icml niao he anatoli juditsky and arkadi nemirovski mirror prox algorithm for composite minimization and problems arxiv preprint martin jaggi revisiting sparse convex optimization in icml pages anatoli juditsky and arkadi nemirovski solving variational inequalities with monotone operators on domains given by linear minimization oracles arxiv preprint guanghui lan the complexity of convex programming under linear optimization oracle arxiv guanghui lan and yi zhou conditional gradient sliding for convex optimization arxiv cun mu yuqian zhang john wright and donald goldfarb scalable robust matrix recovery frankwolfe meets proximal methods arxiv preprint arkadi nemirovski with rate of convergence for variational inequalities with lipschitz continuous monotone operators and smooth saddle point problems siam journal on optimization arkadi nemirovski shmuel onn and uriel rothblum accuracy certificates for computational problems with convex structure mathematics of operations research yurii nesterov smooth minimization of functions mathematical programming yurii nesterov smoothing technique and its applications in semidefinite optimization math yurii nesterov complexity bounds for methods minimizing the model of objective function technical report catholique de louvain center for operations research and econometrics core yuyuan ouyang yunmei chen guanghui lan and eduardo pasiliao an accelerated linearized alternating direction method of multipliers http neal parikh and stephen boyd proximal algorithms foundations and trends in optimization pages federico pierucci zaid harchaoui and malick smoothing approach for composite conditional gradient with nonsmooth loss in dapprentissage mark schmidt nicolas roux and francis bach convergence rates of inexact methods for convex optimization in adv nips zhang yu and schuurmans accelerated training for regularization boosting approach in nips 
lasso with measurements is equivalent to one with linear measurements ehsan abbasi department of electrical engineering caltech eabbasi christos thrampoulidis department of electrical engineering caltech cthrampo babak hassibi department of electrical engineering caltech hassibi abstract consider estimating an unknown but structured sparse etc signal rn from vector rm of measurements of the form yi gi ai where the ai are the rows of known measurement matrix and is potentially unknown nonlinear and random such measurement functions could arise in applications where the measurement device has nonlinearities and uncertainties it could also arise by design gi sign zi corresponds to noisy quantized measurements motivated by the classical work of brillinger and more recent work of plan and vershynin we estimate via solving the arg minx ky λf for some regularization parameter and some typically convex regularizer that promotes the structure of etc while this approach seems to naively ignore the nonlinear function both brillinger in the case and plan and vershynin have shown that when the entries of are iid standard normal this is good estimator of up to constant of proportionality which only depends on in this work we considerably strengthen these results by obtaining explicit expressions for the regularized that are asymptotically precise when and grow large main result is that the estimation performance of the generalized lasso with measurements is asymptotically the same as one whose measurements are linear yi µai σzi with eγg and µγ and standard normal to the best of our knowledge the derived expressions on the estimation performance are the precise results in this context one interesting consequence of our result is that the optimal quantizer of the measurements that minimizes the estimation error of the generalized lasso is the celebrated quantizer introduction measurements consider the problem of estimating an unknown signal vector rn from vector ym of measurements taking the following form yi gi ati here each ai represents known measurement vector the gi are independent copies of generically random link function for instance gi zi with say zi being normally this work was supported in part by the national science foundation under grants and by grant from qualcomm by nasa jet propulsion laboratory through the president and directors fund by king abdulaziz university and by king abdullah university of science and technology distributed recovers the standard linear regression setup with gaussian noise in this paper we are particularly interested in scenarios where is notable examples include sign or gi sign and corresponding to quantized noisy measurements and to the censored tobit model respectively depending on the situation might be known or unspecified in the statistics and econometrics literature the measurement model in is popular under the name model and several aspects of it have been structured signals it is typical that the unknown signal obeys some sort of structure for instance it might be sparse few of its entries are or it might be that vec where is matrix of to exploit this information it is typical to associate with the structure of properly chosen function rn which we refer to as the regularizer of particular interest are convex and such regularizers the for sparse signals the for ones etc please refer to for further discussions an algorithm for linear measurements the generalized lasso when the link function is linear gi zi perhaps the most popular way of estimating is via solving the generalized lasso algorithm arg min ky λf here am is the known measurement matrix and is regularizer parameter this is often referred to as the or the to distinguish from the one solving minx ky λf instead our results can be accustomed to this latter version but for concreteness we restrict attention to throughout the acronym lasso for was introduced in for the special case of is natural generalization to other kinds of structures and includes the the as special cases we often drop the term generalized and refer to simply as the lasso one popular measure of estimation performance of is the recently there have been significant advances on establishing tight bounds and even precise characterizations of this quantity in the presence of linear measurements such precise results have been core to building better understanding of the behavior of the lasso and in particular on the exact role played by the choice of the regularizer in accordance with the structure of by the number of measurements by the value of etc in certain cases they even provide us with useful insights into practical matters such as the tuning of the regularizer parameter using the lasso for measurements the lasso is by nature tailored to linear model for the measurements indeed the first term of the objective function in tries to fit ax to the observed vector presuming that this is of the form yi ati noise of course no one stops us from continuing to use it even in cases where yi ati with being but the question then becomes can there be any guarantees that the solution of the generalized lasso is still good estimate of the question just posed was first studied back in the early by brillinger who provided answers in the case of solving without regularizer term this of course corresponds to standard least squares ls interestingly he showed that when the measurement vectors are gaussian then the ls solution is consistent estimate of up to constant of proportionality which only depends on the the result is sharp but only under the assumption that the number of measurements grows large while the signal dimension stays fixed which was the typical setting of interest at the time in the world of structured signals and measurements the problem was only very recently revisited by plan and vershynin they consider constrained version of the generalized lasso in which the regularizer is essentially replaced by constraint and derive upper bounds on its performance the bounds are not tight they involve absolute constants but they demonstrate some key features the solution to the constrained lasso is good estimate of up to the same constant of proportionality that appears in brillinger result ii thus is natural measure of performance iii estimation is possible even with measurements by taking advantage of the structure of the model is classical topic and can also be regarded as special case of what is known as sufficient dimension reduction problem there is extensive literature on both subjects unavoidably we only refer to the directly relevant works here note that the generalized lasso in does not assume knowledge of all that is assumed is the availability of the measurements yi thus the might as well be unknown or unspecified linear prediction figure squared error of the lasso with measurements and with corresponding linear ones as function of the regularizer parameter both compared to the asymptotic prediction here gi sign with zi the unknown signal is of dimension and has entries see sec for details the different curves correspond to and number of measurements respectively simulation points are averages over problem realizations summary of contributions inspired by the work of plan and vershynin and motivated by recent advances on the precise analysis of the generalized lasso with linear measurements this paper extends these latter results to the case of mesaurements when the measurement matrix has entries gaussian henceforth we assume this to be the case without further reference and the estimation performance is measured in sense we are able to precisely predict the asymptotic behavior of the error the derived expression accurately captures the role of the link function the particular structure of the role of the regularizer and the value of the regularizer parameter further it holds for all values of and for wide class of functions and interestingly our result shows in very precise manner that in large dimensions modulo the information about the magnitude of the lasso treats measurements exactly as if they were scaled and noisy linear measurements with scaling factor and noise variance defined as γg and µγ for where the expecation is with respect to both and in particular when is such that then the estimation performance of the generalized lasso with measurements of the form yi gi ati is asymptotically the same as if the measurements were rather of the form yi µati σzi with as in and zi standard gaussian noise recent analysis of the of the lasso when used to recover structured signals from noisy linear observations provides us with either precise predictions or in other cases with tight upper bounds owing to the established relation between and corresponding linear measurements such results also characterize the performance of the lasso in the presence of nonlinearities we remark that some of the error formulae derived here in the general context of measurements have not been previously known even under the prism of linear measurements figure serves as an illustration the error with measurements matches well with the error of the corresponding linear ones and both are accurately predicted by our analytic expression under the generic model in which allows for to even be unspecified can in principle be estimated only up to constant of proportionality for example if is uknown then any information about the norm could be absorbed in the definition of the same is true when sign eventhough might be known here in these cases what becomes important is the direction of motivated by this and in order to simplify the presentation we have assumed throughout that has unit euclidean this excludes for example link functions that are even but also some other not so obvious cases sec for few special cases sparse recovery with binary measurements yi different methodologies than the lasso have been recently proposed that do not require in remark they note that their results can be easily generalized to the case when by simply redifining and accordingly adjusting the values of the parameters and in the very same argument is also true in our case discussion of relevant literature extending an old result brillinger identified the asymptotic behavior of the estimation error of the ls solution at at by showing that when the dimension of is fixed lim where and are same as in our result can be viewed as generalization of the above in several directions first we extend to the regime where and both grow large by showing that lim second and most importantly we consider solving the generalized lasso instead to which ls is only very special case this allows versions of where the error is finite even when see note the additional challenges faced when considering the lasso no longer has expression ii the result needs to additionally capture the role of and motivated by recent work plan and vershynin consider constrained generalized lasso arg min ky with as in and some known set not necessarily convex in its simplest form their result shows that when dk then with highp probability dk here µx is the gaussian width specific measure ofmcomplexity of the constrained set when viewed from for our purposes it suffices to remark that if is properly chosen and if is on the boundary of then dk is less than thus estimation is in principle is possible with measurements the parameters and that appear in are the same as in and µγ observe that in contrast to and to the setting of this paper the result in is also it suggests the critical role played by and on the other hand is only an upper bound on the error and also it suffers from unknown absolute proportionality constants hidden in moving the analysis into an asymptotic setting our work expands upon the result of first we consider the regularized lasso instead which is more commonly used in practice most importantly we improve the loose upper bounds into precise expressions in turn this proves in an exact manner the role played by and to which is only indicative for direct comparison with we mention the following result which follows from our analysis we omit the proof for brevity assume is convex dk and also then yields an upper bound cσ to the error for some constant instead we show precise analysis of the lasso with linear measurements the first precise error formulae were established in for the with the analysis was based on the the approximate message passing amp framework more general line of work studies the problem using recently developed framework termed the convex gaussian theorem cgmt which is tight version of classical gaussian comparison inequality by gordon the cgmt framework was initially used by stojnic to derive tight upper bounds on the constrained lasso with generalized those to general convex regularizers and also to the the was studied in those bounds hold for all values of snr but they become tight only in the regime precise error expression for all values of snr was derived in for the with under gaussianity assumption on the distribution of the entries of when measurements are linear our theorem generalizes this assumption moreover our theorem provides error predictions for regularizers going beyond the nuclear norm which appear to be novel when it comes to measurements to the best of our knowledge this paper is the first to derive asymptotically precise results on the performance of any program results modeling assumptions unknown structured signal we let rn represent the unknown signal vector we assume that with sampled from probability density in rn thus is deterministically of unit this is mostly to simplify the presentation see footnote information about the structure of and correspondingly of is encoded in to study an which is sparse it is typical to assume that its entries are ρqx where becomes the normalized sparsity level qx is scalar and is the dirac delta regularizer we consider convex regularizers rn measurement matrix the entries of are measurements and we observe where is possibly random map from rm to rm and gm um each gi is from real valued random function for which and are defined in we assume that and are nonzero and bounded asymptotics we study linear asymptotic regime in particular we consider sequence of prob lem instances indexed by such that has entries rn is proper convex and with δn we further require that the following conditions hold is sampled from probability density in rn with marginals that are independent of and have bounded second moments furthermore for any and any it holds and for constants independent of in we used to denote convergence in probability as the assumption holds without loss of generality and is only necessary to simplify the presentation in denotes the subdifferential of at the condition itself is no more than normalization condition on every such sequence generates sequence where and when clear from the context we drop the superscript precise error prediction let be sequence of problem instances that satisfying all the conditions above with these define the sequence of solutions to the corresponding lasso problems for fixed ky λf min the main contribution of this paper is precise evaluation of with high probability over the randomness of of and of general result to state the result in general framework we require further assumption on and later in this section we illustrate how this assumption can be naturally met we write for the fenchel conjugate of supx xt also we denote the moreau envelope of at with index to be ef minx kv assumption we say assumption holds if for all constants the limit of exists with probability one over in and then we denote the limiting value as theorem consider the asymptotic setup of section and let assumption hold recall and as in and let be the minimizer of the generalized lasso in for fixed and for measurements given by further let be the solution to the generalized lasso when used with linear measurements of the form ylin σz where has entries standard normal then in the limit of with probability one such models have been widely used in the relevant literature in fact the results here continue to hold as long as the marginal distribution of converges to given distribution as in theorem relates in very precise manner the error of the generalized lasso under measurements to the error of the same algorithm when used under appropriately scaled noisy linear measurements theorem below derives an asymptotically exact expression for the error theorem precise error formula under the same assumptions of theorem and it holds with probability one lim where is the unique optimal solution to the convex program ατ µτ max min λα λα also the optimal cost of the lasso in converges to the optimal cost of the program in under the stated conditions theorem proves that the limit of exists and is equal to the unique solution of the optimization program in notice that this is deterministic and convex optimization which only involves three scalar optimization variables thus the optimal can in principle be efficiently numerically computed in many specific cases of interest with some extra effort it is possible to yield simpler expressions for see theorem below the role of the normalized number of measurement of the regularizer parameter and that of through and are explicit in the structure of and the choice of the regularizer are implicit in figures illustrate the accuracy of the prediction of the theorem in number of different settings the proofs of both the theorems are deferred to appendix in the next sections we specialize theorem to the cases of sparse and signal recovery sparse recovery assume each entry is sampled from distribution px qx where is the delta dirac function and qx probability density function with second moment normalized to so that condition of section is satisfied then is on average and has unit euclidean norm letting also satisfies condition let us now check assumption the fenchel conjugate of the is simply the indicator function of the unit ball hence without much effort min vi hi hi where we have denoted for the soft thresholding operator an application of the weak law of large numbers to see that the limit of the expression in equals where the expectation is over and px with all these theorem is applicable we have put extra effort in order to obtain the following equivalent but more insightful characterization of the error as stated below and proved in appendix theorem sparse recovery if then define λcrit otherwise let λcrit κcrit be the unique pair of solutions to the following set of equations κh µx κλ µx κδ κh µx κλ where and is independent of px then for any with probability one δκcrit λcrit lim λcrit where is the unique solution to sparse signal recovery signal recovery simulation thm kx simulation thm λcrit figure squared error of the lasso as function of the regularizer parameter compared to the asymptotic predictions simulation points represent averages over realizations illustration of thm for sign px px px and two values of namely and illustration of thm for being as in section and gi sign in particular is composed of blocks of block size each block is zero with probability otherwise its entries are finally figures and validate the prediction of the theorem for different signal distributions namely qx being gaussian and bernoulli respectively for the case of compressed measurements observe the two different regimes of operation one for λcrit and the other for λcrit precisely as they are predicted by the theorem see also sec the special case of theorem for which qx is gaussian has been previously studied in otherwise to the best of our knowledge this is the first precise analysis result for the stated in that generality analogous result but via different analysis tools has only been known for the as appears in recovery let rn be composed of blocks of constant size each such that each block is sampled from probability density in rb px qx rb where thus is on average we operate in the regime of linear measurements as is common we use the pt to induce with this is often referred to as the literature is not hard to show that assumption holds with where rb is the vector soft thresholding operator and ib px and are independent thus theorem is applicable in this setting figure illustrates the accuracy of the prediction matrix recovery let be an unknown matrix of rank in which case vec with assume and as usual in this setting we consider nuclearnorm regularization in particular we choose each subgradient then satisfies kskf in agreement with assumption of section furthermore for this choice of regularizer we have kv min kv si where is as in si denotes the ith singular value of its argument and has entries if conditions are met such that the empirical distribution of the singular values of the sequence of random matrices converges to limiting distribution say then and theorem apply for instance this will be the case if usvt where unitary matrices and is diagonal matrix whose entries have given marginal distribution with bounded moments in particular independent of we leave the details and the problem of numerically evaluating for future work an application to compressive sensing setup consider recovering sparse unknown signal rn from scalar quantized linear measurements let tl represent symmetric with respect to set of decision thresholds and the corresponding representation points such that then quantization of real number into can be represented as qq sign where is the indicator function of set for example quantization with level corresponds to sign the measurement vector ym takes the form yi qq ati where ati are the rows of measurement matrix which is henceforth assumed standard gaussian we use the lasso to obtain an estimate of as arg min ky henceforth we assume for simplicity that also in our case is known since qq is known thus is reasonable to scale the solution of as and consider the error quantity as measure of estimation performance clearly the error depends besides others on the number of bits on the choice of the decision thresholds and on the quantization levels an interesting question of practical importance becomes how to optimally choose these to achieve less error as running example for this section we seek optimal quantization thresholds and corresponding levels arg min while keeping all other parameters such as the number of bits and of measurements fixed consequences of precise error prediction theorem shows that where is the solution to but only this time with measurement vector ylin σµ where as in and has entries standard normal thus lower values of the ratio correspond to lower values of the error and the design problem posed in is equivalent to the following simplified one arg min to be explicit µrand above can be easily expressed from after setting qq as follows and where ti and exp du an algorithm for finding optimal quantization levels and thresholds in contrast to the initial problem in the optimization involved in is explicit in terms of the variables and but is still hard to solve in general interestingly we show in appendix that the popular lm algorithm can be an effective algorithm for solving since the values to which it converges are stationary points of the objective in note that this is not directly obvious result since the classical objective of the lm algorithm is minimizing the quantity ky rather than references francis bach structured norms through submodular functions in advances in neural information processing systems pages mohsen bayati and andrea montanari the lasso risk for gaussian matrices information theory ieee transactions on alexandre belloni victor chernozhukov and lie wang lasso pivotal recovery of sparse signals via conic programming biometrika david brillinger the identification of particular nonlinear time series system biometrika david brillinger generalized linear model with gaussian regressor variables festschrift for erich lehmann page venkat chandrasekaran benjamin recht pablo parrilo and alan willsky the convex geometry of linear inverse problems foundations of computational mathematics david donoho and iain johnstone minimax risk overl forl probability theory and related fields david donoho lain johnstone and andrea montanari accurate prediction of phase transitions in compressed sensing via connection to minimax denoising ieee transactions on information theory david donoho arian maleki and andrea montanari algorithms for compressed sensing proceedings of the national academy of sciences david donoho arian maleki and andrea montanari the phase transition in compressed sensing information theory ieee transactions on alexandra garnham and luke prendergast note on least squares sensitivity in model estimation and the benefits of response transformations electronic of statistics yehoram gordon on milman inequality and random subspaces which escape through mesh in rn springer marwa el halabi and volkan cevher totally unimodular view of structured sparsity arxiv preprint hidehiko ichimura semiparametric least squares sls and weighted sls estimation of models journal of econometrics li and naihua duan regression analysis under link violation the annals of statistics pages samet oymak christos thrampoulidis and babak hassibi the of generalized lasso precise analysis arxiv preprint yaniv plan and roman vershynin the generalized lasso with observations arxiv preprint mihailo stojnic framework to characterize performance of lasso algorithms arxiv preprint christos thrampoulidis samet oymak and babak hassibi regularized linear regression precise analysis of the estimation error in proceedings of the conference on learning theory pages christos thrampoulidis ashkan panahi daniel guo and babak hassibi precise error analysis of the lasso in acoustics speech and signal processing icassp ieee international conference on pages christos thrampoulidis ashkan panahi and babak hassibi asymptotically exact error analysis for the generalized in information theory isit ieee international symposium on pages ieee robert tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series methodological pages robert tibshirani michael saunders saharon rosset ji zhu and keith knight sparsity and smoothness via the fused lasso journal of the royal statistical society series statistical methodology xinyang yi zhaoran wang constantine caramanis and han liu optimal linear estimation under unknown nonlinear transform arxiv preprint ming yuan and yi lin model selection and estimation in regression with grouped variables journal of the royal statistical society series statistical methodology 
from random walks to distances on unweighted graphs tatsunori hashimoto mit eecs thashim yi sun mit mathematics yisun tommi jaakkola mit eecs tommi abstract large unweighted directed graphs are commonly used to capture relations between entities fundamental problem in the analysis of such networks is to properly define the similarity or dissimilarity between any two vertices despite the significance of this problem statistical characterization of the proposed metrics has been limited we introduce and develop class of techniques for analyzing random walks on graphs using stochastic calculus using these techniques we generalize results on the degeneracy of hitting times and analyze metric based on the laplace transformed hitting time ltht the metric serves as natural provably alternative to the expected hitting time we establish general correspondence between hitting times of the brownian motion and analogous hitting times on the graph we show that the ltht is consistent with respect to the underlying metric of geometric graph preserves clustering tendency and remains robust against random addition of edges tests on simulated and data show that the ltht matches theoretical predictions and outperforms alternatives introduction many network metrics have been introduced to measure the similarity between any two vertices such metrics can be used for variety of purposes including uncovering missing edges or pruning spurious ones since the metrics tacitly assume that vertices lie in latent metric space one could expect that they also recover the underlying metric in some limit surprisingly there are nearly no known results on this type of consistency indeed it was recently shown that the expected hitting time degenerates and does not measure any notion of distance we analyze an improved metric laplace transformed hitting time ltht and rigorously evaluate its consistency and robustness under general network model which encapsulates the latent space assumption this network model specified in section posits that vertices lie in latent metric space and edges are drawn between nearby vertices in that space to analyze the ltht we develop two key technical tools we establish correspondence between functionals of hitting time for random walks on graphs on the one hand and limiting itô processes corollary on the other moreover we construct weighted random walk on the graph whose limit is brownian motion corollary we apply these tools to obtain three main results first our theorem recapitulates and generalizes the result of pertaining to degeneration of expected hitting time in the limit our proof is direct and demonstrates the broader applicability of the techniques to general random walk based algorithms second we analyze the laplace transformed hitting time as family of improved distance estimators based on random walks on the graph we prove that there exists scaling limit for the parameter such that the ltht can become the shortest path distance theorem or consistent metric estimator averaging over many paths theorem finally we prove that the ltht captures the advantages of based metrics by respecting the cluster structure theorem and robustly recovering similarity queries when the majority of edges carry no geometric information theorem we now discuss the relation of our work to prior work on similarity estimation metrics there is growing literature on graph metrics that attempts to correct the degeneracy of expected hitting time by interpolating between expected hitting time and shortest path distance the work closest to ours is the analysis of the phase transition of the metric in which proves that are nondegenerate for some however their work did not address consistency or bias of other approaches to metrics such as distributed routing distances truncated hitting times and randomized shortest paths exist but their statistical properties are unknown our paper is the first to prove consistency properties of metric nonparametric statistics in the nonparametric statistics literature the behavior of neighbor and graphs has been the focus of extensive study for undirected graphs techniques have yielded consistency for clusters and shortest paths as well as the degeneracy of expected hitting time algorithms for exactly embedding neighbor graphs are similar and generate metric estimates but require knowledge of the graph construction method and their consistency properties are unknown stochastic differential equation techniques similar to ours were applied to prove laplacian convergence results in while the convergence was exploited in our work advances the techniques of by extracting more robust estimators from information network analysis the task of predicting missing links in graph known as link prediction is one of the most popular uses of similarity estimation the survey compares several common link prediction methods on synthetic benchmarks the consistency of some local similarity metrics such as the number of shared neighbors was analyzed under single generative model for graphs in our results extend this analysis to global metric under weaker model assumptions continuum limits of random walks on networks definition of spatial graph we take generative approach to defining similarity between vertices we suppose that each vertex of graph is associated with latent coordinate xi rd and that the probability of finding an edge between two vertices depends solely on their latent coordinates in this model given only the unweighted edge connectivity of graph we define natural distances between vertices as the distances between the latent coordinates xi formally let rd be an infinite sequence of points drawn from differentiable density with bounded log gradient with compact support spatial graph is defined by the following definition spatial graph let εn xn be local scale function and piecewise continuous function with for and at the spatial graph gn corresponding to εn and is the random graph with vertex set xn and directed edge from xi to xj with probability pij xj xi this graph was proposed in as the generalization of neighbors to isotropic kernels to make inference tractable we focus on the limit as and εn in particular we will suppose that there exist scaling constants gn and deterministic continuous function so that gn gn log εn for xn where the final convergence is uniform in and in the draw of the scaling constant gn represents bound on the asymptotic sparsity of the graph we give few concrete examples to make the quantities gn and εn clear the directed neighbor graph is defined by setting the indicator function of the unit interval εn the distance to the th nearest neighbor and gn the rate at which εn approaches zero gaussian kernel graph is approximated by setting exp the truncation of the gaussian tails at is an analytic convenience rather than fundamental limitation and the bandwidth can be varied by rescaling εn continuum limit of the random walk our techniques rely on analysis of the limiting behavior of the simple random walk xtn on spatial graph gn viewed as markov process with domain the increment at step of xtn is jump to random point in xn which lies within the ball of radius εn xtn around xtn we observe three effects the random walk jumps more frequently towards regions of high density the random walk moves more quickly whenever εn xtn is large for εn small and large step count the random variable xtn is the sum of many small independent but not necessarily identically distributed increments in the limit we may identify xtn with stochastic process satisfying and via the following result which is slight strengthening of theorem obtained by applying theorem in place of the original result of theorem the simple random walk xtn converges uniformly in skorokhod space after time scaling to the itô process ybt valued in the space of continuous functions with reflecting boundary conditions on defined by dybt log ybt ybt ybt effects and may be seen in the stochastic differential equation as follows the direction of the drift is controlled by log ybt the rate of drift is controlled by and the noise is driven by brownian motion wbt with scaling ybt we view theorem as method to understand the simple random walk xtn through the continuous walk ybt attributes of stochastic processes such as stationary distribution or hitting time may be defined for both ybt and xtn and in many cases theorem implies that an version of the discrete attribute will converge to the continuous one because attributes of the continuous process ybt can reveal information about proximity between points this provides general framework for inference in spatial graphs we use hitting times of the continuous process to domain to prove properties of the hitting time of simple random walk on graph via the limit arguments of theorem degeneracy of expected hitting times in networks the hitting time commute time and resistance distance are popular measures of distance based upon the random walk which are believed to be robust and capture the cluster structure of the network however it was shown in surprising result in that on undirected geometric graphs the scaled expected hitting time from xi to xj converges to inverse of the degree of xj in theorem we give an intuitive explanation and generalization of this result by showing that if the random walk on graph converges to any limiting itô process in dimension the scaled expected hitting time to any point converges to the inverse of the stationary distribution this answers the open problem in on the degeneracy of hitting times for directed graphs and graphs with general degree distributions such as directed neighbor graphs lattices and graphs with convergent random walks our proof can be understood as first extending the transience or neighborhood recurrence of brownian motion for to more general itô processes and then connecting hitting times on graphs to their itô process equivalents typical hitting times are large we will prove the following lemma that hitting given vertex quickly is unlikely let txxji be the hitting time to xj of xtn started at xi and texi be the continuous equivalent for ybt to hit both the variance εn and expected value log εn of single step in the simple random walk are the time scaling in theorem was chosen so that as there are discrete steps taken per unit time meaning the total drift and variance per unit time tend to limit lemma typical hitting times are large for any and for large enough we have txxji to prove lemma we require the following tail bound following from the theorem theorem exercise for the laplace transform the laplace transform of the hitting time ltht exp is the solution to the boundary value problem with boundary condition tr βu this will allow us to bound the hitting time to the ball xj of radius centered at xj lemma for and any there exists such that proof we compare the laplace transformed hitting time of the general itô process to that of brownian motion via and handle the latter case directly details are in section we now use lemma to prove lemma xi for proof of lemma our proof proceeds in two steps first we have txxji tb any because xj xj so by theorem we have xi lim gn lim xi for some for large enough this applying lemma we have xi combined with implies txj cgn and hence txxji expected hitting times degenerate to the stationary distribution to translate results from itô processes to directed graphs we require regularity condition let qt xj xi denote the probability that xtn xj conditioned on xi we make the following technical conjecture which we assume holds for all spatial graphs for the rescaled marginal nqt xi is eventually uniformly let πx denote the stationary distribution of xtn the following was shown in theorem under conditions implied by our condition corollary theorem assuming for dx we have the limit lim nπx we may now express the limit of expected hitting time in terms of this result theorem for and any we have txxji xj proof we give sketch by lemma the random walk started at xi does not hit xj within steps with high probability by theorem the simple random walk xtn mixes at exponential rate implying in lemma that the probability of first hitting at step is approximately the stationary distribution at xj expected hitting time is then shown to approximate the expectation of geometric random variable see section for full proof theorem is illustrated in figures and which show with only points expected hitting times on neighbor graph degenerates to the stationary distribution assumption is related to smoothing properties of the graph laplacian and is known to hold for undirected graphs no directed analogue is known and conjectured weaker property for all spatial graphs see section for further details surprisingly proved that hitting times diverge despite convergence of the continuous equivalent this occurs because the discrete walk can jump past the target point in section we consider hitting figure estimated distance from orange starting point on neighbor graph constructed on two clusters and show degeneracy of hitting times theorem and show that interpolate between hitting time and shortest path the laplace transformed hitting time ltht in theorem we showed that expected hitting time is degenerate because simple random walk mixes before hitting its target to correct this we penalize longer paths more precisely consider for the laplace transforms and te of and βb and βn βg these laplace transformed hitting times ltht have three advantages first while the expected hitting time of brownian motion to domain is dominated by long paths the ltht is dominated by direct paths second the ltht for the itô process can be derived in closed form via the feynmankac theorem allowing us to make use of techniques from continuous stochastic processes to control the continuum ltht lastly the ltht can be computed both by sampling and in closed form as matrix inversion section now define the scaled as xi log txj gn taking different scalings for βn with interpolates between expected hitting time βn on fixed graph and shortest path distance βn figures and in theorem we show yields consistent distance measure retaining the unique that the intermediate scaling βn βg properties of hitting times most of our results on the ltht are novel for any metric while considering the laplace transform of the hitting time is novel to our work this metric has been used in the literature in an manner in various forms as similarity metric for collaboration networks hidden subgraph detection and robust shortest path distance however these papers only considered the elementary properties of the limits βn and βn our consistency proof demonstrates the advantage of the stochastic process approach consistency it was shown previously that for fixed and βn log txxji gn converges to shortest path distance from xi to xj we investigate more precise behavior in terms of the scaling of βn there are two regimes if βn log gnd then the shortest path dominates and the ltht the graph converges to shortest path distance see theorem if βn βg converges to its continuous equivalent which for large averages over random walks concentrated we proceed in three steps we around the geodesic to show consistency for βn βg reweight the random walk on the graph so the limiting process is brownian motion we show that for brownian motion recovers latent distance we show that for the reweighted walk converges to its continuous limit we conclude that of the reweighted walk recovers latent distance reweighting the random walk to converge to brownian motion we define weights using the estimators pb and εb for and from times to small out neighbors which corrects this problem and derive closed form solutions theorem this hitting time is but highly biased due to boundary terms corollary theorem let pb and εb be consistent estimators of the density and local scale and be the defined below converges to brownian motion adjacency matrix then the random walk xj pai xi xk xj xt xi εb xi proof reweighting by pb and εb is designed to cancel the drift and diffusion terms in theorem by ensuring that as grows large jumps have means approaching and variances which are asymptotically equal but decaying with see theorem for brownian motion let wt be brownian motion with xi and let xi xj be the hitting time of wt to xj we show that converges to distance lemma for any if βb sα as we have xi log exp xj xj proof we consider hitting time of brownian motion started at distance xj from the origin to distance of the origin which is controlled by bessel process see subsection for details to compare continuous and discrete we convergence of ltht for βn βg will first define the of vertex xi on gn as the graph equivalent of the ball xi definition let εb be the consistent estimate of the local scale from so that εb uniformly as the of path xil is the sum xim of vertex weights εb xi for and gn the of is nbsn there is path of xi for xi xj gn let tbb be the hitting time of the transformed walk on gn from xi to nbsn xj we now verify that hitting times to the on graphs and the ball coincide xi xi corollary for we have tbnb xj proof we verify that the ball and the neighborhood have nearly identical sets of points and apply theorem see subsection for details proving consistency of properly accounting for boundary effects we obtain consistency result for the for small neighborhood hitting times theorem let xi xj gn be connected by geodesic not intersecting for any for large we have with high probability there exists choice of βb and so that if βn βg xj log exp tbxi nbn xj proof of theorem the proof has three steps first we convert to the continuous setting via corollary second we show the contribution of the boundary is negligible the conclusion follows from the explicit computation of lemma full details are in section the stochastic process limit based proof of theorem implies that the is consistent and robust to small perturbations to the graph which preserve the same limit supp section this is special case of more general theorem for transforming limits of graph random walks theorem figure shows that this modification is highly effective in practice bias random walk based metrics are often motivated as recovering cluster preserving metric we now show that the of the simple random walk preserves the underlying cluster structure in the case we provide complete characterization be theorem suppose the spatial graph has and let xi gn nbn xj the hitting time of simple random walk from xi to the of xj it converges to xj log xi dx log gn nbn where xj log xi log defines metric proof apply the wkbj approximation for schrodinger equations to the pde from theorem see corollary and corollary for full proof the leading order terms of the metric appropriately penalize crossing regions of large changes to the log density this is not the case for the expected hitting time theorem robustness while shortest path distance is consistent measure of the underlying metric it breaks down catastrophically with the addition of single edge and does not meaningfully rank vertices that share an edge in contrast we show that ltht breaks ties between vertices via the resource allocation ra index robust local similarity metric under noise definition the noisy spatial graph gn over xn with noise terms qn is constructed by drawing an edge from xi to xj with probability pij xj xi qj qj define the directed ra index in terms of the set nbn xi and the ts set nbin xk and two step by mij xi as rij xk xi xj xi xi log exp txj we show two step and ra index give equivalent methods for testing if vertices are within distance εn theorem if log gnd and xi and xj have at least one common neighbor then ts mij log rij log xi proof let pij be the probability of going from xi to xj in steps and hij the probability of not hitting before time factoring the hitting time yields pij ts hij mij log pij log ij let kmax be the maximal in gn the contribution of paths of length greater than vanishes because hij and pij kmax which is dominated by for rij log noting that pij xi concludes for full details see theorem for edge identification within distance εn the ra index is robust even at noise level gn modifying the graph by changing fewer than edges does not affect the continuum limit of the random graph and therefore preserve the ltht with parameter while this weak bound allows on average noise edges per vertex it does show that the ltht is substantially more robust than shortest paths without modification see section for proofs the conditioning txxji is natural in tasks where only pairs of disconnected vertices are queried empirically we observe it is critical to performance figure figure the ltht recovered deleted edges most consistently on citation network figure the ltht defined above theorem outperforms others at word similarity estimation including the basic theorem if qi gn for all for any there are and hn so that for any with probability at least we have xj min εn xi εn xj if rij hn xj max εn xi εn xj if rij hn proof the minimal ra index follows from standard concentration arguments see link prediction tasks we compare the ltht against other baseline measures of vertex similarity shortest path distance expected hitting time number of common neighbors and the ra index comprehensive evaluation of these metrics was performed in who showed that metric equivalent to the ltht performed best we consider two separate link prediction tasks on the largest connected component of vertices of degree at least five fixing the degree constraint is to ensure that local methods using number of common neighbors such as the resource allocation index do not have an excessive number of ties code to generate figures in this paper are contained in the supplement citation network the kdd challenge dataset includes directed unweighted network of arxiv citations whose dense connected component has vertices and edges we use the same benchmark method as where we delete single edge and compare the similarity of the deleted edge against the set of control pair of vertices which do not share an edge we count the fraction of pairs on which each method rank the deleted edge higher than all other methods we find that ltht is consistently best at this task figure associative thesaurus network the edinburgh associative thesaurus is network with dense connected component of vertices and edges in which subjects were shown set of ten words and for each word was asked to respond with the first word to occur to them each vertex represents word and each edge is weighted directed edge where the weight from xi to xj is the number of subjects who responded with word xj given word xi we measure performance by whether strong associations with more than ten responses can be distinguished from weak ones with only one response we find that the ltht performs best and that preventing jumps is critical to performance as predicted by theorem figure conclusion our work has developed an asymptotic equivalence between hitting times for random walks on graphs and those for diffusion processes using this we have provided short extension of the proof for the divergence of expected hitting times and derived new consistent graph metric that is theoretically principled computationally tractable and empirically successful at link prediction benchmarks these results open the way for the development of other principled metrics that can provably recover underlying latent similarities for spatial graphs results are qualitatively identical when varying from to see supplement for details the ltht is not shown since it is equivalent to the ltht in missing link prediction references alamgir and von luxburg phase transition in the family of in advances in neural information processing systems pages alamgir and von luxburg shortest path distance in random neighbor graphs in proceedings of the international conference on machine learning pages chebotarev class of distances generalizing the and the resistance distances discrete applied mathematics croydon and hambly local limit theorems for sequences of simple random walks on graphs potential analysis gehrke ginsparg and kleinberg overview of the kdd cup acm sigkdd explorations newsletter hashimoto sun and jaakkola metric recovery from directed unweighted graphs in proceedings of the eighteenth international conference on artificial intelligence and statistics pages kiss armstrong milroy and piper an associative thesaurus of english and its computer analysis the computer and literary studies pages kivimäki shimbo and saerens developments in the theory of randomized shortest paths with comparison of graph node distances physica statistical mechanics and its applications lü and zhou link prediction in complex networks survey physica statistical mechanics and its applications øksendal stochastic differential equations an introduction with applications universitext springerverlag berlin sixth edition sarkar chakrabarti and moore theoretical justification of popular link prediction heuristics in ijcai joint conference on artificial intelligence volume page sarkar and moore tractable approach to finding closest neighbors in large graphs in in proc uai shaw and jebara structure preserving embedding in proceedings of the annual international conference on machine learning pages acm smith kao senne bernstein and philips bayesian discovery of threat networks ieee transactions on signal processing stroock and varadhan multidimensional diffussion processes volume springer science business media and jadbabaie family of distributed consensus algorithms with boundary from shortest paths to mean hitting times in decision and control ieee conference on pages ieee ting huang and jordan an analysis of the convergence of graph laplacians in proceedings of the international conference on machine learning pages von luxburg belkin and bousquet consistency of spectral clustering the annals of statistics pages von luxburg radl and hein hitting and commute times in large random neighborhood graphs journal of machine learning research yazdani similarity learning over large collaborative networks phd thesis école polytechnique fédérale de lausanne yen saerens mantrach and shimbo family of dissimilarity measures between nodes generalizing both the and the distances in proceedings of the acm sigkdd international conference on knowledge discovery and data mining pages acm 
bayesian dark knowledge anoop korattikara vivek rathod kevin murphy google research kbanoop rathodv kpmurphy max welling university of amsterdam abstract we consider the problem of bayesian parameter estimation for deep neural networks which is important in problem settings where we may have little data or where we need accurate posterior predictive densities for applications involving bandits or active learning one simple approach to this is to use online monte carlo methods such as sgld stochastic gradient langevin dynamics unfortunately such method needs to store many copies of the parameters which wastes memory and needs to make predictions using many versions of the model which wastes time we describe method for distilling monte carlo approximation to the posterior predictive density into more compact form namely single deep neural network we compare to two very recent approaches to bayesian neural networks namely an approach based on expectation propagation and an approach based on variational bayes our method performs better than both of these is much simpler to implement and uses less computation at test time introduction deep neural networks dnns have recently been achieving state of the art results in many fields however their predictions are often over confident which is problem in applications such as active learning reinforcement learning including bandits and classifier fusion which all rely on good estimates of uncertainty principled way to tackle this problem is to use bayesian inference specifically we first comqn pute the posterior distribution over the model parameters yi where dn xi yi xi is the th input where is the number of features and ryi is the th output then we compute the posterior predictive distribution dn dθ for each test point for reasons of computational speed it is common to approximate the posterior distribution by point estimate such as the map estimate argmax when is large we often use stochastic gradient descent sgd to compute finally we make approximation to the predictive distribution dn unfortunately this loses most of the benefits of the bayesian approach since uncertainty in the parameters which induces uncertainty in the predictions is ignored various ways of more accurately approximating and hence dn have been developed recently proposed method called probabilistic backpropagation pbp based on an online version of expectation propagation ep using repeated assumed density filtering adf where the posterior is approximated as product of univariate gaussians one per parameter θi vi an alternative to ep is variational bayes vb where we optimize lower bound on the marginal likelihood presented biased monte carlo estimate of this lower bound and applies his method called variational inference vi to infer the neural network weights more recently proposed an approach called bayes by backprop bbb which extends the vi method with an unbiased mc estimate of the lower bound based on the reparameterization trick of in both and the posterior is approximated by product of univariate gaussians although ep and vb scale well with data size since they use online learning there are several problems with these methods they can give poor approximations when the posterior does not factorize or if it has or skew at test time computing the predictive density dn can be much slower than using the approximation because of the need to integrate out the parameters they need to use double the memory of standard method to store the mean and variance of each parameter which can be problematic in settings such as mobile phones they can be quite complicated to derive and implement common alternative to ep and vb is to use mcmc methods to approximate traditional mcmc methods are batch algorithms that scale poorly with dataset size however recently method called stochastic gradient langevin dynamics sgld has been devised that can draw samples approximately from the posterior in an online fashion just as sgd updates point estimate of the parameters online furthermore various extensions of sgld have been proposed including stochastic gradient hybrid monte carlo sghmc stochastic gradient thermostat which improves upon sghmc stochastic gradient fisher scoring sgfs which uses second order information stochastic gradient riemannian langevin dynamics distributed sgld etc however in this paper we will just use vanilla sgld all these mcmc methods whether batch or online produce monte carlo approximation to the ps posterior θs where is the number of samples such an approximation can be more accurate than that produced by ep or vb and the method is much easier to implement for sgld you essentially just add gaussian noise to your sgd updates however at test time things are times slower than using estimate since we need to compute ps θs and the memory requirements are times bigger since we need to store the θs for our largest experiment our dnn has parameters so we can only afford to store single sample in this paper we propose to train parametric model to approximate the monte carlo posterior predictive distribution in order to gain the benefits of the bayesian approach while only using the same run time cost as the plugin method following we call the teacher and the student we use to estimate and hence online we simultaneously train the student online to minimize kl we give the details in section similar ideas have been proposed in the past in particular also trained parametric student model to approximate monte carlo teacher however they used batch training and they used mixture models for the student by contrast we use online training and can thus handle larger datasets and use deep neural networks for the student also trained student neural network to emulate the predictions of larger teacher network process they call distillation extending earlier work of which approximated an ensemble of classifiers by single one the key difference from our work is that our teacher is generated using mcmc and our goal is not just to improve classification accuracy but also to get reliable probabilistic predictions especially away from the training data coined the term dark knowledge to represent the information which is hidden inside the teacher network and which can then be distilled into the student we therefore call our approach bayesian dark knowledge we did some preliminary experiments with for fitting an mlp to mnist data but the results were not much better than sgld note that sgld is an approximate sampling algorithm and introduces slight bias in the predictions of the teacher and student network if required we can replace sgld with an exact mcmc method hmc to get more accurate results at the expense of more training time in summary our contributions are as follows first we show how to combine online mcmc methods with model distillation in order to get simple scalable approach to bayesian inference of the parameters of neural networks and other kinds of models second we show that our probabilistic predictions lead to improved log likelihood scores on the test set compared to sgd and the recently proposed ep and vb approaches methods our goal is to train student neural network snn to approximate the bayesian predictive distribution of the teacher which is monte carlo ensemble of teacher neural networks tnn if we denote the predictions of the teacher by dn and the parameters of the student network by our objective becomes kl dn dn log const dθ log dy log dy dθ ep log dθ unfortunately computing this integral is not analytically tractable however we can approximate this by monte carlo ep θs log where is set of samples from to make this function just of we need to integrate out for this we need dataset to train the student network on which we will denote by note that points in this dataset do not need ground truth labels instead the labels which will be probability distributions will be provided by the teacher the choice of student data controls the domain over which the student will make accurate predictions for low dimensional problems such as in section we can uniformly sample the input domain for higher dimensional problems we can sample near the training data for example by perturbing the inputs slightly in any case we will compute monte carlo approximation to the loss as follows dx ep θs log it can take lot of memory to and store the set of parameter samples and the set of data samples so in practice we use the stochastic algorithm shown in algorithm which uses single posterior sample θs and minibatch of at each step the and from algorithm control the strength of the priors for the teacher and student networks we use simple spherical gaussian priors equivalent to regularization we set the precision strength of these gaussian priors by typically since the student gets to see more data than the teacher this is true for two reasons first the teacher is trained to predict single label per input whereas the student is trained to predict distribution which contains more information as argued in second the teacher makes multiple passes over the same training data whereas the student sees fresh randomly generated data at each step classification for classification problems each teacher network θs models the observations using standard softmax model θs we want to approximate this using student network which also has algorithm distilled sgld input dn xi yi minibatch size number of iterations teacher learning schedule ηt student learning schedule ρt teacher prior student prior for do train teacher sgld step sample minibatch indices of size sample zt ηt update θt log log yi zt train student sgd step sample of from student data generator wt ρt γwt softmax output hence from eqn our loss function estimate is the standard cross entropy loss θs log the student network outputs βk log to estimate the gradient we just have to compute the gradients and through the network these gradients are given by θs regression in regression the observations are modeled as yi yi xi where is the prediction of the tnn and λn is the noise precision we want to approximate the predictive distribution as dn eα we will train student network to output the parameters of the approximating distribution and note that this is twice the number of outputs of the teacher network since we want to capture the data dependent we use eα instead of directly predicting the variance to avoid dealing with positivity constraints during training to train the snn we will minimize the objective defined in eqn θs log eα ep θs λn now to estimate θs we just have to compute through the network these gradients are and and back propagate θs θs λn experimental results in this section we compare sgld and distilled sgld with other approximate inference methods including the plugin approximation using sgd the pbp approach of the bbb approach of this is not necessary in the classification case since the softmax distribution already captures uncertainty dataset toyclass mnist toyreg boston housing pbp bbb hmc table summary of our experimental configurations figure posterior predictive density for various methods on the toy dataset sgd plugin using the network hmc using samples sgld using samples distilled sgld using student network with the following architectures and and hamiltonian monte carlo hmc which is considered the gold standard for mcmc for neural nets we implemented sgd and sgld using the torch library for hmc we used stan we perform this comparison for various classification and regression problems as summarized in table toy classification problem we start with toy binary classification problem in order to visually illustrate the performance of different methods we generate synthetic dataset in dimensions with classes points per class we then fit multi layer perceptron mlp with one hidden layer of relu units and softmax outputs denoted using sgd the resulting predictions are shown in figure we see the expected sigmoidal probability ramp orthogonal to the linear decision boundary unfortunately this method predicts label of or with very high confidence even for points that are far from the training data in the top left and bottom right corners in figure we show the result of hmc using samples this is the true posterior predictive density which we wish to approximate in figure we show the result of sgld using about samples specifically we generate samples discard the first for burnin and then keep every th sample we see that this is good approximation to the hmc distribution in figures we show the results of approximating the sgld monte carlo predictive distribution with single student mlp of various sizes to train this student network we sampled points at random from the domain of the input this encourages the student to predict accurately at all locations including those far from the training data in the student has the same ideally we would apply all methods to all datasets to enable proper comparison unfortunately this was not possible for various reasons first the open source code for the ep approach only supports regression so we could not evaluate this on classification problems second we were not able to run the bbb code so we just quote performance numbers from their paper third hmc is too slow to run on large problems so we just applied it to the small toy problems nevertheless our experiments show that our methods compare favorably to these other methods model sgd sgld distilled distilled distilled num params kl table kl divergence on the classification dataset sgd dropout bbb sgd our impl sgld dist sgld table test set misclassification rate on mnist for different methods using mlp sgd first column dropout and bbb numbers are quoted from for our implmentation of sgd fourth column sgld and distilled sgld we report the mean misclassification rate over runs and its standard error size as the teacher but this is too simple model to capture the complexity of the predictive distribution which is an average over models in the student has larger hidden layer this works better however we get best results using two hidden layer model as shown in in table we show the kl divergence between the hmc distribution which we consider as ground truth and the various approximations mentioned above we computed this by comparing the probability distributions pointwise on grid the numbers match the qualitative results shown in figure mnist classification now we consider the mnist digit classification problem which has examples classes and features the only preprocessing we do is divide the pixel values by as in we train only on datapoints and use the remaining for tuning hyperparameters this means our results are not strictly comparable to lot of published work which uses the whole dataset for training however the difference is likely to be small following we use an mlp with hidden layers with hidden units per layer relu activations and softmax outputs we denote this by this model has parameters we first fit this model by sgd using these hyper parameters fixed learning rate of ηt prior precision minibatch size number of iterations as shown in table our final error rate on the test set is which is bit lower than the sgd number reported in perhaps due to the slightly different validation configuration next we fit this model by sgld using these hyper parameters fixed learning rate of ηt thinning interval burn in iterations prior precision minibatch size as shown in table our final error rate on the test set is about which is better than the sgd dropout and bbb results from finally we consider using distillation where the teacher is an sgld mc approximation of the posterior predictive we use the same architecture for the student as well as the teacher we generate data for the student by adding gaussian noise with standard deviation of to randomly sampled training we use constant learning rate of batch size of prior precision of for the student and train for iterations we obtain test error of which is very close to that obtained with sgld see table we only show the bbb results with the same gaussian prior that we use performance of bbb can be improved using other priors such as scale mixture of gaussians as shown in our approach could probably also benefit from such prior but we did not try this in the future we would like to consider more sophisticated data perturbations such as elastic distortions sgd sgld distilled sgld table log likelihood per test example on mnist we report the mean over trials one standard error method pbp as reported in vi as reported in sgd sgld sgld distilled avg test log likelihood table log likelihood per test example on the boston housing dataset we report the mean over trials one standard error we also report the average test of sgd sgld and distilled sgld in table the is equivalent to the logarithmic scoring rule used in assessing the calibration of probabilistic models the logarithmic rule is strictly proper scoring rule meaning that the score is uniquely maximized by predicting the true probabilities from table we see that both sgld and distilled sgld acheive higher scores than sgd and therefore produce better calibrated predictions note that the sgld results were obtained by averaging predictions from models sampled from the posterior whereas distillation produces single neural network that approximates the average prediction of these models distillation reduces both storage and test time costs of sgld by factor of without sacrificing much accuracy in terms of training time sgd took ms sgld took ms and distilled sgld took ms per iteration in terms of memory distilled sgld requires only twice as much as sgd or sgld during training and the same as sgd during testing toy regression we start with toy regression problem in order to visually illustrate the performance of different methods we use the same data and model as in particular we use points in dimensions sampled from the function where we fit this data with an mlp with hidden units and relu activations for sgld we use samples for distillation the teacher uses the same architecture as the student the results are shown in figure we see that sgld is better approximation to the true hmc posterior predictive density than the plugin sgd approximation which has no predictive uncertainty and the vi approximation of finally we see that distilling sgld incurs little loss in accuracy but saves lot computationally boston housing finally we consider larger regression problem namely the boston housing dataset which was also used in this has data points training testing with dimensions since this data set is so small we repeated all experiments times using different test splits following we use an mlp with layer of hidden units and relu activations first we use sgd with these hyper minibatch size noise precision λn prior precision number of trials constant learning rate ηt number of iterations as shown in table we get an average log likelihood of next we fit the model using sgld we use an initial learning rate of which we reduce by factor of every iterations we use iterations burnin of and thinning we choose all using whereas performs posterior inference on the noise and prior precisions and uses bayesian optimization to choose the remaining figure predictive distribution for different methods on toy regression problem pbp of hmc vi method of sgd sgld distilled sgld error bars denote standard deviations figures kindly provided by the authors of we replace their term bp backprop with sgd to avoid confusion interval of as shown in table we get an average log likelihood of which is better than sgd finally we distill our sgld model the student architecture is the same as the teacher we use the following teacher hyper parameters prior precision initial learning rate of which we reduce by factor of every iterations for the student we use generated training data with gaussian noise with standard deviation we use prior precision of an initial learning rate of which we reduce by after every iterations as shown in table we get an average log likelihood of which is only slightly worse than sgld and much better than sgd furthermore both sgld and distilled sgld are better than the pbp method of and the vi method of conclusions and future work we have shown very simple method for being bayesian about neural networks and other kinds of models that seems to work better than recently proposed alternatives based on ep and vb there are various things we would like to do in the future show the utility of our model in an task where predictive uncertainty is useful such as with contextual bandits or active learning consider ways to reduce the variance of the algorithm perhaps by keeping running minibatch of parameters uniformly sampled from the posterior which can be done online using reservoir sampling exploring more intelligent data generation methods for training the student investigating if our method is able to reduce the prevalence of confident false predictions on adversarially generated examples such as those discussed in acknowledgements we thank miguel julien cornebise jonathan huang george papandreou sergio guadarrama and nick johnston references ahn korattikara and welling bayesian posterior sampling via stochastic gradient fisher scoring in icml sungjin ahn babak shahbaba and max welling distributed stochastic gradient mcmc in icml blundell cornebise kavukcuoglu and wierstra weight uncertainty in neural networks in icml cristian bucila rich caruana and alexandru model compression in kdd eric bickel some comparisons among quadratic spherical and logarithmic scoring rules decision analysis tianqi chen emily fox and carlos guestrin stochastic gradient hamiltonian monte carlo in icml ding fang babbush chen skeel and neven bayesian sampling using stochastic gradient thermostats in nips yarin gal and zoubin ghahramani dropout as bayesian approximation representing model uncertainty in deep learning june alex graves practical variational inference for neural networks in nips and adams probabilistic backpropagation for scalable learning of bayesian neural networks in icml geoffrey hinton oriol vinyals and jeff dean distilling the knowledge in neural network in nips deep learning workshop diederik kingma and max welling stochastic gradient vb and the variational in iclr radford neal mcmc using hamiltonian dynamics in handbook of markov chain monte carlo chapman and hall sam patterson and yee whye teh stochastic gradient riemannian langevin dynamics on the probability simplex in nips adriana romero nicolas ballas samira ebrahimi kahou antoine chassang carlo gatta and yoshua bengio fitnets hints for thin deep nets arxiv rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in icml edward snelson and zoubin ghahramani compact approximations to bayesian predictive distributions in icml christian szegedy wojciech zaremba ilya sutskever joan bruna dumitru erhan ian goodfellow and rob fergus intriguing properties of neural networks in iclr max welling and yee teh bayesian learning via stochastic gradient langevin dynamics in icml 
matrix completion with noisy side information hsieh inderjit dhillon university of texas at austin university of california at davis kychiang inderjit chohsieh abstract we study the matrix completion problem with side information side information has been considered in several matrix completion applications and has been empirically shown to be useful in many cases recently researchers studied the effect of side information for matrix completion from theoretical viewpoint showing that sample complexity can be significantly reduced given completely clean features however since in reality most given features are noisy or only weakly informative the development of model to handle general feature set and investigation of how much noisy features can help matrix recovery remains an important issue in this paper we propose novel model that balances between features and observations simultaneously in order to leverage feature information yet be robust to feature noise moreover we study the effect of general features in theory and show that by using our model the sample complexity can be lower than matrix completion as long as features are sufficiently informative this result provides theoretical insight into the usefulness of general side information finally we consider synthetic data and two applications relationship prediction and semisupervised clustering and show that our model outperforms other methods for matrix completion that use features both in theory and practice introduction low rank matrix completion is an important topic in machine learning and has been successfully applied to many practical applications one promising direction in this area is to exploit the side information or features to help matrix completion tasks for example in the famous netflix problem besides rating history profile of users genre of movies might also be given and one could possibly leverage such side information for better prediction observing the fact that such additional features are usually available in real applications how to better incorporate features into matrix completion becomes an important problem with both theoretical and practical aspects several approaches have been proposed for matrix completion with side information and most of them empirically show that features are useful for certain applications however there is surprisingly little analysis on the effect of features for general matrix completion more recently jain and dhillon and xu et al provided guarantees on matrix completion with side information they showed that if perfect features are given under certain conditions one can substantially reduce the sample complexity by solving objective this result suggests that completely informative features are extremely powerful for matrix completion and the algorithm has been successfully applied in many applications however this model is still quite restrictive since if features are not perfect it fails to guarantee recoverability and could even suffer poor performance in practice more general model with recovery analysis to handle noisy features is thus desired in this paper we study the matrix completion problem with general side information we propose dirty statistical model which balances between feature and observation information simultaneously to complete matrix as result our model can leverage feature information yet is robust to noisy features furthermore we provide theoretical foundation to show the effectiveness of our model we formally quantify the quality of features and show that the sample complexity of our model depends on feature quality two noticeable results could thus be inferred first unlike given any feature set our model is guaranteed to achieve recovery with at most samples in manner where is the dimensionality of the matrix second if features are reasonably good we can improve the sample complexity to we emphasize that since is the lower bound of sample complexity for regularized matrix completion our result suggests that even noisy features could asymptotically reduce the number of observations needed in matrix completion in addition we empirically show that our model outperforms other completion methods on synthetic data as well as in two applications relationship prediction and clustering our contribution can be summarized as follows we propose dirty statistical model for matrix completion with general side information where the matrix is learned by balancing features and pure observations simultaneously we quantify the effectiveness of features in matrix completion problem we show that our model is guaranteed to recover the matrix with any feature set and moreover the sample complexity can be lower than standard matrix completion given informative features the paper is organized as follows section states some related research in section we introduce our proposed model for matrix completion with general side information we theoretically analyze the effectiveness of features in our model in section and show experimental results in section related work matrix completion has been widely applied to many machine learning tasks such as recommender systems social network analysis and clustering several theoretical foundations have also been established one remarkable milestone is the strong guarantee provided by et al who proves that npolylogn observations are sufficient for exact recovery provided entries are uniformly sampled at random several work also studies recovery under distributional assumptions setting and noisy observations several works also consider side information in matrix completion although most of them found that features are helpful for certain applications and setting from their experimental supports their proposed methods focus on the matrix factorization formulation without any theoretical guarantees compared to them our model mainly focuses on convex regularized objective and on theoretical insight on the effect of features on the other hand jain and dhillon also see studied an inductive matrix completion objective to incorporate side information and followup work also considers similar formulation with trace norm regularized objective both of them show that recovery guarantees could be attained with lower sample complexity when features are perfect however if features are imperfect such models can not recover the underlying matrix and could suffer poor performance in practice we will have detailed discussion on inductive matrix completion model in section our proposed model is also related to the family of dirty statistical models where the model parameter is expressed as the sum of number of parameter components each of which has its own structure dirty statistical models have been proposed mostly for robust matrix completion graphical model estimation and learning to decompose the sparse component noise and component model parameters our proposed algorithm is completely different we aim to decompose the model into two parts the part that can be described by side information and the part that has to be recovered purely by observations dirty statistical model for matrix completion with features let be the underlying matrix that aims to be recovered where min so that is let be the set of observed entries sampled from with cardinality furthermore let and be the feature set where each row xi or yi denotes the feature of the row or column entity of both min but can be either smaller or larger than thus given set of observations and the feature set and as side information the goal is to recover the underlying low rank matrix to begin with consider an ideal case where the given features are perfect in the following sense col col and row col such feature set can be thought as perfect since it fully describes the true latent feature space of then instead of recovering the low rank matrix directly one can recover smaller matrix such that xm the resulting formulation called inductive matrix completion or imc in brief is shown to be both theoretically preferred and useful in real applications details of this model can be found in however in practice most given features and will not be perfect in fact they could be quite noisy or only weakly correlated to the latent feature space of though in some cases applying imc with imperfect might still yield decent performance in many other cases the performance drastically drops when features become noisy this weakness of imc can also be empirically seen in section therefore more robust model is desired to better handle noisy features we now introduce dirty statistical model for matrix completion with possibly noisy features the core concept of our model is to learn the underlying matrix by balancing feature information and observations specifically we propose to learn jointly from two parts one is the low rank estimate from feature space xm and the other part is the part outside the feature space thus can be used to capture the information that noisy features fail to describe which is then estimated by pure observations naturally both xm and are preferred to be low rank since they are aggregated to estimate low rank matrix this further leads preference on to be low rank as well since one could expect only small subspace of and subspace of are jointly effective to form the low rank space xm putting all of above together we consider to solve the following problem min xm ij rij λm λn where and are regularized with trace norm because of the low rank prior the underlying matrix can thus be estimated by xm we refer our model as dirtyimc for convenience to solve the convex problem we propose an alternative minimization scheme to solve and iteratively our algorithm is stated in details in appendix one remark of this algorithm is that it is guaranteed to converge to global optimal since the problem is jointly convex with and the parameters λm and λn are crucial for controlling the importance between features and residual when λm will be enforced to so features are disregarded and becomes standard matrix completion objective another special case is λn in which will be enforced to and the objective becomes imc intuitively with an appropriate ratio λm the proposed model can incorporate useful part of features yet be robust to noisy part by compensating from pure observations some natural questions arise from here how to quantify the quality of features what is the right λm and λn given feature set and beyond intuition how much can we benefit from features using our model in theory we will formally answer these questions in section theoretical analysis now we analyze the usefulness of features in our model under theoretical perspective we first quantify the quality of features and show that with reasonably good features our model achieves recovery with lower sample complexity finally we compare our results to matrix completion and imc due to space limitations detailed proofs of theorems and lemmas are left in appendix preliminaries recall that our goal is to recover matrix given observed entry set feature set and described in section to recover the matrix with our model equation it is equivalent to solve the problem min xm ij rij subject to for simplicity we will consider max so that feature dimensions do not grow as function of we assume each entry is sampled under an unknown distribution with index set iα jα also each entry of is assumed to be upper bounded maxij so that trace norm of is in such circumstance is consistent with real scenarios like the netflix problem where users can rate movies with scale from to for convenience let be any feasible solution and be the feasible solution set also let fθ xti yj nij be the estimation function for rij parameterized by and fθ fθ be the set of feasible functions we are interested in the following two quantities expected rℓ rij empirical rij thus our model is to solve for that parameterizes arg minf and it is sufficient to show that recovery can be attained if rℓ approaches to zero with large enough and measuring the quality of features we now link the quality of features to rademacher complexity learning theoretic tool to measure the complexity of function class we will show that quality features result in lower model complexity and thus smaller error bound under such viewpoint the upper bound of rademacher complexity could be used for measuring the quality of features to begin with we apply the following lemma to bound the expected lemma bound on expected let be loss function with lipschitz constant lℓ bounded by with respect to its first argument and be constant where let fθ be the rademacher complexity of the function class fθ and associated with defined as σα iα jα riα jα fθ eσ sup where each σα takes values with equal probability then with probability at least for all fθ we have log rℓ fθ apparently to guarantee small enough rℓ both and model complexity eω have to be bounded the next key lemma shows that the model complexity term eω fθ is related to the feature quality in matrix completion context before diving into the details we first provide an intuition on the meaning of good features consider any imperfect feature set which violates one can imagine such feature set is perturbed by some misleading noise which is not correlated to the true latent features however features should still be effective if such noise does not weaken the true latent feature information too much thus if large portion of true latent features lies on the informative part of the feature spaces and they should still be somewhat informative and helpful for recovering the matrix more formally the model complexity can be bounded in terms of and by the following lemma lemma let maxi maxi and max then the model complexity of function class fθ is upper bounded by log log eω fθ mx min then by lemma and one could construct feasible solution set by setting and carefully such that both and eω fθ are controlled to be reasonably small we now suggest witness pair of and constructed as follows let be defined as mini mini min let tµ be the thresholding operator where tµ if and tµ otherwise in addition let σi ui vi be the reduced svd of and define xµ tµ σi ui vi to be the part of the part of denoted as yν can also be defined similarly now consider setting and xµ yνt where arg min yνt xµt xµ xµt ryν yνt yν is the optimal solution for approximating under the informative feature space xµ and yν then the following lemma shows that the trace norm of will not grow as increases lemma fix and let dˆ min rank xµ rank yν then with some universal constant dˆ xy moreover by combining lemma we can upper bound rℓ of dirtyimc as follows theorem consider problem with and xµ yνt then with probability at least the expected of an optimal solution will be bounded by log log log rℓ min cµ sample complexity analysis from theorem we can derive the following sample complexity guarantee of our model for simplicity we assume so it will not grow as increases in the following corollary suppose we aim to where nij xm yijt rij given an arbitrarily small then for dirtyimc model min log observations are sufficient for provided sufficiently large corollary suggests that the sample complexity of our model only depends on the trace norm of residual this matches the intuition of good features stated in section because will cover most part of if features are good and as result will be small and one can enjoy small sample complexity by exploiting quality features we also compare our sample complexity result with other models first suppose features are perfect so that our result suggests that only log samples are required for recovery this matches the result of in which the authors show that given perfect features log observations are enough for exact recovery by solving the imc objective however imc does not guarantee recovery when features are not perfect while our result shows that recovery is still attainable by dirtyimc with min log samples we will also empirically justify this result in section on the other hand for standard matrix completion no features are considered the most wellknown guarantee is that under certain conditions one can achieve poly log sample complexity for both and exact recovery however these bounds only hold with distributional assumptions on observed entries for sample complexity without any distributional assumptions shamir et al recently showed that entries are sufficient for and this bound is tight if no further distribution of observed entries is assumed compared to those results our analysis also requires no assumptions on distribution of observed entries and our sample complexity yields as well in the worst case by the fact that notice that it is reasonable to meet the lower bound even given features since in an extreme case could be random matrices and have no correlation to and thus the given information is as same as that in standard matrix completion however in many applications features will be far from random and our result provides theoretical insight to show that features can be useful even if they are imperfect indeed as long as features are informative enough such that our sample complexity will be asymptotically lower than here we provide two concrete instances for such scenario in the first scenario we consider the matrix to be generated from random orthogonal model as follows theorem let be generated from random orthogonal model where ui vi are random orthogonal bases and σk are singular values with arbitrary magnitude let σt be the largest singular value such that σt then given the noisy features where ui and vi if and and be any basis orthogonal to and if samples are sufficient for dirtyimc to achieve theorem suggests that under random orthogonal model if features are not too noisy in the sense that noise only corrupts the true subspace associated with smaller singular values we can approximately recover with only observations an empirical justification for this result is presented in appendix another scenario is to consider to be the product of two gaussian matrices theorem let be matrix where are true latent features with each uij vij suppose now we are given feature set where row items and column items have corrupted features moreover each corrupted item has perturbed feature xi ui and yi vi where and sparsity ρs svdfeature mc imc dirtyimc feature noise level ρf svdfeature mc imc dirtyimc ρs feature noise level ρf relative error relative error sparsity ρs feature noise level ρf ρf svdfeature mc imc dirtyimc sparsity ρs ρf feature noise level ρf feature noise level ρf ρs feature noise level ρf svdfeature mc imc dirtyimc svdfeature mc imc dirtyimc ρs relative error sparsity ρs relative error relative error relative error sparsity ρs svdfeature mc imc dirtyimc sparsity ρs ρf figure performance of various methods for matrix completion under different sparsity and feature quality compared to other completion methods the top figures show that dirtyimc is less sensitive to noisy features with each ρs and the bottom figures show that error of dirtyimc always decreases to with more observations given any feature quality with some constants and then for dirtyimc model with high probability max log observations are sufficient for theorem suggests that if features have good quality in the sense that items with corrupted features are not for example log then sample complexity of dirtyimc can be log log as well thus both theorem and provide concrete examples showing that given imperfect yet informative features the sample complexity of our model can be asymptotically lower than the lower bound of pure matrix completion which is experimental results in this section we show the effectiveness of the dirtyimc model for matrix completion with features on both synthetic datasets and applications for synthetic datasets we show that dirtyimc model better recovers low rank matrices under various quality of features for real applications we consider relationship prediction and clustering where the current methods are based on matrix completion and imc respectively we show that by applying dirtyimc model to these two problems we can further improve performance by making better use of features synthetic experiments we consider matrix recovery with features on synthetic data generated as follows we create low rank matrix as the true latent space uij vij we then randomly sample ρs percent of entries from as observations and construct perfect feature set which satisfies to examine performance under different quality of features we generate features with noise parameter ρf where and will be derived by replacing ρf percent of bases of and with bases orthogonal to and we then consider recovering the underlying matrix given and subset of we compare our dirtyimc model with standard regularized matrix completion mc and two other completion methods imc and svdfeature the standard relative error is used to evaluate recovered matrix for each method we select parameters from the set and report the one with the best recovery all results are averaged over random trials figure shows the recovery of each method under each sparsity level ρs and each feature noise level ρf and we first observe that in the top figures imc and method accuracy auc dirtyimc imc table relationship prediction on epinions compared with other approaches dirtyimc model gives the best performance in terms of both accuracy and auc svdfeature perform similarly under different ρs this suggests that with sufficient observations performance of imc and svdfeature mainly depend on feature quality and will not be affected much by the number of observations as result given good features they achieve smaller error compared to mc with few observations but as features become noisy they suffer poor performance by trying to learn the underlying matrix under biased feature spaces another interesting finding is that when good features are given imc and svdfeature still fails to achieve relative error as the number of observations increases which reconfirms that imc can not guarantee recoverability when features are not perfect on the other hand we see that performance of dirtyimc can be improved by both better features or more observations in particular it makes use of informative features to achieve lower error compared to mc and is also less sensitive to noisy features compared to imc and svdfeature some finer recovery results on ρs and ρf can be found in appendix applications relationship prediction in signed networks as the first application we consider relationship prediction problem in an online review website epinions where people can write reviews and trust or distrust others based on their reviews such social network can be modeled as signed network where are modeled as edges between entities and the problem is to predict unknown relationship between any two users given the network approach is the low rank model where one can first conduct matrix completion on adjacency matrix and then use the sign of completed matrix for relationship prediction therefore if features of users are available we can also consider low rank model by using our model for matrix completion step this approach can be regarded as an improvement over by incorporating feature information in this dataset there are about users and observed relationship pairs where relationships are distrust in addition to information we also have user feature matrix where for each user feature is collected based on the user review history such as number of reviews the user we then consider the model in where matrix completion is conducted by dirtyimc with relaxation dirtyimc imc imc and matrix factorization proposed in along with another two prediction methods and note that both row and column entities are users so is set for both dirtyimc and imc model we conduct the experiment using cross validation on observed edges where the parameters are chosen from the set the averaged accuracy and auc of each method are reported in table we first observe that imc performs worse than even though imc takes features into account this is because features are only weakly related to relationship matrix and as result imc is misled by such noisy features on the other hand dirtyimc performs the best among all prediction methods in particular it performs slightly better than in terms of accuracy and much better in terms of auc this shows dirtyimc can still exploit weakly informative features without being trapped by noisy features clustering we now consider clustering problem as another application given items the item feature matrix and pairwise constraints specifying whether item and are similar or dissimilar the goal is to find clustering of items such that most similar items are within the same cluster we notice that the problem can indeed be solved by matrix completion consider to be the signed similarity matrix defined as sij or if item and are similar or dissimilar and if similarity is unknown then solving clustering becomes equivalent to finding clustering of the symmetric signed graph where the goal is to cluster nodes so that most edges within the same group are positive and most edges between groups are negative as result matrix completion approach can be applied to solve the signed graph clustering problem on apparently the above solution is not optimal for clustering as it disregards features many clustering algorithms are thus proposed by taking both item features covtype signmc mccc dirtyimc pairwise error pairwise error segment signmc mccc dirtyimc number of observed pairs signmc mccc dirtyimc pairwise error mushroom number of observed pairs number of observed pairs figure clustering on datasets for mushroom dataset where features are almost ideal both mccc and dirtyimc achieve error rate for segment and covtype where features are more noisy our model outperforms mccc as its error decreases given more constraints mushrooms segment covtype number of items feature dimension number of clusters table statistics of clustering datasets and constraints into consideration the current method is the mccc algorithm which essentially solves clustering with imc objective in the authors show that by running on the eigenvectors of the completed matrix zm mccc outperforms other algorithms we now consider solving clustering with our dirtyimc model our algorithm summarized in algorithm in appendix first completes the pairwise matrix with dirtyimc objective instead of imc with both are set as and then runs on the eigenvectors of the completed matrix to obtain clustering this algorithm can be viewed as an improved version of mccc to handle noisy features we now compare our algorithm with signed graph clustering with matrix completion signmc and mccc note that since mccc has been shown to outperform most other clustering algorithms in comparing with mccc is sufficient to demonstrate the effectiveness of our algorithm we perform each method on three datasets mushrooms segment and covtype all of them are classification benchmarks where features and class of items are both available and their statistics are summarized in table for each dataset we randomly sample pairwise constraints and perform each algorithm to derive clustering where πi is the cluster index of item we then evaluate by the following pairwise error to πi πj πi πj πi πi where is the class of item figure shows the result of each method on all three datasets we first see that for mushrooms dataset where features are perfect training accuracy can be attained by for classification both mccc and dirtyimc can obtain perfect clustering which shows that mccc is indeed effective with perfect features for segment and covtype datasets we observe that the performance of and mccc are dominated by feature quality although mccc still benefits from constraint information as it outperforms it clearly does not make the best use of constraints as its performance does not improves even if number of constraints increases on the other hand the error rate of signmc can always decrease down to by increasing however since it disregards features it suffers from much higher error rate than methods with features when constraints are few we again see dirtyimc combines advantage from mccc and signmc as it makes use of features when few constraints are observed yet leverages constraint information simultaneously to avoid being trapped by feature noise this experiment shows that our model outperforms approaches for clustering acknowledgement we thank david inouye and yu for helpful comments and discussions this research was supported by nsf grants and all datasets are available at http for covtype we subsample from the entire dataset to make each cluster has balanced size references abernethy bach evgeniou and vert new approach to collaborative filtering operator estimation with spectral regularization jmlr bartlett and mendelson rademacher and gaussian complexities risk bounds and structural results jmlr bertsekas nonlinear programming athena scientific belmont ma and plan matrix completion with noise proceedings of the ieee and recht exact matrix completion via convex optimization commun acm li ma and wright robust principal component analysis acm and tao the power of convex relaxation matrix completion ieee trans inf chandrasekaran parrilo and willsky latent variable graphical model selection via convex optimization the annals of statistics chen zhang lu chen zheng and yu svdfeature toolkit for collaborative filtering jmlr chen bhojanapalli sanghavi and ward coherent matrix completion in icml chen jalali sanghavi and xu clustering partially observed graphs via convex optimization jmlr chiang hsieh natarajan dhillon and tewari prediction and clustering in signed networks local to global perspective jmlr davis kulis jain sra and dhillon metric learning in icml pages feige and schechtman on the optimality of the random hyperplane rounding technique for max cut random struct algorithms grippo and sciandrone globally convergent techniques for unconstrained optimization optimization methods and software hsieh chiang and dhillon low rank modeling of signed networks in kdd hsieh and olsan nuclear norm minimization via active subspace selection in icml jain and dhillon provable inductive matrix completion corr jalali ravikumar sanghavi and ruan dirty model for learning in nips kakade sridharan and tewari on the complexity of linear prediction risk bounds margin bounds and regularization in nips pages keshavan montanari and oh matrix completion from noisy entries jmlr koren bell and volinsky matrix factorization techniques for recommender systems ieee computer laurent and massart adaptive estimation of quadratic functional by model selection the annals of statistics leskovec huttenlocher and kleinberg predicting positive and negative links in online social networks in www li and liu constrained clustering by spectral kernel learning in iccv massa and avesani bootstrapping of recommender systems in proceedings of ecai workshop on recommender systems pages meir and zhang generalization error bounds for bayesian mixture algorithms jmlr menon chitrapura garg agarwal and kota response prediction using collaborative filtering with hierarchies and in kdd pages natarajan and dhillon inductive matrix completion for predicting associations bioinformatics negahban and wainwright restricted strong convexity and weighted matrix completion optimal bounds with noise jmlr rudelson and vershynin smallest singular value of random rectangular matrix comm pure appl math pages shamir and matrix completion with the trace norm learning bounding and transducing jmlr shin cetintas lee and dhillon tumblr blog recommendation with boosted inductive matrix completion in cikm pages srebro and shraibman rank and in colt pages xu jin and zhou speedup matrix completion with side information application to multilabel learning in nips yang and ravikumar dirty statistical models in nips yi zhang jin qian and jain clustering by input pattern assisted pairwise similarity matrix completion in icml zhong jain and dhillon efficient matrix sensing using gaussian measurements in international conference on algorithmic learning theory alt 
dependent multinomial models made easy stick breaking with the augmentation scott harvard university cambridge ma swl matthew harvard university cambridge ma mattjj ryan adams twitter harvard university cambridge ma rpa abstract many practical modeling problems involve discrete data that are best represented as draws from multinomial or categorical distributions for example nucleotides in dna sequence children names in given state and year and text documents are all commonly modeled with multinomial distributions in all of these cases we expect some form of dependency between the draws the nucleotide at one position in the dna strand may depend on the preceding nucleotides children names are highly correlated from year to year and topics in text may be correlated and dynamic these dependencies are not naturally captured by the typical formulation here we leverage logistic representation and recent innovations in augmentation to reformulate the multinomial distribution in terms of latent variables with jointly gaussian likelihoods enabling us to take advantage of host of bayesian inference techniques for gaussian models with minimal overhead introduction it is often desirable to model discrete data in terms of continuous latent structure in applications involving text corpora time series or polling and purchasing decisions we may want to learn correlations or spatiotemporal dynamics and leverage these structures to improve inferences and predictions however adding these continuous latent dependence structures often comes at the cost of significantly complicating inference such models may require specialized inference algorithms such as variational optimization or they may only admit very general inference tools like particle mcmc or elliptical slice sampling which can be inefficient and difficult to scale developing extending and applying these models has remained challenge in this paper we aim to provide class of such models that are easy and efficient we develop models for categorical and multinomial data in which dependencies among the multinomial parameters are modeled via latent gaussian distributions or gaussian processes and we show that this flexible class of models admits simple auxiliary variable method that makes inference easy fast and modular this construction not only makes these models simple to develop and apply but also allows the resulting inference methods to use algorithms and software for gaussian processes and linear gaussian dynamical systems the paper is organized as follows after providing background material and defining our general models and inference methods we demonstrate the utility of this class of models by applying it to three domains as case studies first we develop correlated topic model for text corpora second we study an application to modeling the spatial and temporal patterns in birth names given only sparse data finally we provide new continuous model for sequences these authors contributed equally including text and human dna in each case given our model construction and auxiliary variable method inference algorithms are easy to develop and very effective in experiments code to use these models write new models that leverage these inference methods and reproduce the figures in this paper is available at modeling correlations in multinomial parameters in this section we discuss an auxiliary variable scheme that allows multinomial observations to appear as gaussian likelihoods within larger probabilistic model the key trick discussed in the proceeding sections is to introduce random variables into the joint distribution over data and parameters in such way that the resulting marginal leaves the original model intact the integral identity underlying the augmentation scheme is eψ κψ dω where and is the density of the distribution pg which does not depend on consider likelihood function of the form eψ eψ for some functions and such likelihoods arise in logistic regression and in binomial and negative binomial regression using along with prior we can write the joint density of as eψ eκ dω eψ the integrand of defines joint density on which admits as marginal density conditioned on these auxiliary variables we have eκ which is gaussian when is gaussian furthermore by the exponential tilting property of the distribution we have pg thus the identity gives rise to conditionally conjugate augmentation scheme for gaussian priors and likelihoods of the form this augmentation scheme has been used to develop gibbs sampling and variational inference algorithms for bernoulli binomial and negative binomial regression models with logit link functions and to the multinomial distribution with logistic link function the logistic softmax function ln maps vector rk to probapk bility vector by setting πk eψk eψj it is commonly used in regression and correlated topic modeling correlated multinomial parameters can be modeled with gaussian prior on the vector though the resulting models are not conjugate the augmentation can be applied to such models but it only provides gibbs updating of this paper develops joint augmentation in the sense that given the auxiliary variables the entire vector is resampled as block in single gibbs update new augmentation for the multinomial distribution first rewrite the multinomial recursively in terms of binomial densities mult bin xk nk ek nk xj ek pk πj figure correlated gaussian priors on and their implied densities on sb see text for details where xk and for convenience we define this decomposition of the multinomial density is representation where each ek represents the fraction of the remaining probability mass assigned to the component we let ek ψk where denotes the logistic function and define the function sb which maps vector to normalized probability vector next we rewrite the density into the form required by by substituting ψk for ek mult nk bin xk nk ψk ψk xk ψk nk xk eψk xk xk eψk nk choosing ak xk and bk nk for each we can then introduce auxiliary variables ωk corresponding to each coordinate ψk dropping terms that do not depend on and completing the square yields xk ψk ψk where diag and that is conditioned on the likelihood of under the augmented multinomial model is proportional to diagonal gaussian distribution figure shows how several gaussian densities map to probability densities on the simplex correlated gaussians left put most probability mass near the axis of the simplex and anticorrelated gaussians center put mass along the sides where is large when is small and finally nearly isotropic gaussian approximates symmetric dirichlet appendix gives expression for the density on induced by gaussian distribution on and also an expression for diagonal gaussian that approximates dirichlet by matching moments correlated topic models the latent dirichlet allocation lda is popular model for learning topics from text corpora the correlated topic model ctm extends lda by including gaussian correlation structure among topics this correlation model is powerful not only because it reveals correlations among figure comparison of correlated topic model performance the left panel shows subset of the inferred topic correlations for the ap news corpus two examples are highlighted positive correlation between topics house committee congress law and bush dukakis president campaign and anticorrelation between percent year billion rate and court case attorney judge the middle and right panels demonstrate the efficacy of our relative to competing models on the ap news corpus and the newsgroup corpus respectively topics but also because inferring such correlations can significantly improve predictions especially when inferring the remaining words in document after only few have been revealed however the addition of this gaussian correlation structure breaks the conjugacy of lda making estimation and particularly bayesian inference and predictions more challenging an approximate maximum likelihood approach using variational em is often effective but fully bayesian approach which integrates out parameters may be preferable especially when making predictions based on small number of revealed words in document recent bayesian approach based on augmentation to the logistic normal ctm provides gibbs sampling algorithm with conjugate updates but the gibbs updates are limited to resampling of one scalar at time which can lead to slow mixing in correlated models in this section we show that mcmc sampling in correlated topic model based on the stick breaking construction can be significantly more efficient than sampling in the while maintaining the same integration advantage over em in the standard lda model each topic is distribution over vocabulary of possible words and each document has distribution over topics the word in document is denoted wn for nd when each and is given symmetric dirichlet prior with parameters αβ and αθ respectively the generative model is dir αβ dir αθ zn cat wn zn cat zn the ctm replaces the dirichlet prior on each with correlated prior induced by first sampling correlated gaussian vector and then applying the logistic normal map ln analogously our generates the correlation structure by instead applying the logistic map sb the goal is then to infer the posterior distribution over the topics the documents topic allocations and their mean and correlation structure where the parameters are given conjugate wishart niw prior modeling correlation structure within the topics can be done analogously for fully bayesian inference in the we develop gibbs sampler that exploits the block conditional gaussian structure provided by the construction the gibbs sampler iteratively samples and as well as the auxiliary variables the first two are standard updates for lda models so we focus on the latter three using the identities derived in section the conditional density of each can be written cd where we have defined cd ωd cd zn ωd diag and so it is resampled as joint gaussian the correlation structure parameters and are sampled from their conditional niw distribution finally the auxiliary variables are sampled as random variables with pg cd feature of the construction is that the the auxiliary variable update is embarrassingly parallel we compare the performance of this gibbs sampling algorithm for the to the gibbs sampling algorithm of the which uses different augmentation as well as the original variational em algorithm for the ctm and collapsed gibbs sampling in standard lda figure shows results on both the ap news dataset and the newsgroups dataset where models were trained on random subset of of the complete documents and tested on the remaining by estimating likelihoods of half the words given the other half the collapsed gibbs sampler for lda is fast but because it does not model correlations its ability to predict is significantly constrained the variational em algorithm for the ctm is reasonably fast but its point estimate doesn quite match the performance from integrating out parameters via mcmc in this setting the gibbs sampler continues to improve slowly but is limited by its updates while the sampler seems to both mix effectively and execute efficiently due to its block gaussian updating the demonstrates that the construction and corresponding augmentation makes inference in correlated topic models both easy to implement and computationally efficient the block conditional gaussianity also makes inference algorithms modular and compositional the construction immediately extends to dynamic topic models dtms in which the latent evolve according to linear gaussian dynamics and inference can be implemented simply by applying code for gaussian linear dynamical systems see section finally because lda is so commonly used as component of other models for images easy effective modular inference for ctms and dtms is promising general tool gaussian processes with multinomial observations consider the united states census data which lists the first names of children born in each state for the years suppose we wish to predict the probability of particular name in new york state in the years and given observed names in earlier years we might reasonably expect that name probabilities vary smoothly over time as names rise and fall in popularity and that name probability would be similar in neighboring states gaussian process naturally captures these prior intuitions about spatiotemporal correlations but the observed name counts are most naturally modeled as multinomial draws from latent probability distributions over names for each combination of state and year we show how efficient inference can be performed in this otherwise difficult model by leveraging the augmentation let rm denote the matrix of dimensional inputs and nm denote the observed dimensional count vectors for each input in our example each row of corresponds to the year latitude and longitude of an observation and is the number of names underlying these observations we introduce set of latent variables ψm such that the probability vector at input is sb the auxiliary variables for the name are linked via gaussian process with covariance matrix whose entry ci is the covariance between input and under the gp prior and mean vector µk the covariance matrix is shared by all names and the mean is empirically set to match the measured name probability the full model is then gp µk xm mult nm sb to perform inference introduce auxiliary variables ωm for each ψm conditioned on these variables the conditional distribution of is ek ek ωk ωk µk ωk µk ek model static raw gp lnm gp sbm gp top bot top bot average number of names correctly predicted figure spatiotemporal gaussian process applied to the names of children born in the united states from with limited dataset of only observations per the stick breaking and logistic normal multinomial gps sbm gp and lnm gp outperform approaches in predicting the top and bottom names bottom left parentheses std error our sbm gp which leverages the augmentation is considerably more efficient than the lnm gp bottom right where ωk diag the auxiliary variables are updated according to their conditional distribution ωm xm ψm pg nm ψm where nm nm xm figure illustrates the power of this approach on census data the top two plots show the inferred probabilities under our multinomial gp model for the full dataset interesting spatiotemporal correlations in name probability are uncovered in this regime the posterior uncertainty is negligible since we observe thousands of names per state and year and simply modeling the transformed empirical probabilities with gp works well however in the sparse data regime with only nm observations per input it greatly improves performance to model uncertainty in the latent probabilities using gaussian process with multinomial observations the bottom panels compare four methods of predicting future names in the years and for dataset with nm predicting based on the empirical probability measured in standard gp to the empirical probabilities transformed by sb raw gp gp whose outputs are transformed by the logistic normal function ln to obtain multinomial probabilities lnm gp fit using elliptical slice sampling and our multinomial gp sbm gp in terms of ability to predict the top and bottom names the multinomial models are both comparable and vastly superior to the naive approaches the sbm gp model is considerably faster than the logistic normal version as shown in the bottom right panel the augmented gibbs sampler is more efficient than the elliptical slice sampling algorithm used to handle the in the lnm gp moreover we are able to make collapsed predictions in which we compute the predictive distribution test given integrating out the training in contrast the lnm gp must condition on the training gp values in order to make predictions and effectively integrate over training samples using mcmc appendix goes into greater detail on how marginal predictions are computed and why they are more efficient than predicting conditioned on single value of figure predictive log likelihood comparison of time series models with multinomial observations multinomial linear dynamical systems while hidden markov models hmms are ubiquitous for modeling time series and sequence data it can be preferable to use continuous state space model in particular while discrete states have no intrinsic geometry continuous states can correspond to natural euclidean embeddings these considerations are particularly relevant to text where word embeddings have proven to be powerful tool gaussian linear dynamical systems lds provide very efficient learning and inference algorithms but they can typically only be applied when the observations are themselves linear with gaussian noise while it is possible to apply gaussian lds to count vectors the resulting model is misspecified in the sense that as continuous density the model assigns zero probability to training and test data however belanger and kakade show that this model can still be used for several machine learning tasks with compelling performance and that the efficient algorithms afforded by the misspecified gaussian assumptions confer significant computational advantage indeed the authors have observed that such gaussian model is worth exploring since multinomial models with softmax link functions prevent step updates and require expensive computations this paper aims to bridge precisely this gap and enable efficient gaussian lds computational methods to be applied while maintaining multinomial emissions and an asymptotically unbiased representation of the posterior while there are other approximation schemes that effectively extend some of the benefits of ldss to nonlinear settings such as the extended kalman filter ekf and unscented kalman filter ukf these methods do not allow for asymptotically unbiased bayesian inference can have complex behavior and can make model learning challenge alternatively particle mcmc pmcmc is very powerful algorithm that provides unbiased bayesian inference for very general state space models but it does not enjoy the efficient block updates or conjugacy of ldss or hmms the multinomial linear dynamical system generates states via linear gaussian dynamical system but generates multinomial observations via the map az xt mult nt sb cz where is the system state at time and xt are the multinomial observations we suppress notation for conditioning on and which are system parameters of appropriate sizes that are given conjugate priors the logistic normal multinomial lds is defined analogously but uses ln in place of sb to produce gibbs sampler with fully conjugate updates we augment the observations with random variables ωt as result the conditional state sequence is jointly distributed according to gaussian lds in which the diagonal observation potential at time is xt ωt thus the state sequence can be jointly sampled using lds software and the system parameters can similarly be updated using standard algorithms the only remaining update is to the auxiliary variables which are sampled according to pg xt cz we compare the and the gibbs sampling inference algorithm to three baseline methods an using pmcmc and ancestor resampling for inference an hmm using gibbs sampling and raw lds which treats the multinomial observation vectors as observations in rk as in we examine each method performance on each of three experiments in modeling sequence of amino acids from human dna with dimensional observations set of random ap news articles with an average of words per article and vocabulary size of words and an excerpt of words from lewis carroll alice adventures in wonderland with vocabulary of words we reserved the final amino acids words per news article and words from alice for computing predictive likelihoods each linear dynamical model had state space while the hmm had discrete states hmms with and states all performed worse on these tasks figure left panels shows the predictive log likelihood for each method on each experiment normalized by the number of counts in the test dataset and relative to the likelihood under multinomial model fit to the training data mean for the dna data which has the smallest vocabulary size the hmm achieves the highest predictive likelihood but the edges out the other lds methods on the two text datasets the outperforms the other methods particularly in alice where the vocabulary is larger and the document is longer in terms of run time the is orders of magnitude faster than the with pmcmc right panel because it mixes much more efficiently over the latent trajectories related work the transformation used herein was applied to categorical models by khan et al but they used local variational bound instead of the augmentation their promising results corroborate our findings of improved performance using this transformation their generalized algorithm is not fully bayesian and does not integrate into existing gaussian modeling and inference code as easily as our augmentation conversely chen et al used the augmentation in conjunction with the logistic normal transformation for correlated topic modeling exploiting the conditional conjugacy of single entry ψk ωk with gaussian prior unlike our transformation which admits block gibbs sampling over the entire vector simultaneously their approach is limited to singlesite gibbs sampling as shown in our correlated topic model experiments this has dramatic effects on inferential performance moreover it precludes analytical marginalization and integration with existing gaussian modeling algorithms for example it is not immediately applicable to inference in linear dynamical systems with multinomial observations conclusion these case studies demonstrate that the multinomial model construction paired with the augmentation yields flexible class of models with easy efficient and compositional inference in addition to making these models easy the methods developed here can also enable new models for multinomial and mixed data the latent continuous structures used here to model correlations and structure can be leveraged to explore new models for interpretable feature embeddings interacting time series and dependence with other covariates acknowledgements is supported by siebel scholarship and the center for brains minds and machines cbmm funded by nsf stc award is supported by the joint research grants program is supported by nsf as well as the alfred sloan foundation references christophe andrieu arnaud doucet and roman holenstein particle markov chain monte carlo methods journal of the royal statistical society series statistical methodology iain murray ryan adams and david mackay elliptical slice sampling journal of machine learning research workshop and conference proceedings aistats nicholas polson james scott and jesse windle bayesian inference for logistic models using latent variables journal of the american statistical association mingyuan zhou lingbo li david dunson and lawrence carin lognormal and gamma mixed negative binomial regression in proceedings of the international conference on machine learning volume page jianfei chen jun zhu zi wang xun zheng and bo zhang scalable inference for logisticnormal topic models in advances in neural information processing systems pages chris holmes leonhard held et al bayesian auxiliary variable models for binary and multinomial regression bayesian analysis david blei and john lafferty correlated topic models advances in neural information processing systems david blei andrew ng and michael jordan latent dirichlet allocation the journal of machine learning research david blei and john lafferty dynamic topic models in proceedings of the international conference on machine learning pages acm xiaogang wang and eric grimson spatial latent dirichlet allocation in advances in neural information processing systems pages david belanger and sham kakade linear dynamical system model for text in proceedings of the international conference on machine learning ronan collobert and jason weston unified architecture for natural language processing deep neural networks with multitask learning in proceedings of the international conference on machine learning pages acm david belanger and sham kakade embedding word tokens using linear dynamical system in nips modern workshop eric wan and rudolph van der merwe the unscented kalman filter for nonlinear estimation in adaptive systems for signal processing communications and control symposium the ieee pages ieee sebastian thrun wolfram burgard and dieter fox probabilistic robotics mit press fredrik lindsten thomas and michael jordan ancestor sampling for particle gibbs in advances in neural information processing systems pages mohammad khan shakir mohamed benjamin marlin and kevin murphy stickbreaking likelihood for categorical data analysis with latent gaussian models in international conference on artificial intelligence and statistics pages 
learning with bayesian decision theory keenon werling department of computer science stanford university keenon arun chaganty department of computer science stanford university chaganty percy liang department of computer science stanford university pliang christopher manning department of computer science stanford university manning abstract our goal is to deploy system starting with zero training examples we consider an setting where as inputs arrive we use crowdsourcing to resolve uncertainty where needed and output our prediction when confident as the model improves over time the reliance on crowdsourcing queries decreases we cast our setting as stochastic game based on bayesian decision theory which allows us to balance latency cost and accuracy objectives in principled way computing the optimal policy is intractable so we develop an approximation based on monte carlo tree search we tested our approach on three recognition sentiment classification and image classification on the ner task we obtained more than an order of magnitude reduction in cost compared to full human annotation while boosting performance relative to the expert provided labels we also achieve improvement over having single human label the whole set and improvement over online learning poor is the pupil who does not surpass his leonardo da vinci introduction there are two roads to an accurate ai system today gather huge amount of labeled training data and do supervised learning or ii use crowdsourcing to directly perform the task however both solutions require amounts of time and money in many situations one wishes to build new system to do twitter information extraction to aid in disaster relief efforts or monitor public opinion but one simply lacks the resources to follow either the pure ml or pure crowdsourcing road in this paper we propose framework called learning formalizing and extending ideas first implemented in in which we produce high quality results from the start without requiring trained model when new input arrives the system can choose to asynchronously query the crowd on parts of the input it is uncertain about query about the label of single token in sentence after collecting enough evidence the system makes prediction the goal is to maintain high accuracy by initially using the crowd as crutch but gradually becoming more as the model improves online learning and online active learning are different in that they do not actively seek new information prior to making prediction and can not maintain high accuracy independent of the number of data instances seen so far active classification like us get beliefs under decide to ask crowd learned model george worker in real time person resource none location http what is george here soup on george str katrina soup on george str katrina incorporate feedback return prediction resource location soup on george str katrina location person resource none figure named entity recognition on tweets in learning strategically seeks information by querying subset of labels prior to prediction but it is based on static policy whereas we improve the model during test time based on observed data to determine which queries to make we model learning as stochastic game based on crf prediction model we use bayesian decision theory to tradeoff latency cost and accuracy in principled manner our framework naturally gives rise to intuitive strategies to achieve high accuracy we should ask for redundant labels to offset the noisy responses to achieve low latency we should issue queries in parallel whereas if latency is unimportant we should issue queries sequentially in order to be more adaptive computing the optimal policy is intractable so we develop an approximation based on monte carlo tree search and progressive widening to reason about continuous time we implemented and evaluated our system on three different tasks recognition sentiment classification and image classification on the ner task we obtained more than an order of magnitude reduction in cost compared to full human annotation while boosting performance relative to the expert provided labels we also achieve improvement over having single human label the whole set and improvement over online learning an implementation of our system dubbed lense for learning from expensive noisy slow experts is available at http problem formulation consider structured prediction problem from input xn to output yn for example for recognition ner on tweets is sequence of words in the tweet on george and is the corresponding sequence of labels none location location the full set of labels of person location resource and none in the learning setting inputs arrive in stream on each input we make zero or more queries on the crowd to obtain labels potentially more than once for any positions in the responses come back asynchronously which are incorporated into our current prediction model pθ figure left shows one possible outcome we query positions george and the first query returns location upon which we make another query on the the same position george and so on when we have sufficient confidence about the entire output we return the most likely prediction under the model each query qi is issued at time si and the response comes back at time ti assume that each query costs cents our goal is to choose queries to maximize accuracy minimize latency and cost we make several remarks about this setting first we must make prediction on each input in the stream unlike in active learning where we are only interested in the pool or stream of examples for the purposes of building good model second the responses are used to update the prediction ogs ogs ogs legend none res loc per res loc ogs ogs ogs system crowd tnow loc ogs loc loc res per loc incorporating information from responses the bar graphs represent the marginals over the labels for each token indicated by the first character at different points in time the two timelines show how the system updates its confidence over labels based on the crowd responses the system continues to issue queries until it has sufficient confidence on its labels see the paragraph on behavior in section for more information game tree an example of partial game tree constructed by the system when deciding which action to take in the state the query has already been issued and the system must decide whether to issue another query or wait for response to figure example behavior while running structure prediction on the tweet soup on george we omit the resource from the game tree for visual clarity model like in online learning this allows the number of queries needed and thus cost and latency to decrease over time without compromising accuracy model we model learning as stochastic game with two players the system and the crowd the game starts with the system receiving input and ends when the system turns in set of labels yn during the system turn the system may choose query action to ask the crowd to label yq the system may also choose the wait action to wait for the crowd to respond to pending query or the return action to terminate the game and return its prediction given responses received thus far the system can make as many queries in row simultaneously as it wants before deciding to wait or turn when the wait action is chosen the turn switches to the crowd which provides response to one pending query and advances the game clock by the time taken for the crowd to respond the turn then immediately reverts back to the system when the game ends the system chooses the return action the system evaluates utility that depends on the accuracy of its prediction the number of queries issued and the total time taken the system should choose query and wait actions to maximize the utility of the prediction eventually returned in the rest of this section we describe the details of the game tree our choice of utility and specify models for crowd responses followed by brief exploration of behavior admitted by our model game tree let us now formalize the game tree in terms of its states actions transitions and rewards see figure for an example the game state tnow consists of the current time tnow the actions that have been issued at times and the responses that have been received at times let rj and tj iff qj is not query action or its responses have not been received by time tnow during the system turn when the system chooses an action qk the state is updated to tnow where qk tnow and if qk then the system chooses another action from the new state if qk the crowd makes stochastic move from finally if qk the game ends this rules out the possibility of launching query midway through waiting for the next response however we feel like this is reasonable limitation that significantly simplifies the search space and the system returns its best estimate of the labels using the responses it has received and obtains utility defined later let qj rj be the set of requests during the crowd turn after the system chooses the next response from the crowd is chosen arg where is sampled from the model pt sj tnow for each finally response is sampled using response model and the state is updated to tj where rk and tk utility under bayesian decision theory the optimal choice for an action in state tnow is the one that attains the maximum expected utility value for the game starting at recall that the system can return at any time at which point it receives utility that trades off two things the first is the accuracy of the map estimate according to the model best guess of incorporating all responses received by time the second is the cost of making queries monetary cost wm per query made and penalty of wt per unit of time taken formally we define the utility to be expacc nq wm tnow wt expacc ep accuracy arg max where nq qj is the number of queries made is prediction model that incorporates the crowd responses the utility of wait and return actions is computed by taking expectations over subsequent trajectories in the game tree this is intractable to compute exactly so we propose an approximate algorithm in section environment model the final component is model of the environment crowd given input and queries qk issued at times sk we define distribution over the output responses rk and response times tk as follows pθ pr ri yqi pt ti si the three components are as follows pθ is the prediction model standard crf pr yq is the response model which describes the distribution of the crowd response for given query when the true answer is yq and pt ti si specifies the latency of query qi the crf model pθ is learned based on all actual responses not simulated ones using adagrad to model annotation errors we set pr yq iff yq and distribute the remaining probability for uniformly given this full model we can compute simply by marginalizing out and from equation when conditioning on we ignore responses that have not yet been received when rj for some behavior let look at typical behavior that we expect the model and utility to capture figure shows how the marginals over the labels change as the crowd provides responses for our running example named entity recognition for the sentence soup on george in the both timelines the system issues queries on soup and george because it is not confident about its predictions for these tokens in the first timeline the crowd correctly responds that soup is resource and that george is location integrating these responses the system is also more confident about its prediction on and turns in the correct sequence of labels in the second timeline crowd worker makes an error and labels george to be person the system still has uncertainty on george and issues an additional query which receives correct response following which the system turns in the correct sequence of labels while the answer is still correct the system could have taken less time to respond by making an additional query on george at the very beginning we found the humans we hired were roughly accurate in our experiments game playing in section we modeled learning as stochastic game played between the system and the crowd we now turn to the problem of actually finding policy that maximizes the expected utility which is of course intractable because of the large state space our algorithm algorithm combines ideas from monte carlo tree search to systematically explore the state space and progressive widening to deal with the challenge of continuous variables time some intuition about the algorithm is provided below when simulating the system turn the next state and hence action is chosen using the upper confidence tree uct decision rule that trades off maximizing the value of the next state exploitation with the number of visits exploration the crowd turn is simulated based on transitions defined in section to handle the unbounded fanout during the crowd turn we use progressive widening that maintains current set of active or explored states which is gradually grown with time let be the number of times state has been visited and be all successor states that the algorithm has sampled algorithm approximating expected utility with mcts and progressive widening for all function monte arlovalue state increment if system turn then log arg initialize visits utility sum and children choose next state using uct arlovalue record observed utility return else if crowd spturn then restrict continuous samples using pw if max then is sampled from set of already visited based on else is drawn based on end if return monte arlovalue else if game terminated then return utility of according to end if end function experiments in this section we empirically evaluate our approach on three tasks while the setting we propose is targeted at scenarios where there is no data to begin with we use existing labeled datasets table to have gold standard baselines we evaluated the following four methods on each dataset human the majority vote of human crowd workers was used as prediction online learning uses classifier that trains on the gold output for all examples seen so far and then returns the mle as prediction this is the best possible offline system it sees perfect information about all the data seen so far but can not query the crowd while making prediction threshold baseline uses the following heuristic for each label yi we ask for queries such that yi instead of computing the expected marginals over the responses to queries in flight we simply count the requests for given variable and reduces the uncertainty on that variable by factor of the system continues launching requests until the threshold adjusted by number of queries in flight is crossed dataset examples ner task and notes we evaluate on the ner sequence labeling problem over english sentences we only consider the four tags corresponding to persons locations organizations or sentiment we evaluate on subset of the imdb sentiment dataset that consists of polar movie reviews the goal is binary classification of documents into classes pos and neg we evaluate on celebrity face classification task each image must be labeled as one of the following four choices andersen cooper daniel craig scarlet johansson or miley cyrus face features we used standard features the current word current lemma previous and next lemmas lemmas in window of size three to the left and right word shape and word prefix and suffixes as well as word embeddings we used two feature sets the first unigrams containing only word unigrams and the second rnn that also contains sentence vector embeddings from we used the last layer of alexnet trained on imagenet as input feature embeddings though we leave into the net to future work table datasets used in this paper and number of examples we evaluate on system online threshold lense ms ms ms ms ms named entity recognition per loc org face identification latency acc ms ms ms ms ms table results on ner and face tasks comparing latencies queries per token and performance metrics for ner and accuracy for face predictions are made using mle on the model given responses the baseline does not reason about time and makes all its queries at the very beginning lense our full system as described in section implementation and crowdsourcing setup we implemented the retainer model of on amazon mechanical turk to create pool of crowd workers that could respond to queries in the workers were given short tutorial on each task before joining the pool to minimize systematic errors caused by misunderstanding the task we paid workers to join the retainer pool and an additional per query for ner since response times were much faster we paid per query worker response times were generally in the range of seconds for ner seconds for sentiment and seconds for faces when running experiments we found that the results varied based on the current worker quality to control for variance in worker quality across our evaluations of the different methods we collected worker responses and their delays on each label ahead of during simulation we sample the worker responses and delays without replacement from this frozen pool of worker responses summary of results table and table summarize the performance of the methods on the three tasks on all three datasets we found that learning outperforms machine and http the original also includes fifth tag for miscellaneous however the definition for miscellaneos is complex making it very difficult for crowd workers to provide accurate labels these datasets are available in the code repository for this paper system unigrams unigrams rnn embeddings queries per example acc unigrams online threshold lense rnn latency online threshold lense time figure queries per example for lense on sentiment with simple unigram features the model quickly learns it does not have the capacity to answer confidently and must query the crowd with more complex rnn features the model learns to be more confident and queries the crowd less over time table results on the sentiment task comparing latency queries per example and accuracy queries per token lense vote baseline lense online learning time time figure comparing and queries per token on the ner task over time the left graph compares lense to online learning which can not query humans at test time this highlights that lense maintains high scores even with very small training set sizes by falling back the crowd when it is unsure the right graph compares query rate over time to this clearly shows that as the model learns it needs to query the crowd less comparisons on both quality and cost on ner we achieve an of at more than an order of magnitude reduction on the cost of achieving comporable quality result using the approach on sentiment and faces we reduce costs for comparable accuracy by factor of around for the latter two tasks both learning methods perform less well than in ner we suspect this is due to the presence of dominant class none in ner that the model can very quickly learn to expend almost no effort on lense outperforms the threshold baseline supporting the importance of bayesian decision theory figure tracks the performance and cost of lense over time on the ner task lense is not only able to consistently outperform other baselines but the cost of the system steadily reduces over time on the ner task we find that lense is able to trade off time to produce more accurate results than the baseline with fewer queries by waiting for responses before making another query while learning allows us to deploy quickly and ensure good results we would like to eventually operate without crowd supervision figure we show the number of queries per example on sentiment with two different features sets unigrams and rnn as described in table with simpler features unigrams the model saturates early and we will continue to need to query to the crowd to achieve our accuracy target as specified by the loss function on the other hand using richer features rnn the model is able to learn from the crowd and the amount of supervision needed reduces over time note that even when the model capacity is limited lense is able to guarantee consistent high level of performance reproducibility all code data and experiments for this paper are available on codalab at https related work learning draws ideas from many areas online learning active learning active classification crowdsourcing and structured prediction online learning the fundamental premise of online learning is that algorithms should improve with time and there is rich body of work in this area in our setting algorithms not only improve over time but maintain high accuracy from the beginning whereas regret bounds only achieve this asymptotically active learning active learning see for survey algorithms strategically select most informative examples to build classifier online active learning performs active learning in the online setting several authors have also considered using crowd workers as noisy oracle it differs from our setup in that it assumes that labels can only be observed after classification which makes it nearly impossible to maintain high accuracy in the beginning active classification active classification asks what are the most informative features to measure at test time existing active classification algorithms rely on having fully labeled dataset which is used to learn static policy for when certain features should be queried which does not change at test time learning differs from active classification in two respects true labels are never observed and our system improves itself at test time by learning stronger model notable exception is legion ar which like us operates in learning setting to for activity classification however they do not explore the machine learning foundations associated with operating in this setting which is the aim of this paper crowdsourcing burgenoning subset of the crowdsourcing community overlaps with machine learning one example is flock which first crowdsources the identification of features for an image classification task and then asks the crowd to annotate these features so it can learn decision tree in another line of work turkontrol models individual crowd worker reliability to optimize the number of human votes needed to achieve confident consensus using pomdp structured prediction an important aspect our prediction tasks is that the output is structured which leads to much richer setting for learning since tags are correlated the importance of coherent framework for optimizing querying resources is increased making active partial observations on structures and has been explored in the measurements framework of and in the distant supervision setting conclusion we have introduced new framework that learns from noisy crowds to maintain high accuracy and reducing cost significantly over time the technical core of our approach is modeling the setting as stochastic game and using ideas from game playing to approximate the optimal policy we have built system lense which obtains significant cost reductions over pure crowd approach and significant accuracy improvements over pure ml approach acknowledgments we are grateful to kelvin guu and volodymyr kuleshov for useful feedback regarding the calibration of our models and amy bearman for providing the image embeddings for the face classification experiments we would also like to thank our anonymous reviewers for their helpful feedback finally our work was sponsored by sloan fellowship to the third author references deng dong socher li li and imagenet hierarchical image database in computer vision and pattern recognition cvpr pages krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in advances in neural information processing systems nips pages bernstein little miller hartmann ackerman karger crowell and panovich soylent word processor with crowd inside in symposium on user interface software and technology pages kokkalis pfeiffer chornyi bernstein and klemmer emailvalet managing email overload through private accountable crowdsourcing in conference on computer supported cooperative work pages li weng he yao datta sun and lee twiner named entity recognition in targeted twitter stream in acm special interest group on information retreival sigir pages walter lasecki young chol song henry kautz and jeffrey bigham crowd labeling for deployable activity recognition in proceedings of the conference on computer supported cooperative work and lugosi prediction learning and games cambridge university press helmbold and panizza some label efficient learning results in conference on learning theory colt pages sculley online active learning methods for fast spam filtering in conference on email and ceas chu zinkevich li thomas and tseng unbiased online active learning in data streams in international conference on knowledge discovery and data mining kdd pages gao and koller active classification based on value of classifier in advances in neural information processing systems nips pages kocsis and bandit based planning in european conference on machine learning ecml pages coulom computing elo ratings of move patterns in the game of go computer games workshop finkel grenager and manning incorporating information into information extraction systems by gibbs sampling in association for computational linguistics acl pages andrew maas raymond daly peter pham dan huang andrew ng and christopher potts learning word vectors for sentiment analysis in acl hlt pages portland oregon usa june association for computational linguistics socher perelygin wu chuang manning ng and potts recursive deep models for semantic compositionality over sentiment treebank in empirical methods in natural language processing emnlp kumar berg belhumeur and nayar attribute and simile classifiers for face verification in iccv oct bernstein brandt miller and karger crowds in two seconds enabling realtime interfaces in user interface software and technology pages settles active learning literature survey technical report university of wisconsin madison donmez and carbonell proactive learning active learning with multiple imperfect oracles in conference on information and knowledge management cikm pages golovin krause and ray bayesian active learning with noisy observations in advances in neural information processing systems nips pages greiner grove and roth learning active classifiers artificial intelligence chai deng yang and ling sensitive naive bayes classification in international conference on data mining pages esmeir and markovitch anytime induction of trees in advances in neural information processing systems nips pages cheng and bernstein flock hybrid learning classifiers in proceedings of the acm conference on computer supported cooperative work social computing pages dai mausam and weld control of workflows in association for the advancement of artificial intelligence aaai liang jordan and klein learning from measurements in exponential families in international conference on machine learning icml angeli tibshirani wu and manning combining distant and partial supervision for relation extraction in empirical methods in natural language processing emnlp 
calibrated structured prediction percy liang department of computer science stanford university stanford ca volodymyr kuleshov department of computer science stanford university stanford ca abstract in applications displaying calibrated confidence probabilities that correspond to true be as important as obtaining high accuracy we are interested in calibration for structured prediction problems such as speech recognition optical character recognition and medical diagnosis structured prediction presents new challenges for calibration the output space is large and users may issue many types of probability queries marginals on the structured output we extend the notion of calibration so as to handle various subtleties pertaining to the structured setting and then provide simple recalibration method that trains binary classifier to predict probabilities of interest we explore range of features appropriate for structured recalibration and demonstrate their efficacy on three datasets introduction applications such as speech recognition medical diagnosis optical character recognition machine translation and scene labeling have two properties they are instances of structured prediction where the predicted output is complex structured object and ii they are applications for which it is important to provide accurate estimates of confidence this paper explores confidence estimation for structured prediction central to this paper is the idea of probability calibration which is prominent in the meteorology and econometrics literature calibration requires that the probability that system outputs for an event reflects the true frequency of that event of the times that system says that it will rain with probability then of the time it should rain in the context of structured prediction we do not have single event or fixed set of events but rather multitude of events that depend on the input corresponding to different conditional and marginal probabilities that one could ask of structured prediction model we must therefore extend the definition of calibration in way that deals with the complexities that arise in the structured setting we also consider the practical question of building system that outputs calibrated probabilities we introduce new framework for calibration in structured prediction which involves defining probabilities of interest and then training binary classifiers to predict these probabilities based on set of features our framework generalizes current methods for binary and multiclass classification which predict class probabilities based on single feature the uncalibrated prediction score in structured prediction the space of interesting probabilities and useful features is considerably richer this motivates us to introduce new concept of events as well as range of new have varying computational demands we perform thorough study of which features yield good calibration and find that features are quite good for calibrating map and marginal estimates over three recognition optical character recognition and scene understanding interestingly features based on map inference alone can achieve good calibration on marginal probabilities which can be more difficult to compute figure in the context of an ocr system our framework augments the structured predictor with calibrated confidence measures for set of events whether the first letter is event probability land structured prediction model forecaster output background structured prediction in structured prediction we want to assign structured label yl to an input for example in optical character recognition ocr is sequence of images and is the sequence of associated characters see figure note that the number of possible outputs for given may be exponentially large common approach to structured prediction is conditional random fields crfs where we posit probabilistic model pθ we train pθ by optimizing or objective over training set assumed to be drawn from an unknown datagenerating distribution the promise of probabilistic model is that in addition to computing the most likely output arg maxy pθ we can also get its probability pθ or even marginal probabilities pθ probabilistic forecasting probabilities from crf pθ are just numbers that sum to in order for these probabilities to be useful as confidence measures we would ideally like them to be calibrated calibration intuitively means that whenever forecaster assigns probability to an event it should be the case that the event actually holds about of the time in the case of binary classification we say that forecaster is perfectly calibrated if for all possible probabilities calibration by itself does not guarantee useful confidence measure forecaster that always outputs the marginal class probability is calibrated but useless for accurate prediction good forecasts must also be sharp their probabilities should be close to or calibration and sharpness given forecaster define to be the true probability of given that received forecast we can use to decompose the prediction loss as follows var var uncertainty sharpness calibration error the first equality follows because has expectation conditioned on and the second equality follows from the variance decomposition of onto the three terms in formalize our intuitions about calibration and sharpness the calibration term measures how close the predicted probability is to the true probability over that region and is natural generalization of perfect calibration which corresponds to zero calibration error the sharpness term measures how much variation there is in the true probability across forecasts it does not depend on the numerical value of the forecaster but only the induced grouping of points it is maximized by making closer to and uncertainty does not depend on the forecaster and can be mostly ignored note that it is always greater than sharpness and thus ensures that the loss stays positive input calib sharp examples to illustrate true the difference between calibrated unsharp pθ calibration error lower uncalibrated sharp pθ is better and sharpness balanced pθ higher is better consider the following binary classification example we have uniform distribution over inputs for with probability and for is either or each with probability setting pθ would achieve perfect calibration but not sharpness we can get excellent sharpness but suffer in calibration by predicting probabilities we can trade off some sharpness for perfect calibration by predicting for and for discretized probabilities we have assumed so far that the forecaster might return arbitrary probabilities in in this case we might need an infinite amount of data to estimate accurately for each value of in order to estimate calibration and sharpness from finite data we use discretized version of calibration and sharpness let be partitioning of the interval for example let map probability to the interval containing in this case we simply redefine to be the true probability of given that lies in bucket it is not hard to see that discretized calibration estimates form an upper bound on the calibration error calibration in the context of structured prediction we have so far presented calibration in the context of binary classification in this section we extend these definitions to structured prediction our ultimate motivation is to construct forecasters that augment structured models pθ with confidence estimates unlike in the multiclass setting we can not learn forecaster fy that targets for each because the cardinality of is too large in fact the user will probably not be interested in every events of interest instead we assume that for given and associated prediction the user is interested in set of events concerning and an event is subset we would like to determine the probability for each here are two useful types of events that will serve as running examples map which encodes whether map arg maxy pθ is correct yj map which encodes whether the label at position in map is correct in the ocr example figure suppose we predict map land define the events of interest to be the map and the marginals map map yl map then we have land note that the events of interest depend on through map event pooling we now define calibration in analogy with we will construct forecaster that tries to predict as we remarked earlier we can not make statement that holds uniformly for all events we can only make guarantee in expectation thus let be drawn uniformly from so that is extended to be joint distribution over we say that forecaster is perfectly calibrated if in other words averaged over all and events of interest whenever the forecaster outputs probability then the event actually holds with probability note that this definition corresponds to perfect binary calibration for the transformed pair of variables as an example if map then says that of all the map predictions with confidence fraction will be correct if yj map then states that out of all the marginals pooled together across all samples and all positions with confidence fraction will be correct algorithm recalibration procedure for calibrated structured prediction input features from trained model pθ event set recalibration set xi yi output forecaster construct the events dataset sbinary train the forecaster or decision trees on sbinary the second example hints at an important subtlety inherent to having multiple events in structured prediction the confidence scores for marginals are only calibrated when averaged over all positions if user only looked at the marginals for the first position she might be sorely disappointed as an extreme example suppose and is or with probability while then forecaster that outputs confidence of for both events and will be perfectly calibrated however neither event is calibrated in isolation and finally perfect calibration can be relaxed following we may define the def calibration error to be where constructing calibrated forecasters having discussed the aspects of calibration specific to structured prediction let us now turn to the problem of constructing calibrated and sharp forecasters from finite data recalibration framework we propose framework that generalizes existing recalibration strategies to structured prediction models pθ first the user specifies set of events of interest as well as features which will in general depend on the trained model pθ we then train forecaster to predict whether the event holds given features we train by minimizing the pempirical loss over recalibration set disjoint from the training examples minf algorithm outlines our procedure as an example consider again the ocr setting in figure the margin feature log pθ map log pθ map where map and map are the first and second highest scoring labels for according to pθ respectively will typically correlate with the event that the map prediction is correct we can perform isotonic regression using this feature on the recalibration set to produce probabilities in the limit of infinite data algorithm minimizes the expected loss where the expectation is over by the calibration error will also be small if there are not too many features we can drive the loss close to zero with nonparametric method such as this is also why isotonic regression is sensible for binary recalibration we first project the data into highly informative feature space then we predict labels from that space to obtain small loss note also that standard multiclass recalibration is special case of this framework where we use the raw uncalibrated score from pθ as single feature in the structured setting one must invest careful thought in the choice of classifier and features we discuss these choices below features calibration is possible even with single constant feature but sharpness depends strongly on the features quality if collapses points of opposite labels no forecaster will be able to separate them and be sharp while we want informative features we can only afford to have few since our recalibration set is typically small compared to calibration for binary classification our choice of features must also be informed by their computational requirements the most informative features might require performing full inference in an intractable model it is therefore useful to think of features as belonging to one of three types depending on whether they are derived from unstructured classifiers an svm trained individually on each label map inference or marginal inference in section we will show that marginal inference produces the sharpest features but clever features can do almost as well in table we propose several features that follow our guiding principles and that illustrate the computational tradeoffs inherent to structured prediction type none map marg map recalibration on name definition yj φno minj mrgyj ssvm svm margin mp map label length φmp map admissibility φmp mrgy pθ margin φmg margin minj mrgyj pθ yj marginal recalibration on yj name definition yj φno mrgyj ssvm svm margin mp label freq positions labeled yjmap φmp neighbors neighbors labeled yjmap mp label type yjmap mp map pseudomargin mrgyj pθ yj mg margin mrgyj pθ yj φmg yjmg yjmap concordance table features for map recalibration map and marginal recalibration yj map we consider three types of features requiring either unstructured map or marginal inference for generic function define mrga where and are the top yj be the score of an svm two inputs to ordered by let yjmg arg maxyj pθ yj let ssvm mp require knowledge defining admissible and classifier predicting label yj features φmp sets in ocr are all english words and are letters percentages in φmp and map φmp are relative to all the labels in forecasters recall from that calibration examines the true probability of an event conditioned on the forecaster prediction by limiting the number of different probabilities that can output we can more accurately estimate the true probability for each to this end let us partition the feature space the range of into regions and output probability fr for each region formally we consider forecasters of the form fr where fr is the fraction of points in region that is for which for which the event holds note that the partitioning could itself depend on the recalibration set two examples of forecasters are neighbors and decision trees let us obtain additional insight into the performance of forecasters as function of recalibration set size let denote here recalibration set of size which is used to derive partitioning and probability estimates fr for each region let tr be the true event probability for region and wr be the probability mass of region we may rewrite the expected calibration error of fr trained on random of size drawn from as calibrationerrorn er wr es fr tr we see that there is classic tradeoff between having smaller regions lower bias increased sharpness and having more data points per region lower variance better calibration fr tr fr tr fr fr bias variance if is fixed partitioning independent of then the bias will be zero and the variance is due to an empirical average falling off as however both and decision trees produce biased estimates fr of tr because the regions are chosen adaptively which is important for achieving sharpness in this case we can still ensure that the calibration error vanishes to zero if we let the regions grow uniformly larger experiments we test our proposed recalibrators and features on three tasks multiclass image classification the task is to predict an image label given an image this setting is special case of structured prediction in which we show that our framework improves over existing multiclass recalibration strategies we perform our experiments on the dataset image classification map recal accuracy on raw uncalibrated svm raw cal fraction of positives mean predicted value ocr chain crf map recalibration accuracy using viterbi decoding scene understanding graph crf marginal recal accuracy using marg decoding raw cal mean predicted value raw cal mean predicted value figure map recalibration in the multiclass and chain crf settings left and middle and marginal recalibration of the graph crf right the legend includes the loss before and after calibration the radius of the black balls reflects the number of points having the given forecasted and true probabilities which consists of color images of different types of animals and vehicles ten classes in total we train linear svm on features derived from clustering and that produce high accuracies on this dataset we use out of the features having the highest mutual information with the label the drop in performance is negligible images were used for training for calibration and for testing optical character recognition the task is to predict the word sequence of characters given sequence of images figure calibrated ocr systems can be useful for automatic sorting of mail this setting demonstrates calibration on tractable crf we used dataset consisting of words from human subjects each character is rasterized into binary image we chose words for training and another for testing the remaining words are subsampled in various ways to produce recalibration sets scene understanding given an image divided into set of regions the task is to label each region with its type person tree calibrated scene understanding is important for building autonomous agents that try to take optimal actions in the environment integrating over uncertainty this is structured prediction setting in which inference is intractable we conduct experiments on voc pascal dataset in brief we train graph crf to predict the joint labeling yi of superpixels yij in an image superpixels per image possible labels the input xi consists of node features crf edges connect adjacent superpixels we use examples for training for testing and subsample the remaining examples to produce calibration sets we perform map inference using dual composition algorithm we use mean field approximation to compute marginals experimental setup we perform both map and marginal calibration as described in section we use decision trees and as our recalibration algorithms and examine the quality of our forecasts based on calibration and sharpness section we further discretize probabilities into buckets of size we report results using calibration curves for each test point xi ei yi let fi xi ei be the forecasted probability and ti yi be the true outcome for each bucket we compute averages fb fi fi and tb fi ti where nb fi is the number of points in bucket calibration curve plots the tb as function of fb perfect calibration corresponds to straight line see figure for an example recalibration we would first like to demonstrate that our approach works well out of the box with very simple parameters single feature with and calibration set we report results in three settings multiclass and ii chain crf map recalibration with the margin feature φmg figure left middle as well as iii graph crf marginal recalibration with the margin feature φmg figure right we use calibration sets of and respectively and compare to the raw crf probabilities pθ uncalibrated unstructured svm scores φno character indicators mp marginal probabilities mg marginal probabilities agreement mg mg all features ocr chain crf marginal recalibration accuracy using viterbi decoding uncalibrated unstructured svm scores φno length presence in dict mp mp margin between and best mp lowest marginal probability mg all features ocr chain crf map recalibration accuracy using viterbi decoding uncalibrated unstructured svm scores φno pseudomargins mp mp mg pseudomargins other map features mp mp marginals concordance all features scene understanding graph crf marginal recalibration accuracy using marg decoding figure feature analysis for map and marginal recalibration of the chain crf left and middle resp and marginal recalibration of the graph crf right subplots show calibration curves for various groups of features from table as well as losses dot sizes indicate relative bucket size figure shows that our predictions green line are in every setting in the multiclass setting we outperform an existing approach which individually recalibrates classifiers and normalizes their probability estimates this suggests that recalibrating for specific event the highest scoring class is better than first estimating all the multiclass probabilities feature analysis next we investigate the role of features in figure we consider three structured settings and in each setting evaluate performance using different sets of features from table from top to bottom the subplots describe progressively more computationally demanding features our main takeaways are that clever inexpensive features do as well as naive expensive ones that features may be complementary and help each other and that recalibration allows us to add global features to chain crf we also see that features affect only sharpness in the intractable graph crf setting figure right we observe that pseudomarginals φmp which require only map inference fare almost as well as true marginals φmg although they lack resolution mp augmenting with additional features φmp that capture whether label is similar to its neighbors and whether it occurs elsewhere in the image resolves this this synergistic interaction of features appears elsewhere on marginal chain crf recalibration figure left the margin φmg between the two best classes yields calibrated forecasts that slightly lack sharpness near zero points with and confidences will have similarly small margins adding the concordance feature φmg improves calibration since we can further differentiate between low and very low confidence estimates similarly individual svm and mp mp features φno the are binary indicators one per character are calibrated but not very sharp they accurately identify and confidence sets which may be sufficient in practice given that they take no additional time to compute adding features based on mg marginals φmg improves sharpness mp on map crf recalibration figure middle we see that simple features φmp can fare better mp mp than more sophisticated ones like the margin recall that is the length of word in φmp encodes whether the word map is in the dictionary this demonstrates that recalibration lets us introduce new global features beyond what in the original crf which can dramatically improve calibration at no additional inferential cost knn cal sha decision tree recalibration set size recalibration set size figure calibration error blue and sharpness green of left and decision trees right as function of calibration set size chain crf marginal recalibration effects of recalibration set size and recalibration technique lastly in figure we compare and decision trees on chain crf marginal prediction using feature φmg we subsample calibration sets of various sizes for each and each algorithm we choose hyperparameter minimum leaf size for decision trees in by crossvalidation on we tried values between and in increments of figure shows that for both methods sharpness remains constant while the calibration error decreases with and quickly stabilizes below this confirms that we can always recalibrate with enough data the decrease in calibration error also indicates that successfully finds good model for each finally we found that fared better when using continuous features see also right columns of figures and decision trees performed much better on categorical features previous work and discussion calibration and sharpness provide the conceptual basis for this work these ideas and their connection to losses have been explored extensively in the statistics literature in connection to forecast evaluation there exist generalizations to other losses as well calibration in the online setting is field in itself see for starting point finally calibration has been explored extensively from bayesian viewpoint starting with the seminal work of dawid recalibration has been mostly studied in the binary classification setting with platt scaling and isotonic regression being two popular and effective methods methods typically involve training predictors and include extensions to ranking losses and combinations of estimators our generalization to structured prediction required us to develop the notion of events of interest which even in the multiclass setting works better than estimating every class probability and this might be useful beyond typical structured prediction problems confidence estimation methods play key role in speech recognition but they require domain specific acoustic features our approach is more general as it applies in any graphical model including ones where inference is intractable uses features and guarantees calibrated probabilities rather than simple scores that correlate with accuracy the issue of calibration arises any time one needs to assess the confidence of prediction its importance has been discussed and emphasized in medicine natural language processing speech recognition meteorology econometrics and psychology unlike uncalibrated confidence measures calibrated probabilities are formally tied to objective frequencies they are easy to understand by users patients undergoing diagnosis or researchers querying probabilistic database moreover modern ai systems typically consist of pipeline of modules in this setting calibrated probabilities are important to express uncertainty meaningfully across different potentially modules we hope our extension to the structured prediction setting can help make calibration more accessible and easier to apply to more complex and diverse settings acknowledgements this research is supported by an nserc canada graduate scholarship to the first author and sloan research fellowship to the second author reproducibility all code data and experiments for this paper are available on codalab at https references seigel confidence estimation for automatic speech recognition hypotheses phd thesis university of cambridge heckerman and nathwani towards normative expert systems representations for efficient knowledge acquisition and inference methods archive kassel comparison of approaches to handwritten character recognition phd thesis massachusetts institute of technology liang klein and taskar an discriminative approach to machine translation in international conference on computational linguistics and association for computational linguistics mueller methods for learning structured prediction in semantic segmentation of natural images phd thesis university of bonn brier verification of forecasts expressed in terms of probability monthly weather review murphy new vector partition of the probability score journal of applied meteorology foster and vohra asymptotic calibration gneiting balabdaoui and raftery probabilistic forecasts calibration and sharpness journal of the royal statistical society series statistical methodology brocker reliability sufficiency and the decomposition of proper scores quarterly journal of the royal meteorological society platt probabilistic outputs for support vector machines and comparisons to regularized likelihood methods advances in large margin classifiers zadrozny and elkan transforming classifier scores into accurate multiclass probability estimates in international conference on knowledge discovery and data mining kdd pages and caruana predicting good probabilities with supervised learning in proceedings of the international conference on machine learning pages stephenson coelho and jolliffe two extra components in the brier score decomposition weather forecasting krizhevsky learning multiple layers of features from tiny images technical report university of toronto coates and ng learning feature representations with neural networks tricks of the trade second edition buja stuetzle and shen loss functions for binary class probability estimation and classification structure and applications philip the bayesian journal of the american statistical association jasa menon jiang vembu elkan and predicting accurate probabilities with ranking loss in international conference on machine learning icml zhong and kwok accurate probability calibration for multiple classifiers in international joint conference on artificial intelligence ijcai pages yu li and deng calibration of confidence measures in speech recognition trans audio speech and lang jiang osl kim and calibrating predictive model estimates to support personalized medicine journal of the american medical informatics association nguyen and connor posterior calibration and exploratory analysis for natural language processing models in empirical methods in natural language processing emnlp pages lichtenstein fischhoff and phillips judgement under uncertainty heuristics and biases cambridge university press 
learning structured output representation using deep conditional generative models kihyuk xinchen honglak nec laboratories america university of michigan ann arbor ksohn xcyan honglak abstract supervised deep learning has been successfully applied to many recognition problems although it can approximate complex function well when large amount of training data is provided it is still challenging to model complex structured output representations that effectively perform probabilistic inference and make diverse predictions in this work we develop deep conditional generative model for structured output prediction using gaussian latent variables the model is trained efficiently in the framework of stochastic gradient variational bayes and allows for fast prediction using stochastic inference in addition we provide novel strategies to build robust structured prediction algorithms such as input and prediction objective at training in experiments we demonstrate the effectiveness of our proposed algorithm in comparison to the deterministic deep neural network counterparts in generating diverse but realistic structured output predictions using stochastic inference furthermore the proposed training methods are complimentary which leads to strong object segmentation and semantic labeling performance on birds and the subset of labeled faces in the wild dataset introduction in structured output prediction it is important to learn model that can perform probabilistic inference and make diverse predictions this is because we are not simply modeling function as in classification tasks but we may need to model mapping from single input to many possible outputs recently the convolutional neural networks cnns have been greatly successful for image classification tasks and have also demonstrated promising results for structured prediction tasks however the cnns are not suitable in modeling distribution with multiple modes to address this problem we propose novel deep conditional generative models cgms for output representation learning and structured prediction in other words we model the distribution of highdimensional output space as generative model conditioned on the input observation building upon recent development in variational inference and learning of directed graphical models we propose conditional variational cvae the cvae is conditional directed graphical model whose input observations modulate the prior on gaussian latent variables that generate the outputs it is trained to maximize the conditional and we formulate the variational learning objective of the cvae in the framework of stochastic gradient variational bayes sgvb in addition we introduce several strategies such as input and prediction training methods to build more robust prediction model in experiments we demonstrate the effectiveness of our proposed algorithm in comparison to the deterministic neural network counterparts in generating diverse but realistic output predictions using stochastic inference we demonstrate the importance of stochastic neurons in modeling the structured output when the input data is partially provided furthermore we show that the proposed training schemes are complimentary leading to strong object segmentation and labeling performance on birds and the subset of labeled faces in the wild dataset in summary the contribution of the paper is as follows we propose cvae and its variants that are trainable efficiently in the sgvb framework and introduce novel strategies to enhance robustness of the models for structured prediction we demonstrate the effectiveness of our proposed algorithm with gaussian stochastic neurons in modeling distribution of structured output variables we achieve strong semantic object segmentation performance on cub and lfw datasets the paper is organized as follows we first review related work in section we provide preliminaries in section and develop our deep conditional generative model in section in section we evaluate our proposed models and report experimental results section concludes the paper related work since the recent success of supervised deep learning on visual recognition there have been many approaches to tackle computer vision tasks such as object detection and semantic segmentation using supervised deep learning techniques our work falls into this category of research in developing advanced algorithms for structured output prediction but we incorporate the stochastic neurons to model the conditional distributions of complex output representation whose distribution possibly has multiple modes in this sense our work shares similar motivation to the recent work on image segmentation tasks using hybrid models of crf and boltzmann machine compared to these our proposed model is an system for segmentation using convolutional architecture and achieves significantly improved performance on challenging benchmark tasks along with the recent breakthroughs in supervised deep learning methods there has been progress in deep generative models such as deep belief networks and deep boltzmann machines recently the advances in inference and learning algorithms for various deep generative models significantly enhanced this line of research in particular the variational learning framework of deep directed graphical model with gaussian latent variables variational autoencoder and deep latent gaussian models has been recently developed using the variational lower bound of the as the training objective and the reparameterization trick these models can be easily trained via stochastic optimization our model builds upon this framework but we focus on modeling the conditional distribution of output variables for structured prediction problems here the main goal is not only to model the complex output representation but also to make discriminative prediction in addition our model can effectively handle images by exploiting the convolutional architecture the stochastic neural network sfnn is conditional directed graphical model with combination of deterministic neurons and the binary stochastic neurons the sfnn is trained using the monte carlo variant of generalized em by drawing multiple samples from the proposal distribution and weighing them differently with importance weights although our proposed gaussian stochastic neural network which will be described in section looks similar on surface there are practical advantages in optimization of using gaussian latent variables over the binary stochastic neurons in addition thanks to the recognition model used in our framework it is sufficient to draw only few samples during training which is critical in training very deep convolutional networks preliminary variational the variational vae is directed graphical model with certain types of latent variables such as gaussian latent variables generative process of the vae is as follows set of latent variable is generated from the prior distribution pθ and the data is generated by the generative distribution pθ conditioned on pθ pθ in general parameter estimation of directed graphical models is often challenging due to intractable posterior inference however the parameters of the vae can be estimated efficiently in the stochastic gradient variational bayes sgvb framework where the variational lower bound of the is used as surrogate objective function the variational lower bound is written as log pθ kl qφ kpθ eqφ log qφ log pθ qφ kpθ eqφ log pθ in this framework proposal distribution qφ which is also known as recognition model is introduced to approximate the true posterior pθ the multilayer perceptrons mlps are used to model the recognition and the generation models assuming gaussian latent variables the first term of equation can be marginalized while the second term is not instead the second term can be approximated by drawing samples by the recognition distribution qφ and the empirical objective of the vae with gaussian latent variables is written as follows levae qφ kpθ log pθ where gφ note that the recognition distribution qφ is reparameterized with deterministic differentiable function gφ whose arguments are data and the noise variable this trick allows error backpropagation through the gaussian latent variables which is essential in vae training as it is composed of multiple mlps for recognition and generation models as result the vae can be trained efficiently using stochastic gradient descent sgd deep conditional generative models for structured output prediction as illustrated in figure there are three types of variables in deep conditional generative model cgm input variables output variables and latent variables the conditional generative process of the model is given in figure as follows for given observation is drawn from the prior distribution pθ and the output is generated from the distribution pθ compared to the baseline cnn figure the latent variables allow for modeling multiple modes in conditional distribution of output variables given input making the proposed cgm suitable for modeling mapping the prior of the latent variables is modulated by the input in our formulation however the constraint can be easily relaxed to make the latent variables statistically independent of input variables pθ pθ deep cgms are trained to maximize the conditional often the objective function is intractable and we apply the sgvb framework to train the model the variational lower bound of the model is written as follows complete derivation can be found in the supplementary material log pθ qφ kpθ eqφ log pθ and the empirical lower bound is written as lecvae qφ kpθ log pθ where gφ and is the number of samples we call this model conditional variational cvae the cvae is composed of multiple mlps such as recognition network qφ conditional prior network pθ and generation network pθ in designing the network architecture we build the network components of the cvae on top of the baseline cnn specifically as shown in figure not only the direct input but also the initial guess made by the cnn are fed into the prior network such recurrent connection has been applied for structured output prediction problems to sequentially update the prediction by revising the previous guess while effectively deepening the convolutional network we also found that recurrent connection even one iteration showed significant performance improvement details about network architectures can be found in the supplementary material output inference and estimation of the conditional likelihood once the model parameters are learned we can make prediction of an output from an input by following the generative process of the cgm to evaluate the model on structured output prediction tasks in testing time we can measure prediction accuracy by deterministic inference without sampling arg maxy pθ although the model is not trained to reconstruct the input our model can be viewed as type of vae that performs of the output variables conditioned on the input at training time alternatively we can draw multiple from the prior distribution and use the average of the posteriors to make prediction arg maxy pθ pθ cnn cgm generation cgm recognition recurrent connection figure illustration of the conditional graphical models cgms the predictive process of output for the baseline cnn the generative process of cgms an approximate inference of also known as recognition process the generative process with recurrent connection another way to evaluate the cgms is to compare the conditional likelihoods of the test data straightforward approach is to draw samples using the prior network and take the average of the likelihoods we call this method the monte carlo mc sampling pθ pθ pθ it usually requires large number of samples for the monte carlo estimation to be accurate alternatively we use the importance sampling to estimate the conditional likelihoods pθ pθ pθ qφ qφ learning to predict structured output although the sgvb learning framework has shown to be effective in training deep generative models the conditional of output variables at training may not be optimal to make prediction at testing in deep cgms in other words the cvae uses the recognition network qφ at training but it uses the prior network pθ at testing to draw samples and make an output prediction since is given as an input for the recognition network the objective at training can be viewed as reconstruction of which is an easier task than prediction the negative kl divergence term in equation tries to close the gap between two pipelines and one could consider allocating more weights on the negative kl term of an objective function to mitigate the discrepancy in encoding of latent variables at training and testing kl qφ kpθ with however we found this approach ineffective in our experiments instead we propose to train the networks in way that the prediction pipelines at training and testing are consistent this can be done by setting the recognition network the same as the prior network qφ pθ and we get the following objective function legsnn log pθ where gθ we call this model gaussian stochastic neural network gsnn note that the gsnn can be derived from the cvae by setting the recognition network and the prior network equal therefore the learning tricks such as reparameterization trick of the cvae can be used to train the gsnn similarly the inference at testing and the conditional likelihood estimation are the same as those of cvae finally we combine the objective functions of two models to obtain hybrid objective lehybrid αlecvae legsnn where balances the two objectives note that when we recover the cvae objective when the trained model will be simply gsnn without the recognition network cvae for image segmentation and labeling semantic segmentation is an important structured output prediction task in this section we provide strategies to train robust prediction model for semantic segmentation problems specifically to learn neural network that can be generalized well to unseen data we propose to train the network with prediction objective and structured input noise if we assume covariance matrix of auxiliary gaussian latent variables to we have deterministic counterpart of gsnn which we call gaussian deterministic neural network gdnn training with prediction objective as the image size gets larger it becomes more challenging to make prediction image reconstruction semantic label prediction the approaches have been used in the sense of forming image pyramid for an input but not much for output prediction here we propose to train the network to predict outputs at different scales by doloss loss loss ing so we can make figure prediction prediction of semantic labels figure describes the prediction at different scales and original for the training training with input omission noise adding noise to neurons is widely used technique to regularize deep neural networks during the training similarly we propose simple regularization technique for semantic segmentation corrupt the input data into according to noise process and optimize the network with the the noise process could be arbitrary but for semantic image segmenfollowing objective tation we consider random block omission noise specifically we randomly generate squared mask of width and height less than of the image width and height respectively at random position and set pixel values of the input image inside the mask to this can be viewed as providing more challenging output prediction task during training that simulates block occlusion or missing input the proposed training strategy also is related to the denoising training methods but in our case we inject noise to the input data only and do not reconstruct the missing input experiments we demonstrate the effectiveness of our approach in modeling the distribution of the structured output variables for the proof of concept we create an artificial experimental setting for structured output prediction using mnist database then we evaluate the proposed cvae models on several benchmark datasets for visual object segmentation and labeling such as birds cub and labeled faces in the wild lfw our implementation is based on matconvnet matlab toolbox for convolutional neural networks and adam for adaptive learning rate scheduling algorithm of sgd optimization toy example mnist to highlight the importance of probabilistic inference through stochastic neurons for structured output variables we perform an experiment using mnist database specifically we divide each digit image into four quadrants and take one two or three quadrant as an input and the remaining quadrants as an as we increase the number of quadrants for an output the input to output mapping becomes more diverse in terms of mapping we trained the proposed models cvae gsnn and the baseline deep neural network and compare their performance the same network architecture the mlp with of relus for recognition conditional prior or generation networks followed by gaussian latent variables was used for all the models in various experimental settings the early stopping is used during the training based on the estimation of the conditional likelihoods on the validation set negative cll nn baseline gsnn monte carlo cvae monte carlo cvae importance sampling performance gap per pixel quadrant validation test quadrants validation test quadrants validation test table the negative cll on mnist database we increase the number of quadrants for an input from to the performance gap between cvae importance sampling and nn is reported similar experimental setting has been used in the multimodal learning framework where the and right halves of the digit images are used as two data modalities ground ground nn nn cvae cvae figure visualization of generated samples with left quadrant and right quadrants for an input we show in each row the input and the ground truth output overlaid with gray color first samples generated by the baseline nns second and samples drawn from the cvaes rest for qualitative analysis we visualize the generated output samples in figure as we can see the baseline nns can only make single deterministic prediction and as result the output looks blurry and doesn look realistic in many cases in contrast the samples generated by the cvae models are more realistic and diverse in shape sometimes they can even change their identity digit labels such as from to or from to and vice versa we also provide quantitative evidence by estimating the conditional clls in table the clls of the proposed models are estimated in two ways as described in section for the mc estimation we draw samples per example to get an accurate estimate for the importance sampling however samples per example were enough to obtain an accurate estimation of the cll we observed that the estimated clls of the cvae significantly outperforms the baseline nn moreover as measured by the per pixel performance gap the performance improvement becomes more significant as we use smaller number of quadrants for an input which is expected as the mapping becomes more diverse visual object segmentation and labeling birds cub database includes images of birds from species with annotations such as bounding box of birds and segmentation mask later yang et al annotated these images with more segmentation masks by cropping the bird patches using the bounding boxes and resized them into pixels the split proposed in was used in our experiment and for validation purpose we partition the training set into folds and with the mean intersection over union iou score over the folds the final prediction on the test set was made by averaging the posterior from ensemble of networks that are trained on each of the folds separately we increase the number of training examples via data augmentation by horizontally flipping the input and output images we extensively evaluate the variations of our proposed methods such as cvae gsnn and the hybrid model and provide summary results on segmentation mask prediction task in table specifically we report the performance of the models with different network architectures and training methods prediction or training first we note that the baseline cnn already beat the previous that is obtained by the boltzmann machine mmbm pixel accuracy iou with graphcut for even without on top of that we observed significant performance improvement with our proposed deep in terms of prediction accuracy the gsnn performed the best among our proposed models and performed even better when it is trained with hybrid objective function in addition the training section further improves the performance compared to the baseline cnn the proposed deep cgms significantly reduce the prediction error reduction in test accuracy at the expense of more time for finally the performance of our two winning entries gsnn and hybrid on the validation sets are both significantly better than their deterministic counterparts gdnn with less than which suggests the benefit of stochastic latent variables as in the case of baseline cnns we found that using the prediction was consistently better than the counterpart for all our models so we used the prediction by default mean inference time per image ms for cnn and ms for deep cgms measured using geforce gtx titan card with matconvnet we provide more information in the supplementary material model training mmbm gloc cnn baseline cnn msc gdnn msc gsnn msc cvae msc hybrid msc gdnn msc ni gsnn msc ni cvae msc ni hybrid msc ni cub val pixel iou cub test pixel iou lfw pixel val pixel test table mean and standard error of labeling accuracy on cub and lfw database the performance of the best or statistically similar to the best performing model models are msc refers prediction training and ni refers the training models cnn baseline gdnn msc ni gsnn msc ni cvae msc ni hybrid msc ni cub val cub test lfw val lfw test table mean and standard error of negative cll on cub and lfw database the performance of the best and statistically similar models are we also evaluate the negative cll and summarize the results in table as expected the proposed cgms significantly outperform the baseline cnn while the cvae showed the highest cll labeled faces in the wild lfw database has been widely used for face recognition and verification benchmark as mentioned in the face images that are segmented and labeled into semantically meaningful region labels hair skin clothes can greatly help understanding of the image through the visual attributes which can be easily obtained from the face shape following region labeling protocols we evaluate the performance of face parts labeling on the subset of lfw database which contains images that are labeled into semantic categories such as hair skin clothes and background we resized images into and used the same network architecture to the one used in the cub experiment we provide summary results of segmentation accuracy in table and the negative cll in table we observe similar trend as previously shown for the cub database the proposed deep cgms outperform the baseline cnn in terms of segmentation accuracy as well as cll however although the accuracies of the cgm variants are higher the performance of gdnn was not significantly behind than those of gsnn and hybrid models this may be because the level of variations in the output space of lfw database is less than that of cub database as the face shapes are more similar and better aligned across examples finally our methods significantly outperform other existing methods which report in or in setting the performance on the lfw segmentation benchmark object segmentation with partial observations we experimented on object segmentation under uncertainties partial input and output observations to highlight the importance of recognition network in cvae and the stochastic neurons for missing value imputation we randomly omit the input pixels at different levels of omission noise and different block sizes and the task is to predict the output segmentation labels for the omitted pixel locations while given the partial labels for the observed input pixels this can also be viewed as segmentation task with noisy or partial observations occlusions to make prediction for cvae with partial output observation yo we perform iterative inference of unobserved output yu and the latent variables in similar fashion to yu pθ yu qφ yo yu input input ground ground cnn cnn cvae cvae figure visualization of the conditionally generated samples first row input image with omission noise noise level block size second row ground truth segmentation third prediction by gdnn and fourth to sixth the generated samples by cvae on cub left and lfw right we report the summary results in table dataset cub iou lfw pixel the cvae performs well even when the noise block gdnn cvae gdnn cvae noise level is high where the level size gdnn significantly fails this is because the cvae utilizes the partial segmentation information to iteratively refine the tion of the rest we visualize the ated samples at noise level of in ure the prediction made by the gdnn is blurry but the samples generated by the cvae are sharper while maintaining reasonable shapes this suggests that the cvae can also be potentially useful for table segmentation results with omission noise on teractive segmentation by iteratively cub and lfw database we report the accuracy on the first validation set incorporating partial output labels conclusion modeling distribution of the structured output variables is an important research question to achieve good performance on structured output prediction problems in this work we proposed stochastic neural networks for structured output prediction based on the conditional deep generative model with gaussian latent variables the proposed model is scalable and efficient in inference and learning we demonstrated the importance of probabilistic inference when the distribution of output space has multiple modes and showed strong performance in terms of segmentation accuracy estimation of conditional and visualization of generated samples acknowledgments this work was supported in part by onr grant and nsf career grant we thank nvidia for donating tesla gpu references andrew arora bilmes and livescu deep canonical correlation analysis in icml bengio alain and yosinski deep generative stochastic networks trainable by backprop in icml ciresan giusti gambardella and schmidhuber deep neural networks segment neuronal membranes in electron microscopy images in nips farabet couprie najman and lecun scene parsing with multiscale feature learning purity trees and optimal covers in icml farabet couprie najman and lecun learning hierarchical features for scene labeling pami girshick donahue darrell and malik convolutional networks for accurate object detection and segmentation pami pp goodfellow mirza courville and bengio deep boltzmann machines in nips goodfellow mirza xu ozair courville and bengio generative adversarial nets in nips he zhang ren and sun spatial pyramid pooling in deep convolutional networks for visual recognition in eccv hinton osindero and teh fast learning algorithm for deep belief nets neural computation huang narayana and towards unconstrained face recognition in cvpr workshop on perceptual organization in computer vision huang ramesh berg and labeled faces in the wild database for studying face recognition in unconstrained environments technical report university of massachusetts amherst kae sohn lee and augmenting crfs with boltzmann machine shape priors for image labeling in cvpr kingma and ba adam method for stochastic optimization in iclr kingma mohamed rezende and welling learning with deep generative models in nips kingma and welling variational bayes in iclr krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips larochelle and murray the neural autoregressive distribution estimator jmlr lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee lee grosse ranganath and ng unsupervised learning of hierarchical representations with convolutional deep belief networks communications of the acm li tarlow and zemel exploring compositional high order pattern potentials for structured output learning in cvpr long shelhamer and darrell fully convolutional networks for semantic segmentation in cvpr pinheiro and collobert recurrent convolutional neural networks for scene parsing in icml rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in icml salakhutdinov and hinton deep boltzmann machines in aistats sermanet eigen zhang mathieu fergus and lecun overfeat integrated recognition localization and detection using convolutional networks in iclr simonyan and zisserman very deep convolutional networks for image recognition in iclr sohn shang and lee improved multimodal deep learning with variation of information in nips srivastava hinton krizhevsky sutskever and salakhutdinov dropout simple way to prevent neural networks from overfitting jmlr szegedy liu jia sermanet reed anguelov erhan vanhoucke and rabinovich going deeper with convolutions in cvpr szegedy toshev and erhan deep neural networks for object detection in nips tang and salakhutdinov learning stochastic feedforward neural networks in nips vedaldi and lenc matconvnet convolutional neural networks for matlab in acmmm vincent larochelle bengio and manzagol extracting and composing robust features with denoising autoencoders in icml wang ai and tang what are good parts for hair shape modeling in cvpr welinder branson mita wah schroff belongie and perona birds technical report california institute of technology yang and yang boltzmann machines for object segmentation in cvpr 
recommendation from recurrent user activities nan yichen niao le college of computing georgia tech milton stewart school of industrial system engineering georgia tech dunan lsong abstract by making personalized suggestions recommender system is playing crucial role in improving the engagement of users in modern however most recommendation algorithms do not explicitly take into account the temporal behavior and the recurrent activities of users two central but less explored questions are how to recommend the most desirable item at the right moment and how to predict the next returning time of user to service to address these questions we propose novel framework which connects point processes and models to capture the recurrent temporal patterns in large collection of consumption pairs we show that the parameters of the model can be estimated via convex optimization and furthermore we develop an efficient algorithm that maintains convergence rate scales up to problems with millions of pairs and hundreds of millions of temporal events compared to other in both synthetic and real datasets our model achieves superb predictive performance in the two recommendation tasks finally we point out that our formulation can incorporate other extra context information of users such as profile textual and spatial features introduction delivering personalized user experiences is believed to play crucial role in the engagement of users to modern for example making recommendations on proper items at the right moment can make personal assistant services on mainstream mobile platforms more competitive and usable since people tend to have different activities depending on the contexts such as morning evening weekdays weekend see for example figure unfortunately most existing recommendation techniques are mainly optimized at predicting users onetime preference often denoted by integer ratings on items while users continuously preferences remain largely under explored besides traditional user feedback signals ratings etc have been increasingly argued to be ineffective to represent real engagement of users due to the sparseness and nosiness of the data the temporal patterns at which users return to the services items thus becomes more relevant metric to evaluate their satisfactions furthermore successful predictions of the returning time not only allows service to keep track of the evolving user preferences but also helps service provider to improve their marketing strategies for most web companies if we can predict when users will come back next we could make ads bidding more economic allowing marketers to bid on time slots after all marketers need not blindly bid all time slots indiscriminately in the context of modern electronic health record data patients may have several diseases that have complicated dependencies on each other shown at the bottom of figure the occurrence of one disease could trigger the progression of another predicting the returning time on certain disease can effectively help doctors to take proactive steps to reduce the potential risks however since most models in literature are particularly optimized for predicting ratings user church time predict the next activity at time grocery patient disease tim time time next event prediction disease time predictions from recurrent events model figure recommendation in the top figure one wants to predict the most desirable activity at given time for user in the bottom figure one wants to predict the returning time to particular disease of patient the sequence of events induced from each pair is modeled as temporal point process along time exploring the recurrent temporal dynamics of users returning behaviors over time becomes more imperative and meaningful than ever before although the aforementioned applications come from different domains we seek to capture them in unified framework by addressing the following two related questions how to recommend the most relevant item at the right moment and how to accurately predict the next returningtime of users to existing services more specifically we propose novel convex formulation of the problems by establishing an under explored connection between point processes and models we also develop new optimization algorithm to solve the low rank point process estimation problem efficiently our algorithm blends proximal gradient and conditional gradient methods and achieves the optimal convergence rate as further demonstrated by our numerical experiments the algorithm scales up to millions of pairs and hundreds of millions of temporal events and achieves superb predictive performance on the two problems on both synthetic and real datasets furthermore our model can be readily generalized to incorporate other contextual information by making the intensity function explicitly depend on the additional spatial textual categorical and user profile information related work the very recent work of kapoor et al is most related to our approach they attempt to predict the returning time for music streaming service based on survival analysis and hidden model although these methods explicitly consider the temporal dynamics of pairs major limitation is that the models can not generalize to recommend any new item in future time which is crucial difference compared to our approach moreover survival analysis is often suitable for modeling single terminal event such as infection and death by assuming that the time to be independent however in many cases this assumption might not hold background on temporal point processes this section introduces necessary concepts from the theory of temporal point processes temporal point process is random process of which the realization is sequence of events ti with ti and abstracted as points on the time line let the history be the list of event time tn up to but not including the current time an important way to characterize temporal point processes is via the conditional intensity function which is the stochastic model for the next event time given all previous events within small window dt dt event in dt is the probability for the occurrence of new event given the history the functional form of the intensity is designed to capture the phenomena of interests for instance homogeneous poisson process has constant intensity over time which is independent of the history the gap thus conforms to the exponential distribution with the mean being alternatively for an inhomogeneous poisson process its intensity function is also assumed to be independent of the history but can be simple function of time given sequence of events tn for any tn we characterize the conditional probability that no event happens during tn and the conditional density that an event occurs at time as exp tn dτ and then given sequence of events tn we express its likelihood by tn ti exp dτ ti low rank hawkes processes in this section we present our model in terms of hawkes processes discuss its possible extensions and provide solutions to our proposed recommendation problems modeling recurrent user activities with hawkes processes figure highlights the basic setting of our model for each observed pair we model the occurrences of user past consumption events on item as hawkes process with the intensity ti ti where ti is the triggering kernel capturing temporal dependencies scales the magnitude of the influence of each past event is baseline intensity and the summation of the kernel terms is history dependent and thus stochastic process by itself we have twofold rationale behind this modeling choice first the baseline intensity captures users inherent and preferences to items regardless of the history second the triggering kernel ti quantifies how the influence from each past event evolves over time which makes the intensity function depend on the history thus hawkes process is essentially conditional poisson process in the sense that conditioned on the history the hawkes process is poisson process formed by the superposition of background homogeneous poisson process with the intensity and set of inhomogeneous poisson processes with the intensity ti however because the events in the past can affect the occurrence of the events in future the hawkes process in general is more expressive than poisson process which makes it particularly useful for modeling repeated activities by keeping balance between the long and the short term aspects of users preferences transferring knowledge with low rank models so far we have shown modeling sequence of events from single pair since we can not observe the events from all pairs the next step is to transfer the learned knowledge to unobserved pairs given users and items we represent the intensity function between user and item as λu λu and αu are the entry tj where tu of the base intensity matrix and the matrix respectively however the two matrices of coefficients and contain too many parameters since it is often believed that users behaviors and items attributes can be categorized into limited number of prototypical types we assume that and have structures that is the nuclear norms of these parameter matrices are small some researchers also explicitly assume that the two matrices factorize into products of low rank factors here we assume the above nuclear norm constraints in order to obtain convex parameter estimation procedures later triggering kernel parametrization and extensions because it is only required that the triggering kernel should be nonnegative and bounded feature in often has analytic forms when tu belongs to many flexible parametric families such as the weibull and distributions for the simplest case tu takes the exponential form tj exp tj alternatively we can make the intensity function λu depend on other additional context information associated with each event for instance we can make the base intensity depend on and we might also extend and into tensors to incorporate the location information furthermore we can even learn the triggering kernel directly using nonparametric methods without loss of generality we stick with the exponential form in later sections recommendation once we have learned and we are ready to solve our proposed problems as follows item recommendation at any given time for each pair because the intensity function λu indicates the tendency that user will consume item at time for each user we recommend the proper items by the following procedures calculate λu for each item sort the items by the descending order of λu return the items prediction for each pair the intensity function λu dominates the point patterns along time given the history we calculate the rt density of the next event time by λu exp tn λu dt so we can use the expectation to predict the next event unfortunately this expectation often does not have analytic forms due to the complexity of λu for hawkes process so we approximate the as following draw samples tm by ogata thinning algorithm pm estimate the by the sample average parameter estimation having presented our model in this section we develop new algorithm which blends proximal gradient and conditional gradient methods to learn the model efficiently convex formulation let be the set of events induced between and we express the of observing each sequence based on equation as log wu φu wu tj where wu φu tu tu tu and tk rt tu tu tu can be exj dt when tj is the exponential kernel pressed as tu exp tj then the of observing all event sequences is simply summation of each individual term by finally we can have the following convex formulation subject to opt min where the matrix nuclear norm which is summation of all singular values is commonly used as convex surrogate for the matrix rank function one solution to is proposed in based on admm however the algorithm in requires at each iteration full svd for computing the proximal operator which is often prohibitive with large matrices alternatively we might turn to more efficient conditional gradient algorithms which require instead the much cheaper linear minimization oracles however the constraints in our problem prevent the linear minimization from having simple analytical solution alternative formulation the difficulty of directly solving the original formulation is caused by the fact that the nonnegative constraints are entangled with the nuclear norm penalty to address this challenge we approximate using simple penalty method specifically given we arrive at the next formulation by introducing two auxiliary variables and with some penalty function such as the squared frobenius norm opt min ρka subject to algorithm learning algorithm proxu ηk input output choose to initialize and set for do proxu ηk lmoψ end ηk algorithm lmoψ top singular vector pairs of and find and by solving we show in theorem that when is properly chosen these two formulations lead to the same optimum see appendix for the complete proof more importantly the new formulation allows us to handle the constraints and nuclear norm regularization terms separately of the problem coincides with the theorem with the condition the optimal value opt optimal value opt in the problem of interest where is problem dependent threshold max efficient optimization proximal method meets conditional gradient now we are ready to present algorithm for solving efficiently denote and we use the bracket notation to represent the respective part for simplicity let ρka ρkλ the course of our action is straightforward at each iteration we apply cheap projection gradient for block and for block and maintain three interdependent sequences cheap linear minimization and based on the accelerated scheme in to be more specific the algorithm consists of two main subroutines proximal gradient when updating we compute directly the associated proximal operator which in our case reduces to the simple projection ηk where simply sets the negative coordinates to zero conditional gradient when updating instead of computing the proximal operator we call the linear minimization oracle lmoψ argmin hpk where pk is the partial derivative with respect to and we do similar updates for the overall performance clearly depends on the efficiency of this lmo which can be solved efficiently in our case as illustrated in algorithm following the linear minimization for our situation requires only computing hpk where the minimizer is readily given by and are the top singular vectors of and ii conducting that produces scaling factor λδ where the quadratic problem admits solution and thus can be computed efficiently we repeat the same process for updating accordingly convergence analysis denote as the objective in formulation where we establish the following convergence results for algorithm described above when solving formulation please refer to appendix for complete proof theorem let be the sequence generated by algorithm by setting and then for we have opt where corresponds to the lipschitz constant of and and are some problem dependent constants remark let denote the objective in formulation which is the original problem of our interest by invoking theorem we further have opt the analysis builds upon the recursions from proximal gradient and conditional gradient methods as result the overall convergence rate comes from two parts as reflected in interestingly one can easily see that for both the proximal and the conditional gradient parts we achieve the respective optimal convergence rates when there is no nuclear norm regularization term the results recover the optimal rate achieved by proximal gradient method for smooth convex optimization when there is no nonnegative constraint the results recover the rate attained by conditional gradient method for smooth convex minimization when both nuclear norm and are in present the proposed algorithm up to our knowledge is first of its kind that achieves the best of both worlds which could be of independent interest experiments we evaluate our algorithm by comparing with competitors on both synthetic and real datasets for each user we randomly pick of all the items she has consumed and hold out the entire sequence of events besides for each sequence of the other items we further split it into pair of subsequences for each testing event we evaluate the predictive accuracy on two tasks item recommendation suppose the testing event belongs to the pair ideally item should rank top at the testing moment we record its predicted rank among all items smaller value indicates better performance prediction we predict the from the learned intensity function and compute the absolute error with respect to the true time we repeat these two evaluations on all testing events because the predictive tasks on those entirely sequences are much more challenging we report the total mean absolute error mae and that specific to the set of entirely heldout sequences separately competitors poisson process is relaxation of our model by assuming each pair has only constant base intensity regardless of the history for task it gives static ranks regardless of the time for task it produces an estimate of the average gaps in many cases the poisson process is hard baseline in that the most popular items often have large base intensity and recommending popular items is often strong heuristic stic fits markov model to each observed pair since it can only make recommendations specific to the few observed items visited before instead of the large number of new items we only evaluate its performance on the returning time prediction task for the set of entirely sequences we use the average predicted time from each observed item as the final prediction svd is the classic matrix factorization model the implicit user feedback is converted into an explicit rating using the frequency of item consumptions since it is not designed for predicting the returning time we report its performance on the recommendation task as reference tensor factorization generalizes matrix factorization to include time we compare with the method which considers poisson regression as the loss function to fit the number of events in each discretized time slot and shows better performance compared to other alternatives with the squared loss we report the performance by using the parameters fitted only in the last interval and using the average parameters over all time intervals we denote these two variants with varying number of intervals as and parameters parameters parameters mae mae mae iterations convergence by iterations methods hawkes poisson svd mae convergence by events methods events convergence by time events mae hawkes poisson stic entries scalability heldout groups total item recommendation heldout groups total prediction figure estimation error by iterations by entries events per entry and by events per entry entries scalability by entries events per entry iterations mae of the predicted ranking and mae of the predicted returning time results synthetic data we generate two matrices and with rank five as the for each pair we simulate events by ogata thinning algorithm with an exponential triggering kernel and get million events in total the bandwidth for the triggering kernel is fixed to one by theorem it is inefficient to directly estimate the exact value of the threshold value for instead we tune and to give the best performance how does our algorithm converge figure shows that it only requires few hundred iterations to descend to decent error for both and indicating algorithm converges very fast since the true parameters are figure verify that it only requires modest number of observed entries each of which induces small number of events to achieve good estimation performance figure further illustrates that algorithm scales linearly as the training set grows what is the predictive performance figure confirm that algorithm achieves the best predictive performance compared to other baselines in figure all temporal methods outperform the static svd since this classic baseline does not consider the underlying temporal dynamics of the observed sequences in contrast although the poisson regression also produces static rankings of the items it is equivalent to recommending the most popular items over time this simple heuristic can still give competitive performance in figure since the occurrence of new event depends on the whole past history instead of the last one the performance of stic deteriorates vastly the other tensor methods predict the returning time with the information from different time intervals however because our method automatically adapts different contributions of each past event to the prediction of the next event it can achieve the best prediction performance overall real data we also evaluate the proposed method on real datasets consists of the music streaming logs between users and artists there are around observed pairs with more than one million events in total contains around shopping events between users and stores the unit time for both dataset is hour mimic ii medical dataset is collection of clinical visit records of intensive care unit patients for seven years we filtered out patients and diseases each event records the time when patient was diagnosed with specific disease the time unit is week all model parameters the kernel bandwidth and the latent rank of other baselines are tuned to give the best performance does the history help because the true temporal dynamics governing the event patterns are unobserved we first investigate whether our model assumption is reasonable our hawkes model considers the effects from past user activities while the survival analysis applied in item recommendation hawkes poisson stic hawkes poisson theoretical quantiles heldout quantiles of real data total groups heldout methods hawkes poisson svd hawkes mae mae rayleigh poisson groups total methods hawkes poisson stic rayleigh theoretical quantiles heldout total groups heldout methods quantiles of real data mae groups total methods hawkes poisson svd mae mae methods hawkes poisson svd mae quantiles of real data mimic ii prediction methods hawkes poisson stic hawkes poisson rayleigh theoretical quantiles heldout groups total heldout groups total figure the quantile plots of different fitted processes the mae of predicted rankings and on the top middle and the mimic ii bottom respectively assumes gaps which might conform to an exponential poisson process or rayleigh distribution according to the theorem given ansequence on tn and ti particular point process with intensity the set of samples ti dt should conform to exponential distribution if is truly sampled from the process therefore we compare the theoretical quantiles from the exponential distribution with the fittings of different models to real sequence of events the closer the slope goes to one the better model matches the event patterns figure clearly shows that our hawkes model can better explain the observed data compared to the other survival analysis models what is the predictive performance finally we evaluate the prediction accuracy in the and column of figure since an entire testing sequence is more challenging the performance on the heldout group is little lower than that on the average total group however across all cases since the proposed model is able to better capture the temporal dynamics of the observed sequences of events it can achieve better performance on both tasks in the end conclusions we propose novel convex formulation and an efficient learning algorithm to recommend relevant services at any given moment and to predict the next of users to existing services empirical evaluations on large synthetic and real data demonstrate its superior scalability and predictive performance moreover our optimization algorithm can be used for solving general nonnegative matrix rank minimization problem with other convex losses under mild assumptions which may be of independent interest acknowledge the research was supported in part by nsf bigdata nsf career references aalen borgan and gjessing survival and event history analysis process point of view springer baltrunas and amatriain towards recommendation based on implicit feedback chi and kolda on tensors sparsity and nonnegative factorizations cox and isham point processes volume chapman cox and lewis multivariate point processes selected statistical papers of sir david cox volume design of investigations statistical methods and applications daley and an introduction to the theory of point processes volume ii general theory and structure volume springer du farajtabar ahmed smola and song processes with applications to clustering document streams in kdd du song smola and yuan learning networks of heterogeneous influence in advances in neural information processing systems pages du song woo and zha uncover information diffusion networks in artificial intelligence and statistics aistats hawkes spectra of some and mutually exciting point processes biometrika kapoor subbian srivastava and schrater just in time recommendations modeling the dynamics of boredom in activity streams wsdm pages kapoor sun srivastava and ye hazard based approach to user return time prediction in kdd pages karatzoglou amatriain baltrunas and oliver multiverse recommendation tensor factorization for collaborative filtering in proceeedings of the acm conference on recommender systems recsys kingman on doubly stochastic poisson processes mathematical proceedings of the cambridge philosophical society pages koenigstein dror and koren yahoo music recommendations modeling music ratings with temporal dynamics and item taxonomy in proceedings of the fifth acm conference on recommender systems recsys pages koren collaborative filtering with temporal dynamics in knowledge discovery and data mining kdd pages lan an optimal method for stochastic composite optimization mathematical programming lan the complexity of convex programming under linear optimization oracle arxiv preprint ogata on lewis simulation method for point processes information theory ieee transactions on ouyang he tran and gray stochastic alternating direction method of multipliers in icml preeti bhargava thomas phan who what when and where collaborative recommendations using tensor factorization on sparse data in www wang chen ghosh denny kho chen malin and sun rubik knowledge guided tensor factorization and completion for health data analytics in kdd rendle factorization models ranking with factorization models volume of studies in computational intelligence chapter pages sastry some problems in linear algebra honors projects xiong chen huang schneider and carbonell temporal collaborative filtering with bayesian probabilistic tensor factorization in sdm pages siam yi hong zhong liu and rajan beyond clicks dwell time for personalization in proceedings of the acm conference on recommender systems recsys pages yu ma yu carbonell and sra efficient structured matrix rank minimization in nips zaid harchaoui anatoli juditsky conditional gradient algorithms for smooth convex optimization mathematical programming zhou zha and song learning social infectivity in sparse networks using multidimensional hawkes processes in artificial intelligence and statistics aistats zhou zha and song learning triggering kernels for hawkes processes in international conference on machine learning icml 
learning stationary time series using gaussian processes with nonparametric kernels felipe tobar ftobar center for mathematical modeling universidad de chile thang bui department of engineering university of cambridge richard turner department of engineering university of cambridge abstract we introduce the gaussian process convolution model gpcm nonparametric generative procedure to model stationary signals as the convolution between process and linear filter drawn from gaussian process the gpcm is nonparametricwindow moving average process and conditionally is itself gaussian process with nonparametric kernel defined in probabilistic fashion the generative model can be equivalently considered in the frequency domain where the power spectral density of the signal is specified using gaussian process one of the main contributions of the paper is to develop novel variational freeenergy approach based on inducing variables that efficiently learns the linear filter and infers the driving process in turn this scheme provides probabilistic estimates of the covariance kernel and the signal both in denoising and prediction scenarios additionally the variational inference procedure provides expressions for the approximate posterior of the spectral density given the observed data leading to new bayesian nonparametric approaches to spectrum estimation the proposed gpcm is validated using synthetic and signals introduction gaussian process gp regression models have become standard tool in bayesian signal estimation due to their expressiveness robustness to overfitting and tractability gp regression begins with prior distribution over functions that encapsulates priori assumptions such as smoothness stationarity or periodicity the prior is then updated by incorporating information from observed data points via their likelihood functions the result is posterior distribution over functions that can be used for prediction critically for this work the posterior and therefore the resultant predictions is sensitive to the choice of prior distribution the form of the prior covariance function or kernel of the gp is arguably the central modelling choice employing simple form of covariance will limit the gp capacity to generalise the ubiquitous radial basis function or squared exponential kernel for example implies prediction is just local smoothing operation expressive kernels are needed but although kernel design is widely acknowledged as pivotal it typically proceeds via black art in which particular functional form is using intuitions about the application domain to build kernel using simpler primitive kernels as building blocks recently some sophisticated automated approaches to kernel design have been developed that construct kernel mixtures on the basis of incorporating different measures of similarity or more generally by both adding and multiplying kernels thus mimicking the way in which human would search for the best kernel alternatively flexible parametric kernel can be used as in the case of the spectral mixture kernels where the power spectral density psd of the gp is parametrised by mixture of gaussians we see two problems with this general approach the first is that computational tractability limits the complexity of the kernels that can be designed in this way such constraints are problematic when searching over kernel combinations and to lesser extent when fitting potentially large numbers of kernel hyperparameters indeed many naturally occurring signals contain more complex structure than can comfortably be entertained using current methods time series with complex spectra like sounds being case in point the second limitation is that hyperparameters of the kernel are typically fit by maximisation of the model marginal likelihood for complex kernels with large numbers of hyperparameters this can easily result in overfitting rearing its ugly head once more see sec this paper attempts to remedy the existing limitations of gps in the time series setting using the same rationale by which gps were originally developed that is kernels themselves are treated nonparametrically to enable flexible forms whose complexity can grow as more structure is revealed in the data moreover approximate bayesian inference is used for estimation thus problems with model structure search and protecting against overfitting these benefits are achieved by modelling time series as the output of linear and system defined by convolution between process and linear filter by considering the filter to be drawn from gp the expected statistics and as consequence the spectral density of the output signal are defined in nonparametric fashion the next section presents the proposed model its relationship to gps and how to sample from it in section we develop an analytic approximate inference method using variational approximations for performing inference and learning section shows simulations using both synthetic and datasets finally section presents discussion of our findings regression model convolving linear filter and process we introduce the gaussian process convolution model gpcm which can be viewed as constructing distribution over functions using generative model in the first stage continuous filter function is drawn from gp with covariance function kh in the second stage the function is produced by convolving the filter with continuous time whitenoise the can be treated informally as draw from gp with gp kh gp σx dτ this family of models can be motivated from several different perspectives due to the ubiquity of linear systems first the model relates to linear lti systems the process is the input to the lti system the function is the system impulse response which is modelled as draw from gp and is its output in this setting as an lti system is entirely characterised by its impulse response model design boils down to identifying suitable function second perspective views the model through the lens of differential equations in which case can be considered to be the green function of system defined by linear differential equation that is driven by in this way the prior over implicitly defines prior over the coefficients of linear differential equations of potentially infinite order third the gpcm can be thought of as generalisation of the moving average process in which the window is potentially infinite in extent and is produced by gp prior fourth perspective relates the gpcm to standard gp models consider the filter to be known in this case the process is distributed according to gp since is linear combination of gaussian random variables the mean function mf and covariance function kf rof the random variable are then stationary and given by mf dτ and kf kf ds here we use informal notation common in the gp literature more formal treatment would use stochastic integral notation which replaces the differential element dτ dw so that eq becomes stochastic integral equation the brownian motion that is the convolution between the filter and its mirrored version with respect to see sec of the supplementary material for the full derivation since is itself is drawn from nonparametric prior the presented model through the relationship above induces prior over nonparametric kernels particular case is obtained when is chosen as the basis expansion of reproducing kernel hilbert space with parametric kernel the squared exponential kernel whereby kf becomes such kernel fifth perspective considers the model in the frequency domain rather than the time domain here the linear filter shapes the spectral content of the input process as is it has positive psd at all frequencies which can potentially influence more precisely since the psd of is given by the fourier transform of the covariance function by the theorem the model places nonparametric prior over the psd given by kf kf dt where dt is the fourier transform of the filter armed with these different theoretical perspectives on the gpcm generative model we next focus on how to design appropriate covariance functions for the filter sensible and tractable priors over the filter function signals have finite power which relates to the stability of the system and potentially complex spectral content how can such knowledge be built into the filter covariance function kh to fulfil these conditions we model the linear filter as draw from squared exponential gp that is multiplied by gaussian window centred on zero in order to restrict its extent the resulting decaying squared exponential dse covariance function is given by squared exponential se covariance and by and respectively that is kh kdse σh with the gp priors for and is stationary and has variance consequently by chebyshev inequality is stochastically bounded that is pr σx σh hence the exponential decay of kdse controlled by plays key role in the finiteness of the integral in eq and consequently of additionally the dse model for the filter provides flexible prior distribution over linear tems where the hyperparameters have physical meaning σh controls the power of the output is the characteristic timescale over which the filter varies that in turn determines the typical frequency content of the system finally is the temporal extent of the filter which controls the length of time correlations in the output signal and equivalently the bandwidth characteristics in the frequency domain although the covariance function is flexible its gaussian form facilitates analytic computation that will be leveraged when approximately sampling from the and performing inference in principle it is also possible in the framework that follows to add causal structure into the covariance function so that only causal filters receive prior probability density but we leave that extension for future work sampling from the model exact sampling from the proposed model in eq is not possible since it requires computation of the convolution between infinite dimensional processes and it is possible to make some analytic progress by considering instead the gp formulation of the gpcm in eq and noting that sampling gp kf only requires knowledge of kf and therefore avoids explicit representation of the troublesome process further progress requires approximation the first key insight is that can be sampled at finite number of locations tnh using multivariate gaussian and then exact analytic inference can be performed to infer the entire function via noiseless gp regression moreover since the filter is drawn from the dse kernel gp kdse it is with high probability temporally limited in extent and smoothly varying therefore relatively small number of samples nh can potentially enable accurate estimates of the second key insight is that it is possible when using the dse kernel to analytically compute the expected value of the covariance of kf kf as well as the uncertainty in this quantity the more values the latent process we consider the lower the uncertainty in and as consequence kf kf almost surely this is an example of bayesian numerical integration method since the approach maintains knowledge of its own inaccuracy in more detail the kernel approximation kf is given by kf dτ dτ ng kdse dτ kdse tr kdse ts dτ mr where mr is the th entry of the matrix hht kdse the kernel approximation and its fourier transform the psd can be calculated in closed form see sec in the supplementary material fig illustrates the generative process of the proposed model ke ne ilt lat nt pr oc obs vat ions ppr ox tr ue ke ne signal ime ample im ample fr ue nc he ime ample figure sampling from the proposed regression model from left to right filter kernel power spectral density and sample of the output inference and learning using variational methods one of the main contributions of this paper is to devise computationally tractable method for learning the filter known as system identification in the control community and inferring the process from noisy dataset rn produced by their convolution and additive gaussian noise dτ performing inference and learning is challenging for three reasons first the convolution means that each observed datapoint depends on the entire unknown filter and process which are infinitedimensional functions second the model is in the unknown functions since the filter and the multiply one another in the convolution third must be handled with care since formally it is only inside integrals we propose variational approach that addresses these three problems first the convolution is made tractable by using variational inducing variables that summarise the infinite dimensional latent functions into finite dimensional inducing points this is the same approach that is used for scaling gp regression second the product is made tractable by using structured meanfield approximation and leveraging the fact that the posterior is conditionally gp when or is fixed third the direct representation of process is avoided by considering set of inducing variables instead which are related to via an integral transformation inducing variables we outline the approach below in order to form the variational approximation we first expand the model with additional variables we use to denote the set of all integral transformations of with members ux dτ which includes the original process when and identically define the set with members uh dτ the variational lower bound of the model evidence can be applied to this augmented using jensen inequality dhdx log log dhdx log this formulation can be made technically rigorous for latent functions but we do not elaborate on that here to simplify the exposition here is any variational distribution over the sets of processes and the bound can be written as the difference between the model evidence and the kl divergence between the variational distribution over all integral transformed processes and the true posterior kl the bound is therefore saturated when but this is intractable instead we choose simpler parameterised form similar in spirit to that used in the approximate sampling procedure that allows us to these difficulties in order to construct the variational distribution we first partition the set into the original process finite set of variables called inducing points ux that will be used to parameterise the approximation and the remaining variables ux so that ux ux the set is partitioned identically uh uh we then choose variational distribution that mirrors the form of the joint distribution ux uh ux uh ux uh ux uh this is structured approximation the approximating distribution over the inducing points ux uh is chosen to be multivariate gaussian the optimal parametric form given the assumed factorisation intuitively the variational approximation implicitly constructs surrogate gp regression problem whose posterior ux uh induces predictive distribution that best captures the true posterior distribution as measured by the kl divergence critically the resulting bound is now tractable as we will now show first note that the shared prior terms in the joint and approximation cancel leading to an elegant form uh ux uh ux log dhdxduh dux uh ux eq log kl uh uh kl ux ux the last two terms in the bound are simple to compute being kl divergences between multivariate gaussians the first term the average of the terms with respect to the variational distribution is more complex eq ti ti dτ eq log log computation of the variational bound therefore requires the first and second moments of the convolution under the variational approximation however these can be computed analytically for particular choices of covariance function such as the dse by taking the expectations inside the integral this is analogous to variational inference for the gaussian process latent variable model for example the first moment of the convolution is eq ti dτ eq uh ti eq ux dτ where the expectations take the form of the predictive mean in gp regression and eq ux kx ux eq uh ti kh uh ti ux ux uh uh where kh uh kuh uh kx ux kux ux are the covariance functions and µuh µux are the means of the approximate variational posterior crucially the integral is tractable if the covariance functions can be convolved analytically kh uh ti kx ux dτ which is the case for the se and dse covariances see sec of the supplementary material for the derivation of the variational lower bound the fact that it is possible to compute the first and second moments of the convolution under the approximate posterior tractable to compute the mean of the posterior distribution means is over the kernel eq kf eq dτ and the associated the method therefore supports full probabilistic inference and learning for nonparametric kernels in addition to extrapolation interpolation and denoising in tractable manner the next section discusses sensible choices for the integral transforms that define the inducing variables uh and ux choice of the inducing variables uh and ux in order to choose the domain of the inducing variables it is useful to consider inference for the process given fixed window typically we assume that the window is smoothly varying in which case the data are only determined by the content of the conversely in inference the data can only reveal the low frequencies in in fact since continuous time process contains power at all frequencies and infinite power in total most of the content will be undeterminable as it is suppressed by the filter or filtered out however for the same reason these components do not affect prediction of since we can only learn the content of the and this is all that is important for making predictions we consider inducing points formed by gaussian integral transform ux exp tx dτ these inducing variables represent local estimate of the process around the inducing location tx considering gaussian window and have squared exponential covariance by construction these covariances are shown in sec of the supplementary material in spectral terms the process ux is version of the true process the variational parameters and tx affect the approximate posterior and can be optimised using the although this was not investigated here to minimise computational overhead for the inducing variables uh we chose not to use the flexibility of the parameterisation and instead place the points in the same domain as the window experiments the was tested using synthetic data with known statistical properties and signals the aim of these experiments was to validate the new approach to learn covariance functions and psds while also providing error bars for the estimates and to compare it against alternative parametric and nonparametric approaches learning known parametric kernels we considered gaussian processes with standard parametric covariance kernels and verified that our method is able to infer such kernels gaussian processes with squared exponential and spectral mixture kernels both of unit variance were used to generate two time series on the region uniformly sampled at hz samples we then constructed the observation signal by adding the experiment then consisted of learning the underlying kernel ii estimating the latent process and iii performing imputation by removing observations in the region of the observations fig shows the results for the case we chose inducing points for ux that is of the samples to be recovered and for uh the hyperparameters in eq were set to and so as to allow for an uninformative prior on the variational objective was optimised with respect to the hyperparameter σh and the variational parameters µh µx means and the cholesky factors of ch cx covariances using conjugate gradients the true se kernel was reconstructed from the noisy data with an accuracy of while the estimation mean squared error mse was within of the unit noise variance for both the true and the proposed model fig shows the results for the time series along the lines of the case the reconstruction of the true kernel and spectrum is remarkably accurate and the estimate of the latent process has virtually the same mean square error mse as the true model these toy results indicate that the variational inference procedure can work well in spite of known biases learning the spectrum of signals the ability of the to provide bayesian estimates of the psd of signals was verified next this was achieved through comparison of the proposed model to the spectral mixture kernel ii tracking the fourier coefficients using kalman filter kalmanfourier iii the method and iv the periodogram we first analysed the mauna loa monthly concentration we considered the gpsm with and components with partition of points between zero and the nyquist frequency with lags and the raw periodogram all methods used all the data and each psd estimate was normalised its maximum shown in fig all methods identified the three main frequency peaks at however notice that the kalmanfourier method does not provide sharp peaks and that places gaussians on frequencies with ilt ior me an nduc ing oint ke ne ls nor malis dis panc process pos ior me an nduc ing oint tr ue se ke ne ke ne dse obs vat ions lat nt pr and ke ne imat lat nt pr oc obs vat ions se ke ne imat mse dse gp imat mse figure joint learning of an se kernel and data imputation using the proposed approach top filter and inducing points uh left filtered process ux centre and learnt kernel right bottom latent signal and its estimates using both the and the true model confidence intervals are shown in light blue and in between dashed red lines and they correspond to for the kernel and otherwise ke ne ls nor malis dis panc gr ound ut dse gp os ior dat imput at ion sd nor malis dis panc dse gp tr ue sm ke ne gr ound ut obs vat ions sm imat mse dse gp imat mse ime fr ue nc ime figure joint learning of an sm kernel and data imputation using nonparametric kernel true and learnt kernel left true and learnt spectra centre and data imputation region right negligible power this is known drawback of the approach it is sensitive to initialisation and gets trapped in noisy frequency peaks in this experiment the centres of the were initialised as multiples of one tenth of the nyquist frequency this example shows that the can overfit noise in training data conversely observe how the proposed approach with nh and nx not only captured the first three peaks but also the spectral floor and placed meaningful error bars where the raw periodogram laid sp al mix omp sp al mix omp ie yule pe iodogr am dse pe iodogr am fr ue nc ye ar fr ue nc ye ar figure spectral estimation of the mauna loa concentration with error bars is shown with the periodogram at the left and all other methods at the right for clarity the next experiment consisted of recovering the spectrum of an audio signal from the timit corpus composed of samples at only using an of the available data we compared the proposed method to again and components and we used the periodogram and the method as benchmarks since these methods can not handle data therefore they used all the data besides the psd we also computed the learnt kernel shown alongside the autocorrelation function in fig left due to its sensitivity to initial conditions the centres of the were initialised every the harmonics of the signal are approximately every however it was only with components that the was able to find the four main lobes of the psd notice also how the accurately finds the main lobes both in location and width together with the error bars pow al de ns ity ovar ianc ke ne pow al de ns ity dse gp sp al mix omp sp al mix omp ut or lat ion unc ion dse pe iodogr am ime milis onds sp al mix omp sp al mix omp ie yule pe iodogr am fr ue nc he fr ue nc he figure audio signal from timit induced kernel of and alongside autocorrelation function left psd estimate using and raw periodogram centre psd estimate using and raw periodogram right discussion the gaussian process convolution model gpcm has been proposed as generative model for stationary time series based on the convolution between filter function and process learning the model from data is achieved via novel variational approximation which in turn allows us to perform predictions and inference on both the covariance kernel and the spectrum in probabilistic analytically and computationally tractable manner the gpcm approach was validated in the recovery of spectral density from sampled time series to our knowledge this is the first probabilistic approach that places nonparametric prior over the spectral density itself and which recovers posterior distribution over that density directly from the time series the encouraging results for both synthetic and data shown in sec serve as proof of concept for the nonparametric design of covariance kernels and psds using convolution processes in this regard extensions of the presented model can be identified in the following directions first for the proposed gpcm to have desired performance the number of inducing points uh and ux needs to be increased with the high frequency content and ii range of correlations of the data therefore to avoid the computational overhead associated to large quantities of inducing points the filter prior or the transformation can be designed to have specific harmonic structure and therefore focus on target spectrum second the algorithm can be adapted to handle longer time series for instance through the use of approximations third the method can also be extended beyond time series to operate on input spaces this can be achieved by means of factorisation of the latent kernel whereby the number of inducing points for the filter only increases linearly with the dimension rather than exponentially acknowledgements part of this work was carried out when was with the university of cambridge thanks grant and center for mathematical modeling cmm thanks epsrc grants and thanks google we thank mark rowland shane gu and the anonymous reviewers for insightful feedback references rasmussen and williams gaussian processes for machine learning the mit press in machine learning vol bengio learning deep architectures for ai foundations and trends no pp mackay introduction to gaussian processes in neural networks and machine learning bishop ed nato asi series pp kluwer academic press wilson and adams gaussian process kernels for pattern discovery and extrapolation in proc of international conference on machine learning duvenaud lloyd grosse tenenbaum and ghahramani structure discovery in nonparametric regression through compositional kernel search in proc of international conference on machine learning pp duvenaud nickisch and rasmussen additive gaussian processes in advances in neural information processing systems pp and alpaydin multiple kernel learning algorithms the journal of machine learning research vol pp tobar kung and mandic multikernel least mean square algorithm ieee trans on neural networks and learning systems vol no pp turner statistical models for natural sounds phd thesis gatsby computational neuroscience unit ucl turner and sahani analysis as probabilistic inference ieee trans on signal processing vol no pp oksendal stochastic differential equations springer oppenheim and willsky signals and systems archambeau cornford opper and gaussian process approximations of stochastic differential equations journal of machine learning research workshop and conference proceedings vol pp gull developments in maximum entropy data analysis in maximum entropy and bayesian methods skilling ed vol pp springer netherlands and smola learning with kernels support vector machines regularization optimization and beyond mit press minka deriving quadrature rules from gaussian processes tech statistics department carnegie mellon university jazwinski stochastic processes and filtering theory new york academic titsias variational learning of inducing variables in sparse gaussian processes in proc of international conference on artificial intelligence and statistics pp and gaussian processes for sparse inference using inducing features in advances in neural information processing systems pp matthews hensman turner and ghahramani on sparse variational methods and the divergence between stochastic processes arxiv preprint mackay information theory inference and learning algorithms cambridge university press titsias and lawrence bayesian gaussian process latent variable model in proc of international conference on artificial intelligence and statistics pp turner and sahani two problems with variational expectation maximisation for models in bayesian time series models barber cemgil and chiappa eds ch pp cambridge university press qi minka and picara bayesian spectrum estimation of unevenly sampled nonstationary data in proc of ieee icassp vol pp percival and walden spectral analysis for physical applications cambridge university press cambridge books online bui and turner gaussian process approximations in advances in neural information processing systems pp 
market framework for eliciting private data bo waggoner harvard seas bwaggoner rafael frongillo university of colorado raf jacob abernethy university of michigan jabernet abstract we propose mechanism for purchasing information from sequence of participants the participants may simply hold data points they wish to sell or may have more sophisticated information either way they are incentivized to participate as long as they believe their data points are representative or their information will improve the mechanism future prediction on test set the mechanism which draws on the principles of prediction markets has bounded budget and minimizes generalization error for bregman divergence loss functions we then show how to modify this mechanism to preserve the privacy of participants information at any given time the current prices and predictions of the mechanism reveal almost no information about any one participant yet in total over all participants information is accurately aggregated introduction firm that relies on the ability to make difficult predictions can gain lot from large collection of data the goal is often to estimate values given observations according to an appropriate class of hypotheses describing the relationship between and for example for linear regression in classic statistical learning theory the goal is formalized as attempting to approximately solve min loss where loss is an appropriate inutility function and is drawn from an unknown distribution in the present paper we are concerned with the case in which the data are not drawn or held by central authority but are instead inherently distributed by this we mean that the data is disjointly partitioned across set of agents with agent privately possessing some portion of the dataset si and agents have no obvious incentive to reveal this data to the firm seeking it the vast swaths of data available in our personal email accounts could provide massive benefits to range of companies for example but users are typically loathe to provide account credentials even when asked politely we will be concerned with the design of financial mechanisms that provide community of agents each holding private set of data an incentive to contribute to the solution of large learning or prediction task here we use the term mechanism to mean an algorithmic interface that can receive and answer queries as well as engage in monetary exchange deposits and payouts our aim will be to design such mechanism that satisfies the following three properties the mechanism is efficient in that it approaches solution to as the amount of data and participation grows while spending constant fixed total budget the mechanism is in the sense that agents are rewarded when their contributions provide marginal value in terms of improved hypotheses and are not rewarded for bad or misleading information the mechanism provides reasonable privacy guarantees so that no agent or outside observer can manipulate the mechanism in order to infer the contributions of agent ultimately we would like our mechanism to approach the performance of learning algorithm that had direct access to all the data while only spending constant budget to acquire data and improve predictions and while protecting participants privacy our construction relies on the recent surge in literature on prediction markets popular for some time in the field of economics and recently studied in great detail in computer science prediction market is mechanism designed for the purpose of information aggregation particularly when there is some underlying future event about which many members of the population may have private and useful information for instance it may elicit predictions about which team will win an upcoming sporting event or which candidate will win an election these predictions are eventually scored on the actual outcome of the event applying these prediction market techniques allows participants to essentially trade in market based on their data this approach is similar to prior work on crowdsourcing contests members of the population have private information just as with prediction markets in this case data points or beliefs and the goal is to incentivize them to reveal and aggregate that information into final hypothesis or prediction their final profits are tied to the outcome of test set of data with each participant being paid in accordance with how much their information improved the performance on the test set our techniques depart from the framework of in two significant aspects we focus on the particular problem of data aggregation and most of our results take advantage of kernel methods and our mechanisms are the first to combine differential privacy guarantees with data aggregation in framework this framework will provide efficiency and truthfulness we will also show how to achieve privacy in many scenarios we will give mechanisms where the prices and predictions published satisfy privacy with respect to each participant data the mechanism output can still give reasonable predictions while no observer can infer much about any participant input data mechanisms for eliciting and aggregating data we now give broad description of the mechanism we will study in brief we imagine central authority the mechanism or market maintaining hypothesis representing the current aggregation of all the contributions made thus far new or returning participant may query at no cost perhaps evaluating the quality of the predictions on dataset and can then propose an update df to that possibly requires an investment bet bets are evaluated at the close of the market when true data sample is generated analogous to test set and payouts are distributed according to the quality of the updates after describing this initial framework as mechanism which is based loosely on the setting of we turn our attention to the special case in which our hypotheses must lie in reproducing kernel hilbert space rkhs for given kernel this nonparametric mechanism is particularly for the problem of data aggregation as the betting space of the participants consists essentially of updates of the form df αt zt where zt is the data object offered by the participant and αt is the magnitude of the bet drawback of mechanism is the lack of privacy guarantees associated with the betting protocol utilizing one data to make bets or investments in the mechanism can lead to loss of privacy for the owner of that data when participant submits bet of the form df αt zt where zt could contain sensitive personal information another participant may be able to infer zt by querying the mechanism one of the primary contributions of the present work detailed in section is technique to allow for productive participation in the mechanism while maintaining guarantee on the privacy of the data submitted the general template there is space of examples where are features and are labels the mechanism designer chooses function space consisting of and assumed to have hilbert space structure one may view as either the hypothesis class or the associated loss class that is where fh measures the of hypothesis on observation and label in each case we will refer to as hypothesis eliding the distinction between fh and the pricing scheme of the mechanism relies on convex cost function cx which is parameterized by elements but whose domain is the set of hypotheses the cost function is publicly available and determined in advance the interaction with the mechanism is sequential process of querying and betting on round the mechanism publishes hypothesis the state of the market which participants may query each participant arrives sequentially and on round participant may place bet df also called trade or update modifying the hypothesis df finally participation ends and the mechanism samples or reveals test from the underlying distribution and pays or charges each participant according to the relative performance of their marginal contributions precisely the total reward for participant bet df is the value df minus the cost cx cx mechanism the market template arket announces for do participant may query functions cx and for examples participant may submit bet df to arket arket updates state df arket observes true sample for do participant receives payment df cx cx the design of prediction markets has been an area of active research over the past several years starting with and many further refinements and generalizations the general idea is that the mechanism can efficiently provide price quotes via function which acts as potential on the space of outstandings shares see for thorough review in the present work we have added an additional twist which is that the function cx is given an additional parameterization of the observation we will not dive too deeply into the theoretical aspects of this generalization but this is straightforward extension of existing theory key special case exponential family mechanism for those more familiar with statistics and machine learning there is natural and canonical family of problems that can be cast within the general framework of mechanism which we will call the exponential family prediction mechanism following assume that can be parameterized as fθ rd that we are given sufficient statistics summary function rrd and that function evaluation is given by fθ hθ we let cx log exp dy so that cx fθ log exp hθ idy in other words we have chosen our mechanism to encode particular exponential family model with cx chosen as the conditional log partition function over the distribution on given if the market has settled on function fθ then one may interpret that as the aggregate market belief on the distribution of is pθ exp hθ where log exp hθ dx dy how may we view this as market aggregate belief notice that if trader observes the market state of fθ and she is considering bet of the form df fθ the eventual profit will be fθ cx fθ cx log pθ the profit is precisely the conditional log likelihood ratio of the update example logistic regression let rk and take to be the set of functions fθ for rk then by our construction cx log exp exp log exp exp and we let the payoff of participant placing bet which moves the market state to fθ upon outcome is fθ cx cx fθ yθ log log exp exp log log exp this can easily be extended to test set by taking the average performance over the test set which is simply negative logistic loss of the parameter choice participant wishing to maximize profit under belief distribution should therefore choose via logistic regression arg min log exp properties of the market we next describe two nice properties of mechanism and bounded budget recall that for the exponential family markets discussed above trader moving the market hypothesis from to was compensated according to the conditional ratio of and on the test data point the implication is that traders are incentivized to minimize kl divergence between the market estimate of the distribution and the true underlying distribution we refer to this property as because traders interests are aligned with the mechanism designer this property indeed holds generally for mechanism where the kl divergence is replaced with general bregman divergence corresponding to the fenchel conjugate of cx see proposition in the appendix for details given that the mechanism must make sequence of possibly negative payments to traders natural question is whether there is the potential for large downside for the mechanism in terms of total payment budget in the context of the exponential family mechanism this question is easy to answer after sequence of bets moving the market state parameter θfinal the total loss to the mechanism corresponds to the total payouts made to traders pθ fθi cx fθi cx log final that is the loss is exactly the conditional ratio in the context of logistic regression this quantity can always be guaranteed to be no more than log as long as the initial parameter is set to for mechanism more generally one has tight bounds on the loss following from such results from prediction markets and we give more detailed statement in proposition in the appendix price sensitivity parameter λc in choosing the cost function family cx an important consideration is the scale of each cx or how quickly changes in the market hypothesis translate to changes in the instantaneous prices which give the marginal cost for an infinitesimal bet df formally this is captured by the price sensitivity λc defined as the upper bound on the operator norm with respect to the norm of the hessian of the cost function cx over all choice of small λc translates to small budget required by the mechanism however it means that the market prices are sensitive in that the same update df changes the prices much more quickly when we consider protecting the privacy of trader updates in section we will see that privacy imposes restrictions on the price sensitivity nonparametric mechanism via kernel methods the framework we have discussed thus far has involved general function space as the state of the mechanism and the contributions by participants are in the form of modifications to these functions one of the downsides of this generic template is that participants may not be able to reason about and they may have information about the optimal only through their own dataset more specific class of functions would be those parameterized by actual data this brings us to type of hypothesis class namely the reproducing kernel hilbert space rkhs we can design market based on an rkhs which we will refer to as kernel market that brings together number of ideas including recent work of as well as kernel exponential families we have positive semidefinite kernel and associated reproducing kernel hilbert space with basis fz the reproducing property is that for all hf now each hypothesis can be expressed as αs zs for some collection of points αs zs the kernel approach has several nice properties one is natural extension of the exponential family mechanism using an rkhs as building block of the class of exponential family distributions key assumption in the exponential family mechanism is that evaluating can be viewed as an inner product in some feature space this is precisely what one has given kernel framework specifically assume we have some psd kernel where then we can define the associated classification kernel according rto yy under certain conditions we again can take cx log exp dy and for any in the rkhs associated to we have an associated distribution of the form pf exp and again participant updating the market from to is rewarded by the conditional ratio of and on the test data the second nice property mirrors one of standard kernel learning methods namely that under certain conditions one need only search the subset of the rkhs spanned by the basis xi yi xi yk where is the set of available data this is direct result of the representer theorem in the context of the kernel market this suggests that participants need only interact with the mechanism by pushing updates that lie in the span of their own data in other words we only need to consider updates of the form df αk this naturally suggests the idea of directly purchasing data points from traders buying data points so far we have supposed that participant knows what trade df she prefers to make but what if she simply has data point drawn from the underlying distribution we would like to give this trader simple trading interface in which she can sell her data to the mechanism without having to reason about the correct df for this data point our proposal is to mimic the behavior of natural learning algorithms such as stochastic gradient descent when presented with the market can offer the trader the purchase bundle corresponding to the update of the learning algorithm on this data point in principle this approach can be used with any online learning algorithm in particular stochastic gradient descent gives clean update rule which now describe the expected profit which the negative of expected loss for trade df is ex cx df cx df given draw the loss function on which to take gradient step is cx df cx df whose gradient is cx δx where δx is the indicator on data point this suggests that the market offer the participant the trade df cx δx where can be chosen arbitrarily as learning rate this can be interpreted as buying unit of shares in the participant data point then hedging by selling small amount of all other shares in proportion to their current prices recall that the current prices are given by cx in the kernel setting the choice of stochastic gradient descent may be somewhat problematic because it can result in share purchases it may instead be desirable to use algorithms that guarantee sparse modern discussion of such approaches can be found in given this framework participants with access to private set of samples from the true underlying distribution can simply opt for this standard bundle corresponding to their data point which is precisely stochastic gradient descent update with small enough learning rate and assuming that the data point is truly independent of the current hypothesis has not been previously incorporated the trade is guaranteed to make at least some positive profit in expectation more sophisticated alternative strategies are also possible of course but even the proposed simple bet type has earning potential protecting participants privacy we now extend the mechanism to protect privacy of the participants an adversary observing the hypotheses and prices of the mechanism and even controlling the trades of other participants should not be able to infer too much about any one trader update df this is especially relevant when participants sell data to the mechanism and this data can be sensitive medical data here privacy is formalized by privacy to be defined shortly one intuitive characterization is that for any prior distribution some adversary has about trader data the adversary posterior belief after observing the mechanism would be approximately the same even if the trader did not participate at all the idea is that rather than posting the exact prices and trades made in the market we will publish noisy versions with the random noise giving the above guarantee naive approach would be to add independent noise to each participant trade however this would require amount of noise the final market hypothesis would be determined by the random noise just as much as by the data and trades the central challenge is to add carefully correlated noise that is large enough to hide the effects of any one participant data point but not so large that the prices equivalently hypothesis become meaningless we show this is possible by adjusting the price sensitivity λc of the mechanism measure of how fast prices change in response to trades defined in it will turn out to suffice to set the price sensitivity to be when there are participants this can roughly be interpreted as saying that any one participant does not move the market price noticeably so their privacy is protected but just polylog traders together can move the prices completely we now formally define differential privacy and discuss two useful tools at our disposal differential privacy and tools differential privacy in our context is defined as follows consider randomized function operating on inputs of the form df df and having outputs of the form then is private if for any coordinate of the vector any two distinct and any measurable set of outputs we have pr pr the notation means the vector with the tth entry removed intuitively is private if modifying the tth entry in the vector to different entry does not change the distribution on outputs too much in our case the data to be protected will be the trade df of each participant and the space of outputs will be the entire sequence of published by the mechanism to preserve privacy each trade must have bounded size consist only of one data point to enforce this we define the following parameter chosen by the mechanism designer max hdf df allowed df where the maximum is over all trades df allowed by the mechanism that is is scalar capturing the maximum allowed size of any one trade for instance if all trades are restricted to be of the form df αk then we would have maxα we next describe the two tools we require tool private functions via gaussian processes given current market state df df where lies in rkhs we construct private version fˆt such that queries to fˆt are accurate close to the outputs of but also private with respect to each df in fact it will become convenient to privately output partial sums of trades so we wish to output that is private and approximates df this is accomplished by the following construction due to theorem corollary let be the sample path of gaussian process with mean zero and whose covariance is given by the kernel function then ln is private with respect to each df for in general may be an object and thus impossible to finitely represent in this case the theorem implies that releasing the results of any number of queries is differentially private of course the more queries that are released the larger the chance of high error on some query this is computationally feasible as each sample is simply sample from gaussian having known covariance with the previous samples drawn unfortunately it would not be sufficient to independently release at each time because the amount of noise required would be prohibitive this leads us to our next tool formally each is random variable and for any finite subset of the corresponding variables are distributed as multivariate normal with covariance given by df df df df df df df df df df df figure picturing the continual observation technique for preserving privacy each df is ap trade data point sold to the market the goal is to release at each time step noisy version of df to do so start at and follow the arrow back to take the partial sum of df for from to and add some random noise trace the next arrow from to to get another partial sum and add noise to that sum as well repeat until is reached then add together all the noisy partial sums to get the output at time which will equal plus noise the key point is that we can many of the noisy partial sums in many different time steps for instance the noisy partial sum from to can be when releasing all of meanwhile each df participates in few noisy partial sums the number of arrows passing above it tool continual observation technique the idea of this technique pioneered by is to pt construct fˆt df by adding together noisy partial sums of the form as constructed in equation the idea for choosing these partial sums is pictured in figure for function that returns an integer smaller than we take fˆt fˆs fˆs specifically is determined by writing in binary then flipping the rightmost one bit to zero this is pictured in figure the intuition behind why this technique helps is twofold first the total noise in fˆt is the sum of noises of its partial sums and it turns out that there are at most dlog terms second the total noise we need to add to protect privacy is governed by how many different partial sums each df participates in and it turns out that this number is also at most dlog this allows for much better privacy and accuracy guarantees than naively treating each step independently mechanism and results combining our market template in mechanism with the above privacy tools we obtain mechanism there are some key differences first we have bound on the total number of queries each query returns the instantaneous prices in the market for this is because each query reveals information about the participants so intuitively allowing too many queries must sacrifice either privacy or accuracy fortunately this bound can be an arbitrarily large polynomial in the number of traders without affecting the quality of the results second we have guarantees on accuracy with probability all price queries return values within of their true prices third it is no longer straightforward to compute and represent the market prices fˆt unless is finite we leave the more general analysis of mechanism to future work either exactly or approximately mechanism inherits the desirable properties of mechanism such as bounded budget and that is participants are incentivized to minimize the risk of the market hypothesis in addition we show that it preserves privacy while maintaining accuracy for an appropriate choice of the price sensitivity λc theorem consider mechanism where is the maximimum trade size equation and then mechanism is differentially private and with traders and price queries has the following accuracy guarantee with probability for each query the returned prices satisfy fˆt by setting λc ln qd ln log log if one for example takes exp then except for superpolynomially low failure probability mechanism answers all queries to within accuracy by setting the price sensitivity to be λc we note however that this is somewhat weaker guarantee than is usually desired in the differential privacy literature where ideally is exponentially small mechanism privacy protected market parameters privacy accuracy kernel trade size queries traders arket announces sets sets with λc λc theorem for do participant proposes bet df arket updates true position df arket instantiates fˆs as defined in equation while and some bserver wishes to make query do bserver submits pricing query on arket returns prices fˆt where fˆt fˆs fˆs arket sets arket observes true sample for do participant receives payment cx df cx computing fˆt we have already discussed limiting to finite in order to efficiently compute the marginal prices fˆt however it is still not immediately clear how to compute these prices and hence how to implement mechanism here we show that the problem can be solved when comes from an exponential family so that cx log exp dy in this case the marginal prices given by the gradient of have nice form namely the price of shares in is ptx cx ef thus evaluating the prices can be done by evaluating for each we also note that the bound used here could be greatly improved by taking into account the structure of the kernel for smooth cases such as the gaussian kernel querying second point very close to the first one requires very little additional randomness and builds up very little additional error we gave only bound that holds for all kernels adding transaction fee in the appendix we discuss the potential need for transaction fees adding small fee suffices to deter arbitrage opportunities introduced by noisy pricing discussion the main contribution of this work was to bring together several tools to construct mechanism for incentivized data aggregation with incentive properties privacy guarantees and limited downside for the mechanism our proposed mechanisms are also extensions of the prediction market literature building upon the work of abernethy et al we introduce the following innovations conditional markets our framework of mechanism can be interpreted as prediction market for conditional predictions rather than classic market which would elicit the joint distribution or just the marginals this is similar to decision markets but without out the associated incentive problems naturally then we couple conditional predictions with restricted hypothesis spaces allowing to capture linear relationship between and nonparametric securities we also extend to nonparametric hypothesis spaces using kernels following the scoring rules of privacy guarantees we provide the first private prediction market to our knowledge showing that information about individual trades is not revealed our approach for preserving privacy also holds in the classic prediction market setting with similar privacy and accuracy guarantees many directions remain for future work these mechanisms could be made more practical and perhaps even better privacy guarantees derived especially in nonparametric settings one could also explore the connections to similar settings such as when agents have costs for acquiring data acknoledgements abernethy acknowledges the generous support of the us national science foundation under career grant and grant references jacob abernethy yiling chen and jennifer wortman vaughan efficient market making via convex optimization and connection to online learning acm transactions on economics and computation may jacob abernethy sindhu kutty lahaie and rahul sami information aggregation in exponential family markets in proceedings of the fifteenth acm conference on economics and computation pages acm jacob abernethy and rafael frongillo collaborative mechanism for crowdsourcing prediction problems in advances in neural information processing systems pages canu and alex smola kernel methods and the exponential family neurocomputing hubert chan elaine shi and dawn song private and continual release of statistics acm transactions on information and system security tissec chen and vaughan new understanding of prediction markets via learning in proceedings of the acm conference on electronic commerce ec pages yiling chen ian kash mike ruberry and victor shnayder decision markets with good incentives in internet and network economics pages springer yiling chen and david pennock utility framework for market makers in in proceedings of the conference on uncertainty in artificial intelligence uai pages cynthia dwork moni naor toniann pitassi and guy rothblum differential privacy under continual observation in proceedings of the acm symposium on theory of computing pages acm cynthia dwork and aaron roth the algorithmic foundations of differential privacy foundations and trends in theoretical computer science rob hall alessandro rinaldo and larry wasserman differential privacy for functions and functional data the journal of machine learning research hanson decision markets entrepreneurial economics bright ideas from the dismal science pages hanson combinatorial information market design information systems frontiers hanson logarithmic market scoring rules for modular combinatorial information aggregation journal of prediction markets abraham othman and tuomas sandholm automated market makers that enable new settings extending cost functions in proceedings of the second conference on auctions market mechanisms and their applications amma pages david pennock and rahul sami computational aspects of prediction markets in noam nisan tim roughgarden eva tardos and vijay vazirani editors algorithmic game theory chapter cambridge university press bernhard and alexander smola learning with kernels support vector machines regularization optimization and beyond mit press amos storkey machine learning markets in proceedings of ai and statistics aistats pages wolfers and zitzewitz prediction markets journal of economic perspectives justin wolfers and eric zitzewitz interpreting prediction market prices as probabilities technical report national bureau of economic research erik zawadzki and lahaie nonparametric scoring rules in proceedings of the aaai conference on artificial intelligence lijun zhang rong jin chun chen jiajun bu and xiaofei he efficient online learning for sparse kernel logistic regression in aaai lijun zhang jinfeng yi rong jin ming lin and xiaofei he online kernel learning with near optimal sparsity bound in proceedings of the international conference on machine learning pages 
lifted inference rules with constraints happy mittal anuj mahajan dept of comp sci engg delhi hauz khas new delhi india vibhav gogate dept of comp sci univ of texas dallas richardson tx usa parag singla dept of comp sci engg delhi hauz khas new delhi india vgogate parags abstract lifted inference rules exploit symmetries for fast reasoning in statistical relational models computational complexity of these rules is highly dependent on the choice of the constraint language they operate on and therefore coming up with the right kind of representation is critical to the success of lifted inference in this paper we propose new constraint language called setineq which allows subset equality and inequality constraints to represent substitutions over the variables in the theory our constraint formulation is strictly more expressive than existing representations yet easy to operate on we reformulate the three main lifting rules decomposer generalized binomial and the recently proposed single occurrence for map inference to work with our constraint representation experiments on benchmark mlns for exact and sampling based inference demonstrate the effectiveness of our approach over several other existing techniques introduction statistical relational models such as markov logic have the power to represent the rich relational structure as well as the underlying uncertainty both of which are the characteristics of several real world application domains inference in these models can be carried out using existing probabilistic inference techniques over the propositionalized theory belief propagation mcmc sampling this approach can be since it ignores the rich underlying structure in the relational representation and as result does not scale to even moderately sized domains in practice lifted inference ameliorates the aforementioned problems by identifying indistinguishable atoms grouping them together and inferring directly over the groups instead of individual atoms starting with the work of poole number of lifted inference algorithms have been proposed these include lifted exact inference techniques such as lifted variable elimination ve lifted approximate inference algorithms based on message passing such as belief propagation lifted sampling based algorithms lifted search lifted variational inference and lifted knowledge compilation there also has been some recent work which examines the complexity of lifted inference independent of the specific algorithm used just as probabilistic inference algorithms use various rules such as conditioning and decomposition to exploit the problem structure lifted inference algorithms use lifted inference rules to exploit the symmetries all of them work with an underlying constraint representation that specifies the allowed set of substitutions over variables appearing in the theory examples of various constraint representations include weighted parfactors with constraints normal form parfactors hypercube based representations tree based constraints and the constraint free normal form these formalisms differ from each other not only in terms of the underlying constraint representation but also how these constraints are processed whether they require constraint solver splitting as needed versus shattering etc the choice of the underlying constraint language can have significant impact on the time as well as memory complexity of the inference procedure and coming up with the right kind of constraint representation is of prime importance for the success of lifted inference techniques although there approach lifted ve cfove gcfove approx lbp knowledge compilation kc lifted inference from other side ptp current work constraint type no subset no subset subset no inequality subset hypercube no inequality subset normal forms no constraints no subset subset constraint aggregation intersection no union intersection no union intersection union intersection union intersection no union none tractable solver no lifting algorithm lifted ve yes lifted ve yes lifted ve yes lifted message passing no intersection no union intersection union no knowledge compilation lifting rules decomposer binomial lifted search sampling decomposer binomial lifted search sampling decomposer binomial single occurrence yes yes table comparison of constraint languages proposed in literature across four dimensions the properties for each language have been highlighted in bold among the existing work only kc allows for full set of constraints gcfove and lbp hypercubes allow for subset constraints but they do not explicitly handle inequality ptp does not handle subset constraints for constraint aggregation most approaches allow only intersection of atomic constraints gcfove and lbp allow union of intersections dnf but only deal with subset constraints see footnote in broeck regarding kc lifted ve kc and ptp use general purpose constraint solver which may not be tractable our approach allows for all the features discussed above and uses tractable solver we propose constrained solution for lifted search and sampling among earlier work only ptp has looked at this problem both search and sampling however it only allows very restrictive set of constraints has been some work studying this problem in the context of lifted ve lifted bp and lifted knowledge compilation existing literature lacks any systematic treatment of this issue in the context of lifted search and sampling based algorithms this paper focuses on addressing this issue table presents detailed comparison of various constraint languages for lifted inference to date we make the following contributions first we propose new constraint language called setineq which allows for subset allowed values are constrained to be either inside subset or outside subset equality and inequality constraints called atomic constraints over substitutions of the variables the set of allowed constraints is expressed as union over individual constraint tuples which in turn are conjunctions over atomic constraints our constraint language strictly subsumes several of the existing constraint representations and yet allows for efficient constraint processing and more importantly does not require separate constraint solver second we extend the three main lifted inference rules decomposer and binomial and single occurrence for map inference to work with our proposed constraint language we provide detailed analysis of the lifted inference rules in our constraint formalism and formally prove that the normal form representation is strictly subsumed by our constraint formalism third we show that evidence can be efficiently represented in our constraint formulation and is key benefit of our approach specifically based on the earlier work of singla et al we provide an efficient greedy approach to convert the given evidence in the database tuple form to our constraint representation finally we demonstrate experimentally that our new approach is superior to normal forms as well as many other existing approaches on several benchmark mlns for both exact and approximate inference markov logic we will use strict subset of first order logic which is composed of constant variable and predicate symbols term is variable or constant predicate represents property of or relation between terms and takes finite number of terms as arguments literal is predicate or its negation formula is recursively defined as follows literal is formula negation of formula is formula if and are formulas then applying binary logical operators such as and to and yields formula and if is variable in formula then and are formulas first order theory knowledge base kb is set of quantified formulas we will restrict our attention to finite first order logic theory with herbrand interpretations as done by most earlier work in this domain we will also restrict our attention to the case of universally quantified variables ground atom is predicate whose terms do not contain any variable in them similarly ground formula is formula that has no variables during the grounding of theory each formula is replaced by conjunction over ground formulas obtained by substituting the universally quantified variables by constants appearing in the theory markov logic network mln or markov logic theory is defined as set of pairs fi wi where fi is formula and wi is its weight real number given finite set of constants markov logic theory represents markov network that has one node for every ground atom in the theory and feature for every ground formula pm the probability distribution represented by the markov network is given by exp wi ni where ni denotes the number th of true pgroundings pm of the under the assignment to the ground atoms world and exp wi ni is the normalization constant called the partition function it is well known that prototypical marginal inference task in mlns computing the marginal probability of ground atom given evidence can be reduced to computing the partition function another key inference task is map inference in which the goal is to find an assignment to ground atoms that has the maximum probability in its standard form markov logic theory is assumed to be constraint free all possible substitutions of variables by constants are considered during the grounding process in this paper we introduce the notion of constrained markov logic theory which is specified as set of triplets fi wi six where si specifies set union of constraints defined over the variables appearing in the formula during the grounding process we restrict to those constant substitutions which satisfy the constraint set associated with formula the probability distribution is now defined using the restricted set of groundings allowed by the respective constraint sets over the formulas in the theory although we focus on mlns in this paper our results can be easily generalized to other representations including weighted parfactors and probabilistic knowledge bases constraint language in this section we formally define our constraint language and its canonical form we also define two operators join and project for our language the various features operators and properties of the constraint language presented this section will be useful when we formally extend various lifted inference rules to the constrained markov logic theory in the next section sec language specification for simplicity of exposition we assume that all logical variables take values from the same domain let xk be set of logical variables our constraint language called setineq contains three types of atomic constraints subset constraints setct of the form xi setinct or xi setoutct equality constraints eqct of the form xi xj and inequality constraints ineqct of the form xi xj we will denote an atomic constraint over set by ax constraint tuple over denoted by is conjunction of atomic constraints over and constraint set over denoted by is disjunction of constraint tuples over an example of constraint set over pair of variables is where and an assignment to the variables in is solution of if all constraints in are satisfied by since is disjunction by definition is also solution of next we define canonical representation for our constraint language we require this definition because symmetries can be easily identified when constraints are expressed in this representation we begin with some required definitions the support of subset constraint is the set of values in that satisfies the constraint two subset constraints and are called value identical if and value disjoint if where and are supports of and respectively constraint tuple is transitive over equality if it contains the transitive closure of all its equality constraints constraint tuple is transitive over inequality if for every constraint of the form xi xj in whenever contains xi xk it also contains xj xk definition constraint tuple is in canonical form if the following three conditions are satisfied for each variable xi there is exactly one subset constraint in all equality and inequality constraints in are transitive and all pairs of variables that participate either in an equality or an inequality constraint have identical supports constraint set is in canonical form if all of its constituent constraint tuples are in canonical form we can easily express constraint set in an equivalent canonical form by enforcing the three conditions one by one on each of its tuples in our running example can be converted into canonical form by splitting it into four sets of constraint tuples where and similarly for we include the conversion algorithm in the supplement due to lack of space the following theorem summarizes its time complexity theorem given constraint set each constraint tuple in it can be converted to canonical form in time mk where is the total number of constants appearing in any of the subset constraints in and is the number of variables in we define following two operations in our constraint language join join operation lets us combine set of constraints possibly defined over different sets of variables into single constraint it will be useful when constructing formulas given constrained predicates refer section let and be constraints tuples over sets of variables and respectively and let the join operation written as results in constraint tuple which has the conjunction of all the constraints present in and given the constraint tuple in our running example and results in the complexity of join operation is linear in the size of constraint tuples being joined project project operation lets us eliminate variable from given constraint tuple this is key operation required in the application of binomial rule refer section let be constraint tuple given xi let xi the project operation written as results in constraint tuple which contains those constraints in not involving xi we refer to as the projected constraint for the variables given solution to the extension count for is defined as the number of unique assignments xi vi such that xi vi is solution for is said to be count preserving if each of its solutions has the same extension count we require tuple to be count preserving in order to correctly maintain the count of the number of solutions during the project operation also refer section lemma let be constraint tuple in its canonical form if xi is variable which is either involved only in subset constraint or is involved in at least one equality constraint then the projected constraint is count preserving in the former case the extension count is given by the size of the support of xi in the latter case it is equal to when dealing with inequality constraints the extension count for each solution to the projected constraint may not be the same and we need to split the constraint first in order to apply the project operation for example consider the constraint then the extension count for the solution to the projected constraint is where extension count for the solution is in such cases we need to split the tuple into multiple constraints such that extension count property is preserved in each split let be set of variables over which constraint tuple needs to be projected let be the set of variables with which xi is involved in an inequality constraint in then tuple can be broken into an equivalent constraint set by considering each possible division of into set of equivalence classes where variables in the same equivalence class are constrained to be equal and variables in different equivalence classes are constrained to be not equal to each other the number of such divisions is given by the bell number the divisions inconsistent with the already existing constraints over variables in can be ignored projection operation has linear time complexity once the extension count property has been ensured using splitting as described above see the supplement for details extending lifted inference rules we extend three key lifted inference rules decomposer binomial and the single occurrence for map to work with our constraint formulation exposition for single occurrence has been moved to supplement due to lack of space we begin by describing some important definitions and assumptions let be constrained mln theory represented by set of triplets fi wi six we make three assumptions first we assume that each constraint set si is specified using setineq and is in canonical form second we assume that each formula in the mln is constant free this can be achieved by replacing the appearance of constant by variable and introducing appropriate constraint over the new variable replacing by variable and constraint third we assume that the variables have been standardized apart each formula has unique set of variables associated with it in the following will denote the set of all the logical variables appearing in xi will denote the set of variables in fi similar to the work done earlier we divide the variables in set of equivalence classes two variables are tied to each other if they appear as the same argument of predicate we take the transitive closure of the tied relation to obtain the variable equivalence classes for example given the theory the variable equivalence classes are and we will use the notation to denote the equivalence class to which belongs motivation and key operations the key intuition behind our approach is as follows let be variable appearing in formula fi let xi be an associated constraint tuple and denote the support for in xi then since constraints are in canonical form for any other variable xi involved in in equality constraint with with as the support we have therefore every pair of values vi vj behave identically with respect to the constraint tuple xi and hence are symmetric to each other now we could extend this notion to other constraints in which appears provided the support sets vl of in all such constraints are either identical or disjoint we could treat each support set vl for as symmetric group of constants which could be argued about in unison in an unconstrained theory there is single disjoint partition of constants the entire domain such that the constants behave identically our approach generalizes this idea to groups of constants which behave identically with each other towards this end we define following key operations over the theory which will be used over and again during application of lifted inference rules partitioning operation we require the support sets of variable or sets of variables over which lifted rule is being applied to be either identical or disjoint we say that theory defined over set of logical variables is partitioned with respect to the variables in the set if for every pair of subset constraints and appearing in tuples of the supports of and are either identical or disjoint but not both given partitioned theory with respect to variables we use vly to denote the set of various supports of variables in we refer to the set as the partition of values in our partitioning algorithm considers all the support sets for variables in and splits them such that all the splits are identical or disjoint the constraint tuples can then be split and represented in terms of these support sets we refer the reader to the supplement section for detailed description of our partitioning algorithm restriction operation once the values of set of variables have been partitioned into set while applying the lifted inference rules we will often need to argue about those formula groundings which are obtained by restricting values to those in particular set vly since values in each such support set behave identically to each other given let axl denote subset constraint over with vly as its support given formula fi we define its restriction to the set xi vly as the formula obtained xjby replacing its associated constraint tuple with new constraint xi tuple of the form al where the conjunction is taken over each variable xj which also appears in fi the restriction of an mln to the set vl denoted by mly is the mln obtained by restricting each formula in to the set vl restriction operation can be implemented in straightforward manner by taking conjunction with the subset constraints having the desired support set for variables in we next define the formulation of our lifting rules in constrained theory decomposer let be an mln theory let denote the set of variables appearing in let denotes the partition function for we say that an equivalence class is decomposer of if if occurs in formula then appears in every predicate in and if xi xj then xi xj do not appear as different arguments of any predicate let be decomposer for let be new theory in which the domain of all the variables belonging to equivalence class has been reduced to single constant the decomposer rule states that the partition function can be using as where in the proof follows from the fact that since is decomposer the theory can be decomposed into independent but identical up to the renaming of constant theories which do not share any random variables next we will extend the decomposer rule above to work with the constrained theories we will assume that the theory has been partitioned with respect to the set of variables appearing in the decomposer let the partition of values in be given by now we define the decomposer rule for constrained theory using the following theorem theorem let be partitioned theory with respect to the decomposer let denote the restriction of to the partition element let further restricts to singleton where is some element in the set then the partition function can be written as binomial let be an unconstrained mln theory and be unary predicate let xj denote the set of variables appearing as first argument of let dom xj ci xj let mkp be the theory obtained from as follows given formula fi with weight wi in which appears wlog let xj denote the argument of in fi then for every such formula fi we replace it by two new formulas fit and fif obtained by substituting true and alse for the occurrence of xj in fi respectively and when xj occurs in fit or fif reducing the domain of xj to ci in fit and ci in fif where xj the weight wit of fit is equal to wi if it has an occurrence of xj wi otherwise similarly forpfif rule states that the partition function can be written as the proof follows from the fact calculation of can be divided into cases where each case corresponds to considering nk equivalent possibilities for number of groundings being true and being false ranging from to next we extend the above rule for constrained theory let be singleton predicate and xj be set of variables appearing as first arguments of as before let be partitioned with respect to xj and xj vl denote the partition of xj values in let denote the set of formulas in which appears for every formula fi in which xj appears only in xj assume that the projections over the set are count preserving then we obtain new mln ml from in the following manner given formula fi with weight wi in which appears do the following steps restrict fi to the set of values vl for variable xj for the remaining xj tuples where xj takes the values from the set vl create two new formulas fit and fif obtained xj by restricting fit to the set vlk and fif to the set vlnj respectively here the subscript nl canonicalize the constraints in fit and fif substitute true and alse for in fit and fif respectively if xj appears in fit after the substitution its weight wit is equal to wi otherwise split fit into fitd such that projection over in each tuple of fid is count preserving with extension count given by eld the weight of each fid is wi eld similarly for fif we are now ready to define the binomial formulation for constrained theory theorem let be an mln theory partitioned with respect to variable xj let xj be singleton predicate let the projections of tuples associated with the formulas in which xj appears only in xj be count preserving let xj vl denote the partition of xj values xj in and let nl then the partition function can be computed using the recursive application of the following rule for each nl nl ml we apply theorem recursively for each partition component in turn to eliminate xj comqr pletely from the theory the binomial application as described above involves comp putations of whereas direct grounding method would involve nl computations two possibilities for each grounding of xj in turn see the supplement for an example normal forms and evidence processing normal forms normal form representation is an unconstrained representation which requires that there are no constants in any formula fl the domain of variables belonging to an equivalence class are identical to each other an unconstrained mln theory with evidence can be converted into normal form by series of mechanical operations in time polynomial in the size domain source rules friends smokers fs webkb alchemy alchemy alchemy smokes cancer smokes friends smokes pageclass pageclass links director worksfor actor director movie worksfor movie imdb type of const person var page class person movie evidence smokes cancer pageclass actor director movie table dataset details var domain size varied separate weight learned for each grounding of the theory and the evidence any variable values appearing as constant in formula or in evidence is split apart from the rest of the domain and new variable with singleton domain created for them constrained theories can be normalized in similar manner by splitting apart those variables appearing any subset constraints simple variable substitution for equality and introducing explicit evidence predicates for inequality we can now state the following theorem theorem let be constrained mln theory the application of the modified lifting rules over this constrained theory can be exponentially more efficient than first converting the theory in the normal form and then applying the original formulation of the lifting rules evidence processing given predicate pj xk let ej denote its associated evidence further ejt ejf denote the set of ground atoms of pj which are assigned true alse in evidence let eju denote the set of groundings which are unknown neither true nor alse note that the set eju is implicitly specified the first step in processing evidence is to convert the sets ejt and ejf into the constraint representation form for every predicate pj this is done by using the hypercube representation over the set of variables appearing in predicate pj hypercube over set of variables can be seen as constraint tuple specifying subset constraint over each variable in the set union of hypercubes represents constraint set representing the union of corresponding constraint tuples finding minimal hypercube decomposition in and we employ the greedy hypercube construction algorithm as proposed singla et al algorithm the constraint representation for the implicit set eju can be obtained by eliminating the set ejt ejf from its bounding hypercube one which includes all the groundings in the set and then calling the hypercube construction algorithm over the remaining set once the constraint representation has been created for every set of evidence and atoms we join them together to obtain the constrained representation the join over constraints is implemented as described in section experiments in our experiments we compared the performance of our constrained formulation of lifting rules with the normal forms for the task of calculating the partition function we refer to our approach as setineq and normal forms as normal we also compared with ptp available in alchemy and gcfvoe system both our systems and gcfove are implemented in java ptp is implemented in we experimented on four benchmark mln domains for calculating the partition function using exact as well as approximate inference table shows the details of our datasets details for one of the domains professor and students ps are presented in supplement due to lack of space evidence was the only type of constraint considered in our experiments the experiments on all the datasets except webkb were carried on machine with intel core cpu and ram webkb is much larger dataset and we ran the experiments on ghz xeon server with cores and gb ram exact inference we compared the performance of the various algorithms using exact inference on two of the domains fs and ps we do not compare the value of since we are dealing with exact inference in the following evidence on type means that of the constants of the type are randomly selected and evidence predicate groundings in which these constants appear are randomly set to true or false remaining evidence groundings are set to unknown is plotted on log scale in the following graphs figure shows the results as the domain size of person is varied from to with evidence in the fs domain we timed out an algorithm after hour ptp failed to gcfove https scale to even size and are not shown in the figure the time taken by normal grows very fast and it times out after size setineq and gcfove have much slower growth rate setineq is about an order of magnitude faster than gcfvoe on all domain sizes figure shows the time taken by the three algorithms as we vary the evidence on person with fixed domain size of for all the algorithms the time first increases with evidence and then drops setineq is up to an order of magnitude faster than gcfvoe and upto orders of magnitude faster than normal figure plots the number of nodes expanded by normal and setineq gcfove code did not provide any such equivalent value as expected we see much larger growth rate for normal compared to setineq fs size vs time sec setineq normal domain size setineq normal gcfove no of nodes setineq normal gcfove time in seconds time in seconds evidence domain size fs evidence vs time sec fs size vs nodes expanded figure results for exact inference on fs approximate inference time in seconds time in seconds for approximate inference we could only compare setineq mal with setineq gcfove does not have an approxinormal mate variant for computing marginals or partition tion ptp using importance sampling is not fully implemented in alchemy for approximate inference in both normal and setineq we used the unbiased tance sampling scheme as described by gogate domingos we collected total of samples for each estimate and averaged the values in all our domain size ments below the log values calculated by the two algorithms were within of each other hence the esti webkb size vs time sec mates are comparable with other we compared the setineq formance of the two algorithms on two real world datasets normal imdb and webkb see table for webkb we imented with most frequent page classes in univ of texas fold it had close to million ground clauses imdb has equal sized folds with close to ings in each the results presented are averaged over the folds figure on log scale shows the time taken by two algorithms as we vary the subset of pages in our evidence data from to the scaling behavior is similar to as observed earlier for datasets figure plots the timing of imdb evidence vs time sec the two algorithms as we vary the evidence on imdb figure results using approximate insetineq is able to exploit symmetries with increasing evference on webkb and imdb idence whereas normal performance degrades conclusion and future work in this paper we proposed new constraint language called setineq for relational probabilistic models our constraint formalism subsumes most existing formalisms we defined efficient operations over our language using canonical form representation and extended key lifting rules decomposer binomial and single occurrence to work with our constraint formalism experiments on benchmark mlns validate the efficacy of our approach directions for future work include exploiting our constraint formalism to facilitate approximate lifting of the theory acknowledgements happy mittal was supported by tcs research scholar program vibhav gogate was partially supported by the darpa probabilistic programming for advanced machine learning program under afrl prime contract number parag singla is being supported by google travel grant to attend the conference we thank somdeb sarkhel for helpful discussions references udi apsel kristian kersting and martin mladenov lifting relational using cluster signatures in proc of pages bui huynh and riedel automorphism groups of graphical models and lifted variational inference in proc of pages de salvo braz amir and roth lifted probabilistic inference in proc of pages de salvo braz amir and roth lifted probabilistic inference in getoor and taskar editors introduction to statistical relational learning mit press domingos and lowd markov logic an interface layer for artificial intelligence synthesis lectures on artificial intelligence and machine learning morgan claypool publishers van den broeck on the completeness of knowledge compilation for lifted probabilistic inference in proc of pages van den broeck lifted inference and learning in statistical relational models phd thesis ku leuven van den broeck on the complexity and approximation of binary evidence in lifted inference in proc of van den broeck and davis conditioning in knowledge compilation and lifted probabilistic inference in proc of van den broeck taghipour meert davis and de raedt lifted probabilistic inference by knowledge compilation in proc of gogate and domingos probabilisitic theorem proving in proc of pages gogate jha and venugopal advances in lifted importance sampling in proc of pages jha gogate meliou and suciu lifted inference seen from the other side the tractable features in proc of pages kersting ahmadi and natarajan counting belief propagation in proc of pages and poole constraint processing in lifted probabilistic inference in proc of mihalkova and mooney learning of markov logic network structure in proceedings of the international conference on machine learning pages milch zettlemoyer kersting haimes and kaebling lifted probabilistic inference with counting formulas in proc of mittal goyal gogate and singla new rules for domain independent lifted map inference in proc of pages mladenov globerson and kersting lifted message passing as reparametrization of graphical models in proc of pages mladenov and kersting equitable partitions of concave free energies in proc of poole probabilistic inference in proc of pages russell and norvig artificial intelligence modern approach edition pearson education singla and domingos lifted belief propagation in proc of pages singla nath and domingos approximate lifted belief propagation in proc of pages taghipour fierens davis and blockeel lifted variable elimination with arbitrary constraints in proc of canary islands spain venugopal and gogate on lifting the gibbs sampling algorithm in proc of pages 
gradient estimation using stochastic computation graphs john joschu nicolas heess theophane theophane pieter pabbeel google deepmind university of california berkeley eecs department abstract in variety of problems originating in supervised unsupervised and reinforcement learning the loss function is defined by an expectation over collection of random variables which might be part of probabilistic model or the external world estimating the gradient of this loss function using samples lies at the core of learning algorithms for these problems we introduce the formalism of stochastic computation acyclic graphs that include both deterministic functions and conditional probability describe how to easily and automatically derive an unbiased estimator of the loss function gradient the resulting algorithm for computing the gradient estimator is simple modification of the standard backpropagation algorithm the generic scheme we propose unifies estimators derived in variety of prior work along with techniques therein it could assist researchers in developing intricate models involving combination of stochastic and deterministic operations enabling for example attention memory and control actions introduction the great success of neural networks is due in part to the simplicity of the backpropagation algorithm which allows one to efficiently compute the gradient of any loss function defined as composition of differentiable functions this simplicity has allowed researchers to search in the space of architectures for those that are both highly expressive and conducive to optimization yielding for example convolutional neural networks in vision and lstms for sequence data however the backpropagation algorithm is only sufficient when the loss function is deterministic differentiable function of the parameter vector rich class of problems arising throughout machine learning requires optimizing loss functions that involve an expectation over random variables two broad categories of these problems are likelihood maximization in probabilistic models with latent variables and policy gradients in reinforcement learning combining ideas from from those two perennial topics recent models of attention and memory have used networks that involve combination of stochastic and deterministic operations in most of these problems from probabilistic modeling to reinforcement learning the loss functions and their gradients are intractable as they involve either sum over an exponential number of latent variable configurations or integrals that have no analytic solution prior work see section has provided derivations of gradient estimators however to our knowledge no previous work addresses the general case appendix recalls several classic and recent techniques in variational inference and reinforcement learning where the loss functions can be straightforwardly described using the formalism of stochastic computation graphs that we introduce for these examples the variancereduced gradient estimators derived in prior work are special cases of the results in sections and the contributions of this work are as follows we introduce formalism of stochastic computation graphs and in this general setting we derive unbiased estimators for the gradient of the expected loss we show how this estimator can be computed as the gradient of certain differentiable function which we call the surrogate loss hence it can be computed efficiently using the backpropagation algorithm this observation enables practitioner to write an efficient implementation using automatic differentiation software we describe variance reduction techniques that can be applied to the setting of stochastic computation graphs generalizing prior work from reinforcement learning and variational inference we briefly describe how to generalize some other optimization techniques to this setting algorithms by constructing an expression that bounds the loss function and methods by computing estimates of products the main practical result of this article is that to compute the gradient estimator one just needs to make simple modification to the backpropagation algorithm where extra gradient signals are introduced at the stochastic nodes equivalently the resulting algorithm is just the backpropagation algorithm applied to the surrogate loss function which has extra terms introduced at the stochastic nodes the modified backpropagation algorithm is presented in section preliminaries gradient estimators for single random variable this section will discuss computing the gradient of an expectation taken over single random estimators described here will be the building blocks for more complex cases with multiple variables suppose that is random variable is function say the cost and we are interested in computing ex there are few different ways that the process for generating could be parameterized in terms of which lead to different gradient estimators we might be given parameterized probability distribution in this case we can use the score function sf estimator ex ex log this classic equation is derived as follows ex dx dx dx log ex log this equation is valid if and only if is continuous function of however it does not need to be continuous function of may be deterministic differentiable function of and another random variable we can write then we can use the pathwise derivative pd estimator defined as follows ez ez this equation which merely swaps the derivative and expectation is valid if and only if is continuous function of for all that is not true if for example is step function note that for the pathwise derivative estimator merely needs to be continuous function of is sufficient that this function is differentiable similar statement can be made about and the score function estimator see glasserman for detailed discussion of the technical requirements for these gradient estimators to be valid finally might appear both in the probability distribution and inside the expectation in then the gradient estimator has two terms log this formula can be derived by writing the expectation as an integral and differentiating as in equation in some cases it is possible to reparameterize probabilistic from the distribution to inside the expectation or vice versa see for general discussion and see for recent application of this idea to variational inference the sf and pd estimators are applicable in different scenarios and have different properties sf is valid under more permissive mathematical conditions than pd sf can be used if is discontinuous or if is discrete random variable sf only requires sample values whereas pd requires the derivatives in the context of control reinforcement learning sf can be used to obtain unbiased policy gradient estimators in the setting where we have no model of the dynamics we only have access to sample trajectories sf tends to have higher variance than pd when both estimators are applicable see for instance the variance of sf increases often linearly with the dimensionality of the sampled variables hence pd is usually preferable when is on the other hand pd has high variance if the function is rough which occurs in many problems due to an exploding gradient problem butterfly effect pd allows for deterministic limit sf does not this idea is exploited by the deterministic policy gradient algorithm nomenclature the methods of estimating gradients of expectations have been independently proposed in several different fields which use differing terminology what we call the score function estimator via is alternatively called the likelihood ratio estimator and reinforce we chose this term because the score function is object in statistics what we call the pathwise derivative estimator from the mathematical finance literature and reinforcement learning is alternatively called infinitesimal perturbation analysis and stochastic backpropagation we chose this term because pathwise derivative is evocative of propagating derivative through sample path stochastic computation graphs the results of this article will apply to stochastic computation graphs which are defined as follows definition stochastic computation graph directed acyclic graph with three types of nodes input nodes which are set externally including the parameters we differentiate with respect to deterministic nodes which are functions of their parents stochastic nodes which are distributed conditionally on their parents each parent of node is connected to it by directed edge in the subsequent diagrams of this article we will use circles to denote stochastic nodes and squares to denote deterministic nodes as illustrated below the structure of the graph fully specifies what estimator we will use sf pd or combination thereof this graphical notation is shown below along with the estimators from section input node deterministic node gives sf estimator stochastic node gives pd estimator simple examples several simple examples that illustrate the stochastic computation graph formalism are shown below the gradient estimators can be described by writing the expectations as integrals and differentiating as with the simpler estimators from section however they are also implied by the general results that we will present in section stochastic computation graph objective gradient estimator ey log ex log ex log ex log log log figure simple stochastic computation graphs these simple examples illustrate several important motifs where stochastic and deterministic nodes are arranged in series or in parallel for example note that in the derivative of does not appear in the estimator since the path from to is blocked by similarly in does not appear this type of behavior is particularly useful if we only have access to simulator of system but not access to the actual likelihood function on the other hand has direct path from to which contributes term to the gradient estimator resembles parameterized markov reward process and it illustrates that we ll obtain score function terms of the form grad future costs the examples above all have one input but the formalism accommodates models with multiple inputs for ample stochastic neural network with multiple layers of weights and biases which may influence different subcrosssoftentropy sets of the stochastic and cost nodes see appendix max loss for nontrivial examples with stochastic nodes and multiple inputs the figure on the right shows deterministic computation graph representing classification loss for neural network which has four parameters weights and biases of course this deterministic computation graph is special type of stochastic computation graph main results on stochastic computation graphs gradient estimators this section will consider general stochastic computation graph in which certain set of nodes are designated as costs and we would like to compute the gradient of the sum of costs with respect to some input node in brief the main results of this section are as follows we derive gradient estimator for an expected sum of costs in stochastic computation graph this estimator contains two parts score function part which is sum of terms grad logprob of variable sum of costs influenced by variable and pathwise derivative term that propagates the dependence through differentiable functions this gradient estimator can be computed efficiently by differentiating an appropriate surrogate objective function let denote the set of input nodes the set of deterministic nodes and the set of stochastic nodes further we will designate set of cost nodes which are and deterministic note that there is no loss of generality in assuming that the costs are cost is stochastic we can simply append deterministic node that applies the identity function to it we will use to denote an input node that we differentiate with respect to in the context of machine learning we will usually be most concerned with differentiating with respect to parameter vector or tensor however the theory we present does not make any assumptions about what represents for the results that follow we need to define the notion of influence for which we will introduce two relations and the relation influences means that there exists sequence of nodes ak with such that ak ak ak are edges in the graph the relation deterministically influences is defined similarly except that now we require that each ak is deterministic node for example in figure diagram above influences but it only deterministically influences notation glossary input nodes deterministic nodes stochastic nodes cost nodes influences deterministically influences deps dependencies next we will establish condition that is sufficient sum of cost nodes influenced by for the existence of the gradient namely we will stipulate that every edge with lying in the denotes the sampled value of the node influenced set of corresponds to differentiable dependency if is deterministic then the jacobian must exist if is stochastic then the probability mass function must be differentiable with respect to more formally condition differentiability requirements given input node for all edges which satisfy and then the following condition holds if is deterministic jacobian exists and if is stochastic then the derivative of the probability mass function parents exists note that condition does not require that all the functions in the graph are differentiable if the path from an input to deterministic node is blocked by stochastic nodes then may be nondifferentiable function of its parents if path from input to stochastic node is blocked by other stochastic nodes the likelihood of given its parents need not be differentiable in fact it does not need to be this fact is particularly important for reinforcement learning allowing us to compute policy gradient estimates despite having discontinuous dynamics function or reward function we need few more definitions to state the main theorems let depsv the dependencies of node the set of nodes that deterministically influence it note the following if the probability mass function of is function of depsv we can write depsv if is deterministic function of depsv so we can write depsv let the sum of costs downstream of node these costs will be treated as constant fixed to the values obtained during sampling in general we will use the hat symbol to denote sample value of variable which will be treated as constant in the gradient formulae now we can write down general expression for the gradient of the expected sum of costs in stochastic computation graph theorem tions hold suppose that satisfies condition then the following two equivalent log depsw depsc dw log deps deps proof see appendix dc dc the estimator expressions above have two terms the first term is due to the influence of on probability distributions the second term is due to the influence of on the cost variables through chain of differentiable functions the distribution term involves sum of gradients times downstream costs the first term in equation involves sum of gradients times downstream costs whereas the first term in equation has sum of costs times upstream gradients surrogate loss functions surrogate loss computation graph the next corollary lets us write down surrogate objective which is function of the inputs that we can differentiate to obtain an unbiased gradient estimator corollary let log deps deps then differentiation of gives us an unbiased grac dient estimate one practical consequence of this result is that we can apply standard automatic differentiation procedure to to obtain an unbiased gradient estimator in other words we convert the stochastic computation graph into deterministic computation graph to which we can apply the backpropagation algorithm there are several alternative ways to define the surrogate objective function that give the same gradient as from corollary we could also write depsc log fˆ log fˆ log fˆ log fˆ log log figure deterministic where is the probability depsw obtained during sampling tation graphs obtained as surrowhich is viewed as constant gate loss functions of stochastic computation graphs from the surrogate objective from corollary is actually an upper bound ure on the true objective in the case that all costs are negative the the costs are not deterministically influenced by the parameters this construction allows from algorithms similar to em to be applied to general stochastic computation graphs see appendix for details derivatives the gradient estimator for stochastic computation graph is itself stochastic computation graph hence it is possible to compute the gradient yet again for each component of the gradient vector and get an estimator of the hessian for most problems of interest it is not efficient to compute this dense hessian on the other hand one can also differentiate the product to get computation is usually not much more expensive than the gradient computation itself the product can be used to implement algorithm via the conjugate gradient algorithm variant of this technique called optimization has been used to train large neural networks variance reduction consider estimating clearly this expectation is unaffected by subtracting con stant from the integrand which gives taking the score function estimator we get log taking ex generally leads to substantial variance is often called see for more thorough discussion of baselines and their variance reduction properties we can make general statement for the case of stochastic computation we can add baseline to every stochastic node which depends all of the nodes it doesn influence let on nfluenced theorem log parentsv on nfluenced proof see appendix algorithms as shown in section the gradient estimator can be obtained by differentiating surrogate objective function hence this derivative can be computed by performing the backpropagation algorithm on that is likely to be the most practical and efficient method and can be facilitated by automatic differentiation software algorithm shows explicitly how to compute the gradient estimator in backwards pass through the stochastic computation graph the algorithm will recursively compute gv at every deterministic and input node related work as discussed in section the score function and pathwise derivative estimators have been used in variety of different fields under different names see for review of gradient estimation mostly from the simulation optimization literature glasserman textbook provides an extensive treatment of various gradient estimators and monte carlo estimators in general griewank and walther textbook is comprehensive reference on computation graphs and automatic differentiation of deterministic programs the notation and nomenclature we use is inspired by bayes nets and influence diagrams in fact stochastic computation graph is type of bayes network where the deterministic nodes correspond to degenerate probability distributions the topic of gradient estimation has drawn significant recent interest in machine learning gradients for networks with stochastic units was investigated in bengio et al though they are concerned the optimal baseline for scalar is in fact the weighted expectation log ex ex where algorithm compute gradient estimator for stochastic computation graph for graph initialization at output nodes do if gv otherwise end for compute for all nodes graph for in everse opological ort on nputs do reverse traversal for parentsv do if not tochastic then if tochastic then gw log parentsv else gw gv end if end if end for end for return with differentiating through individual units and layers not how to deal with arbitrarily structured models and loss functions kingma and welling consider similar framework although only with continuous latent variables and point out that reparameterization can be used to to convert hierarchical bayesian models into neural networks which can then be trained by backpropagation the score function method is used to perform variational inference in general models in the context of probabilistic programming in wingate and weber and similarly in ranganath et al both papers mostly focus on approximations without amortized inference it is used to train generative models using neural networks with discrete stochastic units in mnih and gregor and gregor et al in both amortize inference by using an inference network generative models with continuous valued latent variables networks are trained again using an inference network with the reparametrization method by rezende mohamed and wierstra and by kingma and welling rezende et al also provide detailed discussion of reparameterization including discussion comparing the variance of the sf and pd estimators bengio leonard and courville have recently written paper about gradient estimation in neural networks with stochastic units or activation monte carlo estimators and heuristic approximations the notion that policy gradients can be computed in multiple ways was pointed out in early work on policy gradients by williams however all of this prior work deals with specific structures of the stochastic computation graph and does not address the general case conclusion acknowledgements we have developed framework for describing computation with stochastic and deterministic operations called stochastic computation graph given stochastic computation graph we can automatically obtain gradient estimator given that the graph satisfies the appropriate conditions on differentiability of the functions at its nodes the gradient can be computed efficiently in backwards traversal through the graph one approach is to apply the standard backpropagation algorithm to one of the surrogate loss functions from section another approach which is roughly equivalent is to apply modified backpropagation procedure shown in algorithm the results we have presented are sufficiently general to automatically reproduce variety of gradient estimators that have been derived in prior work in reinforcement learning and probabilistic modeling as we show in appendix we hope that this work will facilitate further development of interesting and expressive models we would like to thank shakir mohamed dave silver yuval tassa andriy mnih and others at deepmind for insightful comments references baxter and bartlett estimation journal of artificial intelligence research pages bengio and courville estimating or propagating gradients through stochastic neurons for conditional computation arxiv preprint fu gradient estimation handbooks in operations research and management science glasserman monte carlo methods in financial engineering volume springer science business media glynn likelihood ratio gradient estimation for stochastic systems communications of the acm greensmith bartlett and baxter variance reduction techniques for gradient estimates in reinforcement learning the journal of machine learning research gregor danihelka mnih blundell and wierstra deep autoregressive networks arxiv preprint griewank and walther evaluating derivatives principles and techniques of algorithmic differentiation siam hochreiter and schmidhuber long memory neural computation kingma and welling variational bayes kingma and welling efficient inference through transformations between bayes nets and neural nets arxiv preprint lecun bottou bengio and haffner learning applied to document recognition proceedings of the ieee martens deep learning via optimization in proceedings of the international conference on machine learning pages mnih and gregor neural variational inference and learning in belief networks mnih heess graves and kavukcuoglu recurrent models of visual attention in advances in neural information processing systems pages munos policy gradient in continuous time the journal of machine learning research neal learning stochastic feedforward networks department of computer science university of toronto neal and hinton view of the em algorithm that justifies incremental sparse and other variants in learning in graphical models pages springer pearl probabilistic reasoning in intelligent systems networks of plausible inference morgan kaufmann ranganath gerrish and blei black box variational inference arxiv preprint rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models silver lever heess degris wierstra and riedmiller deterministic policy gradient algorithms in icml sutton mcallester singh mansour et al policy gradient methods for reinforcement learning with function approximation in nips volume pages citeseer vlassis toussaint kontes and piperidis learning robot control by monte carlo em algorithm autonomous robots wierstra peters and schmidhuber recurrent policy gradients logic journal of igpl williams simple statistical algorithms for connectionist reinforcement learning machine learning wingate and weber automated variational inference in probabilistic programming arxiv preprint wright and nocedal numerical optimization volume springer new york zaremba and sutskever reinforcement learning neural turing machines arxiv preprint 
relative entropy stochastic search abbas rudolf nuno luis paulo jan and gerhard ieeta university of aveiro aveiro portugal dsi university of minho braga portugal liacc university of porto porto portugal ias clas tu darmstadt darmstadt germany max planck institute for intelligent systems stuttgart germany lioutikov peters neumann nunolau lpreis abstract stochastic search algorithms are general optimizers due to their ease of use and their generality they have recently also gained lot of attention in operations research machine learning and policy search yet these algorithms require lot of evaluations of the objective scale poorly with the problem dimension are affected by highly noisy objective functions and may converge prematurely to alleviate these problems we introduce new stochastic search approach we learn simple quadratic surrogate models of the objective function as the quality of such quadratic approximation is limited we do not greedily exploit the learned models the algorithm can be misled by an inaccurate optimum introduced by the surrogate instead we use information theoretic constraints to bound the distance between the new and old data distribution while maximizing the objective function additionally the new method is able to sustain the exploration of the search distribution to avoid premature convergence we compare our method with state of art optimization methods on standard and optimization functions on simulated planar robot tasks and complex robot ball throwing task the proposed method considerably outperforms the existing approaches introduction stochastic search algorithms are black box optimizers of an objective function that is either unknown or too complex to be modeled explicitly these algorithms only make weak assumption on the structure of underlying objective function they only use the objective values and don require gradients or higher derivatives of the objective function therefore they are well suited for black box optimization problems stochastic search algorithms typically maintain stochastic search distribution over parameters of the objective function which is typically multivariate gaussian distribution this policy is used to create samples from the objective function subsequently new stochastic search distribution is computed by either computing gradient based updates evolutionary strategies the method path integrals or policy updates policy updates bound the relative entropy also called kullback leibler or kl divergence between two subsequent policies using for the update of the search distribution is common approach in the stochastic search however such information theoretic bounds could so far only be approximately applied either by using of the resulting in natural evolutionary strategies nes or approximations resulting in the relative entropy policy search reps algorithm in this paper we present novel stochastic search algorithm which is called stochastic search more for the first time our algorithm bounds the kl divergence of the new and old search distribution in closed form without approximations we show that this exact bound performs considerably better than approximated kl bounds in order to do so we locally learn simple quadratic surrogate of the objective function the quadratic surrogate allows us to compute the new search distribution analytically where the kl divergence of the new and old distribution is bounded therefore we only exploit the surrogate model locally which prevents the algorithm to be misled by inaccurate optima introduced by an inaccurate surrogate model however learning quadratic reward models directly in parameter space comes with the burden of quadratically many parameters that need to be estimated we therefore investigate new methods that rely on dimensionality reduction for learning such surrogate models in order to avoid we use supervised bayesian dimensionality reduction approach this dimensionality reduction technique avoids over fitting which makes the algorithm applicable also to high dimensional problems in addition to solving the search distribution update in closed form we also the entropy of the new search distribution to ensure that exploration is sustained in the search distribution throughout the learning progress and hence premature convergence is avoided we will show that this method is more effective than commonly used heuristics that also enforce exploration for example adding small diagonal matrix to the estimated covariance matrix we provide comparison of stochastic search algorithms on standard objective functions used for benchmarking and in simulated robotics tasks the results show that more considerably outperforms methods problem statement we want to maximize an objective function rn the goal is to find one or more parameter vectors rn which have the highest possible objective value we maintain search distribution over the parameter space of the objective function the search distribution is implemented as multivariate gaussian distribution in each iteration the search distribution is used to create samples of the parameter vector subsequently the possibly noisy evaluation of is obtained by querying the objective function the samples are subsequently used to compute new search distribution this process will run iteratively until the algorithm converges to solution related work recent it policy search algorithms are based on the relative entropy policy search reps algorithm which was proposed in as policy search algorithm however in an version of reps that is equivalent to stochastic search was presented the key idea behind reps is to control the by bounding the relative entropy between the old data distribution and the newly estimated search distribution by factor due to the relative entropy bound the algorithm achieves smooth and stable learning process however the episodic reps algorithm uses sample based approximation of the which needs lot of samples in order to be accurate moreover typical problem of reps is that the entropy of the search distribution decreases too quickly resulting in premature convergence taylor approximations of the have also been used very successfully in the area of stochastic search resulting in natural evolutionary strategies nes nes uses the natural gradient to optimize the objective the natural gradient has been shown to outperform the standard gradient in many applications in machine learning the intuition of the natural gradient is that we want to obtain an update direction of the parameters of the search distribution that is most similar to the standard gradient while the between new and old search distributions is bounded to obtain this update direction second order approximation of the kl which is equivalent to the fisher information matrix is used surrogate based stochastic search algorithms have been shown to be more sample efficient than direct stochastic search methods and can also smooth out the noise of the objective function for example an individual optimization method is used on the surrogate that is stopped whenever the between the new and the old distribution exceeds certain bound for the first time our algorithm uses the surrogate model to compute the new search distribution analytically which bounds the kl divergence of the new and old search distribution in closed form quadratic models have been used successfully in trust region methods for local surrogate approximation these methods do not maintain stochastic search distribution but point estimate and trust region around this point they update the point estimate by optimizing the surrogate and staying in the trusted region subsequently heuristics are used to increase or decrease the trusted region in the more algorithm the trusted region is defined implicitly by the the covariance matrix strategy is considered as the state of the art in stochastic search optimization also maintains gaussian distribution over the problem parameter vector and uses heuristics to update the search distribution relative entropy stochastic search similar to information theoretic policy search algorithms we want to control the explorationexploitation by bounding the relative entropy of two subsequent search distribution however by bounding the kl the algorithm can adapt the mean and the variance of the algorithm in order to maximize the objective for the immediate iteration the shrinkage in the variance typically dominates the contribution to the which often leads to premature convergence of these algorithms hence in addition to control the of the update we also need to control the shrinkage of the covariance matrix such control mechanism can be implemented by the entropy of the new distribution in this paper we will set the bound always to certain percentage of the entropy of the old search distribution such that more converges asymptotically to point estimate the more framework similar as in we can formulate an optimization problem to obtain new search distribution that maximizes the expected objective value while the and lowerbounding the entropy of the distribution max rθ dθ kl dθ where rθ denotes the expected objective when evaluating parameter vector the term log dθ denotes the entropy of the distribution and is the old distribution the parameters and are parameters to control the of the algorithm we can obtain closed form solution for by optimizing the lagrangian for the optimization problem given above this solution is given as rθ exp where and are the lagrangian multipliers as we can see the new distribution is now geometric average between the old sampling distribution and the exponential transformation of the objective function note that by setting we obtain the standard episodic reps formulation the optimal value for and can be obtained by minimizing the dual function such that and see the dual function is given by rθ ωβ log exp dθ note that we are typically not able to obtain the expected reward but only noisy estimate of the underlying reward distribution as we are dealing with continuous distributions the entropy can also be negative we specify such that the relative difference of to minimum exploration policy is decreased for certain percentage we change the entropy constraint to throughout all our experiments we use the same value of and we set minimum entropy of search distribution to small enough value like we will show that using the additional entropy bound considerably alleviates the premature convergence problem analytic solution of the and the policy using quadratic surrogate model of the objective function we can compute the integrals in the dual function analytically and hence we can satisfy the introduced bounds exactly in the more framework at the same time we take advantage of surrogate models such as smoothed estimate in the case of noisy objective functions and decrease in the sample we will for now assume that we are given quadratic surrogate model rθ rθ of the objective function rθ which we will learn from data in section moreover the search distribution is gaussian in this case the integrals in the dual function given in equation can be solved in closed form the integral inside the in equation now represents an integral over an gaussian distribution hence the integral evaluates to the inverse of the normalization factor of the corresponding gaussian after rearranging terms the dual can be written as βω ηbt log log with and hence the dual function can be efficiently evaluated by matrix inversions and matrix products note that for large enough value of the matrix will be positive definite and hence invertible even if is not in our optimization we always restrict the values such that stays positive nevertheless we could always find the value with the correct in contrast to more episodic reps relies on sample based approximation of the integrals in the dual function in equation it uses the sampled rewards rθ of the parameters to approximate this integral we can also obtain the update rule for the new policy from equation we know that the new policy is the geometric average of the gaussian sampling distribution and squared exponential given by the exponentially transformed surrogate after terms and completing the square the new policy can be written as where are given in the previous section learning approximate quadratic models in this section we show how to learn quadratic surrogate note that we use the quadratic surrogate in each iteration to locally approximate the objective function and not globally as the search distribution will shrink in each iteration the model error will also vanish asymptotically quadratic surrogate is also natural choice if gaussian distribution is used cause the exponent of the gaussian is also quadratic in the parameters hence even using more complex surrogate it can not be exploited by gaussian distribution local quadratic surrogate model provides similar secondorder information as the hessian in standard gradient updates however quadratic surrogate model also has quadratically many parameters which we have to estimate from ideally very small data the regression performed for learning the quadratic surrogate model estimates the expectation of the objective function from the observed samples to optimize any constrained nonlinear optimization method can be used episodes more reps power xnes more reps power xnes rosenbrock average return average return average return episodes rastrigin more reps power xnes episodes noisy function figure comparison of stochastic search methods for optimizing the rosenbrock and the multi modal rastrigin function comparison for noisy objective function all results show that more clearly outperforms other methods set therefore already learning simple local quadratic surrogate is challenging task in order to learn the local quadratic surrogate we can use linear regression to fit function of the form where is feature function that returns bias term all linear and all quadratic terms of hence the dimensionality of is where is the dimensionality of the parameter space to reduce the dimensionality of the regression problem we project in lower dimensional space and solve the linear regression problem in this reduced the quadratic form of the objective function can then be computed from and still the question remains how to choose the projection matrix we did not achieve good performance with standard pca as pca is unsupervised yet the matrix is typically quite high dimensional such that it is hard to obtain the matrix by supervised learning and simultaneously avoid inspired by where supervised bayesian dimensionality reduction are used for classification we also use supervised bayesian approach where we integrate out the projection matrix bayesian dimensionality reduction for quadratic functions in order to integrate out the parameters we use the following probabilistic dimensionality reduction model dw where is prediction of the objective at query point is the training data set consisting of parameters and their objective evaluations the posterior for is given by bayes rule the likelihood function is given by dβ where is the likelihood of the linear model and its prior for the likelihood of the linear model we use multiplicative noise model the higher the absolute value of the objective the higher the variance the intuition behind this choice is that we are mainly interested in minimizing the relative error instead of the absolute our likelihood and prior is therefore given by is projection matrix that projects vector from dimension manifold to dimension manifold we observed empirically that such relative error performs better if we have objective functions with large difference in the objective values for example an error of has huge influence for an objective value of while for value of such an error is negligible equation is weighted bayesian linear regression model in where the weight of each sample is scaled by the absolute value of therefore can be obtained efficiently in closed form however due to the feature transformation the output depends on the projection therefore the posterior can not be obtained in closed form any more we use simple approach in order to approximate the posterior we use samples from the prior to approximate the integrals in equation and in in this case the predictive model is given by where the prediction for single can again be obtained by standard bayesian linear regression our algorithm is only interested in the expectation rθ in the form of quadratic model given certain we can obtain single quadratic model from µβ where µβ is the mean of the posterior distribution obtained by bayesian linear regression the expected quadratic model is then obtained by weighted average over all quadratic models with weight note that with higher number of projection matrix samples the better the posterior can be approximated generating these samples is typically inexpensive as it just requires computation time but no evaluation of the objective function we also investigated using more sophisticated sampling techniques such as elliptical slice sampling which achieved similar performance but considerably increased computation time further optimization of the sampling technique is part of future work experiments we compare more with state of the art methods in stochastic search and policy search such as nes power and episodic reps in our first experiments we use standard optimization test functions such as the the rosenbrock uni modal and the rastrigin multi modal functions we use dimensional version of these functions furthermore we use planar robot that has to reach given point in task space as toy task for the comparisons the resulting policy has parameters but we also test the algorithms in highdimensional parameter spaces by scaling the robot up to links parameters we subsequently made the task more difficult by introducing hard obstacles which results in discontinuous objective function we denote this task task finally we evaluate our algorithm on physical simulation of robot playing beer pong the used parameters of the algorithms and detailed evaluation of the parameters of more can be found in the supplement standard optimization test functions we chose one functions xi also known as rosenbrock pn function and function which is known as the rastgirin function cos all these functions have global minimum equal in our experiments the mean of the initial distributions has been chosen randomly algorithmic comparison we compared our algorithm against nes power and reps in each iteration we generated new samples for more reps and power we always keep the last samples while for nes and only the current samples are as we can see in the figure more outperforms all the other methods in terms of learning speed and final performance in all test functions however in terms of the computation time more was times slower than the other algorithms yet more was sufficiently fast as one policy update took less than performance on noisy function we also conducted an experiment on optimizing the sphere function where we add multiplicative noise to the reward samples where and xm with randomly chosen matrix we use the heuristics introduced in for and nes nes and algorithms typically only use the new samples and discard the old samples we also tried keeping old samples or getting more new samples which decreased the performance considerably average return average return reps more xnes reps more xnes episodes reaching task average return episodes reaching task gamma gamma gamma episodes evaluation of figure algorithmic comparison for planar task joints parameters more outperforms all the other methods considerably algorithmic comparison for task joints parameters the performance of nes degraded while more could still outperform evaluation of the entropy bound for low the entropy bound is not active and the algorithm converges prematurely if is close to one the entropy is reduced too slowly and convergence takes long figure shows that more successfully smooths out the noise and converges while other methods diverge the result shows that more can learn highly noisy reward functions planar reaching and hole reaching we used planar robot with dmps as the underlying control policy each link had length of the robot is modeled as decoupled linear dynamical system the of the robot has to reach at time step and at the final time step the point with its end effector the reward was given by quadratic cost term for the two as well as quadratic costs for high accelerations note that this objective function is highly in the parameters as the are defined in end effector space we used basis functions per degree of freedom for the dmps while the goal attractor for reaching the final state was assumed to be known hence our parameter vector had dimensions the setup including the learned policy is shown in the supplement algorithmic comparison we generated new samples for more reps we always keep the last samples while for nes and only the current samples are kept we empirically optimized the open parameters of the algorithms by manually testing parameter sets for each algorithm the results shown in figure clearly show that more outperforms all other methods in terms of speed and the final performance entropy bound we also evaluated the entropy bound in figure we can see that the entropy constraint is crucial component of the algorithm to avoid the premature convergence parameter spaces we also evaluated the same task with planar robot resulting in dimensional parameter space we compared more cma reps and nes while nes considerably degraded in performance cma and more performed well where more found considerably better policies average reward of versus of see figure the setup with the learned policy from more is depicted in the supplement we use the same robot setup as in the planar reaching task for hole reaching task for completing the hole reaching task the robot end effector has to reach the bottom of hole wide and deep centering at without any collision with the ground or the walls see figure the reward was given by quadratic cost term for the desired final point quadratic costs for high accelerations and additional punishment for collisions with the walls note that this objective function is discontinuous due to the costs for collisions the goal attractor of the dmp for reaching the final state in this task is unknown and is also learned hence our parameter vector had dimensions algorithmic comparison we used the same learning parameters as for the planar reaching task the results shown in figure show that more clearly outperforms all other methods in this task nes could not find any reasonable solution while power reps and could only learn solutions more could also achieve the same learning speed as reps and but would then also converge to solution average return average return reps power more episodes reps power more xnes hole reaching task beer pong task hole reaching task posture figure algorithmic comparison for the hole reaching task more could find policies of much higher quality algorithmic comparison for the beer pong task only more could reliably learn policies while for the other methods even if some trials found good solutions other trials got stuck prematurely beer pong in this task seven dof simulated barrett wam robot arm had to play it had to throw ball such that it bounces once on the table and falls into cup the ball was placed in container mounted on the the ball could leave the container by strong deceleration of the robot we again used dmp as underlying policy representation where we used the shape parameters five beer pong task per dof and the goal attractor one per dof as parameters the mean of our search distribution was initialized with imitation learning the cup was placed at distance of from figure the beer pong task the robot has to throw ball such that it the robot and it had height of as reward function we bounces of the table and ends up in the computed the point of the ball trajectory after the bounce on cup the table where the ball is passing the plane of the entry of the cup the reward was set to be times the negative squared distance of that point to the center of the cup while punishing the acceleration of the joints we evaluated more cma power and reps on this task the setup is shown in figure and the learning curve is shown in figure more was able to accurately hit the ball into the cup while the other algorithms couldn find robust policy conclusion using to limit the update of the search distribution is idea in the stochastic search community but typically requires approximations in this paper we presented new stochastic search algorithm that computes the analytically by relying on gaussian search distribution and on locally learned quadratic models of the objective function we can obtain closed form of the information theoretic policy update we also introduced an additional entropy term in the formulation that is needed to avoid premature shrinkage of the variance of the search distribution our algorithm considerably outperforms competing methods in all the considered scenarios the main disadvantage of more is the number of parameters however based on our experiments these parameters are not problem specific acknowledgment this project has received funding from the european unions horizon research and innovation programme under grant agreement no romans and the first author is supported by fct under grant references hansen muller and koumoutsakos reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation evolutionary computation sun wierstra schaul and schmidhuber efficient natural evolution strategies in proceedings of the annual conference on genetic and evolutionary computation gecco stulp and sigaud path integral policy improvement with covariance matrix adaptation in international conference on machine learning icml felder and schmidhuber exploration for policy gradient methods in proceedings of the european conference on machine learning ecml furmston and barber unifying perspective of parametric policy search methods for markov decision processes in neural information processing systems nips loshchilov schoenauer and sebag intensive surrogate model exploitation in selfadaptive in gecco mannor rubinstein and gat the cross entropy method for fast policy search in proceedings of the international conference on machine learning icml theodorou buchli and schaal generalized path integral control approach to reinforcement learning the journal of machine learning research kupcsik deisenroth peters and neumann contextual policy search for robot movement skills in proceedings of the national conference on artificial intelligence aaai peters and altun relative entropy policy search in proceedings of the national conference on artificial intelligence aaai aaai press wierstra schaul glasmachers sun peters and schmidhuber natural evolution strategies journal of machine learning research amari natural gradient works efficiently in learning neural computation loshchilov schoenauer and sebag control of the learning schedule for surrogate optimization corr powell the newuoa software for unconstrained optimization without derivatives in report damtp university of cambridge powell the bobyqa algorithm for bound constrained optimization without derivatives in report damtp university of cambridge boyd and vandenberghe convex optimization cambridge university press jolliffe principal component analysis springer verlag mehmet bayesian supervised dimensionality reduction ieee cybernetics murray adams and mackay elliptical slice sampling jmlr cp kober and peters policy search for motor primitives in robotics machine learning pages molga and smutnicki test functions for optimization needs http in ijspeert and schaal learning attractor landscapes for learning motor primitives in advances in neural information processing systems nips 
learning with ladder networks antti rasmus and harri valpola the curious ai company finland mikko honkala nokia labs finland mathias berglund and tapani raiko aalto university finland the curious ai company finland abstract we combine supervised learning with unsupervised learning in deep neural networks the proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation avoiding the need for our work builds on top of the ladder network proposed by valpola which we extend by combining the model with supervision we show that the resulting model reaches performance in mnist and classification in addition to permutationinvariant mnist classification with all labels introduction in this paper we introduce an unsupervised learning method that fits well with supervised learning combining an auxiliary task to help train neural network was proposed by suddarth and kergosien there are multiple choices for the unsupervised task for example reconstruction of the inputs at every level of the model or classification of each input sample into its own class although some methods have been able to simultaneously apply both supervised and unsupervised learning often these unsupervised auxiliary tasks are only applied as followed by normal supervised learning in complex tasks there is often much more structure in the inputs than can be represented and unsupervised learning can not by definition know what will be useful for the task at hand consider for instance the autoencoder approach applied to natural images an auxiliary decoder network tries to reconstruct the original input from the internal representation the autoencoder will try to preserve all the details needed for reconstructing the image at pixel level even though classification is typically invariant to all kinds of transformations which do not preserve pixel values our approach follows valpola who proposed ladder network where the auxiliary task is to denoise representations at every level of the model the model structure is an autoencoder with skip connections from the encoder to decoder and the learning task is similar to that in denoising autoencoders but applied at every layer not just the inputs the skip connections relieve the pressure to represent details at the higher layers of the model because through the skip connections the decoder can recover any details discarded by the encoder previously the ladder network has only been demonstrated in unsupervised learning but we now combine it with supervised learning the key aspects of the approach are as follows compatibility with supervised methods the unsupervised part focuses on relevant details found by supervised learning furthermore it can be added to existing feedforward neural networks for example perceptrons mlps or convolutional neural networks cnns scalability due to local learning in addition to supervised learning target at the top layer the model has local unsupervised learning targets on every layer making it suitable for very deep neural networks we demonstrate this with two deep supervised network architectures computational efficiency the encoder part of the model corresponds to normal supervised learning adding decoder as proposed in this paper approximately triples the computation during training but not necessarily the training time since the same result can be achieved faster due to better utilization of available information overall computation per update scales similarly to whichever supervised learning approach is used with small multiplicative factor as explained in section the skip connections and unsupervised targets effectively turn autoencoders into hierarchical latent variable models which are known to be well suited for semisupervised learning indeed we obtain results in learning in the mnist permutation invariant mnist and classification tasks section however the improvements are not limited to settings for the permutation invariant mnist task we also achieve new record with the normal longer version of this paper with more complete descriptions please see derivation and justification latent variable models are an attractive approach to learning because they can combine supervised and unsupervised learning in principled way the only difference is whether the class labels are observed or not this approach was taken for instance by goodfellow et al with their deep boltzmann machine particularly attractive property of hierarchical latent variable models is that they can in general leave the details for the lower levels to represent allowing higher levels to focus on more invariant abstract features that turn out to be relevant for the task at hand the training process of latent variable models can typically be split into inference and learning that is finding the posterior probability of the unobserved latent variables and then updating the underlying probability model to better fit the observations for instance in the em algorithm the corresponds to finding the expectation of the latent variables over the posterior distribution assuming the model fixed and then maximizes the underlying probability model assuming the expectation fixed the main problem with latent variable models is how to make inference and learning efficient suppose there are layers of latent variables latent variable models often represent the probability distribution of all the variables explicitly as product of terms such as in directed graphical models the inference process and model updates are then derived from bayes rule typically as some kind of approximation often the inference is iterative as it is generally impossible to solve the resulting equations in closed form as function of the observed variables there is close connection between denoising and probabilistic modeling on the one hand given probabilistic model you can compute the optimal denoising say you want to reconstruct latent using prior and an observation noise we first compute the posterior distribution and use its center of gravity as the reconstruction one can show that this minimizes the expected denoising cost on the other hand given denoising function one can draw samples from the corresponding distribution by creating markov chain that alternates between corruption and denoising valpola proposed the ladder network where the inference process itself can be learned by using the principle of denoising which has been used in supervised learning denoising autoencoders dae and denoising source separation dss for complementary tasks in dae an autoencoder is trained to reconstruct the original observation from corrupted version learning is based simply on minimizing the norm of the difference of the original and its reconstruction from the corrupted that is the cost is while daes are normally only trained to denoise the observations the dss framework is based on the idea of using denoising functions of latent variables to train mapping which models the likelihood of the latent variables as function of the observations the cost function is identical to that used in dae except that latent variables replace the observations clean cd cd cd corrupted figure left depiction of an optimal denoising function for bimodal distribution the input for the function is the corrupted value axis and the target is the clean value axis the denoising function moves values towards higher probabilities as show by the green arrows right conceptual illustration of the ladder network when the feedforward path shares the mappings with the corrupted feedforward path or encoder the decoder consists of denoising functions and has cost functions cd on each layer trying to minimize the difference between and the output of the encoder can also be trained to match available labels that is the cost is the only thing to keep in mind is that needs to be normalized somehow as otherwise the model has trivial solution at constant in dae this can not happen as the model can not change the input figure left depicts the optimal denoising function for bimodal distribution which could be the distribution of latent variable inside larger model the shape of the denoising function depends on the distribution of and the properties of the corruption noise with no noise at all the optimal denoising function would be the identity function in general the denoising function pushes the values towards higher probabilities as shown by the green arrows figure right shows the structure of the ladder network every layer contributes to the cost function term cd kz which trains the layers above both encoder and decoder to learn the denoising function which maps the corrupted onto the denoised estimate as the estimate incorporates all prior knowledge about the same cost function term also trains the encoder layers below to find cleaner features which better match the prior expectation since the cost function needs both the clean and corrupted during training the encoder is run twice clean pass for and corrupted pass for another feature which differentiates the ladder network from regular daes is that each layer has skip connection between the encoder and decoder this feature mimics the inference structure of latent variable models and makes it possible for the higher levels of the network to leave some of the details for lower levels to represent rasmus et al showed that such skip connections allow daes to focus on abstract invariant features on the higher levels making the ladder network good fit with supervised learning that can select which information is relevant for the task at hand one way to picture the ladder network is to consider it as collection of nested denoising autoencoders which share parts of the denoising machinery between each other from the viewpoint of the autoencoder at layer the representations on the higher layers can be treated as hidden neurons in other words there is no particular reason why produced by the decoder should resemble the corresponding representations produced by the encoder it is only the cost function cd that ties these together and forces the inference to proceed in reverse order in the decoder this sharing helps deep denoising autoencoder to learn the denoising process as it splits the task into meaningful of denoising intermediate representations algorithm calculation of the output and cost function of the ladder network require corrupted encoder and classifier noise for to do batchnorm noise activation end for clean encoder for denoising targets for to do zpre batchmean zpre batchstd zpre batchnorm zpre activation end for final classification decoder and denoising for to do if then batchnorm else batchnorm end if ui bn µi end for cost function for training if then log end if pl implementation of the model we implement the ladder network for fully connected mlp networks and for convolutional networks we used standard rectifier networks with batch normalization applied to each preactivation the feedforward pass of the full ladder network is listed in algorithm in the decoder we parametrize the denoising function such that it supports denoising of conditionally independent gaussian latent variables conditioned on the activations of the layer above the denoising function is therefore coupled into components gi ui µi ui ui µi ui where ui propagates information from by batchnorm the functions µi ui and ui are modeled as expressive nonlin earities µi ui sigmoid ui ui with the form of the nonlinearity similar for ters and ui the decoder has thus parameters compared to the two parame in the encoder it is worth noting that simple special case of the decoder is model where when this corresponds to denoising cost only on the top layer and means that most of the decoder can be omitted this model which we call the due to the shape of the graph is useful as it can easily be plugged into any feedforward network without decoder implementation further implementation details of the model can be found in the supplementary material or ref experiments we ran experiments both with the mnist and datasets where we attached the decoder both to mlp networks and to convolutional neural networks we also compared the performance of the simpler sec to the full ladder network with convolutional networks our focus was exclusively on learning we make claims neither about the optimality nor the statistical significance of the supervised baseline results we used the adam optimization algorithm the initial learning rate was and it was decreased linearly to zero during final annealing phase the minibatch size was the source code for all the experiments is available at https table collection of previously reported mnist test errors in the permutation invariant setting followed by the results with the ladder network svm standard deviation in parenthesis test error with of used labels embedding transductive svm from mtc atlasrbf dgn dbm dropout adversarial virtual adversarial baseline mlp bn gaussian noise ladder with only cost ladder only cost ladder full all mnist dataset for evaluating learning we randomly split the training samples into validation set and used samples as the training set from the training set we randomly chose or all labels for the supervised all the samples were used for the decoder which does not need the labels the validation set was used for evaluating the model structure and hyperparameters we also balanced the classes to ensure that no particular class was we repeated the training times varying the random seed for the splits after optimizing the hyperparameters we performed the final test runs using all the training samples with different random initializations of the weight matrices and data splits we trained all the models for epochs followed by epochs of annealing mlp useful test for general learning algorithms is the permutation invariant mnist classification task we chose the layer sizes of the baseline model to be the hyperparameters we tuned for each model are the noise level that is added to the inputs and to each layer and denoising cost multipliers we also ran the supervised baseline model with various noise levels for models with just one cost multiplier we optimized them with search grid ladder networks with cost function on all layers have much larger search space and we explored it much more sparsely for the complete set of selected denoising cost multipliers and other hyperparameters please refer to the code the results presented in table show that the proposed method outperforms all the previously reported results encouraged by the good results we also tested with labels and got test error of the simple also performed surprisingly well particularly for labels with labels all models sometimes failed to converge properly with bottom level or full cost in ladder around of runs result in test error of over in order to be able to estimate the average test error reliably in the presence of such random outliers we ran instead of test runs with random initializations in all the experiments we were careful not to optimize any parameters hyperparameters or model choices based on the results on the test samples as is customary we used labeled validation samples even for those settings where we only used labeled samples for training obviously this is not something that could be done in real case with just labeled samples however mnist classification is such an easy task even in the permutation invariant case that labeled samples there correspond to far greater number of labeled samples in many other datasets table cnn results for mnist test error without data augmentation with of used labels embedcnn swwae baseline supervised only all convolutional networks we tested two convolutional networks for the general mnist classification task and focused on the case the first network was extension of the network tested in the permutation invariant case we turned the first fully connected layer into convolution with filters resulting in spatial map of features each of the spatial locations was processed independently by network with the same structure as in the previous section finally resulting in spatial map of features these were pooled with global meanpooling layer we used the same hyperparameters that were optimal for the permutation invariant task in table this model is referred to as with the second network which was inspired by from springenberg et al we only tested the the exact architecture of this network is detailed in the supplementary material or ref it is referred to as since it is smaller version of the network used for dataset the results in table confirm that even the single convolution on the bottom level improves the results over the fully connected network more convolutions improve the significantly although the variance is still high the ladder network with denoising targets on every level converges much more reliably taken together these results suggest that combining the generalization ability of convolutional and efficient unsupervised learning of the full ladder network would have resulted in even better performance but this was left for future work convolutional networks on the dataset consists of small rgb images from classes there are labeled samples for training and for testing we decided to test the simple with the convolutional architecture by springenberg et al the main differences to are the use of gaussian noise instead of dropout and the convolutional perchannel batch normalization following ioffe and szegedy for more detailed description of the model please refer to model in the supplementary material the hyperparameters noise level denoising cost multipliers and number of epochs for all models were optimized using samples for training and the remaining samples for validation after the best hyperparameters were selected the final model was trained with these settings on all the samples all experiments were run with with different random initializations of the weight matrices and data splits we applied global contrast normalization and whitening following goodfellow et al but no data augmentation was used the results are shown in table the supervised reference was obtained with model closer to the original in the sense that dropout rather than additive gaussian noise was used for we spent some time in tuning the regularization of our fully supervised baseline model for labels and indeed its results exceed the previous state of the art this tuning was important to make sure that the improvement offered by the denoising target of the is in general convolutional networks excel in the mnist classification task the performance of the fully supervised with all labels is in line with the literature and is provided as rough reference only only one run no attempts to optimize not available in the code package same caveats hold for this fully supervised reference result for all labels as with mnist only one run no attempts to optimize not available in the code package table test results for cnn on dataset without data augmentation test error with of used labels sparse coding baseline supervised only all not sign of poorly regularized baseline model although the improvement is not as dramatic as with mnist experiments it came with very simple addition to standard supervised training related work early works in learning proposed an approach where inputs are first assigned to clusters and each cluster has its class label unlabeled data would affect the shapes and sizes of the clusters and thus alter the classification result label propagation methods estimate but adjust probabilistic labels based on the assumption that nearest neighbors are likely to have the same label weston et al explored deep versions of label propagation there is an interesting connection between our and the contractive cost used by rifai et al linear denoising function ai bi where ai and bi are parameters turns the denoising cost into stochastic estimate of the contractive cost in other words our seems to combine clustering and label propagation with regularization by contractive cost recently miyato et al achieved impressive results with regularization method that is similar to the idea of contractive cost they required the output of the network to change as little as possible close to the input samples as this requires no labels they were able to use unlabeled samples for regularization the deep boltzmann machine is way to train dbm with backpropagation through variational inference the targets of the inference include both supervised targets classification and unsupervised targets reconstruction of missing inputs that are used in training simultaneously the connections through the inference network are somewhat analogous to our lateral connections specifically there are inference paths from observed inputs to reconstructed inputs that do not go all the way up to the highest layers compared to our approach requires an iterative inference with some initialization for the hidden activations whereas in our case the inference is simple feedforward procedure kingma et al proposed deep generative models for learning based on variational autoencoders their models can be trained with the variational em algorithm stochastic gradient variational bayes or stochastic backpropagation compared with the ladder network an interesting point is that the variational autoencoder computes the posterior estimate of the latent variables with the encoder alone while the ladder network uses the decoder too to compute an implicit posterior approximate the encoder provides the likelihood part which gets combined with the prior zeiler et al train deep convolutional autoencoders in manner comparable to ours they define operations in the encoder to feed the max function upwards to the next layer while the argmax function is fed laterally to the decoder the network is trained one layer at time using cost function that includes reconstruction error and regularization term to promote sparsity zhao et al use similar structure and call it the stacked autoencoder swwae their network is trained simultaneously to minimize combination of the supervised cost and reconstruction errors on each level just like ours discussion we showed how simultaneous unsupervised learning task improves cnn and mlp networks reaching the in various learning tasks particularly the performance obtained with very small numbers of labels is much better than previous published results which shows that the method is capable of making good use of unsupervised learning however the same model also achieves results and significant improvement over the baseline model with full labels in permutation invariant mnist classification which suggests that the unsupervised task does not disturb supervised learning the proposed model is simple and easy to implement with many existing feedforward architectures as the training is based on backpropagation from simple cost function it is quick to train and the convergence is fast thanks to batch normalization not surprisingly the largest improvements in performance were observed in models which have large number of parameters relative to the number of available labeled samples with we started with model which was originally developed for fully supervised task this has the benefit of building on existing experience but it may well be that the best results will be obtained with models which have far more parameters than fully supervised approaches could handle an obvious future line of research will therefore be to study what kind of encoders and decoders are best suited for the ladder network in this work we made very little modifications to the encoders whose structure has been optimized for supervised learning and we designed the parametrization of the vertical mappings of the decoder to mirror the encoder the flow of information is just reversed there is nothing preventing the decoder to have different structure than the encoder an interesting future line of research will be the extension of the ladder networks to the temporal domain while there exist datasets with millions of labeled samples for still images it is prohibitively costly to label thousands of hours of video streams the ladder networks can be scaled up easily and therefore offer an attractive approach for learning in such problems acknowledgements we have received comments and help from number of colleagues who would all deserve to be mentioned but we wish to thank especially yann lecun diederik kingma aaron courville ian goodfellow søren sønderby jim fan and hugo larochelle for their helpful comments and suggestions the software for the simulations for this paper was based on theano and blocks we also acknowledge the computational resources provided by the aalto project the academy of finland has supported tapani raiko references harri valpola from neural pca to deep unsupervised learning in adv in independent component analysis and learning machines pages elsevier steven suddarth and yl kergosien hints as means of improving network performance and learning time in proceedings of the eurasip workshop on neural networks pages springer marc aurelio ranzato and martin szummer learning of compact document representations with deep networks in proc of icml pages acm alexey dosovitskiy jost tobias springenberg martin riedmiller and thomas brox discriminative unsupervised feature learning with convolutional neural networks in advances in neural information processing systems nips pages ian goodfellow mehdi mirza aaron courville and yoshua bengio deep boltzmann machines in advances in neural information processing systems nips pages geoffrey hinton and ruslan salakhutdinov reducing the dimensionality of data with neural networks science antti rasmus tapani raiko and harri valpola denoising autoencoder with modulated lateral connections learns invariant representations of natural images antti rasmus harri valpola mikko honkala mathias berglund and tapani raiko learning with ladder networks arxiv preprint yoshua bengio li yao guillaume alain and pascal vincent generalized denoising as generative models in advances in neural information processing systems nips pages jocelyn sietsma and robert jf dow creating artificial neural networks that generalize neural networks pascal vincent hugo larochelle isabelle lajoie yoshua bengio and manzagol stacked denoising autoencoders learning useful representations in deep network with local denoising criterion jmlr jaakko and harri valpola denoising source separation jmlr sergey ioffe and christian szegedy batch normalization accelerating deep network training by reducing internal covariate shift in international conference on machine learning icml pages diederik kingma and jimmy ba adam method for stochastic optimization in the international conference on learning representations iclr san diego jason weston ratle hossein mobahi and ronan collobert deep learning via embedding in neural networks tricks of the trade pages springer salah rifai yann dauphin pascal vincent yoshua bengio and xavier muller the manifold tangent classifier in advances in neural information processing systems nips pages lee the simple and efficient learning method for deep neural networks in workshop on challenges in representation learning icml nikolaos pitelis chris russell and lourdes agapito learning using an unsupervised atlas in machine learning and knowledge discovery in databases ecml pkdd pages springer diederik kingma shakir mohamed danilo jimenez rezende and max welling learning with deep generative models in advances in neural information processing systems nips pages nitish srivastava geoffrey hinton alex krizhevsky ilya sutskever and ruslan salakhutdinov dropout simple way to prevent neural networks from overfitting jmlr ian goodfellow jonathon shlens and christian szegedy explaining and harnessing adversarial examples in the international conference on learning representations iclr takeru miyato shin ichi maeda masanori koyama ken nakae and shin ishii distributional smoothing by virtual adversarial examples jost tobias springenberg alexey dosovitskiy thomas brox and martin riedmiller striving for simplicity the all convolutional net junbo zhao michael mathieu ross goroshin and yann lecun stacked sergey ioffe and christian szegedy batch normalization accelerating deep network training by reducing internal covariate shift ian goodfellow david mehdi mirza aaron courville and yoshua bengio maxout networks in proc of icml ian goodfellow yoshua bengio and aaron courville feature learning with sparse coding in proc of icml pages mclachlan iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis american statistical association titterington smith and makov statistical analysis of finite mixture distributions in wiley series in probability and mathematical statistics wiley martin szummer and tommi jaakkola partially labeled classification with markov random walks advances in neural information processing systems nips matthew zeiler graham taylor and rob fergus adaptive deconvolutional networks for mid and high level feature learning in iccv pages ieee bastien pascal lamblin razvan pascanu james bergstra ian goodfellow arnaud bergeron nicolas bouchard and yoshua bengio theano new features and speed improvements deep learning and unsupervised feature learning nips workshop bart van dzmitry bahdanau vincent dumoulin dmitriy serdyuk david jan chorowski and yoshua bengio blocks and fuel frameworks for deep learning corr url http 
embedding inference for structured multilabel prediction farzaneh mirzazadeh siamak ravanbakhsh university of alberta nan ding google dale schuurmans university of alberta mirzazad mravanba dingnan daes abstract key bottleneck in structured output prediction is the need for inference during training and testing usually requiring some form of dynamic programming rather than using approximate inference or tailoring specialized inference method for particular responses to the scaling we propose to embed prediction constraints directly into the learned representation by eliminating the need for explicit inference more scalable approach to structured output prediction can be achieved particularly at test time we demonstrate the idea for prediction under subsumption and mutual exclusion constraints where relationship to maximum margin structured output prediction can be established experiments demonstrate that the benefits of structured output training can still be realized even after inference has been eliminated introduction structured output prediction has been an important topic in machine learning many prediction problems involve complex structures such as predicting parse trees for sentences predicting sequence labellings for language and genomic data or predicting multilabel taggings for documents and images initial breakthroughs in this area arose from tractable discriminative training random fields and structured large margin training compare complete output configurations against given target structures rather than simply learning to predict each component in isolation more recently search based approaches that exploit sequential prediction methods have also proved effective for structured prediction despite these improvements the need to conduct inference or search over complex outputs both during the training and testing phase proves to be significant bottleneck in practice in this paper we investigate an alternative approach that eliminates the need for inference or search at test time the idea is to shift the burden of coordinating predictions to the training phase by embedding constraints in the learned representation that ensure prediction relationships are satisfied the primary benefit of this approach is that prediction cost can be significantly reduced without sacrificing the desired coordination of structured output components we demonstrate the proposed approach for the problem of multilabel classification with hierarchical and mutual exclusion constraints on output labels multilabel classification is an important subfield of structured output prediction where multiple labels must be assigned that respect semantic relationships such as subsumption mutual exclusion or weak forms of correlation the problem is of growing importance as larger tag sets are being used to annotate images and documents on the web research in this area can be distinguished by whether the relationships between labels are assumed to be known beforehand or whether such relationships need to be inferred during training in the latter case many works have developed tailored training losses for multilabel prediction that penalize joint prediction behavior without assuming any specific form of prior knowledge more recently several works have focused on coping with large label spaces by using low dimensional projections to label subspaces other work has focused on exploiting weak forms of prior knowledge expressed as similarity information between labels that can be obtained from auxiliary sources unfortunately none of these approaches strictly enforce prior logical relationships between label predictions by contrast other research has sought to exploit known prior relationships between labels the most prominent such approaches have been to exploit generative or conditional graphical model structures over the label set unfortunately the graphical model structures are either limited to junction trees with small treewidth or require approximation other work using output kernels has also been shown able to model complex relationships between labels but is hampered by an intractable problem at test time in this paper we focus on tractable methods and consider the scenario where set of logical label relationships is given priori in particular implication and mutual exclusion relationships these relationships have been the subject of extensive work on multilabel prediction where it is known that if the relationships form tree or directed acyclic graph then efficient dynamic programming algorithms can be developed for tractable inference during training and testing while for general pairwise models approximate inference is required our main contribution is to show how these relationships can be enforced without the need for dynamic programming the idea is to embed label relationships as constraints on the underlying score model during training so that trivial labelling algorithm can be employed at test time process that can be viewed as inference during the training phase the literature on multivariate prediction has considered many other topics not addressed by this paper including learning from incomplete labellings exploiting hierarchies and embeddings for multiclass prediction exploiting multimodal data deriving generalization bounds for structured and multilabel prediction problems and investigating the consistency of multilabel losses background we consider standard prediction model where score function with parameters is used to determine the prediction for given input via arg max here is configuration of assignments over set of components that might depend on since is combinatorial set can not usually be solved by enumeration some structure required for efficient prediction for example might decompose as yc over set of cliques that form junction tree where yc denotes the portion of covered by clique might also encode constraints to aid tractability such as forming consistent matching in bipartite graph or consistent parse tree the key practical requirement is that and allow an efficient solution to the operation of maximizing or summing over all is referred to as inference and usually involves dynamic program tailored to the specific structure encoded by and for supervised learning one attempts to infer useful score function given set of training pairs xt yt that specify the correct output associated with each input conditional random fields and structured large margin training below with margin scaling can both be expressed as optimizations over the score model parameters respectively min min log exp sθ xi sθ xi yi max yi sθ xi sθ xi yi where is regularizer over equations and suggest that inference over is required at each stage of training and testing however we show this is not necessarily the case multilabel prediction to demonstrate how inference might be avoided consider the special case of multilabel prediction with label constraints multilabel prediction specializes the previous set up by assuming is boolean assignment to fixed set of variables where and yi each label is assigned true or false as noted an extensive literature that has investigated various structural assumptions on the score function to enable tractable prediction for simplicity we adopt thepfactored form that has been reconsidered in recent work and originally yk this form allows to be simplified to arg max yk arg max yk sk where sk yk yk gives the decision function associated with label yk that is based on if the constraints in were ignored one would have the relationship yˆk sk the constraints in play an important role however it has been shown in that imposing prior implications and mutual exclusions as constraints in yields state of the art accuracy results for image tagging on the ilsvrc corpus this result was achieved in by developing novel and rather sophisticated dynamic program that can efficiently solve under these constraints here we show how such dynamic program can be eliminated embedding label constraints consider the two common forms of logical relationships between labels implication and mutual exclusion for implication one would like to enforce relationships of the form meaning that whenever the label is set to true then the label must also be set to true for mutual exclusion one would like to enforce relationships of the form meaning that at least one of the labels and must be set to false not both can be simultaneously true these constraints arise naturally in multilabel classification where label sets are increasingly large and embody semantic relationships between categories for example images can be tagged with labels dog cat and siamese where siamese implies cat while dog and cat are mutually exclusive but an image could depict neither these implication and mutual exclusion constraints constitute the hex constraints considered in our goal is to express the logical relationships between label assignments as constraints on the score function that hold universally over all in particular using the decomposed representation the desired label relationships correspond to the following constraints implication mutual exclusion or where we have introduced the additional margin quantity for subsequent large margin training score model the first key consideration is representing the score function in manner that allows the desired relationships to be expressed unfortunately the standard linear form hθ can not allow the needed constraints to be enforced over all without further restricting the form of the feature representation constraint we would like to avoid more specifically consider standard set up where there is mapping yk that produces feature representation for an pair yk for clarity we additionally make the standard assumption that the inputs and outputs each have independent feature representations hence yk ψk for an input feature map and label feature representation ψk in this case score function has the form sk aψk ψk for parameters unfortunately such score function does not allow sk in condition to be expressed over all without either assuming and or special structure in to overcome this restriction we consider more general scoring model that extends the standard form to form that is linear in the parameters but quadratic in the feature representations ψk ψk for here and sk is linear in for each the benefit of quadratic form in the features is that it allows constraints over to be easily imposed on label scores via convex constraints on lemma if then ku ψk for some and proof first expand obtaining aψk ψk qψk ψk since there must exist and such that where and simple substitution and rearrangement shows the claim the representation generalizes both standard and models the standard model is achieved by and by lemma the semidefinite assumption also yields model that has interpretation the feature representations and ψk are both mapped linearly into common euclidean space where the score is determined by the squared distance between the embedded vectors with an additional offset to aid the presentation below we simplify this model bit further set and observe that reduces to sk γk ψk ψk where γk ψk in particular we modify the parameterization to γk θp aq such that θp aq denotes the matrix of parameters in importantly remains linear in the new parameterization lemma can then be modified accordingly for similar convex constraint on lemma if θp aq then there exist and such that for all labels and sk ψk qψk ψk qψl ψl qψk ψl qψl γk ku ψk kv ψk ψl proof similar to lemma since θp aq there exist and such that θp aq where and expanding and substituting gives for note ψk qψk ψk qψl ψl qψk ψl qψl ψk ψl ψk ψl expanding gives ψk ψl ψk ψl ψk ψl ψk ψl kv ψk ψl this representation now allows us to embed the desired label relationships as simple convex constraints on the score model parameters embedding implication constraints theorem assume the score model and θp aq then for any and the implication constraint in is implied for all by proof first since θp aq we have the relationship which implies that there must exist vectors and such that therefore the constraints and can be equivalently as with respect to these vectors next let which exists by and observe that kµ kµ kµ consider two cases case in this case by the cauchy schwarz inequality we have which implies kµ by constraint but this implies hence kµ that therefore it does not matter what value has case in this case assume that kµ otherwise it does not matter what value has then from it follows that kµ kµ by constraint but this implies that hence the implication is enforced embedding mutual exclusion constraints theorem assume the score model and θp aq then for any the mutual exclusion constraint in is implied for all by proof as before since θp aq we have the relationship which implies that there must exist vectors and such that observe that the constraint can then be equivalently expressed as and observe that kµ using as before which exists by therefore kµ kµ to prove the inequality observe that since ka we must have ha bi kbk hence bi kak kbk ha bi ka bk which establishes bi ka bk the inequality then follows simply by setting and now combining with the constraint implies that kµ kµ therefore one of kµ or kµ must hold hence at least one of or must hold therefore the mutual exclusion is enforced importantly once θp aq is imposed the other constraints in theorems and are all linear in the parameters and properties we now establish that the above constraints on the parameters in achieve the desired properties in particular we show that given the constraints inference can be removed both from the prediction problem and from structured large margin training prediction equivalence first note that the decision of whether label yk is associated with can be determined by yk yk max yk sk arg max yk sk yk yk consider joint assignments yl and let denote the set of joint assignments that are consistent with set of implication and mutual exclusion constraints it is assumed the constraints are satisfiable that is is not the empty set then the optimal joint assignment for pl given can be specified by arg yk sk proposition if the constraint set imposes the constraints in and and is nonempty and the score function satisfies the corresponding constraints for some then max yk sk max yk sk yk proof first observe that max yk sk max yk sk max yk sk yk so making local classifications for each label gives an upper bound however if the score function satisfies the constraints then the concatenation of the local label decisions yl must be jointly feasible that is in particular for the implication the score constraint ensures that if implying arg then it must follow that hence implying arg similarly for the mutual exclusion the score constraint ensures min hence if implying arg then it must follow that implying arg and vice versa therefore since the maximizer of is feasible we actually have that the leftmost term in is equal to the rightmost since the feasible set embodies constraints over assignment vectors in interchanging maximization with summation is not normally justified however proposition establishes that if the score model also satisfies its respective constraints as established in the previous section then maximization and summation can be interchanged and inference over predicted labellings can be replaced by greedy componentwise labelling while preserving equivalence large margin structured output training given target joint assignment over labels tl and using the score model the standard structured output large margin training loss can then be written as max ti xi yk xi tik max ti yk tik sk xi using the simplified score function representation such that tik denotes the label of the training example if we furthermore make the standard assumption that ti decomposes as pl ti δk yk tik the loss can be simplified to max δk yk tik yk tik sk xi note also that since yk and tik the margin functions δk typically have the form δk δk and δk and δk for constants and which for simplicity we will assume are equal for all although label specific margins might be possible this is the same used in the constraints and the difficulty in computing this loss is that it apparently requires an exponential search over when this exponential search can be avoided it is normally avoided by developing dynamic program instead we can now see that the search over can be eliminated proposition if the score function satisfies the constraints in and for then max yk tik yk tik sk xi xx max yk tik yk tik sk xi yk proof for given and let fk tk tk sk hence yk arg fk it is easy to show that arg max fk sk tk tk which can be verified by checking the two cases tk and tk when tk we have fk and fk therefore yk arg fk iff similarly when tk we have fk and fk therefore yk arg fk iff combining these two conditions yields next we verify that if the score constraints hold then the logical constraints over are automatically satisfied even by locally assigning yk which implies the optimal joint assignment is feasible establishing the claim in particular for the implication it is assumed that in the target labeling and also that score constraints hold ensuring consider the cases over possible assignments to and if and then by assumption if and then by assumption tight case the case and can not happen by the assumption that if and then by assumption similarly for the mutual exclusion it is assumed that in the target labeling and also that the score constraints hold ensuring min consider the cases over possible assignments to and if and then and implies that and which contradicts the constraint that min tight case if and then and implies that and which contradicts the same constraint if and then and implies that and which again contradicts the same constraint the case and can not happen by the assumption that therefore since the concatenation of the independent maximizers of is feasible we have that the rightmost term in equals the leftmost similar to section proposition demonstrates that if the constraints and are satisfied by the score model then structured large margin training reduces to independent labelwise training under the standard hinge loss while preserving equivalence efficient implementation even though section achieves the primary goal of demonstrating how desired label relationships can be embedded as convex constraints on score model parameters the representation unfortunately does not allow convenient scaling the number of parameters in θp aq is accounting for symmetry which is quadratic in the number of features in and the number of labels such large optimization variable is not practical for most applications where and can be quite large the semidefinite constraint θp aq can also be costly in practice therefore to obtain scalable training we require some further refinement in our experiments below we obtained scalable training procudure by exploiting trace norm regularization on θp aq to reduce its rank the key benefit of trace norm regularization is that efficient solution methods exist that work with low rank factorization of the matrix variable while automatically ensuring positive semidefiniteness and still guaranteeing global optimality therefore we conducted the main optimization in terms of smaller matrix variable such that bb θp aq second to cope with the constraints we employed an augmented lagrangian method that increasingly penalizes constraint violations but otherwise allows simple unconstrained optimization all optimizations for smooth problems were performed using lbfgs and nonsmooth problems were solved using bundle method dataset enron wipo reuters features labels depth training testing reference table data set properties test error unconstrained constrained inference enron wipo reuters test time unconstrained constrained inference enron wipo reuters table left test set prediction error percent right test set prediction time experimental evaluation to evaluate the proposed approach we conducted experiments on multilabel text classification data that has natural hierarchy defined over the label set in particular we investigated three multilabel text classification data sets enron wipo and reuters obtained from https see table for details some preprocessing was performed on the label relations to ensure consistency with our assumptions in particular all implications were added to each instance to ensure consistency with the hierarchy while mutual exclusions were defined between siblings whenever this did not create contradiction we conducted experiments to compare the effects of replacing inference with the constraints outlined in section using the score model for comparison we trained using the structured large margin formulation and trained under multilabel prediction loss without inference but both including then excluding the constraints for the multilabel training loss we used the smoothed calibrated separation ranking loss proposed in in each case the regularization parameter was simply set to for inference we implemented the inference algorithm outlined in the results are given in table showing both the test set prediction error using labelwise prediction error hamming loss and the test prediction times as expected one can see benefits from incorporating known relationships between the labels when training predictor in each case the addition of constraints leads to significant improvement in test prediction error versus training without any constraints or inference added training with inference classical structured large margin training still proves to be an effective training method overall in one case improving the results over the constrained approach but in two other cases falling behind the key difference between the approach using constraints versus that using inference is in terms of the time it takes to produce predictions on test examples using inference to make test set predictions clearly takes significantly longer than applying labelwise predictions from either constrained or unconstrained model as shown in the right subtable of table conclusion we have demonstrated novel approach to structured multilabel prediction where inference is replaced with constraints on the score model on multilabel text classification data the proposed approach does appear to be able to achieve competitive generalization results while reducing the time needed to make predictions at test time in cases where logical relationships are known to hold between the labels using either inference or imposing constraints on the score model appear to yield benefits over generic training approaches that ignore the prior knowledge for future work we are investigating extensions of the proposed approach to more general structured output settings by combining the method with search based prediction methods other interesting questions include exploiting learned label relations and coping with missing labels references bakir hofmann smola taskar and vishwanathan predicting structured data mit press bi and kwok mandatory leaf node prediction in hierarchical multilabel classification in neural information processing systems nips usunier artieres and gallinari robust bloom filters for large multilabel classification tasks in proceedings of advances in neural information processing systems nips daume and langford structured prediction machine learning cheng and bayes optimal multilabel classification via probabilistic classifier chains in proceedings icml waegeman cheng and on label dependence and loss minimization in classification machine learning deng berg li and li what does classifying more than image categories tell us in proceedings of the european conference on computer vision eccv deng ding jia frome murphy bengio li neven and adam object classification using label relation graphs in proceedings eccv guo and schuurmans adaptive large margin training for multilabel classification in aaai haeffele vidal and young structured matrix factorization optimality algorithm and applications to image processing in international conference on machine learning icml hariharan vishwanathan and varma efficient classification with applications to learning machine learning jancsary nowozin and rother learning convex qp relaxations for structured prediction in proceedings of the international conference on machine learning icml joachims transductive inference for text classification using support vector machines in icml bach absil and sepulchre optimization on the cone of positive semidefinite matrices siam journal on optimization kadri ghavamzadeh and preux generalized kernel approach to structured output learning in proceedings of the international conference on machine learning icml kae sohn lee and augmenting crfs with boltzmann machine shape priors for image labeling in proceedings cvpr kapoor jain and vishwanathan multilabel classification using bayesian compressed sensing in proceedings of advances in neural information processing systems nips klimt and yang the enron corpus new dataset for email classification in ecml lafferty mccallum and pereira conditional random fields probabilistic models for segmenting and labeling sequence data in international conference on machine learning icml lewis yang rose and li new benchmark collection for text categorization research journal of machine learning research li wang wipf and tu model for structured prediction in proceedings of the international conference on machine learning icml lin ding hu and wang classification via implicit label space encoding in proceedings of the international conference on machine learning icml multiobjective proximal bundle method for nonconvex nonsmooth optimization fortran subroutine mpbngc technical report of mirzazadeh guo and schuurmans convex in proceedings aaai rousu saunders szedmak and learning of hierarchical multilabel classification models journal of machine learning research srikumar and manning learning distributed representations for structured output prediction in proceedings of advances in neural information processing systems nips sun structure regularization for structured prediction in proceedings nips taskar learning structured prediction models large margin approach phd thesis stanford tsochantaridis hofmann joachims and altun large margin methods for structured and interdependent output variables journal of machine learning research tsoumakas katakis and vlahavas mining data in data mining and knowledge discovery handbook edition springer weinberger and chapelle large margin taxonomy embedding for document categorization in neural information processing systems nips weston bengio and usunier wsabie scaling up to large vocabulary image annotation in international joint conference on artificial intelligence ijcai 
copula variational inference dustin tran harvard university david blei columbia university edoardo airoldi harvard university abstract we develop general variational inference method that preserves dependency among the latent variables our method uses copulas to augment the families of distributions used in and structured approximations copulas model the dependency that is not captured by the original variational distribution and thus the augmented variational family guarantees better approximations to the posterior with stochastic optimization inference on the augmented distribution is scalable furthermore our strategy is generic it can be applied to any inference procedure that currently uses the or structured approach copula variational inference has many advantages it reduces bias it is less sensitive to local optima it is less sensitive to hyperparameters and it helps characterize and interpret the dependency among the latent variables introduction variational inference is computationally efficient approach for approximating posterior distributions the idea is to specify tractable family of distributions of the latent variables and then to minimize the divergence from it to the posterior combined with stochastic optimization variational inference can scale complex statistical models to massive data sets both the computational complexity and accuracy of variational inference are controlled by the factorization of the variational family to keep optimization tractable most algorithms use the fullyfactorized family also known as the family where each latent variable is assumed independent less common structured methods slightly relax this assumption by preserving some of the original structure among the latent variables factorized distributions enable efficient variational inference but they sacrifice accuracy in the exact posterior many latent variables are dependent and methods by construction fail to capture this dependency to this end we develop copula variational inference copula vi copula vi augments the traditional variational distribution with copula which is flexible construction for learning dependencies in factorized distributions this strategy has many advantages over traditional vi it reduces bias it is less sensitive to local optima it is less sensitive to hyperparameters and it helps characterize and interpret the dependency among the latent variables variational inference has previously been restricted to either generic inference on simple dependency does not make significant writing variational updates copula vi widens its applicability providing generic inference that finds meaningful dependencies between latent variables in more detail our contributions are the following generalization of the original procedure in variational inference copula vi generalizes variational inference for and structured factorizations traditional vi corresponds to running only one step of our method it uses coordinate descent which monotonically decreases the kl divergence to the posterior by alternating between fitting the parameters and the copula parameters figure illustrates copula vi on toy example of fitting bivariate gaussian improving generic inference copula vi can be applied to any inference procedure that currently uses the or structured approach further because it does not require specific knowledge figure approximations to an elliptical gaussian the red is restricted to fitting independent gaussians which is the first step in our algorithm the second step blue fits copula which models the dependency more iterations alternate the third refits the meanfield green and the fourth refits the copula cyan demonstrating convergence to the true posterior of the model it falls into the framework of black box variational inference an investigator need only write down function to evaluate the model the rest of the algorithm calculations such as sampling and evaluating gradients can be placed in library richer variational approximations in experiments we demonstrate copula vi on the standard example of gaussian mixture models we found it consistently estimates the parameters reduces sensitivity to local optima and reduces sensitivity to hyperparameters we also examine how well copula vi captures dependencies on the latent space model copula vi outperforms competing methods and significantly improves upon the approximation background variational inference let be set of observations be latent variables and be the free parameters of variational distribution we aim to find the best approximation of the posterior using the variational distribution where the quality of the approximation is measured by kl divergence this is equivalent to maximizing the quantity eq log eq log is the evidence lower bound elbo or the variational free energy for simpler computation standard choice of the variational family is approximation qi zi λi where zd note this is strong independence assumption more sophisticated approaches known as structured variational inference attempt to restore some of the dependencies among the latent variables in this work we restore dependencies using copulas structured vi is typically tailored to individual models and is difficult to work with mathematically copulas learn general posterior dependencies during inference and they do not require the investigator to know such structure in advance further copulas can augment structured factorization in order to introduce dependencies that were not considered before thus it generalizes the procedure we next review copulas copulas we will augment the distribution with copula we consider the variational family zi zd figure example of vine which factorizes copula density of four random variables into product of pair copulas edges in the tree tj are the nodes of the lower level tree and each edge determines bivariate copula which is conditioned on all random variables that its two connected nodes share here zi is the marginal cumulative distribution function cdf of the random variable zi and is joint distribution of random the distribution is called copula of it is joint multivariate density of zd with uniform marginal distributions for any distribution factorization into product of marginal densities and copula always exists and integrates to one intuitively the copula captures the information about the multivariate random variable after eliminating the marginal information by applying the probability integral transform on each variable the copula captures only and all of the dependencies among the zi recall that for all random variables zi is uniform distributed thus the marginals of the copula give no information for example the bivariate gaussian copula is defined as φρ if are independent uniform distributed the inverse cdf of the standard normal transforms to independent normals the cdf φρ of the bivariate gaussian distribution with mean zero and pearson correlation squashes the transformed values back to the unit square thus the gaussian copula directly correlates and with the pearson correlation parameter vine copulas it is difficult to specify copula we must find family of distributions that is easy to compute with and able to express broad range of dependencies much work focuses on copulas such as the clayton gumbel frank and joe copulas however their multivariate extensions do not flexibly model dependencies in higher dimensions rather successful approach in recent literature has been by combining sets of conditional bivariate copulas the resulting joint is called vine vine factorizes copula density ud into product of conditional bivariate copulas also called pair copulas this makes it easy to specify copula one need only express the dependence for each pair of random variables conditioned on subset of the others figure is an example of vine which factorizes copula into the product of pair copulas the first tree has nodes representing the random variables respectively an edge corresponds to pair copula symbolizes edges in collapse into nodes in the next tree and edges in correspond to conditional bivariate copulas symbolizes this proceeds to the last nested tree where symbolizes we overload the notation for the marginal cdf to depend on the names of the argument though we occasionally use qi zi when more clarity is needed this is analogous to the standard convention of overloading the probability density function the vine structure specifies complete factorization of the multivariate copula and each pair copula can be of different family with its own set of parameters ih ih formally vine is nested set of trees with the following properties tree tj nj ej has nodes and edges edges in the th tree ej are the nodes in the th tree two nodes in tree are joined by an edge only if the corresponding edges in tree tj share node each edge in the nested set of trees specifies different pair copula and the product of all edges comprise of factorization of the copula density since there are total of edges factorizes ud as the product of pair copulas each edge tj has conditioning set which is set of variable indices we define to be the bivariate copula density for ui and uk given its conditioning set ui ui uj both the copula and the cdf in its arguments are conditional on vine specifies factorization of the copula which is product over all edges in the levels ud we highlight that depends on the set of all parameters to the pair copulas the vine construction provides us with the flexibility to model dependencies in high dimensions using decomposition of pair copulas which are easier to estimate as we shall see the construction also leads to efficient stochastic gradients by taking individual and thus easy gradients on each pair copula copula variational inference we now introduce copula variational inference copula vi our method for performing accurate and scalable variational inference for simplicity consider the factorization augmented with copula we later extend to structured factorizations the variational family is zi zd copula where denotes the parameters and the copula parameters with this family we maximize the augmented elbo eq log eq log copula vi alternates between two steps fix the copula parameters and solve for the parameters and fix the parameters and solve for the copula parameters this generalizes the approximation which is the special case of initializing the copula to be uniform and stopping after the first step we apply stochastic approximations for each step with gradients derived the next section we set the learning rate ρt to satisfy schedule ρt summary is outlined in algorithm this alternating set of optimizations falls in the class of methods which includes many procedures such as the em algorithm the alternating least squares algorithm and the iterative procedure for the generalized method of moments each step of copula vi monotonically increases the objective function and therefore better approximates the posterior distribution algorithm copula variational inference copula vi input data model variational family initialize randomly so that is uniform while change in elbo is above some threshold do fix maximize over set iteration counter while not converged do draw sample unif update ρt increment end fix maximize over set iteration counter while not converged do draw sample unif update ρt increment end end output variational parameters copula vi has the same generic input requirements as variational inference user need only specify the joint model in order to perform inference further copula variational inference easily extends to the case when the original variational family uses structured factorization by the vine construction one simply fixes pair copulas corresponding to dependence in the factorization to be the independence copula this enables the copula to only model dependence where it does not already exist throughout the optimization we will assume that the tree structure and copula families are given and fixed we note however that these can be learned in our study we learn the tree structure using sequential tree selection and learn the families among choice of bivariate families through bayesian model selection see supplement in preliminary studies we ve found that of the tree structure and copula families do not significantly change in future iterations stochastic gradients of the elbo to perform stochastic optimization we require stochastic gradients of the elbo with respect to both the and copula parameters the copula vi objective leads to efficient stochastic gradients and with low variance we first derive the gradient with respect to the parameters in general we can apply the score function estimator which leads to the gradient eq log log log we follow noisy unbiased estimates of this gradient by sampling from and evaluating the inner expression we apply this gradient for discrete latent variables when the latent variables are differentiable we use the reparameterization trick to take advantage of information from the model log specifically we rewrite the expectation in terms of random variable such that its distribution does not depend on the variational parameters and such that the latent variables are deterministic function of and the parameters following this reparameterization the gradients propagate inside the expectation es log log this estimator reduces the variance of the stochastic gradients furthermore with copula variational family this type of reparameterization using uniform random variable and deterministic function is always possible see the supplement the reparameterized gradient requires calculation of the terms log and for each the latter is tractable and derived in the supplement the former decomposes as log log zi λi zi λi log zd λd zi λi log zi λi zi λi zi λi log ck the summation in is over all pair copulas which contain zi λi as an argument in other words the gradient of latent variable zi is evaluated over both the marginal zi and all pair copulas which model correlation between zi and any other latent variable zj similar derivation holds for calculating terms in the score function estimator we now turn to the gradient with respect to the copula parameters we consider copulas which are differentiable with respect to their parameters this enables an efficient reparameterized gradient es log log the requirements are the same as for the parameters finally we note that the only requirement on the model is the gradient log this can be calculated using automatic differentiation tools thus copula vi can be implemented in library and applied without requiring any manual derivations from the user computational complexity in the vine factorization of the copula there are pair copulas where is the number of latent variables thus stochastic gradients of the parameters and copula parameters require complexity more generally one can apply low rank approximation to the copula by truncating the number of levels in the vine see figure this reduces the number of pair copulas to be kd for some and leads to computational complexity of kd using sequential tree selection for learning the vine structure the most correlated variables are at the highest level of the vines thus truncated low rank copula only forgets the weakest correlations this generalizes low rank gaussian approximations which also have kd complexity it is the special case when the distribution is the product of independent gaussians and each pair copula is gaussian copula related work preserving structure in variational inference was first studied by saul and jordan in the case of probabilistic neural networks it has been revisited recently for the case of conditionally conjugate exponential familes our work differs from this line in that we learn the dependency structure during inference and thus we do not require explicit knowledge of the model further our augmentation strategy works more broadly to any posterior distribution and any factorized variational family and thus it generalizes these approaches similar augmentation strategy is methods which are taylor series correction based on the difference between the posterior and its approximation recently giordano et al consider covariance correction from the estimates all these methods assume the approximation is reliable for the taylor series expansion to make sense which is not true in general and thus is not robust in black box framework our approach alternates the estimation of the and copula which we find empirically leads to more robust estimates than estimating them simultaneously and which is less sensitive to the quality of the approximation lambda method cvi lrvb all covariances mf method cvi lrvb estimated sd estimated sd gibbs standard deviation gibbs standard deviation figure covariance estimates from copula variational inference copula vi mf and linear response variational bayes lrvb to the ground truth gibbs samples copula vi and lrvb effectively capture dependence while mf underestimates variance and forgets covariances experiments we study copula vi with two models gaussian mixtures and the latent space model the gaussian mixture is classical example of model for which it is difficult to capture posterior dependencies the latent space model is modern bayesian model for which the approximation gives poor estimates of the posterior and where modeling posterior dependencies is crucial for uncovering patterns in the data there are several implementation details of copula vi at each iteration we form stochastic gradient by generating samples from the variational distribution and taking the average gradient we set and follow asynchronous updates we set the using adam mixture of gaussians we follow the goal of giordano et al which is to estimate the posterior covariance for gaussian mixture the hidden variables are of mixture proportions and set of multivariate normals µk each with unknown mean µk and precision matrix λk in mixture of gaussians the joint probability is µk xn zn µzn zn zn with dirichlet prior and prior µk we first apply the approximation mf which assigns independent factors to and we then perform copula vi over the distribution one which includes pair copulas over the latent variables we also compare our results to linear response variational bayes lrvb which is posthoc correction technique for covariance estimation in variational inference methods demonstrate similar behavior as lrvb comparisons to structured approximations are omitted as they require explicit factorizations and are not black box standard black box variational inference corresponds to the mf approximation we simulate samples with components and dimensional gaussians figure displays estimates for the standard deviations of for simulations and plots them against the ground truth using effective gibb samples the second plot displays all covariance estimates estimates for and indicate the same pattern and are given in the supplement when initializing at the true parameters both copula vi and lrvb achieve consistent estimates of the posterior variance mf underestimates the variance which is limitation note that because the mf estimates are initialized at the truth copula vi converges to the true posterior upon one step of fitting the copula it does not require alternating more steps variational inference methods predictive likelihood runtime lrvb copula vi steps copula vi steps copula vi converged min min min hr min hr table predictive likelihood on the latent space model each copula vi step either refits the meanfield or the copula copula vi converges in roughly steps and already significantly outperforms both and lrvb upon fitting the copula once steps copula vi is more robust than lrvb as toy demonstration we analyze the mnist data set of handwritten digits using training examples and test examples of and we perform unsupervised classification classify without using training labels we apply mixture of gaussians to cluster and then classify digit based on its membership assignment copula vi reports test set error rate of whereas lrvb ranges between and depending on the estimates lrvb and similar higher order methods correct an existing mf is thus sensitive to local optima and the general quality of that solution on the other hand copula vi both the mf and copula parameters as it fits making it more robust to initialization latent space model we next study inference on the latent space model bernoulli latent factor model for network analysis each node in an network is associated with latent variable edges between pairs of nodes are observed with high probability if the nodes are close to each other in the latent space formally an edge for each pair is observed with probability logit zj where is model parameter we generate an node network with latent node attributes from dimensional gaussian we learn the posterior of the latent attributes in order to predict the likelihood of edges mf applies independent factors on and lrvb applies correction and copula vi uses the fully dependent variational distribution table displays the likelihood of edges and runtime we also attempted hamiltonian monte carlo but it did not converge after five hours copula vi dominates other methods in accuracy upon convergence and the copula estimation without refitting steps already dominates lrvb in both runtime and accuracy we note however that lrvb requires one to invert matrix we can better scale the method and achieve faster estimates than copula vi if we applied stochastic approximations for the inversion however copula vi always outperforms lrvb and is still fast on this node network conclusion we developed copula variational inference copula vi copula vi is new variational inference algorithm that augments the variational distribution with copula it captures posterior dependencies among the latent variables we derived scalable and generic algorithm for performing inference with this expressive variational distribution we found that copula vi significantly reduces the bias of the approximation better estimates the posterior variance and is more accurate than other forms of capturing posterior dependency in variational approximations acknowledgments we thank luke bornn robin gong and alp kucukelbir for their insightful comments this work is supported by nsf onr darpa facebook adobe amazon and the john templeton foundation references dempster laird and rubin maximum likelihood from incomplete data via the em algorithm journal of the royal statistical society series dissmann brechmann czado and kurowicka selecting and estimating regular vine copulae and application to financial returns arxiv preprint fréchet les tableaux dont les marges sont données trabajos de estadística genest gerber goovaerts and laeven editorial to the special issue on modeling and measurement of multivariate risk in insurance and finance insurance mathematics and economics giordano broderick and jordan linear response methods for accurate covariance estimates from mean field variational bayes in neural information processing systems gruber and czado sequential bayesian model selection of regular vine copulas international society for bayesian analysis hoff raftery and handcock latent space approaches to social network analysis journal of the american statistical association hoffman and blei structured stochastic variational inference in artificial intelligence and statistics hoffman blei wang and paisley stochastic variational inference journal of machine learning research joe families of distributions with given margins and bivariate dependence parameters pages institute of mathematical statistics kappen and wiegerinck second order approximations for probability models in neural information processing systems kingma and ba adam method for stochastic optimization in international conference on learning representations kurowicka and cooke uncertainty analysis with high dimensional dependence modelling wiley new york nelsen an introduction to copulas springer series in statistics new york ranganath gerrish and blei black box variational inference in artificial intelligence and statistics pages recht re wright and niu hogwild approach to parallelizing stochastic gradient descent in advances in neural information processing systems pages rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in international conference on machine learning robbins and monro stochastic approximation method the annals of mathematical statistics saul and jordan exploiting tractable substructures in intractable networks in neural information processing systems pages seeger gaussian covariance and scalable variational inference in international conference on machine learning sklar fonstions de répartition dimensions et leurs marges publications de institut de statistique de université de paris stan development team stan library for probability and sampling version toulis and airoldi implicit stochastic gradient descent arxiv preprint tran toulis and airoldi stochastic gradient descent methods for estimation with large data sets arxiv preprint wainwright and jordan graphical models exponential families and variational inference foundations and trends in machine learning 
recursive training of convolutional networks for neuronal boundary detection kisuk lee aleksandar zlateski massachusetts institute of technology kisuklee zlateski ashwin vishwanathan sebastian seung princeton university ashwinv sseung abstract efforts to automate the reconstruction of neural circuits from electron microscopic em brain images are critical for the field of connectomics an important computation for reconstruction is the detection of neuronal boundaries images acquired by serial section em leading em technique are highly anisotropic with inferior quality along the third dimension for such images the maxpooling convolutional network has set the standard for performance at boundary detection here we achieve substantial gain in accuracy through three innovations following the trend towards deeper networks for object recognition we use much deeper network than previously employed for boundary detection second we incorporate as well as filters to enable computations that use context finally we adopt recursively trained architecture in which first network generates preliminary boundary map that is provided as input along with the original image to second network that generates final boundary map backpropagation training is accelerated by znn new implementation of convolutional networks that uses multicore cpu parallelism for speed our hybrid architecture could be more generally applicable to other types of anisotropic images including video and our recursive framework for any image labeling problem introduction neural circuits can be reconstructed by analyzing brain images from electron microscopy em image analysis has been accelerated by semiautomated systems that use computer vision to reduce the amount of human labor required however analysis of large image datasets is still laborious so it is critical to increase automation by improving the accuracy of computer vision algorithms variety of machine learning approaches have been explored for the reconstruction of neurons problem that can be formulated as image segmentation or boundary detection this paper focuses on neuronal boundary detection in images from serial section em the most widespread kind of em the technique starts by cutting and collecting ultrathin to nm sections of brain tissue image is acquired from each section and then the images are aligned the spatial resolution of the resulting image stack along the direction perpendicular to the cutting plane is set by the thickness of the sections this is generally much worse than the resolution that em yields in the xy plane in addition alignment errors may corrupt the image along the direction due to these issues with the direction of the image stack most existing analysis pipelines begin with processing and only later transition to the stages are neuronal boundary detection within each image segmentation of neuron cross sections within each image and reconstruction of individual neurons by linking across multiple images boundary detection in serial section em images is done by variety of algorithms many algorithms were compared in the isbi em segmentation challenge publicly available dataset and benchmark the winning submission was an ensemble of convolutional networks convnets created by idsia one of the convnet architectures shown in figure is the largest architecture from and serves as performance baseline for the research reported here we improve upon by adding several new elements fig increased depth our architecture is deeper than figure and borrows other nowstandard practices from the literature such as rectified linear units relus small filter sizes and multiple convolution layers between pooling layers already outperforms without any use of context is motivated by the principle the deeper the better which has become popular for convnets applied to object recognition as well as when human experts detect boundaries in em images they use context to disambiguate certain locations is also able to use context because it contains filters in its later layers convnets with filters were previously applied to block face em images block face em is another class of em techniques and produces nearly isotropic images unlike serial section em also contains filters in its earlier layers this novel hybrid use of and filters is suited for the highly anisotropic nature of serial section em images recursive training of convnets and are concatenated to create an extremely deep network the output of is preliminary boundary map which is provided as input to in addition to the original image fig based on these two inputs is trained to compute the final boundary map such recursive training has previously been applied to neural networks for boundary detection but not to convnets znn for deep learning very deep convnets with filters are computationally expensive so an efficient software implementation is critical we trained our networks with znn https which uses multicore cpu parallelism for speed znn is one of the few deep learning implementations that is for convnet input boundary prediction convnet input initialize with learned representations stage tanh tanh tanh input input input recursive input relu relu tanh tanh relu tanh stage tanh softmax relu relu output tanh relu tanh relu relu relu softmax output output softmax figure an overview of our proposed framework top and model architectures bottom the number of trainable parameters in each model is while we have applied the above elements to serial section em images they are likely to be generally useful for other types of images the hybrid use of and filters may be useful for video which can also be viewed as an anisotropic image previous convnets applied to video processing have used filters exclusively recursively trained convnets are potentially useful for any image labeling problem the approach is very similar to recurrent convnets which iterate the same convnet the recursive approach uses different convnets for the successive iterations the recursive approach has been justified in several ways in image labeling it is viewed as the sequential refinement of the posterior probability of pixel being assigned label given both an input image and recursive input from the previous step another viewpoint on recursive training is that statistical dependencies in label category space can be directly modeled from the recursive input from the neurobiological viewpoint using preliminary boundary map for an image to guide the computation of better boundary map for the image can be interpreted as employing or attentional mechanism we expect znn to have applications far beyond the one considered in this paper znn can train very large networks because cpus can access more memory than gpus task parallelism rather than the simd parallelism of gpus allows for efficient training of convnets with arbitrary topology capability automatically optimizes each layer by choosing between direct and convolution fft convolution may be more efficient for wider layers or larger filter size finally znn may incur less software development cost owing to the relative ease of the generalpurpose cpu programming model finally we applied our convnets to images from new serial section em dataset from the mouse piriform cortex this dataset is important to us because we are interested in conducting neuroscience research concerning this brain region even to those with no interest in piriform cortex the dataset could be useful for research on image segmentation algorithms therefore we make the annotated dataset publicly available http dataset and evaluation images of mouse piriform cortex the datasets described here were acquired from the piriform cortex of an adult mouse prepared with aldehyde fixation and reduced osmium staining the tissue was sectioned using the automatic tape collecting ultramicrotome atum and sections were imaged on zeiss field emission scanning electron microscope the images were assembled into stacks using custom matlab routines and and each stack was manually annotated using vast https figure then each stack was checked and corrected by another annotator the properties of the four image stacks are detailed in table it should be noted that image quality varies across the stacks due to aging of the field emission source in the microscope in all experiments we used for testing and for training and as an additional training data for recursive training figure example dataset table and results of each architecture on name resolution dimension samples usage table piriform cortex datasets test training training training extra pixel error we use softmax activation in the output layer of our networks to produce outputs between and each of which is interpreted as the probability of an output pixel being boundary or vice versa this boundary map can be thresholded to generate binary boundary map from which the classification error is computed we report the best classification error obtained by optimizing the binarization threshold with line search rand score we evaluate segmentation performance with the rand scoring system let nij denote the number of pixels simultaneously in the ith segment of the proposal segmentation and the th segment of the ground truth segmentation the rand merge score and the rand split score ij nij ij nij rand rand vmerge vsplit nij nij are closer to one when there are fewer merge and split errors respectively the rand is the rand rand and vsplit harmonic mean of vmerge to compute the rand scores we need to first obtain neuronal segmentation based on the realvalued boundary map to this end we apply two segmentation algorithms with different levels of sophistication simple thresholding followed by computing connected components and modified watershed algorithm we report the best rand obtained by optimizing parameters for each algorithm with line search as well as the curve for the rand scores training with znn znn was built for convnets convolution is regarded as special case of convolution in which one of the three filter dimensions has size for the details on how znn implements task parallelism on multicore cpus we refer interested readers to here we describe only aspects of znn that are helpful for understanding how it was used to implement the convnets of this paper dense output with maximum filtering in object recognition convnet is commonly applied to produce single output value for an entire input image however there are many applications in which dense output is required the convnet should produce an output image with the same resolution as the original input image such applications include boundary detection image labeling and object localization znn was built from the ground up for dense output and also for dense feature znn employs which slides window across the image and applies the maximum operation to the window figure is the dense variant of consequently all feature maps remain intact as dense volumes during both forward and backward passes making them straightforward for visualization and manipulation on the other hand all filtering operations are sparse in the sense that the sliding window samples sparsely from regularly spaced set of voxels in the image figure znn can control the of any filtering operation either convolution or znn can efficiently compute the dense output of sliding window convnet by making filter sparsity depend on the number of prior more specifically each feature maps with the same resolution as the original input image see figure for example note that the feature maps shown in figure keep the original resolution even after couple of layers convolution convolution convolution sparse convolution figure sliding window convnet left applied on three adjacent input windows producing three outputs equivalent outputs produced by convnet with sparse filters right applied on larger window computation is minimized by reusing the intermediate values for computing multiple outputs as color coded increases the sparsity of all subsequent filterings by factor equal to the size of the window this approach which we employ for the paper is also called or filter rarefaction and is equivalent in its results to note however that znn is more general as the sparsity of filters need not depend on but can be controlled independently output patch training training in znn is based on loss computed over dense output patch of arbitrary size the patch can be arbitrarily large limited only by memory this includes the case of patch that spans the entire image although large patch sizes reduce the computational cost per output pixel neighboring pixels in the patch may provide redundant information in practice we choose an intermediate output patch size network architecture as baseline for performance comparisons we adopted the largest convnet architecture named from et al figure the architecture of very deep is shown in figure multiple convolution layers are between each layer all convolution filters are except that uses filter to make the field of view or receptive field for single output pixel have an size and therefore centerable around the output pixel due to the use of smaller filters the number of trainable parameters in is roughly the same as in the shallower the architecture of very deep is initially identical to figure except that later convolution layers switch to filters this causes the number of trainable parameters to increase so we compensate by trimming the size of to just feature maps the filters in the later layers should enable the network to use context to detect neuronal boundaries the use of filters in the initial layers makes the network faster to run and train recursive training it is possible to apply by itself to boundary detection giving the raw image as the only input however we use recursive approach in which receives an extra input the output of as we will see below this produces significant improvement in performance it should be noted that instead of providing the recursive input directly to we added new layers dedicated to processing it this separate parallel processing stream for recursive input joins the main stream at allowing for more complex highly nonlinear interaction between the features and the contextual information in the recursive input these layers are identical to and training procedures networks were trained using backpropagation with the loss function we first trained and then trained the layers of were initialized using trained weights from this initialization meant that our recursive approach bore some similarity to recurrent convnets in which the first and second stage networks are constrained to be identical however we did not enforce exact weight sharing but the weights of output patch as mentioned earlier training with znn is done by dense output gradient update with loss during training an output patch of specified size is randomly drawn from the training stacks at the beginning of each forward pass class rebalancing in dense output training imbalance between the number of training samples in different classes can be handled by either sampling balanced number of pixels from an output patch or by differentially weighting the loss in our experiment we adopted the latter approach loss weighting to deal with the high imbalance between boundary and pixels data augmentation we used the same data augmentation method used in randomly rotating and flipping image patches hyperparameter we always used the fixed learning rate of with the momentum of when updating weights we divided the gradient by the total number of pixels in an output patch similar to the typical minibatch averaging we first trained with an output patch of size for gradient updates next we trained with output patches reflecting the increased size of model compared to after updates we evaluated the trained on the training stacks to obtain preliminary boundary maps and started training with output patches again reflecting the increased model complexity we trained for updates in this recursive training stage we additionally used to prevent from being overly dependent on the boundary maps for training stacks it should be noted that has slightly lower than other stacks table which we think is helpful in terms of learning representation our proposed recursive framework is different from the training of recurrent convnets in that recursive input is not dynamically produced by the first convnet during training but evaluated before and being fixed throughout the recursive training stage however it is also possible to further train the first convnet even after evaluating its preliminary output as recursive input to the second convnet we further trained for another updates while is being trained we report the final performance of after total of updates we also replaced the initial boundary map with the final one when evaluating results with znn it took two days to train both and for updates and three days to train for updates results in this section we show both quantitative and qualitative results obtained by the three architectures shown in figure namely and the classification error of each model on test set was and quantitative comparison figure compares the result of each architecture on test set both quantitatively and qualitatively the leftmost bar graph shows the best rand of each model obtained by segmentation with simpler connected component clustering and more sophisticated segmentation the middle and rightmost graphs show the curve of each model for the rand scores obtained with the connected component and segmentation respectively we observe that performs significantly better than and also outperforms by significant margin in terms of both best rand and overall curve qualitative comparison figure shows the visualization of boundary detection results of each model on test set along with the original em images and ground truth boundary map we observe that false detection of boundary on intracellular regions was significantly reduced in rand score connected component connected component image data watershed rand score watershed rand merge score rand best rand rand merge score rand split score rand split score figure quantitative top and qualitative middle and bottom evaluation of results which demonstrates the effectiveness of the proposed convnet combined with recursive approach the middle and bottom rows in figure show some example locations in test set where both models and failed to correctly detect the boundary or erroneously detected false boundaries whereas correctly predicted on those ambiguous locations visual analysis on the boundary detection results of each model again demonstrates the superior performance of the proposed recursively trained convnet over models discussion recursive framework our proposed recursive framework is greatly inspired by the work of chen et al in this work they examined the close interplay between neurons in the primary and higher visual cortical areas and respectively of monkeys performing contour detection tasks in this task monkeys were trained to detect global contour pattern that consists of multiple collinearly aligned bars in cluttered background the main discovery of their work is as follows initially neurons responded to the global contour pattern after short time delay ms the activity of neurons responding to each bar composing the global contour pattern was greatly enhanced whereas those responding to the background was largely suppressed despite the fact that those foreground and background neurons have similar response properties they referred to it as response mode of neurons between foreground and background which is attributable to the influence from the higher level neurons this process is also referred to as countercurrent disambiguating process this experimental result readily suggests mechanistic interpretation on the recursive training of deep convnets for neuronal boundary detection we can roughly think of responses as lower level feature maps in deep convnet and responses as higher level feature maps or output activations once the overall contour of neuonal boundaries is detected by the feedforward processing of this preliminary boundary map can then be recursively fed to this process figure visualization of the effect of recursive training left an example feature map from the lower layer in and its corresponding feature map in right an example feature map from the higher layer in and its corresponding feature map in note that recursive training greatly enhances the ratio of boundary representations can be thought of as corresponding to the initial detection of global contour patterns by and its influence on during recursive training will learn how to integrate the contextual information in the recursive input with the features presumably in such way that feature activations on the boundary location are enhanced whereas activations unrelated to the neuronal boundary intracellular space mitochondria etc are suppressed here the recursive input can also be viewed as the modulatory gate through which only the signals relevant to the given task of neuronal boundary detection can pass this is convincingly demonstrated by visualizing and comparing the feature maps of and in figure the noisy representations of oriented boundary segments in first and third volumes are greatly enhanced in second and fourth volumes with signals near boundary being preserved or amplified and noises in the background being largely suppressed this is exactly what we expected from the proposed interpretation of our recursive framework potential of znn we have shown that znn can serve as viable alternative to the mainstream deep learning frameworks especially when processing volume data with convnets znn unique features including the large output training and the dense computation of feature maps can be further utilized for additional computations to better perform the given task in theory we can perform any kind of computation on the dense output prediction between each forward and backward passes for instance objective functions that consider topological constraints malis or sampling of topologically relevant locations led weighting can be applied to the dense output patch in addition to loss computation before each backward pass dense feature maps also enable the straighforward implementation of feature integration for segmentation long et al resorted to upsampling of the higher level features with lower resolution in order to integrate them with the lower level features with higher resolution since znn maintains every feature map at the original resolution of input it is straighforward enough to combine feature maps from any level removing the need for upsampling acknowledgments we thank juan tapia gloria choi and dan stettler for initial help with tissue handling and jeff lichtman and richard schalek with help in setting up tape collection kisuk lee was supported by samsung scholarship the recursive approach proposed in this paper was partially motivated by matthew greene preliminary experiments we are grateful for funding from the mathers foundation keating fund for innovation simons center for the social brain darpa and aro references takemura et al visual motion detection circuit suggested by drosophila connectomics nature helmstaedter briggman turaga jain seung and denk connectomic reconstruction of the inner plexiform layer in the mouse retina nature kim et al wiring specificity supports direction selectivity in the retina nature helmstaedter connectomics challenges of dense neural circuit reconstruction nature methods jain seung and turaga machines that learn to segment images crucial technology for connectomics current opinion in neurobiology tasdizen et al image segmentation for connectomics using machine learning in computational intelligence in biomedical imaging pp ed suzuki springer new york briggman and bock volume electron microscopy for neuronal circuit reconstruction current opinion in neurobiology jurrus et al detection of neuron membranes in electron microscopy images using serial neural network architecture medical image analysis liu jones seyedhosseini and tasdizen modular hierarchical approach to electron microscopy image segmentation journal of neuroscience methods segmentation of neuronal structures in em stacks challenge isbi http giusti gambardella and schmidhuber deep neural networks segment neuronal membranes in electron microscopy images in nips krizhevsky sutskever and hinton imagenet classification with deep convolutional neural networks in nips simonyan and zisserman very deep convolutional networks for image recognition in iclr turaga et al convolutional networks can learn to generate affinity graphs for image segmentation neural computation huang and jain deep and wide multiscale recursive networks for robust image labeling in iclr seyedhosseini and tasdizen series contextual model for image segmentation image processing ieee transactions on zlateski lee and seung znn fast and scalable algorithm for training convolutional networks on and shared memory machines tran bourdev fergus torresani and paluri learning spatiotemporal features with convolutional networks yao torabi cho ballas pal larochelle and courville describing videos by exploiting temporal structure pinheiro and collobert recurrent convolutional neural networks for scene labeling in icml tu and its application to vision tasks in cvpr mathieu henaff and lecun fast training of convolutional networks through ffts in iclr vasilache johnson mathieu chintala piantino and lecun fast convolutional nets with fbfft gpu performance evaluation in iclr tapia et al en bloc staining of neuronal tissue for field emission scanning electron microscopy nature protocols kasthuri et al saturated reconstruction of volume of neocortex cell hayworth et al imaging atum ultrathin section libraries with wafermapper approach to em reconstruction of neural circuits frontiers in neural circuits rand objective criteria for the evaluation of clustering methods journal of the american statistical association unnikrishnan pantofaru and hebert toward objective evaluation of image segmentation algorithms pattern analysis and machine intelligence ieee transactions on zlateski and seung image segmentation by single linkage clustering of watershed basin graph long shelhamer and darrell fully convolutional networks for semantic segmentation in cvpr sermanet eigen zhang mathieu fergus and lecun overfeat integrated recognition localization and detection using convolutional networks in iclr giusti masci gambardella and schmidhuber fast image scanning with deep convolutional neural networks in icip masci giusti fricout and schmidhuber fast learning algorithm for image segmentation with convolutional networks in icip chen yan gong gilbert liang and li incremental integration of global contours through interplay between visual cortical areas neuron turaga et al maximin affinity learning of image segmentation in nips 
block minimization framework for learning with limited memory ian yen lin lin university of texas at austin national taiwan university ianyen sdlin abstract in past few years several techniques have been proposed for training of linear support vector machine svm in setting where dual blockcoordinate descent method was used to balance cost spent on and computation in this paper we consider the more general setting of regularized empirical risk minimization erm when data can not fit into memory in particular we generalize the existing block minimization framework based on strong duality and augmented lagrangian technique to achieve global convergence for general convex erm the block minimization framework is flexible in the sense that given solver working under sufficient memory one can integrate it with the framework to obtain solver globally convergent under condition we conduct experiments on classification and regression problems to corroborate our convergence theory and compare the proposed framework to algorithms adopted from online and distributed settings which shows superiority of the proposed approach on data of size ten times larger than the memory capacity introduction nowadays data of huge scale are prevalent in many applications of statistical learning and data mining it has been argued that model performance can be boosted by increasing both number of samples and features and through crowdsourcing technology annotated samples of terabytes storage size can be generated as result the performance of model is no longer limited by the sample size but the amount of available computational resources in other words the data size can easily go beyond the size of physical memory of available machines under this setting most of learning algorithms become slow due to expensive from secondary storage device when it comes to data two settings are often considered online and distributed learning in the online setting each sample is processed only once without storage while in the distributed setting one has several machines that can jointly fit the data into memory however the real cases are often not as extreme as these two there are usually machines that can fit part of the data but not all of them in this setting an algorithm can only process block of data at time therefore balancing the time spent on and computation becomes the key issue although one can employ an learning algorithm in this setting it has been observed that online method requires large number of epoches to achieve comparable performance to batch method and at each epoch it spends most of time on instead of computation the situation for online method could become worse for problem of convex objective function where qualitatively slower convergence of online method is exhibited than that proved for problem like svm in the past few years several algorithms have been proposed to solve linear support vector machine svm in the limited memory setting these approaches are based on dual block coordinate descent algorithim which decomposes the original problem into series of block each of them requires only block of data loaded into memory the approach was proved linearly convergent to the global optimum and demonstrated fast convergence empirically however the convergence of the algorithm relies on the assumption of smooth dual problem which as we show does not hold generally for other regularized empirical risk minimizaton erm problem as result although the approach can be extended to the more general setting it is not globally convergent except for class of problems with in this paper we first show how to adapt the dual descnet method of to the general setting of regularized empirical risk mimization erm which subsumes most of supervised learning problems ranging from classification regression to ranking and recommendation then we discuss the convergence issue arises when the underlying erm is not primal proximal point or dual augmented lagrangian method is then proposed to address this issue which as we show results in block minimization algorithm with global convergence to optimum for convex regularized erm problems the framework is flexible in the sense that given solver working under condition it can be integrated into the block minimization framework to obtain solver globally convergent under condition we conduct experiments on classification and regression problems to corroborate our convergence theory which shows that the proposed simple technique changes the convergence behavior dramatically we also compare the proposed framework to algorithms adopted from online and distributed settings in particular we describe how to adapt distributed optimization framework alternating direction method of multiplier admm to the limitedmemory setting and show that although the adapted algorithm is effective it is not as efficient as the proposed framework specially designed for setting note our experiment does not adapt into comparison some recently proposed distributed learning algorithms cocoa etc that only apply to erm with or some other distributed method designed for some specific loss function problem setup in this work we consider the regularized empirical risk minimization problem which given data set φn estimates model through min ξn ln φn where rd is the model parameter to be estimated φn is by design matrix that encodes features of the data sample ln is convex loss function that penalizes the discrepancy between ground truth and prediction vector rp and is convex regularization term penalizing model complexity the formulation subsumes large class of statistical learning problems ranging from classification regression ranking and convex clustering for example in classification problem we have where ypconsists of the set of all possible labels and ln can be defined as the logistic loss ln log exp ξk ξyn as in logistic regression or the hinge loss ln δk yn ξk ξyn as used in support vector machine in regression problem the target variable consists of real values rk the prediction vector has dimensions and square loss ln is often used there are also variety of regularizers employed in different applications which includes the in ridge regression in lasso in matrix completion and family of structured group norms λkwkg although the specific form of ln does not affect the implementation of the training procedure two properties of the functions strong convexity and smoothness have key effects on the behavior of the block minimization algorithm definition strong convexity function is strongly convex iff it is lower bounded by simple quadratic function kx for some constant and dom definition smoothness function is smooth iff it is upper bounded by simple quadratic function kx for some constant and dom for instance the square loss and logistic loss are both smooth and strongly convex while the hingeloss satisfies neither of them on the other hand most of regularizers such as structured group norm and nuclear norm are neither smooth nor strongly convex except for the which satifies both in the following we will demonstrate the effects of these properties to block minimization algorithms throughout this paper we will assume that solver for that works in condition is given and our task is to design an algorithmic framework that integrates with the solver to efficiently solve when data can not fit into memory we will assume however that the parameter vector can be fit into memory dual block minimization in this section we extend the block minimization framework of from linear svm to the general setting of regularized erm dual of can be expressed as min αn αn φtn αn where is the convex conjugate of and αn is the convex conjugate of ln the block minimization algorithm of basically performs dual descent over by dividing the whole data set into blocks dbk and optimizing block of dual variables αbk at time where dbk φn and αbk αn bk in the dual problem is derived explicitly in order to perform the algorithm however for many regularizer such as and nuclear norm it is more efficient and convenient to solve in the primal therefore here instead of explicitly forming the dual problem we express it implicitly as min where is the lagrangian function of and maximize block of variables αbk from the primal instead of dual by strong duality max min min max αbk αbk αtbj with other dual variables αbj fixed the maximization of dual variables αbk in then enforces the primal equalities φn bk which results in the block minimization problem min ln µtt bk ξn bk the logistic loss is strongly convex when its input are within bounded range which is true as long as we have regularizer where µtbk have been dropped since they φn αn note that in variables are not relevant to the block of dual variables αbk and thus given the dimensional vector µtbk one can solve without accessing data φn outside the block bk throughout the pn algorithm we maintain vector µt φtn αtn and compute µtb via µtb µt φtn αtn in the beginning of solving each block subproblem since subproblem is of the same form to the original problem except for one additional linear augmented term µtbk one can adapt the solver of to solve easily by providing an augmented version of the gradient µtbk to the solver where denotes the function with augmented terms and denotes the function without augmented terms note the augmented term µtbk is constant and separable coordinates so it adds little overhead to the solver after obtaining solution from we can derive the corresponding optimal dual variables αbk for according to the kkt condition and maintain subsequently by ln bk µtbk φtn the procedure is summarized in algorithm which requires total memory capacity of the factor comes from the storage of µt wt factor comes from the storage of data block and the factor comes from the storage of αbk note this requires the same space complexity as that required in the original algorithm proposed for linear svm where for the binary classification setting block minimization the block minimization algorithm though can be applied to the general regularized erm problem it is not guaranteed that the sequence αt produced by algorithm converges to global optimum of in fact the global convergence of algorithm only happens for some special cases one sufficient condition for the global convergence of descent algorithm is that the terms in objective function that are not separable blocks must be smooth definition the dual objective function expressed using only comprises two terms pn pn φtn αn αn where second term is separable to αn and thus is also separable αbk while the first term couples variables αbk involving all the blocks as result if is smooth function according to definition then algorithm has global convergence to the optimum however the following theorem states this is true only when is strongly convex theorem duality assume is closed and convex then is smooth with parameter if and only if its convex conjugate is strongly convex with parameter proof of above theorem can be found in according to theorem the block minimization algorithm is not globally convergent if is not strongly convex which however is the case for most of regularizers other than the as discussed in section in this section we propose remedy to this problem which by lagrangian method or equivalently primal proximal point method creates dual objective function of desired property that iteratively approaches the original objective and results in fast global convergence of the approach algorithm dual block minimization split data into blocks bk initialize for do draw uniformly from load dbk and αtbk into memory compute µtbk from solve to obtain compute bk by maintain through save bk out of memory end for algorithm block minimization split data into blocks bk initialize for outer iteration do for do draw uniformly from load dbk αsbk into memory compute µsbk from solve to obtain compute bk by maintain through save bk out of memory end for αs end for algorithm the dual augmented lagrangian dal method or equivalently proximal point method modifies the original problem by introducing sequence of proximal maps arg min kw wt where denotes the erm problem under this simple modification instead of doing blockcoordinate descent in the dual of original problem we perform on the proximal subproblem as we show in next section the dual formulation of has the required property for global convergence of the dual bcd algorithm all terms involving more than one block of variables αbk are smooth given the current iterate wt the block minimization algorithm optimizes the dual of problem one block of variables αbk at time keeping others fixed αbj αbj max min min max αbk where is the lagrangian of αbk αtn φn kw wt once again the maximization αbk in enforces the equalities φn bk and thus leads to primal involving only data in block bk min ln µbk kw wt ξn φn bk where note that is almost the same as except that it has φn augmented term therefore one can follow the same procedure as in algorithm to pn maintain the vector φtn αn and computes µbk φtn µbk before solving each block subproblem after obtaining solution from we update dual variables αbk as ln bk and maintain subsequently as µbk φtn the is of similar form to the original erm problem since the augmented term is simple quadratic function separable each coordinate given solver for working in condition one can easily adapt it by modifying µtbk wt where denotes the function with augmented terms and denotes the function without augmented terms the block minimization procedure is repeated until every reaches tolerance then the proximal point method update is performed where is the solution of for the latest dual iterate the resulting algorithm is summarized in algorithm analysis in this section we analyze the convergence rate of algorithm to the optimum of first we show that the formulation has dual problem with desired property for the global convergence of descent in particular since the dual of takes the form minp αn φtn αn αn where is the convex conjugate of kw and since is strongly convex with parameter the convex conjugate is smooth with parameter ηt according to theorem therefore is in the composite form of convex smooth function plus convex function this type of function has been widely studied in the literature of descent in particular one can show that descent applied on has global convergence to optimum with fast rate by the following theorem theorem bcd convergence let the sequence αs be the iterates produced by block coordinate descent in the inner loop of algorithm and be the number of blocks denote the optimal value of then with probability as the dual objective function of and for some constant if ln is smooth or ii ln is polyhedral function and is also polyhedral or smooth otherwise for any convex ln we have αs for βk log αs for ck log for some constant note the above analysis in appendix does not assume exact solution of each block subproblem instead it only assumes each block minimization step leads to dual ascent amount proportional to that produced by single dual proximal gradient ascent step on the block of dual variables for the outer loop of primal or dual augmented lagrangian iterates we show the following convergence theorem theorem proximal point convergence let be objective of the regularized erm problem and maxv maxw kv wk be the radius of initial level set the sequence wt produced by the update with ηt has fopt for log for some constant if both ln and are strictly convex and smooth or ii polyhedral otherwise for any convex we have fopt the following theorem further shows that solving inexactly with tolerance suffices for convergence to overall precision where is the number of outer iterations required theorem inexact proximal map suppose for given dual iterate wt each is solved inexactly the solution has proxηt wt then let be the sequence of iterates produced by inexact proximal updates and as that generated by exact updates after iterations we have wt wt note for ln being strictly convex and smooth or polyhedral is of order log and thus it only requires log log overall number of block minimization steps to achieve suboptimality otherwise as long as ln is smooth for any convex regularizer is of order so it requires log log total number of block minimization steps practical issues solving inexactly while the analysis in section assumes exact solution of subproblems in practice the block minimization framework does not require solving subproblem exactly in our experiments it suffices for the fast convergence of update to solve subproblem for only single pass of all blocks of variables αbk and limit the number of iterations the designated solver spends on each subproblem to be no more than some parameter tmax random selection replacement in algorithm and the block to be optimized is chosen uniformly at random from which eases the analysis for proving better convergence rate however in practice to avoid unbalanced update frequency among blocks we do random sampling without replacement for both algorithm and that is for every iterations we generate random permutation πk of block index and optimize block subproblems according to the order πk this also eases the checking of stopping condition storage of dual variables both the algorithms and need to store the dual variables αbk into memory and them some secondary storage units which requires time linear to for some problems such as classification with large number of labels or structured prediction with large number of factors this can be very expensive in this situation one can instead maintain φn αn µbk directly note has and storage cost linear to which can be much smaller than in problem experiment in this section we compare the proposed dual augmented block minimization framework algorithm to the vanilla dual block coordinate descent algorithm and methods adopted from online and distributed learning the experiments are conducted on the problem of svm and the lasso regression problem in the setting with data size times larger than the available memory for both problems we use randomized coordinate descent method as the solver for solving and we set parameter ηt of for all experiments four public benchmark data sets are webspam for classification and for regression which can be obtained from the libsvm data set collections for and the features are generated from random fourier features that approximate the effect of gaussian rbf kernel table summarizes the data statistics the algorithms in comparison and their shorthands are listed below where all solvers are implemented in and run on machine with intel xeon cpu we constrained the process to use no more than of memory required to store the whole data onlinemd stochastic mirror descent method specially designed for problem proposed in with step size chosen from for best performance table data statistics summary of data statistics when stored using sparse format the last two columns specify memory consumption in mb of the whole data and that of block when data is split into partitions data train test dimension memory block webspam figure relative function value difference to the optimum and testing rmse accuracy on lasso top and bottom rmse best for for accuracy best for for webspam best for admm onlinemd admm onlinemd admm onlinemd admm onlinemd obj rmse rmse objective time time time admm onlinemd admm onlinemd time admm onlinemd admm onlinemd error obj obj error time time time time dual descent method algorithm block minimization algorithm admm admm for learning algorithm in admm that updates randomly chosen block of dual variables at time for learning algorithm in we use wall clock time that includes both and computation as measure for training time in all experiments in figure three measures are plotted versus the training time relative objective function difference to the optimum testing rmse and accuracy figure shows the results where as expected the dual block coordinate descent method without augmentation can not improve the objective after certain number of iterations however with extremely simple modification the block minimization algorithm becomes not only globally convergent but with rate several times faster than other approaches among all methods the convergence of online mirror descent smidas is significantly slower which is expected since the online mirror descent on convex function converges at rate qualitatively slower than the linear convergence rate of and admm and ii online method does not utilize the available memory capacity and thus spends unbalanced time on and computation for methods adopted from distributed optimization the experiment shows consistently but only slightly improves admm and both of them converge much slower than the approach presumably due to the conservative updates on the dual variables acknowledgement we thank to the support of telecommunication chunghwa telecom ltd via aoard via no ministry of science and technology national taiwan university and intel via the objective value obtained from fluctuates lot in figures we plot the lowest values achieved by from the beginning to time references boyd parikh chu peleato and eckstein distributed optimization and statistical learning via the alternating direction method of multipliers foundations and trends in machine learning chang and roth selective block minimization for faster convergence of limited memory linear models in sigkdd acm deng dong socher li li and fei imagenet hierarchical image database in cvpr hoffman on approximate solutions of systems of linear inequalities journal of research of the national bureau of standards hong and luo on the linear convergence of the alternating direction method of multipliers hsieh dhillon ravikumar becker and olsen quic dirty quadratic approximation approach for dirty statistical models in nips jaggi smith terhorst krishnan hofmann and jordan communicationefficient distributed dual coordinate ascent in nips joachims support vector method for multivariate performance measures in icml kakade and tewari applications of strong smoothness duality to learning with matrices corr ma smith jaggi jordan and adding averaging in distributed optimization icml obozinski jacob and vert group lasso with overlaps the latent group lasso approach arxiv preprint rahimi and recht random features for kernel machines in nips and iteration complexity of randomized descent methods for minimizing composite function mathematical programming singer srebro and cotter pegasos primal estimated solver for svm mathematical programming and tewari stochastic methods for loss minimization jmlr srebro sridharan and tewari on the universality of online mirror descent in nips tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society tomioka suzuki and sugiyama convergence of dual augmented lagrangian algorithm for sparsity regularized estimation jmlr trofimov and genkin distributed coordinate descent for logistic regression arxiv preprint wang and lin iteration complexity of feasible descent methods for convex optimization jmlr yen chang lin and lin indexed block coordinate descent for linear classification with limited memory in sigkdd acm yen hsieh ravikumar and dhillon constant nullspace strong convexity and fast convergence of proximal methods under settings in nips yen lin lin ravikumar and dhillon sparse random feature algorithm as coordinate descent in hilbert space in nips yen lin zhong ravikumar and dhillon convex approach to madbayes dirichlet process mixture models in icml yen zhong hsieh ravikumar and dhillon sparse linear programming via primal and dual augmented coordinate descent in nips yu hsieh chang and lin large linear classification when data can not fit in memory sigkdd yuan chang hsieh and lin comparison of optimization methods and software for linear classification jmlr zhong yen dhillon and ravikumar proximal for computationally intensive in nips 
optimal testing for properties of distributions jayadev acharya constantinos daskalakis gautam kamath eecs mit jayadev costis abstract given samples from an unknown discrete distribution is it possible to distinguish whether belongs to some class of distributions versus being far from every distribution in this fundamental question has received tremendous attention in statistics focusing primarily on asymptotic analysis as well as in information theory and theoretical computer science where the emphasis has been on small sample size and computational complexity nevertheless even for basic properties of discrete distributions such as monotonicity independence logconcavity unimodality and rate the optimal sample complexity is unknown we provide general approach via which we obtain and computationally efficient testers for all these distribution families at the core of our approach is an algorithm which solves the following problem given samples from an unknown distribution and known distribution are and close in or far in total variation distance the optimality of our testers is established by providing matching lower bounds up to constant factors finally necessary building block for our testers and an important byproduct of our work are the first known computationally efficient proper learners for discrete monotone hazard rate distributions introduction the quintessential scientific question is whether an unknown object has some property whether model from specific class fits the object observed behavior if the unknown object is probability distribution to which we have sample access we are typically asked to distinguish whether belongs to some class or whether it is sufficiently far from it this question has received tremendous attention in the field of statistics see where test statistics for important properties such as the ones we consider here have been proposed nevertheless the emphasis has been on asymptotic analysis characterizing the rates of convergence of test statistics under null hypotheses as the number of samples tends to infinity in contrast we wish to study the following problem in the small sample regime given family of distributions some and sample access to an unknown distribution over discrete support how many samples are required to distinguish between versus dtv the problem has been studied intensely in the literature on property testing and sublinear algorithms where the emphasis has been on characterizing the optimal tradeoff between support size and the accuracy in the number of samples several results have been obtained roughly we want success probability at least which can be boosted to times and taking the majority by repeating the test log clustering into three groups where is the class of monotone distributions over or more generally poset ii is the class of independent or independent distributions over hypergrid and iii contains and the problem becomes that of testing whether equals or is far from it with respect to iii exactly characterizes the number of samples required to test identity to each distribution providing single tester matching this bound simultaneously for all nevertheless this tester and its precursors are not applicable to the composite identity testing problem that we consider if our class were finite we could test against each element in the class albeit this would not necessarily be sample optimal if our class were continuum we would need tolerant identity testers which tend to be more expensive in terms of sample complexity and result in substantially suboptimal testers for the classes we consider or we could use approaches related to generalized likelihood ratio test but their behavior is not in our regime and optimizing likelihood over our classes becomes computationally intense our contributions we obtain and computationally efficient testers for for the most fundamental shape restrictions to distribution our contributions are the following for known distribution over and sample access to we show that distinguishing the cases whether the between and is at most versus the distance between and is at least requires samples as corollary we obtain an alternate argument that shows that identity testing requires samples previously shown in for the class mdn of monotone distributions over we require an optimal number of samples where prior work requires log samples for and nd poly for our results improve the exponent of with respect to shave all logarithmic factors in and improve the exponent of by at least factor of useful building block and interesting byproduct of our analysis is extending oblivious decomposition for monotone distributions to monotone distributions in and to the stronger notion of see section moreover we show that logd samples suffice to learn monotone distribution over in see lemma for the precise statement for the of product distributions over nd our algorithm requires samples we note that product distribution is one where all marginals are independent so this is equivalent to testing if collection of random variables are all independent in the case where are large then the first term dominates and the sample complexity is in particular when is constant and all are equal to we achieve the optimal sample complexity of to the best of our knowledge this is the first result for and when this improves the previously known complexity from polylog significantly improving the dependence on and shaving all logarithmic factors for the classes lcdn mhrn and un of and unimodal distributions over we require an optimal number of samples our testers for lcdn and mhrn are to our knowledge the first for these classes for the low sample regime we are and its references for statistics literature on the asymptotic regime our tester for un improves the dependence of the sample complexity on by at least factor of in the exponent and shaves all logarithmic factors in compared to testers based on testing monotonicity useful building block and important byproduct of our analysis are the first computationally efficient algorithms for properly learning and distributions to within in total variation distance from poly samples independent of the domain size see corollaries and again these are the first computationally efficient algorithms to our knowledge in the low sample regime provide algorithms for density estimation which are will approximate an unknown distribution from these classes with distribution that does not belong to these classes on the other hand the statistics literature focuses on estimation in the asymptotic and its references for all the above classes we obtain matching lower bounds showing that the sample complexity of our testers is optimal with respect to and when applicable see section our lower bounds are based on extending paninski lower bound for testing uniformity our techniques at the heart of our tester lies novel use of the statistic naturally the and its related statistic have been used in several of the results we propose new use of the statistic enabling our optimal sample complexity the essence of our approach is to first draw small number of samples independent of for and distributions and only logarithmic in for monotone and unimodal distributions to approximate the unknown distribution in distance if our learner is required to output distribution that is to in total variation and to in distance then some analysis reduces our testing problem to distinguishing the following cases and are in distance this case corresponds to and are in total variation distance this case corresponds to dtv we draw comparison with robust identity testing in which one must distinguish whether and are or in total variation distance for constants in valiant and valiant show that log samples are required for this problem sample complexity which may be prohibitively large in many settings in comparison the problem we study tests for closeness rather than total variation closeness relaxation of the previous problem however our tester demonstrates that this relaxation allows us to achieve substantially sublinear complexity of on the other hand this relaxation is still tight enough to be useful demonstrated by our application in obtaining testers we note that while the statistic for testing hypothesis is prevalent in statistics providing optimal error exponents in the regime to the best of our knowledge in the regime of the statistic have only been recently used for in and for testing uniformity of monotone distributions in in particular design an unbiased statistic for estimating the distance between two unknown distributions organization in section we show that version of the statistic appropriately excluding certain elements of the support is sufficiently to distinguish between the above cases moreover the sample complexity of our algorithm is optimal for most classes our base tester is combined with the extension of decomposition theorem to test monotone distributions in section see theorem and corollary and is also used to test independence of distributions in section see theorem in section we give our results on testing unimodal and monotone hazard rate distributions naturally there are several bells and whistles that we need to add to the above skeleton to accommodate all classes of distributions that we are considering in remark we mention the additional modifications for these classes related work for the problems that we study in this paper we have provided the related works in the previous section along with our contributions we can not do justice to the role of shape restrictions of probability distributions in probabilistic modeling and testing it suffices to say that the classes of distributions that we study are fundamental motivating extensive literature on their learning and testing in the recent times there has been work on shape restricted statistics pioneered by jon wellner and others study estimation of monotone and densities and study estimation of distributions due to the sheer volume of literature in statistics in this field we will restrict ourselves to those already referenced as we have mentioned statistics has focused on the asymptotic regime as the number of samples tends to infinity instead we are considering the low sample regime and are more stringent about the behavior of our testers requiring guarantees we want to accept if the unknown distribution is in our class of interest and also reject if it is far from the class for this problem as discussed above there are few results when is whole class of distributions closer related to our paper is the line of papers for monotonicity testing albeit these papers have sample complexity as discussed above testing independence of random variables has long history in statisics the theoretical computer science community has also considered the problem of testing independence of two random variables while our results sharpen the case where the variables are over domains of equal size they demonstrate an interesting asymmetric upper bound when this is not the case more recently acharya and daskalakis provide optimal testers for the family of poisson binomial distributions finally contemporaneous work of canonne et al provides generic algorithm and lower bounds for the families of distributions considered here we note that their algorithm has sample complexity which is suboptimal in both and while our algorithms are optimal their algorithm also extends to mixtures of these classes though some of these extensions are not computationally efficient they also provide framework for proving lower bounds giving the optimal bounds for many classes when is sufficiently large with respect to in comparison we provide these lower bounds unconditionally by modifying paninski construction to suit the classes we consider preliminaries we use the following probability distances in our paper def the total variation distance between distributions and is dtv supa def between and over is defined as pi qiqi the kp the kolmogorov distance between two probability measures and over an ordered set with def cumulative density functions fp and fq is dk fq our paper is primarily concerned with testing against classes of distributions defined formally as definition given and sample access to distribution an algorithm is said to test class if it has the following guarantees if the algorithm outputs accept with probability at least if dtv the algorithm outputs eject with probability at least we note the following useful relationships between these distances proposition dk dtv definition an support of distribution is any set such that the flattening of function over subset is the function such that definition let be distribution and support is partition of the domain the flattening of with respect to is the distribution which is the flattening of over the intervals poisson sampling throughout this paper we use the standard poissonization approach instead of drawing exactly samples from distribution we first draw poisson and then draw samples from as result the number of times different elements in the support of occur in the sample become independent giving much simpler analyses in particular the number of times we will observe domain element will be distributed as poisson mpi independently for each since poisson is tightly concentrated around this additional flexibility comes only at cost in the sample complexity with an inversely exponential in additive increase in the error probability the testing algorithm an overview our algorithm for testing class can be decomposed into three steps learning in our first step requires learning with very specific guarantees given sample access to we wish to output such that is close to in total variation distance and ii and are in on an effective of when we also require the algorithm to output description of an effective support for which this property holds this requirement can be slightly relaxed as we show in our results for testing unimodality is not in we do not guarantee anything about from an information theoretic standpoint this problem is harder than learning the distribution in total variation since is more restrictive than total variation distance nonetheless for the structured classes we consider we are able to learn in by modifying the approaches to learn in total variation computation of distance to class the next step is to see if the hypothesis is close to the class or not since we have an explicit description of this step requires no further samples from it is purely computational if we find that is far from the class then it must be that as otherwise the guarantees from the previous step would imply that is close to thus if it is not we can terminate the algorithm at this point at this point the previous two steps guarantee that our distribution is such that if then and are close in distance on known effective support of if dtv then and are far in total variation distance we can distinguish between these two cases using samples with simple statistical test that we describe in section using the above approach our tester as described in the next section can directly test monotonicity and monotone hazard rate with an extra trick using kolmogorov max inequality it can also test unimodality robust identity test our main result in this section is theorem theorem given class of probability distributions sample access to distribution and an explicit description of distribution both over with the following properties property dtv property if then then there exists an algorithm such that if it outputs accept with probability at least if dtv it outputs eject with probability at least the time and sample complexity of this algorithm are proof algorithm describes testing procedure that gives the guarantee of the theorem algorithm testing algorithm input an explicit distribution poisson samples from distribution where ni denotes the number of occurrences of the ith domain element qi ni mqi ni mqi if return close else return far in section we compute the mean and variance of the statistic defined in algorithm as pi qi pi pi qi pa qa var qi qi where by pa and qa we denote respectively the vectors and restricted to the coordinates in and we slightly abuse notation when we write pa qa as these do not then correspond to probability distributions lemma demonstrates the separation in the means of the statistic in the two cases of interest versus dtv and lemma shows the separation in the variances in the two cases these two results are proved in section lemma if then if dtv then lemma let if then var if dtv var then assuming lemmas and theorem is now simple application of chebyshev inequality when we have that var thus chebyshev inequality gives pr pr pr var when dtv var pr pr pr therefore var this proves the correctness of algorithm for the running time we divide the summation in into the elements for which ni and ni when ni the contribution of the term to the summation is mqi and we can sum them up by subtracting the total probability of all elements appearing at least once from remark to apply theorem we need to learn distribution in and find that is in to for the class of monotone distributions we are able to efficiently obtain such which immediately implies learning algorithms for this class however for some classes we may not be able to learn with such strong guarantees and we must consider modifications to our base testing algorithm for example for and monotone hazard rate distributions we can obtain distribution and set with the following guarantees if then ps qs and if dtv then dtv in this scenario the tester will simply pretend that the support of and is ignoring any samples and support elements in analysis of this tester is extremely similar to theorem in particular we can still show that the statistic will be separated in the two cases when excluding will only reduce on the other hand when dtv since and must still be far on the remaining support and we can show that is still sufficiently large therefore small modification allows us to handle this case with the same sample complexity of for unimodal distributions we are even unable to identify large enough subset of the support where the approximation is guaranteed to be tight but we can show that there exists light enough piece of the support in terms of probability mass under that we can exclude to make the approximation tight given that we only use chebyshev inequality to prove the concentration of the test statistic it would seem that our lack of knowledge of the piece to exclude would involve union bound and corresponding increase in the required number of samples we avoid this through careful application of kolmogorov max inequality in our setting see theorem of section testing monotonicity as the first application of our testing framework we will demonstrate how to test for monotonicity let and id jd we say if il jl for distribution over is monotone decreasing if for all pi pj we follow the steps in the overview the learning result we show is as follows proved in section this definition describes monotone distributions by symmetry identical results hold for monotone distributions lemma let there is an algorithm that takes log samples from distribution over and outputs distribution such that if is monotone then with probability at least furthermore the distance of to monotone distributions can be computed in time poly this accomplishes the first two steps in the overview in particular if the distance of from monotone distributions is more than we declare that is not monotone therefore property in theorem is satisfied and the lemma states that property holds with probability at least we then proceed to the test at this point we have precisely the guarantees needed to apply theorem over directly implying our main result of this section theorem for any complexity there exists an algorithm for testing monotonicity over with sample log and time complexity poly log in particular this implies the following optimal algorithms for monotonicity testing for all corollary fix any and suppose log then there exists an algorithm for testing monotonicity over with sample complexity we note that the class of monotone distributions is the simplest of the classes we consider we now consider testing for monotone hazard rate and unimodality all of which are much more challenging to test in particular these classes require more sophisticated structural understanding more complex proper algorithms and modifications to our we have already given some details on the required adaptations to the tester in remark our algorithms for learning these classes use convex programming one of the main challenges is to enforce of the pdf when learning lcdn respectively of the cdf when learning mhrn while simultaneously enforcing closeness in total variation distance this involves careful choice of our variables and we exploit structural properties of the classes to ensure the soundness of particular taylor approximations we encourage the reader to refer to the proofs of theorems and for more details testing independence of random variables def let nd and let be the class of all product distributions over similar to learning monotone distributions in distance we prove the following result in section lemma there is an algorithm that takes samples from distribution and outputs such that if then with probability at least the distribution always satisfies property since it is in and by this lemma with probability at least satisfies property in theorem therefore we obtain the following result theorem for any there exists an algorithm testing independence of random variables qd pd over nd with sample and time complexity when and this improves the result of for testing independence of two random variables corollary testing if two distributions over are independent has sample complexity testing unimodality and monotone hazard rate unimodal distributions over denoted by un are all distributions for which there exists an such that pi is for and for distributions over denoted by lcdn is the of unimodal distributions for which pi monotone hazard rate mhr distributions over denoted by mhrn are distributions with cdf for which implies fifi jfj the following theorem bounds the complexity of testing these classes for moderate theorem suppose for each of the classes unimodal and mhr there exists an algorithm for testing the class over with sample complexity this result is corollary of the specific results for each class which is proved in the appendix in particular more complete statement for unimodality and rate with precise dependence on both and is given in theorems and respectively we mention some key points about each class and refer the reader to the respective appendix for further details testing unimodality using unionpbound argument one can use the results on testing monotonicity to give an algorithm with log samples however this is unsatisfactory since our lower bound and as we will demonstrate the true complexity of this problem is we overcome the logarithmic barrier introduced by the union bound by employing decomposition of the domain and using kolmogorov testing the key step is to design an algorithm to learn distribution in distance we formulate the problem as linear program in the logarithms of the distribution and show that using samples it is possible to output distribution that has distance at most from the underlying distribution testing monotone hazard rate for learning mhr distributions in distance we formulate linear program in the logarithms of the cdf and show that using log samples it is possible to output mhr distribution that has distance at most from the underlying mhr distribution lower bounds we now prove sharp lower bounds for the classes of distributions we consider we show that the example studied by paninski to prove lower bounds on testing uniformity can be used to prove lower bounds for the classes we consider they consider class consisting of distributions defined as follows without loss of generality assume that is even for each of the vectors define distribution over as follows for qi nz for each distribution in has total variation distance from un the uniform distribution over by choosing to be an appropriate constant paninski showed that apdistribution picked uniformly at random from can not be distinguished from un with fewer than samples with probability at least suppose is class of distributions such that the uniform distribution un is in ii for appropriately chosen dtv then testing is not easier than distinguishing un from invoking immediately implies that testing the class requires samples the lower bounds for all the one dimensional distributions will follow directly from this construction and for testing monotonicity in higher dimensions we extend this construction to appropriately these arguments are proved in section leading to the following lower bounds for testing these classes theorem for any any algorithm for testing monotonicity over requires samples for testing independence over nd requires nd samples testing unimodality or monotone hazard rate over needs samples references fisher statistical methods for research workers edinburgh oliver and boyd lehmann and romano testing statistical hypotheses springer science business media fischer the art of uninformed decisions primer to property testing science rubinfeld algorithms in international congress of mathematicians canonne survey on distribution testing your data is big but is it blue eccc batu kumar and rubinfeld sublinear algorithms for testing monotone and unimodal distributions in proceedings of stoc bhattacharyya fischer rubinfeld and valiant testing monotonicity of distributions over general partial orders in ics pp batu fischer fortnow kumar rubinfeld and white testing random variables for independence and identity in proceedings of focs alon andoni kaufman matulef rubinfeld and xie testing and almost independence in proceedings of stoc paninski test for uniformity given very sparsely sampled discrete ieee transactions on information theory vol no huang and meyn generalized error exponents for small sample universal hypothesis testing ieee transactions on information theory vol no pp dec valiant and valiant estimating the unseen an log estimator for entropy and support size shown optimal via new clts in proceedings of stoc an automatic inequality prover and instance optimal identity testing in focs estimating density under order restrictions nonasymptotic minimax risk the annals of statistics vol no pp september levi ron and rubinfeld testing properties of collections of distributions theory of computing vol no pp hall and van keilegom testing for monotone increasing hazard rate annals of statistics pp chan diakonikolas servedio and sun learning mixtures of structured distributions over discrete domains in proceedings of soda cule and samworth theoretical properties of the maximum likelihood estimator of multidimensional density electronic journal of statistics vol pp acharya das jafarpour orlitsky pan and suresh competitive classification and closeness testing in colt pp chan diakonikolas valiant and valiant optimal algorithms for testing closeness of discrete distributions in soda pp bhattacharya and valiant testing closeness with unequal sized samples in nips acharya jafarpour orlitsky and theertha suresh competitive test for uniformity of monotone distributions in proceedings of aistats pp barlow bartholomew bremner and brunk statistical inference under order restrictions new york wiley jankowski and wellner estimation of discrete monotone density electronic journal of statistics vol pp balabdaoui and wellner estimation of density characterizations consistency and minimax lower bounds statistica neerlandica vol no pp balabdaoui jankowski and rufibach maximum likelihood estimation and confidence bands for discrete distribution online available http saumard and wellner and strong review statistics surveys vol pp adamaszek czumaj and sohler testing monotone continuous distributions on highdimensional real cubes in soda pp rao and scott the analysis of categorical data from complex sample surveys journal of the american statistical association vol no pp agresti and kateri categorical data analysis springer acharya and daskalakis testing poisson binomial distributions in soda pp canonne diakonikolas gouleakis and rubinfeld testing shape restrictions of discrete distributions in stacs gibbs and su on choosing and bounding probability metrics international statistical review vol no pp dec acharya jafarpour orlitsky and suresh efficient compression of monotone and distributions in isit kamath orlitsky pichapati and suresh on learning distributions from their samples in colt massart the tight constant in the inequality the annals of probability vol no pp 
efficient learning of hidden markov models for disease progression liu shuang li fuxin li le song and james rehg college of computing georgia institute of technology atlanta ga abstract the hidden markov model is an attractive approach to modeling disease progression due to its ability to describe noisy observations arriving irregularly in time however the lack of an efficient parameter learning algorithm for restricts its use to very small models or requires unrealistic constraints on the state transitions in this paper we present the first complete characterization of efficient learning methods for models we demonstrate that the learning problem consists of two challenges the estimation of posterior state probabilities and the computation of conditioned statistics we solve the first challenge by reformulating the estimation problem in terms of an equivalent discrete hidden markov model the second challenge is addressed by adapting three approaches from the continuous time markov chain literature to the domain we demonstrate the use of with more than states to visualize and predict disease progression using glaucoma dataset and an alzheimer disease dataset introduction the goal of disease progression modeling is to learn model for the temporal evolution of disease from sequences of clinical measurements obtained from longitudinal sample of patients by distilling population data into compact representation disease progression models can yield insights into the disease process through the visualization and analysis of disease trajectories in addition the models can be used to predict the future course of disease in an individual supporting the development of individualized treatment schedules and improved treatment efficiencies furthermore progression models can support phenotyping by providing natural similarity measure between trajectories which can be used to group patients based on their progression hidden variable models are particularly attractive for modeling disease progression for three reasons they support the abstraction of disease state via the latent variables they can deal with noisy measurements effectively and they can easily incorporate dynamical priors and constraints while conventional hidden markov models hmms have been used to model disease progression they are not suitable in general because they assume that measurement data is sampled regularly at discrete intervals however in reality patient visits are irregular in time as consequence of scheduling issues missed visits and changes in symptomatology hmm is an hmm in which both the transitions between hidden states and the arrival of observations can occur at arbitrary continuous times it is therefore suitable for temporal data such as clinical measurements unfortunately the additional modeling flexibility provided by comes at the cost of more complex inference procedure in not only are the hidden states unobserved but the transition times at which the hidden states are changing are also unobserved moreover multiple unobserved hidden state transitions can occur between two successive observations previous method addressed these challenges by directly maximizing the data likelihood but this approach is limited to very small model sizes general em framework for dynamic bayesian networks of which is special case was introduced in but that work did not address the question of efficient learning consequently there is need for efficient learning methods that can scale to large state spaces hundreds of states or more key aspect of our approach is to leverage the existing literature for continuous time markov chain ctmc models these models assume that states are directly observable but retain the irregular distribution of state transition times em approaches to ctmc learning compute the expected state durations and transition counts conditioned on each pair of successive observations the key computation is the evaluation of integrals of the matrix exponential eqs and prior work by wang et al used closed form estimator due to which assumes that the transition rate matrix can be diagonalized through an eigendecomposition unfortunately this is frequently not achievable in practice limiting the usefulness of the approach we explore two additional ctmc approaches which use an alternative matrix exponential on an auxillary matrix expm method and direct truncation of the infinite sum expansion of the exponential unif method neither of these approaches have been previously exploited for learning we present the first comprehensive framework for efficient parameter learning in cthmm which both extends and unifies prior work on ctmc models we show that can be conceptualized as hmm which yields posterior state distributions at the observation times coupled with ctmcs that govern the distribution of hidden state transitions between observations eqs and we explore both soft and hard viterbi decoding approaches to estimating the posterior state distributions in combination with three methods for calculating the conditional expectations we validate these methods in simulation and evaluate our approach on two datasets for glaucoma and alzheimer disease including visualizations of the progression model and predictions of future progression our approach outperforms method for glaucoma prediction which demonstrates the practical utility of for clinical data modeling markov chain markov chain ctmc is defined by finite and discrete state space state transition rate matrix and an initial state probability distribution the elements qij in describe the rate the process transitions from state to for and qii are specified such that each row of sums to zero qi qij qii in process in which the qij are independent of the sojourn time in each state is with parameter qi which is qi with mean the probability that the process next move from state is to state is qij when realization of the ctmc is fully observed meaning that one can observe every transition time and the corresponding state yv where denotes the state at time the complete likelihood cl of the data is cl vy vy qijij τi where is the time interval between two transitions nij is the number of transitions from state to and τi is the total amount of time the chain remains in state in general realization of the ctmc is observed only at discrete and irregular time points tv corresponding to state sequence which are distinct from the switching times as result the markov process between two consecutive observations is hidden with potentially many unobserved state transitions thus both nij and τi are unobserved in order to express the likelihood of the incomplete observations we can utilize discrete time hidden markov model by defining state transition probability matrix for each distinct time interval eqt where pij the entry in is the probability that the process is in state after time given that it is in state at time this quantity takes into account all possible intermediate state transitions and timing between and which are not observed then the likelihood of the data is vy pyv τv vy pij τv yv pij yv where τv tv is the time interval between two observations yv is an indicator function that is if the condition is true otherwise it is represents unique values among all time intervals τv and yv is the total counts from all successive visits when the condition is true note that there is no analytic maximizer of due to the structure of the matrix exponential and direct numerical maximization with respect to is computationally challenging this motivates the use of an approach an em algorithm for ctmc is described in based on eq the expected complete log hood takes the form log qij nij τi where is the current estimate for and nij and τi are the expected state transition count and total duration given the incomplete observation and the current transition rate matrix respectively once these two expectations are computed in the the updated parameters can be obtained via the as nij and τi now the main computational challenge is to evaluate nij and τi by exploiting the properties of the markov process the two expectations can be decomposed as nij nij τi yv nij τi yv τi where yv if the condition is true otherwise it is thus the computation reduces to computing the conditioned expectations nij and τi for all these expectations are also key step in learning and section presents our approach to computing them hidden markov model in this section we describe the hidden markov model for disease progression and the proposed framework for learning model description in contrast to ctmc where the states are directly observed none of the states are directly observed in instead the available observational data depends on the hidden states via the measurement model in contrast to conventional hmm the observations ov are only available at continuous points in time tv as consequence there are two levels of hidden information in first at observation time the state of the markov chain is hidden and can only be inferred from measurements second the state transitions in the markov chain between two consecutive observations are also hidden as result markov chain may visit multiple hidden states before reaching state that emits noisy observation this additional complexity makes more effective model for event data in comparison to hmm and ctmc but as consequence the parameter learning problem is more challenging we believe we are the first to present comprehensive and systematic treatment of efficient em algorithms to address these challenges fully observed contains four sequences of information the underlying state transition time the corresponding state yv of the hidden markov chain and the observed data ov at time tv their joint complete likelihood can be written as cl vy ov tv qijij τi ov tv we will focus our development on the estimation of the transition rate matrix estimates for the parameters of the emission model and the initial state distribution can be obtained from the standard discrete time hmm formulation but with transition probabilities described below parameter estimation given current estimate of the parameter the expected complete takes the form log qij nij qi τi log ov tv in the taking the derivative of with respect to qij we have nij and τi the challenge lies in the where we compute the expectations of nij and τi conditioned on the observation sequence the statistic for nij can be expressed in terms of the expectations between successive pairs of observations as follows nij tv nij tv tv tv nij tv tv tv nij tv in similar way we can obtain an expression for the expectation of τi τi tv τi tv in section we present our approach to computing the conditioned statistics nij tv and τi tv the remaining step is to compute the posterior state distribution at two consecutive observation times tv computing the posterior state probabilities the challenge in efficiently computing tv is to avoid the explicit enumeration of all possible state transition sequences and the variable time intervals between intermediate state transitions from to the key is to note that the posterior state probabilities are only needed at the times where we have observation data we can exploit this insight to reformulate the estimation problem in terms of an equivalent discrete hidden markov model specifically given the current estimate and we will divide the time into intervals each with duration τv tv we then make use of the transition property of ctmc and associate each interval with state transition matrix τv τv together with the emission model we then have discrete hidden markov model with joint likelihood τv tv ov tv the formulation in eq allows us to reduce the computation of tv to familiar operations the algorithm can be used to compute the posterior distribution of the hidden states which we refer to as the soft method alternatively the map assignment of hidden states obtained from the viterbi algorithm can provide an approximate distribution which we refer to as the hard method em algorithms for pseudocode for the em algorithm for parameter learning is shown in algorithm multiple variants of the basic algorithm are possible depending on the choice of method for computing the conditioned expectations along with the choice of hard or soft decoding for obtaining the posterior state probabilities in eq note that in line of algorithm algorithm parameter learning input data ov and tv state set edge set initial guess of output transition rate matrix qij find all distinct time intervals from compute for each repeat compute tv for all and the data likelihood by using soft or viterbi hard create soft count table from by summing prob from visits of same use expm unif or eigen method to compute nij and τi update qij τiji and qii qij until likelihood converges we group probabilities from successive visits of same time interval and the same specified endstates in order to save computation time this is valid because in nij tv nij where so that the expectations only need to be evaluated for each distinct time interval rather than each different visiting time also see the discussion below eq computing the conditioned expectations the remaining step in finalizing the em algorithm is to discuss the computation of the conditioned expectations for nij and τi from eqs and respectively the first step is to express the expectations in integral form following qi pk pj dx pk pk pi dx τi pk nij rt rt from eq we define τk pk pj dx eqx eq dx while τk can be similarly defined for eq see for similar construction several methods for computing τk and τk have been proposed in the ctmc literature metzner et al observe that expressions can be obtained when is diagonalizable unfortunately this property is not guaranteed to exist and in practice we find that the intermediate matrices are frequently not diagonalizable during em iterations we refer to this approach as eigen an alternative is to leverage classic method of van loan for computing of maq trix exponentials in this approach an auxiliary matrix is constructed as where rt is matrix with identical dimensions to it is shown in that eqx beq dt eat where is the dimension of following we set where is the matrix with in the th entry and elsewhere thus the left hand side reduces to τk for all in the corresponding matrix entries thus we can leverage the substantial literature on numerical computation of the matrix exponential we refer to this approach as expm after the popular matlab function third approach for computing the expectations introduced by hobolth and jensen for ctmcs is called uniformization unif and is described in the supplementary material along with additional details for expm expm based algorithm algorithm presents pseudocode for the expm method for computing conditioned statistics the algorithm exploits the fact that the matrix does not change with time therefore when using the scaling and squaring method for computing matrix exponentials one can easily cache and reuse the intermediate powers of to efficiently compute eta for different values of analysis of time complexity and comparisons we conducted asymptotic complexity analysis for all six combinations of hard and soft em with the methods expm unif and eigen for computing the conditional expectations for both hard and algorithm the expm algorithm for computing conditioned statistics for each state in do for to do τi end for end for for each edge in do for to do nij ij where pkl nij nij end for end for di pkl where soft variants the time complexity of expm is rs rls where is the number of distinct time intervals between observations is the number of states and is the number of edges the soft version of eigen has the same time complexity but since the eigendecomposition of matrices can be in any em iteration this method is not attractive unif is based on truncating an infinite sum and the truncation point varies with maxi qi with the result that the cost of unif varies significantly with both the data and the parameters in comparison expm is much less sensitive to these values log versus quadratic dependency see the supplemental material for the details we conclude that expm is the most robust method available for the soft em case when the state space is large hard em can be used to tradeoff accuracy with time in the hard em case unif can be more efficient than expm because unif can evaluate only the expectations specified by the required from the best decoded paths whereas expm must always produce results from all these asymptotic results are consistent with our experimental findings on the glaucoma dataset from section using model with states soft expm requires minutes per iteration on ghz machine with unoptimized matlab code while soft unif spends more than minutes per iteration hard unif spends minutes per iteration and eigen fails experimental results we evaluated our em algorithms in simulation sec and on two datasets glaucoma dataset sec in which we compare our prediction performance to method and dataset for alzheimer disease ad sec where we compare visualized progression trends to recent findings in the literature our disease progression models employ glaucoma and ad states representing significant advance in the ability to work with large models previous works employed fewer than states simulation on complete digraph we test the accuracy of all methods on complete digraph with synthetic data generated under different noise levels each pqi is randomly drawn from and then qij is drawn from and renormalized such that qij qi the state chains are generated from such that where qi is the largest mean holding time each chain has total duration around min qi the data emission model for state is set as where varies under different noise level settings the observations are then sampled from the state chains with rate max where qi qi is the smallest mean holding time which should be dense enough to make the chain identifiable total of observations are sampled the average relative error is used as the performance metric where is vector contains all learned qij parameters and is the ground truth the simulation results from random runs are listed in table expm and unif produce nearly identical results so they are combined in the table eigen fails at least once for each setting but when it works it produces similar results all soft methods achieve significantly better accuracy table the average relative error from random runs on complete digraph under varying noise levels the convergence threshold is on relative data likelihood change error expm unif expm unif functional deterioration structural deterioration structural deterioration functional deterioration figure the state structure for glaucoma progression modeling illustration of the prediction of future states from one fold of convergence behavior of soft expm on the glaucoma dataset than hard methods especially when the noise level becomes higher this can be attributed to the maintenance of the full hidden state distribution which makes it more robust to noise application of to predicting glaucoma progression in this experiment we used to visualize glaucoma dataset and predict glaucoma progression glaucoma is leading cause of blindness and visual morbidity worldwide this disease is characterized by slowly progressing optic neuropathy with associated irreversible structural and functional damage there are conflicting findings in the temporal ordering of detectable structural and functional changes which confound glaucoma clinical assessment and treatment plans here we use state space model with states defined by successive value bands of the two main glaucoma markers visual field index vfi functional marker and average rnfl retinal nerve fiber layer thickness structural marker with forwarding edges see fig more details of the dataset and model can be found in the supplementary material we utilize soft expm for the following experiments since it converges quickly see fig has an acceptable computational cost and exhibits the best performance to predict future continuous measurements we follow simple procedure illustrated in fig given testing patient viterbi decoding is used to decode the best hidden state path for the past visits then given future time the most probable future state is predicted by maxj pij blue node where is the current state black node to predict the continuous measurements we search for the future time and in desired resolution when the patient enters and leaves state having same value range as state for each disease marker separately the measurement at time can then be computed by linear interpolation between and and the two data bounds of state for the specified marker in fig the mean absolute error mae between the predicted values and the actual measurements was used for performance assessment the performance of cthmm was compared to both conventional linear regression and bayesian joint linear regression for the bayesian method the joint prior distribution of the four parameters two intercepts and two slopes computed from the training set is used alongside the data likelihood the results in table demonstrate the significantly improved performance of in fig we visualize the model trained using the entire dataset several dominant paths can be identified there is an early stage containing rnfl thinning with intact vision blue vertical path in the first column and at around rnfl range the transition trend reverses and vfi changes become more evident blue horizontal paths this shape in the disease progression supports the finding in that rnfl thickness of around microns is tipping point at which functional deterioration becomes clinically observable with structural deterioration our model can be used to visualize the relationship between structural and functional degeneration yielding insights into the progression process application of to exploratory analysis of alzheimer disease we now demonstrate the use of as an exploratory tool to visualize the temporal interaction of disease markers of alzheimer disease ad ad is an irreversible disease that results in loss of mental function due to the degeneration of brain tissues an estimated table the mean absolute error mae of predicting the two glaucoma measures represents that cthmm performs significantly better than the competing method under student mae bayesian joint linear regression linear regression vfi rnfl million americans have ad yet no prevention or cures have been found it could be beneficial to visualize the relationship between clinical imaging and biochemical markers as the pathology evolves in order to better understand ad progression and develop treatments state model was constructed from cohort of ad patients see the supplementary material for additional details the visualization result is shown in fig the state transition trends show that the abnormality of aβ level emerges first blue lines when cognition scores are still normal hippocampus atrophy happens more often green lines when aβ levels are already low and cognition has started to show abnormality most cognition degeneration happens red lines when both aβ levels and hippocampus volume are already in abnormal stages our quantitative visualization results supports recent findings that the decreasing of aβ level in csf is an early marker before detectable hippocampus atrophy in elderly the disease model with interactive visualization can be utilized as an exploratory tool to gain insights of the disease progression and generate hypotheses to be further investigated by medical researchers structural degeneration rnfl functional degeneration vfi functional cognition structural hippocampus biochemical beta glaucoma progression alzheimer disease progression figure visualization scheme the strongest transition among the three instantaneous links from each state are shown in blue while other transitions are drawn in dotted black the line width and the node size reflect the expected count the node color represents the average sojourn time red to green to years and above similar to but the strongest transition from each state is color coded as follows aβ direction blue hippo green cog red aβ cyan aβ magenta yellow aβ cog black the node color represents the average sojourn time red to green to years and above conclusion in this paper we present novel em algorithms for learning which leverage recent approaches for evaluating the conditioned expectations in ctmc models to our knowledge we are the first to develop and test the expm and unif methods for learning we also analyze their time complexity and provide experimental comparisons among the methods under soft and hard em frameworks we find that soft em is more accurate than hard em and expm works the best under soft em we evaluated our em algorithsm on two disease progression datasets for glaucoma and ad we show that outperforms the bayesian joint linear regression method for glaucoma progression prediction this demonstrates the practical value of for longitudinal disease modeling and prediction acknowledgments portions of this work were supported in part by nih and by grant awarded by the national institute of biomedical imaging and bioengineering through funds provided by the big data to knowledge initiative additionally the collection and sharing of the alzheimers data was funded by adni under nih and dod award the research was also supported in part by bigdata onr nsf and nsf career references cox and miller the theory of stochastic processes london chapman and hall jackson models for panel data the msm package for journal of statistical software vol no bartolomeo trerotoli and serio progression of liver cirrhosis to hcc an application of hidden markov model bmc med research vol no liu ishikawa chen and et longitudinal modeling of glaucoma progression using hidden markov model med image comput comput assist interv vol no pp wang sontag and wang unsupervised learning of disease progression models proceeding kdd vol no pp nodelman shelton and koller expectation maximization and complex duration distributions for continuous time bayesian networks in proc uncertainty in ai uai and visualization and prediction of disease interactions with hidden markov models in nips metzner horenko and schtte generator estimation of markov jump processes based on incomplete observations nonequidistant in time physical review vol no hobolth and jensen summary statistics for markov chains journal of applied probability vol no pp tataru and hobolth comparison of methods for calculating conditional expectations of sufficient statistics for continuous time markov chains bmc bioinformatics vol no medeiros zangwill girkin and et combining structural and functional measurements to improve estimates of rates of glaucomatous progression am ophthalmol vol no pp bladt and srensen statistical inference for discretely observed markov jump processes statist soc vol no rabinar tutorial on hidden markov models and selected applications in speech recognition proceedings of the ieee vol no hobolth and statistical inference in evolutionary models of dna sequences via the em algorithm statistical applications in genetics and molecular biology vol no van loan computing integrals involving the matrix exponential ieee trans automatic control vol pp higham functions of matrices theory and computation siam metzner horenko and schtte generator estimation of markov jump processes journal of computational physics vol kingman glaucoma is second leading cause of blindness globally bulletin of the world health organization vol no wollstein schuman price and et optical coherence tomography longitudinal evaluation of retinal nerve fiber layer thickness in glaucoma arch ophthalmol vol no pp wollstein kagemann bilonick and et retinal nerve fibre layer and visual function loss in glaucoma the tipping point br ophthalmol vol no pp the alzheimers disease neuroimaging initiative http fagan head shah and et al decreased csf beta correlates with brain atrophy in cognitively normal elderly ann vol no 
expectation particle belief propagation thibaut lienart yee whye teh arnaud doucet department of statistics university of oxford oxford uk lienart teh doucet abstract we propose an original implementation of the loopy belief propagation lpb algorithm for pairwise markov random fields mrf on continuous state space the algorithm constructs adaptively efficient proposal distributions approximating the local beliefs at each note of the mrf this is achieved by considering proposal distributions in the exponential family whose parameters are updated iterately in an expectation propagation ep framework the proposed particle scheme provides consistent estimation of the lbp marginals as the number of particles increases we demonstrate that it provides more accurate results than the particle belief propagation pbp algorithm of at fraction of the computational cost and is additionally more robust empirically the computational complexity of our algorithm at each iteration is quadratic in the number of particles we also propose an accelerated implementation with computational complexity which still provides consistent estimates of the loopy bp marginal distributions and performs almost as well as the original procedure introduction undirected graphical models also known as markov random fields provide flexible framework to represent networks of random variables and have been used in large variety of applications in machine learning statistics signal processing and related fields for many applications such as tracking sensor networks or computer vision it can be beneficial to define mrf on continuous given pairwise mrf we are here interested in computing the marginal distributions at the nodes of the graph popular approach to do this is to consider the loopy belief propagation lbp algorithm lbp relies on the transmission of messages between nodes however when dealing with continuous random variables computing these messages exactly is generally intractable in practice one must select way to tractably represent these messages and way to update these representations following the lbp algorithm the nonparametric belief propagation nbp algorithm represents the messages with mixtures of gaussians while the particle belief propagation pbp algorithm uses an importance sampling approach nbp relies on restrictive integrability conditions and does not offer consistent estimators of the lbp messages pbp offers way to circumvent these two issues but the implementation suggested proposes sampling from the estimated beliefs which need not be integrable moreover even when they are integrable sampling from the estimated beliefs is very expensive computationally practically the authors of only sample approximately from those using short mcmc runs leading to biased estimators in our method we consider sequence of proposal distributions at each node from which one can sample particles at given iteration of the lbp algorithm the messages are then computed using importance sampling the novelty of the approach is to propose principled and automated way of designing sequence of proposals in tractable exponential family using the expectation agation ep framework the resulting algorithm which we call expectation particle belief propagation epbp does not suffer from restrictive integrability conditions and sampling is done exactly which implies that we obtain consistent estimators of the lbp messages the method is empirically shown to yield better approximations to the lbp beliefs than the implementation suggested in at much reduced computational cost and than ep background notations we consider pairwise mrf distribution over set of random variables indexed by set which factorizes according to an undirected graph with xv ψu xu ψuv xu xv the random variables are assumed to take values on continuous possibly unbounded space the positive functions ψu and ψuv are respectively known as the node and edge potentials the aim is to approximate the marginals pu xu for all popular approach is the lbp algorithm discussed earlier this algorithm is fixed point iteration scheme yielding approximations called the beliefs at each node when the underlying graph is tree the resulting beliefs can be shown to be proportional to the exact marginals this is not the case in the presence of loops in the graph however even in these cases lbp has been shown to provide good approximations in wide range of situations the lbp iteration can be written as follows at iteration uv uv wu xu dxu but xu ψu xu mtwu xu where γu denotes the neighborhood of the set of nodes muv is known as the message from node to node and bu is the belief at node related work the crux of any generic implementation of lbp for continuous state spaces is to select way to represent the messages and design an appropriate method to the message update in nonparametric bp nbp the messages are represented by mixtures of gaussians in theory computing the product of such messages can be done analytically but in practice this is impractical due to the exponential growth in the number of terms to consider to circumvent this issue the authors suggest an importance sampling approach targeting the beliefs and fitting mixtures of gaussians to the resulting weighted particles the computation of the update is then always done over constant number of terms restriction of vanilla nonparametric bp is that the messages must be finitely integrable for the message representation to make sense this is the case if the following two conditions hold sup ψuv xu xv dxu and ψu xu dxu xv these conditions do however not hold in number of important cases as acknowledged in for instance the potential ψu xu is usually proportional to likelihood of the form yu which need not be integrable in xu similarly in imaging applications for example the edge potential can encode similarity between pixels which also need not verify the integrability condition as in further nbp does not offer consistent estimators of the lbp messages particle bp pbp offers way to overcome the shortcomings of nbp the authors also consider importance sampling to tackle the update of the messages but without fitting mixture of gaussians for chosen proposal distribution qu on node and draw of particles xu qu xu the messages are represented as mixtures pbp uv xv ωuv ψuv xv with ωuv ψu xu pbp wu xu qu xu this algorithm has the advantage that it does not require the conditions to hold the authors suggest two possible choices of sampling distributions sampling from the local potential ψu or sampling from the current belief estimate the first case is only valid if ψu is integrable xu which as we have mentioned earlier might not be the case in general and the second case implies sampling from distribution of the form bupbp xu ψu xu pbp wu xu which is product of mixtures as in nbp sampling of the proposal has complexity and is thus in general too expensive to consider alternatively as the authors suggest one can run short mcmc simulation targeting it which reduces the complexity to order since the bupbp is of order and we cost of each iteration which requires evaluating need iterations of the mcmc simulation the issue with this approach is that it is still computationally expensive and it is unclear how many iterations are necessary to get good samples our contribution in this paper we consider the general context where the edge and might be nonnormalizable and our proposed method is based on pbp as pbp is theoretically better suited than nbp since as discussed earlier it does not require the conditions to hold and provided that one samples from the proposals exactly it yields consistent estimators of the lbp messages while nbp does not further the development of our method also formally shows that considering proposals close to the beliefs as suggested by is good idea our core observation is that since sampling from proposal of the form using mcmc simulation is very expensive we should consider using more tractable proposal distribution instead however it is important that the proposal distribution is constructed adaptively taking into account evidence collected through the message passing itself and we propose to achieve this by using proposal distributions lying in tractable exponential family and adapted using the expectation propagation ep framework expectation particle belief propagation our aim is to address the issue of selecting the proposals in the pbp algorithm we suggest using exponential family distributions as the proposals on node for computational efficiency reasons with parameters chosen adaptively based on current estimates of beliefs and ep each step of our algorithm involves both projection onto the exponential family as in ep as well as particle approximation of the lbp message hence we will refer to our method as expectation particle belief propagation or epbp for short for each pair of adjacent nodes and we will use muv xv to denote the exact but unavailable lbp message from to uv xv to denote the particle approximation of muv and ηuv an exponential family projection of uv in addition let denote an exponential family projection of the node potential ψu we will consider approximations consisting of particles in the following we will derive the form of our particle approximated message uv xv along with the choice of the proposal distribution qu xu used to construct uv our starting point is the belief over xu and xv given the incoming particle approximated messages buv xu xv ψuv xu xv ψu xu ψv xv wu xu νv xv buv xv the exact lbp message muv xv can be derived by computing the marginal distribution and constructing muv xv such that buv xv muv xv cvu xv cvu xv ψv xv where νv xv is the particle approximated from to it is easy to see that the resulting message is as expected muv xv ψuv xu xv ψu xu wu xu dxu since the above exact lbp belief and message are intractable in our scenario of interest the idea buv xu xv instead consider proposal distribution is to use an importance sampler targeting of the form qu xu qv xv since xu and xv are independent under the proposal we can draw independent samples say xu and xv from qu and qv respectively we can then approximate the belief using cross product of the particles buv xv buv xu xv xu xv qu xu qv wu xu vu xv ψuv xu xv ψu xu xu xv qu xu qv xv buv xv marginalizing onto xv we have the following particle approximation to uv xv xm vu xv δx xv buv xv qv xv where the particle approximated message uv xv from to has the form of the message representation in the pbp algorithm buv to determine sensible proposal distributions we can find qu and qv that are close to the target buv kqu qv as the measure of closeness the optimal qu required for the using the kl divergence kl to message is the node belief buv xu ψu xu wu xu thus supporting the claim in that good proposal to use is the current estimate of the node belief as pointed out in section it is computationally inefficient to use the particle approximated node belief as the proposal distribution an idea is to use tractable exponential family distribution for qu instead say qu xu xu ηwu xu where and ηwu are exponential family approximations of ψu and wu respectively in section we use gaussian family but we are not limited to this using the framework of expectation propogation ep we can iteratively find good exponential family approximations as follows for each γu to update the ηwu we form the cavity distribution qu qu and the sponding tilted distribution wu qu the updated ηwu is the exponential family factor minimising the kl divergence ηwu arg min kl wu xu xu xu xu geometrically the update projects the tilted distribution onto the exponential family manifold the optimal solution requires computing the moments of the tilted distribution through cal quadrature and selecting ηwu so that ηwu qu matches the moments of the tilted distribution in our scenario the moment computation can be performed crudely on small number of evaluation points since it only concerns the updating of the importance sampling proposal if an optimal in the exponential family does not exist in the gaussian case that the optimal has negative variance we simply revert ηwu to its previous value an analogous update is used for in the above derivation the expectation propagation steps for each incoming message into and for the node potential are performed first to fit the proposal to the current estimated belief at before it is used to draw particles which can then be used to form the particle approximated messages from to each of its neighbours alternatively once each particle approximated message uv xv is formed we can update its exponential family projection ηuv xv immediately this alternative scheme is described in algorithm algorithm node update sample xu qu bu compute wu xu ψu xu for γu do cuv vu xu compute bu xu cuv compute the normalized weights wuv xu pn update the estimator of the outgoing message uv xv wuv ψuv xu xv compute the cavity distribution qv qv get in the exponential family such that qv approximates ψv qv update qv and let compute the cavity distribution qv qv get ηuv in the exponential family such that qv approximates uv qv update qv ηuv and let ηuv ηuv ηuv end for computational complexity and implementation each ep projection step costs computations since the message wu is mixture of components see drawing particles from the exponential family proposal qu costs the step with highest computational complexity is in evaluating the particle weights in indeed evaluating the mixture representation of message on single point is and we need to compute this for each of particles similarly evaluating the estimator of the belief on sampling points at node requires this can be reduced since the algorithm still provides consistent estimators if we consider the evaluation of unbiased estimators of the messages instead since the messages pn have the form uv xv wuv ψuv xv we can follow method presented in where one draws indices from multinomial with weights wuv and evaluates the correi sponding components ψuv this reduces the cost of the evaluation of the beliefs to which leads to an overall complexity if is we show in the next section how it compares to the quadratic implementation when log experiments we investigate the performance of our method on mrfs for two simple graphs this allows us to compare the performance of epbp to the performance of pbp in depth we also illustrate the behavior of the version of epbp finally we show that epbp provides good results in simple denoising application comparison with pbp we start by comparing epbp to pbp as implemented by ihler et al on grid figure with random variables taking values on the node and edge potentials are selected such that the marginals are multimodal and skewed with ψu xu xu yu xu yu ψuv xu xv xu xv where yu denotes the observation at node exp density of normal distribution exp density of gumbel distribution and exp density of laplace distribution the parameters and are respectively set to and we compare the two methods after lbp the scheduling used alternates between the classical orderings and one lbp iteration implies that all nodes have been updated once figure illustration of the grid left and tree right graphs used in the experiments pbp as presented in is implemented using the same parameters than those in an implementation code provided by the authors the proposal on each node is the last estimated belief and sampled with mcmc chain the mh proposal is normal distribution for epbp the approximation of the messages are gaussians the ground truth is approximated by running lbp on deterministic equally spaced mesh with points all simulations were run with julia on mac with ghz intel core processor our code is available figure compares the performances of both methods the error is computed as the mean error over all nodes between the estimated beliefs and the ground truth evaluated over the same deterministic mesh one can observe that not only does pbp perform worse than epbp but also that the error plateaus with increasing number of samples this is because the secondampling within pbp is done approximately and hence the consistency of the estimators is lost the offered by epbp is very substantial figure left hence although it would be possible to use more mcmc iterations within pbp to improve its performance it would make the method prohibitively expensive to use note that for epbp one observes the usual convergence of particle methods figure compares the estimator of the beliefs obtained by the two methods for three arbitrarily picked nodes node and as illustrated on figure the figure also illustrates the last proposals constructed with our approach and one notices that their supports match closely the support of the true beliefs figure left illustrates how the estimated beliefs converge as compared to the true beliefs with increasing number of iterations one can observe that pbp converges more slowly and that the results display more variability which might be due to the mcmc runs being too short we repeated the experiments on tree with nodes figure right where we know that at convergence the beliefs computed using bp are proportional to the true marginals the node and edge potentials are again picked such that the marginals are multimodal with ψu xu xu yu xu yu ψuv xu xv xu xv with and on this example we also show how pure ep with normal distributions performs we also try using the distributions obtained with ep as proposals for pbp referred to as pbp after ep in figures both methods underperform compared to epbp as illustrated visually in figure in particular one can observe in figure that pbp after ep converges slower than epbp with increasing number of samples implementation and denoising application as outlined in section in the implementation of epbp one can use an unbiased estimator of the edge weights based on draw of components from multinomial the complexity of the resulting algorithm is we apply this method to the grid example in the case where is picked to be roughly of order log for we pick the results are illustrated in figure where one can see that the log implementation compares very well to the original quadratic implementation at much reduced cost we apply this method on simple probabilistic model for an image denoising problem the aim of this example is to show that the method can be applied to larger graphs and still provide good results the model underlined is chosen to showcase the flexibility and applicability of our method in particular when the is it is not claimed to be an optimal approach to image the node and edge potentials are defined as follows ψu xu xu yu ψuv xu xv lλ xu xv https in this case in particular an method such as is likely to yield better results where lλ if and otherwise in this example we set the value assigned to each pixel of the reconstruction is the estimated mean obtained over the corresponding node figure the image has size and the simulation was run with particles per nodes and bp iterations taking under minutes to complete we compare it with the result obtained with ep on the same model epbp pbp after ep mean error mean error pbp epbp number of samples per node number of samples per node figure left comparison of the mean error for pbp and epbp for the grid example right comparison of the mean error for pbp after ep and epbp for the tree example in both cases epbp is more accurate for the same number of samples true belief estimated belief epbp estimated belief pbp proposal epbp figure comparison of the beliefs on node and as obtained by evaluating lbp on deterministic mesh true belief with pbp and with epbp for the grid example the proposal used by epbp at the last step is also illustrated the results are obtained with samples on each node and bp iterations one can observe visually that epbp outperforms pbp pbp epbp time mean error epbp pbp number of bp iterations number of samples per node figure left comparison of the convergence in error with increasing number of bp iterations for the grid example when using particles right comparison of the time needed to perform pbp and epbp on the grid example discussion we have presented an original way to design adaptively efficient and proposals for particle implementation of loopy belief propagation our proposal is inspired by the expectation propagation framework we have demonstrated empirically that the resulting algorithm is significantly faster and more accurate than an implementation of pbp using the estimated beliefs as proposals and sampling from them using mcmc as proposed in it is also more accurate than ep due to the nonparametric nature of the messages and offers consistent estimators of the lbp messages version of the method was also outlined and shown to perform almost as well as the original method on mildly models it was also applied successfully in simple image denoising example illustrating that the method can be applied on graphical models with several hundred nodes we believe that our method could be applied successfully to wide range of applications such as smoothing for hidden markov models tracking or computer vision in future work we will look at considering other divergences than the kl and the power ep framework we will also look at encapsulating the present algorithm within sequential monte carlo framework and the recent work of naesseth et al true belief est bel epbp est bel pbp est bel ep est bel pbp after ep figure comparison of the beliefs on node and as obtained by evaluating lbp on deterministic mesh using epbp pbp ep and pbp using the results of ep as proposals this is for the tree example with samples on each node and lbp iterations again one can observe visually that epbp outperforms the other methods nlogn implementation quadratic implementation time mean error nlogn implementation quadratic implementation number of samples number of samples per node figure comparison of the mean error for pbp and epbp on grid left for the same number of samples epbp is more accurate it is also faster by about two orders of magnitude right the simulations were run several times for the same observations to illustrate the variability of the results figure from left to right comparison of the original first noisy second and recovered image using the implementation of epbp third and with ep fourth acknowledgments we thank alexander ihler and drew frank for sharing their implementation of particle belief propagation tl gratefully acknowledges funding from epsrc grant and the scatcherd european scholarship scheme ywt research leading to these results has received funding from epsrc grant and erc under the eu programme grant agreement no ad research was supported by the epsrc grant and by grant references alexander ihler and david mcallester particle belief propagation in proc aistats pages martin wainwright and michael jordan graphical models exponential families and variational inference found and tr in mach erik sudderth alexander ihler michael isard william freeman and alan willsky nonparametric belief propagation commun acm jeremy schiff erik sudderth and ken goldberg nonparametric belief propagation for distributed tracking of robot networks with noisy measurements in iros pages alexander ihler john fisher randolph moses and alan willsky nonparametric belief propagation for of sensor networks in ieee sel ar volume pages christopher crick and avi pfeffer loopy belief propagation as basis for communication in sensor networks in proc uai pages jian sun zheng and shum stereo matching using belief propagation in ieee trans patt an mach volume pages andrea klaus mario sormann and konrad karner stereo matching using belief propagation and dissimilarity measure in proc icpr volume pages nima noorshams and martin wainwright belief propagation for continuous state spaces stochastic with quantitative guarantees jmlr judea pearl probabilistic reasoning in intelligent systems morgan kaufman jonathan yedidia william freeman and yair weiss constructing free energy approximations and generalized belief propagation algorithms merl technical report erik sudderth alexander ihler william freeman and alan willsky nonparametric belief propagation in procs ieee comp vis patt volume pages thomas minka expectation propagation for approximate bayesian inference in proc uai pages kevin murphy yair weiss and michael jordan loopy belief propagation for approximate inference an empirical study in proc uai pages mila nikolova thresholding implied by truncated quadratic regularization ieee trans sig mark briers arnaud doucet and sumeetpal singh sequential auxiliary particle belief propagation in proc icif volume pages leonid rudin stanley osher and emad fatemi nonlinear total variation based noise removal algorithms physica briers doucet and maskell smoothing algorithms for models ann inst stat erik sudderth michael mandel william freeman and alan willsky visual hand tracking using nonparametric belief propagation in procs ieee comp vis patt pedro felzenszwalb and daniel huttenlocher efficient image segmentation int journ comp thomas minka power ep technical report christian naesseth fredrik lindsten and thomas sequential monte carlo for graphical models in proc nips pages 
latent bayesian melding for integrating individual and population models mingjun zhong nigel goddard charles sutton school of informatics university of edinburgh united kingdom mzhong csutton abstract in many statistical problems more model may be suitable for behaviour whereas more detailed model is appropriate for accurate modelling of individual behaviour this raises the question of how to integrate both types of models methods such as posterior regularization follow the idea of generalized moment matching in that they allow matching expectations between two models but sometimes both models are most conveniently expressed as latent variable models we propose latent bayesian melding which is motivated by averaging the distributions over populations statistics of both the and the models under logarithmic opinion pool framework in case study on electricity disaggregation which is type of singlechannel blind source separation problem we show that latent bayesian melding leads to significantly more accurate predictions than an approach based solely on generalized moment matching introduction good statistical models of populations are often very different from good models of individuals as an illustration the population distribution over human height might be approximately normal but to model an individual height we might use more detailed discriminative model based on many features of the individual genotype as another example in social network analysis simple models like the preferential attachment model replicate aggregate network statistics such as degree distributions whereas to predict whether two individuals have link social networking web site might well use classifier with many features of each person previous history of course every model of an individual implies model of the population but models whose goal is to model individuals tend to be necessarily more detailed these two styles of modelling represent different types of information so it is natural to want to combine them recent line of research in machine learning has explored the idea of incorporating constraints into bayesian models that are difficult to encode in standard prior distributions these methods which include posterior regularization learning with measurements and the generalized expectation criterion tend to follow moment matching idea in which expectations of the distribution of one model are encouraged to match values based on prior information interestingly these ideas have precursors in the statistical literature on simulation models in particular bayesian melding considers applications in which there is computer simulation that maps from model parameters to quantity for example might summarize the output of deterministic simulation of population dynamics or some other physical phenomenon bayesian melding considers the case in which we can build meaningful prior distributions over both and these two prior distributions need to be merged because of the deterministic relationship this is done using logarithmic opinion pool we show that there is close connection between bayesian melding and the later work on posterior regularization which does not seem to have been recognized in the machine learning literature we also show that bayesian melding has the additional advantage that it can be conveniently applied when both and models contain latent variables as would commonly be the case if they were mixture models or hierarchical bayesian models we call this approach latent bayesian melding we present detailed case study of latent bayesian melding in the domain of energy disaggregation which is particular type of blind source separation bss problem the goal of the electricity disaggregation problem is to separate the total electricity usage of building into sum of source signals that describe the energy usage of individual appliances this problem is hard because the source signals are not identifiable which motivates work that adds additional prior information into the model we show that the latent bayesian melding approach allows incorporation of new types of constraints into standard models for this problem yielding strong improvement in performance in some cases amounting to error reduction over moment matching approach the bayesian melding approach we briefly describe the bayesian melding approach to integrating prior information in deterministic simulation models which has seen wide application in the bayesian modelling context denote as the observation data and suppose that the model includes unknown variables which could include model parameters and latent variables we are then interested in the posterior ps however in some situations the variables may be related to new random variable by deterministic simulation function such that we call and input and output variables for example in the energy disaggregation problem the total energy consumption variable pt stt where st are the state variables of hidden markov model encoding and is vector containing the mean energy consumption of each state see section both and are random variables and so in the bayesian context the modellers usually choose appropriate priors pτ and ps based on prior knowledge however given ps the map naturally introduces another prior for which is an induced prior denoted by therefore there are two different priors for the same variable from different sources which might not be consistent in the energy disaggregation example is induced by the state variables st of the hidden markov model which is the individual model of specific household and pτ could be modelled by using population information from national survey we can think of this as population model since it combines information from many households the bayesian melding approach combines the two priors into one by using the logarithmic pooling method so that the logarithmically pooled prior is peτ pτ where the prior peτ melds the prior information of both and in the model the prior ps does not include information about thus it is required to derive melded prior for if is invertible the prior for can be obtained by using the technique if is not invertible poole and raftery heuristically derived melded prior pτ pes cα ps where cα is constant given such that pes ds this gives new posterior pe pe ps note that it is interesting to infer however we use fixed value in this paper so far we have been assuming there are no latent variables in pτ we now consider the situation when is generated by some latent variables the latent bayesian melding approach it is common that the variable is modelled by latent variable see the examples in section so we could assume that we have conditional distribution and prior distribution pξ this defines marginal distribution pτ pξ dξ this could be used to produce the melded prior of the bayesian melding approach pτ pξ dξ pes cα ps the integration in is generally intractable we could employ the monte carlo method to approximate it for fixed however importantly we are also interested in inferring the latent variable which is meaningful for example in the energy disaggregation problem when we are interested in finding the maximum posteriori map rvalue of the posterior where pes was used as the prior we propose to use rough approximation pξ pτ dξ maxξ pξ pτ this leads to an approximate prior pτ pξ pes max pes max cα ps to obtain this approximate prior for the joint prior pes has to exist and so we show that it does exist under certain conditions by the following theorem we assume that and are continuous random variables and that both and pτ are positive and share the same support also eps denotes the expectation with respect to ps theorem if eps then constant cα exists such that pes dξds for any fixed the proof can be found in the supplementary materials in we heuristically derived an approximate joint prior pes interestingly if and are independent conditional on we can show as follows that pes is limit distribution derived from joint distribution of and induced by to see this we derive joint prior for and ps pτ dτ pτ dτ ps pξ dτ dτ pτ pτ for deterministic simulation the distribution is due to the borel paradox the distribution depends on the parameterization we assume that is uniform on conditional on and and the distribution is then denoted by pδ the marginal distribution is pδ pδ ps ds denote and gδ pδ then we have the following theorem theorem if pδ pτ and gδ has bounded derivatives in any order then pδ gδ dτ see the supplementary materials for the proof under this parameterization we denote ps pξ pδ gδ dτ ps pξ by applying the logarithmic ing method we have joint prior pτ pξ pes cα ps cα ps since the joint prior blends the variable and the latent variable we call this approximation the latent bayesian melding lbm approach which gives the posterior pe pe ps note that if there are no latent variables then latent bayesian melding collapses to the bayesian melding approach in section we will apply this method to an energy disaggregation problem for integrating population information with an individual model related methods we now discuss possible connections between bayesian melding bm and other related methods recently in machine learning moment matching methods have been proposed posterior regularization pr learning with measurements and the generalized expectation criterion these methods share the common idea that the bayesian models or posterior distributions are constrained by some observations or measurements to obtain distribution the idea is that the system we are modelling is too complex and unobservable and thus we have limited prior information to alleviate this problem we assume we can obtain some observations of the system in some way by experiments for example those observations could be the mean values of the functions of the variables those observations could then guide the modelling of the system interestingly very similar idea has been employed in the bias correction method in information theory and statistics where the distribution is obtained by optimizing the divergence subject to the moment constraints note that the bias correction method in is different to others where the bias of consistent estimator was corrected when the bias function could be estimated we now consider the posteriors derived by pr and bm in general given function and values bi pr solves the constrained problem minimize kl subject to epe mi bi δi where mi could be any function such as power function this gives an optimal posterior qi pep exp mi where is the normalizing constant bm has deterministic simulation where pτ the posterior is then pebm they have similar form and the key difference is the last factor which is derived from the constraints or the deterministic simulation pep and pebm pi are identical if λi mi log the difference between bm and lbm is the latent variable we could perform bm by integrating out in but this is computationally expensive instead lbm jointly models and allowing possibly joint inference which is an advantage over bm the energy disaggregation problem in energy disaggregation we are given time series of energy consumption readings from sensor we consider the energy measured in watt hours as read from household electricity meter which is denoted by yt where yt the recorded energy signal is assumed to be the aggregation of the consumption of individual appliances in the household suppose there are appliances and the energy consumption of each appliance is denoted by xi xit where xit the observed aggregate signal is assumed to be the sum of the component pi signals so that yt xit where given the task is to infer the unknown component signals xi this is essentially the bss problem for which there is no unique solution it can also be useful to add an extra component ut to model the unknown appliances to make more robust as the model proposed in the prior pt of ut is defined as exp ut the model then has new pi form yt xit ut natural way to represent this model is as an additive factorial hidden markov model afhmm where the appliances are treated as hmms this is now described the additive factorial hidden markov model in the afhmm each component signal xi is represented by hmm we suppose there are ki states for each xit and so the state variable is denoted by zit ki since xi is pki hmm the initial probabilities are πik ki where πik the mean values are µi µki such that xit µi the transition probabilities are pki pjk where pjk zit and pjk we denote all these parameters πi µi by we assume they are known and can be learned from the training data instead of using we could use binary vector sit sitki to represent the variable such that sitk when zit and for all sitj when then we are interested in inferring the states sit instead of inferring xit directly since xit sit µi therefore we want to make inference over the posterior distribution qi qki where the hmm defines the prior of the states πik qt qi si the inverse noise variance is assumed to be gamma pkj tribution exp and the data likelihood has the gaussian form µi ut exp yt sit to make the map inference over we relax the binary variable sitk to be continuous in the range as in it has been shown that incorporating domain knowledge into afhmm can help to reduce the identifiability problem the domain knowledge we will incorporate using lbm is the summary statistics population modelling of summary statistics in energy disaggregation it is useful to provide summaries of energy consumption to the users for example it would be useful to show the householders the total energy they had consumed in one day for their appliances the duration that each appliance was in use and the number of times that they had used these appliances since there already exists data about typical usage of different appliances we can employ these data to model the distributions of those summary statistics we denote those desired statistics by τi where denotes the appliances for appliance we assume we have measured some time series from different houses for many days this is always possible because we can collect them from public data sets the data reviewed in we can then empirically obtain the distributions of those statistics the distribution is represented by pm τim ηim where γim represents the empirical quantities of the statistic of the appliance which can be obtained from data and ηim are the latent variables which might not be known since ηim are variables we can employ prior distribution ηim we now give some examples of those statistics total energy consumption the total energy consumption of an appliance can be represented as function of the states of hmm such that τi pt duration of using the appliance can also be sit µi duration of appliance usage pthe ki represented as function of states τi sitk where represents the sampling duration for data point of the appliances and we assume that represents the off state which means the appliance was turned off number of cycles the number of cycles the number of times an appliance is used can be counted by computing the number of alterations from off state to on pt pki such that τi sitk si let the binary vector ξi ξic ξici represent the number of cycles where ξic pci means that the appliance had been used cycles and ξic note ξi is an example of ηi in this case to model these statistics in our lbm framework the latent variable that we use is the number of cycles the distributions of τi could be empirically modelled by using the observation pci data one approach is to assume gaussian mixture density such that τi ξic pci pc τi where ξic and pc is the gaussian component density using the mixture gaussian we basically assume that for an appliance given the number of cycles the total energy consumption is modelled by gaussian with mean µic and variance simpler model pci would be linear regression model such that τi ξic µic where this model assumes that given the number of cycles the total energy consumption is close to the mean µic the mixture model is more appropriate than the regression model but the inference is more difficult pci when τi represents the number of cycles for appliance we can use τi cic ξic where cic represents the number of cycles when the state variables si are relaxed to we can then pci employ noise model such that τi where we model ξi with qci icξicic discrete distribution such that ξi pic where pic represents the prior probability of the number of cycles for the appliance which can be obtained from the training data we now show that how to use the lbm to integrate the afhmm with these population distributions the latent bayesian melding approach to energy disaggregation we have shown that the summary statistics can be represented as deterministic function of the state variable of hmms such that which means that the itself can be represented as latent variable model we could then straightforwardly employ the lbm to produce joint prior over and such that pes cα ps pτ since in our model is not pτ invertible we need to generate proper density for pτ one possible way is to generate random samples from the prior ps which is hmm and then pτ can be modelled by using kernel density estimation however this will make the inference difficult instead we employ gaussian density τim where and are computed from the new posterior distribution of lbm thus has the form ps pτ cα ps where represents the collection of all the noise variances all the inverse noise variances employ the gamma distribution as the prior we are interested in inferring the map values since the variables and are binary we have to solve combinatorial optimization problem which is intractable so we solve relaxed problem as in since log ps is not convex we employ the relaxit ation method of so new ki variable matrix it hit jk is introduced such that hjk it when si and sitj and otherwise hjk under these constraints we then obtain pi log ps log log πi hit jk log pjk this is now linear we optimize the which is denoted by the np constraints np for those variables are repreo ki ci sented as sets qs itk itk ic ic np ki it it it qh and qu hl si sit hjk σim denote qs qξ qh qu the relaxed optimization problem is then maximize subject to we oberved that every term in is either quadratic or linear when are fixed and the solutions for are deterministic when the other variables are fixed the constraints are all linear therefore we optimize while fixing all the other variables and then optimize all the other variables simultaneously while fixing this optimization problem is then convex quadratic program cqp for which we use mosek we denote this method by experimental results we have incorporated population information into the afhmm by employing the latent bayesian melding approach in this section we apply the proposed model to the disaggregation problem we will compare the new approach with the using the set of statistics described in section the key difference between our method and is that models the statistics conditional on the number of cycles the hes data we apply afhmm and to the household electricity survey hes this data set was gathered in recent study commissioned by the uk department of food and rural affairs the study monitored households selected to be representative of the population across england from may to july individual appliances were monitored and in some households the overall electricity consumption was also monitored the data were monitored the hes dataset and information on how the raw data was cleaned can be found from https table normalized disaggregation error nde signal aggregate error sae duration aggregate error dae and cycle aggregate error cae by and on synthetic mains in hes data ethods afhmm nde sae dae cae ime table normalized disaggregation error nde signal aggregate error sae duration aggregate error dae and cycle aggregate error cae by and on mains in hes data ethods afhmm nde sae dae cae ime every or minutes for different houses we used only the data we then used the individual appliances to train the model parameters of the afhmm which will be used as the input to the models for disaggregation note that we assumed the hmms have states for all the appliances this number of states is widely applied in energy disaggregation problems though our method could easily be applied to larger state spaces in the hes data in some houses the overall electricity consumption the mains was monitored however in most houses only subset of individual appliances were monitored and the total electricity readings were not recorded generating the population information most of the houses in hes did not monitor the mains readings they all recorded the individual appliances consumption we used subset of the houses to generate the population information of the individual appliances we used the population information of total energy consumption duration of appliance usage and the number of cycles in time period in our experiments the time period was one day we modelled the distributions of these summary statistics by using the methods described in the section where the distributions were gaussian all the required quantities for modelling these distributions were generated by using the samples of the individual appliances houses without mains readings in this experiment we randomly selected one hundred households and one day usage was used as test data for each household since no mains readings were monitored in these houses we added up the appliance readings to generate synthetic mains readings we then applied the afhmm and to these synthetic mains to predict the individual appliance usage to compare these three methods we employed four error measures denote as the inferred signal for the appliance usage xi one measure is the normalp ized disaggregation error nde itp it it this measures how well the method predicts the it it energy consumption at every time point however the householders might be more interested in the summaries of the appliance usage for example in particular time period one day people are interested in the total energy consumption of the appliances the total time they have been using pi those appliances and how many times they have used them we thus employ as the signal aggregate error sae the duration aggregate error dae or the cycle aggregate error cae where ri represents the total energy consumption the duration or the number of cycles respectively and represents the predicted summary statistics all the methods were applied to the synthetic data table shows the overall error computed by these methods we see that both the methods using prior information improved over the base line method afhmm the and performed similarly in terms of nde and sae but improved over in terms of dae and cae houses with mains readings we also applied those methods to houses which have mains readings we used days data for each house and the recorded mains readings were used as the input to the models all the methods were used to predict the appliance consumption table shows the table normalized disaggregation error nde signal aggregate error sae duration aggregate error dae and cycle aggregate error cae by and on data ethods afhmm nde sae dae cae ime error of each house and also the overall errors this experiment is more realistic than the synthetic mains readings since the real mains readings were used as the input we see that both the methods incorporating prior information have improved over the afhmm in terms of nde sae and dae the and have the similar results for sae is improved over for nde dae and cae data in the previous section we have trained the model using the hes data and applied the models to different houses of the same data set more realistic situation is to train the model in one data set and apply the model to different data set because it is unrealistic to expect to obtain appliancelevel data from every household on which the system will be deployed in this section we use the hes data to train the model parameters of the afhmm and model the distribution of the summary statistics we then apply the models to the dataset which was also gathered from uk households to make the predictions there are five houses in and all of them have mains readings and as well as the individual appliance readings all the mains meters were sampled every seconds and some of them also sampled at higher rate details of the data and how to access it can be found in we employ three of the houses for analysis in our experiments houses in the data the other two houses were excluded because the correlation between the sum of submeters and mains is very low which suggests that there might be recording errors in the meters we selected appliances for disaggregation based on those that typically use the most energy since the sample rate of the submeters in the hes data is minutes we downsampled the signal from seconds to minutes for the data for each house we randomly selected month for analysis all the four methods were applied to the mains readings for comparison purposes we computed the nde sae dae and cae errors of all three methods averaged over days table shows the results the results are consistent with the results of the hes data both the and improve over the basic afhmm except that did not improve the cae as for hes testing data and have similar results on nde and sae and again improved over in dae and cae these results are consistent in suggesting that incorporating population information into the model can help to reduce the identifiability problem in bss problems conclusions we have proposed latent bayesian melding approach for incorporating population information with latent variables into individual models and have applied the approach to energy disaggregation problems the new approach has been evaluated by applying it to two electricity data sets the latent bayesian melding approach has been compared to the posterior regularization approach case of the bayesian melding approach and afhmm both the lbm and pr have significantly lower error than the base line method lbm improves over pr in predicting the duration and the number of cycles both methods were similar in nde and the sae errors acknowledgments this work is supported by the engineering and physical sciences research council uk grant numbers and references leontine alkema adrian raftery and samuel clark probabilistic projections of hiv prevalence using bayesian melding the annals of applied statistics pages mosek aps the mosek optimization toolbox for python manual version revision barabasi and reka albert emergence of scaling in random networks science batra et al nilmtk an open source toolkit for load monitoring in proceedings of the international conference on future energy systems pages new york ny usa robert bordley multiplicative formula for aggregating probability assessments management science grace chiu and joshua gould statistical inference for food webs with emphasis on ecological networks via bayesian melding environmetrics luiz max de carvalhoa daniel am villelaa flavio coelhoc and leonardo bastosa on the choice of the weights for the logarithmic pooling of probability distributions september elhamifar and sastry energy disaggregation via learning powerlets and sparse coding in proceedings of the conference on artificial intelligence aaai pages ganchev gillenwater and taskar posterior regularization for structured latent variable models journal of machine learning research giffin and caticha updating probabilities with data and moments the int workshop on bayesian inference and maximum entropy methods in science and engineering ny july hart nonintrusive appliance load monitoring proceedings of the ieee edwin jaynes information theory and statistical mechanics physical review jack kelly and william knottenbelt the dataset domestic electricity demand and demand from five uk homes kim marwah arlitt lyon and han unsupervised disaggregation of low frequency power measurements in proceedings of the siam conference on data mining pages kolter and jaakkola approximate inference in additive factorial hmms with application to energy disaggregation in proceedings of aistats volume pages liang jordan and klein learning from measurements in exponential families in the annual international conference on machine learning pages james mackinnon and anthony smith approximate bias correction in econometrics journal of econometrics mann and mccallum generalized expectation criteria for learning of conditional random fields in proceedings of acl pages columbus ohio june keith myerscough jason frank and benedict leimkuhler correction of extended dynamical systems using observational data arxiv preprint parson ghosh weal and rogers load monitoring using prior models of general appliance types in proceedings of aaai pages july david poole and adrian raftery inference for deterministic simulation models the bayesian melding approach journal of the american statistical association pages mj rufo cj et al pool to combine prior distributions suggestion for approach bayesian analysis raftery and waddell uncertain benefits application of bayesian melding to the alaskan way viaduct in seattle transportation research part policy and practice robert wolpert comment on inference from deterministic population dynamics model for bowhead whales journal of the american statistical association wytock and zico kolter contextually supervised source separation with application to energy disaggregation in proceedings of aaai pages zhong goddard and sutton signal aggregate constraints in additive factorial hmms with application to energy disaggregation in nips pages zimmermann evans griggs king harding roberts and evans household electricity survey 
distributionally robust logistic regression soroosh peyman mohajerin esfahani daniel kuhn polytechnique de lausanne lausanne switzerland abstract this paper proposes distributionally robust approach to logistic regression we use the wasserstein distance to construct ball in the space of probability distributions centered at the uniform distribution on the training samples if the radius of this ball is chosen judiciously we can guarantee that it contains the unknown datagenerating distribution with high confidence we then formulate distributionally robust logistic regression model that minimizes expected logloss function where the worst case is taken over all distributions in the wasserstein ball we prove that this optimization problem admits tractable reformulation and encapsulates the classical as well as the popular regularized logistic regression problems as special cases we further propose distributionally robust approach based on wasserstein balls to compute upper and lower confidence bounds on the misclassification probability of the resulting classifier these bounds are given by the optimal values of two highly tractable linear programs we validate our theoretical guarantees through simulated and empirical experiments introduction logistic regression is one of the most frequently used classification methods its objective is to establish probabilistic relationship between continuous feature vector and binary explanatory variable however in spite of its overwhelming success in machine learning data analytics and medicine logistic regression models can display poor performance if training data is sparse in this case modelers often resort to ad hoc regularization techniques in order to combat overfitting effects this paper aims to develop new regularization techniques for logistic to provide intuitive probabilistic interpretations for existing using tools from modern distributionally robust optimization logistic regression let rn denote feature vector and the associated binary label to be predicted in logistic regression the conditional distribution of given is modeled as prob exp xi where the weight vector rn constitutes an unknown regression parameter suppose that training samples have been observed then the maximum likelihood estimator of classical logistic regression is found by solving the geometric program min lβ whose objective function is given by the sample average of the logloss function lβ log exp xi it has been observed however that the resulting maximum likelihood estimator may display poor performance indeed it is well documented that minimizing the average logloss function leads to overfitting and weak classification performance in order to overcome this deficiency it has been proposed to modify the objective function of problem an alternative approach is to add regularization term to the logloss function in order to mitigate overfitting these regularization techniques lead to modified optimization problem min lβ εr where and denote the regularization function and the associated coefficient respectively popular choice for the regularization term is kβk where denotes generic norm such as the or the the use of tends to induce sparsity in which in turn helps to combat overfitting effects moreover logistic regression serves as an effective means for feature selection it is further shown in that outperforms when the number of training samples is smaller than the number of features on the downside leads to optimization problems which are more challenging algorithms for large scale regularized logistic regression are discussed in distributionally robust optimization regression and classification problems are typically modeled as optimization problems under uncertainty to date optimization under uncertainty has been addressed by several complementary modeling paradigms that differ mainly in the representation of uncertainty for instance stochastic programming assumes that the uncertainty is governed by known probability distribution and aims to minimize probability functional such as the expected cost or quantile of the cost distribution in contrast robust optimization ignores all distributional information and aims to minimize the cost under all possible uncertainty realizations while stochastic programs may rely on distributional information that is not available or hard to acquire in practice robust optimization models may adopt an overly pessimistic view of the uncertainty and thereby promote decisions the emerging field of distributionally robust optimization aims to bridge the gap between the conservatism of robust optimization and the specificity of stochastic programming it seeks to minimize probability functional the expectation where the worst case is taken with respect to an ambiguity set that is family of distributions consistent with the given prior information on the uncertainty the vast majority of the existing literature focuses on ambiguity sets characterized through moment and support information see however ambiguity sets can also be constructed via distance measures in the space of probability distributions such as the prohorov metric or the divergence due to its attractive measure concentration properties we use here the wasserstein metric to construct ambiguity sets contribution in this paper we propose distributionally robust perspective on logistic regression our research is motivated by the observation that regularization techniques can improve the performance of many classifiers in the context of support vector machines and lasso there have been several recent attempts to give ad hoc regularization techniques robustness interpretation however to the best of our knowledge no such connection has been established for logistic regression in this paper we aim to close this gap by adopting new distributionally robust optimization paradigm based on wasserstein ambiguity sets starting from distributionally robust statistical learning setup we will derive family of regularized logistic regression models that admit an intuitive probabilistic interpretation and encapsulate the classical regularized logistic regression as special case moreover by invoking recent measure concentration results our proposed approach provides probabilistic guarantee for the emerging regularized classifiers which seems to be the first result of this type all proofs are relegated to the technical appendix we summarize our main contributions as follows distributionally robust logistic regression model and tractable reformulation we propose distributionally robust logistic regression model based on an ambiguity set induced by the wasserstein distance we prove that the resulting optimization problem admits an equivalent reformulation as tractable convex program risk estimation using similar distributionally robust optimization techniques based on the wasserstein ambiguity set we develop two highly tractable linear programs whose optimal values provide confidence bounds on the misclassification probability or risk of the emerging classifiers performance guarantees adopting distributionally robust framework allows us to invoke results from the measure concentration literature to derive probabilistic guarantees specifically we establish performance guarantees for the classifiers obtained from the proposed distributionally robust optimization model probabilistic interpretation of existing regularization techniques we show that the standard regularized logistic regression is special case of our framework in particular we show that the regularization coefficient in can be interpreted as the size of the ambiguity set underlying our distributionally robust optimization model distributionally robust perspective on statistical learning in the standard statistical learning setting all training and test samples are drawn independently from some distribution supported on rn if the distribution was known the best weight parameter could be found by solving the stochastic optimization problem inf lβ lβ rn in practice however is only indirectly observable through independent training samples thus the distribution is itself uncertain which motivates us to address problem from distributionally robust perspective this means that we use the training samples to construct an ambiguity set that is family of distributions that contains the unknown distribution with high confidence then we solve the distributionally robust optimization problem inf sup eq lβ which minimizes the expected logloss function the construction of the ambiguity set should be guided by the following principles tractability it must be possible to solve the distributionally robust optimization problem efficiently ii reliability the optimizer of should be in thus facilitating attractive guarantees iii asymptotic consistency for large training data sets the solution of should converge to the one of in this paper we propose to use the wasserstein metric to construct as ball in the space of probability distributions that satisfies iii definition wasserstein distance let denote the set of probability distributions on the wasserstein distance between two distributions and supported on is defined as inf dξ dξ dξ dξ dξ dξ where and is metric on the wasserstein distance represents the minimum cost of moving the distribution to the distribution where the cost of moving unit mass from to amounts to in the remainder we denote by bε the ball of radius centered at with respect to the wasserstein distance in this paper we propose to use wasserstein balls as ambiguity sets given the training data points natural candidate for the center of the pn stein ball is the empirical distribution where denotes the dirac point measure at thus we henceforth examine the distributionally robust optimization problem inf eq lβ sup equipped with wasserstein ambiguity set note that reduces to the average logloss minimization problem associated with classical logistic regression if we set tractable reformulation and probabilistic guarantees in this section we demonstrate that can be reformulated as tractable convex program and establish probabilistic guarantees for its optimal solutions tractable reformulation we first define metric on the space which will be used in the remainder definition metric on the space the distance between two data points is defined as kx where is any norm on rn and is positive weight the parameter in definition represents the relative emphasis between feature mismatch and label uncertainty the following theorem presents tractable reformulation of the distributionally robust optimization problem and thus constitutes the first main result of this paper theorem tractable reformulation the optimization problem is equivalent to min λε si lβ si jˆ inf sup lβ lβ λκ si note that constitutes tractable convex program for most commonly used norms remark regularized logistic regression as the parameter characterizing the metric tends to infinity the second constraint group in the convex program becomes redundant hence reduces to the celebrated regularized logistic regression problem inf lβ where the regularization function is determined by the dual norm on the feature space while the regularization coefficient coincides with the radius of the wasserstein ball note that for the wasserstein distance between two distributions is infinite if they assign different labels to fixed feature vector with positive probability any distribution in bε must then have nonoverlapping conditional supports for and thus setting reflects the belief that the label is deterministic function of the feature and that label measurements are exact as this belief is not tenable in most applications an approach with may be more satisfying performance guarantees we now exploit recent measure concentration result characterizing the speed at which converges to with respect to the wasserstein distance in order to derive performance guarantees for distributionally robust logistic regression in the following we let be set of independent training samples from and we denote by and the optimal solutions and jˆ the corresponding optimal value of note that these values are random objects as they depend on the random training data theorem performance assume that the distribution is there is with ep exp if the radius of the wasserstein ball is set to log log εn log log then we have pn bε implying that pn ep for all sample sizes and confidence levels moreover the positive constants and appearing in depend only on the parameters and the dimension of the feature space and the metric on the space remark loss denoting the empirical logloss function on the training set by the loss jˆ can be expressed as jˆ max note that the last term in can be viewed as complementary regularization term that does not appear in standard regularized logistic regression this term accounts for label uncertainty and decreases with thus can be interpreted as our trust in the labels of the training samples note that this regularization term vanishes for one can further prove that converges to for implying that reduces to the standard regularized logistic regression in this limit remark performance guarantees the following comments are in order assumption the assumption of theorem is restrictive but seems to be unavoidable for any priori guarantees of the type described in theorem note that this assumption is automatically satisfied if the features have bounded support or if they are known to follow for instance gaussian or exponential distribution ii asymptotic consistency for any fixed confidence level the radius εn defined in drops to zero as the sample size increases and thus the ambiguity set shrinks to singleton to be more precise with probability across all training datasets sequence of distributions in the ambiguity set converges in the wasserstein metric and thus weakly to the unknown data generating distribution see corollary for formal proof consequently the solution of can be shown to converge to the solution of as increases iii finite sample behavior the priori bound on the size of the wasserstein ball has two growth regimes for small the radius decreases as and for large it scales with where is the dimension of the feature space we refer to section for further details on the optimality of these rates and potential improvements for special cases note that when the support of the underlying distribution is bounded or has gaussian distribution the parameter can be effectively set to risk estimation and one of the main objectives in logistic regression is to control the classification performance specifically we are interested in predicting labels from features this be achieved via classifier function fβ rn whose risk fβ represents the misclassification probability in logistic regression natural choice for the classifier is fβ if prob otherwise the conditional probability prob is defined in the risk associated with this classifier can be expressed as ep yhβ as in section we can use and expectations over wasserstein balls to construct confidence bounds on the risk theorem risk estimation for any depending on the training dataset we have the risk rmax eq is given by λε si smin ri ti ri yˆi si rmax λκ ri ti si if the wasserstein radius is set to εn as defined in then rmax with probability across all training sets xi yi ii similarly the risk rmin inf eq xi is given by si min λε si ri ti ri yˆi si rmin λκ ri ti ri ti si average ccr average ccr average ccr training samples training samples training samples figure performance solid blue line and the average ccr dashed red line if the wasserstein radius is set to εn as defined in then rmin with probability across all training sets xi yi we emphasize that and constitute highly tractable linear programs moreover we have rmin rmax with probability numerical results we now showcase the power of distributionally robust logistic regression in simulated and empirical experiments all optimization problems are implemented in matlab via the modeling language yalmip and solved with the nonlinear programming solver ipopt all experiments were run on an intel xeon cpu for the largest instance studied the problems and were solved in and seconds respectively experiment performance we use simulation experiment to study the performance guarantees offered by distributionally robust logistic regression as in we assume that the features follow multivariate standard normal distribution and that the conditional distribution of the labels is of the form with the true distribution is uniquely determined by this information if we use the to measure distances in the feature space then satisfies the assumption of theorem for finally we set our experiment comprises simulation runs in each run we generate training samples and test samples from we calibrate the distributionally robust logistic regression model to the training data and use the test data to evaluate the average logloss as well as the correct classification rate ccr of the classifier associated with we then record the percentage moreover we calculate the of simulation runs in which the average logloss exceeds erage ccr across all simulation runs figure displays both and the average ccr as function of for different values of note that quantifies the probability with respect to the training data that belongs to the wasserstein ball of radius around the empirical distribution thus increases with the average ccr benefits from the regularization induced by the distributional robustness and increases with as long as the empirical confidence is smaller than as soon as the wasserstein ball is large enough to contain the distribution with high confidence however any further increase of is detrimental to the average ccr figure also indicates that the radius implied by fixed empirical confidence level scales inversely with the number of training samples specifically for the wasserstein radius implied by the confidence level is given by respectively this observation is consistent with the priori estimate of the wasserstein radius εn associated with given indeed as theorem implies that εn scales with for experiment the effect of the wasserstein ball in the second simulation experiment we study the statistical properties of the logloss as in we set and assume that the features follow multivariate standard normal distribution while the conditional distribution of the labels is of the form with sampled uniformly from the unit sphere we use the in the feature space and we set all results reported here are averaged over simulation runs in each trial we use training samples to calibrate problem and test samples to estimate the logloss distribution of the resulting classifier figure visualizes the conditional cvar of the logloss distribution for various confidence levels and for different values of the cvar of the logloss at level is defined as the conditional expectation of the logloss above its see in other words the cvar at level quantifies the average of the worst logloss realizations as expected using distributionally robust approach renders the logistic regression problem more which results in uniformly lower cvar values of the logloss particularly for smaller confidence levels thus increasing the radius of the wasserstein ball reduces the right tail of the logloss distribution figure confirms this observation by showing that the cumulative distribution function cdf of the logloss converges to step function for large moreover one can prove that the weight vector tends to zero as grows specifically for we have in which case the logloss approximates the deterministic value log zooming into the cvar graph of figure at the end of the high confidence levels we observe that the which coincides in fact with the expected logloss increases at every quantile level see figure experiment real world case studies and risk estimation next we validate the performance of the proposed distributionally robust logistic regression method on the mnist dataset and three popular datasets from the uci repository ionosphere thoracic surgery and breast cancer in this experiment we use the distance function of definition with the we examine three different models logistic regression lr regularized logistic regression rlr and distributionally robust logistic regression with drlr all results reported here are averaged over independent trials in each trial related to uci dataset we randomly select of data to train the models and the rest to test the performance similarly in each trial related to the mnist dataset we randomly select samples from the training dataset and test the performance on the complete test dataset the results in table top indicate that drlr outperforms rlr in terms of ccr by about the same amount by which rlr outperforms classical lr consistently across all experiments we also evaluated the cvar of logloss which is natural performance indicator for robust methods table bottom shows that drlr wins by large margin outperforming rlr by in the remainder we focus on the ionosphere case study the results of which are representative for the other case studies figures and depict the logloss and the ccr for different wasserstein radii drlr outperforms rlr consistently for all sufficiently small values of this observation can be explained by the fact that drlr accounts for uncertainty in the label whereas rlr does not thus there is wider range of wasserstein radii that result in cdf cvar cvar quantile percentage quantile percentage logloss cvar versus quantile of the cvar versus quantile of the cumulative distribution of the logloss function logloss function zoomed logloss function figure cvar and cdf of the logloss function for different wasserstein radii table the average and standard deviation of ccr and cvar evaluated on the test dataset lr rlr drlr ionosphere thoracic surgery breast cancer ccr mnist vs mnist vs mnist vs ionosphere thoracic surgery breast cancer cvar mnist vs mnist vs mnist vs rlr drlr rlr drlr true risk upper bound lower bound confidence risk average ccr average logloss confidence the average logloss for the average correct risk estimation and its confient tion rate for different dence level figure average logloss ccr and risk for different wasserstein radii ionosphere dataset an attractive logloss and ccr this effect facilitates the choice of and could be significant advantage in situations where it is difficult to determine priori in the experiment underlying figure we first fix to the optimal solution of for and figure shows the true risk and its confidence bounds as expected for the upper and lower bounds coincide with the empirical risk on the training data which is lower bound for the true risk on the test data due to effects as increases the confidence interval between the bounds widens and eventually covers the true risk for instance at the confidence interval is given by and contains the true risk with probability acknowledgments this research was supported by the swiss national science foundation under grant references hosmer and lemeshow applied logistic regression john wiley sons feng xu mannor and yan robust logistic regression and classification in advances in neural information processing systems pages plan and vershynin robust compressed sensing and sparse logistic regression convex programming approach ieee transactions on information theory ding vishwanathan warmuth and denchev regression for binary and multiclass classification the journal of machine learning research liu robit regression simple robust alternative to logistic and probit regression pages john wiley sons rousseeuw and christmann robustness against separation and outliers in logistic regression computational statistics data analysis tibshirani regression shrinkage and selection via the lasso journal of the royal statistical society series ng feature selection regularization and rotational invariance in proceedings of the international conference on machine learning pages koh kim and boyd an method for logistic regression the journal of machine learning research and tewari stochastic methods for loss minimization the journal of machine learning research shi yin osher and sajda fast hybrid algorithm for logistic regression the journal of machine learning research yun and toh coordinate gradient descent method for convex minimization computational optimization and applications shapiro dentcheva and ruszczyski lectures on stochastic programming siam birge and louveaux introduction to stochastic programming springer and nemirovski robust and applications mathematical programming bertsimas and sim the price of robustness operations research el ghaoui and nemirovski robust optimization princeton university press delage and ye distributionally robust optimization under moment uncertainty with application to problems operations research goh and sim distributionally robust optimization and its tractable approximations operations research wiesemann kuhn and sim distributionally robust convex optimization operations research and iyengar ambiguous chance constrained problems and robust optimization mathematical programming hu and hong divergence constrained distributionally robust optimization technical report available from optimization online xu caramanis and mannor robustness and regularization of support vector machines the journal of machine learning research xu caramanis and mannor robust regression and lasso ieee transactions on information theory mohajerin esfahani and kuhn distributionally robust optimization using the wasserstein metric performance guarantees and tractable reformulations http fournier and guillin on the rate of convergence in wasserstein distance of the empirical measure probability theory and related fields pages yalmip toolbox for modeling and optimization in matlab in ieee international symposium on computer aided control systems design pages and biegler on the implementation of an filter algorithm for nonlinear programming mathematical programming rockafellar and uryasev optimization of conditional journal of risk lecun the mnist database of handwritten digits http bache and lichman uci machine learning repository http 
variational dropout and the local reparameterization trick diederik tim and max machine learning group university of amsterdam algoritmica university of california irvine and the canadian institute for advanced research cifar abstract we investigate local reparameterizaton technique for greatly reducing the variance of stochastic gradients for variational bayesian inference sgvb of posterior over model parameters while retaining parallelizability this local reparameterization translates uncertainty about global parameters into local noise that is independent across datapoints in the minibatch such parameterizations can be trivially parallelized and have variance that is inversely proportional to the minibatch size generally leading to much faster convergence additionally we explore connection with dropout gaussian dropout objectives correspond to sgvb with local reparameterization prior and proportionally fixed posterior variance our method allows inference of more flexibly parameterized posteriors specifically we propose variational dropout generalization of gaussian dropout where the dropout rates are learned often leading to better models the method is demonstrated through several experiments introduction deep neural networks are flexible family of models that easily scale to millions of parameters and datapoints but are still tractable to optimize using stochastic gradient ascent due to their high flexibility neural networks have the capacity to fit wide diversity of nonlinear patterns in the data this flexbility often leads to overfitting when left unchecked spurious patterns are found that happen to fit well to the training data but are not predictive for new data various regularization techniques for controlling this overfitting are used in practice currently popular and empirically effective technique being dropout in it was shown that regular binary dropout has gaussian approximation called gaussian dropout with virtually identical regularization performance but much faster convergence in section of it is shown that gaussian dropout optimizes lower bound on the marginal likelihood of the data in this paper we show that relationship between dropout and bayesian inference can be extended and exploited to greatly improve the efficiency of variational bayesian inference on the model parameters this work has direct interpretation as generalization of gaussian dropout with the same fast convergence but now with the freedom to specify more flexibly parameterized posterior distributions bayesian posterior inference over the neural network parameters is theoretically attractive method for controlling overfitting exact inference is computationally intractable but efficient approximate schemes can be designed markov chain monte carlo mcmc is class of approximate inference methods with asymptotic guarantees pioneered by for the application of regularizing neural networks later useful refinements include and an alternative to mcmc is variational inference or the equivalent minimum description length mdl framework modern variants of stochastic variational inference have been applied to neural networks with some succes but have been limited by high variance in the gradients despite their theoretical attractiveness bayesian methods for inferring posterior distribution over neural network weights have not yet been shown to outperform simpler methods such as dropout even new crop of efficient variational inference algorithms based on stochastic gradients with minibatches of data have not yet been shown to significantly improve upon simpler regularization in section we explore an as yet unexploited trick for improving the efficiency of stochastic gradientbased variational inference with minibatches of data by translating uncertainty about global parameters into local noise that is independent across datapoints in the minibatch the resulting method has an optimization speed on the same level as fast dropout and indeed has the original gaussian dropout method as special case an advantage of our method is that it allows for full bayesian analysis of the model and that it significantly more flexible than standard dropout the approach presented here is closely related to several popular methods in the literature that regularize by adding random noise these relationships are discussed in section efficient and practical bayesian inference we consider bayesian analysis of dataset containing set of observations of tuples where the goal is to learn model with parameters or weights of the conditional probability standard classification or regression bayesian inference in such model consists of updating some initial belief over parameters in the form of prior distribution after observing data into an updated belief over these parameters in the form of an approximation to the posterior distribution computing the true posterior distribution through bayes rule involves computationally intractable integrals so good approximations are necessary in variational inference inference is cast as an optimization problem where we optimize the parameters of some parameterized model such that is close approximation to as measured by the divergence dkl this divergence of our posterior to the true posterior is minimized in practice by maximizing the variational lower bound of the marginal likelihood of the data where ld dkl ld eq log we ll call ld the expected the bound plus dkl equals the conditional marginal log since this marginal is constant maximizing the bound will minimize dkl stochastic gradient variational bayes sgvb various algorithms for optimization of the variational bound eq with differentiable and exist see section for an overview recently proposed efficient method for optimization with differentiable models is the stochastic gradient variational bayes sgvb method introduced in especially appendix and the basic trick in sgvb is to parameterize the random parameters as where is differentiable function and is random noise variable in this new parameterisation an unbiased differentiable monte carlo estimator of the expected can be formed ld lsgvb log yi where xi yi is minibatch of data with random datapoints and is noise vector drawn from the noise distribution we ll assume that the remaining term in the variational lower bound dkl can be computed deterministically but otherwise it may be approximated similarly the estimator is differentiable and unbiased so its gradient note that the described method is not limited to classification or regression and is straightforward to apply to other modeling settings like unsupervised models and temporal models is also unbiased ld lsgvb we can proceed with variational bayesian inference by randomly initializing and performing stochastic gradient ascent on variance of the sgvb estimator the theory of stochastic approximation tells us that stochastic gradient ascent using will asymptotically converge to local optimum for an appropriately declining step size and sufficient weight updates but in practice the performance of stochastic gradient ascent crucially depends on the variance of the gradients if this variance is too large stochastic gradient descent will fail to make much progress in any reasonable amount of time our objective function consists of an expected log likelihood term that we approximate using monte carlo and kl divergence term dkl that we assume can be calculated analytically and otherwise be approximated with monte carlo with similar reparameterization assume that we draw minibatches of datapoints with replacement see appendix for similar analysis for minibatches without replacement using li as shorthand for log yi the contribution to the likelihood for the datapoint in the minibatch the monte carlo estimator pm may be rewritten as lsgvb li whose variance is given by var lsgvb var cov var li cov li lj where the variances and the data distribution and distribution covariances are both var li xi yi log yi with xi yi drawn from the empirical distribution defined by the training set as can be seen from the total contribution to the variance by var li is inversely proportional to the minibatch size however the total contribution by the covariances does not decrease with in practice this means that the variance of lsgvb can be dominated by the covariances for even moderately large local reparameterization trick we therefore propose an alternative estimator for which we have cov li lj so that the variance of our stochastic gradients scales as we then make this new estimator computationally efficient by not sampling directly but only sampling the intermediate variables through which influences lsgvb by doing so the global uncertainty in the weights is translated into form of local uncertainty that is independent across examples and easier to sample we refer to such reparameterization from global noise to local noise as the local reparameterization trick whenever source of global noise can be translated to local noise in the intermediate states of computation local reparameterization can be applied to yield computationally and statistically efficient gradient estimator such local reparameterization applies to fairly large family of models but is best explained through simple example consider standard fully connected neural network containing hidden layer consisting of neurons this layer receives an input feature matrix from the layer below which is multiplied by weight matrix before nonlinearity is applied aw we then specify the posterior approximation on the weights to be fully ized gaussian wi µi which means the weights are sampled as wi µi with in this case we could make sure that cov li lj by sampling separate weight matrix for each example in the minibatch but this is not computationally efficient we would need to sample million random numbers for just single layer of the neural network even if this could be done efficiently the computation following this step would become much harder where we originally performed simple product of the form aw this now turns into separate local products the theoretical complexity of this computation is higher but more importantly such computation can usually not be performed in parallel using fast blas basic linear algebra subprograms this also happens with other neural network architectures such as convolutional neural networks where optimized libraries for convolution can not deal with separate filter matrices per example fortunately the weights and therefore only influence the expected log likelihood through the neuron activations which are of much lower dimension if we can therefore sample the random activations directly without sampling or we may obtain an efficient monte carlo estimator at much lower cost for factorized gaussian posterior on the weights the posterior for the activations conditional on the input is also factorized gaussian wi µi bm with am µi and rather than sampling the gaussian weights and then computing the resulting activations we may thus sample the activations from their implied gaussian distribution directly using bm with here is an matrix so we only need to sample thousand random variables instead of million thousand fold savings in addition to yielding gradient estimator that is more computationally efficient than drawing separate weight matrices for each training example the local reparameterization trick also leads to an estimator that has lower variance to see why consider the stochastic gradient estimate with respect to the posterior parameter for minibatch of size drawing random weights we get lsgvb lsgvb am bm if on the other hand we form the same gradient using the local reparameterization trick we get lsgvb lsgvb bm here there are two stochastic terms the first is the backpropagated gradient lsgvb bm and the second is the sampled random noise or estimating the gradient with respect to then basically comes down to estimating the covariance between these two terms this is much easier to do for as there are much fewer of these individually they have higher correlation with the backpropagated gradient lsgvb bm so the covariance is easier to estimate in other words measuring the effect of on lsgvb bm is easy as is the only random variable directly influencing this gradient via bm on the other hand when sampling random weights there are thousand influencing each gradient term so their individual effects get lost in the noise in appendix we make this argument more rigorous and in section we show that it holds experimentally variational dropout dropout is technique for regularization of neural network parameters which works by adding multiplicative noise to the input of each layer of the neural network during optimization using the notation of section for fully connected neural network dropout corresponds to with where is the matrix of input features for the current minibatch is weight matrix and is the output matrix for the current layer before nonlinearity is applied the symbol denotes the elementwise hadamard product of the input matrix with matrix of independent noise variables by adding noise to the input during training the weight parameters are less likely to overfit to the training data as shown empirically by previous publications originally proposed drawing the elements of from bernoulli distribution with probability with the dropout rate later it was shown that using continuous distribution with the same relative mean and variance such as gaussian with works as well or better here we dropout with continuous noise as variational method and propose generalization that we call variational dropout in developing variational dropout we provide firm bayesian justification for dropout training by deriving its implicit prior distribution and variational objective this new interpretation allows us to propose several useful extensions to dropout such as principled way of making the normally fixed dropout rates adaptive to the data variational dropout with independent weight noise if the elements of the noise matrix are drawn independently from gaussian the marginal distributions of the activations bm are gaussian as well bm with am and making use of this fact proposed gaussian dropout regularization method where instead of applying the activations are directly drawn from their approximate or exact marginal distributions as given by argued that these marginal distributions are exact for gaussian noise and for bernoulli noise still approximately gaussian because of the central limit theorem this ignores the dependencies between the different elements of as present using but report good results nonetheless as noted by and explained in appendix this gaussian dropout noise can also be interpreted as arising from bayesian treatment of neural network with weights that multiply the input to give aw where the posterior distribution of the weights is given by factorized gaussian with wi from this perspective the marginal distributions then arise through the application of the local reparameterization trick as introduced in section the variational objective corresponding to this interpretation is discussed in section variational dropout with correlated weight noise instead of ignoring the dependencies of the activation noise as in section we may retain the dependencies by interpreting dropout as form of correlated weight noise bm am with wk and wi si with si where am is row of the input matrix and bm row of the output the wi are the rows of the weight matrix each of which is constructed by multiplying parameter vector by stochastic scale variable si the distribution on these scale variables we interpret as bayesian posterior distribution the weight parameters and the biases are estimated using maximum likelihood the original gaussian dropout sampling procedure can then be interpreted as arising from local reparameterization of our posterior on the weights dropout prior and variational objective the posterior distributions proposed in sections and have in common that they can be decomposed into parameter vector that captures the mean and multiplicative noise term determined by parameters any posterior distribution on for which the noise enters this multiplicative way we will call dropout posterior note that many common distributions such as univariate gaussians with nonzero mean can be reparameterized to meet this requirement during dropout training is adapted to maximize the expected log likelihood ld for this to be consistent with the optimization of variational lower bound of the form in the prior on the weights has to be such that dkl does not depend on in appendix we show that the only prior that meets this requirement is the scale invariant prior log prior that is uniform on the of the weights or the si for section as explained in appendix this prior has an interesting connection with the floating point format for storing numbers from an mdl perspective the floating point format is optimal for communicating numbers drawn from this prior conversely the kl divergence dkl with this prior has natural interpretation as regularizing the number of significant digits our posterior stores for the weights wi in the format putting the expected log likelihood and penalty together we see that dropout training maximizes the following variatonal lower bound ld dkl where we have made the dependence on the and parameters explicit the noise parameters the dropout rates are commonly treated as hyperparameters that are kept fixed during training for the prior this then corresponds to fixed limit on the number of significant digits we can learn for each of the weights wi in section we discuss the possibility of making this limit adaptive by also maximizing the lower bound with respect to for the choice of factorized gaussian approximate posterior with wi as discussed in section the lower bound is analyzed in detail in appendix there it is shown that for this particular choice of posterior the negative dkl is not analytically tractable but can be approximated extremely accurately using dkl wi wi constant log with the same expression may be used to calculate the corresponding term posterior approximation of section dkl for the adaptive regularization through optimizing the dropout rate the noise parameters used in dropout training the dropout rates are usually treated as fixed hyperparameters but now that we have derived dropout variational objective making these parameters adaptive is trivial simply maximize the variational lower bound with respect to we can use this to learn separate dropout rate per layer per neuron of even per separate weight in section we look at the predictive performance obtained by making adaptive we found that very large values of correspond to local optima from which it is hard to escape due to gradients to avoid such local optima we found it beneficial to set constraint during training we maximize the posterior variance at the square of the posterior mean which corresponds to dropout rate of related work pioneering work in practical variational inference for neural networks was done in where biased variational lower bound estimator was introduced with good results on recurrent neural network models in later work it was shown that even more practical estimators can be formed for most types of continuous latent variables or parameters using reparameterization trick leading to efficient and unbiased stochastic variational inference these works focused on an application to inference extensive empirical results on inference of global model parameters were reported in including succesful application to reinforcement learning these earlier works used the relatively estimator upon which we improve variable reparameterizations have long history in the statistics literature but have only recently found use for efficient machine learning and inference related is also probabilistic backpropagation an algorithm for inferring marginal posterior probabilities however it requires certain tractabilities in the network making it insuitable for the type of models under consideration in this paper as we show here regularization by dropout can be interpreted as variational inference dropconnect is similar to dropout but with binary noise on the weights rather than hidden units dropconnect thus has similar interpretation as variational inference with uniform prior over the weights and mixture of two dirac peaks as posterior in standout was introduced variation of dropout where binary belief network is learned for producing dropout rates recently proposed another bayesian perspective on dropout in recent work similar reparameterization is described and used for variational inference their focus is on approximations of the variational bound rather than unbiased monte carlo estimators and also investigate bayesian perspective on dropout but focus on the binary variant reports various encouraging results on the utility of dropout implied prediction uncertainty experiments we compare our method to standard binary dropout and two popular versions of gaussian dropout which we ll denote with type and type with gaussian dropout type we denote the gaussian dropout from type denotes the gaussian dropout from this way the method names correspond to the matrix names in section or where noise is injected models were implemented in theano and optimization was performed using adam with default and temporal averaging two types of variational dropout were included type is correlated weight noise as introduced in section an adaptive version of gaussian dropout type variational dropout type has independent weight uncertainty as introduced in section and corresponds to gaussian dropout type de facto standard benchmark for regularization methods is the task of mnist digit classification we choose the same architecture as fully connected neural network with hidden layers and rectified linear units relus we follow the dropout recommendations from these earlier publications which is dropout rate of for the hidden layers and for the input layer we used early stopping with all methods where the amount of epochs to run was determined based on performance on validation set variance we start out by empirically comparing the variance of the different available stochastic estimators of the gradient of our variational objective to do this we train the neural network described above for either epochs test error or epochs test error using variational dropout with independent weight noise after training we calculate the gradients for the weights of the top and bottom level of our network on the full training set and compare against the gradient estimates per batch of training examples appendix contains the same analysis for the case of variational dropout with correlated weight noise table shows that the local reparameterization trick yields the lowest variance among all variational dropout estimators for all conditions although it is still substantially higher compared to not having any dropout regularization the variance scaling achieved by our estimator is especially important early on in the optimization when it makes the largest difference compare weight sample per minibatch and weight sample per data point the additional variance reduction obtained by our estimator through drawing fewer random numbers section is about factor of and this remains relatively stable as training progresses compare local reparameterization and weight sample per data point stochastic gradient estimator local reparameterization ours weight sample per data point slow weight sample per minibatch standard no dropout noise minimal var top layer epochs top layer epochs bottom layer epochs bottom layer epochs table average empirical variance of minibatch stochastic gradient estimates examples for fully connected neural network regularized by variational dropout with independent weight noise speed we compared the regular sgvb estimator with separate weight samples per datapoint with the efficient estimator based on local reparameterization in terms of time efficiency with our implementation on modern gpu optimization with the estimator took seconds per epoch while the efficient estimator took seconds an over fold speedup classification error figure shows classification error for the tested regularization methods for various choices of number of hidden units our adaptive variational versions of gaussian dropout perform equal or better than their counterparts and standard dropout under all tested conditions the difference is especially noticable for the smaller networks in these smaller networks we observe that variational dropout infers dropout rates that are on average far lower than the dropout rates for larger networks this adaptivity comes at negligable computational cost classification error on the mnist dataset classification error on the dataset figure best viewed in color comparison of various dropout methods when applied to fullyconnected neural networks for classification on the mnist dataset shown is the classification error of networks with hidden layers averaged over runs he variational versions of gaussian dropout perform equal or better than their counterparts the difference is especially large with smaller models where regular dropout often results in severe underfitting comparison of dropout methods when applied to convolutional net trained on the dataset for different settings of network size the network has two convolutional layers with each and feature maps respectively each with stride and followed by softplus nonlinearity this is followed by two fully connected layers with each hidden units we found that slightly downscaling the kl divergence part of the variational objective can be beneficial variational in figure denotes performance of type variational dropout but with downscaled with factor of this small modification seems to prevent underfitting and beats all other dropout methods in the tested models conclusion efficiency of posterior inference using stochastic variational bayes sgvb can often be significantly improved through local reparameterization where global parameter uncertainty is translated into local uncertainty per datapoint by injecting noise locally instead of globally at the model parameters we obtain an efficient estimator that has low computational complexity can be trivially parallelized and has low variance we show how dropout is special case of sgvb with local reparameterization and suggest variational dropout straightforward extension of regular dropout where optimal dropout rates are inferred from the data rather than fixed in advance we report encouraging empirical results acknowledgments we thank the reviewers and yarin gal for valuable feedback diederik kingma is supported by the google european fellowship in deep learning max welling is supported by research grants from google and facebook and the nwo project in natural ai references ahn korattikara and welling bayesian posterior sampling via stochastic gradient fisher scoring arxiv preprint ba and frey adaptive dropout for training deep neural networks in advances in neural information processing systems pages bayer karol korhammer and van der smagt fast adaptive weight noise arxiv preprint bengio estimating or propagating gradients through stochastic neurons arxiv preprint bergstra breuleux bastien lamblin pascanu desjardins turian and bengio theano cpu and gpu math expression compiler in proceedings of the python for scientific computing conference scipy volume blundell cornebise kavukcuoglu and wierstra weight uncertainty in neural networks arxiv preprint gal and ghahramani dropout as bayesian approximation representing model uncertainty in deep learning arxiv preprint graves practical variational inference for neural networks in advances in neural information processing systems pages and adams probabilistic backpropagation for scalable learning of bayesian neural networks arxiv preprint hinton srivastava krizhevsky sutskever and salakhutdinov improving neural networks by preventing of feature detectors arxiv preprint hinton and van camp keeping the neural networks simple by minimizing the description length of the weights in proceedings of the sixth annual conference on computational learning theory pages acm kingma and ba adam method for stochastic optimization proceedings of the international conference on learning representations kingma fast inference with continuous latent variable models in auxiliary form arxiv preprint kingma and welling variational bayes proceedings of the international conference on learning representations maeda bayesian encourages dropout arxiv preprint neal bayesian learning for neural networks phd thesis university of toronto rezende mohamed and wierstra stochastic backpropagation and approximate inference in deep generative models in proceedings of the international conference on machine learning pages robbins and monro stochastic approximation method the annals of mathematical statistics salimans and knowles variational posterior approximation through stochastic linear regression bayesian analysis srivastava hinton krizhevsky sutskever and salakhutdinov dropout simple way to prevent neural networks from overfitting the journal of machine learning research wan zeiler zhang cun and fergus regularization of neural networks using dropconnect in proceedings of the international conference on machine learning pages wang and manning fast dropout training in proceedings of the international conference on machine learning pages welling and teh bayesian learning via stochastic gradient langevin dynamics in proceedings of the international conference on machine learning pages 
